In my previous post I described some sections that are typically useful when interpreting AWR data. However, sometimes the answer comes from an unexpected source. For example, the workload profile section of the report contains key information for understanding what the database looks like, but it seldom gives a direct answer to the problem (except for maybe excessive parsing and excessive commits). But recently I came across a case when this section was enough to identify the root cause of a non-trivial issue:
Per Second Per Transaction
Redo size: 1,895,241.12 12,004.40
Logical reads: 832,945.54 5,275.85
Block changes: 11,937.82 75.61
Physical reads: 7,458.75 47.24
Physical writes: 759.33 4.81
User calls: 449.83 2.85
Parses: 225.18 1.43
Hard parses: 15.90 0.10
Sorts: 467.90 2.96
Logons: 1.38 0.01
Executes: 103,266.84 654.09
This excerpt was coming from an AWR report for a database that virtually froze with 100% CPU consumption on the box. The question was what causing this high CPU consumption (the SAs ruled out possibility of blaming other processes on the box).
When looking carefully at the numbers above, one could notice that executes per second looks enormous. This becomes even more apparent when looking at the rate of user calls, which is a few orders of magnitude lower. These numbers, combined with high CPU usage, are enough to suspect context switching as the primary suspect: a SQL statement containing a PL/SQL function, which executes a SQL statement hundreds of thousands of time per function call.
Further investigation confirmed that it was indeed the case. There was a stats job running shortly before the incident, leading to invalidation of the SQL plan, and the new plan was calling the PL/SQL function at an early stage, before most rows were eliminated.
The point I am trying to make is that one should try to maintain a good balance between focusing on just few key performance indicators, and paying attention to secondary details as well.