AWR report case study: stating the obvious

It’s been a while since I came across an interesting and complex case. However, yesterday on an OTN forum thread I saw a case which was interesting by its simplicity. Even though it’s almost trivial on the technical level, it’s very useful to highlight typical tuning “rookie mistakes” (I can remember quite a few cases from not so long ago, when I did similar mistakes, too).

The author posts a question about “library cache: mutex X” events that are ruining performance of his 2-node RAC cluster. The original post doesn’t contain any specifics except for CPU utilization percentage on both nodes.

Within a few hours, a few replies appear, most of them either trying to shed light on this particular wait event or sharing similar experiences. I asked the original poster to provide key sections of an AWR report (workload profile, top events, database/host CPU, top SQL), which he soon did:

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):                1.3                0.5       0.00       0.00
       DB CPU(s):                1.0                0.4       0.00       0.00
       Redo size:           10,348.4            4,055.3
   Logical reads:            9,824.0            3,849.8
   Block changes:               60.7               23.8
  Physical reads:               28.5               11.2
 Physical writes:               14.2                5.6
      User calls:              608.6              238.5
          Parses:              749.8              293.8
     Hard parses:              333.3              130.6
W/A MB processed:                3.0                1.2
          Logons:                0.2                0.1
        Executes:            1,045.4              409.7
       Rollbacks:                0.0                0.0
    Transactions:                2.6

A glance on this section tells everything about the problem: the database is hard-parsing like crazy! 333 hard parses second, that’s more than 1 hard parse per 2 user calls. These numbers are even more impressive if we take into consideration that the database has only 1.3 average active sessions.

When we turn to top events, everything checks out: DB CPU as the largest wait (of course, since hard parsing is a CPU-intensive activity), log file sync with insignificant percentage, and a few smaller waits related to library cache contention:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                            7,528          79.2
log file sync                        18,587         258     14    2.7 Commit
library cache pin                   406,927         207      1    2.2 Concurrenc
library cache lock                  263,777         147      1    1.5 Concurrenc
db file sequential read              27,589         146      5    1.5 User I/O

And as a final confirmation, the time model statistics section states the problem as plainly as possible:

-> Total time in database user-calls (DB Time): 9506.3s
-> Statistics including the word measure background process
   time, and so do not contribute to the DB time statistic
-> Ordered by % or DB time desc, Statistic name
&nbsp;
Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
DB CPU                                                7,528.4         79.2
sql execute elapsed time                              7,396.1         77.8
parse time elapsed                                    5,290.4         55.7

i.e. yes, the database is spending twice as much time parsing SQL as executing it.

So in this case AWR literally gives the answer in plain English, no further analysis required (of course, one would still have to identify the unshared SQL that is hard-parsed over and over — but that’s hardly a challenging task, given that the library cache is filled with thousands of cursors with similar text and same force_matching_signature).

Let’s turn to the “human” side of the problem. If the answer is right there, then why is it often so difficult to see it?

I see several two very typical mistakes here:

1) trying to draw conclusions without having some context (for a detailed explanation of importance of context, see a great post by J. Lewis), i.e. going to specific events without getting a general idea of the scales from the “load profile” section. Load profile is probably the most frequently (and unjustly) overseen section of the AWR report

2) not being quantitative — just because a “bad” event is present in the system doesn’t mean that it’s the cause of the problem. in order to be relevant, the size of the symptom must match the size of the problem. If you problem has the size of 80% the DB time, then you should put aside small events (no matter how suspicious they seem) and go after the biggest one, no matter how “benign” it looks.