AWR report case study: stating the obvious

9 Jul

It’s been a while since I came across an interesting and complex case. However, yesterday on an OTN forum thread I saw a case which was interesting by its simplicity. Even though it’s almost trivial on the technical level, it’s very useful to highlight typical tuning “rookie mistakes” (I can remember quite a few cases from not so long ago, when I did similar mistakes, too).

The author posts a question about “library cache: mutex X” events that are ruining performance of his 2-node RAC cluster. The original post doesn’t contain any specifics except for CPU utilization percentage on both nodes.

Within a few hours, a few replies appear, most of them either trying to shed light on this particular wait event or sharing similar experiences. I asked the original poster to provide key sections of an AWR report (workload profile, top events, database/host CPU, top SQL), which he soon did:

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):                1.3                0.5       0.00       0.00
       DB CPU(s):                1.0                0.4       0.00       0.00
       Redo size:           10,348.4            4,055.3
   Logical reads:            9,824.0            3,849.8
   Block changes:               60.7               23.8
  Physical reads:               28.5               11.2
 Physical writes:               14.2                5.6
      User calls:              608.6              238.5
          Parses:              749.8              293.8
     Hard parses:              333.3              130.6
W/A MB processed:                3.0                1.2
          Logons:                0.2                0.1
        Executes:            1,045.4              409.7
       Rollbacks:                0.0                0.0
    Transactions:                2.6

A glance on this section tells everything about the problem: the database is hard-parsing like crazy! 333 hard parses second, that’s more than 1 hard parse per 2 user calls. These numbers are even more impressive if we take into consideration that the database has only 1.3 average active sessions.

When we turn to top events, everything checks out: DB CPU as the largest wait (of course, since hard parsing is a CPU-intensive activity), log file sync with insignificant percentage, and a few smaller waits related to library cache contention:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
DB CPU                                            7,528          79.2
log file sync                        18,587         258     14    2.7 Commit
library cache pin                   406,927         207      1    2.2 Concurrenc
library cache lock                  263,777         147      1    1.5 Concurrenc
db file sequential read              27,589         146      5    1.5 User I/O

And as a final confirmation, the time model statistics section states the problem as plainly as possible:

-> Total time in database user-calls (DB Time): 9506.3s
-> Statistics including the word measure background process
   time, and so do not contribute to the DB time statistic
-> Ordered by % or DB time desc, Statistic name
 
Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
DB CPU                                                7,528.4         79.2
sql execute elapsed time                              7,396.1         77.8
parse time elapsed                                    5,290.4         55.7

i.e. yes, the database is spending twice as much time parsing SQL as executing it.

So in this case AWR literally gives the answer in plain English, no further analysis required (of course, one would still have to identify the unshared SQL that is hard-parsed over and over — but that’s hardly a challenging task, given that the library cache is filled with thousands of cursors with similar text and same force_matching_signature).

Let’s turn to the “human” side of the problem. If the answer is right there, then why is it often so difficult to see it?

I see several two very typical mistakes here:

1) trying to draw conclusions without having some context (for a detailed explanation of importance of context, see a great post by J. Lewis), i.e. going to specific events without getting a general idea of the scales from the “load profile” section. Load profile is probably the most frequently (and unjustly) overseen section of the AWR report

2) not being quantitative — just because a “bad” event is present in the system doesn’t mean that it’s the cause of the problem. in order to be relevant, the size of the symptom must match the size of the problem. If you problem has the size of 80% the DB time, then you should put aside small events (no matter how suspicious they seem) and go after the biggest one, no matter how “benign” it looks.

Other posts on AWR

http://savvinov.com/awr/

About these ads

9 Responses to “AWR report case study: stating the obvious”

  1. Steve July 10, 2012 at 10:59 am #

    Nikolay,
    Thanks for the information that you provide on your blog. It is very helpful in understanding some of Oracle’s technical details.

    I was just curious how you determined these two points from the stats that you posted?

    * the database has only 1.3 average active sessions.
    * the database is spending twice as much time parsing SQL as executing it.

    • savvinov July 10, 2012 at 11:33 am #

      Hi Steve,

      thanks for your questions.

      1) the average number of active sessions is simply DB time per second. you find this number in the “load profile” section on the top of the report

      2) I know how much time the database spends parsing by looking at the “time model” section of the report (there is an excerpt from it towards the end of the post, unfortunately, I left the title out when copying). Basically it says that the database spends 77.8% of time running SQL, and 55.7% out of this time is spent on parsing

      Hope that answers your question.

  2. vijay August 5, 2012 at 2:41 pm #

    Nikolay,
    You have higlighted the most ignored section of awr i.e load profile.Thanks for reminding everyone and how useful it is.
    I have a question :

    shouldn’t the sql execute elapsed time and parse time elapsed time sum up to 100% of dbtime?

    • savvinov August 5, 2012 at 2:57 pm #

      Hi Vijay,

      thanks for you comment. Indeed, the load profile is probably the most commonly overlooked AWR session — it deserves a dedicated post, and I’m planning to do it as soon as I get enough free time.

      Regarding your question — no, I don’t think that SQL execute elapsed time and parse elapsed time should sum up to 100% of dbtime. For instance, an important part of parsing is executing recursive SQL — that time would probably be counted as “parse time” for the original statement, but as “execute time” for the recursive statement.

      • vijay August 5, 2012 at 4:12 pm #

        Thank You Very Much. Can you shed light on what actually constitutes user call?

      • savvinov August 6, 2012 at 7:42 am #

        Hi Vijay,

        probably the best to answer that question is in a separate post on load profile section of AWR report (which I am still hoping to write some time soon).

        But briefly — a user call is when a client asks the database server to do something, e.g. connect to the database, parse or execute a SQL statement, or fetch some rows for a SQL statement that has been executed.

  3. vijay August 7, 2012 at 12:22 pm #

    Thank You.

  4. John August 9, 2012 at 10:17 am #

    Nikolay,

    I understand that DB Time = + .

    In “Troubleshooting log file sync waits — a simple case study” i see “CPU time” in the Top 5 events.

    Could you tell what it is ?

    Cheers,
    John

    • savvinov August 9, 2012 at 12:55 pm #

      Hi John,

      thank you for your question.

      generally, a session can be in one of the following states:

      1) idle — not doing anything, waiting for the client to ask it to do something
      2) waiting — when it needs something in order to continue it’s work, and cannot get it (e.g. a lock, a latch, a mutex, a pin); or it could be waiting for the OS to complete a request on its behalf (e.g. db file scattered read), or when it’s waiting for a background session to finish its job (e.g. log file sync waits) etc.
      3) on CPU — the session is not waiting for any event, it’s either running on CPU or expecting its turn to get on CPU

      The first two cases are described by OWI events (where OWI stands for Oracle Wait Interface), which are externalized via V$ views like V$SESSION_WAIT etc. In the third case, the OWI doesn’t have any detailed information about what the session is doing, so all it can do is say that the session is in the “ON CPU” state.

      The time spend by a session in such state is shown in “timed events” as “CPU time”. Note that unlike other items in that list, it’s not a real OWI event (e.g. it doesn’t have a wait class).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 242 other followers

%d bloggers like this: