AWR analysis: another case study

A few weeks ago, I received a request to review an AWR report for a database suffering from instance-level performance issues. Here are the the key parts of that report (with some masking):

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Release     RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
XXXX           XXXXX     XXXXX               1 10.2.0.5.0  NO  XXXX

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     65115 03-May-16 11:00:09       152      17.8
  End Snap:     65116 03-May-16 12:00:18       152      17.7
   Elapsed:               60.16 (mins)
   DB Time:            2,712.41 (mins)

Cache Sizes
~~~~~~~~~~~                       Begin        End
                             ---------- ----------
               Buffer Cache:     5,856M     5,856M  Std Block Size:         8K
           Shared Pool Size:     2,048M     2,048M      Log Buffer:    14,340K

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:          2,455,599.10             14,087.84
              Logical reads:            613,415.60              3,519.18
              Block changes:             12,238.64                 70.21
             Physical reads:             12,233.70                 70.19
            Physical writes:              1,517.54                  8.71
                 User calls:              1,159.19                  6.65
                     Parses:             39,080.15                224.20
                Hard parses:                 32.45                  0.19
                      Sorts:                708.22                  4.06
                     Logons:                  0.31                  0.00
                   Executes:             39,393.06                226.00
               Transactions:                174.31

 ...

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
db file sequential read          12,719,449      73,404      6   45.1   User I/O
CPU time                                         50,461          31.0
db file parallel read             2,127,146      14,407      7    8.9   User I/O
read by other session             3,660,106      11,240      3    6.9   User I/O
db file scattered read            1,005,907       5,493      5    3.4   User I/O
          -------------------------------------------------------------
Time Model Statistics            DB/Inst: XXXX/XXXX  Snaps: 65115-65116
-> Total time in database user-calls (DB Time): 162744.5s
-> Statistics including the word 'background' measure background process
   time, and so do not contribute to the DB time statistic
-> Ordered by % or DB time desc, Statistic name

Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
sql execute elapsed time                            159,211.8         97.8
DB CPU                                               50,461.4         31.0
PL/SQL execution elapsed time                         5,870.2          3.6
parse time elapsed                                    5,788.9          3.6
RMAN cpu time (backup/restore)                        1,328.8           .8
hard parse elapsed time                                 227.6           .1
repeated bind elapsed time                              169.1           .1
connection management call elapsed time                   6.3           .0
PL/SQL compilation elapsed time                           4.2           .0
hard parse (sharing criteria) elapsed time                3.8           .0
hard parse (bind mismatch) elapsed time                   2.7           .0
failed parse elapsed time                                 0.1           .0
sequence load elapsed time                                0.0           .0
DB time                                             162,744.5          N/A
background elapsed time                               9,714.6          N/A
background cpu time                                   1,816.4          N/A
-------------------------------------------------------------
...

Operating System Statistics       DB/Inst: XXXX/XXXX  Snaps: 65115-65116

Statistic                                       Total
-------------------------------- --------------------
AVG_BUSY_TIME                                 345,463
AVG_IDLE_TIME                                  15,407
AVG_SYS_TIME                                   53,275
AVG_USER_TIME                                 292,060
BUSY_TIME                                   5,529,148
IDLE_TIME                                     248,026
SYS_TIME                                      854,375
USER_TIME                                   4,674,773
RSRC_MGR_CPU_WAIT_TIME                              0
VM_IN_BYTES                       -4.771264721965E+18
VM_OUT_BYTES                      -1.354598512669E+15
PHYSICAL_MEMORY_BYTES                  17,178,402,816
NUM_CPUS                                           16
NUM_CPU_CORES                                      16

...

What can we learn from all this?

First of all, there are a few interesting numbers in the load profile section. There are about 1k user calls per second, but the number of executes is much higher: about 40k per second, which suggests that most of the database load falls on SQL statements in PL/SQL routines.

Interestingly, the number of parses is also exceptionally high (quite close to the number of executes). However, only a small fraction of parses are hard ones.

The amount of redo generated is fairly small (about 2.4MB/s) and in general the database does much more reading than writing.

Top 5 timed events section doesn’t look particularly exciting. We can see that the database spends most of its time doing I/O which is to be expected. Some DBAs sometimes get alarmed by “read by other session” events, taking it as a sign of concurrency — but it’s a perfectly normal event which doesn’t necessarily indicate a problem (another session is reading a block we’re interested in into memory so that our session wouldn’t have to do it itself).

There are no obvious “bad” events here, signalling contention or configuration issues. The second line in the top events list is taken by “CPU time”, which is responsible for 31.0% of the database time. Since the database only has 16 CPUs, 50,461 CPU seconds mean that CPU utilization is high. To covert to percentages, simply divide it by the number of CPUs (16) and the number of seconds in an hour (3600): 50461/16/3600=0.876, i.e. the database alone accounts for about 88% CPU utilization! Furthermore, if we look into the OS stats section, we’ll see that the overall CPU utilization including non-Oracle processes is even higher, about 95% (just take BUSY_TIME and divide by BUSY_TIME + IDLE_TIME).

Now the question becomes: what causes the high CPU usage? There are many possibilities here. Basically, anything that is not related to physical I/O can cause CPU consumption (actually, even physical I/O can contribute to CPU usage as I’ve shown earlier, but that’s more like an exception than a rule, as typically physical I/O is too slow to contribute significantly). Most common scenarios include parsing, context switching overhead and logical I/O.

I don’t think there is enough information in the report to draw a conclusion with 100% certainty, but there is enough to make some educated guesses. Could parsing be the main problem here? The exceptionally high parsing rate (39k/second, i.e. practically 1 per every execution) from the load profile certainly seems to suggest so. It’s a possibility. However, on the other hand, only a tiny fraction (less than 0.1%) of those are hard parses, and it is hard parses that contribute to CPU usage the most. Further, “Time Model Statistics” section is claiming that only 3.6% of DB time is taken by parsing, and while this figure could be sometimes inaccurate and I wouldn’t trust it blindly, combined with all other information it looks sensible. So while parsing certainly seems to be excessive here, and no doubt it’s making the performance worse, there are likely to be other problems, those that would give more performance benefits if fixed.

What about context switching? That too is a possibility. The strongest evidence in favor of this theory is the disproportion between user calls and executions (1k vs 39k per second), which suggests that there is a lot of SQL called from PL/SQL routines, with a potentially large performance overhead for SQL-PL/SQL context switching. So I would keep this possibility in mind. However, analysing this overhead quantitatively is very difficult, and I wouldn’t spend too much time on it before I ruled out other possibilities.

Finally, logical I/O. 613k logical I/Os per second is certainly a lot, especially for a database that is hosted on a server with only 16 cores. Given the block size here is the standard one, i.e. 8k, this means that the database is reading from cache at 5GB/s, or 18TB/hr! While there exist many databases that read a lot more than that, they are typically hosted on more powerful servers, or even multiple servers connected into a RAC cluster.

Is this enough to explain the high CPU usage? Let’s see. While a logical I/O can be carried out in a number of ways, and exact timings would depend on CPU architecture, generally a logical I/O tends to take somewhere between a few microseconds and a few tens of microseconds (e.g. see here ). E.g. if we assume that on this particular system a logical I/O takes 20 us, that would account for almost the entire CPU load. Of course we don’t know if it’s 20 us or 10 us or less (or more), but I would say that overall the evidence in favor of excessive logical I/Os looks stronger than the other two versions we analyzed here. So that’s what I would focus on first. If after reducing the LIO the problem is still there, then I would start looking at other possibilities.

Very often, logical I/O can be dramatically reduced by tuning top SQL. Of course, that would require more database access than just an AWR report. But still, AWR reports provide an excellent starting point for this analysis, as they contain a subsection called “SQL ordered by gets”. In this particular case, the usefulness of this section was somewhat limited by the abundance of PL/SQL calls and the fact that the total percentage came to more than 100% (probably due to double-counting of SQL called from PL/SQL routines), but it was still useful. After excluding PL/SQL routine calls, most top statement had structure like this:

select id
from A
where id = :b1
and colA = :b2
and 0 = (select count(*) from B where colB = :b3)
and 1 < (select count(*) from C where colC1 = :b4 and colC2 = :b5 and colC3 = :b6)
and 0 <= (select count(*) from D where colD1 = :b7 and colD2 = a.ColA2)
...
-- and so on and so forth

so basically, it’s a query to check existence of records in certain tables that satisfy certain conditions. Many of such queries had abundance of both correlated and uncorrelated scalar subqueries in the WHERE clause. One obvious problem is that this increases the risk of an inefficient plan, because the optimizer needs to correctly assess whether the subquery needs to be unnested, and it’s not always easy to do so (and even if it evaluates the situation correctly, it can still fail to do the unnesting). But that’s not all. If you look at the uncorrelated subqueries, you can see, that they can be separated out into standalone queries, which would have allowed to shortcut the execution when one of the predicates is found to be false! For example, Oracle would have been able to evaluate the condition (select count(*) from B where colB = :b3) = 0, and if false, exited this branch of logic, avoiding unnecessary I/O.

9 thoughts on “AWR analysis: another case study

  1. My first thought was:
    174 tx/s looks like OLTP; 3,500 LIOs/tx looks like very inefficient OLTP

    Executes = parses and both much larger than calls looks like lots of “execute immediate” or function calls inside PL/SQL

    db file scattered, db file parallel, read by other session very large => inappropriate access paths for OLTP.

    What can one was about comparison of (subquery) with zero as a test for (non-)_existence !

    Mind you, the sample code has prompted me to think about optimising generic solutions to “there are at least two related rows”

    1. Hi Jonathan, thanks for sharing your thoughts!

      Yes, the load profile section was clearly suggestive of OLTP, and top SQL text unambiguously confirmed it (PL/SQL function calls, or SQL statements called thousands of times per hour). I didn’t want to put a label on it because you can expect people to run pretty much anything on their databases, but during the time of the report the load was certainly OLTP-like.

      I can’t quite agree about the top events — db file scattered reads are fairly small (1/15th of db file sequential read), db file parallel read only tells you that there’s a lot of NLJ batching/prefetching going on which is very natural for OLTP load. The sheer amount of I/O related waits does suggest inefficient access paths, sure, but that’s not really new information — this is something we already know from LIO and TX stats.

      Best regards,
      Nikolay

  2. 73, 4% of db file sequential read is the mark of an exceptionnaly well tuned OLTP using its indexes and IOT (rare) or on the contrary some very bad and big nested loop of nested loop access that must be reduced.

    I have a DB which is a COBOL ported application with index hint in absolutely every SQL and it reaches barely 50% sequential read. So the 73% is extremly suspect of wrong doing.

    The 2 biggest usual source of nested loop of 2 indexes or a single deterministic function (or sub-select) between SELECT and FROM, whose index access is ran for each row. With what you said I smell we are in the second case.

    1. Hi Bernard,

      thanks for your comment.

      I myself wouldn’t read too much into the ratio of db file sequential reads to all I/O-related events — in my experience, it doesn’t tell you much other than it would obviously tend to be higher for an OLTP environment. I’ve seen some databases where virtually all I/O waits were db file sequantial reads, and among them some were well tuned, some poorly tuned, and some were in between…

      It could be helpful though when most of the workload falls onto 1-2 or 3 top SQL statements — then by looking at statements themselves you can sometimes judge if the wait event profile is what you should expect…

      Best regards,
      Nikolay

  3. Hi Nickolay,

    Nice and attractive reading

    A ”read by other session” wait event means that while I was trying to read a block from disk I found that someone is reading exactly the same block from disk into the buffer cache. In principle this is good because someone is doing a work I ought to do. But typically this is always a sign of a performance issue which we have to care about it.

    Best regards
    Mohamed Houri

    1. Hi Mohamed,

      thanks for your input. All this event is telling us is that some of the data is being read simultaneously by several different sessions, which is a fairly normal situation, especially for an OLTP database.

      Best regards,
      Nikolay

  4. Thanks Nikolav for excellent post here. it’s amazingly informative and people can learn a lot from your Blogs. Jonathan has also given a glimpse of his expertise. it’s really good to see knowledgeable hands sharing information and it’s always go hand in hand.

    Here ,I would really appreciate , if you could put some light,as how you were able to identify the inefficiency in sql and it’s execution plan. It would worth a read , if you could show this with a test case to find such bad SQL and tune them.

    Regards
    Alok

    1. Hi Alok,

      thanks for your comment. There are several ways you can identify inefficiency in SQL. One is by looking at LIO/row ratio. It’s very useful for making a quick estimate, however, you need to be careful if the query e.g. has aggregation operations in it, as well as in some other special cases. Another way is good for more detailed analysis, if you have dbms_xplan output with rowsource stats (“A-rows”) populated, or similar diagnostic information (SQL real-time monitoring report, SQL trace output etc.). You look for sudden drops in A-rows number in the SQL plan tree. When you see one, especially if it occurs late in the plan, this is often an indication that the query is being executed in an suboptimal way, because rows that will not be returned to the user are not eliminated early enough.

      If you search my blog you will find a few blogs that focus on that latter technique, e.g. “Efficiency-based tuning” which covers the basics. You may also find my latest post of some interest, as it gives a real-life example of applying this technique, covering most of the steps involved in identifying and fixing the problem (at least as far as the short-term solution is concerned).

      There are also many resources outside my blog which cover this topic. I don’t have any links at hand, but you can google for “SQL tuning by waste minimization” to find the classic paper on the subject as well as other works that reference it.

      Best regards,
      Nikolay

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s