CPU starvation disguised as an I/O issue (yet another AWR case study)

In AWR analysis, what appears to be the root cause of the issue, can easily turn out to be just a symptom. Last week, Rajat sent me an AWR report which is a perfect illustration of this (thanks Rajat), I posted the key sections from this report below (sorry for less than perfect formatting — I had to manually re-format the HTML version of the report into text).


WORKLOAD REPOSITORY report for
DB Name      DB Id           Instance       Inst num        Release          RAC           Host
DSS          37220993      dss              1               10.2.0.4.0       NO            dssdbnz

                  Snap Id      Snap Time             Sessions      Cursors/Session
Begin Snap:       18471      12-Oct-12 08:30:28      131              1.5
End Snap:         18477      12-Oct-12 14:30:24      108              1.8
Elapsed:          359.93 (mins)
DB Time:          25,730.14 (mins)

Load Profile
                              Per Second      Per Transaction
Redo size:                    325,282.85      103,923.02
Logical reads:                33,390.52       10,667.77
Block changes:                1,307.95        417.87
Physical reads:               1,927.33        615.75
Physical writes:              244.65          78.16
User calls:                   391.34          125.03
Parses:                       68.14           21.77
Hard parses:                  3.33            1.06
Sorts:                        47.86           15.29
Logons:                       3.15            1.01
Executes:                     234.32          74.86
Transactions:                 3.13
% Blocks changed per Read:       3.92       Recursive Call %:      61.11
Rollback per transaction %:      24.71      Rows per Sort:         3325.52

Top 5 Timed Events
Event                               Waits          Time(s)      Avg Wait(ms)      % Total Call Time      Wait Class
free buffer waits              10,726,838      344,377     32      22.3      Configuration
db file sequential read        6,122,262      335,366      55      21.7      User I/O
db file scattered read         3,597,607      305,576      85      19.8      User I/O
CPU time                                      161,491              10.5
read by other session          2,572,875      156,821     61       10.2      User I/O

Operating System Statistics
Statistic                                 Total
AVG_BUSY_TIME                             2,093,109
AVG_IDLE_TIME                             63,212
AVG_IOWAIT_TIME                           18,463
AVG_SYS_TIME                              87,749
AVG_USER_TIME                             2,004,722
BUSY_TIME                                 16,749,988
IDLE_TIME                                 510,692
IOWAIT_TIME                               152,594
SYS_TIME                                  707,137
USER_TIME                                 16,042,851
LOAD                                      4
OS_CPU_WAIT_TIME                          ###############
RSRC_MGR_CPU_WAIT_TIME                    0
VM_IN_BYTES                               5,503,492,096
VM_OUT_BYTES                              2,054,414,336
PHYSICAL_MEMORY_BYTES                     34,288,209,920
NUM_CPUS                                  8
NUM_CPU_SOCKETS                           8

The first thing that leaps into the eye is the I/O problem which manifests itself via free buffer waits and abnormally high average times for I/O-related events: 55 ms for sequential file read (!) and 85 ms for scattered file reads (!!) is 10-20 times more than normal. That alone should be enough to bring any application to its knees.

So the next question is what could have caused all this. It’s possible that this abnormality stems from a hardware failure or a misconfiguration in the I/O subsystem, so it would make sense to contact the SA and ask him to check the I/O stats from his angle. However, if we look a bit further in the AWR report, we will find yet another problem:


BUSY_TIME 16,749,988
IDLE_TIME 510,692

i.e. the CPU was 97% busy (!!) during an 6-hour interval (!!!). Now that we have established that both I/O and CPUs were bottlenecks, it’s quite unlikely that the two problems are unrelated to each other. Most probably, there is a causality link between them, but what is the cause and what is the consequence? An I/O problem is unlikely to affect the CPU, but the inverse is possible. If all CPUs on the system are busy, then it can take forever for the OS to complete an O/S request, and eventually the entire system virtually comes to a standstill.

But that’s not yet the end of the story; it would be also nice to know why the CPU usage went up so dramatically, so let’s take at “SQL ordered by CPU time” section of the report (column names and sql id’s have been obfuscated upon the sender’s request):

CPU Time (s)Elapsed Time (s) Executions CPU per Exec (s)  % Total DB TimeSQL Id      SQL Module      SQL Text
78,629      390,095          6       13104.92            25.27      b3p2r2asdfyf        SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
5,302       21,053           0                           1.36       0vpp40n6p6zq5       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
5,238       21,042           0                           1.36       2y03w8vc4uz4c       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
5,230       21,053           0                           1.36       8pcz31gc88vba       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
4,976       18,853           1       4975.57             1.22       asd0tw7sj5pxv       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
4,928       18,863           1       4928.05             1.22       4fdsaj175f4jk       SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...
4,356       21,049           0                           1.36       1w1fasdff0av        SQL*Plus       SELECT ABC_XYZ || '#'|| ZO...

We can see that one statement is responsible for about 50% of the CPU utilization during the analyzed period, and a bunch of very similar statements (which probably only differ by values of literals) is responsible for at least yet another 15%.

So everything is pointing to the top statement as the sole culprit of the entire issue — and indeed, after I was provided additional information about that statement, I saw that its plan flipped right around the time of the incident. If I had access to the database, I would’ve pursued the matter further — e.g. by using dbms_xplan.display_awr to see the different plans and dbms_stats.diff_table_stats_in_history to see what changes in statistics may have provoked this change.

Conclusion

This post hopefully illustrates following points:

1) CPU starvation has many different appearance, among other things, it can look like an I/O issue

2) Don’t jump to conclusions. After finding a symptom, resist the temptation to pronounce it the root cause of the problem. Note that in this specific case if we decided that the I/O problem was the root cause, then we most likely would have proceeded to “SQL ordered by reads” section of the report rather than “SQL ordered by CPU time”, which doesn’t point in the right direction (the culprit query isn’t anywhere near the top in that list)

3) Oracle databases contain tons of diagnostic information which allow to identify the root cause of almost any performance problem in great detail. You don’t have to settle for a vague answer like “something’s wrong with the I/O subsystem” or “the workload was probably higher”.

Update: I spotted an error in my CPU time calculation: 6 hours is not 216,000 s, but 21,6000. With this taken into account, busy_time is almost exactly equal to NUM_CPUS x elapsed_time, and we don’t even to refer to os_cpu_wait_time to know that there was CPU starvation — so I’ve removed that part from my post.

12 thoughts on “CPU starvation disguised as an I/O issue (yet another AWR case study)

  1. Hi Nikolay..Great post as usual.. :)

    I have one question..

    “BUSY_TIME 16,749,988

    IDLE_TIME 510,692

    i.e. the CPU was 97% busy (!!) during an 8-hour interval (!!!). ”

    How did you calculate this? And what is this 8 hour interval here?

    Thanks in advance.

  2. Hi Sandy – I got the same question. After struggling, I found the formulae. but not sure it is correct or not. ‘BUSY_TIME / (BUSY_TIME+IDLE_TIME) * 100’.

    1. Hi,

      I used BUSY_TIME / elapsed_time / NUM_CPUS. You can also use your formula with one correction: add IOWAIT_TIME to the denominator (it doesn’t make much of a difference in this case).

  3. Hi Nikolay,

    Good post.

    Too many folks forget that physical I/O finishes with the kernel marking the blocked process as ‘runable’. The time waiting for the scheduler to put the process on a core is added to the I/O time. Also, an 11g AWR would more clearly point out the waits for CPU.

    Years ago I wrote a post that shows this exact phenomenon adding to Log File Sync. If you’ll allow I’d like to leave a link to that post:

    http://kevinclosson.wordpress.com/2007/07/21/manly-men-only-use-solid-state-disk-for-redo-logging-lgwr-io-is-simple-but-not-lgwr-processing/

    1. Hi Kevin,

      it’s really great to see you in my blog. I have been following yours for a while now, great stuff. And I remember this article on log file sync waits you referenced — it’s probably the best study on the subject, ever!

  4. Hi Nikolay..Great post :)

    I have one question..
    “We can see that one statement is responsible for about 50% of the CPU utilization during the analyzed period, and a bunch of very similar statements (which probably only differ by values of literals) is responsible for at least yet another 15%.”

    How did you calculate this? 50% and 15%.

    Thanks in advance.

    1. Hi Dan,

      it’s actually quite simple: the report shows total CPU usage to be 160,000 CPU seconds (see e.g. “CPU time” in “top 5 timed events”), while the top statement took 80,000 CPU seconds out of it. 80,000/160,000 = 1/2 = 50%. And similarly for the other statements.

  5. Hello NIkolay,i have a question : if ‘CPU time’ isn’t in “top 5 timed events” , how we can know its value?

    1. Hi Lonion,

      it depends on the database version. In 10g, you can find “DB CPU” (which is essentially the same thing) in “Time model statistics” section.

      In 11g there is an “Instance CPU” which shows both % of CPU available used by the instance and % of CPU usage that the instance is responsible for (i.e. the former shows CPU usage in terms of total CPU capactity, while the latter shows what is the share of this database instance among all CPU consumers on the host).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s