CPU-starved LGWR

In my recent post I showed how log file sync (LFS) and log file parallel write (LFPW) look for normal systems. I think it would also be interesting to compare that to the situation when LGWR does not have enough CPU.

I happen to have collected LGWR and database-level trace files for a 11.2.0.3 database on a Solaris 10 server which was under serious pressure (50 threads mostly inserting and committing data, only 32 CPUs). The AWR showed significant OS_CPU_WAIT_TIME (comparable to BUSY_TIME and much larger than IDLE_TIME) so I know for sure that CPU was an issue. And here is what LFS and LFPW histograms plotted from the trace file (as described here) looked like:

 

CPU_starved_LGWR_bilog

and then the same plot in the usual (i.e. not logarithmic) scale:

cpu_starved_lgwr

Note the incredible level of detail. You wouldn’t be able to see any of hint the sharp peaks or other structures on default coarse 11g histograms! If you don’t believe me, let’s compare. Same everything, but now with AWR histograms:

 

 

CPU_starved_LGWR_bilog_lores

cpu_starved_lgwr_lores

Of course, there is the downside of having to obtain a trace file, which will be a problem on a production environment — but in this case you can use ASH instead (the ASH data maybe statistically biased towards larger samples, so you need to be careful — but it should at least give you a basic idea).

Thanks to the high resolution, it’s quite easy to see that the LFS problem in this case is not coming from the I/O, because the shapes of the LFS (the blue curve) and LFPW (the red curve) are completely different, as there nothing in the LFPW curve that could potentially explain the sharp peaks at 10, 20, 30 and 40 ms and a step-like increase after the 10 ms peak.

The location of the peaks (multiples of 10 ms) is clearly suggestive of 10 ms OS scheduler quantum, although I haven’t monitored that run on the OS level so I cannot confirm that with 100% certainty.

Conclusion

When looking at LFS and LFPW, the level of detail is of critical importance:

– averages would confuse and mislead

– default AWR histograms wouldn’t mislead, but they wouldn’t tell too much either

– high-resolution histograms obtained using trace files or ASH would give you the full picture, and in many cases would even give the answer right away!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s