Recently I was asked to help with a case involving library cache locks, and even though it was really simple, I think it’s good for illustrating a few points in performance analysis.
As it often happens, it all started with customers complaining about “the database being slow”. I asked for more specific details, and I found that it wasn’t the entire database, but rather a reporting subsystem. Since I didn’t have specific session sid or sql_id at hand, I started with analyzing a 30-min AWR report, and since there wasn’t much time to look at everything, I just focused on top timed events:
Event Waits Time (s) Avg Wait(ms) %Total Wait Class
CPU time 16,188 46.2
db file sequential read 18,230,956 11,787 1 33.6 User I/O
db file parallel write 278,480 4,509 16 12.9 System I/O
library cache lock 1,131 3,297 2,915 9.4 Concurrency
log file sync 120,356 1,686 14 4.8 Commit
SQL tuning is the key to database performance tuning, and yet this seems to be a “blind spot” for many database specialists. I think it’s probably because it falls right on the boundary between DBA’s and developer’s responsibilities: so while a DBA expects database developers take care of performance while developing code, developers often neglect that, relying on a “develop first, let the DBA tune it later” approach. Also, until recently, there were surprisingly few good (and accessible to a newbie) descriptions of how to read a SQL plan.
Now that we have Chritian Antognini’s great book “Troubleshooting Oracle Performance”, the situation has improved dramatically. But still, I think that a blog post on that subject won’t hurt: after all, it’s free and it’s written by someone who still remembers difficulty his first analyzing SQL plans. :)
It is really amazing how many people in this day and age are still relying on good old EXPLAIN PLAN when having access to much more precise diagnostic tools. There are probably several reasons for this:
1) many people haven’t heard of dbms_xplan or don’t have a very good idea of what it can do
2) tracing is often pictured as some utterly complex activity producing results “unreadable”, if not undecipherable, results
3) finally, many people have unrealistically high expectations of EXPLAIN PLAN.
In my earlier post, I mentioned an interesting case which among other symptoms, featured high CPU usage. In this post I would like to tell more about that case.
I was investigating poor performance on a 10.2.0.5 Oracle database, and I was asked to look at it after yet another outage (the database wasn’t accepting any connections for over two hours). The AWR report showed massive mutex and latch contention:
Let’s start with some basic concepts. AWR reports deal with several kinds of time. The simplest kind is the elapsed time , it’s just the interval of time between the start and end snapshots. Another important quantity is DB time, which is defined as time in user calls during that period. It can be (and for a busy system typically is) greater than the elapsed time. However, the reason for that is not the number of CPUs as some experts incorrectly state (apparently, they confuse it with CPU time that we’ll discuss below, e.g. here), it’s that this time is a sum over all active user processes which are using CPU or waiting for something. Note that it only counts time spent in user calls, i.e. background processes are not included in that. Continue reading