In my earlier post, I mentioned an interesting case which among other symptoms, featured high CPU usage. In this post I would like to tell more about that case.
I was investigating poor performance on a 10.2.0.5 Oracle database, and I was asked to look at it after yet another outage (the database wasn’t accepting any connections for over two hours). The AWR report showed massive mutex and latch contention:
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- cursor: pin S ############ 2,072,643 2 44.5 Other cursor: pin S wait on X 76,424,627 929,303 12 20.0 Concurrenc latch free 1,958 246,702 ###### 5.3 Other CPU time 58,596 1.3 log file sync 746,839 44,076 59 0.9 Commit -------------------------------------------------------------
So the natural question was what was causing this? The database was showing very high hard parsing rate (over 100 per second), but that was there all the time, not just during the incident, so while it was obviously an important factor, it couldn’t have been solely responsible. So I turned my attention to ASH, I saw that there were sessions in the ON CPU state showing same sql_id for few hours straight before the database blacked out, but this sql_id was nowhere else to be found. Nothing in AWR reports, in DBA_HIST_SQLTEXT, in V$SQL.
Of course knowing sql_id without having the text was of no use, so I needed to find the text. The database would only hang once every few weeks, and waiting that long was not a very attractive option. Fortunately, the database performance showed some patterns, so at some time of the day it tended to get worse even though not hanging completely. I waited until next such period and I got lucky. The problem query was something like:
SELECT col1, col2 FROM tab1 WHERE col1 IN ('value1', 'value2', ... , 'value105')
Interestingly enough, V$SQL was showing: RUNTIME_MEMORY = 0, LOADED_VERSIONS = 0, EXECUTIONS = 0, SORTS = 0, FETCHES = 0, ELAPSED_TIME = 0… the only non-zero statistics were USERS_OPENING=14, LOADS=1 and USERS_EXECUTING=1.
So apparently the statement was stuck while being parsed — and all those zeroes above would probably explain why AWR wouldn’t include pick the statement in its reports (why would anyone bother to look at a statement with 0 executions and 0 elapsed time?).
Knowing the query structure helped to narrow down my search on MOS and soon I found a bug with fitting symptoms:
Bug 9935787 – Long parse time for large inlists – can cause ‘cursor: pin S wait on X’ waits.
As subsequent analysis showed, there were other factors contributing to the problem:
- concurrent execution of exactly the same statement
- high CPU usage because of excessive hard parsing.
Excessive hard parsing was setting the stage for the incident (CPU usage was already consistently high, so any significant increase was fatal). When the offending statement was stuck parsing, the application would retry the same SQL in another session, using cursor: pin S wait on X waits. When that happened, one several possible scenarios would develop next. One of them was described Tanel Poder in his blog:
http://blog.tanelpoder.com/2010/04/21/cursor-pin-s-waits-sporadic-cpu-spikes-and-systematic-troubleshooting/. For example, the OS could notice high CPU usage from an Oracle process and would downgrade its priority, while the process on the top of the runqueue was burning CPU waiting for release of the mutex.
Another possible scenario was that the stuck session was holding another resource, like a library cache lock (because of a stats job running outside its normal window) and thus the number of sessions waiting on it was growing until the entire database froze.
Once the problem was identified, finding the solution was simple: as it turned out, the offending SQL was not really needed (it was collecting data for some legacy report), so it was simply removed. Additionally, a list of statements using literals was given to the application team for fixing. While they haven’t implemented all of it, interim results are already noticeable: mutex contention went down almost an order of magnitude, CPU utilization dropped from 80-90% to 50-80%, and the database has been running incident-free for quite a while now.
I would like to focus on several points in this case
1) Many database performance problems are “binary bombs”: they are a result of two (or even more) simultaneous factors. In this case, poor database hygiene (lack of bind variables) has made the database vulnerable, even though it wasn’t enough to make the database performance unacceptable by itself. Other factors were needed to play the role of a trigger. In such cases it is important to have an integral picture of what is going on, because otherwise there is a risk of addressing the problem only superficially (treating the symptom, not the root cause)
2) It is of critical importance to observe and identify database workload periodicity, because this gives you the best shot at catching a problem in real time.