This year was rich in interesting performance issues. This post is about one recent issue that was particularly intriguing because it was:
– not associated with any particular database activity
– it tended to appear at a certain time of the day (between 2am and 4am database time), but there were some variations
– there was no discernible pattern in days when the problem appeared
– performance degradation was due to a whole bunch of various “bad” wait events rather than just one
– there was no correlation between the problem appearance and the workload metrics.
This is an example of it:
Avg %Time Total Wait wait Waits % DB Event Waits -outs Time (s) (ms) /txn time -------------------------- ------------ ----- ---------- ------- -------- ------ db file sequential read 25,418,423 0 106,469 4 31.1 22.9 free buffer waits 1,311,363 0 100,159 76 1.6 21.6 read by other session 7,410,309 0 62,025 8 9.1 13.4 log file sync 418,838 0 53,045 127 0.5 11.4 buffer busy waits 184,209 0 43,813 238 0.2 9.4 write complete waits 638 0 16,435 25761 0.0 3.5 enq: TX - index contention 17,722 0 13,310 751 0.0 2.9 direct path write 180,155 0 8,257 46 0.2 1.8 log buffer space 40,891 0 7,302 179 0.0 1.6
So as you can see, it’s a whole bouquet of seemingly unrelated problems (6 different “bad” wait events from 3 different classes). To make it more complicated, the exact proportions changed from time to time (e.g. sometimes the index contention was more prominent than other events).
The problem could be seen with a naked eye while watching the database via OEM Performance Page, and one could see that all “bad” waits appeared at almost exactly same time, which made coincidence extremely unlikely.
There’s not enough evidence in the posted part of the AWR (or the entire AWR report) to resolve the problem entirely. But it’s enough to narrow down the circle of possibilities and form a credible hypothesis that can be tested. All we need is common sense, some basic understanding of Oracle architecture and Occam’s razor.
Occam’s principle dictates us to seek the explanation involving as few entities as possible. So the root cause must be related to either one event, or several different event of the same nature, while the rest must be some sort of “side effects”. Which events are more likely to be linked to the root cause?
It’s hard to imagine a realistic scenario where, for instance, log file sync waits would have caused the rest of the symptoms. A session releases locks and other shared resources before going into log file sync wait (or else any mild case of log file sync waits would momentarily escalate to a major performance issue via locking and latching). It’s also difficult to come up with such a scenario for the index contention event. We could link it to buffer busy waits (via leaf block splits), but not the rest.
But what about free buffer waits? From documentation (and MOS) we know that this event is seen when something is preventing the DBW process from flushing dirty buffers to disk fast enough to accomodate the demand for free space (typically due to I/O problems). Could that be a good fit? Yes it could.
First of all, it has to do with database writer, on whose performance depends the entire database. When something so fundamental is affected, a lot of things can go wrong, so that would explain why observed symptoms are so diverse. Another reason for that is that we can see other write-related events there: for example, write complete waits (similar in nature to free buffer waits), but also log buffer space (slow LGWR writes) and even log file sync (can also be explained by slow LGWR writes)! As for the rest of the symptoms — as was already mentioned, slow writes can explain virtually anything, because they mean that many resources are going to be held longer (including transaction locks or pins on buffers).
How can we make sure we are on the right track? Easy-peasy, check DBW performance, i.e. look at db file parallel write:
The biggest peak on the plot is exactly at the time when the problem was the worst, and positions of the smaller peaks correlate extremely well with other occurences of the same problem.
Since there was not enough I/O in the database to explain the observed picture, I raised a ticket with storage specialists so that they would look from their end. And they found what was wrong. As it turned out, the problem was that the production SAN storage was not 100% isolated from test boxes. It was isolated on disk, HBA card etc. levels, but it was still sharing the cache. Because of that, massive write activity on UAT boxes was affecting production in such a way.
When a whole mess of seemingly unrelated performance problems emerge out of nowhere roughly at the same time, it’s worth it spend some time trying to find the common root cause rather than trying to tackle them one by one. And the best way to find such common root cause is often by looking at performance of the fundamental background processes (first of all, database writer), e.g. their wait profile.
Here is a plot showing strong correlation between db file parallel write and log file parallel write average wait times that should hopefully convince Frits that db file parallel write can, in fact, reflect on write latency: