Occam’s razor

This year was rich in interesting performance issues. This post is about one recent issue that was particularly intriguing because it was:

– intermittent

– not associated with any particular database activity

– it tended to appear at a certain time of the day (between 2am and 4am database time), but there were some variations

– there was no discernible pattern in days when the problem appeared

– performance degradation was due to a whole bunch of various “bad” wait events rather than just one

– there was no correlation between the problem appearance and the workload metrics.

This is an example of it:


                                                             Avg                
                                        %Time Total Wait    wait    Waits   % DB
Event                             Waits -outs   Time (s)    (ms)     /txn   time
-------------------------- ------------ ----- ---------- ------- -------- ------
db file sequential read      25,418,423     0    106,469       4     31.1   22.9
free buffer waits             1,311,363     0    100,159      76      1.6   21.6
read by other session         7,410,309     0     62,025       8      9.1   13.4
log file sync                   418,838     0     53,045     127      0.5   11.4
buffer busy waits               184,209     0     43,813     238      0.2    9.4
write complete waits                638     0     16,435   25761      0.0    3.5
enq: TX - index contention       17,722     0     13,310     751      0.0    2.9
direct path write               180,155     0      8,257      46      0.2    1.8
log buffer space                 40,891     0      7,302     179      0.0    1.6

So as you can see, it’s a whole bouquet of seemingly unrelated problems (6 different “bad” wait events from 3 different classes). To make it more complicated, the exact proportions changed from time to time (e.g. sometimes the index contention was more prominent than other events).

The problem could be seen with a naked eye while watching the database via OEM Performance Page, and one could see that all “bad” waits appeared at almost exactly same time, which made coincidence extremely unlikely.

There’s not enough evidence in the posted part of the AWR (or the entire AWR report) to resolve the problem entirely. But it’s enough to narrow down the circle of possibilities and form a credible hypothesis that can be tested. All we need is common sense, some basic understanding of Oracle architecture and Occam’s razor.

Occam’s principle dictates us to seek the explanation involving as few entities as possible. So the root cause must be related to either one event, or several different event of the same nature, while the rest must be some sort of “side effects”. Which events are more likely to be linked to the root cause?

It’s hard to imagine a realistic scenario where, for instance, log file sync waits would have caused the rest of the symptoms. A session releases locks and other shared resources before going into log file sync wait (or else any mild case of log file sync waits would momentarily escalate to a major performance issue via locking and latching). It’s also difficult to come up with such a scenario for the index contention event. We could link it to buffer busy waits (via leaf block splits), but not the rest.

But what about free buffer waits? From documentation (and MOS) we know that this event is seen when something is preventing the DBW process from flushing dirty buffers to disk fast enough to accomodate the demand for free space (typically due to I/O problems). Could that be a good fit? Yes it could.

First of all, it has to do with database writer, on whose performance depends the entire database. When something so fundamental is affected, a lot of things can go wrong, so that would explain why observed symptoms are so diverse. Another reason for that is that we can see other write-related events there: for example, write complete waits (similar in nature to free buffer waits), but also log buffer space (slow LGWR writes) and even log file sync (can also be explained by slow LGWR writes)! As for the rest of the symptoms — as was already mentioned, slow writes can explain virtually anything, because they mean that many resources are going to be held longer (including transaction locks or pins on buffers).

How can we make sure we are on the right track? Easy-peasy, check DBW performance, i.e. look at db file parallel write:

db_file_parallel_write

The biggest peak on the plot is exactly at the time when the problem was the worst, and positions of the  smaller peaks correlate extremely well with other occurences of the same problem.

Since there was not enough I/O in the database to explain the observed picture, I raised a ticket with storage specialists so that they would look from their end. And they found what was wrong. As it turned out, the problem was that the production SAN storage was not 100% isolated from test boxes. It was isolated on disk, HBA card etc. levels, but it was still sharing the cache. Because of that, massive write activity on UAT boxes was affecting production in such a way.

Conclusion

When a whole mess of seemingly unrelated performance problems emerge out of nowhere roughly at the same time, it’s worth it spend some time trying to find the common root cause rather than trying to tackle them one by one. And the best way to find such common root cause is often by looking at performance of the fundamental background processes (first of all, database writer), e.g. their wait profile.

Update

Here is a plot showing strong correlation between db file parallel write and log file parallel write average wait times that should hopefully convince Frits that db file parallel write can, in fact, reflect on write latency:

background_io_latency

10 thoughts on “Occam’s razor”

    1. Frits,

      maybe I’m misinterpreting your tone, but you sound condescending and almost rude. It’s like you’re absolutely sure that I could not possibly know what ‘db file parallel write’ means. Not a nice attitude, especially around holidays.

      However, your post does look interesting, I’ll read it in detail when I have more time.

      Best regards,
      Nikolay

  1. Hi Nikolay, that is absolutely not what I meant nor tried to suggest. Sorry if you got that impression. The intention was to raise awareness that the implementation of dbwr IO and the db file parallel read event is far from obvious. I tried not to come across as self advertising and thus be brief. But that backfired I guess.

      1. Thank you for your kind words Nikolay.

        Yes. That’s the presentation I was aiming at. This is the new version I uploaded after UKOUG Tech 14, which has Oracle 12 stuff, and an overview of the implementation of IO in versions 11.2.0.1-12.1.0.2 added.

        The point essentially about ‘db file parallel write’ is that it doesn’t reflect write IO latency at all.

        1. Frits,

          it’s true that db file parallel write cannot be considered as write I/O latency. I should have probably spent some time clarifying that point, but I felt that my blog was already much longer than I intended it to be, so I skipped that part. Besides, I felt that it was kind of obvious that the write latency didn’t literally go up x450 times from 4 ms to 1.8 s — if that happened, the database instance would have crashed or become completely paralyzed.

          So what happened in that particular case, was the following: the number of write requests per event went up as well (up to ~180 or 190 request per write if I remember correctly). So the write time per request hadn’t changed by as much as x450. But it was still a very noticeable change, from 2-3 ms to 8 ms, and that change was very similar to the change in average log file parallel write.

          So I guess what I’m saying is that even though db file parallel write time is not the actual write latency, dramatic increase in this metric can be indicative of DBW being under stress.

          If you don’t agree with that, I’d like to show you both db file parallel write and free buffer waits / local write waits / write complete waits on the same plot. Hopefully the correlation between them should convince you. I don’t know how to add an image to a comment in wordpress, so perhaps I could just twit it to you or something.

          Best regards,
          Nikolay

          1. When you are using AIO, submitting and reaping IOs is done in two steps. If you are on a filesystem, ‘db file async I/O submit’ times the io_submit() system call. If you are on ASM, ‘db file async I/O submit’ times nothing, the io_submit() call is done outside of the wait event (I am working on getting that fixed, that is a bug).

            The reaping of the IOs is what is done with the ‘db file parallel write’, alias the io_getevents() call. I am convinced this is known. However, there is no 1:1 relationship between the number of IOs submitted and the number of IOs the DBWR is waiting for, as is timed with ‘db file parallel write’. In my tests (on a simple VM, no load), the number of IOs for which io_getevents() was waiting for was 25-33% of the number of submitted IOs. I think the number of IOs to wait for is calculated using the same statistics as Oracle uses to determine the amount of IO slots for direct path asynchronous IO.

            DBWR also reaps IOs outside of the wait event too, so calculating the numbers from the wait event arguments will show “missing” IOs.

            This means that the time waited inside “db file parallel write” is, as far as I can tell now, the time Oracle calculated it is willing to wait. Probably, my current guess, is this is used as an IO throttle (when waiting for IOs to finish, the DBWR can’t submit more IOs).

            So it might be better to look at the number of IOs done in the statistics, and the CPU time per DBWR.

            This is in contrast to ‘log file parallel write’, where waiting happens for all previously submitted IOs.

            1. So your studies show that db file parallel write doesn’t always reflect DBW I/O write latency. But it is a big leap in logic to go from that to “db file parallel never reflects DBW write latency”. Please look at the plot in the update to the post (on the bottom of the page). Would you agree that there is a very strong correlation between the two lines? And if yes, would you agree that this correlation proves that db file parallel write is indeed indicative of DBW write I/O latency?

  2. I am not sure what you want me to say.

    Yes, I fully agree that ‘db file parallel write’ indicates you are doing IO, or that the DBWR is busy. I also fully agree that your plot shows that the time spend on waiting for both wait events has a correlation.

    I do not agree that db file parallel write is an exact measurement of IO latency. But nor is log file parallel write for that matter.

    log file parallel write is the measurement of the slowest IO of the batch previously submitted. Including or excluding io_submit(), depending on version. it is reasonably exact in the sense that the wait always waits for the IOs just submitted. There are no IOs reaped from another batch than just submitted.

    db file parallel write works as previously described. this means it that if the database starts doing DBWR IO, it waits and reaps the fastest IOs, then submits another batch, waits and reaps a number of IOs (from whatever batch) that finishes, etc, etc. This means there is no relationship between the actual IO time and the time spend on waiting on it. On top of that, Oracle will determine the actual number of IOs it wants to wait for in io_getevents() (=db file parallel write) after each submitted batch of write IOs.

    just look at the IO logic of the dbwr process. take my gdb macro, and throttle IO, then dirty blocks, checkpoint and look.

    btw. above assumes AIO being used and NO exadata. With exadata, db file parallel write is different once again.

    1. Frits,

      I can see that your position has evolved from “db file parallel write doesn’t reflect I/O latency at all” to “db file parallel write is not an exact measurement of I/O latency”. That’s good enough for me. I never claimed that db file parallel write is an exact measurement of I/O latency. In fact, a few weeks ago I spent a lot of time convincing some people at work that db file parallel write is complicated and should not be taken as a measurement of I/O latency (they wanted to use it to compare SSD vs HDD because redo was on SSD while data files on HDD).

      But nevertheless even if it is not the most direct and accurate measure of I/O latency, it is often correlated with it, and therefore changes in db file parallel write are a strong indication that something is not right with the write I/O.

      Of course, filesystem I/O stats or SAN I/O stats would have been more helpful. But they are not accessible for the vast majority of DBA’s and database developers in real-life conditions. We have to look at db file parallel write not because it’s such a good measure of I/O, but because it’s the best we can get without raising a ticket for another team and waiting a couple of days for them to pick it up and process it.

      So I guess that explains why we are having trouble understanding each other: you’re probably looking at this from a researcher’s point of view, while I’m looking at this from a DBA’s perspective.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s