How not to troubleshoot log file sync waits

Introduction

There is a very common mistake in troubleshooting log file sync (LFS) waits: comparing its average time to average log file parallel write (LFPW) and trying to deduce from that whether the root cause of the wait is slow I/O or something else. The fact that this approach is recommended by Oracle itself (e.g. MOS 1376916.1) and many independent experts unfortunately doesn’t make it any less wrong.

Theory

It is well known that averages and ratios can distort the reality (Milsap and Holt even wrote a whole book about it). A few outliers can significantly bias the average, and as James Morle showed, LFS will be affected more than LFPW (and thus the ratio between them will be distorted).  I will briefly summarize his argument here.

When log writer is writing fast, it generally finishes its work for one session before the next session commits, but when it experiences an occasional hiccup, sessions start piling up waiting on LGWR. The figure below illustrates this mechanism by showing 10 consequtive short LFPW causing 10 short LFS, followed by a long LFPW, causing 10 longer LFS events.

lfs_lfpw2

 

You can easily calculate that average LFPW in this example is 1.8 ms, while average LFS is approximately twice as big, 3.7 ms.

I ran some tests to take a closer look at the effect.

Data

I wrote a very simple PL/SQL procedure to perform inserts and commits inside a loop in several parallel threads. I monitored OS performance to make sure that there’s no CPU starvation (and I separately checked LGWR waits from ASH to make sure that it’s wating on I/O and not something else, like I/O or redo copy latch).  I traced both the user sessions and the LGWR session and read the trace files into the database for analysis as described in my previous post to produce fine-grained histograms. Plotted below are results for 1, 10 and 1000 threads. Note that the plots are in bi-logarithmic scale.

lfs_lfpw_1_10_100

For 1 thread, the shape of LFS and LFPW histograms are almost identical, with a small shift about 100 microseconds explained by pre- and post- processing overhead by LGWR. With multiple threads the behavior is different: the difference starts off as the same ~100 us shift, but at larger times the difference between LFS and LFPW grows nearly exponentially and eventually saturates at the number of threads. This of course makes perfect sense: because of the mechanism described above, the longer LFPW is, the more LFS waits of comparable duration it would produce, but the maximum number of LFS per one LFPW of given duration is limited to the number of threads.

As expected, as the number of threads grows, the average for LFS changes much more dramatically (from 300 us to 20 ms) compared to LFPW (from 200 us to 1 ms).

Note that there aren’t any big outliers (e.g. big enough to produce LGWR warnings) in the picture — normal scatter in I/O service times is sufficient to produce the effect.

Discussion

We have seen that in many cases the comparison between the averages would be very misleading. There would be cases when it wouldn’t, too: when LFS waits are either produced by one session or by different sessions, but at non-overlapping times. But in such cases LFS waits probably wouldn’t be much of a problem.

A busy OLTP system that suffers from LFS is more likely to be similar to our simulations with threads = 100.  Imagine that we analyze its behavior, and we only have the averages, without the histogram to tell us why they are so different. We’d be facing a very confusing picture: LFS is 20 ms, but LFPW is only 1 ms — if as per the MOS note the average LFPW represents the I/O part of the LFS, then where did the remaining 19 ms come from? Is it latching? Is it inter-process communication? Weird effects from multiple public redo strands? Or is it CPU? Or a CPU scheduling issue? The best case scenario, the DBA wastes a few hours doing research on the internet and MOS, but worse things can easily happen, ranging from man-days wasted on combatting non-existent issues to ill-advised hardware upgrades.

Or take another scenario: on a similar system (OLTP with 50-100 frequently commiting writing sessions and a bunch of reading sessions) an intermittent I/O problem causes appearance of a small percentage of anomalously long LFPW events. The percentage is small enough so that average LFPW doesn’t change much, but average LFS changes significantly. If the DBA starts the investigation by comparing LFPW/LFS averages he would conclude that “it’s not I/O” and would go on a wild goose chase instead of dealing with the real problem.

In order to protect oneself against this pitfall, it’s always better to look at histograms rather than averages. But if histograms are not available, there are other possibilities as well: e.g. look at number concurrent waits on ‘log file sync’ vs time in ASH (group by sample_time), or if even that is not available, at least compare event counts for LFS and LFPW (if they are very different, this is a clear indication of concurrency, i.e. the averages are meaningless).

 Summary

Contrary to what note MOS 1376916.1 and numerous internet articles based on it state, LFPW/LFS ratio is not helpful for understanding the source of LFS waits. Comparing histograms for the two events would be much more productive, but if they are not available, looking at ASH data for LFS or LFS/LFPW event counts can help.

 

 

 

 

4 thoughts on “How not to troubleshoot log file sync waits

  1. Hi!

    Great site, with interesting thoughts!

    Few suggestions to this post:
    – It would be helpful if you wrote few sentences about the LFS, LFPW events and the relation between them for the beginners like me. (I did my job, and read about them elsewhere:)
    – Please upgrade/update the top figure. For me it was difficult to follow the events. Color coding the two events, and maybe indicating where the commits were issued would greatly help in the understanding of how you calculated the averages and which parts of the arrows contribute to which event.

    That said, it’s still good sience, and teaching, and post.
    Cheeers,
    Vaurob

    1. Hi Vaurob,

      thank you for your comment. The goal of that post was not to teach people about LFS, but rather to help those of them who already know what LFS is avoid a particular mistake of relying on LFS/LFPW averages when trying to understand its root cause. I’m afraid that adding a few definitions still wouldn’t make this post accessible to beginners. I wish I could recommend you a good entry-level text about LFS, but I unfortunately don’t know any. I am hopeful that I’ll get a chance to write one myself soon.

      Regarding the figure: I don’t think that color-coding would help because everything on that picture except for the line on the bottom (LFPW) is related to the same event, log file sync. There are 10 sessions in that example (each represented by it’s own imaginary horizontal line) committing at a constant rate (1 commit per ms), and the LGWR process processes them one by one, but then experiences a 10 ms delay during which sessions waiting on LFS start to pile up. I hope that this clarifies things a bit — but if not, try James Morle’s blog that I mention in the blog.

      Thanks again for your interest in my blog and your comment,

      Best regards,
      Nikolay

Leave a comment