There is a very common mistake in troubleshooting log file sync (LFS) waits: comparing its average time to average log file parallel write (LFPW) and trying to deduce from that whether the root cause of the wait is slow I/O or something else. The fact that this approach is recommended by Oracle itself (e.g. MOS 1376916.1) and many independent experts unfortunately doesn’t make it any less wrong.
It is well known that averages and ratios can distort the reality (Milsap and Holt even wrote a whole book about it). A few outliers can significantly bias the average, and as James Morle showed, LFS will be affected more than LFPW (and thus the ratio between them will be distorted). I will briefly summarize his argument here.
When log writer is writing fast, it generally finishes its work for one session before the next session commits, but when it experiences an occasional hiccup, sessions start piling up waiting on LGWR. The figure below illustrates this mechanism by showing 10 consequtive short LFPW causing 10 short LFS, followed by a long LFPW, causing 10 longer LFS events.
You can easily calculate that average LFPW in this example is 1.8 ms, while average LFS is approximately twice as big, 3.7 ms.
I ran some tests to take a closer look at the effect.
I wrote a very simple PL/SQL procedure to perform inserts and commits inside a loop in several parallel threads. I monitored OS performance to make sure that there’s no CPU starvation (and I separately checked LGWR waits from ASH to make sure that it’s wating on I/O and not something else, like I/O or redo copy latch). I traced both the user sessions and the LGWR session and read the trace files into the database for analysis as described in my previous post to produce fine-grained histograms. Plotted below are results for 1, 10 and 1000 threads. Note that the plots are in bi-logarithmic scale.
For 1 thread, the shape of LFS and LFPW histograms are almost identical, with a small shift about 100 microseconds explained by pre- and post- processing overhead by LGWR. With multiple threads the behavior is different: the difference starts off as the same ~100 us shift, but at larger times the difference between LFS and LFPW grows nearly exponentially and eventually saturates at the number of threads. This of course makes perfect sense: because of the mechanism described above, the longer LFPW is, the more LFS waits of comparable duration it would produce, but the maximum number of LFS per one LFPW of given duration is limited to the number of threads.
As expected, as the number of threads grows, the average for LFS changes much more dramatically (from 300 us to 20 ms) compared to LFPW (from 200 us to 1 ms).
Note that there aren’t any big outliers (e.g. big enough to produce LGWR warnings) in the picture — normal scatter in I/O service times is sufficient to produce the effect.
We have seen that in many cases the comparison between the averages would be very misleading. There would be cases when it wouldn’t, too: when LFS waits are either produced by one session or by different sessions, but at non-overlapping times. But in such cases LFS waits probably wouldn’t be much of a problem.
A busy OLTP system that suffers from LFS is more likely to be similar to our simulations with threads = 100. Imagine that we analyze its behavior, and we only have the averages, without the histogram to tell us why they are so different. We’d be facing a very confusing picture: LFS is 20 ms, but LFPW is only 1 ms — if as per the MOS note the average LFPW represents the I/O part of the LFS, then where did the remaining 19 ms come from? Is it latching? Is it inter-process communication? Weird effects from multiple public redo strands? Or is it CPU? Or a CPU scheduling issue? The best case scenario, the DBA wastes a few hours doing research on the internet and MOS, but worse things can easily happen, ranging from man-days wasted on combatting non-existent issues to ill-advised hardware upgrades.
Or take another scenario: on a similar system (OLTP with 50-100 frequently commiting writing sessions and a bunch of reading sessions) an intermittent I/O problem causes appearance of a small percentage of anomalously long LFPW events. The percentage is small enough so that average LFPW doesn’t change much, but average LFS changes significantly. If the DBA starts the investigation by comparing LFPW/LFS averages he would conclude that “it’s not I/O” and would go on a wild goose chase instead of dealing with the real problem.
In order to protect oneself against this pitfall, it’s always better to look at histograms rather than averages. But if histograms are not available, there are other possibilities as well: e.g. look at number concurrent waits on ‘log file sync’ vs time in ASH (group by sample_time), or if even that is not available, at least compare event counts for LFS and LFPW (if they are very different, this is a clear indication of concurrency, i.e. the averages are meaningless).
Contrary to what note MOS 1376916.1 and numerous internet articles based on it state, LFPW/LFS ratio is not helpful for understanding the source of LFS waits. Comparing histograms for the two events would be much more productive, but if they are not available, looking at ASH data for LFS or LFS/LFPW event counts can help.