A few years ago I wrote a series of posts about various scenarios of log file sync (LFS) waits. A few days ago, I have had an issue which added yet another scenario to that already diverse collection, and I thought it would be potentially of interest for others in the Oracle community. To set the scene, it’s the same 18c (18.10 to be more accurate) 3-node X6-2 Exadata cluster where most of issues I describe in my blog occur. At a certain moment, I noticed LFS go up quite a bit, with log file parallel write (LFPW) also showing signs of increase, but a much more moderate one. I’ve got a script to graph wait event stats for multiple wait events (e.g. using a regular expression pattern for their names), below you can see a panel summarising average wait times for events containing “log file” or “redo” in their names.
As you can see, apart from LFS itself, there is only one event really standing out, LGWR wait for redo copy (LFPW has a bump, but it’s a moderate and slow increase rather than a sharp spike in the LFS). This is a foreground wait by the LGWR process and it accounts for the time spent gathering redo information before it’s actually flushed to the disk (the latter is accounted for by the LFPW). Note that it would have been as easy to link this background event to the original symptoms by looking at LGWR wait profile in ASH, it’s just that I missed it there because of a bug in one of my scripts (so it’s a good thing to use multiple sources of diagnostic data to be able to cross-check things).
So why did LGWR wait for redo copy go up all of a sudden? As it happened, around the same time we saw the increase in LFS waits, we also were made aware of different issue, huge number of rather large trace files overwhelming the diagnostic filesystem on the compute node. The trace files did not contain any mention of an error, rather, they mentioned QESMA and then contained a long list of various numbers for statistics of a partitioned table. Searching MOS for relevant keywords returned some articles mentioning ORA-600 errors due to attempts to fetch a table row using an invalid rowid from an index, and searching the alert log revealed a bunch of ORA-600 errors with 12406 code which confirmed that it was bug 30576853 (Doc ID 30576853.8) that we were hitting.
The MOS article didn’t suggest any workaround, and the patches were for all for 19.x versions, but a developer came up with an idea to rebuild the indexes in the affected table and it has made the desired difference. A further investigation showed that the indexes were compressed not long before the problem with ORA-600 errors and excessive trace files came, so it was probably what triggered the bug in the first place.
But back to LFS waits — how was it related to the trace file generation? After all, this is an Exadata system, and redo log files are supposed to be on storage cells which weren’t particularly affected? I can only speculate that even though the actual redo write I/O should mostly depend on the storage cell performance, the preparation step in gathering redo information to be written may also involve some local I/O on the compute nodes itself. And all these excessive trace files being generated did wreak havoc on the local I/O there:
No doubt, the problem was further exacerbated by housekeeping jobs trying to free space when the diagnostic filesystem got filled up, which probably explains the delay between the time ORA-600 trace file started being generated and appearance of the LFS symptoms.
Hopefully this post should illustrate several points:
- Excessive trace file generation is more than simple nuisance, it can cause serious issues
- Compute nodes I/O performance is almost as important as cell I/O performance on Exadata and needs constant monitoring
- Cross-checking diagnostic information from multiple sources, e.g. DBA_HIST_SYSSTAT, DBA_HIST_SYSTEM_EVENT and DBA_HIST_ACTIVE_SESSION_HISTORY, can help avoid missing a problem.