Last year, I spent some time researching redo log related performance problems, which resulted in a mini-series, including one post devoted specifically to one previously unknown scenario of excessive log file sync waits. I am happy to announce that a service request opened on the back of this research resulted in the MOS note on “log file parallel write” wait event (Doc ID 34583.1) having been updated with a general description of this scenario and factors that may contribute to it. Unfortunately, more specific information regarding this issue has been put to an internal Oracle note because of the limitations that concern underscore parameters. So if you are dealing with log file sync (or log buffer space) issues, then I strongly recommend to log in to MOS and familiarize yourself with the updated version of the note as soon as possible.
Let’s consider a hypothetical scenario. Suppose you have a process A that you want to run faster. This process doesn’t commit (much), so it doesn’t wait on log file sync. However, there is another multi-threaded process, B, that commits very frequently, and spends a lot of time on “log file sync”. You don’t care about the process B, your only goal is to make A go faster. After exhausting your tuning arsenal (to no avail), you go to a production DBA. He looks at the AWR report and says “you’re committing too much”. “Look how much time the system spends on log file sync”. You tell him that the process A doesn’t commit much and doesn’t suffer from log file sync waits, but the DBA insists: “Even so, reducing commits would improve the database performance health in general, and by doing so it would benefit your process as well. Besides, getting rid of all that log file sync noise would help us see the problem with process A more clearly”. You are convinced, and after spending some time with the code, you find a bunch of unnecessary commits inside loops. You remove them. You reduce the commit rate per second by several orders of magnitude! Your database is much healthier now! And your process A will now run… almost four times slower than it did before.
If the end of this story surprises you, it really shouldn’t. Let me show you AWR exceprts that simulate this situation:
A few weeks ago I wrote a post about log parallelism causing excessive log file sync waits. Ever since, I’m finding more and more examples how this bug affects OLTP and hybrid databases (and even some data warehouses)! For example, my current employer is a large organization that has several thousand databases (set up at different time by different teams), and according to the studies I conducted on a sample of a few dozen databases, no less than 15-20% of the total Oracle database real estate have this problem. In several cases the scale of the problem is simply scary: e.g. I found a database that spends 43% of its time on log file sync (!) on the average, reaching up to 78% (!!) during peak workloads. Based on the feedback to my posts here and on LinkedIn, the situation in other organizations is no better. It looks like “log file sync” is a very wide-spread problem, and log parallelism is one of the main causes (if not the main cause).
This is why I am happy to announce that Oracle development has created a bug and assigned it a number (19959089). The bug note doesn’t have much at this stage, but at least it does list setting “_log_parallelism_max” to 1 as a workaround, which should encourage more people to test and apply this solution. It would probably still be necessary to get the Oracle support to okay changing the underscore parameter, but a reference to this bug should make it much simpler to obtain. Hopefully, when the bug investigation is concluded, there will be a detailed official note and/or a patch available that would eliminate the need to raise an SR. I’ll post a blog about it when this happens.
Log buffer space is a simple, yet frequently misunderstood wait event. The main reason for that is probably its name. It sounds as if it points immediately to the answer: if space in the log buffer is the issue, then increasing it surely should resolve it. Well, unfortunately even though log buffer space is simple, it’s not that simple.
Continue reading “Log buffer space”
Two most popular causes for log file sync waits are redo filesystem I/O and CPU (or CPU scheduling). I’d like to share an example of a third scenario.
A few days ago, I noticed a weird spike in log file sync average duration on a production 11.2 database:
select time, snap_id, round(time_delta/1e3/nullif(waits_delta,0), 1) avg_wait_ms, waits_delta num_of_waits, round(time_delta/1e6) total_seconds from ( select sn.snap_id, sn.begin_interval_time time, e.total_waits - lag(e.total_waits) over (partition by e.event_name order by e.snap_id) waits_delta, e.time_waited_micro - lag(e.time_waited_micro) OVER (PARTITION BY e.event_name ORDER BY e.snap_id) time_delta from dba_hist_system_event e, dba_hist_snapshot sn where e.snap_id = sn.snap_id AND e.event_name = 'log file sync' ) ev WHERE ev.time_delta > 0 order by time desc
Log parallelism is an optimization introduced in 9.2 that reduces latch contention due to redo copy to the log buffer by enabling multiple public redo buffers (or “strands”). In many cases, however, it can cause a massive degradation of commit performance (“log file sync” wait increase). In this blog post, I will describe the mechanism, illustrate it with test results, and discuss solutions. Tests were performed on several different 11gR2 databases on Solaris servers.
In my recent post I showed how log file sync (LFS) and log file parallel write (LFPW) look for normal systems. I think it would also be interesting to compare that to the situation when LGWR does not have enough CPU.
I happen to have collected LGWR and database-level trace files for a 126.96.36.199 database on a Solaris 10 server which was under serious pressure (50 threads mostly inserting and committing data, only 32 CPUs). The AWR showed significant OS_CPU_WAIT_TIME (comparable to BUSY_TIME and much larger than IDLE_TIME) so I know for sure that CPU was an issue. And here is what LFS and LFPW histograms plotted from the trace file (as described here) looked like:
Not every commit results in a redo write. This is because there are multiple optimizations (some controlled by the user e.g. with COMMIT_LOGGING parameter, some automatic) that aim at reducing the number of redo writes caused by commits by grouping redo records together. Such group or “piggyback” commits are important for understanding log file sync waits and various statistics around it. In particular, “piggyback” commits play a key role when many sessions commit concurrently at a high rate, as described in my previous post. I made myself a little demo to actually see this mechanism in work with my own eyes. I think it could be of interest for others, so I’m sharing it here. Since the demo involves stopping and resuming background process, I wouldn’t recommend running it on anything other than a designated private sandbox environment.
There is a very common mistake in troubleshooting log file sync (LFS) waits: comparing its average time to average log file parallel write (LFPW) and trying to deduce from that whether the root cause of the wait is slow I/O or something else. The fact that this approach is recommended by Oracle itself (e.g. MOS 1376916.1) and many independent experts unfortunately doesn’t make it any less wrong.