Another piece of good news — Oracle has opened a bug for yet another anomaly I reported earlier in my blog: row-by-row processing of bulk DML when the block size on the target table is less than the default 8k. So it’s now officially bug 20039770 – “DML SLOW WITH 4K BLOCK SIZE VS 8K BLOCK SIZE”. Their bug description seems a bit off (because 2K shows the same behavior as 4K, and 16/32K as 8K, so it’s not really a case of “4K VS 8K”), but I’m sure they update it accordingly in the course of the investigation by the development team. Unlike the log parallelism bug, this one is not yet open to public (maybe because they don’t have anything to put into the bug note yet, not even a workaround), I’ll post an update when that changes.
If you ever work with small block sizes then I highly recommend you familiarize yourself with this bug, because its impact on DML performance is quite big (I observed x3-x7 effect, but it could be even larger in a different setup). It makes rewrites of row-by-row logic via bulk DML pretty much useless, so if I had known about it earlier, it would have saved me a couple of weeks!
I will post another blog when Oracle concludes their investigation.
A few weeks ago I wrote a post about log parallelism causing excessive log file sync waits. Ever since, I’m finding more and more examples how this bug affects OLTP and hybrid databases (and even some data warehouses)! For example, my current employer is a large organization that has several thousand databases (set up at different time by different teams), and according to the studies I conducted on a sample of a few dozen databases, no less than 15-20% of the total Oracle database real estate have this problem. In several cases the scale of the problem is simply scary: e.g. I found a database that spends 43% of its time on log file sync (!) on the average, reaching up to 78% (!!) during peak workloads. Based on the feedback to my posts here and on LinkedIn, the situation in other organizations is no better. It looks like “log file sync” is a very wide-spread problem, and log parallelism is one of the main causes (if not the main cause).
This is why I am happy to announce that Oracle development has created a bug and assigned it a number (19959089). The bug note doesn’t have much at this stage, but at least it does list setting “_log_parallelism_max” to 1 as a workaround, which should encourage more people to test and apply this solution. It would probably still be necessary to get the Oracle support to okay changing the underscore parameter, but a reference to this bug should make it much simpler to obtain. Hopefully, when the bug investigation is concluded, there will be a detailed official note and/or a patch available that would eliminate the need to raise an SR. I’ll post a blog about it when this happens.
Log buffer space is a simple, yet frequently misunderstood wait event. The main reason for that is probably its name. It sounds as if it points immediately to the answer: if space in the log buffer is the issue, then increasing it surely should resolve it. Well, unfortunately even though log buffer space is simple, it’s not that simple.
Two most popular causes for log file sync waits are redo filesystem I/O and CPU (or CPU scheduling). I’d like to share an example of a third scenario.
A few days ago, I noticed a weird spike in log file sync average duration on a production 11.2 database:
round(time_delta/1e3/nullif(waits_delta,0), 1) avg_wait_ms,
e.total_waits - lag(e.total_waits) over (partition by e.event_name order by e.snap_id) waits_delta,
e.time_waited_micro - lag(e.time_waited_micro) OVER (PARTITION BY e.event_name ORDER BY e.snap_id) time_delta
from dba_hist_system_event e,
where e.snap_id = sn.snap_id
AND e.event_name = 'log file sync'
WHERE ev.time_delta > 0
order by time desc
Chasing cost efficiency, business often cuts back on money spent on UAT boxes used for performance testing. More often than not, this is a bad-decision, because the only thing worse than not having a UAT environment is having a UAT environment that is nothing like production. It gives a false sense of security while exposing your application to all sorts of nasty surprises. In this post I tried to summarize a few typical configuration differences between UAT and production which can affect performance test results in a major way.
Log parallelism is an optimization introduced in 9.2 that reduces latch contention due to redo copy to the log buffer by enabling multiple public redo buffers (or “strands”). In many cases, however, it can cause a massive degradation of commit performance (“log file sync” wait increase). In this blog post, I will describe the mechanism, illustrate it with test results, and discuss solutions. Tests were performed on several different 11gR2 databases on Solaris servers.
In my recent post I showed how log file sync (LFS) and log file parallel write (LFPW) look for normal systems. I think it would also be interesting to compare that to the situation when LGWR does not have enough CPU.
I happen to have collected LGWR and database-level trace files for a 22.214.171.124 database on a Solaris 10 server which was under serious pressure (50 threads mostly inserting and committing data, only 32 CPUs). The AWR showed significant OS_CPU_WAIT_TIME (comparable to BUSY_TIME and much larger than IDLE_TIME) so I know for sure that CPU was an issue. And here is what LFS and LFPW histograms plotted from the trace file (as described here) looked like:
Not every commit results in a redo write. This is because there are multiple optimizations (some controlled by the user e.g. with COMMIT_LOGGING parameter, some automatic) that aim at reducing the number of redo writes caused by commits by grouping redo records together. Such group or “piggyback” commits are important for understanding log file sync waits and various statistics around it. In particular, “piggyback” commits play a key role when many sessions commit concurrently at a high rate, as described in my previous post. I made myself a little demo to actually see this mechanism in work with my own eyes. I think it could be of interest for others, so I’m sharing it here. Since the demo involves stopping and resuming background process, I wouldn’t recommend running it on anything other than a designated private sandbox environment.
There is a very common mistake in troubleshooting log file sync (LFS) waits: comparing its average time to average log file parallel write (LFPW) and trying to deduce from that whether the root cause of the wait is slow I/O or something else. The fact that this approach is recommended by Oracle itself (e.g. MOS 1376916.1) and many independent experts unfortunately doesn’t make it any less wrong.
SQL trace file provide the highest level of detail possible about SQL execution. The problem with that information is converting it to a convenient format for further analysis. One very good solution is parsetrc tool by Kyle Hailey written in Perl. It gives high-resolution histograms, I/O transfer rates as a function of time, and other very useful info. Unfortunately, I myself am not a Perl expert, so it’s a bit difficult for me to customize this tool when I need something slightly different from defaults (e.g. change histogram resolution, look at events not hardcoded into the script etc.). Another limitation is that since the tool is external to the database, you can’t join the data anything else (like ASH queries). So I found another solution for raw trace file analysis: external tables + regexp queries.