Log buffer space

14 Oct

Introduction

Log buffer space is a simple, yet frequently misunderstood wait event. The main reason for that is probably its name. It sounds as if it points immediately to the answer: if space in the log buffer is the issue, then increasing it surely should resolve it. Well, unfortunately even though log buffer space is simple, it’s not that simple.
Continue reading

Log file sync from neither I/O nor CPU

8 Oct

Two most popular causes for log file sync waits are redo filesystem I/O and CPU (or CPU scheduling). I’d like to share an example of a third scenario.

A few days ago, I noticed a weird spike in log file sync average duration on a production 11.2 database:

select time,
       snap_id,
       round(time_delta/1e3/nullif(waits_delta,0), 1) avg_wait_ms,
       waits_delta num_of_waits,
       round(time_delta/1e6) total_seconds
from
(       
  select sn.snap_id,
         sn.begin_interval_time time,
         e.total_waits - lag(e.total_waits) over (partition by e.event_name order by e.snap_id) waits_delta,
         e.time_waited_micro - lag(e.time_waited_micro) OVER (PARTITION BY e.event_name ORDER BY e.snap_id) time_delta
  from dba_hist_system_event e,
       dba_hist_snapshot sn
  where e.snap_id = sn.snap_id
  AND e.event_name = 'log file sync'
) ev
WHERE ev.time_delta > 0 
order by time desc

Continue reading

Lies, damned lies and non production-like performance testing

24 Sep

Chasing cost efficiency, business often cuts back on money spent on UAT boxes used for performance testing. More often than not, this is a bad-decision, because the only thing worse than not having a UAT environment is having a UAT environment that is nothing like production. It gives a false sense of security while exposing your application to all sorts of nasty surprises. In this post I tried to summarize a few typical configuration differences between UAT and production which can affect performance test results in a major way.
Continue reading

High log file sync waits? Check log parallelism!

22 Sep

Log parallelism is an optimization introduced in 9.2 that reduces latch contention due to redo copy to the log buffer by enabling multiple public redo buffers (or “strands”). In many cases, however, it can cause a massive degradation of commit performance (“log file sync” wait increase). In this blog post, I will describe the mechanism, illustrate it with test results, and discuss solutions. Tests were performed on several different 11gR2 databases on Solaris servers.

Continue reading

CPU-starved LGWR

11 Sep

In my recent post I showed how log file sync (LFS) and log file parallel write (LFPW) look for normal systems. I think it would also be interesting to compare that to the situation when LGWR does not have enough CPU.

I happen to have collected LGWR and database-level trace files for a 11.2.0.3 database on a Solaris 10 server which was under serious pressure (50 threads mostly inserting and committing data, only 32 CPUs). The AWR showed significant OS_CPU_WAIT_TIME (comparable to BUSY_TIME and much larger than IDLE_TIME) so I know for sure that CPU was an issue. And here is what LFS and LFPW histograms plotted from the trace file (as described here) looked like:

 

CPU_starved_LGWR_bilog Continue reading

Piggyback commits

10 Sep

Not every commit results in a redo write. This is because there are multiple optimizations (some controlled by the user e.g. with COMMIT_LOGGING parameter, some automatic) that aim at reducing the number of redo writes caused by commits by grouping redo records together. Such group or “piggyback” commits are important for understanding log file sync waits and various statistics around it. In particular, “piggyback” commits play a key role when many sessions commit concurrently at a high rate, as described in my previous post. I made myself a little demo to actually see this mechanism in work with my own eyes. I think it could be of interest for others, so I’m sharing it here. Since the demo involves stopping and resuming background process, I wouldn’t recommend running it on anything other than a designated private sandbox environment.

Continue reading

How not to troubleshoot log file sync waits

9 Sep

Introduction

There is a very common mistake in troubleshooting log file sync (LFS) waits: comparing its average time to average log file parallel write (LFPW) and trying to deduce from that whether the root cause of the wait is slow I/O or something else. The fact that this approach is recommended by Oracle itself (e.g. MOS 1376916.1) and many independent experts unfortunately doesn’t make it any less wrong.

Continue reading

Querying trace files

8 Sep

SQL trace file provide the highest level of detail possible about SQL execution. The problem with that information is converting it to a convenient format for further analysis. One very good solution is parsetrc tool by Kyle Hailey written in Perl. It gives high-resolution histograms, I/O transfer rates as a function of time, and other very useful info. Unfortunately, I myself am not a Perl expert, so it’s a bit difficult for me to customize this tool when I need something slightly different from defaults (e.g. change histogram resolution, look at events not hardcoded into the script etc.). Another limitation is that since the tool is external to the database, you can’t join the data anything else (like ASH queries). So I found another solution for raw trace file analysis: external tables + regexp queries.

Continue reading

Block size and multirow DML peformance

19 Aug

While 8k is the default block size, Oracle supports other block sizes, too. Smaller block sizes (more commonly, 4k) are encouraged for OLTP workloads to reduce concurrency, and even smaller block size, 2k, is recommended for databases running on “Advanced format” (or “4K”) storage. Oracle documentation warns us of possible implications when storing larger rows in 2k/4k blocks, such as potentially larger space overhead due to metadata, or even possibility of row chaining. What it doesn’t warn us about, is that there is also a big difference in the way DML operations work, which has very serious implications for their performance.

Continue reading

Peeling through layers of cache

11 Aug

Nowadays, data in databases is wrapped in may layers of cache: result cache, buffer cache, OS page cache, storage hardware cache… They greatly improve performance, but they also make it less stable and harder to predict. And when I/O performance takes a turn for worse, one has to go through multiple layers of cache trying to understand what went wrong and why. I had such a case not too long ago.

The incident took place on an Oracle 11.2.0.3 database running on a Solaris 10 server. The first symptom was decreased throughput of one of batch processes in the database by about 30 percent. Since the process was spending more than half of its time doing I/O (as ASH indicated), it was easy to establish that the slowdown was linked to increased time of db file sequential reads by plotting average wait times from DBA_HIST_SYSTEM_EVENT (the SQL code for this and other queries in this blog post can be found below, in the Appendix):

Continue reading

Follow

Get every new post delivered to your Inbox.

Join 344 other followers