Lies, damned lies and non production-like performance testing

24 Sep

Chasing cost efficiency, business often cuts back on money spent on UAT boxes used for performance testing. More often than not, this is a bad-decision, because the only thing worse than not having a UAT environment is having a UAT environment that is nothing like production. It gives a false sense of security while exposing your application to all sorts of nasty surprises. In this post I tried to summarize a few typical configuration differences between UAT and production which can affect performance test results in a major way.
Continue reading

High log file sync waits? Check log parallelism!

22 Sep

Log parallelism is an optimization introduced in 9.2 that reduces latch contention due to redo copy to the log buffer by enabling multiple public redo buffers (or “strands”). In many cases, however, it can cause a massive degradation of commit performance (“log file sync” wait increase). In this blog post, I will describe the mechanism, illustrate it with test results, and discuss solutions. Tests were performed on several different 11gR2 databases on Solaris servers.

Continue reading

CPU-starved LGWR

11 Sep

In my recent post I showed how log file sync (LFS) and log file parallel write (LFPW) look for normal systems. I think it would also be interesting to compare that to the situation when LGWR does not have enough CPU.

I happen to have collected LGWR and database-level trace files for a database on a Solaris 10 server which was under serious pressure (50 threads mostly inserting and committing data, only 32 CPUs). The AWR showed significant OS_CPU_WAIT_TIME (comparable to BUSY_TIME and much larger than IDLE_TIME) so I know for sure that CPU was an issue. And here is what LFS and LFPW histograms plotted from the trace file (as described here) looked like:


CPU_starved_LGWR_bilog Continue reading

Piggyback commits

10 Sep

Not every commit results in a redo write. This is because there are multiple optimizations (some controlled by the user e.g. with COMMIT_LOGGING parameter, some automatic) that aim at reducing the number of redo writes caused by commits by grouping redo records together. Such group or “piggyback” commits are important for understanding log file sync waits and various statistics around it. In particular, “piggyback” commits play a key role when many sessions commit concurrently at a high rate, as described in my previous post. I made myself a little demo to actually see this mechanism in work with my own eyes. I think it could be of interest for others, so I’m sharing it here. Since the demo involves stopping and resuming background process, I wouldn’t recommend running it on anything other than a designated private sandbox environment.

Continue reading

How not to troubleshoot log file sync waits

9 Sep


There is a very common mistake in troubleshooting log file sync (LFS) waits: comparing its average time to average log file parallel write (LFPW) and trying to deduce from that whether the root cause of the wait is slow I/O or something else. The fact that this approach is recommended by Oracle itself (e.g. MOS 1376916.1) and many independent experts unfortunately doesn’t make it any less wrong.

Continue reading

Querying trace files

8 Sep

SQL trace file provide the highest level of detail possible about SQL execution. The problem with that information is converting it to a convenient format for further analysis. One very good solution is parsetrc tool by Kyle Hailey written in Perl. It gives high-resolution histograms, I/O transfer rates as a function of time, and other very useful info. Unfortunately, I myself am not a Perl expert, so it’s a bit difficult for me to customize this tool when I need something slightly different from defaults (e.g. change histogram resolution, look at events not hardcoded into the script etc.). Another limitation is that since the tool is external to the database, you can’t join the data anything else (like ASH queries). So I found another solution for raw trace file analysis: external tables + regexp queries.

Continue reading

Block size and multirow DML peformance

19 Aug

While 8k is the default block size, Oracle supports other block sizes, too. Smaller block sizes (more commonly, 4k) are encouraged for OLTP workloads to reduce concurrency, and even smaller block size, 2k, is recommended for databases running on “Advanced format” (or “4K”) storage. Oracle documentation warns us of possible implications when storing larger rows in 2k/4k blocks, such as potentially larger space overhead due to metadata, or even possibility of row chaining. What it doesn’t warn us about, is that there is also a big difference in the way DML operations work, which has very serious implications for their performance.

Continue reading

Peeling through layers of cache

11 Aug

Nowadays, data in databases is wrapped in may layers of cache: result cache, buffer cache, OS page cache, storage hardware cache… They greatly improve performance, but they also make it less stable and harder to predict. And when I/O performance takes a turn for worse, one has to go through multiple layers of cache trying to understand what went wrong and why. I had such a case not too long ago.

The incident took place on an Oracle database running on a Solaris 10 server. The first symptom was decreased throughput of one of batch processes in the database by about 30 percent. Since the process was spending more than half of its time doing I/O (as ASH indicated), it was easy to establish that the slowdown was linked to increased time of db file sequential reads by plotting average wait times from DBA_HIST_SYSTEM_EVENT (the SQL code for this and other queries in this blog post can be found below, in the Appendix):

Continue reading

Row cache lock

14 Jul

A rather odd performance issue happened at work a few days ago. All of a sudden, one of the databases started to work very slowly, and a quick look in ASH data showed that it was spending over 70% of its time waiting on “row cache lock” (not to be confused with “latch: row cache lock”).

It was a test database ( on an overloaded server with non-standard configuratioin, so my initial reaction was to bounce the database and hope that the symptoms would go away, but it didn’t work. So there was no other choice but to investigate the root cause. The next simplest thing to “turning it off and then back on” was finding the blocking session in ASH (the V$ACTIVE_SESSION_HISTORY view has columns BLOCKING_SESSION and BLOCKING_SESSION_SERIAL#). Most of the entries for this wait event had NULLs in these columns, but among those NULLs there were sid’s and serial#’s of two sessions. Unfortunately, both sessions were from background processes, MMON and SMON, so killing them obviously wasn’t a tempting choice. MMON is responsible for collecting AWR data and such, so the instance could probably survive if it gets killed, but SMON must always be running or the instace would crash.

Continue reading

Caution: hot patches

26 Jun

For many Oracle database patches there is an option to apply them “online”, i.e. without stopping the database and related services (listener, ASM etc.).  This is very convenient when downtime is impossible or difficult to arrange. However, this convenience comes at a certain price, including some performance penalties. Such side effects of hot patching are not well understood, even by Oracle support engineers.

Continue reading


Get every new post delivered to your Inbox.

Join 184 other followers