This week my blog has crossed an important milestone: 111,111 visits. :) I’d like to use this occasion to thank all my readers for their continued interest in my blog, and announce that there are a few more interesting texts in the pipeline, covering such topics as insert optimization, manual optimization of expensive joins, mechanics of redo log related waits (log file sync, log buffer, log file parallel write), and much else. Stay tuned!
While 8k is the default block size, Oracle supports other block sizes, too. Smaller block sizes (more commonly, 4k) are encouraged for OLTP workloads to reduce concurrency, and even smaller block size, 2k, is recommended for databases running on “Advanced format” (or “4K”) storage. Oracle documentation warns us of possible implications when storing larger rows in 2k/4k blocks, such as potentially larger space overhead due to metadata, or even possibility of row chaining. What it doesn’t warn us about, is that there is also a big difference in the way DML operations work, which has very serious implications for their performance.
Nowadays, data in databases is wrapped in may layers of cache: result cache, buffer cache, OS page cache, storage hardware cache… They greatly improve performance, but they also make it less stable and harder to predict. And when I/O performance takes a turn for worse, one has to go through multiple layers of cache trying to understand what went wrong and why. I had such a case not too long ago.
The incident took place on an Oracle 184.108.40.206 database running on a Solaris 10 server. The first symptom was decreased throughput of one of batch processes in the database by about 30 percent. Since the process was spending more than half of its time doing I/O (as ASH indicated), it was easy to establish that the slowdown was linked to increased time of db file sequential reads by plotting average wait times from DBA_HIST_SYSTEM_EVENT (the SQL code for this and other queries in this blog post can be found below, in the Appendix):
A rather odd performance issue happened at work a few days ago. All of a sudden, one of the databases started to work very slowly, and a quick look in ASH data showed that it was spending over 70% of its time waiting on “row cache lock” (not to be confused with “latch: row cache lock”).
It was a test database (220.127.116.11) on an overloaded server with non-standard configuratioin, so my initial reaction was to bounce the database and hope that the symptoms would go away, but it didn’t work. So there was no other choice but to investigate the root cause. The next simplest thing to “turning it off and then back on” was finding the blocking session in ASH (the V$ACTIVE_SESSION_HISTORY view has columns BLOCKING_SESSION and BLOCKING_SESSION_SERIAL#). Most of the entries for this wait event had NULLs in these columns, but among those NULLs there were sid’s and serial#’s of two sessions. Unfortunately, both sessions were from background processes, MMON and SMON, so killing them obviously wasn’t a tempting choice. MMON is responsible for collecting AWR data and such, so the instance could probably survive if it gets killed, but SMON must always be running or the instace would crash.
For many Oracle database patches there is an option to apply them “online”, i.e. without stopping the database and related services (listener, ASM etc.). This is very convenient when downtime is impossible or difficult to arrange. However, this convenience comes at a certain price, including some performance penalties. Such side effects of hot patching are not well understood, even by Oracle support engineers.
Performance tuning is all about time. You measure the time it takes for a certain process to complete, and then you search for ways to reduce this time to improve end-users experience and/or increase the application productivity. But minimizing time is not enough — it’s important to minimize the correct time metric. A typical mistake in database performance optimization to optimize DB time instead of the duration experienced by the end user. If this happens, this can easily result in a situation when DB time is reduced significantly, but the process is still taking almost as long as before, the SLA is still breached, bosses and users are increasingly frustrated.
In this post, I will give two real-life examples of cases when the difference between DB time and elapsed time was the key to understanding the problem.
SQL performance can degrade for many reasons, some of most common are:
– plan changes
– data skewness
– low caching efficiency
– data growth
All these factors are relatively well known. A somewhat less common, although not exceptionally rare scenario, is read consistency overhead due to concurrent DML against queried tables. Because of being less common, this scenario is often overlooked, which leads to false diagnoses (and eventually to “fixes” that can do more harm than good).