Performance tuning is all about time. You measure the time it takes for a certain process to complete, and then you search for ways to reduce this time to improve end-users experience and/or increase the application productivity. But minimizing time is not enough — it’s important to minimize the correct time metric. A typical mistake in database performance optimization to optimize DB time instead of the duration experienced by the end user. If this happens, this can easily result in a situation when DB time is reduced significantly, but the process is still taking almost as long as before, the SLA is still breached, bosses and users are increasingly frustrated.
In this post, I will give two real-life examples of cases when the difference between DB time and elapsed time was the key to understanding the problem.
Case 1. Parallel PL/SQL execution workflow
One of the critical business processes on the production databases was found to perform slower than usual by 10-15%. The investigating DBA was handed the AWR report and asked to pin the root cause of the issue. Of course with AWR there is always the problem that it contains the information about the mix of processes you’re interested in, and processes you don’t care about. But in this case that didn’t appear to be a serious issue: that critical process was responsible for most of the database activity during that time. So it was natural to expect anything big enough to account for the observed degradation to show up on the report. Also, with AWR there’s generally the problem of dilution of the “problem” time portion of the report with the “non-problem” part. But in this case the overlap between the AWR interval and the duration of the process was good, so it wasn’t supposed to be an issue, either.
To DBA’s big frustration, the report revealed nothing. Everything looked normal, there were no suspicious wait events in the top lists, and top items in the “SQL ordered by” sections of the AWR report were only responsible for 2% of DB time. On some systems, non-use of bind variables results in top SQL being “smeared out” because each execution has its own sql_id — but that wasn’t the case here, either. How was that possible?
The answer was in the architecture of the process. It used dbms_scheduler jobs to parallelize the workload (with up to 50 simultaneous jobs), but it was impossible to parallelize all stages of it, so parallel operations were interspersed with serial ones. One of such serial SQL statements started to work slower. It was responsible for less than 0.1% of the DB time, but in terms of the elapsed time, it was almost 10%. Since DB time is all AWR can offer, it was impossible to find the answer in AWR.
Ironically, there was no need to look in the AWR, because all operations by the process were logged (with timestamped) an a database table, and it was very straightforward to identify the faulty statement from that log. But no one was looking there, because AWR was erraneously believed to always have all answers about performance problems.
Case 2. Parallel SQL
In this case, a SQL statement was showing unstable and poor performance. On some occasions it would complete within two or three hours, which was deemed acceptable, but on other occasions it would run for 5 hours and more, and then killed and re-submitted (often with the same success). The plan, however, was stable. There weren’t any changes or skews in data, either, or any fluctuations in I/O performance or caching efficiency.
A senior colleague of mine noted that there was a lot of DML against queried tables going on and suggested read consistency overhead as the candidate for the root cause. Given all the symptoms, it was a good fit. I ran a query on ASH and found that only 17% of DB time was taken by reads from UNDO. Not enough to explain the observed degradation, so I started to pursue another line of investigation. A bit later, however, it hit me: it was 17% of DB time. But we noticed the degradation in terms of elapsed time. The query was using parallelism, so the two weren’t necessarily same.
And indeed, after looking at the elapsed time (it’s easy to switch from one to another in ASH, you just replace COUNT(*) with COUNT(DISTINCT sample_time)), I saw that 52% of it was accounted for me reads from the UNDO tablespace. So my colleague was right the entire time, and if I looked at the right metric, I would’ve realized that right away.
It made me wonder why the difference between the two numbers was so big, and it turned out that like in the example above, the execution workflow for the statement was a mix of parallel and serial operations. That’s not surprising, because some activities (like full scans) can be easily parallelized, while others are harder to parallelize or cannot be parallelized at all. For some reason, the serial index probes were affected the most by the concurrent DML activity, so most of the reads from UNDO were associated with them.
1. From the end user perspective, performance is measured as the duration between the beginning and end of an activity
2. Most automatic performance diagnostic and tuning tools (like AWR and ADDM) measure database time instead
3. If an activity is executed in several parallel threads, the two are not necessarily same, and can be affected by various performance factors to a different degree
4. Code instrumentation techniques can provide information about the linear flow of a process in time by logging statements with timestamps. When such information is available, it’s more relevant for performance issue investigation than statspack/AWR or any other similar tool that report global aggregates and averages
5. Another helpful tool for measuring elapsed time is ASH where you can switch between DB time and elapsed time by replacing COUNT(*) with COUNT(DISTINCT sample_time) and vice versa