It’s been forever since I last shared any of my performance troubleshooting experiences at work. This week, I got a case that I think is worth publishing, and I decided to write about it in my blog. So, here we go…
A few days ago, I received a complaint about unstable performance of one of frequently running SQL reports on a 11gR2 database. Most of the time it completed within a couple of minutes, however, on certain occasions it took much longer than that, and once it even took over 20 minutes.
I have a special stored report in SQL developer for conveniently displaying key statistics from DBA_HIST_SQLSTAT which is very helpful as a first step when analyzing unstable SQL performance:
select sn.begin_interval_time, plan_hash_value, executions_delta, round(elapsed_time_delta/1e6/decode(executions_delta, 0, 1, executions_delta), 3) sec_per_exec, round(disk_reads_delta/decode(executions_delta,0,1, executions_delta),1) reads_per_exec, round(buffer_gets_delta/decode(executions_delta,0,1, executions_delta), 1) gets_per_exec, round(iowait_delta/1000/nullif(disk_reads_delta, 0), 1) avg_read_msec, round(100*(1-disk_reads_delta/nullif(buffer_gets_delta, 0)), 6) bchr, round(rows_processed_delta/decode(executions_delta,0, 1, executions_delta), 1) rows_per_exec from dba_hist_sqlstat st, dba_hist_snapshot sn where st.snap_id = sn.snap_id and sql_id = nvl(:sql_id, sql_id) and force_matching_signature = nvl(:sig, force_matching_signature) order by sn.snap_id desc
As you can see, it normalizes performance metrics to the number of executions (with the special case of executions_delta = 0, which essentially means “1 unfinished execution”, so I use decode to replace it with one), and calculates buffer cache hit ratio (BCHR), and average physical read time. Apart from standard I/O metrics, I always pay attention to number of rows per execution, as it tells me if variations in performance have to do with data volume.
I don’t see much use in publishing the raw numbers, so I’ll just summarize my observations here:
1) rows per execution remained relatively stable
2) average read times varied from a fraction of a millisecond to almost 20 ms
3) BCHR was typically extremely high — so high, that the fraction of physical reads was more conveniently measured in parts-per-million than percent! However, occasionally it dropped to 80-90%
4) longer executions correlated well with spikes in average read times and/or drops in BCHR
5) the SQL was always using the same plan
6) the total amount of logical I/O was rather high: a few million I/Os per execution, with only 1-2k rows returned per execution.
So the general picture is rather clear: the variations in performance have to do with variations in caching, both on the database level (as measured by BCHR) and on the OS level (as measured by average read times). Since maximum read times were close to 20 ms — higher than what can be reasonably expected from enterprise-grade storage array (even if it’s on spinning disks) — this means that there was some queuing in the disks, but it’s not that important so let’s not focus on this here (besides, it was already known that those disks were a bit overloaded).
Note that the variations in cache are easy to miss here, as 80 and 99.9999 look like fairly close numbers. However, the execution time is proportional to (1 – BCHR) rather than BCHR itself, which means that the difference between “lucky” and “unlucky” executions was enormous: 20% vs 0.0001% disk reads (so if cache reads were instantaneous, we would have seen jumps in performance of 5-6 orders of magnitude!).
There’s not much that can be done about those variations (except going all-flash or in memory, if you can afford it). However, those variations would hurt a lot less if the SQL was doing less I/O — thousands of I/Os per row retrieved certainly seems very high.
So I re-ran the SQL with “statistics_level = ALL” to see which operations were producing most of the I/O.
Two problems were detected. One was an inefficient hash self-join (with a full table scan) that returned about 0.1% of the table rows. Obviously, a nested loop with index probing would be much more efficient here, but the optimizer was unable to make the right choice here because of a strong anti-correlation it was not aware of (i.e. there were two conditions of type colA = A and colB = B, which had very poor selectivity individually, but excellent selectivity when combined together). That was easy to fix by creating a statistics extension on the two columns and updating the stats on that table (and while the fix is pending production deployment, a couple of cardinality hints does the same job).
The second problem was somewhat more difficult:
---------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | Buffers | ---------------------------------------------------------------------------------------- |* 5 | HASH JOIN OUTER | | 1 | 1 | 819 | 3938K| | 6 | VIEW | | 1 | 1 | 819 | 2923K| ... | 26 | VIEW | V | 1 | 962K| 1006K| 1014K| |* 27 | HASH JOIN | | 1 | 962K| 1006K| 1014K| | 28 | INDEX FULL SCAN | X_PK | 1 | 37 | 37 | 1 | |* 29 | HASH JOIN OUTER | | 1 | 962K| 1006K| 1014K| |* 30 | HASH JOIN | | 1 | 449K| 449K| 543K| | 31 | TABLE ACCESS BY INDEX ROWID | Y | 1 | 449K| 449K| 372K| | 32 | INDEX FULL SCAN | Y_IDX | 1 | 449K| 449K| 4442 | | 33 | TABLE ACCESS BY INDEX ROWID | Z | 1 | 273K| 273K| 170K| | 34 | INDEX FULL SCAN | Z_PK | 1 | 273K| 273K| 2369 | | 35 | TABLE ACCESS BY INDEX ROWID | W | 1 | 585K| 585K| 470K| | 36 | INDEX FULL SCAN | W_PK | 1 | 585K| 585K| 6768 |
Here, the inline view on line 6 is joined to a named view (let’s call it V). The inline view is generating about 3M logical I/Os due to the problem we identified above (and we know how to fix it). But there are yet another 1M logical I/Os coming from the view V, what can we do about them?
The first thing to note is that the problem is very unlikely to be related to cardinality estimates. The cardinality for view V is in the right ballpark. Yes, the cardinality estimate for the inline view is off by about 3 orders of magnitude, but so what? It is is lower than the actual number of rows — therefore, if the optimizer could find a way to merge the view V, or at least push the join predicate inside it, it would have done it. So it looks like something is blocking this transformation. There is a relatively large number of things that could have caused this: analytic functions, rownums, outer joins, etc. Finding the exact cause may be difficult, and fixing it may be even more difficult (if at all possible). Is there an easier way?
The SQL text goes like this:
SELECT inline.col1, ... inline.colN, V.col1 Vcol1, V.col2 Vcol2 FROM (...) inline, V WHERE inline.joinCol1 = V.joinCol2 (+) and inline.joinCol2 = V.joinCol2 (+) ,
i.e. it selects a whole bunch of columns from the inline view, but only two from the view V. Therefore we can replace the join with correlated subqueries, avoiding instantiation of the view V:
SELECT inline.col1, ... inline.colN, (select col1 from V where joinCol1 = inline.joinCol1 and joinCol1 = inline.joinCol2) Vcol1, (select col2 from V where joinCol1 = inline.joinCol1 and joinCol1 = inline.joinCol2) Vcol2, FROM (...) inline
With both fixes implemented, the I/O volume is reduced from ~4M to about 130k, i.e. a factor of x30 improvement (the improvement in terms of time is harder to measure because of caching effecs, but it has a similar order of magnitude).
In my opinion, this case is a good illustration that it helps to be flexible when trying to address the root cause: sometimes you can (and should) go all the way, in other cases you can (and should) settle for a simple workaround. Here, the first problem with the query plan was wrong cardinality estimates due to anticorrelation between columns. I could have simply forced the correct plan with hints, but getting to the bottom of the problem has the additional advantage of helping performance of other queries involving similar predicates (not to mention the fact that hints aren’t always reliable). In the second case, however, settling for a workaround is justified, because it saves a lot of time and effort while delivering a perfectly acceptable result.