An example of tuning — nothing special, but it does illustrate several aspects of tuning work, so I thought I’d make a blog post out of it.
Yesterday I as contacted by a development team working with a 10.2.0.4 database. Their complaint was the traditional “the system is slow”. Despite very unspecific symptoms, they were very specific about the time when it all started. I didn’t have access to AWR on that database, so instead I looked at top wait events in DBA_HIST_SYSTEM_EVENT for recent snapshots, nothing interesting. The top wait event was db file sequential read (with db file scattered read as a very distant second). Since that particular group was working with just one user in the database, I looked at ASH data for this user — same thing, just a bunch of db file seqential reads.
Then I looked at top SQL for that particular user in ASH, and found that two statements were consuming more resources than others, and one of them showed up with a different plan shortly before they noticed performance degradation. As it turned out, it was that particular statement that was givinig them the impression of everything being slow.
So I turned my attention to that SQL. The cost was low. No “red flags” like cartesian merge joins, full table scans of huge tables or anything of that kind. Just a bunch of nested loops with index range scans, a few hash joins, a couple REMOTE operations (reading from a db link).
Without cardinality, predicates and other advanced diagnostic information it’s hard to troubleshoot a query. So the first thing to do would be re-run the query with rowsource stats on, and do a cardinality feedback analysis (http://www.centrexcc.com/Tuning%20by%20Cardinality%20Feedback.ppt.pdf). At that point, I faced the first difficulty: the query was using bind variables, so I needed some realistic values to reproduce the problem. I found some values in V$SQL_BIND_CAPTURE, but they didn’t reproduce the problem. OK, after all, the statement was bind-sensitive (developers observed that and shared their observation with me), so maybe captured values weren’t “heavy” enough… After some work we found some “heavy” values which were supposed to definitely bring down the performance. Still no luck, the query executed almost as fast as the first time. Apparently, whatever factor lead to the bad plan wasn’t there any longer (I think it was a freakish bind peeked, but there could be other possibilities).
Then I tried to trick the optimizer into using the same plan by using exactly same text of the query, using something like:
from v$sql where sql_id = 'gq2unm44q6zmh';
execute immediate l_sql_text using 1;
but still no luck — Oracle generated a new child cursor due to a BIND_MISMATCH!
Then I realized that there was a simpler way: I generated an outline using the advanced option of dbms_xplan:
dbms_xplan.display_awr('gq2unm44q6zmh', 731024556, null, 'advanced'),
and pasted the generated outline section into the SQL text. Finally, I reproduced the problem, and was able to collect all the diagnostic information I needed. The rest was simple: cardinality feedback analysis showed that there was a major discrepancy between E-rows and A-rows which originated from a filter predicate of a kind: T.COL1 = T.COL2 || ‘_’ || T.COL3. As it turns out, that was they way they marked test entries in the system… not a very good idea, at least as far as the optimizer is concerned! There is no way for the optimizer to accurately assess the selectivity of this predicate, and its guess was off by two orders of magnitude. From that moment on, the cardinality errors in the plan started to pile up, and all joins were in the wrong order.
There can be several solutions for this kind of problem, e.g. rewriting the equality as an expression and using a function-based index on it to ensure correct cardinality (with 11g, there are other options, such as extended statistics) but the best way in my optinion is to address the underlying design issue, namely: one should never use patterns or rules for classifying data!
To sum up, I think this case teaches us about several things:
1) People can easily confuse a local problem for a global one… so when responding to complaints about “system being slow” it’s important to find a confirmation of a global-level issue before acting on it. When they’re saying “everything is slow”, it could be that this “everything” is just running same query over and over! I know this sounds trivial, but I continiously see cases when people rush to make global recommendations when asked about such “global” issues
2) People also often tend to overlook bind sensitivity of SQL. If a statement using bind variables doesn’t show slowness for one set of values, it doesn’t mean that it doesn’t have a problem.
3) The ‘advanced’ dbms_xplan option is great for reproducing a bad plan
4) Keep things simple, or you’ll confuse the optimizer. It’s simply not worth it trying to save time or space by not using 1 small additional column and relying on some patterns to distinguish between different classes of data.