There exists a fair amount of confusion around one of the most famous Oracle errors — ORA-01555 (“snapshot too old”). The difficulty is probably coming from the fact that ORA-01555 is related to internal workings of undo generation and usage in an Oracle database, which a rather complex. Because of that, when this error on their system, inexperienced DBAs and developers often either start looking at UNDO-related parameters, rollback segment sizes, etc., or start looking for any suspicious activity, which they later link to excessive undo “consumption” or “generation”.
Ever since I heard Tanel Poder talk about application servers causing positive feedback loops (performance gets worse -> open more sessions -> make performance even worse) in Riga 2012 LVOUG conference, I was curios to see one myself. And recently I got a chance to see a very interesting variety of such behavior.
On the plot above you can see database CPU usage (red line) versus time, plotted together with the number of sessions running same SQL statement (identically the same, i.e. not even different binds — there are no binds in this statement). There is a clear correlation between the two. Apparently, when database performance degrades to a certain point, the application server is not getting the answer fast enough, and launches the query again in another session, thus degrading performance even more, etc.
This behavior is very dangerous and may easily cause prolonged outages (see the plateau towards the right side of the plot — the database server was maxed out in terms of CPU for over 24 hours!).
Moral of the story: don’t spawn more sessions in response to poor performance, unless you want poor performance to quickly become terrible!
Troubleshooting relatively short plans is simple, once one gets a little practice, but the mere sight of a plan that has several tens or even hundred operations can be very initimidating. In this post I am describing a few tricks that make this task easier.
First of all, use V$SQL_PLAN/V$SQL_PLAN_STATISTICS instead of (or in addition to) the formatted plan produced by DBMS_XPLAN
For example, in DBMS_XPLAN output for a long plan it is hard to figure out relationships between operations, but it is very straightforward to figure them out by querying V$SQL_PLAN, e.g.:
select * from v$sql_plan p where sql_id = :sql_id and id = :op_id and P.PARENT_ID = :operation_id;
If the statement is a join of several views, and the optimizer doesn’t merge the views, then it is convenient to check which views are contributing to the performance problem the most, e.g.:
select object_name, s.last_cr_buffer_gets from v$sql_plan p, v$sql_plan_statistics s where p.sql_id = s.sql_id and p.child_number = s.child_number and P.ID = s.operation_id and p.sql_id = :sql_id and p.child_number = :child_no and p.operation = 'VIEW' order by S.LAST_CR_BUFFER_GETS desc;
In absence of merging, each view (including inline views) can be tuned separately (of course if there is a reason to believe that a merged view would perform better, then efforts should be focused on making it mergeable, and not on tuning it in isolation).
It is also convenient to use V$SQL_PLAN_STATISTICS to determine which “elementary” operations, such as table, index or material view access, takes most time (gets, reads etc.):
select object_name, s.last_cr_buffer_gets from v$sql_plan p, v$sql_plan_statistics s where p.sql_id = s.sql_id and p.child_number = s.child_number and P.ID = s.operation_id and p.sql_id = :sql_id and p.child_number = :child_no and p.operation in ('TABLE ACCESS', 'INDEX', 'MAT_VIEW ACCESS') order by S.LAST_CR_BUFFER_GETS desc;
If the statement accesses same tables/indexes several times, then the approach can be taken one step further: you can group the plan statistics by object name to determine which objects contribute most:
select object_name, sum(s.last_cr_buffer_gets) total_gets, count(*) num_times_used from v$sql_plan p, v$sql_plan_statistics s where p.sql_id = s.sql_id and p.child_number = s.child_number and P.ID = s.operation_id and p.sql_id = :sql_id group by object_name order by sum(S.LAST_CR_BUFFER_GETS) desc;
It could also help to left-join the query above to DBA_SEGMENTS to compare the number of gets from an object to its total number of blocks. If the number of gets exceeds the total size of the object, then most likely it’s being accessed inefficiently. In such cases it could help to use the refactoring clause WITH, or the entire query logic needs to be evaluated.
For example, once I saw a query that compiled a financial report from many different views. It wasn’t performing well, but the plan (with about 700 operations in it) didn’t show any obvious problem. However, using this grouping trick I found that one small table was accessed about 40 times, and the total number of gets associated with it was 20 times the size of the table. The table contained FX conversion rates, and as it turned out, each of many sections of the report was performing this conversion independently, resulting in a significant performance overhead.
These simple tricks go well with traditional divide-and-conquer methods, and/or the data access efficiency analysis described in my previous post.
In order to tune a query, you need to know two things:
- can it be tuned, and if yes, then by how much
- which part of the query (which operation, or which data object) is most promising from the tuning point of view.
Currently existing tuning methods don’t really answer these questions. You either focus on the most expensive operation(s), and hope that you can eliminate them (or transform them into something less cosly), or you focus on the ones where you see a large discrepancy between actual rowcounts and optimizer predictions (cardinality feedback tuning). Either way, you can’t be sure that you’ve set your priorities right. It could well be the case that the cost of the most expensive operation cannot be reduced by much, but you can win back enough performance elsewhere. With the cardinality feedback tuning, you also don’t have any guarantee that improving accuracy of optimizer estimates would eventually transform into acceptable level of performance.
Of course, if the plan only contains a few operations, this is not a big issue, and after a few trials you will usually get to the bottom of the problem. However, when dealing with very complex plans, hundreds operations long, this is not really an option. When dealing with such plans a few months back, I developed for myself a simple tuning method that allows to evaluate with high accuracy potential tuning benefits of plan operations, using rowsource stats and optimizer as input. In this post, I’m sharing this method, as well as a script that implements it.
In my earlier post, I described a method of troubleshooting slow queries using dbms_xplan. While this method is all you need in most cases, it does have one serious problem: it requires the problem query be completed before any diagnostics are taken. What if the query is so slow that it cannot finish within reasonable time frame? I’ll present several alternatives in this post.
If your Oracle version is 11g or higher, then the most convenient tool to deal with “stuck” queries is SQL monitor. It’s very simple in use:
declare report clob; begin report := DBMS_SQLTUNE.REPORT_SQL_MONITOR(); dbms_output.put_line(report); end; /
Oracle Virtual Private Database (VPD), also known as Row Level Security (RLS), provides a very high level of flexibility in exposing data to users. It is also a very convenient tool for forcing hard parsing of a SQL statement either on every execution or depending on some criteria (e.g. see here). VPD works by appending invisible (e.g. query text in V$SQL doesn’t contain them) predicates to SQL statements. This invisibility is mostly a good thing, but in some cases it can also be a nuisance (e.g. when troubleshooting SQL). There are a couple of blogs describing how to view these predicates (here and here), both suggesting DBMS_XPLAN as one of the ways.
However, it doesn’t always work as expected, and sometimes it doesn’t work at all. Consider an example: Continue reading
It’s been a while since I came across an interesting and complex case. However, yesterday on an OTN forum thread I saw a case which was interesting by its simplicity. Even though it’s almost trivial on the technical level, it’s very useful to highlight typical tuning “rookie mistakes” (I can remember quite a few cases from not so long ago, when I did similar mistakes, too).
The author posts a question about “library cache: mutex X” events that are ruining performance of his 2-node RAC cluster. The original post doesn’t contain any specifics except for CPU utilization percentage on both nodes.
Within a few hours, a few replies appear, most of them either trying to shed light on this particular wait event or sharing similar experiences. I asked the original poster to provide key sections of an AWR report (workload profile, top events, database/host CPU, top SQL), which he soon did:
A few days ago, I happened to participate in an interesting thread on an OTN forum. We weren’t able to answer all of the questions raised by the original poster, but this discussion stirred some thoughts that eventually lead me to writing this post.
One of the key things that make database a database and not a data dump, is integrity constraints. They allow to implement restrictions on data in a very efficient fashion, and independently of the current application interface. But constraints also play a very important role in providing additional information about data to the query optimizer. Failure to povide that information, or providing it incorrectly that may lead to serious performance issues.
Let’s consider an example.
Last week, I attended Latvian Oracle User Group conference Harmony 2012, held in Riga, Latvia on June 1. The event featured several “stars” of the Oracle world, including Tom Kyte, Tanel Poder, Graham Wood, Daniel Morgan and others. Events like this are a great opportunity to catch up with recent trends in Oracle development and tuning, learn new tricks of the trade, hear the best tuning specialist tell how they tackled their most difficult performance cases, and more!
Many database performance problems can be resolved via a top-down approach: you look at database performance globally (using statspack or AWR report) and then you work your way down to specific SQL consuming most of resources. This approach breaks when the application is not using bind variables. The reason for that is that performance reports depend on sql_id to identify a statement, and in absence of cursor sharing a single SQL statement has as many sql_id’s as there are combinations of literal parameters. So if your heaviest query is something like
SELECT * FROM T1 WHERE X=:B1,
then you’ll see in in AWR top-SQL lists, but if it’s
SELECT * FROM T1 WHERE X="some literal value",
then at best you’ll see a bunch of similar statements in the lists, each responsible for a small fraction of the workload. There is also a good chance that your heaviest SQL won’t show up in those lists at all, if it’s overshadowed by frequently run statements with unchanging parameters (or using binds).