Troubleshooting stuck queries

Introduction

In my earlier post, I described a method of troubleshooting slow queries using dbms_xplan. While this method is all you need in most cases, it does have one serious problem: it requires the problem query be completed before any diagnostics are taken. What if the query is so slow that it cannot finish within reasonable time frame? I’ll present several alternatives in this post.

SQL Monitor

If your Oracle version is 11g or higher, then the most convenient tool to deal with “stuck” queries is SQL monitor. It’s very simple in use:

declare
  report clob;
begin
  report := DBMS_SQLTUNE.REPORT_SQL_MONITOR();
  dbms_output.put_line(report);
end;
/

where are optional, and can be sql_id of the statement or session_id or other parameters that can identify the activity of interest (see herefor a full description). If no parameters are specified, then the report is generated for the latest SQL execution monitored. By default, Oracle monitors queries that take longer than 5 seconds, but you can change this behavior with MONITOR hint.

This is what the output looks like:

SQL Monitoring Report

SQL Text
------------------------------
select count(*) from t t1, t

Global Information
------------------------------
 Status              :  EXECUTING
 Instance ID         :  1
 Session             :  SCOTT (200:69)
 SQL ID              :  bdkccn9auvaf2
 SQL Execution ID    :  16777216
 Execution Started   :  11/09/2012 10:43:25
 First Refresh Time  :  11/09/2012 10:43:31
 Last Refresh Time   :  11/09/2012 10:45:51
 Duration            :  147s
 Module/Action       :  SQL*Plus/-
 Service             :  SYS$USERS
 Program             :  sqlplus.exe

Global Stats
=========================================
| Elapsed |   Cpu   |  Other   | Buffer |
| Time(s) | Time(s) | Waits(s) |  Gets  |
=========================================
|     146 |     144 |     1.23 |   2159 |
=========================================

SQL Plan Monitoring Details (Plan Hash Value=840385502)
=====================================================================================================================================
| Id   |        Operation        | Name |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Mem | Activity | Activity Detail |
|      |                         |      | (Estim) |      | Active(s) | Active |       | (Actual) |     |   (%)    |   (# samples)   |
=====================================================================================================================================
|    0 | SELECT STATEMENT        |      |         |      |           |        |     1 |          |     |          |                 |
| -> 1 |   SORT AGGREGATE        |      |       1 |      |       141 |     +6 |     1 |        0 |     |     6.90 | Cpu (10)        |
| -> 2 |    MERGE JOIN CARTESIAN |      |     10G |  43M |       141 |     +6 |     1 |       4G |     |          |                 |
| -> 3 |     TABLE ACCESS FULL   | T    |    100K |  429 |       141 |     +6 |     1 |    39675 |     |     0.69 | Cpu (1)         |
| -> 4 |     BUFFER SORT         |      |    100K |  43M |       145 |     +2 | 39675 |       4G |  2M |    92.41 | Cpu (134)       |
|    5 |      TABLE ACCESS FULL  | T    |    100K |  427 |         1 |     +6 |     1 |     100K |     |          |                 |
=====================================================================================================================================

It’s almost everything you need in order to identify your problem — for each operation, you can find the expected and actual rowcount, memory usage, activity details etc.

For more information on real-time SQL monitoring, see here.

Active session history

If you’re not on 11g yet, then you have to use other tools, and one of the best alternatives is Active Session History (ASH). Using the V$ACTIVE_SESSION_HISTORY view, you can see what your SQL is waiting on, e.g.:

select nvl(event, session_state), count(*)
from v$active_session_history
where sql_id = 'bdkccn9auvaf2'
and session_id = 200
and session_serial# = 69
group by nvl(event, session_state)
order by 2 desc;

This query will tell you what wait events the query is waiting on. Even though you don’t get a break-down by plan operation with this approach, you still get some very valuable clues as to your performance problem. Furthermore, if you find that your query is waiting on disk I/O you can use event parameters (columns p1, p2 and p3 of the view) to identify the file and block the query is waiting for, which you can then use to identify the object name (e.g. here).

Unfortunately, often the query is stuck on CPU (e.g. when performing lots of logical reads because of an inefficient plan), in which case ASH cannot provide any additional details (if you have server access, and your platform is Solaris, you can use DTrace scripts for that, e.g. see Alexander Anokhin’s post).

Longops

If a plan operation is taking more than 6(?) seconds, it will show up in V$SESSION_LONGOPS view (if you have Toad for Oracle with DBA module, you can conveniently see the contents of the view on “Longops” tab of the Session browser). This is very convenient because not only you can identify which operation your plan is stuck on, but also see the progress of this operation using TOTALWORK and SOFAR columns (however keep in mind that for some sorts and joins these numbers can be inaccurate).

Using longops is great for spotting expensive full table scans, index fast full scans, hash joins, sorts etc. However if Oracle spends hours on an operation in total, but only a couple seconds at once, you won’t see it.

Tracing

Tracing is a classic performance troubleshooting tool. Normally it’s enabled via ALTER SESSION SET EVENT statement or a dbms_monitor call (a wrapper to dbms_system), but when a query is stuck then the only way I know is by calling dbms_system.set_ev directly:
exec dbms_system.set_ev(&&sid, &&serial#, 10046, 12, ”), where &&sid and &&serial# are the sid and serial# for the session running the stuck query, 10046 is the event number for SQL tracing (other events will allow you to trace optimizer, joins, sorts etc.), 12 is the level of tracing (it means that both wait events and binds will be captured; if you don’t want binds, use level = 8). To stop the tracing, call exec dbms_system.set_ev(&&sid, &&serial#, 10046, 0, ”).

A serious downside of that it requires access to the database server filesystem (because that’s where the trace file resides). A yet more serious problem is that if tracing stops before the query stopped running, the trace file won’t capture plan statistics, and in any case trace files don’t contain any information about individual logical reads — so it will probably only contain wait events for this session, i.e. essentially same sort of information that you can find in ASH. Of course, with ASH you only have samples (taken once per second), while with tracing you have the actual events — but if the number of samples is big enough samples are almost as good.

Conclusion

In this post, I reviewed several common tools for troubleshooting runaway queries. In most cases the best choice is SQL monitor, which is only available on 11g. For earlier database versions, there are other methods available, which have their pluses and minuses. The bottom line is that there are always ways to obtain accurate diagnostic information, so there is no excuse for guessing. Instead of randomly trying hints, blaming suspiciously looking operations (“my plan has a full table scan — that must be why it’s so slow”) or the database as a whole (“my query is running slow, can you please check the database?”), one should find the most appropriate tool to see where exactly the query is spending most of the time, and work on it.

7 thoughts on “Troubleshooting stuck queries

  1. Hi nikolay..great post..
    From your post, I got that:

    There is one query in our application which takes a lot of time to give result and from the EVENT column in V$SESSION view I found that it waits for asynch descriptor resize. I am using 11.2.0.2. What may be the reason..

    Thanks and regards,

    Sandy

    1. hi Sandy,

      can you tell me how you found that it’s waiting on asynch descriptor resize? The reason I’m asking is that I want to make sure that the session wasn’t in fact running on CPU, which may look very similar (you can tell it by looking at session state or wait_time).

      If your observation was correct, then you may suffer from misconfigured asynchronous I/O. This is an OS issue and needs to be resolved on the OS level.

      Best regards,
      Nikolay

      1. Hi Nikolay..

        The EVENT column shows me asynch descriptor resize for a long time.
        And the STATE column is showing WAITED SHORT TIME and the WAIT_TIME column is showing me -1

        Now if its a OS kernel parameter settings issue, then what parameter should I change…My database is in Linux EL6 and while installing, I set the paramateres as per the oracle doc.

        Are you referring to the kernel parameter fs-aio-nr…

        Thanks in advance Nikolay..

      2. hi Sandy,

        SECONDS_IN_WAIT means what the name suggests only when the session is still in WAITING state, otherwise its true meaning is “seconds since the last wait”. This often leads to confusion.

        So in your case, I don’t think that your problem has anything to do with asynch I/O — the session is busy with something else (probably running on CPU). Run a trace with waits on (level 8 or 12) to find out what exactly the session is waiting for.

        You may find these two links useful for better understanding of wait events:

        http://docs.oracle.com/cd/B28359_01/server.111/b28320/dynviews_3023.htm

        http://oracledoug.com/serendipity/index.php?/archives/1510-Wait-Times.html

  2. Hi Nikolay,

    the SQL Monitor is an excellent tool, but perhaps it’s worth mentioning that you need the diagnostics and the tuning pack to use it in production (if things have not changed since the last time I checked the licence information). The only thing I miss in SQL Monitor is the predicate section – so most of the time I end up using the SQL Monitor and dbms_xplan at the same time.

    Regards

    Martin

    1. Hi Martin,

      both very good points. Regarding the predicate section — you are right that it’s useful, but it can also be easily produced with regular explain plan, so this is something I normally don’t have to worry about (unlike A-time & A-rows which are often hard to obtain in a production environment if a query takes too long to run). But it’s a valid point that SQL Monitor output could benefit from having more plan information (such as cost and predicates) in it.

Leave a comment