A picture paints a thousand words

When I first learned about Active Session History, it was a real game changer for me. It’s (kinda) like tracing which is always on, for every single session… well, active session, sure, but who cares about idle ones? For a while I got so obsessed with it that I almost stopped using other tools — fortunately, that was a only short while, because as great as ASH is, you still need other tools. But to the day, ASH still remains one of my favorites. However, to fully exploit its potential, you need to properly visualize its results, otherwise they can be misleading, as I intend to show in the rest of this post.

The problem I was investigating was about a batch job that took almost x10 longer than it was supposed to. The AWR report wasn’t particularly helpful so I won’t give it here (because it points to a number of issues that aren’t really relevant here). But one of the reasons we don’t like AWR is that it shows numbers based on all database activity, which is not good when you are only interested in a specific part of it. ASH can easily remedy that, as you can add any filters in your query against [G]V$ACTIVE_SESSION_HISTORY or DBA_HIST_ACTIVE_SESS_HISTORY. In my case, the job was spawning child jobs, which could have been identified by a substring in the “action” attribute, so my query was something like:

select nvl(event, session_state) event,
       round(100*count(*)/sum(count(*))over(), 1) pct_total
from dba_hist_active_sess_history ash
where sample_time between timestamp '2019-04-30 00:00:00' and timestamp '2019-04-30 00:03:00'
and action like 'job_identifier_suffix_%'
group by nvl(event, session_state)
order by count(*) desc

and the results were:

ON CPU	                        56.2
cell single block physical read	16.8
cell smart table scan	         3.9
library cache: mutex X	         3.3
latch: redo allocation	         2.6

Where do I go from there? “ON CPU” is the biggest one here, but from AWR I knew that CPU utilization was very low, and the OEM performance screen showed no nasty light-green stuff (“CPU waits”). The next biggest one is “cell single block physical read”, which is User I/O, so it doesn’t look anomalous, either (besides, even if it did, 16.8% is not enough to explain a 1000% performance degradation). Finally, from the AWR report I new that the average time for this event was normal (even better than normally). Same goes about the next one, cell smart table scan. The next two are certainly less normal, but the percentages are very small, so from the practical point of view there is no reason to bother even looking at them.

OK, let’s take a look from a different angle: since the “event” dimension isn’t helpful, what about sql_id? Here is what I saw (cut to the top 5):

34h1wx2z38g5d	15.8
g2x62uyw372yb	7
268gyqpkwp8x3	6.3
dc0hqgfq17649	4.8
6uh4rcuur266b	4.6

Once again, a disappointment. What you want to see is obviously something big, something like some sql_id taking up 99.9% of DB time — then you know who your culprit is, you tune that one sql_id and you’re the hero. But when the load is as spread out as this, things are less easy.

There could be, of course, other ways to slice and dice ASH data, but at this point I turned from numbers to pictures. I have a little utility written in R shiny to produce OEM-style graphs using ASH, it has a simple web interface which allows you to chose the dimension and filters, and optionally specify the quantity to be plotted (by default, it’s average active sessions, but you can also get I/O requests or I/O megabytes for reads or writes, or even interconnect traffic, as ASH has this stuff as well). Here is what it showed me:

ashviewer

You can ignore the second peak on the right, as it’s not relevant here (the minimum interval for plotting in my tool is 1 hour, and we’re only intereseted in first ~30 minutes here), but the shape of the first peaks tells you the entire story. Or rather it’s the small flat plateau to the right of the main peak. Remember the stuff I told you about the job spawning child jobs? So this pattern looks very much like one of the child jobs got stuck at some point, and everything was waiting for it to finish. Let’s now use a filter to only keep ASH samples for that child job, and do the slicing and dicing once again. Here is the breakdown by sql_id:

7jsud6um70kzp	98.7
268gyqpkwp8x3	0.7
3mmrq8bf7ux6b	0.7

Now we’re talking! The first one in the list is responsible for 99% of the time, so we’re in luck — looks like the scope of the problem is limited to a single SQL statement. But what is the nature of the problem with it?

Interestingly enough, querying ASH and DBA_HIST_SQLSTAT showed no changes in the query plan. Also, there no “bad” events associated with this sql_id during that time frame — so Oracle was executing it same way it was always doing, and was performing I/O same was it always did, it’s just that this time it did lots and lots of I/O. But why? The structure of the SQL gives some clues:

SELECT A.ID
FROM A, B
WHERE A.ID = B.ID
AND A.VERSION = B.VERSION
AND B.ATTR1 = :B1
AND A.ATTR2 = :B2
AND NVL(A.ATTR3,'none') = NVL(:B3 ,'none')
AND -- a few more static conditions here
AND ROWNUM                        = 1;

So this is a parametric query with a stopkey condition (ROWNUM=1), which means that depending on how soon you encounter the row matching query’s condition, the query can take a varying amount of time. And if the plan is not very efficient in general, then it can vary a lot. So it’s possible that for most combinations of parameters used by the child jobs it completed quickly, but for some of them it took really long time before finding one good row

I could try using a more efficient index than the actual plan did, but given that this is an Exadata machine (as I’m sure you guessed by this point after seeing things like cell smart table scan events earlier on), and the fact that tables A and B were not huge, forcing a smart scan instead of an index range scan did the trick.

As you can see, the problem isn’t terribly complicated or difficult to fix. Once you visualized it properly, it’s almost trivial. But without this visualization it’s possible to stray for hours and maybe even days, and to arrive to completely wrong conclusions.

I do realize that not everyone is comfortable enough with R or R shiny to do that kind of visualization there, but it’s not really necessary. From what I know, ASH analytics add-on to OEM offers somewhat similar functionality (not sure how good/bad it is as I haven’t used it myself, but it should be okay for the basic stuff I used in this example). Of course if you’re doing okay with R or willing to give it a go, I’m happy to share the code — just ping me in comments or on Twitter.