Non-intrusive tracing

Earlier this year I’ve already touched upon the subject of so called “Observer effect” – impact that the act of observation makes upon the observed process – applied to the database world.  In this blog I’d like to expand on this subject a little bit, and discuss one way to minimize this effect using OS observability tools.

Traditional tools

All tools that we use to obtain diagnostic information about database processes are intrusive to some extent. However, some of them are built-in in Oracle, and therefore their effect is present at all times, and constitutes a part of the normal behavior of the database. Thus, such tools as Oracle Wait Interface and Time Model Statistics, externalized via various V$ views, can be considered non-intrusive. There exist a number of popular utilities that   provide a way to use this information to obtain workload metrics for a process of interest (basically, they take a before and an after snapshots and calculate and output the deltas). Many of them can be found e.g. here.

One common problem with these tools is the fact that V$ views externalize information from X$ “tables”, or more accurately, memory structures that can be queried like a database table, but do not provide read consistency. As a result, when using these tools, you will occasionally see some anomalies (like noticeable increments in some statistics without any workload in the monitored session to explain it). Of course, if you’re only interested in the big picture, then it’s not necessarily a problem. But when looking at subtle effects it can be a real nuisance. These problems can be minimized by taking snapshots in an external session, but this approach introduces problems of its own (e.g. it becomes difficult to automate tests with a large number of tests cases).

Another problem is that such tools only provide aggregate statistics (as opposed to raw trace files that contain individual events along with their parameters and wait times). Often, this is not enough.

OS tools

One tempting alternative is to use OS-level observability tools (such as dtrace, perf, strace etc.), which of course introduce their own overhead, but are unlikely to change Oracle’s logic (because Oracle doesn’t know that it’s being watched!). It’s a unique way to get a peek at internals, but interpreting this diagnostic information can be challenging, because it’s completely decoupled from database diagnostic data. For example, you see an I/O request — how do you match it to a specific database event, especially when there is no 1-to-1 mapping?

Wait events from OS

Fortunately, there’s a way to enjoy the best of both worlds! It was shown by Luca Canali in his brilliant series on observing logical and physical I/O. The idea is really simple, and very cool: trace OWI function calls from the OS! Let me recapitulate it here: there are functions that are called at the beginning and end of a wait event, and probing those function calls with systemtap allows to capture their parameters (which can be then converted to wait times and event parameters p1, p2 and p3 using simple arithmetics).

The systemtap script that Luca is using is doing a lot of things, so I removed anything other than looking at wait events (and only kept the part that looks at the end of the wait event, as it provides all the necessary information). Low-level implementation of wait instrumentation is version-dependent, so for example for 12.1.0.2 you can use the wait.stp script below:

#!/usr/local/bin/stap
probe process("oracle").function("kskthewt") {
xksuse = register("r13") - 3928
ksuudnam = user_string(xksuse + 140)
ksusenum = user_uint16(xksuse + 1704)
ksuseopc = user_uint16(xksuse + 1602)
ksusep1 = user_uint64(xksuse + 1608)
ksusep2 = user_uint64(xksuse + 1616)
ksusep3 = user_uint64(xksuse + 1624)
ksusetim = user_uint32(xksuse + 1632)
ksusesqh = user_uint32(xksuse + 1868)
ksuseobj = user_uint32(xksuse + 2312)
printf("DB WAIT EVENT END: timestamp_ora=%ld, pid=%d, sid=%d, name=%s, event#=%u, p1=%lu, p2=%lu, p3=%lu, wait_time=%u, obj=%d, sql_hash=%u\n==========\n",
register("rdi"), pid(), ksusenum, ksuudnam, ksuseopc, ksusep1, ksusep2, ksusep3, ksusetim, ksuseobj, ksusesqh)
}

Calling the script is very simple:

stap wait.stp -x <pid>

where pid of the OS process can obtained by querying V$PROCESS and V$SESSION:

select p.spid
from v$process p,
     v$session s
where s.paddr = p.addr
and s.sid = userenv('sid');

(I use a cool trick to set terminal window header to display sid, serial# and spid as described here)
 

Automation

The method above works just fine, except for the part that the event capturing script needs to be launched manually in a separate terminal window. When performing large numbers of tests in loops, this becomes a serious inconvenience. Fortunately, there is a solution to this problem: systemtap can run in background using so-called “flight recorder” mode. It allows to launch systemtap monitoring directly from an sqlplus script:

column spid new_value spid
...
select p.spid ...
...
host stap -F -o <output_file_name> wait.stp -x &spid

When all diagnostic information is collected, the monitoring process can be killed using something like:

host kill -9 $(pidof stapio)

Note that this would kill ALL stapio processes so it’s only safe to do this on a sandbox environment where you are the only user! There might be a more precise way to kill the launched stapio process using the process identifier returned by the shell, but there is no elegant way to read in this process id into sqlplus so I just killed all stapio processes indiscriminately.

Postprocessing

Working with flat files is a good way to appreciate the power and flexibility of relational databases. You quickly notice even simplest things become difficult. For example, the script above produces wait event numbers instead of names. Of course, it’s possible to add a sed script like Luca did, but the more complex your analysis becomes, the sooner you start to miss the ability to join information from different sources into a single report, like you can do in Oracle.

There is a solution, for that problem as well: simply create an external table on the top of that flat file (as shown here) and use regexp to parse its contents:

SELECT regexp_substr(text, 'timestamp_ora=(\d*)', 1, 1, null, 1) timestamp_ora,
         regexp_substr(text, 'pid=(\d*)', 1, 1, null, 1) pid,
         regexp_substr(text, 'sid=(\d*)', 1, 1, null, 1) sid,
         regexp_substr(text, 'event#=([^,]*),', 1, 1, null, 1) event,
         regexp_substr(text, 'p1=([^,]*),', 1, 1, null, 1) p1,
         regexp_substr(text, 'p2=([^,]*),', 1, 1, null, 1) p2,
         regexp_substr(text, 'p3=([^,]*),', 1, 1, null, 1) p3,
         regexp_substr(text, 'wait_time=([^,]*),', 1, 1, null, 1) wait_time,
         regexp_substr(text, 'obj=([^,]*),', 1, 1, null, 1) obj,       
         regexp_substr(text, 'sql_hash=(.*)$', 1, 1, null, 1) sql_hash       
  FROM <systemtap_output_file> f
  where text like 'DB WAIT EVENT END%'

That’s it! Now you can do any analytical reporting on your output super-easily. Want to convert event# into a human-readable event name? Simply join to V$EVENT_NAME! Want to see object names rather than their numbers? Simply join to DBA_OBJECTS! And so on, and so forth.

Summary

Oracle built-in tracing tools (extended SQL trace, autotrace, statistics_level = all etc.) can change the behavior of the observed process (e.g. it can disable nested loop optimizations like prefetch or batching). Systemtap provides a unique way to observe internals of such processes in a non-intrusive way. In-flight mode allows to start and stop systemtap monitoring directly from sqlplus prompt or script of the observed session. External files and regexp expressions provide a way to analyze obtained output using all power of SQL reporting.

Relocation

I feel like I owe a big apology to my readers — not also I haven’t blogged anything for about half a year, but also I started a series and I haven’t finished it. But I have a good excuse. In September, I’ve moved to the UK. It’s a great place to be if you’re into databases — UKOUG is one of the strongest Oracle groups in the world, if not the strongest one. And of course it’s just a great place to be (if you’re not spoiled growing up in a place with nice climate — but I’m from Russia, so leaden skies make me feel like I’m at home!).

The relocation has consumed all my free time and energy for several months. Now the transition period is coming to its end and I’m planning to resume blogging soon. I’m not planning to continue the SQL performance series — it’s very time-consuming, and I haven’t received enough feedback to keep me motivated enough to spend all this time.

But there’s a lot of other stuff — I’m still getting many interesting cases, a few studies are underway, and a few more are planned. So as they say, “stay tuned”!