My 2015

Year 2015 was a very good one for me, even though not exactly in a way I expected it to be. I didn’t get to blog as much as I wanted to, and I didn’t get as many interesting performance troubleshooting to do as years before that. But there was lots of other interesting experiences — e.g. designing, running and analyzing all sorts of sophisticated performance tests for a candidate hardware platform.

Of course, the most important event of the year was moving to the UK, and the new challenges and opportunities this move presented. It was a very positive experience overall (although there are a few aspects of life in the UK that I still need to adapt to, but that’s perfectly normal).

I also spoke at Harmony 2015 conference in Tallinn (LGWR stuff from my previous year’s research), and that was also new and important experience for me. I’m hoping to do this more in the future. I attended a very interesting UKOUG Tech’15 conference in Birmingham (as a delegate, not a speaker), and had a few very interesting conversations there (in particular, I’m very grateful to Tanel Poder for finding some time for me — that conversation was extremely useful).

I am really looking forward to 2016. No one knows what it would bring us, but I have good reasons to expect great things from it. For example, there’s a good chance that I’ll get involved with some interesting topics, including Exadata, building integrated solutions using both relational and non-relational technologies, performance-related internals digging, and others (and of course I’ll cover the most interesting stuff in this blog).

I am hoping that 2016 would be an eventful and productive year for you as well. Happy holidays everyone!

Non-intrusive tracing

Earlier this year I’ve already touched upon the subject of so called “Observer effect” – impact that the act of observation makes upon the observed process – applied to the database world.  In this blog I’d like to expand on this subject a little bit, and discuss one way to minimize this effect using OS observability tools.

Traditional tools

All tools that we use to obtain diagnostic information about database processes are intrusive to some extent. However, some of them are built-in in Oracle, and therefore their effect is present at all times, and constitutes a part of the normal behavior of the database. Thus, such tools as Oracle Wait Interface and Time Model Statistics, externalized via various V$ views, can be considered non-intrusive. There exist a number of popular utilities that   provide a way to use this information to obtain workload metrics for a process of interest (basically, they take a before and an after snapshots and calculate and output the deltas). Many of them can be found e.g. here.

One common problem with these tools is the fact that V$ views externalize information from X$ “tables”, or more accurately, memory structures that can be queried like a database table, but do not provide read consistency. As a result, when using these tools, you will occasionally see some anomalies (like noticeable increments in some statistics without any workload in the monitored session to explain it). Of course, if you’re only interested in the big picture, then it’s not necessarily a problem. But when looking at subtle effects it can be a real nuisance. These problems can be minimized by taking snapshots in an external session, but this approach introduces problems of its own (e.g. it becomes difficult to automate tests with a large number of tests cases).

Another problem is that such tools only provide aggregate statistics (as opposed to raw trace files that contain individual events along with their parameters and wait times). Often, this is not enough.

OS tools

One tempting alternative is to use OS-level observability tools (such as dtrace, perf, strace etc.), which of course introduce their own overhead, but are unlikely to change Oracle’s logic (because Oracle doesn’t know that it’s being watched!). It’s a unique way to get a peek at internals, but interpreting this diagnostic information can be challenging, because it’s completely decoupled from database diagnostic data. For example, you see an I/O request — how do you match it to a specific database event, especially when there is no 1-to-1 mapping?

Wait events from OS

Fortunately, there’s a way to enjoy the best of both worlds! It was shown by Luca Canali in his brilliant series on observing logical and physical I/O. The idea is really simple, and very cool: trace OWI function calls from the OS! Let me recapitulate it here: there are functions that are called at the beginning and end of a wait event, and probing those function calls with systemtap allows to capture their parameters (which can be then converted to wait times and event parameters p1, p2 and p3 using simple arithmetics).

The systemtap script that Luca is using is doing a lot of things, so I removed anything other than looking at wait events (and only kept the part that looks at the end of the wait event, as it provides all the necessary information). Low-level implementation of wait instrumentation is version-dependent, so for example for you can use the wait.stp script below:

probe process("oracle").function("kskthewt") {
xksuse = register("r13") - 3928
ksuudnam = user_string(xksuse + 140)
ksusenum = user_uint16(xksuse + 1704)
ksuseopc = user_uint16(xksuse + 1602)
ksusep1 = user_uint64(xksuse + 1608)
ksusep2 = user_uint64(xksuse + 1616)
ksusep3 = user_uint64(xksuse + 1624)
ksusetim = user_uint32(xksuse + 1632)
ksusesqh = user_uint32(xksuse + 1868)
ksuseobj = user_uint32(xksuse + 2312)
printf("DB WAIT EVENT END: timestamp_ora=%ld, pid=%d, sid=%d, name=%s, event#=%u, p1=%lu, p2=%lu, p3=%lu, wait_time=%u, obj=%d, sql_hash=%u\n==========\n",
register("rdi"), pid(), ksusenum, ksuudnam, ksuseopc, ksusep1, ksusep2, ksusep3, ksusetim, ksuseobj, ksusesqh)

Calling the script is very simple:

stap wait.stp -x <pid>

where pid of the OS process can obtained by querying V$PROCESS and V$SESSION:

select p.spid
from v$process p,
     v$session s
where s.paddr = p.addr
and s.sid = userenv('sid');

(I use a cool trick to set terminal window header to display sid, serial# and spid as described here)


The method above works just fine, except for the part that the event capturing script needs to be launched manually in a separate terminal window. When performing large numbers of tests in loops, this becomes a serious inconvenience. Fortunately, there is a solution to this problem: systemtap can run in background using so-called “flight recorder” mode. It allows to launch systemtap monitoring directly from an sqlplus script:

column spid new_value spid
select p.spid ...
host stap -F -o <output_file_name> wait.stp -x &spid

When all diagnostic information is collected, the monitoring process can be killed using something like:

host kill -9 $(pidof stapio)

Note that this would kill ALL stapio processes so it’s only safe to do this on a sandbox environment where you are the only user! There might be a more precise way to kill the launched stapio process using the process identifier returned by the shell, but there is no elegant way to read in this process id into sqlplus so I just killed all stapio processes indiscriminately.


Working with flat files is a good way to appreciate the power and flexibility of relational databases. You quickly notice even simplest things become difficult. For example, the script above produces wait event numbers instead of names. Of course, it’s possible to add a sed script like Luca did, but the more complex your analysis becomes, the sooner you start to miss the ability to join information from different sources into a single report, like you can do in Oracle.

There is a solution, for that problem as well: simply create an external table on the top of that flat file (as shown here) and use regexp to parse its contents:

SELECT regexp_substr(text, 'timestamp_ora=(\d*)', 1, 1, null, 1) timestamp_ora,
         regexp_substr(text, 'pid=(\d*)', 1, 1, null, 1) pid,
         regexp_substr(text, 'sid=(\d*)', 1, 1, null, 1) sid,
         regexp_substr(text, 'event#=([^,]*),', 1, 1, null, 1) event,
         regexp_substr(text, 'p1=([^,]*),', 1, 1, null, 1) p1,
         regexp_substr(text, 'p2=([^,]*),', 1, 1, null, 1) p2,
         regexp_substr(text, 'p3=([^,]*),', 1, 1, null, 1) p3,
         regexp_substr(text, 'wait_time=([^,]*),', 1, 1, null, 1) wait_time,
         regexp_substr(text, 'obj=([^,]*),', 1, 1, null, 1) obj,       
         regexp_substr(text, 'sql_hash=(.*)$', 1, 1, null, 1) sql_hash       
  FROM <systemtap_output_file> f
  where text like 'DB WAIT EVENT END%'

That’s it! Now you can do any analytical reporting on your output super-easily. Want to convert event# into a human-readable event name? Simply join to V$EVENT_NAME! Want to see object names rather than their numbers? Simply join to DBA_OBJECTS! And so on, and so forth.


Oracle built-in tracing tools (extended SQL trace, autotrace, statistics_level = all etc.) can change the behavior of the observed process (e.g. it can disable nested loop optimizations like prefetch or batching). Systemtap provides a unique way to observe internals of such processes in a non-intrusive way. In-flight mode allows to start and stop systemtap monitoring directly from sqlplus prompt or script of the observed session. External files and regexp expressions provide a way to analyze obtained output using all power of SQL reporting.


I feel like I owe a big apology to my readers — not also I haven’t blogged anything for about half a year, but also I started a series and I haven’t finished it. But I have a good excuse. In September, I’ve moved to the UK. It’s a great place to be if you’re into databases — UKOUG is one of the strongest Oracle groups in the world, if not the strongest one. And of course it’s just a great place to be (if you’re not spoiled growing up in a place with nice climate — but I’m from Russia, so leaden skies make me feel like I’m at home!).

The relocation has consumed all my free time and energy for several months. Now the transition period is coming to its end and I’m planning to resume blogging soon. I’m not planning to continue the SQL performance series — it’s very time-consuming, and I haven’t received enough feedback to keep me motivated enough to spend all this time.

But there’s a lot of other stuff — I’m still getting many interesting cases, a few studies are underway, and a few more are planned. So as they say, “stay tuned”!

SQL performance, Part III. Data storage strategies

Imagine that you’re on a desert island where some pirates hid their treasure. You don’t know where exactly it is hidden, but you want to find it. How should you approach this – e.g. should you dig randomly? Or should you follow some sort of a system – and if yes, then which?

I think most readers have smelled a trap right away (it’s not a very sophisticated trap anyway) – of course it doesn’t matter how you do it. If it takes you X minutes to examine 1 square foot of the surface, and the total surface is Y square feet, it will take you anywhere from 0 to Y/X minutes (so, Y/X/2 on the average) to explore the entire island and find the treasure, no matter which system you’re following (or using no system at all). When there is no information whatsoever about the object(s) sought, all search strategies are equally (in)efficient.

Continue reading “SQL performance, Part III. Data storage strategies”

SQL Performance, Part I. Data access efficiency.

Database performance optimization deals with the problem of reducing time (and/or other resources) required to complete database queries by resolving various inefficiencies. There are many kinds of performance issues, and they can be categorized in many different ways. One important way is by scope: statement-level or instance-level. Instance-level performance tuning (which we are not covering in this mini-series) deals with such problems as excessive parsing, cache efficiency, concurrency etc. Statement-level performance tuning deals mostly with inefficient data access.

Continue reading “SQL Performance, Part I. Data access efficiency.”

Giving back

Occasionally, I receive requests to explain stuff I talk about in my blog posts on a level accessible to beginners. Unfortunately, it’s not always possible to do so. You start to explain one little thing, and then you realize that in order to understand that thing, one would to understand a dozen other little things, and before you know it, instead of writing a short blog post you find yourself writing something of the size of Encyclopedia Britannica. I decided to try to go in a different direction. I’ve accumulated some beginner-to-intermediate-level material that hopefully should make certain topics in database performance optimization clearer. I will be posting them in form of short “lessons”. I don’t have how many such lessons will be in total — I guess this depends a lot on the feedback that I’ll be receiving. My plan is to start with statement-level optimization topics, and then, if this goes well, to move on to instance-level stuff. I will try to: 1) provide a high-level overview without getting caught up into details 2) present a holistic view of Oracle database performance rather than asystematic assembly of performance-related topics 3) “derive” facts from higher principles rather than simply state them 4) use analogies to make basic performance concepts more intuitive 5) be as quantitative and specific as possible 6) heavily use demos. Hope you’ll find it useful.



Part I, Data access efficiency

Part II, Disk I/O: metrics and scales

Part III, Data storage strategies

Observer effect

In physics, one important limitation of any experiment is the fact that an act of observation inevitably interferes with the observed process (“observer effect”). Same thing is true about databases. It is a well known fact that, for example, turning on tracing can significantly slow down the process for which it’s enabled. But there exist even nastier forms of this effect: for example, when you try to trace a SQL statement using nested loop batching mechanism, apparently, the batching gets disabled! This is extremely confusing, because plan statistics in the trace file indicate that the batching takes place (i.e. the plan shows the “double nested loop” shape characteristic for NLJ batching), and if you look at the plan using dbms_xplan.display_cursor using “outline” option, you’ll see NLJ_BATCHING among other hints.

This effect was discovered by Randolf Geist, and published in the part 3 of his wonderful series on Logical I/O evolution (see here, here and here). It is very thorough and very detailed. Because of the level of detail in there it’s rather lengthy, which is why not everyone gets to that particular paragraph, so let me quote it in here:

…be aware of an odd behaviour I’ve experienced during my tests: If any kind of row source statistics sampling was enabled by either using STATISTICS_LEVEL = ALL, the GATHER_PLAN_STATISTICS hint or even enabling (extended) SQL trace, the optimized, batched form of physical I/O could not be reproduced. You could tell this from the session statistics that start with “Batched IO%” – these all stayed at 0. Only when disabling all these things the effects were visible and the corresponding statistics where non-zero. I don’t know why this is the case, but it is an important detail when testing this feature.

I think it is important that the community be aware of this behavior, because not knowing about it made me waste several days of work, and I know at least one other person who also had difficulties of the same type. But aside from this particular oddity, this is also a reminder of importance of the “observer effect” in general. So any findings obtained with intrusive methods (like extended SQL trace or statistlcs_level = ALL etc.) need to be cross-checked with less intrusive methods (e.g. dumps from V$MYSTAT, V$SESSION_EVENT, V$SESS_TIME_MODEL etc., or SQL real-time monitor reports).

The obvious downside for using V$ views is necessity to take the baseline, but there is a simple workaround if taking a baseline is inconvenient: one can simply establish a new session, thus resetting all counters and statistics to zero.

Optimizing work environment. Part II, command line

(the first part can be found here)

Now let’s consider command line. There is a lot of things that can be done to improve user experience here as well.

1) Maintain a local script repository on the machine(s) you’re using the most. To make sqlplus search there, define SQLPATH variable (using Control Panel => System => Advanced => Environment variables on Windows, or shell initialization scripts such as .profile or .bash_profile on Unix).

2) To facilitate synchronization of scripts among different environments, use web-repositories. One option here is to go with a plain http (or ftp) repository e.g. using any (free or paid) hosting service, or your own web server if you have any. The other option is to go with a code repository tool like git that utilizes the secure (https) protocol. Plain http has the advantage of you being able to run scripts directly from the online repository (see here for more detail). But if your internet connection requires proxy, this may prove difficult (setting HTTP_PROXY variable might help, but I myself had no luck with it).

Continue reading “Optimizing work environment. Part II, command line”