Nested loop internals. Part 3: comparative efficiency

In the previous parts (here and here) of the series we looked at some aspects of nested loop I/O optimizations, but we have left out the most important question (from the practical point of view): how these methods are doing time-wise? Which one(s) is(are) faster, and how much savings are they offering compared to the non-optimized plan? We will turn to these questions now.

Continue reading “Nested loop internals. Part 3: comparative efficiency”

Nested loop internals. Part 2: decision making

In the previous part of this mini-series we looked at differences in multiblock read behavior for different nested loop optimization mechanisms depending on degree of ordering of the data. In this post I’ll continue to explore the subject, but this time we’ll focus on decision-making process: what factors (other than the obvious ones — like optimizer hints and/or parameters) affect the specific choice of a mechanism?

Continue reading “Nested loop internals. Part 2: decision making”

Nested loop internals

Nested loop join appears like the simplest thing there could be — you go through one table, and as you go, per each row found you probe the second table to see if you find any matching rows. But thanks to a number of optimizations introduced in recent Oracle releases, it has become much more complex than that. Randolf Geist has written a great series of posts about this join mechanism (part 1, part 2 and part 3) where he explores in a great detail how numerous nested loop optimization interact with various logical I/O optimizations for unique and non-unique indexes. Unfortunately, it doesn’t cover the physical I/O aspects, and that seems to me like the most interesting part — after all, that was the primary motivation behind introducing all those additional nested loop join mechanism on the top of the basic classical nested loop. So I conducted a study on my own, and I’m presenting my results in the mini-series that I’m opening with this post.

Continue reading “Nested loop internals”

Peeking table block contents

Sometimes you want to know what’s inside a certain block. Of course, the most straightforward way to do it is by dumping block contents using ALTER SYSTEM DUMP DATAFILE contents and analyzing it. However, “straightforward” doesn’t mean “simple”. Block dumps represent its contents in binary format which is hard to read. Sure, there are various utilities (like utl_raw) that can help you convert everything to the human-readable format, but it’s going to be a tedious and time-consuming job, especially if you need more than just a few values from just a couple of blocks. Another problem is that you may not have access to the server OS shell (e.g. developers rarely have access to it even on non-production system, except maybe on private sandboxes).

There’s a better way, at least if the block you’re interested in is a standard table data block and all you want to know is what kind of data it contains (and not internal information like locks, flags, free space etc.). The idea is that rather than going to the block itself, you can use rowid to calculate the block address and the relative file number. If you know the table of interest, and if you do the calculation above for all its rows, then you simply filter out the particular block you’re interested in from the resultset. It’s really much simpler than it sounds, just bear with me a little and you’ll see.

The first step would be identifying the segment name. Normally you already know it from the very beginning. For example, if you found the address of the block that you want to look up in ASH or in a trace file, e.g. as p1 and p2 parameters of “db file sequential read” event, then you can simply take current_obj# and look up the object name in DBA_OBJECTS, using object_id (and/or data_object_id) as the key.

Once you have the table name, you can display block contents using rowid_relative_fno and rowid_block_number functions of dbms_rowid package:

select *
select id1,
dbms_rowid.rowid_relative_fno(rowid) fno,
dbms_rowid.rowid_block_number(rowid) block#
from &mytable
where fno = :fno
and block# = :blockno;

where &mytable obviously should be replaced with the name of the table identified in the previous step.

It could be convenient to aggregate the query above so that each block’s contents would be represented by a single row. For example, imagine that you have a table with a composite primary key (id1, id2), so that knowing these two values is enough to identify the row. Then you can use listagg to compactly represent block contents:

select listagg('(' || id1 || ',' || id2 || ')') within group (order by id1, id2) block_contents, fno, block#
select id1,
dbms_rowid.rowid_relative_fno(rowid) fno,
dbms_rowid.rowid_block_number(rowid) block#
from &mytable
group by fno, block#

This is particularly convenient when working with a relatively large number of blocks that can be extracted from another SQL statement, e.g. a query on an external table built on top of a trace or a dump file as described in one of my recent posts (e.g. here).

As a final touch, you can also add a query block to identify the table’s segment header to make sure you don’t miss it:


select 'segment header' block_contents,
header_file fno,
header_block block#
from dba_segments
where segment_name = '&mytable'

You can’t do the same trick on indexes as there’s no rowid for an index entry, but there are other ways to peek at contents of branch and leaf index blocks. If I get a chance, I’ll show one or two such methods in a separate blog post.


My 2015

Year 2015 was a very good one for me, even though not exactly in a way I expected it to be. I didn’t get to blog as much as I wanted to, and I didn’t get as many interesting performance troubleshooting to do as years before that. But there was lots of other interesting experiences — e.g. designing, running and analyzing all sorts of sophisticated performance tests for a candidate hardware platform.

Of course, the most important event of the year was moving to the UK, and the new challenges and opportunities this move presented. It was a very positive experience overall (although there are a few aspects of life in the UK that I still need to adapt to, but that’s perfectly normal).

I also spoke at Harmony 2015 conference in Tallinn (LGWR stuff from my previous year’s research), and that was also new and important experience for me. I’m hoping to do this more in the future. I attended a very interesting UKOUG Tech’15 conference in Birmingham (as a delegate, not a speaker), and had a few very interesting conversations there (in particular, I’m very grateful to Tanel Poder for finding some time for me — that conversation was extremely useful).

I am really looking forward to 2016. No one knows what it would bring us, but I have good reasons to expect great things from it. For example, there’s a good chance that I’ll get involved with some interesting topics, including Exadata, building integrated solutions using both relational and non-relational technologies, performance-related internals digging, and others (and of course I’ll cover the most interesting stuff in this blog).

I am hoping that 2016 would be an eventful and productive year for you as well. Happy holidays everyone!

Non-intrusive tracing

Earlier this year I’ve already touched upon the subject of so called “Observer effect” – impact that the act of observation makes upon the observed process – applied to the database world.  In this blog I’d like to expand on this subject a little bit, and discuss one way to minimize this effect using OS observability tools.

Traditional tools

All tools that we use to obtain diagnostic information about database processes are intrusive to some extent. However, some of them are built-in in Oracle, and therefore their effect is present at all times, and constitutes a part of the normal behavior of the database. Thus, such tools as Oracle Wait Interface and Time Model Statistics, externalized via various V$ views, can be considered non-intrusive. There exist a number of popular utilities that   provide a way to use this information to obtain workload metrics for a process of interest (basically, they take a before and an after snapshots and calculate and output the deltas). Many of them can be found e.g. here.

One common problem with these tools is the fact that V$ views externalize information from X$ “tables”, or more accurately, memory structures that can be queried like a database table, but do not provide read consistency. As a result, when using these tools, you will occasionally see some anomalies (like noticeable increments in some statistics without any workload in the monitored session to explain it). Of course, if you’re only interested in the big picture, then it’s not necessarily a problem. But when looking at subtle effects it can be a real nuisance. These problems can be minimized by taking snapshots in an external session, but this approach introduces problems of its own (e.g. it becomes difficult to automate tests with a large number of tests cases).

Another problem is that such tools only provide aggregate statistics (as opposed to raw trace files that contain individual events along with their parameters and wait times). Often, this is not enough.

OS tools

One tempting alternative is to use OS-level observability tools (such as dtrace, perf, strace etc.), which of course introduce their own overhead, but are unlikely to change Oracle’s logic (because Oracle doesn’t know that it’s being watched!). It’s a unique way to get a peek at internals, but interpreting this diagnostic information can be challenging, because it’s completely decoupled from database diagnostic data. For example, you see an I/O request — how do you match it to a specific database event, especially when there is no 1-to-1 mapping?

Wait events from OS

Fortunately, there’s a way to enjoy the best of both worlds! It was shown by Luca Canali in his brilliant series on observing logical and physical I/O. The idea is really simple, and very cool: trace OWI function calls from the OS! Let me recapitulate it here: there are functions that are called at the beginning and end of a wait event, and probing those function calls with systemtap allows to capture their parameters (which can be then converted to wait times and event parameters p1, p2 and p3 using simple arithmetics).

The systemtap script that Luca is using is doing a lot of things, so I removed anything other than looking at wait events (and only kept the part that looks at the end of the wait event, as it provides all the necessary information). Low-level implementation of wait instrumentation is version-dependent, so for example for you can use the wait.stp script below:

probe process("oracle").function("kskthewt") {
xksuse = register("r13") - 3928
ksuudnam = user_string(xksuse + 140)
ksusenum = user_uint16(xksuse + 1704)
ksuseopc = user_uint16(xksuse + 1602)
ksusep1 = user_uint64(xksuse + 1608)
ksusep2 = user_uint64(xksuse + 1616)
ksusep3 = user_uint64(xksuse + 1624)
ksusetim = user_uint32(xksuse + 1632)
ksusesqh = user_uint32(xksuse + 1868)
ksuseobj = user_uint32(xksuse + 2312)
printf("DB WAIT EVENT END: timestamp_ora=%ld, pid=%d, sid=%d, name=%s, event#=%u, p1=%lu, p2=%lu, p3=%lu, wait_time=%u, obj=%d, sql_hash=%u\n==========\n",
register("rdi"), pid(), ksusenum, ksuudnam, ksuseopc, ksusep1, ksusep2, ksusep3, ksusetim, ksuseobj, ksusesqh)

Calling the script is very simple:

stap wait.stp -x <pid>

where pid of the OS process can obtained by querying V$PROCESS and V$SESSION:

select p.spid
from v$process p,
     v$session s
where s.paddr = p.addr
and s.sid = userenv('sid');

(I use a cool trick to set terminal window header to display sid, serial# and spid as described here)


The method above works just fine, except for the part that the event capturing script needs to be launched manually in a separate terminal window. When performing large numbers of tests in loops, this becomes a serious inconvenience. Fortunately, there is a solution to this problem: systemtap can run in background using so-called “flight recorder” mode. It allows to launch systemtap monitoring directly from an sqlplus script:

column spid new_value spid
select p.spid ...
host stap -F -o <output_file_name> wait.stp -x &spid

When all diagnostic information is collected, the monitoring process can be killed using something like:

host kill -9 $(pidof stapio)

Note that this would kill ALL stapio processes so it’s only safe to do this on a sandbox environment where you are the only user! There might be a more precise way to kill the launched stapio process using the process identifier returned by the shell, but there is no elegant way to read in this process id into sqlplus so I just killed all stapio processes indiscriminately.


Working with flat files is a good way to appreciate the power and flexibility of relational databases. You quickly notice even simplest things become difficult. For example, the script above produces wait event numbers instead of names. Of course, it’s possible to add a sed script like Luca did, but the more complex your analysis becomes, the sooner you start to miss the ability to join information from different sources into a single report, like you can do in Oracle.

There is a solution, for that problem as well: simply create an external table on the top of that flat file (as shown here) and use regexp to parse its contents:

SELECT regexp_substr(text, 'timestamp_ora=(\d*)', 1, 1, null, 1) timestamp_ora,
         regexp_substr(text, 'pid=(\d*)', 1, 1, null, 1) pid,
         regexp_substr(text, 'sid=(\d*)', 1, 1, null, 1) sid,
         regexp_substr(text, 'event#=([^,]*),', 1, 1, null, 1) event,
         regexp_substr(text, 'p1=([^,]*),', 1, 1, null, 1) p1,
         regexp_substr(text, 'p2=([^,]*),', 1, 1, null, 1) p2,
         regexp_substr(text, 'p3=([^,]*),', 1, 1, null, 1) p3,
         regexp_substr(text, 'wait_time=([^,]*),', 1, 1, null, 1) wait_time,
         regexp_substr(text, 'obj=([^,]*),', 1, 1, null, 1) obj,       
         regexp_substr(text, 'sql_hash=(.*)$', 1, 1, null, 1) sql_hash       
  FROM <systemtap_output_file> f
  where text like 'DB WAIT EVENT END%'

That’s it! Now you can do any analytical reporting on your output super-easily. Want to convert event# into a human-readable event name? Simply join to V$EVENT_NAME! Want to see object names rather than their numbers? Simply join to DBA_OBJECTS! And so on, and so forth.


Oracle built-in tracing tools (extended SQL trace, autotrace, statistics_level = all etc.) can change the behavior of the observed process (e.g. it can disable nested loop optimizations like prefetch or batching). Systemtap provides a unique way to observe internals of such processes in a non-intrusive way. In-flight mode allows to start and stop systemtap monitoring directly from sqlplus prompt or script of the observed session. External files and regexp expressions provide a way to analyze obtained output using all power of SQL reporting.