Nested loop internals. Part 2: decision making

1 Feb

In the previous part of this mini-series we looked at differences in multiblock read behavior for different nested loop optimization mechanisms depending on degree of ordering of the data. In this post I’ll continue to explore the subject, but this time we’ll focus on decision-making process: what factors (other than the obvious ones — like optimizer hints and/or parameters) affect the specific choice of a mechanism?

Continue reading

Nested loop internals

13 Jan

Nested loop join appears like the simplest thing there could be — you go through one table, and as you go, per each row found you probe the second table to see if you find any matching rows. But thanks to a number of optimizations introduced in recent Oracle releases, it has become much more complex than that. Randolf Geist has written a great series of posts about this join mechanism (part 1, part 2 and part 3) where he explores in a great detail how numerous nested loop optimization interact with various logical I/O optimizations for unique and non-unique indexes. Unfortunately, it doesn’t cover the physical I/O aspects, and that seems to me like the most interesting part — after all, that was the primary motivation behind introducing all those additional nested loop join mechanism on the top of the basic classical nested loop. So I conducted a study on my own, and I’m presenting my results in the mini-series that I’m opening with this post.

Continue reading

Covering bases

7 Jan

What would you think if you receive a complaint about plan regression with the following information (from SQL real-time monitoring report) about the good plan:

Continue reading

Peeking table block contents

4 Jan

Sometimes you want to know what’s inside a certain block. Of course, the most straightforward way to do it is by dumping block contents using ALTER SYSTEM DUMP DATAFILE contents and analyzing it. However, “straightforward” doesn’t mean “simple”. Block dumps represent its contents in binary format which is hard to read. Sure, there are various utilities (like utl_raw) that can help you convert everything to the human-readable format, but it’s going to be a tedious and time-consuming job, especially if you need more than just a few values from just a couple of blocks. Another problem is that you may not have access to the server OS shell (e.g. developers rarely have access to it even on non-production system, except maybe on private sandboxes).

There’s a better way, at least if the block you’re interested in is a standard table data block and all you want to know is what kind of data it contains (and not internal information like locks, flags, free space etc.). The idea is that rather than going to the block itself, you can use rowid to calculate the block address and the relative file number. If you know the table of interest, and if you do the calculation above for all its rows, then you simply filter out the particular block you’re interested in from the resultset. It’s really much simpler than it sounds, just bear with me a little and you’ll see.

The first step would be identifying the segment name. Normally you already know it from the very beginning. For example, if you found the address of the block that you want to look up in ASH or in a trace file, e.g. as p1 and p2 parameters of “db file sequential read” event, then you can simply take current_obj# and look up the object name in DBA_OBJECTS, using object_id (and/or data_object_id) as the key.

Once you have the table name, you can display block contents using rowid_relative_fno and rowid_block_number functions of dbms_rowid package:

select *
from
(
select id1,
id2,
dbms_rowid.rowid_relative_fno(rowid) fno,
dbms_rowid.rowid_block_number(rowid) block#
from &mytable
)
where fno = :fno
and block# = :blockno;

where &mytable obviously should be replaced with the name of the table identified in the previous step.

It could be convenient to aggregate the query above so that each block’s contents would be represented by a single row. For example, imagine that you have a table with a composite primary key (id1, id2), so that knowing these two values is enough to identify the row. Then you can use listagg to compactly represent block contents:


select listagg('(' || id1 || ',' || id2 || ')') within group (order by id1, id2) block_contents, fno, block#
from
(
select id1,
id2,
dbms_rowid.rowid_relative_fno(rowid) fno,
dbms_rowid.rowid_block_number(rowid) block#
from &mytable
)
group by fno, block#

This is particularly convenient when working with a relatively large number of blocks that can be extracted from another SQL statement, e.g. a query on an external table built on top of a trace or a dump file as described in one of my recent posts (e.g. here).

As a final touch, you can also add a query block to identify the table’s segment header to make sure you don’t miss it:

 


select 'segment header' block_contents,
header_file fno,
header_block block#
from dba_segments
where segment_name = '&mytable'

You can’t do the same trick on indexes as there’s no rowid for an index entry, but there are other ways to peek at contents of branch and leaf index blocks. If I get a chance, I’ll show one or two such methods in a separate blog post.

 

My 2015

1 Jan

Year 2015 was a very good one for me, even though not exactly in a way I expected it to be. I didn’t get to blog as much as I wanted to, and I didn’t get as many interesting performance troubleshooting to do as years before that. But there was lots of other interesting experiences — e.g. designing, running and analyzing all sorts of sophisticated performance tests for a candidate hardware platform.

Of course, the most important event of the year was moving to the UK, and the new challenges and opportunities this move presented. It was a very positive experience overall (although there are a few aspects of life in the UK that I still need to adapt to, but that’s perfectly normal).

I also spoke at Harmony 2015 conference in Tallinn (LGWR stuff from my previous year’s research), and that was also new and important experience for me. I’m hoping to do this more in the future. I attended a very interesting UKOUG Tech’15 conference in Birmingham (as a delegate, not a speaker), and had a few very interesting conversations there (in particular, I’m very grateful to Tanel Poder for finding some time for me — that conversation was extremely useful).

I am really looking forward to 2016. No one knows what it would bring us, but I have good reasons to expect great things from it. For example, there’s a good chance that I’ll get involved with some interesting topics, including Exadata, building integrated solutions using both relational and non-relational technologies, performance-related internals digging, and others (and of course I’ll cover the most interesting stuff in this blog).

I am hoping that 2016 would be an eventful and productive year for you as well. Happy holidays everyone!

Non-intrusive tracing

21 Dec

Earlier this year I’ve already touched upon the subject of so called “Observer effect” – impact that the act of observation makes upon the observed process – applied to the database world.  In this blog I’d like to expand on this subject a little bit, and discuss one way to minimize this effect using OS observability tools.

Traditional tools

All tools that we use to obtain diagnostic information about database processes are intrusive to some extent. However, some of them are built-in in Oracle, and therefore their effect is present at all times, and constitutes a part of the normal behavior of the database. Thus, such tools as Oracle Wait Interface and Time Model Statistics, externalized via various V$ views, can be considered non-intrusive. There exist a number of popular utilities that   provide a way to use this information to obtain workload metrics for a process of interest (basically, they take a before and an after snapshots and calculate and output the deltas). Many of them can be found e.g. here.

One common problem with these tools is the fact that V$ views externalize information from X$ “tables”, or more accurately, memory structures that can be queried like a database table, but do not provide read consistency. As a result, when using these tools, you will occasionally see some anomalies (like noticeable increments in some statistics without any workload in the monitored session to explain it). Of course, if you’re only interested in the big picture, then it’s not necessarily a problem. But when looking at subtle effects it can be a real nuisance. These problems can be minimized by taking snapshots in an external session, but this approach introduces problems of its own (e.g. it becomes difficult to automate tests with a large number of tests cases).

Another problem is that such tools only provide aggregate statistics (as opposed to raw trace files that contain individual events along with their parameters and wait times). Often, this is not enough.

OS tools

One tempting alternative is to use OS-level observability tools (such as dtrace, perf, strace etc.), which of course introduce their own overhead, but are unlikely to change Oracle’s logic (because Oracle doesn’t know that it’s being watched!). It’s a unique way to get a peek at internals, but interpreting this diagnostic information can be challenging, because it’s completely decoupled from database diagnostic data. For example, you see an I/O request — how do you match it to a specific database event, especially when there is no 1-to-1 mapping?

Wait events from OS

Fortunately, there’s a way to enjoy the best of both worlds! It was shown by Luca Canali in his brilliant series on observing logical and physical I/O. The idea is really simple, and very cool: trace OWI function calls from the OS! Let me recapitulate it here: there are functions that are called at the beginning and end of a wait event, and probing those function calls with systemtap allows to capture their parameters (which can be then converted to wait times and event parameters p1, p2 and p3 using simple arithmetics).

The systemtap script that Luca is using is doing a lot of things, so I removed anything other than looking at wait events (and only kept the part that looks at the end of the wait event, as it provides all the necessary information). Low-level implementation of wait instrumentation is version-dependent, so for example for 12.1.0.2 you can use the wait.stp script below:

#!/usr/local/bin/stap
probe process("oracle").function("kskthewt") {
xksuse = register("r13") - 3928
ksuudnam = user_string(xksuse + 140)
ksusenum = user_uint16(xksuse + 1704)
ksuseopc = user_uint16(xksuse + 1602)
ksusep1 = user_uint64(xksuse + 1608)
ksusep2 = user_uint64(xksuse + 1616)
ksusep3 = user_uint64(xksuse + 1624)
ksusetim = user_uint32(xksuse + 1632)
ksusesqh = user_uint32(xksuse + 1868)
ksuseobj = user_uint32(xksuse + 2312)
printf("DB WAIT EVENT END: timestamp_ora=%ld, pid=%d, sid=%d, name=%s, event#=%u, p1=%lu, p2=%lu, p3=%lu, wait_time=%u, obj=%d, sql_hash=%u\n==========\n",
register("rdi"), pid(), ksusenum, ksuudnam, ksuseopc, ksusep1, ksusep2, ksusep3, ksusetim, ksuseobj, ksusesqh)
}

Calling the script is very simple:

stap wait.stp -x <pid>

where pid of the OS process can obtained by querying V$PROCESS and V$SESSION:

select p.spid
from v$process p,
     v$session s
where s.paddr = p.addr
and s.sid = userenv('sid');

(I use a cool trick to set terminal window header to display sid, serial# and spid as described here)
 

Automation

The method above works just fine, except for the part that the event capturing script needs to be launched manually in a separate terminal window. When performing large numbers of tests in loops, this becomes a serious inconvenience. Fortunately, there is a solution to this problem: systemtap can run in background using so-called “flight recorder” mode. It allows to launch systemtap monitoring directly from an sqlplus script:

column spid new_value spid
...
select p.spid ...
...
host stap -F -o <output_file_name> wait.stp -x &spid

When all diagnostic information is collected, the monitoring process can be killed using something like:

host kill -9 $(pidof stapio)

Note that this would kill ALL stapio processes so it’s only safe to do this on a sandbox environment where you are the only user! There might be a more precise way to kill the launched stapio process using the process identifier returned by the shell, but there is no elegant way to read in this process id into sqlplus so I just killed all stapio processes indiscriminately.

Postprocessing

Working with flat files is a good way to appreciate the power and flexibility of relational databases. You quickly notice even simplest things become difficult. For example, the script above produces wait event numbers instead of names. Of course, it’s possible to add a sed script like Luca did, but the more complex your analysis becomes, the sooner you start to miss the ability to join information from different sources into a single report, like you can do in Oracle.

There is a solution, for that problem as well: simply create an external table on the top of that flat file (as shown here) and use regexp to parse its contents:

SELECT regexp_substr(text, 'timestamp_ora=(\d*)', 1, 1, null, 1) timestamp_ora,
         regexp_substr(text, 'pid=(\d*)', 1, 1, null, 1) pid,
         regexp_substr(text, 'sid=(\d*)', 1, 1, null, 1) sid,
         regexp_substr(text, 'event#=([^,]*),', 1, 1, null, 1) event,
         regexp_substr(text, 'p1=([^,]*),', 1, 1, null, 1) p1,
         regexp_substr(text, 'p2=([^,]*),', 1, 1, null, 1) p2,
         regexp_substr(text, 'p3=([^,]*),', 1, 1, null, 1) p3,
         regexp_substr(text, 'wait_time=([^,]*),', 1, 1, null, 1) wait_time,
         regexp_substr(text, 'obj=([^,]*),', 1, 1, null, 1) obj,       
         regexp_substr(text, 'sql_hash=(.*)$', 1, 1, null, 1) sql_hash       
  FROM <systemtap_output_file> f
  where text like 'DB WAIT EVENT END%'

That’s it! Now you can do any analytical reporting on your output super-easily. Want to convert event# into a human-readable event name? Simply join to V$EVENT_NAME! Want to see object names rather than their numbers? Simply join to DBA_OBJECTS! And so on, and so forth.

Summary

Oracle built-in tracing tools (extended SQL trace, autotrace, statistics_level = all etc.) can change the behavior of the observed process (e.g. it can disable nested loop optimizations like prefetch or batching). Systemtap provides a unique way to observe internals of such processes in a non-intrusive way. In-flight mode allows to start and stop systemtap monitoring directly from sqlplus prompt or script of the observed session. External files and regexp expressions provide a way to analyze obtained output using all power of SQL reporting.

Relocation

1 Dec

I feel like I owe a big apology to my readers — not also I haven’t blogged anything for about half a year, but also I started a series and I haven’t finished it. But I have a good excuse. In September, I’ve moved to the UK. It’s a great place to be if you’re into databases — UKOUG is one of the strongest Oracle groups in the world, if not the strongest one. And of course it’s just a great place to be (if you’re not spoiled growing up in a place with nice climate — but I’m from Russia, so leaden skies make me feel like I’m at home!).

The relocation has consumed all my free time and energy for several months. Now the transition period is coming to its end and I’m planning to resume blogging soon. I’m not planning to continue the SQL performance series — it’s very time-consuming, and I haven’t received enough feedback to keep me motivated enough to spend all this time.

But there’s a lot of other stuff — I’m still getting many interesting cases, a few studies are underway, and a few more are planned. So as they say, “stay tuned”!

SQL Performance, Part IV. Heap tables.

26 May

Demo script

SQL performance, Part III. Data storage strategies

20 May

Imagine that you’re on a desert island where some pirates hid their treasure. You don’t know where exactly it is hidden, but you want to find it. How should you approach this – e.g. should you dig randomly? Or should you follow some sort of a system – and if yes, then which?

I think most readers have smelled a trap right away (it’s not a very sophisticated trap anyway) – of course it doesn’t matter how you do it. If it takes you X minutes to examine 1 square foot of the surface, and the total surface is Y square feet, it will take you anywhere from 0 to Y/X minutes (so, Y/X/2 on the average) to explore the entire island and find the treasure, no matter which system you’re following (or using no system at all). When there is no information whatsoever about the object(s) sought, all search strategies are equally (in)efficient.

Continue reading

SQL Performance, Part II — Disk I/O: metrics and scales

7 May

While query performance depends on a large number of things, overall scale of query performance for a given database is generally set by disk I/O speed. The most common type of a storage device used in databases is still a hard disk drive (HDD), so let’s consider how it works.

Continue reading

Follow

Get every new post delivered to your Inbox.

Join 764 other followers