Nested loop internals — summary

In this article, I’ll summarize my observations regarding nested loop join mechanisms as well as previously known facts, so that everything would be in one place.

1) Nested loops are the simplest join mechanism in Oracle, where as data is read from one table (or more generally, row source), for each row another table (or more generally, row source), is probed (typically with an index proble, such as INDEX RANGE/UNIQUE SCAN) to see if there is a match. Unlike other join methods, it doesn’t incur any startup costs, which makes it more attractive for selective queries

2) The optimizer calculates of a nested loop join as follows: it’s the cost of acquiring the outer row source plus its cardinality times the cost of probing the inner rowsource. However, there is a number of parameters that can change the optimizer cost of nested loops significantly, making them less or more attractive compared to other types of joins (first of all, hash joins). First of all it’s the infamous optimizer_index_cost_adj and optimizer_index_caching parameters (especially the former, which is basically just a tweak designed to push the optimizer towards choosing more nested-loopy and indexy plans).

3) In most cases, even if the optimizer gets the cardinality and the total logical I/O volume right, the physical I/O volume would tend to be much lower, because nested loops are so to say self-caching (i.e. even if none of the blocks of interest were in the cache before, the cache hit ratio is still more likely to be on the high side because same blocks will be read over and over). Of course, it’s hard to make any numeric predictions here, but if the buffer cache is sufficiently large compared to sizes of data sets involved, and if it is not under much pressure from other database activities, at very least we can expect that repetitive reads of the branch index blocks would all come from the cache (except maybe the very first read that would populate the buffer cache). The fact that the blocks are going to be read over and over will make them “hotter” and thus less likely to expire before the end of the operation.

This is important, because this will often result in nested loop cost being overestimated, as the optimizer is not very good in estimating the caching effects (the only way is to set optimizer_index_caching, but it’s very hard to find a good value to set it to).

4) There are two effects will can greatly affect the amount of logical I/O compared to the optimizer’s prediction:
– Oracle would often pin either the driving table blocks, or root/branch blocks of the inner table’s index, leading to lower numbers for the logical I/O (the number of block visits, however, stays the same — they just become cheaper because re-visiting a pinned block doesn’t involve manipulations with latches)
– an index range scan of a non-unique index would have larger cost compared to a unique index. This has to do with the fact that leaf blocks don’t contain the upper bounds of the values, so e.g. if a leaf block contains values {1, 2, 3} for column x, then for WHERE X = 1 or 2 there is no extra read, but for WHERE X=3 there will be, as Oracle doesn’t know whether the next block could also contain values with X=3. Obviously, the this overhead would be more greater when the number of rowids per a leaf block is small (i.e. when a length of an index entry is relatively large).

5) There exist several optimizations for nested loops: prefetching (old, introduced in 10g), batching (new, introduced in 11g), and batch rowid access (the newest, introduced in 12c), although stringly speaking the last one is not necessarily related to nested loops, it can be observed in a simple INDEX RANGE SCAN followed by TABLE ACCESS BY ROWID.

All these optimizations basically do the same thing, although there is some difference in how they do it. The basic mechanism is grouping individual reads of the inner row source into a fewer number of multiblock reads. As a rule, those will be non-contiguous (db file parallel reads).

The more random is the ordering of the inner row source (with respect to the order in which the outer row source is probing it), the more important those multiblock optimizations become. Oracle knows that, and tends to increase the multiblock read count when the degree of ordering is low. With the prefetch, Oracle depends mostly on optimizer statistics (clustering factor), with batch optimizations, it relies on the actual data it encounters in the beginning of the operation (so if sees a small patch of ordered data in the beginning, but the rest is unordered, there is a good chance that multiblock reads won’t be enabled).

6) Generally, the more blocks per read, the more efficient multiblocks tend to be, however, this is only true if there is enough free I/O bandwidth for that. There is a few parameters that affect the the multiblok read count for non-contiguous reads. However, generally it is not necessary to mess with them because:
– most of them are undocumented so it’s not a good idea to change them unless Oracle support directly tells you so
– performance of mulitblock reads is not very sensitive to the value of multiblock read count. The order of magnitude of the effect here is typically tens of percent. Considering that he mere fact of using multiblock reads is typically giving an improvement of a few hundred percent, finding (and enforcing) perfect multiblock read count is rarely worth the trouble.

7) There is a number of factors which have a strong impact on how efficient non-contiguous multiblock reads are:
– type of storage, SSD or HDD (SSD doesn’t care if multiblock reads are contiguous or not as blocks being physically adjacent to each other doesn’t have any meaning here, which is opposite from the case with HDD)
– in case of HDD, striping (basically, if your data is “smeared” across several spindles, then it’s easier to benefit from non-contiguous multiblock reads; with the only spindle you can only benefit from reduced scatter when moving from one block to another, while with multiple spindles you can benefit from the truly asynchronous I/O)
– I/O scheduler type (if you’re on an old I/O scheduler you may be not getting full benefits of non-contiguous reads as they may have lower priorities compared to contiguous reads)
– physical or virtual server (if virtual server, there can be complicated interactions between the guest and host OS I/O layers)
– but most importantly, non-contiguous multiblock reads on any type of storage require asynchronous I/O to be enabled (if you don’t enable the AIO, you’d still see the multiblock reads, but they won’t give any real performance benefits as the request would end up serialized on the physical layer).

8) In the ideal case performance of non-contiguous multiblock reads can be close to that of contiguous multiblock reads. This means that a nested loop would give you the same benefit as the hash joins (read both row sources with multiblock reads), only better, because with the hash joins, the price of being able to use multiblock reads is having the read the entire table. Whereas with nested loops, you don’t have this problem! However, there are other kinds of overhead that the nested loop would incur compared to the hash join:

– lots of buffer visits to the root/branch blocks of the probing index. While reading blocks from the memory is cheaper compared to disk reads, we gain maximum performance benefits with SSD, where this difference isn’t that big anymore (tens or hundreds microseconds per block per physical multiblock reads, microseconds or tens of microseconds per a logical read), so when LIO volume is high enough, it would make a contribution to the total elapsed time comparable with that of the physical I/O
– physical reads from the index leaf blocks (the less keys per leaf block, the larger this overhead is).

It is also worth mentioning that the actual cost of hash joins might be lower than the optimizer’s estimate when a significant percentage of physical reads aren’t actually physical (i.e. they’re coming from OS or storage cache). When this happens, this shifts the balance back towards the hash joins.

But even so, nested loops can be much more effective for queries with relatively poor selectivity than they are traditionally considered to be.

To sum up:

Multiblock reads are a complete game changer for nested loop performance. However, extracting the full value from them requires properly configuring the storage, the OS and the database.

Previous parts:

Part I
Part II
Part III

Update: Embarassingly, I’ve forgot to thank two great storage experts for their help in writing this article: Tom O’Neill and Chris Buckel (the @flashdba). My sincere apologies, and much thanks for your contributions!

One thought on “Nested loop internals — summary

Leave a comment