Nested loop internals. Part 3: comparative efficiency

In the previous parts (here and here) of the series we looked at some aspects of nested loop I/O optimizations, but we have left out the most important question (from the practical point of view): how these methods are doing time-wise? Which one(s) is(are) faster, and how much savings are they offering compared to the non-optimized plan? We will turn to these questions now.

First of all, if we simply run everything with settings described in the previous parts we would find almost no difference in timings. This is because I’ve disabled filesystem caching by setting FILESYSTEM_OPTIONS = DIRECTIO, which, as it turns out, not only forces directs reads, but also disables asynchronous I/O, which is vital for obtaining performance benefits from non-contiguous multiblock reads. So the first order of business would be to correct this mistake (I simply reverted it to SETALL). This makes a massive difference in terms of performance.

As we have already seen, nested loop optimizations kick in for high clustered factors (i.e. unordered data sets), so I picked rand = 100 (to have the data set as unordered as it can be), pad = 1000 (to have sufficiently low row-per-block density so as to make TABLE ACCESS BY ROWID very expensive, while steering clear of any special effects that might be there when there is just 1 row per block) for this test. To minimize both systematic and statistical errors, I ran the test 100 times.

We also need to be careful about comparing the results. I calculated the total time for all db file parallel read waits, as it would also allow me to incude batch-rowid optimization into comparison (which otherwise would be difficult to do, as it doesn’t batch index block reads).

Here are the numbers that I got by running the test above on my MacBook Pro laptop with an SSD hard drive:

prefetch	110±13 ms
batch	104±16 ms
batch rowid	78±8 ms

For comparison, the total sequential I/O time for non-optimized nested loop was 1380 ms. Unlike the numbers above, it includes the time for reading the index blocks, but that’s a relatively small correction (since we have about 6 rows per block, we’d be reading approximately 6 times less index leave blocks than the table blocks).

Therefore we can conclude that in this case multiblock reads provide approximately x10 improvement compared to the non-optimized approach. All 3 multiblock optimizations give similar results, but batch rowid seems to be doing a little bit better than prefetch and batching (by about 25-30%), which is most probably accounted for by higher average multiblock read count (121 versus 38 for prefetch and 69 for batching).

It would be interesting to compare these results versus results obtained with the identical setup on my PC with a regular spinning HDD drive:

prefetch	2804±464 ms
batch	2572±445 ms
batch rowid	2038±282 ms

and 12723 &plusmn 4310 ms for the non-optimized nested loop.

As we can see, overall difference in performance is quite large (almost an order of magnitude), but the tendencies are similar. In both cases, multiblock optimizations provide significant performance benefits, although in the first case (SSD) they are higher. This is to be expected, as on an SSD devide one can perform I/O in truly asynchronous way (read data from different channels simultaneously), while with HDD, if we only have one spindle and one platter (as is the case with my PC), we can only benefit from re-arranging the reads and thus reducing the seek time.

Let’s summarize our observations in form of a list:

1. Asynchronous I/O is of key importance for multiblock nested loop optimizations because without it multiblock reads would still be there, but they would be essentially serialized, so there would be no real performance gain. Therefore to make sure that your nested loops are delivering the best performance possible, make sure AIO is enabled on both database and OS level.
2. With SSD one can obtain larger performance benefits from non-contiguous multiblock reads compared to HDD
3. This additional performance benefit isn’t free as it increases the I/O bandwidth usage
4. The larger the multi-block read count, the better performance, although this dependency is not linear (e.g. almost doubling MBRC from 69 to 121 only gives 25% improvement in performance) and we can expect it to saturate at high MBRC
5. In terms of read times per block multiblock reads are approximately 10 times more efficient compared to single-block reads, so multiblock optimizations have a decisive effect on the actual cost of a nested loop.

The last item is very important, especially since the optimizer estimated of the nested loop cost estimations seem to be unaffected by any multiblock optimizations. Therefore it is very easy for the optimizer to overestimate the cost of a nested loop, making it only eligible for very selective queries, whereas in reality it could be efficient for a much broader class of queries.

Part 1
Part 2