Logical I/O

Except a few special cases, optimizing SQL is about minimizing I/O. And by “I/O” we normally mean “physical I/O”, because everybody knows that logical I/O (LIO) is much, much faster. But how much faster exactly? For a long time, this question has been bothering me. It looks like there has been little research in this area. Basically the only thorough investigation I managed to find on the subject was one by Cary Millsap and co-authors. So I conducted some research on my own. Before presenting my findings, here’s a summary of Cary’s paper: the difference between PIO and LIO time is not the difference between disk access and memory access, it’s much smaller (37x vs 10,000x as per their measurement), and therefore performance optimization needs to address LIO as well as physical I/O (PIO).

There are several reasons why this number cannot be very reliable. The most obvious one is that the paper was published over 10 years ago, there has been a lot of changes in hardware since then. But more importantly, LIO is a much trickier concept than a PIO, and this complexity cannot be very well described by a single number.

Let me explain a bit. We can talk about PIO time because it’s independent of the contents. When you’re reading an 8kB block, whether it’s full of 0’s, or 1’s, or of random numbers, it doesn’t matter: mechanical latencies are going to be same. With LIO, that’s not quite the case, because the time spent on processing contents of a block is: a) comparable with the time spent on reading it b) is very difficult to distinguish from it from instrumentation point of view. Besides, LIO time is greatly influenced by concurrent activity (whether or not a read consistent version needs to be built) and various obscure optimizations (like “fastpath”) with little or no documentation about them. Because of all that, one should keep in mind that “LIO time” cannot as clearly defined concept as PIO time. Still, some simple cases can be analyzed, and some useful conclusions can be drawn from that analysis.

So what I did was the following: I took measurements with several different combinations of row length and number of columns so as to find the minimal LIO time (the “pure” LIO time so to say) and also find how much it changes as the table structure becomes more complex. The table below summarizes the results for the two setups: A, a Windows 7 PC and B, Solaris M-8000 server. both running Oracle 11gR2. A few words about setup: I used equally-sized varchar2 columns with 1, 5 and 25 columns, and row length of 25, 250 and 2500 bytes. I.e. 5 columns and 2500 bytes means 5 columns 500 bytes each. The total “payload” (number of rows times average row length) was also the same in all cases, 1GB. The actual table size, however, varied from test case to test case because of difference in relative overhead and other irregularities, so the LIO was measured by dividing total CPU time by the number of consistent gets. In order to minimize instrumentation error, I preloaded all tables into cache, and cross checked the results via several sources.

To make sure that I minimize artifacts from the test setup I also did the following: 1) checked that results are independent of the way buffer cache is populated with table contents (full table scan and randomly ordered reads by rowid interspersed with reads from other segments gave results that were different less than 10%) 2) checked that same was also true for the way data was read from the buffer cache (full table scan and reads by rowid also gave roughly same results).

row length  # columns   LIO(microsec), A  LIO(microsec), B
2500              1           1.44           5.04
2500              5           1.34           5.44
2500              25          1.43           5.39
250               1           2.06           8.87
250               5           4.41           8.97
250               25          2.50           8.64
25                1           5.32           8.56
25                5           5.52           8.29
25                25          6.85           8.94

Looking at the results table, I make following observations:

1) The smaller row length (i.e. the more densely rows are packed into blocks), the more expensive is a LIO

2) The more columns, the more expensive is a LIO (although there are a couple of anomalies here)

3) Row density affects LIO cost more than the number of columns (although this might be different with other data types)

4) Cost of a LIO is different on different machines (I think that it depends primarily on memory characteristics — e.g. NUMA effects could explain why a plain PC outperforms an expensive enterprise server)

5) In all cases LIO time is much lower than the 53 microseconds from the above mentioned paper. So either LIO became faster over the last decade, or instrumentation errors resulted in it being overestimated in Millsap’s paper.

Whatever the reason, does it mean that LIO matters less? Not really. First of all, even 5-9 microseconds per LIO is not that small a number, especially if we take into consideration the fact that LIO doesn’t seem to benefit much from “multiblock reads”. Consider a full table scan with mbrc = 128 and average “db file scattered rea” of 6 ms (which corresponds to 160 MB/s, a fairly standard data transfer rate). On the other hand, 128 LIOs can easily take 1 ms or more. So the difference between PIO and LIO would be only by a factor of 6, and not by several orders of magnitude as most of us would expect. And that doesn’t even include the cost of actually reconstructing the read-consistent version in case it’s necessary because of concurrent activity, or potential contention for latches!

If anything, LIO is becoming more important nowadays because – enterprise grade storage nowadays comes with big cache of its own, so physical reads are often measured in hundreds or even tens of microseconds. I’m not exaggerating — I’ve seen systems with long-term average “physical reads” of 80 microseconds! – low-latency flash storage is becoming more popular by day (driving the difference between PIO and LIO further down) – memory is rapidly becoming cheaper, which means caches are rapidly getting bigger – advent of in-memory option in Oracle means that people would rely on memory even more – high per-CPU-core licensing costs for in-memory would translate the resource cost of LIO into $$$ in a very direct way – spread of virtualization technologies makes it easier to put freed CPU resources to good use, which means that there will be demand for non-SLA optimization. To put it simply, 10 years ago if you reduce CPU utilization from 80% to 50% meant that you’d be wasting more CPU cycles. Now those CPU cycles can be quickly and cheaply allocated to another database (or application) where they would bring value to the business. So there will be a good reason to tune something to reduce resource usage even without any users complaining (and no one would accuse you of having Compulsive Tuning Disorder!). But that’s a topic for another post.

Summary

LIO time depends on block structure, hardware and other factors. Typical values are 1-10 microseconds, which is much lower than the popular estimate of 50 microseconds per LIO. Nevertheless, in many situations time spent on LIO can be comparable to PIO time, so reducing LIO is very important. Because of recent technological changes and other changes, the motivation for LIO optimization is likely to increase in the future.

16 thoughts on “Logical I/O”

  1. Can you share the SQL that is used to touch the columns? A lot of what I think you’re showing here is the cost of walking rows. Oracle can quickly find the first column of a row but other columns require “walking the row”.

    1. Kevin, it’s a simple select count(*) from table. I used a number of tools to make sure that Oracle does what I needed it to be doing. I also used dtrace to see what was going on from the OS angle. It can be used to tell how much time is spent on walking the rows (kcb* vs kd* calls). For large row size, almost none. For smaller row size, all the additional overhead is coming because of increased cost of walking a row.

      1. Datablocks have a row directory in the header. The row directory points to the first column of each row. Why would count(*) need to look at the rows and not just get the count from the row directory? Are you using predicates?

        1. This MIGHT depend on how the table was created and the number of ITLs used, and whether the rows were automatically deemed clean. If the block is is definitely committed and clean for all transactions then count(*) need only check the number of rows in the directory, but if there is any doubt then it may have to walk the directory to check which rows are linked to which ITL, to determine if there is any requirement for read consistency.

          1. Thanks Jonathan. Are there any statistics that could tell for sure which of the two scenarios is taking place?

            About my setup: it was fairly standard, a ASSM tablespace with 1M uniform extents, table created with default storage settings. The table was populated during creation (CTAS) and not touched ever since, so I cannot think of any reason why there would be any doubts regarding “cleanness” of its blocks.

            So the only explanation for variations in CPU time I can come up with is that it has to do with variations in time it takes to scan row directory because the size of the row directory depends on the number of rows (2 bytes per row if I’m not mistaken). It’s still surprising why this would take several extra microseconds though…

        1. Again we get back to how the data set was created. The number of row directory entries is not necessarily the same as the number of rows. Maybe there’s a flag in the block header that tells Oracle to check that every row piece is a First pece. Maybe the flag can be set incorrectly by some operations.

  2. Very good topic indeed! I’d like to understand mode of what you are testing because the cost of a LIOP include the cost of touching the data that caused the LIOP. A header snacking op like count(*) should not show variation based on the shape/sizes of the rows. If this link comes through it is a bit of LIO data from single user SLOB where slob.sql is changed to touch c2 then c13 and then c19

    1. Here is an example of dtrace CPU profiler one-liner output:

        oracle`kcbz_fp_buf            13
        oracle`kdst_fetch             14
        oracle`skgslcas               30
        oracle`kcbgtcr               181
        oracle`kdstf00000010000kmP   316
      

      Here you can see that it’s walking the rows and it’s doing that a lot. For larger row length (i.e. smaller number of rows per block) the share of time spent in kd* calls decreases compared to kcb*.

      1. Could it be variation in the cost of examining row directory in the block’s header? I can’t be 100% sure what names like kdstf* mean, the ORA-600 lookup paper says “kds” are “operations on data such as retrieving a row”, and “tf” seems to come from “table fetch”, but I could be wrong.

  3. Dude,
    Install a RAM DISK on that PC class machine and place Oracle db files there; doing this you should cut the IO latency to minimum and see what you get.

    1. Cristian,

      logical I/O performance doesn’t depend on the kind of storage you have. Also, physical reads don’t contaminate performance statistics for logical I/O because the metrics are different. So unfortunately I don’t see how your suggestion might help here.

      Best regards,
      Nikolay

      1. Nikolay,

        you wrote
        “the difference between PIO and LIO time is not the difference between disk access and memory access”
        “In order to minimize instrumentation error, I preloaded all tables into cache”
        “it’s a simple select count(*) from table.”

        You could run two tests, with more complex queries, 1st with RAM Disk, 2nd without, and check the final measurements, how average, median, std. deviation etc looks like.
        Monitor the tests with DTrace or perf. Maybe a flame graph (http://www.brendangregg.com/flamegraphs.html) could shed more light on the matter.

        Regards,
        Cristian.

        1. Hi Cristian,

          I feel that my setup was adequate for the purposes of the simple test that I wanted to run. If you ever get to running more advanced tests using the tools you mentioned, please share your results, that would make an interesting reading.

          Best regards,
          Nikolay

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s