Block size and multirow DML peformance

While 8k is the default block size, Oracle supports other block sizes, too. Smaller block sizes (more commonly, 4k) are encouraged for OLTP workloads to reduce concurrency, and even smaller block size, 2k, is recommended for databases running on “Advanced format” (or “4K”) storage. Oracle documentation warns us of possible implications when storing larger rows in 2k/4k blocks, such as potentially larger space overhead due to metadata, or even possibility of row chaining. What it doesn’t warn us about, is that there is also a big difference in the way DML operations work, which has very serious implications for their performance.

As an IT saying goes, a code snippet is worth a thousand words, so without further ado let’s start with a simple example. I ran the test below on Oracle 11.2.0.1 installed on a Windows 7 (but I repeated it on Linux/11.2.0.1 and Solaris/11.2.0.3 to make sure it’s not some random glitch on the Windows port). If you want to repeat the test, be sure to define db_4k_cache_size before creating the 4k tablespace if it’s not yet defined.


create tablespace test_dml_4k
datafile 'd:\test_dml_4k.dbf' size 100M
extent management local uniform size 1M
segment space management auto
blocksize 4096;

create table t4k(x number, text varchar2(4000))
tablespace test_dml_4k;

insert into t4k
(x, text)
select level, rpad('x', 100, 'x') text
from dual
connect by level <= 1e5;


Autotrace this, then repeat the same with 8k block size, and compare the output. Here is what I got on my machine:

	                        8k        4k
CPU used by this session	14        70
db block changes      	        9678      213527
db block gets	                11493     120719
redo entries	                6703      112511
redo size	                12364360  34910732

 

As you can see, the difference is quite dramatic: almost 3 times as much redo, 5 times more CPU time, 10 times as many db block gets… and all that from different block size? Wow, those big block sizes must really boost DML performance! If we double the block size, perhaps we can make it even faster?..

The truth is that this is not a direct effect of the block size itself: if we repeat this exercise with 2k block, we’ll get more or less same results as for 4k, and if we try it with 16k block, we’ll get more or less the same as with 8k.

The clue to the root cause of this behavior comes from statistics above. As you can see, the number of redo entires and db block gets is fairly close to the number of rows inserted. And db block changes is suspiciously close to double of that number. If you vary the number of rows inserted, you’ll confirm that suspicion.

So it looks like for block sizes smaller than 8k, Oracle switches to row-by-row processing of INSERT SELECT statements, and other DML statements for that matter (I did similar tests with UPDATE, DELETE and MERGE, with same results). How do I find out for sure? I dumped redo log, using MIN SCN clause to limit the amount information dumped (see e.g.here for more on redo dumping). Here is what I got (edited for brevity):

1. Small block size (4k)


CHANGE #1 TYP:0 CLS:1 AFN:17 DBA:0x04400355 OBJ:78233 SCN:0x0000.006168ee SEQ:16 OP:11.2 ENC:0 RBL:0
...
KDO Op code: IRP row dependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x04400355 hdba: 0x04400209
itli: 1 ispac: 0 maxfr: 1172
tabn: 0 slot: 0(0x0) size/delt: 107
fb: --H-FL-- lb: 0x1 cc: 2
null: --
col 0: [ 2] c1 02
col 1: [100]
78 78 78 78 78 78 78 78 78 ... {total of 100 "78"s}
three more lines of the same
CHANGE #2 ...

2. Large block size (8k)


CHANGE #1 TYP:2 CLS:1 AFN:19 DBA:0x04c001d2 OBJ:78235 SCN:0x0000.00617149 SEQ:1 OP:11.11 ENC:0 RBL:0
...
KDO Op code: QMI row dependencies Disabled
xtype: XA flags: 0x00000000 bdba: 0x04c001d2 hdba: 0x04c00083
itli: 2 ispac: 0 maxfr: 4858
tabn: 0 lock: 2 nrow: 56
slot[0]: 10
tl: 107 fb: --H-FL-- lb: 0x0 cc: 2
col 0: [ 2] c1 02
col 1: [100]
78 78 78 78 78 78 78 78 78 ... {total of 100 "78"s}
...
{54 more such records}
slot[55]: 65
tl: 107 fb: --H-FL-- lb: 0x0 cc: 2
col 0: [ 2] c1 39
col 1: [100]
78 78 78 78 78 78 78 78 78 ... {total of 100 "78"s}
three more lines of the same
CHANGE #2 ...

One doesn’t need any advanced redo dump reading skills to spot the difference: in the first case, there is a change record generated per each row processed, while in the second case, one change record covers 55 rows (or “slots” in block-dump lingo), which looks like the number of rows that would fit into one block (if we take into account metadata overhead and reserved free space inside the block). And if we look at the operation codes, we’ll see that they are different: in first case it’s IRP (“Insert Row Piece”), while in second it’s QMI (“Insert Row Array”) (redo opcodes can be found here).

Thus it is confirmed that different mechanisms are involved into DML processing if the block size is smaller than 8k, which explains the difference in performance.

Anaysis

So how big is this difference? Obviously, it depends on a number of factors: the size of a row, the number of rows processed by one statement, block size, tablespace settings, hardware etc. In my tests I found the difference in terms of CPU time to be 6-7 times  for small rows sizes (tens of bytes or smaller) and 3-4 times for larger row sizes (hundreds of bytes). Obviously, as the row size approaches the block size, the difference gradually disappears as row-by-row processing and “block-by-block” processing become one and the same thing. The effect is best observed when at least several thousand rows are processed in one DML statement.

One obvious question pops up: why would Oracle switch to an inefficient mechanism for smaller blocks? I don’t have an exact answer, but my guess would be concurrency. My test above was one in a “dust-free lab” environment, but in a multi-user environment with high concurrency row-by-row mechanism might provide some advantages.

Another question is: if the difference between the two mechanisms looks quite big, why there is little or no research on it (I found none)? I have two possible explanations for this:

1) My googling skills aren’t as good as I think them to be. Not very likely, but possible, so if you have any links to such research, please leave them in comments to this post, thanks in advance.

2) Smaller block size tend to be used on OLTP environment, where batch DML operations with thousands of rows are less common than single-row DML. Plus, DML often involves physical I/O which can mask the effect (row-by-row processing on the other hand doesn’t increase physical I/O because we’re reading and modifying the same few blocks over and over).

So, what are the repercussions of this? Does this mean that we have to stop using 2k or 4k blocks? Of course not. But it would be useful to include it into the number of things considered when choosing a block size for the new database. I think it deserves at least no less consideration than space overhead due to metadata or danger of having a chained row once in a blue moon. And for small block databases it could be useful to have some tablespaces with larger blocks for tables with frequent batch-load operations. And of course it would be useful to consider the block size before rewriting any inside-the-loop row-by-row DML in hope of better performance: for small block size, the effect is likely to be minimal.

 

Summary

 
For small blocks (2k and 4k), Oracle uses row-by-row internal processing for multi-row DML statements, which means that they can be much slower compared to larger block sizes (8k and above). For some odd reason, Oracle documentation doesn’t mention this as a consideration for choosing optimal block size, but they really should.

9 thoughts on “Block size and multirow DML peformance

  1. Nikolay,

    “why there is little or no research on it”

    A possible partial answer, but working from memory and with no testing on terminal releases: I think the bulk processing optimization for inserts used to disappear if there were indexes on the table that had to be maintained at the same time; and the update and delete optimisation only appeared in 10g – and didn’t apply if indexes had to be maintained.

    So perhaps, historically, the optimization was effectively invisible for all the testing that people did. Remember the ASSM, big block, migration bug that Steve Karam expsosed a few years ago: it had been around for three major versions of Oracle, but no-one had noticed it.

    1. Jonathan,

      thanks for your comment. Indeed, that explains it. Apparently, in 10g there was no optimization to bulk-update index blocks after in a DML statement, which rendered the optimization for table blocks essentially useless, so they just turned it off… But even so, it was still important for Oracle user community to know about this behavior, because one of most basic recommendations for improving performance of bulk DML is drop all indexes (I myself have to resort to this trick quite often during application releases, when massive amounts of data need to be loaded or modified within very tight time windows). With indexes dropped, this optimization would become enabled for 8k blocks and higher, but not for smaller blocks, leading to a major difference in performance.

      Now that in 11g the optimization for bulk-updating index blocks is in place, knowing this difference becomes even more important. I think that should put that into the documentation.

  2. Nikolay

    I think it’s important that you’ve discovered it and made it easy for people to test and reproduce but I think it’s the sort of thing that Oracle will not document because it’s an internal mechanism that may change at any time.

    I’m inclined to think it’s a bug and that it will be fixed if someone wants to make a fuss about it. I’m also curious (but not sufficiently curious to test it) to see if it appears for dictionary managed tablespaces, or if the smaller blocksize is used for the whole database, or if the smaller blocksize is used for the undo tablespace. (I can think of some reasons – though not good ones – why any of these choices might make the problem go away.)

    1. Jonathan,

      I draw a line between implementation of a feature and its limitations. We can’t expect Oracle to tell us all about how it implements its features. But we can (and should) expect them to tell us when these features supposed to work and when not. Take direct path insert for example: would we agree if Oracle decided not to tell is when it’s not working because it’s internal details that can change at any time? No, of course not. Why should batch DML be treated differently?

      It probably looks like I’m blowing this out of proportion, but I’m not. At work, I spend most time with a 4k block database, and hundreds and thousands of man-hours of development effort on optimizing DML using bulk operations was essentially wasted because of this limitation. I don’t think that our situation is untypical, considering that 4k is considered by Oracle as one of two standard block sizes (and the advent of Advanced Format storage is making 2k more and more popular). “Row by row equals slow by slow” has become a developer’s mantra nowadays, so rewriting via bulk is one of the first things a developer would try when facing inadequate DML performance. People would waste much time and effort before they eventually realize that it’s not their fault that they can’t achieve target performance that way, it shouldn’t be the case.

      I will raise an SR requesting this behavior either be fixed, or documented. Let’s see how Oracle responds.

Leave a comment