Last week I posted a blog on high CPU utilization during I/O. The blog title contained a question mark because it seemed like there were more questions than answers. Were the CPU utilization numbers real or just an instrumentation artifact? Were they distorted by double-counting? If real, where was it coming from? About a dozen top experts participated in that discussion, and everybody had a different theory. I’ve spent some time looking at the problem on different test environments from the OS angle, so now I have some answers I’d like to share.
To avoid any effects due to virtualization, I did the final test on a physical 32-core server running Solaris 10, with Oracle 11.2.0.3. As in my previous tests, I disabled asycnhronous I/O (disk_asynch_io=false) and set “_small_table_threshold” to a very high value to disable direct reads (a simpler way to do this is via event 10949, but I found that it hasn’t always worked for me). There was one important difference with the initial setup: the block size was 4,096 (because that was the database’s default and I hadn’t realized that until my tests were completed and the database was unavailable for any further tests). Because of that, the size of the table (setup up exactly as shown in my previous post) was somewhat larger, about 1.5 GB, instead of 1.2 GB, and 360k blocks instead of 140k. Naturally, the amount of time spent on CPU during the tablescan was also larger: about 27 seconds (previously, I used to see numbers in 5-7 second interval). The basic mechanics, however, are independent of the block size.
I did some initial studies with prstat (with the “microstates” option, prstat -m), because it’s simple to use and doesn’t require elevated access. It confirmed that CPU time was consistent with what the database reported, and also showed that most of the CPU time was spent in the kernel model (80% sleep, i.e. waiting for i/o, 15% sys, i.e. kernel calls, 5% user, i.e. “userland” calls by Oracle). To get more detail, I used the following dtrace one-liners:
dtrace -n 'profile-1001 /arg1 && pid == xxxxx / { @[ufunc(arg1)]=count();} '
for looking at user calls and
dtrace -n 'profile-1001 /arg0 && pid == xxxxx / { @[func(arg0)]=count();} '
(where xxxxx is the pid of the Oracle process to trace) for looking at kernel calls.
I had some trouble getting them to work, especially the second one, because of the access issues (and the fact that the database was running on a non-global zone). Eventually, I arranged for root access on the global zone, and everything completed okay and I got the following output:
... (163 lines) libc.so.1`_write 130 oracle`ksfd_odmgo 159 oracle`skgovprint 160 oracle`skgo_process_arg 164 a.out`kslwtectx 165 oracle`kdst_fetch 166 libc_psr.so.1`memcpy 177 oracle`kcbldio 211 oracle`kcbhvbo 743 libc.so.1`_ndoprnt 833
for user calls, and for kernel calls:
... (308 lines) genunix`as_segcompar 374 emcpmpx`MpxPathWeightForAdaptive 390 scsi`scsi_cache_init_pkt 392 unix`disp_getwork 477 unix`fp_restore 492 emcp`PowerDispatchX 509 emlxs`emlxs_sli3_issue_iocb_cmd 531 unix`mutex_exit 667 unix`bzero 1923 unix`mutex_enter 2038
What do I conclude from this?
The most important conclusion is that the CPU costs are real (and don’t represent some accounting errors), and they are not just side effects of asynchronous I/O (like the one described here) or artifacts introduced by virtualization.
CPU time was spent over a large number of calls. I aggregated the output by module for kernel calls and that gave me following top modules:
unix 7.4s (37%) -- OS layer, first of all mutexes emc 4.5s (23%) -- hardware layer genunix 2.7s (13%) -- kernel debugging vxfs 1.7s (9%) -- filesystem layer ...
It’s curious that so much time (2s, about 10%) is spent on unix`bzero function which sets zero-valued bytes — it would make more sense to see it in some activity that formats/allocates new space, not in a process that simply fetches data for a query. But it’s just something curious, I don’t think it matters anything for the analysis (as 90% of CPU time is spent on other stuff).
The next important conclusion is the one I didn’t even need dtrace for (because we could see that in prstat): namely, that most of the CPU time was spent in the kernel mode. This is a clear indication that CPU time is the overhead from OS calls that handle I/O requests. It makes sense, because the smaller MBRC is, the more I/O requests need to be raised with the OS, so the costs that are constant for an I/O request (independent of its size) increase the total CPU consumption. But it’s still very interesting that the time spent handling I/O requests on the OS side exceeds the time spent on the database time (including acquiring and releasing latches, walking cache hash chains etc.) by that much.
The overhead is about 40 to 60 microseconds per request. For example, if we read a table in 128-block portions (MBRC=1), we save about 5-7 milliseconds of CPU time on each read. This is a lot (comparable to the physical I/O time), and larger than what I expected. I thought that small MBRC’s are bad mostly because of larger disk latencies, but turns out that CPU is equally important, if not more.
Another important finding for me was that CPU profiling confirmed that CPU usage was coming from the tablescan itself and not SORT AGGREGATE associated with COUNT(*). That is of direct practical importance, because it means that it’s safe to use COUNT(*) when studying tablescans in various tests. If it turned out that counting rows incurred that much CPU overhead, then it would mean that testing would require selecting rows from a table without aggregation, which is a nuisance (lots of SQL*Net waits + additional steps needed to suppress all the output).
Conclusion
When performing tablescans (and other multiblock I/O) there are significant CPU costs involved, which depending on db_file_multiblock_read settings and other factors can reach tens of percent of total time. A large part (75% in this particular case) of those costs fall on OS kernel handling I/O requests (in this particular case, 40-60 CPU-microseconds per request).