High CPU during I/O (continued)

Last week I posted a blog on high CPU utilization during I/O. The blog title contained a question mark because it seemed like there were more questions than answers. Were the CPU utilization numbers real or just an instrumentation artifact? Were they distorted by double-counting? If real, where was it coming from? About a dozen top experts participated in that discussion, and everybody had a different theory. I’ve spent some time looking at the problem on different test environments from the OS angle, so now I have some answers I’d like to share.

To avoid any effects due to virtualization, I did the final test on a physical 32-core server running Solaris 10, with Oracle 11.2.0.3. As in my previous tests, I disabled asycnhronous I/O (disk_asynch_io=false) and set “_small_table_threshold” to a very high value to disable direct reads (a simpler way to do this is via event 10949, but I found that it hasn’t always worked for me). There was one important difference with the initial setup: the block size was 4,096 (because that was the database’s default and I hadn’t realized that until my tests were completed and the database was unavailable for any further tests). Because of that, the size of the table (setup up exactly as shown in my previous post) was somewhat larger, about 1.5 GB, instead of 1.2 GB, and 360k blocks instead of 140k. Naturally, the amount of time spent on CPU during the tablescan was also larger: about 27 seconds (previously, I used to see numbers in 5-7 second interval). The basic mechanics, however, are independent of the block size.

I did some initial studies with prstat (with the “microstates” option, prstat -m), because it’s simple to use and doesn’t require elevated access. It confirmed that CPU time was consistent with what the database reported, and also showed that most of the CPU time was spent in the kernel model (80% sleep, i.e. waiting for i/o, 15% sys, i.e. kernel calls, 5% user, i.e. “userland” calls by Oracle). To get more detail, I used the following dtrace one-liners:

dtrace -n 
 'profile-1001 /arg1 && pid == xxxxx / { @[ufunc(arg1)]=count();} '

for looking at user calls and

dtrace -n 
 'profile-1001 /arg0 && pid == xxxxx / { @[func(arg0)]=count();} '

(where xxxxx is the pid of the Oracle process to trace) for looking at kernel calls.

I had some trouble getting them to work, especially the second one, because of the access issues (and the fact that the database was running on a non-global zone). Eventually, I arranged for root access on the global zone, and everything completed okay and I got the following output:

  ... (163 lines)
  libc.so.1`_write                  130
  oracle`ksfd_odmgo                 159
  oracle`skgovprint                 160
  oracle`skgo_process_arg           164
  a.out`kslwtectx                   165
  oracle`kdst_fetch                 166
  libc_psr.so.1`memcpy              177
  oracle`kcbldio                    211
  oracle`kcbhvbo                    743
  libc.so.1`_ndoprnt                833

 

for user calls, and for kernel calls:

  ... (308 lines)
  genunix`as_segcompar              374
  emcpmpx`MpxPathWeightForAdaptive  390
  scsi`scsi_cache_init_pkt          392
  unix`disp_getwork                 477
  unix`fp_restore                   492
  emcp`PowerDispatchX               509
  emlxs`emlxs_sli3_issue_iocb_cmd   531
  unix`mutex_exit                   667
  unix`bzero                       1923
  unix`mutex_enter                 2038

What do I conclude from this?

The most important conclusion is that the CPU costs are real (and don’t represent some accounting errors), and they are not just side effects of asynchronous I/O (like the one described here) or artifacts introduced by virtualization.

CPU time was spent over a large number of calls. I aggregated the output by module for kernel calls and that gave me following top modules:

unix     7.4s (37%) -- OS layer, first of all mutexes
emc      4.5s (23%) -- hardware layer
genunix  2.7s (13%) -- kernel debugging 
vxfs     1.7s (9%)  -- filesystem layer
...

It’s curious that so much time (2s, about 10%) is spent on unix`bzero function which sets zero-valued bytes — it would make more sense to see it in some activity that formats/allocates new space, not in a process that simply fetches data for a query. But it’s just something curious, I don’t think it matters anything for the analysis (as 90% of CPU time is spent on other stuff).

The next important conclusion is the one I didn’t even need dtrace for (because we could see that in prstat): namely, that most of the CPU time was spent in the kernel mode. This is a clear indication that CPU time is the overhead from OS calls that handle I/O requests. It makes sense, because the smaller MBRC is, the more I/O requests need to be raised with the OS, so the costs that are constant for an I/O request (independent of its size) increase the total CPU consumption. But it’s still very interesting that the time spent handling I/O requests on the OS side exceeds the time spent on the database time (including acquiring and releasing latches, walking cache hash chains etc.) by that much.

The overhead is about 40 to 60 microseconds per request. For example, if we read a table in 128-block portions (MBRC=1), we save about 5-7 milliseconds of CPU time on each read. This is a lot (comparable to the physical I/O time), and larger than what I expected. I thought that small MBRC’s are bad mostly because of larger disk latencies, but turns out that CPU is equally important, if not more.

Another important finding for me was that CPU profiling confirmed that CPU usage was coming from the tablescan itself and not SORT AGGREGATE associated with COUNT(*). That is of direct practical importance, because it means that it’s safe to use COUNT(*) when studying tablescans in various tests. If it turned out that counting rows incurred that much CPU overhead, then it would mean that testing would require selecting rows from a table without aggregation, which is a nuisance (lots of SQL*Net waits + additional steps needed to suppress all the output).

Conclusion

When performing tablescans (and other multiblock I/O) there are significant CPU costs involved, which depending on db_file_multiblock_read settings and other factors can reach tens of percent of total time. A large part (75% in this particular case) of those costs fall on OS kernel handling I/O requests (in this particular case, 40-60 CPU-microseconds per request).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s