Log file sync from neither I/O nor CPU

Two most popular causes for log file sync waits are redo filesystem I/O and CPU (or CPU scheduling). I’d like to share an example of a third scenario.

A few days ago, I noticed a weird spike in log file sync average duration on a production 11.2 database:

select time,
       snap_id,
       round(time_delta/1e3/nullif(waits_delta,0), 1) avg_wait_ms,
       waits_delta num_of_waits,
       round(time_delta/1e6) total_seconds
from
(       
  select sn.snap_id,
         sn.begin_interval_time time,
         e.total_waits - lag(e.total_waits) over (partition by e.event_name order by e.snap_id) waits_delta,
         e.time_waited_micro - lag(e.time_waited_micro) OVER (PARTITION BY e.event_name ORDER BY e.snap_id) time_delta
  from dba_hist_system_event e,
       dba_hist_snapshot sn
  where e.snap_id = sn.snap_id
  AND e.event_name = 'log file sync'
) ev
WHERE ev.time_delta > 0 
order by time desc

========================================================================
begin_interval_time     snap_id avg_wait_ms num_of_waits   total_seconds
========================================================================
05-OCT-14 11.30.14  AM	52454   1170.2	     14113	       16515
05-OCT-14 11.14.04  AM	52454   10.6	     8463	       89
05-OCT-14 08.00.03  AM	52454   19.8	     339975	       6715
05-OCT-14 07.30.02  AM	52454   8.6	         15728	       135
05-OCT-14 07.00.00  AM	52454   18	         15678	       282
05-OCT-14 06.30.58  AM	52454   12.2	     15696	       192
05-OCT-14 06.00.57  AM	52454   73	         18480	       1349
05-OCT-14 05.30.55  AM	52454   12.9	     18385	       237
05-OCT-14 05.00.51  AM	52454   17.5	     17033	       299

Wait event histogram showed that the average was distorted by a few super-long waits:

select wait_time_milli, wait_count
from dba_hist_event_histogram h,      
     dba_hist_snapshot sn
where h.snap_id = sn.snap_id
and event_name = 'log file sync'
and sn.snap_id = 52454
order by wait_time_milli;
=========================
wait_time_milli wait_count
==========================
1	            247
2	            8466
4	            364854
8	            214135
16	            183034
32	            91450
64	            23878
128	            7027
256	            9411
512	            3118
1024	        777
2048	        6
4096	        19
8192	        15
16384	        24
32768	        1
65536	        10
131072	        51
262144	        65

I checked the log writer activity in ASH and found that it was stuck on “enq: CF – contention” wait for a few straight minutes. “CF” means “control file”, i.e. the log writer needed to write to the control file (probably because it was about to switch log files which requires an update to the control files) but couldn’t do it because something else was writing to it. It was straightforward to find that it was RMAN: ASH gave me blocking session id and serial number (blocking_session, blocking_session_serial#):

select distinct blocking_session, blocking_session_serial#
from dba_hist_active_sess_history ash
where program like '%LGWR%'
and event = 'enq: CF - contention'
and sample_time between to_date('11:00 05-Oct-2014', 'hh24:mi dd-mon-yyyy') 
                and to_date('12:00 05-Oct-2014', 'hh24:mi dd-mon-yyyy');

and module/program columns told me that session belonged to RMAN:

select distinct module, program, action
from dba_hist_active_sess_history
where session_id = 7395 
and session_serial# = 2807
rman@gmfxdbp1 (TNS V1-V3)	rman@gmfxdbp1 (TNS V1-V3)	0003613 STARTED53
rman@gmfxdbp1 (TNS V1-V3)	rman@gmfxdbp1 (TNS V1-V3)	0007777 STARTED53
rman@gmfxdbp1 (TNS V1-V3)	rman@gmfxdbp1 (TNS V1-V3)	0012913 STARTED53
rman@gmfxdbp1 (TNS V1-V3)	rman@gmfxdbp1 (TNS V1-V3)	0014887 STARTED53
rman@gmfxdbp1 (TNS V1-V3)	rman@gmfxdbp1 (TNS V1-V3)	0025379 STARTED53
rman@gmfxdbp1 (TNS V1-V3)	rman@gmfxdbp1 (TNS V1-V3)	0027865 STARTED53
....

During that period, that RMAN session was busy on “control file paralle write”:

select to_char(sample_time, 'dd-mon-yy hh24:mi:ss') sample_time, event, round(delta_write_io_bytes/1024/1024/10) megabytes_per_sec
from dba_hist_active_sess_history
where session_id = 7395 
and session_serial# = 2807
and sample_time between to_date('11:00 05-Oct-2014', 'hh24:mi dd-mon-yyyy') 
                and to_date('12:00 05-Oct-2014', 'hh24:mi dd-mon-yyyy')
order by sample_time desc
================================================================
sample_time          event                     megabytes_per_sec
================================================================
05-oct-14 11:57:03	control file parallel write	      10
05-oct-14 11:56:53	control file parallel write	      10
05-oct-14 11:56:42	control file parallel write	      9
...
05-oct-14 11:46:54	control file parallel write	      9

So for 10 minutes, rman was writing at 9-10 MB/s, i.e. the total amount of information written within that time frame was over 5 GB, i.e. several times the size of the control files. I’m not really sure what all this information was about. Since there was a lot of non-standard one-off activity in the database at the moment (and it was clearly not a good idea not to disable the backup before starting it), and since the incident wasn’t impactful enough to warrant further investigation so I just left it at that. I did discuss this with the production DBA though, and he suggested that one of the reasons that explained high write activity was high value of _control_file_record_keep_time parameter (set to 95 days).

This short example illustrates following points:
– there are other scenarios for log file sync than slow I/O or CPU/scheduling issues
– the most direct way to investigate the origin of log file sync waits is by looking at LGWR session (using trace if possible, or ASH or real-time monitoring tools otherwise) rather than at global statistics
– high backup activity can have nasty performance side effects, especially when it overlaps with other activities, and when control_file_record_keep_time is high.

6 thoughts on “Log file sync from neither I/O nor CPU

  1. Hi,

    congratulations on excellent quality of LGWR/log file sync processing posts! :)

    Just out of curiosity related to this one:
    a) perhaps making the snapshot control file(x2) and several/mirrored control files (e.g. x3) would give this value of 5GB?
    b) have you investigated/observed v$wait_chains during such event by any chance (AFAIK not for postmortem as there is no ASH version of v$wait_chains – true/false?)? just curious if it would return valid results in this case for LGWR ;)
    c) round(delta_write_io_bytes/1024/1024/10) megabytes_per_sec — why /10? shouldn’t be connected somehow with v$ash.delta_time?

    -J

    1. Hi Jakub,

      great questions!

      a) thanks for pointing to that — being more of a developer than a DBA, rman is not an area I’m very familiar with
      b) I don’t use v$wait_chains much (and no, there is no ASH version of this view in my knowledge)
      c) I assumed that delta_time was close to time in microseconds since the last sample taken (the numbers certainly looked that way), but you are right, it is more correct to use delta_time here. If my assumption was wrong, that might have made the write rate appear bigger than it actually was.

      Best regards,
      Nikolay

      1. My previous comment was about question “B” – ASH based analysis of wait chains. It’s better than V$WAIT_CHAINS btw, as it may take time before the chains show up in V$WAIT_CHAINS but in ASH the links are immediate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s