High log file sync waits? Check log parallelism!

Log parallelism is an optimization introduced in 9.2 that reduces latch contention due to redo copy to the log buffer by enabling multiple public redo buffers (or “strands”). In many cases, however, it can cause a massive degradation of commit performance (“log file sync” wait increase). In this blog post, I will describe the mechanism, illustrate it with test results, and discuss solutions. Tests were performed on several different 11gR2 databases on Solaris servers.

Log parallelism

Log parallelism is controlled by two hidden parameters: _log_parallelism_max and _log_parallelism_dynamic which as their names suggest control the number of public redo buffers and whether this number should be adjusted dynamically. By default, Oracle adds up to one public redo buffer per each 16 CPUs. One important effect from multiple redo buffers being used is that the log writer will have to write from each buffer, and therefore the number of I/O requests per one log write will increase to up to M x P, where M is redo multiplexing level (i.e. the number of  redo file copies that Oracle simultaneously writes to in order to ensure that it doesn’t lose its ability to recover if one of them is lost), and P is log parallelism degree. This is easy to see thanks to the P3 (“requests”) parameter of the log file parallel write event which you can access either by tracing the log writer session or via V$ACTIVE_SESSION_HISTORY event samples.

The OS normally won’t have a problem submitting all M x P requests asynchronously  (if asynch I/O is enabled on database and OS/storage levels). But it does not guarantee that they would actually complete in parallel because:

– the OS may not allow simulteneous multiple writes to a single file

–  different redo buffers may map to physically close locations which would make it physically impossible to write to them at the same time (e.g. in a hard disk it’s impossible for the attenuator arm to write to several differents sectors/tracks on the same platter at once).

As a result, per each multiplexed redo log file you’d be doing several smaller sequential I/O operations instead of one large sequential write. That means doing up to P times more seek operations (which is what takes most time when doing sequential writes) and additionally you may also waste some time burn some CPU on kernel locking (observable by watching TM_DELTA_CPU_TIME for the LGWR session grow in ASH despite it not being on CPU).

Therefore, what is looks like parallel I/O on the database level, can in fact be serialized by the operating system, storage (or both).

Test results

Let’s illustrate this with some data. I ran a simple multi-threaded looped insert/commit test (with COMMIT WRITE WAIT in order to prevent PL/SQL commit optimization from messing up the results). The first time I run it with default log parallelism settings. The second time log parallelism is disabled by setting “_log_parallelism_max” parameter to one and bouncing the instance.

So, this is log parallelism on:

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
xxxxxx         680910931 xxxxxx             1 09-Apr-14 11:01 11.2.0.3.0  NO 

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
tfru1            Solaris[tm] OE (64-bit)           128    16       2      48.00

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     92128 17-Sep-14 10:00:24        52       1.8
  End Snap:     92129 17-Sep-14 10:06:49        66       1.5
   Elapsed:                6.41 (mins)
   DB Time:               63.12 (mins)

Cache Sizes                       Begin        End
~~~~~~~~~~~                  ---------- ----------
               Buffer Cache:     7,168M     7,168M  Std Block Size:         8K
           Shared Pool Size:     4,608M     4,608M      Log Buffer:   107,096K

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):                9.9                0.0       0.01      12.67
       DB CPU(s):                0.3                0.0       0.00       0.41
       Redo size:        2,064,749.9            1,588.4
   Logical reads:            7,024.3                5.4
   Block changes:            6,389.7                4.9
  Physical reads:                0.1                0.0
 Physical writes:              283.9                0.2
      User calls:                0.8                0.0
          Parses:                4.2                0.0
     Hard parses:                0.0                0.0
W/A MB processed:                0.1                0.0
          Logons:                0.2                0.0
        Executes:            1,310.5                1.0
       Rollbacks:                0.0                0.0
    Transactions:            1,299.9

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
log file sync                       500,050       3,676      7   97.1 Commit
DB CPU                                              123           3.2
library cache: mutex X                  813           0      0     .0 Concurrenc
log file switch (private stran           20           0      8     .0 Configurat
control file sequential read          2,953           0      0     .0 System I/O

...

Background Wait Events            DB/Inst: xxxxxx/xxxxxx  Snaps: 92128-92129
...
                                                             Avg
                                        %Time Total Wait    wait    Waits   % bg
Event                             Waits -outs   Time (s)    (ms)     /txn   time
-------------------------- ------------ ----- ---------- ------- -------- ------
log file parallel write          99,963     0        372       4      0.2   90.5
db file parallel write           13,655     0         10       1      0.0    2.5
os thread startup                    15     0          1      65      0.0     .2
control file parallel writ          415     0          1       2      0.0     .2
db file async I/O submit          6,627     0          0       0      0.0     .1
control file sequential re        1,132     0          0       0      0.0     .0

The test inserted 500,000 rows into test tables in 50 threads, and you can see that the number of log file sync is very similar to that number (which confirms that PL/SQL commit optimization is not distorting our test results). The vast majority of DB time is spent on log file sync, so our results are not polluted with irrelevant waits. Now let’s disable log parallelism and run the same test:

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
xxxxxx         680910931 xxxxxx             1 17-Sep-14 10:28 11.2.0.3.0  NO 

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
tfru1            Solaris[tm] OE (64-bit)           128    16       2      48.00

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     92131 17-Sep-14 10:31:31        54       1.8
  End Snap:     92132 17-Sep-14 10:33:29        65       1.6
   Elapsed:                1.96 (mins)
   DB Time:               19.01 (mins)

Cache Sizes                       Begin        End
~~~~~~~~~~~                  ---------- ----------
               Buffer Cache:     7,168M     7,168M  Std Block Size:         8K
           Shared Pool Size:     4,608M     4,608M      Log Buffer:   107,096K

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):                9.7                0.0       0.00       2.76
       DB CPU(s):                1.0                0.0       0.00       0.27
       Redo size:        6,660,802.0            1,565.4
   Logical reads:           26,032.4                6.1
   Block changes:           20,958.9                4.9
  Physical reads:               41.0                0.0
 Physical writes:                0.0                0.0
      User calls:                3.5                0.0
          Parses:               18.3                0.0
     Hard parses:                2.8                0.0
W/A MB processed:                1.9                0.0
          Logons:                0.6                0.0
        Executes:            4,325.6                1.0
       Rollbacks:                0.0                0.0
    Transactions:            4,255.1

...	

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
log file sync                       500,055       1,008      2   88.4 Commit
DB CPU                                              113           9.9
library cache: mutex X                1,247           0      0     .0 Concurrenc
log file switch (private stran           10           0     14     .0 Configurat
db file sequential read               2,208           0      0     .0 User I/O  

Background Wait Events            DB/Inst: xxxxxx/xxxxxx  Snaps: 92131-92132
...
                                                             Avg
                                        %Time Total Wait    wait    Waits   % bg
Event                             Waits -outs   Time (s)    (ms)     /txn   time
-------------------------- ------------ ----- ---------- ------- -------- ------
log file parallel write         101,623     0        107       1      0.2   89.9
os thread startup                    16     0          1      66      0.0     .9
ADR block file write                880     0          1       1      0.0     .5
control file parallel writ          125     0          0       1      0.0     .1
ADR block file read               5,208     0          0       0      0.0     .1
ADR file lock                       822     0          0       0      0.0     .0
control file sequential re          901     0          0       0      0.0     .0

The difference is quite impressive: the test now completes 3 times faster because of the reduction in log file sync time!

Risk factors for redo I/O serialization

 

The described mechanism is very common, but not universal. Certain platforms and configurations carry more risk of adverse effects from log parallelism on database performance than others:

– raw/ASM storage means less risk (because Oracle controls many low-level I/O details so it can avoid unnecessary locking)

– no write cache or write cache saturation (writes to cache don’t have to be serialized, writes to physically close locations in persistent storage do)

– the more CPUs the database server has, the more aggresive log parallelism is by default, i.e. the higher is the risk

– broad tails in log file parallel write histograms (and consequently high values of log file sync to log file parallel write average wait ratio, 2.5-3 and higher)  are strong indicators of log parallelism being the problem.

Since short transactions with frequent commits are typical for OLTP systems, obviously OLTP systems would generally suffer more seriously from this problem (i.e. a data warehouse is just as likely to have redo I/O serialization as OLTP, but the chance that it would present an actual problem is much lower).

If your database is suffering from slow commits (“log file sync”) then following steps would help you diagnose the problem (you don’t have to do all of them):

– use ASH to look at statistics for the number of requests (p3 parameter for log file parallel write) and see if there is a correlation between elapsed time and the number of requests

– use ASH to check wait statistics for the LGWR session. If it spends much time on CPU, or waiting for redo allocation/copy latches, or control file parallel writes etc., this means that the problem is not the I/O

Of course the ultimate (and simplest) check would involve running a test with and without log parallelism, but this requires sys level access and involves changing hidden parameters, it could be helpful to gain some confidence that you’re on the write track before doing it.

Solutions

The simplest solution is to disable log parallelism by setting _log_parallelism_max to 1. This may cause an increase in waits on redo copy/allocation waits, so be sure to test it before applying it on production. Also, since this is an undocumented parameter, changing it would require approval from Oracle support, but it should be very straightforward to obtain it if you open an SR and ask for it explicitly (I was able to obtain the approval within a few hours).

If for whatever reason you don’t like this solution, you can also try moving redo logs on a different type of storage — e.g. ASM or raw.

Keep in mind that as always with bottlenecks, removing one can cause another someplace else, so a realistic test on a production-like environment before going on production is important.

Summary

Log parallelism helps reduce latch contention when copying redo to the log buffer. A possible (and rather common) side effect is a dramatic degradation in log file sync waits which outweighs benefits from reduced latching. A few simple checks can tell whether log parallelism is a factor for your log file sync problem, in which case you might be able to reduce log file sync waits by x2 or more by disabling the log parallelism with a single parameter change.

 

 

Advertisements

13 thoughts on “High log file sync waits? Check log parallelism!”

    1. Hi Hemant,

      this testing actually gives a lower bound estimate for the effect, because people try to save money on test databases, i.e. they typically have less CPUs than production systems. Therefore log parallelism is by default less aggressive in there. When I use _log_parallelism_max based on the number of CPUs in production, I get up to x15 improvement!

      It’s really surprising that the effect went unnoticed for so long. Well, it wasn’t completely unnoticed, Tony Hasler reported similar observations a few years ago in his blog: http://tonyhasler.wordpress.com/2011/07/24/log-file-sync-and-log-file-parallel-write-part-2/ But he himself essentially discarded his findings by attributing them to an unknown fault in the I/O subsystem.

  1. Nikolay,

    Its not the main topic of the article (which is very good), but could you please explain the below:

    “additionally you may also waste some time burn some CPU on kernel locking (observable by watching TM_DELTA_CPU_TIME for the LGWR session grow in ASH despite it not being on CPU).”

    Thanks,
    Naresh

    1. Naresh,

      generally a session is either on CPU, or in wait. In the former case all signs should point to CPU usage: SESSION_STATE shows “ON CPU”, TM_DELTA_CPU_TIME is incrementing, OS commands (e.g. prstat in Unix) also show that the OS process for the session is on CPU. In the latter case none of the signs above should be observed.

      However, with log parallelism bug, the symptoms may be sending “mixed signals”. At least that’s what I saw on a Solaris 10 machine: the LGWR session was not “ON CPU”, but it’s TM_DELTA_CPU_TIME statistic in ASH kept increasing, and prstat showed signs of high CPU usage as well.

      Best regards,
      Nikolay

  2. Just for interest’s sake, here’s awr snippets from a test done on Direct NFS to a NetApp filer (in my case, 16 sessions at 100k inserts each, on a 4-core Linux VM on Intel )

    parallelism = default
    ================
    Elapsed: 3.27 (mins)
    Executes (SQL): 8,124.8 per sec

    Total Wait Wait % DB Wait
    Event Waits Time (sec) Avg(ms) time Class
    —————————— ———– ———- ———- —— ——–
    log file sync 1,588,849 2546.6 1.60 79.2 Commit

    Total Avg
    %Time Wait wait Waits % bg
    Event Waits -outs Time (s) (ms) /txn time
    ————————– ———– —– ——– ———- ——– ——
    log file parallel write 235,392 0 141 0.60 0.1 37.6

    parallelism = 1
    ================
    Elapsed: 2.95 (mins)
    Executes (SQL): 9,026.3 per sec

    Total Wait Wait % DB Wait
    Event Waits Time (sec) Avg(ms) time Class
    —————————— ———– ———- ———- —— ——–
    log file sync 1,599,080 2429.3 1.52 87.6 Commit

    Total Avg
    %Time Wait wait Waits % bg
    Event Waits -outs Time (s) (ms) /txn time
    ————————– ———– —– ——– ———- ——– ——
    log file parallel write 199,623 0 114 0.57 0.1 38.5

    So there is a marginal gain, but nothing of the order of your example…unsure whether that’s a feature of Direct NFS or something else.

    Cheers,
    Connor

    1. Hi Connor,

      thanks for sharing your observation. Obviously, in your case the effect was negligible, and it would be interesting to understand why that was the case. Could you provide more details about your case — first of all, what kind of a storage are you running on, ASM or filesystem? If filesystem, then what kind? What kind of hardware is it on? Is redo placed on dedicated disk or is it sharing disks with other data?

      38.5 ms per write is obviously very bad — can you check I/O stats from the OS side, i.e. are you experiencing a lot of queuing?

      It would also help whether 38.5 ms average is coming from consistently bad I/O figures or it’s because of a few outliers, e.g. from the AWR histogram (provided your version of Oracle is 11g or higher).

      It sounds like a test setup, so I believe there is no synchronous storage-level replication here?

      Also, it would help to know the average number of requests (or better yet, the distribution of requests per one log file write event).

      I’d be happy to take a closer look if you email me the details.

      Best regards,
      Nikolay

      1. Sorry its the formatting thats distorting things…I’ll summarise :-)

        log para = 1
        ==============
        total elapsed = 2.95mins
        log file sync, avg wait = 1.52ms
        log file parallel write, 199,623 writes, avg = 0.57ms

        log para = default
        ==============
        total elapsed = 3.27mins
        log file sync, avg wait = 1.60ms
        log file parallel write, 235,392 writes, avg = 0.60ms

        So not much difference, but still, no parallel outperforms the default

        Cheers
        Connor

    2. Hi Connor,

      ok, now it makes more sense. In your case disabling log parallelism cannot solve your log file sync problem, because there is no such problem. 1-2 ms per LFS event is about as good as it gets, so there’s nothing to fix here. Besides, yours is a 4 core system — in order to observe log parallelism you generally need more cores/CPUs (because the number of public redo strands is proportional to the number of CPUs).

      Best regards,
      Nikolay

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s