Log parallelism is an optimization introduced in 9.2 that reduces latch contention due to redo copy to the log buffer by enabling multiple public redo buffers (or “strands”). In many cases, however, it can cause a massive degradation of commit performance (“log file sync” wait increase). In this blog post, I will describe the mechanism, illustrate it with test results, and discuss solutions. Tests were performed on several different 11gR2 databases on Solaris servers.
Log parallelism
Log parallelism is controlled by two hidden parameters: _log_parallelism_max and _log_parallelism_dynamic which as their names suggest control the number of public redo buffers and whether this number should be adjusted dynamically. By default, Oracle adds up to one public redo buffer per each 16 CPUs. One important effect from multiple redo buffers being used is that the log writer will have to write from each buffer, and therefore the number of I/O requests per one log write will increase to up to M x P, where M is redo multiplexing level (i.e. the number of redo file copies that Oracle simultaneously writes to in order to ensure that it doesn’t lose its ability to recover if one of them is lost), and P is log parallelism degree. This is easy to see thanks to the P3 (“requests”) parameter of the log file parallel write event which you can access either by tracing the log writer session or via V$ACTIVE_SESSION_HISTORY event samples.
The OS normally won’t have a problem submitting all M x P requests asynchronously (if asynch I/O is enabled on database and OS/storage levels). But it does not guarantee that they would actually complete in parallel because:
– the OS may not allow simulteneous multiple writes to a single file
– different redo buffers may map to physically close locations which would make it physically impossible to write to them at the same time (e.g. in a hard disk it’s impossible for the attenuator arm to write to several differents sectors/tracks on the same platter at once).
As a result, per each multiplexed redo log file you’d be doing several smaller sequential I/O operations instead of one large sequential write. That means doing up to P times more seek operations (which is what takes most time when doing sequential writes) and additionally you may also waste some time burn some CPU on kernel locking (observable by watching TM_DELTA_CPU_TIME for the LGWR session grow in ASH despite it not being on CPU).
Therefore, what is looks like parallel I/O on the database level, can in fact be serialized by the operating system, storage (or both).
Test results
Let’s illustrate this with some data. I ran a simple multi-threaded looped insert/commit test (with COMMIT WRITE WAIT in order to prevent PL/SQL commit optimization from messing up the results). The first time I run it with default log parallelism settings. The second time log parallelism is disabled by setting “_log_parallelism_max” parameter to one and bouncing the instance.
So, this is log parallelism on:
WORKLOAD REPOSITORY report for DB Name DB Id Instance Inst Num Startup Time Release RAC ------------ ----------- ------------ -------- --------------- ----------- --- xxxxxx 680910931 xxxxxx 1 09-Apr-14 11:01 11.2.0.3.0 NO Host Name Platform CPUs Cores Sockets Memory(GB) ---------------- -------------------------------- ---- ----- ------- ---------- tfru1 Solaris[tm] OE (64-bit) 128 16 2 48.00 Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 92128 17-Sep-14 10:00:24 52 1.8 End Snap: 92129 17-Sep-14 10:06:49 66 1.5 Elapsed: 6.41 (mins) DB Time: 63.12 (mins) Cache Sizes Begin End ~~~~~~~~~~~ ---------- ---------- Buffer Cache: 7,168M 7,168M Std Block Size: 8K Shared Pool Size: 4,608M 4,608M Log Buffer: 107,096K Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~ --------------- --------------- ---------- ---------- DB Time(s): 9.9 0.0 0.01 12.67 DB CPU(s): 0.3 0.0 0.00 0.41 Redo size: 2,064,749.9 1,588.4 Logical reads: 7,024.3 5.4 Block changes: 6,389.7 4.9 Physical reads: 0.1 0.0 Physical writes: 283.9 0.2 User calls: 0.8 0.0 Parses: 4.2 0.0 Hard parses: 0.0 0.0 W/A MB processed: 0.1 0.0 Logons: 0.2 0.0 Executes: 1,310.5 1.0 Rollbacks: 0.0 0.0 Transactions: 1,299.9 Top 5 Timed Foreground Events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg wait % DB Event Waits Time(s) (ms) time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- log file sync 500,050 3,676 7 97.1 Commit DB CPU 123 3.2 library cache: mutex X 813 0 0 .0 Concurrenc log file switch (private stran 20 0 8 .0 Configurat control file sequential read 2,953 0 0 .0 System I/O ... Background Wait Events DB/Inst: xxxxxx/xxxxxx Snaps: 92128-92129 ... Avg %Time Total Wait wait Waits % bg Event Waits -outs Time (s) (ms) /txn time -------------------------- ------------ ----- ---------- ------- -------- ------ log file parallel write 99,963 0 372 4 0.2 90.5 db file parallel write 13,655 0 10 1 0.0 2.5 os thread startup 15 0 1 65 0.0 .2 control file parallel writ 415 0 1 2 0.0 .2 db file async I/O submit 6,627 0 0 0 0.0 .1 control file sequential re 1,132 0 0 0 0.0 .0
The test inserted 500,000 rows into test tables in 50 threads, and you can see that the number of log file sync is very similar to that number (which confirms that PL/SQL commit optimization is not distorting our test results). The vast majority of DB time is spent on log file sync, so our results are not polluted with irrelevant waits. Now let’s disable log parallelism and run the same test:
WORKLOAD REPOSITORY report for DB Name DB Id Instance Inst Num Startup Time Release RAC ------------ ----------- ------------ -------- --------------- ----------- --- xxxxxx 680910931 xxxxxx 1 17-Sep-14 10:28 11.2.0.3.0 NO Host Name Platform CPUs Cores Sockets Memory(GB) ---------------- -------------------------------- ---- ----- ------- ---------- tfru1 Solaris[tm] OE (64-bit) 128 16 2 48.00 Snap Id Snap Time Sessions Curs/Sess --------- ------------------- -------- --------- Begin Snap: 92131 17-Sep-14 10:31:31 54 1.8 End Snap: 92132 17-Sep-14 10:33:29 65 1.6 Elapsed: 1.96 (mins) DB Time: 19.01 (mins) Cache Sizes Begin End ~~~~~~~~~~~ ---------- ---------- Buffer Cache: 7,168M 7,168M Std Block Size: 8K Shared Pool Size: 4,608M 4,608M Log Buffer: 107,096K Load Profile Per Second Per Transaction Per Exec Per Call ~~~~~~~~~~~~ --------------- --------------- ---------- ---------- DB Time(s): 9.7 0.0 0.00 2.76 DB CPU(s): 1.0 0.0 0.00 0.27 Redo size: 6,660,802.0 1,565.4 Logical reads: 26,032.4 6.1 Block changes: 20,958.9 4.9 Physical reads: 41.0 0.0 Physical writes: 0.0 0.0 User calls: 3.5 0.0 Parses: 18.3 0.0 Hard parses: 2.8 0.0 W/A MB processed: 1.9 0.0 Logons: 0.6 0.0 Executes: 4,325.6 1.0 Rollbacks: 0.0 0.0 Transactions: 4,255.1 ... Top 5 Timed Foreground Events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg wait % DB Event Waits Time(s) (ms) time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- log file sync 500,055 1,008 2 88.4 Commit DB CPU 113 9.9 library cache: mutex X 1,247 0 0 .0 Concurrenc log file switch (private stran 10 0 14 .0 Configurat db file sequential read 2,208 0 0 .0 User I/O Background Wait Events DB/Inst: xxxxxx/xxxxxx Snaps: 92131-92132 ... Avg %Time Total Wait wait Waits % bg Event Waits -outs Time (s) (ms) /txn time -------------------------- ------------ ----- ---------- ------- -------- ------ log file parallel write 101,623 0 107 1 0.2 89.9 os thread startup 16 0 1 66 0.0 .9 ADR block file write 880 0 1 1 0.0 .5 control file parallel writ 125 0 0 1 0.0 .1 ADR block file read 5,208 0 0 0 0.0 .1 ADR file lock 822 0 0 0 0.0 .0 control file sequential re 901 0 0 0 0.0 .0
The difference is quite impressive: the test now completes 3 times faster because of the reduction in log file sync time!
Risk factors for redo I/O serialization
The described mechanism is very common, but not universal. Certain platforms and configurations carry more risk of adverse effects from log parallelism on database performance than others:
– raw/ASM storage means less risk (because Oracle controls many low-level I/O details so it can avoid unnecessary locking)
– no write cache or write cache saturation (writes to cache don’t have to be serialized, writes to physically close locations in persistent storage do)
– the more CPUs the database server has, the more aggresive log parallelism is by default, i.e. the higher is the risk
– broad tails in log file parallel write histograms (and consequently high values of log file sync to log file parallel write average wait ratio, 2.5-3 and higher) are strong indicators of log parallelism being the problem.
Since short transactions with frequent commits are typical for OLTP systems, obviously OLTP systems would generally suffer more seriously from this problem (i.e. a data warehouse is just as likely to have redo I/O serialization as OLTP, but the chance that it would present an actual problem is much lower).
If your database is suffering from slow commits (“log file sync”) then following steps would help you diagnose the problem (you don’t have to do all of them):
– use ASH to look at statistics for the number of requests (p3 parameter for log file parallel write) and see if there is a correlation between elapsed time and the number of requests
– use ASH to check wait statistics for the LGWR session. If it spends much time on CPU, or waiting for redo allocation/copy latches, or control file parallel writes etc., this means that the problem is not the I/O
Of course the ultimate (and simplest) check would involve running a test with and without log parallelism, but this requires sys level access and involves changing hidden parameters, it could be helpful to gain some confidence that you’re on the write track before doing it.
Solutions
The simplest solution is to disable log parallelism by setting _log_parallelism_max to 1. This may cause an increase in waits on redo copy/allocation waits, so be sure to test it before applying it on production. Also, since this is an undocumented parameter, changing it would require approval from Oracle support, but it should be very straightforward to obtain it if you open an SR and ask for it explicitly (I was able to obtain the approval within a few hours).
If for whatever reason you don’t like this solution, you can also try moving redo logs on a different type of storage — e.g. ASM or raw.
Keep in mind that as always with bottlenecks, removing one can cause another someplace else, so a realistic test on a production-like environment before going on production is important.
Summary
Log parallelism helps reduce latch contention when copying redo to the log buffer. A possible (and rather common) side effect is a dramatic degradation in log file sync waits which outweighs benefits from reduced latching. A few simple checks can tell whether log parallelism is a factor for your log file sync problem, in which case you might be able to reduce log file sync waits by x2 or more by disabling the log parallelism with a single parameter change.
You’ve really achieved significant results in your test. I’d known about log parallelism and redo copy latches but not the relation to log file sync.
Hi Hemant,
this testing actually gives a lower bound estimate for the effect, because people try to save money on test databases, i.e. they typically have less CPUs than production systems. Therefore log parallelism is by default less aggressive in there. When I use _log_parallelism_max based on the number of CPUs in production, I get up to x15 improvement!
It’s really surprising that the effect went unnoticed for so long. Well, it wasn’t completely unnoticed, Tony Hasler reported similar observations a few years ago in his blog: http://tonyhasler.wordpress.com/2011/07/24/log-file-sync-and-log-file-parallel-write-part-2/ But he himself essentially discarded his findings by attributing them to an unknown fault in the I/O subsystem.
Nikolay,
Its not the main topic of the article (which is very good), but could you please explain the below:
“additionally you may also waste some time burn some CPU on kernel locking (observable by watching TM_DELTA_CPU_TIME for the LGWR session grow in ASH despite it not being on CPU).”
Thanks,
Naresh
Naresh,
generally a session is either on CPU, or in wait. In the former case all signs should point to CPU usage: SESSION_STATE shows “ON CPU”, TM_DELTA_CPU_TIME is incrementing, OS commands (e.g. prstat in Unix) also show that the OS process for the session is on CPU. In the latter case none of the signs above should be observed.
However, with log parallelism bug, the symptoms may be sending “mixed signals”. At least that’s what I saw on a Solaris 10 machine: the LGWR session was not “ON CPU”, but it’s TM_DELTA_CPU_TIME statistic in ASH kept increasing, and prstat showed signs of high CPU usage as well.
Best regards,
Nikolay
Just for interest’s sake, here’s awr snippets from a test done on Direct NFS to a NetApp filer (in my case, 16 sessions at 100k inserts each, on a 4-core Linux VM on Intel )
parallelism = default
================
Elapsed: 3.27 (mins)
Executes (SQL): 8,124.8 per sec
Total Wait Wait % DB Wait
Event Waits Time (sec) Avg(ms) time Class
—————————— ———– ———- ———- —— ——–
log file sync 1,588,849 2546.6 1.60 79.2 Commit
Total Avg
%Time Wait wait Waits % bg
Event Waits -outs Time (s) (ms) /txn time
————————– ———– —– ——– ———- ——– ——
log file parallel write 235,392 0 141 0.60 0.1 37.6
parallelism = 1
================
Elapsed: 2.95 (mins)
Executes (SQL): 9,026.3 per sec
Total Wait Wait % DB Wait
Event Waits Time (sec) Avg(ms) time Class
—————————— ———– ———- ———- —— ——–
log file sync 1,599,080 2429.3 1.52 87.6 Commit
Total Avg
%Time Wait wait Waits % bg
Event Waits -outs Time (s) (ms) /txn time
————————– ———– —– ——– ———- ——– ——
log file parallel write 199,623 0 114 0.57 0.1 38.5
So there is a marginal gain, but nothing of the order of your example…unsure whether that’s a feature of Direct NFS or something else.
Cheers,
Connor
Hi Connor,
thanks for sharing your observation. Obviously, in your case the effect was negligible, and it would be interesting to understand why that was the case. Could you provide more details about your case — first of all, what kind of a storage are you running on, ASM or filesystem? If filesystem, then what kind? What kind of hardware is it on? Is redo placed on dedicated disk or is it sharing disks with other data?
38.5 ms per write is obviously very bad — can you check I/O stats from the OS side, i.e. are you experiencing a lot of queuing?
It would also help whether 38.5 ms average is coming from consistently bad I/O figures or it’s because of a few outliers, e.g. from the AWR histogram (provided your version of Oracle is 11g or higher).
It sounds like a test setup, so I believe there is no synchronous storage-level replication here?
Also, it would help to know the average number of requests (or better yet, the distribution of requests per one log file write event).
I’d be happy to take a closer look if you email me the details.
Best regards,
Nikolay
Sorry its the formatting thats distorting things…I’ll summarise :-)
log para = 1
==============
total elapsed = 2.95mins
log file sync, avg wait = 1.52ms
log file parallel write, 199,623 writes, avg = 0.57ms
log para = default
==============
total elapsed = 3.27mins
log file sync, avg wait = 1.60ms
log file parallel write, 235,392 writes, avg = 0.60ms
So not much difference, but still, no parallel outperforms the default
Cheers
Connor
Hi Connor,
ok, now it makes more sense. In your case disabling log parallelism cannot solve your log file sync problem, because there is no such problem. 1-2 ms per LFS event is about as good as it gets, so there’s nothing to fix here. Besides, yours is a 4 core system — in order to observe log parallelism you generally need more cores/CPUs (because the number of public redo strands is proportional to the number of CPUs).
Best regards,
Nikolay