Troubleshooting log file sync waits — a simple case study

Yesterday I was asked to give my opinion on the AWR below:

DB Name	DB Id	Instance	Inst num	Release	        RAC	Host
*****	*****	*****   	1	NO	******

	        Snap Id	Snap Time	   Sessions  Cursors/Session
Begin Snap:	66340	6/20/2012 14:30	   117	  9.3
End Snap:	66341	6/20/2012 15:00	   115	  9.1
Elapsed:		30.16 (mins)
DB Time:		35.57 (mins)

Report Summary

Cache Sizes

	                Begin	End
Buffer Cache:	        1,888M	1,888M	Std Block Size:	8K
Shared Pool Size:	1,024M	1,024M	Log Buffer:	3,168K

Load Profile

	                Per Second	Per Transaction
Redo size:	        504,437.43	3,150.59
Logical reads:	        23,547.86	147.07
Block changes:	        1,900.01	11.87
Physical reads:	        1,931.69	12.06
Physical writes:	50.85	        0.32
User calls:	        478.35	        2.99
Parses:	                20.39	        0.13
Hard parses:	        0.05	        0
Sorts:	                8.67	        0.05
Logons:	                0.1	        0
Executes:	        250.29	        1.56
Transactions:	        160.11

% Blocks changed per Read:	8.07	Recursive Call %:	18.31
Rollback per transaction %:	0	Rows per Sort:	178.25


Top 5 Timed Events

Event	                Waits	        Time(s)	Avg Wait(ms)	% Total Call Time	Wait Class
CPU time		                1,882		        88.2
log file sync	        293,178	        1,217	4	        57	                Commit
log file parallel write	290,903	        961	3	        45	                System I/O
db file scattered read	301,788	        45	0	        2.1	                User I/O
db file parallel write	11,839	        31	3	        1.5	                System I/O

Of course the first things that leaps into the eyes is that the sum of top-5 events exceeds 100%, but it happens sometimes. According to a MOS note (whose number I unfortunately didn’t keep — and trying to find it again didn’t succeed) this behavior isn’t even considered as bug, and whatever it is, it was fixed in 11g. So let’s move on. The next “big” event in the report is “CPU time” — but according to OS statistics host CPU usage was about 10% at that period so we can skip that.

The second biggest wait event is the infamous log file sync wait. When determining the source of log file sync waits, it is important to look at another redo-related event, log file parallel write. In this case, the wait time for both events is close, which is suggesting that the main component in log file sync wait was the I/O part (recorded in log file parallel write). On the other hand, 3ms per write doesn’t look particularly slow, so the system would’ve probably been okay were it not for such a high number of commits: 160 per second (or 1 commit per 3 user calls, while Oracle recommends 1 per 30).

The next step is finding the source of the commits, and it can be tricky sometimes. In this case it was simpler: there was an insert statement with the number of executions very close to number of commits (269,378 vs 289,696 respectively), so it was reasonable to assume that commits were following those insert statements. A further insight was provided by the name of the table that was inserted to: it was something like ABCERROR. Now all pieces start to fall into their places: apparently, it was the application error logging that was responsible to such a high commit rate. Apparently, the errors themselves weren’t important enough to be noticed by the application users, but the side effects of those errors were quite noticeable. It will probably take the application team a while to follow up on this analysis so I can’t claim that this is a fact with 100% certainty, but I’m pretty sure that this is the case.

– in 10g, AWR reports can show the sum of top event percentages exceed 100%, and Oracle doesn’t consider this a bug (!)
– when troubleshooting log file sync waits, following stats are useful:
a) log file parallel write (count and avg time)
b) commits (per second and per user call)
c) OS CPU usage (to rule out CPU saturation)
– if you’re lucky, you can find culprit of high commits in “SQL by executions”
– names of tables in top SQL can occasionally provide useful tips about the nature of a problem

Other posts on AWR

3 thoughts on “Troubleshooting log file sync waits — a simple case study”

    1. Hi Ranjith,

      you can find both numbers in the Load Profile section (number of commits is simply the number of transactions corrected for the number of rollbacks — in this case there weren’t any rollbacks, so it was simply the number of transactions)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s