Yesterday I was asked to give my opinion on the AWR below:
DB Name DB Id Instance Inst num Release RAC Host ***** ***** ***** 1 10.2.0.4.0 NO ****** Snap Id Snap Time Sessions Cursors/Session Begin Snap: 66340 6/20/2012 14:30 117 9.3 End Snap: 66341 6/20/2012 15:00 115 9.1 Elapsed: 30.16 (mins) DB Time: 35.57 (mins) Report Summary Cache Sizes Begin End Buffer Cache: 1,888M 1,888M Std Block Size: 8K Shared Pool Size: 1,024M 1,024M Log Buffer: 3,168K Load Profile Per Second Per Transaction Redo size: 504,437.43 3,150.59 Logical reads: 23,547.86 147.07 Block changes: 1,900.01 11.87 Physical reads: 1,931.69 12.06 Physical writes: 50.85 0.32 User calls: 478.35 2.99 Parses: 20.39 0.13 Hard parses: 0.05 0 Sorts: 8.67 0.05 Logons: 0.1 0 Executes: 250.29 1.56 Transactions: 160.11 % Blocks changed per Read: 8.07 Recursive Call %: 18.31 Rollback per transaction %: 0 Rows per Sort: 178.25 ... Top 5 Timed Events Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class CPU time 1,882 88.2 log file sync 293,178 1,217 4 57 Commit log file parallel write 290,903 961 3 45 System I/O db file scattered read 301,788 45 0 2.1 User I/O db file parallel write 11,839 31 3 1.5 System I/O
Of course the first things that leaps into the eyes is that the sum of top-5 events exceeds 100%, but it happens sometimes. According to a MOS note (whose number I unfortunately didn’t keep — and trying to find it again didn’t succeed) this behavior isn’t even considered as bug, and whatever it is, it was fixed in 11g. So let’s move on. The next “big” event in the report is “CPU time” — but according to OS statistics host CPU usage was about 10% at that period so we can skip that.
The second biggest wait event is the infamous log file sync wait. When determining the source of log file sync waits, it is important to look at another redo-related event, log file parallel write. In this case, the wait time for both events is close, which is suggesting that the main component in log file sync wait was the I/O part (recorded in log file parallel write). On the other hand, 3ms per write doesn’t look particularly slow, so the system would’ve probably been okay were it not for such a high number of commits: 160 per second (or 1 commit per 3 user calls, while Oracle recommends 1 per 30).
The next step is finding the source of the commits, and it can be tricky sometimes. In this case it was simpler: there was an insert statement with the number of executions very close to number of commits (269,378 vs 289,696 respectively), so it was reasonable to assume that commits were following those insert statements. A further insight was provided by the name of the table that was inserted to: it was something like ABCERROR. Now all pieces start to fall into their places: apparently, it was the application error logging that was responsible to such a high commit rate. Apparently, the errors themselves weren’t important enough to be noticed by the application users, but the side effects of those errors were quite noticeable. It will probably take the application team a while to follow up on this analysis so I can’t claim that this is a fact with 100% certainty, but I’m pretty sure that this is the case.
- in 10g, AWR reports can show the sum of top event percentages exceed 100%, and Oracle doesn’t consider this a bug (!)
- when troubleshooting log file sync waits, following stats are useful:
a) log file parallel write (count and avg time)
b) commits (per second and per user call)
c) OS CPU usage (to rule out CPU saturation)
- if you’re lucky, you can find culprit of high commits in “SQL by executions”
- names of tables in top SQL can occasionally provide useful tips about the nature of a problem