It looks like we hit 13930580 on one of our systems and I wanted to put some example output showing log file sync times going above 10 milliseconds (10,000 microseconds) after the log writer switches to polling.
Here is the output in the lgwr trace file showing the switch to polling:
*** 2013-12-27 01:50:47.374 kcrfw_update_adaptive_sync_mode: post->poll long#=1 sync#=1 sync=55336 poll=1913 4 rw=5506 rw+=5506 ack=0 min_sleep=19134 *** 2013-12-27 01:50:47.374 Log file sync switching to polling Current scheduling delay is 1 usec Current approximate redo synch write rate is 0 per sec
Here is a report showing the number of log file sync waits and their average time in microseconds:
END_INTERVAL_TIME number of waits ave microseconds -------------------------- --------------- ---------------- 26-DEC-13 03.00.19.333 PM 976539 2279.06861 26-DEC-13 04.00.43.475 PM 681212 2029.32406 26-DEC-13 05.00.07.975 PM 343932 1575.26284 26-DEC-13 06.00.34.772 PM 163911 1850.74354 26-DEC-13 07.01.00.911 PM 73151 3815.28597 26-DEC-13 08.00.24.463 PM 39304 5038.05427 26-DEC-13 09.00.48.970 PM 32122 5677.00557 26-DEC-13 10.00.13.491 PM 472349 2353.95857 26-DEC-13 11.00.40.521 PM 18679 18655.5294 27-DEC-13 12.00.06.276 AM 19618 17046.2287 27-DEC-13 01.00.30.299 AM 18983 5721.99178 27-DEC-13 02.00.54.261 AM 17724 17106.3415 27-DEC-13 03.00.18.282 AM 9088 25342.7271 27-DEC-13 04.00.42.218 AM 14365 12128.7 27-DEC-13 05.00.06.391 AM 16323 12879.8831 27-DEC-13 06.00.31.379 AM 43758 15326.7298 27-DEC-13 07.00.56.027 AM 83819 14796.2851 27-DEC-13 08.00.20.637 AM 168718 13506.4363 27-DEC-13 09.00.47.262 AM 302827 19116.9491 27-DEC-13 10.00.14.014 AM 480347 19358.6655 27-DEC-13 11.00.41.178 AM 512777 15952.2358 27-DEC-13 12.00.08.220 PM 511091 13799.5512 27-DEC-13 01.00.38.131 PM 576341 10183.4347 27-DEC-13 02.00.06.308 PM 524568 10251.1259
Notice how the average wait time goes above 10,000 microseconds consistently once the log writer switches to polling between 1 and 2 am on 12/27/2013. I didn’t show all the output but this long log file sync wait time has continued since the switch.
Also, these long log file sync times don’t correspond to long log file parallel write times. Here are the number and averages of log file parallel write waits for the same time frame:
END_INTERVAL_TIME number of waits ave microseconds -------------------------- --------------- ---------------- 26-DEC-13 03.00.19.333 PM 902849 1426.66601 26-DEC-13 04.00.43.475 PM 659701 1394.87763 26-DEC-13 05.00.07.975 PM 344245 1294.92401 26-DEC-13 06.00.34.772 PM 166643 1586.64944 26-DEC-13 07.01.00.911 PM 80457 4019.29429 26-DEC-13 08.00.24.463 PM 46409 5580.67827 26-DEC-13 09.00.48.970 PM 69218 5115.20904 26-DEC-13 10.00.13.491 PM 475297 2219.80541 26-DEC-13 11.00.40.521 PM 40943 19405.052 27-DEC-13 12.00.06.276 AM 38835 18160.8073 27-DEC-13 01.00.30.299 AM 24734 6321.38425 27-DEC-13 02.00.54.261 AM 33617 11723.6698 27-DEC-13 03.00.18.282 AM 36469 17485.2614 27-DEC-13 04.00.42.218 AM 19344 6955.27042 27-DEC-13 05.00.06.391 AM 17857 4399.75718 27-DEC-13 06.00.31.379 AM 45098 4923.02763 27-DEC-13 07.00.56.027 AM 83700 3610.39713 27-DEC-13 08.00.20.637 AM 160919 2841.31507 27-DEC-13 09.00.47.262 AM 266405 3523.86855 27-DEC-13 10.00.14.014 AM 384795 3367.5075 27-DEC-13 11.00.41.178 AM 437806 2729.84248 27-DEC-13 12.00.08.220 PM 448261 2442.81012 27-DEC-13 01.00.38.131 PM 511648 1880.74418 27-DEC-13 02.00.06.308 PM 481106 1919.21158
The average I/O time – log file parallel write – is pretty low when the system is active (more than 100,000 waits per hour) – usually less than 4000 microseconds and yet log file sync is always more than 10,000 after the switch to polling. Also, the CPU on the system is consistently less than 30% used so it isn’t a system load issue.
Here are some Oracle support documents related to this issue:
Adaptive Switching Between Log Write Methods can Cause 'log file sync' Waits (Doc ID 1462942.1) Waits for "log file sync" with Adaptive Polling vs Post/Wait Choice Enabled (Doc ID 1541136.1) Bug 13930580: LGWR IS BLOCKING SESSIONS
Here is the script I used to get the wait output:
set linesize 32000 set pagesize 1000 set long 2000000000 set longchunksize 1000 set head off; set verify off; set termout off; column u new_value us noprint; column n new_value ns noprint; select name n from v$database; select user u from dual; set sqlprompt &ns:&us> set head on set echo on set termout on set trimspool on UNDEFINE WAITNAME UNDEFINE MINIMUMWAITS spool "&ns.&&WAITNAME..log" column END_INTERVAL_TIME format a26 select sn.END_INTERVAL_TIME, (after.total_waits-before.total_waits) "number of waits", (after.time_waited_micro-before.time_waited_micro)/ (after.total_waits-before.total_waits) "ave microseconds", before.event_name "wait name" from DBA_HIST_SYSTEM_EVENT before, DBA_HIST_SYSTEM_EVENT after, DBA_HIST_SNAPSHOT sn where before.event_name='&&WAITNAME' and after.event_name=before.event_name and after.snap_id=before.snap_id+1 and after.instance_number=1 and before.instance_number=after.instance_number and after.snap_id=sn.snap_id and after.instance_number=sn.instance_number and (after.total_waits-before.total_waits) > &&MINIMUMWAITS order by after.snap_id; spool off
I gave log file sync as the WAITNAME and 1 as MINIMUMWAITS for the first output. I changed WAITNAME to log file parallel write for the second one with 1 still for MINIMUMWAITS.
It looks like there is a new feature in 11.2 that was finally turned on by default in 11.2.0.3. The work around is to set a hidden parameter to false to turn off the new feature. Check out the Oracle support docs I listed for details.
– Bobby
p.s. I forgot to mention that when I tested on a test database with and without this new feature the log file sync times were lower with the new feature, as they should be.
With _use_adaptive_log_file_sync=TRUE (feature enabled):
Top 5 Timed Foreground Events
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
log file sync | 639,598 | 3,466 | 5 | 86.74 | Commit |
DB CPU | 397 | 9.93 | |||
buffer exterminate | 683 | 14 | 21 | 0.36 | Other |
virtual circuit wait | 912 | 12 | 13 | 0.31 | Network |
SQL*Net message to client | 1,293,196 | 7 | 0 | 0.17 | Network |
With _use_adaptive_log_file_sync=FALSE (disabled as in earlier versions of 11.2):
Top 5 Timed Foreground Events
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
log file sync | 639,644 | 3,553 | 6 | 87.31 | Commit |
DB CPU | 367 | 9.02 | |||
buffer exterminate | 1,364 | 28 | 21 | 0.69 | Other |
buffer busy waits | 2,679 | 15 | 6 | 0.37 | Concurrency |
virtual circuit wait | 903 | 13 | 15 | 0.32 | Network |
With the new feature enabled log file sync was 5 milliseconds instead of 6 without it. So, the new feature does speed up log file sync waits when it is working normally. But, there must be some bug condition where it degrades to greater than 10 millisecond log file syncs.
We have had no issues with this feature after applying patches for the below 2 bugs
13707904 LGWR SOMETIMES USES POLLING, SOMETIMES POST/WAIT
14823372 ADAPTIVE LFS PICKS INACCURATE POLLING INTERVAL ON RAC
Thank you for this information. We will look at the patches.
Pingback: Bug 13930580 Workaround Effective | Bobby Durrett's DBA Blog
Interestingly post. Would be interested to know what O/S and CPU were in use on the affected database.
Thank you for your comment. It was HP-UX 11.31 on Itanium processor – ia64.