Bug 13930580 Example

It looks like we hit 13930580 on one of our systems and I wanted to put some example output showing log file sync times going above 10 milliseconds (10,000 microseconds) after the log writer switches to polling.

Here is the output in the lgwr trace file showing the switch to polling:

*** 2013-12-27 01:50:47.374
kcrfw_update_adaptive_sync_mode: post->poll long#=1 sync#=1 
sync=55336 poll=1913
4 rw=5506 rw+=5506 ack=0 min_sleep=19134

*** 2013-12-27 01:50:47.374
Log file sync switching to polling
Current scheduling delay is 1 usec
Current approximate redo synch write rate is 0 per sec

Here is a report showing the number of log file sync waits and their average time in microseconds:

END_INTERVAL_TIME          number of waits ave microseconds
-------------------------- --------------- ----------------
26-DEC-13 03.00.19.333 PM           976539       2279.06861
26-DEC-13 04.00.43.475 PM           681212       2029.32406
26-DEC-13 05.00.07.975 PM           343932       1575.26284
26-DEC-13 06.00.34.772 PM           163911       1850.74354
26-DEC-13 07.01.00.911 PM            73151       3815.28597
26-DEC-13 08.00.24.463 PM            39304       5038.05427
26-DEC-13 09.00.48.970 PM            32122       5677.00557
26-DEC-13 10.00.13.491 PM           472349       2353.95857
26-DEC-13 11.00.40.521 PM            18679       18655.5294
27-DEC-13 12.00.06.276 AM            19618       17046.2287
27-DEC-13 01.00.30.299 AM            18983       5721.99178
27-DEC-13 02.00.54.261 AM            17724       17106.3415
27-DEC-13 03.00.18.282 AM             9088       25342.7271
27-DEC-13 04.00.42.218 AM            14365          12128.7
27-DEC-13 05.00.06.391 AM            16323       12879.8831
27-DEC-13 06.00.31.379 AM            43758       15326.7298
27-DEC-13 07.00.56.027 AM            83819       14796.2851
27-DEC-13 08.00.20.637 AM           168718       13506.4363
27-DEC-13 09.00.47.262 AM           302827       19116.9491
27-DEC-13 10.00.14.014 AM           480347       19358.6655
27-DEC-13 11.00.41.178 AM           512777       15952.2358
27-DEC-13 12.00.08.220 PM           511091       13799.5512
27-DEC-13 01.00.38.131 PM           576341       10183.4347
27-DEC-13 02.00.06.308 PM           524568       10251.1259

Notice how the average wait time goes above 10,000 microseconds consistently once the log writer switches to polling between 1 and 2 am on 12/27/2013. I didn’t show all the output but this long log file sync wait time has continued since the switch.

Also, these long log file sync times don’t correspond to long log file parallel write times. Here are the number and averages of log file parallel write waits for the same time frame:

END_INTERVAL_TIME          number of waits ave microseconds
-------------------------- --------------- ----------------
26-DEC-13 03.00.19.333 PM           902849       1426.66601
26-DEC-13 04.00.43.475 PM           659701       1394.87763
26-DEC-13 05.00.07.975 PM           344245       1294.92401
26-DEC-13 06.00.34.772 PM           166643       1586.64944
26-DEC-13 07.01.00.911 PM            80457       4019.29429
26-DEC-13 08.00.24.463 PM            46409       5580.67827
26-DEC-13 09.00.48.970 PM            69218       5115.20904
26-DEC-13 10.00.13.491 PM           475297       2219.80541
26-DEC-13 11.00.40.521 PM            40943        19405.052
27-DEC-13 12.00.06.276 AM            38835       18160.8073
27-DEC-13 01.00.30.299 AM            24734       6321.38425
27-DEC-13 02.00.54.261 AM            33617       11723.6698
27-DEC-13 03.00.18.282 AM            36469       17485.2614
27-DEC-13 04.00.42.218 AM            19344       6955.27042
27-DEC-13 05.00.06.391 AM            17857       4399.75718
27-DEC-13 06.00.31.379 AM            45098       4923.02763
27-DEC-13 07.00.56.027 AM            83700       3610.39713
27-DEC-13 08.00.20.637 AM           160919       2841.31507
27-DEC-13 09.00.47.262 AM           266405       3523.86855
27-DEC-13 10.00.14.014 AM           384795        3367.5075
27-DEC-13 11.00.41.178 AM           437806       2729.84248
27-DEC-13 12.00.08.220 PM           448261       2442.81012
27-DEC-13 01.00.38.131 PM           511648       1880.74418
27-DEC-13 02.00.06.308 PM           481106       1919.21158

The average I/O time – log file parallel write – is pretty low when the system is active (more than 100,000 waits per hour) – usually less than 4000 microseconds and yet log file sync is always more than 10,000 after the switch to polling. Also, the CPU on the system is consistently less than 30% used so it isn’t a system load issue.

Here are some Oracle support documents related to this issue:

Adaptive Switching Between Log Write Methods can Cause 
'log file sync' Waits (Doc ID 1462942.1)

Waits for "log file sync" with Adaptive Polling vs Post/Wait 
Choice Enabled (Doc ID 1541136.1)

Bug 13930580: LGWR IS BLOCKING SESSIONS

Here is the script I used to get the wait output:

set linesize 32000
set pagesize 1000
set long 2000000000
set longchunksize 1000
set head off;
set verify off;
set termout off;

column u new_value us noprint;
column n new_value ns noprint;

select name n from v$database;
select user u from dual;
set sqlprompt &ns:&us>

set head on
set echo on
set termout on
set trimspool on

UNDEFINE WAITNAME
UNDEFINE MINIMUMWAITS

spool "&ns.&&WAITNAME..log"

column END_INTERVAL_TIME format a26

select sn.END_INTERVAL_TIME,
(after.total_waits-before.total_waits) "number of waits",
(after.time_waited_micro-before.time_waited_micro)/
(after.total_waits-before.total_waits) "ave microseconds",
before.event_name "wait name"
from DBA_HIST_SYSTEM_EVENT before, 
DBA_HIST_SYSTEM_EVENT after,
DBA_HIST_SNAPSHOT sn
where before.event_name='&&WAITNAME' and
after.event_name=before.event_name and
after.snap_id=before.snap_id+1 and
after.instance_number=1 and
before.instance_number=after.instance_number and
after.snap_id=sn.snap_id and
after.instance_number=sn.instance_number and
(after.total_waits-before.total_waits) > &&MINIMUMWAITS
order by after.snap_id;

spool off

I gave log file sync as the WAITNAME and 1 as MINIMUMWAITS for the first output. I changed WAITNAME to log file parallel write for the second one with 1 still for MINIMUMWAITS.

It looks like there is a new feature in 11.2 that was finally turned on by default in 11.2.0.3. The work around is to set a hidden parameter to false to turn off the new feature. Check out the Oracle support docs I listed for details.

– Bobby

p.s. I forgot to mention that when I tested on a test database with and without this new feature the log file sync times were lower with the new feature, as they should be.

With _use_adaptive_log_file_sync=TRUE (feature enabled):

Top 5 Timed Foreground Events

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
log file sync	639,598	3,466	5	86.74	Commit
DB CPU		397		9.93
buffer exterminate	683	14	21	0.36	Other
virtual circuit wait	912	12	13	0.31	Network
SQL*Net message to client	1,293,196	7	0	0.17	Network

With _use_adaptive_log_file_sync=FALSE (disabled as in earlier versions of 11.2):

Top 5 Timed Foreground Events

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
log file sync	639,644	3,553	6	87.31	Commit
DB CPU		367		9.02
buffer exterminate	1,364	28	21	0.69	Other
buffer busy waits	2,679	15	6	0.37	Concurrency
virtual circuit wait	903	13	15	0.32	Network

With the new feature enabled log file sync was 5 milliseconds instead of 6 without it. So, the new feature does speed up log file sync waits when it is working normally. But, there must be some bug condition where it degrades to greater than 10 millisecond log file syncs.

5 Responses to Bug 13930580 Example

fairlierego says:

January 10, 2014 at 12:39 am

We have had no issues with this feature after applying patches for the below 2 bugs

13707904 LGWR SOMETIMES USES POLLING, SOMETIMES POST/WAIT
14823372 ADAPTIVE LFS PICKS INACCURATE POLLING INTERVAL ON RAC

- Bobby says:
  
  January 10, 2014 at 7:11 am
  
  Thank you for this information. We will look at the patches.
  
Pingback: Bug 13930580 Workaround Effective | Bobby Durrett's DBA Blog
Anonymous says:

June 5, 2014 at 2:29 pm

Interestingly post. Would be interested to know what O/S and CPU were in use on the affected database.

- Bobby says:
  
  June 5, 2014 at 2:38 pm
  
  Thank you for your comment. It was HP-UX 11.31 on Itanium processor – ia64.

Bug 13930580 Example

About Bobby

5 Responses to Bug 13930580 Example

Leave a Reply Cancel reply

Subscribe to Blog via Email

Archives

Bobby Durrett\’s DBA Blog

Orafaq blog aggregator

Bug 13930580 Example

Share this:

About Bobby

5 Responses to Bug 13930580 Example

Leave a Reply Cancel reply

Subscribe to Blog via Email

Archives

Bobby Durrett\’s DBA Blog

Orafaq blog aggregator