I had a five-minute conversation with Oracle development Friday that rocked my world. I found out that parsing blocks stats which blocks parsing.
We have a system with queries that are taking minutes to parse. These queries include the main tables on our database, one of which is interval partitioned and has 20,000 sub-partitions. We have seen a situation where Oracle’s delivered statistics gathering job hangs on a library cache lock waiting for a query to finish parsing. But, much worse than that, we find most queries on our system hanging on library cache lock waits blocked by the statistics job.
We have an SR open on this situation because it seems to be a bug, but Friday someone on the phone from Oracle development explained that this parse blocks stats blocks parse situation is normal. Later after I got off the phone I built a simple test case proving that what he said was true. I took a query that took a long time to parse in production and ran it on our development database and it took 16 seconds to parse there. I choose the smallest table that the query included and gathered stats on it. The stats ran in a fraction of a second when run by itself, but if I started the long parsing query in one window and ran the stats in another window the stats hung on a library cache lock wait for 15 seconds. Then I created a trivial query against the same small table I had gathered stats on. The query ran instantly by itself. But, if I ran the long parsing query first, kicked off the stats which hung on the lock, and then kicked off the short query against the table I was gathering stats on the short query hung on a library cache lock also. This example convinced me that the parse blocks stats blocks parse chain was real.
This morning I built a standalone test case that others can run to prove this out on their databases: zip of testcase. To run the testcase you need three windows where you can run three sqlplus scripts in rapid succession. In one window first just run tables.sql to create the test tables. Then run these three scripts one after the other in each window to create the three link chain: chain1.sql, chain2.sql, chain3.sql. Chain1.sql has the explain plan of a query that takes a long time to parse. Chain2.sql gathers stats on one table. Chain3.sql runs a simple query against the table whose stats are being gathered. Chain1 spends all of its time on the CPU doing the parse. Chain2 and 3 spends all of their time on library cache lock waits.
First I created two tables:
create table t1 as select * from dba_tables; create table t2 as select * from dba_tables;
Next I kicked off the explain plan that takes a long time to run. It joined 100 tables together:
explain plan into plan_table for select count(*) from t1, t2, t2 t3, ... t2 t100 where t1.owner=t2.owner and ... t1.owner=t100.owner /
This explain plan ran for 26 seconds, almost all of which was CPU:
Elapsed: 00:00:26.90 ... CPU in seconds -------------- 26.81
Right after I kicked off the explain plan I kicked off this statement which gathered stats on the first table in the from clause:
execute dbms_stats.gather_table_stats(NULL,'T1');
This ran for 25 seconds and almost all of the time was spent on a library cache lock wait:
Elapsed: 00:00:25.77 ... Library cache lock in seconds ----------------------------- 25.55
Right after I kicked off the gather table stats command I ran this simple query making sure that it was unique and required a hard parse:
select /* comment to force hard parse */ count(*) from T1;
This ran for 24 seconds and almost all of the time was spent on a library cache lock wait:
Elapsed: 00:00:24.48 ... Library cache lock in seconds ----------------------------- 24.48
Evidently when a session parses a query it needs to obtain a shared lock on every table that the query includes. When you gather statistics on a table you need to obtain an exclusive lock on the table, even if you are gathering statistics on one partition or sub-partition of the table. While the statistics gathering session waits to acquire an exclusive lock any new parses that include the same table will hang.
Prior to Friday I did not think that there was any non-bug situation where gathering optimizer statistics would lock up sessions. I thought that the only negative to gathering statistics at the same time as other application processing was that statistics gathering would compete for system resources such as CPU and I/O and possibly slow down application code. But, now I know that gathering statistics can hang all queries that use the given table if stats gathering gets hung up waiting for a query that takes a long time to parse.
– Bobby
P.S. After reviewing the SR I wanted to understand what this parse blocks stats blocks parse looked like in a state dump. The Oracle support analyst explained how the locks looked in a state dump that we uploaded but I didn’t get a chance to look at it closely until today. I found the most important information in lines with the string “LibraryObjectLock” at the front of the line after some spaces or tabs. There were three types of lines – the holding share lock, the waiting exclusive lock, and the many waiting share locks:
LibraryObjectLock: Address=... Handle=0x5196c8908 Mode=S ... LibraryObjectLock: Address=... Handle=0x5196c8908 RequestMode=X ... LibraryObjectLock: Address=... Handle=0x5196c8908 RequestMode=S ...
The “…” indicates places I edited out other details. The handle 0x5196c8908 identifies the table being locked. The “Mode=S” string indicates a successful share lock of that table by the session with the long parse time. The “RequestMode=X” was from the stats job trying to get exclusive access to the table. The “RequestMode=S” was all the other sessions trying to get shared access to the table waiting for stats to first get exclusive access. Anyway, I just wanted to translate what Oracle support told me into something that might be useful to others. Plus I want to remember it myself!
Interesting findings Bob. Thanks for sharing. I am going through some material related to tuning the library cache and it’s directly related to analyze stats info related to hard parsing. This is a very interesting topic.
Thanks Luis. We have been struggling with library cache lock waits on our Exadata system for a long time. We just finished upgrading our production Exadata databases to 11.2.0.4 but we are still having issues. Some of the library cache lock waits appear to be normal and not due to bug and I find that surprising.
I hit same issue on our 11gR2 Cluster with a very huge table (18B rows) — It might be worth trying to turn “dynamic sampling” OFF [optimizer_dynamic_sampling=0]
Vitaliy,
Thank you for your comment. We have seen issues with dynamic sampling causing long parse times in the past, but this case is different. I did a quick test setting optimizer_dynamic_sampling=0 and it did not reduce the parse time. I think we are hitting a bug related to a range query on the partitioning column. In a problem query if I have partitioning column p and have a range like p >= to_date(‘2014-11-07′,’YYYY-MM-DD’) then parsing takes 4 minutes. If I add a condition like p <= to_date('2014-11-26','YYYY-MM-DD') parsing takes 1 second. It looks like the bug relates to an open ended range. - Bobby
Just did test in 10.2.0.5.0.Thanks for Sharing..Very interesting Bob.
Chain1.sql
————-
Explained.
Elapsed: 00:00:15.23
CPU in seconds
————–
14,83
Elapsed: 00:00:00.01
Chain2.sql
————-
SQL> execute dbms_stats.gather_table_stats(NULL,’T1′);
PL/SQL procedure successfully completed.
Elapsed: 00:00:12.73
Library cache lock in seconds
—————————–
12,22
Chain3.sql
————-
COUNT(*)
———-
2505
Elapsed: 00:00:10.26
Library cache lock in seconds
—————————–
10,02
Thanks for your comment. It’s good to see the results duplicated on another system.
Pingback: Long parse time with interval partitioning and subpartitions | Bobby Durrett's DBA Blog
Pingback: Keeping GATHER_SCHEMA_STATS() under control with MAX_RUN_DURATION | Paul Stuart's Oracle Blog