This morning a colleague of mine noticed an alert from our plan change monitor and prevented any impact on the users by forcing a SQL’s plan to its normal fast plan before the users could be affected. I want to share the monitor script and describe how we use its output.
I know this is long but I want to paste the SQL text of the monitor script here. Review my comments on each step. I can not remember the details of the script myself even though I wrote it but the comments hopefully will help:
set linesize 160
set pagesize 1000
set echo on
set termout on
set trimspool on
set define off
-- status active
-- this plan isn't the most commonly
-- run plan.
-- average this plan in v$sql > 10 x average of
-- most common plan
-- status active sessions
-- sql_id and plan_hash_value
-- elapsed and executions
-- max elapsed for the sql
-- eliminate pl/sql which has
-- plan hash value of 0
drop table active_sql;
create table active_sql as
select
vs.sql_id,
sq.plan_hash_value,
max(sq.ELAPSED_TIME) elapsed,
max(sq.executions) executions
from
v$session vs,
v$sql sq
where
vs.sql_id=sq.sql_id and
vs.SQL_CHILD_NUMBER=sq.child_number and
vs.status='ACTIVE' and
sq.plan_hash_value <> 0
group by
vs.sql_id,
sq.plan_hash_value;
-- to get the most frequently
-- used plan first get the number
-- of exections by plan
-- for each of the active sqls
drop table plan_executions;
create table plan_executions as
select
ss.sql_id,
ss.plan_hash_value,
sum(ss.executions_delta) total_executions
from
dba_hist_sqlstat ss,
active_sql a
where
a.sql_id=ss.sql_id
group by
ss.sql_id,
ss.plan_hash_value;
-- use the previous table to get
-- the plans that are most frequently
-- used. note that two plans could
-- have the same number of executions
-- but this is unlikely.
drop table most_frequent_executions;
create table most_frequent_executions as
select
pe1.sql_id,
pe1.plan_hash_value
from plan_executions pe1
where
pe1.total_executions =
(select max(pe2.total_executions)
from plan_executions pe2
where
pe1.sql_id=pe2.sql_id);
-- handle special case of two plans with
-- same number of executions.
-- pick one with highest plan value
-- just to eliminate dups.
drop table most_frequent_nodups;
create table most_frequent_nodups as
select
mfe1.sql_id,
mfe1.plan_hash_value
from most_frequent_executions mfe1
where
mfe1.plan_hash_value =
(select max(mfe2.plan_hash_value)
from most_frequent_executions mfe2
where
mfe1.sql_id=mfe2.sql_id);
-- get list of active sql that
-- are not running the most
-- frequently executed plan
drop table not_most_freq;
create table not_most_freq as
select * from active_sql
where
(sql_id,plan_hash_value) not in
(select sql_id,plan_hash_value from most_frequent_nodups);
-- working on this part of the logic:
-- average this plan in v$sql > 10 x average of
-- most common plan
-- get average elapsed of most
-- frequently executed plans
-- add 1 to handle case of 0 executions
drop table avg_elapsed_most_f;
create table avg_elapsed_most_f as
select
ss.sql_id,
ss.plan_hash_value,
sum(ss.ELAPSED_TIME_DELTA)/(sum(ss.executions_delta)+1)
avg_elapsed
from DBA_HIST_SQLSTAT ss,
most_frequent_nodups nd
where
ss.sql_id = nd.sql_id and
ss.plan_hash_value = nd.plan_hash_value
group by
ss.sql_id,
ss.plan_hash_value;
-- get list of the sqls that are running
-- the plan that isn't most frequently
-- executed and has an average elapsed
-- more than 10 times the average of
-- the most frequently executed plan
-- add 1 to executions to prevent
-- divide by zero
drop table more_than_10x;
create table more_than_10x as
select
n.sql_id,
n.plan_hash_value
from
not_most_freq n,
avg_elapsed_most_f m
where
(n.elapsed/(n.executions+1)) > 10 * m.avg_elapsed and
n.sql_id=m.sql_id;
spool planchangemonitor.log
select name db_name from v$database;
-- The listed sql_id and plan_hash_value items correspond to
-- sql statements that have plans that may be
-- inefficient and need to be investigated.
-- The session id and username are included if a
-- session is currently running the sql with the plan.
select
'CHANGED '||'PLAN' flag,
m.sql_id,
m.plan_hash_value,
s.sid,
s.username
from
more_than_10x m,
v$session s,
v$sql q
where
m.sql_id=s.sql_id(+) and
m.plan_hash_value=q.plan_hash_value(+) and
s.sql_id=q.sql_id and
s.SQL_CHILD_NUMBER=q.child_number
order by
m.sql_id,
m.plan_hash_value,
s.sid,
s.username;
spool off
If I remember correctly I think the script looks for sessions running a plan whose current run time is 10 times that of the most frequently executed plan. This script is not perfect. The join to v$sql is not perfect and in some cases you can get duplicates. People could quibble about the approach. Why 10 times the previous run time? I thought about more complex approaches but I just needed to get something in place. But, on one database with a lot of small transactions we have made good use of this script, despite its flaws.
This morning my colleague noticed emails from the script that runs this SQL listing a query whose plan had changed. The output looked like this:
FLAG SQL_ID PLAN_HASH_VALUE SID USERNAME
------------ ------------- --------------- ---------- ---------
CHANGED PLAN 75ufmwrcmsuwz 2484041482 35 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 394 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 395 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 446 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 463 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 464 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 544 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 613 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 631 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 665 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 678 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 738 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 746 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 750 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 752 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 1333 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 1416 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 1573 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 1943 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 1957 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 3038 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 3445 YOURUSER
CHANGED PLAN 75ufmwrcmsuwz 2484041482 3816 YOURUSER
I changed the real user to YOURUSER. This output indicates that 23 sessions were all running the same SQL – sql_id=75ufmwrcmsuwz – and that this SQL was running on a new plan that was causing the SQL to run at least 10 times the normal run time. In fact it was about 30 times as long.
To resolve the issue my colleague used our script to find the history of plans for 75ufmwrcmsuwz.
select ss.sql_id,
2 ss.plan_hash_value,
3 sn.END_INTERVAL_TIME,
4 ss.executions_delta,
5 ELAPSED_TIME_DELTA/(executions_delta*1000) "Elapsed Average ms",
6 CPU_TIME_DELTA/(executions_delta*1000) "CPU Average ms",
7 IOWAIT_DELTA/(executions_delta*1000) "IO Average ms",
8 CLWAIT_DELTA/(executions_delta*1000) "Cluster Average ms",
9 APWAIT_DELTA/(executions_delta*1000) "Application Average ms",
10 CCWAIT_DELTA/(executions_delta*1000) "Concurrency Average ms",
11 BUFFER_GETS_DELTA/executions_delta "Average buffer gets",
12 DISK_READS_DELTA/executions_delta "Average disk reads",
13 ROWS_PROCESSED_DELTA/executions_delta "Average rows processed"
14 from DBA_HIST_SQLSTAT ss,DBA_HIST_SNAPSHOT sn
15 where ss.sql_id = '75ufmwrcmsuwz'
16 and ss.snap_id=sn.snap_id
17 and executions_delta > 0
18 and ss.INSTANCE_NUMBER=sn.INSTANCE_NUMBER
19 order by ss.snap_id,ss.sql_id;
SQL_ID PLAN_HASH_VALUE END_INTERVAL_TIME EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ----------------------
75ufmwrcmsuwz 2297146707 30-AUG-15 09.00.54.472 AM 830 587.207673 49.9638554 541.617188 0 0 0 5234.01928 125.083133 332.66747
75ufmwrcmsuwz 2297146707 30-AUG-15 10.00.46.872 AM 1178 578.205867 49.3972835 532.377174 0 0 0 4870.22326 126.048387 324.50764
75ufmwrcmsuwz 2297146707 30-AUG-15 11.00.39.206 AM 1433 631.484713 49.1486392 585.826676 0 0 0 4624.11305 125.446615 299.57083
75ufmwrcmsuwz 2297146707 30-AUG-15 12.00.32.247 PM 1620 592.593823 49.5987654 546.29731 0 0 0 4744.17284 121.961728 312.735185
75ufmwrcmsuwz 2297146707 30-AUG-15 01.00.20.783 PM 1774 534.412339 51.059752 485.46836 0 0 0 4983.44983 119.564825 326.067644
75ufmwrcmsuwz 2297146707 30-AUG-15 02.00.08.843 PM 1757 447.385105 44.9345475 404.415659 0 0 0 4525.13147 107.277746 293.739328
75ufmwrcmsuwz 2297146707 30-AUG-15 03.00.57.308 PM 1626 431.718507 45.904059 388.200416 0 0 0 4462.93296 118.027675 300.724477
75ufmwrcmsuwz 2297146707 30-AUG-15 04.00.44.209 PM 1080 375.905966 44.212963 334.434835 0 0 0 4766.81574 109.157407 310.712037
75ufmwrcmsuwz 2297146707 30-AUG-15 05.00.36.753 PM 707 368.289475 44.3140028 327.166223 0 0 0 4894.20509 108.050919 315.565771
75ufmwrcmsuwz 2297146707 30-AUG-15 06.00.26.449 PM 529 341.483588 39.6408318 305.47356 0 0 0 4381.19849 96.2646503 288.030246
75ufmwrcmsuwz 2297146707 30-AUG-15 07.00.17.636 PM 356 380.733635 41.5168539 342.034876 0 0 0 4553.4691 105.272472 292.283708
75ufmwrcmsuwz 2297146707 30-AUG-15 08.00.11.170 PM 313 435.449406 37.1565495 402.636489 0 0 0 4144.30351 92.8690096 264.923323
75ufmwrcmsuwz 2297146707 30-AUG-15 09.00.07.894 PM 214 516.455509 44.5794393 477.020692 0 0 0 4567.67757 114.415888 289.607477
75ufmwrcmsuwz 2297146707 30-AUG-15 10.00.59.991 PM 182 720.749681 44.3956044 684.439467 0 0 0 3811.83516 95.2362637 239.027473
75ufmwrcmsuwz 2297146707 30-AUG-15 11.00.47.388 PM 83 1043.1503 43.7349398 1008.41358 0 0 0 3575.96386 114.289157 250.120482
75ufmwrcmsuwz 2484041482 30-AUG-15 11.00.47.388 PM 6 25314.6558 4311.66667 22971.4913 0 0 0 78533.8333 69813.3333 157.833333
75ufmwrcmsuwz 2484041482 31-AUG-15 12.00.36.033 AM 96 25173.7346 5105.20833 21475.9516 0 0 0 135242.802 62433.3125 118.395833
75ufmwrcmsuwz 2484041482 31-AUG-15 01.00.29.070 AM 39 26877.0626 5540.51282 22977.6229 0 0 0 139959.308 68478.1795 93.7179487
75ufmwrcmsuwz 2484041482 31-AUG-15 02.00.18.755 AM 38 26993.419 5998.15789 22768.4285 0 0 0 153843.342 74492 149.342105
75ufmwrcmsuwz 2484041482 31-AUG-15 03.00.09.615 AM 29 25432.5074 4952.06897 22288.7966 0 0 0 112813.552 69803.0345 187.689655
75ufmwrcmsuwz 2484041482 31-AUG-15 04.00.01.749 AM 34 27281.7339 4541.47059 24543.1609 0 0 0 95144.5 69187.3824 135.676471
75ufmwrcmsuwz 2484041482 31-AUG-15 05.00.55.876 AM 146 30512.9976 5421.43836 26984.2559 0 0 0 115531.801 71886.6644 136.321918
75ufmwrcmsuwz 2484041482 31-AUG-15 06.00.50.769 AM 405 24339.6641 4853.40741 20794.0472 0 0 0 115490.01 62004.4642 229.106173
75ufmwrcmsuwz 2484041482 31-AUG-15 07.00.43.047 AM 838 27552.3731 4903.06683 23661.2101 0 0 0 111654.558 58324.9511 176.058473
75ufmwrcmsuwz 2484041482 31-AUG-15 08.00.50.864 AM 1653 30522.8358 4550.56261 26526.2183 0 0 0 93818.418 49865.4701 137.212341
Based on this output my colleague chose 2297146707 as the good plan. She ran coe_xfr_sql_profile.sql which is found in SQLT’s utl directory like this:
coe_xfr_sql_profile.sql 75ufmwrcmsuwz 2297146707
Then she ran the generated script:
coe_xfr_sql_profile_75ufmwrcmsuwz_2297146707.sql
This forced the plan back to its earlier efficient plan and my coworker did all of this early this morning before we reached our peak time of usage and before the bad plan could affect our users.
Bobby
P.S. This example is from an 11.2.0.3 database running on HP-UX Itanium.