Shared servers prevents web outage

This weekend we had the most convincing evidence that our change from dedicated to shared servers on a database that supports a farm of web servers was the right move.  We have had some outages on the weekend caused by a sudden burst in web server generated database activity.  In the past the CPU load would spike and log file sync (commit) waits would be 20 times slower and we would have to bounce the database and web servers to recover.  Sunday we had a similar spike in database activity without having any sort of outage.

Here is a section of the AWR report from the last weekend outage:

Top 5 Timed Events

 

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
log file sync 636,948 157,447 247 9.3 Commit
latch: library cache 81,521 98,589 1,209 5.8 Concurrency
latch: library cache pin 39,580 73,409 1,855 4.3 Concurrency
latch free 42,929 45,043 1,049 2.7 Other
latch: session allocation 32,766 42,227 1,289 2.5 Other

Here is the same part of the AWR report for this weekend’s spike in activity:

Top 5 Timed Events

 

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
log file sync 630,867 6,802 11 43.1 Commit
CPU time 5,221 33.1
db file sequential read 604,450 4,498 7 28.5 User I/O
db file parallel write 213,913 3,661 17 23.2 System I/O
log file parallel write 522,021 1,168 2 7.4 System I/O

These are hour long intervals, in both cases between 9 and 10 am central on the first Sunday of a Month (June and August).  The key is that in both cases there are around 600,000 commits in that hour.  During the outage in June the commits took 247 milliseconds, a quarter of a second, each.  This Sunday they took only 11 milliseconds.  Note that in both cases the disk IO for commits – log file parallel write – was only 2 milliseconds.  So, the difference was CPU and primarily queuing for the CPU.  So, with dedicated servers we had 20 times as much queuing for the CPU roughly (247 ms/11 ms).  Note that during our normal peak processing log file sync is 3 milliseconds so even the 11 milliseconds we saw this weekend represents some queuing.

The key to this is that we have the number of shared server processes set to twice the number of CPUs.  When I say “CPUs” I mean from the Unix perspective.  They are probably cores, etc. But, Unix thinks we have 16 CPUs.  We have 32 shared server processes.  This prevents CPU queuing because even if all 32 shared servers are running full out they probably wont max out the CPU because they will be doing things besides CPU some of the time.  The ideal number may not be 2x CPUs.  It may be 1.5 or 2.3 but the point is there is some number of shared server processes that under overwhelming load will allow the CPU to be busy but not allow a huge queue for the CPU.  Two times the number of CPUs is a good starting point and this was what Tom Kyte recommended in my ten minute conversation with him that spawned this change to shared servers.

With dedicated servers we would have hundreds of processes and they could easily max out the CPU and then the log writer process(LGRW) would split time waiting on the CPU equally with the hundreds of active dedicated server processes.  I think what was really happening with dedicated servers is that hundreds of sessions were hung up waiting on commits and then the session pools from the web servers started spawning new connections which themselves ate up CPU and a downward spiral would occur that we could not recover from.  With shared servers the commits remained efficient and the web servers didn’t need to spawn so many new connections because they weren’t hung up waiting on commits.

If you are supporting a database that has a lot of web server connections doing a lot of commits you might want to consider shared servers as an option to prevent the log writer from being starved for CPU.

Here are my previous posts related to this issue for reference:

 https://www.bobbydurrettdba.com/2013/07/19/shared-servers-results-in-lower-cpu-usage-for-top-query-in-awr/

https://www.bobbydurrettdba.com/2013/06/26/testing-maximum-number-of-oracle-sessions-supported-by-shared-servers/

https://www.bobbydurrettdba.com/2012/08/30/faster-commit-time-with-shared-servers/

https://www.bobbydurrettdba.com/2012/03/21/reducing-size-of-connection-pool-to-improve-web-application-performance/

It may be tough to convince people to move to shared servers since it isn’t a commonly used feature of the Oracle database but in the case of hundreds of sessions with lots of commits it makes sense as a way of keeping the commit process efficient.

– Bobby

P.S.  Here are our parameters in production related to the shared servers change with the ip address removed.  We had to bump up the large pool and set local_listener in addition to setting the shared servers and dispatchers parameters.  I added newlines to the dispatchers and local listener parameters to fit on this page.

NAME                                 VALUE
------------------------------------ -------------------
max_shared_servers                   32
shared_servers                       32
dispatchers                          (PROTOCOL=TCP)
                                     (DISPATCHERS=64)
max_dispatchers                      
local_listener                       (ADDRESS=
                                     (PROTOCOL=TCP)
                                     (HOST=EDITEDOUT)
                                     (PORT=1521))
large_pool_size                      2G

P.P.S.  This server is on HP-UX 11.11 and Oracle 10.2.0.3.

Posted in Uncategorized | 6 Comments

Proc to run a long select statement

I’m trying to test some select statements that have some lines longer than 4000 characters and I couldn’t get them to run in sqlplus so I built this proc to run a select statement that is stored in a CLOB and return the number of rows fetched and elapsed time in seconds.

CREATE OR REPLACE procedure runclob(
       sqlclob  in clob,
       total_rows_fetched out NUMBER,
       elapsed_time_seconds out NUMBER) is
    clob_cursor INTEGER;
    rows_fetched INTEGER;
    before_date date;
    after_date date;
BEGIN
    select sysdate into before_date from dual;

    clob_cursor := DBMS_SQL.OPEN_CURSOR;

    DBMS_SQL.PARSE (clob_cursor,sqlclob,DBMS_SQL.NATIVE);

    rows_fetched := DBMS_SQL.EXECUTE_AND_FETCH (clob_cursor);
    total_rows_fetched := rows_fetched;

    LOOP
        EXIT WHEN rows_fetched < 1;
        rows_fetched := DBMS_SQL.FETCH_ROWS (clob_cursor);
        total_rows_fetched := total_rows_fetched + rows_fetched;
    END LOOP;

    DBMS_SQL.CLOSE_CURSOR (clob_cursor);

    select sysdate into after_date from dual;

    elapsed_time_seconds := (after_date-before_date)*24*3600;
END;
/

The queries I’m working on are generated by OBIEE and I’m testing running them with two different optimizer statistics choices and I want to see which causes a group of queries to run faster.  I will have a block of code that loops through a collection of select statements that I extracted from production and runs each one in my test database with the proc listed above.

– Bobby

 

Posted in Uncategorized | Leave a comment

Dynamic sampling hint better than multi-column histogram

One of our senior developers found a way to greatly improve the performance of a query by adding a dynamic sampling hint for a table that had multiple correlated conditions in its where clause.  This led me to try to understand why the dynamic sampling hint helped and it appears that a dynamic sampling hint can deal with correlated predicates in the where clause better than multi-column histograms.

A quick Google search about the dynamic sampling hint found this article by Tom Kyte.  The section titled “When the Optimizer Guesses” seemed to match the symptoms I saw in the query our developer was tuning.  So, I started messing with the test case from the article and found that multi-column histograms would not make the test case run the efficient plan with the index scan but dynamic sampling, even at level 1, would.

I used the following commands to gather stats:

select DBMS_STATS.CREATE_EXTENDED_STATS (NULL,'T','(FLAG1,FLAG2)') from dual;
execute DBMS_STATS.GATHER_TABLE_STATS (NULL,'T',estimate_percent=>NULL,method_opt=>'FOR ALL COLUMNS SIZE 254');

Here is the plan without the dynamic sampling hint:

Execution Plan
----------------------------------------------------------
Plan hash value: 2153619298

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      | 37371 |  3941K|   580   (1)| 00:00:07 |
|*  1 |  TABLE ACCESS FULL| T    | 37371 |  3941K|   580   (1)| 00:00:07 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("FLAG1"='N' AND "FLAG2"='N')

Here is the plan with this hint: dynamic_sampling(t 1)

Execution Plan
----------------------------------------------------------
Plan hash value: 1020776977

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |    54 |  5832 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |    54 |  5832 |     3   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T_IDX |    54 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("FLAG1"='N' AND "FLAG2"='N')

Note
-----
   - dynamic sampling used for this statement (level=2)

If you look at a 10053 trace you see that dynamic sampling is running this query:

SELECT
...bunch of hints edited out...
NVL (SUM (C1), 0), NVL (SUM (C2), 0), NVL (SUM (C3), 0)
  FROM (SELECT /*+ ...more hints... */
              1 AS C1,
               CASE
                  WHEN "T"."FLAG1" = 'N' AND "T"."FLAG2" = 'N' 
                  THEN 1
                  ELSE 0
               END
                  AS C2,
               CASE
                  WHEN "T"."FLAG2" = 'N' AND "T"."FLAG1" = 'N' 
                  THEN 1
                  ELSE 0
               END
                  AS C3
          FROM "MYUSERID"."T" SAMPLE BLOCK (0.519350, 1) 
               SEED (1) "T") SAMPLESUB

I think this is the same query that is in the article but in the 10053 trace it is easier to see the constants “T”.”FLAG2″ = ‘N’ AND “T”.”FLAG1″ = ‘N’.  So, it shows that dynamic sampling is running a query with the where clause conditions.  In the 10053 trace for our production query the dynamic sampling query came back with all of the conditions on the table that we hinted.

Note that if you leave stats off of the table you get the same good plan:

Execution Plan
----------------------------------------------------------
Plan hash value: 1020776977

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |     1 |   162 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T     |     1 |   162 |     1   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T_IDX |     1 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("FLAG1"='N' AND "FLAG2"='N')

Note
-----
   - dynamic sampling used for this statement (level=2)

So, with no stats on the table it must be doing the same dynamic sampling query with all the where clause conditions.

Interestingly, in the case of the real query the table is very small.  I’m thinking we need to just delete the stats from the table and lock them empty.  That way every query with the table will dynamically sample the small table and handle correlated predicates better than multi-column stats.

Here is my test script and its log.

– Bobby

Posted in Uncategorized | Leave a comment

putty the publisher could not be verified Windows 7

So, this isn’t really Oracle related except that I got a new Windows 7 laptop to support Oracle and I’m getting an annoying popup message every time I try to run putty to access a database server.

puttyerror

I looked up the message “The publisher could not be verified” and Windows 7 and found a bunch of stuff about changing Windows policies or internet security – yuck!  I found a simpler solution.

I have putty.exe installed in c:\putty on my laptop.  I added a bat file called putty.bat in c:\putty that has these lines in it:

start /b c:\putty\putty.exe

Then I created a shortcut on my desktop to run c:\putty\putty.bat but to start the command prompt window minimized:

puttyshortcut

Now when I click on this shortcut putty pops right up without the publisher could not be verified message.

– Bobby

 

Posted in Uncategorized | 8 Comments

Finally up on Oracle 12c

So, I had to get a new work laptop to install Oracle 12c.  I was on Windows XP and even though I knew my company would have paid for me to get a new one a long time ago I was slow to give up my old laptop because of the time it would take to reconfigure it.  But, because 12c wasn’t out for 32 bit windows I had to bite the bullet and request a new laptop.  It is really nice.  It is an HP EliteBook 8470P with 128 gig solid state disk and 4 G of RAM.  The SSD is so fast it is sweet.  But, I don’t have a ton of RAM to build virtual environments in.  It supports up to 16 gig on Windows 7 64-bit so I may end up upgrading the memory.  Also, I didn’t want to eat up space on my SSD for a test environment so I used a 500 gig external drive connected to the laptop using USB to store the virtual machine and install files.

I’m not going to give all the gory details because it would just be a repeat of what other people have published.  I’ll just give the highlights.

I’m using VMware player 5.02, Oracle Enterprise Linux 6.4, and the 12c 64-bit Linux install zips.  I setup a VM with 2 gig of RAM, 2 cores CPU, and 100 gig of disk space.  I used the default NAT network configuration.

After installing OEL I found that I couldn’t connect to the new VM from Windows over the network.  Apparently the network is not started by default and I had to check a box in the Linux network configuration while doing the install or the network would not come up automatically when the OS boots up.  I found this web site after the fact that documents this issue.

The other surprise was that the way I installed Linux the OS was not configured with a graphical console.  I chose “database server”.  I guess I’m surprised that they wouldn’t install X Windows on a database server by default.  So, to do the Oracle software install I ended up using my Cygwin XWindows server to run the install just as I would for a remote install on one of my real work servers where I don’t have access to the graphical console.

I found this command in the Linux installation guide for 12c:

yum install oracle-rdbms-server-12cR1-preinstall

This was very nice.  It installed all of the needed packages downloading them over the internet and created the needed oracle user and oinstall groups.  I think it also did any needed kernel parameter changes.  It was definitely a time saver over doing all this manually as I have in the past.

Even though I really didn’t have a /u01 filesystem – I just let it create the default / and /home filesystems – I still put everything under /u01 to make it look like a default install:

mkdir -p /u01/app/oracle
chown -R oracle:oinstall /u01/app/oracle
chmod -R 775 /u01/app/oracle

So, ORACLE_BASE=/u01/app/oracle

Next I got this error running the installer:

PRVF-0002 Could not retrieve local nodename

I found this web site(DOES NOT EXIST) which recommended adding the hostname to the /etc/hosts file.  Here is what I ended up adding:

192.168.133.128   ora12c ora12c.usfoods.com

I had set the name and domain to ora12c.usfoods.com during the install and ultimately I found out that I needed to use this hosts entry to get it to work.  Also, I found the ip address for the virtual machine by running ifconfig.

Lastly, I found that I had to disable the firewall to be able to access the database from the  Windows host using sqlplus.  It is interesting that even though I choose “database server” and did the yum install of the package it still had the firewall up blocking port 1521 in to the machine.  I ran these two commands as root to turn off the firewall:

service iptables stop
chkconfig iptables off

Now, after backing up my newly configured VM I can get into my new 12c DB from the Windows host:

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing opt
ions

SQL>

Victory!  Of course, then the very first thing I attempt to do, that I would normally do under 11g, doesn’t work:

SQL> create user abc123 identified by a9s9d9d9;
create user abc123 identified by a9s9d9d9
            *
ERROR at line 1:
ORA-65096: invalid common user or role name

So, I’ve got 12c, but now I still have to figure out how to use it!

– Bobby

P.S. Check out this 12c install web page(DOES NOT EXIST).  It looks pretty comprehensive.

Posted in Uncategorized | 4 Comments

List tables that stats job doesn’t finish

We are running the Oracle delivered automatic optimizer stats job on some of our 11.2 Oracle databases and the job is not completing for some of our tables.  But, the list of these tables is buried in a trace file on the database server on some random RAC node.  I built a script to figure out which node and file the trace was in and edit the output down to a list of tables that stats didn’t finish on.  Output is like this:

Job log start time: 2013-07-17 09:00:07
MYOWNER1.MYTABLE1
MYOWNER2.MYTABLE2
...

The outer script calls a SQL script that builds a list of Unix commands to search through the logs for the table names.

statlogs.sh

rm statlogstmp.sh
sqlplus -s /nolog < statlogs.sql > statlogstmp.sh
chmod 744 statlogstmp.sh
./statlogstmp.sh

statlogs.sql

connect YOURUSERNAME/YOURPASSWORD
set linesize 32000
set pagesize 1000
set echo off
set termout off
set trimspool on
set serveroutput on size 1000000
set feedback off
set heading off

-- This script builds unix commands to find tables 
-- whose stats did not complete during the automatic stats job runs
-- over the past 7 days.

DECLARE 
    CURSOR JRD_CURSOR IS 
    select
      INSTANCE_ID,
      SLAVE_PID,
      cast(ACTUAL_START_DATE as date) start_date_time,
      cast(ACTUAL_START_DATE+RUN_DURATION as date) end_date_time
    from
      DBA_SCHEDULER_JOB_RUN_DETAILS
    where
      JOB_NAME like 'ORA$AT_OS_OPT_SY%' and
      log_date > sysdate - 7
    order by LOG_DATE;    

    JRD_REC JRD_CURSOR%ROWTYPE;

    AT_LEAST_ONE_JOB_RAN boolean;
    TRACE_DIR varchar2(2000);
    log_inst_name varchar2(2000);
    log_host_name varchar2(2000);
    log_INSTANCE_ID number;
    job_slave_pid number;
    job_start_date_time date;
    job_end_date_time date; 

BEGIN
    AT_LEAST_ONE_JOB_RAN := FALSE;
    OPEN JRD_CURSOR;
LOOP
    FETCH JRD_CURSOR INTO JRD_REC;
    EXIT WHEN JRD_CURSOR%NOTFOUND;
    AT_LEAST_ONE_JOB_RAN := TRUE;
    log_INSTANCE_ID := JRD_REC.INSTANCE_ID;
    job_slave_pid :=JRD_REC.SLAVE_PID;
    job_start_date_time := JRD_REC.start_date_time;
    job_end_date_time := JRD_REC.end_date_time;

-- Output echo command to display date and time that the current stats job log was opened    

    DBMS_OUTPUT.PUT_LINE('echo "Job log start time: '|| 
    to_char(job_start_date_time,'YYYY-MM-DD HH24:MI:SS') || '"');

-- Trace directory for the stats job log

    select VALUE into TRACE_DIR from gv$parameter where name='background_dump_dest'
      and INST_ID=log_INSTANCE_ID;  

-- Details needed for name of stats job log

   select INSTANCE_NAME,HOST_NAME into log_inst_name,log_host_name
   from gv$instance
   where INSTANCE_NUMBER=log_INSTANCE_ID;

-- Output ssh command to find names of tables whose stats did not complete

   DBMS_OUTPUT.PUT_LINE('ssh '||log_host_name||' "grep TABLE: '||TRACE_DIR||'/'||log_inst_name||'_j0*_'||
     job_slave_pid||'.trc" | awk ''{ print $3 }'' | sed ''s/"//'' | sed ''s/"."/./'' | sed ''s/"\.".*//'' | sort -u');

END LOOP;
CLOSE JRD_CURSOR;
IF (AT_LEAST_ONE_JOB_RAN = FALSE) THEN
    DBMS_OUTPUT.PUT_LINE('No stats jobs have run in the past 7 days.');
END IF;
END;
/

Put your username and password at the top of the sql script.

Here is a link to a zip of these two files.

– Bobby

 

 

Posted in Uncategorized | 1 Comment

Shared servers results in lower CPU usage for top query in AWR

In two previous posts I’ve discussed the possibility of switching from the default dedicated server configuration to shared servers to support a large web farm.

The original post I decribed how Tom Kyte recommended a Youtube video that champions reducing the number of connections from a web farm to improve database performance.  Shared servers was one option.  Others would be some sort of connection pooling or a transaction processing monitor such as Tuxedo.

The second post described how I built a test with many dedicated server connections running many short transactions – lots of commits – and showed that switching to shared servers greatly reduced the commit time in this scenario.

Saturday night we switched our production database server to shared servers and we have seen the top query on our AWR report from peak times the previous week running several times faster because it consumes much less CPU per execution.  These results come from the DBA_HIST_SQLSTAT view.  I have not verified that the users themselves are seeing performance gains so I’m assuming that DBA_HIST_SQLSTAT is telling the truth. 🙂

Here is my DBA_HIST_SQLSTAT query:

>column END_INTERVAL_TIME format a25
>
>select ss.sql_id,
  2  ss.plan_hash_value,
  3  sn.END_INTERVAL_TIME,
  4  ss.executions_delta,
  5  ELAPSED_TIME_DELTA/(executions_delta*1000) "Elapsed Average ms",
  6  CPU_TIME_DELTA/(executions_delta*1000) "CPU Average ms",
  7  IOWAIT_DELTA/(executions_delta*1000) "IO Average ms",
  8  CLWAIT_DELTA/(executions_delta*1000) "Cluster Average ms",
  9  APWAIT_DELTA/(executions_delta*1000) "Application Average ms",
 10  CCWAIT_DELTA/(executions_delta*1000) "Concurrency Average ms",
 11  BUFFER_GETS_DELTA/executions_delta "Average buffer gets",
 12  DISK_READS_DELTA/executions_delta "Average disk reads",
 13  ROWS_PROCESSED_DELTA/executions_delta "Average rows processed",
 14  FETCHES_DELTA/executions_delta "Average fetches",
 15  END_OF_FETCH_COUNT_DELTA/executions_delta "Average end of fetch count",
 16  SORTS_DELTA/executions_delta "Average sorts",
 17  PX_SERVERS_EXECS_DELTA/executions_delta "Average PX servers execs",
 18  LOADS_DELTA/executions_delta "Average loads",
 19  INVALIDATIONS_DELTA/executions_delta "Average invalidations",
 20  PARSE_CALLS_DELTA/executions_delta "Average parse calls",
 21  DIRECT_WRITES_DELTA/executions_delta "Average direct writes",
 22  PLSEXEC_TIME_DELTA/executions_delta "Average PS/SQL exec time",
 23  JAVEXEC_TIME_DELTA/executions_delta "Average Java exec time"
 24  from DBA_HIST_SQLSTAT ss,DBA_HIST_SNAPSHOT sn
 25  where ss.sql_id = '1p08fp6u57us5'
 26  and ss.snap_id=sn.snap_id
 27  and executions_delta > 100000
 28  order by ss.snap_id,ss.sql_id;

This averages out various statistics on a per execution basis for the given hour interval.  I’ve excluded any interval with <= 100000 executions so we can look at peak usage times only.  I edited out all the columns of the output that were not different or significant to make this summary output:

PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms CPU Average ms
--------------- ------------------------- ---------------- ------------------ --------------
      156426008 12-JUL-13 03.00.42.806 PM           107606         28.2400079      26.081442
      156426008 15-JUL-13 08.00.21.424 AM           113049         6.54265593     6.45811559

The key details here are that the plan did not change and the number of executions is about the same, but the CPU per execution is considerably lower after the move to shared servers.

Here is an edited version of the query and its plan.  I’ve changed the table and column names.

Plan hash value: 156426008

-------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                  |       |       |    11 (100)|          |
|   1 |  HASH JOIN OUTER             |                  |     6 |   702 |    11  (10)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| TABLE_A          |     6 |   492 |     8   (0)| 00:00:01 |
|   3 |    INDEX RANGE SCAN          | TABLE_A_INDEX    |     6 |       |     3   (0)| 00:00:01 |
|   4 |   TABLE ACCESS FULL          | TABLE_B          |    18 |   630 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

SELECT TABLE_A.COLUMN1,
       TABLE_A.COLUMN2,
       TABLE_A.COLUMN3,
       TABLE_B.COLUMN5,
  FROM TABLE_A, TABLE_B
 WHERE TABLE_A.COLUMN1 = :B1 AND 
       TABLE_A.COLUMN2 = TABLE_B.COLUMN4(+);

I don’t really know why the switch to shared servers caused this query to use so much less CPU.  During these peak times we have about 400 database sessions from about 50 web servers.  With dedicated servers that translates to 400 Unix processes with the 100,000 executions of this query spread evenly across them.  With shared servers the workload is processed by only 32 Unix processes.  So, I’m guessing that there are efficiencies that come from doing the work in fewer Unix processes but I haven’t found a statistic that homes in on where those efficiencies are coming from.

– Bobby

 

 

 

Posted in Uncategorized | 4 Comments

Moving to 64 bit laptop OS to run Oracle 12c

I wanted to download the new version of the Oracle database software to test it out but I have an old laptop running Windows XP, a 32-bit OS.  So far Oracle hasn’t released a 32-bit Windows or Linux version on their download page.  They may eventually release 32 bit versions but I don’t want to wait.  So, I finally have to bite the bullet and get a new Windows 7 OS laptop.  I could have gotten one a long time ago but I didn’t want to take the time to reinstall everything.  I can hardly use my family’s Windows 7 laptop (and the newer version of Office that it has) because I’ve stuck with such an old laptop for work use.

Time to enter this century and move to 64 bits on my laptop.  32 just isn’t enough in 2013.

– Bobby

Posted in Uncategorized | Leave a comment

Using coe_xfr_sql_profile.sql to force a good plan

Sunday night we worked with Oracle Support on a performance issue with a query that had been running for over 14 hours even though it ran the previous weekend in five minutes.  Oracle Support showed us how to use coe_xfr_sql_profile.sql to force the query to run using the same plan it had used on the previous weekend in an attempt to improve the query’s performance.

I’ve built a simplistic example to demonstrate how to use coe_xfr_sql_profile.sql .  First I run a simple query dumping out the plan with its hash value and sql id:

ORCL:SYS>SELECT sum(blocks) from test;

SUM(BLOCKS)
-----------
     237045

ORCL:SYS>select * from 
table(dbms_xplan.display_cursor(null,null,'ALL'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------
SQL_ID  gpm6wzuy8yabq, child number 0
-------------------------------------
SELECT sum(blocks) from test

Plan hash value: 1950795681

Then I generate and run the script to create the sql profile and enforce it:

-- build script to load profile

@coe_xfr_sql_profile.sql gpm6wzuy8yabq 1950795681

-- run generated script

@coe_xfr_sql_profile_gpm6wzuy8yabq_1950795681.sql

After running these scripts if I redo my test it shows the profile in use:

Note
-----
 - SQL profile coe_gpm6wzuy8yabq_1950795681 used for this statement

I referred to the following Oracle documents:

Using Sqltxplain to create a ‘SQL Profile’ to consistently reproduce a good plan [ID 1487302.1]

SQLT (SQLTXPLAIN) – Tool that helps to diagnose a SQL statement performing poorly or one that produces wrong results [ID 215187.1]

This is pretty cool.  I’ve done something similar with SQL Plan Management but this is a nice Oracle Support provided tool.

Alas, in our case on Sunday forcing the previous week’s plan didn’t help.  So, if a query suddenly starts running for a long time the plan may not be the issue even if it has changed.  In our case there was a big change to the data which caused the plan to change and the performance to degrade.  But it was cool to learn about this tool that I haven’t used before in the process and anticipate it being helpful in future cases where a change in plan does result in a performance issue.

– Bobby

Posted in Uncategorized | 12 Comments

Cardinality errors across joins

We have been discussing query tuning at work and I’ve come back to an example of the Oracle optimizer choosing the wrong plan due to an error in its calculation of the number of rows returned by a query.  I want to show that it is easy and common to have queries that the optimizer can not correctly optimize because it can’t get the number of rows, or cardinality right.

I’ve recast the fourth script from my cardinality talk – old scripts here.  I’ve made it more real life by changing the table names to be more business related.  I’ve also tried to prove that multi-column histograms don’t help and neither does cardinality feedback.  This is all on 11.2.0.3 on 32-bit Windows.  Here is the updated script and its log.

Two tables are joined together.  First is a small division table that has text names and numeric division numbers:

create table DIVISION (DIVNAME VARCHAR2(2000),DIVNUM NUMBER);

insert into DIVISION values ('Los Angeles',1);
insert into DIVISION values ('Mayberry',2);

Next is the sales detail table.  Note that there was only one sale in Mayberry but a million in Los Angeles:

create table SALES (DIVNUM NUMBER);

/* SALES table has 1,000,000 1's for LA and one 2 for Mayberry */

declare
  i number;
begin
  i:=1;
  loop
    insert into SALES values (1);
    i:=i+1;
    if (mod(i,10000)=0) then
      commit;
    end if;
    exit when (i > 1000000);
  end loop;
  commit;
  insert into SALES values (2);
  commit;
end;
/

create index SALESINDEX on SALES(DIVNUM);

The query is just going to get the one Mayberry row joining the division and sales tables.  There is an index but it doesn’t use it.

SQL> select B.DIVNUM
  2  from DIVISION A,SALES B
  3  where
  4  a.DIVNUM=B.DIVNUM and
  5  A.DIVNAME='Mayberry';

    DIVNUM
----------
         2

Elapsed: 00:00:00.07

--------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)|
--------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   455 (100)|
|*  1 |  HASH JOIN         |          |   500K|  8300K|   455   (2)|
|*  2 |   TABLE ACCESS FULL| DIVISION |     1 |    14 |     3   (0)|
|   3 |   TABLE ACCESS FULL| SALES    |  1000K|  2929K|   448   (2)|
--------------------------------------------------------------------

We have queries like this all the time where two tables are joined on some sort of numeric key but the where clause has a meaningful text-based condition on the smaller table.  In the example it is a small division table joined to a sales table.  It could also be a date table that relates quarters or fiscal periods to particular days.

Some sort of cross table histogram would be needed to fix this.  It would look something like this:

DIVISION.DIVNAME DIVISION.DIVNUM SALES.DIVNUM NUM_ROWS
     Los Angeles                1           1  1000000
        Mayberry                2           2        1

But, think about how complex this could get in terms of all the cross table relationships, even if this type of histogram were implemented.  For the optimizer to figure out the row count across joins in the most general case there would be countless different cross table histograms that capture the relationships among the many tables and columns that make up a large database.

The type of join demonstrated in the test script is common in our production database applications, so I believe this test shows that we should expect the optimizer to frequently have inaccurate cardinality estimates which can lead to inefficient query execution plans.

– Bobby

P.S.  I ran across a case yesterday where a dynamic sampling hint helped with a table that had multiple correlated predicates against the one table.  So, I thought I’d try that hint on both tables in the example for this post but it still didn’t get the good plan with the index.  The updated example script and its log are here.  I ran this test on Exadata on 11.2.0.2 because that was where I saw the dynamic sampling hint work.

Posted in Uncategorized | 6 Comments