Bobby Durrett's DBA Blog

netstat -o shows that (ENABLE=BROKEN) turns on TCP keepalive

Posted on August 31, 2022 by Bobby

In an earlier post I showed a Java program that will login to an Oracle database and wait for 350 seconds. I also talked about how we set the Linux parameter net.ipv4.tcp_keepalive_time to 60 seconds but that I needed to add (ENABLE=BROKEN) to the TNS connect string to enable the keepalive. I found a helpful post that said to use netstat -a -n -o to see connections that are using TCP keepalive. So, I tried my Java program with and without (ENABLE=BROKEN) and ran netstat -a -n -o both ways and it showed that keepalive was only working with (ENABLE=BROKEN).

with (ENABLE=BROKEN)

$ netstat -a -n -o | grep 10.99.94.32
tcp6 0 0 172.99.99.187:44314  10.99.94.32:1523 ESTABLISHED keepalive (27.30/0/0)
$ netstat -a -n -o | grep 10.99.94.32
tcp6 0 0 172.99.99.187:44314  10.99.94.32:1523 ESTABLISHED keepalive (41.47/0/0)

without (ENABLE=BROKEN)

$ netstat -a -n -o | grep 10.99.94.32
tcp6 0 0 172.99.99.187:54884  10.99.94.32:1523 ESTABLISHED off (0.00/0/0)

I edited the IP addresses to obscure them and removed spaces to make it fit better, but the important thing is that with (ENABLE=BROKEN) the 60 second keepalive timer is working, but without it the timer is off.

This information might not be that helpful to others if they do not have this kind of timeout, although I have been told that many firewalls have similar timeouts. Certainly, any AWS customer that connects through their Gateway Load Balancer to an on premises Oracle database would need to know this sort of thing. Hopefully, we are not the only ones in the world doing it this way! But at least I documented it for myself which will be helpful no matter what.

Bobby

Posted in Uncategorized | 2 Comments

350 Second Timeout Causes ORA-03135 Errors in AWS DMS

Posted on August 2, 2022 by Bobby

This is a follow up to an earlier post about the 350 second timeout that is built into Amazon Web Services’ (AWS) Gateway Load Balancer (GWLB).

The earlier post was about Debezium (DBZ) using its Oracle Connector to pull data from an on-premises Oracle database into Kafka in AWS. DBZ used JDBC to connect to the Oracle database so I built a simple Java program that uses JDBC to mimic the behavior we saw in DBZ. With DBZ we were hanging if any SQL statement that DBZ ran took >= 350 seconds to run. If it did, then the Oracle session hung and Debezium never got past that SQL statement.

But for AWS Database Migration Service (DMS) the symptoms were different. For DMS I could not find any SQL statement that ran for >= 350 seconds. All the SQL statements ran much faster. But we did see ORA-03135 errors in DMS’s log like this:

ORA-03135 Error

DMS seemed to be waiting >= 350 seconds between SQL statements in certain cases, maybe doing a large load, and that seemed to be causing the ORA-03135 errors. I also saw DMS Oracle sessions waiting for more than 350 seconds on “SQL*Net message from client” idle waits. These seemed to eventually go away after 6000 or more seconds. I think that the GWLB was silently dropping the network connection, but the Oracle sessions still existed until at some point they realized that the network connection was gone. But I wanted to recreate the problem in a simple test case to prove that the 350 second GWLB timeout would throw the ORA-03135 error and leave the DMS Oracle sessions hanging for several thousand seconds in the SQL*Net wait that I was seeing in our production DMS sessions.

To recreate this error and the orphaned session behavior and to show that it was due to the GWLB 350 second timeout and not some other weird network problem I did some simple tests with SQL*Plus and Instant Client. I installed these on an AWS EC2 Linux machine that already had the firewall and security group configuration setup to allow a connection from the EC2 to an on-premises Oracle database. Then I just logged into that database and sat idle for different lengths of time before running a select statement. I narrowed it down to about 350 seconds as the cutoff point where the session is lost due to too much idle time.

Here is my test with < 350 second wait:

SQL> connect myuser/mypassword@mydatabase
Connected.
SQL> host sleep 348

SQL> select * from dual;

D
-
X

Elapsed: 00:00:00.11

Here is my test with > 350 seconds wait:

SQL> connect myuser/mypassword@mydatabase
Connected.
SQL> host sleep 351

SQL> select * from dual;
select * from dual
*
ERROR at line 1:
ORA-03135: connection lost contact
Process ID: 1208
Session ID: 57 Serial number: 21111

Narrowing it down to 350 seconds at the cutoff showed that just logging in and waiting for > 350 seconds causes an ORA-03135 error. I also verified that the associated Oracle sessions hung around for > 350 seconds stuck on the “SQL*Net message from client” wait. Sure, DMS could be throwing a ORA-03135 error due to some unrelated network problem, but my SQL*Plus test proved that any Oracle connection from our AWS environment back to our on-premises Oracle databases will throw a ORA-03135 error and leave orphaned Oracle sessions if it sits idle for >= 350 seconds unless we put the fix in place that I mentioned in my earlier post.

The fix is to set the Linux parameter net.ipv4.tcp_keepalive_time to < 350 seconds and to use (ENABLE=BROKEN) in your connection strings. Once I put these in place for my SQL*Plus test I could wait longer than 350 seconds and then run a select statement with no errors.

Since March when we noticed this timeout with Debezium I have suspected the timeout would also affect DMS, but I did not know that the symptoms would be throwing ORA-03135 errors and leaving orphaned sessions when the time idle between SQL statements exceeded the timeout. It took a few tickets working with AWS support but last week they put net.ipv4.tcp_keepalive_time < 350 seconds and (ENABLE=BROKEN) in their global DMS configuration for all their customers.

So, from now on anyone setting up a new DMS replication instance version 3.4.5 or later should be able to replicate data from AWS to an on-premises Oracle database through Amazon’s Gateway Load Balancer without facing these ORA-03135 errors. If you created your replication instance before last week you should create a new one >= version 3.4.5 to take advantage of this fix, especially if you are seeing ORA-03135 errors in your logs.

Bobby

Posted in Uncategorized | 2 Comments

Docker Sample Application behind Zscaler

Posted on April 15, 2022 by Bobby

I am trying to learn about Docker by installing it on an Oracle Linux 7 VM on top of VirtualBox on my work laptop. My work laptop uses Zscaler. I had a bunch of certificate issues and ended up learning a lot about Docker by working around them. I tried to do the Sample Application – really the simplest first step in the Docker documentation – and had all kinds of trouble getting it to work. Ultimately, I ended up with a Dockerfile that looked like this:

[root@docker ~]# cat Dockerfile
# syntax=docker/dockerfile:1
FROM oraclelinux:7
COPY z.pem /etc/pki/ca-trust/source/anchors/z.pem
RUN update-ca-trust
RUN echo sslverify=false >> /etc/yum.conf
RUN yum install -y oracle-nodejs-release-el7 oracle-release-el7
RUN yum install -y nodejs
RUN npm install -g npm
RUN npm install -g yarn
WORKDIR /app
COPY . .
RUN yarn config set "strict-ssl" false -g
RUN yarn install --production
CMD ["node", "src/index.js"]
EXPOSE 3000

By contrast the Dockerfile that was supposed to work looks like this:

# syntax=docker/dockerfile:1
FROM node:12-alpine
RUN apk add --no-cache python2 g++ make
WORKDIR /app
COPY . .
RUN yarn install --production
CMD ["node", "src/index.js"]
EXPOSE 3000

I ended up using the oraclelinux:7 image because it had more stuff installed such as update-ca-trust. Because I could not get anything to work with Zscaler I had to start with an image that did not require me to pull more stuff down with yum. Then, after playing with it I still ended up disabling SSL verification on yum and yarn. I had to install node since I was starting with a plain Linux image and not a node image.

I had these instructions for getting Zscaler to work on my Oracle Linux 7 VirtualBox VMs on my company computer:

Had to extract Zscaler .cer root ca from Chrome browser as z.cer. 
Moved to linux and ran:

openssl x509 -inform der -in z.cer -outform der -out z.pem

copied z.pem to /etc/pki/ca-trust/source/anchors/

ran

update-ca-trust

worked.

I do not know if this is really doing anything. It affects curl so that I can use curl without the -k option to disable SSL verification. Maybe things that use curl under the covers are affected by adding z.pem to the trusted certificates.

Anyway, I just wanted to document this for myself. Maybe someone out there will benefit also.

Bobby

Posted in Uncategorized | 4 Comments

350 second AWS timeout causes JDBC call to hang

Posted on March 25, 2022 by Bobby

When I run the following Java program on an AWS EC2 Linux virtual machine connecting to an Oracle database in my company’s internal network it hangs forever.

When I run it on a Linux machine on our internal network it runs fine.

Evidently my company uses an AWS feature called “Gateway Load Balancer” to connect our AWS network to our internal on premises network. Evidently the GLB has a 350 second timeout. See this document:

https://aws.amazon.com/blogs/networking-and-content-delivery/best-practices-for-deploying-gateway-load-balancer/

Here is a quote of the relevant paragraph:

Some applications or API requests, such as synchronous API calls to databases, have long periods of inactivity. GWLB has a fixed idle timeout of 350 seconds for TCP flows and 120 seconds for non-TCP flows. Once the idle timeout is reached for a flow, it is removed from GWLB’s connection state table. As a result, the subsequent packets for that flow are treated as a new flow and may be sent to a different healthy firewall instance. This can result in the flow timing out on the client side. Some firewalls have a default timeout of 3600 seconds (1 hour). In this case, GWLB’s idle timeout is lower than the timeout value on the firewall, which causes GWLB to remove the flow without the firewall or client being aware it was dropped.
Best practices for deploying Gateway Load Balancer

This means that my JDBC call using the thin driver will work fine if I sleep for 349 seconds but will hang forever if I try to sleep for 350 seconds. The solution is to update a Linux operating system parameter and to update the JDBC connect string.

OS:

sysctl -w net.ipv4.tcp_keepalive_time=60

add this line to /etc/sysctl.conf:

net.ipv4.tcp_keepalive_time=60

Evidently our default tcp_keepalive_time value was 7200 seconds which is longer than the 350 second timeout so we had to lower it to 60 seconds to that the Gateway Load Balancer would know that our JDBC call was actually doing something.

You have to add (ENABLE=broken) to the jdbc connect string like this:

jdbc:oracle:thin:MYUSER/MYPASSWORD!@(DESCRIPTION=(ENABLE=broken)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(Host=myhost)(Port=1521)))(CONNECT_DATA=(SERVICE_NAME=MYSERVICE)))

Once I did this my Java test program worked fine. It ran for about 350 seconds and finished cleanly.

If you are working in AWS and connecting to a on premises database using JDBC and you have a SQL statement that should run for 350 seconds or more and hangs forever you might check whether you are being affected by this timeout.

Bobby

p.s. I forgot to mention that the Oracle database session goes away after 350 seconds. It is just the client side JDBC call that hangs apparently forever.

p.p.s. We have a related issue with Putty sessions connecting to Amazon EC2 Linux VMs timing out after 350 seconds. A coworker offered this article as a solution:

https://patrickmn.com/aside/how-to-keep-alive-ssh-sessions/

The Putty keepalives setting works great!

Another coworker of mine was saying that certain types of firewalls work this way with timeouts. The problem is that the GWLB times out our on-premises side but not our AWS side. So, in the case of using Putty to ssh into an EC2 that does not have keepalives configured my Putty session, which also does not have keepalives configured, times out after 350 seconds of idle time. When I hit enter, I get “Network error: Software caused connection abort” but if I check my BASH shell process id, I see that my shell process was never terminated. So, old processes hang around forever on my EC2 if the ssh connection times out due to the GWLB 350 second timeout.

Maybe it is normal for connections on one side of a firewall to time out and the other side to hang forever? I am not sure.

Posted in Uncategorized | 2 Comments

Do SET_AUDIT_TRAIL_LOCATION before INIT_CLEANUP

Posted on February 4, 2022 by Bobby

This is all old stuff, but I want to record a simple thing I found. I was following Oracle’s support document for setting up audit table cleanup using the DBMS_AUDIT_MGMT package. I used this document:

SCRIPT: Basic example to manage AUD$ table with dbms_audit_mgmt (Doc ID 1362997.1)

This is a very helpful document, but the example script runs DBMS_AUDIT_MGMT.INIT_CLEANUP before it runs DBMS_AUDIT_MGMT.SET_AUDIT_TRAIL_LOCATION and it moves the audit tables SYS.AUD$ first to the SYSAUX tablespace and then to a newly created AUDIT_DATA tablespace. My simple thought is to run SET_AUDIT_TRAIL_LOCATION first to move SYS.AUD$ to AUDIT_DATA and then run INIT_CLEANUP which leaves SYS.AUD$ in AUDIT_DATA. Nothing monumental, but it seems more efficient to move the audit table once.

I did a couple of quick tests on an 18c database to demonstrate that SYS.AUD$ only moves once with SET_AUDIT_TRAIL_LOCATION first.

Test1: Follow the order in the Oracle document:

Before starting:

SQL> select
  2  tablespace_name
  3  from dba_tables
  4  where
  5  owner='SYS' and
  6  table_name='AUD$';

TABLESPACE_NAME
------------------------------
SYSTEM

Create tablespace:

SQL> CREATE TABLESPACE AUDIT_DATA LOGGING
DATAFILE '/oracle/db01/DBA18C/dbf/audit_data_1.dbf'
SIZE 100M
AUTOEXTEND OFF;  2    3    4

Tablespace created.

Do INIT:

SQL> BEGIN
  2  IF NOT DBMS_AUDIT_MGMT.IS_CLEANUP_INITIALIZED
  3  (DBMS_AUDIT_MGMT.AUDIT_TRAIL_AUD_STD)
  4  THEN
  5  dbms_output.put_line('Calling DBMS_AUDIT_MGMT.INIT_CLEANUP');
  6  DBMS_AUDIT_MGMT.INIT_CLEANUP(
  7  audit_trail_type => dbms_audit_mgmt.AUDIT_TRAIL_AUD_STD,
  8  default_cleanup_interval => 24*7);
  9  else
 10  dbms_output.put_line('Cleanup for STD was already initialized');
 11  end if;
 12  end;
 13  /
Calling DBMS_AUDIT_MGMT.INIT_CLEANUP

PL/SQL procedure successfully completed.

Table in SYSAUX:

SQL> select
  2  tablespace_name
  3  from dba_tables
  4  where
  5  owner='SYS' and
  6  table_name='AUD$';

TABLESPACE_NAME
------------------------------
SYSAUX

Set the new table location:

SQL> begin
  2  DBMS_AUDIT_MGMT.SET_AUDIT_TRAIL_LOCATION(
  3  audit_trail_type => dbms_audit_mgmt.AUDIT_TRAIL_AUD_STD,
  4  audit_trail_location_value => 'AUDIT_DATA') ;
  5  end;
  6  /

Table is in AUDIT_DATA (moved twice SYSTEM->SYSAUX->AUDIT_DATA):

SQL> select
  2  tablespace_name
  3  from dba_tables
  4  where
  5  owner='SYS' and
  6  table_name='AUD$';

TABLESPACE_NAME
------------------------------
AUDIT_DATA

Test2: Reverse the order in the Oracle document:

First, I restored my database to its original condition:

SQL> select
  2  tablespace_name
  3  from dba_tables
  4  where
  5  owner='SYS' and
  6  table_name='AUD$';

TABLESPACE_NAME
------------------------------
SYSTEM

After creating the tablespace again, I ran set the trail location and the table is now in AUDIT_DATA:

SQL> select
  2  tablespace_name
  3  from dba_tables
  4  where
  5  owner='SYS' and
  6  table_name='AUD$';

TABLESPACE_NAME
------------------------------
AUDIT_DATA

Next, I do the init and the table does not move:

SQL> select
  2  tablespace_name
  3  from dba_tables
  4  where
  5  owner='SYS' and
  6  table_name='AUD$';

TABLESPACE_NAME
------------------------------
AUDIT_DATA

So, I am not sure why Oracle’s document has you do INIT_CLEANUP before SET_AUDIT_TRAIL_LOCATION but it seems more efficient to do them in the reverse order and move SYS.AUD$ once, from SYSTEM to AUDIT_DATA.

Bobby

Posted in Uncategorized | 2 Comments

Oracle 21c Laptop Install

Posted on January 26, 2022 by Bobby

Finally got around to installing Oracle 21c on my laptop. I installed it on Oracle’s Linux version 7 running in VirtualBox. There are other posts out there, so I won’t get too detailed. I noticed this post:

https://oracle-base.com/articles/21c/oracle-db-21c-installation-on-oracle-linux-7

But I mainly went by the manual:

https://docs.oracle.com/en/database/oracle/oracle-database/21/ladbi/index.html

I stopped the firewall:

service firewalld stop
systemctl disable firewalld

I used the preinstall yum package:

yum install oracle-database-preinstall-21c

As root setup the directories with the right ownership and permission:

mkdir -p /u01/app/oracle
mkdir -p /u01/app/oraInventory
chown -R oracle:oinstall /u01/app/oracle
chown -R oracle:oinstall /u01/app/oraInventory
chmod -R 775 /u01/app

As oracle unzipped the zip:

cd /u01/app/oracle/product/21.0.0/dbhome_1
unzip -q /home/oracle/LINUX.X64_213000_db_home.zip

I was annoyed by the -q option of unzip. If I did it again, I would leave it off so that I could see the list of files as they were unzipped.

From the console I ran this:

cd /u01/app/oracle/product/21.0.0/dbhome_1
./runInstaller

I got a warning about the clock source being kvm-clock and not tsc. I tried Oracle’s instructions to fix this warning, but they did not work for me. It was just a warning apparently coming from the cluster verification utility. Since I am not using RAC, I didn’t see how this could matter so I ignored it. The install was very simple. Forced to use a CDB this time as I already knew.

The only interesting part for me was the read only Oracle home. Evidently the files that are updatable are stored under ORACLE_BASE_HOME instead of ORACLE_HOME.

I ended up adding these lines to .bash_profile:

export ORACLE_SID=orcl
export ORAENV_ASK=NO
. oraenv
export ORACLE_BASE_HOME=/u01/app/oracle/homes/OraDB21Home1

To add a new tnsnames.ora entry for the pdb I had to go to $ORACLE_BASE_HOME/network/admin instead of $ORACLE_HOME/network/admin:

[oracle@orcl21 ~]$ ls -altr $ORACLE_HOME/network/admin
total 4
-rw-r--r--.  1 oracle oinstall 1624 Feb 18  2020 shrept.lst
drwxr-xr-x.  2 oracle oinstall   61 Jul 27  2021 samples
drwxr-xr-x. 10 oracle oinstall   98 Jul 27  2021 ..
drwxr-xr-x.  3 oracle oinstall   37 Jan 26 13:45 .
[oracle@orcl21 ~]$ ls -altr $ORACLE_BASE_HOME/network/admin
total 16
drwxr-x---. 5 oracle oinstall  40 Jan 26 12:52 ..
-rw-r-----. 1 oracle oinstall 190 Jan 26 12:53 sqlnet.ora
-rw-r-----. 1 oracle oinstall 329 Jan 26 12:53 listener.ora
-rw-r-----. 1 oracle oinstall 397 Jan 26 13:49 tnsnames.ora.01262022
-rw-r-----. 1 oracle oinstall 586 Jan 26 13:49 tnsnames.ora
drwxr-x---. 2 oracle oinstall  89 Jan 26 13:49 .

I added this entry to $ORACLE_BASE_HOME/network/admin/tnsnames.ora:

orclpdb =
    (DESCRIPTION =
      (ADDRESS = (PROTOCOL = TCP)(HOST = orcl21)(PORT = 1521))
      (CONNECT_DATA =
        (SERVER = DEDICATED)
        (SERVICE_NAME = orclpdb)
      )
  )

Pretty simple. The hard part is understanding what is new in 21c.

Bobby

Posted in Uncategorized | Leave a comment

Dropping sequences whose name starts with ISEQ$$

Posted on January 25, 2022 by Bobby

We had an application that created 500,000 tables and 500,000 sequences and the vendor sent us a cleanup script that we thought would drop the tables but not the sequences. It took us a few attempts to get a cleanup script that looked like it would work with the tables, but the sequences seemed totally wrong. The tables had dashes in their names and a bunch of random characters between the dashes and the cleanup script looked for that pattern. But the sequence names all started with ISEQ$$ and they were trying to drop sequences whose names were like the tables’ names. Confusing. Finally, they convinced us to run the script whether it looks like it would work or not. After tweaking it a bit it did run and dropped both the tables and the sequences. What in the world? Then I had a vague memory of something called “identity columns” probably from my 12.1 certification. I do not think I have seen them in a real system, so I checked the tables and sure enough they all had a single identity column and each sequence matched up with each table. So, when we dropped the tables, the sequences went with them. I do not know why I did not clue into the fact that the sequences have dollar signs $$ in their name which means they were probably system generated. Duh!

I thought about not posting anything about this because I was sure there were several good posts out there about this and there are. So, I will link some of them below and not try to recreate them. The funny thing is that the DBAs at the vendor kept talking about flushing the recycle bin to drop the sequence and that also made no sense at the time. Why would flushing the dropped tables out of the recycle bin have anything to do with the sequences? But as you will see in the posts it does.

One thing that probably is not in the posts is that if you turn off the recycle bin with a parameter then dropping the table drops the sequence that is associated with its identity column without having to clear the recycle bin or do a drop table purge. We have this parameter set:

SQL> show parameter recyclebin

NAME                                 TYPE        VALUE
------------------------------------ ----------- -----------------
recyclebin                           string      OFF

Here is my simple test script and its output: zip

I used drop table purge in the test script, so it works even if the recycle bin is enabled. If you run the blog.sql script in the zip, be sure to run it as a user that does not have any tables.

Here are the blog posts:

https://oracle-base.com/articles/12c/identity-columns-in-oracle-12cr1

This is from Tim Hall’s Oracle Base which has a fantastic amount of detail about various Oracle features and versions. It shows the ISEQ$$ sequence names. Why didn’t I just google ISEQ$$?

https://floo.bar/2019/11/29/drop-the-underlying-sequence-of-an-identity-column/

This talks about how to drop the sequence by dropping the table and either clearing the recycle bin or using drop table purge.

https://stackoverflow.com/questions/58984546/cannot-drop-a-system-generated-sequence/58984939

This SO question has a nice answer similar to the previous post.

https://itoug.it/blog/2020/04/identity-columns-a-new-entry-in-oracle-database-12c-part-2-beyond-the-basics/

Covers a lot of ground but talks about the recycle bin.

Anyway, I wanted to note these posts/links as well as my experience. If you want to drop sequences with names starting with ISEQ$$ you need to drop the associated table with an identity column and be sure to purge the table from the recycle bin if you have it enabled.

Bobby

Posted in Uncategorized | Leave a comment

latch: shared pool waits after patching to 19.13

Posted on December 14, 2021 by Bobby

I have been working hard on an issue that happened after we patched an Oracle database to the 19.13 patch set which just came out in October. The application experienced a ton of latch: shared pool waits, and the patch set had to be backed out. Currently the database is running on the 19.5 patch set and running fine. I have been trying to figure this out on my own and I have been working with Oracle support but so far, I have not come up with a proven resolution. But I thought I would document what I know here before I go off on vacation for two weeks around the Christmas and New Year holidays.

It all started on October 26th when we were still on 19.5 and our application hung up with alert log messages consistent with this bug:

Bug 30417732 – Instance Crash After Hitting ORA-00600 [kqrHashTableRemove: X lock] (Doc ID 30417732.8)

Here were the error messages:

One of these:

ORA-00600: internal error code, arguments: [kqrHashTableRemove: X lock], [0x11AA5F3A0]

Then a bunch 0f these:

ORA-00600: internal error code, arguments: [kglpnlt]

I found that the fix for bug 30417732 is in the 19.9 and later patch sets. So, I, perhaps foolishly, decided to apply the latest patch set (19.13) after testing it in our non-production databases. In retrospect I probably should have found a one-off patch for 30417732 but I wanted to catch up to the latest patch set and be on the current and hopefully most stable version of Oracle.

So after applying 19.13 in our lower environments and testing there we applied it on production on November 13th and we had poor performance in the application Monday through Thursday and finally backed out the patch set Thursday night. It has been fine ever since and the original problem, bug 30417732, has fortunately not happened again.

I generated an AWR report during an hour when the problem was occurring, and it showed latch: shared pool as the top foreground event. Like this:

I was able to use my onewait.py and sessioncounts.py to show a correlation between the number of connected sessions and the number of latch waits.

Latch waits:

Number of connected sessions:

This database uses shared servers which uses memory in the shared pool and large pool so it makes sense that a bunch of logins might cause some contention for the shared pool. I tried to recreate the high latch: shared pool waits on a test database by running a Java script that just creates 1000 new sessions:

This did result in similar or even worse latch: shared pool waits during the time the Java script was running:

So, I have been able to recreate the problem in a test database but now I am stuck. This is not a minimal reproducible example. The test database is actively used for development and testing. I have tried to create my own load to apply to a small database but so far, I cannot reproduce the exact results. If I run 10 concurrent copies of MaxSessions.java I can get latch: shared pool waits on a small database, but I get the same behavior on 19.5 as I do on 19.13. So, something is missing. Here is a spreadsheet of my test results:

19.5, 19.12, and 19.13 have the same results but 18.14 is a lot better. I was trying to come up with a test that was a lot better on 19.5 than 19.13 but I have not succeeded.

I have looked at a boatload of other things not listed here but I am running out of steam. Friday is my last day working until January 4th, so I hope to put this out of my mind until then. But I thought it was worth using this blog post to document the journey. Maybe at some point I will be able to post a positive update.

Bobby

12/16/21

Made some progress working with Oracle support on this yesterday. I did not mention in this post that the latch behind all the latch: shared pool waits was

kghfrunp: alloc: wait

Here is the top latch from the AWR report:

An Oracle analyst noted that this bug also involves the same latch:

Bug 33406872 : LATCH SHARED POOL CONTENTION AFTER UPGRADE TO 19.12

I had noticed bug 33406872 but I did not know that it was on the kghfrunp: alloc: wait latch because that information is not visible to Oracle customers.

The same analyst had me rerun the Java script listed above but with a dedicated server connection. I did and it did not have the latch waits. So, this is a shared server bug.

Oracle development is working on it so there is no fix yet but here is the situation as it seems to be now: If you are using shared servers on 19.12 or 19.13 and have high latch: shared pool waits on the kghfrunp: alloc: wait latch you may be hitting a new bug that Oracle is working on.

1/7/2022

After getting back from 2 weeks vacation I took another crack at this and I think I have come up with a useful test case for the SR. Time will tell, but it looks promising. I added a second Java script to the MaxSessions one listed above. Here it is:

This opens 1000 shared server sessions and then runs unique select statements on each of the 1000 sessions one at a time looping up to 1000 times. After about 60 loops I get the latch: shared pool waits. I tried this on both 19.13 and 19.5 and I get the waits in both but with differences.

19.13 after 100 loops:

19.5 after 100 loops

Avg Wait, % DB time, and sleeps for the kghfrunp: alloc: wait latch were all higher on 19.13.

Note that I had to keep the GenLoad.java script running while I ran MaxSessions.java and I took AWR snapshots around the MaxSessions run to get these AWR report outputs.

Maybe this will be enough to help Oracle recreate this behavior internally.

Bobby

10/24/22

Back from a weeklong cruise vacation and Oracle support says that they opened bug 33801843 for this issue and that it will be fixed in a future release. I would not use shared servers going forward after 19.11.

1/6/22

Good news. Oracle had me try my test case with 19.17 and it worked great. I tested with the December monthly patch set on top of 19.17 so I do not know exactly what fixed it. But, based on my tests applying 34419443 and 34819700 on 64-bit Linux resolves the issue described above which was introduced in 19.12.

Here are the zips I used to patch including latest OPatch:

$ ls -altr
total 1798904
drwxr-xr-x. 7 oracle dba         71 Sep 21 11:34 ..
-rw-rw-rw-. 1 oracle dba 1580330977 Jan 13 16:43 p34419443_190000_Linux-x86-64.zip
-rw-rw-rw-. 1 oracle dba  138029131 Jan 13 16:44 p34819700_1917000DBRU_Linux-x86-64.zip
-rw-rw-rw-. 1 oracle dba  123714997 Jan 13 16:47 p6880880_190000_Linux-x86-64.zip

Posted in Uncategorized | 10 Comments

Tweaked script to show top SQL by force matching signature

Posted on September 15, 2021 by Bobby

I just finished another production tuning exercise that is like two recent posts:

Two New ASH FORCE_MATCHING_SIGNATURE scripts

Tuning Example With Outline Hints and SQL Profile

The biggest new thing was that I ran the ashtopelapsed.sql script that I mentioned in the first post above but it missed the top SQL statement for the job I was trying to tune. I am struggling to get the logic right. I am trying to write an ASH-based top SQL report for a given time range. It is like an AWR report where you see the top elapsed time SQL statement for the given time frame. But I am grouping by FORCE_MATCHING_SIGNATURE and not SQL_ID because I want to see the top SQL that all have the same signature but different SQL_IDs. In this case there was a PeopleSoft update and select statement that did not use bind variables, so each execution had its own SQL_ID. Here is the select:

SELECT 
DISTINCT TO_CHAR(A.ACCOUNTING_DT,'YYYY-MM-DD') 
FROM PS_HR_ACCTG_LINE A 
WHERE 
A.APPL_JRNL_ID = 'ABCDE' AND 
A.BUSINESS_UNIT_GL = 'FG' AND 
(A.LEDGER_GROUP = ' ' OR A.LEDGER_GROUP = 'HIJK') AND 
A.GL_DISTRIB_STATUS = 'N' AND 
A.ACCOUNTING_DT >= TO_DATE('2006-09-03','YYYY-MM-DD') AND 
A.ACCOUNTING_DT <= TO_DATE('2021-09-04','YYYY-MM-DD');

I guess this runs for different values of BUSINESS_UNIT_GL, ACCOUNTING_DT, etc.

I am trying to get ashtopelapsed.sql to show the top SQL by force matching signature:

create table topsigs as
select 
FORCE_MATCHING_SIGNATURE,
count(*) active
from DBA_HIST_ACTIVE_SESS_HISTORY a
where 
sample_time 
between 
to_date('13-SEP-2021 01:00:19','DD-MON-YYYY HH24:MI:SS')
and 
to_date('13-SEP-2021 06:52:25','DD-MON-YYYY HH24:MI:SS')
group by
FORCE_MATCHING_SIGNATURE;

In this case during the time frame of the problem job, from about 1 to 7 am yesterday.

But then I want to find at least one SQL_ID for the given value of FORCE_MATCHING_SIGNATURE so I can then get its text as an example of a real SQL statement that has the signature:

create table sigtoid as
select 
t.FORCE_MATCHING_SIGNATURE,
min(a.sql_id) a_sql_id
from
topsigs t,
DBA_HIST_ACTIVE_SESS_HISTORY a,
DBA_HIST_SQLTEXT s
where
t.FORCE_MATCHING_SIGNATURE = a.FORCE_MATCHING_SIGNATURE and
a.sql_id = s.sql_id
group by t.FORCE_MATCHING_SIGNATURE;

There has to be a row in DBA_HIST_SQLTEXT for the signature or else I will not have an example SQL statement and the signature will get dropped from the report. I may need to put some sort of outer join somewhere so I can see the FORCE_MATCHING_SIGNATURE values that do not have an example SQL_ID in DBA_HIST_SQLTEXT. But fortunately, with the last revision (commit) to ashtopelapsed.sql I was able to see the text of an example SQL statement for each of the top signatures during the time when the problem job ran.

The other funny thing is that I forgot I had ashfmscount.sql so I ended up rewriting it. I need to read my own blog posts. 🙂 Hopefully this more detailed post will serve as a model for future performance tuning efforts and so I will not forget the scripts that I used this time.

As for hacking a SQL Profile together, the subject of the second post above, I edited one in vi as described in the post and the other I just extracted using coe_xfr_sql_profile.sql. I also edited both SQL Profiles to use FORCE = TRUE to make the SQL profile apply to all SQL statements with the same signature. I described FORCE = TRUE in earlier posts:

force_match => TRUE option of DBMS_SQLTUNE.IMPORT_SQL_PROFILE

Example of coe_xfr_sql_profile force_match TRUE

I think there is some overlap in all these posts, but I wanted to document today’s tuning exercise. The biggest new thing is the improvement of ashtopelapsed.sql.

Couple of images of the AWR and ASH top SQL for the time of this long running job:

Neither of these captured the top SQL by signature so that is why ashtopelapsed.sql was helpful. There may be some Oracle supplied way to do the same thing that I do not know about but the standard awrrpt.sql and ashrpt.sql do not handle this situation well. I could not find any useful information about my long running job in the top SQL portions of the AWR and ASH reports.

Here is the top output from ashtopelapsed.sql:

    ACTIVE FORCE_MATCHING_SIGNATURE EXAMPLE_SQL_I SQL_TEXT
---------- ------------------------ ------------- --------------------------
     10389                        0 02101wfuxvwtd INSERT INTO PS_AERUNCONTRO
      1394      2407690495325880429 003t0wk45taqs UPDATE PS_HR_ACCTG_LINE SE
       858      9373188292577351646 04hvza4k3u72x INSERT INTO PS_PROJ_RES_TM
                                                  ID, VOUCHER_ID, VOUCHER_LI
                                                  D_DETAIL, CST_DISTRIB_STAT
                                                  _TYPE, JL.RESOURCE_TYPE, J
                                                  JRNL_HEADER JH , PS_JRNL_L

       718       810675553005238195 b8fakhtc5g3rx select * from (select a.os
       646     12856152818202055532 00mmbs0c6bh1a SELECT DISTINCT TO_CHAR(A.

I created FORCE=TRUE SQL Profiles for FORCE_MATCHING_SIGNATURE values 2407690495325880429 and 12856152818202055532. Here was the output from ashfmscount.sql:

FORCE_MATCHING_SIGNATURE     ACTIVE
------------------------ ----------
     2407690495325880429       1394
    12856152818202055532        646
    12319028094632613940         43
    12084330158653099691          2
    10829167302201730459          2
     4612906341224490548          1

Almost all the 6-hour runtime was taken up by 2407690495325880429 and 12856152818202055532. The units are in 10 seconds so in hours it is:

((1394 + 646)*10)/3600.0 = 5.666666666666667

Almost all the job run time was spent on SQL statements that match these two signatures.

These two had full scans and needed index skip scans.

Bad plan:

Plan hash value: 3305693634
-----------------------------------------------
| Id  | Operation          | Name             |
-----------------------------------------------
|   0 | UPDATE STATEMENT   |                  |
|   1 |  UPDATE            | PS_HR_ACCTG_LINE |
|   2 |   TABLE ACCESS FULL| PS_HR_ACCTG_LINE |
-----------------------------------------------

Good plan:

Plan hash value: 1538925713
---------------------------------------------------------
| Id  | Operation                    | Name             |
---------------------------------------------------------
|   0 | UPDATE STATEMENT             |                  |
|   1 |  UPDATE                      | PS_HR_ACCTG_LINE |
|*  2 |   TABLE ACCESS BY INDEX ROWID| PS_HR_ACCTG_LINE |
|*  3 |    INDEX SKIP SCAN           | PSAHR_ACCTG_LINE |
---------------------------------------------------------

Not to beat a dead horse, but my main point is that I updated ashtopelapsed.sql to be useful in this situation. It gave me information that the AWR and ASH reports did not.

Here are the scripts that I used for this tuning example:

ashdump.sql – dump out a single 10 second point in time to find the session that is running the batch job

I also had to run this script which is not in my git repository (it is based on ashcount.sql which is my generic script for grouping by certain ash columns).

     select
  2  SESSION_ID,count(*) active
  3  from DBA_HIST_ACTIVE_SESS_HISTORY a
  4  where
  5  sample_time
  6  between
  7  to_date('13-SEP-2021 01:00:19','DD-MON-YYYY HH24:MI:SS')
  8  and
  9  to_date('13-SEP-2021 06:52:25','DD-MON-YYYY HH24:MI:SS')
 10  and program like 'PSAESRV%'
 11  group by SESSION_ID
 12  order by active desc;

SESSION_ID     ACTIVE
---------- ----------
      1039       2088
      1005        957
       858        804
      1006        195
       993        150
       942        143
      1060        114
       940        109
       927         69
      1026          1

I was trying to figure out which PSAESRV session was the one that ran all 6 hours. Session 1039 was the one, so I used it in ashfmscount.sql.

ashfmscount.sql – find the top signatures for the session that is the batch job

ashtopelapsed.sql – get the text of the top SQL statements grouped by signature in the time frame that the job ran

plan.sql – get a plan and output an outline hint to match it

test2.sql – test query with and without outline hint. also get plan hash value in memory so it can be extracted by sqlt.

optimizerstatistics – get index columns and column statistics

fmsstat2.sql – show execution history of 2407690495325880429 from AWR – here is before the profile:

FORCE_MATCHING_SIGNATURE SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME        
------------------------ ------------- --------------- -------------------------
     2407690495325880429 03jh5xnnsts7q      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 17wshhdq89azx      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 1k4a28bdg7r8h      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 3zhumhfujfwj5      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 59xs2q8215bby      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 666s10cy2k3fh      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 6bdybc28zn3fg      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 7050x70hjg5mj      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 7wvsquchacp2p      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 8yjhydta27wsn      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 ahj6g61c8hzcq      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 ba76xa0dpua3n      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 bfzkxy486cfu5      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 cg65kxmcytw00      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 d63wz9wyhp10y      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 dk7annutx6n4z      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 dpw0m2t5dmcs8      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 fx4q3cpn84zbf      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 gbc4ksykvucqk      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 gn81xu18vsctn      3305693634 06-AUG-21 02.00.46.749 AM
     2407690495325880429 gxvwq870b82tg      3305693634 06-AUG-21 02.00.46.749 AM

vsqlarea.sql – show current execution details such as plan for 2407690495325880429 after the profile

LAST_ACTIVE         FORCE_MATCHING_SIGNATURE SQL_ID        PLAN_HASH_VALUE
------------------- ------------------------ ------------- ---------------
2021-09-14 17:54:36      2407690495325880429 5vczbrua1hnyk      1538925713
2021-09-14 17:54:36      2407690495325880429 gb7awt1g0spj3      1538925713
2021-09-14 17:54:34      2407690495325880429 54tz4ks94ard2      1538925713
2021-09-14 17:49:13      2407690495325880429 g9r370krad681      1538925713
2021-09-14 17:49:12      2407690495325880429 g4mqhstqmb616      1538925713
2021-09-14 17:43:19      2407690495325880429 apfc4uuksxzt2      1538925713
2021-09-14 17:43:18      2407690495325880429 g1k89u3xqb0nt      1538925713
2021-09-14 17:43:17      2407690495325880429 1rshyrb28xppz      1538925713
2021-09-14 17:36:52      2407690495325880429 934a88k6ct5ct      1538925713

The job ran last night in about 36 minutes instead of almost 6 hours, so it was a big improvement.

I was not able to improve the performance of this long running job using AWR or ASH reports but once I fixed an issue with my ashtopelapsed.sql script and used it along with my other ASH and non-ASH scripts I was able to resolve the performance issue. The job had many update and select statements that were too short to show up on the AWR and ASH reports but when summarized by their force matching signatures they showed up at the top of my ashtopelapsed.sql and ashfmscount.sql scripts’ outputs. Going forward I hope to use the collection of scripts described in this post when faced with tuning situations where an AWR report does not point to a top SQL statement. I may start using these on a regular basis even in cases where the AWR report does provide helpful information just to have another source of information.

Bobby

Posted in Uncategorized | 2 Comments

Outline Hint from dbms_xplan does not contain PARALLEL hint

Posted on September 3, 2021 by Bobby

I was tuning a parallel query and was a little surprised to find that an outline hint from EXPLAIN PLAN and dbms_xplan.display did not contain a PARALLEL hint. So, a query I was testing did not run in parallel with the outline hint. I had to add the PARALLEL hint. It is surprising because I think of an outline hint as a full set of hints.

I built a testcase to demonstrate this. Here is a zip of its script and output on a 19c database: blogparalleloutlinehints.zip.

I created a small test table and got the outline for the plan of a select statement with a parallel hint:

drop table test;

create table test noparallel as select * from dba_tables;

explain plan into plan_table for 
select /*+ parallel(test,8) */ sum(blocks) from test;

set markup html preformat on

select * from table(dbms_xplan.display('PLAN_TABLE',NULL,'ADVANCED'));

The outline hint looks like this:

  /*+
      BEGIN_OUTLINE_DATA
      FULL(@"SEL$1" "TEST"@"SEL$1")
      OUTLINE_LEAF(@"SEL$1")
      ALL_ROWS
      DB_VERSION('19.1.0')
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Notice that it just has a FULL hint and nothing about running the query in parallel.

Next, I ran the query with the outline hint but without any parallel hint like this:

select   /*+
      BEGIN_OUTLINE_DATA
      FULL(@"SEL$1" "TEST"@"SEL$1")
      OUTLINE_LEAF(@"SEL$1")
      ALL_ROWS
      DB_VERSION('19.1.0')
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */ sum(blocks) from test;

It ran with a serial (non-parallel) plan:

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |    27 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|   2 |   TABLE ACCESS FULL| TEST |  1785 |  5355 |    27   (0)| 00:00:01 |
---------------------------------------------------------------------------

Then I ran the query with both a parallel hint and the same outline hint:

select   /*+ parallel(test,8)
      BEGIN_OUTLINE_DATA
      FULL(@"SEL$1" "TEST"@"SEL$1")
      OUTLINE_LEAF(@"SEL$1")
      ALL_ROWS
      DB_VERSION('19.1.0')
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */ sum(blocks) from test;

It ran the desired parallel plan:

-------------------------------------------...---------------------
| Id  | Operation              | Name     |...|IN-OUT| PQ Distrib |
-------------------------------------------...---------------------
|   0 | SELECT STATEMENT       |          |...|      |            |
|   1 |  SORT AGGREGATE        |          |...|      |            |
|   2 |   PX COORDINATOR       |          |...|      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |...| P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |...| PCWP |            |
|   5 |      PX BLOCK ITERATOR |          |...| PCWC |            |
|*  6 |       TABLE ACCESS FULL| TEST     |...| PCWP |            |
-------------------------------------------...---------------------

In other tuning situations with plans that did not execute in parallel the outline hints from dbms_xplan.display_cursor did a great job of capturing all the details of the plan. But when extracting an outline hint from a query that has a parallel hint in it, I needed both the outline hint and the parallel hint to get the same plan.

Bobby

Posted in Uncategorized | Leave a comment

Bobby Durrett's DBA Blog

netstat -o shows that (ENABLE=BROKEN) turns on TCP keepalive

350 Second Timeout Causes ORA-03135 Errors in AWS DMS

Docker Sample Application behind Zscaler

350 second AWS timeout causes JDBC call to hang

Do SET_AUDIT_TRAIL_LOCATION before INIT_CLEANUP

Oracle 21c Laptop Install

Dropping sequences whose name starts with ISEQ$$

latch: shared pool waits after patching to 19.13

Tweaked script to show top SQL by force matching signature

Outline Hint from dbms_xplan does not contain PARALLEL hint

Subscribe to Blog via Email

Archives

Bobby Durrett\’s DBA Blog

Orafaq blog aggregator