Querying Many Databases in Parallel in Python

I have some Python scripts that I use to access a bunch of databases to gather information such as the size of the database. Usually it is not that important how long they take so I have been running queries against a list of 100 databases one at a time and it can take a while depending on what I do. Coworkers of mine have worked on running Unix/Linux shell scripts in parallel to hit a bunch of systems like this more quickly. So, I thought I would look at how to do the same in Python. I am sure there are more complicated ways to do it, but I got good results based on the simplest example from this part of the Python documentation:

https://docs.python.org/3.8/library/multiprocessing.html

Here is a simple python script to get the size of the database files from four databases at the same time:

All I did was take the first example from the multiprocessing documentation and replace f() which squared a number with dbspace() which connects to a database and runs a query to get the total db size. I experimented with different Pool() sizes. It is 4 in the example.

I picked 82 non-production databases that I have access to and ran the same query sequentially in a loop and it took 375 seconds. I took the code above and increased the Pool to 32 processes and it ran in 24 seconds. That is about a 15x speedup.

I am sure that I could get a lot more complicated with all this stuff, but this works well with minimal effort.

Bobby

Posted in Uncategorized | 2 Comments

Cleaned out my desk…due to COVID-19 remote work

I guess I have had a desk in an office since around Summer 1990 when I first reported for work in San Bernardino. I have moved offices multiple times including across the country twice and did not have a desk for a week or two during those moves. But otherwise I have always had a place where I could leave my stuff at work. A couple of weeks ago I cleaned out my desk, maybe for good, even though I am still employed. Like a lot of people COVID-19 has pushed my work to 100% remote work out of my home. So, it was no surprise that I needed to clean out the stuff I had in my unused cubicle in the office. It is fine, but it still is kind of the end of an era and I am a little nostalgic mainly about the framed pictures and certificates that I am disposing of, several of which I have had for two decades. Of course, the funny thing is that the last time I moved desks from one cubicle to another I put all these framed pictures and certificates in a drawer and have never gotten the energy to put them back up again. I can never get them that straight when I hang them up. Anyway, while I was off work on Wednesday, I took all the pictures and certificates out of their frames. Will probably donate the frames and throw away the pictures if they are duplicates. But the certificates are a record of my efforts at self-improvement at work.

Kind of a sad image of all my work pictures out of their frames

These are my family pictures that I had at work. I have had some of these at work for over 28 years. Of course, like I said, I had these in a drawer at work anyway! Most of these are duplicates and one thing about working from home is I see the real people, so I do not need their picture to remind me of them!

Empty frames in my garage ready to be sent to Goodwill

I figured I would donate the frames. I do not plan to hang anything up at home and if we do go back in the office, I think I will just keep images on my computer instead.

Oracle 7 certification before OCP program

This was so long ago. I started on Oracle 7. I think it was 7.0.16. I do not remember much except that it was a hard test for me.

First OCP Test – not an upgrade – I took all the tests

This is the most significant OCP certification. There is so much I could say about my career and my self-development as represented by this point in time. Between the Oracle 7 cert in 1998 and this one in 2003 I went through a time of discouragement in my career and part of that was not pursing the Oracle 8 OCP cert. I had a lot of non-technical things to do at work and the challenge of providing for my young growing family and did not focus on Oracle learning. But when I finally decided to leave my first job, which I thought would be my only and last job, I decided to get OCP certified to help me get a new job and that was when I got my 9i cert. I think I had to pay for an Oracle class out of my own pocket to get this one which was a big deal at the time. Since then all my OCP tests were upgrade exams and my employers have paid for most of my tests. They paid for my first 12c test that I failed but I paid for the one I passed – happy to do it!

10g OCP – cannot remember this at all

I must be getting old because I cannot remember this at all. I just celebrated my 15th year with my current employer so the rest of these OCP certs must have been while here. 3 years per cert seems about the way it goes. I think I was still in South Carolina for this one.

11g OCP cert – I have used 11g a ton

I do not remember this one either, but it was 5 years from the last so that is a little significant. We merged DBA teams into one team around 2010 and I moved to Arizona at that time. We have a ton of 11.2 systems, but I think this was an 11.1 cert.

12c OCP cert – my last one?

This one nearly killed me. I cannot believe it has been 6 years since I took this test. I guess I am due for another one, but I do not know what I would study. Maybe some Oracle cloud cert? 20c? I wrote a couple of blog posts about this one. One reason that I have not had Oracle certs since 2014 is that I have been doing other kinds of self-development such as through Toastmasters.

First Toastmasters Leadership award
First speaking award
Second leader award – next level
Did the first level award again

As you can tell from the certificates, I did a bunch of Toastmasters from 2016-2018. This has been very valuable to me in my career. Toastmasters has really changed the way I think about communicating with other people at work about Oracle database technology. I have put the focus on making myself a better communicator instead of getting frustrated with people who do not understand or agree with things I am trying to say. But since 2018 I have slowed down a bit and the last few months have been a mess with COVID-19. I have not felt like I had time to blog or pursue self-improvement. But hopefully that has changed for the better now. Maybe things will improve going into next year? The big question is what do I work on next for self-development? Oracle certification? Do I develop Toastmasters mastery? Online Computer Science academic studies? Other databases – SQL Server, MySQL, etc? Cloud – the future is cloudy. 🙂

I do not know what the future holds but my desk cleanout and taking my pictures and certificates out of their frames paints a picture of where I have been. What comes next?

Bobby

Posted in Uncategorized | Leave a comment

AWR and V$ Scripts for a SQL_ID

Quick note to myself. If I am tuning a query by its SQL_ID I use these scripts:

From AWR:

findsql.sql – finds SQL_ID of queries that match certain strings

sqlstat.sql – shows execution history of SQL_ID

getplans.sql – shows all plans for SQL_ID

From V$ views:

vfindsqlid.sql – finds SQL_ID of queries that match certain strings

vsqlarea.sql – shows current executions of SQL_ID

extractplansqlid.sql – shows current plan of SQL_ID

I mainly mine the history from the AWR but I have been looking at V$ view information more recently, especially after putting in a fix to a slow query plan. I updated this on 6/18/21 to include the find scripts. The six scripts together are a pretty good toolkit for finding a SQL_ID and getting information about it. The AWR gives you history (we keep 6 weeks of hourly snapshots). The V$ scripts give you current information.

Bobby

Posted in Uncategorized | 2 Comments

Python Formatted Print

I wanted to write a quick post to document how to format nice columns in Python. Mainly I want to remember this myself and if I put it on my blog, I can find it quickly. There are probably better posts out there but so be it.

If you have read my other posts, you will realize that I am command line oriented. Yes, I know that makes me a dinosaur. 😊🦕🦕 My two main GitHub repositories are command line scripts. PythonDBAGraphs displays graphs but you run it from the command line. OracleDatabaseTuningSQL is a collection of SQL*Plus scripts that I run from the command line. Typically I work on my Windows 10 laptop running the Windows version of command line SQL*Plus in a command prompt window. Some of the SQL*Plus scripts like tablestats.sql use the column command to format query output in nicely sized columns:

The column commands just trim owner to 20 characters and table_name to 30 to format the output nicely.

Since I am not a professional application developer and since programming is not my primary task, I tend to learn just enough of something to get by. In the C language that I learned in college around 1984 – eons ago – I know that there is a formatted print command printf that will do all kinds of fancy output. But I just use it in the most basic way such as:

printf("%s is a string.\n",mystringvariable);

%s is for strings, %d integers, %f floats – simple. But you can format columns nicely if you need to. I recently did a little formatted output for a Nethack code modification that I played with. In my hacked version of the delivered files.c I put some commands to print out some debugging information in column format(edited for clarity):

fprintf(fptr,"\nInventory Letter     Object Type or Name                     Priority\n");
fprintf(fptr,"---------------------------------------------------------------------\n");

for (i = 0; i < num_autoletter ; i++) {
    fprintf(fptr,"%c                    %-40s    %4d \n",letter, object_type_or_name, priority);
}

The output looks something like this:

Inventory Letter     Object Type or Name                     Priority
---------------------------------------------------------------------
b                    blindfold                                      1
l                    magic lamp                                     1
l                    lamp                                           2
l                    oil lamp                                       2
l                    brass lantern                                  2
k                    key                                            1

This is just a game, but the point is that I needed to line up the output in columns. %-40s and %4d were the C formatting strings that I can never remember and had to look up for that situation, but they did the job. So, the question is how to do the same in Python?

Usually in Python I just use the print() command and pass it a string that I construct. For example:

Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:20:19) [MSC v.1925 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> i = 5
>>> print("i = "+str(i))
i = 5
>>>

I use the str() function to convert numbers to strings and the concatenation operator + to combine strings. Then I pass the full string to print(). Easy and good enough for basic output that does not need to be formatted into columns or otherwise. But I know there is a Python equivalent to C’s printf formats, but it just is not in the forefront of my mind. I have to look it up to use it. I recently did that for a Rosetta Code example program. The output was supposed to look like this:

Name     Bits  Start  End
=======  ====  =====  ===
ID         16      0   15
QR          1     16   16
Opcode      4     17   20
AA          1     21   21
TC          1     22   22
RD          1     23   23
RA          1     24   24
Z           3     25   27
RCODE       4     28   31
QDCOUNT    16     32   47
ANCOUNT    16     48   63
NSCOUNT    16     64   79
ARCOUNT    16     80   95

Here is the code to print this formatted output (simplified):

print("Name     Bits  Start  End")
print("=======  ====  =====  ===")

for line in lines:
    print('{0:7}    {1:2d}     {2:2d}   {3:2d}'.format(name, bits, startbit, endbit))

I guess the 0:, 1:, 2:, 3: numbers represent the arguments to the format method of the type str. A regular string like “Hello World” is of type str in Python.

>>> type("asdfa")
<class 'str'>
>>>

:7 seems to just say that this is a string and stick the string in a 7-character width column. :2d seems to say that this is an integer and stick it in a two-character column. The spaces between the {} bracketed format strings line up the columns. I know this is a simple example.

There are a lot of nice web pages out there. This is from the tutorial:

7.1.2. The String format() Method

This seems to be from the reference:

Format String Syntax

Crazily complex examples:

Format examples

There is a lot of great Python help on print formatting on Stack Overflow. I had a great post that I wanted to link in but could not find it.

Anyway, this is not a database post, but I use Python in my Oracle database work and wanted to do a quick post for my own reference if nothing else.

Bobby

Posted in Uncategorized | Leave a comment

SQL Profile example when best plan not clear

I resolved another production performance issue with a SQL Profile yesterday. I have several posts about SQL Profiles, so I do not want to be redundant, but this case was a little different because it was not clear that I had a better plan. I want to document the challenge that I had deciding if I had the best plan and show the resolution.

On September 21 when I think I was on vacation or otherwise not in the office there was a big performance issue on an older production database. The on-call DBA identified the sql_id of the top SQL statement as 30q69rbpn7g75. But he and an application support developer together could not connect that SQL statement back to the long running reports that were impacting the business. Eventually the issue went away later that night. Here is some of the execution history from that original issue:

This output is from my sqlstat.sql script. Notice how earlier in the day there are several plans with widely varying elapsed times. Plan hash value 1732425686 averages 66 millisconds during the hour ending at 10 am. I do not know why the query has so many different plans. 2823019537 was the bad plan and it got locked in throughout the afternoon and into the night.

Later in that same week I reviewed the plans and found that the longest running plans used certain indexes on the top table and the shorter ones used another index. I looked at the “Segments by Logical Reads” section of an AWR report during the problem time and found that the top segment was index X6_WFCTOTAL. I used by getplans.sql to get all the plans for SQL id 30q69rbpn7g75 and found that the fastest ones used range scans against index X1_WFCTOTAL and the slow ones did range or skip scans against indexes X5_WFCTOTAL or X6_WFCTOTAL. So I picked one plan, 382022017, and used coe_xfr_sql_profile.sql to force 30q69rbpn7g75 to always run with plan 382022017 which used index X1_WFCTOTAL. Here is some execution history of the plan I picked:

This plan was averaging less than 1800 milliseconds. But after putting in the supposedly better plan it seemed to have worse execution times that other plans. Also, the plan hash value was different than 382022017. Here is the execution history from when I put this SQL Profile in place:

Notice first that the plan hash value is 653445232 instead of 382022017. I usually see things like this when the plan has system generated table temporary table names, but I have not seen that in the plan. There must be something like that going on. Either that on the SQL Profile just locks in a slightly different plan. Anyway, 653445232 is the plan caused by the SQL Profile. Notice how the average elapsed time hangs around 90,000 to 100,000 milliseconds. But 382022017 has elapsed times under 1800 milliseconds. Seeing these results, I dropped the SQL Profile. Kind of like a doctor who swears to “do no harm” I did not want to slow down queries that were running very efficiently in my attempt to prevent another system slowdown or outage like we had on September 21. I dropped the SQL Profile on September 29.

Then yesterday, October 19, the problem recurred. This time the bad plan was 3908549093 and used a skip scan. I seem to see a lot of bad plans with skip scans. Here was the execution history before I put the SQL Profile back in:

Between 12 and 1 pm the system was really bogged down with executions of 30q69rbpn7g75 taking almost an hour, 3543778 milliseconds. So, I put the SQL Profile back in. I had the script sitting there from the first time I tried it. These are all kept in the sqlt/utl directory. After putting it in things cleared up and the backlog of reports slowly emptied. I think we killed one long running session and had the user rerun the report with the SQL Profile in place.

But with the SQL Profile in place it still in consistently around 90,000 milliseconds even today. There are no executions under 1800 milliseconds like before. It is a puzzle, but it got us through the issue. I am planning to leave this in place to prevent another production outage, but I suspect that these reports may be running longer than normal in many cases. But at least they are consistent and the business needs are being met.

So, this is another example of a SQL Profile to the rescue in a production database performance issue. But this was different because it was not clear that the one plan was always the best one. I went so far as to put it in and take it out and then put it in again. I ended up leaving it in because it resolved a performance issue that we had twice and do not want to have in the future. Hopefully any less than optimal performance outside of these peak times will be worth it since we are preventing an outage with significant impact to the business.

Bobby

Posted in Uncategorized | Leave a comment

Upgrade Data Migration Choices – HP to Linux

1. Introduction

I just finished moving and upgrading an Oracle database from 11.2.0.4 on HP Unix Itanium to 19c on x86-64 Linux on VMWare along with an army of coworkers on our project team. I want to use this post to describe the choices I considered for migrating the data from the old to the new system. I mention my coworkers because this was a major collaborative effort and I do not want to take credit for the work of others. But I did have to think through the best way to move this big database cross platform and I think that my evaluation of those options in dialog with the project team was a way that I contributed. So, I can blog about it. Also, I think it is worth posting about this subject because the choice of data migration method was challenging and interesting to me. So, if I can find a way to communicate about it clearly enough it might have value to others and even to myself if I come back to this later.

First, I think I should explain some of the parameters for this upgrade/migration. The database files total about 15 terabytes of space. I guess actual data is 7-8 TB but everything else together adds up to 15. The database has tens of thousands of partitions and subpartitions. These partitions/subpartitions are manipulated by batch processes on the weekend and in the weekday evenings. Some of the tables are partitioned by date but others by columns that are not date related. The source system is running 11.2.0.4 and the target system 19c. The source and target platforms have different endianness. At the beginning of the project our goal was to minimize downtime for the final cutover weekend, so I tried to find methods of cutting over from the old to new database with only an hour or two of downtime max. The system has an online web front end that business users use during the weekdays as well as very resource intensive batch jobs on the weekend and on some weeknights. One goal of the project besides upgrading to a newer platform was to improve performance. On some weekends the batch processes did not complete in time for the users to see the updated data on Monday morning. The much faster x86-64 processors on the new system should speed up the weekend batch processes to meet the business need to complete before the Monday workday. CPU was pegged for several hours each weekend on the older HP Itanium system so we knew that newer, faster CPUs would help.

2. Transportable Tablespaces and Cross Platform Incremental Backup

Next, I want to talk about different options we considered. Another DBA on my team worked with an earlier proof of concept for this project to show the potential benefits of it. At some point she tried out the ideas in an earlier version of this Oracle support document:

V4 Reduce Transportable Tablespace Downtime using Cross Platform Incremental Backup (Doc ID 2471245.1)

This is a fairly complicated procedure designed to speed up the move from Unix to Linux by allowing you to use Transportable Tablespaces to move your data over, converting to the new endianness, and then apply incremental backups of the source system to catch up any changes, changing endianness of the backups as you go. Transportable Tablespaces are kind of a pain, but I have worked on them before and they can be done. But the key limitation of this process is that just before you go live on the new system you must use Data Pump to import all the metadata from your source database. But since we have tens of thousands of partitions and subpartitions the time to import the metadata could be several hours. So, I felt like we could use this approach but with all the complexity and risk we still might have a lengthy period of downtime and another approach might work better. I think the key lesson from this approach is how hard it is to change endianness using a physical backup when you have a huge data dictionary. If we were moving from Linux to Linux, we probably could have used some sort of physical copy along with the application of incremental changes without the big metadata import at the end. But with the big data dictionary and the cross-platform nature of the upgrade the physical copy option (transportable tablespaces) seemed too slow.

One reason I liked this option was that it would work well with NOLOGGING changes. I was concerned early on that I could not use replication methods that pulled changes from the redo logs because some updates would not be logged. But incremental backups do pick up blocks that are updated even if the updates are not written to the redo logs. Eventually I did a bunch of research and found that I could turn on FORCE LOGGING on the source database and that opened up the option of using the redo logs. The problem of the long time to export/import the metadata with the Transportable Tablespaces option pushed me to pursue the FORCE LOGGING research to make sure we could use it.

3. Pre-Loading Static Historical Data

The second approach I considered and discussed with the team early on was exporting and importing historical data in advance and only migrating over actively updated data during the cutover window. If you have most of your data in tables that are partitioned by a date column you could potentially pre-load the older static data on your new system and then migrate a small percentage of your 15 terabyte database during the cutover window. I found that about 4 terabytes had not been updated in the past 90 days and that made me stop pursuing this option. Here is the script I used: statictabsum.sql. I just looked at the last_analyzed date on tables, indexes, partitions, and subpartitions. Not perfect but it gave me some idea of how much data was static. Slightly edited output:

    >select sum(bytes)/(1024*1024*1024*1024) total_tb
  2  from withlastanalyzed
  3  where
  4  last_analyzed < sysdate - &&DAYSOLD;

  TOTAL_TB
----------
4.34381223

The system that I was upgrading is a pretty complicated system with many batch jobs and many interfaces with external systems. I would have to work with the development team to understand what data was truly static and there would be the chance that we would make a mistake and not copy something that we should. I would hate to go live and find that I missed some data that should have been updated on the partitions that I thought were static. Yikes! With my imperfect estimate of how much data could be pre-loaded and knowing the complexity of the system I felt that it was not worth trying the pre-load of the static data.

4. Oracle GoldenGate Replication

The third thing I tried, and thought would work really well for us, was Oracle GoldenGate replication. Several years ago, a coworker of mine had upgraded and migrated the database for a transactional system using GoldenGate with minimal downtime. Oracle has a white paper about how to do this:

Zero Downtime Database Upgrade Using Oracle GoldenGate

The idea is to start capturing changes to your source system and then do a full export/import to your target system. Then apply all the changes that happened on the source database during the export/import to the target db using GoldenGate. We had this setup and began testing it but we had failures because of changes to table partitions and subpartitions by the application batch jobs. We had setup GoldenGate to replicate DML but not DDL. We researched it in the manuals and found that to replicate DDL (partition drops and adds) the source system had to be 11.2.0.4 or later and have compatibility set to 11.2.0.4 or later to do DDL replication in the most current and desirable way. See this section of the 12.2.0.1 installation manual:

13.1.1 Support for DDL capture in integrated capture mode

Unfortunately, even though our source system was on 11.2.0.4 the compatible parameter was set to 11.2.0, the default. I wonder why 11.2.0.4 defaults to compatible=11.2.0? Sure, we could change the parameter, but you cannot easily back out a change to compatible. And we would essentially have to retest our entire application before changing it. Given that our project was running out of time and money that was not an option. So, that led to our final choice.

5. Data Pump Export and Import

The last migration method that we considered and the one we just used in the production migration was Oracle Data Pump export and import, the expdp and impdp utilities. This is the simplest and most widely used method for copying data from one Oracle database to another and it works great across platforms with different endianness and different versions of Oracle. We used this method to populate our new databases for testing. Originally we thought about using the same data migration method for testing that we planned to do for the final production cutover but we did not have the time and money to get GoldenGate setup before all the testing began so we were not able to use it to populate our test databases. We had to use the simpler Data Pump utility. We had a whole crew of people lined up to work on the project, so we had to simply do a full export/import and cut them loose. As the project went on we streamlined our export/import method including working with Oracle support to get a fix for a bug that affected our data migrations. As our project had dragged on for a number of months beyond our initial deadline and as we discovered that GoldenGate was not going to work for us without a lot of more work I started to think about using Data Pump export/import, our last resort. At the time it was taking 50 hours to do the full export/import and I asked if we could get approval for two extra days of downtime – two full working days of application downtime. To me 50 hours to migrate a 15-terabyte database was not bad. I had expected it to take several days based on prior experiences populating dev and test databases from this same source system. The HP Unix hardware was older and had slower network adapters so that was part of the reason for the long data migration time.

Once we got the approval to take the extra downtime, I had to fend off suggestions to go back to preloading the historical data or to ignore tables that we think are obsolete. By this time, I was convinced that a full export/import made the most sense because all our testing was done on systems that were populated with the same method. We had the business approval for the downtime, and we knew that we had been testing for months on databases that had been populated with this method. Finally, our Unix/Linux/Storage team came up with a clever way to cut our export/import time almost in half without changing anything on the database side. Essentially, they figured out how to move the filesystem that we were exporting to onto an HP Unix blade with a faster network adapter. The original system had a 1 gigabit network adapter and the new system had 10 gigabit. Also, they setup an NFS mounted filesystem so we could import straight over the network rather than copy all the files to the target server and have all that duplicate storage space. We tested this approach several times and then the real thing went in without issues and even a little faster than we expected.

6. Conclusion

My main concern for this project was how to migrate such a large database and change the endianness of the data in a short enough time to meet the needs of our users and our company. This concern drove me to investigate several potentially complicated approaches to this data migration. But in the end, we used the simplest and surest method that we had already used to populate our test databases. We took several steps to tune our full Data Pump export/import process. The seven largest tables were broken out into their own parfile and exported to disk uncompressed in parallel. Their indexes were rebuilt parallel nologging. Then the Unix/Linux/Storage team did their magic with the faster network adapter. But even with these helpful performance enhancements our approach was still simple – Data Pump export and import. The more exotic methods that we tried were thwarted by the nature of the database we were upgrading. It had too many subpartitions. The application manipulated the subpartitions during the week. We had the wrong compatible value. Finally, budget and schedule concerns forced the decision to go with what worked, export/import. And in the end, it worked well.

I want to conclude this post with high level lessons that I learned from this process. There is value in simplicity. Do not use a more complex solution when a simpler one meets your needs. The Transportable Tablespaces and GoldenGate options both were cool to work on and think about. But it takes time to dig into things and complexity adds risk of failure. If a simpler solution is safer, meets the business need, and takes less time and money why not use it? I guess the other high-level lesson is that it is easier to get approval for downtime when your project is behind schedule and over budget. When push came to shove downtime was less important than when we went live. Maybe if I had pushed for the longer downtime from the beginning and planned all along to use export/import the project would have been shorter. But I did not know that GoldenGate would run into the issues it did, so it was harder to insist that there was no way to eliminate substantial downtime up front. I also did not know at the beginning of the project that the export/import method could be streamlined to run in 1 day instead of several. Maybe the real lesson is that you have to work through these data migration options along with your project team and make the best choices that you can at the time based on technical issues that you discover and the business needs as they evolve with the project. This project ended with a simple data migration approach that met the company’s needs, but future projects may have different technical and business parameters and the other more complex approaches may be more appropriate to future situations.

Bobby

Posted in Uncategorized | 4 Comments

Elementary, my dear DBA!

I read all the Arthur Conan Doyle Sherlock Holmes stories and have recently watched a couple Star Trek The Next Generation stories with the android Data pretending to be Holmes and it gave me the idea for this post about diagnosing an Oracle hang/bug that has been giving me fits lately. The logic involved in solving my Oracle problem reminds me of this often-quoted Sherlock Holmes saying:

“…when you have eliminated the impossible, whatever remains, however improbable, must be the truth…”

Sherlock Holmes

I have often had moments in diagnosing Oracle problems that resemble the idea behind this quote. I think this happens when the information I have in front of me does not directly lead to a solution. Most of the time it does. You get an error, you look it up, and you fix the problem. But sometimes things do not make sense. It looks like the database or hardware is not working correctly even though it is. Somehow, I have to discipline my mind to believe that the system is working correctly and follow the logical conclusions. Well, maybe I should talk about my example and it will be clearer.

I spent a week moving my daughter into her college dorm and then visiting my parents in Alabama. It was a lot of travel along with the COVID-19 fears although it all went well. (I wore my mask everywhere). But then I got back in the office (online from home) and found that the project I was working on had a batch job hung up. I was also on call last week, so it was a little stressful. The hung session was a query coordinator process for a parallel INSERT statement, and it was hung on an PX Deq: Join ACK wait. It was hung for several days until we killed it, so it wasn’t just slow. We had already tested this job and every other batch job with no problems. Most were running significantly faster on our newer, faster x86-64 processors. But we were doing a final batch schedule test that runs several jobs at the same time trying to mimic production. No problem I thought. I will just head to Oracle’s support site and look up the wait event and see what to do. But my search turned up empty. If I was on 10.2 or some old version of RAC maybe, but nothing for 19c non-RAC. I tried Google too but nothing. Hmm. So, I opened an SR with Oracle. I made sure to get a statedump and hanganalyze while the session was still hung up. I figured no problem. Oracle will peek at the dumps and give me a patch and we will be on our way. But no deal. Oracle support was helpful, and I was getting them the information they asked for but nothing just jumped out at us. Plus, I was doing my own investigation looking at logs, trace files, and DBA_HIST views, etc. I was getting frustrated. This upgrade is supposed to go live soon and now all of a sudden at the last minute I am getting this weird wait event that no one seems to have heard of even though we have tested on this system for months with nothing like this.

I think this is where the Sherlock Holmes logic of deduction applies. Underlying my frustration is an assumption about reality. My assumption is that this horrible unknown wait is showing up in the same conditions that all our successful testing was in. The idea that the current testing is in a radically different configuration is the “possible but improbable” category. In this case we have two database VMs setup with the same number of cores and RAM and the same version of Oracle including patches and close to the same data, etc. We have a new production database and a performance database with the idea that development could do their final testing in a performance database that matches production. As it turns out all the init.ora parameters on the two databases are the same except for the ones that size the SGA and PGA memory. We are moving from an HP Unix system with a lot more physical memory than we have on these VMs so the SGA_TARGET and PGA_TARGET settings on that system together exceed the memory available on the new VMs but we copied the parameters from the HP system to the new Linux ones. Evidently, I correctly resized these down on the production system but never on the performance system. So, somewhere in our full-scale batch testing with a bunch of parallel SQL statements overloading the system and a ton of swapping of memory caused by the wrong parameters some parallel server process died leaving my query coordinator hung on PX Deq: Join ACK. I do not really know if this is true. But it fits the facts. I synced these memory parameters up with the working production database, we reran the problem job, and it ran like clockwork. It really helps to have free memory.

In some ways my example is simple. You cannot set SGA_TARGET and PGA_TARGET to more than your total memory and expect good things to happen. DBA 101. But the more interesting thing, at least to me, is the mindset that I need to have when facing a difficult Oracle problem. Here are some of the truths in my mind during this problem:

  1. Our new production database works great and testing has gone very well for months
  2. Our new performance database is very similar to our new production database
  3. Our batch jobs ran fast and error free individually in production
  4. Run together on perf our batch jobs hung up with a weird wait
  5. Lots of things I looked at including AWR reports, alert logs, etc. looked weird

I guess I would like my brain to more quickly jump to questioning my assumption that the performance database is like production. Sure, running a bunch of batch jobs at the same time will put more load on the system and may create problems. But suddenly hitting some weird bug that no one in the world has ever hit doesn’t really make sense. There must be some reasonable explanation that makes sense without assuming some terrible thing is wrong with our database despite months of success using this version of Oracle on this data and code. Other people would be reporting the same thing and blogging about it.

I am not sure how clear this is. I have something I am trying to say about logic and problem solving with Oracle databases but do not know if I have said it in a coherent way. A working database and hardware do not suddenly start doing weird things for no reason. After eliminating that virtual impossibility, you are left with possible solutions that you have not considered because they seem so unlikely. Elementary…

Bobby

P.S. I have a feeling that people are not that interested in this post which is fine. It does not have any useful scripts in it. But, I ran into another situation today that fits the same theme.

Here are the facts:

  1. Import and index build of large schema takes longer than on earlier runs
  2. Linux admin says there is a lot of I/O on the redo logs
  3. Storage admin says there was a bump in service times despite being on a nice array and SSD
  4. Import uses data pump and indexes built parallel nologging
  5. We have been experimenting with GGS on this system
  6. We are importing more data than in previous run

I guess eliminating the impossible means eliminating the idea that heavy redo log I/O is not occurring. It was occurring no matter how much that conflicts with my model of reality. The improbable is that my parallel nologging index build and data pump are somehow doing a boatload of log I/O. Enter the solution. Somehow this database got set to force logging, probably during our work with GGS. It didn’t take me that long to figure it out today but I could have put the redo log I/O into my thoughts earlier. Anyway, I am having fun with this post and maybe it is not so helpful. Sherlock Holmes is not the sum of all knowledge. But there is something to be said for taking the evidence where it leads even if it is not where you expect.

Posted in Uncategorized | 3 Comments

Simple Performance Tuning Example

I helped resolve an Oracle database performance problem yesterday and I thought I would capture what I did in this blog post. It is a “simple” problem in that the solution was easy once I understood the issue and because this sort of problem happens often. But the process to get to the solution is not completely simple. I think documenting the process may have more value than writing down the details of this specific problem.

Today is Tuesday. Thursday last week our development team was doing some testing and performance of the system was terrible. I was busy working on another project that is my primary assignment right now, so I did not look at it. It was also Thursday before the July 4th holiday weekend and we are all mentally and physically checking out a bit early. Some key people were already off on vacation. So, we shut it all down until Monday (yesterday). Also, late Thursday there was some thought that there was a system issue maybe with our virtual machines’ host or networking. So, we let it go. Monday our dev team said they thought the slowness was caused by bad data including duplicates that were throwing errors on a target system (non-Oracle) in the cloud. So, they cleaned up the data and started a new run. But it was slow as molasses and the application front end was also super slow.

In retrospect, looking at the paragraph above, I bet a lot of DBAs with performance tuning experience would jump to the solution. I can think off the top of my head of a few types of things that might cause these symptoms. But I did not immediately think of it so I went through a process to get there. No leap of intuition was required except maybe at the very end.

Since the problem query was currently running the first step was to look at the session monitor in Toad. I may have looked at HP Unix’s Glance tool, which is like Top, to see the CPU usage and top processes at an OS level. What I saw was a parallel query running with 8 parallel child processes and the query had been running about 30 seconds.

Step two was to take an AWR report. First, I ran a command like this to get a snapshot after the current activity:

exec dbms_workload_repository.create_snapshot;

Then I ran an AWR report for the last two snapshots.

The “SQL ordered by Elapsed Time” part of the report looked like this:

SQL ordered by Elapsed Time

The top query azrtxpx1g6mxy was the one I saw in Toad. The second query is a subselect of the top one. Averaging 51.23 seconds and having 46.77% of its time on I/O I eventually went where I usually go with AWR reports. I went to the “Segments by Logical Reads” section of the AWR report. Here is part of that section:

Segments by Logical Reads

I hid the table names, but the point is that the top segment was doing 99.67% of the logical I/Os. So, the issue had to do with I/O. But looking at the “Top 5 Timed Foreground Events” section I/O seems very efficient.

Foreground Events

The top wait was the I/O event “direct path read” which can be deceptively low on systems that use asynchronous I/O. But this system uses HP Unix file system I/O which does not support asynchronous I/O so the direct path read number is correct. One of my list of possible things that could be wrong before I started was some sort of I/O contention but this report pointed away from I/O hardware.

So, the AWR report and my look at Toad session monitor points to a SQL statement doing parallel query full scans of a table and that SQL statement running 10s of seconds in length. So, I looked at the execution plan next. I like to use my getplans.sql SQL*Plus script to extract a query’s plan from the AWR using its SQL_ID. I ran this is SQL*Plus:

@getplans azrtxpx1g6mxy

Sure enough, the same top table was running a parallel full table scan in the plan.

Somewhere in this process I did some counts. I think the simplicity of using select count(*) queries may get in the way of people using it but count(*) queries are one of the biggest helps to me tuning a SQL statement. I did two types of counts for this case. I did a select count(*) on the top table and found out that it had 160 million rows. The count took 45 seconds even through it was doing a fast full scan of an index and not the full table. Then I queried grouping by the column that joins the table to its parent table. This is a typical header detail join where the child table joins the parent on the parent’s unique key. Then I got the idea that the query might run faster if there were an index on the joining column.

I ran my optimizer statistics scripts to check the index columns and to look at the column statistics to see how selective the joining column is. I found that there was no index on the column that the header table used to join to the problem detail table. So, I recommended that the dev team add an index on that column and including the other columns in the where clause for good measure. They might have to try two or three different column choices and orders to get the best index.

But the dev team was skeptical. This same query runs fine in production. It is not new so what changed? I tried running my sqlstat.sql script to get a history of the execution of SQL_ID azrtxpx1g6mxy but there was no history on our test system going back before July 2nd. I ran it on production and there was no history at all. I found out that this SQL had not run for two months on production and since we only keep 6 weeks history in our AWR any execution history of azrtxpx1g6mxy had fallen off the list.

Finally, I had a moment of insight. If I could have seen a faster plan on production, no insight would have been required. It would have been an almost mechanical process of comparing the good and bad plans. But I finally got the idea of running my same optimizer statistics scripts against production and looking at the indexes. Sure enough, production had an index on the joining column. Duh! I probably should have checked for differences between the tables and indexes in test and prod first, but I had no reason to suspect a difference. So, a coworker put the indexes back in test and everything was fine.

So, it is a simple thing. Indexes were dropped on a test system. Happens all the time. Not profound. But maybe the process of driving from problem statement through the different kinds of evidence and tools to the solution is more profound. It gets results. What does the AWR report point to? The top segments pointed to a table. The counts and column statistics suggested the need for an index. Finally, a compare of prod and test showed that an index was missing.

Bobby

Posted in Uncategorized | 4 Comments

Fixed issue with DBA_HIST_SQLSTAT script

I got rid of the old sqlstat.sql, sqlstat2.sql and sqlstat3.sql scripts in my GitHub repository and fixed an issue that was in sqlstat3.sql. Now the current version is just called sqlstat.sql. This one script should work well in all the situations where the other three were useful. (One script to rule them all…)

The old version of sqlstat3.sql was showing rows with plans that were not being run so I added a condition to only show plans that actually had any elapsed time. Slowly getting more useful.

Old output:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA
------------- --------------- ------------------------- ----------------
g0wqzdhxy62mc      2903317857 14-APR-20 01.00.04.320 PM                0
g0wqzdhxy62mc      4109593498 14-APR-20 01.00.04.320 PM              440
g0wqzdhxy62mc      1569680226 14-APR-20 01.00.04.320 PM                0
g0wqzdhxy62mc      2845678604 14-APR-20 01.00.04.320 PM                0

New output:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA
------------- --------------- ------------------------- ----------------
g0wqzdhxy62mc      4109593498 14-APR-20 01.00.04.320 PM              440

I really was not using the other two scripts anymore and was only using sqlstat3.sql. They are in the GitHub history if anyone needs to revisit them. Going forward I will use the new sqlstat.sql.

Bobby

Posted in Uncategorized | 2 Comments

How to email from Python

I got an email asking how to send an alert email from a Python script. Here is an example based on a script I have setup:

You need to replace “PUT YOUR SMTP SERVER HERE” with the host name of your smtp server.

Bobby

Posted in Uncategorized | Leave a comment