Delphix support helped me resolve an issue yesterday and the experience gave me the idea of writing this post about several general computer issue troubleshooting tips that I have learned down through the years. Never mind that I ignored these lessons during this particular problem. This is more of a “do as I say” and not a “do as I do” story. Actually, some times I remember these lessons. I didn’t do so well this week. But the several mistakes that I made resolving this recent Delphix issue motivate me to write this post and if nothing else remind myself of the lessons I’ve learned in the past about how to resolve a computer problem.
Don’t panic!
I’m reminded of the friendly advice on the cover of the Hitchhiker’s Guide to the Galaxy: “Don’t panic!”. So, yesterday it was 4:30 pm. I had rebooted the Delphix virtual machine and then in a panic had the Unix team reboot the HP Unix target server. But, still I could not bring up any of the Delphix VDBs. We had people coming over to our house for dinner that night and I was starting to worry that I would be working on this issue all night. I ended up getting out of the office by 5:30 pm and had a great dinner with friends. What was I so stressed about? Even the times that I have been up all night it didn’t kill me. Usually the all night issues lead to me learning things anyway.
Trust support
The primary mistake that I made was to get my mind fixed on a solution to the problem instead of working with Delphix support and trusting them to guide us to the solution. We had a number of system issues due to a recent network issue and I got my mind set on the idea that my Delphix issue was due to some network hangup. I feel sorry for our network team because it seems like the first thought people have any time there is some issue is that it is a “network issue”. I should know better. How many times have I been working on issues when everyone says it is a “database issue” and I’m annoyed because I know that the issue is somewhere else and they are not believing me when I point to things outside the database. Anyway, I opened a case with Delphix on Monday when I couldn’t get a VDB to come down. It just hung for 5 minutes until it gave me an error. I assumed that it was a network hangup and got fixated on rebooting the Delphix VM. Ack! Ultimately, I ended up working with two helpful and capable people in Delphix support and they resolved the issue which was not what I thought at all. There are times to disagree with support and push for your own solution but I did this too early in this case and I was dead wrong.
Keep it simple
I’ve heard people refer to Occam’s razor which I translate in computer terms to mean “look for simple problems first”. Instead of fixing my mind on some vague network issue where the hardware is not working properly, how about assuming that all the hardware and software is working normally and then thinking about what problems might cause my symptoms? I can’t remember how many times this has bit me. There is almost always some simple explanation. In this case I had made a change to a Unix shell script that runs when someone logs in as the oracle user. This caused Delphix to no longer be able to do anything with the VDBs on that server. Oops! It was a simple blunder, no big deal. But I’m kicking myself for not first thinking about a simple problem like a script change instead of focusing on something more exotic.
What changed?
I found myself saying the same dumb thing that I’ve heard people say to me all the time: nothing changed. In this case I said something like “this has worked fine for 3 years now and nothing has changed”. The long-suffering and patient Delphix support folks never called me on this, but I was dead wrong. Something had to have changed for something that was working to stop working. I should have spent time looking at the various parts of our Delphix setup to see if anything had changed before I contacted support. All I had to do was see the timestamp on our login script and I would see that something had recently changed.
Understand how it all works
I think my Delphix skills are a little rusty. We just started a new expansion project to add new database sources to Delphix. It has been a couple of years since I’ve done any heavy configuration and trouble shooting. But I used to have a better feel for how all the pieces fit together. I should have thought about what must have gone on behind the scenes when I asked Delphix to stop a VDB and it hung for 5 minutes. What steps was it doing? Where in the process could the breakdown be occurring? Delphix support did follow this type of reasoning to find the issue. They manually tried some of the steps that the Delphix software would do automatically until they found the problem. If I stopped to think about the pieces of the process I could have done the same. This has been a powerful approach to solving problems all through my career. I think about resolving PeopleSoft issues. It just helps to understand how things work. For example, if you understand how the PeopleSoft login process works you can debug login issues by checking each step of the process for possible issues. The same is true for Oracle logins from clients. In general, the more you understand all the pieces of a computer system, down to the transistors on the chips, the better chance you have of visualizing where the problem might be.
Well, I can’t think of any other pearls of wisdom from this experience but I thought I would write these down while it was on my mind. Plus, I go on call Monday morning so I need to keep these in mind as I resolve any upcoming issues. Thanks to Delphix support for their good work on this issue.