I was looking at the Python sortedcontainers package and it got me thinking. It is a long convoluted story and I am not sure that I can explain it clearly in a short blog post. I tried to explain all this to my wife in the last few minutes as we were driving up to a friend’s house last night and I’m sure it was confusing the way I explained it. But, I’m hoping that I can capture some of the ideas that I have thought about in a way that will be useful to others.
I have studied some computer science topics that do not directly relate to my work with Oracle databases and my review of the sortedcontainers implementation documentation tied together several things that I have studied and related them back to my work with Oracle performance tuning. I have not tested sortedcontainers to make sure that it does everything the web site says it does. But, I think it is the best Python package for doing balanced tree type of structures in memory. An AVL tree or B-tree keeps data ordered so you can quickly search for a range of key values and get them out in sorted order. Normal Oracle indexes are a type of B-tree but on disk with blocks cached in memory when queries access them. AVL trees are binary trees so each node points to at most 2 children. B-tree nodes can have many children. Sortedcontainers seem to work like a balanced tree with 1000 or so max children per node. I think it makes efficient use of Python’s built-in list objects. It seems to work well with caching within the CPU. I have not carefully reviewed the theory and tested all this out to prove that it is right but it seems likely that it is. I think it seems convincing because it ties back to other computer science topics that I have studied and to my experience with Oracle performance tuning.
I have been slowly working through an algorithms class on MIT’s OCW website. I am on a section about AVL trees. So, I was looking around at AVL trees in Python. I noticed that Rosetta Code had an AVL tree topic but no Python example until I added one. I also looked around on PyPI for an AVL tree Python package. Based on my search, I thought that bintrees was the most mature, but its web page has a note saying “Use sortedcontainers instead”. So, that made me think that sortedcontainers was the best balanced tree option in Python. The algorithms class talks about how to prove that you can work with AVL trees in O(log n) time. The sortedcontainers performance documentation has a complex explanation of its big O complexity. Also, I think that my class will discuss some of the concepts used in the sortedcontainers analysis in future lessons. So, that motivates me to go forward.
The assembly language book that I worked through helped me understand how to write programs that run faster because they make better use of the x86-64 processor’s cache and registers. Its creator seems to have designed sortedcontainers with CPU caches in mind. Right or wrong, in my mind this ties back to memory caches that affect Oracle database performance. How much of Oracle tuning relates back to how systems cache database blocks in RAM and where? You have the database block cache of course. You also have operating system filesystem cache which you might bypass with direct I/O. You may have high-speed memory cache within your SAN’s storage server. I don’t know about today but in the past disk controller cards and even disk drives themselves had memory caches. You might say, joking, that “cache is king” in database performance. At least, you have to say it is important to understand when and where database systems cache disk blocks in memory to understand why you are getting the performance you are seeing.
So, I guess my mind connected sortedcontainers with my algorithms class and assembly language book. I also connected sortedcontainers back to Oracle performance tuning. It makes me feel that digging into some computer science training is not a waste of time. It helps me to step back from Oracle database specific study and get a little theory. Also, my database work is focusing more and more on the x86-64 architecture and the Linux platform so looking at computer science on the same platform that I use for work has clear benefits.
So, I’m concerned that I have not made this post helpful to people who read it. Is it just about my experience or does it have a point for other people? Maybe the point is that it can’t hurt for an Oracle DBA to learn some computer science. Maybe you are like me and studied C.S. in school many years ago. Maybe you have learned Oracle on the job and don’t have a C.S. background. Maybe the message for both of us from my story about sortedcontainers and my “Aha!” moment is that there are benefits to studying a little computer science even if it does not directly relate to your job. There is only so much time in an Oracle DBA’s day and you can’t learn everything, but maybe it is worth putting some time into learning some C.S. theory if you can.
Bobby
In my experience as someone who spent a lot of time outside “technology” before working on data and ML projects mostly (python ecosystem and sql) , the little bit of CS I’ve learned (intermittently and somewhat randomly along my journey) as def made me appreciate databases (indexes) and generally resist the temptation if at all possible to do brute-force pairwise comparisons 🙂
The 80/20 seems to have gotten me reasonably far!
Thanks for your comment. Interesting to hear from someone who isn’t a pure technology person.