There are three steps to fixing any bug: 1) reproduction, 2) diagnosis and 3) fix. This post will focus on the second step, diagnosis.
There are two ways to approach diagnosing a bug. One is that you can stare at the screen, think really hard, and try to guess what might be going wrong. In my experience this is the most common debugging approach that programmers take. It’s neither very efficient nor very enjoyable.
Another way is that you can perform a series of tests to find where the cause of the bug lies without yet worrying about what the cause of the bug is. Once the location of the bug is found, it’s often so obvious what’s causing the bug that little to no thinking is required. This method of bug diagnosis is much easier and more enjoyable than just thinking really hard, although unfortunately it’s not very common.
20 Questions strategies
You’ve probably played the game 20 Questions, where one player thinks of an object and the other players ask up to 20 yes-or-no questions to try to guess what the object is.
Smart players of 20 Questions try to ask questions that divide the possibilities more or less in half. If what’s known about the object is that it’s an animal, but nothing more, then it’s of course a waste of a guess to guess whether the animal is a turtle, because if the answer is no, then you’ve only eliminated a small fraction of all the kinds of animals that the target object might be because not very many kinds of animals are turtles.
Better to ask something like “Does it live on land?” or “Is it extinct?” which will rule out larger chunks of possibilities.
The binary search algorithm
This “eliminate-half strategy” is the genius of the binary search algorithm, which repeatedly chops a sorted list in half to efficiently find the target item. If you can determine that an item is not present in one half of the list, then it’s necessarily true that the item lies in the other half of the list. If you’re searching a list of 1000 items, you can chop the list down to 500, then 250, then 125, 63, 32, 16, 8, 4, 2, 1. That’s only 10 steps for a list of 1000 items. I find it pretty impressive that a single item can be found in a list of 1000 items in only 10 steps.
The simplicity of the binary search algorithm must be why, a surprisingly large amount of the time, a plastic toy that costs $12.99 is able to correctly determine what object you’re thinking of within 20 guesses. The machine seems miraculously intelligent but its job is probably actually requires more drudgery than actual thinking. All that presumably has to be done is to place the world’s information into nested categories, each “higher” category being about twice as broad as the one below it. For a human to manually perform these categorizations would take a lot of work, but once the categorizations are made, carrying out the work of the binary search steps is mindlessly simple.
If binary search is powerful enough that a cheap toy can use it to guess almost any object you can think of, it’s not hard to imagine how useful binary search can be in pinpointing a bug in a web application.
Playing “20 Questions” with your IT system
Whenever I’m faced with a bug, I play “20 Questions” with my IT system. I ask my system a series of yes-or-no questions. Each answer rules out a chunk of the system as the area which contains the bug. Once my answer has allowed me to rule out an area, I ask a question on the area of the system which hasn’t yet been ruled out. Eventually the area that hasn’t been ruled out becomes so tiny that it’s (much of the time) totally obvious what the root cause is.
Of course, I can’t literally “ask” questions of my system. Instead I devise a series of yes-or-no tests that I can perform. For example, let’s say my web service is not resolving requests. The first question I might ask is “Is the site actually down or is it down for just me?” For that question my test would involve visiting downforeveryoneorjustme.com and typing in the URL for my website. If the site is in fact down for everyone, then my next question might be “Are requests making it to my web server?” For that question my test would involve looking at my web server’s logs to see evidence of HTTP requests being received. And so on.
Unlike a real binary search algorithm, I don’t always do a 50/50 split. Often it’s more economical to do what you might call a “weighted binary search” instead.
Splitting on size vs. splitting on likelihood
The reason a binary search algorithm splits lists in half is because we want each yes-or-no question to eliminate as much search area as possible, but we don’t know whether the answer will be a yes or a no, and so splitting the list in half guarantees that we never eliminate less than half of the list. If for example we chose a 75/25 split instead of a 50/50 split, we risk eliminating only 25% of the possibilities rather than the other 75% of the possibilities. That would be a waste.
The 50/50 strategy makes sense if and only if you don’t know the probability of the target item appearing in one half or the other half (or if the chances truly are 50/50). The strategy makes less sense if have clues about where the target item lies.
For example, I become aware that a certain bug was present after I performed a certain deployment and absent before that, I don’t have to waste time stupidly dividing my entire codebase in half repeatedly in order to find the root cause of the bug. It’s reasonable for me to believe that the code introduced by the deployment is likely to (although of course not certain to) contain the offending code. The change in the deployment might represent just 1% or less of the codebase, meaning that if my guess was wrong that that deployment introduced the bug, then I will only have eliminated 1% of the codebase and my search area will still be the remaining 99%. But because the risk of my guess being wrong is usually so low, and because the cost of performing the yes/no test is usually so small, it’s a win on average to do this kind of 1/99 split rather than a 50/50 split.
The “20 Questions Method of Debugging” is to perform a binary search on your codebase. Divide your codebase roughly in half, devise a test to tell you which half you can rule out as containing the root cause of the bug, and repeat until the remaining search area is so small that the root cause can’t help but reveal itself.
More precisely, the 20 Questions Method of Debugging is a “weighted binary search”. If you have a good reason to believe that the root cause lies in a certain area of your codebase, then you can save steps by immediately testing to see if you can rule out everything that’s not the suspect area rather than doing a dumb 50/50 split. You will probably “lose” sometimes and end up eliminating the smaller fraction rather than the larger fraction, but as long as your guesses are right enough of the time, this strategy will be more efficient on average than always doing a 50/50 split.
The 20 Questions Method of Debugging is a better method of debugging than the “stare at the screen and think really hard” method for two reasons. The first reason is that the 20 Questions Method is more efficient than the “sit and think” method because thinking is expensive but performing a series of yes-or-no tests is cheap. The second reason is my favorite, which is this: the “sit and think” method may never yield the right answer, but the 20 Questions Method is virtually guaranteed to eventually turn up the right answer through power of sheer logic.