In order to fix any problem, you need to know two things:
- What exactly is wrong
- How to fix it
The beautiful thing is that if you have the right answers for both the two items above, you’re guaranteed to fix the problem. If your fix doesn’t work, then it necessarily means you were wrong about either #1 or #2.
The fastest way to solve any problem is to focus first on finding out what’s wrong, with reasonable certainty, before you move onto the step of trying to fix the problem. “Diagnose before you prescribe.”
Why it’s faster to separate the diagnosis from the fix
There are two reasons why I think it’s faster to separate the diagnosis from the fix than to mix them up. The first is that if you don’t separate the two jobs, you’re liable to resort to guessing.
Guessing has its place but guessing is more productive when the guesses are founded on a decent level of understanding of the situation. The beginning of a problem-solving endeavor is when your understanding is the lowest. At this time your guesses will be the least informed by reality. At best, you will happen to make some correct guesses by luck. More likely, you’ll just waste time. The worst case is that you lead yourself down fruitless and time-consuming paths because you prescribed before you diagnosed.
The other reason has to do with mental clarity. If you mix diagnosing with solving, you’re liable to get confused. You’re more liable to flail and throw things at the wall rather than advance methodically toward a solution. You’re likely to run out of mental “RAM” and subsequently “crash”, losing your whole train of thought. All these things are pure waste.
How to diagnose a problem
Step zero: a scientific mindset
Science is the pursuit of the answer to the question: “What is true?” Since reality is very complex and often misleading, scientists need to be very careful about what they bestow with the label of “true”. It’s surprisingly easy to get fooled. In order for any conclusion to be true, it must be based on premises that are also true, otherwise the conclusion is suspect.
When any new piece of information enters your awareness, you need to pass that piece of information through a filter of scrutiny before you allow yourself to call it a “fact”. Don’t just believe something is true because it seems true. Many things that seem true aren’t not only not true but wildly wrong (e.g. a geocentric universe). You must require some sort of justification for believing something is true, and the justification you require should be proportionate to that thing’s size and prominence inside of whatever you’re trying to figure out.
This is the type of mindset we need to adopt when problem-solving. We need to practice science.
Step one: state the symptom
The first step toward diagnosing a problem is to make a problem statement. A problem statement is simply an expression of what you know to be wrong based on what you see.
At first your problem statement might be vague because although you know what the symptom of the problem is, you don’t know the root cause. You might not even know in what area the root cause lies.
Resist the temptation to be overly specific in your problem statement. A vague problem statement might not seem very helpful, but at least it’s not hurtful. A problem statement that’s precise but untrue causes much more harm than a problem statement that’s true but vague.
Robert M. Pirsig put this quite nicely in Zen and the Art of Motorcycle Maintenance:
In Part One of formal scientific method, which is the statement of the problem, the main skill is in stating absolutely no more than you are positive you know. It is much better to enter a statement “Solve Problem: Why doesn’t cycle work?” which sounds dumb but is correct, than it is to enter a statement “Solve Problem: What is wrong with the electrical system?” when you don’t absolutely know the trouble is in the electrical system.
When you’re satisfied that you have a true problem statement, write it down. Writing down the problem statement allows you to clear your “mental RAM” so you can use that precious limited RAM for other things.
In the remainder of this diagnosis discussion I’ll use the example problem statement of “site is down”.
By the way, I must make a caveat here. The problem statement of “site is down” depends on me knowing that the site is down. It wouldn’t be valid for me to receive a phone call from someone telling me the site is down and then make a problem statement that the site is down. Literally everything must be scrutinized. Just because someone tells me the site is down doesn’t necessarily mean that the site is actually down. It could mean that that person’s wi-fi isn’t working, or that I’m being pranked. I must verify for myself that the site is down (by attempting to visit it myself, for example) before I accept it as a true fact that the site is down.
Step two: isolate the root cause
Isolating the root cause is usually done by degrees. The idea with isolating the root cause is to progressively refine the original problem statement until the problem statement is so narrow and specific that the root cause is obvious.
The way to progressively narrow down the problem is to pose yes-or-no questions and determine what the answer is for each. It’s kind of like playing a game of “20 questions”. When I play 20 questions I like to ask questions like “Is it man-made?”, “Does it use electricity?”, “Could I pick one up?” which are each broad but also each eliminate a broad scope of possibilities.
Unlike 20 questions, we can’t get the answer to each of our questions just by having someone tell us. We need to find the answers ourselves by performing some sort of test. This is another area where it’s very important not to let ourselves get fooled.
Each test we perform needs to only change one variable at a time (using the scientific sense of the word variable, not the programming sense). If, for example, you delete line 26 in a file, delete line 4 from another file, restart the server, and get an error message, you can’t conclude that deleting line 26 is what caused the error message. It might be that deleting line 4 did it. It might even be that none of your changes did it, some somebody else changed something without restarting the server, and your server restart only applied some previous unrelated change. You must poke at each item from several directions in order to be reasonably sure you have the truth.
You will also necessarily need to bring facts into the picture from outside your tests. Be very careful to distinguish “strong” facts from “weak” facts. For example, a user might tell you they saw an error message on page X at about 4:30 yesterday. There’s no particular reason to disbelieve the user, but you also can’t take their word for it either. They could be misremembering. They could be confused about what page they saw. They might even not understand the boundaries between your application and, e.g., the operating system, meaning that what you thought was a bug report was really just a Windows error that the user experienced. So take these sorts of “facts” with a grain of salt. Give them, say, 60% credence instead of 100% credence.
Also note that this step is “isolate the root cause”, not “think really hard to guess what the root cause might be”. Empiricism is much cheaper than reasoning. It’s much faster to find where a problem lies than to try to figure out what the problem is. You can save the “what” for after you’ve found the “where”. Once you’ve found the “where”, the “what” will be about 100 times easier.
Let’s continue with the “website is down” example.
Websites work because a user’s request hits a server and then the server successfully returns a response, which is then displayed by a browser. If the website isn’t working, the problem must lie somewhere in that request chain. (BTW, “request chain” is just a term I made up.)
Based on this fact, I can refine my problem statement to: Something is broken in the request chain. It’s important to note that when I say “request chain”, I mean it in the very broadest sense possible. I conceive that chain to include everything that happens between the user typing the URL and a page being displayed. Otherwise I risk excluding the true root cause from my search area.
At this point I haven’t really made my problem statement any narrower, but I have made it more specific without making it any less true. My problem statement is now sufficiently specific that I can identify some investigation steps.
I’ll now ask myself the following questions:
- Does the DNS name for my website successfully resolve?
- When I send a request to my site’s URL, does that request actually make it to a server?
- If the request makes it to the server, do I see my application logging something in response?
- etc.
I typically don’t list out a large number of steps in advance, but rather I list one or two investigation steps, carry them out, and then think of one or two more.
The “site is down” example I’m using is a real one that actually happened to me some time ago. My memory is that when I SSH’d into a server to investigate, I discovered that there was an “out of disk space” error happening, and indeed, the server was out of disk space.
So in this case I refined my problem statement further to: one of the web servers is out of disk space. In this particular case I was done. The problem statement was sufficiently specific that it was obvious what to do: kill this server and spin up a fresh instance (and also add some disk space monitoring so this never happens again!).
If I hadn’t reached the end of the diagnosis process so easily, I would have continued posing yes/no questions of yes/no specificity until I had a specific enough problem statement that I knew what the fix was.
How to fix a problem
You know you’ve found the diagnosis to a problem when your problem statement is sufficiently specific that it’s obvious what the problem is.
When it’s obvious what the problem is, it’s often obvious what the solution is too. But not always. Often there are multiple possible ways to fix a problem. Sometimes it’s not clear what the palette of possible solutions is, or how easy each one would be to implement.
Here’s how to proceed in those situations.
Step one: enumerate all possible solutions
List all the possible solutions to the problem without filtering for elegance, ease of implementation, or anything else.
Step two: rank the solutions from most appealing to least appealing
Put the most simplest, most elegant solutions at the top and the dirtiest hacks at the bottom.
Step three: try the most appealing solution
Ideally, the most appealing solution also happens to be easy to implement. If it’s not, then you have a decision to make.
Your decision is whether to power through and implement the elegant solution even though it’s time-consuming, or to give up and implement a less appealing solution instead.
The way I make this decision is to spend time and effort on a solution in proportion to how appealing the next-most-appealing solution is. If there’s another solution that seems equally good, I’ll give up fast since the other solution might give me an equally acceptable outcome more cheaply. If all the other solutions are awful hacks, I’ll spend as much time as needed getting my solution to work rather than resorting to a hack.
Step four: move on
If I’ve decided that my current solution is sufficiently tricky that “the juice is not worth the squeeze”, then I’ll move to the next-most-appealing solution. I’ll cross my current solution off the list (at least for now) and then start again at step three with my next solution.
Step five: regroup
If I’ve tried all possible solutions and nothing works, I’ll see if I can think of any more possible solutions. If so, I’ll repeat the process using my new list of potential solutions. If I can’t think of any more possible solutions, I’ll go back to my original list and repeat the whole process, but this time with more effort.
It always works
Following the above methodology basically always works. It’s not a question of whether this methodology will yield the answer to a problem, it’s a matter of how much time spent solving a problem is justifiable relative to the cost of leaving the problem unsolved. Sometimes it’s wiser just to leave the problem unsolved.
But most of the time this methodology leads to the answer quite quickly. If you follow these steps you’ll be able to solve virtually any problem, and in much less time than most other programmers, and with much less confusion and frustration.
I generally love your posts, and this one too. But you left out what I think is often overlooked and should not be subsumed into one of your two steps. “Figure out how to reproduce your problem”. Often in the process of writing a good stackoverflow question, I work hard to reproduce a problem in the simplest possible way and in doing that I figure out what is happening on how to fix it. FWIW.
Thanks, that’s a good point!