My method of systematic troubleshooting

by Jason Swett,

Why systematic troubleshooting is valuable

Computers are complicated, programming is complicated, and the problems we programmers have to solve are often complicated.

Because human brains are only so powerful, and because humans are susceptible to logical fallacies, it’s very important to have a systematic approach to troubleshooting if we want to have a hope of solving our problems and solving them in a timely manner.

If you don’t have a good methodology, you’ll guess and flail and get frustrated and ultimately probably fail to fix the issue. If you do have a good methodology, almost any problem will eventually collapse under the crushing weight of your capabilities.

Here are some questions I tend to ask myself when troubleshooting technical issues.

Do I know, with certainty, exactly what’s wrong?

Don’t get fooled

Most of the time, when I’m presented with a problem, I don’t know exactly what’s wrong at first. It’s of course impossible to fix a problem if I don’t know what needs fixing.

The challenge here is not to fool yourself into thinking the problem is something other than what it really is.

I’ve been guilty on a number of occasions of taking a bug report at face value and then discovering later that the reporter of the bug was mistaken and that the bug was something else. If someone tells me “we’re not able to charge American Express cards in the system,” I should translate that statement most of the time to “something seems to be wrong with credit card payments”.

State only what you know for sure to be true

Here’s a quote from Zen and the Art of Motorcycle Maintenance (using motorcycles as the subject of investigation rather than computers):

It is much better to enter a statement “Solve Problem: Why doesn’t cycle work?” which sounds dumb but is correct, than it is to enter a statement “Solve Problem: What is wrong with the electrical system?” when you don’t absolutely know the trouble is in the electrical system. What you should state is “Solve Problem: What is wrong with cycle?” and then state as the first entry of Part Two: “Hypothesis Number One: The trouble is in the electrical system.” You think of as many hypotheses as you can, then you design experiments to test them to see which are true and which are false.

Good advice. (And a good book.)

How can I narrow the scope of my investigation?

Most problems are too big

With most problems, there are so many variables interacting with each other that it’s impossible to model the whole situation in my head. So I try to see what I can eliminate from the picture.

Let’s say I’m trying to deploy a Rails application to AWS. Some of my deployment worked but I can’t connect to my RDS instance (that is, my database instance).

In this case, my EC2 instance can’t connect to my RDS instance. Or, more specifically, the Rails app on my EC2 instance can’t connect to my RDS instance. How do I know whether the problem lies with my RDS instance, my EC2 instance, my Rails app, all of these, none of these, or some combination? That’s a lot of possibilities.

Narrowing it down

So rather than trying to tackle the whole problem, I’ll narrow it down to just the RDS instance. Is anything wrong with the RDS instance? How can I interrogate the RDS instance without bringing all the other parts into the picture?

One thing I can do in this case is to use the PostgreSQL CLI client (that is to say, the psql command) on my laptop (not on the EC2 instance, on my laptop) to try to connect to the RDS instance. This way I’m not involving my EC2 instance, I’m not involving environment variables, I’m not involving Rails, I’m only dealing with the RDS instance.

I can run the command psql my-db-name -U postgres -h my-rds-hostname and see what happens. If I can connect to my database that way and run a query, then I can be sure that my database name, my password and my hostname are all good. If I get prompted for a password but it tells me the password is wrong, then I know the problem is my password. If I don’t even get prompted for a password, then something else is wrong.

If I don’t know what’s wrong, in what areas might the problem lie and what tests can I perform to get a yes/no in each area?

List hypotheses

Before I actually get to work investigating a problem, I’ll usually list a number of hypotheses that I can test.

Here are some hypotheses for the RDS connection problem.

  • I’m using the wrong database credentials
  • I have the URL for my RDS instance wrong
  • The environment variables for my database credentials, RDS URL, etc., aren’t even set
  • My RDS instance’s security group is set to block traffic
  • Something else is wrong that I can’t think of

Test the hypotheses

Then I’ll try to get an answer to each question. I might start with what’s easiest to test or I might start with what I think is most likely to be the culprit.

I already shared one way of checking the second hypothesis, “I have the URL for my RDS instance wrong”. If I believe I have my RDS instance URL right, or if my tests in that area are inconclusive, I might try testing a different hypothesis.

When working with AWS, I find it easy to accidentally make my security groups (i.e. the sets of rules that control what types of traffic various entities can receive, like HTTP traffic, SSH traffic, etc.) too restrictive. PostgreSQL requests typically travel on port 5432, so if I don’t have port 5432 open for my RDS instance, I won’t be able to connect on port 5432.

To see if this is the issue I can use an open port checker to hit my RDS URL at port 5432. Again, this test tests only one thing at a time, which means if it doesn’t work, I can be pretty sure exactly what doesn’t work. If I perform a test that could have a large number reasons for returning negative, I haven’t learned very much. Simple, narrow tests are the most useful kinds of tests to perform.

Have I tried everything that can possibly be tried? If not, what haven’t I tried yet?

This is a question I often ask myself when I get stuck. It might sound like a dumb question but it turns out to be a productive question to ask a surprising portion of the time.

Persistence conquers all things

The point of this question is to facilitate persistence. It’s basically true that every problem is solvable and that the only way to fail is to give up. After all, if you understood everything about the problem and all the parts surrounding it, you would know exactly how to fix the problem, and there wouldn’t be a problem.

For example, with the RDS example, if you knew everything about networking, and databases, and Linux, and Rails, then you’d know exactly how to fix the problem. Luckily it never gets to that point. Usually there’s a relatively small amount of knowledge and understanding that lies between you and the solution to the problem.

So next time you’re confronted with a hairy issue, remember to use a systematic methodology, remember not to get fooled into believing things about the situation that aren’t true, remember to be persistent, and it’s entirely likely that you’ll be able to solve your problem.

2 thoughts on “My method of systematic troubleshooting

  1. Roger Collins

    You had me at Zen and the Art of Motorcycle Maintenance.

    I also recommend designing the system from the ground up with a high value of visibility. One of my pet peeves is troubleshooting down to a piece of code that is failing silently rather than logging a warning. Don’t do that.

    I’m researching newer testing tools and seeing your name all over the place. Congratulations and keep up the great work, Jason.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *