What a flaky test is and why they’re hard to fix
A flaky test is a test that passes sometimes and fails sometimes even though no code has changed.
There are several causes of flaky tests. The commonality among all the causes is that they all involve some form of non-determinism: code that doesn’t always behave the same on every run even though neither the inputs nor the code itself has changed.
Flaky tests are known to present themselves more in a continuous integration (CI) environment than in a local test environment. The reason for this is that certain characteristics of CI test runs make the tests more susceptible to non-determinism.
The fact that the flakiness usually can’t be reproduced locally means that it’s harder to reproduce and diagnose the buggy behavior of the flaky tests.
In addition to the fact that flaky tests often only flake on CI, the fact that flaky tests don’t fail consistently adds to the difficulty of fixing them.
Despite these difficulties, I’ve developed some tactics and strategies for fixing flaky tests that consistently lead to success. In this post I’ll give a detailed account of how I fix flaky tests.
The overall approach
When I’m fixing any bug I divide the bugfix into three stages: reproduction, diagnosis and fix.
I consider a flaky test a type of bug. Therefore, when I try to fix a flaky test, I follow this same three-step process as I would when fixing any other type of bug. In what follows I’ll cover how I approach each of these three steps of reproduction, diagnosis and fix.
Before reproducing: determine whether it’s really a flaky test
Not everything that appears to be a flaky test is actually a flaky test. Sometimes a test that appears to be flaking is just a healthy test that’s legitimately failing.
So when I see a test that’s supposedly flaky, I like to try to find multiple instances of that test flaking before I accept its flakiness as a fact. And even then, there’s no law that says that a test that previously flaked can’t fail legitimately at some point in the future. So the first step is to make sure that the problem I’m solving really is the problem I think I’m solving.
Reproducing a flaky test
If I can’t reproduce a bug, I can’t test for its presence or absence. If I can’t test for a bug’s presence or absence, I can’t know whether a fix attempt actually fixed the bug or not. For this reason, before I attempt to fix any bug, I always devise a test that will tell me whether the bug is present or absent.
My go-to method for reproducing a flaky test is simply to re-run the test suite multiple times on my CI service until I see the flaky test fail. Actually, I like to run the test suite a great number of times to get a feel for how frequently the flaky test fails. The actions I take during the bugfix process may be different depending on how frequently the test fails, as we’ll see later on.
Sometimes a flaky test fails so infrequently that it’s practically impossible to get the test to fail on demand. When this happens, it impossible to tell whether the test is passing due to random chance or because the flakiness has legitimately gone away. The way I handle these cases is to deprioritize the fix attempt and wait for the test to fail again in the natural course of business. That way I can be sure that I’m not wasting my time trying to fix a problem that’s not really there.
That covers the reproduction step of the process. Now let’s turn to diagnosis.
Diagnosing a flaky test
What follows is a list of tactics that can be used to help diagnose flaky tests. The list is kind of linear and kind of not. When I’m working on flaky tests I’ll often jump from tactic to tactic depending on what the scenario calls for rather than rigidly following the tactics in a certain order.
Get familiar with the root causes of flaky tests
If you were a doctor and you needed to diagnose a patient, it would obviously be helpful for you first to be familiar with a repertoire of diseases and their characteristic symptoms so you can recognize diseases when you see them.
Same with flaky tests. If you know the common causes for flaky tests and how to recognize them, you’ll have an easier time with trying to diagnose flaky tests.
In a separate post I show the root causes of flaky tests, which are race conditions, leaked state, network/third-party dependency, fixed time dependency and randomness. I suggest either committing these root causes to memory or reviewing them each time you embark on a flaky test diagnosis project.
Have a completely open mind
One of the biggest dangers in diagnosing flaky tests or in diagnosing any kind of problem is the danger of coming to believe something that’s not true.
Therefore, when starting to investigate a flaky test, I try to be completely agnostic as to what the root cause might be. It’s better to be clueless and right than to be certain and wrong.
Look at the failure messages
The first thing I do when I become aware of a flaky test is to look at the error message. The error message doesn’t always reveal anything terribly helpful but I of course have to start somewhere. It’s worth checking the failure message because sometimes it contains a helpful clue.
It’s important not to be deceived by error messages. Error messages are an indication of a symptom of a root cause, and the symptom of a root cause often has little or nothing to do with the root cause itself. Be careful not to fall into the trap of “the error message says something about X, therefore the root cause has something to do with X”. That’s very often not true.
Look at the test code
After looking at the failure message, I open the flaky test in an editor an look at its code. At first I’m not looking for anything specific. I’m just getting the lay of the land. How big is this test? How easy is it to understand? Does it have a lot of setup data or not much?
I do all this to load the problem area into my head. The more familiar I am with the problem area, the more I can “read from RAM” (use my brain’s short-term memory) as I continue to work on the problem instead of “read from disk” (look at the code). This way I can solve the problem more efficiently.
Once I’ve surveyed the test in this way, I zero in on the line that’s yielding the failure message. Is there anything interesting that jumps out? If so, I pause and consider and potentially investigate.
The next step I take with the test code is to go through the list of causes of flaky tests and look for instances of those.
After I’ve done all that, I study the test code to try to understand, in a meaningful big-picture way, what the test is all about. Obviously I’m going to be more likely to be successful in fixing problems with the test if I actually understand what the test is all about than if I don’t. (Sometimes this involves rewriting part or all of the test.)
Finally, I often go back to the beginning and repeat these steps an additional time, since each run through these steps can arm me with more knowledge that I can use on the next run through.
Look at the application code
The root cause of every flaky test is some sort of non-determinism. Sometimes the non-determinism comes from the test. Sometimes the non-determinism comes from the application code. If I wasn’t able to find the cause of the flakiness in the test code, I turn my attention to the application code.
Just like with the test code, the first thing I do is to just scan the relevant application code to get a feel for what it’s all about.
The next thing I do is to go through the code more carefully and look for causes of flakiness. (Again, you can refer to this blog post for that list.)
Then, just like with the test code, I try to understand the application code in a big-picture sort of way.
Make the test as understandable as possible
Many times when I look at a flaky test, the test code is too confusing to try to troubleshoot. When this is the case, I try to improve the test to the point that I can easily understand it. Easily understandable code is obviously easier to troubleshoot than confusing code.
To my surprise, I’ve often found that, after I improve the structure of the test, the flakiness goes away.
Side note: whenever I modify a test to make it easier to understand, I perform my modifications in one or more small, atomic pieces of work. I do this because I want to keep my refactorings and my fix attempts separate.
Make the application code as easy to understand as possible
If the application code is confusing then it’s obviously going to hurt my ability to understand and fix the flaky test. So, sometimes, I refactor the application code to make it easier to understand.
Make the test environment as understandable as possible
The quality of the test environment has a big bearing on how easy the test suite is to work with. By test environment I mean the tools (RSpec/Minitest, Factory Bot, Faker, etc.), the configurations for the tools, any seed data, the continuous integration service along with its configuration, any files shared among all the tests, and things like that.
The harder the test environment is to understand, the harder it will be to diagnose flaky tests. Not every flaky test fix job prompts me to work on the test environment, but it’s one of the things I look at when I’m having a tough time or I’m out of other ideas.
Check the tests that ran just before the flaky test
Just because a certain test flakes doesn’t necessarily mean that that test itself is flaky—even if the same test flakes consistently.
Sometimes, due to leaked state, test A will create a problem and then test B will fail. (A complete description of leaked state can be found in this post.) The symptom is showing up in test B so it looks like test B has a problem. But there’s nothing at all wrong with test B. The real problem is test A. So the problematic test passes but the innocent test flakes. It’s very deceiving!
Therefore, when I’m trying to diagnose a flaky test, I’ll check the continuous integration service to see what test ran before that test failed. Sometimes this leads me to discover that the test that ran before the flaky one is leaking state and needs to be fixed.
Add diagnostic info to the test
Sometimes, the flaky test’s failure message doesn’t show much useful information. In these cases I might add some diagnostic info to the test (or the relevant application code) in the form of print statements or exceptions.
Perform a binary search
Binary search debugging is a tactic that I use to diagnose bugs quickly. There are two main ideas behind it: 1) it’s easier to find where a bug is than what a bug is, and 2) binary search can be used to quickly find the location of a bug.
I make heavy use of binary search debugging when diagnosing flaky tests. See this blog post for a complete description of how to use binary search debugging.
Repeat all the above steps
If I go through all the above steps and I don’t have any more ideas, I might simply go through the list an additional time. Now that I’m more familiar with the test and everything surrounding it, I might have an enhanced ability to learn new things about the situation when I take an additional pass, and I might have new ideas or realizations that I didn’t have before.
Now let’s talk about the third step in fixing a flaky test, applying the fix itself.
Applying the fix for a flaky test
How to be sure your bugfixes work
A mistake many developers make when fixing bugs is that they don’t figure out a way to know if their bugfix actually worked or not. The result is that they often have to “fix” the bug multiple times before it really gets fixed. And of course, the false fixes create waste and confusion. That’s obviously not good.
The way to ensure that you get the fix right the first time is to devise a test (it can be manual or automated) that a) fails when the bug is present and b) passes when the bug is absent. (See this post for more details on how to apply a bugfix.)
How the nature of flaky tests complicates the bugfix process
Unlike “regular” bugs, which can usually be reproduced on demand once reproduction steps are known, flaky tests are usually only reproducible in one way: by re-running the test suite repeatedly on CI.
This works out okay when the test fails with relative frequency. If the test fails one out of every five test runs, for example, then I can run the test suite 50 times and expect to see (on average) ten failures. This means that if I apply the ostensible fix for the flaky test and then run the test suite 50 more times and see zero failures, then I can be pretty confident that my fix worked.
How certain I can be that my fix worked goes down the more infrequently the flaky test fails. If the test fails only once out of every 50 test runs on average, then if I run my test suite 50 times and see zero failures, then I can’t be sure whether that means the flaky test is fixed or if it just means that all my runs passed due to random chance.
Ideally a bugfix process goes like this:
- Perform a test that shows that the bug is present (i.e run the test suite a bunch of times and observe that the flaky test fails)
- Apply a bugfix on its own branch
- Perform a test on that branch that shows that the bug is absent (i.e run the test suite a bunch of times and observe that the flaky test doesn’t fail)
- Merge the bugfix branch into master
The reason this process is good is because it gives certainty that the bugfix works before the bugfix branch gets merged into master.
But for a test that fails infrequently, it’s not realistic to perform the steps in that order. Instead it has to be like this:
- Perform a test that shows that the bug is present (i.e observe over time that the flaky test fails sometimes)
- Apply a bugfix on its own branch
- Merge the bugfix branch into master
- Perform a test on that branch that shows that the bug is absent (i.e observe over a sufficiently long period of time that the flaky test no longer fails)
Notice how the test that shows that the bug is present is different. When the test fails frequently, we can perform on “on-demand” test where we run the test suite a number of times to observe that the bug is present. When the test fails infrequently, we don’t realistically have this option because it may require a prohibitively large number of test suite runs just to get a single failure. Instead we just have to go off of what has been observed in the test suite over time in the natural course of working.
Notice also that the test that shows that the bug is absent is different. When the test fails frequently, we can perform the same on-demand test after the bugfix as before the bugfix in order to be certain that the bugfix worked. When the test fails infrequently, we can’t do this, and we just have to wait until a bunch of test runs naturally happen over time. If the test goes sufficiently long without failing again, we can be reasonably sure that the bugfix worked.
Lastly, notice how in the process for an infrequently-failing test, merging the fix into master has to happen before we perform the test that ensures that the bugfix worked. This is because the only way to test that the bugfix worked is to actually merge the bugfix into master and let it sit there for a large number of test runs over time. It’s not ideal but there’s not a better way.
A note about deleting and skipping flaky tests
There are two benefits to fixing a flaky test. One benefit of course is that the test will no longer flake. The other is that you gain some skill in fixing flaky tests as well as a better understanding of what causes flaky tests. This means that fixing flaky tests creates a positive feedback loop. The more flaky tests you fix, the more quickly and easily you can fix future flaky tests, and the fewer flaky tests you’ll write in the first place because you know what mistakes not to make.
If you simply delete a flaky test, you’re depriving yourself of that positive feedback loop. And of course, you’re also destroying whatever value that test had. It’s usually better to push through and keep working on fixing the flaky test until the job is done.
It might sometimes seem like the amount of time it takes to fix a certain flaky test than the value of that test can justify. But keep in mind that the significant thing is not the cost/benefit ratio of any individual flaky test fix, but the cost/benefit ratio of all the flaky test fixes on average. Sometimes flaky test fixes will take 20 minutes and sometimes they’ll take two weeks. The flaky test fixes that take two weeks might feel unjustifiable, but if you have a general policy of just giving up when things get too hard and deleting the test, then your test-fixing skills will always stay limited, and your weak skills will incur a cost on the test suite for as long as you keep deleting difficult flaky tests. Better to just bite the bullet and develop the skills to fix hard flaky test cases.
Having said all that, deleting a flaky test is sometimes the right move. When development teams lack the skills to write non-flaky tests, sometimes the teams have other bad testing habits, like writing tests that are pointless. When a flaky test coincidentally happens to also be pointless, it’s better to just delete the test than to pay the cost to fix a test that doesn’t have any value.
Skipping flaky tests is similar in spirit to deleting them. Skipping a flaky test has all the same downsides as deleting it, plus now you have the extra overhead of occasionally stumbling across the test and remembering “Oh yeah, I should fix this eventually.” And what’s worse, the skipped test often gets harder to fix as time goes on because the skipped test is frozen in time but the rest of the codebase continues to change in ways that aren’t compatible with tests. The easiest time to fix a flaky test is right when the flakiness is first discovered.
- The root cause of every flaky test is some sort of non-determinism.
- Flaky tests are known to present themselves more in a CI environment than in a local test environment because certain characteristics of CI test runs make the tests more susceptible to non-determinism.
- I consider a flaky test to be a type of bug. When I’m fixing any bug, including a flaky test, I divide the bugfix into three stages, which are reproduction, diagnosis and fix.
- To reproduce a flaky test, I run the test suite enough times on CI to see the flaky test fail, or if it fails too infrequently I wait for it to fail naturally.
- There are a large number of tactics I use to diagnose flaky tests. I don’t necessarily go through the tactics in a specific order but rather I use intuition and experience to decide which tactic to use next. The important thing is to treat the flaky test diagnosis as a distinct step which occurs after reproduction and before the application of the fix.
- With the application of any bugfix, it’s good to have a test you can perform before and after the fix to be sure that the fix worked. When a flaky test fails frequently enough, you can do this sort of test by simply re-running the test suite in CI a sufficient number of times. If the flaky test fails infrequently, this is not practical, and the fix must be merged to master without being sure that it worked.
- When you delete a flaky test, you not only destroy the value of the test but you also lose the opportunity to build your skills in fixing flaky tests and avoiding writing flaky tests in the first place. Unless the test coincidentally happens to be one that has little or no value, it’s better to fix it.