What is a flaky test?
A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.
In other words, a flaky test is a test that’s non-deterministic.
A test can be non-deterministic if either a) the test code is non-deterministic or b) the application code being tested is non-deterministic, or both.
Below are some common causes of flaky tests. I’ll briefly discuss the fix for some of these common causes, but the focus of this post isn’t to provide a guide to fixing flaky tests, it’s to give you a familiarity with the most common causes for flaky tests so that you can know what to go looking for when you do your investigation work.
The causes I’ll discuss are race conditions, leaked state, network/third-party dependency, randomness and fixed time dependency.
A race condition is when the correct functioning of a program depends on two or more parallel actions completing in a certain sequence, but the actions sometimes complete in a different sequence, resulting in incorrect behavior.
Real-life race condition example
Let’s say I’m going to pick up my friend to go to a party. I know it takes 15 minutes to get to his house so I text him 15 minutes prior so he can make sure to be ready in time. The sequence of events is supposed to be 1) my friend finishes getting ready and then 2) I arrive at my friend’s house. If the events happen in the opposite order then I end up having to wait for my friend, and I lay on the horn and shout obscenities until he finally comes out of his house.
Race conditions are especially likely to occur when the times are very close. Imagine, for example, that it always takes me exactly 15 minutes to get to my friend’s house, and it usually takes him 14 minutes to get ready, but about one time out of 50, say, it takes him 16 minutes. Or maybe it takes my friend 14 minutes to get ready but one day I somehow get to his house in 13 minutes instead of the usual 15. You can imagine how it wouldn’t take a big deviation from the norm to cause the race condition problem to occur.
Hopefully this real-life example illustrates that the significant thing that gives rise to race conditions is parallelism and sequence dependence, and it doesn’t matter what form the parallelism takes. The parallelism could take the form of multithreading, asynchronicity, two entirely separate systems (e.g. two calls to two different third-party APIs), or literally anything else.
Race conditions in DOM interaction/system tests
Race conditions are fairly common in system tests (tests that exercise the full application stack including the browser).
Let’s say there’s a test that 1) submits a form and then 2) clicks a link on the subsequent page. To tie this to the pick-up-my-friend analogy, the submission of the form would be analogous to me texting my friend saying I’ll be there in 15 minutes, and the loading of the subsequent page would be analogous to my friend getting ready. The race condition creates a problem when the test attempts to click the link before the page loads (analogous to me arriving at my friend’s house before he’s ready).
To make the analogy more precise, this sort of failure is analogous to me arriving at my friend’s house before he’s ready, and then just leaving after five minutes because I don’t want to wait (timeout error!).
The solution to this sort of race condition is easy: just remove the asynchronicity. Instead of allowing the test to execute at its natural pace, add a step after the form submission that waits for the next page to load before trying to click the link. This is analogous to me adding a step and saying “text me when you’re ready” before I leave to pick up my friend. If we arrange it like that then there’s no longer a race.
Because DOM interactions often involve asynchronicity, DOM interaction is a common area for race conditions, and therefore flaky tests, to be present.
Edge-of-timeout race conditions
A race condition can also occur when the amount of time an action takes to complete is just under a timeout value, e.g. a timeout value is five seconds and the action takes four seconds (but sometimes six seconds).
In these cases you can increase the timeout and/or improve the performance of the action such that the timeout value and typical run length are no longer close to each other.
Tests can create flaky behavior when they leak state into other tests.
Let me shoot an apple off your head
Let’s use another analogy to illustrate this one. Let’s say I wanted to perform two tests on myself. The first test is to see if I can shoot an apple off the top of my friend’s head with a bow and arrow. The second test is to see if I can drink 10 shots of tequila in under an hour.
If I were to perform the arrow test immediately followed by the tequila test and do that once a week, I could expect to get basically the same test results each time.
But if I were to perform the tequila test immediately followed by the arrow test, my aim would probably be compromised, and I might miss the apple once in a while. (Sorry, friend.) The problem is that the tequila test “leaks state”: it creates a lasting alteration in the global state, and that alteration affects subsequent tests.
And if I were to perform these two tests in random order, the tequila test would give the same result each time because I’d always be starting it sober, but the arrow test would appear to “flake” because sometimes I’d start it sober and sometimes I’d start it drunk. I might even suspect that there’s a problem with the arrow test because that’s the test that’s showing the symptom, but I’d be wrong. The problem is a different test with leaky state.
Ways for tests to leak state
Returning to computers, there are a lot of ways a test can alter the global state and create non-deterministic behavior.
One way is to alter database data. Imagine there are two tests, each of which creates a user with the email address
firstname.lastname@example.org. The first test will pass and, if there’s a unique constraint on
users.email, the second test will raise an error due to the unique constraint violation. Sometimes the first test will fail and sometimes the second test will fail, depending on which order you run them in.
Another way that a test could leak state is to change a configuration setting. Let’s say that your test environment has background jobs configured not to run for most tests because most background jobs are irrelevant to what’s being tested and would just slow things down. But then let’s imagine that you have one test where you do want background jobs to run, and so at the beginning of that tests you set background job setting from “don’t run” to “run”. If you don’t remember to change the setting back to “don’t run” at the end, background jobs will run for all later tests and potentially cause problematic behavior.
State can also be leaked by altering environment variables, altering the contents of the filesystem, or any number of other ways.
The main reason why network dependency can create non-deterministic behavior doesn’t take a lot of explaining: sometimes the network is up and sometimes it’s not.
Moreover, when you’re depending on the network, you’re often depending on some third-party service. Even if the network itself is working just fine, the third-party service could suffer an outage at any time, causing your tests to fail. I’ve also seen cases where a test fails because a test makes a third-party service call over and over and then gets rate-limited, and from that point on, for a period of time, that test fails.
The way to prevent flaky tests caused by network dependence is to use test doubles in your tests rather than hitting live services.
Randomness is, by definition, non-deterministic. If you, for example, have a test that generates a random integer between 1 and 2 and then asserts that that number is 1, that test is obviously going to fail about half the time. Random inputs lead to random failures.
One way to get bitten by randomness is to grab the first item in a list of things that’s usually in the same order but not always. That’s why it’s usually better to specify a definite item rather than grab the nth item in a list.
Fixed time dependency
Once I was working late and I noticed that certain tests started to fail for no apparent reason, even though I hadn’t changed any code.
After some investigation I realized that, due to the way they were written, these tests would always fail when run at a certain time of day. I had just never worked that late before.
This is common with tests that cross the boundary of a day (or month or year). Let’s say you have a test that creates an appointment that occurs four hours from the current time, and then asserts that that appointment is included on today’s list of appointments. That test will pass when it’s run at 8am because the appointment will appear at 12pm which is the same day. But the test will fail when it’s run at 10pm because four hours after 10pm is 2am which is the next day.
- A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.
- Flaky tests are caused by non-determinism either in the test code or in the application code.
- Some of the most common causes of flaky tests are race conditions, leaked state, network/third-party dependency, randomness and fixed time dependency.