Category Archives: Automated Testing

Testing anti-pattern: merged setup data

Misleading details

When we create our jobs, we give each of them an order_index. This matters for the #status test but is totally immaterial to the #start! test. As the author of this test, I happen to know which details matter for which tests, but someone reading this test for the first time would have no easy way of knowing when an order_index is needed and when it’s not.

The only safe assumption is that every detail is needed for every test. If we alter the global setup data somehow, it’s possible that we’ll cause a silent defect. We could cause a test to keep passing but to lose its validity and start showing us a false positive. When data is included in the global setup beyond what’s needed for every test in the file, it creates an unnecessary risk that makes the test harder to change than it needs to be.

Shoehorned data

I’ll now reveal a bit more of the describe "#start!" test case.

describe "#start!" do
  let!(:job) { create(:job) }
  let!(:build) { job.build }

  before do
    fake_job_machine_request = double("JobMachineRequest")

    # The following line, which references "job",
    # is the line to pay attention to
    allow(job).to receive(:job_machine_request).and_return(fake_job_machine_request)
  end
end

In the above code, which shows the original version of the #start! test before we merged all the setup data, the setup and usage of job were straightforward. We created a job and then we used it.

But now that we’ve merged all our setup code, we only have job_1 and job_2 available to us and no plain old job_2 anymore. This makes things awkward for the #start! test where we only need one job to work with. Here are two possible options, both undesirable.

before do
  # Referring to job_1 misleadingly implies that the
  # fact that it's job 1 and not job 2 is significant,
  # which it's not, and also that we might do
  # something with job 2, which we won't
  allow(job_1).to receive(:job_machine_request).and_return(fake_job_machine_request)

  # This is almost better but not really
  job = job_1
  allow(job).to receive(:job_machine_request).and_return(fake_job_machine_request)
end

Another way we can deal with this issue is simply to add a third job to our setup called job:

RSpec.describe Build, type: :model do
  let!(:build) { create(:build) }
  let!(:job) { create(:job) }
  let!(:job_1) { create(:job, build: build, order_index: 1) }
  let!(:job_2) { create(:job, build: build, order_index: 2) }

  describe "#start!" do
  ...

  describe "#status" do
  ...
...

This takes away our problem of having to shoehorn job_1 into standing in for job, and it also takes away our misleading details problem, but it creates a new problem.

It’s unclear what setup data belongs to which test(s)

The other problem with combining the setup data at the top of the test is that it’s not clear which values are needed for which test. This creates a symptom very similar to the one creates by the misleading details problem: we can’t easily know what kinds of changes to the setup data are safe and which are risky.

It’s unclear what the state of the system is and how it will affect any particular test

I once worked on a project where a huge setup script ran before the test suite. The test environment would get some arbitrary number of users, superusers, accounts, customers and almost every other kind of entity imaginable.

The presence of all this data made it extremely hard to understand what the state of the system was and how any existing data might interfere with any particular test. Although it’s sometimes worth it to make compromises, it’s generally much easier if every test starts with a fully clean slate.

What’s the solution?

Instead of merging all of a test file’s setup data at the top, it’s better to give each test only the absolute minimum data that it needs. This usually means giving each test case its own individual setup data. Often it even means creating duplication—although duplication in tests is usually not real duplication. The performance cost is also usually negligible or nonexistent, since even if global setup data only appears once, it actually gets run once per test run anyway, and nothing is saved by combining it. Any cost in duplication or performance is usually overwhelmingly offset by the benefit of not merging setup data: it makes your tests easier to understand and work with.

Why duplication is more acceptable in tests

4 Replies

It’s often taught in programming that duplication is to be avoided. But for some reason it’s often stated that duplication is more acceptable in test code than in application code. Why is this?

We’ll explore this, but first, let’s examine the wrong answers.

Incorrect reasons why duplication is more acceptable in tests

“Duplication isn’t actually that bad.”

Many programmers hold the opinion that duplication isn’t something that should be avoided fastidiously. Instead, a certain amount of duplication should be tolerated, and when the duplication gets to be too painful, then it should be addressed. The “rule of three” for example says to tolerate code that’s duplicated twice, but clean it up once the duplication reaches three instances.

This way of thinking is overly simplistic and misses the point. The cost of duplication doesn’t depend on whether the duplication appears twice or three times but rather on factors like how easy the duplication is to notice, how costly it is to keep the duplicated instances synchronized, and how much “traffic” the duplicated areas receive. (See this post for more details on the nature of duplication and its costs.)

The heuristic for whether to tolerate duplication shouldn’t be “tolerate some but don’t tolerate too much”. Rather the cost of a piece of duplication should be assessed based on the factors above and weighed against any benefits that piece of duplication has. If the costs aren’t justified by the benefits, then the duplication should be cleaned up.

“Duplication in test code can be clearer than the DRY version”

It’s true that duplication in test code can be clearer than the DRY version. But duplication in application code can be clearer than the DRY version too. So if duplicating code can make it clearer, why not prefer duplication in application code to the same exact degree as in test code?

This answer doesn’t actually answer the question. The question is about the difference between duplication in test code and application code.

The real reason why duplication is more acceptable in test code

In order to understand why duplication is more acceptable in test code than application code, it helps to get very clear on what exactly duplication is and why it incurs a cost.

What duplication is and why it costs

Duplication doesn’t mean two identical pieces of code. Duplication is two or more copies of the same behavior. It’s possible to have two identical pieces of code that represent different pieces of behavior. It’s also possible to have the same behavior expressed more than once but in different code.

Let’s also review why duplication incurs a cost. The main reason is because it leaves the program susceptible to logical inconsistencies. If one copy of a behavior gets changed but the other copies don’t, then the other behaviors are now wrong and there’s a bug present. The other reason duplication incurs a cost is because it creates a maintenance burden. Updating something in multiple places is obviously more costly than updating it in just one place.

The difference between test code and application code

The difference between test code and application code is that test code doesn’t contain behaviors. All the behaviors are in the application code. The purpose of the test code is to specify the behaviors of the application code.

What in the codebase determines whether the application code is correct? The tests. If the application code passes its tests (i.e. its specifications), then the application code is correct (for a certain definition of “correct”). What in the code determines whether the tests (specifications) are correct? Nothing! The program’s specifications come entirely from outside the program.

Tests are always correct

This means that whatever the tests specify is, by definition, correct. If we have two tests containing the same code and one of the tests changes, it does not always logically follow that the other test needs to be updated to match. This is different from duplicated application code. If a piece of behavior is duplicated in two places in the application code and one piece of behavior gets changed, it does always logically follow that the other piece of behavior needs to get updated to match. (Otherwise it wouldn’t be in instance of duplication.)

This is the reason why duplication is more acceptable in test code than in application code.

Takeaways

Duplication is when one behavior is specified multiple times.
Duplication in application code is costly because multiple, among other reasons, copies of the same behavior are subject to diverging, thus creating a bug.
Since test code is a human-determined specification, it’s by definition always correct. If one instance of a duplicated piece of code changes, it’s not a logical necessity that the other piece needs to change with it.

Why tests flake more on CI than locally

1 Reply

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.

The root cause of flaky tests is some sort of non-determinism, either in the test code or in the application code.

In order to understand why a CI test run is more susceptible to flakiness than a local test run, we can go through all the root causes for flakiness one-by-one and consider how a CI test run has a different susceptibility to that specific flaky test cause than a local test run.

The root causes we’ll examine (which are all explained in detail in this post) are leaked state, race conditions, network/third-party dependency, fixed time dependency and randomness.

Leaked state

Sometimes one test leaks some sort of state (e.g. a change to a file or env var) into the global environment which interferes with later tests.

The reason a CI test run is more susceptible to leaked state flakiness is clear. Unlike a local environment where you’re usually just running one test file at a time, in CI you’re running a whole bunch of tests together. This creates more opportunities for tests to interfere with each other.

Race conditions

A race condition is when the correct functioning of a program depends on two or more parallel actions completing in a certain sequence, but the actions sometimes complete in a different sequence, resulting in incorrect behavior.

One way that race conditions can arise is through performance differences. Let’s say there’s a process that times out after 5000ms. Most of the time the process completes in 4500ms, meaning no timeout. But sometimes it takes 5500ms to complete, meaning the process does time out.

It’s very easy for differences to arise between a CI environment and a local environment in ways that affect performance. The OS is different, the memory and processor speed are different, and so on. These differences can mean that race conditions arise on CI that would not have arisen in a local environment.

Network/third-party dependency

Network dependency can lead to flaky tests for the simple reason that sometimes the network works and sometimes it doesn’t. Third-party dependency can lead to flaky tests because sometimes third-party services don’t behave deterministically. For example, the service can have an outage, or the service can rate-limit you.

This is the type of flakiness that should never occur because it’s not a good idea to hit the network in tests. Nonetheless, I have seen this type of flakiness occur in test suites where the developers didn’t know any better.

Part of the reason why CI test runs are more susceptible to this type of flakiness is that there are simply more at-bats. If a test makes a third-party request only once per day locally but 1,000 times per day on CI, there are of course more chances for the CI request to encounter a problem.

Fixed time dependency

There are some tests that always pass at one time of day (or month or year) and always fail at another.

Here’s an excerpt about this from my other post about the causes of flaky tests:

This is common with tests that cross the boundary of a day (or month or year). Let’s say you have a test that creates an appointment that occurs four hours from the current time, and then asserts that that appointment is included on today’s list of appointments. That test will pass when it’s run at 8am because the appointment will appear at 12pm which is the same day. But the test will fail when it’s run at 10pm because four hours after 10pm is 2am which is the next day.

CI test runs are more susceptible to fixed-time-dependency flakiness than local test runs for a few reasons. One is the fact that CI test runs simply have more at-bats than local test runs. Another is that the CI environment’s time zone settings might be different from the local test environment. A third reason is that unlike a local test environment which is normally only used inside of typical working hours, a CI environment is often utilized for a broader stretch of time each day due to developers kicking off test runs from different time zones and from developers’ varying schedule habits.

Randomness

The final cause of flaky tests is randomness. As far as I know, the only way that CI test runs are more susceptible to flakiness due to randomness is the fact that CI test runs have more at-bats than local test runs.

Takeaways

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.
The root cause of flaky tests is some sort of non-determinism, either in the test code or in the application code.
Whenever flakiness is more frequent in CI, the reason is because some difference between the CI test runs and the local runs make flakiness more likely. When flakiness is more likely, it’s because one of the specific five causes of flaky tests has been made more likely.

How I fix flaky tests

2 Replies

What a flaky test is and why they’re hard to fix

A flaky test is a test that passes sometimes and fails sometimes even though no code has changed.

There are several causes of flaky tests. The commonality among all the causes is that they all involve some form of non-determinism: code that doesn’t always behave the same on every run even though neither the inputs nor the code itself has changed.

Flaky tests are known to present themselves more in a continuous integration (CI) environment than in a local test environment. The reason for this is that certain characteristics of CI test runs make the tests more susceptible to non-determinism.

The fact that the flakiness usually can’t be reproduced locally means that it’s harder to reproduce and diagnose the buggy behavior of the flaky tests.

In addition to the fact that flaky tests often only flake on CI, the fact that flaky tests don’t fail consistently adds to the difficulty of fixing them.

Despite these difficulties, I’ve developed some tactics and strategies for fixing flaky tests that consistently lead to success. In this post I’ll give a detailed account of how I fix flaky tests.

The overall approach

When I’m fixing any bug I divide the bugfix into three stages: reproduction, diagnosis and fix.

I consider a flaky test a type of bug. Therefore, when I try to fix a flaky test, I follow this same three-step process as I would when fixing any other type of bug. In what follows I’ll cover how I approach each of these three steps of reproduction, diagnosis and fix.

Before reproducing: determine whether it’s really a flaky test

Not everything that appears to be a flaky test is actually a flaky test. Sometimes a test that appears to be flaking is just a healthy test that’s legitimately failing.

So when I see a test that’s supposedly flaky, I like to try to find multiple instances of that test flaking before I accept its flakiness as a fact. And even then, there’s no law that says that a test that previously flaked can’t fail legitimately at some point in the future. So the first step is to make sure that the problem I’m solving really is the problem I think I’m solving.

Reproducing a flaky test

If I can’t reproduce a bug, I can’t test for its presence or absence. If I can’t test for a bug’s presence or absence, I can’t know whether a fix attempt actually fixed the bug or not. For this reason, before I attempt to fix any bug, I always devise a test that will tell me whether the bug is present or absent.

My go-to method for reproducing a flaky test is simply to re-run the test suite multiple times on my CI service until I see the flaky test fail. Actually, I like to run the test suite a great number of times to get a feel for how frequently the flaky test fails. The actions I take during the bugfix process may be different depending on how frequently the test fails, as we’ll see later on.

Sometimes a flaky test fails so infrequently that it’s practically impossible to get the test to fail on demand. When this happens, it impossible to tell whether the test is passing due to random chance or because the flakiness has legitimately gone away. The way I handle these cases is to deprioritize the fix attempt and wait for the test to fail again in the natural course of business. That way I can be sure that I’m not wasting my time trying to fix a problem that’s not really there.

That covers the reproduction step of the process. Now let’s turn to diagnosis.

Diagnosing a flaky test

What follows is a list of tactics that can be used to help diagnose flaky tests. The list is kind of linear and kind of not. When I’m working on flaky tests I’ll often jump from tactic to tactic depending on what the scenario calls for rather than rigidly following the tactics in a certain order.

Get familiar with the root causes of flaky tests

If you were a doctor and you needed to diagnose a patient, it would obviously be helpful for you first to be familiar with a repertoire of diseases and their characteristic symptoms so you can recognize diseases when you see them.

Same with flaky tests. If you know the common causes for flaky tests and how to recognize them, you’ll have an easier time with trying to diagnose flaky tests.

In a separate post I show the root causes of flaky tests, which are race conditions, leaked state, network/third-party dependency, fixed time dependency and randomness. I suggest either committing these root causes to memory or reviewing them each time you embark on a flaky test diagnosis project.

Have a completely open mind

One of the biggest dangers in diagnosing flaky tests or in diagnosing any kind of problem is the danger of coming to believe something that’s not true.

Therefore, when starting to investigate a flaky test, I try to be completely agnostic as to what the root cause might be. It’s better to be clueless and right than to be certain and wrong.

Look at the failure messages

The first thing I do when I become aware of a flaky test is to look at the error message. The error message doesn’t always reveal anything terribly helpful but I of course have to start somewhere. It’s worth checking the failure message because sometimes it contains a helpful clue.

It’s important not to be deceived by error messages. Error messages are an indication of a symptom of a root cause, and the symptom of a root cause often has little or nothing to do with the root cause itself. Be careful not to fall into the trap of “the error message says something about X, therefore the root cause has something to do with X”. That’s very often not true.

Look at the test code

After looking at the failure message, I open the flaky test in an editor an look at its code. At first I’m not looking for anything specific. I’m just getting the lay of the land. How big is this test? How easy is it to understand? Does it have a lot of setup data or not much?

I do all this to load the problem area into my head. The more familiar I am with the problem area, the more I can “read from RAM” (use my brain’s short-term memory) as I continue to work on the problem instead of “read from disk” (look at the code). This way I can solve the problem more efficiently.

Once I’ve surveyed the test in this way, I zero in on the line that’s yielding the failure message. Is there anything interesting that jumps out? If so, I pause and consider and potentially investigate.

The next step I take with the test code is to go through the list of causes of flaky tests and look for instances of those.

After I’ve done all that, I study the test code to try to understand, in a meaningful big-picture way, what the test is all about. Obviously I’m going to be more likely to be successful in fixing problems with the test if I actually understand what the test is all about than if I don’t. (Sometimes this involves rewriting part or all of the test.)

Finally, I often go back to the beginning and repeat these steps an additional time, since each run through these steps can arm me with more knowledge that I can use on the next run through.

Look at the application code

The root cause of every flaky test is some sort of non-determinism. Sometimes the non-determinism comes from the test. Sometimes the non-determinism comes from the application code. If I wasn’t able to find the cause of the flakiness in the test code, I turn my attention to the application code.

Just like with the test code, the first thing I do is to just scan the relevant application code to get a feel for what it’s all about.

The next thing I do is to go through the code more carefully and look for causes of flakiness. (Again, you can refer to this blog post for that list.)

Then, just like with the test code, I try to understand the application code in a big-picture sort of way.

Make the test as understandable as possible

Many times when I look at a flaky test, the test code is too confusing to try to troubleshoot. When this is the case, I try to improve the test to the point that I can easily understand it. Easily understandable code is obviously easier to troubleshoot than confusing code.

To my surprise, I’ve often found that, after I improve the structure of the test, the flakiness goes away.

Side note: whenever I modify a test to make it easier to understand, I perform my modifications in one or more small, atomic pieces of work. I do this because I want to keep my refactorings and my fix attempts separate.

Make the application code as easy to understand as possible

If the application code is confusing then it’s obviously going to hurt my ability to understand and fix the flaky test. So, sometimes, I refactor the application code to make it easier to understand.

Make the test environment as understandable as possible

The quality of the test environment has a big bearing on how easy the test suite is to work with. By test environment I mean the tools (RSpec/Minitest, Factory Bot, Faker, etc.), the configurations for the tools, any seed data, the continuous integration service along with its configuration, any files shared among all the tests, and things like that.

The harder the test environment is to understand, the harder it will be to diagnose flaky tests. Not every flaky test fix job prompts me to work on the test environment, but it’s one of the things I look at when I’m having a tough time or I’m out of other ideas.

Check the tests that ran just before the flaky test

Just because a certain test flakes doesn’t necessarily mean that that test itself is flaky—even if the same test flakes consistently.

Sometimes, due to leaked state, test A will create a problem and then test B will fail. (A complete description of leaked state can be found in this post.) The symptom is showing up in test B so it looks like test B has a problem. But there’s nothing at all wrong with test B. The real problem is test A. So the problematic test passes but the innocent test flakes. It’s very deceiving!

Therefore, when I’m trying to diagnose a flaky test, I’ll check the continuous integration service to see what test ran before that test failed. Sometimes this leads me to discover that the test that ran before the flaky one is leaking state and needs to be fixed.

Add diagnostic info to the test

Sometimes, the flaky test’s failure message doesn’t show much useful information. In these cases I might add some diagnostic info to the test (or the relevant application code) in the form of print statements or exceptions.

Perform a binary search

Binary search debugging is a tactic that I use to diagnose bugs quickly. There are two main ideas behind it: 1) it’s easier to find where a bug is than what a bug is, and 2) binary search can be used to quickly find the location of a bug.

I make heavy use of binary search debugging when diagnosing flaky tests. See this blog post for a complete description of how to use binary search debugging.

Repeat all the above steps

If I go through all the above steps and I don’t have any more ideas, I might simply go through the list an additional time. Now that I’m more familiar with the test and everything surrounding it, I might have an enhanced ability to learn new things about the situation when I take an additional pass, and I might have new ideas or realizations that I didn’t have before.

Now let’s talk about the third step in fixing a flaky test, applying the fix itself.

Applying the fix for a flaky test

How to be sure your bugfixes work

A mistake many developers make when fixing bugs is that they don’t figure out a way to know if their bugfix actually worked or not. The result is that they often have to “fix” the bug multiple times before it really gets fixed. And of course, the false fixes create waste and confusion. That’s obviously not good.

The way to ensure that you get the fix right the first time is to devise a test (it can be manual or automated) that a) fails when the bug is present and b) passes when the bug is absent. (See this post for more details on how to apply a bugfix.)

How the nature of flaky tests complicates the bugfix process

Unlike “regular” bugs, which can usually be reproduced on demand once reproduction steps are known, flaky tests are usually only reproducible in one way: by re-running the test suite repeatedly on CI.

This works out okay when the test fails with relative frequency. If the test fails one out of every five test runs, for example, then I can run the test suite 50 times and expect to see (on average) ten failures. This means that if I apply the ostensible fix for the flaky test and then run the test suite 50 more times and see zero failures, then I can be pretty confident that my fix worked.

How certain I can be that my fix worked goes down the more infrequently the flaky test fails. If the test fails only once out of every 50 test runs on average, then if I run my test suite 50 times and see zero failures, then I can’t be sure whether that means the flaky test is fixed or if it just means that all my runs passed due to random chance.

Ideally a bugfix process goes like this:

Perform a test that shows that the bug is present (i.e run the test suite a bunch of times and observe that the flaky test fails)
Apply a bugfix on its own branch
Perform a test on that branch that shows that the bug is absent (i.e run the test suite a bunch of times and observe that the flaky test doesn’t fail)
Merge the bugfix branch into master

The reason this process is good is because it gives certainty that the bugfix works before the bugfix branch gets merged into master.

But for a test that fails infrequently, it’s not realistic to perform the steps in that order. Instead it has to be like this:

Perform a test that shows that the bug is present (i.e observe over time that the flaky test fails sometimes)
Apply a bugfix on its own branch
Merge the bugfix branch into master
Perform a test on that branch that shows that the bug is absent (i.e observe over a sufficiently long period of time that the flaky test no longer fails)

Notice how the test that shows that the bug is present is different. When the test fails frequently, we can perform on “on-demand” test where we run the test suite a number of times to observe that the bug is present. When the test fails infrequently, we don’t realistically have this option because it may require a prohibitively large number of test suite runs just to get a single failure. Instead we just have to go off of what has been observed in the test suite over time in the natural course of working.

Notice also that the test that shows that the bug is absent is different. When the test fails frequently, we can perform the same on-demand test after the bugfix as before the bugfix in order to be certain that the bugfix worked. When the test fails infrequently, we can’t do this, and we just have to wait until a bunch of test runs naturally happen over time. If the test goes sufficiently long without failing again, we can be reasonably sure that the bugfix worked.

Lastly, notice how in the process for an infrequently-failing test, merging the fix into master has to happen before we perform the test that ensures that the bugfix worked. This is because the only way to test that the bugfix worked is to actually merge the bugfix into master and let it sit there for a large number of test runs over time. It’s not ideal but there’s not a better way.

A note about deleting and skipping flaky tests

There are two benefits to fixing a flaky test. One benefit of course is that the test will no longer flake. The other is that you gain some skill in fixing flaky tests as well as a better understanding of what causes flaky tests. This means that fixing flaky tests creates a positive feedback loop. The more flaky tests you fix, the more quickly and easily you can fix future flaky tests, and the fewer flaky tests you’ll write in the first place because you know what mistakes not to make.

If you simply delete a flaky test, you’re depriving yourself of that positive feedback loop. And of course, you’re also destroying whatever value that test had. It’s usually better to push through and keep working on fixing the flaky test until the job is done.

It might sometimes seem like the amount of time it takes to fix a certain flaky test than the value of that test can justify. But keep in mind that the significant thing is not the cost/benefit ratio of any individual flaky test fix, but the cost/benefit ratio of all the flaky test fixes on average. Sometimes flaky test fixes will take 20 minutes and sometimes they’ll take two weeks. The flaky test fixes that take two weeks might feel unjustifiable, but if you have a general policy of just giving up when things get too hard and deleting the test, then your test-fixing skills will always stay limited, and your weak skills will incur a cost on the test suite for as long as you keep deleting difficult flaky tests. Better to just bite the bullet and develop the skills to fix hard flaky test cases.

Having said all that, deleting a flaky test is sometimes the right move. When development teams lack the skills to write non-flaky tests, sometimes the teams have other bad testing habits, like writing tests that are pointless. When a flaky test coincidentally happens to also be pointless, it’s better to just delete the test than to pay the cost to fix a test that doesn’t have any value.

Skipping flaky tests is similar in spirit to deleting them. Skipping a flaky test has all the same downsides as deleting it, plus now you have the extra overhead of occasionally stumbling across the test and remembering “Oh yeah, I should fix this eventually.” And what’s worse, the skipped test often gets harder to fix as time goes on because the skipped test is frozen in time but the rest of the codebase continues to change in ways that aren’t compatible with tests. The easiest time to fix a flaky test is right when the flakiness is first discovered.

Takeaways

The root cause of every flaky test is some sort of non-determinism.
Flaky tests are known to present themselves more in a CI environment than in a local test environment because certain characteristics of CI test runs make the tests more susceptible to non-determinism.
I consider a flaky test to be a type of bug. When I’m fixing any bug, including a flaky test, I divide the bugfix into three stages, which are reproduction, diagnosis and fix.
To reproduce a flaky test, I run the test suite enough times on CI to see the flaky test fail, or if it fails too infrequently I wait for it to fail naturally.
There are a large number of tactics I use to diagnose flaky tests. I don’t necessarily go through the tactics in a specific order but rather I use intuition and experience to decide which tactic to use next. The important thing is to treat the flaky test diagnosis as a distinct step which occurs after reproduction and before the application of the fix.
With the application of any bugfix, it’s good to have a test you can perform before and after the fix to be sure that the fix worked. When a flaky test fails frequently enough, you can do this sort of test by simply re-running the test suite in CI a sufficient number of times. If the flaky test fails infrequently, this is not practical, and the fix must be merged to master without being sure that it worked.
When you delete a flaky test, you not only destroy the value of the test but you also lose the opportunity to build your skills in fixing flaky tests and avoiding writing flaky tests in the first place. Unless the test coincidentally happens to be one that has little or no value, it’s better to fix it.

What causes flaky tests

1 Reply

What is a flaky test?

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.

In other words, a flaky test is a test that’s non-deterministic.

A test can be non-deterministic if either a) the test code is non-deterministic or b) the application code being tested is non-deterministic, or both.

Below are some common causes of flaky tests. I’ll briefly discuss the fix for some of these common causes, but the focus of this post isn’t to provide a guide to fixing flaky tests, it’s to give you a familiarity with the most common causes for flaky tests so that you can know what to go looking for when you do your investigation work. (I have a separate post for fixing flaky tests.)

The causes I’ll discuss are race conditions, leaked state, network/third-party dependency, randomness and fixed time dependency.

Race conditions

Real-life race condition example

Let’s say I’m going to pick up my friend to go to a party. I know it takes 15 minutes to get to his house so I text him 15 minutes prior so he can make sure to be ready in time. The sequence of events is supposed to be 1) my friend finishes getting ready and then 2) I arrive at my friend’s house. If the events happen in the opposite order then I end up having to wait for my friend, and I lay on the horn and shout obscenities until he finally comes out of his house.

Race conditions are especially likely to occur when the times are very close. Imagine, for example, that it always takes me exactly 15 minutes to get to my friend’s house, and it usually takes him 14 minutes to get ready, but about one time out of 50, say, it takes him 16 minutes. Or maybe it takes my friend 14 minutes to get ready but one day I somehow get to his house in 13 minutes instead of the usual 15. You can imagine how it wouldn’t take a big deviation from the norm to cause the race condition problem to occur.

Hopefully this real-life example illustrates that the significant thing that gives rise to race conditions is parallelism and sequence dependence, and it doesn’t matter what form the parallelism takes. The parallelism could take the form of multithreading, asynchronicity, two entirely separate systems (e.g. two calls to two different third-party APIs), or literally anything else.

Race conditions in DOM interaction/system tests

Race conditions are fairly common in system tests (tests that exercise the full application stack including the browser).

Let’s say there’s a test that 1) submits a form and then 2) clicks a link on the subsequent page. To tie this to the pick-up-my-friend analogy, the submission of the form would be analogous to me texting my friend saying I’ll be there in 15 minutes, and the loading of the subsequent page would be analogous to my friend getting ready. The race condition creates a problem when the test attempts to click the link before the page loads (analogous to me arriving at my friend’s house before he’s ready).

To make the analogy more precise, this sort of failure is analogous to me arriving at my friend’s house before he’s ready, and then just leaving after five minutes because I don’t want to wait (timeout error!).

The solution to this sort of race condition is easy: just remove the asynchronicity. Instead of allowing the test to execute at its natural pace, add a step after the form submission that waits for the next page to load before trying to click the link. This is analogous to me adding a step and saying “text me when you’re ready” before I leave to pick up my friend. If we arrange it like that then there’s no longer a race.

Because DOM interactions often involve asynchronicity, DOM interaction is a common area for race conditions, and therefore flaky tests, to be present.

Edge-of-timeout race conditions

A race condition can also occur when the amount of time an action takes to complete is just under a timeout value, e.g. a timeout value is five seconds and the action takes four seconds (but sometimes six seconds).

In these cases you can increase the timeout and/or improve the performance of the action such that the timeout value and typical run length are no longer close to each other.

Leaked state

Tests can create flaky behavior when they leak state into other tests.

Let me shoot an apple off your head

Let’s use another analogy to illustrate this one. Let’s say I wanted to perform two tests on myself. The first test is to see if I can shoot an apple off the top of my friend’s head with a bow and arrow. The second test is to see if I can drink 10 shots of tequila in under an hour.

If I were to perform the arrow test immediately followed by the tequila test and do that once a week, I could expect to get basically the same test results each time.

But if I were to perform the tequila test immediately followed by the arrow test, my aim would probably be compromised, and I might miss the apple once in a while. (Sorry, friend.) The problem is that the tequila test “leaks state”: it creates a lasting alteration in the global state, and that alteration affects subsequent tests.

And if I were to perform these two tests in random order, the tequila test would give the same result each time because I’d always be starting it sober, but the arrow test would appear to “flake” because sometimes I’d start it sober and sometimes I’d start it drunk. I might even suspect that there’s a problem with the arrow test because that’s the test that’s showing the symptom, but I’d be wrong. The problem is a different test with leaky state.

Ways for tests to leak state

Returning to computers, there are a lot of ways a test can alter the global state and create non-deterministic behavior.

One way is to alter database data. Imagine there are two tests, each of which creates a user with the email address test@example.com. The first test will pass and, if there’s a unique constraint on users.email, the second test will raise an error due to the unique constraint violation. Sometimes the first test will fail and sometimes the second test will fail, depending on which order you run them in.

Another way that a test could leak state is to change a configuration setting. Let’s say that your test environment has background jobs configured not to run for most tests because most background jobs are irrelevant to what’s being tested and would just slow things down. But then let’s imagine that you have one test where you do want background jobs to run, and so at the beginning of that tests you set background job setting from “don’t run” to “run”. If you don’t remember to change the setting back to “don’t run” at the end, background jobs will run for all later tests and potentially cause problematic behavior.

State can also be leaked by altering environment variables, altering the contents of the filesystem, or any number of other ways.

Network/third-party dependency

The main reason why network dependency can create non-deterministic behavior doesn’t take a lot of explaining: sometimes the network is up and sometimes it’s not.

Moreover, when you’re depending on the network, you’re often depending on some third-party service. Even if the network itself is working just fine, the third-party service could suffer an outage at any time, causing your tests to fail. I’ve also seen cases where a test fails because a test makes a third-party service call over and over and then gets rate-limited, and from that point on, for a period of time, that test fails.

The way to prevent flaky tests caused by network dependence is to use test doubles in your tests rather than hitting live services.

Randomness

Randomness is, by definition, non-deterministic. If you, for example, have a test that generates a random integer between 1 and 2 and then asserts that that number is 1, that test is obviously going to fail about half the time. Random inputs lead to random failures.

One way to get bitten by randomness is to grab the first item in a list of things that’s usually in the same order but not always. That’s why it’s usually better to specify a definite item rather than grab the nth item in a list.

Fixed time dependency

Once I was working late and I noticed that certain tests started to fail for no apparent reason, even though I hadn’t changed any code.

After some investigation I realized that, due to the way they were written, these tests would always fail when run at a certain time of day. I had just never worked that late before.

This is common with tests that cross the boundary of a day (or month or year). Let’s say you have a test that creates an appointment that occurs four hours from the current time, and then asserts that that appointment is included on today’s list of appointments. That test will pass when it’s run at 8am because the appointment will appear at 12pm which is the same day. But the test will fail when it’s run at 10pm because four hours after 10pm is 2am which is the next day.

Takeaways

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.
Flaky tests are caused by non-determinism either in the test code or in the application code.
Some of the most common causes of flaky tests are race conditions, leaked state, network/third-party dependency, randomness and fixed time dependency.

Keep test code and application code separate

Why I organize my tests by domain concept, not by test type

2 Replies

In Rails apps that use RSpec, it’s customary to have a spec directory with certain subdirectories named for the types of tests they contain: models, requests, system. The Minitest organization scheme doesn’t share the exact same nomes but it does share the custom of organizing by test type.

I would like to raise the question: Why do we do it this way?

To get at the answer to that question I’d like to ask a broader question: What’s the benefit of organizing test files at all? Why not just throw all the tests in a single directory? For me there are two reasons.

Reasons to organize test files into directories

Finding tests

Sometimes I’m changing a feature and I want to know where the tests are for that feature so I can change or extend the tests accordingly

When I’m making a change to a feature, I usually want to know where the tests are that are related to that feature so I can update or extend the tests accordingly. Or, if that feature doesn’t have tests, I want to know so, and with a reasonable degree of certainty, so that I don’t accidentally create new tests that duplicate existing ones.

Running tests in groups

If tests are organized into directories then they can be conveniently run in groups.

It is of course possible, at least in some frameworks, to apply certain tags to tests and then run the tagged tests as a group. But doing so depends on developers remembering to add tags. This seems to me like a fragile link in the chain.

I find directories to be better than tags for this purpose since it’s of course impossible to forget to put a file in a directory.

Test type vs. meaning

At some point I realized that if I organize my test files based on meaning rather than test type, it makes it much easier to both a) find the tests when I want to find them and b) run the tests in groups that serve my purposes. Here’s why.

Finding tests

When I want to find the tests that correspond to a certain feature, I don’t necessarily know a lot about the characteristics of those tests. There might be a test that matches the filename of the application code file that I’m working on, but also there might not be. I’m also not always sure whether the application code I’m working on is covered by a model test, a system test, some other type of test, some combination of test types, or no test at all. The best I can do is either guess, search manually, or grep for some keywords and hope that the results aren’t too numerous to be able to examine one-by-one.

If on the other hand the files are organized in a directory tree that corresponds to the tests’ meaning in the domain model, then finding the tests is easier. If I’m working in the application’s billing area, for example, I can look in spec/billing folder to see if the relevant tests are there. If I use a nested structure, I can look in spec/billing/payments to find tests that are specifically related to payments.

I don’t need to worry about whether the payments-related tests are model tests, system tests or some other type of tests. I can just look in spec/billing/payments and work with whatever’s there. (I do, however, like to keep folders at the leaf level with names like models, system, etc. because it can be disorienting to not know what types of tests you’re looking at, and also it can create naming conflicts if you don’t separate files by type.)

Running tests in groups

I don’t often find it particularly useful to, say, run all my model tests or all my system tests. I do however find it useful to run all the tests in a certain conceptual area.

When I make a change in a certain area and I want to check for regressions, I of course want to check in the most likely places first. It’s usually more likely that I’ve introduced a regression to a conceptually related area than a conceptually unrelated area.

To continue the example from above, if I make a change to the payments area, then I can run all the tests in spec/billing/payments to conveniently check for regressions. If those tests all pass then I can zoom out one level and run all the tests in spec/billing. This gives me four “levels” of progressively broader regression testing: 1) a single file in spec/billing/payments, 2) all the tests in spec/billing/payments, 3) all the tests in spec/billing, and 4) all the tests in the whole test suite. If I organize my tests by type, I don’t have that ability.

On breaking convention

I’m not often a big fan of diverging from framework conventions. Breaking conventions often results in a loss of convenience which isn’t made up for by whatever is gained by breaking convention.

But don’t mistake this break from convention with other types of breaks from conventions you might have seen. Test directory structure is a very weak convention and it’s not even a Rails convention, it’s a convention of RSpec or Minitest. And in fact, it’s not even a technical convention, it’s a cultural convention. Unless I’m mistaken, there’s not actually any functionality tied to the test directory structure in RSpec or Minitest, and so diverging from the cultural standard doesn’t translate to a loss of functionality. It’s virtually all upside.

Takeaways

The benefits of organizing tests into directories include to be able to find tests and to be able to run tests in groups.
Organizing tests by meaning rather than type makes it easier to find tests and to run them in groups in a way that’s more logical for the purpose of finding regressions.

Why DSLs are a necessary part of learning Rails testing

The four phases of a test

1 Reply

When writing tests, or reading other people’s tests, it can be helpful to understand that tests are often structured in four distinct phases.

These phases are:

Setup
Exercise
Assertion
Teardown

Let’s illustrate these four phases using an example.

Test phase example

Let’s say we have an application that has a list of users that can receive messages. Only active users are allowed to receive messages. So, we need to assert that when a user is inactive, that user can’t receive messages.

Here’s how this test might go:

Create a User record (setup)
Set the user’s “active” status to false (exercise)
Assert that the user is not “messageable” (assertion)
Delete the User record we created in step 1 (teardown)

In parallel with this example, I’ll also use another example which is somewhat silly but also less abstract. Let’s imagine we’re designing a sharp-shooting robot that can fire a bow and accurately hit a target with an arrow. In order to test our robot’s design, we might:

Get a fresh prototype of the robot from the machine shop (setup)
Allow the robot to fire an arrow (exercise)
Look at the target to make sure it was hit by the arrow (assertion)
Return the prototype to the machine shop for disassembly (teardown)

Now let’s take a look at each step in more detail.

The purpose of each test phase

Setup

The setup phase typically creates all the data that’s needed in order for the test to operate. (There are other things that could conceivably happen during a setup phase but for our current purposes we can think of the setup phase’s role as being to put data in place.)In our case, the creation of the User record is all that’s involved in the setup step, although more complicated tests could of course create any number of database records and potentially establish relationships among them.

Exercise

The exercise phase walks through the motions of the feature we want to test. With our robot example, the exercise phase is when the robot fires the arrow. With our messaging example, the exercise phase is when the user gets put in an inactive state.

Side note: the distinction between setup and exercise may seem blurry, and indeed it sometimes is, especially in low-level tests like our current example. If someone were to argue that setting the user to inactive should actually be part of the setup, I’m not sure how I’d refute them. To help with the distinction in this case, imagine if we instead were writing an integration test that actually opened up a browser and simulated clicks. For this test, our setup would be the same (create a user record) but our exercise might be different. We might visit a settings page, uncheck an “active” checkbox, then save the form.

Assertion

The assertion phase is basically what all the other phases exist in support of. The assertion is the actual test part of the test, the thing that determines whether the test passes or fails.

Teardown

Each test needs to clean up after itself. If it didn’t, then each test would potentially pollute the world in which the test is running and affect the outcome of later tests, making the tests non-deterministic. We don’t want this. We want deterministic tests, i.e. tests that behave the same exact way every single time no matter what. The only thing that should make a test go from passing to failing or vice-versa is if the behavior that the test tests changes.

In reality, Rails tests tend not to have an explicit teardown step. The main pollutant we have to worry about with our tests is database data that gets left behind. RSpec is capable of taking care of this problem for us by running each test in a database transaction. The transaction starts before each test is run and aborts after the test finishes. So really, the data never gets permanently persisted in the first place. So although I’m mentioning the teardown step here for completeness’ sake, you’re unlikely to see it in the wild.

A concrete example

See if you can identify the phases in the following RSpec test.

RSpec.describe User do
  let!(:user) { User.create!(email: 'test@example.com') }

  describe '#messageable?' do
    context 'is inactive' do
      it 'is false' do
        user.update!(active: false)
        expect(user.messageable?).to be false
        user.destroy!
      end
    end
  end
end

Here’s my annotated version.

RSpec.describe User do
  let!(:user) { User.create!(email: 'test@example.com') } # setup

  describe '#messageable?' do
    context 'is inactive' do
      it 'is false' do
        user.update!(active: false)           # exercise
        expect(user.messageable?).to be false # assertion
        user.destroy!                         # teardown
      end
    end
  end
end

Takeaway

Being familiar with the four phases of a test can help you overcome the writer’s block that testers sometimes feel when staring at a blank editor. “Write the setup” is an easier job than “write the whole test”.

Understanding the four phases of a test can also help make it easier to parse the meaning of existing tests.

When I do TDD and when I don’t

9 Replies

Some developers advocate doing test-driven development 100% of the time. Other developers think TDD is for the birds and don’t do it at all. Still other developers go in the middle and practice TDD more than 0% of the time but less than 100% of the time.

I personally am in the camp of practicing TDD some of the time but not all. Here’s my reasoning.

When TDD makes sense to me

It’s not the case that I use TDD, or even write tests at all, for every single project I work on. But I do pretty much always program in feedback loops.

Feedback loops

The “feedback loop method” is where work as follows. First, I think of a tiny goal that I want to accomplish (e.g. make “hello world” appear on the screen). Then I decide on a manual test I can perform in order to see if that goal is accomplished (e.g. refresh the page and observe). Then I perform the test, write some code to try to make the test pass, perform the test again, and repeat the process with a new goal.

TDD == automated feedback loops

The way I view TDD is that it’s just the automated version of the manual work I was going to do anyway. Instead of making a to-do note that says “make ‘hello world’ appear on the screen” and then manually refreshing the page to see if it’s there, I write a test that expects “hello world” to appear on the screen. All the other steps are the exact same.

I’ve found that TDD works great for me when I’m working on what you might call “crisply-defined” work. In other words, the requirements I’m working to fulfill are known and specified. I find that I’ve found that there are other scenarios where TDD doesn’t work so great for me.

When TDD doesn’t make sense to me

Coding as production vs. coding as thinking

It’s easy to think that the reason to write code is to create a work product. But that’s certainly not the only reason to write code. Code isn’t just a medium for producing a product. It’s also a medium for thinking.

This is the idea behind a “spike” in Agile programming. When you’re doing a spike, you have no necessary intention to actually keep any of the code you’re writing. You’re just exploring. You’re seeing what it looks like when you do this or how it feels when you do that.

You can think of coding kind of like playing a piano. Sometimes you have some pieces of music already in your head and you’re trying to record an album. Other times you’re just messing around to see you can come up with any music worth recording. These are two very different modes of engaging with your instrument. Both are very necessary in order to ultimately record some music.

TDD doesn’t mix great with spikes

I often find that a spike phase is necessary when I’m coding, for example, a feature with known big-picture requirements but unknown UI specifics. In that case my test would be so full of guesses and placeholders that it would be kind of a joke of a test, and it wouldn’t help me much. In these cases I give myself permission to forego the testing during the spike period. I come back after I have some working code and backfill the tests.

Takeaways

I don’t practice TDD 100% of the time. (I believe I do practice TDD the vast majority of the time though.)
I view TDD as the automated version of the coding workflow that I already use anyway.
Producing a work product is not the only reason to write code. Code can also be a medium for thinking.
When I’m in the mode of using coding as a way to think, I find that the benefits of TDD don’t really apply.

The Beginner's Guideto Rails Testing

Misleading details

Shoehorned data

It’s unclear what setup data belongs to which test(s)

It’s unclear what the state of the system is and how it will affect any particular test

What’s the solution?

Incorrect reasons why duplication is more acceptable in tests

“Duplication isn’t actually that bad.”

“Duplication in test code can be clearer than the DRY version”

The real reason why duplication is more acceptable in test code

What duplication is and why it costs

The difference between test code and application code

Tests are always correct

Takeaways

Leaked state

Race conditions

Network/third-party dependency

Fixed time dependency

Randomness

Takeaways

What a flaky test is and why they’re hard to fix

The overall approach

Before reproducing: determine whether it’s really a flaky test

Reproducing a flaky test

Diagnosing a flaky test

Get familiar with the root causes of flaky tests

Have a completely open mind

Look at the failure messages

Look at the test code

Look at the application code

Make the test as understandable as possible

Make the application code as easy to understand as possible

Make the test environment as understandable as possible

Check the tests that ran just before the flaky test

Add diagnostic info to the test

Perform a binary search

Repeat all the above steps

Applying the fix for a flaky test

How to be sure your bugfixes work

How the nature of flaky tests complicates the bugfix process

A note about deleting and skipping flaky tests

Takeaways

What is a flaky test?

Race conditions

Real-life race condition example

Race conditions in DOM interaction/system tests

Edge-of-timeout race conditions

Leaked state

Let me shoot an apple off your head

Ways for tests to leak state

Network/third-party dependency

Randomness

Fixed time dependency

Takeaways

Reasons to organize test files into directories

Finding tests

Running tests in groups

Test type vs. meaning

Finding tests

Running tests in groups

On breaking convention

Takeaways

Test phase example

The purpose of each test phase

Setup

Exercise

Assertion

Teardown

A concrete example

Takeaway

When TDD makes sense to me

Feedback loops

TDD == automated feedback loops

When TDD doesn’t make sense to me

Coding as production vs. coding as thinking

TDD doesn’t mix great with spikes

Takeaways

The Beginner's Guide
to Rails Testing