Category Archives: Programming

How I fix flaky tests

What a flaky test is and why they’re hard to fix

A flaky test is a test that passes sometimes and fails sometimes even though no code has changed.

There are several causes of flaky tests. The commonality among all the causes is that they all involve some form of non-determinism: code that doesn’t always behave the same on every run even though neither the inputs nor the code itself has changed.

Flaky tests are known to present themselves more in a continuous integration (CI) environment than in a local test environment. The reason for this is that certain characteristics of CI test runs make the tests more susceptible to non-determinism.

The fact that the flakiness usually can’t be reproduced locally means that it’s harder to reproduce and diagnose the buggy behavior of the flaky tests.

In addition to the fact that flaky tests often only flake on CI, the fact that flaky tests don’t fail consistently adds to the difficulty of fixing them.

Despite these difficulties, I’ve developed some tactics and strategies for fixing flaky tests that consistently lead to success. In this post I’ll give a detailed account of how I fix flaky tests.

The overall approach

When I’m fixing any bug I divide the bugfix into three stages: reproduction, diagnosis and fix.

I consider a flaky test a type of bug. Therefore, when I try to fix a flaky test, I follow this same three-step process as I would when fixing any other type of bug. In what follows I’ll cover how I approach each of these three steps of reproduction, diagnosis and fix.

Before reproducing: determine whether it’s really a flaky test

Not everything that appears to be a flaky test is actually a flaky test. Sometimes a test that appears to be flaking is just a healthy test that’s legitimately failing.

So when I see a test that’s supposedly flaky, I like to try to find multiple instances of that test flaking before I accept its flakiness as a fact. And even then, there’s no law that says that a test that previously flaked can’t fail legitimately at some point in the future. So the first step is to make sure that the problem I’m solving really is the problem I think I’m solving.

Reproducing a flaky test

If I can’t reproduce a bug, I can’t test for its presence or absence. If I can’t test for a bug’s presence or absence, I can’t know whether a fix attempt actually fixed the bug or not. For this reason, before I attempt to fix any bug, I always devise a test that will tell me whether the bug is present or absent.

My go-to method for reproducing a flaky test is simply to re-run the test suite multiple times on my CI service until I see the flaky test fail. Actually, I like to run the test suite a great number of times to get a feel for how frequently the flaky test fails. The actions I take during the bugfix process may be different depending on how frequently the test fails, as we’ll see later on.

Sometimes a flaky test fails so infrequently that it’s practically impossible to get the test to fail on demand. When this happens, it impossible to tell whether the test is passing due to random chance or because the flakiness has legitimately gone away. The way I handle these cases is to deprioritize the fix attempt and wait for the test to fail again in the natural course of business. That way I can be sure that I’m not wasting my time trying to fix a problem that’s not really there.

That covers the reproduction step of the process. Now let’s turn to diagnosis.

Diagnosing a flaky test

What follows is a list of tactics that can be used to help diagnose flaky tests. The list is kind of linear and kind of not. When I’m working on flaky tests I’ll often jump from tactic to tactic depending on what the scenario calls for rather than rigidly following the tactics in a certain order.

Get familiar with the root causes of flaky tests

If you were a doctor and you needed to diagnose a patient, it would obviously be helpful for you first to be familiar with a repertoire of diseases and their characteristic symptoms so you can recognize diseases when you see them.

Same with flaky tests. If you know the common causes for flaky tests and how to recognize them, you’ll have an easier time with trying to diagnose flaky tests.

In a separate post I show the root causes of flaky tests, which are race conditions, leaked state, network/third-party dependency, fixed time dependency and randomness. I suggest either committing these root causes to memory or reviewing them each time you embark on a flaky test diagnosis project.

Have a completely open mind

One of the biggest dangers in diagnosing flaky tests or in diagnosing any kind of problem is the danger of coming to believe something that’s not true.

Therefore, when starting to investigate a flaky test, I try to be completely agnostic as to what the root cause might be. It’s better to be clueless and right than to be certain and wrong.

Look at the failure messages

The first thing I do when I become aware of a flaky test is to look at the error message. The error message doesn’t always reveal anything terribly helpful but I of course have to start somewhere. It’s worth checking the failure message because sometimes it contains a helpful clue.

It’s important not to be deceived by error messages. Error messages are an indication of a symptom of a root cause, and the symptom of a root cause often has little or nothing to do with the root cause itself. Be careful not to fall into the trap of “the error message says something about X, therefore the root cause has something to do with X”. That’s very often not true.

Look at the test code

After looking at the failure message, I open the flaky test in an editor an look at its code. At first I’m not looking for anything specific. I’m just getting the lay of the land. How big is this test? How easy is it to understand? Does it have a lot of setup data or not much?

I do all this to load the problem area into my head. The more familiar I am with the problem area, the more I can “read from RAM” (use my brain’s short-term memory) as I continue to work on the problem instead of “read from disk” (look at the code). This way I can solve the problem more efficiently.

Once I’ve surveyed the test in this way, I zero in on the line that’s yielding the failure message. Is there anything interesting that jumps out? If so, I pause and consider and potentially investigate.

The next step I take with the test code is to go through the list of causes of flaky tests and look for instances of those.

After I’ve done all that, I study the test code to try to understand, in a meaningful big-picture way, what the test is all about. Obviously I’m going to be more likely to be successful in fixing problems with the test if I actually understand what the test is all about than if I don’t. (Sometimes this involves rewriting part or all of the test.)

Finally, I often go back to the beginning and repeat these steps an additional time, since each run through these steps can arm me with more knowledge that I can use on the next run through.

Look at the application code

The root cause of every flaky test is some sort of non-determinism. Sometimes the non-determinism comes from the test. Sometimes the non-determinism comes from the application code. If I wasn’t able to find the cause of the flakiness in the test code, I turn my attention to the application code.

Just like with the test code, the first thing I do is to just scan the relevant application code to get a feel for what it’s all about.

The next thing I do is to go through the code more carefully and look for causes of flakiness. (Again, you can refer to this blog post for that list.)

Then, just like with the test code, I try to understand the application code in a big-picture sort of way.

Make the test as understandable as possible

Many times when I look at a flaky test, the test code is too confusing to try to troubleshoot. When this is the case, I try to improve the test to the point that I can easily understand it. Easily understandable code is obviously easier to troubleshoot than confusing code.

To my surprise, I’ve often found that, after I improve the structure of the test, the flakiness goes away.

Side note: whenever I modify a test to make it easier to understand, I perform my modifications in one or more small, atomic pieces of work. I do this because I want to keep my refactorings and my fix attempts separate.

Make the application code as easy to understand as possible

If the application code is confusing then it’s obviously going to hurt my ability to understand and fix the flaky test. So, sometimes, I refactor the application code to make it easier to understand.

Make the test environment as understandable as possible

The quality of the test environment has a big bearing on how easy the test suite is to work with. By test environment I mean the tools (RSpec/Minitest, Factory Bot, Faker, etc.), the configurations for the tools, any seed data, the continuous integration service along with its configuration, any files shared among all the tests, and things like that.

The harder the test environment is to understand, the harder it will be to diagnose flaky tests. Not every flaky test fix job prompts me to work on the test environment, but it’s one of the things I look at when I’m having a tough time or I’m out of other ideas.

Check the tests that ran just before the flaky test

Just because a certain test flakes doesn’t necessarily mean that that test itself is flaky—even if the same test flakes consistently.

Sometimes, due to leaked state, test A will create a problem and then test B will fail. (A complete description of leaked state can be found in this post.) The symptom is showing up in test B so it looks like test B has a problem. But there’s nothing at all wrong with test B. The real problem is test A. So the problematic test passes but the innocent test flakes. It’s very deceiving!

Therefore, when I’m trying to diagnose a flaky test, I’ll check the continuous integration service to see what test ran before that test failed. Sometimes this leads me to discover that the test that ran before the flaky one is leaking state and needs to be fixed.

Add diagnostic info to the test

Sometimes, the flaky test’s failure message doesn’t show much useful information. In these cases I might add some diagnostic info to the test (or the relevant application code) in the form of print statements or exceptions.

Perform a binary search

Binary search debugging is a tactic that I use to diagnose bugs quickly. There are two main ideas behind it: 1) it’s easier to find where a bug is than what a bug is, and 2) binary search can be used to quickly find the location of a bug.

I make heavy use of binary search debugging when diagnosing flaky tests. See this blog post for a complete description of how to use binary search debugging.

Repeat all the above steps

If I go through all the above steps and I don’t have any more ideas, I might simply go through the list an additional time. Now that I’m more familiar with the test and everything surrounding it, I might have an enhanced ability to learn new things about the situation when I take an additional pass, and I might have new ideas or realizations that I didn’t have before.

Now let’s talk about the third step in fixing a flaky test, applying the fix itself.

Applying the fix for a flaky test

How to be sure your bugfixes work

A mistake many developers make when fixing bugs is that they don’t figure out a way to know if their bugfix actually worked or not. The result is that they often have to “fix” the bug multiple times before it really gets fixed. And of course, the false fixes create waste and confusion. That’s obviously not good.

The way to ensure that you get the fix right the first time is to devise a test (it can be manual or automated) that a) fails when the bug is present and b) passes when the bug is absent. (See this post for more details on how to apply a bugfix.)

How the nature of flaky tests complicates the bugfix process

Unlike “regular” bugs, which can usually be reproduced on demand once reproduction steps are known, flaky tests are usually only reproducible in one way: by re-running the test suite repeatedly on CI.

This works out okay when the test fails with relative frequency. If the test fails one out of every five test runs, for example, then I can run the test suite 50 times and expect to see (on average) ten failures. This means that if I apply the ostensible fix for the flaky test and then run the test suite 50 more times and see zero failures, then I can be pretty confident that my fix worked.

How certain I can be that my fix worked goes down the more infrequently the flaky test fails. If the test fails only once out of every 50 test runs on average, then if I run my test suite 50 times and see zero failures, then I can’t be sure whether that means the flaky test is fixed or if it just means that all my runs passed due to random chance.

Ideally a bugfix process goes like this:

Perform a test that shows that the bug is present (i.e run the test suite a bunch of times and observe that the flaky test fails)
Apply a bugfix on its own branch
Perform a test on that branch that shows that the bug is absent (i.e run the test suite a bunch of times and observe that the flaky test doesn’t fail)
Merge the bugfix branch into master

The reason this process is good is because it gives certainty that the bugfix works before the bugfix branch gets merged into master.

But for a test that fails infrequently, it’s not realistic to perform the steps in that order. Instead it has to be like this:

Perform a test that shows that the bug is present (i.e observe over time that the flaky test fails sometimes)
Apply a bugfix on its own branch
Merge the bugfix branch into master
Perform a test on that branch that shows that the bug is absent (i.e observe over a sufficiently long period of time that the flaky test no longer fails)

Notice how the test that shows that the bug is present is different. When the test fails frequently, we can perform on “on-demand” test where we run the test suite a number of times to observe that the bug is present. When the test fails infrequently, we don’t realistically have this option because it may require a prohibitively large number of test suite runs just to get a single failure. Instead we just have to go off of what has been observed in the test suite over time in the natural course of working.

Notice also that the test that shows that the bug is absent is different. When the test fails frequently, we can perform the same on-demand test after the bugfix as before the bugfix in order to be certain that the bugfix worked. When the test fails infrequently, we can’t do this, and we just have to wait until a bunch of test runs naturally happen over time. If the test goes sufficiently long without failing again, we can be reasonably sure that the bugfix worked.

Lastly, notice how in the process for an infrequently-failing test, merging the fix into master has to happen before we perform the test that ensures that the bugfix worked. This is because the only way to test that the bugfix worked is to actually merge the bugfix into master and let it sit there for a large number of test runs over time. It’s not ideal but there’s not a better way.

A note about deleting and skipping flaky tests

There are two benefits to fixing a flaky test. One benefit of course is that the test will no longer flake. The other is that you gain some skill in fixing flaky tests as well as a better understanding of what causes flaky tests. This means that fixing flaky tests creates a positive feedback loop. The more flaky tests you fix, the more quickly and easily you can fix future flaky tests, and the fewer flaky tests you’ll write in the first place because you know what mistakes not to make.

If you simply delete a flaky test, you’re depriving yourself of that positive feedback loop. And of course, you’re also destroying whatever value that test had. It’s usually better to push through and keep working on fixing the flaky test until the job is done.

It might sometimes seem like the amount of time it takes to fix a certain flaky test than the value of that test can justify. But keep in mind that the significant thing is not the cost/benefit ratio of any individual flaky test fix, but the cost/benefit ratio of all the flaky test fixes on average. Sometimes flaky test fixes will take 20 minutes and sometimes they’ll take two weeks. The flaky test fixes that take two weeks might feel unjustifiable, but if you have a general policy of just giving up when things get too hard and deleting the test, then your test-fixing skills will always stay limited, and your weak skills will incur a cost on the test suite for as long as you keep deleting difficult flaky tests. Better to just bite the bullet and develop the skills to fix hard flaky test cases.

Having said all that, deleting a flaky test is sometimes the right move. When development teams lack the skills to write non-flaky tests, sometimes the teams have other bad testing habits, like writing tests that are pointless. When a flaky test coincidentally happens to also be pointless, it’s better to just delete the test than to pay the cost to fix a test that doesn’t have any value.

Skipping flaky tests is similar in spirit to deleting them. Skipping a flaky test has all the same downsides as deleting it, plus now you have the extra overhead of occasionally stumbling across the test and remembering “Oh yeah, I should fix this eventually.” And what’s worse, the skipped test often gets harder to fix as time goes on because the skipped test is frozen in time but the rest of the codebase continues to change in ways that aren’t compatible with tests. The easiest time to fix a flaky test is right when the flakiness is first discovered.

Takeaways

The root cause of every flaky test is some sort of non-determinism.
Flaky tests are known to present themselves more in a CI environment than in a local test environment because certain characteristics of CI test runs make the tests more susceptible to non-determinism.
I consider a flaky test to be a type of bug. When I’m fixing any bug, including a flaky test, I divide the bugfix into three stages, which are reproduction, diagnosis and fix.
To reproduce a flaky test, I run the test suite enough times on CI to see the flaky test fail, or if it fails too infrequently I wait for it to fail naturally.
There are a large number of tactics I use to diagnose flaky tests. I don’t necessarily go through the tactics in a specific order but rather I use intuition and experience to decide which tactic to use next. The important thing is to treat the flaky test diagnosis as a distinct step which occurs after reproduction and before the application of the fix.
With the application of any bugfix, it’s good to have a test you can perform before and after the fix to be sure that the fix worked. When a flaky test fails frequently enough, you can do this sort of test by simply re-running the test suite in CI a sufficient number of times. If the flaky test fails infrequently, this is not practical, and the fix must be merged to master without being sure that it worked.
When you delete a flaky test, you not only destroy the value of the test but you also lose the opportunity to build your skills in fixing flaky tests and avoiding writing flaky tests in the first place. Unless the test coincidentally happens to be one that has little or no value, it’s better to fix it.

Binary search debugging

2 Replies

Diagnosing bugs by guessing

When faced with a bug they have to diagnose, many developers will start making guesses as to what the problem is.

Guessing can be fine, especially when the guesses are good ones and when the guesses are inexpensive to test. But often, the guesses quickly degrade into long shots and the developer spends his or her time flailing around randomly rather than progressing steadily toward a solution.

Diagnosing bugs more methodically

One of my principles of debugging is that it’s almost always easier to determine where the cause of a bug is than what the cause of the bug is. When I need to diagnose a bug, the first thing I ask is not “What is the problem?” but “Where is the problem?” Not only is “where” a much easier question to answer than “what”, but once I’ve found the “where”, the “what” is often plainly evident.

When you change the question from “what” to “where”, the problem changes from a thought-intensive mystery-solving exercise to a relatively straightforward search problem.

When I want to determine where a bug lies, I identify an area of code and ask “Does the bug lie in this area of code?” If the answer is yes, then I perform the search again on a narrower area. If the answer is no, I continue my search in a different area.

When I ask the question “Does the bug lie in this area of code?” the answer can be gotten like this.

Perform the steps that reproduce the bug on the latest code. Observe that the bug is present.
Delete or disable some section of the code.
Perform the reproduction steps again.
Observe whether the bug is present. If the bug is gone, the answer is yes. If the bug is still present, the answer is no.

But the question remains: how do you determine which areas of your code to inspect for the bug? Obviously it’s not going to be efficient to just randomly choose areas for inspection. That’s where binary search comes in.

Binary search

Imagine I wanted to know someone’s birthday but I wasn’t allowed to ask them when their birthday was. I was only allowed to ask yes-or-no questions.

In order to figure out this person’s birthday, I could of course ask, “Is it on January 1st?” “Is it on January 2nd?” etc. But that would take me up to 365 guesses (or actually 366 because of leap year birthdays) and on average it would take me 366/2 = 183 guesses (assuming even distribution of birthdays).

Instead I could ask the person “Is your birthday before July 1st?” If so, I can rule out all the days after July 1st. Then I can cut the remaining portion of the year in half. The next question would be “Is your birthday before April 1st?” If so, I do the same thing again. “Before February 16th?” “Before January 23rd?” “Before January 13th?” and so on. With this method, it takes not more than nine questions to arrive at the right answer. That’s a lot better than 183!

Binary search in code

This binary search method can be applied to searching code as well. Let’s say you have a 100-line file that contains a bug but you don’t know where the bug is. You can (at least in theory) delete the last 50 lines of code and then check the remaining code for the presence of the bug. If the bug is still there, then you can divide that code in half by deleting the last 25 lines, and so on. Obviously you usually can’t literally delete the last 50 lines of a file because then it wouldn’t be syntactically valid and so on, but the principle of course still stands.

Binary search can also be used on the level of an entire codebase. You can devise questions that will divide the codebase roughly in half and then check for the presence or absence of the bug in each half. You don’t even need to necessarily delete code in order for this method to work. You just need a way to eliminate half of the codebase from your search area somehow. (Think about the game 20 Questions. “Is it living? Is it an animal? Is it a mammal?” and so on.)

You can also perform searches across time. Git bisect works by taking a range of commits and then repeatedly dividing it in half, asking you to answer the question “Does the bug lie in this half?” at each step. When you perform a bisect, you’re asking not “What is this bug?” but rather “What commit introduced this bug?” (In other words, “where is this bug”.) If your commits are small and atomic, then often the cause of the bug will be often obvious once the offending commit is identified. If the offending commit is large, you might need to do another binary search on the code the commit introduced in order to isolate the root cause.

The beauty of binary search debugging

Before I figured out how to diagnose bugs methodically, debugging was often an extremely frustrating exercise. I would stare at the screen and wonder why what was happening was happening. I would read the code over and over to try to make sense of it. I would make guesses as to what the problem could be. And after my guess turned out to be wrong, I often felt no closer to a solution than before.

Now things are different. The bug diagnosis methodology that I use now—which combines the “where, not what” principle with binary search—has two big benefits. First, this methodology allows me to progress systematically and inexorably toward a solution rather than taking shots in the dark. Second, it almost always works. It’s hard to overstate how good it feels to know that whatever bug I’m faced with, I have the ability to diagnose it in a timely manner, and without much mental strain, and with a success rate close to 100%. That sure as hell beats guessing.

Takeaways

When diagnosing bugs, guessing is sometimes fine, but it’s often inefficient.
It’s almost always easier to find where a bug is than what a bug is.
Using binary search to find where a bug is makes the search process more efficient.
Binary search almost always works.

What causes flaky tests

1 Reply

What is a flaky test?

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.

In other words, a flaky test is a test that’s non-deterministic.

A test can be non-deterministic if either a) the test code is non-deterministic or b) the application code being tested is non-deterministic, or both.

Below are some common causes of flaky tests. I’ll briefly discuss the fix for some of these common causes, but the focus of this post isn’t to provide a guide to fixing flaky tests, it’s to give you a familiarity with the most common causes for flaky tests so that you can know what to go looking for when you do your investigation work. (I have a separate post for fixing flaky tests.)

The causes I’ll discuss are race conditions, leaked state, network/third-party dependency, randomness and fixed time dependency.

Race conditions

A race condition is when the correct functioning of a program depends on two or more parallel actions completing in a certain sequence, but the actions sometimes complete in a different sequence, resulting in incorrect behavior.

Real-life race condition example

Let’s say I’m going to pick up my friend to go to a party. I know it takes 15 minutes to get to his house so I text him 15 minutes prior so he can make sure to be ready in time. The sequence of events is supposed to be 1) my friend finishes getting ready and then 2) I arrive at my friend’s house. If the events happen in the opposite order then I end up having to wait for my friend, and I lay on the horn and shout obscenities until he finally comes out of his house.

Race conditions are especially likely to occur when the times are very close. Imagine, for example, that it always takes me exactly 15 minutes to get to my friend’s house, and it usually takes him 14 minutes to get ready, but about one time out of 50, say, it takes him 16 minutes. Or maybe it takes my friend 14 minutes to get ready but one day I somehow get to his house in 13 minutes instead of the usual 15. You can imagine how it wouldn’t take a big deviation from the norm to cause the race condition problem to occur.

Hopefully this real-life example illustrates that the significant thing that gives rise to race conditions is parallelism and sequence dependence, and it doesn’t matter what form the parallelism takes. The parallelism could take the form of multithreading, asynchronicity, two entirely separate systems (e.g. two calls to two different third-party APIs), or literally anything else.

Race conditions in DOM interaction/system tests

Race conditions are fairly common in system tests (tests that exercise the full application stack including the browser).

Let’s say there’s a test that 1) submits a form and then 2) clicks a link on the subsequent page. To tie this to the pick-up-my-friend analogy, the submission of the form would be analogous to me texting my friend saying I’ll be there in 15 minutes, and the loading of the subsequent page would be analogous to my friend getting ready. The race condition creates a problem when the test attempts to click the link before the page loads (analogous to me arriving at my friend’s house before he’s ready).

To make the analogy more precise, this sort of failure is analogous to me arriving at my friend’s house before he’s ready, and then just leaving after five minutes because I don’t want to wait (timeout error!).

The solution to this sort of race condition is easy: just remove the asynchronicity. Instead of allowing the test to execute at its natural pace, add a step after the form submission that waits for the next page to load before trying to click the link. This is analogous to me adding a step and saying “text me when you’re ready” before I leave to pick up my friend. If we arrange it like that then there’s no longer a race.

Because DOM interactions often involve asynchronicity, DOM interaction is a common area for race conditions, and therefore flaky tests, to be present.

Edge-of-timeout race conditions

A race condition can also occur when the amount of time an action takes to complete is just under a timeout value, e.g. a timeout value is five seconds and the action takes four seconds (but sometimes six seconds).

In these cases you can increase the timeout and/or improve the performance of the action such that the timeout value and typical run length are no longer close to each other.

Leaked state

Tests can create flaky behavior when they leak state into other tests.

Let me shoot an apple off your head

Let’s use another analogy to illustrate this one. Let’s say I wanted to perform two tests on myself. The first test is to see if I can shoot an apple off the top of my friend’s head with a bow and arrow. The second test is to see if I can drink 10 shots of tequila in under an hour.

If I were to perform the arrow test immediately followed by the tequila test and do that once a week, I could expect to get basically the same test results each time.

But if I were to perform the tequila test immediately followed by the arrow test, my aim would probably be compromised, and I might miss the apple once in a while. (Sorry, friend.) The problem is that the tequila test “leaks state”: it creates a lasting alteration in the global state, and that alteration affects subsequent tests.

And if I were to perform these two tests in random order, the tequila test would give the same result each time because I’d always be starting it sober, but the arrow test would appear to “flake” because sometimes I’d start it sober and sometimes I’d start it drunk. I might even suspect that there’s a problem with the arrow test because that’s the test that’s showing the symptom, but I’d be wrong. The problem is a different test with leaky state.

Ways for tests to leak state

Returning to computers, there are a lot of ways a test can alter the global state and create non-deterministic behavior.

One way is to alter database data. Imagine there are two tests, each of which creates a user with the email address test@example.com. The first test will pass and, if there’s a unique constraint on users.email, the second test will raise an error due to the unique constraint violation. Sometimes the first test will fail and sometimes the second test will fail, depending on which order you run them in.

Another way that a test could leak state is to change a configuration setting. Let’s say that your test environment has background jobs configured not to run for most tests because most background jobs are irrelevant to what’s being tested and would just slow things down. But then let’s imagine that you have one test where you do want background jobs to run, and so at the beginning of that tests you set background job setting from “don’t run” to “run”. If you don’t remember to change the setting back to “don’t run” at the end, background jobs will run for all later tests and potentially cause problematic behavior.

State can also be leaked by altering environment variables, altering the contents of the filesystem, or any number of other ways.

Network/third-party dependency

The main reason why network dependency can create non-deterministic behavior doesn’t take a lot of explaining: sometimes the network is up and sometimes it’s not.

Moreover, when you’re depending on the network, you’re often depending on some third-party service. Even if the network itself is working just fine, the third-party service could suffer an outage at any time, causing your tests to fail. I’ve also seen cases where a test fails because a test makes a third-party service call over and over and then gets rate-limited, and from that point on, for a period of time, that test fails.

The way to prevent flaky tests caused by network dependence is to use test doubles in your tests rather than hitting live services.

Randomness

Randomness is, by definition, non-deterministic. If you, for example, have a test that generates a random integer between 1 and 2 and then asserts that that number is 1, that test is obviously going to fail about half the time. Random inputs lead to random failures.

One way to get bitten by randomness is to grab the first item in a list of things that’s usually in the same order but not always. That’s why it’s usually better to specify a definite item rather than grab the nth item in a list.

Fixed time dependency

Once I was working late and I noticed that certain tests started to fail for no apparent reason, even though I hadn’t changed any code.

After some investigation I realized that, due to the way they were written, these tests would always fail when run at a certain time of day. I had just never worked that late before.

This is common with tests that cross the boundary of a day (or month or year). Let’s say you have a test that creates an appointment that occurs four hours from the current time, and then asserts that that appointment is included on today’s list of appointments. That test will pass when it’s run at 8am because the appointment will appear at 12pm which is the same day. But the test will fail when it’s run at 10pm because four hours after 10pm is 2am which is the next day.

Takeaways

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.
Flaky tests are caused by non-determinism either in the test code or in the application code.
Some of the most common causes of flaky tests are race conditions, leaked state, network/third-party dependency, randomness and fixed time dependency.

Keep test code and application code separate

Leave a reply

Sometimes you’ll be tempted to add things to your application code that don’t affect the functionality of your application but do make testing a little easier.

The drawback to doing this is that causes your application code to lose cohesion. Instead of doing just one job—making your application work—your code is now doing two jobs: 1) making your application work and 2) helping to test the application. This mixture of jobs is a straw on the camel’s back that makes the application code just that much harder to understand.

Next time you’re tempted to add something to your application code to make testing more convenient, resist the temptation. Instead add it somewhere in the code that supports your tests. This may take more time and thought initially but the investment will pay itself back in the long run.

Modeling legacy code behavior using science

Leave a reply

When you want to understand what a legacy program you’re working on is supposed to do, what’s your first instinct? Often it’s to look at the code.

But unfortunately legacy code is often so convoluted and inscrutable that it’s virtually impossible to tell what the code is supposed to do just by looking at it.

In these cases it may seem that you’re out of luck. But fortunately you’re not.

Modeling

Due to what philosophers call the veil of perception, direct knowledge of anything in the world is impossible. There’s very little that we can know with absolute completeness and correctness.

We don’t have knowledge about how most things work, we only have a model of how it works. A model is a useful approximation to the truth. Our models may not be absolutely complete or correct but they’re useful enough to get the job done on a day-to-day basis.

Take a car, for example. I don’t have a complete understanding of how every part of a car works. And what I do know probably isn’t even 100% accurate. But my model is sufficiently comprehensive and accurate that I can operate my car without being regularly surprised and confounded by differences in the ways I expect my car to behave versus the way it actually behaves.

Let’s use another example, trees. My understanding of trees is fragmentary. But like my model of cars, my model of trees is sufficiently comprehensive and accurate that trees don’t regularly throw me surprises. I understand trees to be stationary, and so when I cross the street, I never worry that I may be struck by a speeding tree. Although I do know that wind can blow over dead trees and so I’m cautious about entering the woods on a windy day.

How models are developed

How did I develop my models of cars and trees?

Trees, unlike cars, are natural. I can’t look at a tree’s source code or schematics to gain an understanding of how trees work. Everything humans know about trees has been discovered by observation and experimentation—in other words, scientific inquiry. We observe that trees start off small and then grow large. We observe that deciduous trees lose their leaves in the winter and coniferous trees don’t. And so on.

We can of course also learn about trees by reading books and learning from teachers, but that’s not source material, it’s only a regurgitation of the products of someone else’s scientific inquiry.

Cars are a little different. Cars are man-made, so in addition to learning about cars through observation and experimentation, we can learn by, for example, reading the owner’s manual that came with the car. And in theory, we could look at engineering diagrams and talk with the people who designed the car in order to gain a direct understanding of how the car works straight from the source material.

Cars are also different from trees in the sense that much of the mechanics of a car are very self-evident. You can learn something about how a car works by taking apart its engine. Not so much with a tree.

Inspecting a car’s transmission is analogous to reading a line of code. The thing you’re looking at is, itself, an instruction. It can’t lie to you. You can be mystified by it and you can misunderstand its purpose, but the car part can’t outright lie to you because the car part is how the car works.

Modeling the mind

Before we connect all this back to legacy code, I want to share one more example of scientific modeling.

The human brain is a complex machine. So is a car, so is a tree, and so is a codebase. But unlike a car or a codebase, we can’t observe the mechanics of the brain directly, at least not yet, not very much. We can’t look at the brain’s source code or insert breakpoints. For the purposes of understanding how it works, a brain is much more like a tree than a car.

But cognitive scientists have still managed to learn a lot about how the brain works. Or, more precisely, cognitive scientists have gained an understanding of the behavior that the brain produces. They’ve gained an understanding of how the mind (the outward expression of the machinery of the brain) works. They’ve developed a model.

This model of the mind has been developed in the same way that any accurate model has been developed: through scientific inquiry. A scientist can’t have a chat with the brain’s designer or have a look at its schematics, but a scientist can compare the behavior of people with normal brains against people who have had certain parts of their brains excised, for example, and make inferences about the roles that those parts of the brain play.

(Scientists can make diagrams of how they believe the brain works, but remember, those diagrams aren’t source material. They’re not code. They’re only a documentation of our current best understandings based on our scientific inquiry.)

So: if scientists can develop a model of the behavior generated by the brain without having access to the source machinery of the brain, what can we learn from scientists about how to understand the behavior of legacy systems without having access to comprehensible code?

Applying scientific inquiry to software systems

If you haven’t already done so, I’d like to invite you to make a conscious distinction between two ways of learning the behavior of a software system. One way is the obvious way that everyone’s familiar with: reading the code. The other way is the way that many people have probably used informally quite a bit but may not have consciously put a name to so much, which is scientific inquiry.

A full instruction in scientific inquiry is of course outside the scope of this post, and I wouldn’t be qualified to give one anyway. The point of this post is to invite you to consciously realize that you can develop a usefully accurate model of a software system not just by reading its code, but by using the methods of scientific inquiry.

If you’re interested in learning more about the methods of science, I would recommend The Magic of Reality by Richard Dawkins and The Demon-Haunted World by Carl Sagan.

Takeaways

Direct knowledge of anything in the world is impossible due to the veil of perception. All we can do is develop models, which are useful approximations to the truth.
Models are developed through the process of scientific inquiry.
In addition to reading the code, a model of a software system can be developed by using the process of scientific inquiry.

Premature generalization

Leave a reply

Most programmers are familiar with the concept of premature optimization and the reasons why it’s bad. As a reminder, the main reason premature optimization is bad is because it’s an effort to solve problems that probably aren’t real. It’s more economical to wait and observe where the performance bottlenecks are than to try to predict where the bottlenecks are going to be.

Perhaps fewer programmers are familiar with the idea of premature generalization, also known as the code smell Speculative Generality. Premature generalization is when you generalize a piece of code beyond its current requirements in anticipation of more general future requirements. In my experience it’s a very common mistake.

Premature generalization is bad for the same exact reason premature optimization is bad: because it’s an effort to solve problems that probably aren’t real.

Making a piece of code more general than it needs to be in anticipation of future needs might seem like a smart planning measure. If you can see that the code you’re writing will probably need to accommodate more use cases in the future, why not just make the code general enough now? That way you only have to write the code once.

When programmers do this they’re making a bet. Sometimes their bet is right and sometimes it’s wrong. In my experience, these sorts of bets are wrong enough of the time that you lose on average. It’s like betting $50 for a 10% chance at winning $100. If you were to do that 10 times, you’d spend $500 and win just once (on average), meaning you’ll have paid $500 to win $100.

It’s more economical to make your code no more general than what’s called for by today’s requirements and accept the risk that you might have to rework the code later to generalize it. This is also a bet but it’s a sounder one. Imagine a lottery system where you can either buy a ticket for $50 for a 10% chance of winning $100, or you can choose not to play and accept a 10% chance of getting fined $30. (I know it’s a weird lottery but bear with me.) If you buy a ticket ten times then on average you lose $400 because you’ve paid $500 to win $100. If ten times in a row you choose not to buy a ticket, then on average you get fined $30. So you’re obviously way better off with a policy of never buying the ticket.

Takeaways

Premature generalization is when you generalize a piece of code beyond its current requirements in anticipation of more general future requirements.
On average, premature generalization doesn’t pay. It’s more economical to write the code in such a way as to only accommodate today’s requirements and then only generalize if and when a genuine need arises.

Cohesion

3 Replies

Every codebase is a story. Well-designed programs tell a coherent, easy-to-understand story. Other programs are poorly designed and tell a confusing, hard-to-understand story. And it’s often the case that a program wasn’t designed at all, and so no attempt was made to tell a coherent story. But there’s some sort of story in the code no matter what.

If a codebase is like a story, a file in a codebase is like a chapter in a book. A well written-chapter will clearly let the reader know what the most important points are and will feature those important points most prominently. A chapter is most understandable when it principally sticks to just one topic.

The telling of the story may unavoidably require the conveyance of incidental details. When this happens, those incidental details will be put in their proper place and not mixed confusingly with essential points. If a detail would pose too much of a distraction or an interruption, it gets moved to a footnote or appendix or parenthetical clause.

A piece of code is cohesive if a) everything in it shares one single idea and b) it doesn’t mix incidental details with essential points.

Now let’s talk about ways that cohesion tends to get lost as well as ways to maintain cohesion.

How cohesion gets lost

Fresh new projects are usually pretty easy to work with. This is because a) when you don’t have very much code, it’s easier to keep your code organized, and b) when the total amount of code is small, you can afford to be fairly disorganized without hurting overall understandability too much.

Things get tougher as the project grows. Entropy (the tendency for all things to decline into disorder) unavoidably sets in. Unless there are constant efforts to fight back against entropy, the codebase grows increasingly disordered. The code grows harder to understand and work with.

One common manifestation of entropy is the tendency for developers to hang new methods onto objects like ornaments on a Christmas tree. A developer is tasked with adding a new behavior. He or she goes looking for the object that seems like the most fitting home for that behavior. He or she adds the new behavior, which doesn’t perfectly fit the object where it was placed, but the new code only makes the object 5% less cohesive, and it’s not clear where might be a better place for that behavior, so in it goes.

This ornament-hanging habit is never curtailed because no individual “offense” appears to be all that bad. This is the nature of entropy: disorder sets in not because anything bad was done but simply because no one is going out of their way to stave off disorder.

So, even though no individual change appears to be all that bad, the result of all these changes in aggregate is a surprisingly bad mess. The objects are huge. They confusingly mix unrelated ideas. Their essential points are obscured by incidental details. They’re virtually impossible to understand. They lack cohesion.

How can this problem be prevented?

How cohesion can be preserved

The first key to maintaining cohesion is to make a clear distinction between what’s essential and what’s incidental. More specifically, a distinction must be made between what’s essential and what’s incidental with respect to the object in question.

For example, let’s say I have a class called Appointment. The concerns of Appointment include, among other things, a start time, a client and some matters related to caching.

I would say that the start time and client are essential concerns of the appointment and that the caching is probably incidental. In the story of Appointment, start time and client are important highlights, whereas caching concerns are incidental details and should be tucked away in a footnote or appendix.

That explains how to identify incidental details conceptually but it doesn’t explain how to separate incidental details mechanically. So, how do we do that?

The primary way I do this is to simply move the incidental details into different objects. Let’s say for example that I have a Customer object with certain methods including one called balance.

Over time the balance calculation becomes increasingly complicated to the point that it causes Customer to lose cohesion. No problem: I can just move the guts of the balance method into a new object (a PORO) called CustomerBalance and delegate all the gory details of balance calculation to that object. Now Customer can once again focus on the essential points and forget about the incidental details.

Now, in this case it made perfect sense to recognize the concept of a customer balance as a brand new abstraction. But it doesn’t always work out this way. In our earlier Appointment example, for example, it’s maybe not so natural to take our caching concerns and conceive of them as a new extraction. It’s not particularly clear how that would go.

What we can do in these cases, when we want to move an incidental detail out of an object but we can’t put our finger on a befitting new abstraction, is we can use a mixin instead. I view mixins as a good way to hold a bit of code which has cohesion with itself but which doesn’t quite qualify as an abstraction and so doesn’t make sense as an object. For me, mixins usually don’t have standalone value, and they’re usually only ever “mixed in” to one object as opposed to being reusable.

(I could have said concern instead of mixin, but a) to me it’s a distinction without a meaningful difference, and b) concerns come along with some conceptual baggage that I didn’t want to bring into the picture here.)

So for our Appointment example, we could move the caching code into a mixin in order to get it out of Appointment so that Appointment could once again focus solely on its essential points and forget about its incidental details.

Where to put these newly-sprouted files

When I make an object more cohesive by breaking out its incidental details into new model file, you might wonder where I put that new file.

The short answer is that I put these files into app/models, with additional subfolders based on the meaning of the code.

So for the Appointment, I might have app/models/appointment.rb and app/models/scheduling/appointment_caching.rb, provided that the caching code is related specifically to scheduling. The rationale here is that the caching logic will only ever be relevant to scheduling whereas an appointment might be viewed in multiple contexts, e.g. sometimes scheduling and sometimes billing.

For the customer balance example, I might have app/models/customer.rb and app/models/billing/customer_balance.rb. Again, a customer balance is always a billing concern whereas a customer could be looked at through a billing lens or conceivably through some other sort of lens.

Note that even though appointment_caching.rb is a mixin or concern, I don’t put it in a concerns or mixins folder. That’s because I believe in organizing files by meaning rather than type. I find that doing so makes it easier to find what I want to find when I want to find it.

Takeaways

A piece of code is cohesive if a) everything in it shares single idea and b) it doesn’t mix incidental details with essential points.
Cohesion naturally erodes over time due to entropy.
The first key to maintaining cohesion is to make a clear distinction between what’s essential and what’s incidental.
Incidental details can be moved into either new objects or into mixins/concerns in order to help preserve cohesion.

Crisp boundaries

Leave a reply

If you’re going to make a change to an area, you have to understand that area. If you don’t understand the area you’re changing very well, your lack of understanding might lead to you accidentally introducing a bug.

Well-written code is loosely coupled from the other pieces of code it touches. “Loosely coupled” means that if you have classes A and B which talk to each other, you can understand class A without having to know much about class B and vice versa.

Conversely, if A and B are tightly coupled, then you might have to understand both class A and class B just to understand class A. Tight coupling makes code harder to work with.

One aspect of loosely-coupled code is that it has crisp boundaries, which are the opposite of blurry boundaries. Here’s an example of a piece of code with blurry boundaries.

class Person
  def initialize(params)
    @name = params[:name]
  end
end

person = Person.new(params)

The only thing Person needs from the outside world is a name, but Person is accepting the broader params as an argument.

Looking outward from inside the Person class, we might wonder: what exactly is in params? Who knows! It could be anything.

The inclusion of params is a “leak” from the outside world into Person.

Looking the other direction, from outside Person inward, we might see Person.new(params) and wonder what exactly of params Person needs in order to do its job. Does Person need everything inside of params? Just some of it? Who knows! Could be anything.

Let’s contrast the blurry-boundary code above with the crisp-boundary code below.

class Person
  def initialize(name)
    @name = name
  end
end

person = Person.new(params[:name])

In this case, looking outward from inside the Person class, it’s clear that Person takes a name and that’s it.

And then looking in from outside, Person.new(params[:name]) makes it clear exactly what’s being sent to Person.

In order to make your classes and methods more understandable, keep your boundaries crisp by accepting the minimum amount of argument information necessary in order to get the job done.

Why I organize my tests by domain concept, not by test type

2 Replies

In Rails apps that use RSpec, it’s customary to have a spec directory with certain subdirectories named for the types of tests they contain: models, requests, system. The Minitest organization scheme doesn’t share the exact same nomes but it does share the custom of organizing by test type.

I would like to raise the question: Why do we do it this way?

To get at the answer to that question I’d like to ask a broader question: What’s the benefit of organizing test files at all? Why not just throw all the tests in a single directory? For me there are two reasons.

Reasons to organize test files into directories

Finding tests

Sometimes I’m changing a feature and I want to know where the tests are for that feature so I can change or extend the tests accordingly

When I’m making a change to a feature, I usually want to know where the tests are that are related to that feature so I can update or extend the tests accordingly. Or, if that feature doesn’t have tests, I want to know so, and with a reasonable degree of certainty, so that I don’t accidentally create new tests that duplicate existing ones.

Running tests in groups

If tests are organized into directories then they can be conveniently run in groups.

It is of course possible, at least in some frameworks, to apply certain tags to tests and then run the tagged tests as a group. But doing so depends on developers remembering to add tags. This seems to me like a fragile link in the chain.

I find directories to be better than tags for this purpose since it’s of course impossible to forget to put a file in a directory.

Test type vs. meaning

At some point I realized that if I organize my test files based on meaning rather than test type, it makes it much easier to both a) find the tests when I want to find them and b) run the tests in groups that serve my purposes. Here’s why.

Finding tests

When I want to find the tests that correspond to a certain feature, I don’t necessarily know a lot about the characteristics of those tests. There might be a test that matches the filename of the application code file that I’m working on, but also there might not be. I’m also not always sure whether the application code I’m working on is covered by a model test, a system test, some other type of test, some combination of test types, or no test at all. The best I can do is either guess, search manually, or grep for some keywords and hope that the results aren’t too numerous to be able to examine one-by-one.

If on the other hand the files are organized in a directory tree that corresponds to the tests’ meaning in the domain model, then finding the tests is easier. If I’m working in the application’s billing area, for example, I can look in spec/billing folder to see if the relevant tests are there. If I use a nested structure, I can look in spec/billing/payments to find tests that are specifically related to payments.

I don’t need to worry about whether the payments-related tests are model tests, system tests or some other type of tests. I can just look in spec/billing/payments and work with whatever’s there. (I do, however, like to keep folders at the leaf level with names like models, system, etc. because it can be disorienting to not know what types of tests you’re looking at, and also it can create naming conflicts if you don’t separate files by type.)

Running tests in groups

I don’t often find it particularly useful to, say, run all my model tests or all my system tests. I do however find it useful to run all the tests in a certain conceptual area.

When I make a change in a certain area and I want to check for regressions, I of course want to check in the most likely places first. It’s usually more likely that I’ve introduced a regression to a conceptually related area than a conceptually unrelated area.

To continue the example from above, if I make a change to the payments area, then I can run all the tests in spec/billing/payments to conveniently check for regressions. If those tests all pass then I can zoom out one level and run all the tests in spec/billing. This gives me four “levels” of progressively broader regression testing: 1) a single file in spec/billing/payments, 2) all the tests in spec/billing/payments, 3) all the tests in spec/billing, and 4) all the tests in the whole test suite. If I organize my tests by type, I don’t have that ability.

On breaking convention

I’m not often a big fan of diverging from framework conventions. Breaking conventions often results in a loss of convenience which isn’t made up for by whatever is gained by breaking convention.

But don’t mistake this break from convention with other types of breaks from conventions you might have seen. Test directory structure is a very weak convention and it’s not even a Rails convention, it’s a convention of RSpec or Minitest. And in fact, it’s not even a technical convention, it’s a cultural convention. Unless I’m mistaken, there’s not actually any functionality tied to the test directory structure in RSpec or Minitest, and so diverging from the cultural standard doesn’t translate to a loss of functionality. It’s virtually all upside.

Takeaways

The benefits of organizing tests into directories include to be able to find tests and to be able to run tests in groups.
Organizing tests by meaning rather than type makes it easier to find tests and to run them in groups in a way that’s more logical for the purpose of finding regressions.

Organizing Rails files by meaning

2 Replies

Every once in a while I come across the question “Where should I put my POROs in Rails?”

In order to answer this question, I would actually zoom out and ask a broader question: How should we organize our files in Rails in general?

Rails’ organizational limits

To some it might seem that this question already has an answer. Rails already gives us app/controllers for controllers, app/models for models, app/helpers for helpers and so on.

But after a person works with a growing Rails app for a while, it eventually becomes clear that Rails can only take you so far. The sheer quantity of code overwhelms Rails’ ability to help keep the code organized. It’s like piling pound after pound of spaghetti onto a single dinner plate. It only makes sense up to a certain point. Past that point the result is a mess. (This isn’t a criticism of Rails. It’s a natural fact of frameworks in general.)

A Rails codebase can grow both “horizontally” and “vertically”. Horizontal growth means adding more resources: more database tables, more model files, more controller files, etc. Rails can handle horizontal growth just fine, indefinitely.

Vertical growth means a growth in complexity. If the amount of domain logic in an application continues to grow but the number of controllers/models/etc. stays the same, then the result is that the individual files all grow. If the “fat models, skinny controllers” heuristic is followed, then the complexity accumulates in the model files. The result is huge models. These huge models are hard to understand because of their sheer size and because they lack cohesion, meaning that each model isn’t “about” one thing, but rather each model file is just a dumping ground for everything that might be loosely related to that model.

Common (poor) attempts to manage complexity growth

A common way to address the complexity problem is to split the code according to design patterns (decorators, builders, etc.) and put the files in folders that are named for the design patterns: app/decorators, app/builders and so on. The logic of this approach is that it’s a continuation of what Rails is already doing for us, which is to divide files by design pattern. At first glance it seems like a sensible approach.

However, I don’t think this approach does a very good job of addressing the problem of being able to find what we need to find when we need to find it. Here’s why.

Let’s say for example that I need to make a change to some billing-related logic. I know that the code I’m looking for has something to do with billing of course, but I might not know much else about the code I’m looking for. I have no idea whether the code I’m interested in might lie in app/models, app/decorators or anywhere else. I probably have a sense of whether the code is display-related (app/views), domain-logic-related (app/models) or related to the request/response lifecycle (app/controllers), but beyond that, I probably have no clue where the code is located. How could I?

When people try to extend Rails’ convention of dividing files by design pattern, they’re missing an important point. Decorators, builders, commands, queries, etc. are all different from each other, but they’re different from each other in a different way than models, views and controllers are different from each other.

Think of it this way. Imagine if instead of being divided into meat, produce, dairy, etc. sections a grocery store was organized by “things in boxes”, “things in plastic bags”, etc. The former is an essential difference while the latter is an incidental difference. Unless you know how everything is packaged, you won’t be sure where to look. The difference between models, views and controllers is like the difference between meat, produce and dairy. The difference between decorators, builders, commands, queries, etc. is more like the difference between how the items are packaged. Again, the former is essential while the latter is incidental.

Organizing by meaning

A better way to organize Rails code is by meaning. Instead of having one folder for each design pattern I have, I can have one folder for each conceptual area of my app. For example, if my app has a lot of billing code, I can have folders called app/models/billing, app/controllers/billing and so on. This makes it much easier to find a piece of code when I don’t know anything about the code’s structure, I only know about its meaning.

Regarding design patterns, I think design patterns are both overrated and overemphasized, at least in the Rails world. A lot of Rails developers seem to have the idea that every file they create must belong to some category: model, controller, worker, helper, decorator, service, etc. Maybe this is because in a vanilla Rails app, pretty much everything is slotted in to a category in some way. But there’s no logical reason that every piece of code has to fit into some design pattern. The plain old “object” is an extremely powerful and versatile device.

But what if everything in a Rails app is just plain old Ruby objects? Won’t the app lose structure? Not necessarily. Most objects represent models in the broad sense of the term “model”, which is that the code represents some aspect of the world in a way that’s easy to understand and work with. Therefore, the objects that comprise the app’s domain logic can go in app/models, organized hierarchically by domain concept. Plain old objects can sit quite comfortably in app/models alongside Active Record models.

Now let’s go all the way back to the original question: where should you put POROs in Rails?

The answer depends on how you organize your Rails code in general. It also depends on what you consider POROs to be. I consider most of my POROs to be models, so I put them in app/models.

Takeaways

Rails can only help with code organization when the amount of code is small. Past a certain point it’s up to you to impose your own structure.
If the aim is to be able to find the code you need to find when you need to find it, organizing by design pattern doesn’t help much if you don’t already know how the code is structured. Organizing the code by meaning is better.

The Beginner's Guideto Rails Testing

What a flaky test is and why they’re hard to fix

The overall approach

Before reproducing: determine whether it’s really a flaky test

Reproducing a flaky test

Diagnosing a flaky test

Get familiar with the root causes of flaky tests

Have a completely open mind

Look at the failure messages

Look at the test code

Look at the application code

Make the test as understandable as possible

Make the application code as easy to understand as possible

Make the test environment as understandable as possible

Check the tests that ran just before the flaky test

Add diagnostic info to the test

Perform a binary search

Repeat all the above steps

Applying the fix for a flaky test

How to be sure your bugfixes work

How the nature of flaky tests complicates the bugfix process

A note about deleting and skipping flaky tests

Takeaways

Diagnosing bugs by guessing

Diagnosing bugs more methodically

Binary search

Binary search in code

The beauty of binary search debugging

Takeaways

What is a flaky test?

Race conditions

Real-life race condition example

Race conditions in DOM interaction/system tests

Edge-of-timeout race conditions

Leaked state

Let me shoot an apple off your head

Ways for tests to leak state

Network/third-party dependency

Randomness

Fixed time dependency

Takeaways

Modeling

How models are developed

Modeling the mind

Applying scientific inquiry to software systems

Takeaways

Takeaways

How cohesion gets lost

How cohesion can be preserved

Where to put these newly-sprouted files

Takeaways

Reasons to organize test files into directories

Finding tests

Running tests in groups

Test type vs. meaning

Finding tests

Running tests in groups

On breaking convention

Takeaways

Rails’ organizational limits

Common (poor) attempts to manage complexity growth

Organizing by meaning

Takeaways

The Beginner's Guide
to Rails Testing