Category Archives: Programming

Using ChatGPT to reduce “study and synthesize” work

When ChatGPT first came out, the first programming use case I thought of for ChatGPT was to write code. I thought of it like “GitHub Copilot on steroids”. I imagine that was a lot of other people’s first thought too. But gradually I realized that having ChatGPT write production code is actually not a very good idea.

When ChatGPT gives you a big chunk of code, how can you be sure that the code does exactly what you think it does? How can you be sure that it’s not missing something, and that it doesn’t contain anything extra, and that it doesn’t have bugs?

The answer of course is that you have to test it. But retroactively testing existing code is usually tedious and annoying. You basically have to replay the process of writing the code so that you can test each piece of code individually.

A programming workflow that involves using ChatGPT to write big chunks of code seems dangerous at worst and terribly inefficient at best.

If ChatGPT isn’t great for writing production code, what’s it good for?

Using ChatGPT to reduce mental labor

One part of our jobs as programmers is to learn general principles and then apply parts of those principles to a certain peculiar need.

I’ll give a somewhat silly example to illustrate the point starkly. Let say I want to integrate Authorize.net into a Lisp program and that I’ve never used either technology before. Without checking, one could pretty safely assume there are no tutorials in existence on how to integrate Authorize.net into a Lisp app.

In order to complete my integration project I’ll need to learn something about a) Authorize.net in general, b) Lisp in general, and c) integrating Authorize.net with a Lisp app specifically. Then I’ll need to synthesize a solution based on what I’ve learned.

This whole process can be wasteful, time-consuming, and at times, quite boring. In the beginning I might know that I need to get familiar with Authorize.net, but I’m not sure yet which parts of Authorize.net I need to be familiar with. So I’ll read a whole bunch about Authorize.net, but I won’t know until the end of the project which areas of my study were actually needed and which were just a waste of time.

And what’s even worse is the cases where the topics you’re studying are of no permanent benefit to your skills as a programmer. In the case of Authorize.net I might not expect to ever use it again. (At least I hope not!) This kind of learning is just intellectual ditch-digging. It’s pure toil with little or no lasting benefit.

This kind of work, where you first you study some generalities then you synthesize a specific solution from those generalities, I call “study and synthesis” work.

Thanks to ChatGPT, most “study and synthesis” work is a thing of the past. If I tell ChatGPT “Give me a complete tutorial on how to integrate Authorize.net with a Lisp program”, it will. The tutorial may not be correct down to every last detail but that’s not the point. Just having a high-level plan spelled out saves a lot of mental labor. And then if I need to zoom in on certain details which the tutorial either got wrong or omitted, ChatGPT will quite often correct its mistakes when pressed.

Using ChatGPT to write production code may seem like a natural and logical use for the tool, but it’s actually not a very good one. You’ll get a lot more leverage out of ChatGPT if you use it for “study and synthesize” work.

In defense of productivity

Anti-productivity sentiment

In my career I’ve noticed that a lot of developers have a distaste for the idea of “productivity”. They view it as some sort of toxic, unhealthy obsession. (It always has to be an “obsession”, by the way. One can never just have an interest in productivity.)

Productivity is often associated with working harder and longer, sacrificing oneself for a soulless corporation.

In a lot of ways I actually agree with these people. Having an unhealthy obsession with anything is obviously unhealthy, by definition. And I think working long and hard for its own sake, for no meaningful reward, is a waste of precious time.

But I think sometimes these anti-productivity people are so blinded by their natural aversion to “productivity culture” that they miss out on some good and worthwhile ideas, ideas they would actually like if they opened their minds to them.

“Productivity” is a pretty ambiguous word. It could have a lot of different interpretations. I’d like to share my personal interpretation of productivity which I happen to quite like. Maybe you’d like to adopt it for yourself.

My version of productivity

For me, productivity isn’t about obsessively tracking every minute of the day or working so hard you burn yourself out.

The central idea of productivity for me is decreasing the ratio of effort to value. This could mean working less to create the same value or it could mean working the same to create more value. Or anywhere in between. Each person can decide for themselves where they’d like to set the dial.

I value a calm, healthy mind and body. People obviously do better work when they’re relaxed and even-keeled than when they’re harried and stressed.

Productivity for me is about realizing that our time on this planet is limited and precious, and that we shouldn’t be needlessly wasteful with our time but rather protect it and spend it thoughtfully.

Why duplication is more acceptable in tests

It’s often taught in programming that duplication is to be avoided. But for some reason it’s often stated that duplication is more acceptable in test code than in application code. Why is this?

We’ll explore this, but first, let’s examine the wrong answers.

Incorrect reasons why duplication is more acceptable in tests

“Duplication isn’t actually that bad.”

Many programmers hold the opinion that duplication isn’t something that should be avoided fastidiously. Instead, a certain amount of duplication should be tolerated, and when the duplication gets to be too painful, then it should be addressed. The “rule of three” for example says to tolerate code that’s duplicated twice, but clean it up once the duplication reaches three instances.

This way of thinking is overly simplistic and misses the point. The cost of duplication doesn’t depend on whether the duplication appears twice or three times but rather on factors like how easy the duplication is to notice, how costly it is to keep the duplicated instances synchronized, and how much “traffic” the duplicated areas receive. (See this post for more details on the nature of duplication and its costs.)

The heuristic for whether to tolerate duplication shouldn’t be “tolerate some but don’t tolerate too much”. Rather the cost of a piece of duplication should be assessed based on the factors above and weighed against any benefits that piece of duplication has. If the costs aren’t justified by the benefits, then the duplication should be cleaned up.

“Duplication in test code can be clearer than the DRY version”

It’s true that duplication in test code can be clearer than the DRY version. But duplication in application code can be clearer than the DRY version too. So if duplicating code can make it clearer, why not prefer duplication in application code to the same exact degree as in test code?

This answer doesn’t actually answer the question. The question is about the difference between duplication in test code and application code.

The real reason why duplication is more acceptable in test code

In order to understand why duplication is more acceptable in test code than application code, it helps to get very clear on what exactly duplication is and why it incurs a cost.

What duplication is and why it costs

Duplication doesn’t mean two identical pieces of code. Duplication is two or more copies of the same behavior. It’s possible to have two identical pieces of code that represent different pieces of behavior. It’s also possible to have the same behavior expressed more than once but in different code.

Let’s also review why duplication incurs a cost. The main reason is because it leaves the program susceptible to logical inconsistencies. If one copy of a behavior gets changed but the other copies don’t, then the other behaviors are now wrong and there’s a bug present. The other reason duplication incurs a cost is because it creates a maintenance burden. Updating something in multiple places is obviously more costly than updating it in just one place.

The difference between test code and application code

The difference between test code and application code is that test code doesn’t contain behaviors. All the behaviors are in the application code. The purpose of the test code is to specify the behaviors of the application code.

What in the codebase determines whether the application code is correct? The tests. If the application code passes its tests (i.e. its specifications), then the application code is correct (for a certain definition of “correct”). What in the code determines whether the tests (specifications) are correct? Nothing! The program’s specifications come entirely from outside the program.

Tests are always correct

This means that whatever the tests specify is, by definition, correct. If we have two tests containing the same code and one of the tests changes, it does not always logically follow that the other test needs to be updated to match. This is different from duplicated application code. If a piece of behavior is duplicated in two places in the application code and one piece of behavior gets changed, it does always logically follow that the other piece of behavior needs to get updated to match. (Otherwise it wouldn’t be in instance of duplication.)

This is the reason why duplication is more acceptable in test code than in application code.

Takeaways

  • Duplication is when one behavior is specified multiple times.
  • Duplication in application code is costly because multiple, among other reasons, copies of the same behavior are subject to diverging, thus creating a bug.
  • Since test code is a human-determined specification, it’s by definition always correct. If one instance of a duplicated piece of code changes, it’s not a logical necessity that the other piece needs to change with it.

Why tests flake more on CI than locally

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.

The root cause of flaky tests is some sort of non-determinism, either in the test code or in the application code.

In order to understand why a CI test run is more susceptible to flakiness than a local test run, we can go through all the root causes for flakiness one-by-one and consider how a CI test run has a different susceptibility to that specific flaky test cause than a local test run.

The root causes we’ll examine (which are all explained in detail in this post) are leaked state, race conditions, network/third-party dependency, fixed time dependency and randomness.

Leaked state

Sometimes one test leaks some sort of state (e.g. a change to a file or env var) into the global environment which interferes with later tests.

The reason a CI test run is more susceptible to leaked state flakiness is clear. Unlike a local environment where you’re usually just running one test file at a time, in CI you’re running a whole bunch of tests together. This creates more opportunities for tests to interfere with each other.

Race conditions

A race condition is when the correct functioning of a program depends on two or more parallel actions completing in a certain sequence, but the actions sometimes complete in a different sequence, resulting in incorrect behavior.

One way that race conditions can arise is through performance differences. Let’s say there’s a process that times out after 5000ms. Most of the time the process completes in 4500ms, meaning no timeout. But sometimes it takes 5500ms to complete, meaning the process does time out.

It’s very easy for differences to arise between a CI environment and a local environment in ways that affect performance. The OS is different, the memory and processor speed are different, and so on. These differences can mean that race conditions arise on CI that would not have arisen in a local environment.

Network/third-party dependency

Network dependency can lead to flaky tests for the simple reason that sometimes the network works and sometimes it doesn’t. Third-party dependency can lead to flaky tests because sometimes third-party services don’t behave deterministically. For example, the service can have an outage, or the service can rate-limit you.

This is the type of flakiness that should never occur because it’s not a good idea to hit the network in tests. Nonetheless, I have seen this type of flakiness occur in test suites where the developers didn’t know any better.

Part of the reason why CI test runs are more susceptible to this type of flakiness is that there are simply more at-bats. If a test makes a third-party request only once per day locally but 1,000 times per day on CI, there are of course more chances for the CI request to encounter a problem.

Fixed time dependency

There are some tests that always pass at one time of day (or month or year) and always fail at another.

Here’s an excerpt about this from my other post about the causes of flaky tests:

This is common with tests that cross the boundary of a day (or month or year). Let’s say you have a test that creates an appointment that occurs four hours from the current time, and then asserts that that appointment is included on today’s list of appointments. That test will pass when it’s run at 8am because the appointment will appear at 12pm which is the same day. But the test will fail when it’s run at 10pm because four hours after 10pm is 2am which is the next day.

CI test runs are more susceptible to fixed-time-dependency flakiness than local test runs for a few reasons. One is the fact that CI test runs simply have more at-bats than local test runs. Another is that the CI environment’s time zone settings might be different from the local test environment. A third reason is that unlike a local test environment which is normally only used inside of typical working hours, a CI environment is often utilized for a broader stretch of time each day due to developers kicking off test runs from different time zones and from developers’ varying schedule habits.

Randomness

The final cause of flaky tests is randomness. As far as I know, the only way that CI test runs are more susceptible to flakiness due to randomness is the fact that CI test runs have more at-bats than local test runs.

Takeaways

  • A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.
  • The root cause of flaky tests is some sort of non-determinism, either in the test code or in the application code.
  • Whenever flakiness is more frequent in CI, the reason is because some difference between the CI test runs and the local runs make flakiness more likely. When flakiness is more likely, it’s because one of the specific five causes of flaky tests has been made more likely.

How I fix flaky tests

What a flaky test is and why they’re hard to fix

A flaky test is a test that passes sometimes and fails sometimes even though no code has changed.

There are several causes of flaky tests. The commonality among all the causes is that they all involve some form of non-determinism: code that doesn’t always behave the same on every run even though neither the inputs nor the code itself has changed.

Flaky tests are known to present themselves more in a continuous integration (CI) environment than in a local test environment. The reason for this is that certain characteristics of CI test runs make the tests more susceptible to non-determinism.

The fact that the flakiness usually can’t be reproduced locally means that it’s harder to reproduce and diagnose the buggy behavior of the flaky tests.

In addition to the fact that flaky tests often only flake on CI, the fact that flaky tests don’t fail consistently adds to the difficulty of fixing them.

Despite these difficulties, I’ve developed some tactics and strategies for fixing flaky tests that consistently lead to success. In this post I’ll give a detailed account of how I fix flaky tests.

The overall approach

When I’m fixing any bug I divide the bugfix into three stages: reproduction, diagnosis and fix.

I consider a flaky test a type of bug. Therefore, when I try to fix a flaky test, I follow this same three-step process as I would when fixing any other type of bug. In what follows I’ll cover how I approach each of these three steps of reproduction, diagnosis and fix.

Before reproducing: determine whether it’s really a flaky test

Not everything that appears to be a flaky test is actually a flaky test. Sometimes a test that appears to be flaking is just a healthy test that’s legitimately failing.

So when I see a test that’s supposedly flaky, I like to try to find multiple instances of that test flaking before I accept its flakiness as a fact. And even then, there’s no law that says that a test that previously flaked can’t fail legitimately at some point in the future. So the first step is to make sure that the problem I’m solving really is the problem I think I’m solving.

Reproducing a flaky test

If I can’t reproduce a bug, I can’t test for its presence or absence. If I can’t test for a bug’s presence or absence, I can’t know whether a fix attempt actually fixed the bug or not. For this reason, before I attempt to fix any bug, I always devise a test that will tell me whether the bug is present or absent.

My go-to method for reproducing a flaky test is simply to re-run the test suite multiple times on my CI service until I see the flaky test fail. Actually, I like to run the test suite a great number of times to get a feel for how frequently the flaky test fails. The actions I take during the bugfix process may be different depending on how frequently the test fails, as we’ll see later on.

Sometimes a flaky test fails so infrequently that it’s practically impossible to get the test to fail on demand. When this happens, it impossible to tell whether the test is passing due to random chance or because the flakiness has legitimately gone away. The way I handle these cases is to deprioritize the fix attempt and wait for the test to fail again in the natural course of business. That way I can be sure that I’m not wasting my time trying to fix a problem that’s not really there.

That covers the reproduction step of the process. Now let’s turn to diagnosis.

Diagnosing a flaky test

What follows is a list of tactics that can be used to help diagnose flaky tests. The list is kind of linear and kind of not. When I’m working on flaky tests I’ll often jump from tactic to tactic depending on what the scenario calls for rather than rigidly following the tactics in a certain order.

Get familiar with the root causes of flaky tests

If you were a doctor and you needed to diagnose a patient, it would obviously be helpful for you first to be familiar with a repertoire of diseases and their characteristic symptoms so you can recognize diseases when you see them.

Same with flaky tests. If you know the common causes for flaky tests and how to recognize them, you’ll have an easier time with trying to diagnose flaky tests.

In a separate post I show the root causes of flaky tests, which are race conditions, leaked state, network/third-party dependency, fixed time dependency and randomness. I suggest either committing these root causes to memory or reviewing them each time you embark on a flaky test diagnosis project.

Have a completely open mind

One of the biggest dangers in diagnosing flaky tests or in diagnosing any kind of problem is the danger of coming to believe something that’s not true.

Therefore, when starting to investigate a flaky test, I try to be completely agnostic as to what the root cause might be. It’s better to be clueless and right than to be certain and wrong.

Look at the failure messages

The first thing I do when I become aware of a flaky test is to look at the error message. The error message doesn’t always reveal anything terribly helpful but I of course have to start somewhere. It’s worth checking the failure message because sometimes it contains a helpful clue.

It’s important not to be deceived by error messages. Error messages are an indication of a symptom of a root cause, and the symptom of a root cause often has little or nothing to do with the root cause itself. Be careful not to fall into the trap of “the error message says something about X, therefore the root cause has something to do with X”. That’s very often not true.

Look at the test code

After looking at the failure message, I open the flaky test in an editor an look at its code. At first I’m not looking for anything specific. I’m just getting the lay of the land. How big is this test? How easy is it to understand? Does it have a lot of setup data or not much?

I do all this to load the problem area into my head. The more familiar I am with the problem area, the more I can “read from RAM” (use my brain’s short-term memory) as I continue to work on the problem instead of “read from disk” (look at the code). This way I can solve the problem more efficiently.

Once I’ve surveyed the test in this way, I zero in on the line that’s yielding the failure message. Is there anything interesting that jumps out? If so, I pause and consider and potentially investigate.

The next step I take with the test code is to go through the list of causes of flaky tests and look for instances of those.

After I’ve done all that, I study the test code to try to understand, in a meaningful big-picture way, what the test is all about. Obviously I’m going to be more likely to be successful in fixing problems with the test if I actually understand what the test is all about than if I don’t. (Sometimes this involves rewriting part or all of the test.)

Finally, I often go back to the beginning and repeat these steps an additional time, since each run through these steps can arm me with more knowledge that I can use on the next run through.

Look at the application code

The root cause of every flaky test is some sort of non-determinism. Sometimes the non-determinism comes from the test. Sometimes the non-determinism comes from the application code. If I wasn’t able to find the cause of the flakiness in the test code, I turn my attention to the application code.

Just like with the test code, the first thing I do is to just scan the relevant application code to get a feel for what it’s all about.

The next thing I do is to go through the code more carefully and look for causes of flakiness. (Again, you can refer to this blog post for that list.)

Then, just like with the test code, I try to understand the application code in a big-picture sort of way.

Make the test as understandable as possible

Many times when I look at a flaky test, the test code is too confusing to try to troubleshoot. When this is the case, I try to improve the test to the point that I can easily understand it. Easily understandable code is obviously easier to troubleshoot than confusing code.

To my surprise, I’ve often found that, after I improve the structure of the test, the flakiness goes away.

Side note: whenever I modify a test to make it easier to understand, I perform my modifications in one or more small, atomic pieces of work. I do this because I want to keep my refactorings and my fix attempts separate.

Make the application code as easy to understand as possible

If the application code is confusing then it’s obviously going to hurt my ability to understand and fix the flaky test. So, sometimes, I refactor the application code to make it easier to understand.

Make the test environment as understandable as possible

The quality of the test environment has a big bearing on how easy the test suite is to work with. By test environment I mean the tools (RSpec/Minitest, Factory Bot, Faker, etc.), the configurations for the tools, any seed data, the continuous integration service along with its configuration, any files shared among all the tests, and things like that.

The harder the test environment is to understand, the harder it will be to diagnose flaky tests. Not every flaky test fix job prompts me to work on the test environment, but it’s one of the things I look at when I’m having a tough time or I’m out of other ideas.

Check the tests that ran just before the flaky test

Just because a certain test flakes doesn’t necessarily mean that that test itself is flaky—even if the same test flakes consistently.

Sometimes, due to leaked state, test A will create a problem and then test B will fail. (A complete description of leaked state can be found in this post.) The symptom is showing up in test B so it looks like test B has a problem. But there’s nothing at all wrong with test B. The real problem is test A. So the problematic test passes but the innocent test flakes. It’s very deceiving!

Therefore, when I’m trying to diagnose a flaky test, I’ll check the continuous integration service to see what test ran before that test failed. Sometimes this leads me to discover that the test that ran before the flaky one is leaking state and needs to be fixed.

Add diagnostic info to the test

Sometimes, the flaky test’s failure message doesn’t show much useful information. In these cases I might add some diagnostic info to the test (or the relevant application code) in the form of print statements or exceptions.

Perform a binary search

Binary search debugging is a tactic that I use to diagnose bugs quickly. There are two main ideas behind it: 1) it’s easier to find where a bug is than what a bug is, and 2) binary search can be used to quickly find the location of a bug.

I make heavy use of binary search debugging when diagnosing flaky tests. See this blog post for a complete description of how to use binary search debugging.

Repeat all the above steps

If I go through all the above steps and I don’t have any more ideas, I might simply go through the list an additional time. Now that I’m more familiar with the test and everything surrounding it, I might have an enhanced ability to learn new things about the situation when I take an additional pass, and I might have new ideas or realizations that I didn’t have before.

Now let’s talk about the third step in fixing a flaky test, applying the fix itself.

Applying the fix for a flaky test

How to be sure your bugfixes work

A mistake many developers make when fixing bugs is that they don’t figure out a way to know if their bugfix actually worked or not. The result is that they often have to “fix” the bug multiple times before it really gets fixed. And of course, the false fixes create waste and confusion. That’s obviously not good.

The way to ensure that you get the fix right the first time is to devise a test (it can be manual or automated) that a) fails when the bug is present and b) passes when the bug is absent. (See this post for more details on how to apply a bugfix.)

How the nature of flaky tests complicates the bugfix process

Unlike “regular” bugs, which can usually be reproduced on demand once reproduction steps are known, flaky tests are usually only reproducible in one way: by re-running the test suite repeatedly on CI.

This works out okay when the test fails with relative frequency. If the test fails one out of every five test runs, for example, then I can run the test suite 50 times and expect to see (on average) ten failures. This means that if I apply the ostensible fix for the flaky test and then run the test suite 50 more times and see zero failures, then I can be pretty confident that my fix worked.

How certain I can be that my fix worked goes down the more infrequently the flaky test fails. If the test fails only once out of every 50 test runs on average, then if I run my test suite 50 times and see zero failures, then I can’t be sure whether that means the flaky test is fixed or if it just means that all my runs passed due to random chance.

Ideally a bugfix process goes like this:

  1. Perform a test that shows that the bug is present (i.e run the test suite a bunch of times and observe that the flaky test fails)
  2. Apply a bugfix on its own branch
  3. Perform a test on that branch that shows that the bug is absent (i.e run the test suite a bunch of times and observe that the flaky test doesn’t fail)
  4. Merge the bugfix branch into master

The reason this process is good is because it gives certainty that the bugfix works before the bugfix branch gets merged into master.

But for a test that fails infrequently, it’s not realistic to perform the steps in that order. Instead it has to be like this:

  1. Perform a test that shows that the bug is present (i.e observe over time that the flaky test fails sometimes)
  2. Apply a bugfix on its own branch
  3. Merge the bugfix branch into master
  4. Perform a test on that branch that shows that the bug is absent (i.e observe over a sufficiently long period of time that the flaky test no longer fails)

Notice how the test that shows that the bug is present is different. When the test fails frequently, we can perform on “on-demand” test where we run the test suite a number of times to observe that the bug is present. When the test fails infrequently, we don’t realistically have this option because it may require a prohibitively large number of test suite runs just to get a single failure. Instead we just have to go off of what has been observed in the test suite over time in the natural course of working.

Notice also that the test that shows that the bug is absent is different. When the test fails frequently, we can perform the same on-demand test after the bugfix as before the bugfix in order to be certain that the bugfix worked. When the test fails infrequently, we can’t do this, and we just have to wait until a bunch of test runs naturally happen over time. If the test goes sufficiently long without failing again, we can be reasonably sure that the bugfix worked.

Lastly, notice how in the process for an infrequently-failing test, merging the fix into master has to happen before we perform the test that ensures that the bugfix worked. This is because the only way to test that the bugfix worked is to actually merge the bugfix into master and let it sit there for a large number of test runs over time. It’s not ideal but there’s not a better way.

A note about deleting and skipping flaky tests

There are two benefits to fixing a flaky test. One benefit of course is that the test will no longer flake. The other is that you gain some skill in fixing flaky tests as well as a better understanding of what causes flaky tests. This means that fixing flaky tests creates a positive feedback loop. The more flaky tests you fix, the more quickly and easily you can fix future flaky tests, and the fewer flaky tests you’ll write in the first place because you know what mistakes not to make.

If you simply delete a flaky test, you’re depriving yourself of that positive feedback loop. And of course, you’re also destroying whatever value that test had. It’s usually better to push through and keep working on fixing the flaky test until the job is done.

It might sometimes seem like the amount of time it takes to fix a certain flaky test than the value of that test can justify. But keep in mind that the significant thing is not the cost/benefit ratio of any individual flaky test fix, but the cost/benefit ratio of all the flaky test fixes on average. Sometimes flaky test fixes will take 20 minutes and sometimes they’ll take two weeks. The flaky test fixes that take two weeks might feel unjustifiable, but if you have a general policy of just giving up when things get too hard and deleting the test, then your test-fixing skills will always stay limited, and your weak skills will incur a cost on the test suite for as long as you keep deleting difficult flaky tests. Better to just bite the bullet and develop the skills to fix hard flaky test cases.

Having said all that, deleting a flaky test is sometimes the right move. When development teams lack the skills to write non-flaky tests, sometimes the teams have other bad testing habits, like writing tests that are pointless. When a flaky test coincidentally happens to also be pointless, it’s better to just delete the test than to pay the cost to fix a test that doesn’t have any value.

Skipping flaky tests is similar in spirit to deleting them. Skipping a flaky test has all the same downsides as deleting it, plus now you have the extra overhead of occasionally stumbling across the test and remembering “Oh yeah, I should fix this eventually.” And what’s worse, the skipped test often gets harder to fix as time goes on because the skipped test is frozen in time but the rest of the codebase continues to change in ways that aren’t compatible with tests. The easiest time to fix a flaky test is right when the flakiness is first discovered.

Takeaways

  • The root cause of every flaky test is some sort of non-determinism.
  • Flaky tests are known to present themselves more in a CI environment than in a local test environment because certain characteristics of CI test runs make the tests more susceptible to non-determinism.
  • I consider a flaky test to be a type of bug. When I’m fixing any bug, including a flaky test, I divide the bugfix into three stages, which are reproduction, diagnosis and fix.
  • To reproduce a flaky test, I run the test suite enough times on CI to see the flaky test fail, or if it fails too infrequently I wait for it to fail naturally.
  • There are a large number of tactics I use to diagnose flaky tests. I don’t necessarily go through the tactics in a specific order but rather I use intuition and experience to decide which tactic to use next. The important thing is to treat the flaky test diagnosis as a distinct step which occurs after reproduction and before the application of the fix.
  • With the application of any bugfix, it’s good to have a test you can perform before and after the fix to be sure that the fix worked. When a flaky test fails frequently enough, you can do this sort of test by simply re-running the test suite in CI a sufficient number of times. If the flaky test fails infrequently, this is not practical, and the fix must be merged to master without being sure that it worked.
  • When you delete a flaky test, you not only destroy the value of the test but you also lose the opportunity to build your skills in fixing flaky tests and avoiding writing flaky tests in the first place. Unless the test coincidentally happens to be one that has little or no value, it’s better to fix it.

Binary search debugging

Diagnosing bugs by guessing

When faced with a bug they have to diagnose, many developers will start making guesses as to what the problem is.

Guessing can be fine, especially when the guesses are good ones and when the guesses are inexpensive to test. But often, the guesses quickly degrade into long shots and the developer spends his or her time flailing around randomly rather than progressing steadily toward a solution.

Diagnosing bugs more methodically

One of my principles of debugging is that it’s almost always easier to determine where the cause of a bug is than what the cause of the bug is. When I need to diagnose a bug, the first thing I ask is not “What is the problem?” but “Where is the problem?” Not only is “where” a much easier question to answer than “what”, but once I’ve found the “where”, the “what” is often plainly evident.

When you change the question from “what” to “where”, the problem changes from a thought-intensive mystery-solving exercise to a relatively straightforward search problem.

When I want to determine where a bug lies, I identify an area of code and ask “Does the bug lie in this area of code?” If the answer is yes, then I perform the search again on a narrower area. If the answer is no, I continue my search in a different area.

When I ask the question “Does the bug lie in this area of code?” the answer can be gotten like this.

  1. Perform the steps that reproduce the bug on the latest code. Observe that the bug is present.
  2. Delete or disable some section of the code.
  3. Perform the reproduction steps again.
  4. Observe whether the bug is present. If the bug is gone, the answer is yes. If the bug is still present, the answer is no.

But the question remains: how do you determine which areas of your code to inspect for the bug? Obviously it’s not going to be efficient to just randomly choose areas for inspection. That’s where binary search comes in.

Binary search

Imagine I wanted to know someone’s birthday but I wasn’t allowed to ask them when their birthday was. I was only allowed to ask yes-or-no questions.

In order to figure out this person’s birthday, I could of course ask, “Is it on January 1st?” “Is it on January 2nd?” etc. But that would take me up to 365 guesses (or actually 366 because of leap year birthdays) and on average it would take me 366/2 = 183 guesses (assuming even distribution of birthdays).

Instead I could ask the person “Is your birthday before July 1st?” If so, I can rule out all the days after July 1st. Then I can cut the remaining portion of the year in half. The next question would be “Is your birthday before April 1st?” If so, I do the same thing again. “Before February 16th?” “Before January 23rd?” “Before January 13th?” and so on. With this method, it takes not more than nine questions to arrive at the right answer. That’s a lot better than 183!

Binary search in code

This binary search method can be applied to searching code as well. Let’s say you have a 100-line file that contains a bug but you don’t know where the bug is. You can (at least in theory) delete the last 50 lines of code and then check the remaining code for the presence of the bug. If the bug is still there, then you can divide that code in half by deleting the last 25 lines, and so on. Obviously you usually can’t literally delete the last 50 lines of a file because then it wouldn’t be syntactically valid and so on, but the principle of course still stands.

Binary search can also be used on the level of an entire codebase. You can devise questions that will divide the codebase roughly in half and then check for the presence or absence of the bug in each half. You don’t even need to necessarily delete code in order for this method to work. You just need a way to eliminate half of the codebase from your search area somehow. (Think about the game 20 Questions. “Is it living? Is it an animal? Is it a mammal?” and so on.)

You can also perform searches across time. Git bisect works by taking a range of commits and then repeatedly dividing it in half, asking you to answer the question “Does the bug lie in this half?” at each step. When you perform a bisect, you’re asking not “What is this bug?” but rather “What commit introduced this bug?” (In other words, “where is this bug”.) If your commits are small and atomic, then often the cause of the bug will be often obvious once the offending commit is identified. If the offending commit is large, you might need to do another binary search on the code the commit introduced in order to isolate the root cause.

The beauty of binary search debugging

Before I figured out how to diagnose bugs methodically, debugging was often an extremely frustrating exercise. I would stare at the screen and wonder why what was happening was happening. I would read the code over and over to try to make sense of it. I would make guesses as to what the problem could be. And after my guess turned out to be wrong, I often felt no closer to a solution than before.

Now things are different. The bug diagnosis methodology that I use now—which combines the “where, not what” principle with binary search—has two big benefits. First, this methodology allows me to progress systematically and inexorably toward a solution rather than taking shots in the dark. Second, it almost always works. It’s hard to overstate how good it feels to know that whatever bug I’m faced with, I have the ability to diagnose it in a timely manner, and without much mental strain, and with a success rate close to 100%. That sure as hell beats guessing.

Takeaways

  • When diagnosing bugs, guessing is sometimes fine, but it’s often inefficient.
  • It’s almost always easier to find where a bug is than what a bug is.
  • Using binary search to find where a bug is makes the search process more efficient.
  • Binary search almost always works.

What causes flaky tests

What is a flaky test?

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.

In other words, a flaky test is a test that’s non-deterministic.

A test can be non-deterministic if either a) the test code is non-deterministic or b) the application code being tested is non-deterministic, or both.

Below are some common causes of flaky tests. I’ll briefly discuss the fix for some of these common causes, but the focus of this post isn’t to provide a guide to fixing flaky tests, it’s to give you a familiarity with the most common causes for flaky tests so that you can know what to go looking for when you do your investigation work. (I have a separate post for fixing flaky tests.)

The causes I’ll discuss are race conditions, leaked state, network/third-party dependency, randomness and fixed time dependency.

Race conditions

A race condition is when the correct functioning of a program depends on two or more parallel actions completing in a certain sequence, but the actions sometimes complete in a different sequence, resulting in incorrect behavior.

Real-life race condition example

Let’s say I’m going to pick up my friend to go to a party. I know it takes 15 minutes to get to his house so I text him 15 minutes prior so he can make sure to be ready in time. The sequence of events is supposed to be 1) my friend finishes getting ready and then 2) I arrive at my friend’s house. If the events happen in the opposite order then I end up having to wait for my friend, and I lay on the horn and shout obscenities until he finally comes out of his house.

Race conditions are especially likely to occur when the times are very close. Imagine, for example, that it always takes me exactly 15 minutes to get to my friend’s house, and it usually takes him 14 minutes to get ready, but about one time out of 50, say, it takes him 16 minutes. Or maybe it takes my friend 14 minutes to get ready but one day I somehow get to his house in 13 minutes instead of the usual 15. You can imagine how it wouldn’t take a big deviation from the norm to cause the race condition problem to occur.

Hopefully this real-life example illustrates that the significant thing that gives rise to race conditions is parallelism and sequence dependence, and it doesn’t matter what form the parallelism takes. The parallelism could take the form of multithreading, asynchronicity, two entirely separate systems (e.g. two calls to two different third-party APIs), or literally anything else.

Race conditions in DOM interaction/system tests

Race conditions are fairly common in system tests (tests that exercise the full application stack including the browser).

Let’s say there’s a test that 1) submits a form and then 2) clicks a link on the subsequent page. To tie this to the pick-up-my-friend analogy, the submission of the form would be analogous to me texting my friend saying I’ll be there in 15 minutes, and the loading of the subsequent page would be analogous to my friend getting ready. The race condition creates a problem when the test attempts to click the link before the page loads (analogous to me arriving at my friend’s house before he’s ready).

To make the analogy more precise, this sort of failure is analogous to me arriving at my friend’s house before he’s ready, and then just leaving after five minutes because I don’t want to wait (timeout error!).

The solution to this sort of race condition is easy: just remove the asynchronicity. Instead of allowing the test to execute at its natural pace, add a step after the form submission that waits for the next page to load before trying to click the link. This is analogous to me adding a step and saying “text me when you’re ready” before I leave to pick up my friend. If we arrange it like that then there’s no longer a race.

Because DOM interactions often involve asynchronicity, DOM interaction is a common area for race conditions, and therefore flaky tests, to be present.

Edge-of-timeout race conditions

A race condition can also occur when the amount of time an action takes to complete is just under a timeout value, e.g. a timeout value is five seconds and the action takes four seconds (but sometimes six seconds).

In these cases you can increase the timeout and/or improve the performance of the action such that the timeout value and typical run length are no longer close to each other.

Leaked state

Tests can create flaky behavior when they leak state into other tests.

Let me shoot an apple off your head

Let’s use another analogy to illustrate this one. Let’s say I wanted to perform two tests on myself. The first test is to see if I can shoot an apple off the top of my friend’s head with a bow and arrow. The second test is to see if I can drink 10 shots of tequila in under an hour.

If I were to perform the arrow test immediately followed by the tequila test and do that once a week, I could expect to get basically the same test results each time.

But if I were to perform the tequila test immediately followed by the arrow test, my aim would probably be compromised, and I might miss the apple once in a while. (Sorry, friend.) The problem is that the tequila test “leaks state”: it creates a lasting alteration in the global state, and that alteration affects subsequent tests.

And if I were to perform these two tests in random order, the tequila test would give the same result each time because I’d always be starting it sober, but the arrow test would appear to “flake” because sometimes I’d start it sober and sometimes I’d start it drunk. I might even suspect that there’s a problem with the arrow test because that’s the test that’s showing the symptom, but I’d be wrong. The problem is a different test with leaky state.

Ways for tests to leak state

Returning to computers, there are a lot of ways a test can alter the global state and create non-deterministic behavior.

One way is to alter database data. Imagine there are two tests, each of which creates a user with the email address test@example.com. The first test will pass and, if there’s a unique constraint on users.email, the second test will raise an error due to the unique constraint violation. Sometimes the first test will fail and sometimes the second test will fail, depending on which order you run them in.

Another way that a test could leak state is to change a configuration setting. Let’s say that your test environment has background jobs configured not to run for most tests because most background jobs are irrelevant to what’s being tested and would just slow things down. But then let’s imagine that you have one test where you do want background jobs to run, and so at the beginning of that tests you set background job setting from “don’t run” to “run”. If you don’t remember to change the setting back to “don’t run” at the end, background jobs will run for all later tests and potentially cause problematic behavior.

State can also be leaked by altering environment variables, altering the contents of the filesystem, or any number of other ways.

Network/third-party dependency

The main reason why network dependency can create non-deterministic behavior doesn’t take a lot of explaining: sometimes the network is up and sometimes it’s not.

Moreover, when you’re depending on the network, you’re often depending on some third-party service. Even if the network itself is working just fine, the third-party service could suffer an outage at any time, causing your tests to fail. I’ve also seen cases where a test fails because a test makes a third-party service call over and over and then gets rate-limited, and from that point on, for a period of time, that test fails.

The way to prevent flaky tests caused by network dependence is to use test doubles in your tests rather than hitting live services.

Randomness

Randomness is, by definition, non-deterministic. If you, for example, have a test that generates a random integer between 1 and 2 and then asserts that that number is 1, that test is obviously going to fail about half the time. Random inputs lead to random failures.

One way to get bitten by randomness is to grab the first item in a list of things that’s usually in the same order but not always. That’s why it’s usually better to specify a definite item rather than grab the nth item in a list.

Fixed time dependency

Once I was working late and I noticed that certain tests started to fail for no apparent reason, even though I hadn’t changed any code.

After some investigation I realized that, due to the way they were written, these tests would always fail when run at a certain time of day. I had just never worked that late before.

This is common with tests that cross the boundary of a day (or month or year). Let’s say you have a test that creates an appointment that occurs four hours from the current time, and then asserts that that appointment is included on today’s list of appointments. That test will pass when it’s run at 8am because the appointment will appear at 12pm which is the same day. But the test will fail when it’s run at 10pm because four hours after 10pm is 2am which is the next day.

Takeaways

  • A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.
  • Flaky tests are caused by non-determinism either in the test code or in the application code.
  • Some of the most common causes of flaky tests are race conditions, leaked state, network/third-party dependency, randomness and fixed time dependency.

Keep test code and application code separate

Sometimes you’ll be tempted to add things to your application code that don’t affect the functionality of your application but do make testing a little easier.

The drawback to doing this is that causes your application code to lose cohesion. Instead of doing just one job—making your application work—your code is now doing two jobs: 1) making your application work and 2) helping to test the application. This mixture of jobs is a straw on the camel’s back that makes the application code just that much harder to understand.

Next time you’re tempted to add something to your application code to make testing more convenient, resist the temptation. Instead add it somewhere in the code that supports your tests. This may take more time and thought initially but the investment will pay itself back in the long run.

Modeling legacy code behavior using science

When you want to understand what a legacy program you’re working on is supposed to do, what’s your first instinct? Often it’s to look at the code.

But unfortunately legacy code is often so convoluted and inscrutable that it’s virtually impossible to tell what the code is supposed to do just by looking at it.

In these cases it may seem that you’re out of luck. But fortunately you’re not.

Modeling

Due to what philosophers call the veil of perception, direct knowledge of anything in the world is impossible. There’s very little that we can know with absolute completeness and correctness.

We don’t have knowledge about how most things work, we only have a model of how it works. A model is a useful approximation to the truth. Our models may not be absolutely complete or correct but they’re useful enough to get the job done on a day-to-day basis.

Take a car, for example. I don’t have a complete understanding of how every part of a car works. And what I do know probably isn’t even 100% accurate. But my model is sufficiently comprehensive and accurate that I can operate my car without being regularly surprised and confounded by differences in the ways I expect my car to behave versus the way it actually behaves.

Let’s use another example, trees. My understanding of trees is fragmentary. But like my model of cars, my model of trees is sufficiently comprehensive and accurate that trees don’t regularly throw me surprises. I understand trees to be stationary, and so when I cross the street, I never worry that I may be struck by a speeding tree. Although I do know that wind can blow over dead trees and so I’m cautious about entering the woods on a windy day.

How models are developed

How did I develop my models of cars and trees?

Trees, unlike cars, are natural. I can’t look at a tree’s source code or schematics to gain an understanding of how trees work. Everything humans know about trees has been discovered by observation and experimentation—in other words, scientific inquiry. We observe that trees start off small and then grow large. We observe that deciduous trees lose their leaves in the winter and coniferous trees don’t. And so on.

We can of course also learn about trees by reading books and learning from teachers, but that’s not source material, it’s only a regurgitation of the products of someone else’s scientific inquiry.

Cars are a little different. Cars are man-made, so in addition to learning about cars through observation and experimentation, we can learn by, for example, reading the owner’s manual that came with the car. And in theory, we could look at engineering diagrams and talk with the people who designed the car in order to gain a direct understanding of how the car works straight from the source material.

Cars are also different from trees in the sense that much of the mechanics of a car are very self-evident. You can learn something about how a car works by taking apart its engine. Not so much with a tree.

Inspecting a car’s transmission is analogous to reading a line of code. The thing you’re looking at is, itself, an instruction. It can’t lie to you. You can be mystified by it and you can misunderstand its purpose, but the car part can’t outright lie to you because the car part is how the car works.

Modeling the mind

Before we connect all this back to legacy code, I want to share one more example of scientific modeling.

The human brain is a complex machine. So is a car, so is a tree, and so is a codebase. But unlike a car or a codebase, we can’t observe the mechanics of the brain directly, at least not yet, not very much. We can’t look at the brain’s source code or insert breakpoints. For the purposes of understanding how it works, a brain is much more like a tree than a car.

But cognitive scientists have still managed to learn a lot about how the brain works. Or, more precisely, cognitive scientists have gained an understanding of the behavior that the brain produces. They’ve gained an understanding of how the mind (the outward expression of the machinery of the brain) works. They’ve developed a model.

This model of the mind has been developed in the same way that any accurate model has been developed: through scientific inquiry. A scientist can’t have a chat with the brain’s designer or have a look at its schematics, but a scientist can compare the behavior of people with normal brains against people who have had certain parts of their brains excised, for example, and make inferences about the roles that those parts of the brain play.

(Scientists can make diagrams of how they believe the brain works, but remember, those diagrams aren’t source material. They’re not code. They’re only a documentation of our current best understandings based on our scientific inquiry.)

So: if scientists can develop a model of the behavior generated by the brain without having access to the source machinery of the brain, what can we learn from scientists about how to understand the behavior of legacy systems without having access to comprehensible code?

Applying scientific inquiry to software systems

If you haven’t already done so, I’d like to invite you to make a conscious distinction between two ways of learning the behavior of a software system. One way is the obvious way that everyone’s familiar with: reading the code. The other way is the way that many people have probably used informally quite a bit but may not have consciously put a name to so much, which is scientific inquiry.

A full instruction in scientific inquiry is of course outside the scope of this post, and I wouldn’t be qualified to give one anyway. The point of this post is to invite you to consciously realize that you can develop a usefully accurate model of a software system not just by reading its code, but by using the methods of scientific inquiry.

If you’re interested in learning more about the methods of science, I would recommend The Magic of Reality by Richard Dawkins and The Demon-Haunted World by Carl Sagan.

Takeaways

  • Direct knowledge of anything in the world is impossible due to the veil of perception. All we can do is develop models, which are useful approximations to the truth.
  • Models are developed through the process of scientific inquiry.
  • In addition to reading the code, a model of a software system can be developed by using the process of scientific inquiry.

Premature generalization

Most programmers are familiar with the concept of premature optimization and the reasons why it’s bad. As a reminder, the main reason premature optimization is bad is because it’s an effort to solve problems that probably aren’t real. It’s more economical to wait and observe where the performance bottlenecks are than to try to predict where the bottlenecks are going to be.

Perhaps fewer programmers are familiar with the idea of premature generalization, also known as the code smell Speculative Generality. Premature generalization is when you generalize a piece of code beyond its current requirements in anticipation of more general future requirements. In my experience it’s a very common mistake.

Premature generalization is bad for the same exact reason premature optimization is bad: because it’s an effort to solve problems that probably aren’t real.

Making a piece of code more general than it needs to be in anticipation of future needs might seem like a smart planning measure. If you can see that the code you’re writing will probably need to accommodate more use cases in the future, why not just make the code general enough now? That way you only have to write the code once.

When programmers do this they’re making a bet. Sometimes their bet is right and sometimes it’s wrong. In my experience, these sorts of bets are wrong enough of the time that you lose on average. It’s like betting $50 for a 10% chance at winning $100. If you were to do that 10 times, you’d spend $500 and win just once (on average), meaning you’ll have paid $500 to win $100.

It’s more economical to make your code no more general than what’s called for by today’s requirements and accept the risk that you might have to rework the code later to generalize it. This is also a bet but it’s a sounder one. Imagine a lottery system where you can either buy a ticket for $50 for a 10% chance of winning $100, or you can choose not to play and accept a 10% chance of getting fined $30. (I know it’s a weird lottery but bear with me.) If you buy a ticket ten times then on average you lose $400 because you’ve paid $500 to win $100. If ten times in a row you choose not to buy a ticket, then on average you get fined $30. So you’re obviously way better off with a policy of never buying the ticket.

Takeaways

  • Premature generalization is when you generalize a piece of code beyond its current requirements in anticipation of more general future requirements.
  • On average, premature generalization doesn’t pay. It’s more economical to write the code in such a way as to only accommodate today’s requirements and then only generalize if and when a genuine need arises.