Category Archives: Programming

Programming principles as memes

In 1976 in The Selfish Gene, Richard Dawkins explained that the unit of selection for evolution is not the group or species or individual but the gene. Genes that are best at causing themselves to get replicated are the ones that do get replicated, and so those genes then become predominant.

An organism’s body is a gene’s “survival machine”. Individual organisms die but a gene can survive indefinitely. A gene “uses” an organism to propagate itself from one generation to the next. Since an organism’s design is determined entirely by its DNA, the genes are in charge. The genes are the masters and the organisms are the slaves. Everything an organism does, it’s been programmed to do by its genes in order to propagate those genes.

In The Selfish Gene Dawkins also introduced the idea of a meme. A meme is kind of like a gene but instead of a piece of DNA it’s a behavior or idea. In The Beginning of Infinity David Deutsch uses the idea of a joke as an example of a meme. If a joke is funny, it will make people who hear it want to tell it again, causing the joke to be proliferated. But the joke may not always be retold word-for-word. Some people might remember the idea of the joke but tell it in their own words. Some people might misremember the joke and accidentally tell it a bit differently. Some people may intentionally change the joke in order to try to improve it. Just as a gene will compete with variants of itself, a joke will do the same thing.

The versions of the joke that are best at getting themselves replicated are the ones that become predominant. Note, importantly, that this does not necessarily mean that the best version of the joke will become predominant, just the version that’s best at getting itself replicated. “Survival of the fittest” is a myth. The versions of genes and memes that survive are the ones that are best at getting themselves copied, nothing more.

Before internet memes our culture had many others. Quotes are often memes, for example, like “if it ain’t broke don’t fix it” and “practice makes perfect”. The content of a meme can be true or false, helpful or unhelpful. “Practice makes perfect”, for example, is sometimes refuted with the expression “practice makes permanent”, which is closer to the truth. “Practice makes perfect” didn’t become a meme because it’s true, it became a meme because it has the attractive qualities of being snappy, alliterative and memorable, and superficially seeming wise and true.

Programming principles can also come in the form of memes. “Don’t Repeat Yourself” (DRY) is memorable thanks to its slightly amusing pronounceable acronym and the fact that its advice is useful and approximately true, although only approximately. The refuting advice, “Write Everything Twice” (WET) is cleverly formulated as the antonym to DRY, and has been very successful in replicating itself, despite the fact that its advice (allow duplication up to two times, and only de-duplicate when the number of instances reaches three) is, at least in my opinion, completely nonsensical.

Just as genes don’t obey “survival of the fittest”, memes, including programming memes, don’t obey “survival of the truest”. The programming principles that replicate themselves the most are simply the ones that are the best at getting themselves replicated, not necessarily the ones that are the truest or best. Sometimes an idea proliferates simply because it’s easy to learn and easy to teach, or because repeating the idea makes the repeater feel wise and sophisticated.

Having said that, sometimes a meme is good at causing itself to be replicated precisely because it is true. Newton’s laws of physics, for example, successfully proliferated because they were genuinely useful, and they were useful because they were true—at least true enough for many purposes. Then, later, Einstein’s laws of physics became successful memes because they were an improvement upon Newton’s laws of physics, and succeeded in certain areas where Newton’s laws failed. This phenomenon of memes proliferating precisely because they’re true should give us hope.

We all know people who seem to be slaves to some particular sort of fashion. Some people repeat the political ideas they hear on a certain news channel, for example, or get caught up in a succession of get-rich-quick schemes. These people seem not to have minds of their own. Like genes’ survival machines, these people in a way exist only as a vehicle for memes to replicate themselves. But of course, not everyone has to be a slave to memes. Instead of uncritically accepting memetic ideas, we can examine them with a careful combination of open-mindedness and skepticism and demand a good explanation for why the idea in the meme is supposedly true. In this way, perhaps our false programming memes can wither away and gradually be replaced with ever truer ones.

How do we tell what’s a good programming practice?

I’ve been thinking lately about the question of how to decide what’s a good programming practice and what’s not.

One way of looking at this is that it’s all just subjective opinion. Just do whatever works for you. You might call this the “anything goes” principle.

The problem with the Anything Goes Principle is that it doesn’t solve any meaningful problems. At best it works as a way to sometimes get strangers to stop arguing on the internet. It certainly doesn’t solve a debate between two team members in an organization. The Anything Goes Principle is simply a way to agree to disagree, not a way to resolve a disagreement.

Another way to approach the question is what you might call the “consensus principle” or the “best practice principle”. This principle says that whatever the prevailing best practice in the industry is is the right answer. But this method is easily debunked. There have been many instances in history where the consensus view in a certain field is just plain wrong. As the saying goes, “What’s popular is not always right”. The truth can’t be discovered via popularity contest.

Here’s what I think is the real key to recognizing a good programming practice: an explanation which stands up to rigorous criticism and which is sufficiently specific that it couldn’t be used to defend any other principle.

Let’s work with an example. It’s a very uncontroversial principle in programming that, in general, clear and accurate variable names, even if they may be long, are better than short, cryptic variable names. This principle has a strong explanation behind it. Clear variable names are, by definition, self-explanatory. When a name is clear, the reader can learn its meaning directly from the name itself rather than having to infer the meaning by studying the surrounding context. In terms of time and cognitive strain, it’s cheaper to learn the meaning of a variable by simply reading the name than by studying the surrounding context. This explanation is, I think, very hard to refute.

This principle even has an exception that proves the rule. There are times when a one-letter variable name is actually better than a long-and-clear name. The explanation for this is that when the scope of a variable is very small and the original name of the variable is not particularly short, the repetition of the long variable name can actually present enough noise in the code that the long-variable version is harder to understand than the short-variable version.

And by the way I want to note that the argument isn’t that the proponent of this rule personally finds certain code easier to understand. The argument is that a typical programmer would find the code easier to understand, since “a typical programmer” is the likely audience for the code, not specifically the proponent of the rule. So the rule could not be refuted by the argument of “well I personally think short variable names are easy enough to understand”. The argument would have to be “I think short variable names are easy enough for the typical programmer to understand”. It’s the difference between a subjective argument and an objective one.

Here’s an example of a poor explanation. A certain article about service objects claims that “Services are easy and fast to test since they are small Ruby objects that have been separated from their environment.” This explanation is weak partly because it could equally be applied to many other practices. The quality of being small and “separated from their environment” (in other words, loosely coupled from dependencies) is not unique to service objects. Furthermore, the quality of being small isn’t inherent to what a service object supposedly is. I’ve seen plenty of gargantuan service objects. And for that matter, loose coupling isn’t an inherent property of service objects either, and I’ve also seen plenty of service objects with tight coupling to their dependencies. My personal distastes for service objects aside, these explanations for the merits of service objects are objectively bad.

If a certain programming practice can address all the criticism it’s subjected to with specific and objective explanations, then it’s probably a good one. If not, it’s probably a bad one. And while everyone is of course entitled to their own subjective opinion, much of the quality of various programming practices is actually a matter of objectivity. At least that’s what I think.

Programming is technical writing

I enjoy reading nonfiction. Not just programming but also biology, physics, philosophy, mechanical engineering, robotics and other technical topics.

The challenge in technical writing is to convey difficult ideas in ways that are easy to understand. The best technical writing doesn’t make the reader think, “Wow, the writer must be a genius to be able to understand this stuff,” it makes them think, “Wow, I must be a genius to be able to understand this stuff!”

Very few programmers make a serious study of technical writing. I myself, 20+ years into my programming career, have only just begun to do so myself. But I think improving one’s technical writing ability is a highly profitable undertaking for any programmer, even if you never intend to do any technical writing. In a very real sense, programming IS technical writing.

Donald Knuth has said, “Programs are meant to be read by humans and only incidentally for computers to execute.” When we write code, we’re (hopefully) aiming not just for it to fulfill its functional requirements but also for it to be understandable by human readers. In other words, we’re trying to convey difficult ideas in ways that are easy to understand. That’s what I mean when I say that programming is technical writing.

How does one get better at technical writing? One thing that helps is to do a lot of technical reading. In programming, my favorite technical writers are Martin Fowler and Steve McConnell. Outside of that, my favorites include Richard Dawkins (the master in my book), Carl Sagan (a close second), Steven Pinker, David Deutsch, Sean Carroll and Richard Feynman. If you’d like a specific starting point, I’d recommend The Blind Watchmaker by Richard Dawkins.

One last comment, an important one. Good writing is downstream of good thinking. None of the authors I listed above are writers by profession. The programming authors are programmers and the science authors are scientists. These writers are good writers because they’re excellent practitioners, with good ideas worth sharing. It also so happens that these practitioners have learned how to skillfully put their ideas into words. If Richard Dawkins were a clumsy writer, his books would probably still be worth reading on the merit of the ideas they contain. But if Dawkins were an excellent writer but a poor scientist, his books would be junk. Similarly, a programmer must be skilled not only in coding style but in designing and maintaining a software system. Good code is downstream of good thinking.

Sharp blades and dull blades

We all know the metaphor of technical debt. I want to share some weaknesses of the technical debt metaphor, plus a different metaphor which I think is better.

In real life, debt is not inherently bad. If you have debt, all it means in most cases is that you took out a loan. Loans are a good thing. Loans are what makes it possible to buy things like cars and houses that almost nobody has the money for up front. Business loans can allow business to make investments to grow faster, pay back the loan, and end up with a net benefit that wouldn’t have been possible without the loan.

Debt in real life is also avoidable. You don’t have to get a car loan or mortgage if you don’t want to. You don’t have to get a credit card. You could theoretically pay for everything with cash your whole life.

In software, what we call “technical debt” is not avoidable. It’s going to happen no matter what.

Managers sometimes talk about “strategic technical debt”, where debt is consciously taken on in order to meet a short-term timeline. The thing is, there’s never a plan to pay the debt back. Indeed it rarely does get paid back. Usually the next move is to take on yet more technical debt. It’s like a payday loan cycle, where each payday loan is used to pay the interest from the last payday loan, with less and less of each paycheck ultimately going into your own pocket. Yet the “strategic technical debt” myth persists, partly because the metaphor is so easy to grasp (and miscomprehend).

I prefer to think of a software system as a collection of blades.

A dull blade is never preferable to a sharp one. Sometimes it may be better to just cut with the dull blade instead of taking the time to sharpen the blade before cutting, but it’s clear that that’s a compromise. No one would talk about “strategic dull blades” the way they talk about strategic technical debt.

Unlike debt which you can simply choose not to take on, blades unavoidably get dull with use. Having technical debt makes it sound like you did something wrong. But you don’t have to do anything wrong in order to end up with bad code. Often, bad code is simply the result of entropy. The blade metaphor makes it clear that dullness is normal and unavoidable.

The blade metaphor also has built-in suggestions about how to address the problem. When is it time to sharpen a dull blade? Right before you use it (or right after). It obviously wouldn’t be a very smart investment to just go randomly sharpening saws in the shed. When you pick up a saw and notice that it’s hard to cut with, that can tell you that it would be worth the investment to sharpen it at least a little before trying to cut more with it. If there’s a blade that never gets used, you can afford not to sharpen it, since a dull blade that’s not being used is not hurting anything.

Lastly, the blade metaphor makes clear the benefit of being in a good state. What exactly is the benefit of not having technical debt? That you don’t have to pay technical interest? The benefit is not so clear. The benefit of having sharp blades is obvious: you can cut faster. Unlike being debt-free which is simply the absence of a bad thing, you’re in possession of something good.

As an industry I wish we would do away with the technical debt metaphor and adopt the blade metaphor instead. I think this metaphor would help managers understand the nature of the issue more accurately. Instead of pressuring us to take on “strategic technical debt”, maybe they’ll see the virtues of helping us keep our blades sharp.

Testing anti-pattern: merged setup data

In a single test file, there’s often overlap among the setup data needed for the tests in the file. For whatever reasons, perhaps in an effort to improve performance or avoid duplication, test writers often merge the setup code and bring it to the top of the file so it’s available to all test cases.

Let’s take a look at a test that contains apparently duplicative setup data. This test has two test cases, each of which needs one build and at least one job.

RSpec.describe Build, type: :model do
  describe "#start!" do
    let!(:job) { create(:job) }
    let!(:build) { job.build }
  ...

  describe "#status" do
    let!(:build) { create(:build) }
    let!(:job_1) { create(:job, build: build, order_index: 1) }
    let!(:job_2) { create(:job, build: build, order_index: 2) }
  ...
...

There’s obviously some duplication among this setup data. We’re creating two builds and three jobs, but we only really need a total of one build and two jobs. If we wanted to, we could be more “economical” with our setup data by combining it and placing it at the top of the file so that it can be used by all tests, like so:

RSpec.describe Build, type: :model do
  let!(:build) { create(:build) }
  let!(:job_1) { create(:job, build: build, order_index: 1) }
  let!(:job_2) { create(:job, build: build, order_index: 2) }

  describe "#start!" do
  ...

  describe "#status" do
  ...
...

Now our setup code is superficially a little less “wasteful” but we’ve created a couple subtle problems that make our test harder to understand and change.

Misleading details

When we create our jobs, we give each of them an order_index. This matters for the #status test but is totally immaterial to the #start! test. As the author of this test, I happen to know which details matter for which tests, but someone reading this test for the first time would have no easy way of knowing when an order_index is needed and when it’s not.

The only safe assumption is that every detail is needed for every test. If we alter the global setup data somehow, it’s possible that we’ll cause a silent defect. We could cause a test to keep passing but to lose its validity and start showing us a false positive. When data is included in the global setup beyond what’s needed for every test in the file, it creates an unnecessary risk that makes the test harder to change than it needs to be.

Shoehorned data

I’ll now reveal a bit more of the describe "#start!" test case.

describe "#start!" do
  let!(:job) { create(:job) }
  let!(:build) { job.build }

  before do
    fake_job_machine_request = double("JobMachineRequest")

    # The following line, which references "job",
    # is the line to pay attention to
    allow(job).to receive(:job_machine_request).and_return(fake_job_machine_request)
  end
end

In the above code, which shows the original version of the #start! test before we merged all the setup data, the setup and usage of job were straightforward. We created a job and then we used it.

But now that we’ve merged all our setup code, we only have job_1 and job_2 available to us and no plain old job_2 anymore. This makes things awkward for the #start! test where we only need one job to work with. Here are two possible options, both undesirable.

before do
  # Referring to job_1 misleadingly implies that the
  # fact that it's job 1 and not job 2 is significant,
  # which it's not, and also that we might do
  # something with job 2, which we won't
  allow(job_1).to receive(:job_machine_request).and_return(fake_job_machine_request)

  # This is almost better but not really
  job = job_1
  allow(job).to receive(:job_machine_request).and_return(fake_job_machine_request)
end

Another way we can deal with this issue is simply to add a third job to our setup called job:

RSpec.describe Build, type: :model do
  let!(:build) { create(:build) }
  let!(:job) { create(:job) }
  let!(:job_1) { create(:job, build: build, order_index: 1) }
  let!(:job_2) { create(:job, build: build, order_index: 2) }

  describe "#start!" do
  ...

  describe "#status" do
  ...
...

This takes away our problem of having to shoehorn job_1 into standing in for job, and it also takes away our misleading details problem, but it creates a new problem.

It’s unclear what setup data belongs to which test(s)

The other problem with combining the setup data at the top of the test is that it’s not clear which values are needed for which test. This creates a symptom very similar to the one creates by the misleading details problem: we can’t easily know what kinds of changes to the setup data are safe and which are risky.

It’s unclear what the state of the system is and how it will affect any particular test

I once worked on a project where a huge setup script ran before the test suite. The test environment would get some arbitrary number of users, superusers, accounts, customers and almost every other kind of entity imaginable.

The presence of all this data made it extremely hard to understand what the state of the system was and how any existing data might interfere with any particular test. Although it’s sometimes worth it to make compromises, it’s generally much easier if every test starts with a fully clean slate.

What’s the solution?

Instead of merging all of a test file’s setup data at the top, it’s better to give each test only the absolute minimum data that it needs. This usually means giving each test case its own individual setup data. Often it even means creating duplication—although duplication in tests is usually not real duplication. The performance cost is also usually negligible or nonexistent, since even if global setup data only appears once, it actually gets run once per test run anyway, and nothing is saved by combining it. Any cost in duplication or performance is usually overwhelmingly offset by the benefit of not merging setup data: it makes your tests easier to understand and work with.

Using ChatGPT to reduce “study and synthesize” work

When ChatGPT first came out, the first programming use case I thought of for ChatGPT was to write code. I thought of it like “GitHub Copilot on steroids”. I imagine that was a lot of other people’s first thought too. But gradually I realized that having ChatGPT write production code is actually not a very good idea.

When ChatGPT gives you a big chunk of code, how can you be sure that the code does exactly what you think it does? How can you be sure that it’s not missing something, and that it doesn’t contain anything extra, and that it doesn’t have bugs?

The answer of course is that you have to test it. But retroactively testing existing code is usually tedious and annoying. You basically have to replay the process of writing the code so that you can test each piece of code individually.

A programming workflow that involves using ChatGPT to write big chunks of code seems dangerous at worst and terribly inefficient at best.

If ChatGPT isn’t great for writing production code, what’s it good for?

Using ChatGPT to reduce mental labor

One part of our jobs as programmers is to learn general principles and then apply parts of those principles to a certain peculiar need.

I’ll give a somewhat silly example to illustrate the point starkly. Let say I want to integrate Authorize.net into a Lisp program and that I’ve never used either technology before. Without checking, one could pretty safely assume there are no tutorials in existence on how to integrate Authorize.net into a Lisp app.

In order to complete my integration project I’ll need to learn something about a) Authorize.net in general, b) Lisp in general, and c) integrating Authorize.net with a Lisp app specifically. Then I’ll need to synthesize a solution based on what I’ve learned.

This whole process can be wasteful, time-consuming, and at times, quite boring. In the beginning I might know that I need to get familiar with Authorize.net, but I’m not sure yet which parts of Authorize.net I need to be familiar with. So I’ll read a whole bunch about Authorize.net, but I won’t know until the end of the project which areas of my study were actually needed and which were just a waste of time.

And what’s even worse is the cases where the topics you’re studying are of no permanent benefit to your skills as a programmer. In the case of Authorize.net I might not expect to ever use it again. (At least I hope not!) This kind of learning is just intellectual ditch-digging. It’s pure toil with little or no lasting benefit.

This kind of work, where you first you study some generalities then you synthesize a specific solution from those generalities, I call “study and synthesis” work.

Thanks to ChatGPT, most “study and synthesis” work is a thing of the past. If I tell ChatGPT “Give me a complete tutorial on how to integrate Authorize.net with a Lisp program”, it will. The tutorial may not be correct down to every last detail but that’s not the point. Just having a high-level plan spelled out saves a lot of mental labor. And then if I need to zoom in on certain details which the tutorial either got wrong or omitted, ChatGPT will quite often correct its mistakes when pressed.

Using ChatGPT to write production code may seem like a natural and logical use for the tool, but it’s actually not a very good one. You’ll get a lot more leverage out of ChatGPT if you use it for “study and synthesize” work.

In defense of productivity

Anti-productivity sentiment

In my career I’ve noticed that a lot of developers have a distaste for the idea of “productivity”. They view it as some sort of toxic, unhealthy obsession. (It always has to be an “obsession”, by the way. One can never just have an interest in productivity.)

Productivity is often associated with working harder and longer, sacrificing oneself for a soulless corporation.

In a lot of ways I actually agree with these people. Having an unhealthy obsession with anything is obviously unhealthy, by definition. And I think working long and hard for its own sake, for no meaningful reward, is a waste of precious time.

But I think sometimes these anti-productivity people are so blinded by their natural aversion to “productivity culture” that they miss out on some good and worthwhile ideas, ideas they would actually like if they opened their minds to them.

“Productivity” is a pretty ambiguous word. It could have a lot of different interpretations. I’d like to share my personal interpretation of productivity which I happen to quite like. Maybe you’d like to adopt it for yourself.

My version of productivity

For me, productivity isn’t about obsessively tracking every minute of the day or working so hard you burn yourself out.

The central idea of productivity for me is decreasing the ratio of effort to value. This could mean working less to create the same value or it could mean working the same to create more value. Or anywhere in between. Each person can decide for themselves where they’d like to set the dial.

I value a calm, healthy mind and body. People obviously do better work when they’re relaxed and even-keeled than when they’re harried and stressed.

Productivity for me is about realizing that our time on this planet is limited and precious, and that we shouldn’t be needlessly wasteful with our time but rather protect it and spend it thoughtfully.

Why duplication is more acceptable in tests

It’s often taught in programming that duplication is to be avoided. But for some reason it’s often stated that duplication is more acceptable in test code than in application code. Why is this?

We’ll explore this, but first, let’s examine the wrong answers.

Incorrect reasons why duplication is more acceptable in tests

“Duplication isn’t actually that bad.”

Many programmers hold the opinion that duplication isn’t something that should be avoided fastidiously. Instead, a certain amount of duplication should be tolerated, and when the duplication gets to be too painful, then it should be addressed. The “rule of three” for example says to tolerate code that’s duplicated twice, but clean it up once the duplication reaches three instances.

This way of thinking is overly simplistic and misses the point. The cost of duplication doesn’t depend on whether the duplication appears twice or three times but rather on factors like how easy the duplication is to notice, how costly it is to keep the duplicated instances synchronized, and how much “traffic” the duplicated areas receive. (See this post for more details on the nature of duplication and its costs.)

The heuristic for whether to tolerate duplication shouldn’t be “tolerate some but don’t tolerate too much”. Rather the cost of a piece of duplication should be assessed based on the factors above and weighed against any benefits that piece of duplication has. If the costs aren’t justified by the benefits, then the duplication should be cleaned up.

“Duplication in test code can be clearer than the DRY version”

It’s true that duplication in test code can be clearer than the DRY version. But duplication in application code can be clearer than the DRY version too. So if duplicating code can make it clearer, why not prefer duplication in application code to the same exact degree as in test code?

This answer doesn’t actually answer the question. The question is about the difference between duplication in test code and application code.

The real reason why duplication is more acceptable in test code

In order to understand why duplication is more acceptable in test code than application code, it helps to get very clear on what exactly duplication is and why it incurs a cost.

What duplication is and why it costs

Duplication doesn’t mean two identical pieces of code. Duplication is two or more copies of the same behavior. It’s possible to have two identical pieces of code that represent different pieces of behavior. It’s also possible to have the same behavior expressed more than once but in different code.

Let’s also review why duplication incurs a cost. The main reason is because it leaves the program susceptible to logical inconsistencies. If one copy of a behavior gets changed but the other copies don’t, then the other behaviors are now wrong and there’s a bug present. The other reason duplication incurs a cost is because it creates a maintenance burden. Updating something in multiple places is obviously more costly than updating it in just one place.

The difference between test code and application code

The difference between test code and application code is that test code doesn’t contain behaviors. All the behaviors are in the application code. The purpose of the test code is to specify the behaviors of the application code.

What in the codebase determines whether the application code is correct? The tests. If the application code passes its tests (i.e. its specifications), then the application code is correct (for a certain definition of “correct”). What in the code determines whether the tests (specifications) are correct? Nothing! The program’s specifications come entirely from outside the program.

Tests are always correct

This means that whatever the tests specify is, by definition, correct. If we have two tests containing the same code and one of the tests changes, it does not always logically follow that the other test needs to be updated to match. This is different from duplicated application code. If a piece of behavior is duplicated in two places in the application code and one piece of behavior gets changed, it does always logically follow that the other piece of behavior needs to get updated to match. (Otherwise it wouldn’t be in instance of duplication.)

This is the reason why duplication is more acceptable in test code than in application code.

Takeaways

  • Duplication is when one behavior is specified multiple times.
  • Duplication in application code is costly because multiple, among other reasons, copies of the same behavior are subject to diverging, thus creating a bug.
  • Since test code is a human-determined specification, it’s by definition always correct. If one instance of a duplicated piece of code changes, it’s not a logical necessity that the other piece needs to change with it.

Why tests flake more on CI than locally

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.

The root cause of flaky tests is some sort of non-determinism, either in the test code or in the application code.

In order to understand why a CI test run is more susceptible to flakiness than a local test run, we can go through all the root causes for flakiness one-by-one and consider how a CI test run has a different susceptibility to that specific flaky test cause than a local test run.

The root causes we’ll examine (which are all explained in detail in this post) are leaked state, race conditions, network/third-party dependency, fixed time dependency and randomness.

Leaked state

Sometimes one test leaks some sort of state (e.g. a change to a file or env var) into the global environment which interferes with later tests.

The reason a CI test run is more susceptible to leaked state flakiness is clear. Unlike a local environment where you’re usually just running one test file at a time, in CI you’re running a whole bunch of tests together. This creates more opportunities for tests to interfere with each other.

Race conditions

A race condition is when the correct functioning of a program depends on two or more parallel actions completing in a certain sequence, but the actions sometimes complete in a different sequence, resulting in incorrect behavior.

One way that race conditions can arise is through performance differences. Let’s say there’s a process that times out after 5000ms. Most of the time the process completes in 4500ms, meaning no timeout. But sometimes it takes 5500ms to complete, meaning the process does time out.

It’s very easy for differences to arise between a CI environment and a local environment in ways that affect performance. The OS is different, the memory and processor speed are different, and so on. These differences can mean that race conditions arise on CI that would not have arisen in a local environment.

Network/third-party dependency

Network dependency can lead to flaky tests for the simple reason that sometimes the network works and sometimes it doesn’t. Third-party dependency can lead to flaky tests because sometimes third-party services don’t behave deterministically. For example, the service can have an outage, or the service can rate-limit you.

This is the type of flakiness that should never occur because it’s not a good idea to hit the network in tests. Nonetheless, I have seen this type of flakiness occur in test suites where the developers didn’t know any better.

Part of the reason why CI test runs are more susceptible to this type of flakiness is that there are simply more at-bats. If a test makes a third-party request only once per day locally but 1,000 times per day on CI, there are of course more chances for the CI request to encounter a problem.

Fixed time dependency

There are some tests that always pass at one time of day (or month or year) and always fail at another.

Here’s an excerpt about this from my other post about the causes of flaky tests:

This is common with tests that cross the boundary of a day (or month or year). Let’s say you have a test that creates an appointment that occurs four hours from the current time, and then asserts that that appointment is included on today’s list of appointments. That test will pass when it’s run at 8am because the appointment will appear at 12pm which is the same day. But the test will fail when it’s run at 10pm because four hours after 10pm is 2am which is the next day.

CI test runs are more susceptible to fixed-time-dependency flakiness than local test runs for a few reasons. One is the fact that CI test runs simply have more at-bats than local test runs. Another is that the CI environment’s time zone settings might be different from the local test environment. A third reason is that unlike a local test environment which is normally only used inside of typical working hours, a CI environment is often utilized for a broader stretch of time each day due to developers kicking off test runs from different time zones and from developers’ varying schedule habits.

Randomness

The final cause of flaky tests is randomness. As far as I know, the only way that CI test runs are more susceptible to flakiness due to randomness is the fact that CI test runs have more at-bats than local test runs.

Takeaways

  • A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.
  • The root cause of flaky tests is some sort of non-determinism, either in the test code or in the application code.
  • Whenever flakiness is more frequent in CI, the reason is because some difference between the CI test runs and the local runs make flakiness more likely. When flakiness is more likely, it’s because one of the specific five causes of flaky tests has been made more likely.

How I fix flaky tests

What a flaky test is and why they’re hard to fix

A flaky test is a test that passes sometimes and fails sometimes even though no code has changed.

There are several causes of flaky tests. The commonality among all the causes is that they all involve some form of non-determinism: code that doesn’t always behave the same on every run even though neither the inputs nor the code itself has changed.

Flaky tests are known to present themselves more in a continuous integration (CI) environment than in a local test environment. The reason for this is that certain characteristics of CI test runs make the tests more susceptible to non-determinism.

The fact that the flakiness usually can’t be reproduced locally means that it’s harder to reproduce and diagnose the buggy behavior of the flaky tests.

In addition to the fact that flaky tests often only flake on CI, the fact that flaky tests don’t fail consistently adds to the difficulty of fixing them.

Despite these difficulties, I’ve developed some tactics and strategies for fixing flaky tests that consistently lead to success. In this post I’ll give a detailed account of how I fix flaky tests.

The overall approach

When I’m fixing any bug I divide the bugfix into three stages: reproduction, diagnosis and fix.

I consider a flaky test a type of bug. Therefore, when I try to fix a flaky test, I follow this same three-step process as I would when fixing any other type of bug. In what follows I’ll cover how I approach each of these three steps of reproduction, diagnosis and fix.

Before reproducing: determine whether it’s really a flaky test

Not everything that appears to be a flaky test is actually a flaky test. Sometimes a test that appears to be flaking is just a healthy test that’s legitimately failing.

So when I see a test that’s supposedly flaky, I like to try to find multiple instances of that test flaking before I accept its flakiness as a fact. And even then, there’s no law that says that a test that previously flaked can’t fail legitimately at some point in the future. So the first step is to make sure that the problem I’m solving really is the problem I think I’m solving.

Reproducing a flaky test

If I can’t reproduce a bug, I can’t test for its presence or absence. If I can’t test for a bug’s presence or absence, I can’t know whether a fix attempt actually fixed the bug or not. For this reason, before I attempt to fix any bug, I always devise a test that will tell me whether the bug is present or absent.

My go-to method for reproducing a flaky test is simply to re-run the test suite multiple times on my CI service until I see the flaky test fail. Actually, I like to run the test suite a great number of times to get a feel for how frequently the flaky test fails. The actions I take during the bugfix process may be different depending on how frequently the test fails, as we’ll see later on.

Sometimes a flaky test fails so infrequently that it’s practically impossible to get the test to fail on demand. When this happens, it impossible to tell whether the test is passing due to random chance or because the flakiness has legitimately gone away. The way I handle these cases is to deprioritize the fix attempt and wait for the test to fail again in the natural course of business. That way I can be sure that I’m not wasting my time trying to fix a problem that’s not really there.

That covers the reproduction step of the process. Now let’s turn to diagnosis.

Diagnosing a flaky test

What follows is a list of tactics that can be used to help diagnose flaky tests. The list is kind of linear and kind of not. When I’m working on flaky tests I’ll often jump from tactic to tactic depending on what the scenario calls for rather than rigidly following the tactics in a certain order.

Get familiar with the root causes of flaky tests

If you were a doctor and you needed to diagnose a patient, it would obviously be helpful for you first to be familiar with a repertoire of diseases and their characteristic symptoms so you can recognize diseases when you see them.

Same with flaky tests. If you know the common causes for flaky tests and how to recognize them, you’ll have an easier time with trying to diagnose flaky tests.

In a separate post I show the root causes of flaky tests, which are race conditions, leaked state, network/third-party dependency, fixed time dependency and randomness. I suggest either committing these root causes to memory or reviewing them each time you embark on a flaky test diagnosis project.

Have a completely open mind

One of the biggest dangers in diagnosing flaky tests or in diagnosing any kind of problem is the danger of coming to believe something that’s not true.

Therefore, when starting to investigate a flaky test, I try to be completely agnostic as to what the root cause might be. It’s better to be clueless and right than to be certain and wrong.

Look at the failure messages

The first thing I do when I become aware of a flaky test is to look at the error message. The error message doesn’t always reveal anything terribly helpful but I of course have to start somewhere. It’s worth checking the failure message because sometimes it contains a helpful clue.

It’s important not to be deceived by error messages. Error messages are an indication of a symptom of a root cause, and the symptom of a root cause often has little or nothing to do with the root cause itself. Be careful not to fall into the trap of “the error message says something about X, therefore the root cause has something to do with X”. That’s very often not true.

Look at the test code

After looking at the failure message, I open the flaky test in an editor an look at its code. At first I’m not looking for anything specific. I’m just getting the lay of the land. How big is this test? How easy is it to understand? Does it have a lot of setup data or not much?

I do all this to load the problem area into my head. The more familiar I am with the problem area, the more I can “read from RAM” (use my brain’s short-term memory) as I continue to work on the problem instead of “read from disk” (look at the code). This way I can solve the problem more efficiently.

Once I’ve surveyed the test in this way, I zero in on the line that’s yielding the failure message. Is there anything interesting that jumps out? If so, I pause and consider and potentially investigate.

The next step I take with the test code is to go through the list of causes of flaky tests and look for instances of those.

After I’ve done all that, I study the test code to try to understand, in a meaningful big-picture way, what the test is all about. Obviously I’m going to be more likely to be successful in fixing problems with the test if I actually understand what the test is all about than if I don’t. (Sometimes this involves rewriting part or all of the test.)

Finally, I often go back to the beginning and repeat these steps an additional time, since each run through these steps can arm me with more knowledge that I can use on the next run through.

Look at the application code

The root cause of every flaky test is some sort of non-determinism. Sometimes the non-determinism comes from the test. Sometimes the non-determinism comes from the application code. If I wasn’t able to find the cause of the flakiness in the test code, I turn my attention to the application code.

Just like with the test code, the first thing I do is to just scan the relevant application code to get a feel for what it’s all about.

The next thing I do is to go through the code more carefully and look for causes of flakiness. (Again, you can refer to this blog post for that list.)

Then, just like with the test code, I try to understand the application code in a big-picture sort of way.

Make the test as understandable as possible

Many times when I look at a flaky test, the test code is too confusing to try to troubleshoot. When this is the case, I try to improve the test to the point that I can easily understand it. Easily understandable code is obviously easier to troubleshoot than confusing code.

To my surprise, I’ve often found that, after I improve the structure of the test, the flakiness goes away.

Side note: whenever I modify a test to make it easier to understand, I perform my modifications in one or more small, atomic pieces of work. I do this because I want to keep my refactorings and my fix attempts separate.

Make the application code as easy to understand as possible

If the application code is confusing then it’s obviously going to hurt my ability to understand and fix the flaky test. So, sometimes, I refactor the application code to make it easier to understand.

Make the test environment as understandable as possible

The quality of the test environment has a big bearing on how easy the test suite is to work with. By test environment I mean the tools (RSpec/Minitest, Factory Bot, Faker, etc.), the configurations for the tools, any seed data, the continuous integration service along with its configuration, any files shared among all the tests, and things like that.

The harder the test environment is to understand, the harder it will be to diagnose flaky tests. Not every flaky test fix job prompts me to work on the test environment, but it’s one of the things I look at when I’m having a tough time or I’m out of other ideas.

Check the tests that ran just before the flaky test

Just because a certain test flakes doesn’t necessarily mean that that test itself is flaky—even if the same test flakes consistently.

Sometimes, due to leaked state, test A will create a problem and then test B will fail. (A complete description of leaked state can be found in this post.) The symptom is showing up in test B so it looks like test B has a problem. But there’s nothing at all wrong with test B. The real problem is test A. So the problematic test passes but the innocent test flakes. It’s very deceiving!

Therefore, when I’m trying to diagnose a flaky test, I’ll check the continuous integration service to see what test ran before that test failed. Sometimes this leads me to discover that the test that ran before the flaky one is leaking state and needs to be fixed.

Add diagnostic info to the test

Sometimes, the flaky test’s failure message doesn’t show much useful information. In these cases I might add some diagnostic info to the test (or the relevant application code) in the form of print statements or exceptions.

Perform a binary search

Binary search debugging is a tactic that I use to diagnose bugs quickly. There are two main ideas behind it: 1) it’s easier to find where a bug is than what a bug is, and 2) binary search can be used to quickly find the location of a bug.

I make heavy use of binary search debugging when diagnosing flaky tests. See this blog post for a complete description of how to use binary search debugging.

Repeat all the above steps

If I go through all the above steps and I don’t have any more ideas, I might simply go through the list an additional time. Now that I’m more familiar with the test and everything surrounding it, I might have an enhanced ability to learn new things about the situation when I take an additional pass, and I might have new ideas or realizations that I didn’t have before.

Now let’s talk about the third step in fixing a flaky test, applying the fix itself.

Applying the fix for a flaky test

How to be sure your bugfixes work

A mistake many developers make when fixing bugs is that they don’t figure out a way to know if their bugfix actually worked or not. The result is that they often have to “fix” the bug multiple times before it really gets fixed. And of course, the false fixes create waste and confusion. That’s obviously not good.

The way to ensure that you get the fix right the first time is to devise a test (it can be manual or automated) that a) fails when the bug is present and b) passes when the bug is absent. (See this post for more details on how to apply a bugfix.)

How the nature of flaky tests complicates the bugfix process

Unlike “regular” bugs, which can usually be reproduced on demand once reproduction steps are known, flaky tests are usually only reproducible in one way: by re-running the test suite repeatedly on CI.

This works out okay when the test fails with relative frequency. If the test fails one out of every five test runs, for example, then I can run the test suite 50 times and expect to see (on average) ten failures. This means that if I apply the ostensible fix for the flaky test and then run the test suite 50 more times and see zero failures, then I can be pretty confident that my fix worked.

How certain I can be that my fix worked goes down the more infrequently the flaky test fails. If the test fails only once out of every 50 test runs on average, then if I run my test suite 50 times and see zero failures, then I can’t be sure whether that means the flaky test is fixed or if it just means that all my runs passed due to random chance.

Ideally a bugfix process goes like this:

  1. Perform a test that shows that the bug is present (i.e run the test suite a bunch of times and observe that the flaky test fails)
  2. Apply a bugfix on its own branch
  3. Perform a test on that branch that shows that the bug is absent (i.e run the test suite a bunch of times and observe that the flaky test doesn’t fail)
  4. Merge the bugfix branch into master

The reason this process is good is because it gives certainty that the bugfix works before the bugfix branch gets merged into master.

But for a test that fails infrequently, it’s not realistic to perform the steps in that order. Instead it has to be like this:

  1. Perform a test that shows that the bug is present (i.e observe over time that the flaky test fails sometimes)
  2. Apply a bugfix on its own branch
  3. Merge the bugfix branch into master
  4. Perform a test on that branch that shows that the bug is absent (i.e observe over a sufficiently long period of time that the flaky test no longer fails)

Notice how the test that shows that the bug is present is different. When the test fails frequently, we can perform on “on-demand” test where we run the test suite a number of times to observe that the bug is present. When the test fails infrequently, we don’t realistically have this option because it may require a prohibitively large number of test suite runs just to get a single failure. Instead we just have to go off of what has been observed in the test suite over time in the natural course of working.

Notice also that the test that shows that the bug is absent is different. When the test fails frequently, we can perform the same on-demand test after the bugfix as before the bugfix in order to be certain that the bugfix worked. When the test fails infrequently, we can’t do this, and we just have to wait until a bunch of test runs naturally happen over time. If the test goes sufficiently long without failing again, we can be reasonably sure that the bugfix worked.

Lastly, notice how in the process for an infrequently-failing test, merging the fix into master has to happen before we perform the test that ensures that the bugfix worked. This is because the only way to test that the bugfix worked is to actually merge the bugfix into master and let it sit there for a large number of test runs over time. It’s not ideal but there’s not a better way.

A note about deleting and skipping flaky tests

There are two benefits to fixing a flaky test. One benefit of course is that the test will no longer flake. The other is that you gain some skill in fixing flaky tests as well as a better understanding of what causes flaky tests. This means that fixing flaky tests creates a positive feedback loop. The more flaky tests you fix, the more quickly and easily you can fix future flaky tests, and the fewer flaky tests you’ll write in the first place because you know what mistakes not to make.

If you simply delete a flaky test, you’re depriving yourself of that positive feedback loop. And of course, you’re also destroying whatever value that test had. It’s usually better to push through and keep working on fixing the flaky test until the job is done.

It might sometimes seem like the amount of time it takes to fix a certain flaky test than the value of that test can justify. But keep in mind that the significant thing is not the cost/benefit ratio of any individual flaky test fix, but the cost/benefit ratio of all the flaky test fixes on average. Sometimes flaky test fixes will take 20 minutes and sometimes they’ll take two weeks. The flaky test fixes that take two weeks might feel unjustifiable, but if you have a general policy of just giving up when things get too hard and deleting the test, then your test-fixing skills will always stay limited, and your weak skills will incur a cost on the test suite for as long as you keep deleting difficult flaky tests. Better to just bite the bullet and develop the skills to fix hard flaky test cases.

Having said all that, deleting a flaky test is sometimes the right move. When development teams lack the skills to write non-flaky tests, sometimes the teams have other bad testing habits, like writing tests that are pointless. When a flaky test coincidentally happens to also be pointless, it’s better to just delete the test than to pay the cost to fix a test that doesn’t have any value.

Skipping flaky tests is similar in spirit to deleting them. Skipping a flaky test has all the same downsides as deleting it, plus now you have the extra overhead of occasionally stumbling across the test and remembering “Oh yeah, I should fix this eventually.” And what’s worse, the skipped test often gets harder to fix as time goes on because the skipped test is frozen in time but the rest of the codebase continues to change in ways that aren’t compatible with tests. The easiest time to fix a flaky test is right when the flakiness is first discovered.

Takeaways

  • The root cause of every flaky test is some sort of non-determinism.
  • Flaky tests are known to present themselves more in a CI environment than in a local test environment because certain characteristics of CI test runs make the tests more susceptible to non-determinism.
  • I consider a flaky test to be a type of bug. When I’m fixing any bug, including a flaky test, I divide the bugfix into three stages, which are reproduction, diagnosis and fix.
  • To reproduce a flaky test, I run the test suite enough times on CI to see the flaky test fail, or if it fails too infrequently I wait for it to fail naturally.
  • There are a large number of tactics I use to diagnose flaky tests. I don’t necessarily go through the tactics in a specific order but rather I use intuition and experience to decide which tactic to use next. The important thing is to treat the flaky test diagnosis as a distinct step which occurs after reproduction and before the application of the fix.
  • With the application of any bugfix, it’s good to have a test you can perform before and after the fix to be sure that the fix worked. When a flaky test fails frequently enough, you can do this sort of test by simply re-running the test suite in CI a sufficient number of times. If the flaky test fails infrequently, this is not practical, and the fix must be merged to master without being sure that it worked.
  • When you delete a flaky test, you not only destroy the value of the test but you also lose the opportunity to build your skills in fixing flaky tests and avoiding writing flaky tests in the first place. Unless the test coincidentally happens to be one that has little or no value, it’s better to fix it.