Category Archives: Programming

Abstraction != abstractness

An old man dies of a heart attack. What do we choose as the explanation for his death?

We could simply say he died of a heart attack and leave it at that. Or we could say that his brain activity ceased due to lack of oxygen to the brain. Or we could say that he died of old age. All of these explanations are valid and appropriate depending on the level of detail we’re interested in.

When we replace lower-level information with higher-level information in order to work with a certain subject more easily, it’s called abstraction.

Abstraction is a process. Carrying out the process of abstraction can result in an entity called an abstraction. (Note the two meanings of the same word.) An abstraction is a model of some part of the world at a certain level of detail.

Imagine two computer programs, one that tracks local obituaries and another that tracks clinical details of patients. In the obituary program, it would probably be appropriate to have an abstraction called “individual” and list a person’s death at a high level of abstraction, e.g. heart attack. In the clinical program, it might be more appropriate to have an abstraction called “patient” and describe the patient’s cause of death at a much lower level of abstraction. Both abstractions refer to the same entity, a person, but at different levels of detail, different levels of abstraction.

The root of abstraction is abstract. One definition of abstract is “existing in thought or as an idea but not having a physical or concrete existence”. Another similar definition of abstract is “not having to do with any particular instance”. Confusingly, abstractions don’t have to be particularly abstract. There’s nothing abstract about saying “Michael Jackson died of a heart attack” for example. In fact it’s quite concrete. Nor do abstractions have to be generalizations.

Sadly, the prevailing understanding of abstraction in the programming community seems to be “a thing that is abstract”. Abstraction has come to mean generalization. But that’s mistaken. Abstraction is not mainly about generalization, it’s about dealing with concepts at an appropriate level of detail.

Programming principles as memes

Leave a reply

In 1976 in The Selfish Gene, Richard Dawkins explained that the unit of selection for evolution is not the group or species or individual but the gene. Genes that are best at causing themselves to get replicated are the ones that do get replicated, and so those genes then become predominant.

An organism’s body is a gene’s “survival machine”. Individual organisms die but a gene can survive indefinitely. A gene “uses” an organism to propagate itself from one generation to the next. Since an organism’s design is determined entirely by its DNA, the genes are in charge. The genes are the masters and the organisms are the slaves. Everything an organism does, it’s been programmed to do by its genes in order to propagate those genes.

In The Selfish Gene Dawkins also introduced the idea of a meme. A meme is kind of like a gene but instead of a piece of DNA it’s a behavior or idea. In The Beginning of Infinity David Deutsch uses the idea of a joke as an example of a meme. If a joke is funny, it will make people who hear it want to tell it again, causing the joke to be proliferated. But the joke may not always be retold word-for-word. Some people might remember the idea of the joke but tell it in their own words. Some people might misremember the joke and accidentally tell it a bit differently. Some people may intentionally change the joke in order to try to improve it. Just as a gene will compete with variants of itself, a joke will do the same thing.

The versions of the joke that are best at getting themselves replicated are the ones that become predominant. Note, importantly, that this does not necessarily mean that the best version of the joke will become predominant, just the version that’s best at getting itself replicated. “Survival of the fittest” is a myth. The versions of genes and memes that survive are the ones that are best at getting themselves copied, nothing more.

Before internet memes our culture had many others. Quotes are often memes, for example, like “if it ain’t broke don’t fix it” and “practice makes perfect”. The content of a meme can be true or false, helpful or unhelpful. “Practice makes perfect”, for example, is sometimes refuted with the expression “practice makes permanent”, which is closer to the truth. “Practice makes perfect” didn’t become a meme because it’s true, it became a meme because it has the attractive qualities of being snappy, alliterative and memorable, and superficially seeming wise and true.

Programming principles can also come in the form of memes. “Don’t Repeat Yourself” (DRY) is memorable thanks to its slightly amusing pronounceable acronym and the fact that its advice is useful and approximately true, although only approximately. The refuting advice, “Write Everything Twice” (WET) is cleverly formulated as the antonym to DRY, and has been very successful in replicating itself, despite the fact that its advice (allow duplication up to two times, and only de-duplicate when the number of instances reaches three) is, at least in my opinion, completely nonsensical.

Just as genes don’t obey “survival of the fittest”, memes, including programming memes, don’t obey “survival of the truest”. The programming principles that replicate themselves the most are simply the ones that are the best at getting themselves replicated, not necessarily the ones that are the truest or best. Sometimes an idea proliferates simply because it’s easy to learn and easy to teach, or because repeating the idea makes the repeater feel wise and sophisticated.

Having said that, sometimes a meme is good at causing itself to be replicated precisely because it is true. Newton’s laws of physics, for example, successfully proliferated because they were genuinely useful, and they were useful because they were true—at least true enough for many purposes. Then, later, Einstein’s laws of physics became successful memes because they were an improvement upon Newton’s laws of physics, and succeeded in certain areas where Newton’s laws failed. This phenomenon of memes proliferating precisely because they’re true should give us hope.

We all know people who seem to be slaves to some particular sort of fashion. Some people repeat the political ideas they hear on a certain news channel, for example, or get caught up in a succession of get-rich-quick schemes. These people seem not to have minds of their own. Like genes’ survival machines, these people in a way exist only as a vehicle for memes to replicate themselves. But of course, not everyone has to be a slave to memes. Instead of uncritically accepting memetic ideas, we can examine them with a careful combination of open-mindedness and skepticism and demand a good explanation for why the idea in the meme is supposedly true. In this way, perhaps our false programming memes can wither away and gradually be replaced with ever truer ones.

How do we tell what’s a good programming practice?

1 Reply

I’ve been thinking lately about the question of how to decide what’s a good programming practice and what’s not.

One way of looking at this is that it’s all just subjective opinion. Just do whatever works for you. You might call this the “anything goes” principle.

The problem with the Anything Goes Principle is that it doesn’t solve any meaningful problems. At best it works as a way to sometimes get strangers to stop arguing on the internet. It certainly doesn’t solve a debate between two team members in an organization. The Anything Goes Principle is simply a way to agree to disagree, not a way to resolve a disagreement.

Another way to approach the question is what you might call the “consensus principle” or the “best practice principle”. This principle says that whatever the prevailing best practice in the industry is is the right answer. But this method is easily debunked. There have been many instances in history where the consensus view in a certain field is just plain wrong. As the saying goes, “What’s popular is not always right”. The truth can’t be discovered via popularity contest.

Here’s what I think is the real key to recognizing a good programming practice: an explanation which stands up to rigorous criticism and which is sufficiently specific that it couldn’t be used to defend any other principle.

Let’s work with an example. It’s a very uncontroversial principle in programming that, in general, clear and accurate variable names, even if they may be long, are better than short, cryptic variable names. This principle has a strong explanation behind it. Clear variable names are, by definition, self-explanatory. When a name is clear, the reader can learn its meaning directly from the name itself rather than having to infer the meaning by studying the surrounding context. In terms of time and cognitive strain, it’s cheaper to learn the meaning of a variable by simply reading the name than by studying the surrounding context. This explanation is, I think, very hard to refute.

This principle even has an exception that proves the rule. There are times when a one-letter variable name is actually better than a long-and-clear name. The explanation for this is that when the scope of a variable is very small and the original name of the variable is not particularly short, the repetition of the long variable name can actually present enough noise in the code that the long-variable version is harder to understand than the short-variable version.

And by the way I want to note that the argument isn’t that the proponent of this rule personally finds certain code easier to understand. The argument is that a typical programmer would find the code easier to understand, since “a typical programmer” is the likely audience for the code, not specifically the proponent of the rule. So the rule could not be refuted by the argument of “well I personally think short variable names are easy enough to understand”. The argument would have to be “I think short variable names are easy enough for the typical programmer to understand”. It’s the difference between a subjective argument and an objective one.

Here’s an example of a poor explanation. A certain article about service objects claims that “Services are easy and fast to test since they are small Ruby objects that have been separated from their environment.” This explanation is weak partly because it could equally be applied to many other practices. The quality of being small and “separated from their environment” (in other words, loosely coupled from dependencies) is not unique to service objects. Furthermore, the quality of being small isn’t inherent to what a service object supposedly is. I’ve seen plenty of gargantuan service objects. And for that matter, loose coupling isn’t an inherent property of service objects either, and I’ve also seen plenty of service objects with tight coupling to their dependencies. My personal distastes for service objects aside, these explanations for the merits of service objects are objectively bad.

If a certain programming practice can address all the criticism it’s subjected to with specific and objective explanations, then it’s probably a good one. If not, it’s probably a bad one. And while everyone is of course entitled to their own subjective opinion, much of the quality of various programming practices is actually a matter of objectivity. At least that’s what I think.

Programming is technical writing

Leave a reply

I enjoy reading nonfiction. Not just programming but also biology, physics, philosophy, mechanical engineering, robotics and other technical topics.

The challenge in technical writing is to convey difficult ideas in ways that are easy to understand. The best technical writing doesn’t make the reader think, “Wow, the writer must be a genius to be able to understand this stuff,” it makes them think, “Wow, I must be a genius to be able to understand this stuff!”

Very few programmers make a serious study of technical writing. I myself, 20+ years into my programming career, have only just begun to do so myself. But I think improving one’s technical writing ability is a highly profitable undertaking for any programmer, even if you never intend to do any technical writing. In a very real sense, programming IS technical writing.

Donald Knuth has said, “Programs are meant to be read by humans and only incidentally for computers to execute.” When we write code, we’re (hopefully) aiming not just for it to fulfill its functional requirements but also for it to be understandable by human readers. In other words, we’re trying to convey difficult ideas in ways that are easy to understand. That’s what I mean when I say that programming is technical writing.

How does one get better at technical writing? One thing that helps is to do a lot of technical reading. In programming, my favorite technical writers are Martin Fowler and Steve McConnell. Outside of that, my favorites include Richard Dawkins (the master in my book), Carl Sagan (a close second), Steven Pinker, David Deutsch, Sean Carroll and Richard Feynman. If you’d like a specific starting point, I’d recommend The Blind Watchmaker by Richard Dawkins.

One last comment, an important one. Good writing is downstream of good thinking. None of the authors I listed above are writers by profession. The programming authors are programmers and the science authors are scientists. These writers are good writers because they’re excellent practitioners, with good ideas worth sharing. It also so happens that these practitioners have learned how to skillfully put their ideas into words. If Richard Dawkins were a clumsy writer, his books would probably still be worth reading on the merit of the ideas they contain. But if Dawkins were an excellent writer but a poor scientist, his books would be junk. Similarly, a programmer must be skilled not only in coding style but in designing and maintaining a software system. Good code is downstream of good thinking.

Sharp blades and dull blades

Leave a reply

We all know the metaphor of technical debt. I want to share some weaknesses of the technical debt metaphor, plus a different metaphor which I think is better.

In real life, debt is not inherently bad. If you have debt, all it means in most cases is that you took out a loan. Loans are a good thing. Loans are what makes it possible to buy things like cars and houses that almost nobody has the money for up front. Business loans can allow business to make investments to grow faster, pay back the loan, and end up with a net benefit that wouldn’t have been possible without the loan.

Debt in real life is also avoidable. You don’t have to get a car loan or mortgage if you don’t want to. You don’t have to get a credit card. You could theoretically pay for everything with cash your whole life.

In software, what we call “technical debt” is not avoidable. It’s going to happen no matter what.

Managers sometimes talk about “strategic technical debt”, where debt is consciously taken on in order to meet a short-term timeline. The thing is, there’s never a plan to pay the debt back. Indeed it rarely does get paid back. Usually the next move is to take on yet more technical debt. It’s like a payday loan cycle, where each payday loan is used to pay the interest from the last payday loan, with less and less of each paycheck ultimately going into your own pocket. Yet the “strategic technical debt” myth persists, partly because the metaphor is so easy to grasp (and miscomprehend).

I prefer to think of a software system as a collection of blades.

A dull blade is never preferable to a sharp one. Sometimes it may be better to just cut with the dull blade instead of taking the time to sharpen the blade before cutting, but it’s clear that that’s a compromise. No one would talk about “strategic dull blades” the way they talk about strategic technical debt.

Unlike debt which you can simply choose not to take on, blades unavoidably get dull with use. Having technical debt makes it sound like you did something wrong. But you don’t have to do anything wrong in order to end up with bad code. Often, bad code is simply the result of entropy. The blade metaphor makes it clear that dullness is normal and unavoidable.

The blade metaphor also has built-in suggestions about how to address the problem. When is it time to sharpen a dull blade? Right before you use it (or right after). It obviously wouldn’t be a very smart investment to just go randomly sharpening saws in the shed. When you pick up a saw and notice that it’s hard to cut with, that can tell you that it would be worth the investment to sharpen it at least a little before trying to cut more with it. If there’s a blade that never gets used, you can afford not to sharpen it, since a dull blade that’s not being used is not hurting anything.

Lastly, the blade metaphor makes clear the benefit of being in a good state. What exactly is the benefit of not having technical debt? That you don’t have to pay technical interest? The benefit is not so clear. The benefit of having sharp blades is obvious: you can cut faster. Unlike being debt-free which is simply the absence of a bad thing, you’re in possession of something good.

As an industry I wish we would do away with the technical debt metaphor and adopt the blade metaphor instead. I think this metaphor would help managers understand the nature of the issue more accurately. Instead of pressuring us to take on “strategic technical debt”, maybe they’ll see the virtues of helping us keep our blades sharp.

Testing anti-pattern: merged setup data

Leave a reply

In a single test file, there’s often overlap among the setup data needed for the tests in the file. For whatever reasons, perhaps in an effort to improve performance or avoid duplication, test writers often merge the setup code and bring it to the top of the file so it’s available to all test cases.

Let’s take a look at a test that contains apparently duplicative setup data. This test has two test cases, each of which needs one build and at least one job.

RSpec.describe Build, type: :model do
  describe "#start!" do
    let!(:job) { create(:job) }
    let!(:build) { job.build }
  ...

  describe "#status" do
    let!(:build) { create(:build) }
    let!(:job_1) { create(:job, build: build, order_index: 1) }
    let!(:job_2) { create(:job, build: build, order_index: 2) }
  ...
...

There’s obviously some duplication among this setup data. We’re creating two builds and three jobs, but we only really need a total of one build and two jobs. If we wanted to, we could be more “economical” with our setup data by combining it and placing it at the top of the file so that it can be used by all tests, like so:

RSpec.describe Build, type: :model do
  let!(:build) { create(:build) }
  let!(:job_1) { create(:job, build: build, order_index: 1) }
  let!(:job_2) { create(:job, build: build, order_index: 2) }

  describe "#start!" do
  ...

  describe "#status" do
  ...
...

Now our setup code is superficially a little less “wasteful” but we’ve created a couple subtle problems that make our test harder to understand and change.

Misleading details

When we create our jobs, we give each of them an order_index. This matters for the #status test but is totally immaterial to the #start! test. As the author of this test, I happen to know which details matter for which tests, but someone reading this test for the first time would have no easy way of knowing when an order_index is needed and when it’s not.

The only safe assumption is that every detail is needed for every test. If we alter the global setup data somehow, it’s possible that we’ll cause a silent defect. We could cause a test to keep passing but to lose its validity and start showing us a false positive. When data is included in the global setup beyond what’s needed for every test in the file, it creates an unnecessary risk that makes the test harder to change than it needs to be.

Shoehorned data

I’ll now reveal a bit more of the describe "#start!" test case.

describe "#start!" do
  let!(:job) { create(:job) }
  let!(:build) { job.build }

  before do
    fake_job_machine_request = double("JobMachineRequest")

    # The following line, which references "job",
    # is the line to pay attention to
    allow(job).to receive(:job_machine_request).and_return(fake_job_machine_request)
  end
end

In the above code, which shows the original version of the #start! test before we merged all the setup data, the setup and usage of job were straightforward. We created a job and then we used it.

But now that we’ve merged all our setup code, we only have job_1 and job_2 available to us and no plain old job_2 anymore. This makes things awkward for the #start! test where we only need one job to work with. Here are two possible options, both undesirable.

before do
  # Referring to job_1 misleadingly implies that the
  # fact that it's job 1 and not job 2 is significant,
  # which it's not, and also that we might do
  # something with job 2, which we won't
  allow(job_1).to receive(:job_machine_request).and_return(fake_job_machine_request)

  # This is almost better but not really
  job = job_1
  allow(job).to receive(:job_machine_request).and_return(fake_job_machine_request)
end

Another way we can deal with this issue is simply to add a third job to our setup called job:

RSpec.describe Build, type: :model do
  let!(:build) { create(:build) }
  let!(:job) { create(:job) }
  let!(:job_1) { create(:job, build: build, order_index: 1) }
  let!(:job_2) { create(:job, build: build, order_index: 2) }

  describe "#start!" do
  ...

  describe "#status" do
  ...
...

This takes away our problem of having to shoehorn job_1 into standing in for job, and it also takes away our misleading details problem, but it creates a new problem.

It’s unclear what setup data belongs to which test(s)

The other problem with combining the setup data at the top of the test is that it’s not clear which values are needed for which test. This creates a symptom very similar to the one creates by the misleading details problem: we can’t easily know what kinds of changes to the setup data are safe and which are risky.

It’s unclear what the state of the system is and how it will affect any particular test

I once worked on a project where a huge setup script ran before the test suite. The test environment would get some arbitrary number of users, superusers, accounts, customers and almost every other kind of entity imaginable.

The presence of all this data made it extremely hard to understand what the state of the system was and how any existing data might interfere with any particular test. Although it’s sometimes worth it to make compromises, it’s generally much easier if every test starts with a fully clean slate.

What’s the solution?

Instead of merging all of a test file’s setup data at the top, it’s better to give each test only the absolute minimum data that it needs. This usually means giving each test case its own individual setup data. Often it even means creating duplication—although duplication in tests is usually not real duplication. The performance cost is also usually negligible or nonexistent, since even if global setup data only appears once, it actually gets run once per test run anyway, and nothing is saved by combining it. Any cost in duplication or performance is usually overwhelmingly offset by the benefit of not merging setup data: it makes your tests easier to understand and work with.

Using ChatGPT to reduce “study and synthesize” work

1 Reply

When ChatGPT first came out, the first programming use case I thought of for ChatGPT was to write code. I thought of it like “GitHub Copilot on steroids”. I imagine that was a lot of other people’s first thought too. But gradually I realized that having ChatGPT write production code is actually not a very good idea.

When ChatGPT gives you a big chunk of code, how can you be sure that the code does exactly what you think it does? How can you be sure that it’s not missing something, and that it doesn’t contain anything extra, and that it doesn’t have bugs?

The answer of course is that you have to test it. But retroactively testing existing code is usually tedious and annoying. You basically have to replay the process of writing the code so that you can test each piece of code individually.

A programming workflow that involves using ChatGPT to write big chunks of code seems dangerous at worst and terribly inefficient at best.

If ChatGPT isn’t great for writing production code, what’s it good for?

Using ChatGPT to reduce mental labor

One part of our jobs as programmers is to learn general principles and then apply parts of those principles to a certain peculiar need.

I’ll give a somewhat silly example to illustrate the point starkly. Let say I want to integrate Authorize.net into a Lisp program and that I’ve never used either technology before. Without checking, one could pretty safely assume there are no tutorials in existence on how to integrate Authorize.net into a Lisp app.

In order to complete my integration project I’ll need to learn something about a) Authorize.net in general, b) Lisp in general, and c) integrating Authorize.net with a Lisp app specifically. Then I’ll need to synthesize a solution based on what I’ve learned.

This whole process can be wasteful, time-consuming, and at times, quite boring. In the beginning I might know that I need to get familiar with Authorize.net, but I’m not sure yet which parts of Authorize.net I need to be familiar with. So I’ll read a whole bunch about Authorize.net, but I won’t know until the end of the project which areas of my study were actually needed and which were just a waste of time.

And what’s even worse is the cases where the topics you’re studying are of no permanent benefit to your skills as a programmer. In the case of Authorize.net I might not expect to ever use it again. (At least I hope not!) This kind of learning is just intellectual ditch-digging. It’s pure toil with little or no lasting benefit.

This kind of work, where you first you study some generalities then you synthesize a specific solution from those generalities, I call “study and synthesis” work.

Thanks to ChatGPT, most “study and synthesis” work is a thing of the past. If I tell ChatGPT “Give me a complete tutorial on how to integrate Authorize.net with a Lisp program”, it will. The tutorial may not be correct down to every last detail but that’s not the point. Just having a high-level plan spelled out saves a lot of mental labor. And then if I need to zoom in on certain details which the tutorial either got wrong or omitted, ChatGPT will quite often correct its mistakes when pressed.

Using ChatGPT to write production code may seem like a natural and logical use for the tool, but it’s actually not a very good one. You’ll get a lot more leverage out of ChatGPT if you use it for “study and synthesize” work.

In defense of productivity

Leave a reply

Anti-productivity sentiment

In my career I’ve noticed that a lot of developers have a distaste for the idea of “productivity”. They view it as some sort of toxic, unhealthy obsession. (It always has to be an “obsession”, by the way. One can never just have an interest in productivity.)

Productivity is often associated with working harder and longer, sacrificing oneself for a soulless corporation.

In a lot of ways I actually agree with these people. Having an unhealthy obsession with anything is obviously unhealthy, by definition. And I think working long and hard for its own sake, for no meaningful reward, is a waste of precious time.

But I think sometimes these anti-productivity people are so blinded by their natural aversion to “productivity culture” that they miss out on some good and worthwhile ideas, ideas they would actually like if they opened their minds to them.

“Productivity” is a pretty ambiguous word. It could have a lot of different interpretations. I’d like to share my personal interpretation of productivity which I happen to quite like. Maybe you’d like to adopt it for yourself.

My version of productivity

For me, productivity isn’t about obsessively tracking every minute of the day or working so hard you burn yourself out.

The central idea of productivity for me is decreasing the ratio of effort to value. This could mean working less to create the same value or it could mean working the same to create more value. Or anywhere in between. Each person can decide for themselves where they’d like to set the dial.

I value a calm, healthy mind and body. People obviously do better work when they’re relaxed and even-keeled than when they’re harried and stressed.

Productivity for me is about realizing that our time on this planet is limited and precious, and that we shouldn’t be needlessly wasteful with our time but rather protect it and spend it thoughtfully.

Why duplication is more acceptable in tests

4 Replies

It’s often taught in programming that duplication is to be avoided. But for some reason it’s often stated that duplication is more acceptable in test code than in application code. Why is this?

We’ll explore this, but first, let’s examine the wrong answers.

Incorrect reasons why duplication is more acceptable in tests

“Duplication isn’t actually that bad.”

Many programmers hold the opinion that duplication isn’t something that should be avoided fastidiously. Instead, a certain amount of duplication should be tolerated, and when the duplication gets to be too painful, then it should be addressed. The “rule of three” for example says to tolerate code that’s duplicated twice, but clean it up once the duplication reaches three instances.

This way of thinking is overly simplistic and misses the point. The cost of duplication doesn’t depend on whether the duplication appears twice or three times but rather on factors like how easy the duplication is to notice, how costly it is to keep the duplicated instances synchronized, and how much “traffic” the duplicated areas receive. (See this post for more details on the nature of duplication and its costs.)

The heuristic for whether to tolerate duplication shouldn’t be “tolerate some but don’t tolerate too much”. Rather the cost of a piece of duplication should be assessed based on the factors above and weighed against any benefits that piece of duplication has. If the costs aren’t justified by the benefits, then the duplication should be cleaned up.

“Duplication in test code can be clearer than the DRY version”

It’s true that duplication in test code can be clearer than the DRY version. But duplication in application code can be clearer than the DRY version too. So if duplicating code can make it clearer, why not prefer duplication in application code to the same exact degree as in test code?

This answer doesn’t actually answer the question. The question is about the difference between duplication in test code and application code.

The real reason why duplication is more acceptable in test code

In order to understand why duplication is more acceptable in test code than application code, it helps to get very clear on what exactly duplication is and why it incurs a cost.

What duplication is and why it costs

Duplication doesn’t mean two identical pieces of code. Duplication is two or more copies of the same behavior. It’s possible to have two identical pieces of code that represent different pieces of behavior. It’s also possible to have the same behavior expressed more than once but in different code.

Let’s also review why duplication incurs a cost. The main reason is because it leaves the program susceptible to logical inconsistencies. If one copy of a behavior gets changed but the other copies don’t, then the other behaviors are now wrong and there’s a bug present. The other reason duplication incurs a cost is because it creates a maintenance burden. Updating something in multiple places is obviously more costly than updating it in just one place.

The difference between test code and application code

The difference between test code and application code is that test code doesn’t contain behaviors. All the behaviors are in the application code. The purpose of the test code is to specify the behaviors of the application code.

What in the codebase determines whether the application code is correct? The tests. If the application code passes its tests (i.e. its specifications), then the application code is correct (for a certain definition of “correct”). What in the code determines whether the tests (specifications) are correct? Nothing! The program’s specifications come entirely from outside the program.

Tests are always correct

This means that whatever the tests specify is, by definition, correct. If we have two tests containing the same code and one of the tests changes, it does not always logically follow that the other test needs to be updated to match. This is different from duplicated application code. If a piece of behavior is duplicated in two places in the application code and one piece of behavior gets changed, it does always logically follow that the other piece of behavior needs to get updated to match. (Otherwise it wouldn’t be in instance of duplication.)

This is the reason why duplication is more acceptable in test code than in application code.

Takeaways

Duplication is when one behavior is specified multiple times.
Duplication in application code is costly because multiple, among other reasons, copies of the same behavior are subject to diverging, thus creating a bug.
Since test code is a human-determined specification, it’s by definition always correct. If one instance of a duplicated piece of code changes, it’s not a logical necessity that the other piece needs to change with it.

Why tests flake more on CI than locally

1 Reply

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.

The root cause of flaky tests is some sort of non-determinism, either in the test code or in the application code.

In order to understand why a CI test run is more susceptible to flakiness than a local test run, we can go through all the root causes for flakiness one-by-one and consider how a CI test run has a different susceptibility to that specific flaky test cause than a local test run.

The root causes we’ll examine (which are all explained in detail in this post) are leaked state, race conditions, network/third-party dependency, fixed time dependency and randomness.

Leaked state

Sometimes one test leaks some sort of state (e.g. a change to a file or env var) into the global environment which interferes with later tests.

The reason a CI test run is more susceptible to leaked state flakiness is clear. Unlike a local environment where you’re usually just running one test file at a time, in CI you’re running a whole bunch of tests together. This creates more opportunities for tests to interfere with each other.

Race conditions

A race condition is when the correct functioning of a program depends on two or more parallel actions completing in a certain sequence, but the actions sometimes complete in a different sequence, resulting in incorrect behavior.

One way that race conditions can arise is through performance differences. Let’s say there’s a process that times out after 5000ms. Most of the time the process completes in 4500ms, meaning no timeout. But sometimes it takes 5500ms to complete, meaning the process does time out.

It’s very easy for differences to arise between a CI environment and a local environment in ways that affect performance. The OS is different, the memory and processor speed are different, and so on. These differences can mean that race conditions arise on CI that would not have arisen in a local environment.

Network/third-party dependency

Network dependency can lead to flaky tests for the simple reason that sometimes the network works and sometimes it doesn’t. Third-party dependency can lead to flaky tests because sometimes third-party services don’t behave deterministically. For example, the service can have an outage, or the service can rate-limit you.

This is the type of flakiness that should never occur because it’s not a good idea to hit the network in tests. Nonetheless, I have seen this type of flakiness occur in test suites where the developers didn’t know any better.

Part of the reason why CI test runs are more susceptible to this type of flakiness is that there are simply more at-bats. If a test makes a third-party request only once per day locally but 1,000 times per day on CI, there are of course more chances for the CI request to encounter a problem.

Fixed time dependency

There are some tests that always pass at one time of day (or month or year) and always fail at another.

Here’s an excerpt about this from my other post about the causes of flaky tests:

This is common with tests that cross the boundary of a day (or month or year). Let’s say you have a test that creates an appointment that occurs four hours from the current time, and then asserts that that appointment is included on today’s list of appointments. That test will pass when it’s run at 8am because the appointment will appear at 12pm which is the same day. But the test will fail when it’s run at 10pm because four hours after 10pm is 2am which is the next day.

CI test runs are more susceptible to fixed-time-dependency flakiness than local test runs for a few reasons. One is the fact that CI test runs simply have more at-bats than local test runs. Another is that the CI environment’s time zone settings might be different from the local test environment. A third reason is that unlike a local test environment which is normally only used inside of typical working hours, a CI environment is often utilized for a broader stretch of time each day due to developers kicking off test runs from different time zones and from developers’ varying schedule habits.

Randomness

The final cause of flaky tests is randomness. As far as I know, the only way that CI test runs are more susceptible to flakiness due to randomness is the fact that CI test runs have more at-bats than local test runs.

Takeaways

A flaky test is a test that passes sometimes and fails sometimes, even though no code has changed.
The root cause of flaky tests is some sort of non-determinism, either in the test code or in the application code.
Whenever flakiness is more frequent in CI, the reason is because some difference between the CI test runs and the local runs make flakiness more likely. When flakiness is more likely, it’s because one of the specific five causes of flaky tests has been made more likely.

Code with Jason

The Beginner's Guide
to Rails Testing

Category Archives: Programming

Abstraction != abstractness

Programming principles as memes

How do we tell what’s a good programming practice?

Programming is technical writing

Sharp blades and dull blades

Testing anti-pattern: merged setup data

Misleading details

Shoehorned data

It’s unclear what setup data belongs to which test(s)

It’s unclear what the state of the system is and how it will affect any particular test

What’s the solution?

Using ChatGPT to reduce “study and synthesize” work

Using ChatGPT to reduce mental labor

In defense of productivity

Anti-productivity sentiment

My version of productivity

Why duplication is more acceptable in tests

Incorrect reasons why duplication is more acceptable in tests

“Duplication isn’t actually that bad.”

“Duplication in test code can be clearer than the DRY version”

The real reason why duplication is more acceptable in test code

What duplication is and why it costs

The difference between test code and application code

Tests are always correct

Takeaways

Why tests flake more on CI than locally

Leaked state

Race conditions

Network/third-party dependency

Fixed time dependency

Randomness

Takeaways

The Beginner's Guideto Rails Testing

Misleading details

Shoehorned data

It’s unclear what setup data belongs to which test(s)

It’s unclear what the state of the system is and how it will affect any particular test

What’s the solution?

Using ChatGPT to reduce mental labor

Anti-productivity sentiment

My version of productivity

Incorrect reasons why duplication is more acceptable in tests

“Duplication isn’t actually that bad.”

“Duplication in test code can be clearer than the DRY version”

The real reason why duplication is more acceptable in test code

What duplication is and why it costs

The difference between test code and application code

Tests are always correct

Takeaways

Leaked state

Race conditions

Network/third-party dependency

Fixed time dependency

Randomness

Takeaways

The Beginner's Guide
to Rails Testing