Duplication is an often-discussed topic in programming.
Sadly, much of the popular advice on duplication, such as the rule of three and the refrain duplication is cheaper than the wrong abstraction, treats duplication in an oversimplified way that doesn’t stand up to the nuanced reality of the issue.
In this post I’ll show what duplication is, why it’s such a surprisingly complicated issue, why the popular advice is dubious, and what can be done to address duplication.
We’ll start with a definition of duplication.
What duplication is
We could imagine that duplication could be defined as a piece of code that appears in two or more places. Indeed, this sounds like a very reasonable and accurate definition. But it’s actually wrong.
Here’s what duplication really is. Duplication is when there’s a single behavior that’s specified in two or more places.
Just because two identical pieces of code are present doesn’t necessarily mean duplication exists. And just because there are no two identical pieces of code present doesn’t mean there’s no duplication.
Two pieces of code could happen to be identical, but if they actually serve different purposes and lead separate lives, then they don’t represent the same behavior, and they don’t constitute duplication. To “DRY up” these identical-looking pieces of code would create new problems, like handcuffing two people together who need to walk in two different directions.
On the other hand, it’s possible for a single behavior to be represented in a codebase but with non-identical code. The way to tell if two pieces of code are duplicative isn’t to see if their code matches (although most of the time duplicative behavior and duplicative code do appear together.) The question that determines duplication is: if I changed one piece of code in order to meet a new requirement, would it be logically necessary to update the other piece of code the same way? If so, then the two pieces of code are probably duplicates of each other, even if their behavior is not achieved using the same exact syntax.
Why duplication is bad
The main reason duplication is bad is because it leaves a program susceptible to developing logical inconsistencies.
If a behavior is expressed in two different places in a program, and one of them accidentally doesn’t match the other, then the deviating behavior is necessarily wrong. (Or if the deviating happens to still meet its requirements, it only does so by accident.)
Another reason duplication can be bad is because it can pose an extra maintenance burden. It takes longer, and requires more mental energy, to apply a change to two areas of code instead of just one.
But not all instances of duplication are equally bad. Some kinds of duplication are more dangerous than others.
When duplication is more dangerous or less dangerous
There are three factors that determine the degree of harm of an instance of duplication: 1) how easily discoverable the duplication is, 2) how much extra overhead the presence of the duplication incurs, and 3) how much “traffic” that area receives, i.e. how frequently that area of code needs to be changed or understood. Let’s look at each of these factors more closely.
If there’s a piece of behavior that’s specified twice in the codebase, but the two pieces of code are only separated by one line, then there’s not a big problem because everyone is basically guaranteed to notice the problem. If someone updates one of the copies of the behavior to meet a new requirement, they’re very unlikely to accidentally miss updating the other one. You might call this the proximity factor.
If two pieces of duplicated behavior appear in different files in different areas of the application, then a “miss” is much more likely to occur, and therefore the duplication constitutes a worse problem.
Another quality that makes discovery of duplication easier is similitude. If two pieces of code look very similar, then their duplicity is more likely to be noticed than if the two pieces of code don’t look the same. You might call this the similitude factor.
If the proximity factor is bad (the pieces of duplicated code are at a great distance from each other) and/or if the similitude factor is bad (the duplication is obscured by the pieces of duplicated code not being similar enough to appear obviously duplicative) then it means the duplication is riskier.
Some instances of duplication are easier to live with than others. Two short lines of very similar code, located right next to each other, are very easy to keep in sync with one another. Other types of duplication are much more deeply baked into the system and can cause a much bigger headache.
For example, if a piece of duplication exists as part of the database schema, that’s a much higher maintenance cost than a short code duplication. Instances of duplication that are big and aren’t represented by identical code can also be costly to maintain because, in those cases, you can’t just type the same thing twice, you have to perform a potentially expensive translation step in your head.
Duplication is a type of “bad code”, and so principles that apply to bad code apply to duplication as well. One of these principles is that bad code in heavily-trafficked areas costs more than bad code in lightly-trafficked areas.
When considering how much a piece of bad code costs, it’s worth considering when that cost is incurred. When a piece of bad code incurs a cost, we might think of this as analogous to paying a toll on a toll road.
One tollway is when a piece of code is changed. The more frequently the code is changed, the more of a toll it’s going to incur, and so the bigger a problem it is.
Another tollway is when a piece of code needs to be understood as a prerequisite to understanding a different piece of code. Every codebase has “leaf code” and “branch code”. If a piece of code is leaf code, as in nothing depends on it, then we can afford for that code to be pretty bad and it doesn’t matter. Branch code, on the other hand, gets heavy intellectual traffic, and so incurs a higher toll, and so is a bigger problem.
How to decide whether to DRY up a piece of code or to keep the duplication
The way to decide whether or not to DRY up a piece of duplication is pretty simple, although it’s not easy. There are two factors to consider.
If a piece of duplication is “severe”—i.e. it has low discoverability, poses high overhead, and/or has a high traffic level—it should probably be fixed. If not, it should probably be left alone.
Quality of alternative
Just because a piece of duplication costs something doesn’t automatically mean that the de-duplicated version costs less. It doesn’t happen very often, but sometimes a de-duplication unavoidably results in code that’s so generalized that it’s virtually impossible to understand. In these cases the duplicated version may be the lesser of two evils.
But be careful to make the distinction between “this code can’t be de-duplicated without making it worse” and “this particular attempt to de-duplicate this code made it worse”. Like all refactoring projects, sometimes you just need to try a few times before you land on something you’re happy with. And sometimes you just need to be careful not to go overboard.
Why the popular guidelines make little sense
It currently seems to be fashionable to hold the belief that developers apply DRY too eagerly. This hasn’t been my experience. The opposite has been my experience.
Claims that developers apply DRY too eagerly are often accompanied by advice to follow WET (“write everything twice”) or the “rule of three”, or “duplication is cheaper than the wrong abstraction”. Here’s why I think these popular guidelines make little sense.
Rule of three/”write everything twice”
Here’s my way of deciding whether to DRY up a duplication: Is the duplication very bad? Are we able to come up with a fix that’s better than the duplicated version and not worse? If so, then clean it up. If not, leave it alone.
Notice that my criteria do not include “does the duplication appear three times”? I can’t see how that could be among the most meaningful factors.
Imagine, for example, a piece of duplication in the form of three very simple and nearly-identical lines, grouped together in a single file. The file is an unimportant one which only gets touched a couple times a year, and no one needs to understand that piece of code as a prerequisite to understanding anything else.
Now imagine another piece of duplication. The duplication appears in only two places, but the places are distant from one another and therefore the duplication is hard to discover. The two places where the duplicated behavior appear are expressed differently enough that the code would elude detection by a code quality tool or a manual human search. The behavior is a vitally central and important one. It doesn’t get changed often enough that it stays at the top of everyone’s mind, but it gets changed often enough that there are lots of opportunities for divergences to arise. And the two places the behavior appears are brutally painful to keep in sync.
Given this scenario, why on earth would I choose to fix the triple-duplicate and leave the double-duplicate alone?
The rule of three and “write everything twice” (WET) make little sense. The number of times a piece of duplication appears is not the main factor in judging its harmfulness.
Duplication is cheaper than the wrong abstraction
This statement is repeated very frequently in the Ruby community, usually to discourage people from applying the DRY principle too eagerly.
I wish we would think about this statement more deeply. Why are we setting up such a strong a connection between duplication and abstractions? It strikes me as a non-sequitur.
And why are we imagining such a strong danger of creating the wrong abstraction? Do we not trust ourselves to DRY up a piece of code and end up with something good? And again, why does the result of our de-duplicating have to be an abstraction? I find it an illogical connection.
If we take out the word “abstraction” then the sentiment that remains is “duplicated code is better than a de-duplicated version that’s even worse”. In which case I of course agree, but the statement is so banal that it’s not even a statement worth making.
I think “duplication is cheaper than the wrong abstraction” is a statement devoid of any useful meaning, and one we should stop repeating.
How to fix instances of duplication
A duplication-removal project is just a special case of a refactoring project. (Remember, refactoring means “changing the structure of code without changing its behavior”). Any guidelines that apply to general refactoring projects also apply to de-duplication projects.
When de-duplicating, it helps to work in small, atomic units. If the refactoring was triggered by a need to make a behavior change, don’t mix the behavior change with the refactoring. Perform the refactoring either before implementing the change or after or both, not during. And when you reach the point when you’re no longer sure that your refactorings are an improvement, stop.
When I’m de-duplicating two pieces of code, it’s often not clear how the unification will be achieved. In these cases I like to make it my first step to make the duplicate pieces of code completely identical while still keeping them separate. Merging two subtly different pieces of code can be tricky but merging two identical pieces of code is trivial. So make them identical first, then merge.
You can find a lot of other great refactoring techniques in Martin Fowler’s book Refactoring: Improving the Design of Existing Code.
- Duplication exists when there’s a single behavior that’s specified in two or more places.
- The main reason duplication is bad is because it leaves a program susceptible to developing logical inconsistencies.
- Not all instances of duplication are equally dangerous. The severity of a piece of duplication can be judged based on its discoverability, overhead cost and traffic level.
- In order to decide whether an instance of duplication is worth fixing, consider the severity of the duplication. Also compare the duplicative code with the de-duplicated code, and only keep the “fixed” version if the fixed version is actually better.
- The rule of three/”write everything twice” makes little sense because it doesn’t take the factors into account that determine whether a piece of duplication is dangerous or innocuous. “Duplication is the wrong abstraction” makes little sense because it sets up a false dichotomy between duplication and “the wrong abstraction”.
- To get good at removing duplication, get good at refactoring.
- When attempting to remove an instance of duplication, it’s often helpful to make the duplicative code completely identical as a first step, and then merge the identical code as a second step.