Category Archives: Programming

Modeling legacy code behavior using science

When you want to understand what a legacy program you’re working on is supposed to do, what’s your first instinct? Often it’s to look at the code.

But unfortunately legacy code is often so convoluted and inscrutable that it’s virtually impossible to tell what the code is supposed to do just by looking at it.

In these cases it may seem that you’re out of luck. But fortunately you’re not.

Modeling

Due to what philosophers call the veil of perception, direct knowledge of anything in the world is impossible. There’s very little that we can know with absolute completeness and correctness.

We don’t have knowledge about how most things work, we only have a model of how it works. A model is a useful approximation to the truth. Our models may not be absolutely complete or correct but they’re useful enough to get the job done on a day-to-day basis.

Take a car, for example. I don’t have a complete understanding of how every part of a car works. And what I do know probably isn’t even 100% accurate. But my model is sufficiently comprehensive and accurate that I can operate my car without being regularly surprised and confounded by differences in the ways I expect my car to behave versus the way it actually behaves.

Let’s use another example, trees. My understanding of trees is fragmentary. But like my model of cars, my model of trees is sufficiently comprehensive and accurate that trees don’t regularly throw me surprises. I understand trees to be stationary, and so when I cross the street, I never worry that I may be struck by a speeding tree. Although I do know that wind can blow over dead trees and so I’m cautious about entering the woods on a windy day.

How models are developed

How did I develop my models of cars and trees?

Trees, unlike cars, are natural. I can’t look at a tree’s source code or schematics to gain an understanding of how trees work. Everything humans know about trees has been discovered by observation and experimentation—in other words, scientific inquiry. We observe that trees start off small and then grow large. We observe that deciduous trees lose their leaves in the winter and coniferous trees don’t. And so on.

We can of course also learn about trees by reading books and learning from teachers, but that’s not source material, it’s only a regurgitation of the products of someone else’s scientific inquiry.

Cars are a little different. Cars are man-made, so in addition to learning about cars through observation and experimentation, we can learn by, for example, reading the owner’s manual that came with the car. And in theory, we could look at engineering diagrams and talk with the people who designed the car in order to gain a direct understanding of how the car works straight from the source material.

Cars are also different from trees in the sense that much of the mechanics of a car are very self-evident. You can learn something about how a car works by taking apart its engine. Not so much with a tree.

Inspecting a car’s transmission is analogous to reading a line of code. The thing you’re looking at is, itself, an instruction. It can’t lie to you. You can be mystified by it and you can misunderstand its purpose, but the car part can’t outright lie to you because the car part is how the car works.

Modeling the mind

Before we connect all this back to legacy code, I want to share one more example of scientific modeling.

The human brain is a complex machine. So is a car, so is a tree, and so is a codebase. But unlike a car or a codebase, we can’t observe the mechanics of the brain directly, at least not yet, not very much. We can’t look at the brain’s source code or insert breakpoints. For the purposes of understanding how it works, a brain is much more like a tree than a car.

But cognitive scientists have still managed to learn a lot about how the brain works. Or, more precisely, cognitive scientists have gained an understanding of the behavior that the brain produces. They’ve gained an understanding of how the mind (the outward expression of the machinery of the brain) works. They’ve developed a model.

This model of the mind has been developed in the same way that any accurate model has been developed: through scientific inquiry. A scientist can’t have a chat with the brain’s designer or have a look at its schematics, but a scientist can compare the behavior of people with normal brains against people who have had certain parts of their brains excised, for example, and make inferences about the roles that those parts of the brain play.

(Scientists can make diagrams of how they believe the brain works, but remember, those diagrams aren’t source material. They’re not code. They’re only a documentation of our current best understandings based on our scientific inquiry.)

So: if scientists can develop a model of the behavior generated by the brain without having access to the source machinery of the brain, what can we learn from scientists about how to understand the behavior of legacy systems without having access to comprehensible code?

Applying scientific inquiry to software systems

If you haven’t already done so, I’d like to invite you to make a conscious distinction between two ways of learning the behavior of a software system. One way is the obvious way that everyone’s familiar with: reading the code. The other way is the way that many people have probably used informally quite a bit but may not have consciously put a name to so much, which is scientific inquiry.

A full instruction in scientific inquiry is of course outside the scope of this post, and I wouldn’t be qualified to give one anyway. The point of this post is to invite you to consciously realize that you can develop a usefully accurate model of a software system not just by reading its code, but by using the methods of scientific inquiry.

If you’re interested in learning more about the methods of science, I would recommend The Magic of Reality by Richard Dawkins and The Demon-Haunted World by Carl Sagan.

Takeaways

  • Direct knowledge of anything in the world is impossible due to the veil of perception. All we can do is develop models, which are useful approximations to the truth.
  • Models are developed through the process of scientific inquiry.
  • In addition to reading the code, a model of a software system can be developed by using the process of scientific inquiry.

Premature generalization

Most programmers are familiar with the concept of premature optimization and the reasons why it’s bad. As a reminder, the main reason premature optimization is bad is because it’s an effort to solve problems that probably aren’t real. It’s more economical to wait and observe where the performance bottlenecks are than to try to predict where the bottlenecks are going to be.

Perhaps fewer programmers are familiar with the idea of premature generalization, also known as the code smell Speculative Generality. Premature generalization is when you generalize a piece of code beyond its current requirements in anticipation of more general future requirements. In my experience it’s a very common mistake.

Premature generalization is bad for the same exact reason premature optimization is bad: because it’s an effort to solve problems that probably aren’t real.

Making a piece of code more general than it needs to be in anticipation of future needs might seem like a smart planning measure. If you can see that the code you’re writing will probably need to accommodate more use cases in the future, why not just make the code general enough now? That way you only have to write the code once.

When programmers do this they’re making a bet. Sometimes their bet is right and sometimes it’s wrong. In my experience, these sorts of bets are wrong enough of the time that you lose on average. It’s like betting $50 for a 10% chance at winning $100. If you were to do that 10 times, you’d spend $500 and win just once (on average), meaning you’ll have paid $500 to win $100.

It’s more economical to make your code no more general than what’s called for by today’s requirements and accept the risk that you might have to rework the code later to generalize it. This is also a bet but it’s a sounder one. Imagine a lottery system where you can either buy a ticket for $50 for a 10% chance of winning $100, or you can choose not to play and accept a 10% chance of getting fined $30. (I know it’s a weird lottery but bear with me.) If you buy a ticket ten times then on average you lose $400 because you’ve paid $500 to win $100. If ten times in a row you choose not to buy a ticket, then on average you get fined $30. So you’re obviously way better off with a policy of never buying the ticket.

Takeaways

  • Premature generalization is when you generalize a piece of code beyond its current requirements in anticipation of more general future requirements.
  • On average, premature generalization doesn’t pay. It’s more economical to write the code in such a way as to only accommodate today’s requirements and then only generalize if and when a genuine need arises.

Cohesion

Every codebase is a story. Well-designed programs tell a coherent, easy-to-understand story. Other programs are poorly designed and tell a confusing, hard-to-understand story. And it’s often the case that a program wasn’t designed at all, and so no attempt was made to tell a coherent story. But there’s some sort of story in the code no matter what.

If a codebase is like a story, a file in a codebase is like a chapter in a book. A well written-chapter will clearly let the reader know what the most important points are and will feature those important points most prominently. A chapter is most understandable when it principally sticks to just one topic.

The telling of the story may unavoidably require the conveyance of incidental details. When this happens, those incidental details will be put in their proper place and not mixed confusingly with essential points. If a detail would pose too much of a distraction or an interruption, it gets moved to a footnote or appendix or parenthetical clause.

A piece of code is cohesive if a) everything in it shares one single idea and b) it doesn’t mix incidental details with essential points.

Now let’s talk about ways that cohesion tends to get lost as well as ways to maintain cohesion.

How cohesion gets lost

Fresh new projects are usually pretty easy to work with. This is because a) when you don’t have very much code, it’s easier to keep your code organized, and b) when the total amount of code is small, you can afford to be fairly disorganized without hurting overall understandability too much.

Things get tougher as the project grows. Entropy (the tendency for all things to decline into disorder) unavoidably sets in. Unless there are constant efforts to fight back against entropy, the codebase grows increasingly disordered. The code grows harder to understand and work with.

One common manifestation of entropy is the tendency for developers to hang new methods onto objects like ornaments on a Christmas tree. A developer is tasked with adding a new behavior. He or she goes looking for the object that seems like the most fitting home for that behavior. He or she adds the new behavior, which doesn’t perfectly fit the object where it was placed, but the new code only makes the object 5% less cohesive, and it’s not clear where might be a better place for that behavior, so in it goes.

This ornament-hanging habit is never curtailed because no individual “offense” appears to be all that bad. This is the nature of entropy: disorder sets in not because anything bad was done but simply because no one is going out of their way to stave off disorder.

So, even though no individual change appears to be all that bad, the result of all these changes in aggregate is a surprisingly bad mess. The objects are huge. They confusingly mix unrelated ideas. Their essential points are obscured by incidental details. They’re virtually impossible to understand. They lack cohesion.

How can this problem be prevented?

How cohesion can be preserved

The first key to maintaining cohesion is to make a clear distinction between what’s essential and what’s incidental. More specifically, a distinction must be made between what’s essential and what’s incidental with respect to the object in question.

For example, let’s say I have a class called Appointment. The concerns of Appointment include, among other things, a start time, a client and some matters related to caching.

I would say that the start time and client are essential concerns of the appointment and that the caching is probably incidental. In the story of Appointment, start time and client are important highlights, whereas caching concerns are incidental details and should be tucked away in a footnote or appendix.

That explains how to identify incidental details conceptually but it doesn’t explain how to separate incidental details mechanically. So, how do we do that?

The primary way I do this is to simply move the incidental details into different objects. Let’s say for example that I have a Customer object with certain methods including one called balance.

Over time the balance calculation becomes increasingly complicated to the point that it causes Customer to lose cohesion. No problem: I can just move the guts of the balance method into a new object (a PORO) called CustomerBalance and delegate all the gory details of balance calculation to that object. Now Customer can once again focus on the essential points and forget about the incidental details.

Now, in this case it made perfect sense to recognize the concept of a customer balance as a brand new abstraction. But it doesn’t always work out this way. In our earlier Appointment example, for example, it’s maybe not so natural to take our caching concerns and conceive of them as a new extraction. It’s not particularly clear how that would go.

What we can do in these cases, when we want to move an incidental detail out of an object but we can’t put our finger on a befitting new abstraction, is we can use a mixin instead. I view mixins as a good way to hold a bit of code which has cohesion with itself but which doesn’t quite qualify as an abstraction and so doesn’t make sense as an object. For me, mixins usually don’t have standalone value, and they’re usually only ever “mixed in” to one object as opposed to being reusable.

(I could have said concern instead of mixin, but a) to me it’s a distinction without a meaningful difference, and b) concerns come along with some conceptual baggage that I didn’t want to bring into the picture here.)

So for our Appointment example, we could move the caching code into a mixin in order to get it out of Appointment so that Appointment could once again focus solely on its essential points and forget about its incidental details.

Where to put these newly-sprouted files

When I make an object more cohesive by breaking out its incidental details into new model file, you might wonder where I put that new file.

The short answer is that I put these files into app/models, with additional subfolders based on the meaning of the code.

So for the Appointment, I might have app/models/appointment.rb and app/models/scheduling/appointment_caching.rb, provided that the caching code is related specifically to scheduling. The rationale here is that the caching logic will only ever be relevant to scheduling whereas an appointment might be viewed in multiple contexts, e.g. sometimes scheduling and sometimes billing.

For the customer balance example, I might have app/models/customer.rb and app/models/billing/customer_balance.rb. Again, a customer balance is always a billing concern whereas a customer could be looked at through a billing lens or conceivably through some other sort of lens.

Note that even though appointment_caching.rb is a mixin or concern, I don’t put it in a concerns or mixins folder. That’s because I believe in organizing files by meaning rather than type. I find that doing so makes it easier to find what I want to find when I want to find it.

Takeaways

  • A piece of code is cohesive if a) everything in it shares single idea and b) it doesn’t mix incidental details with essential points.
  • Cohesion naturally erodes over time due to entropy.
  • The first key to maintaining cohesion is to make a clear distinction between what’s essential and what’s incidental.
  • Incidental details can be moved into either new objects or into mixins/concerns in order to help preserve cohesion.

Crisp boundaries

If you’re going to make a change to an area, you have to understand that area. If you don’t understand the area you’re changing very well, your lack of understanding might lead to you accidentally introducing a bug.

Well-written code is loosely coupled from the other pieces of code it touches. “Loosely coupled” means that if you have classes A and B which talk to each other, you can understand class A without having to know much about class B and vice versa.

Conversely, if A and B are tightly coupled, then you might have to understand both class A and class B just to understand class A. Tight coupling makes code harder to work with.

One aspect of loosely-coupled code is that it has crisp boundaries, which are the opposite of blurry boundaries. Here’s an example of a piece of code with blurry boundaries.

class Person
  def initialize(params)
    @name = params[:name]
  end
end

person = Person.new(params)

The only thing Person needs from the outside world is a name, but Person is accepting the broader params as an argument.

Looking outward from inside the Person class, we might wonder: what exactly is in params? Who knows! It could be anything.

The inclusion of params is a “leak” from the outside world into Person.

Looking the other direction, from outside Person inward, we might see Person.new(params) and wonder what exactly of params Person needs in order to do its job. Does Person need everything inside of params? Just some of it? Who knows! Could be anything.

Let’s contrast the blurry-boundary code above with the crisp-boundary code below.

class Person
  def initialize(name)
    @name = name
  end
end

person = Person.new(params[:name])

In this case, looking outward from inside the Person class, it’s clear that Person takes a name and that’s it.

And then looking in from outside, Person.new(params[:name]) makes it clear exactly what’s being sent to Person.

In order to make your classes and methods more understandable, keep your boundaries crisp by accepting the minimum amount of argument information necessary in order to get the job done.

Why I organize my tests by domain concept, not by test type

In Rails apps that use RSpec, it’s customary to have a spec directory with certain subdirectories named for the types of tests they contain: models, requests, system. The Minitest organization scheme doesn’t share the exact same nomes but it does share the custom of organizing by test type.

I would like to raise the question: Why do we do it this way?

To get at the answer to that question I’d like to ask a broader question: What’s the benefit of organizing test files at all? Why not just throw all the tests in a single directory? For me there are two reasons.

Reasons to organize test files into directories

Finding tests

Sometimes I’m changing a feature and I want to know where the tests are for that feature so I can change or extend the tests accordingly

When I’m making a change to a feature, I usually want to know where the tests are that are related to that feature so I can update or extend the tests accordingly. Or, if that feature doesn’t have tests, I want to know so, and with a reasonable degree of certainty, so that I don’t accidentally create new tests that duplicate existing ones.

Running tests in groups

If tests are organized into directories then they can be conveniently run in groups.

It is of course possible, at least in some frameworks, to apply certain tags to tests and then run the tagged tests as a group. But doing so depends on developers remembering to add tags. This seems to me like a fragile link in the chain.

I find directories to be better than tags for this purpose since it’s of course impossible to forget to put a file in a directory.

Test type vs. meaning

At some point I realized that if I organize my test files based on meaning rather than test type, it makes it much easier to both a) find the tests when I want to find them and b) run the tests in groups that serve my purposes. Here’s why.

Finding tests

When I want to find the tests that correspond to a certain feature, I don’t necessarily know a lot about the characteristics of those tests. There might be a test that matches the filename of the application code file that I’m working on, but also there might not be. I’m also not always sure whether the application code I’m working on is covered by a model test, a system test, some other type of test, some combination of test types, or no test at all. The best I can do is either guess, search manually, or grep for some keywords and hope that the results aren’t too numerous to be able to examine one-by-one.

If on the other hand the files are organized in a directory tree that corresponds to the tests’ meaning in the domain model, then finding the tests is easier. If I’m working in the application’s billing area, for example, I can look in spec/billing folder to see if the relevant tests are there. If I use a nested structure, I can look in spec/billing/payments to find tests that are specifically related to payments.

I don’t need to worry about whether the payments-related tests are model tests, system tests or some other type of tests. I can just look in spec/billing/payments and work with whatever’s there. (I do, however, like to keep folders at the leaf level with names like models, system, etc. because it can be disorienting to not know what types of tests you’re looking at, and also it can create naming conflicts if you don’t separate files by type.)

Running tests in groups

I don’t often find it particularly useful to, say, run all my model tests or all my system tests. I do however find it useful to run all the tests in a certain conceptual area.

When I make a change in a certain area and I want to check for regressions, I of course want to check in the most likely places first. It’s usually more likely that I’ve introduced a regression to a conceptually related area than a conceptually unrelated area.

To continue the example from above, if I make a change to the payments area, then I can run all the tests in spec/billing/payments to conveniently check for regressions. If those tests all pass then I can zoom out one level and run all the tests in spec/billing. This gives me four “levels” of progressively broader regression testing: 1) a single file in spec/billing/payments, 2) all the tests in spec/billing/payments, 3) all the tests in spec/billing, and 4) all the tests in the whole test suite. If I organize my tests by type, I don’t have that ability.

On breaking convention

I’m not often a big fan of diverging from framework conventions. Breaking conventions often results in a loss of convenience which isn’t made up for by whatever is gained by breaking convention.

But don’t mistake this break from convention with other types of breaks from conventions you might have seen. Test directory structure is a very weak convention and it’s not even a Rails convention, it’s a convention of RSpec or Minitest. And in fact, it’s not even a technical convention, it’s a cultural convention. Unless I’m mistaken, there’s not actually any functionality tied to the test directory structure in RSpec or Minitest, and so diverging from the cultural standard doesn’t translate to a loss of functionality. It’s virtually all upside.

Takeaways

  • The benefits of organizing tests into directories include to be able to find tests and to be able to run tests in groups.
  • Organizing tests by meaning rather than type makes it easier to find tests and to run them in groups in a way that’s more logical for the purpose of finding regressions.

Organizing Rails files by meaning

Every once in a while I come across the question “Where should I put my POROs in Rails?”

In order to answer this question, I would actually zoom out and ask a broader question: How should we organize our files in Rails in general?

Rails’ organizational limits

To some it might seem that this question already has an answer. Rails already gives us app/controllers for controllers, app/models for models, app/helpers for helpers and so on.

But after a person works with a growing Rails app for a while, it eventually becomes clear that Rails can only take you so far. The sheer quantity of code overwhelms Rails’ ability to help keep the code organized. It’s like piling pound after pound of spaghetti onto a single dinner plate. It only makes sense up to a certain point. Past that point the result is a mess. (This isn’t a criticism of Rails. It’s a natural fact of frameworks in general.)

A Rails codebase can grow both “horizontally” and “vertically”. Horizontal growth means adding more resources: more database tables, more model files, more controller files, etc. Rails can handle horizontal growth just fine, indefinitely.

Vertical growth means a growth in complexity. If the amount of domain logic in an application continues to grow but the number of controllers/models/etc. stays the same, then the result is that the individual files all grow. If the “fat models, skinny controllers” heuristic is followed, then the complexity accumulates in the model files. The result is huge models. These huge models are hard to understand because of their sheer size and because they lack cohesion, meaning that each model isn’t “about” one thing, but rather each model file is just a dumping ground for everything that might be loosely related to that model.

Common (poor) attempts to manage complexity growth

A common way to address the complexity problem is to split the code according to design patterns (decorators, builders, etc.) and put the files in folders that are named for the design patterns: app/decorators, app/builders and so on. The logic of this approach is that it’s a continuation of what Rails is already doing for us, which is to divide files by design pattern. At first glance it seems like a sensible approach.

However, I don’t think this approach does a very good job of addressing the problem of being able to find what we need to find when we need to find it. Here’s why.

Let’s say for example that I need to make a change to some billing-related logic. I know that the code I’m looking for has something to do with billing of course, but I might not know much else about the code I’m looking for. I have no idea whether the code I’m interested in might lie in app/models, app/decorators or anywhere else. I probably have a sense of whether the code is display-related (app/views), domain-logic-related (app/models) or related to the request/response lifecycle (app/controllers), but beyond that, I probably have no clue where the code is located. How could I?

When people try to extend Rails’ convention of dividing files by design pattern, they’re missing an important point. Decorators, builders, commands, queries, etc. are all different from each other, but they’re different from each other in a different way than models, views and controllers are different from each other.

Think of it this way. Imagine if instead of being divided into meat, produce, dairy, etc. sections a grocery store was organized by “things in boxes”, “things in plastic bags”, etc. The former is an essential difference while the latter is an incidental difference. Unless you know how everything is packaged, you won’t be sure where to look. The difference between models, views and controllers is like the difference between meat, produce and dairy. The difference between decorators, builders, commands, queries, etc. is more like the difference between how the items are packaged. Again, the former is essential while the latter is incidental.

Organizing by meaning

A better way to organize Rails code is by meaning. Instead of having one folder for each design pattern I have, I can have one folder for each conceptual area of my app. For example, if my app has a lot of billing code, I can have folders called app/models/billing, app/controllers/billing and so on. This makes it much easier to find a piece of code when I don’t know anything about the code’s structure, I only know about its meaning.

Regarding design patterns, I think design patterns are both overrated and overemphasized, at least in the Rails world. A lot of Rails developers seem to have the idea that every file they create must belong to some category: model, controller, worker, helper, decorator, service, etc. Maybe this is because in a vanilla Rails app, pretty much everything is slotted in to a category in some way. But there’s no logical reason that every piece of code has to fit into some design pattern. The plain old “object” is an extremely powerful and versatile device.

But what if everything in a Rails app is just plain old Ruby objects? Won’t the app lose structure? Not necessarily. Most objects represent models in the broad sense of the term “model”, which is that the code represents some aspect of the world in a way that’s easy to understand and work with. Therefore, the objects that comprise the app’s domain logic can go in app/models, organized hierarchically by domain concept. Plain old objects can sit quite comfortably in app/models alongside Active Record models.

Now let’s go all the way back to the original question: where should you put POROs in Rails?

The answer depends on how you organize your Rails code in general. It also depends on what you consider POROs to be. I consider most of my POROs to be models, so I put them in app/models.

Takeaways

  • Rails can only help with code organization when the amount of code is small. Past a certain point it’s up to you to impose your own structure.
  • If the aim is to be able to find the code you need to find when you need to find it, organizing by design pattern doesn’t help much if you don’t already know how the code is structured. Organizing the code by meaning is better.

Duplication

Duplication can pose serious maintenance problems to codebases. In addition to the potential harm caused by duplication itself, developers’ attempts to fix duplication can sometimes introduce new problems.

Certain popular approaches to addressing duplication exist, such as the rule of three and the refrain duplication is cheaper than the wrong abstraction.

I think these advice snippets treat duplication in an oversimplified way that doesn’t stand up to the nuanced reality of the problem of duplication. We’ll see exactly why shortly.

In this post I’ll show what duplication is, why it’s such a surprisingly complicated problem, and what can be done to address it.

We’ll start with a definition of duplication.

What duplication is

We could imagine that duplication could be defined as a piece of code that appears in two or more places. Indeed, this sounds like a very reasonable and accurate definition. But it’s actually wrong.

Here’s what duplication really is. Duplication is when there’s a single behavior that’s specified in two or more places.

Just because two identical pieces of code are present doesn’t necessarily mean duplication exists. And just because there are no two identical pieces of code present doesn’t mean there’s no duplication.

Two pieces of code could happen to be identical, but if they actually serve different purposes and lead separate lives, then they don’t represent the same behavior, and they don’t constitute duplication. To “DRY up” these identical-looking pieces of code would create new problems, like handcuffing two people together who need to walk in two different directions.

On the other hand, it’s possible for a single behavior to be represented in a codebase but with non-identical code. The way to tell if two pieces of code are duplicative isn’t to see if their code matches (although most of the time duplicative behavior and duplicative code do appear together.) The question that determines duplication is: if I changed one piece of code in order to meet a new requirement, would it be logically necessary to update the other piece of code the same way? If so, then the two pieces of code are probably duplicates of each other, even if their behavior is not achieved using the same exact syntax.

Why duplication is bad

The main reason duplication is bad is because it leaves a program susceptible to developing logical inconsistencies.

If a behavior is expressed in two different places in a program, and one of them accidentally doesn’t match the other, then the deviating behavior is necessarily wrong. (Or if the deviating happens to still meet its requirements, it only does so by accident.)

Another reason duplication can be bad is because it can pose an extra maintenance burden. It takes longer, and requires more mental energy, to apply a change to two areas of code instead of just one.

But not all instances of duplication are equally bad. Some kinds of duplication are more dangerous than others.

When duplication is more dangerous or less dangerous

There are three factors that determine the degree of harm of an instance of duplication: 1) how easily discoverable the duplication is, 2) how much extra overhead the presence of the duplication incurs, and 3) how much “traffic” that area receives, i.e. how frequently that area of code needs to be changed or understood. Let’s look at each of these factors more closely.

Discoverability

If there’s a piece of behavior that’s specified twice in the codebase, but the two pieces of code are only separated by one line, then there’s not a big problem because everyone is basically guaranteed to notice the problem. If someone updates one of the copies of the behavior to meet a new requirement, they’re very unlikely to accidentally miss updating the other one. You might call this the proximity factor.

If two pieces of duplicated behavior appear in different files in different areas of the application, then a “miss” is much more likely to occur, and therefore the duplication constitutes a worse problem.

Another quality that makes discovery of duplication easier is similitude. If two pieces of code look very similar, then their duplicity is more likely to be noticed than if the two pieces of code don’t look the same. You might call this the similitude factor.

If the proximity factor is bad (the pieces of duplicated code are at a great distance from each other) and/or if the similitude factor is bad (the duplication is obscured by the pieces of duplicated code not being similar enough to appear obviously duplicative) then it means the duplication is riskier.

Overhead

Some instances of duplication are easier to live with than others. Two short lines of very similar code, located right next to each other, are very easy to keep in sync with one another. Other types of duplication are much more deeply baked into the system and can cause a much bigger headache.

For example, if a piece of duplication exists as part of the database schema, that’s a much higher maintenance cost than a short code duplication. Instances of duplication that are big and aren’t represented by identical code can also be costly to maintain because, in those cases, you can’t just type the same thing twice, you have to perform a potentially expensive translation step in your head.

Traffic level

Duplication is a type of “bad code”, and so principles that apply to bad code apply to duplication as well. One of these principles is that bad code in heavily-trafficked areas costs more than bad code in lightly-trafficked areas.

When considering how much a piece of bad code costs, it’s worth considering when that cost is incurred. When a piece of bad code incurs a cost, we might think of this as analogous to paying a toll on a toll road.

One tollway is when a piece of code is changed. The more frequently the code is changed, the more of a toll it’s going to incur, and so the bigger a problem it is.

Another tollway is when a piece of code needs to be understood as a prerequisite to understanding a different piece of code. Every codebase has “leaf code” and “branch code”. If a piece of code is leaf code, as in nothing depends on it, then we can afford for that code to be pretty bad and it doesn’t matter. Branch code, on the other hand, gets heavy intellectual traffic, and so incurs a higher toll, and so is a bigger problem.

How to decide whether to DRY up a piece of code or to keep the duplication

The way to decide whether or not to DRY up a piece of duplication is pretty simple, although it’s not easy. There are two factors to consider.

Severity

If a piece of duplication is “severe”—i.e. it has low discoverability, poses high overhead, and/or has a high traffic level—it should probably be fixed. If not, it should probably be left alone.

Quality of alternative

Just because a piece of duplication costs something doesn’t automatically mean that the de-duplicated version costs less. It doesn’t happen very often, but sometimes a de-duplication unavoidably results in code that’s so generalized that it’s virtually impossible to understand. In these cases the duplicated version may be the lesser of two evils.

But be careful to make the distinction between “this code can’t be de-duplicated without making it worse” and “this particular attempt to de-duplicate this code made it worse”. Like all refactoring projects, sometimes you just need to try a few times before you land on something you’re happy with. And sometimes you just need to be careful not to go overboard.

Why the popular guidelines make little sense

It currently seems to be fashionable to hold the belief that developers apply DRY too eagerly. This hasn’t been my experience. The opposite has been my experience.

Claims that developers apply DRY too eagerly are often accompanied by advice to follow WET (“write everything twice”) or the “rule of three”, or “duplication is cheaper than the wrong abstraction”. Here’s why I think these popular guidelines make little sense.

Rule of three/”write everything twice”

Here’s my way of deciding whether to DRY up a duplication: Is the duplication very bad? Are we able to come up with a fix that’s better than the duplicated version and not worse? If so, then clean it up. If not, leave it alone.

Notice that my criteria do not include “does the duplication appear three times”? I can’t see how that could be among the most meaningful factors.

Imagine, for example, a piece of duplication in the form of three very simple and nearly-identical lines, grouped together in a single file. The file is an unimportant one which only gets touched a couple times a year, and no one needs to understand that piece of code as a prerequisite to understanding anything else.

Now imagine another piece of duplication. The duplication appears in only two places, but the places are distant from one another and therefore the duplication is hard to discover. The two places where the duplicated behavior appear are expressed differently enough that the code would elude detection by a code quality tool or a manual human search. The behavior is a vitally central and important one. It doesn’t get changed often enough that it stays at the top of everyone’s mind, but it gets changed often enough that there are lots of opportunities for divergences to arise. And the two places the behavior appears are brutally painful to keep in sync.

Given this scenario, why on earth would I choose to fix the triple-duplicate and leave the double-duplicate alone?

The rule of three and “write everything twice” (WET) make little sense. The number of times a piece of duplication appears is not the main factor in judging its harmfulness.

Duplication is cheaper than the wrong abstraction

This statement is repeated very frequently in the Ruby community, usually to discourage people from applying the DRY principle too eagerly.

I wish we would think about this statement more deeply. Why are we setting up such a strong a connection between duplication and abstractions? It strikes me as a non-sequitur.

And why are we imagining such a strong danger of creating the wrong abstraction? Do we not trust ourselves to DRY up a piece of code and end up with something good? And again, why does the result of our de-duplicating have to be an abstraction? I find it an illogical connection.

If we take out the word “abstraction” then the sentiment that remains is “duplicated code is better than a de-duplicated version that’s even worse”. In which case I of course agree, but the statement is so banal that it’s not even a statement worth making.

I think “duplication is cheaper than the wrong abstraction” is a statement devoid of any useful meaning, and one we should stop repeating.

How to fix instances of duplication

A duplication-removal project is just a special case of a refactoring project. (Remember, refactoring means “changing the structure of code without changing its behavior”). Any guidelines that apply to general refactoring projects also apply to de-duplication projects.

When de-duplicating, it helps to work in small, atomic units. If the refactoring was triggered by a need to make a behavior change, don’t mix the behavior change with the refactoring. Perform the refactoring either before implementing the change or after or both, not during. And when you reach the point when you’re no longer sure that your refactorings are an improvement, stop.

When I’m de-duplicating two pieces of code, it’s often not clear how the unification will be achieved. In these cases I like to make it my first step to make the duplicate pieces of code completely identical while still keeping them separate. Merging two subtly different pieces of code can be tricky but merging two identical pieces of code is trivial. So make them identical first, then merge.

You can find a lot of other great refactoring techniques in Martin Fowler’s book Refactoring: Improving the Design of Existing Code.

Takeaways

  • Duplication exists when there’s a single behavior that’s specified in two or more places.
  • The main reason duplication is bad is because it leaves a program susceptible to developing logical inconsistencies.
  • Not all instances of duplication are equally dangerous. The severity of a piece of duplication can be judged based on its discoverability, overhead cost and traffic level.
  • In order to decide whether an instance of duplication is worth fixing, consider the severity of the duplication. Also compare the duplicative code with the de-duplicated code, and only keep the “fixed” version if the fixed version is actually better.
  • The rule of three/”write everything twice” makes little sense because it doesn’t take the factors into account that determine whether a piece of duplication is dangerous or innocuous. “Duplication is the wrong abstraction” makes little sense because it sets up a false dichotomy between duplication and “the wrong abstraction”.
  • To get good at removing duplication, get good at refactoring.
  • When attempting to remove an instance of duplication, it’s often helpful to make the duplicative code completely identical as a first step, and then merge the identical code as a second step.

The difference between procs and lambdas in Ruby

Note: before starting this post, I recommend reading my other posts about procs and closures for background.

Overview

What’s the difference between a proc and a lambda?

Lambdas actually are procs. Lambdas are just a special kind of proc and they behave a little bit differently from regular procs. In this post we’ll discuss the two main ways in which lambdas differ from regular procs:

  1. The return keyword behaves differently
  2. Arguments are handled differently

Let’s take a look at each one of these differences in more detail.

The behavior of “return”

In lambdas, return means “exit from this lambda”. In regular procs, return means “exit from embracing method”.

Below is an example, pulled straight from the official Ruby docs, which illustrates this difference.

def test_return
  # This is a lambda. The "return" just exits
  # from the lambda, nothing more.
  -> { return 3 }.call

  # This is a regular proc. The "return" returns
  # from the method, meaning control never reaches
  # the final "return 5" line.
  proc { return 4 }.call

  return 5
end

test_return # => 4

Argument handling

Argument matching

A proc will happily execute a call with the wrong number of arguments. A lambda requires all arguments to be present.

> p = proc { |x, y| "x is #{x} and y is #{y}" }
> p.call(1)
 => "x is 1 and y is "
> p.call(1, 2, 3)
 => "x is 1 and y is 2"
> l = lambda { |x, y| "x is #{x} and y is #{y}" }
> l.call(1)
(irb):5:in `block in <main>': wrong number of arguments (given 1, expected 2) (ArgumentError)
> l.call(1, 2, 3)
(irb):14:in `block in <main>': wrong number of arguments (given 3, expected 2) (ArgumentError)

Array deconstruction

If you call a proc with an array instead of separate arguments, the array will get deconstructed, as if the array is preceded with a splat operator.

If you call a lambda with an array instead of separate arguments, the array will be interpreted as the first argument, and an ArgumentError will be raised because the second argument is missing.

> proc { |x, y| "x is #{x} and y is #{y}" }.call([1, 2])
 => "x is 1 and y is 2"
> lambda { |x, y| "x is #{x} and y is #{y}" }.call([1, 2])
(irb):9:in `block in <main>': wrong number of arguments (given 1, expected 2) (ArgumentError)

In other words, lambdas behave exactly like Ruby methods. Regular procs don’t.

Takeaways

  • In lambdas, return means “exit from this lambda”. In regular procs, return means “exit from embracing method”.
  • A regular proc will happily execute a call with the wrong number of arguments. A lambda requires all arguments to be present.
  • Regular procs deconstruct arrays in arguments. Lambdas don’t.
  • Lambdas behave exactly like methods. Regular procs behave differently.

Why global variables are bad

If you’ve been programming for any length of time, you’ve probably come across the advice “don’t use global variables”.

Why are global variables so often advised against?

The reason is that global variables make a program less understandable. When you’re looking at a piece of code that uses a global variable, you don’t know if you’re seeing the whole picture. The code isn’t self-contained. In order to understand your piece of code, you potentially have to venture to some outside place to have a look at some other code that’s influencing your code at a distance.

The key idea is scope. If a local variable is defined inside of a function, for example, then that variable’s scope is limited to that function. Nobody from outside that function can see or mess with that variable. As another example, if a private instance variable is defined for a class, then that variable’s scope is limited to that class, and nobody from outside that class can see or mess with the variable.

The broader a variable’s scope, the more code has to be brought into the picture in order to understand any of the code that involves that variable. If I have a function that depends on its own arguments and nothing else, then that function can be understood in isolation. All I need in order to understand the function (at least in terms of causes and effects, as opposed to conceptual understanding which may require outside context) is the code inside the function. If alternatively the function involves a class instance variable, for example, then I potentially need to look at the other places in the class that involve the instance variable in order to understand the behavior of the function.

The maximum scope a variable can have is global scope. In terms of understanding, a global variable presents the biggest burden and requires the most investigative work. That’s why global variables are so often cautioned against.

Having said that, it’s actually a little simplistic to say “global variables are bad”. It would be more precise to say “global variables are costly”. There are some scenarios where the cost of a global variable is worth the price. In those cases, the idea of a global variable could be said to be good because it’s less costly than the alternatives.

But in the vast majority of cases, it’s good to keep the scope of variables as small as possible. The smaller the scopes of your variables are, the more it will aid the understandability of your code.

When good code is important and when it’s not

Tollways

All code has a maintenance cost. Some code of course is an absolute nightmare to maintain. We would say its maintenance cost is high. Other code is easier to maintain. We would say its maintenance cost is low, or at least relatively low compared to worse code.

When thinking about good code and bad code, it’s worth considering when exactly code’s maintenance cost is incurred. We might refer to these events as “tollways”. We can’t travel these roads without paying a toll.

Tollway 1: when the code is changed

For any particular piece of code, a toll is incurred every time that code needs to be changed. The size of the toll depends on how easy the code is to understand and change.

Tollway 2: when the code needs to be understood in order to support a change in a different area

Even if a piece of code doesn’t need to be changed, the code incurs a toll whenever someone it needs to be understood in order to make a different change. This dependency of understanding happens when pieces of code are coupled via inheritance, passing values in methods, global variables, or any of the other ways that code can be coupled.

We could put code into two categories: “leaf code”, which depends on other code but has no dependencies itself, and “branch code”, which does have dependencies, and may or may not depend on other code. Branch code incurs tolls and leaf code doesn’t.

Good code matters in proportion to future tollway traffic

When any new piece of code is added to a codebase, it may be possible to predict the future “tollway traffic” of that code.

Every codebase has some areas that change more frequently than others. If the code you’re adding lies in a high-change area, then it’s probably safe to predict that that code will have high future tollway traffic. On average it’s a good investment to spend time making this code especially good because the upfront effort will get paid back a little bit every time the code gets touched in the future.

Conversely, if there’s a piece of code that you have good reason to believe will change infrequently, it’s less important to make this code good, because the return on investment won’t be as great. (If you find out that your original prediction was wrong, it may be wise to refactor the code so you don’t end up paying more in toll fees than you have to.)

If there’s a piece of code that’s very clearly “branch code” (other code depends on it) then it’s usually a good idea to spend extra time to make sure this code is easily understandable. Most codebases have a small handful of key models which are touched by a large amount of code in the codebase. If the branch code is sound, it’s a great benefit. If the branch code has problems (e.g. some fundamental concept was poorly-named early on) then those problems will stretch their tentacles throughout the codebase and cause very expensive problems.

On the other hand, if a piece of code can be safely identified as leaf code, then it’s not so important to worry about making that code super high-quality.

But in general, it’s hard to predict whether a piece of code will have high or low future tollway traffic, so it’s good to err on the side of assuming high future tollway traffic. Rarely do codebases suffer from the problem that the code is too good.

Bad reasons to write bad code

It’s commonly believed that it’s wise to take on “strategic technical debt” in order to meet deadlines. In theory this is a smart way to go, but in practice it’s always a farce. The debt gets incurred but then never paid back.

It’s also a mistake to write crappy code because “users don’t care about code”. Users obviously don’t literally care about code, but users do experience the symptoms of bad code when the product is full of bugs and the development team’s productivity slows to a crawl.

Takeaways

  • A piece of code incurs a “toll” when it gets changed or when it needs to be understood in order to support a change in a different piece of code.
  • The return on investment of making a piece of code good is proportionate to the future tollway traffic that code will receive.
  • Predicting future tollway traffic is not always easy or possible, but it’s not always impossible either. Being judicious about when to spend extra effort on code quality or to skip effort is more economical than indiscriminately writing “medium-quality” code throughout the entire codebase.