The big infrastructure switchover
For the last few weeks at work I’ve been working on a project to move our AWS infrastructure off of Elastic Beanstalk and onto a more custom architecture using Ansible.
Today was the riskiest point of the project. The reason is that this last Saturday I performed the big switchover where I pointed our main DNS record at the new infrastructure, leaving our old infrastructure behind. Today, Monday, is the first day that the new infrastructure is being used in production.
A problem with faxes
Our system has a component that sends faxes. It’s a relatively “risky” feature because the successful transmission of a fax involves a background job, a cron job, and a third-party API. Anytime I mess with infrastructure changes, the fax component is the first area I test for breakage.
This morning, around open of business, I got an error notification email from Bugsnag. The error was one I had never seen before. It was an error about not being able to send a fax because the recipient number was missing.
It was probably (probably) the infrastructure
My initial reaction was of course to put two and two together. The fax component is particularly sensitive to infrastructure changes. Today is the first day after a big infrastructure change. Therefore, it’s highly likely that the infrastructure change caused the fax bug.
However, to my admitted surprise, the infrastructure change was not the cause of the bug. The fax component was working just as well as it always had. The never-seen-before error appeared today not because the infrastructure change broke something but because of coincidental timing.
The fax component was working just as well as it always had. The error would have appeared anytime the fax recipient number was missing. It just so happened that the first time the fax recipient number was ever missing was also the first day the new infrastructure was being used in production.
I know the difference between a likelihood and a certainty. Yet I allowed myself to get fooled because I was sloppy with my application of logic.
I should have slowed down and said, “Yes, the infrastructure change is almost certainly the reason for the new bug, but that’s not the same thing as it being certainly the reason for the bug. There is a non-zero chance that the culprit is something else.
And in this case the culprit was in fact something else.