The motivation
About seven months ago I became aware of a slightly worrisome problem in the application I maintain at work.
The application is a medical application. Each patient in the system has a numeric, incremental account number like 4220. Due to coincidental timing of business and technology, the database ID of each patient is pretty close to the account number. The patient with account number 4220 might have a database ID of something like 4838.
You can imagine how this might cause confusion. Imagine wanting to check which patient you’re viewing and you see 4220 on the screen and patients/4838
in the URL. If you’re not paying close attention it can be confusing.
I brought this issue up with my boss. Turns out my boss had actually gotten tripped up on this issue himself. I brought up the option of switching from numeric IDs to UUIDs as a way to fix the problem and he agreed that it was a good idea. I also brought up my judgment that this would likely be a large, risky, time-consuming project, but we both agreed that the risks and costs associated with the UUID change were less than the risks and costs associated with leaving things the way they were.
The approach
Acknowledging the risk of the project, I decided to distribute the risk in small pieces over time so that no single change would carry too much risk.
Rather than trying to convert all my 87 tables to UUIDs at once, which would be way too risky, I decided to convert 1-10 tables at a time, starting with just one and ramping it up over time as the level of uncertainty decreased.
The planned cadence was once per week, spreading the work across 20-40 weeks, depending on how many tables could be converted in a batch. I applied each UUID change on a Saturday morning, although I didn’t start off doing them Saturdays. I was prompted to start doing it this way after one of the early migrations caused a problem. This brings me to my first lesson.
Lesson 1: apply UUID changes off-hours with a significant time buffer
My first mistake was that I didn’t properly calibrate my methodology to the true level of risk involved.
For one of the earlier UUID migrations I performed, I did it maybe an hour and a half before open of business while no one was doing anything. Unfortunately, something went wrong with the migration, and I didn’t have time to fully fix the problem (nor did I have time to roll it back) before people started using the system.
This incident showed me that I needed a bigger buffer time.
If I wanted to perform the migrations off-hours, my options were to a) perform the migrations on weekdays between about 10pm and 6am or b) perform the migrations on weekends. I decided to go with weekends.
In addition to adding buffer time, I added a way for me to conveniently put the app into “maintenance mode” which would block all users except me from using the app while maintenance mode was on.
Since the time I added these changes to my change process, there have been no UUID-related incidents.
Lesson 2: beware of subtle references
When a table’s IDs get converted to UUIDs, all the references to that table of course need to get updated too. A table called customers
needs to have every customer_id
column updated in order for the association to be preserved.
This is easy enough when the referencing column’s name matches the referenced table’s name (e.g. customer_id
matches customers
) and/or when a foreign key constraint is present, making it physically impossible to forget about the association. It’s harder when neither a matching column name nor a foreign key constraint exists to clue you in.
You know that incident I mentioned in Lesson 1? That incident was caused by my failure to detect active_storage_attachments.record_id
as a column that needed its values changed from numeric IDs to UUIDs. This caused a peculiar bug where most of the file uploads in the system started to appear on records to which they did not belong. Not always, just sometimes. The random behavior had do do with the fact that an expression like '123abc'.to_i
evaluates to 123
while an expression like 'abc123'.to_i
evaluates to 0
.
Anyway, the fix to that issue was conceptually straightforward enough once I knew what was happening, even if the fix was a little tedious.
At this point you might wonder whether good test coverage might have caught such a bug. I must have been taking stupid pills on the day this incident happened because, in addition to everything else I did wrong, I ran the test suite but didn’t bother to check the results before I deployed, even though I knew my change was particularly risky. If I had checked the test suite results I would have seen that there were a few failing tests related to file uploads. I haven’t made that mistake since.
Lesson 3: keep the numeric IDs around
For the first several tables I converted, it didn’t occur to me that I might want to keep the numeric IDs around for any reason. Two things that happened later showed me that it is in fact a good idea to keep the numeric IDs around.
First, keeping the numeric IDs makes for easier rollback. The incident from Lesson 1 could have been rolled back fairly trivially if I had just kept the numeric ID column.
Second, sometimes an incremental ID is handy. I recently built a feature that lets patients pay their statements online. The patients can find their statement by entering their statement ID. It’s not realistic to ask 80 year-old un-tech-savvy people to enter values like CB08B2
where they have to make the distinction between a zero and a letter O. So for that feature I used the numeric, sequential ID of the statements to show to patients.
Lesson 4: there are a lot of special cases
As of this writing I’ve converted 37 tables from numeric IDs to sequential IDs. These were done, again, in small batches of 1 to 10, usually more like 1 to 3.
Almost every batch presented me with some small, new, unique obstacle. Most of these obstacles were too small and uninteresting for it to be useful for me to list every one. For example, I learned on one batch that if a view (that is, database view, not Rails view) references the table I’m changing, then the view needs to be dropped before I make the UUID change and then re-created afterward. There was a lot of little stuff like that.
For this reason, I didn’t have much success with the gems that exist to assist with UUID migrations. Everything I did was custom. It wasn’t much code, though, mostly just a script that I used for each migration and added to each time I encountered a new issue. The current total size of the script is 91 lines.
Lesson 5: performance is fine (for me)
I’ve read some comments online that say UUIDs are bad for performance. Some cursory research tells me that, at least for PostgreSQL, this isn’t really true, at least not enough to matter.
I’ve also experienced no discernible performance hits as a result of converting tables from incremental primary keys to UUID primary keys.
Takeaways
If I had a time machine, here’s some advice I would give my past self.
- Apply the UUID changes off-hours with a big buffer.
- For each change, look really carefully for referencing columns.
- Keep the numeric IDs around.
- Expect to hit a lot of little snags along the way.
All in all this project has been worth it. Only one production incident was caused. The incident was recovered from relatively easily. The work hasn’t been that painful. And there’s a great satisfaction I get from looking at patient URLs and knowing they will never cause confusion with patient account numbers again.
Well written. Would be interested to if you could list all the small items you ran into, even if it was just a bullet list so I can keep these in mind.