Why migrations matter?

Why migrations matter?

Once an org reaches a certain age, migrations need to become a core competency.
It’s a skill learning how to deal with the long tail.
This also suggests that any platform work should focus on extensible, modifiable, replaceable platforms.
As the complexity of a system increases, the accuracy of any single agent’s own model of that system decreases rapidly.

We are prisoners of the present, in perpetual transition from an inaccessible past to an unknowable future.

Migrations matter because they are usually the only available avenue to make meaningful progress on technical debt.

Engineers hate technical debt. If there is an easy project they can personally do to reduce tech debt, they’ll take it on themselves. Engineering managers hate technical debt, too. If there is an easy project their team can execute in isolation, they’ll get it scheduled. In aggregate, this leads to a dynamic where there is very little low-hanging fruit to reduce technical debt, and most remaining options require many teams working together to implement them: migrations.

Each migrations aims to create technical leverage (“your indexes no longer have to fit on a single server!”) or reduce technical debt (“your acknowledged writes are guaranteed to persist a master failover”) . They occupy the awkward territory of reduced immediate contribution today in exchange for more capacity tomorrow. This makes them controversial to schedule, and as your systems become larger, they become more expensive.

Googlers have a phrase, “Running to stand still”, to describe a team whose entire capacity is consumed in upgrading dependencies and patterns, such that it can’t make forward progress on the product/system they own. Spending all your time on migrations is extreme.

Migrations are the only mechanism to effectively manage technical debt as your company and code grows. If you don’t get effective at software and system migrations, you’ll end up languishing in technical debt. (And still have to do one later anyway, it’s just that it’ll probably be a full rewrite.)

Running good migrations

The good news is that while migrations are hard, there is a pretty standard playbook that works remarkably well: Derisk, Enable , and then Finish.

Derisk

The first phase of a migration is derisking it, and to do so as quickly and cheaply as possible. Write a design document and shop it with the teams that you believe will have the hardest time migrating. Iterate. Shop it with teams who have atypical patterns and edge cases. Iterate. Test it against the next six to twelve months of roadmap. Iterate.

After you’ve evolved the design, the next step is to embed into the most challenging one or two teams, and work side by side with those teams to build, evolve and migrate to the new system. Don’t start with the easiest migrations, which can lead to a false sense of security.

Effective derisking is essential, because each team that endorses a migration is making a bet on you that you’re going to get this damn thing done, and not leave them with a migration to an abandoned system that they have to revert. If you leave one migration partially finished, folks will be exceedingly suspicious of participating in the next.

Enable

Once you’ve validated the solution solves the intended problem, it’s time to start sharpening your tools. Many folks start migrations by generating tracking tickets for teams to implement, but it’s better to slow down and build tooling to programmatically migrate the 90%. This radically reduces the migration’s cost to the broader organization, which increases their success rate and creates more future opportunities to migrate.

Once you’ve handled as much of the migration programmatically as possible, figure out the self-service tooling and documentation you can provide to allow folks to make the necessary changes without getting stuck. The best migration tools are incremental and reversible: folks should be able to immediately return to previous behavior if something goes wrong, and have the necessary expressiveness to derisk their particular migration path.

Documentation and self-service tooling are products, and thrive under the same regime: sit down with some teams and watch them follow your instructions, then improve them. Find a another team, repeat. Spending an extra two days intentionally making your documentation clean and tools intuitive can save years in large migrations. Do it!

Finish

The last phase of a migration is deprecating the legacy system you’ve replaced. This requires getting to 100% adoption, and that can be quite challenging.

Start by stopping the bleeding, which is ensuring that all newly written code uses the new approach. That can be installing a ratchet in your linters, or updating your documentation and self-service tooling. This is always the first step, because it turns time into your friend. Instead of falling behind by default, you’re now making progress by default.

Ok, now you should start generating tracking tickets, and a mechanism which pushes migration status to teams that need to migrate and to the general management structure. It’s important to give wider management context around migrations because they are the folks who need to prioritize the migrations; if a team isn’t working on a migration, it’s typically because their leadership has not prioritized it.

At this point you’re pretty close to complete, but have the long tail of weird or unstaffed. Your tool now is finish it yourself. It’s not necessarily fun, but getting to 100% is going to require the team leading the migration to dig into the nooks and crannies themselves.