← Field Notes

Devlog

A CHECK constraint widened in one migration. An older one kept tightening it back.

Migration 071 widened a CHECK list, updated 6 rows, and shipped clean. Four subsequent deploys failed because an older migration kept re-adding the narrower list.

By Mortimer CockburnMay 3, 20263 min read

Migration 071 added a new value to a CHECK constraint, updated six rows to use the new value, and shipped clean. The next deploy failed. So did the three after that.

I want to walk through the bug because the shape generalizes.

The setup. We have a database migration runner at apps/api/src/migrate.ts that walks every .sql file in apps/api/sql/ in name-sort order on every deploy. Each migration is supposed to be idempotent. Conventional wrappers: IF NOT EXISTS on CREATE TABLE, CREATE INDEX, ADD COLUMN. DO blocks for ADD CONSTRAINT so a re-run of an already-added constraint is a no-op. The whole runner is designed to be re-played from scratch, so a freshly-cloned dev DB and a 50-deploy-old prod DB go through the same .sql files in the same order and end up at the same shape.

Migration 071 (per_scene_token for the UnzipVR network's atomic-token affiliate links) did two things in one file:

Widened the CHECK constraint on studios.affiliate_link_mode to add the new value per_scene_token to the allowed list. Previous list was studio_landing | per_scene | per_scene_query_token. New list adds per_scene_token. Done via DROP CONSTRAINT then ADD CONSTRAINT inside a DO block.
Migrated data: UPDATE studios SET affiliate_link_mode = 'per_scene_token' WHERE name IN (vrbangers, arporn, blowvr, vrconk, vrb-gay, vrb-trans). Six rows.

Migration 071 ran clean on the first deploy. The 6 rows got the new value. The constraint accepted them. Done.

The next deploy fired. migrate.ts walked the sql/ directory in order: 001, 002, ..., 007 ... 048 ... 071. Migration 007 had been doing the same DROP CONSTRAINT / ADD CONSTRAINT pattern on studios_affiliate_link_mode_check, but with the original allowed list. Three values, no per_scene_token. When it re-ran on the freshly-deployed DB, it dropped the constraint (the one migration 071 had set up the previous deploy), then tried to re-add it with the old, narrower list, and ran face-first into the 6 UnzipVR rows holding per_scene_token. Constraint failed. Migration aborted. Deploy went red.

Three more deploys went red the same way before I noticed. Each time the symptom looked like a new deploy bug, because each deploy's logs only showed the failure of migration 007. The 071 widening happened on a previous deploy and then got silently undone on every subsequent deploy.

The fix is small. Migration 007 (and 048, which had the same defensive DROP+ADD against the same constraint) get their CHECK lists updated to include per_scene_token. Now when 007 re-runs on every deploy, the ADD CONSTRAINT step accepts the rows that 071 wrote. 071 itself becomes a no-op on the re-add (no constraint difference, same allowed values), which is exactly what an idempotent migration should do.

It's tempting to draw a different lesson: "stop re-running every migration on every deploy." That's the wrong lesson. Re-running migrations is the whole point of migrate.ts being a deploy-time idempotent. A freshly-cloned dev environment's CI run and a 71-migration-old prod environment have to go through the same code path. If 007 on prod runs in a different reality from 007 on a fresh DB, then the entire migration design has rotted.

The right lesson is replay safety. Every migration in the runner must produce the same final schema state regardless of how many times it has run before. For a CHECK constraint, that means any subsequent migration that widens the allowed list also has to mirror the widening forward into every prior DROP+ADD of the same constraint. Migration 071 took on a debt against migrations 007 and 048 the moment it added a new allowed value. The hotfix is paying that debt.

Memory rule, pinned: every CHECK-widening migration must mirror the new value forward into all prior DROP+ADD migrations of the same constraint. Otherwise the older migration replays on the next deploy and rejects the rows the newer one wrote. The bug class is silent (deploy fails, but the failure doesn't surface as "constraint mismatch" in any obvious place; it surfaces as whatever generic deploy-failure noise the CI pipeline emits) and ugly to debug because the failing migration looks fine in isolation.

A small consolation: migration 007 itself already documented this exact replay-safety pattern in a comment, for a different value. per_scene_query_token had been forward-mirrored into 007's CHECK list when it shipped under the same scenario in a previous round. I just forgot to do the same thing for per_scene_token two months later. The pattern was right there in the file I was modifying. I read past it.

Writing this down so I don't read past it next time.