← Field Notes

Devlog

Five OOM crashes in one day, then we left WHC for Vultr

PornBoxd ran on a 2 GB WebHostingCanada VPS for the first three weeks. WHC is fine, the support is human, the price is good. But the box was always close to…

By Mortimer CockburnApril 26, 20263 min read

PornBoxd ran on a 2 GB WebHostingCanada VPS for the first three weeks. WHC is fine, the support is human, the price is good. But the box was always close to its memory ceiling, and on Friday it crashed five times in a single day under nothing more than a catalog import overlapping with the GSC sync cron. The third crash I happened to be awake for. The fourth and fifth I was not. I woke up to the heartbeat alerts, brewed coffee, started writing the migration plan.

The math at our scale is straightforward. The API process needs about 300 MB at idle, jumps to 600 MB during a heavy aggregate query. The Next.js process is around 250 MB. Postgres comfortably uses 700 MB. The image processor cron, when it is actually running, can consume 400 MB depending on how many gallery images its 8-way pool is fetching. Add the OS, sshd, fail2ban, the usual collection of background daemons. The healthy floor is around 2.2 GB. The actual ceiling on the box was 2 GB. We were running on swap permanently and had several PM2 watchdog hacks to restart processes whenever they crossed 1.5 GB resident.

I had been carrying around five OOM-safety tricks for weeks: a swap file, a memory-limit on the API cluster, a cron_restart for the processor, a daily pm2 reset to clear node heap leakage, and a flock around image processing to prevent concurrent runs. All of them existed because the box was too small. None of them would have been needed on the right size of box.

Vultr High Frequency 8 GB is $48 a month. For a hobby project that is more than I want to pay. For a project that has been crashing five times a day, $48 is cheap. We migrated on Saturday afternoon. The new box auto-provisioned 8 GB of swap (which we will never touch), upgraded Postgres from 14 to 16, and on the first cron pass with everything running concurrently we sat at 38% memory utilization. I removed all five OOM safety tricks the next morning. The site has been quiet since.

The painful part of the migration was not the migration. It was that on the first deploy after switching to the new box, my deploy script scp'd the latest source bundle into the prod working tree before running git pull. Which broke git pull because of conflicts with files I had untracked-but-modified during the migration. The next deploy attempt overwrote those, and the deploy after that failed because the working tree was now in a state Git did not like. Four broken deploys before I figured out what was happening, three of them rolled back immediately, one of them caught by the smoke test. The fix is straightforward: never scp into a directory that has a .git/ in it. Use a staging path. Stage and copy after the pull.

Lessons people who have done a lot of operations work already know but I had to relearn:

When your safety nets all exist because of a single root cause, fix the root cause and remove the nets at the same time. The nets accumulate complexity that becomes its own bug surface.

Migration day is a bad day to also test your deploy script edge cases. Pin the deploy to a known-good path before you switch infrastructure.

$48 a month is not an expensive answer to "the production site keeps crashing". It is a very cheap answer. I should have done this two weeks earlier.