Our backup heartbeats lied for two hours
The day after the Vultr migration was supposed to be the calm day. It was not. I noticed the first sign on Monday morning when the sitemap regen logs had not…
The day after the Vultr migration was supposed to be the calm day. It was not.
I noticed the first sign on Monday morning when the sitemap regen logs had not ticked over since Saturday. Then I checked the GSC sync (should run every six hours) and that had not ticked either. Then the image processor. Then the backup script. Four separate cron jobs, all silent for two days. None of the heartbeat alerts had fired. The Healthchecks dashboard was a wall of green checkmarks for jobs that absolutely had not run.
What happened was banal. During the migration I had run a database script that, as a side effect of an ALTER it did not need to, briefly clobbered the OS-level crontab. The repo keeps ops/crontab.pornboxd as the declarative source, and the deploy script reinstalls it on every push, so I would have caught the issue on the next deploy without noticing. Except I had not pushed since the migration, so the broken crontab stayed broken. None of the cron jobs ran.
That part is the easy lesson. Crontabs should be source-controlled and the deploy should idempotently reinstall them, which we already did. The catch-up was straightforward: run the deploy, the deploy reinstalled the cron, everything started ticking again, the world recovered.
The harder lesson is why my heartbeats did not catch the silence sooner.
The way our heartbeats worked was that each cron job ended with a curl https://hc-ping.com/<uuid> line on success, and Healthchecks would page me if a uuid had not pinged in over its grace period. If a cron job did not run at all, no ping fired, and after a few hours Healthchecks would page. Should have. Healthchecks did exactly what it said. I had set the grace periods conservatively long for the bigger jobs (12 hours for backup, 8 hours for GSC sync) because some upstream APIs are slow on weekends. On the weekend itself, those grace windows had elapsed without crossing into a paging condition because the migration happened Saturday afternoon and the next ping should have been Sunday morning. Healthchecks did not page until Sunday afternoon. Twenty-something hours of silent failure before the first alert.
The deeper bug, and the one I have already fixed, is that the backup script itself was not using set -euo pipefail plus a trap ERR for the heartbeat ping. So even on the box where cron WAS running, if any single step inside the backup script silently failed (a bad rclone token, a Postgres connection refused, a missing R2 bucket), the script would fail mid-way, AND THEN the next line would still curl the heartbeat URL because shell scripts default to "keep going on error". The heartbeat would fire green even though the backup did not actually back anything up.
I had been pinging "I am alive" without verifying the work was done. Healthchecks calls this "false-positive pinging" and warns about it in their docs. I had not read those docs. The fix is three lines at the top of every operational shell script:
set -euo pipefail
trap 'curl -sm 10 https://hc-ping.com/$UUID/fail' ERR
trap 'curl -sm 10 https://hc-ping.com/$UUID' EXIT
The first line makes any failed command terminate the script. The second line pings a different URL on error. The third pings the success URL on a clean exit. Healthchecks has separate fail/success endpoints exactly because false-positive pinging is so common. The endpoints existed for years before I noticed.
Two takeaways. First, cron is supposed to be the most boring thing in your stack and that is exactly when it bites you, because nobody is watching it. Second, my heartbeat regime was naive. Pinging a URL only tells the monitor that I made it to the end of the script. It does not tell the monitor that the work succeeded. There is a one-line fix and there is no excuse not to use it.