← Field Notes

Devlog

740 gallery rows had been stuck for 24 hours, the queue was poisoned

The image processor is a cron job that runs every 15 minutes, picks up to 50 images that have not been processed yet (imagestatus='pending'), downloads each…

By Mortimer CockburnApril 28, 20263 min read

The image processor is a cron job that runs every 15 minutes, picks up to 50 images that have not been processed yet (image_status='pending'), downloads each from the studio's URL, uploads a copy to our R2 bucket so we can serve it through Cloudflare's image transformation pipeline, and marks the row complete. Boring infrastructure. It had been running fine for weeks.

On Monday evening I ran a count of pending rows out of habit and got 740. That number had not been higher than 50 in months. Something was wrong.

Reading the processor logs: each run was hitting its 45-minute timeout (the global RUN_TIMEOUT_MS ceiling) before draining the batch. Inside each run, every image fetch was getting a 30-second timeout each. Two hundred batched downloads at 30 seconds each is 100 minutes if they all run sequentially, which they did, and the cron only had 45. So the cron would die before finishing the batch, the unfinished rows would carry into the next 15-minute run, the next run also would not finish, and the queue grew.

The reason fetches were eating 30 seconds was not slow networks. It was that a few specific source URLs were dead URLs. Some studio CDNs were returning a 415 ("Unsupported Media Type"), one CDN was returning 403 due to a token-auth issue I will get to in a second, one CDN was just gone. Each of those dead URLs would consume the full 30 seconds before declaring itself dead, because we had not configured a connection-vs-read timeout split. So a batch with three dead URLs at the front already consumed 90 seconds of the run's time before any healthy URL got fetched. The queue was poisoned.

Two changes shipped that evening. First, the per-image timeout dropped from 30 seconds to 8. If an image fetch is going to fail it usually fails within the first second; we do not gain anything by waiting longer. Second, the processor switched from sequential await calls to a bounded-concurrency pool of 8 workers. Conceptually: imagine eight cashiers at the bank instead of one. A confused customer holds up their cashier for 8 seconds maximum, then the next person steps up, and the other seven cashiers are still running normally for the seven other customers in their queue.

Within one run after deploy, the queue went from 740 pending rows to under 50. I sat watching the count drop and felt the kind of relief you only get from fixing a problem you did not know was about to become an outage.

The token-auth misdiagnosis is a related side story worth telling. FuckPassVR's CDN had been throwing 403s on our image fetches for three weeks. I had assumed it was Referer-blocking (the CDN refusing requests that did not come from fuckpassvr.com). I tried setting the Referer header explicitly. Still 403. Tried User-Agent rotation. Still 403. Tried a residential-proxy fetch. Still 403. I was running out of theories.

What it actually was: their CDN serves untokenized image URLs with a 403, and you have to pull the tokenized URL from the page's og:image meta tag instead, which carries a short-lived signed token. The bare URL is dead by design. We had been scraping the bare URL and getting the deserved 403. The fix took twenty minutes to write once I realized what was happening, and three weeks to realize what was happening.

I have a memory rule now: when a CDN returns 403, probe both the untokenized AND the tokenized URL before concluding the block shape. The first thing I check should not be "what is blocking us", it should be "are we asking for the right thing".

The 740 rows drained inside an hour. The processor now reliably finishes a batch in under 12 minutes. There is one more lever I want to pull (per-cron flock -n -E 0 to make sure two instances of the processor never overlap if one runs long), but the queue is healthy and the silent failure mode is gone.