pornboxdBETA
← Field Notes
Devlog

A free Cloudflare Worker that routes around datacenter IP blocks (after three caching bugs)

Some upstream studio feeds block requests from datacenter IP ranges. Their CDN looks at the IP, sees AWS or Vultr or DigitalOcean, and either returns a 403 or…

Some upstream studio feeds block requests from datacenter IP ranges. Their CDN looks at the IP, sees AWS or Vultr or DigitalOcean, and either returns a 403 or strips the response down to a stub. This is a real defensive measure (people scrape catalogs for shady reasons all the time) and we cannot complain about it. But we still need the data.

The textbook answer is "use a residential proxy". That works, but it also costs roughly $0.30 per 1000 requests on the cheap end, and it puts another vendor in our critical path. For studios with smaller catalogs we can absorb the cost. For studios that change weekly with thousands of pages, residential proxy would dominate our infra bill.

The Cloudflare Worker version is free and feels almost like cheating.

A Worker is a small JavaScript function that runs on Cloudflare's edge network, the same edge that already sits in front of every PornBoxd request. We deployed a Worker called feedproxy with a single /fetch endpoint that takes a ?url= query parameter, fetches the upstream URL from a Cloudflare data center (which the upstream's WAF treats as ordinary traffic, not as our scraper), and pipes the response back. From PornBoxd's importer it looks like an ordinary internal call. From the upstream's perspective it looks like Cloudflare asking, which is fine.

The free tier on Workers is generous. 100,000 requests per day. We do maybe 5,000 a day across all studios. So the cost is zero. The latency is one extra hop, which adds about 40 ms. The upstream sees a Cloudflare IP. Everyone is happy.

I shipped this in a hurry on a Saturday and then spent the next three days debugging caching bugs. None of them were Cloudflare's fault. All three were me being too clever.

Bug one. I set cacheEverything: true because hey, why not, lots of these feeds are stable for hours. The bug is that "everything" means everything, including 404s. A single dead URL would respond once with 404, get cached at the edge for an hour, and then every subsequent fetch of that same URL would also return 404 even after the upstream had recovered. The fix is to set the cacheTtlByStatus map so 404s and 5xx responses cache for at most 60 seconds, while 200 responses cache for an hour. Five minutes of work after I figured out what was happening. Three hours of work to figure out what was happening.

Bug two. Some upstreams return a transient 503 or 522 once in a while, just because their site is having a rough afternoon. Without retry logic, our importer would log the failure and move on, and we would be missing six scenes for the next 24 hours until the next cron tick. Added a retry-once-after-2-seconds path inside the worker for status codes 502, 503, 504, and 522. If the second call also fails, we surface the error to the importer like before. Very simple, very effective.

Bug three was the one that embarrassed me. I had the worker reading its environment variables at module-load time, like this: const HOST_ALLOWLIST = env.HOST_ALLOWLIST.split(','). That works in Node. It does not work in Workers. Workers re-instantiate constants per cold start, but env is a per-request argument, not a module-scope thing. So HOST_ALLOWLIST was sometimes the value at deploy time, sometimes undefined depending on which isolate served the request. Fixed by reading env inside the request handler. Three lines.

You can probably tell which of these I am most annoyed about. The third one. The first two are reasonable mistakes you could read about in any engineering blog. The third is the kind of mistake you make because you assumed Workers behave like Node.js and you did not read the bindings docs carefully enough. Cloudflare's docs do tell you, in fairness. They tell you in a different page than the one I was reading.

The proxy is now boring. We use it for VR Bangers, DarkRoom, VirtualTaboo, and FuckPassVR's detail-page enrichment. Free tier coverage is comfortable. I could deprecate the residential-proxy line item if I felt like it, although I am keeping that contract live as a fallback.