pornboxdBETA
← Field Notes
Devlog

Teaching the importer to merge duplicates (and a $0.15 lesson in humility)

Two things happened today. One is that the catalogue gained a real actor-merge tool, the kind a growing catalogue with overlapping studios genuinely needs.

Two things happened today. One is that the catalogue gained a real actor-merge tool, the kind a growing catalogue with overlapping studios genuinely needs. Every imported actor gets a content_hash derived from name + studio, so "Maria" at one studio and "Maria" at another are distinct rows by default (which is usually correct, same name, different person). But when they are the same person, merging them has a subtle trap: if you just delete the loser rows, the very next scrape will reimport them via their content_hash and undo your merge. The fix is a merged_into column, losers stay in place as redirect stubs, the importer sees the hash, reads merged_into, and routes new video links to the canonical winner. An admin UI at /admin/actors surfaces auto-grouped exact-name duplicates with side-by-side thumbnail strips for visual disambiguation; one click merges them with proper row locks. Re-scrape proved the merge survives: seeded a duplicate, merged, re-imported, the new video attached to the winner.

The second thing was a painful lesson. Some studios publish release dates, some don't. For the ones that don't (looking at you, VR Bangers network, roughly 1,700 videos missing dates), I built a dedicated Apify actor to visit each video URL and scrape the displayed release date. First test run cost $0.046 for four URLs, about 900× the expected baseline. Turns out Apify's default memory allocation is 4 GB; my actor only needs 256 MB, and nobody tells you that 4 GB × even a few seconds of runtime adds up fast. Fixed the memory tier and moved on, but then hit the real issue: the target site sits behind Cloudflare's managed challenge. Residential proxy brought the success rate up, but never to a level that made economic sense at 1,700-page scale. Called it. The scraper is parked with a full post-mortem in the cost-optimisation playbook, and the backlog is going to get a manual browser userscript instead.

In the quieter background: FreeOnes actor enrichment shipped, a separate Apify actor pulls bio, country, aliases, birthday, measurements, and avatar from each performer's public profile, then the writer gained a "tier-2.5" alias matching step so future imports auto-link by stage name even if the scraper uses a different spelling. Roughly 61% hit rate (not every performer has a FreeOnes page), about $0.017 per 200-actor run at 256 MB. Admin button + inline trigger + hourly sweep. The in-flight progress indicators on the admin panel also went in, every long-running processor now brackets its work between a "running" row start and a completion update, so the admin dashboard can poll and show "import: 1m 47s" next to the Run button.