Enriching 1,350 actresses, shipping a feed, and rolling back a studio whose VR was their initials
Sunday in JAV land. Four discrete shipping events between breakfast and dinner, three of them shipped clean, one of them rolled back within hours because the…
Sunday in JAV land. Four discrete shipping events between breakfast and dinner, three of them shipped clean, one of them rolled back within hours because the studio in question turned out to not actually be a VR studio at all. Each one tells you something about how niche metadata pipelines really work, so I am putting them all in one post.
Step one: enriching the actress catalog from JavDatabase.com. The R18.dev SQL dump (which is our primary JAV source) has good per-scene data but thin per-actress data. Names, slugs, scene counts. Not much else. JavDatabase.com is an open community-maintained wiki of around 53,000 idol pages, daily-fresh, and crucially they expose blood type, birthday, and birth place fields that R18 omits. None of those fields are mind-blowing on their own, but blood type especially is a thing JAV viewers care about (it is a Japan thing, you can argue with the culture but you cannot argue with the demand).
The enricher is a tiny Apify actor that takes our JAV actor list, looks up each name against JavDatabase, parses the labels, and writes back. We hit 1,350 enrichments out of 2,180 actresses. About 62%. The unmatched ones are typically performers with unusual transliterations or single-scene careers that never got a JavDatabase page. We will pick at the long tail with manual matching over time.
Step two: the TMA Shopify feed. TMA (Tokyo-Hot's parent) ships its catalog through a standard Shopify JSON endpoint at /products.json. This is the kind of "you have to know it is there" find that keeps me checking every studio's robots.txt and sitemap on first onboarding, because plenty of studios serve their store data through Shopify without realizing they are also serving it to the public. 153 products in the feed, of which 126 cleanly mapped to our schema (the rest were merchandise, T-shirts and DVDs that our catalog does not carry), of which 49 were genuinely new scenes and 77 reconciled to existing R18.dev rows we already had. The reconciliation step is where the writer's tier-2 dedup logic earns its keep: same scene, two source URLs, one canonical row.
Step three: V&R Planning, rolled back. This is the embarrassing one. V&R Planning is a Japanese studio in the R18.dev label list, and an automatic pipeline I built classifies each label as VR or non-VR based on the title token "VR" appearing somewhere in the studio name. V&R has the letters V and R in its abbreviation, so my classifier marked it VR and onboarded the catalog. 600 scenes imported.
I was browsing the new arrivals page that night and noticed something off. The first thumbnail looked like a regular non-VR title. The second one too. I pulled three random scenes and checked the resolution, the camera setup, the runtime. None of them were VR. V&R Planning, I now know, is the founders' initials. The studio shoots niche scat and bondage content in 2D. The "VR" was nothing to do with format.
Rolled back the import within two hours. 554 actually-imported rows dropped (some of the 600 had failed dedup checks). Added a memory rule for myself: before onboarding any new label as VR, check at least three thumbnails for stereoscopic markers (dual-eye images, the egg-shape distortion, fisheye). Do not trust a name token. The memory file lives next to my other "I will not make this mistake again" notes.
Step four: Wikidata identifier sync. Wikidata exposes a "JAV idol identifier" property (P9781) that connects a Wikidata Q-number to a specific R18.dev actress id. With Q-numbers we can pull cross-source biographical data, link to Wikipedia where it exists, and one day ship richer actress detail pages. The sync is a weekly cron, it ran for the first time on Sunday, and we now have 1,111 actresses out of 2,180 mapped to Q-numbers. 51% coverage. The other 49% are mostly performers without dedicated Wikipedia pages, which is the limiting factor.
Four shipping events, three good outcomes, one rolled back within hours. About what I would expect from a day of metadata-source plumbing. The cost of being wrong on Step 3 was two hours of cleanup, which is fine. The cost of being right on Steps 1, 2, and 4 is a meaningfully better catalog. Diversification pays.