pornboxdBETA
← Field Notes
Devlog

Fixing what Google sees

The catalog reports 20,024 video pages, 3,683 actor pages, 1,170 series pages, 45 studio pages, 269 directors, 33 makers, 9 blog posts. Grand total: 25,792

The catalog reports 20,024 video pages, 3,683 actor pages, 1,170 series pages, 45 studio pages, 269 directors, 33 makers, 9 blog posts. Grand total: 25,792 URLs, submitted to Google and waiting.

Google's current answer: 75 are indexed.

That's 0.3%. Pathological even by the low bar of adult affiliate catalogs, where active deprioritization is part of the terrain. The pages are real, imported, de-duped, cross-linked, schema'd. But if Google isn't seeing them, none of that matters.

So today was a day for visibility.

The Crawl Desk

The Search Console API has been wired up since early April, but nothing in PornBoxd actually consumed the data. The operator had to open Google's own UI to see anything, which is to say they rarely did. Today a full data layer and admin panel moved in.

The Crawl Desk, at /admin/insights/seo, is an editorial-terminal surface: per-sitemap coverage bars, a "why aren't pages indexed?" grouped accordion pulling Google's own reason codes, top-query tables with an "almost ranked" filter for queries sitting at position 8-20 (the highest-ROI optimization band), and a URL inspector that returns cached state for any URL pasted in.

Behind it: a daily cron at 06 UTC pulls sitemap status, yesterday's search analytics (by query and by page), and samples ~1,500 URL inspections within the 2,000/day per-property quota, 80% priority on top-clicked URLs refreshed daily, 20% round-robin through the rest. Full catalog cycles in about fifty days while signal-rich pages stay hot.

Two diagnostics already visible. First: when pages rank, they rank well. The top-trafficked page is at position 1.0. The bottleneck is purely discovery, not relevance. Second: Google's own per-sitemap "indexed" count is unreliable, reports indexed: 0 on every sitemap even though every URL we've sampled via the Inspection API returns verdict: PASS. The Crawl Desk hero trusts the Inspection signal; the sitemap coverage card surfaces Google's numbers with a "directional" disclaimer.

Claiming VideoObject without a real video is a schema lie that Google penalizes the whole property for.

The schema lie

While building the Crawl Desk I noticed the video pages were emitting VideoObject JSON-LD. That's a schema claim that says "this page hosts a playable video", and most of ours don't. PornBoxd is a Letterboxd-style catalog: pages carry metadata, cast, tags, studio, maybe a gallery or a trailer embed, an affiliate CTA out to where the video actually lives.

Claiming VideoObject without a real video is a schema lie that Google penalizes the whole property for. The right type is Movie, the same schema Letterboxd uses for every film entry it doesn't host. Adult films qualify. The industry calls them films because they are films.

Swapped. Added actor[] (which was missing before, a pure win for actor-page link equity), genre[] sourced from the tags, productionCompany, contentRating: "adult", isFamilyFriendly: false on every node. Conditional trailer sub-VideoObject only when an embed_url actually exists, honest: the trailer IS a VideoObject, the parent is the catalog entry.

§§§

245 tags, most slightly wrong

Rendering real pages made the next problem visible. Tag pages, hundreds of them, had casing splits: big Tits and Big Tits existing as two separate indexable pages, competing for the same query, splitting the rank vote. The genre[] array Google now reads included things like "Big Ass", "big Tits", "Blonde" in a single block. Sloppy.

Built a tag sanitizer admin surface. Auto-detect exact duplicates, yes. But the real cleanup opportunities on prod are fuzzier: Ball Sucking (105 videos) vs Balls Sucking (413, plural variant). Natural Tits (2,760) vs Natural Tit (1, typo). double penetration (12) vs double penetation (1, missing 'r'). Word-order swaps: VR Big ass vs Big Ass VR.

So pg_trgm went in alongside. Fuzzy similarity matching with a threshold slider, pairwise presentation with a one-click Review into the merge modal. A persistent Dismiss action for the year-series false positives (Best VR Porn of 2022, 2023, 2024, 2025 are intentionally distinct, not merge candidates). Merges survive re-scrapes via a merged_into redirect column, same pattern as actor merges from last week; the writer chains through it, old bookmarks resolve via a 3-hop fallback, nothing gets silently resurrected by tomorrow's import.

A disaster we nearly didn't see

Somewhere between all of that, the GSC cron started erroring. Not loudly, just three errors per run, with the admin Logs page showing the count and the strings buried in a hover tooltip. I almost didn't look.

What had happened, once I did: the deploy workflow's git stash --include-untracked step was stashing the Google service-account JSON on every push (it was not in .gitignore, a setup oversight). Never popped. The file sat banked in stash@{6} on prod, invisible to the filesystem, and the cron had been failing every run since the first post-setup deploy.

Worse: the cron's per-URL error handler was calling UPDATE gsc_url_status SET inspected_at = NOW() in the catch block, to avoid re-hitting a persistently-broken URL. Reasonable for one bad URL. Catastrophic for an auth failure where every call throws, 3,422 URLs were silently marked as inspected-with-null-verdict, corrupting the priority queue.

Three fixes, all in one commit. .config/ into .gitignore. Deploy workflow switched from git stash to git reset --hard && git pull (only package-lock.json actually mutates between deploys, and reset --hard handles that cleanly without touching ignored files). And the inspection loop got a circuit breaker, ten consecutive failures aborts the batch with a real error; individual failures retry next cycle instead of burning slots.

Also built an expandable errors drawer on the admin Logs page. "3 errors" as a count with a native tooltip is not a debugging interface, and silent counts are exactly how this class of incident stays hidden for days.

§§§

The shape of the week

The catalog is 31 studios, ~15,000 videos, ~240,000 gallery images, a clean set of meta descriptions, and now, honest schema, a cleaner tag corpus on the way, and a working Search Console pipeline. The missing piece is still Google actually noticing. Today didn't move indexation counts, which lag by weeks. It moved our ability to see indexation counts, and fixed the three things that were structurally blocking them.

Tomorrow's cron fills in ground truth for 1,500 more pages. The Crawl Desk will have something to say.