OCR and AI Pipeline over 2.7M Pages with Full-Text Search and Chat

(epstein-file-explorer.com)

3 points | by VibeCodingFG 7 hours ago ago

1 comments

  • VibeCodingFG 7 hours ago ago

    I vibecoded and built a document ingestion and search system over ~1.3M public PDFs (2.7M+ pages total).

    The original goal was to extract structured information (people, places, relationships) via OCR + AI analysis. The pipeline looks roughly like this:

    • Bulk ingest via torrent (aria2c) • Normalize + upload to Cloudflare R2 • OCR and text extraction • Media classification + AI prioritization • AI analysis (DeepSeek) for entity extraction • Load structured results into PostgreSQL • Generate person↔document and person↔person relationships

    At this scale, the bottleneck wasn’t storage — it was AI cost and throughput.

    Running full AI enrichment over millions of pages is slow and compute-intensive. Because of that, only portions of the corpus were initially enriched with structured metadata.

    To avoid discoverability being limited by enrichment progress, I added:

    • Full-text search across all 2.7M+ pages • Page-level deep linking into source documents • A conversational “Ask the Archive” feature that retrieves from the indexed corpus

    The architecture today is:

    Ingestion → OCR → Indexed text store → AI enrichment (incremental) → Postgres for relationships → Search + Graph + Chat layer

    Some of the more interesting challenges:

    • Entity collision from messy OCR output • Preventing high-degree “hub” entities from polluting graph queries • Incremental reprocessing when improving extraction • Balancing precomputed graph edges vs. query-time joins • Handling burst traffic (20–55k visits/day)

    I’d appreciate feedback on: 1. Whether moving relationship storage to a graph-native DB would make sense long-term 2. Better strategies for incremental AI enrichment at this scale 3. Techniques to reduce noisy edge generation in large document graphs

    Happy to answer technical questions.