OCR and AI Pipeline over 2.7M Pages with Full-Text Search and Chat

(epstein-file-explorer.com)

3 points | by VibeCodingFG 7 hours ago ago

1 comments

VibeCodingFG 7 hours ago ago

I vibecoded and built a document ingestion and search system over ~1.3M public PDFs (2.7M+ pages total).
The original goal was to extract structured information (people, places, relationships) via OCR + AI analysis. The pipeline looks roughly like this:
• Bulk ingest via torrent (aria2c) • Normalize + upload to Cloudflare R2 • OCR and text extraction • Media classification + AI prioritization • AI analysis (DeepSeek) for entity extraction • Load structured results into PostgreSQL • Generate person↔document and person↔person relationships
At this scale, the bottleneck wasn’t storage — it was AI cost and throughput.
Running full AI enrichment over millions of pages is slow and compute-intensive. Because of that, only portions of the corpus were initially enriched with structured metadata.
To avoid discoverability being limited by enrichment progress, I added:
• Full-text search across all 2.7M+ pages • Page-level deep linking into source documents • A conversational “Ask the Archive” feature that retrieves from the indexed corpus
The architecture today is:
Ingestion → OCR → Indexed text store → AI enrichment (incremental) → Postgres for relationships → Search + Graph + Chat layer
Some of the more interesting challenges:
• Entity collision from messy OCR output • Preventing high-degree “hub” entities from polluting graph queries • Incremental reprocessing when improving extraction • Balancing precomputed graph edges vs. query-time joins • Handling burst traffic (20–55k visits/day)
I’d appreciate feedback on: 1. Whether moving relationship storage to a graph-native DB would make sense long-term 2. Better strategies for incremental AI enrichment at this scale 3. Techniques to reduce noisy edge generation in large document graphs
Happy to answer technical questions.