Robust ways to extract bank statements from PDF to CSV beyond raw LLMs?

(exactstatement.com)

1 points | by alexfefun1 11 hours ago ago

1 comments

alexfefun1 11 hours ago ago

I’ve built a tool called ExactStatement to help users convert PDF bank statements into specific CSV formats.
Currently, I’m using the Gemini API (Pro/Flash) to directly transform PDF content into structured JSON. While it works surprisingly well for 95% of cases, the "last 5%" is a headache:
Hallucinations: Occasionally, the AI misinterprets a digit or skips a line item, which is unacceptable for financial data.
Context Limits: Very long statements (50+ pages) sometimes lead to degraded performance or missing rows.
I'm looking for a more robust engineering approach. Should I:
Stick with LLMs but add a validation layer (e.g., checking if the calculated balance matches the statement's final balance)?
Switch to a hybrid approach? (e.g., using LayoutLM or Amazon Textract for OCR/Layout analysis first, then using LLMs for cleaning).
Go back to rule-based parsing for major banks (though maintaining templates seems like a nightmare)?
How are you guys solving the "precision" problem in document extraction today? Would love to hear your experiences with specific libraries or workflows.