Robust ways to extract bank statements from PDF to CSV beyond raw LLMs?

(exactstatement.com)

1 points | by alexfefun1 11 hours ago ago

1 comments

  • alexfefun1 11 hours ago ago

    I’ve built a tool called ExactStatement to help users convert PDF bank statements into specific CSV formats.

    Currently, I’m using the Gemini API (Pro/Flash) to directly transform PDF content into structured JSON. While it works surprisingly well for 95% of cases, the "last 5%" is a headache:

    Hallucinations: Occasionally, the AI misinterprets a digit or skips a line item, which is unacceptable for financial data.

    Context Limits: Very long statements (50+ pages) sometimes lead to degraded performance or missing rows.

    I'm looking for a more robust engineering approach. Should I:

    Stick with LLMs but add a validation layer (e.g., checking if the calculated balance matches the statement's final balance)?

    Switch to a hybrid approach? (e.g., using LayoutLM or Amazon Textract for OCR/Layout analysis first, then using LLMs for cleaning).

    Go back to rule-based parsing for major banks (though maintaining templates seems like a nightmare)?

    How are you guys solving the "precision" problem in document extraction today? Would love to hear your experiences with specific libraries or workflows.