Developer & designer (solo)
Invoice Agent
A Telegram bot that turns a photo or PDF of any invoice into structured, confirmed, queryable data — and the spend indicators that come with it. GPT-4o Vision reads the document; a human confirms it; an audit trail and a guardrailed analytics layer make the result safe to trust and easy to question.
Invoices don’t arrive as data
They arrive as a photo snapped over a desk, a PDF forwarded from a supplier, a receipt held up to a phone camera — every vendor with its own layout, its own language, its own idea of where the total goes.
Capturing that by hand is the tax every small finance workflow pays: retype the vendor, the date, the amount, the tax; file it somewhere; and then, somehow, still be able to answer “how much did we spend with this supplier last quarter?”
Invoice Agent removes both halves of that tax — the capture and the control. Send an invoice to a Telegram chat and get back structured, confirmed data in your database, plus the indicators that come with it.
How it works
The whole thing runs as an n8n workflow behind a Telegram bot — no app to install, no portal to log into. The interface is a chat the user already has open.
- Send — a photo or a PDF, straight into the chat. Both are first-class inputs: PDFs are rasterised via CloudConvert so one vision pipeline reads either with no difference to the user.
- Extract — GPT-4o Vision reads the document and returns structured fields — vendor, date, totals, tax, currency — each with a confidence score. Layout-agnostic, and tuned for both English and Portuguese invoices.
- Confirm — the bot replies with a clean preview and asks the human to confirm. Reply NO and you drop into a correction loop that re-extracts against your notes; reply YES and it moves on. Nothing is stored until a person has signed off.
- Classify — choose a payment method (card, transfer, cash, cheque, crypto).
- Store — the confirmed record lands in SQL Server, alongside an immutable audit-log entry.
A invoice_sessions table tracks conversation state, so a multi-turn exchange survives between messages — the bot always knows whether it’s waiting on a confirmation, a correction, or a fresh document.
Trust is the feature
A model that reads a document is only useful if you can trust what it wrote down. Three things make that true here — and none of them are the model:
- Human-in-the-loop by default. Nothing reaches the database until the user confirms it. Low-confidence extractions are flagged with a warning before they’re ever presented as final, so a blurry photo or a faint scan becomes a prompt to correct, not a silent bad row.
- An immutable audit trail. Every event — extraction, correction, confirmation — is written to an append-only log. You can always reconstruct how a stored number got there.
- A hard boundary at the database. The analytics layer (below) lets a model write SQL, but only ever to read.
From capture to control
Captured invoices are only worth the indicators you can pull from them — so the same chat doubles as an analytics interface:
“What’s my total spend?” · “Which company do I pay the most?” · “Show invoices from last month” · “Total spend by payment method”
Ask in plain language; the agent generates a SQL query, runs it, and answers conversationally. And because letting a model write SQL against a live database is exactly where this could go wrong, every generated query passes a safety gate first: SELECT-only, with a blocklist of mutating keywords (INSERT, UPDATE, DELETE, DROP, ALTER, EXEC, TRUNCATE, …) and a query log of its own. The model can interrogate the data; it cannot change it.
For anything that wants a dashboard rather than a sentence, a Power BI model sits on top of a summary view — total spend, top vendors, payment-method breakdown, monthly trend. The indicators are there whether you’d rather ask or look.
Architecture
The pipeline is a single n8n workflow: a Telegram webhook in, a branch on image-vs-PDF, a GPT-4o Vision extraction step, a stateful confirmation loop, and SQL Server writes. Four LLM prompts do distinct, narrow jobs — invoice extraction, correction merge, natural-language-to-SQL, and result formatting — each with its own contract. CloudConvert handles PDF-to-image; a Cloudflare tunnel exposes the webhook in development.
| Layer | Technologies |
|---|---|
| Orchestration | n8n workflow automation · Telegram Bot API · Cloudflare Tunnel |
| AI | OpenAI GPT-4o Vision — extraction, correction merge, NL→SQL, answer formatting |
| Data | SQL Server — sessions, invoices, audit log, query log · CloudConvert (PDF→image) |
| Reporting | Power BI — summary view, spend & vendor dashboards |
| License | MIT — self-hosted |
On a typical image, extraction takes a few seconds; a PDF adds the conversion step. The footprint is small enough to run a year of invoices on a modest box — the constraint is never compute, it’s getting the captured data to a state you can act on.
The broader point
Vision models read documents well now — that part is genuinely solved, and reaching for one here is the right call, not a gimmick. The work that makes it useful is everything wrapped around it: a human confirming before anything persists, an audit log that explains every stored value, and a firm line between a language model and a database it may read but never write.
Automating capture is half the job. The other half is making the captured data safe to trust and easy to question — so an invoice, snapped as a photo or forwarded as a PDF, becomes an indicator someone can actually act on.