Skip to content
FS Fábio Silva
← All work

Developer & designer (solo)

Invoice Agent

A Telegram bot that turns a photo or PDF of any invoice into structured, confirmed, queryable data — and the spend indicators that come with it. GPT-4o Vision reads the document; a human confirms it; an audit trail and a guardrailed analytics layer make the result safe to trust and easy to question.

Invoice Agent in a Telegram chat: send a photo or PDF, the bot extracts the fields, you confirm them, and the data lands in SQL Server — then you ask it questions in plain language.

Invoices don’t arrive as data

They arrive as a photo snapped over a desk, a PDF forwarded from a supplier, a receipt held up to a phone camera — every vendor with its own layout, its own language, its own idea of where the total goes.

Capturing that by hand is the tax every small finance workflow pays: retype the vendor, the date, the amount, the tax; file it somewhere; and then, somehow, still be able to answer “how much did we spend with this supplier last quarter?”

Invoice Agent removes both halves of that tax — the capture and the control. Send an invoice to a Telegram chat and get back structured, confirmed data in your database, plus the indicators that come with it.

How it works

The whole thing runs as an n8n workflow behind a Telegram bot — no app to install, no portal to log into. The interface is a chat the user already has open.

  1. Send — a photo or a PDF, straight into the chat. Both are first-class inputs: PDFs are rasterised via CloudConvert so one vision pipeline reads either with no difference to the user.
  2. Extract — GPT-4o Vision reads the document and returns structured fields — vendor, date, totals, tax, currency — each with a confidence score. Layout-agnostic, and tuned for both English and Portuguese invoices.
  3. Confirm — the bot replies with a clean preview and asks the human to confirm. Reply NO and you drop into a correction loop that re-extracts against your notes; reply YES and it moves on. Nothing is stored until a person has signed off.
  4. Classify — choose a payment method (card, transfer, cash, cheque, crypto).
  5. Store — the confirmed record lands in SQL Server, alongside an immutable audit-log entry.

A invoice_sessions table tracks conversation state, so a multi-turn exchange survives between messages — the bot always knows whether it’s waiting on a confirmation, a correction, or a fresh document.

Trust is the feature

A model that reads a document is only useful if you can trust what it wrote down. Three things make that true here — and none of them are the model:

  • Human-in-the-loop by default. Nothing reaches the database until the user confirms it. Low-confidence extractions are flagged with a warning before they’re ever presented as final, so a blurry photo or a faint scan becomes a prompt to correct, not a silent bad row.
  • An immutable audit trail. Every event — extraction, correction, confirmation — is written to an append-only log. You can always reconstruct how a stored number got there.
  • A hard boundary at the database. The analytics layer (below) lets a model write SQL, but only ever to read.

From capture to control

Captured invoices are only worth the indicators you can pull from them — so the same chat doubles as an analytics interface:

“What’s my total spend?” · “Which company do I pay the most?” · “Show invoices from last month” · “Total spend by payment method”

Ask in plain language; the agent generates a SQL query, runs it, and answers conversationally. And because letting a model write SQL against a live database is exactly where this could go wrong, every generated query passes a safety gate first: SELECT-only, with a blocklist of mutating keywords (INSERT, UPDATE, DELETE, DROP, ALTER, EXEC, TRUNCATE, …) and a query log of its own. The model can interrogate the data; it cannot change it.

For anything that wants a dashboard rather than a sentence, a Power BI model sits on top of a summary view — total spend, top vendors, payment-method breakdown, monthly trend. The indicators are there whether you’d rather ask or look.

Architecture

The pipeline is a single n8n workflow: a Telegram webhook in, a branch on image-vs-PDF, a GPT-4o Vision extraction step, a stateful confirmation loop, and SQL Server writes. Four LLM prompts do distinct, narrow jobs — invoice extraction, correction merge, natural-language-to-SQL, and result formatting — each with its own contract. CloudConvert handles PDF-to-image; a Cloudflare tunnel exposes the webhook in development.

LayerTechnologies
Orchestrationn8n workflow automation · Telegram Bot API · Cloudflare Tunnel
AIOpenAI GPT-4o Vision — extraction, correction merge, NL→SQL, answer formatting
DataSQL Server — sessions, invoices, audit log, query log · CloudConvert (PDF→image)
ReportingPower BI — summary view, spend & vendor dashboards
LicenseMIT — self-hosted

On a typical image, extraction takes a few seconds; a PDF adds the conversion step. The footprint is small enough to run a year of invoices on a modest box — the constraint is never compute, it’s getting the captured data to a state you can act on.

The broader point

Vision models read documents well now — that part is genuinely solved, and reaching for one here is the right call, not a gimmick. The work that makes it useful is everything wrapped around it: a human confirming before anything persists, an audit log that explains every stored value, and a firm line between a language model and a database it may read but never write.

Automating capture is half the job. The other half is making the captured data safe to trust and easy to question — so an invoice, snapped as a photo or forwarded as a PDF, becomes an indicator someone can actually act on.