Automatically extracts structured data from PDF invoices and reconciles them against Stripe transactions, flagging discrepancies for your finance team.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
In this tutorial you’ll build an Express API that accepts PDF invoices, extracts structured data using Mistral AI, matches the extracted totals against Stripe payment intents, and emails a discrepancy report to your finance team. By the end you’ll understand how the pipeline chains together PDF text extraction, document classification via a confidence router, structured JSON extraction, Stripe reconciliation, and budget enforcement per file.
Prerequisites
Node.js >= 22
pnpm 10.x
Mistral AI API key
Stripe secret key
SMTP email account (for sending reports)
Step 1: Scaffold the project
Create the directory structure and initialize a TypeScript Node.js project.
Set the project to use ES modules and pin the package manager version.
terminal
pnpm pkg set type="module"pnpm pkg set packageManager="pnpm@10.0.0"
pnpm pkg set engines.node=">=22"
Step 2: Install dependencies
Install the runtime and dev dependencies. The Mistral SDK talks to the API, unpdf extracts text from PDFs, tesseract.js handles OCR on image-only scans, stripe searches payment intents, nodemailer sends reports, dotenv loads environment variables, and the REAA packages handle classification routing and per-file budget enforcement.
Create .env from the example. Fill in your real API keys and SMTP credentials. The budget limit controls the maximum spend per file; the route thresholds control when the classifier routes, asks for clarification, or falls back.
The .gitignore already excludes .env and .env.local, so your keys won’t be committed.
Step 6: Write the config module
The config module validates all environment variables at startup using Zod, so the app fails fast with a clear error if a required variable is missing or misconfigured.
The type definitions file exports Zod schemas for invoice, receipt, credit note, extraction results, and discrepancy reports. These schemas both define the TypeScript types and validate the data coming back from the Mistral LLM.
The budget module uses the REAA budget engine to enforce a per-file spend limit. Before each Mistral call, checkBudget throws if the estimated cost would exceed the limit. After each call, recordSpend logs the actual cost so the tracker stays accurate.
Create src/lib/budget.ts:
ts
import { BudgetController } from "@reaatech/agent-budget-engine";import { SpendStore } from "@reaatech/agent-budget-spend-tracker";import { BudgetScope, BudgetExceededError } from "@reaatech/agent-budget-types";import { mistralPricingProvider } from "./mistral-pricing.js";import { config } from "./config.js";const store = new SpendStore();export const controller = new BudgetController({ spendTracker: store, pricing: mistralPricingProvider,});controller.on("hard-stop", (event) => { console.warn(`Budget hard-stop for ${event.scopeType}:${event.scopeKey}`);});export function defineFileBudget(scopeKey: string): void { controller.defineBudget({ scopeType: BudgetScope.Task, scopeKey, limit: config.BUDGET_PER_FILE_LIMIT, policy: { softCap: 0.8, hardCap: 1.0, autoDowngrade: [], disableTools: [], }, });}export function checkBudget( scopeKey: string, estimatedCost: number, modelId: string, tools?: string[]): void { const result = controller.check({ scopeType: BudgetScope.Task, scopeKey, estimatedCost, modelId, tools: tools ?? [], }); if (!result.allowed) { throw new BudgetExceededError( `Budget exceeded for ${scopeKey}`, { scopeType: BudgetScope.Task, scopeKey }, result.limit - result.remaining, result.limit, result.remaining, result.action ); }}export function recordSpend( requestId: string, scopeKey: string, cost: number, inputTokens: number, outputTokens: number, modelId: string): void { controller.record({ requestId, scopeType: BudgetScope.Task, scopeKey, cost, inputTokens, outputTokens, modelId, provider: "mistral", timestamp: new Date(), });}
Step 9: Write the Mistral pricing provider
The pricing provider implements the PricingProvider interface so the budget controller can convert token estimates to USD.
The PDF processor tries unpdf first (fast, works on text-based PDFs). If the extracted text is shorter than 50 characters, it falls back to Tesseract OCR (slower, works on scanned images). The function returns both the extracted text and a confidence score.
The classifier uses @reaatech/confidence-router with keyword-based rules to determine whether an uploaded document is an invoice, receipt, or credit note. It registers a KeywordClassifier for each document type, then calls router.classify() followed by router.decide() to produce a routing decision: ROUTE, CLARIFY, or FALLBACK.
This module sends the extracted text to Mistral and validates the JSON response against the correct Zod schema. It uses jsonrepair to fix truncated or malformed JSON before parsing. The checkBudget call before the LLM request and recordSpend call after ensure every document stays within budget.
Create src/lib/mistral.ts:
ts
import { Mistral } from "@mistralai/mistralai";import { jsonrepair } from "jsonrepair";import { config } from "./config.js";import { checkBudget, recordSpend } from "./budget.js";import { InvoiceSchema, ReceiptSchema, CreditNoteSchema } from "./types.js";import type { ExtractionResult } from "./types.js";const mistral = new Mistral({ apiKey: config.MISTRAL_API_KEY });export class ExtractionError extends Error { constructor(message: string, public readonly rawOutput: string) { super(message); this.name = "ExtractionError"; }}function buildPrompt(documentType: string, text: string): string { const system = `You are a document extraction assistant. Extract structured data from the provided text and output ONLY valid JSON matching the schema for a ${documentType}.`; return `${system}\n\nText:\n${text}`;}function parseAndValidate(documentType: string, rawText: string): ExtractionResult { let repaired: string; try { repaired = jsonrepair(rawText); } catch { throw new ExtractionError("Failed to repair JSON", rawText); } let parsed: unknown; try { parsed = JSON.parse(repaired); } catch { throw new ExtractionError("Failed to parse repaired JSON", rawText); } try { if (documentType === "invoice") { const data = InvoiceSchema.parse(parsed); return { documentType: "invoice", data }; } if (documentType === "receipt") { const data = ReceiptSchema.parse(parsed); return { documentType: "receipt", data }; } if (documentType === "credit_note") { const data = CreditNoteSchema.parse(parsed); return { documentType: "credit_note", data }; } } catch { throw new ExtractionError("Schema validation failed", rawText); } throw new ExtractionError(`Unknown document type: ${documentType}`, rawText);}export async function extractFromText( documentType: string, text: string, budgetScope: string): Promise<ExtractionResult> { const estimatedTokens = Math.ceil(text.length / 4) + 500; const estimatedCost = (estimatedTokens / 1000) * 0.003 + (2048 / 1000) * 0.009; await checkBudget(budgetScope, estimatedCost, config.MISTRAL_MODEL); const response = await mistral.chat.complete({ model: config.MISTRAL_MODEL, messages: [{ role: "user", content: buildPrompt(documentType, text) }], maxTokens: 2048, }); const rawContent = response.choices[0]?.message?.content ?? ""; const contentString = typeof rawContent === "string" ? rawContent : ""; const result = parseAndValidate(documentType, contentString); const inputTokens = estimatedTokens; const outputTokens = contentString.length / 4; const actualCost = (inputTokens / 1000) * 0.003 + (outputTokens / 1000) * 0.009; recordSpend( `req-${budgetScope}`, budgetScope, actualCost, inputTokens, Math.ceil(outputTokens), config.MISTRAL_MODEL ); return result;}
Step 13: Write the Stripe reconciliation module
The reconciliation module searches Stripe for payment intents matching the invoice total within a 5% tolerance and dated within 2 days of the invoice date. The reconcile function then picks the closest matching payment and compares totals to produce a discrepancy report with status “matched”, “mismatch”, “unmatched”, or “error”.
The email module sends a discrepancy report via nodemailer. When the status is “matched”, the subject says the invoice matched successfully. For any other status, it says “Discrepancy report” and includes the differences list in the HTML body.
The upload route is the heart of the pipeline. It accepts a multipart PDF upload, runs text extraction, classifies the document, calls Mistral for structured data, matches invoices against Stripe, and sends a report email. Budget is defined at the start of the request so the budget controller can track spend for just this file.
The test setup file mocks the config module so tests run without real environment variables. It also starts an MSW server that intercepts Mistral and Stripe API calls, keeping tests fast and deterministic.
Write integration tests for the upload endpoint to verify the full pipeline from multipart file upload through classification and extraction. The mock returns a successful invoice extraction and a matching Stripe payment intent, so the tests verify the happy path and error cases end to end.
The scripts are already in package.json from the scaffold step. Verify they are present:
terminal
pnpm pkg set scripts.typecheck="tsc --noEmit"pnpm pkg set scripts.lint="eslint ."pnpm pkg set scripts.test="vitest run --coverage --reporter=json --outputFile=vitest-report.json"
Step 20: Run the tests
Run the test suite to verify the pipeline logic works end to end.
terminal
pnpm test
Expected output includes several passing test groups. The “POST /upload” suite runs 8 cases covering missing file, unsupported type, short text, fallback classification, clarification needed, successful invoice, extraction error, and receipt/credit note flows. The “extractFromText” suite covers invoice, receipt, credit note, and error paths. The “reconcile” suite covers unmatched, exact match, mismatch, and error cases. Coverage thresholds (90% on lines, branches, functions, statements) are met.
Step 21: Start the server
Start the Express server so you can upload invoices.
terminal
node --import dotenv/config src/server.ts
Expected output: Server listening on port 3000.
Step 22: Upload an invoice
With the server running, send a PDF invoice using curl. The email field specifies where the discrepancy report gets sent.
terminal
curl -X POST http://localhost:3000/upload \ -F "file=@invoice.pdf" \ -F "email=finance@example.com"
If the invoice total matches a Stripe payment intent within 5% and 2 days, the response includes "status":"matched". If no match is found, you get "status":"unmatched". If the totals differ, you get "status":"mismatch" with a differences array listing the discrepancies. An email report goes to the address you specified.
Next steps
Add a GUI upload page with drag-and-drop so non-technical team members can submit invoices without using curl.
Store the discrepancy reports in a database so the finance team can review historical reconciliations from a dashboard.
Extend the classifier with additional document types such as purchase orders or shipping receipts by adding more keyword rules to the router.
}));
vi.mock("../src/classifier.js", () => ({
classifyDocument: mockClassifyDocument,
}));
vi.mock("../src/lib/mistral.js", () => ({
extractFromText: mockExtractFromText,
ExtractionError: class ExtractionError extends Error {
constructor(message: string, public rawOutput: string) {