A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a document processing pipeline that extracts line items from PDF receipts and Excel expense sheets using Mistral AI, repairs malformed JSON output with a six-strategy repair engine, enforces daily spend budgets, and pushes validated expense data into Xero as ACCREC invoices. You’ll build each layer from scratch — schemas, extractors, planners, parsers, telemetry, budget enforcement, and the Next.js API routes that tie them together.
A Xero custom connection app with client ID and client secret (or leave those placeholders if you only want to test the pipeline without pushing to Xero)
Basic familiarity with TypeScript, Next.js App Router, and Zod schemas
Step 1: Scaffold the project and install dependencies
Start from an empty directory. Create the Next.js project, then install all dependencies at exact pinned versions.
The project already includes an .env.example that lists these same variables — keep it as documentation for the team.
Step 2: Define the expense Zod schemas
The pipeline needs three Zod schemas — one for individual line items, one for a complete document (receipt or invoice), and one for a batch. Create src/schemas/expense-schema.ts:
Expected output:ExpenseDocumentSchema is the schema you’ll pass to the structured repair engine later — it validates that Mistral’s output matches this exact shape.
Step 3: Create shared processing types
Create src/types/index.ts with the status types and response shapes the API route uses:
Expected output: The DocumentSource interface is what you’ll pass into the extractor — it tells the extractor what kind of file it’s dealing with.
Step 4: Build the document extractor (PDF + XLSX)
The extractor handles two file formats. Create src/services/document-extractor.ts:
ts
import { getDocument } from "pdfjs-dist";import { read, utils } from "xlsx";import type { DocumentSource } from "../types/index.js";export class ExtractionError extends Error { code: string; constructor(message: string, code = "EXTRACTION_FAILED") { super(message); this.name = "ExtractionError"; this.code = code; }}export async function extractTextFromPdf(buffer: Buffer): Promise<string> { const pdf = await getDocument({ data: buffer }).promise; if (pdf.numPages === 0) return ""; const maxPages = Math.min(pdf.numPages, 50); if (pdf.numPages > 50) { console.warn(`Document has ${String(pdf.numPages)} pages; extracting first 50`); } const pageTexts: string[] = []; for (let i = 1; i <= maxPages; i++) { const page = await pdf.getPage(i); const content = await page.getTextContent(); const pageText = content.items.map((item) => ("str" in item ? (item as { str: string }).str : "")).join(" "); pageTexts.push(pageText); } return pageTexts.join("\n");}export function extractTextFromXlsx(buffer: Buffer): string { const workbook = read(buffer, { type: "buffer" }); const sheetTexts: string[] = []; for (const sheetName of workbook.SheetNames) { const sheet = workbook.Sheets[sheetName]; const csv = utils.sheet_to_csv(sheet); sheetTexts.push(csv); } return sheetTexts.join("\n");}export async function extractText(source: DocumentSource, buffer: Buffer): Promise<string> { try { if (source.mimeType === "application/pdf") { return await extractTextFromPdf(buffer); } return extractTextFromXlsx(buffer); } catch (err) { const message = err instanceof Error ? err.message : "Unknown extraction error"; throw new ExtractionError(message); }}
Expected output: The extractor dispatches based on MIME type. PDF text extraction uses pdfjs-dist’s getDocument and iterates pages. XLSX extraction uses xlsx’s read and converts each sheet to CSV. Documents over 50 pages are truncated.
Step 5: Build the context-window planner
Large documents exceed Mistral’s context window. Use @reaatech/context-window-planner to chunk the text. Create src/services/context-planner.ts:
ts
import { ContextPlannerBuilder, createTokenizer, createPriorityGreedyStrategy, createConversationTurn, createGenerationBuffer, type ContextItem,} from "@reaatech/context-window-planner";export function planDocumentContext(text: string, maxTokens: number): string { if (text.length < maxTokens * 4) { return text; } const tokenizer = createTokenizer("mock"); const planner = new ContextPlannerBuilder() .withBudget(maxTokens - 1000) .withReserved(200) .withTokenizer(tokenizer) .withStrategy(createPriorityGreedyStrategy()) .build(); planner.addAll([ createConversationTurn({ role: "user", content: text }, tokenizer), createGenerationBuffer({ reservedTokens: 500 }), ]); const result = planner.pack(); if (result.warnings.length > 0) { console.warn("Context planner warnings:", result.warnings); } const includedContent = result.included .map((item: ContextItem) => { if ("content" in item) return (item as { content: string }).content; return ""; }) .join("\n"); return includedContent || text;}
Expected output: Short documents pass through untouched — the text.length < maxTokens * 4 check avoids planner overhead. Long documents are chunked via the priority-greedy strategy, keeping the highest-priority content within budget.
Step 6: Build the expense parser (Mistral AI)
This is the heart of the pipeline. It sends the extracted text to Mistral AI and repairs the response. Create src/services/expense-parser.ts:
ts
import { Mistral } from "@mistralai/mistralai";import { repair, UnrepairableError } from "@reaatech/structured-repair-core";import { ExpenseDocumentSchema } from "../schemas/expense-schema.js";import type { ExpenseDocument } from "../schemas/expense-schema.js";import type { TelemetryContext } from "@reaatech/llm-cost-telemetry";export class ParseError extends Error { code: string; rawInput: string; constructor(message: string, rawInput: string) { super(message); this.name = "ParseError"; this.code = "PARSE_FAILED"; this.rawInput = rawInput; }}const SYSTEM_PROMPT = `You extract expense line items from receipt/invoice text. Return a JSON array of objects matching this schema:{ vendorName: string, date: string, totalAmount: number, currency: string, lineItems: [ { itemDescription: string, quantity: number, unitAmount: number, taxType: string, lineAmount: number, category: string } ], receiptNumber?: string}`;export async function parseExpenses( extractedText: string, _telemetryContext?: TelemetryContext,): Promise<ExpenseDocument[]> { const mistral = new Mistral({ apiKey: process.env.MISTRAL_API_KEY ?? "", }); void _telemetryContext; if (!extractedText || extractedText.trim().length === 0) { return []; } try { const result = await mistral.chat.complete({ model: "mistral-large-latest", messages: [ { role: "system", content: SYSTEM_PROMPT }, { role: "user", content: extractedText }, ], responseFormat: { type: "text" }, }); const messageContent = result.choices[0]?.message?.content; const rawOutput = typeof messageContent === "string" ? messageContent : JSON.stringify(messageContent ?? ""); try { const data = await repair(ExpenseDocumentSchema, rawOutput); if (Array.isArray(data)) { return data; } return [data]; } catch (repairErr) { if (repairErr instanceof UnrepairableError) { throw new ParseError( "Could not parse Mistral output into expense schema", rawOutput, ); } throw new ParseError( repairErr instanceof Error ? repairErr.message : "Unknown repair error", rawOutput, ); } } catch (parseErr) { if (parseErr instanceof ParseError) throw parseErr; const message = parseErr instanceof Error ? parseErr.message : "Unknown Mistral API error"; throw new ParseError(message, extractedText); }}
Expected output: The function calls mistral.chat.complete() with a detailed system prompt, then feeds the raw output through repair(). The repair engine strips code fences, fixes trailing commas, coerces types, and maps misnamed keys — so even messy LLM output becomes valid data.
Note: The v2 Mistral SDK uses new Mistral() (named import) and calls .chat.complete(), not .chat.completions.create() like OpenAI.
Step 7: Build the cost telemetry service
Track all Mistral API calls with token counts and cost. Create src/services/cost-telemetry.ts:
Expected output:calculateCostFromTokens(tokens, pricePerMillion) computes (tokens / 1,000,000) × price. For a 10,000-token call at $4 per million tokens, that’s $0.04. Each span is logged as JSON.
Step 8: Build the budget enforcer
Enforce a daily spend cap so an accidental large document doesn’t run up a big bill. Create src/services/budget-enforcer.ts:
Expected output: The budget controller uses a state machine: Active → Warned (at 80% spend) → Stopped (at 100%). The checkBudget() call throws "Budget exceeded" if the hard cap is reached, and the route handler returns a 429 status code.
Step 9: Build the Xero client
Push parsed expense documents to Xero as ACCREC invoices via OAuth 2.0 client credentials. Create src/services/xero-client.ts:
Expected output: The client uses client_credentials grant — no redirect URI needed. Each document becomes one ACCREC invoice. If an individual invoice push fails, the error is logged and the batch continues with the next document.
Step 10: Create the API route handlers
Now wire the pipeline into a Next.js API route. Create app/api/process-expense/route.ts:
ts
import { type NextRequest, NextResponse } from "next/server";import { generateId } from "@reaatech/llm-cost-telemetry";import { extractText } from "../../../src/services/document-extractor.js";import { planDocumentContext } from "../../../src/services/context-planner.js";import { getTokenEstimate, recordLlmCall } from "../../../src/services/cost-telemetry.js";import { checkBudget, recordSpend } from "../../../src/services/budget-enforcer.js";import { parseExpenses } from "../../../src/services/expense-parser.js";import { pushExpensesToXero } from "../../../src/services/xero-client.js";const MAX_FILE_SIZE = 10 * 1024 * 1024;const ALLOWED_MIME_TYPES
Then create app/api/budget-status/route.ts:
ts
import { NextResponse } from "next/server";import { getBudgetStatus } from "../../../src/services/budget-enforcer.js";export function GET() { const status = getBudgetStatus(); return NextResponse.json(status, { status: 200 });}
Expected output:POST /api/process-expense accepts a multipart file upload, runs the full pipeline, and returns a JSON response with the parsed documents and cost breakdown. It returns:
400 if no file or unsupported MIME type
413 if the file exceeds 10 MB
429 if the daily budget is exceeded
500 if parsing, repair, or Xero push fails
GET /api/budget-status returns { spent, remaining, state }.
Step 11: Build the upload UI
Replace app/page.tsx with a client component that has a file upload form:
Expected output: A clean upload form. Clicking “Process Expense” disables the button, sends the file to the API, and displays the response JSON below the form. Errors show in red.
Step 12: Create barrel exports and run the tests
Update src/index.ts to re-export everything from the services:
ts
export { extractText, ExtractionError } from "./services/document-extractor.js";export { planDocumentContext } from "./services/context-planner.js";export { parseExpenses, ParseError } from "./services/expense-parser.js";export { recordLlmCall, getTokenEstimate } from "./services/cost-telemetry.js";export { getBudgetController, initializeBudget, checkBudget, recordSpend, getBudgetStatus,} from "./services/budget-enforcer.js";export { initXeroClient, pushExpensesToXero, XeroPushError } from "./services/xero-client.js";
Now run the type checker, linter, and test suite:
terminal
pnpm typecheckpnpm lintpnpm test
Expected output: TypeScript compiles without errors. ESLint passes. Vitest runs with coverage: