Automated Receipt Classifier for Small CPA Firms

Eliminate manual receipt categorization with an AI agent that extracts vendor, amount, and GL category from client uploads.

receipt-classifier cpa-firm document-pipeline nextjs fastify ocr agent-mesh guardrails typescript

The problem

A bookkeeper at a 3-person CPA firm spends 10+ hours per week manually sorting client receipts forwarded via email or upload. Each receipt must be reviewed to identify the vendor, extract the amount, and assign the correct GL category. During tax season, this backlog balloons, causing overtime and errors. The bookkeeper needs a way to automate this drudgery so they can focus on reconciliations and client advisory.

Built from

Intro

A bookkeeper at a small CPA firm spends 10+ hours per week manually sorting client receipts — identifying the vendor, extracting the amount, and assigning a GL category. This tutorial walks you through building an automated receipt classifier agent that handles that workflow end to end: it extracts text from PDFs and images via OCR, classifies each receipt with an LLM, validates the output through a guardrail chain, tracks spend and evaluation trajectories, and surfaces everything through a Next.js upload UI. You’ll use the Vercel AI SDK with generateText and Output.object for structured LLM output, unpdf and tesseract.js for document parsing, and REAA (Reusable Enterprise AI Agent) packages for agent mesh, guardrails, budget tracking, golden eval harness, and markdown utilities.

Prerequisites

Node.js 22+ and pnpm 10 installed on your machine
An OpenAI API key with access to gpt-5.2-mini
A Langfuse account with public and secret keys (for observability tracing)
Basic familiarity with TypeScript, Next.js App Router, and REST APIs

Step 1: Scaffold the project and install dependencies

Create the project directory and cd into it:

terminal

mkdir agnostic-receipt-classifier-agent && cd agnostic-receipt-classifier-agent

Create package.json with exact-pinned dependencies:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

173 kB·91 tests·94.6% coverage·vitest passing

SHA-2565c07b46483f11d07345ef400907a721c2cee66cfa83a2312a9480e314f3b6bef

Book a conversation All solutions

Comments

Loading comments…

import { extractText, getDocumentProxy } from "unpdf"; import { createWorker } from "tesseract.js"; export class ExtractionError extends Error { code: string; sourceType: string; cause: unknown; constructor(message: string, code: string, sourceType: string, cause: unknown) { super(message); this.name = "ExtractionError"; this.code = code; this.sourceType = sourceType; this.cause = cause; } } export class DocumentExtractor { async extractFromPdf(buffer: Uint8Array): Promise<{ totalPages: number; text: string }> { const pdf = await getDocumentProxy(new Uint8Array(buffer)); const result = await extractText(pdf, { mergePages: true }); return { totalPages: result.totalPages, text: result.text }; } async extractFromImage(buffer: Uint8Array): Promise<string> { const worker = await createWorker("eng"); try { const imageBuf = buffer instanceof Buffer ? buffer : Buffer.from(buffer); const ret = await worker.recognize(imageBuf); return ret.data.text; } finally { await worker.terminate(); } } classifySourceType(buffer: Uint8Array): "pdf" | "image" { const header = new Uint8Array(buffer.slice(0, 4)); const isPdf = header[0] === 0x25 && header[1] === 0x50 && header[2] === 0x44 && header[3] === 0x46; return isPdf ? "pdf" : "image"; } async extract(buffer: Uint8Array): Promise<{ rawText: string; sourceType: "pdf" | "image"; totalPages: number }> { const sourceType = this.classifySourceType(buffer); try { if (sourceType === "pdf") { const result = await this.extractFromPdf(buffer); return { rawText: result.text, sourceType, totalPages: result.totalPages }; } const text = await this.extractFromImage(buffer); return { rawText: text, sourceType, totalPages: 1 }; } catch (err) { throw new ExtractionError( `Failed to extract text from ${sourceType}`, "EXTRACTION_FAILED", sourceType, err instanceof Error ? err : new Error(String(err)), ); } } }

import { randomId } from "@reaatech/agents-markdown"; import { generateText, Output } from "ai"; import { openai } from "@ai-sdk/openai"; import { z } from "zod"; import type { CostTelemetry } from "@reaatech/llm-router-core"; import type { ClassificationResult } from "../types/receipt.js"; const ClassifiedReceiptSchema = z.object({ vendor: z.string(), amount: z.number().positive(), glCategory: z.enum([ "Travel", "OfficeSupplies", "Utilities", "Meals", "Equipment", "Software", "ProfessionalServices", "Rent", "Other", ]), confidence: z.number().min(0).max(1), explanation: z.string(), isValid: z.boolean(), }); const SYSTEM_PROMPT = "You are a CPA bookkeeping assistant. Extract the vendor name, total receipt amount, and best-fit GL category from the receipt text below. Respond with the exact JSON schema provided."; export class ReceiptClassifier { private getModel() { const modelId = process.env.RECEIPT_CLASSIFIER_MODEL ?? "gpt-5.2-mini"; return openai(modelId); } async classify(extractedText: string): Promise<{ classification: ClassificationResult; telemetry: CostTelemetry }> { const attempt = async (): Promise<{ classification: ClassificationResult; telemetry: CostTelemetry }> => { const result = await generateText({ model: this.getModel(), output: Output.object({ schema: ClassifiedReceiptSchema }), system: SYSTEM_PROMPT, prompt: extractedText, }); const classification: ClassificationResult = result.output; const telemetry: CostTelemetry = { requestId: randomId(), modelId: process.env.RECEIPT_CLASSIFIER_MODEL ?? "gpt-5.2-mini", cost: 0, inputTokens: result.usage.inputTokens ?? 0, outputTokens: result.usage.outputTokens ?? 0, timestamp: new Date(), strategy: "receipt-classification", }; return { classification, telemetry }; }; try { return await attempt(); } catch { try { return await attempt(); } catch { const classification: ClassificationResult = { vendor: "", amount: 0, glCategory: "Other", confidence: 0, explanation: "Classification failed after retry", isValid: false, }; const telemetry: CostTelemetry = { requestId: randomId(), modelId: process.env.RECEIPT_CLASSIFIER_MODEL ?? "gpt-5.2-mini", cost: 0, inputTokens: 0, outputTokens: 0, timestamp: new Date(), strategy: "receipt-classification", }; return { classification, telemetry }; } } } }

import { GuardrailChain, ChainBuilder, setLogger, ConsoleLogger, LRUCache, generateCorrelationId, type Guardrail, type GuardrailResult, type ChainContext, } from "@reaatech/guardrail-chain"; import type { ClassificationResult, GlCategory } from "../types/receipt.js"; void LRUCache; void generateCorrelationId; setLogger(new ConsoleLogger()); const VALID_CATEGORIES: GlCategory[] = [ "Travel", "OfficeSupplies", "Utilities", "Meals", "Equipment", "Software", "ProfessionalServices", "Rent", "Other", ]; class AmountSanityGuardrail implements Guardrail<ClassificationResult, ClassificationResult> { readonly id = "amount-sanity"; readonly name = "Amount Sanity"; readonly type = "output" as const; enabled = true; execute( input: ClassificationResult, _context: ChainContext, ): Promise<GuardrailResult<ClassificationResult>> { void _context; if (input.amount <= 0 || input.amount > 1_000_000) { return Promise.resolve({ passed: false, output: input }); } return Promise.resolve({ passed: true, output: input }); } } class GlCategoryGuardrail implements Guardrail<ClassificationResult, ClassificationResult> { readonly id = "gl-category"; readonly name = "GL Category"; readonly type = "output" as const; enabled = true; execute( input: ClassificationResult, _context: ChainContext, ): Promise<GuardrailResult<ClassificationResult>> { void _context; if (!VALID_CATEGORIES.includes(input.glCategory)) { return Promise.resolve({ passed: false, output: { ...input, glCategory: "Other", confidence: Math.min(input.confidence, 0.3), }, }); } return Promise.resolve({ passed: true, output: input }); } } class VendorPresenceGuardrail implements Guardrail<ClassificationResult, ClassificationResult> { readonly id = "vendor-presence"; readonly name = "Vendor Presence"; readonly type = "output" as const; enabled = true; execute( input: ClassificationResult, _context: ChainContext, ): Promise<GuardrailResult<ClassificationResult>> { void _context; if (!input.vendor || input.vendor.trim().length === 0) { return Promise.resolve({ passed: false, output: input }); } return Promise.resolve({ passed: true, output: input }); } } export function buildReceiptGuardrailChain(): GuardrailChain { return new ChainBuilder() .withBudget({ maxLatencyMs: 2000, maxTokens: 8000 }) .withGuardrail(new AmountSanityGuardrail()) .withGuardrail(new GlCategoryGuardrail()) .withGuardrail(new VendorPresenceGuardrail()) .withSlowGuardrailSkipping(true) .build(); } export async function validateClassification( classification: ClassificationResult, ): Promise<{ success: boolean; failedGuardrail?: string }> { const chain = buildReceiptGuardrailChain(); const result = await chain.execute(classification); return { success: result.success, failedGuardrail: result.failedGuardrail, }; }

import { SpendStore } from "@reaatech/agent-budget-spend-tracker"; import { randomId } from "@reaatech/agents-markdown"; export class ReceiptSpendTracker { private static instance: ReceiptSpendTracker | undefined; private store: { record(entry: { requestId: string; scopeType: string; scopeKey: string; cost: number; inputTokens: number; outputTokens: number; modelId: string; provider: string; timestamp: Date; }): number; getSpend(scopeType: string, scopeKey: string): number; getRate(scopeType: string, scopeKey: string, windowMinutes?: number): number; detectSpikes(scopeType: string, scopeKey: string, windowSize?: number, thresholdStdDev?: number): Array<{ entryId: number; cost: number; expectedCost: number; deviation: number; timestamp: Date; }>; }; private constructor() { this.store = new SpendStore({ maxEntries: 100_000 }); } static getInstance(): ReceiptSpendTracker { if (ReceiptSpendTracker.instance === undefined) { ReceiptSpendTracker.instance = new ReceiptSpendTracker(); } return ReceiptSpendTracker.instance; } recordClassification(params: { modelId: string; cost: number; inputTokens: number; outputTokens: number; userId: string; clientId: string; }): number { const requestId = randomId(); const now = new Date(); const base = { requestId, cost: params.cost, inputTokens: params.inputTokens, outputTokens: params.outputTokens, modelId: params.modelId, provider: "openai", timestamp: now, }; const userEntryId = this.store.record({ ...base, scopeType: "user", scopeKey: params.userId, }); const clientEntryId = this.store.record({ ...base, scopeType: "client", scopeKey: params.clientId, }); return userEntryId + clientEntryId; } getSpendForUser(userId: string): number { return this.store.getSpend("user", userId); } getSpendForClient(clientId: string): number { return this.store.getSpend("client", clientId); } getRateForUser(userId: string): number { return this.store.getRate("user", userId, 5); } detectSpikeForUser( userId: string, ): Array<{ cost: number; modelId: string }> { const spikes = this.store.detectSpikes("user", userId, 2); return spikes.map((s) => ({ cost: s.cost, modelId: this.extractModelId(s) })); } private extractModelId(s: { entryId: number; cost: number; expectedCost: number; deviation: number; timestamp: Date; modelId?: string; }): string { return s.modelId ?? "unknown"; } }

import { quickCreateGolden, compareAgainstGolden as compareGolden, batchQualityCheck, createCurator, } from "@reaatech/agent-eval-harness-golden"; import type { GoldenTrajectory } from "@reaatech/agent-eval-harness-golden"; import type { ClassificationResult } from "../types/receipt.js"; export interface CurationReportEntry { goldenId: string; passed: boolean; score: number; issues: string[]; suggestions: string[]; } export class ReceiptEvalTracker { private static instance: ReceiptEvalTracker | undefined; private constructor() {} static getInstance(): ReceiptEvalTracker { if (ReceiptEvalTracker.instance === undefined) { ReceiptEvalTracker.instance = new ReceiptEvalTracker(); } return ReceiptEvalTracker.instance; } storeTrajectory( receiptId: string, classification: ClassificationResult, modelId: string, ): GoldenTrajectory { const trajectory = { turns: [ { turn_id: 0, role: "agent" as const, content: JSON.stringify(classification), timestamp: new Date().toISOString(), }, ], metadata: { agent_id: modelId, total_turns: 1, }, }; return quickCreateGolden(trajectory, `receipt-${receiptId}`, [ modelId, "receipt-classification", ]); } compareAgainstGolden( candidate: ClassificationResult, golden: GoldenTrajectory, config?: { similarityThreshold?: number }, ) { const candidateTrajectory = { turns: [ { turn_id: 0, role: "agent" as const, content: JSON.stringify(candidate), timestamp: new Date().toISOString(), }, ], metadata: { total_turns: 1, }, }; return compareGolden(golden, candidateTrajectory, { similarityThreshold: config?.similarityThreshold ?? 0.85, }); } curateTrajectory(golden: GoldenTrajectory): void { const curator = createCurator(golden.trajectory); curator.annotateTurn({ turnId: 0, expected: true, qualityNotes: "Receipt classification", alternatives: [] }); curator.validate(); curator.publish(); } generateCurationReport( goldens: GoldenTrajectory[], ): CurationReportEntry[] { const results = batchQualityCheck(goldens); return results.map((r) => ({ goldenId: r.id, passed: r.result.passed, score: r.result.score, issues: r.result.issues.map((i) => i.description), suggestions: r.result.suggestions, })); } }

import Fastify, { type FastifyInstance } from "fastify"; import multipart from "@fastify/multipart"; import { ReceiptPipeline } from "../pipeline/receipt-pipeline.js"; import { ReceiptClassifier } from "../services/receipt-classifier.js"; import { ReceiptSpendTracker } from "../budget/spend-tracker.js"; import { z } from "zod"; const classifySchema = z.object({ text: z.string().min(1, "Text must not be empty"), }); export async function startServer(port?: number): Promise<FastifyInstance> { const server = Fastify({ logger: true }); await server.register(multipart); server.post("/api/receipts/upload", async (req, reply) => { try { const data = await req.file(); if (!data) { return await reply.status(400).send({ error: "No file provided" }); } const buffer = await data.toBuffer(); if (buffer.length > 10 * 1024 * 1024) { return await reply.status(413).send({ error: "File exceeds 10 MB limit" }); } if (buffer.length === 0) { return await reply.status(422).send({ error: "File is empty" }); } const userId = (data.fields.userId as string | undefined) ?? "anonymous"; const clientId = (data.fields.clientId as string | undefined) ?? "default"; const pipeline = new ReceiptPipeline(); const result = await pipeline.processReceipt( new Uint8Array(buffer), userId, clientId, ); return await reply.status(200).send(result); } catch (err) { const message = err instanceof Error ? err.message : "Upload failed"; return reply.status(500).send({ error: message }); } }); server.post("/api/receipts/classify", async (req, reply) => { try { const parsed = classifySchema.safeParse(req.body); if (!parsed.success) { const firstIssue = parsed.error.issues[0]; return await reply.status(400).send({ error: firstIssue.message }); } const classifier = new ReceiptClassifier(); const classification = await classifier.classify(parsed.data.text); return await reply.status(200).send(classification); } catch (err) { const message = err instanceof Error ? err.message : "Classification failed"; return reply.status(500).send({ error: message }); } }); server.get("/api/receipts/spend/:userId", async (req, reply) => { const { userId } = req.params as { userId: string }; const tracker = ReceiptSpendTracker.getInstance(); const totalSpend = tracker.getSpendForUser(userId); return reply.status(200).send({ totalSpend }); }); server.get("/api/health", async (_req, reply) => { return reply.status(200).send({ status: "ok", timestamp: new Date().toISOString(), }); }); const bindPort = port ?? (Number(process.env.FASTIFY_PORT) || 3001); await server.listen({ port: bindPort }); return server; } export async function stopServer(server: FastifyInstance): Promise<void> { await server.close(); }

"use client"; import { useState } from "react"; export default function Home() { const [file, setFile] = useState<File | null>(null); const [loading, setLoading] = useState(false); const [result, setResult] = useState<string | null>(null); const [error, setError] = useState<string | null>(null); async function handleSubmit(e: React.SyntheticEvent<HTMLFormElement>) { e.preventDefault(); if (!file) return; setLoading(true); setResult(null); setError(null); try { const formData = new FormData(); formData.set("file", file); const res = await fetch("/api/upload", { method: "POST", body: formData }); const data = await res.json() as { error?: string }; if (!res.ok) { setError(data.error ?? `HTTP ${String(res.status)}`); } else { setResult(JSON.stringify(data, null, 2)); } } catch (err) { setError(err instanceof Error ? err.message : "Upload failed"); } finally { setLoading(false); } } return ( <main style={{ maxWidth: 640, margin: "2rem auto", fontFamily: "sans-serif" }}> <h1>Receipt Classifier</h1> <p>Upload a receipt (PDF, PNG, or JPEG) for automated classification.</p> <form onSubmit={(e) => { void handleSubmit(e); }}> <input type="file" accept=".pdf,.png,.jpg,.jpeg" onChange={(e: React.ChangeEvent<HTMLInputElement>) => { setFile(e.target.files?.[0] ?? null); }} disabled={loading} style={{ display: "block", marginBottom: "1rem" }} /> <button type="submit" disabled={!file || loading}> {loading ? "Processing..." : "Submit"} </button> </form> {error && ( <pre style={{ color: "red", whiteSpace: "pre-wrap", marginTop: "1rem" }}> Error: {error} </pre> )} {result && ( <pre style={{ background: "#f5f5f5", padding: "1rem", borderRadius: 4, whiteSpace: "pre-wrap", marginTop: "1rem" }}> {result} </pre> )} </main> ); }

import { describe, it, expect, vi, beforeEach } from "vitest"; const mockGenerateText = vi.hoisted(() => vi.fn()); vi.mock("ai", () => ({ generateText: mockGenerateText, Output: { object: vi.fn(() => ({ schema: undefined })) } })); vi.mock("@ai-sdk/openai", () => ({ openai: vi.fn(() => ({ provider: "openai", modelId: "gpt-5.2-mini" })) })); vi.mock("@reaatech/agents-markdown", () => ({ randomId: vi.fn(() => "mock-id") })); import { ReceiptClassifier } from "../../src/services/receipt-classifier.js"; describe("ReceiptClassifier", () => { let classifier: ReceiptClassifier; beforeEach(() => { vi.clearAllMocks(); classifier = new ReceiptClassifier(); }); it("classifies a Staples receipt as OfficeSupplies", async () => { mockGenerateText.mockResolvedValue({ output: { vendor: "Staples", amount: 45.99, glCategory: "OfficeSupplies", confidence: 0.95, explanation: "Office supply purchase from Staples", isValid: true, }, usage: { inputTokens: 100, outputTokens: 50 }, }); const { classification } = await classifier.classify("Staples store receipt total $45.99"); expect(classification.vendor).toBe("Staples"); expect(classification.amount).toBe(45.99); expect(classification.glCategory).toBe("OfficeSupplies"); expect(classification.confidence).toBeGreaterThan(0.8); expect(classification.isValid).toBe(true); }); it("retries once when generateText throws, then succeeds", async () => { mockGenerateText .mockRejectedValueOnce(new Error("API timeout")) .mockResolvedValueOnce({ output: { vendor: "Amazon", amount: 25.00, glCategory: "OfficeSupplies", confidence: 0.85, explanation: "Office supplies from Amazon", isValid: true, }, usage: { inputTokens: 90, outputTokens: 42 }, }); const { classification } = await classifier.classify("Amazon order $25.00"); expect(classification.vendor).toBe("Amazon"); expect(classification.isValid).toBe(true); expect(mockGenerateText).toHaveBeenCalledTimes(2); }); it("returns isValid false when both attempts fail", async () => { mockGenerateText.mockRejectedValue(new Error("LLM unavailable")); const { classification } = await classifier.classify("Some receipt text"); expect(classification.isValid).toBe(false); expect(classification.confidence).toBe(0); expect(classification.glCategory).toBe("Other"); expect(classification.vendor).toBe(""); expect(classification.amount).toBe(0); expect(mockGenerateText).toHaveBeenCalledTimes(2); }); });

Automated Receipt Classifier for Small CPA Firms

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Step 2: Configure environment variables

Step 3: Define the receipt types

Step 4: Build the document extractor

Step 5: Build the AI receipt classifier

Step 6: Create the guardrail chain

Step 7: Build the budget spend tracker and eval tracker

Step 8: Add Langfuse observability

Step 9: Wire the receipt pipeline

Step 10: Create the Next.js API routes

Step 11: Build the Fastify server

Step 12: Add instrumentation and the entry point

Step 13: Create the upload UI

Step 14: Run the tests

Next steps