xAI Grok Expense Report Extraction for SMB Finance
Automatically extract line items, totals, and merchant names from scanned receipts and invoices using xAI Grok's vision model, then export to spreadsheets.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building an expense-report extraction pipeline that processes scanned receipts and PDF invoices using xAI Grok’s vision model. You’ll create a Next.js API that accepts document uploads, extracts text via tesseract.js and unpdf, passes it to Grok for structured expense extraction, repairs malformed JSON, validates accuracy against golden reference data, tracks API costs, and exports results as CSV files stored in S3.
Prerequisites
Node.js 22+ and pnpm installed on your machine
An xAI API key with access to the Grok model (set as XAI_API_KEY)
An AWS S3 bucket and IAM credentials (set as AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_BUCKET_NAME)
Familiarity with TypeScript and Next.js App Router basics
Step 1: Scaffold the project and install dependencies
Start by creating a Next.js project with the App Router and installing the required packages.
The REAA packages (@reaatech/media-pipeline-mcp-doc-extraction, @reaatech/structured-repair-core, @reaatech/classifier-evals, @reaatech/llm-cost-telemetry) are vendored internally. Place them in your packages directory or install from a private registry.
terminal
# If vendored locally:pnpm add @reaatech/media-pipeline-mcp-doc-extraction@0.3.0 @reaatech/structured-repair-core@1.0.0 @reaatech/classifier-evals@0.1.0 @reaatech/llm-cost-telemetry@0.1.0
Expected output: Your package.json now lists all dependencies with exact versions (no ^ or ~ prefixes).
Fill in your xAI API key and AWS credentials in .env. The DEFAULT_DAILY_BUDGET sets a cost cap for Grok API calls.
Step 3: Define shared types
Create the type definitions that all modules share. These describe expense line items, extracted expense records, pipeline inputs and results, and quality-gate output.
Expected output: Six interfaces and one type union that define the data flowing through every stage of the pipeline.
Step 4: Build the document ingestion pipeline
The ingestion pipeline handles OCR for images via tesseract.js and text extraction for PDFs via unpdf. It also preprocesses images with sharp (resize to 2048px width, grayscale) before OCR.
terminal
mkdir -p src/lib
Write src/lib/doc-pipeline.ts:
ts
import { createDocumentExtractionOperations } from "@reaatech/media-pipeline-mcp-doc-extraction";import { ArtifactRegistry } from "@reaatech/media-pipeline-mcp-core";import { MediaProvider } from "@reaatech/media-pipeline-mcp-provider-core";import type { ArtifactStore, ArtifactMeta, StorageResult } from "@reaatech/media-pipeline-mcp-storage";import type { ProviderInput, ProviderOutput } from "@reaatech/media-pipeline-mcp-provider-core";import type { DocumentType } from "../types/index.js";import sharp from "sharp";import { createWorker } from "tesseract.js";import { extractText as unpdfExtractText, getDocumentProxy } from "unpdf";import { Readable } from
Expected output: A document pipeline that detects file types by magic bytes, OCRs images via tesseract.js with sharp preprocessing, extracts PDF text via unpdf, and throws clear errors on empty buffers or unsupported types.
Step 5: Create the Grok extraction module
This module sends the extracted text to xAI Grok with a structured output schema and returns typed expense data.
Write src/lib/grok-extractor.ts:
ts
import { generateText, Output } from "ai";import { xai } from "@ai-sdk/xai";import { z } from "zod";import type { ExtractedExpense } from "../types/index.js";const expenseLineItemShape = { description: z.string(), category: z.string(), amount: z.number(), quantity: z.number(), total: z.number(), date: z.string().optional(),};const extractedExpenseShape = { merchantName: z.string(),
Expected output: A Zod-validated Grok extraction that retries on 429 rate limits with exponential backoff (1s, 4s) and throws ExtractionError when Grok returns empty results.
Step 6: Implement JSON repair
Grok sometimes returns malformed JSON (markdown fences, trailing commas, string values where numbers are expected). The repair module uses @reaatech/structured-repair-core to fix it.
Write src/lib/repair.ts:
ts
import { repair, repairOutput } from "@reaatech/structured-repair-core";import type { ExtractedExpense } from "../types/index.js";import { expenseArraySchema } from "./grok-extractor.js";export async function repairExpenseJson(rawJson: string): Promise<ExtractedExpense[]> { return repair(expenseArraySchema, rawJson);}export function repairWithDiagnostics( raw: string,): { success: boolean; data: ExtractedExpense[] | null; steps: unknown[] } { const result = repairOutput({ schema: expenseArraySchema, input: raw, debug: false, }); return { success: result.success, data: result.success ? result.data : null, steps: result.steps, };}
Expected output: Two functions — repairExpenseJson attempts full repair and throws on failure; repairWithDiagnostics returns a structured result with step-by-step diagnostics for debugging.
Step 7: Build the quality gate
The quality gate compares extracted expenses against a golden reference dataset. It validates field accuracy and redacts PII (emails, phone numbers) from expense notes and descriptions.
Expected output: A quality gate that validates accuracy at a 70% threshold against 3 golden records and redacts PII from expense text fields.
Step 8: Create the cost tracker
The cost tracker records every Grok API call, calculates its dollar cost using xAI’s per-million-token pricing ($2.50/M input, $10.00/M output), and enforces a configurable daily budget.
Expected output: A cost tracker that validates every span with CostSpanSchema, supports retryWithBackoff for transient failures, and groups costs by feature name.
Step 9: Build the S3 storage module
The S3 storage module uploads CSV results and original documents to an S3 bucket, using PutObjectCommand and GetObjectCommand. It validates that all required AWS environment variables are present at construction time.
Expected output: An S3 wrapper that throws descriptive errors on missing credentials and on missing S3 objects, with uploadFile and downloadFile methods wrapping the AWS SDK.
Step 10: Create the CSV exporter
The CSV exporter converts extracted expenses into properly escaped CSV strings.
Expected output: A CSV exporter that escapes commas and double-quotes, handles missing optional fields, and returns headers-only for empty arrays.
Step 11: Wire everything into the pipeline orchestrator
The ExpensePipeline class coordinates all modules in a precise order: document ingestion, budget check, Grok extraction (with repair fallback), PII redaction, quality gate, CSV export, S3 upload, and cost recording.
Write src/lib/pipeline.ts:
ts
import type { PipelineInput, PipelineResult, ExtractedExpense } from "../types/index.js";import { extractText, createDocPipeline } from "./doc-pipeline.js";import { extractExpenses, ExtractionError } from "./grok-extractor.js";import { repairWithDiagnostics } from "./repair.js";import { evaluate, redactExpenseData } from "./quality-gate.js";import { CostTracker } from "./cost-tracker.js";import { S3Storage } from "./s3-storage.js";import { exportCsv } from "./csv-exporter.js";export class ExpensePipeline { private docPipeline = createDocPipeline(); private costTracker =
Expected output: A pipeline that processes a document through extraction, repair, quality checking, and S3 upload — gracefully handling S3 failures (non-fatal) and Grok extraction failures (with repair fallback). Non-extraction errors (empty document, budget exceeded) throw and propagate up to the API.
Step 12: Create the upload API route
The API route accepts multipart file uploads, validates MIME type (JPEG, PNG, TIFF, PDF) and file size (max 20MB), then runs the pipeline.
Expected output: A route handler at POST /api/upload that accepts expense documents and returns structured extraction results, plus GET /api/upload that returns pipeline usage metrics.
Step 13: Add barrel exports and entry point
Replace the placeholder src/index.ts with barrel exports that expose all public types and classes:
ts
import "dotenv/config";export type { ExtractedExpense, ExpenseLineItem, PipelineInput, PipelineResult, QualityGateResult, DocumentType } from "./types/index.js";export { createDocPipeline } from "./lib/doc-pipeline.js";export { ExpensePipeline } from "./lib/pipeline.js";export { extractExpenses, ExtractionError } from "./lib/grok-extractor.js";export { repairExpenseJson, repairWithDiagnostics } from "./lib/repair.js";export { evaluate, redactExpenseData } from "./lib/quality-gate.js";export { CostTracker } from "./lib/cost-tracker.js";export { S3Storage } from "./lib/s3-storage.js";export { exportCsv } from "./lib/csv-exporter.js";
Expected output: A barrel module that imports dotenv/config for environment variable loading and re-exports every public symbol.
Step 14: Run the tests
The test suite covers every module with happy-path, error-path, and boundary cases. Run all tests with coverage:
terminal
pnpm test
Expected output: All tests pass across 8 test files, with line coverage at 99%, statement coverage at 99%, branch coverage at 96%, and function coverage at 98%. The coverage thresholds are set to 90% across all metrics in vitest.config.ts.
Add a dashboard page — Build a Next.js frontend at app/dashboard/page.tsx that displays extraction history and lets users upload receipts through a drag-and-drop zone.
Expand golden dataset — Add more reference records to quality-gate.ts to improve accuracy validation across different receipt formats and currencies.
Add image preprocessing options — Extend the sharp preprocessing in doc-pipeline.ts with thresholding, deskewing, and contrast enhancement for low-quality scans.
"stream"
;
class InMemoryArtifactStore implements ArtifactStore {
private store = new Map<string, { data: Buffer; meta: ArtifactMeta }>();