OpenAI Invoice Extraction for Xero SMB Accounting

Automatically extract and sync invoice data from PDFs and images directly into Xero, eliminating manual data entry for small businesses.

openai xero invoice-extraction document-pipeline nextjs typescript smb-accounting structured-output

The problem

SMBs waste hours manually entering invoice details from supplier PDFs and images into Xero, leading to errors and delays in reconciliation.

Built from

Intro

This recipe builds an invoice extraction pipeline that turns PDF and image invoices into structured data and syncs them directly into Xero. You’ll use OpenAI’s GPT-5.2 vision model via the Responses API, validate output against Zod schemas, repair malformed results automatically, route high-confidence extractions to Xero and flag low-confidence ones for human review. The pipeline uses Langfuse observability, a spend-based budget controller, an S3 document archive, and a Next.js admin API. It runs on Node.js 22+ with Next.js 16 (App Router).

Prerequisites

Node.js >= 22
pnpm (v10+)
An OpenAI API key with access to gpt-5.2 (or equivalent vision-capable model)
A Xero developer account with an OAuth2 client_credentials (Custom Connection) app
An AWS S3 bucket with IAM credentials (access key + secret)
A Langfuse account (optional; for tracing and observability)
Familiarity with Next.js App Router, TypeScript, and basic shell usage

Step 1: Scaffold the project and configure dependencies

Create a new Next.js project with pinned dependency versions — no ^ or ~ ranges — since the @reaatech/* packages are vendored.

Create a package.json:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

185 kB·123 tests·100.0% coverage·vitest passing

SHA-256bc7d6be8e6f70d99af713aeede9d2b59940d81876f0ea71af7a53546490b98d3

Book a conversation All solutions

Comments

Loading comments…

// src/extraction/schema-builder.ts import { InvoiceDataSchema } from "../types/invoice.js"; import type { InvoiceData } from "../types/invoice.js"; import { ZodError } from "zod"; export class ValidationError extends Error { public issues: { path: string; message: string }[]; constructor(message: string, issues: { path: string; message: string }[] = []) { super(message); this.name = "ValidationError"; this.issues = issues; } } export function buildInvoiceExtractionPrompt(rawText: string): string { return `You are an invoice data extraction assistant. Extract the following fields from the invoice text and return a JSON object matching this exact schema: { "invoiceNumber": string (required), "supplierName": string (required), "supplierAddress": string (optional), "customerName": string (optional), "customerAddress": string (optional), "lineItems": [ { "description": string (required), "quantity": number (positive, required), "unitAmount": number (positive, required), "lineAmount": number (optional), "accountCode": string (optional), "taxType": string (optional), "taxAmount": number (optional) } ] (required, at least one item), "subtotal": number (optional), "tax": number (optional), "total": number (optional), "currencyCode": string (optional), "invoiceDate": string (YYYY-MM-DD, required), "dueDate": string (YYYY-MM-DD, required), "reference": string (optional), "bankAccount": string (optional) } Return ONLY the JSON object, no markdown formatting, no explanation. Invoice text: ${rawText}`; } export function validateExtraction(raw: unknown): InvoiceData { try { return InvoiceDataSchema.parse(raw); } catch (err) { if (err instanceof ZodError) { const issues = err.issues.map((issue) => ({ path: issue.path.join("."), message: issue.message, })); throw new ValidationError( `Invoice data validation failed: ${issues.map((i) => `${i.path}: ${i.message}`).join("; ")}`, issues ); } throw new ValidationError( `Unexpected validation error: ${String(err)}` ); } } export function safeParseExtraction( raw: unknown ): { data: InvoiceData | null; errors: string[] } { const result = InvoiceDataSchema.safeParse(raw); if (result.success) { return { data: result.data, errors: [] }; } const errors = result.error.issues.map( (issue) => `${issue.path.join(".")}: ${issue.message}` ); return { data: null, errors }; }

// src/integrations/s3-storage.ts import { S3Client, PutObjectCommand, GetObjectCommand } from "@aws-sdk/client-s3"; import pRetry from "p-retry"; export class StorageError extends Error { constructor(message: string, cause?: unknown) { super(message, { cause }); this.name = "StorageError"; } } export class S3Storage { private client: S3Client; private bucketName: string; constructor() { const region = process.env.AWS_REGION; const bucketName = process.env.S3_BUCKET_NAME; const accessKeyId = process.env.AWS_ACCESS_KEY_ID; const secretAccessKey = process.env.AWS_SECRET_ACCESS_KEY; if (!region || !bucketName || !accessKeyId || !secretAccessKey) { throw new Error( "Missing required AWS environment variables: AWS_REGION, S3_BUCKET_NAME, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY" ); } this.bucketName = bucketName; this.client = new S3Client({ region, credentials: { accessKeyId, secretAccessKey, }, }); } async uploadDocument(key: string, body: Uint8Array, contentType: string): Promise<string> { return pRetry( async () => { try { await this.client.send( new PutObjectCommand({ Bucket: this.bucketName, Key: key, Body: body, ContentType: contentType, }) ); return key; } catch (err) { throw new StorageError( `Failed to upload document to s3://${this.bucketName}/${key}: ${String(err)}`, err ); } }, { retries: 3 } ); } async getDocument(key: string): Promise<Uint8Array> { return pRetry( async () => { try { const response = await this.client.send( new GetObjectCommand({ Bucket: this.bucketName, Key: key, }) ); const stream = response.Body; if (!stream) { throw new StorageError("Empty response body from S3"); } const chunks: Uint8Array[] = []; for await (const chunk of stream as AsyncIterable<Uint8Array>) { chunks.push(chunk); } const totalLength = chunks.reduce((acc, c) => acc + c.length, 0); const result = new Uint8Array(totalLength); let offset = 0; for (const chunk of chunks) { result.set(chunk, offset); offset += chunk.length; } return result; } catch (err) { if (err instanceof StorageError) { throw err; } throw new StorageError( `Failed to get document from s3://${this.bucketName}/${key}: ${String(err)}`, err ); } }, { retries: 3 } ); } }

// app/api/review/route.ts import { type NextRequest, NextResponse } from "next/server"; import { extractionStore } from "../extractions/store"; import { InvoiceDataSchema } from "@/src/types/invoice"; import { XeroInvoiceIntegration } from "@/src/integrations/xero"; const xeroIntegration = new XeroInvoiceIntegration(); export function GET() { const pendingReview = Array.from(extractionStore.values()).filter( (entry) => entry.status === "REVIEW_REQUIRED" ); return NextResponse.json({ items: pendingReview, total: pendingReview.length }); } export async function PATCH(request: NextRequest) { try { const body = await request.json() as { extractionId: string; action: string; editedData?: unknown }; const { extractionId, action, editedData } = body; if (!extractionId || !action) { return NextResponse.json( { error: "Missing extractionId or action" }, { status: 400 } ); } const entry = extractionStore.get(extractionId); if (!entry) { return NextResponse.json( { error: "Extraction not found" }, { status: 404 } ); } if (action === "approve") { if (!entry.structured) { return NextResponse.json( { error: "No structured data to approve" }, { status: 400 } ); } const tenantId = process.env.XERO_TENANT_ID ?? ""; const result = await xeroIntegration.createWithRetry(entry.structured, tenantId); entry.status = "EXTRACTED"; extractionStore.set(extractionId, entry); return NextResponse.json({ updated: true, invoiceId: result.invoiceId, warnings: result.warnings, }); } if (action === "reject") { entry.status = "FAILED"; extractionStore.set(extractionId, entry); return NextResponse.json({ updated: true }); } if (action === "edit") { if (!editedData) { return NextResponse.json( { error: "Missing editedData" }, { status: 400 } ); } const validated = InvoiceDataSchema.parse(editedData); entry.structured = validated; entry.status = "EXTRACTED"; extractionStore.set(extractionId, entry); const tenantId = process.env.XERO_TENANT_ID ?? ""; const result = await xeroIntegration.createWithRetry(entry.structured, tenantId); return NextResponse.json({ updated: true, invoiceId: result.invoiceId, warnings: result.warnings, }); } return NextResponse.json( { error: `Unknown action: ${action}` }, { status: 400 } ); } catch (error) { const message = error instanceof Error ? error.message : "Internal server error"; return NextResponse.json({ error: message }, { status: 500 }); } }

// tests/types/invoice-schema.test.ts import { describe, it, expect } from "vitest"; import { ZodError } from "zod"; import { InvoiceDataSchema } from "../../src/types/invoice.js"; describe("InvoiceDataSchema", () => { const validInvoice = { invoiceNumber: "INV-001", supplierName: "Test Corp", supplierAddress: "123 Main St", customerName: "Customer Inc", customerAddress: "456 Oak Ave", lineItems: [ { description: "Services", quantity: 1, unitAmount: 100, lineAmount: 100, accountCode: "500", taxType: "NONE", taxAmount: 0, }, ], subtotal: 100, tax: 0, total: 100, currencyCode: "USD", invoiceDate: "2024-01-15", dueDate: "2024-02-15", reference: "INV-001", bankAccount: "123456789", }; it("parses valid full data", () => { const result = InvoiceDataSchema.parse(validInvoice); expect(result.invoiceNumber).toBe("INV-001"); expect(result.supplierName).toBe("Test Corp"); expect(result.lineItems).toHaveLength(1); expect(result.total).toBe(100); }); it("rejects missing invoiceNumber", () => { const { invoiceNumber, ...invalid } = validInvoice; void invoiceNumber; expect(() => InvoiceDataSchema.parse(invalid)).toThrow(ZodError); }); it("rejects negative quantity", () => { const invalid = { ...validInvoice, lineItems: [ { description: "Bad", quantity: -1, unitAmount: 100, lineAmount: -100 }, ], }; expect(() => InvoiceDataSchema.parse(invalid)).toThrow(ZodError); }); it("accepts empty lineItems array", () => { const result = InvoiceDataSchema.parse({ ...validInvoice, lineItems: [] }); expect(result.lineItems).toEqual([]); }); it("accepts omitted optional fields like bankAccount", () => { const { bankAccount, ...noBank } = validInvoice; void bankAccount; const result = InvoiceDataSchema.parse(noBank); expect(result.bankAccount).toBeUndefined(); }); });

OpenAI Invoice Extraction for Xero SMB Accounting

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and configure dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and configure dependencies

Step 2: Configure environment variables

Step 3: Define the invoice data types with Zod

Step 4: Build the PDF parser and OCR service

Step 5: Create the document loader with concurrency control

Step 6: Build the schema builder and output repair

Step 7: Set up Langfuse observability and Next.js instrumentation

Step 8: Build the S3 storage and Xero integration

Step 9: Create the budget controller and confidence router

Step 10: Wire it all together — the extraction pipeline

Step 11: Create the API routes

Step 12: Write the golden evaluation comparator

Step 13: Write and run the tests

Next steps