Small businesses receive hundreds of invoices in PDF format that need to be manually entered into QuickBooks; existing OCR is brittle and expensive, and LLMs sometimes produce unparseable JSON.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building an invoice extraction pipeline with Next.js and xAI Grok. You’ll build an API endpoint that accepts a PDF invoice, parses it with Unstructured.io, extracts structured fields using Grok, repairs any malformed LLM output with structured-repair-core, tracks LLM spend per tenant with agent-budget-engine, and pushes the final invoice to QuickBooks Online. The pipeline is idempotent — duplicate uploads with the same key return the cached result instead of charging you twice.
This is for TypeScript developers who want a copy-paste-along recipe combining LLM extraction, repair, cost governance, and a third-party accounting API.
Prerequisites
Node.js 22+ and pnpm 10 installed
xAI API key — sign up at console.x.ai and create a key
Unstructured.io API key — sign up at unstructured.io and get a free API key
QuickBooks Online sandbox account — create one at developer.intuit.com and note the OAuth credentials (consumer key, consumer secret, OAuth token, realm ID, refresh token)
Familiarity with Next.js App Router, TypeScript, and Zod
Step 1: Scaffold the project and configure environment variables
Start by creating a Next.js App Router project. Pin every dependency to an exact version so the build is reproducible.
Replace the contents of .env.example with the full list of environment variables the pipeline needs:
env
NODE_ENV=development# xAI Grok API (OpenAI-compatible)XAI_API_KEY=<your-xai-api-key># Unstructured.io API for PDF parsingUNSTRUCTURED_API_KEY=<your-unstructured-api-key># QuickBooks Online OAuth 2.0 credentialsQUICKBOOKS_CONSUMER_KEY=<your-quickbooks-consumer-key>QUICKBOOKS_CONSUMER_SECRET=<your-quickbooks-consumer-secret>QUICKBOOKS_OAUTH_TOKEN=<your-oauth-token>QUICKBOOKS_REALM_ID=<your-realm-id>QUICKBOOKS_REFRESH_TOKEN=<your-refresh-token># Set to true to use QuickBooks Sandbox environmentQUICKBOOKS_SANDBOX=true# Budget limits (USD)DEFAULT_TENANT_BUDGET_USD=10.00MAX_INVOICE_COST_USD=0.50
Expected output: A Next.js 16 project with all dependencies installed and a .env.example file. Copy .env.example to .env.local and fill in your real API keys to test later.
Step 2: Define the invoice and pipeline types with Zod
Create the invoice schemas. You need two schemas: a strict InvoiceDataSchema for final validation and a forgiving InvoiceDataInputSchema that coerces types and supplies .catch() defaults so the repair engine has room to work.
// src/types/index.tsexport { InvoiceDataSchema, InvoiceDataInputSchema, LineItemSchema,} from "./invoice.js";export type { InvoiceData, LineItem } from "./invoice.js";export type { PipelineConfig, PipelineCost, PipelineResult, ProcessingStatus,} from "./pipeline.js";
Expected output: Three files (src/types/invoice.ts, src/types/pipeline.ts, src/types/index.ts) that define the data shapes for the entire pipeline.
Step 3: Build the PDF processor with Unstructured.io and sharp
The PDF processor reads a file, sends it to Unstructured.io’s API for content extraction, and provides helpers for getting page counts and converting pages to images with sharp. It includes retry logic (3 attempts, exponential backoff) for transient API errors.
ts
// src/services/pdf-processor.tsimport fs from "node:fs";import sharp from "sharp";import { UnstructuredClient } from "unstructured-client";import { Strategy } from "unstructured-client/sdk/models/shared";import { UnstructuredClientError, ConnectionError,} from "unstructured-client/sdk/models/errors";export class PdfProcessingError extends Error { code: "UNSTRUCTURED_API_ERROR" | "FILE_NOT_FOUND" | "PERMISSION_DENIED"; constructor( message: string, code: "UNSTRUCTURED_API_ERROR"
Expected output:src/services/pdf-processor.ts with three exported functions and a typed PdfProcessingError class.
Step 4: Build the LLM extractor with fallback and repair
The LLM extractor uses a two-tier extraction strategy. First, it tries @instructor-ai/instructor with a Zod response_model for structured output. If that fails, it falls back to a raw chat.completions.create call with response_format: { type: "json_object" }. Both paths pass the raw LLM output through jsonrepair and then through @reaatech/structured-repair-core’s repairOutput, which applies strategies like strip-fences, fix-json-syntax, and coerce-types to turn malformed responses into valid data.
ts
// src/services/llm-extractor.tsimport OpenAI from "openai";import Instructor from "@instructor-ai/instructor";import { repairOutput, UnrepairableError,} from "@reaatech/structured-repair-core";import { jsonrepair } from "jsonrepair";import { z } from "zod";import { InvoiceDataSchema, InvoiceDataInputSchema,} from "../types/invoice.js";import type { InvoiceData } from "../types/invoice.js";export interface ExtractionUsage { inputTokens: number;
Expected output:src/services/llm-extractor.ts with the Grok client, Instructor wrapper, raw extraction, repair pipeline, and the unified extractInvoice function with fallback.
Step 5: Build the budget tracker with per-tenant spend control
The budget tracker wraps @reaatech/agent-budget-spend-tracker and @reaatech/agent-budget-engine to enforce per-tenant spending limits. It defines a budget with a soft cap (80%) and hard cap (100%), checks whether a call is allowed before running the LLM, and records spend after each invoice extraction. It also uses @reaatech/llm-cost-telemetry and @reaatech/llm-cost-telemetry-calculator to generate request IDs and calculate costs based on token usage.
Expected output:src/services/budget-tracker.ts with a BudgetExceededError, a factory for the budget engine, and helper functions for configuring, checking, recording, and calculating costs.
Step 6: Build the QuickBooks integration
The QuickBooks service wraps the node-quickbooks callback-based SDK with a promisify helper and provides functions to find or create customers and push invoices. The constructor requires ten arguments including the consumer key, secret, OAuth token, realm ID, and sandbox flag.
ts
// src/services/quickbooks.tsimport QuickBooks from "node-quickbooks";import type { InvoiceData } from "../types/invoice.js";export class QuickBooksError extends Error { code: string; constructor(message: string, code: string) { super(message); this.name = "QuickBooksError"; this.code = code; }}export function createQuickBooksClient(): QuickBooks { return new QuickBooks( process.env.QUICKBOOKS_CONSUMER_KEY as
Expected output:src/services/quickbooks.ts with promisified wrappers around node-quickbooks for creating invoices and managing customers.
Step 7: Build the pipeline orchestrator with idempotency
The pipeline orchestrator ties together every service. It initializes the idempotency middleware (with a 24-hour TTL), reads the PDF, checks the budget, extracts the invoice, records spend, pushes to QuickBooks, and returns a comprehensive PipelineResult. The idempotency middleware ensures that re-uploading the same file with the same key returns the cached result instead of charging you twice.
ts
// src/pipeline/process-invoice.tsimport { MemoryAdapter, IdempotencyMiddleware } from "@reaatech/idempotency-middleware";import { parsePdf } from "../services/pdf-processor.js";import { extractInvoice } from "../services/llm-extractor.js";import { createBudgetEngine, checkPreCallBudget, recordSpend, calculateInvoiceCost, BudgetExceededError,} from "../services/budget-tracker.js";import { createQuickBooksClient, getOrCreateCustomer, pushInvoiceToQuickBooks,} from "../services/quickbooks.js";import type { PipelineResult, PipelineCost } from "../types/pipeline.js";export class PipelineError
Expected output:src/pipeline/process-invoice.ts — the central orchestrator that composes the full pipeline inside an idempotency wrapper.
Step 8: Build the API route handler
The API route exposes a POST /api/invoices endpoint that accepts a multipart file upload with an Idempotency-Key header, calls the pipeline, and returns the structured result. It also provides a GET /api/invoices health-check endpoint.
Expected output:app/api/invoices/route.ts with POST (file upload + pipeline execution) and GET (health check) handlers. The route uses NextRequest/NextResponse throughout, never bare Request or new Response.
Step 9: Run the tests
The test suite covers every module with mocked external calls. The LLM extractor tests use MSW to mock the xAI API and vi.mock to stub the Instructor, openai, jsonrepair, and @reaatech/structured-repair-core modules. The pipeline tests mock all services and verify the idempotency middleware is called correctly. The API route tests verify every HTTP status code (200, 400, 402, 409, 500) and temp file cleanup. The types tests validate both the strict and forgiving Zod schemas.
First, update vitest.config.ts to configure coverage thresholds:
pnpm vitest run --coverage --reporter=json --outputFile=vitest-report.json
Expected output: The test runner reports numFailedTests: 0 and all four coverage metrics — lines, branches, functions, statements — at or above 90%. The coverage report covers the following files:
src/types/invoice.ts — Zod schema validation and coercion
Re-upload the same file with the same Idempotency-Key — the idempotency middleware returns the cached result (status 200) instead of charging you again. Omitting the key returns a 400 error.
Next steps
Add a queue worker — use a message broker (BullMQ, SQS) to process invoices asynchronously so the API responds instantly
Add PDF pre-processing — use sharp to convert each page to an image and send both markdown and image to a multimodal model for higher accuracy
Add a web dashboard — build a simple Next.js page where users can drag-and-drop invoices and see extract results in a table
Add rate limiting — wrap the API route with a rate limiter to prevent abuse
Persist the idempotency store — swap MemoryAdapter for Redis or DynamoDB so the cache survives server restarts
|
"FILE_NOT_FOUND"
|
"PERMISSION_DENIED"
,
) {
super(message);
this.name = "PdfProcessingError";
this.code = code;
}
}
function sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
function isRetryableError(err: unknown): boolean {
if (err instanceof UnstructuredClientError && err.statusCode >= 500) {
return true;
}
if (err instanceof ConnectionError) {
return true;
}
return false;
}
export async function parsePdf(filePath: string): Promise<string> {
let fileBuffer: Buffer;
try {
fileBuffer = fs.readFileSync(filePath);
} catch (err) {
const nodeErr = err as { code?: string };
if (nodeErr.code === "ENOENT") {
throw new PdfProcessingError(
`File not found: ${filePath}`,
"FILE_NOT_FOUND",
);
}
if (nodeErr.code === "EACCES" || nodeErr.code === "EPERM") {
export const EXTRACTION_SYSTEM_PROMPT = `You are an invoice data extraction assistant. Extract structured invoice fields from the provided markdown content. Return a JSON object with the following fields:
- vendorName: the name of the vendor or supplier
- customerName: the name of the customer or bill-to party
- invoiceDate: the invoice date
- invoiceNumber: the unique invoice identifier
- lineItems: an array of line items, each with description, quantity, unitPrice
- subtotal: the subtotal amount
- taxAmount: the tax amount
- totalAmount: the total invoice amount
- currency: the currency code (default "USD")
- notes: freeform additional notes or remarks
Output valid JSON only, no markdown fences or explanatory text.`;