Turn stacks of invoices and receipts into clean QuickBooks transactions with Vertex AI document parsing and structured repair, reducing manual data entry to zero.
Small business owners waste hours each week manually entering invoice data into QuickBooks. Off-the-shelf OCR tools produce messy, unstructured text that still requires fixing, and generic AI pipelines fail when formats vary.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This recipe turns stacks of PDF invoices and receipts into clean QuickBooks transactions using Google’s Gemini 2.5 Flash on Vertex AI. You’ll build a complete document pipeline that extracts structured data from PDF invoices, repairs malformed LLM output, routes high-confidence fields straight to QuickBooks, flags low-confidence items for human review, and tracks per-document processing costs. By the end you’ll have a Next.js API route and a CLI batch processor that both feed the same extraction pipeline.
Prerequisites
Node.js 22+ and pnpm 10 installed
A Google Cloud Platform project with the Vertex AI API enabled
A service-account JSON key file downloaded to your machine
(Optional) A webhook endpoint that accepts QuickBooks-style transaction payloads
Step 1: Scaffold the project, configure vitest, and install dependencies
Start with a fresh Next.js 16 project. The application router lives at the project root under app/, while service code is organized under src/.
Create vitest.config.ts at the project root. This sets up the @ path alias that the API route uses for its imports, and configures the 90% coverage thresholds across lines, branches, functions, and statements:
Expected output: pnpm installs all packages without peer-dependency errors. The node_modules/ directory now contains @google/genai, all four @reaatech/* packages, zod, pdf-parse, and vitest.
Step 2: Define the invoice Zod schemas
Create src/schemas/invoice.ts with the Zod schemas that represent every part of an invoice. These schemas validate Gemini’s output and are also passed to the repair engine as the target shape.
Expected output: The schemas compose naturally — InvoiceHeaderSchema references InvoiceAddressSchema, and InvoiceExtractionSchema bundles header, line items, summary, confidence, and raw text. The confidence field is clamped to the [0, 1] range.
Step 3: Create the PDF text extraction adapter
Create src/lib/pdf-adapter.ts. This module wraps pdf-parse with a default export function that accepts a Buffer and returns { text: string }. The API route and CLI both call the same adapter.
Expected output: The adapter dynamically imports pdf-parse inside the function body — this avoids module-scope side effects and keeps the import tree clean. The finally block ensures the parser is destroyed even if getText() throws.
Step 4: Build the Vertex AI extraction service
Create src/services/vertex-extractor.ts. This class wraps the @google/genai SDK. It exposes three methods: extractFromPdfBuffer (the full pipeline from PDF bytes to structured data), extractFromText (from raw text to structured data), and extractRawFromText (returns the raw Vertex response for the repair pipeline to handle).
ts
import { GoogleGenAI } from "@google/genai";import pdf from "../lib/pdf-adapter.js";import { InvoiceExtractionSchema, type InvoiceExtraction } from "../schemas/invoice.js";export class ExtractionError extends Error { code: string; constructor(message: string, code: string, cause?: Error) { super(message); this.name = "ExtractionError"; this.code = code; if (cause) this.cause = cause; }}export class VertexExtractor { lastUsageMetadata: unknown = undefined; constructor(private readonly ai: GoogleGenAI) {} async extractRawFromText(rawText: string): Promise<{ rawOutput: string; usageMetadata?: unknown }> { const prompt = `Extract structured invoice data from the following OCR text. Return ONLY valid JSON matching this schema: header contains invoiceNumber, invoiceDate, dueDate, currency, vendorName, vendorAddress (line1, city, state, zip, country), customerName; lineItems is an array with description, quantity, unitPrice, total; summary has subtotal, taxTotal, grandTotal. Text:\n${rawText}`; try { const response = await this.ai.models.generateContent({ model: "gemini-2.5-flash", contents: prompt, }); this.lastUsageMetadata = response.usageMetadata; return { rawOutput: typeof response.text === "string" ? response.text : "", usageMetadata: response.usageMetadata }; } catch (err) { throw new ExtractionError( err instanceof Error ? err.message : "Vertex API call failed", "API_ERROR", err instanceof Error ? err : undefined, ); } } async extractFromText(rawText: string): Promise<InvoiceExtraction> { const { rawOutput } = await this.extractRawFromText(rawText); return this.tryParse(rawOutput); } async extractFromPdfBuffer(pdfBuffer: Buffer): Promise<InvoiceExtraction> { const result = await pdf(pdfBuffer); return this.extractFromText(result.text); } private tryParse(raw: string): InvoiceExtraction { let parsed: unknown; try { parsed = JSON.parse(raw); } catch { throw new ExtractionError("Vertex response is not valid JSON", "JSON_PARSE_ERROR"); } const result = InvoiceExtractionSchema.safeParse(parsed); if (!result.success) { throw new ExtractionError(`Schema validation failed: ${result.error.message}`, "VALIDATION_FAILED"); } return result.data; }}
Expected output: The extractRawFromText method stores the Vertex response’s usageMetadata on this.lastUsageMetadata so callers can read prompt and candidate token counts for cost tracking. Errors from Vertex SDK calls are wrapped in ExtractionError with a machine-readable code property (API_ERROR, JSON_PARSE_ERROR, VALIDATION_FAILED).
Step 5: Wire up the structured repair pipeline
Create src/services/repair-pipeline.ts. This module calls @reaatech/structured-repair-core to fix common issues in LLM-generated JSON, such as markdown fences, bad syntax, type coercion errors, and hallucinated fields.
ts
import { repair, repairOutput, isValid, analyzeInput, type RepairResult } from "@reaatech/structured-repair-core";import { InvoiceExtractionSchema, type InvoiceExtraction } from "../schemas/invoice.js";export type { RepairResult } from "@reaatech/structured-repair-core";export function repairInvoiceOutput(raw: string): RepairResult<InvoiceExtraction> { return repairOutput({ schema: InvoiceExtractionSchema, input: raw, strategies: ["strip-fences", "extract-json", "fix-json-syntax", "coerce-types", "fuzzy-match-keys", "remove-extra-fields"], debug: process.env["DEBUG"] === "true", });}export async function quickRepairInvoice(raw: string): Promise<InvoiceExtraction> { return repair(InvoiceExtractionSchema, raw);}export function isValidInvoice(data: unknown): data is InvoiceExtraction { return isValid(InvoiceExtractionSchema, typeof data === "string" ? data : JSON.stringify(data));}export function analyzeInvoiceInput(raw: string) { return analyzeInput(raw);}
Expected output:repairInvoiceOutput is the main entrypoint used by both the API route and CLI. It applies six repair strategies in sequence: strip markdown fences, extract embedded JSON, fix broken JSON syntax, coerce string-typed numbers to actual numbers, fuzzy-match misspelled or hyphenated keys to the schema’s expected keys, and remove any extra fields Gemini hallucinated. Set DEBUG=true in your environment to see each strategy’s before/after state.
Step 6: Build the confidence router wrapper
Create src/services/invoice-router.ts. This module wraps @reaatech/confidence-router to decide whether each invoice field is trustworthy enough to send to QuickBooks automatically, or whether it needs human review.
Expected output: The router has three decision levels. Fields with confidence >= 0.85 are ROUTE (auto-send to QuickBooks). Fields with confidence between 0.3 and 0.85 are CLARIFY (flag for human review). Fields at or below 0.3 are FALLBACK (rejected entirely). The routeWholeInvoice function evaluates four key fields (invoice number, vendor name, line items, grand total) and returns the worst-case decision across all of them — if even one field is FALLBACK, the whole invoice needs attention.
Step 7: Create the QuickBooks webhook sender
Create src/services/quickbooks-sender.ts. This function maps the extracted invoice data into a QuickBooks-compatible payload and POSTs it to a configurable webhook URL.
Expected output: The function never throws — it catches errors and returns { ok: false, error } instead. This is deliberate: callers in the API route and CLI batch loop continue processing remaining invoices instead of crashing on a single network failure. The payload maps invoice header fields (vendorName to Payee, grandTotal to TotalAmt) and each line item into QuickBooks Line entries.
Step 8: Implement the human review queue
Create src/queues/review.ts. This in-memory queue stores invoices whose confidence was too low for automatic routing. A human can later review, correct, and re-submit them — at which point processReadyReviews re-evaluates and sends qualifying corrections to QuickBooks.
ts
import { generateId, now } from "@reaatech/llm-cost-telemetry";import { ConfidenceRouter } from "@reaatech/confidence-router";import type { InvoiceExtraction } from "../schemas/invoice.js";import type { InvoiceRoutingResult } from "../services/invoice-router.js";import { routeWholeInvoice } from "../services/invoice-router.js";import { sendToQuickBooks, type QuickBooksConfig } from "../services/quickbooks-sender.js";export interface ReviewItem { id: string; invoice: InvoiceExtraction; routingResult: InvoiceRoutingResult; submittedAt: string; status: "pending" | "resolved" | "dismissed"; correctedInvoice?: InvoiceExtraction;}export class ReviewQueue { private items: Map<string, ReviewItem> = new Map(); enqueue(invoice: InvoiceExtraction, routingResult: InvoiceRoutingResult): string { const id = generateId(); this.items.set(id, { id, invoice, routingResult, submittedAt: now().toISOString(), status: "pending", }); return id; } dequeue(id: string): ReviewItem | undefined { return this.items.get(id); } listAll(status?: ReviewItem["status"]): ReviewItem[] { const all = [...this.items.values()]; return status ? all.filter((i) => i.status === status) : all; } resolve(id: string, correctedInvoice: InvoiceExtraction): boolean { const item = this.items.get(id); if (!item) return false; item.status = "resolved"; item.correctedInvoice = correctedInvoice; return true; } dismiss(id: string): boolean { const item = this.items.get(id); if (!item) return false; item.status = "dismissed"; return true; }}export async function processReadyReviews( queue: ReviewQueue, routerFactory: () => ConfidenceRouter, quickbooksConfig: QuickBooksConfig): Promise<{ sent: number; failed: number }> { const pending = queue.listAll("pending"); let sent = 0; let failed = 0; for (const item of pending) { if (!item.correctedInvoice) continue; const router = routerFactory(); const result = routeWholeInvoice(router, item.correctedInvoice); if (result.overallType === "ROUTE") { const response = await sendToQuickBooks(item.correctedInvoice, quickbooksConfig); if (response.ok) { queue.dismiss(item.id); sent++; } else { failed++; } } else { failed++; } } return { sent, failed };}
Expected output: The queue exposes enqueue (returns a unique string ID), resolve (attach a corrected invoice), and dismiss (discard without sending). The processReadyReviews function iterates pending items that have been corrected, re-runs them through the confidence router, and auto-sends those that now meet the route threshold.
Step 9: Add cost telemetry
Create src/lib/cost-telemetry.ts. This module tracks how much each extraction costs using @reaatech/llm-cost-telemetry and @reaatech/llm-cost-telemetry-calculator. It also checks a daily budget to prevent cost overruns.
ts
import { loadConfig, generateId, now, type CostSpan } from "@reaatech/llm-cost-telemetry";import { calculateCost } from "@reaatech/llm-cost-telemetry-calculator";let telemetryConfig: ReturnType<typeof loadConfig> | null = null;export function resetTelemetryConfig(): void { telemetryConfig = null;}export function getTelemetryConfig(): ReturnType<typeof loadConfig> { if (!telemetryConfig) telemetryConfig = loadConfig(); return telemetryConfig;}export function trackExtractionCost(args: { provider: string; model: string; inputTokens: number; outputTokens: number; tenant?: string;}): { costUsd: number; span: CostSpan } { const result = calculateCost({ provider: args.provider as "openai" | "anthropic" | "google", model: args.model, inputTokens: args.inputTokens, outputTokens: args.outputTokens, }); const span: CostSpan = { id: generateId(), provider: args.provider as "openai" | "anthropic" | "google", model: args.model, inputTokens: args.inputTokens, outputTokens: args.outputTokens, costUsd: result.costUsd, tenant: args.tenant ?? "default", feature: "invoice-extraction", timestamp: now(), }; return { costUsd: result.costUsd, span };}export function checkBudget(tenant: string, estimatedCostUsd: number): boolean { const config = getTelemetryConfig(); const globalBudget = config.budget.global; if (!globalBudget) return true; const dailyBudget = globalBudget.daily; if (typeof dailyBudget !== "number") return true; return estimatedCostUsd <= dailyBudget;}export function formatCostReport(spans: Array<{ costUsd: number; feature: string }>): string { const byFeature = new Map<string, number>(); let total = 0; for (const s of spans) { byFeature.set(s.feature, (byFeature.get(s.feature) ?? 0) + s.costUsd); total += s.costUsd; } const lines = ["Cost Report:", "Feature | Cost | %"]; for (const [feature, cost] of byFeature) { const pct = total > 0 ? ((cost / total) * 100).toFixed(1) + "%" : "0%"; lines.push(`${feature} | $${cost.toFixed(4)} | ${pct}`); } lines.push(`Total | $${total.toFixed(4)} | 100%`); return lines.join("\n");}
Expected output: The config is loaded lazily (not at module-import time) so environment variables are available when it runs. The provider string for Google Vertex / Gemini models passed to calculateCost is "google". The formatCostReport function aggregates spans by feature and prints a table with feature name, cost in USD, and percentage of total spend.
Step 10: Build the API route handler
Create app/api/extract/route.ts. This Next.js App Router route accepts multipart PDF uploads via POST and a healthcheck via GET. It ties together the extractor, repair pipeline, confidence router, QuickBooks sender, review queue, and cost telemetry into one pipeline.
Expected output: The route extracts PDF text via the adapter, sends it to Gemini for structured extraction in raw mode, pipes the raw output through the repair pipeline, reads real usage metadata from the Vertex response for cost tracking, routes the repaired invoice through the confidence router, and either sends it to QuickBooks (high confidence) or enqueues it for review (low confidence). Start the dev server with pnpm dev and test with curl -F "file=@invoice.pdf" http://localhost:3000/api/extract.
Step 11: Create the CLI batch processor
Create src/cli/batch-process.ts. This CLI processes all PDF files in a directory through the same pipeline as the API route, with added budget checks and a cost report at the end.
ts
import { readdirSync, readFileSync } from "node:fs";import { resolve } from "node:path";import { GoogleGenAI } from "@google/genai";import { VertexExtractor } from "../services/vertex-extractor.js";import { repairInvoiceOutput } from "../services/repair-pipeline.js";import { createInvoiceRouter, routeWholeInvoice } from "../services/invoice-router.js";import { sendToQuickBooks, type QuickBooksConfig } from "../services/quickbooks-sender.js";import { ReviewQueue } from "../queues/review.js";import { trackExtractionCost, formatCostReport, checkBudget } from "../lib/cost-telemetry.js";import pdf from "../lib/pdf-adapter.js";
Expected output: Run the CLI with pnpm tsx src/cli/batch-process.ts --dir ./invoices --quickbooks-url https://hooks.example.com/qb --quickbooks-token tok_abc --budget-limit 10.0. Each file is processed individually with try/catch so one failure doesn’t abort the batch. The budget is checked before each extraction — if the running total exceeds the limit or the daily budget from config, the batch stops early with a warning.
Step 12: Wire up barrel exports
Replace the placeholder src/index.ts with barrel re-exports that expose the entire library surface from a single entry point:
ts
export { InvoiceLineItemSchema, InvoiceAddressSchema, InvoiceHeaderSchema, InvoiceSummarySchema, InvoiceExtractionSchema,} from "./schemas/invoice.js";export type { InvoiceLineItem, InvoiceAddress, InvoiceHeader, InvoiceSummary, InvoiceExtraction,} from "./schemas/invoice.js";export { VertexExtractor, ExtractionError } from "./services/vertex-extractor.js";export { repairInvoiceOutput, quickRepairInvoice, isValidInvoice, analyzeInvoiceInput,} from "./services/repair-pipeline.js";export { createInvoiceRouter, routeField, routeWholeInvoice,} from "./services/invoice-router.js";export type { FieldRoutingDecision, InvoiceRoutingResult } from "./services/invoice-router.js";export { sendToQuickBooks } from "./services/quickbooks-sender.js";export type { QuickBooksConfig } from "./services/quickbooks-sender.js";export { ReviewQueue, processReadyReviews } from "./queues/review.js";export type { ReviewItem } from "./queues/review.js";export { getTelemetryConfig, trackExtractionCost, checkBudget, formatCostReport, resetTelemetryConfig,} from "./lib/cost-telemetry.js";
Expected output:pnpm typecheck passes. Every @reaatech/* package listed in package.json dependencies is now imported by at least one module under src/.
Step 13: Configure environment variables and run tests
Create .env.example at the project root with these entries:
Copy this to .env.local and fill in real values for your GCP project and service-account key path.
Now run the full quality check:
terminal
pnpm typecheckpnpm lintpnpm test
Expected output: TypeScript compiles with zero errors. ESLint passes with no warnings. Vitest runs 89 tests across 20 test suites — all passing, zero failures. Coverage exceeds 90% on lines, branches, functions, and statements for runtime code (source under src/ and app/**/route.ts).
Next steps
Persist the review queue: Replace the in-memory Map with a SQLite or PostgreSQL backend so review items survive server restarts.
Add a review dashboard: Build a GET /api/review endpoint that returns queued items, and a React page that lets human reviewers see extracted invoices, correct fields inline, and re-submit.
Integrate the QuickBooks Online API: Replace the webhook with OAuth2 and the official QuickBooks Online API, connecting vendorName to the Vendor resource and lineItems to Invoice line detail types.
export async function main(argv: string[]): Promise<void> {