Automatically extract and normalize line items from tax forms (1040, W-2, 1099) for small business bookkeeping using Grok’s reasoning and REAA’s output repair engine.
SMB accountants spend hours transcribing numbers from PDF tax forms into spreadsheets. Manual entry is slow and error-prone, and off-the-shelf OCR often produces garbled or malformed JSON that downstream systems can’t use.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a tax form extraction pipeline for small-business accounting. You’ll create a Next.js route handler that accepts PDF tax forms (Form 1040, W-2, 1099-NEC, 1099-MISC), extracts text using unpdf with a tesseract.js OCR fallback, sends the content to xAI Grok with a structured JSON schema, repairs malformed output using REAA’s structured repair engine, and enforces daily budget caps with per-call cost telemetry and semantic caching. By the end, you’ll have a complete document extraction endpoint that returns schema-validated JSON ready for QuickBooks or Xero upload.
Prerequisites
Node.js >= 22 and pnpm >= 10 installed
An xAI API key — set it as XAI_API_KEY in your environment
Redis running locally on port 6379 (or a remote Redis URL) — used by the LLM cache
An OpenAI API key (optional, for text-embedding-3-small embeddings) — falls back to your xAI API key if not set
Basic familiarity with TypeScript, Next.js App Router, and REST APIs
Step 1: Scaffold the Next.js project
Create a new Next.js project with the App Router. The scaffold gives you TypeScript, ESLint, and the App Router layout. You’ll then add Vitest for testing along with this recipe’s pinned dependencies.
cd xai-grok-tax-form-extraction-for-smb-accounting
The scaffold creates package.json, tsconfig.json, next.config.ts, .env.example, and the app/ directory with a placeholder page. Open package.json and add these pinned dependencies alongside the ones that create-next-app already generated:
Expected output:pnpm resolves all vendor packages — including the six @reaatech/* packages — and writes pnpm-lock.yaml.
Step 2: Configure environment variables
Open .env.example and replace its contents with the full list of environment variables the pipeline needs:
env
# Env vars used by xai-grok-tax-form-extraction-for-smb-accounting.NODE_ENV=development# xAI Grok API credentialsXAI_API_KEY=<your-xai-api-key>XAI_BASE_URL=https://api.x.ai/v1XAI_MODEL=grok-3# Redis connection for LLM cacheREDIS_URL=redis://localhost:6379# Daily budget ceiling for API spendDAILY_BUDGET_USD=5.00# Cache similarity thresholdCACHE_SEMANTIC_THRESHOLD=0.85
Copy the file to .env.local so Next.js reads it at dev time:
terminal
cp .env.example .env.local
Now edit .env.local and replace <your-xai-api-key> with your actual xAI API key.
Expected output:cat .env.local shows your real key in XAI_API_KEY, and all other vars have their default values.
Step 3: Define shared types and error classes
Create src/types.ts — this defines the tax form discriminant union, extraction metadata, the pipeline response shape, and typed error classes the pipeline throws at each stage:
Create src/lib/tax-schemas.ts. This file defines Zod schemas for Form 1040, W-2, and 1099-NEC/1099-MISC, a discriminated union that ties them together, a metadata schema, and the system prompt template that xAI Grok receives:
ts
import { z } from "zod";export const Form1040Schema = z.object({ formType: z.literal("1040"), filingStatus: z.enum([ "single", "married_joint", "married_separate", "head_of_household", "qualifying_widow", ]), wages: z.number(), taxableInterest: z.number(), adjustedGrossIncome: z.number(), totalTax: z.number(), refund: z.number().optional(), amountOwed: z.number().optional(),});export const W2Schema = z.object({ formType: z.literal("W-2"), employerEIN: z.string(), employerName: z.string(), wagesTips: z.number(), federalIncomeTaxWithheld: z.number(), socialSecurityWages: z.number(), socialSecurityTaxWithheld: z.number(), medicareWages: z.number(), medicareTaxWithheld: z.number(),});export const Form1099Schema = z.object({ formType: z.enum(["1099-NEC", "1099-MISC"]), payerEIN: z.string(), payerName: z.string(), nonemployeeCompensation: z.number().optional(), rents: z.number().optional(), otherIncome: z.number().optional(), federalTaxWithheld: z.number(),});export const ExtractedTaxDocumentSchema = z.discriminatedUnion("formType", [ Form1040Schema, W2Schema, Form1099Schema,]);export const ProcessingMetadataSchema = z.object({ extractionMethod: z.enum(["pdf-text", "ocr", "hybrid"]), confidence: z.number().min(0).max(1), tokensUsed: z.number().optional(), costUsd: z.number().optional(), totalPages: z.number(),});export const TaxExtractionOutputSchema = z.object({ documents: z.array(ExtractedTaxDocumentSchema), processingMetadata: ProcessingMetadataSchema,});export type Form1040 = z.infer<typeof Form1040Schema>;export type W2Form = z.infer<typeof W2Schema>;export type Form1099 = z.infer<typeof Form1099Schema>;export type ExtractedTaxDocument = z.infer<typeof ExtractedTaxDocumentSchema>;export type TaxExtractionOutput = z.infer<typeof TaxExtractionOutputSchema>;export type ProcessingMetadata = z.infer<typeof ProcessingMetadataSchema>;export const SYSTEM_PROMPT_TEMPLATE = `You are a tax form data extraction assistant. Extract the following fields from the tax form text and return valid JSON.Supported form types and their fields:For Form 1040: formType ("1040"), filingStatus (one of: single, married_joint, married_separate, head_of_household, qualifying_widow), wages (number), taxableInterest (number), adjustedGrossIncome (number), totalTax (number), refund (optional number), amountOwed (optional number).For Form W-2: formType ("W-2"), employerEIN (string), employerName (string), wagesTips (number), federalIncomeTaxWithheld (number), socialSecurityWages (number), socialSecurityTaxWithheld (number), medicareWages (number), medicareTaxWithheld (number).For Form 1099-NEC or 1099-MISC: formType ("1099-NEC" or "1099-MISC"), payerEIN (string), payerName (string), nonemployeeCompensation (optional number), rents (optional number), otherIncome (optional number), federalTaxWithheld (number).Return ONLY valid JSON. Do not include any explanatory text.`;
Step 5: Build the configuration loader
Create src/lib/config.ts. This reads environment variables at startup, validates that XAI_API_KEY is set, and exposes a typed config object consumed by every service:
Expected output: Importing config throws ConfigError("XAI_API_KEY is required but not set") unless the env var is present.
Step 6: Implement PDF text extraction with OCR fallback
Create src/services/text-extractor.ts. This attempts PDF text extraction with unpdf first; if that returns no text, it falls back to tesseract.js OCR. If both fail, it throws ExtractionError:
ts
import { getDocumentProxy, extractText } from "unpdf";import { createWorker } from "tesseract.js";import { ExtractionError, type ExtractionMethod } from "../types.js";export async function extractPdfText( buffer: Uint8Array,): Promise<{ text: string; method: ExtractionMethod; totalPages: number }> { const pdfResult = await tryUnpdf(buffer); if (pdfResult) { return pdfResult; } const ocrResult = await tryOcr(buffer); if (ocrResult) { return ocrResult; } throw new ExtractionError( "Failed to extract text from PDF: both unpdf and tesseract.js returned no text", );}async function tryUnpdf( buffer: Uint8Array,): Promise<{ text: string; method: ExtractionMethod; totalPages: number } | null> { try { const pdf = await getDocumentProxy(new Uint8Array(buffer)); const result = await extractText(pdf, { mergePages: true }); const text = result.text; if (!text || text.trim().length === 0) { return null; } return { text, method: "pdf-text", totalPages: result.totalPages }; } catch { return null; }}async function tryOcr( buffer: Uint8Array,): Promise<{ text: string; method: ExtractionMethod; totalPages: number } | null> { try { const imageBuffer = Buffer.from(buffer); const worker = await createWorker("eng"); const ret = await worker.recognize(imageBuffer); await worker.terminate(); const text = ret.data.text; if (!text || text.trim().length === 0) { return null; } return { text, method: "ocr", totalPages: 1 }; } catch { return null; }}
Expected output: A searchable PDF with embedded text returns { text: "...", method: "pdf-text", totalPages: 3 }. A scanned image-based PDF triggers the OCR fallback and returns { text: "...", method: "ocr", totalPages: 1 }. A corrupt file throws ExtractionError.
Step 7: Create the Grok API client
Create src/services/grok-client.ts. This wraps the OpenAI SDK pointed at the xAI API base URL, sends the extracted text with the system prompt, and returns the raw content plus token usage. It retries once on connection errors and throws typed errors for auth or rate-limit failures:
Expected output:callGrok("Form 1040...", SYSTEM_PROMPT_TEMPLATE) returns raw JSON from Grok and a token count like { promptTokens: 1200, completionTokens: 200, totalTokens: 1400 }.
Step 8: Build the output repair service
Create src/services/repair-service.ts. This wraps @reaatech/structured-repair-core — it takes a Zod schema and the raw JSON string from Grok, runs the six graduated strategies (strip fences, fix syntax, coerce types, fuzzy-match keys, remove extra fields), and either returns typed data or throws with per-field error details:
ts
import { z } from "zod";import { repairOutput, isValid } from "@reaatech/structured-repair-core";import { RepairFailedError, type FieldError } from "../types.js";export function repairLlmOutput<T>( schema: z.ZodType<T>, rawJson: string,): T { const result = repairOutput({ schema, input: rawJson, }); if (result.success) { return result.data as T; } const fieldErrors: FieldError[] = (result.fieldErrors ?? []).map( (fe: { path?: string; message?: string }) => ({ path: fe.path ?? "unknown", message: fe.message ?? "Unknown validation error", }), ); throw new RepairFailedError( "Repair failed: all strategies exhausted", fieldErrors, );}export function isValidJson<T>( schema: z.ZodType<T>, rawJson: string,): boolean { return isValid(schema, rawJson);}
Expected output: A string like ```json\n{ "formType": "1040", "wages": "75000" }\n``` — with fences and string-typed numbers — gets repaired into a valid { formType: "1040", wages: 75000 } object through fence stripping and type coercion.
Step 9: Wire up the LLM cache with Redis
Create src/services/cache-service.ts. This implements the StorageAdapter interface from @reaatech/llm-cache backed by Redis, and a factory function that constructs a CacheEngine with OpenAI embeddings for semantic matching:
ts
import { Redis } from "ioredis";import { CacheEngine, InMemoryAdapter, OpenAIEmbedder, type StorageAdapter, type StorageStats, type HealthStatus, type CacheEntry, buildPromptHash, buildExactMatchKey,} from "@reaatech/llm-cache";import { config } from "../lib/config.js";export class RedisStorageAdapter implements StorageAdapter { private redis: Redis; constructor(redis: Redis) { this.redis = redis; }
Expected output:createCacheEngine(redis) returns a configured CacheEngine that checks exact-matches via Redis GET, then falls back to semantic embedding similarity. Cache misses proceed to the Grok call.
Step 10: Add budget enforcement and cost telemetry
Create src/services/budget-service.ts. This wraps @reaatech/agent-budget-engine to enforce a daily spending cap, and src/services/telemetry-service.ts to emit per-call cost spans via the @reaatech/llm-cost-telemetry types:
ts
// src/services/budget-service.tsimport { BudgetController } from "@reaatech/agent-budget-engine";import { SpendStore } from "@reaatech/agent-budget-spend-tracker";import { BudgetScope, type BudgetCheckRequest, type SpendEntry } from "@reaatech/agent-budget-types";import { generateId } from "@reaatech/llm-cost-telemetry";import { BudgetExceededError } from "../types.js";export function createBudgetController(dailyLimit: number): BudgetController { const spendTracker = new SpendStore(); const controller = new BudgetController({ spendTracker }); controller.defineBudget({ scopeType: BudgetScope.Org, scopeKey: "tax-extraction", limit: dailyLimit, policy: { softCap: 0.8, hardCap: 1.0, }, }); return controller;}export function checkBudget( controller: BudgetController, estimatedCost: number,): void { const request: BudgetCheckRequest = { scopeType: BudgetScope.Org, scopeKey: "tax-extraction", estimatedCost, modelId: "grok-3", tools: [], }; const result = controller.check(request); if (!result.allowed) { throw new BudgetExceededError( "Daily budget exceeded: request blocked", ); }}export function recordSpend( controller: BudgetController, params: { cost: number; inputTokens: number; outputTokens: number },): void { const entry: SpendEntry = { requestId: generateId(), scopeType: BudgetScope.Org, scopeKey: "tax-extraction", cost: params.cost, inputTokens: params.inputTokens, outputTokens: params.outputTokens, modelId: "grok-3", provider: "xai", timestamp: new Date(), }; controller.record(entry);}
Expected output:checkBudget(controller, 0.02) allows the request if under the daily limit. After the Grok call completes, recordSpend and recordCostSpan log the actual cost. When the $5.00 daily cap is reached, checkBudget throws BudgetExceededError and the endpoint returns HTTP 429.
Step 11: Create the pipeline orchestrator
Create src/services/tax-extractor.ts. This is the heart of the recipe — it ties together text extraction, cache lookup, the Grok call, output repair, budget checks, and telemetry into a single extractTaxData function:
ts
import { Redis } from "ioredis";import { config } from "../lib/config.js";import { ExtractedTaxDocumentSchema, SYSTEM_PROMPT_TEMPLATE,} from "../lib/tax-schemas.js";import type { TaxExtractionResponse, ExtractedTaxDocument,} from "../types.js";import { RepairFailedError, BudgetExceededError,} from "../types.js";import { extractPdfText } from "./text-extractor.js";import { callGrok } from "./grok-client.js";import { repairLlmOutput } from "./repair-service.js";
Expected output: The orchestrator returns TaxExtractionResponse — either from cache (zero cost, full confidence) or from a fresh Grok invocation. On any recoverable failure it returns a well-structured error response instead of crashing.
Step 12: Build the API route
Create app/api/extract-tax/route.ts. This is the Next.js App Router endpoint that accepts multipart PDF uploads, validates the file, and delegates to extractTaxData:
pnpm vitest run --coverage --reporter=json --outputFile=vitest-report.json
Expected output: All tests pass with zero failures. Line, branch, function, and statement coverage meet the 90% thresholds on runtime code (src/**/*.ts and app/**/route.ts).
Next steps
Add more form types — extend the Zod schemas and system prompt for Form 1120 (corporate), Schedule C, or state-level returns, then add them to the ExtractedTaxDocumentSchema discriminated union.
Queue-based processing — for high-volume filing periods, add a Redis-backed BullMQ job queue so PDFs are processed asynchronously with webhook callbacks instead of blocking on the HTTP request.
Batch upload — modify the POST handler to accept multiple PDFs in a single request (e.g. a zip archive or a files array) and return an array of TaxExtractionResponse objects keyed by filename.
Cost dashboard — pipe the JSON lines emitted by recordCostSpan into a logging sink (CloudWatch, Loki, or a local file), then build a small dashboard to visualize daily spend per form type.
Frontend upload UI — create a client component in app/ with a drag-and-drop zone that POSTs the PDF to /api/extract-tax and renders the extracted JSON table inline.