Independent insurance brokers manually re‑key data from dozens of policy PDFs every week, a slow, error‑prone process that consumes billable hours and delays client service.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a Next.js API that extracts structured insurance policy data from uploaded PDFs using Cohere’s LLM. You’ll build a document processing pipeline that parses PDF text, falls back to OCR for scanned documents, sends the content to Cohere for structured field extraction, repairs malformed JSON output, enforces per-broker budgets, and tracks LLM cost telemetry — all exposed through three REST API endpoints.
Prerequisites
Node.js 22+ and pnpm 10+ installed
A Cohere API key with access to the command-a-03-2025 model
Familiarity with TypeScript, Next.js App Router, and Zod schemas
Your terminal ready — you’ll paste every command and code block
Step 1: Scaffold the project and install dependencies
Create a new Next.js project with the App Router and install all third-party and vendored packages:
Expected output: pnpm exits 0 and you see the dependency tree printed.
Create .env.local with these placeholders:
env
# Env vars used by cohere-insurance-policy-data-extraction-for-smb-brokers.# Keep placeholders only — never commit real values.COHERE_API_KEY=<your-cohere-api-key>COHERE_MODEL=command-a-03-2025DEFAULT_BROKER_BUDGET=100.0
Step 2: Define the types and Zod schemas
Create src/lib/types.ts. This file defines the Zod schema for insurance policy fields and the TypeScript interfaces your pipeline will use:
The PolicyFieldSchema is the contract — Cohere’s output must map to these fields. The optional() on additional_insured handles policies with only one named insured. The z.enum() calls restrict policy_type and status to valid values, which @reaatech/structured-repair-core uses to reject hallucinated categories.
Step 3: Create the configuration module
Create src/lib/config.ts to parse environment variables at module load time:
ts
import type { AppConfig } from "./types.js";function getEnvOrThrow(key: string): string { const value = process.env[key]; if (!value) { throw new Error(`Missing ${key}`); } return value;}function getEnvWithDefault(key: string, defaultVal: string): string { return process.env[key] ?? defaultVal;}function getEnvNumberWithDefault(key: string, defaultVal: number): number { const raw = process.env[key]; if (raw === undefined) return defaultVal; const parsed = Number(raw); return Number.isNaN(parsed) ? defaultVal : parsed;}export const config: AppConfig = Object.freeze({ cohereApiKey: getEnvOrThrow("COHERE_API_KEY"), cohereModel: getEnvWithDefault("COHERE_MODEL", "command-a-03-2025"), defaultBrokerBudget: getEnvNumberWithDefault("DEFAULT_BROKER_BUDGET", 100.0),});
Expected output:pnpm typecheck exits 0.
The Object.freeze() prevents any runtime mutation of the config object. The getEnvOrThrow helper gives you a clear, fail-fast error message if COHERE_API_KEY is missing — no silent fallback to an unusable empty string.
Step 4: Build the PDF parser
Create src/lib/pdf-parser.ts. This module reads PDF files using pdfjs-dist and renders pages to images as a fallback for scanned documents:
The parsePdf function iterates every page, extracts text items, and joins them into a single string. When a PDF has no extractable text (a scanned document), renderPageToImage renders a page to raw pixels using pdfjs-dist’s render context, then converts those pixels to a PNG buffer using Sharp — that image is then fed to the OCR pipeline in Step 8.
Step 5: Create the Cohere client
Create src/services/cohere-client.ts. This module wraps the Cohere SDK to extract policy fields from text:
ts
import { CohereClientV2, CohereError, CohereTimeoutError } from "cohere-ai";import { config } from "../lib/config.js";export class ExtractionError extends Error { public statusCode?: number; public body?: unknown; constructor(message: string, details?: { statusCode?: number; body?: unknown }) { super(message); this.name = "ExtractionError"; this.statusCode = details?.statusCode; this.body = details?.body; }}const cohere = new CohereClientV2({});export async function extractPolicyFields( text: string): Promise<{ rawResponse: string; inputTokens: number; outputTokens: number }> { const extractionPrompt = `Extract the following insurance policy fields as strict JSON. Return ONLY a valid JSON object with these keys: policy_number, insured_name, policy_type, effective_date, expiration_date, premium_amount, deductible, coverage_limits { liability, property, medical }, status, additional_insured. Policy text:\n\n${text}`; try { const response = await cohere.chat({ model: config.cohereModel, messages: [{ role: "user" as const, content: extractionPrompt }], }); const content = response.message.content; const rawResponse = Array.isArray(content) && content.length > 0 ? (content[0] as { text?: string }).text ?? "" : ""; const inputTokens = response.usage?.tokens?.inputTokens ?? response.usage?.billedUnits?.inputTokens ?? 0; const outputTokens = response.usage?.tokens?.outputTokens ?? response.usage?.billedUnits?.outputTokens ?? 0; return { rawResponse, inputTokens, outputTokens }; } catch (err) { if (err instanceof CohereError) { throw new ExtractionError("Cohere API error", { statusCode: err.statusCode, body: err.body, }); } if (err instanceof CohereTimeoutError) { throw new ExtractionError("Cohere API timed out"); } throw err; }}export function estimateCost(inputTokens: number, outputTokens: number): number { const inputCost = (inputTokens / 1_000_000) * 2.5; const outputCost = (outputTokens / 1_000_000) * 10.0; return Number((inputCost + outputCost).toFixed(6));}
Expected output:pnpm typecheck exits 0.
The CohereClientV2 constructor is called with an empty {} — the SDK reads COHERE_API_KEY from the environment automatically. The extraction prompt tells Cohere to return strict JSON with precisely the keys your PolicyFieldSchema expects. The ExtractionError class wraps both CohereError (HTTP-level failures) and CohereTimeoutError (network timeouts) into a uniform error type the pipeline can catch.
Step 6: Wire up budget enforcement and cost telemetry
Create src/services/budget-service.ts using @reaatech/agent-budget-engine:
The budget service wraps BudgetController with a soft cap at 80% and hard cap at 100% of the defined limit — at the hard cap, check() starts returning { allowed: false }. The cost telemetry module buffers spans and auto-flushes every 60 seconds (or when the buffer hits 500 spans), sending them to the aggregator which groups by tenant, feature, provider, and model.
Step 7: Set up the document extraction operations
Create the artifact store and registry adapters needed by @reaatech/media-pipeline-mcp-doc-extraction, then instantiate the extraction operations.
createDocumentExtractionOperations takes an artifact registry (for metadata) and an artifact store (for binary data). The adapter’s set() method on the store is a convenience used by the pipeline to register in-memory rendered images before calling performOCR.
Step 8: Build the extraction pipeline
Create src/lib/pipeline.ts — this is the heart of the recipe. It orchestrates budget checking, PDF parsing, OCR fallback, Cohere extraction, structured repair, and cost tracking:
ts
import { config } from "./config.js";import type { ExtractionRequest, ExtractionResult, PolicyFields } from "./types.js";import { parsePdf, renderPageToImage } from "./pdf-parser.js";import { extractPolicyFields, estimateCost, ExtractionError } from "../services/cohere-client.js";import { checkBudget, recordSpend } from "../services/budget-service.js";import { addCostSpan } from "../services/cost-telemetry.js";import { performOCR } from "../services/document-extraction.js";import { repairOutput } from "@reaatech/structured-repair-core";import { PolicyFieldSchema } from "./types.js";export async function processDocument( req
Expected output:pnpm typecheck exits 0.
The pipeline follows a strict seven-step flow: (1) check the broker’s budget against a 5-cent estimated cost, (2) extract text from the PDF, (3) fall back to OCR rendering if the PDF is scanned, (4) send extracted text to Cohere, (5) repair Cohere’s output using repairOutput from @reaatech/structured-repair-core (which handles markdown fences, JSON syntax errors, type coercion, and fuzzy key matching), (6) record spend and telemetry, and (7) return a standardized ExtractionResult. Every step is wrapped so that any failure produces a structured result rather than throwing an uncaught exception.
Step 9: Create the API route handlers
Create the three Next.js App Router route handlers. Start with the document upload endpoint at app/api/documents/process/route.ts:
Note the use of NextRequest and NextResponse (not bare Request / Response) — this is mandatory in Next.js App Router routes to get the correct Content-Type headers and request extensions. The dynamic route segments use the Next 15+ Promise<{ brokerId }> pattern for params.
Step 10: Define the barrel export
Replace the scaffolded src/index.ts with a barrel re-export:
ts
export { config } from "./lib/config.js";export { processDocument } from "./lib/pipeline.js";export * from "./lib/types.js";
Run the type checker:
terminal
pnpm typecheck
Step 11: Write the tests
Create test files that cover happy paths, error handling, and edge cases. Here’s the Zod schema test at tests/lib/types.test.ts:
Here’s the pipeline test at tests/lib/pipeline.test.ts that mocks all external services (representative subset — the full test suite covers 15+ edge cases including OCR fallback failure, empty Cohere responses, malformed JSON repair, and partial data scenarios):
Create the API route handler test at tests/api/documents/process.test.ts (representative subset — the full test suite additionally covers missing brokerId validation, budget-exceeded 402 responses, and non-Error rejection handling):
ts
import { describe, it, expect, vi, beforeEach } from "vitest";import { NextRequest } from "next/server";interface ProcessResponse { status?: string; error?: string; policyData?: Record<string, unknown>;}const mockProcessDocument = vi.hoisted(() => vi.fn());vi.mock("../../../src/lib/pipeline.js", () => ({ processDocument: mockProcessDocument,}));function createMockRequest(formData: Record<string, string | Blob>): NextRequest { const fd = new FormData(); for (const [key, value] of Object.entries(formData)) { if (value instanceof Blob) { fd.append(key, value, (value as File).name || "file.pdf"); } else { fd.append(key, value); } } return new NextRequest("http://localhost/api/documents/process", { method: "POST", body: fd, });}describe("POST /api/documents/process", () => { beforeEach(() => { vi.clearAllMocks(); }); it("returns 200 with success result for valid upload", async () => { const { POST } = await import("../../../app/api/documents/process/route.js"); mockProcessDocument.mockResolvedValue({ documentId: "doc-1", brokerId: "broker-1", policyData: null, confidence: 0, costUsd: 0, extractedAt: new Date().toISOString(), status: "success", repairAttempted: false, }); const file = new File(["fake pdf content"], "policy.pdf", { type: "application/pdf" }); const req = createMockRequest({ file, brokerId: "broker-1" }); const res = await POST(req); expect(res.status).toBe(200); const body = (await res.json()) as ProcessResponse; expect(body.status).toBe("success"); }); it("returns 400 when file is missing", async () => { const { POST } = await import("../../../app/api/documents/process/route.js"); const req = createMockRequest({ brokerId: "broker-1" }); const res = await POST(req); expect(res.status).toBe(400); const body = (await res.json()) as ProcessResponse; expect(body.error).toBe("Bad request"); }); it("returns 500 when pipeline throws unknown error", async () => { const { POST } = await import("../../../app/api/documents/process/route.js"); mockProcessDocument.mockRejectedValue(new Error("Something went wrong")); const file = new File(["content"], "policy.pdf", { type: "application/pdf" }); const req = createMockRequest({ file, brokerId: "broker-1" }); const res = await POST(req); expect(res.status).toBe(500); const body = (await res.json()) as ProcessResponse; expect(body.error).toBe("Extraction failed"); });});
Create the budget route test at tests/api/brokers/budget.test.ts:
ts
import { describe, it, expect, vi } from "vitest";import { NextRequest } from "next/server";interface BudgetStatusResponse { spent: number; remaining: number; state: string;}const mockGetBrokerState = vi.hoisted(() => vi.fn());vi.mock("../../../src/services/budget-service.js", () => ({ getBrokerState: mockGetBrokerState,}));describe("GET /api/brokers/[brokerId]/budget", () => { it("returns broker budget state", async () => { const { GET } = await import("../../../app/api/brokers/[brokerId]/budget/route.js"); mockGetBrokerState.mockReturnValue({ spent: 25.0, remaining: 75.0, state: "Active", }); const req = new NextRequest("http://localhost/api/brokers/b1/budget"); const res = await GET(req, { params: Promise.resolve({ brokerId: "b1" }) }); expect(res.status).toBe(200); const body = (await res.json()) as BudgetStatusResponse; expect(body).toHaveProperty("spent"); expect(body).toHaveProperty("remaining"); expect(body).toHaveProperty("state"); }); it("returns fallback zero values for unknown broker", async () => { const { GET } = await import("../../../app/api/brokers/[brokerId]/budget/route.js"); mockGetBrokerState.mockReturnValue({ spent: 0, remaining: 0, state: "unknown", }); const req = new NextRequest("http://localhost/api/brokers/unknown/budget"); const res = await GET(req, { params: Promise.resolve({ brokerId: "unknown" }) }); expect(res.status).toBe(200); const body = (await res.json()) as BudgetStatusResponse; expect(body.spent).toBe(0); expect(body.remaining).toBe(0); expect(body.state).toBe("unknown"); });});
Create the usage route test at tests/api/brokers/usage.test.ts:
ts
import { describe, it, expect, vi } from "vitest";import { NextRequest } from "next/server";interface TenantCosts { totalUsd: number; byProvider: Record<string, number>; byFeature: Record<string, number>;}interface UsageResponse { brokerId: string; usage: TenantCosts;}const mockGetTenantCosts = vi.hoisted(() => vi.fn());vi.mock("../../../src/services/cost-telemetry.js", () => ({ getTenantCosts: mockGetTenantCosts,}));describe("GET /api/brokers/[brokerId]/usage", () => { it("returns broker usage with tenant costs", async () => { const { GET } = await import("../../../app/api/brokers/[brokerId]/usage/route.js"); mockGetTenantCosts.mockReturnValue({ totalUsd: 42.5, byProvider: { cohere: 42.5 }, byFeature: { "policy-extraction": 42.5 }, }); const req = new NextRequest("http://localhost/api/brokers/b1/usage"); const res = await GET(req, { params: Promise.resolve({ brokerId: "b1" }) }); expect(res.status).toBe(200); const body = (await res.json()) as UsageResponse; expect(body.brokerId).toBe("b1"); expect(body.usage.totalUsd).toBe(42.5); }); it("returns zero usage for unknown broker", async () => { const { GET } = await import("../../../app/api/brokers/[brokerId]/usage/route.js"); mockGetTenantCosts.mockReturnValue({ totalUsd: 0, byProvider: {}, byFeature: {}, }); const req = new NextRequest("http://localhost/api/brokers/unknown/usage"); const res = await GET(req, { params: Promise.resolve({ brokerId: "unknown" }) }); expect(res.status).toBe(200); const body = (await res.json()) as UsageResponse; expect(body.usage.totalUsd).toBe(0); });});
Run the full test suite:
terminal
pnpm test
Expected output:numFailedTests: 0, numTotalTests >= 3, and coverage metrics all >= 90%.
Step 12: Run the quality gate
Run the preflight validator to catch any remaining issues:
terminal
pnpm typecheckpnpm lintpnpm test
All three commands should exit 0. If pnpm lint flags issues, fix them before proceeding.
Expected output: Each command exits 0 cleanly.
Next steps
Add provider registration: Wire up an actual Google Document AI or Anthropic provider to the documentExtractionOps for real OCR instead of the in-memory adapter
Frontend upload form: Build a Next.js client page with a file upload <form> that posts to /api/documents/process and displays the extracted ExtractionResult
Webhook notifications: After a successful extraction, POST the result to a broker-configured webhook endpoint so their agency management system receives the data automatically