Property managers and small legal teams spend hours manually pulling critical dates, rent amounts, and clauses from lease documents, risking costly oversights.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This recipe builds a document processing pipeline that extracts key lease terms from PDF and DOCX files using Claude (Anthropic’s API). The pipeline parses uploaded documents, generates embeddings with VoyageAI, retrieves similar past clauses from a ChromaDB vector store, sends an augmented prompt to Claude for structured extraction, validates the output against a Zod schema, and routes results based on confidence scoring. A budget engine caps API spend per extraction, and circuit breakers prevent cascading failures during document parsing. If you’re a property manager, legal tech builder, or anyone who needs to pull structured data from semi-structured lease documents, this is a solid foundation you can deploy today.
Prerequisites
Node.js 22+ and pnpm 10+
An Anthropic API key (ANTHROPIC_API_KEY) with access to claude-sonnet-4-6
A VoyageAI API key (VOYAGE_API_KEY) for generating text embeddings
A ChromaDB instance running locally (CHROMA_URL=http://localhost:8000) — or any Chroma-compatible endpoint
Langfuse account (optional) for LLM observability — you can skip this and the pipeline works in noop mode
Familiarity with Next.js App Router patterns, TypeScript, and the concept of vector embeddings
Step 1: Scaffold the project and configure environment variables
Start from an empty directory. The scaffold agent has already placed the Next.js 16 App Router shell and installed dependencies via pnpm install. Your package.json pins every dependency to an exact version:
Expected output: You see a node_modules/ directory and pnpm-lock.yaml already present. Next, copy .env.example to .env and fill in your API keys. The example file contains every variable the pipeline reads:
Create src/types/lease.ts. This file defines the core domain types — lease clauses, extraction records, parsed documents — and their Zod validation schemas. Every step of the pipeline depends on these types:
Expected output: The LeaseClauseSchema uses .strict() so any extra fields on incoming JSON are silently stripped. The LeaseExtractionSchema validates nested clauses, nullable dates, and the five-state status enum.
Step 3: Build the document parsers (PDF and DOCX)
The pipeline ingests two file formats. Each gets its own dedicated parser under src/lib/.
First, src/lib/pdf-parser.ts uses pdfjs-dist to extract text page by page:
ts
import * as pdfjsLib from "pdfjs-dist";import { DocumentParseError } from "../types/lease.js";export async function parsePdf(buffer: Buffer): Promise<{ text: string; pageCount: number }> { const loadingTask = pdfjsLib.getDocument({ data: buffer }); let doc: pdfjsLib.PDFDocumentProxy; try { doc = await loadingTask.promise; } catch (error) { throw new DocumentParseError( `Failed to parse PDF: ${error instanceof Error ? error.message : String(error)}`, "unknown.pdf", "pdf", error instanceof Error ? error.message : String(error), ); } const pageCount = doc.numPages; if (pageCount === 0) { return { text: "", pageCount: 0 }; } const textParts: string[] = []; for (let i = 1; i <= pageCount; i++) { const page = await doc.getPage(i); const content = await page.getTextContent(); const pageText = content.items .map((item) => ("str" in item ? (item as { str: string }).str : "")) .join(" "); textParts.push(pageText); } const text = textParts.join(" "); return { text, pageCount };}
Next, src/lib/docx-parser.ts uses mammoth (note: mammoth’s extractRawText takes a buffer, not a file path):
Expected output: A PDF returns { text, pageCount, fileType: "pdf" } with one string of all page text concatenated. A DOCX returns fileType: "docx" and pageCount: 0. Any other extension throws a DocumentParseError.
Step 4: Create the embedding service and vector store
The pipeline generates vector embeddings for each document using VoyageAI and stores/retrieves them via ChromaDB.
src/lib/embedding-service.ts wraps the voyageai SDK:
ts
import { VoyageAIClient, VoyageAIError } from "voyageai";export class EmbeddingError extends Error { constructor(message: string) { super(message); this.name = "EmbeddingError"; }}export class EmbeddingService { private client: VoyageAIClient; constructor() { const apiKey = process.env.VOYAGE_API_KEY; if (!apiKey) { throw new Error("VOYAGE_API_KEY environment variable is required"); } this.client = new VoyageAIClient({ apiKey }); } async generateEmbedding(text: string): Promise<number[]> { if (text.length === 0) { throw new EmbeddingError("Input text must not be empty"); } try { const response = await this.client.embed({ input: text, model: "voyage-3" }); const data = response.data; if (!data || data.length === 0) { throw new EmbeddingError("No embedding returned from API"); } const firstItem = data[0]; if (!firstItem) { throw new EmbeddingError("No embedding returned from API"); } const embedding = firstItem.embedding; if (!embedding) { throw new EmbeddingError("Embedding data is missing"); } return embedding; } catch (error) { if (error instanceof EmbeddingError) { throw error; } if (error instanceof VoyageAIError) { throw new EmbeddingError(`VoyageAI API error: ${error.message}`); } throw new EmbeddingError( `Unexpected error generating embedding: ${error instanceof Error ? error.message : String(error)}`, ); } } async generateEmbeddings(texts: string[]): Promise<number[][]> { if (texts.length === 0) { return []; } try { const response = await this.client.embed({ input: texts, model: "voyage-3" }); const data = response.data; if (!data) { throw new EmbeddingError("No embeddings returned from API"); } return data.map((d) => { const embedding = d.embedding; if (!embedding) { throw new EmbeddingError("Embedding data is missing for an item"); } return embedding; }); } catch (error) { if (error instanceof EmbeddingError) { throw error; } if (error instanceof VoyageAIError) { throw new EmbeddingError(`VoyageAI API error: ${error.message}`); } throw new EmbeddingError( `Unexpected error generating embeddings: ${error instanceof Error ? error.message : String(error)}`, ); } }}
src/lib/vector-store.ts manages ChromaDB collections and maps query results into @reaatech/hybrid-rag’s RetrievalResult shape:
Expected output: The ensureCollection method tries createCollection first and falls back to getCollection if the collection already exists, making initialization idempotent.
Step 5: Build the clause retriever
The ClauseRetriever in src/lib/clause-retriever.ts connects embedding generation with vector search and formats results for the Claude prompt:
ts
import type { RetrievalResult } from "@reaatech/hybrid-rag";import { EmbeddingService } from "./embedding-service.js";import { VectorStore } from "./vector-store.js";export class ClauseRetriever { constructor( private readonly embeddingService: EmbeddingService, private readonly vectorStore: VectorStore, ) {} async retrieveRelevantClauses( queryText: string, collectionName: string, topK: number = 5, ): Promise<RetrievalResult[]> { if (queryText.length === 0) { return []; } const embedding = await this.embeddingService.generateEmbedding(queryText); const results = await this.vectorStore.searchSimilarClauses( collectionName, embedding, topK, ); if (results.length === 0) { console.warn( `[clause-retriever] No similar clauses found for query in collection "${collectionName}"`, ); return []; } return results; } buildContextString(results: RetrievalResult[]): string { const items = results .map((r) => `- ${r.content}`) .join("\n"); return `## Similar clauses from past leases:\n${items}`; }}
Expected output: Given a query string, the retriever generates a VoyageAI embedding, searches the ChromaDB collection, and returns matching clauses. If no matches are found, it logs a warning and returns an empty array. buildContextString formats results as a Markdown bullet list for injection into Claude’s system prompt.
Step 6: Create the Anthropic client
src/lib/anthropic-client.ts wraps the @anthropic-ai/sdk Messages API with error handling for auth (401), rate limits (429), and server errors (500+):
ts
import Anthropic from "@anthropic-ai/sdk";import type { Message } from "@anthropic-ai/sdk/resources/messages/messages.js";export class AnthropicApiError extends Error { constructor( message: string, public readonly statusCode: number, public readonly originalError: unknown, ) { super(message); this.name = "AnthropicApiError"; }}const MAX_DOCUMENT_LENGTH = 180_000;export class AnthropicClient { private readonly client: Anthropic; constructor() { const apiKey = process.env.ANTHROPIC_API_KEY; if (!apiKey) { throw new AnthropicApiError( "ANTHROPIC_API_KEY environment variable is not set", 0, null, ); } this.client = new Anthropic({ apiKey }); } async extractLeaseClauses( documentText: string, systemPrompt: string, ): Promise<Message> { if (documentText.length === 0) { throw new AnthropicApiError( "documentText cannot be empty", 0, null, ); } const text = documentText.length > MAX_DOCUMENT_LENGTH ? documentText.slice(0, MAX_DOCUMENT_LENGTH) : documentText; try { const message = await this.client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 4096, system: systemPrompt, messages: [{ role: "user", content: text }], }); if (message.stop_reason !== "end_turn" && message.stop_reason !== "stop_sequence") { console.warn( `[anthropic-client] Message stopped with reason: ${String(message.stop_reason)}`, ); } return message; } catch (error) { if (error instanceof AnthropicApiError) { throw error; } const apiError = error as Record<string, unknown>; const statusCode = typeof apiError.status === "number" ? apiError.status : 0; if (statusCode === 401) { throw new AnthropicApiError( "Authentication failed: Invalid API key", statusCode, error, ); } if (statusCode === 429) { throw new AnthropicApiError( "Rate limit exceeded", statusCode, error, ); } if (statusCode >= 500) { throw new AnthropicApiError( "Anthropic API server error: " + String(statusCode), statusCode, error, ); } const message = error instanceof Error ? error.message : "unknown"; throw new AnthropicApiError( "Anthropic API error: " + message, statusCode, error, ); } }}
Expected output: Documents longer than 180,000 characters are silently truncated to that limit. The client throws typed AnthropicApiError with the HTTP status code for every failure mode.
Step 7: Add schema repair for resilient JSON
Lease documents vary in structure and quality. Claude may return valid JSON that doesn’t match the Zod schema, or it may return broken JSON. src/lib/schema-repair.ts handles both cases — it first attempts a direct Zod parse, and if that fails, it sends a correction prompt to Claude with the validation errors:
ts
import { LeaseExtractionSchema, type LeaseExtraction,} from "../types/lease.js";import { AnthropicClient } from "./anthropic-client.js";export interface RepairError { field: string; message: string;}function dateReviver(_key: string, value: unknown): unknown { if (typeof value === "string" && /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}
Expected output: If the JSON is valid and matches the schema, the first parse succeeds and no Claude re-prompt is made. If the JSON is malformed, the schema repair sends a targeted correction request with the exact validation errors listed.
Step 8: Implement confidence routing
Extractions above the confidence threshold are auto-approved; below-threshold ones are flagged for human review. src/lib/confidence-router.ts wraps @reaatech/confidence-router:
Expected output: The router sums all clause confidences, divides by count to get the average, then asks the ConfidenceRouter to decide. ROUTE means auto-approve, CLARIFY means ask for more input, FALLBACK means flag for manual review.
Step 9: Add budget control and circuit breakers
The ExtractionBudgetController in src/lib/budget-controller.ts wraps @reaatech/agent-budget-engine to limit API spend per document:
The circuit breakers in src/lib/circuit-breaker.ts prevent cascading failures. Three breakers guard distinct resources — PDF parsing, the Claude API, and ChromaDB:
ts
import { CircuitBreaker, InMemoryAdapter, CircuitOpenError,} from "@reaatech/circuit-breaker-agents";export const pdfBreaker = new CircuitBreaker({ name: "pdf-parser", failureThreshold: 3, recoveryTimeoutMs: 60000, persistence: new InMemoryAdapter(),});export const claudeBreaker = new CircuitBreaker({ name: "claude-api", failureThreshold: 5, recoveryTimeoutMs: 30000, persistence: new InMemoryAdapter(),});export const chromaBreaker = new CircuitBreaker({ name: "chroma-db", failureThreshold: 3, recoveryTimeoutMs: 15000, persistence: new InMemoryAdapter(),});export async function withCircuitBreaker<T>( breaker: CircuitBreaker, fn: () => Promise<T>,): Promise<T> { try { return await breaker.execute(fn); } catch (error) { if (error instanceof CircuitOpenError) { console.warn( "[circuit-open] circuit=" + error.circuitId + ", circuit is open, retry after " + String(error.timeUntilRetry) + "ms", ); throw error; } throw error; }}
Expected output: After 3 consecutive PDF parsing failures, the PDF breaker opens for 60 seconds before allowing retries. The Claude breaker opens after 5 failures with a 30-second recovery window. The ChromaDB breaker has a 15-second window.
Step 10: Add a pricing provider and observability
The pricing provider in src/lib/pricing-provider.ts estimates Claude API costs per model. It uses overloaded method signatures so callers can pass a model string or an object with token counts:
Expected output: If Langfuse env vars are missing, all observability functions return noop objects instead of throwing. The pricing provider uses per-model rates from a lookup table and caps estimation at 1 million tokens.
Step 11: Wire up the extraction pipeline
src/lib/extraction-pipeline.ts orchestrates every component into a single runPipeline function. This is the heart of the recipe:
ts
import { type ExtractionRequest, type ExtractionResult } from "../types/lease.js";import { parseDocument } from "./document-parser.js";import { EmbeddingService } from "./embedding-service.js";import { VectorStore } from "./vector-store.js";import { ClauseRetriever } from "./clause-retriever.js";import { AnthropicClient } from "./anthropic-client.js";import { SchemaRepair } from "./schema-repair.js";import { ExtractionRouter } from "./confidence-router.js";import { ExtractionBudgetController } from "./budget-controller.js";import { withCircuitBreaker, claudeBreaker, chromaBreaker, pdfBreaker } from "./circuit-breaker.js"
The src/instrumentation.ts file initializes the budget controller and Langfuse at server startup (Next.js calls register() when experimental.instrumentationHook is enabled):
Expected output: The pipeline runs in order: parse → embed → retrieve → budget check → Claude extract → schema repair → route. If the budget is exceeded, the pipeline returns a FALLBACK result without calling Claude.
Step 12: Create the API routes and UI
Six route handlers wire the pipeline to HTTP.
app/api/documents/route.ts accepts PDF/DOCX file uploads via multipart form data:
ts
import { type NextRequest, NextResponse } from "next/server";import { runPipeline } from "../../../src/lib/extraction-pipeline.js";import { type ExtractionRequest } from "../../../src/types/lease.js";export async function POST(req: NextRequest): Promise<NextResponse> { try { const formData = await req.formData(); const file = formData.get("file") as File | null; if (!file) { return NextResponse.json({ error: "No file provided" }, { status: 400 }); } const name = file.name.toLowerCase(); if (!name.endsWith(".pdf") && !name.endsWith(".docx")) { return NextResponse.json( { error: "Invalid file type. Only PDF and DOCX are supported." }, { status: 400 }, ); } if (file.size > 20 * 1024 * 1024) { return NextResponse.json( { error: "File too large. Maximum size is 20MB." }, { status: 400 }, ); } const buffer = Buffer.from(await file.arrayBuffer()); const request: ExtractionRequest = { documentId: crypto.randomUUID(), content: buffer.toString("base64"), fileType: name.endsWith(".pdf") ? "pdf" : "docx", fileName: file.name, }; const result = await runPipeline(request); return NextResponse.json(result, { status: 201 }); } catch (err) { const message = err instanceof Error ? err.message : "Unknown error"; return NextResponse.json({ error: message }, { status: 500 }); }}
The remaining routes provide listing and detail endpoints:
Expected output: Navigating to http://localhost:3000 shows the landing page. Clicking “Upload Lease” opens the upload form. Submitting a PDF or DOCX triggers the full pipeline and displays the extraction JSON.
Step 13: Run the tests
The test suite covers every module with mocked externals. Run all tests with:
terminal
pnpm test
This runs vitest run --coverage and writes the report. You can also verify types and linting separately:
terminal
pnpm typecheck # TypeScript type checkingpnpm lint # ESLintpnpm test # Tests + coverage (90% thresholds on lines, branches, functions, statements)
Expected output: All 17 test suites pass with coverage meeting the 90% thresholds configured in vitest.config.ts. The individual suites cover document parsing, embedding, vector store operations, the Anthropic client, schema repair, confidence routing, budget control, circuit breakers, clause retrieval, pricing, observability, instrumentation, all API endpoints, and an end-to-end pipeline integration test.
Next steps
Add a database backend — Replace the in-memory stores for extractions and review tasks with PostgreSQL or SQLite so data persists across server restarts
Extend clause types — Add more LeaseClauseType variants (escalation clauses, security deposits, options to extend) and update the Zod schema and Claude prompt
Batch processing — Use the budget controller’s soft-cap mechanism to process batches of leases in sequence, auto-downgrading from sonnet to haiku as the daily budget runs low
Review UI — Build a full attorney review interface under /reviews that lets users approve, reject, or edit individual clauses from low-confidence extractions
Multi-tenant isolation — Scope the ChromaDB collections and budget controller scope keys per property management firm
const systemPrompt = `You are a lease document analyst. Extract key lease clauses from the document text below. Return a JSON object matching this schema: