Anthropic Document Pipeline for SMB Lease Abstraction

Extract key lease terms from PDFs and DOCX files with Claude, backed by a retrieval-augmented clause library and confidence-based human review.

anthropic document-pipeline lease-abstraction rag nextjs typescript claude chromadb zod circuit-breaker

The problem

Property managers and small legal teams spend hours manually pulling critical dates, rent amounts, and clauses from lease documents, risking costly oversights.

Built from

Intro

This recipe builds a document processing pipeline that extracts key lease terms from PDF and DOCX files using Claude (Anthropic’s API). The pipeline parses uploaded documents, generates embeddings with VoyageAI, retrieves similar past clauses from a ChromaDB vector store, sends an augmented prompt to Claude for structured extraction, validates the output against a Zod schema, and routes results based on confidence scoring. A budget engine caps API spend per extraction, and circuit breakers prevent cascading failures during document parsing. If you’re a property manager, legal tech builder, or anyone who needs to pull structured data from semi-structured lease documents, this is a solid foundation you can deploy today.

Prerequisites

Node.js 22+ and pnpm 10+
An Anthropic API key (ANTHROPIC_API_KEY) with access to claude-sonnet-4-6
A VoyageAI API key (VOYAGE_API_KEY) for generating text embeddings
A ChromaDB instance running locally (CHROMA_URL=http://localhost:8000) — or any Chroma-compatible endpoint
Langfuse account (optional) for LLM observability — you can skip this and the pipeline works in noop mode
Familiarity with Next.js App Router patterns, TypeScript, and the concept of vector embeddings

Step 1: Scaffold the project and configure environment variables

Start from an empty directory. The scaffold agent has already placed the Next.js 16 App Router shell and installed dependencies via pnpm install. Your package.json pins every dependency to an exact version:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

192 kB·146 tests·99.7% coverage·vitest passing

SHA-25635bf4b4af7ae94cca2ec60d4285601fb22778e46aa076f1be4bf071b695ce35a

Book a conversation All solutions

Comments

Loading comments…

Intro

Prerequisites

Node.js 22+ and pnpm 10+
An Anthropic API key (ANTHROPIC_API_KEY) with access to claude-sonnet-4-6
A VoyageAI API key (VOYAGE_API_KEY) for generating text embeddings
A ChromaDB instance running locally (CHROMA_URL=http://localhost:8000) — or any Chroma-compatible endpoint
Langfuse account (optional) for LLM observability — you can skip this and the pipeline works in noop mode
Familiarity with Next.js App Router patterns, TypeScript, and the concept of vector embeddings

Step 1: Scaffold the project and configure environment variables

import { VoyageAIClient, VoyageAIError } from "voyageai"; export class EmbeddingError extends Error { constructor(message: string) { super(message); this.name = "EmbeddingError"; } } export class EmbeddingService { private client: VoyageAIClient; constructor() { const apiKey = process.env.VOYAGE_API_KEY; if (!apiKey) { throw new Error("VOYAGE_API_KEY environment variable is required"); } this.client = new VoyageAIClient({ apiKey }); } async generateEmbedding(text: string): Promise<number[]> { if (text.length === 0) { throw new EmbeddingError("Input text must not be empty"); } try { const response = await this.client.embed({ input: text, model: "voyage-3" }); const data = response.data; if (!data || data.length === 0) { throw new EmbeddingError("No embedding returned from API"); } const firstItem = data[0]; if (!firstItem) { throw new EmbeddingError("No embedding returned from API"); } const embedding = firstItem.embedding; if (!embedding) { throw new EmbeddingError("Embedding data is missing"); } return embedding; } catch (error) { if (error instanceof EmbeddingError) { throw error; } if (error instanceof VoyageAIError) { throw new EmbeddingError(`VoyageAI API error: ${error.message}`); } throw new EmbeddingError( `Unexpected error generating embedding: ${error instanceof Error ? error.message : String(error)}`, ); } } async generateEmbeddings(texts: string[]): Promise<number[][]> { if (texts.length === 0) { return []; } try { const response = await this.client.embed({ input: texts, model: "voyage-3" }); const data = response.data; if (!data) { throw new EmbeddingError("No embeddings returned from API"); } return data.map((d) => { const embedding = d.embedding; if (!embedding) { throw new EmbeddingError("Embedding data is missing for an item"); } return embedding; }); } catch (error) { if (error instanceof EmbeddingError) { throw error; } if (error instanceof VoyageAIError) { throw new EmbeddingError(`VoyageAI API error: ${error.message}`); } throw new EmbeddingError( `Unexpected error generating embeddings: ${error instanceof Error ? error.message : String(error)}`, ); } } }

import { ChromaClient, type Metadata } from "chromadb"; import type { RetrievalResult } from "@reaatech/hybrid-rag"; export class VectorStoreError extends Error { constructor(message: string) { super(message); this.name = "VectorStoreError"; } } export class VectorStore { private chroma: ChromaClient; constructor() { this.chroma = new ChromaClient(); } private async ensureCollection(name: string) { try { return await this.chroma.createCollection({ name }); } catch { return await this.chroma.getCollection({ name }); } } async storeClauses( collectionName: string, ids: string[], embeddings: number[][], documents: string[], metadatas: Record<string, unknown>[], ): Promise<void> { try { const collection = await this.ensureCollection(collectionName); await collection.add({ ids, embeddings, documents, metadatas: metadatas as Metadata[] }); } catch (error) { throw new VectorStoreError( `Failed to store clauses: ${error instanceof Error ? error.message : String(error)}`, ); } } async searchSimilarClauses( collectionName: string, embedding: number[], topK: number, ): Promise<RetrievalResult[]> { try { const collection = await this.ensureCollection(collectionName); const results = await collection.query({ queryEmbeddings: [embedding], nResults: topK, }); if (results.ids.length === 0 || results.ids[0] === undefined) { return []; } const retrievalResults: RetrievalResult[] = []; for (let i = 0; i < results.ids[0].length; i++) { const id = results.ids[0][i]; const document = results.documents[0]?.[i]; retrievalResults.push({ chunkId: id ?? "", documentId: "", content: document ?? "", score: results.distances[0]?.[i] ?? 0, source: "vector", metadata: results.metadatas[0]?.[i] ?? {}, }); } return retrievalResults; } catch (error) { throw new VectorStoreError( `Failed to search clauses: ${error instanceof Error ? error.message : String(error)}`, ); } } }

import Anthropic from "@anthropic-ai/sdk"; import type { Message } from "@anthropic-ai/sdk/resources/messages/messages.js"; export class AnthropicApiError extends Error { constructor( message: string, public readonly statusCode: number, public readonly originalError: unknown, ) { super(message); this.name = "AnthropicApiError"; } } const MAX_DOCUMENT_LENGTH = 180_000; export class AnthropicClient { private readonly client: Anthropic; constructor() { const apiKey = process.env.ANTHROPIC_API_KEY; if (!apiKey) { throw new AnthropicApiError( "ANTHROPIC_API_KEY environment variable is not set", 0, null, ); } this.client = new Anthropic({ apiKey }); } async extractLeaseClauses( documentText: string, systemPrompt: string, ): Promise<Message> { if (documentText.length === 0) { throw new AnthropicApiError( "documentText cannot be empty", 0, null, ); } const text = documentText.length > MAX_DOCUMENT_LENGTH ? documentText.slice(0, MAX_DOCUMENT_LENGTH) : documentText; try { const message = await this.client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 4096, system: systemPrompt, messages: [{ role: "user", content: text }], }); if (message.stop_reason !== "end_turn" && message.stop_reason !== "stop_sequence") { console.warn( `[anthropic-client] Message stopped with reason: ${String(message.stop_reason)}`, ); } return message; } catch (error) { if (error instanceof AnthropicApiError) { throw error; } const apiError = error as Record<string, unknown>; const statusCode = typeof apiError.status === "number" ? apiError.status : 0; if (statusCode === 401) { throw new AnthropicApiError( "Authentication failed: Invalid API key", statusCode, error, ); } if (statusCode === 429) { throw new AnthropicApiError( "Rate limit exceeded", statusCode, error, ); } if (statusCode >= 500) { throw new AnthropicApiError( "Anthropic API server error: " + String(statusCode), statusCode, error, ); } const message = error instanceof Error ? error.message : "unknown"; throw new AnthropicApiError( "Anthropic API error: " + message, statusCode, error, ); } } }

export interface PricingProvider { estimateCost(args: { model: string; inputTokens: number; outputTokens: number; }): number; } const PRICING_TABLE: Record< string, { inputRate: number; outputRate: number } > = { "claude-sonnet-4-6": { inputRate: 3.0, outputRate: 15.0 }, "claude-haiku-4-5-20251001": { inputRate: 0.8, outputRate: 4.0 }, }; const MAX_ESTIMATION_TOKENS = 1_000_000; export class AnthropicPricingProvider implements PricingProvider { getModelRates( model: string, ): { inputRate: number; outputRate: number } { const rates = PRICING_TABLE[model]; if (rates === undefined) { console.warn( '[pricing-provider] Unknown model "' + model + '", using fallback rates (highest)', ); return { inputRate: 3.0, outputRate: 15.0 }; } return rates; } estimateCost(modelId: string, estimatedInputTokens: number, provider?: string): number; estimateCost(args: { model: string; inputTokens: number; outputTokens: number }): number; estimateCost( modelOrArgs: string | { model: string; inputTokens: number; outputTokens: number }, estimatedInputTokens?: number, provider?: string, ): number { void provider; let model: string; let inputTokens: number; let outputTokens: number; if (typeof modelOrArgs === "string") { model = modelOrArgs; inputTokens = estimatedInputTokens ?? 0; outputTokens = 0; } else { model = modelOrArgs.model; inputTokens = modelOrArgs.inputTokens; outputTokens = modelOrArgs.outputTokens; } if (inputTokens <= 0 && outputTokens <= 0) { return 0; } const clampedInputTokens = Math.min(inputTokens, MAX_ESTIMATION_TOKENS); const clampedOutputTokens = Math.min(outputTokens, MAX_ESTIMATION_TOKENS); const rates = this.getModelRates(model); return ( (clampedInputTokens * rates.inputRate + clampedOutputTokens * rates.outputRate) / 1_000_000 ); } }

import Link from "next/link"; export default function LeasesPage() { return ( <div style={{ padding: "2rem", maxWidth: 960, margin: "0 auto" }}> <header style={{ display: "flex", justifyContent: "space-between", alignItems: "center", marginBottom: "2rem", }} > <h1 style={{ fontSize: "1.5rem" }}>Lease Extractions Dashboard</h1> <Link href="/leases/upload" style={{ padding: "0.5rem 1rem", background: "#000", color: "#fff", borderRadius: 6, textDecoration: "none", fontSize: "0.9rem", }} > Upload New </Link> </header> <table style={{ width: "100%", borderCollapse: "collapse", border: "1px solid #ddd", }} > <thead> <tr style={{ background: "#f5f5f5", textAlign: "left" }}> <th style={{ padding: "0.75rem", borderBottom: "2px solid #ddd" }}> Document Name </th> <th style={{ padding: "0.75rem", borderBottom: "2px solid #ddd" }}> Status </th> <th style={{ padding: "0.75rem", borderBottom: "2px solid #ddd" }}> Confidence </th> <th style={{ padding: "0.75rem", borderBottom: "2px solid #ddd" }}> Date </th> <th style={{ padding: "0.75rem", borderBottom: "2px solid #ddd" }}> Action </th> </tr> </thead> <tbody> <tr> <td colSpan={5} style={{ padding: "2rem", textAlign: "center", color: "#999", }} > No extractions yet.{" "} <Link href="/leases/upload" style={{ color: "#000" }}> Upload a lease document </Link>{" "} to get started. </td> </tr> </tbody> </table> </div> ); }

Anthropic Document Pipeline for SMB Lease Abstraction

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Step 2: Define the lease types and Zod schemas

Step 3: Build the document parsers (PDF and DOCX)

Step 4: Create the embedding service and vector store

Step 5: Build the clause retriever

Step 6: Create the Anthropic client

Step 7: Add schema repair for resilient JSON

Step 8: Implement confidence routing

Step 9: Add budget control and circuit breakers

Step 10: Add a pricing provider and observability

Step 11: Wire up the extraction pipeline

Step 12: Create the API routes and UI

Step 13: Run the tests

Next steps