Small law firms and contract-heavy SMBs manually review every agreement to find renewal dates, liability caps, and termination clauses. Missing a deadline or misreading a clause leads to unbilled work and client disputes.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks through building a contract clause extraction pipeline for small law firms and SMBs. You’ll build a Next.js application that ingests PDF contracts, runs OCR via AWS Textract, extracts 12 standard clause types using AWS Bedrock (Claude Sonnet 4), repairs malformed LLM JSON output automatically, and tracks per-contract cost against a daily budget. By the end you’ll have a fully tested API that produces structured contract summaries.
Set up your tsconfig.json for ESM and strict mode, and create the app/layout.tsx and app/page.tsx placeholder files. Also create the src/ directory for server-side code and the API route directories:
Expected output: A clean Next.js project with all dependencies installed, TypeScript configured, and an empty src/ directory ready for server modules. You should see node_modules/ and pnpm-lock.yaml.
Step 2: Define contract Zod schemas
Create the Zod schemas that model every contract artifact — clauses, parties, dates, and the top-level summary. These schemas are used for parsing, validation, and as the target shape for LLM extraction and repair.
Expected output: A TypeScript file that exports all contract schemas — 12 clause types, party fields, date classifications, and the full extraction result shape. The CLAUSE_TYPES array covers the most common SMB contract clauses.
Step 3: Create the Bedrock provider
Wrap the AWS Bedrock Runtime SDK in a class that conforms to @reaatech/media-pipeline-mcp-provider-core’s MediaProvider interface. This provider calls Claude Sonnet 4 through the Converse API with a temperature of 0.0 for deterministic extraction.
ts
// src/lib/bedrock-provider.tsimport { MediaProvider, type ProviderInput, type ProviderOutput, type CostEstimate } from "@reaatech/media-pipeline-mcp-provider-core";import { BedrockRuntimeClient, ConverseCommand } from "@aws-sdk/client-bedrock-runtime";const DEFAULT_MODEL_ID = "anthropic.claude-sonnet-4-v1:0";const INPUT_COST_PER_1K = 0.003;const OUTPUT_COST_PER_1K = 0.015;export class BedrockProvider extends MediaProvider { readonly name = "bedrock"; readonly supportedOperations = ["document.extract_fields", "document.summarize"]; private
Expected output: A BedrockProvider class that implements estimateCost() and execute(). It calls Claude Sonnet 4 via the AWS Bedrock Converse API with temperature 0 and calculates per-call cost from token usage.
Step 4: Create the Textract OCR client
Build the Amazon Textract integration that extracts text from scanned PDFs. It supports two modes — full analysis with TABLES and FORMS features, and basic text detection. Both include automatic retries with p-retry:
ts
// src/lib/textract-client.tsimport { TextractClient, AnalyzeDocumentCommand, DetectDocumentTextCommand,} from "@aws-sdk/client-textract";import pRetry, { AbortError } from "p-retry";export class TextractError extends Error { constructor(public code: string, message: string) { super(message); this.name = "TextractError"; }}function getClient(): TextractClient { const region = process.env.AWS_REGION ?? "us-east-1"; return new TextractClient({ region });}export async function analyzeDocument(buffer: Uint8Array): Promise<string> { const client = getClient(); const cmd = new AnalyzeDocumentCommand({ Document: { Bytes: buffer }, FeatureTypes: ["TABLES", "FORMS"], }); const response = await pRetry(() => client.send(cmd), { retries: 3, onFailedAttempt: (ctx) => { if (ctx.error instanceof AbortError) { throw ctx.error; } if (ctx.retriesLeft > 0) { return; } throw new TextractError("TextractError", String(ctx.error)); }, }); const lines: string[] = []; for (const block of response.Blocks ?? []) { if (block.BlockType === "LINE" && block.Text) { lines.push(block.Text); } } return lines.join("\n");}export async function detectDocumentText(buffer: Uint8Array): Promise<string> { const client = getClient(); const cmd = new DetectDocumentTextCommand({ Document: { Bytes: buffer }, }); const response = await pRetry(() => client.send(cmd), { retries: 3, onFailedAttempt: (ctx) => { if (ctx.error instanceof AbortError) { throw ctx.error; } if (ctx.retriesLeft > 0) { return; } throw new TextractError("TextractError", String(ctx.error)); }, }); const lines: string[] = []; for (const block of response.Blocks ?? []) { if (block.BlockType === "LINE" && block.Text) { lines.push(block.Text); } } return lines.join("\n");}
Expected output: Two Textract functions — analyzeDocument() for full layout analysis with tables and forms detection, and detectDocumentText() for basic OCR. Both retry up to 3 times and only throw TextractError after all retries are exhausted.
Step 5: Build the doc-extraction layer
This module connects the Bedrock provider to the REAA media pipeline. It creates an in-memory artifact registry and store, feeds the contract text as an artifact, and calls extractFields with a system prompt targeting the 12 contract clauses.
ts
// src/lib/doc-extraction.tsimport { createDocumentExtractionOperations } from "@reaatech/media-pipeline-mcp-doc-extraction";import { InMemoryArtifactRegistry, InMemoryArtifactStore } from "./artifact-store.js";import { BedrockProvider } from "./bedrock-provider.js";import { BedrockRuntimeClient } from "@aws-sdk/client-bedrock-runtime";const BEDROCK_SYSTEM_PROMPT = `You are a contract extraction assistant. Extract the following fields from the contract text and return them as a strict JSON object matching the ContractSummary shape:Parties: For each party, extract name, role, and address (if available).Clauses: Extract full text and page reference for each of these 12 clause types: liability_cap, termination, renewal, confidentiality, indemnification, governing_law, dispute_resolution, force_majeure, assignment, waiver, entire_agreement, non_compete.Dates: Extract dates classified as: execution, effective, expiry, renewal, termination.Also extract: governing law, jurisdiction, and contract value (if present).Return ONLY valid JSON with no additional text.`;let bedrockClientSingleton:
You also need the in-memory artifact store that the doc-extraction pipeline uses:
ts
// src/lib/artifact-store.tsimport { Readable } from "stream";import { ArtifactRegistry } from "@reaatech/media-pipeline-mcp-core";import type { ArtifactMeta, ArtifactStore, StorageResult } from "@reaatech/media-pipeline-mcp-storage";export class InMemoryArtifactRegistry extends ArtifactRegistry {}export class InMemoryArtifactStore implements ArtifactStore { private store = new Map<string, Buffer>(); async put( id: string, data: Buffer | NodeJS.ReadableStream | string, _meta: ArtifactMeta ): Promise<string> { void _meta; if (data instanceof Buffer) { this.store.set(id, data); } else if (data instanceof Readable) { const chunks: Buffer[] = []; for await (const chunk of data) { if (Buffer.isBuffer(chunk)) { chunks.push(chunk); } else { chunks.push(Buffer.from(String(chunk))); } } this.store.set(id, Buffer.concat(chunks)); } else { this.store.set(id, Buffer.from(data as string)); } return `in-memory://artifacts/${id}`; } get(id: string): Promise<StorageResult> { const data = this.store.get(id); if (!data) { return Promise.reject(new Error(`Artifact not found: ${id}`)); } return Promise.resolve({ data, meta: { id, type: "document", mimeType: "application/octet-stream", }, }); } getSignedUrl(_id: string, _expiresIn?: number): Promise<string> { void _id; void _expiresIn; return Promise.reject(new Error("getSignedUrl not supported for in-memory store")); } delete(id: string): Promise<void> { this.store.delete(id); return Promise.resolve(); } list(prefix?: string): Promise<ArtifactMeta[]> { const entries: ArtifactMeta[] = []; for (const key of this.store.keys()) { if (!prefix || key.startsWith(prefix)) { entries.push({ id: key, type: "document", mimeType: "application/octet-stream", }); } } return Promise.resolve(entries); } healthCheck(): Promise<boolean> { return Promise.resolve(true); }}
Expected output: The extractContractSummary() function feeds contract text through the Bedrock provider with a detailed system prompt, reads the LLM output from the artifact store, and returns the raw JSON string plus the estimated token count.
Step 6: Build the JSON repair layer
LLMs sometimes wrap JSON in markdown code fences, add trailing commas, or include extra fields. This module uses @reaatech/structured-repair-core to fix those issues against your Zod schema, trying six strategies in order.
Expected output: A repair pipeline that strips markdown fences, extracts embedded JSON, fixes syntax errors, coerces types, fuzzy-matches keys, removes extra fields, and always provides partial data even when validation fails.
Step 7: Add cost tracking and budget enforcement
Track every Bedrock call’s cost and enforce a daily budget cap. This module wraps the @reaatech/llm-cost-telemetry library to record cost spans and check budgets before executing LLM calls.
Expected output: A CostTracker that records per-call cost spans and a withCostTracking() wrapper that checks the daily budget before executing. If the budget is exceeded, a BudgetExceededError is thrown.
Step 8: Build the pipeline orchestrator
The pipeline ties everything together. It accepts a PDF buffer or raw text, runs PDF parsing with unpdf, optionally enhances with Textract OCR, feeds the text to Bedrock via the doc-extraction layer, repairs the JSON, and returns a structured ExtractionResult with cost and token metadata.
ts
// src/pipeline/contract.tsimport { extractText, getDocumentProxy } from "unpdf";import { analyzeDocument } from "../lib/textract-client.js";import { extractContractSummary, ExtractionError } from "../lib/doc-extraction.js";import { repairExtraction } from "../repair/extract.js";import { CostTracker, BudgetExceededError, withCostTracking } from "../telemetry/wrapper.js";import { generateId } from "@reaatech/llm-cost-telemetry";import { type ExtractionResult, type ContractSummary } from "../lib/contract-schema.js";import { writeFile, unlink } from "node:fs/promises";import { tmpdir } from "node:os";import { join }
Expected output: Two pipeline entry points — extractFromPdf() (writes PDF to temp disk, runs unpdf + optional Textract, extracts via Bedrock, repairs JSON, cleans up) and extractFromText() (same pipeline but skips PDF parsing).
Step 9: Set up the database with Drizzle
Create the PostgreSQL schema for persisting extraction results, then build the data access layer with Drizzle ORM.
// src/db/index.tsimport postgres from "postgres";import { drizzle } from "drizzle-orm/postgres-js";import { eq, desc } from "drizzle-orm";import { contractsTable, type Contract, type NewContract } from "./schema.js";const client = postgres(process.env.DATABASE_URL ?? "");const db = drizzle(client, { schema: { contracts: contractsTable } });export async function insertContract(data: NewContract): Promise<Contract> { const [row] = await db.insert(contractsTable).values(data).returning(); return row;}export async function getContract(id: string): Promise<Contract | undefined> { const rows = await db .select() .from(contractsTable) .where(eq(contractsTable.id, id)) .limit(1); return rows[0];}export async function listContracts( limit = 50, offset = 0,): Promise<Contract[]> { return db .select() .from(contractsTable) .orderBy(desc(contractsTable.createdAt)) .limit(limit) .offset(offset);}
Expected output: A contracts table with fields for the extraction ID, filename, raw text, summary JSON, cost, tokens, repair metadata, and timestamps. Three access functions: insertContract(), getContract(), and listContracts().
Step 10: Create the API routes
Build the Next.js App Router routes. The extraction endpoint accepts either a PDF file upload (multipart/form-data) or raw text (application/json), runs the pipeline, persists the result, and returns the extraction.
ts
// app/api/extract/route.tsimport { type NextRequest, NextResponse } from "next/server";import { extractFromPdf, extractFromText } from "../../../src/pipeline/contract.js";import { insertContract } from "../../../src/db/index.js";import { BudgetExceededError } from "../../../src/telemetry/wrapper.js";export async function POST(req: NextRequest) { try { const contentType = req.headers.get("content-type") ?? ""; if (contentType.includes("multipart/form-data")) { const formData = await req.formData(); const file = formData.get("file"); if (!file || !(file instanceof File)) { return NextResponse.json({ error: "No PDF file or text provided" }, { status: 400 }); } if (!file.name.toLowerCase().endsWith(".pdf")) { return NextResponse.json({ error: "Only PDF files are accepted" }, { status: 400 }); } const buffer = new Uint8Array(await file.arrayBuffer()); const result = await extractFromPdf(buffer, file.name); await insertContract({ id: result.id, filename: result.filename, rawText: result.rawText, summaryJson: result.summary, costUsd: String(result.costUsd), totalTokens: String(result.totalTokens), repairApplied: result.repairApplied, repairSteps: result.repairSteps !== undefined ? String(result.repairSteps) : null, errorMessage: result.error ?? null, createdAt: new Date(result.createdAt), }); if (result.error) { return NextResponse.json({ error: result.error, extractionResult: result }, { status: 422 }); } return NextResponse.json(result, { status: 201 }); } else { const body = await req.json() as { text: string }; if (!body.text) { return NextResponse.json({ error: "No PDF file or text provided" }, { status: 400 }); } const result = await extractFromText(body.text); await insertContract({ id: result.id, filename: result.filename, rawText: result.rawText, summaryJson: result.summary, costUsd: String(result.costUsd), totalTokens: String(result.totalTokens), repairApplied: result.repairApplied, repairSteps: result.repairSteps !== undefined ? String(result.repairSteps) : null, errorMessage: result.error ?? null, createdAt: new Date(result.createdAt), }); if (result.error) { return NextResponse.json({ error: result.error, extractionResult: result }, { status: 422 }); } return NextResponse.json(result, { status: 201 }); } } catch (error) { if (error instanceof BudgetExceededError) { return NextResponse.json({ error: "Daily extraction budget exceeded" }, { status: 429 }); } throw error; }}
ts
// app/api/contracts/route.tsimport { type NextRequest, NextResponse } from "next/server";import { insertContract, listContracts } from "../../../src/db/index.js";export async function GET(req: NextRequest) { const searchParams = req.nextUrl.searchParams; const limit = Number(searchParams.get("limit")) || 50; const offset = Number(searchParams.get("offset")) || 0; const contracts = await listContracts(limit, offset); return NextResponse.json(contracts);}export async function POST(req: NextRequest) { const body = await req.json() as { filename: string }; if (!body.filename) { return NextResponse.json({ error: "filename is required" }, { status: 400 }); } const contract = await insertContract({ id: crypto.randomUUID(), filename: body.filename }); return NextResponse.json(contract, { status: 201 });}
ts
// app/api/contracts/[id]/route.tsimport { type NextRequest, NextResponse } from "next/server";import { getContract } from "../../../../src/db/index.js";export async function GET(_req: NextRequest, { params }: { params: Promise<{ id: string }> }) { const { id } = await params; const contract = await getContract(id); if (!contract) return NextResponse.json({ error: "Not found" }, { status: 404 }); return NextResponse.json(contract);}
Expected output: Three API routes — POST /api/extract (accepts PDF upload or text JSON), GET /api/contracts (lists stored contracts with pagination), POST /api/contracts (creates a contract record), and GET /api/contracts/[id] (fetches a single contract by ID). The extraction route returns 201 on success, 422 on partial extraction with errors, 400 on missing input, and 429 when the daily budget is exceeded.
Step 11: Add the instrumentation hook
The instrumentation hook loads the cost telemetry configuration at server startup, which reads budget settings from environment variables.
ts
// src/instrumentation.tsexport async function register() { if (process.env.NEXT_RUNTIME === "nodejs") { const { loadConfig } = await import("@reaatech/llm-cost-telemetry"); loadConfig(); }}
Expected output: A register() function that runs once at Next.js server startup in the Node.js runtime, loading the telemetry configuration. The experimental.instrumentationHook: true flag in next.config.ts ensures this function is actually invoked.
Step 12: Configure environment and run the tests
Create the .env.example file with every variable the application reads:
env
# Env vars used by aws-bedrock-contract-clause-extraction-for-smb-legal.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=development# AWS BedrockAWS_REGION=us-east-1AWS_ACCESS_KEY_ID=<your-access-key>AWS_SECRET_ACCESS_KEY=<your-secret># DatabaseDATABASE_URL=postgres://user:***@host:5432/dbname# BudgetDEFAULT_DAILY_BUDGET=5.00# Bedrock modelBEDROCK_MODEL_ID=anthropic.claude-sonnet-4-v1:0# FeaturesTEXTRACT_ENABLED=true
Copy to .env.local and fill in real values for local development.
Now run the three verification commands:
terminal
pnpm typecheck
Your terminal should print no errors — all types resolve correctly across the schema, provider, repair, telemetry, and pipeline modules.
terminal
pnpm lint
No lint errors should appear.
terminal
pnpm test
Your terminal should show vitest output with numFailedTests=0 and 127 passing tests across the contract schema, doc-extraction, textract client, repair pipeline, cost tracker, pipeline orchestrator, instrumentation, database layer, and API routes. Coverage should report 90%+ on lines, branches, functions, and statements for the src/ runtime modules.
Expected output: All three commands exit with code 0. The typecheck confirms structural correctness, lint confirms code quality, and the test suite validates every layer from schema validation to error handling to pipeline orchestration.
Next steps
Extend clause types: Add industry-specific clauses to CLAUSE_TYPES (e.g., data_privacy, service_levels) and update the Bedrock system prompt to extract them.
Add user authentication: Protect the API routes with NextAuth or Clerk so each SMB client sees only their contracts, and attribute costs per tenant.
Build a dashboard UI: Add a Next.js page that lists extracted contracts with visual summaries — clause counts, cost per contract, and a timeline of daily spend.
Replace in-memory artifact store: Swap InMemoryArtifactStore for a production-grade S3-backed store so large PDFs don’t consume server memory.