Anthropic Salesforce Contract Extraction for SMB Sales

Automatically extracts key fields like value, dates, and parties from Salesforce contracts and proposals, eliminating manual data entry for SMB sales teams.

anthropic salesforce contract-extraction document-pipeline nextjs ocr structured-output smb

The problem

Small sales teams keep contracts and proposals as PDFs or scanned documents inside Salesforce, but pulling out amounts, effective dates, and signatory details manually is slow, error-prone, and inconsistent.

Built from

Intro

This tutorial walks you through building a contract extraction pipeline for SMB sales teams. You’ll create a Next.js API that takes a Salesforce document ID, fetches the file (PDF or image), and extracts the text through PDF parsing or OCR. The cleaned text goes to Anthropic’s Claude with a structured extraction prompt, and the JSON output is repaired by a graduated repair engine before being returned. Along the way you’ll enforce per-document budget caps, manage multi-page context across token windows, and trace every run through Langfuse for observability.

The extraction, repair, budgeting, and session management rely on REAA’s npm packages: @reaatech/media-pipeline-mcp-core, @reaatech/media-pipeline-mcp-doc-extraction, @reaatech/structured-repair-core, @reaatech/session-continuity, and @reaatech/agent-budget-engine.

Prerequisites

Node.js 22+ and pnpm 10 installed
An Anthropic API key for Claude (set as ANTHROPIC_API_KEY)
A Salesforce instance with OAuth access token (SALESFORCE_INSTANCE_URL and SALESFORCE_ACCESS_TOKEN)
AWS credentials for Textract OCR (AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
A Langfuse account (optional — for LLM tracing)
Basic familiarity with TypeScript, Next.js App Router, and REST APIs

Step 1: Scaffold the Next.js project

Start with a fresh Next.js App Router project. This gives you the right tsconfig, ESLint config, and project structure with next@16.2.9, react@19.2.4, and react-dom@19.2.4 already installed.

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

189 kB·76 tests·100.0% coverage·vitest passing

SHA-256dd4744de11fc04e5b6a1d36a86371084ed5daff56da186c72605ebe51fce4438

Book a conversation All solutions

Comments

Loading comments…

Intro

Prerequisites

Node.js 22+ and pnpm 10 installed
An Anthropic API key for Claude (set as ANTHROPIC_API_KEY)
A Salesforce instance with OAuth access token (SALESFORCE_INSTANCE_URL and SALESFORCE_ACCESS_TOKEN)
AWS credentials for Textract OCR (AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
A Langfuse account (optional — for LLM tracing)
Basic familiarity with TypeScript, Next.js App Router, and REST APIs

Step 1: Scaffold the Next.js project

Start with a fresh Next.js App Router project. This gives you the right tsconfig, ESLint config, and project structure with next@16.2.9, react@19.2.4, and react-dom@19.2.4 already installed.

import { z } from "zod"; // ─── Zod Schemas ────────────────────────────────────────────── export const ContractFieldSchema = z.object({ name: z.string(), type: z.enum(["string", "number", "date", "boolean", "array"]), description: z.string().optional(), }); export const ContractExtractionRequestSchema = z.object({ documentId: z.string(), sessionId: z.string().optional(), }); export const ExtractedContractSchema = z.object({ contract_value: z.number().optional(), effective_date: z.string().optional(), expiration_date: z.string().optional(), parties: z .array(z.object({ name: z.string(), role: z.string().optional() })) .optional(), signatory_details: z .array( z.object({ name: z.string(), title: z.string().optional(), signed_date: z.string().optional(), }) ) .optional(), contract_terms: z.string().optional(), governing_law: z.string().optional(), renewal_terms: z.string().optional(), }); export const ExtractionResultSchema = z.object({ success: z.boolean(), data: ExtractedContractSchema.nullable(), documentId: z.string(), sessionId: z.string().optional(), error: z.string().optional(), repairSteps: z.array(z.string()).optional(), cost_usd: z.number().optional(), }); // ─── Inferred TypeScript Types ──────────────────────────────── export type ContractField = z.infer<typeof ContractFieldSchema>; export type ContractExtractionRequest = z.infer< typeof ContractExtractionRequestSchema >; export type ExtractedContract = z.infer<typeof ExtractedContractSchema>; export type ExtractionResult = z.infer<typeof ExtractionResultSchema>; // ─── Canonical Field Definitions ────────────────────────────── export const CONTRACT_FIELDS: ContractField[] = [ { name: "contract_value", type: "number", description: "The total contract value in dollars" }, { name: "effective_date", type: "date", description: "The contract effective date" }, { name: "expiration_date", type: "date", description: "The contract expiration or end date" }, { name: "parties", type: "array", description: "The parties to the contract with names and roles" }, { name: "signatory_details", type: "array", description: "The signatories with names, titles, and signed dates" }, { name: "contract_terms", type: "string", description: "Key terms and conditions of the contract" }, { name: "governing_law", type: "string", description: "The governing law or jurisdiction" }, { name: "renewal_terms", type: "string", description: "Auto-renewal and termination notice terms" }, ];

import { ArtifactRegistry, type Artifact } from "@reaatech/media-pipeline-mcp-core"; import { createDocumentExtractionOperations, type DocumentExtractionOperations } from "@reaatech/media-pipeline-mcp-doc-extraction"; class ArtifactStore { private data = new Map<string, Buffer>(); put(id: string, content: Buffer): Promise<string> { this.data.set(id, content); return Promise.resolve(id); } get(id: string): Promise<{ data: Buffer; meta: { type: "document" | "image" | "video" | "audio" | "text"; mimeType: string } }> { const content = this.data.get(id); if (!content) { return Promise.reject(new Error("Artifact not found: " + id)); } return Promise.resolve({ data: content, meta: { type: "document" as const, mimeType: "application/octet-stream" } }); } getSignedUrl(id: string): Promise<string> { return Promise.reject(new Error("getSignedUrl not implemented (in-memory store): " + id)); } delete(id: string): Promise<void> { this.data.delete(id); return Promise.resolve(); } list(): Promise<Array<{ id: string; type: "document" | "image" | "video" | "audio" | "text"; mimeType: string }>> { return Promise.resolve( Array.from(this.data.entries()).map(([id]) => ({ id, type: "document" as const, mimeType: "application/octet-stream", })) ); } healthCheck(): Promise<boolean> { return Promise.resolve(true); } } export { ArtifactStore }; export function createPipeline(registry: ArtifactRegistry) { const artifactStore = new ArtifactStore(); const docOps = createDocumentExtractionOperations(registry, artifactStore); function registerArtifact( buffer: Buffer, fileName: string, mimeType: string ): string { const artifact = registry.register({ type: "document", mimeType, uri: "memory:", metadata: { fileName }, sourceStep: "ingest", }); void artifactStore.put(artifact.id, buffer); return artifact.id; } return { registerArtifact, docOps, artifactStore }; } export async function extractFieldsViaPipeline( docOps: DocumentExtractionOperations, artifactId: string, fields: Array<{ name: string; type: "string" | "number" | "date" | "boolean" | "array"; description?: string }> ): Promise<Artifact> { return docOps.extractFields({ artifactId, fields }); } export async function getArtifactContent( store: ArtifactStore, artifact: Artifact ): Promise<unknown> { const { data } = await store.get(artifact.id); const text = data.toString("utf-8"); return JSON.parse(text); }

import { type NextRequest, NextResponse } from "next/server"; import { ArtifactRegistry } from "@reaatech/media-pipeline-mcp-core"; import { ContractExtractionRequestSchema } from "../../../src/types/index.js"; import { getSalesforceConnection } from "../../../src/api/salesforce.js"; import { createTextractClient } from "../../../src/api/textract.js"; import { createAnthropicClient } from "../../../src/api/anthropic.js"; import { createBudgetController } from "../../../src/lib/budget.js"; import { createSessionManager } from "../../../src/lib/session.js"; import { createPipeline } from "../../../src/lib/pipeline.js"; import { ExtractionOrchestrator } from "../../../src/services/extractor.js"; export async function POST(req: NextRequest): Promise<NextResponse> { try { let body: Record<string, unknown>; try { const raw: unknown = await req.json(); body = raw as Record<string, unknown>; } catch { return NextResponse.json( { error: "invalid_request", details: "Request body must be valid JSON" }, { status: 400 } ); } const parsed = ContractExtractionRequestSchema.safeParse(body); if (!parsed.success) { return NextResponse.json( { error: "invalid_request", details: "Validation failed" }, { status: 400 } ); } const { documentId, sessionId: requestSessionId } = parsed.data; const salesforceConn = getSalesforceConnection({ instanceUrl: process.env.SALESFORCE_INSTANCE_URL ?? "", accessToken: process.env.SALESFORCE_ACCESS_TOKEN ?? "", }); const textractClient = createTextractClient( process.env.AWS_REGION ?? "us-east-1" ); const anthropicClient = createAnthropicClient(); const budgetController = createBudgetController(); const sessionManager = createSessionManager(); const registry = new ArtifactRegistry(); const { registerArtifact, docOps, artifactStore } = createPipeline(registry); const orchestrator = new ExtractionOrchestrator( salesforceConn, textractClient, anthropicClient, budgetController, sessionManager, registry, docOps, artifactStore, registerArtifact ); const result = await orchestrator.extractDocument(documentId); result.sessionId = requestSessionId ?? result.sessionId; return NextResponse.json(result, { status: result.success ? 200 : 422, }); } catch (err) { console.error("Extraction route error:", err); return NextResponse.json( { error: "internal_error", message: err instanceof Error ? err.message : "unknown", }, { status: 500 } ); } } export function GET(): NextResponse { return NextResponse.json({ status: "ok" }); }

Anthropic Salesforce Contract Extraction for SMB Sales

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the Next.js project

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the Next.js project

Step 2: Install dependencies

Step 3: Set up environment variables

Step 4: Define the Zod schemas and TypeScript types

Step 5: Build the Salesforce integration

Step 6: Build the PDF parser

Step 7: Build the Textract OCR service

Step 8: Build the Anthropic extraction service

Step 9: Build the structured output repair service

Step 10: Build the budget controller

Step 11: Build the session continuity module

Step 12: Build the media pipeline layer

Step 13: Build the Langfuse telemetry module

Step 14: Wire the extraction orchestrator

Step 15: Create the API route handler

Step 16: Create the barrel export

Step 17: Set up Vitest with MSW

Step 18: Run the tests

Next steps