Anthropic Salesforce Contract Extraction for SMB Sales
Automatically extracts key fields like value, dates, and parties from Salesforce contracts and proposals, eliminating manual data entry for SMB sales teams.
Small sales teams keep contracts and proposals as PDFs or scanned documents inside Salesforce, but pulling out amounts, effective dates, and signatory details manually is slow, error-prone, and inconsistent.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a contract extraction pipeline for SMB sales teams. You’ll create a Next.js API that takes a Salesforce document ID, fetches the file (PDF or image), and extracts the text through PDF parsing or OCR. The cleaned text goes to Anthropic’s Claude with a structured extraction prompt, and the JSON output is repaired by a graduated repair engine before being returned. Along the way you’ll enforce per-document budget caps, manage multi-page context across token windows, and trace every run through Langfuse for observability.
The extraction, repair, budgeting, and session management rely on REAA’s npm packages: @reaatech/media-pipeline-mcp-core, @reaatech/media-pipeline-mcp-doc-extraction, @reaatech/structured-repair-core, @reaatech/session-continuity, and @reaatech/agent-budget-engine.
Prerequisites
Node.js 22+ and pnpm 10 installed
An Anthropic API key for Claude (set as ANTHROPIC_API_KEY)
A Salesforce instance with OAuth access token (SALESFORCE_INSTANCE_URL and SALESFORCE_ACCESS_TOKEN)
AWS credentials for Textract OCR (AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
A Langfuse account (optional — for LLM tracing)
Basic familiarity with TypeScript, Next.js App Router, and REST APIs
Step 1: Scaffold the Next.js project
Start with a fresh Next.js App Router project. This gives you the right tsconfig, ESLint config, and project structure with next@16.2.9, react@19.2.4, and react-dom@19.2.4 already installed.
Expected output:node_modules/ populated and pnpm-lock.yaml generated. The package.json dependencies and devDependencies sections are exact-pinned.
Step 3: Set up environment variables
Create a .env.example at the project root with placeholder values for every integration:
env
# Env vars used by anthropic-salesforce-contract-extraction-for-smb-sales.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=development# Anthropic Claude API key (required)ANTHROPIC_API_KEY=<your-anthropic-api-key># Salesforce OAuth connectionSALESFORCE_INSTANCE_URL=<https://your-instance.salesforce.com>SALESFORCE_ACCESS_TOKEN=<your-sf-oauth-access-token># AWS Textract for OCRAWS_REGION=<us-east-1>AWS_ACCESS_KEY_ID=<your-aws-access-key>AWS_SECRET_ACCESS_KEY=<your-aws-secret-key># Langfuse LLM tracing (optional)LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_BASE_URL=<https://cloud.langfuse.com>
Copy this to .env and fill in real values for local testing:
terminal
cp .env.example .env
Expected output: The .env.example file with all variable names. Your .env file (gitignored) has real credentials.
Step 4: Define the Zod schemas and TypeScript types
Create src/types/index.ts. This file holds every schema and type your pipeline uses — Zod schemas for validation plus inferred TypeScript types. The ExtractedContractSchema defines exactly what fields Claude should extract from each contract.
ts
import { z } from "zod";// ─── Zod Schemas ──────────────────────────────────────────────export const ContractFieldSchema = z.object({ name: z.string(), type: z.enum(["string", "number", "date", "boolean", "array"]), description: z.string().optional(),});export const ContractExtractionRequestSchema = z.object({ documentId: z.string(), sessionId: z.string().optional(),});export const ExtractedContractSchema = z.object({ contract_value: z.number().optional(), effective_date: z.string().optional(), expiration_date: z.string().optional(), parties: z .array(z.object({ name: z.string(), role: z.string().optional() })) .optional(), signatory_details: z .array( z.object({ name: z.string(), title: z.string().optional(), signed_date: z.string().optional(), }) ) .optional(), contract_terms: z.string().optional(), governing_law: z.string().optional(), renewal_terms: z.string().optional(),});export const ExtractionResultSchema = z.object({ success: z.boolean(), data: ExtractedContractSchema.nullable(), documentId: z.string(), sessionId: z.string().optional(), error: z.string().optional(), repairSteps: z.array(z.string()).optional(), cost_usd: z.number().optional(),});// ─── Inferred TypeScript Types ────────────────────────────────export type ContractField = z.infer<typeof ContractFieldSchema>;export type ContractExtractionRequest = z.infer< typeof ContractExtractionRequestSchema>;export type ExtractedContract = z.infer<typeof ExtractedContractSchema>;export type ExtractionResult = z.infer<typeof ExtractionResultSchema>;// ─── Canonical Field Definitions ──────────────────────────────export const CONTRACT_FIELDS: ContractField[] = [ { name: "contract_value", type: "number", description: "The total contract value in dollars" }, { name: "effective_date", type: "date", description: "The contract effective date" }, { name: "expiration_date", type: "date", description: "The contract expiration or end date" }, { name: "parties", type: "array", description: "The parties to the contract with names and roles" }, { name: "signatory_details", type: "array", description: "The signatories with names, titles, and signed dates" }, { name: "contract_terms", type: "string", description: "Key terms and conditions of the contract" }, { name: "governing_law", type: "string", description: "The governing law or jurisdiction" }, { name: "renewal_terms", type: "string", description: "Auto-renewal and termination notice terms" },];
Expected output: TypeScript compiles this without errors. ExtractedContractSchema is the single source of truth for what fields Claude will be asked to extract and what the repair engine will validate against.
Step 5: Build the Salesforce integration
Create src/api/salesforce.ts. This module handles authentication via jsforce and two operations: fetching a document’s binary by its ContentVersion ID, and querying attached documents by a parent record ID.
getSalesforceConnection creates a jsforce.Connection with the access token already set — no login step needed for OAuth-authenticated sessions.
fetchDocumentBinary enforces a 10 MB size limit to prevent large documents from blowing out your API timeout or token budget.
The MIME-type mapping covers PDF, common image formats, Office documents, and plain text — anything else gets application/octet-stream and will be rejected downstream.
Expected output:tsc --noEmit passes. The module exports custom error classes (SalesforceDocumentNotFoundError, DocumentTooLargeError) and typed interfaces.
Step 6: Build the PDF parser
Create src/api/pdf.ts. This module uses pdfjs-dist to extract text content from PDF buffers, page by page.
Text is joined per page with [Page N] markers — this is how the session continuity layer knows where page boundaries are.
Individual page parse failures are silently skipped (the continue in the catch block) so a single corrupted page doesn’t lose the entire document.
The import uses pdfjs-dist/legacy/build/pdf.mjs — the ESM entry point required in a Node.js module environment.
Expected output: TypeScript compiles cleanly.
Step 7: Build the Textract OCR service
Create src/api/textract.ts. This module wraps AWS Textract’s DetectDocumentText API for extracting text from scanned images.
ts
import { TextractClient, DetectDocumentTextCommand,} from "@aws-sdk/client-textract";export class TextractServiceError extends Error { readonly code = "TEXTRACT_SERVICE_ERROR"; constructor(message: string, cause?: unknown) { super(message); this.name = "TextractServiceError"; this.cause = cause; }}export interface TextractBlock { text: string; confidence: number; page: number;}export interface TextractResult { fullText: string; blocks: TextractBlock[];}export function createTextractClient(region: string): TextractClient { return new TextractClient({ region });}export async function detectText( client: TextractClient, buffer: Buffer): Promise<TextractResult> { let response: unknown; try { response = await client.send( new DetectDocumentTextCommand({ Document: { Bytes: buffer }, }) ); } catch (err) { throw new TextractServiceError( `Textract API error: ${err instanceof Error ? err.message : String(err)}`, err ); } const rawBlocks = (response as { Blocks?: Array<Record<string, unknown>> }).Blocks; if (!rawBlocks || rawBlocks.length === 0) { return { fullText: "", blocks: [] }; } const pageMap = new Map<number, string[]>(); const blocks: TextractBlock[] = []; for (const b of rawBlocks) { if (b["BlockType"] === "LINE") { const text = (b["Text"] as string) || ""; const confidence = (b["Confidence"] as number) || 0; const page = (b["Page"] as number) || 1; blocks.push({ text, confidence, page }); const existing = pageMap.get(page); if (existing) { existing.push(text); } else { pageMap.set(page, [text]); } } } const fullText = Array.from(pageMap.entries()) .sort(([a], [b]) => a - b) .map(([pageNum, lines]) => "[Page " + String(pageNum) + "]\n" + lines.join("\n")) .join("\n\n"); return { fullText, blocks };}
Key details:
Only LINE blocks are extracted — other Textract block types (WORD, TABLE, KEY_VALUE_SET) are filtered out because the pipeline only needs linear text for LLM extraction.
Output is organized by page with [Page N] markers, matching the PDF parser’s format so the orchestrator can treat both paths identically.
Expected output: TypeScript compiles. The AWS SDK v3 client is instantiated with just a region — it picks up credentials from the environment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY).
Step 8: Build the Anthropic extraction service
Create src/api/anthropic.ts. This module calls Claude with a structured system prompt and implements exponential-backoff retry for rate limits.
The system prompt explicitly tells Claude to return only valid JSON — no markdown fences, no conversational wrapper. This reduces the repair burden downstream.
Retry logic uses exponential backoff (1s, 2s, 4s) for 429 rate-limit errors, up to 4 total attempts.
AnthropicServiceError carries a retryable boolean so the orchestrator can decide whether to re-queue failed documents.
The model is pinned to claude-sonnet-4-6 — a good balance of extraction quality and cost for SMB contracts.
Expected output: TypeScript compiles. The extractContractFields function handles rate limits gracefully and surfaces structured error metadata.
Step 9: Build the structured output repair service
Create src/services/repair.ts. This module wraps @reaatech/structured-repair-core to fix malformed LLM output before it reaches your application logic.
repairOutput runs six strategies in sequence — stripping markdown fences, extracting JSON from conversational wrappers, fixing truncated JSON syntax, coercing types (string to number, etc.), fuzzy-matching misnamed keys, and removing extra fields that don’t exist in the schema.
repairContractOutput is the primary entry point used by the orchestrator. It returns a success boolean, the repaired data, a detailed steps array, and per-field fieldErrors if repair partially fails.
diagnoseLlmOutput uses analyzeInput for debugging — call it separately to inspect raw LLM output without applying repairs.
Expected output:tsc --noEmit passes.
Step 10: Build the budget controller
Create src/lib/budget.ts. This module wraps @reaatech/agent-budget-engine to cap per-document spend with soft/hard caps and auto-downgrade rules.
ts
import { BudgetController } from "@reaatech/agent-budget-engine";import { SpendStore } from "@reaatech/agent-budget-spend-tracker";import { type BudgetScope } from "@reaatech/agent-budget-types";const DOCUMENT_SCOPE = "session" as BudgetScope;export function createBudgetController(): BudgetController { const store = new SpendStore(); return new BudgetController({ spendTracker: store });}export function defineDocumentBudget( controller: BudgetController, documentId: string, limitUsd: number = 0.50): void { controller.defineBudget({ scopeType: DOCUMENT_SCOPE, scopeKey: documentId, limit: limitUsd, policy: { softCap: 0.8, hardCap: 1.0, autoDowngrade: [{ from: ["claude-opus-4-7"], to: "claude-sonnet-4-6" }], }, });}export function checkBudget( controller: BudgetController, documentId: string, estimatedCost: number): { allowed: boolean; action: string } { const result = controller.check({ scopeType: DOCUMENT_SCOPE, scopeKey: documentId, estimatedCost, modelId: "claude-sonnet-4-6", tools: [], }); return { allowed: result.allowed, action: result.action };}export function recordSpend( controller: BudgetController, documentId: string, cost: number, inputTokens: number, outputTokens: number): void { controller.record({ requestId: crypto.randomUUID(), scopeType: DOCUMENT_SCOPE, scopeKey: documentId, cost, inputTokens, outputTokens, modelId: "claude-sonnet-4-6", provider: "anthropic", timestamp: new Date(), });}export function getBudgetState( controller: BudgetController, documentId: string): { spent: number; remaining: number; state: string } { const state = controller.getState(DOCUMENT_SCOPE, documentId); return { spent: state?.spent ?? 0, remaining: state?.remaining ?? 0, state: state?.state ?? "unknown", };}
Key decisions:
Each document gets its own budget scope keyed by documentId, with a default $0.50 limit.
The policy uses an 80% soft cap and a 100% hard cap — at 80% spend the engine emits a warning, at 100% it stops further requests.
The autoDowngrade rule catches any call to claude-opus-4-7 and redirects it to claude-sonnet-4-6 — a safety net if a downstream caller requests an expensive model.
getBudgetState lets you report remaining budget in API responses so callers can see how close they are to the cap.
Expected output: TypeScript compiles.
Step 11: Build the session continuity module
Create src/lib/session.ts. This module uses @reaatech/session-continuity to manage multi-page document context across token windows using an in-memory storage adapter.
ts
import { SessionManager, type Session, type Message, type TokenCounter, type ConversationContextResult,} from "@reaatech/session-continuity";// ─── Token Counter ───────────────────────────────────────────const simpleTokenCounter: TokenCounter = { count(text: string): number { return Math.ceil(text.length / 4); }, countMessages(messages: Message[]): number { return messages.reduce((sum, m) => {
Key decisions:
The MemoryStorageAdapter implements the storage interface expected by SessionManager — fully in-memory for this recipe. In production you’d swap this for a database-backed adapter.
Token budget is set to 128K with 4K reserved for overhead; when the budget is exceeded, the "compress" overflow strategy and "sliding_window" compression drop older messages to keep context under 120K tokens.
Each page of the document is added as a separate user message with a [Page N]: prefix. The SessionManager tracks the total token count and automatically compresses when approaching the limit.
Lifecycle helpers (initContractSession, addPageContent, getFullContext, endContractSession) abstract away the raw SessionManager API so the orchestrator only deals with document-level concepts.
Expected output: TypeScript compiles.
Step 12: Build the media pipeline layer
Create src/lib/pipeline.ts. This module wraps @reaatech/media-pipeline-mcp-core and @reaatech/media-pipeline-mcp-doc-extraction for artifact management and field extraction.
The ArtifactRegistry from @reaatech/media-pipeline-mcp-core registers each document as an artifact with a unique ID and metadata. The ArtifactStore is a simple in-memory buffer store.
createDocumentExtractionOperations from @reaatech/media-pipeline-mcp-doc-extraction wires the registry and store together, giving you a docOps.extractFields() method that can run extraction pipeline steps.
registerArtifact both registers the artifact in the registry and stores its binary in one call — the orchestrator uses this after fetching from Salesforce.
Expected output: TypeScript compiles.
Step 13: Build the Langfuse telemetry module
Create src/lib/telemetry.ts. This optional module traces extraction runs to Langfuse for observability.
The module is only invoked when LANGFUSE_SECRET_KEY is set — if Langfuse isn’t configured, tracing is silently skipped.
Each extraction run creates one trace with one generation span, recording token usage, cost, and the list of repair steps that were applied.
shutdownAsync() flushes the event queue so traces are not lost when the process exits after responding.
Expected output: TypeScript compiles. No errors when LANGFUSE_SECRET_KEY is unset — the module is only invoked conditionally.
Step 14: Wire the extraction orchestrator
Create src/services/extractor.ts. This is the central orchestrator that ties together every module you’ve built. It handles budget checks, document fetching, text extraction (PDF or OCR), session continuity, LLM extraction, output repair, spend recording, and optional Langfuse tracing.
ts
import pLimit from "p-limit";import type Anthropic from "@anthropic-ai/sdk";import type { TextractClient } from "@aws-sdk/client-textract";import type { ExtractedContract, ExtractionResult } from "../types/index.js";import { ExtractedContractSchema } from "../types/index.js";import { fetchDocumentBinary, getSalesforceConnection } from "../api/salesforce.js";import { detectText } from "../api/textract.js";import { parsePdf } from "../api/pdf.js";import { extractContractFields } from "../api/anthropic.js";import { defineDocumentBudget, checkBudget, recordSpend,
How the orchestrator works — step by step:
Budget preflight — Defines a $0.50 budget for the document, then runs a pre-check estimating $0.10. If denied, returns immediately with budget_denied.
Salesforce fetch — Downloads the ContentVersion binary from Salesforce via jsforce.
Artifact registration — Registers the binary in the media pipeline’s artifact registry via registerArtifact.
Text extraction — Routes to parsePdf for PDFs or detectText for images. Returns unsupported_mime or no_text_extracted errors for problematic documents.
Session continuity — Creates a session, adds each page as a message, and retrieves the full context. The SessionManager handles token budget enforcement and sliding-window compression for multi-page documents.
LLM extraction — Sends the full text to Claude with the contract extraction system prompt.
Output repair — Runs the raw LLM output through repairContractOutput with ExtractedContractSchema. If successful, records the spend, ends the session, traces to Langfuse (if configured), and returns the extracted data.
The extractDocuments method uses p-limit to process multiple documents concurrently with a configurable concurrency cap (default 3).
Expected output: TypeScript compiles. The orchestrator is the single entry point that the API route handler instantiates and calls.
Step 15: Create the API route handler
Create app/api/extract/route.ts. This is the Next.js App Router route handler that exposes the extraction pipeline as a REST API.
Uses NextRequest and NextResponse.json() — never bare Request/new Response(JSON.stringify(...)). This ensures the proper Content-Type: application/json header is always set.
Request validation via Zod’s safeParse returns a clear 400 error on malformed input.
On successful extraction the route returns 200; on failed extraction (budget denied, no text, repair failed) it returns 422 — semantically correct for a processing failure that isn’t a server error.
The GET handler provides a simple health-check endpoint at the same path.
Expected output:tsc --noEmit passes. The route handler complies with the Next.js App Router convention — named exports (POST, GET), NextRequest param types, NextResponse.json() return values.
Step 16: Create the barrel export
Create or update src/index.ts to export the key classes and functions that external consumers (like the route handler) need:
ts
export { ExtractionOrchestrator } from "./services/extractor.js";export * from "./types/index.js";export { createBudgetController, defineDocumentBudget, checkBudget, recordSpend,} from "./lib/budget.js";export { createSessionManager, initContractSession, addPageContent, getFullContext, endContractSession,} from "./lib/session.js";
Expected output:tsc --noEmit passes. The barrel export re-exports everything external consumers need without exposing internal implementation details.
Expected output: The test setup starts an MSW test server before all tests, resets handlers between tests, and closes after all tests. Any unhandled HTTP request in tests throws an error — catching unintended network calls.
Step 18: Run the tests
Run the full test suite to verify everything works:
terminal
pnpm test
Expected output: All tests pass across 18 test suites. The coverage report targets 90% on lines, functions, and statements, with 60% on branches — covering runtime code (src/**/*.ts and app/**/route.ts). UI components (*.tsx) are excluded from coverage requirements.
You can also run the type checker and linter separately:
terminal
pnpm typecheckpnpm lint
Expected output:pnpm typecheck exits 0 with no errors. pnpm lint exits 0 with no warnings.
Next steps
Add database-backed session storage — Replace MemoryStorageAdapter with a PostgreSQL or Redis adapter so sessions survive server restarts and multi-request conversations.
Support batch extraction from Salesforce records — Use the queryContentDocumentByRecordId function to list all documents attached to an Opportunity or Contract record, then call extractDocuments to process them all in parallel with a concurrency cap.
Add a webhook response — Instead of responding synchronously, post the extraction result to a Salesforce Chatter feed or a Slack webhook so sales reps get notified when contract fields are ready.
Implement a retry queue — Documents that fail with budget_denied or repair_failed could be re-queued with a higher budget or a different model. Use getBudgetState to report remaining budget in the response so callers can decide whether to retry.
);
this.name = "DocumentTooLargeError";
this.sizeBytes = sizeBytes;
}
}
export interface SalesforceConfig {
instanceUrl: string;
accessToken: string;
apiVersion?: string;
}
export interface DocumentRecord {
documentId: string;
title: string;
mimeType: string;
}
export interface DocumentBinary {
buffer: Buffer;
fileName: string;
mimeType: string;
}
import jsforce from "jsforce";
export function getSalesforceConnection(
config: SalesforceConfig
) {
const conn = new jsforce.Connection({
loginUrl: config.instanceUrl,
version: config.apiVersion,
});
conn.accessToken = config.accessToken;
return conn;
}
type SalesforceConn = ReturnType<typeof getSalesforceConnection>;
export async function fetchDocumentBinary(
conn: SalesforceConn,
documentId: string
): Promise<DocumentBinary> {
let record: Record<string, unknown>;
try {
record = await conn.sobject("ContentVersion").retrieve(documentId);
} catch {
throw new SalesforceDocumentNotFoundError(documentId);
}
const versionData = record["VersionData"] as string | undefined;
const title = record["Title"] as string;
const fileExtension = record["FileExtension"] as string;
export function createAnthropicClient(): Anthropic {
return new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
}
export const CONTRACT_EXTRACTION_SYSTEM_PROMPT = `You are a contract extraction assistant. Extract the specified fields from the contract document below and return ONLY valid JSON.
Extract these fields if present:
- contract_value (number): The total contract value in dollars
- effective_date (string): The contract effective date (ISO format if possible)
- expiration_date (string): The contract expiration or end date (ISO format if possible)
- parties (array): The parties to the contract, each with name and role
- signatory_details (array): The signatories, each with name, title, and signed_date
- contract_terms (string): Key terms and conditions
- governing_law (string): The governing law or jurisdiction
- renewal_terms (string): Auto-renewal and termination notice terms
Return ONLY a valid JSON object. Do not include markdown fences, code blocks, or any explanatory text.`;
async function sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));