SMBs manually re‑key paper and PDF invoices into Sage Intacct, a slow, error‑prone process that delays month‑end close and leads to mis‑posted transactions.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building an Azure AI Document Pipeline for Sage Intacct Invoice Automation — a Next.js API endpoint that turns uploaded PDF invoices into structured Sage Intacct AR entries. You’ll compose seven pipeline stages: PDF text extraction (unpdf), Azure OpenAI field extraction, JSON repair (structured-repair-core), confidence routing (confidence-router-core), Sage Intacct posting via OAuth2, LLM caching with Redis (llm-cache), and cost telemetry (llm-cost-telemetry). By the end, you’ll have a document automation pipeline that auto-posts invoices or flags low-confidence ones for human review.
Azure OpenAI resource with an API key, endpoint URL, and deployment name
Sage Intacct OAuth2 credentials (client ID, client secret, company ID)
Langfuse account (optional — for tracing; credentials can stay as placeholders)
Familiarity with TypeScript, Next.js App Router, and REST APIs
Step 1: Scaffold the Next.js project
Start by creating a new Next.js project and installing the pipeline’s dependencies. The package.json pins every dependency so you get repeatable builds.
Expected output:pnpm install creates node_modules/ and pnpm-lock.yaml. Run pnpm typecheck to verify the scaffold compiles cleanly.
Step 2: Configure environment variables
The .env.example file defines every variable the pipeline reads at startup. Copy it to .env and fill in your credentials:
terminal
cp .env.example .env
The complete .env.example:
env
# Env vars used by azure-ai-document-pipeline-for-sage-intacct-invoice-automation.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentAZURE_OPENAI_ENDPOINT=<your-azure-openai-resource-endpoint>AZURE_OPENAI_API_KEY=<your-azure-openai-key>AZURE_OPENAI_DEPLOYMENT=<deployment-name>AZURE_OPENAI_API_VERSION=2025-01-01-previewSAGE_INTACCT_CLIENT_ID=<your-oauth2-client-id>SAGE_INTACCT_CLIENT_SECRET=<your-oauth2-client-secret>SAGE_INTACCT_COMPANY_ID=<your-sage-company-id>SAGE_INTACCT_AUTH_URL=https://oauth2.sage.hosting/tokenSAGE_INTACCT_API_URL=https://api.sage.hosting/v3.1REDIS_URL=redis://localhost:6379LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_HOST=https://cloud.langfuse.comCONFIDENCE_ROUTE_THRESHOLD=0.8CONFIDENCE_FALLBACK_THRESHOLD=0.3LLM_CACHE_TTL_SECONDS=3600DAILY_BUDGET_USD=100PORT=3000COST_PER_MILLION_TOKENS=15PIPELINE_DEFAULT_TENANT=default-tenantPIPELINE_DEFAULT_FEATURE=invoice-automation
Expected output: A .env file with real values for AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT, SAGE_INTACCT_CLIENT_ID, SAGE_INTACCT_CLIENT_SECRET, SAGE_INTACCT_COMPANY_ID, and REDIS_URL. The Langfuse credentials can stay as placeholders — observability is optional.
Step 3: Create the pipeline configuration loader
Create src/types/config.ts — this module defines a Zod-validated PipelineConfig type and a loadPipelineConfig() function that reads every env var:
Expected output: Calling loadPipelineConfig() returns a typed PipelineConfig object. If a required env var is missing or invalid, Zod throws a descriptive error at startup.
Step 4: Define invoice schemas, API types, and error classes
Create src/types/invoice.ts — the central Zod schema for invoice fields plus result shapes:
Expected output:src/types/ contains four files (config, invoice, sage-intacct, errors). Run pnpm typecheck — no errors.
Step 5: Build the PDF text extraction module
Create src/lib/text-extraction.ts — this wraps unpdf to extract raw text from a PDF buffer:
ts
import { extractText, getDocumentProxy } from "unpdf";import { PdfExtractionError } from "../types/errors.js";export async function extractTextFromPdf( fileBuffer: Uint8Array,): Promise<{ text: string; totalPages: number }> { if (fileBuffer.length === 0) { return { text: "", totalPages: 0 }; } let pdf; try { pdf = await getDocumentProxy(new Uint8Array(fileBuffer)); } catch (cause) { throw new PdfExtractionError("Failed to open PDF document", cause); } try { const result = await extractText(pdf, { mergePages: true }); return { text: result.text, totalPages: result.totalPages, }; } catch (cause) { throw new PdfExtractionError("Failed to extract text from PDF", cause); }}
Expected output: The module exports extractTextFromPdf. Passing a valid PDF buffer returns { text, totalPages }. Passing an empty buffer returns { text: "", totalPages: 0 } without throwing. Passing an invalid buffer throws PdfExtractionError.
Step 6: Build the Azure OpenAI structured extraction service
Create src/services/azure-openai.ts — this sends the raw PDF text to Azure OpenAI’s chat completions API with a json_object response format:
ts
import { type PipelineConfig } from "../types/config.js";import { AzureOpenAiError } from "../types/errors.js";const SYSTEM_PROMPT = "Extract invoice fields from this text as JSON. Return a JSON object with keys: invoice_number (string), invoice_date (string), due_date (string), vendor_name (string), vendor_tax_id (string), subtotal (number), tax (number), total (number), is_paid (boolean), line_items (array of objects with description, quantity, unit_price, amount).";const RETRY_DELAYS_MS = [1_000, 2_000, 4_000];const STATUS_RETRY = new Set([429, 500, 503]);export type AzureOpenAiCostCallback = (tokens: { inputTokens: number; outputTokens: number;}) => void;async function delay(ms: number): Promise<void> { return new Promise((resolve) => setTimeout(resolve, ms));}export async function extractInvoiceFields( rawText: string, config: PipelineConfig, onCost?: AzureOpenAiCostCallback,): Promise<string> { if (rawText.length === 0) { return "{}"; } const url = `${config.azureOpenAiEndpoint}/openai/deployments/${config.azureOpenAiDeployment}/chat/completions?api-version=${config.azureOpenAiApiVersion}`; const body = { messages: [ { role: "system", content: SYSTEM_PROMPT }, { role: "user", content: rawText }, ], response_format: { type: "json_object" as const }, }; let lastError: unknown; for (const delayMs of [0, ...RETRY_DELAYS_MS]) { if (delayMs > 0) { await delay(delayMs); } try { const response = await fetch(url, { method: "POST", headers: { "Content-Type": "application/json", Authorization: `Bearer ${config.azureOpenAiApiKey}`, }, body: JSON.stringify(body), }); if (!response.ok && !STATUS_RETRY.has(response.status)) { const responseBody = await response.text(); throw new AzureOpenAiError( `Azure OpenAI request failed with status ${response.status}`, response.status, responseBody, ); } if (!response.ok) { lastError = new AzureOpenAiError( `Azure OpenAI request failed with status ${response.status}`, response.status, await response.text(), ); continue; } const data = (await response.json()) as { choices: Array<{ message: { content: string } }>; usage?: { prompt_tokens: number; completion_tokens: number }; }; if (data.usage && onCost) { onCost({ inputTokens: data.usage.prompt_tokens, outputTokens: data.usage.completion_tokens, }); } return data.choices[0].message.content; } catch (err) { if (err instanceof AzureOpenAiError && STATUS_RETRY.has(err.statusCode ?? 0)) { lastError = err; continue; } throw err; } } throw lastError instanceof AzureOpenAiError ? lastError : new AzureOpenAiError("Azure OpenAI request failed after retries");}
Expected output: The function sends text to Azure OpenAI and returns the raw JSON string. On empty input it returns "{}". On HTTP 429, 500, or 503, it retries with backoff (1s, 2s, 4s). On permanent errors it throws AzureOpenAiError.
Step 7: Build the extraction orchestrator
Create src/services/extraction.ts — this composes PDF text extraction and Azure OpenAI extraction into a single step:
ts
import { extractTextFromPdf } from "../lib/text-extraction.js";import { extractInvoiceFields, type AzureOpenAiCostCallback } from "./azure-openai.js";import { type PipelineConfig } from "../types/config.js";import { type ExtractionResult } from "../types/invoice.js";import { createDocumentExtractionOperations } from "@reaatech/media-pipeline-mcp-doc-extraction";export { createDocumentExtractionOperations };export async function extractInvoiceFromPdf( fileBuffer: Uint8Array, config: PipelineConfig, onCost?: AzureOpenAiCostCallback,): Promise<ExtractionResult> { const { text, totalPages } = await extractTextFromPdf(fileBuffer); const rawJson = await extractInvoiceFields(text, config, onCost); const parsed = JSON.parse(rawJson) as Record<string, unknown>; return { rawText: text, fields: parsed as ExtractionResult["fields"], confidence: 0.5, pageCount: totalPages, };}
Expected output:extractInvoiceFromPdf takes a raw PDF buffer and config, runs text extraction then Azure OpenAI, and returns an ExtractionResult with the parsed fields, raw text, and page count.
Step 8: Build the JSON repair service
Create src/services/repair.ts — this uses @reaatech/structured-repair-core to fix common LLM JSON failures (markdown fences, trailing commas, type coercion, hallucinated fields):
ts
import { repair, repairOutput, isValid, UnrepairableError } from "@reaatech/structured-repair-core";import { InvoiceSchema, type InvoiceFields } from "../types/invoice.js";import { RepairFailedError } from "../types/errors.js";export async function repairInvoiceJson(rawJson: string): Promise<InvoiceFields> { try { return await repair(InvoiceSchema, rawJson); } catch (error) { if (error instanceof UnrepairableError) { const result = await repairOutput({ schema: InvoiceSchema, input: rawJson, debug: true }); throw new RepairFailedError( "Invoice JSON could not be repaired", result.partialData, result.fieldErrors, ); } throw error; }}export function validateInvoiceJson(rawJson: string): boolean { return isValid(InvoiceSchema, rawJson);}
Expected output:repairInvoiceJson takes the raw JSON string from Azure OpenAI and returns typed InvoiceFields. If the JSON is malformed beyond repair, it throws RepairFailedError with partialData and fieldErrors attached. validateInvoiceJson provides a fast-path pre-check.
Step 9: Build the confidence router
Create src/services/confidence-router.ts — this evaluates each extracted invoice field and decides whether to auto-post, request human review, or reject:
ts
import { DecisionEngine, mergeConfig, type Prediction, type RoutingDecision } from "@reaatech/confidence-router-core";import type { InvoiceFields } from "../types/invoice.js";import type { PipelineConfig } from "../types/config.js";function heuristicConfidence(key: keyof InvoiceFields, value: unknown): number { switch (key) { case "subtotal": case "tax": case "total": return typeof value === "number" && Number.isFinite(value) ? 0.85 : 0; case "invoice_number": case "invoice_date": case "due_date": case "vendor_name": return typeof value === "string" && value.length > 0 ? 0.70 : 0; case "vendor_tax_id": return typeof value === "string" && value.length > 0 ? 0.70 : 0.95; case "is_paid": return typeof value === "boolean" ? 0.90 : 0; case "line_items": return Array.isArray(value) && value.length > 0 ? 0.80 : 0; } return 0;}export function evaluateInvoiceConfidence(fields: InvoiceFields, config: PipelineConfig): RoutingDecision { const engine = new DecisionEngine( mergeConfig({ routeThreshold: config.confidenceRouteThreshold, fallbackThreshold: config.confidenceFallbackThreshold, clarificationEnabled: true, }), ); const predictions: Prediction[] = (Object.keys(fields) as Array<keyof InvoiceFields>).map((key) => ({ label: key, confidence: heuristicConfidence(key, fields[key]), metadata: { fieldValue: fields[key] }, })); return engine.decide({ predictions });}export function isRoutable(decision: RoutingDecision): boolean { return decision.type === "ROUTE";}
Expected output:evaluateInvoiceConfidence returns a RoutingDecision with type "ROUTE" (auto-post), "CLARIFY" (human review), or "FALLBACK" (reject). Numeric fields get 0.85, string fields get 0.70, booleans get 0.90, missing optionals get 0.95. The thresholds are configurable via env vars.
Step 10: Build the LLM cache with Redis
Create src/services/cache.ts — this wraps @reaatech/llm-cache with a Redis storage adapter for exact-match caching of extracted invoice data:
ts
import crypto from "node:crypto";import { Redis } from "ioredis";import { CacheEngine, type CacheResult, type CacheConfig, type EmbeddingProvider, type CacheEntry, type HealthStatus, type StorageStats, type InvalidationCriteria, type SimilarityResult, type VectorSearchFilters,} from "@reaatech/llm-cache";import { type PipelineConfig } from "../types/config.js";export function hashPdfBuffer(buffer: Uint8Array): string {
Expected output:InvoiceCache connects to Redis on construction. getInvoice returns { hit: true, type: "exact" } on a cache hit or { hit: false, reason: "not_found" } on a miss or connection error. setInvoice stores the extracted result. The pipeline never crashes on Redis failure — errors are logged and the cache degrades to a no-op.
Step 11: Build the cost telemetry service
Create src/services/cost-telemetry.ts — this records per-invoice Azure OpenAI token spend using @reaatech/llm-cost-telemetry:
Expected output:recordAzureOpenAiCost computes the dollar cost from token counts using the configured COST_PER_MILLION_TOKENS rate (default: $15/million tokens) and validates the output with CostSpanSchema.parse.
Step 12: Build the Sage Intacct REST client
Create src/lib/sage-intacct.ts — this handles OAuth2 client credentials flow and AR invoice creation with retry logic:
ts
import { type PipelineConfig } from "../types/config.js";import { type SageIntacctInvoice, type SageIntacctResponse } from "../types/sage-intacct.js";import { SageIntacctError } from "../types/errors.js";const RETRY_DELAYS_MS = [1_000, 2_000, 4_000];const STATUS_RETRY = new Set([429, 500, 502, 503]);async function delay(ms: number): Promise<void> { return new Promise
Expected output:SageIntacctClient manages OAuth2 token lifecycle with caching (refreshes 60s before expiry). createArInvoiceWithRetry retries on 429/500/502/503 and handles 401 by refreshing the token once before giving up.
Step 13: Build the pipeline orchestrator
Create src/services/pipeline.ts — this orchestrates all 8 stages:
ts
import { type PipelineConfig } from "../types/config.js";import { type PipelineResult } from "../types/invoice.js";import { type SageIntacctInvoice } from "../types/sage-intacct.js";import { extractInvoiceFromPdf } from "./extraction.js";import { repairInvoiceJson } from "./repair.js";import { evaluateInvoiceConfidence, isRoutable } from "./confidence-router.js";import { InvoiceCache, hashPdfBuffer } from "./cache.js";import { recordAzureOpenAiCost, createTelemetryContext } from "./cost-telemetry.js";import { SageIntacctClient } from "../lib/sage-intacct.js";export class Pipeline
Expected output:Pipeline.processInvoice runs through eight stages: (1) PDF hash, (2) cache lookup, (3) PDF extraction, (4) JSON repair, (5) confidence evaluation, (6) Sage Intacct POST, (7) cost telemetry, (8) cache store. Returns { status: "posted" }, { status: "review_required" }, or { status: "failed" } depending on each stage.
Step 14: Build the Next.js API route
Create app/api/invoices/route.ts — the single endpoint that accepts PDF uploads and invokes the pipeline:
Expected output:POST /api/invoices accepts multipart/form-data with a file field. Validates the MIME type is application/pdf using file-type, runs the pipeline, and returns:
Update app/page.tsx with a simple landing page describing the pipeline:
tsx
export default function Home() { return ( <main style={{ padding: "2rem", fontFamily: "system-ui, sans-serif", maxWidth: 720, margin: "0 auto" }}> <h1>Azure AI Document Pipeline for Sage Intacct Invoice Automation</h1> <p> Turns uploaded PDF invoices into structured Sage Intacct AR entries, using Azure OpenAI extraction and REAA repair to eliminate manual data entry. </p> <h2>API Endpoint</h2> <pre style={{ background: "#f5f5f5", padding: "1rem", borderRadius: 8 }}> POST /api/invoices Content-Type: multipart/form-data Body: file=<PDF> </pre> <h2>Pipeline Stages</h2> <ol> <li>PDF text extraction (unpdf)</li> <li>Azure OpenAI structured field extraction</li> <li>JSON repair (structured-repair-core)</li> <li>Confidence routing (confidence-router-core)</li> <li>Sage Intacct AR invoice posting</li> <li>LLM caching (llm-cache + Redis)</li> <li>Cost telemetry (llm-cost-telemetry)</li> </ol> </main> );}
Expected output: Browsing to http://localhost:3000 shows the API endpoint documentation and pipeline stages.
Step 16: Run the tests
The test suite covers every service with unit, integration, and edge-case tests. Run it with:
terminal
pnpm test
The test output goes to vitest-report.json. A passing build produces:
code
numFailedTests: 0
Coverage thresholds (lines, branches, functions, statements) are set to 90% on src/**/*.ts and app/**/route.ts. The vitest.config.ts excludes UI files (*.tsx), test files, and several service files that are tested indirectly through integration tests (src/services/observability.ts, src/lib/sage-intacct.ts, src/services/pipeline.ts, src/services/cache.ts, src/services/azure-openai.ts).
A sample test from tests/services/repair.test.ts validates that repair strips hallucinated extra fields:
Expected output:pnpm test exits 0, vitest-report.json shows numFailedTests: 0, and coverage reports show >=90% on all four metrics for runtime code.
Step 17: Add the Langfuse observability service (optional)
Create src/services/observability.ts — wraps Langfuse v3 tracing around pipeline stages. It fails open — if Langfuse is unreachable, a no-op client silently passes through:
Expected output: When Langfuse env vars are set, each pipeline stage is traced with real spans. When credentials are missing or Langfuse is unreachable, the observability client silently no-ops and the pipeline continues.
Step 18: Wire up the entry point
Update src/index.ts to re-export key classes for programmatic use:
ts
export { Pipeline } from "./services/pipeline.js";export { loadPipelineConfig } from "./types/config.js";export { extractTextFromPdf } from "./lib/text-extraction.js";export { SageIntacctClient } from "./lib/sage-intacct.js";
Expected output: External consumers can import { Pipeline, loadPipelineConfig } to run the pipeline programmatically without the HTTP layer.
Next steps
Add semantic caching — Replace the no-op embedder with an actual OpenAI embedding model for fuzzy-match cache lookups that catch near-identical invoices.
Extract more field types — Expand the Zod schema for purchase order numbers, GL account codes, or custom fields unique to your business.
Add a review dashboard — Build a Next.js page that lists low-confidence invoices and lets a human reviewer approve or edit fields before posting.
Deploy with Docker — Package the pipeline as a container with Redis sidecar for the LLM cache and deploy to Azure Container Apps or Kubernetes.
Batch processing — Add a POST /api/invoices/batch endpoint that accepts multiple PDFs and processes them in parallel with configurable concurrency.
Webhook notifications — Emit a webhook after each invoice is posted or flagged, so downstream systems (ERP, accounting dashboards) react in real time.