AWS Bedrock Document Pipeline for Magento Invoice Automation
Automatically extract vendor invoice data from PDFs and emails, then reconcile and post them as Magento purchase orders or payments without manual data entry.
Small e‑commerce businesses receive dozens of supplier invoices weekly as PDFs or images. Entering each manually into Magento is tedious, introduces errors, and delays payment tracking, causing late fees and stock discrepancies.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial builds an AWS Bedrock Document Pipeline that receives supplier invoices, packing slips, and remittance advices, extracts structured data using Amazon Textract and Bedrock (Claude), repairs the LLM output into validated JSON, and pushes it to the Magento REST API as purchase orders or payment invoices. You’ll wire up four REAA packages (confidence-router, structured-repair-core, session-continuity, and media-pipeline-mcp-doc-extraction), three AWS SDK clients, and a full test suite with 90%+ coverage targets.
This recipe is for intermediate TypeScript developers who want a real-world document ingestion pipeline with AI extraction, fallback handling, and e-commerce integration.
Prerequisites
Node.js >= 22 and pnpm installed
AWS account with access to Bedrock (Claude Sonnet model anthropic.claude-sonnet-4-v1:0) and Textract
Magento 2 instance with admin API access and an integration access token
A Langfuse account (optional, for observability tracing)
Basic familiarity with Next.js App Router route handlers and the aws-sdk v3 client pattern
Step 1: Create the project and install dependencies
Start by scaffolding a Next.js App Router project and installing all runtime dependencies.
Expected output:pnpm-lock.yaml is updated and all 12 runtime packages are listed in package.json under dependencies, alongside the next, react, and react-dom that create-next-app already added.
Step 2: Configure environment variables
Create .env.example with placeholder values for every integration:
env
# Env vars used by aws-bedrock-document-pipeline-for-magento-invoice-automation.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentAWS_REGION=us-east-1AWS_ACCESS_KEY_ID=<your-access-key>AWS_SECRET_ACCESS_KEY=<your-secret>MAGENTO_ADMIN_API_URL=https://your-magento-instance.com/restMAGENTO_INTEGRATION_ACCESS_TOKEN=<your-magento-token>LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_BASE_URL=<your-langfuse-base-url>SESSION_STORAGE_TYPE=memoryPORT=3000NEXT_PUBLIC_APP_URL=http://localhost:3000
Copy this to .env.local and fill in real values:
terminal
cp .env.example .env.local
Expected output: At runtime the pipeline reads AWS_REGION, MAGENTO_ADMIN_API_URL, and MAGENTO_INTEGRATION_ACCESS_TOKEN as required vars. Missing any of the three throws an error listing the missing keys.
Step 3: Create the types and configuration module
The pipeline needs shared types for documents, webhook payloads, pipeline results, and config. Create src/types/index.ts:
Now create src/config/index.ts to load these from the environment with validation:
ts
import "dotenv/config";import type { PipelineConfig } from "../types/index.js";export function getConfig(): PipelineConfig { const missingVars: string[] = []; const region = process.env.AWS_REGION; if (!region) { missingVars.push("AWS_REGION"); } const magentoApiUrl = process.env.MAGENTO_ADMIN_API_URL; if (!magentoApiUrl) { missingVars.push("MAGENTO_ADMIN_API_URL"); } const magentoToken = process.env.MAGENTO_INTEGRATION_ACCESS_TOKEN; if (!magentoToken) { missingVars.push("MAGENTO_INTEGRATION_ACCESS_TOKEN"); } if (missingVars.length > 0) { throw new Error( `Missing required environment variables: ${missingVars.join(", ")}` ); } return { region: region as string, magentoApiUrl: magentoApiUrl as string, magentoToken: magentoToken as string, langfusePublicKey: process.env.LANGFUSE_PUBLIC_KEY ?? "", langfuseSecretKey: process.env.LANGFUSE_SECRET_KEY ?? "", langfuseBaseUrl: process.env.LANGFUSE_BASE_URL ?? "", storageType: process.env.SESSION_STORAGE_TYPE ?? "memory", };}
Expected output: Running pnpm typecheck shows no type errors.
Step 4: Build the multi-format file parser
The pipeline accepts PDF, DOCX, XLSX, and email attachments. Create src/services/file-parser.ts that handles all four formats using pdfjs-dist, mammoth, xlsx, and mailparser:
Expected output: Empty buffer returns empty text; oversized files (>50MB) return empty with a tooLarge flag; PDFs iterate pages and extract text using pdfjsLib; unsupported formats throw UnsupportedDocumentFormatError.
Step 5: Create the document classifier
The classifier uses @reaatech/confidence-router to decide whether an incoming document is an invoice, packing slip, remittance, or unknown. Create src/services/document-classifier.ts:
Expected output: An empty byte array returns an empty array. A ThrottlingException triggers up to 3 retries with 1s/2s/4s backoff.
Step 7: Build the Bedrock LLM client
The Bedrock client sends extracted text to Claude and asks for structured invoice JSON. Create src/services/bedrock-client.ts:
ts
import { BedrockRuntimeClient, ConverseCommand,} from "@aws-sdk/client-bedrock-runtime";export class BedrockError extends Error { name = "BedrockError" as const;}export function createBedrockClient(region: string): BedrockRuntimeClient { return new BedrockRuntimeClient({ region });}export async function extractInvoiceData( client: BedrockRuntimeClient, rawText: string): Promise<string> { if (!rawText) { return "{}"; } const systemMessage = [ { text: `Extract invoice data as JSON with fields: invoiceNumber (string), invoiceDate (string), vendorName (string), vendorTaxId (string, optional), subtotal (number), tax (number), total (number), isPaid (boolean), lineItems (array of {description, quantity, unitPrice, total}), purchaseOrderNumber (string, optional). Respond with ONLY valid JSON.`, }, ]; let textContent = rawText; if (rawText.length > 100000) { textContent = rawText.slice(0, 100000); systemMessage.unshift({ text: `[NOTE: the document text was truncated to ${String(textContent.length)} characters]`, }); } try { const command = new ConverseCommand({ modelId: "anthropic.claude-sonnet-4-v1:0", system: systemMessage, messages: [ { role: "user", content: [{ text: textContent }], }, ], inferenceConfig: { maxTokens: 4096, temperature: 0.1, }, }); const response = await client.send(command); const text = response.output?.message?.content?.[0]?.text ?? ""; return text; } catch (error) { if ( error instanceof Error && (error.name === "AccessDeniedException" || error.name === "ModelTimeoutException") ) { throw new BedrockError( `Bedrock API error: ${error.name} - ${error.message}` ); } throw error; }}
Expected output: Empty input returns "{}". Text longer than 100,000 chars is truncated with a note in the system message. API errors like AccessDeniedException are re-thrown as BedrockError.
Step 8: Create the output repairer with Zod schema
LLM output is rarely perfect JSON — it may have markdown fences, trailing commas, or type mismatches. The repairer uses @reaatech/structured-repair-core with a Zod schema. Create src/services/output-repairer.ts:
Expected output: A fenced JSON block like ```json { "invoiceNumber": "INV-001", ... } ``` is repaired into typed data. A prose string without JSON returns { success: false } with error details.
Step 9: Build the Magento REST client
The Magento client wraps the Magento Admin REST API. Create src/services/magento-client.ts with three main operations: createPurchaseOrder, createPaymentInvoice, and getInvoiceByNumber:
ts
export class MagentoAuthError extends Error { name = "MagentoAuthError" as const;}export class MagentoValidationError extends Error { name = "MagentoValidationError" as const; body: unknown; constructor(message: string, body?: unknown) { super(message); this.body = body; }}export class MagentoNetworkError extends Error { name = "MagentoNetworkError" as const;}
Expected output: Creating a purchase order with empty items throws MagentoValidationError immediately. A 401 response from Magento throws MagentoAuthError. Network failures are wrapped as MagentoNetworkError.
Step 10: Create the document extraction adapter
The pipeline uses @reaatech/media-pipeline-mcp-doc-extraction for document processing. Create src/services/document-extractor.ts to wrap the MCP extraction operations:
Expected output:createExtractor returns a DocumentExtractionOperations object that can perform OCR and structured field extraction on document artifacts. Empty results resolve to empty strings or objects.
Step 11: Set up session tracking
The pipeline uses @reaatech/session-continuity to track document processing across pages and reprocessing events. You need a storage adapter, a token counter, and helper functions. Create three files.
First, src/services/session-storage.ts — an in-memory storage adapter:
ts
import { type IStorageAdapter, type Session, type Message, type SessionId, type MessageId, type MessageQueryOptions, type SessionFilters, type UpdateSessionOptions, type HealthStatus,} from "@reaatech/session-continuity";export class InMemoryStorageAdapter implements IStorageAdapter { private sessionsMap: Map<string, Session> = new Map(); private messagesMap: Map<string, Message[]> = new Map
Next, src/services/session-tracker.ts — the session manager and document-session helpers:
Expected output:wrapPipelineSpan logs start/end timing to console. If Langfuse is configured, it creates a trace for each span. On failure the error is re-thrown.
Step 13: Wire the pipeline orchestrator
The InvoicePipeline class ties every service together. Create src/pipeline/invoice-pipeline.ts:
ts
import type { ConfidenceRouter } from "@reaatech/confidence-router";import type { DocumentExtractionOperations } from "@reaatech/media-pipeline-mcp-doc-extraction";import type { SessionManager } from "@reaatech/session-continuity";import type { TextractClient } from "@aws-sdk/client-textract";import type { BedrockRuntimeClient } from "@aws-sdk/client-bedrock-runtime";import type { WebhookPayload, PipelineResult, DocumentType,} from "../types/index.js";import { parseDocument } from "../services/file-parser.js";import { classifyDocument } from "../services/document-classifier.js";import { detectText }
Create src/index.ts as the entry point that wires all dependencies together:
ts
export type { PipelineConfig } from "./types/index.js";export type { DocumentClassification, ExtractedDocument, RepairedInvoice, WebhookPayload, PipelineResult, DocumentType } from "./types/index.js";export { getConfig } from "./config/index.js";export { InvoicePipeline } from "./pipeline/invoice-pipeline.js";export { createClassifier } from "./services/document-classifier.js";export { createExtractor } from "./services/document-extractor.js";export { createTextractClient } from "./services/textract-client.js";export { createBedrockClient } from "./services/bedrock-client.js";export { repairInvoiceOutput, quickRepairInvoice } from "./services/output-repairer.js";export { createSessionManager } from
Expected output: An invoice document flows through: parse → classify → Textract → Bedrock → repair → Magento PO creation. Each step is wrapped in an observability span. Packing slips stop after Textract extraction. Remittances look up the existing invoice and create a payment. Unknown documents return immediately with success: false.
Step 14: Create the API route handlers
The webhook endpoint receives multipart document uploads via the Next.js App Router. Create app/api/webhook/invoice/route.ts:
import { NextResponse } from "next/server";export function GET(): NextResponse { return NextResponse.json({ status: "ok", timestamp: new Date().toISOString(), });}
Expected output:POST /api/webhook/invoice with a PDF multipart form returns { success: true, documentType: "invoice", ... }. No file returns 400. Unsupported MIME type returns 400. GET /api/health returns { status: "ok", timestamp: "..." }.
Step 15: Add Langfuse instrumentation at startup
Enable startup-time Langfuse initialization using the Next.js instrumentation hook. Open next.config.ts (created by create-next-app) and replace its contents with:
import { initLogger } from "./services/observability.js";export function register(): void { if (process.env.LANGFUSE_PUBLIC_KEY && process.env.LANGFUSE_SECRET_KEY) { initLogger({ publicKey: process.env.LANGFUSE_PUBLIC_KEY, secretKey: process.env.LANGFUSE_SECRET_KEY, baseUrl: process.env.LANGFUSE_BASE_URL ?? "https://cloud.langfuse.com", }); }}
Expected output: When the app starts, register() fires and initializes Langfuse if the keys are set. The experimental.instrumentationHook: true flag is required — without it the register() function never runs.
Step 16: Run the tests
The project includes a full test suite. Run the typecheck, linter, and tests:
terminal
pnpm typecheckpnpm lintpnpm test
Expected output:
pnpm typecheck exits 0 with no errors.
pnpm lint exits 0.
Vitest reports numFailedTests: 0 with line, branch, function, and statement coverage all >= 90%.
The test suite covers:
Pipeline end-to-end: invoice, packing slip, remittance, and unknown document paths
Observability: span completion and error logging, Langfuse trace creation
Next steps
Add a persistent storage adapter — swap InMemoryStorageAdapter for a Redis or DynamoDB adapter from @reaatech/session-continuity packages for production use
Deploy the webhook behind a queue — use SQS or Bull to decouple document ingestion from the response, especially for multi-page PDFs
Add retry logic for Magento calls — implement exponential backoff when Magento returns 429 or 5xx
Expand document types — add support for credit memos, quotes, and return authorizations with dedicated classification labels and extraction schemas