AWS Bedrock Document Pipeline for Magento Invoice Automation

Automatically extract vendor invoice data from PDFs and emails, then reconcile and post them as Magento purchase orders or payments without manual data entry.

aws-bedrock document-pipeline invoice-automation magento textract claude typescript express

The problem

Small e‑commerce businesses receive dozens of supplier invoices weekly as PDFs or images. Entering each manually into Magento is tedious, introduces errors, and delays payment tracking, causing late fees and stock discrepancies.

Built from

Intro

This tutorial builds an AWS Bedrock Document Pipeline that receives supplier invoices, packing slips, and remittance advices, extracts structured data using Amazon Textract and Bedrock (Claude), repairs the LLM output into validated JSON, and pushes it to the Magento REST API as purchase orders or payment invoices. You’ll wire up four REAA packages (confidence-router, structured-repair-core, session-continuity, and media-pipeline-mcp-doc-extraction), three AWS SDK clients, and a full test suite with 90%+ coverage targets.

This recipe is for intermediate TypeScript developers who want a real-world document ingestion pipeline with AI extraction, fallback handling, and e-commerce integration.

Prerequisites

Node.js >= 22 and pnpm installed
AWS account with access to Bedrock (Claude Sonnet model anthropic.claude-sonnet-4-v1:0) and Textract
Magento 2 instance with admin API access and an integration access token
A Langfuse account (optional, for observability tracing)
Basic familiarity with Next.js App Router route handlers and the aws-sdk v3 client pattern

Step 1: Create the project and install dependencies

Start by scaffolding a Next.js App Router project and installing all runtime dependencies.

terminal

npx create-next-app@latest aws-bedrock-document-pipeline --typescript --eslint --app --src-dir

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

197 kB·143 tests·96.1% coverage·vitest passing

SHA-25653b9341b96bed097594fee743d3942b72109189a9a66fa61692a33037b0f08e3

Book a conversation All solutions

Comments

Loading comments…

import mammoth from "mammoth"; import * as pdfjsLib from "pdfjs-dist"; import { read, utils } from "xlsx"; import mailparser from "mailparser"; export class UnsupportedDocumentFormatError extends Error { name = "UnsupportedDocumentFormatError" as const; } export async function parseDocument( content: Buffer, sourceFormat: string ): Promise<{ text: string; metadata: Record<string, unknown> }> { if (content.length === 0) { return { text: "", metadata: { byteCount: 0 } }; } if (content.length > 50 * 1024 * 1024) { return { text: "", metadata: { tooLarge: true } }; } switch (sourceFormat) { case "pdf": { try { const pdfDoc = await pdfjsLib.getDocument({ data: content }).promise; const textParts: string[] = []; for (let i = 1; i <= pdfDoc.numPages; i++) { const page = await pdfDoc.getPage(i); const contentPage = await page.getTextContent(); const pageText = contentPage.items .map((item) => "str" in item ? item.str : "") .join(" "); textParts.push(pageText); } return { text: textParts.join("\n"), metadata: { pageCount: pdfDoc.numPages, sourceFormat }, }; } catch { return { text: "", metadata: { error: "Failed to parse PDF document", sourceFormat }, }; } } case "docx": { const result = await mammoth.extractRawText({ buffer: content }); return { text: result.value, metadata: { sourceFormat } }; } case "xlsx": { const workbook = read(content, { type: "buffer" }); const allText: string[] = []; for (const sheetName of workbook.SheetNames) { const sheet = workbook.Sheets[sheetName]; const rows = utils.sheet_to_json<(string | number | undefined)[]>( sheet, { header: 1 } ); for (const row of rows) { const rowText = row .filter((cell) => cell != null) .map(String) .join(" "); if (rowText.trim()) { allText.push(rowText); } } } return { text: allText.join("\n"), metadata: { sheetCount: workbook.SheetNames.length, sourceFormat }, }; } case "email": { const parsed = await mailparser.simpleParser(content); const subject = parsed.subject ?? ""; const body = parsed.text ?? ""; return { text: subject ? `${subject}\n${body}` : body, metadata: { from: parsed.from?.text, to: parsed.to?.text, date: parsed.date, sourceFormat, }, }; } default: { throw new UnsupportedDocumentFormatError( `Unsupported format: ${sourceFormat}` ); } } }

import { ConfidenceRouter, KeywordClassifier } from "@reaatech/confidence-router"; import type { DocumentClassification } from "../types/index.js"; export function createClassifier(): ConfidenceRouter { const router = new ConfidenceRouter({ routeThreshold: 0.8, fallbackThreshold: 0.3, clarificationEnabled: false, }); router.registerClassifier( new KeywordClassifier([ { label: "invoice", keywords: ["invoice", "inv#", "bill to", "purchase order"] }, { label: "packing_slip", keywords: ["packing slip", "delivery note", "items shipped"] }, { label: "remittance", keywords: ["remittance", "payment advice", "amount paid"] }, ]) ); return router; } export function classifyDocument( router: ConfidenceRouter, text: string ): Promise<DocumentClassification> { if (!text) { return Promise.resolve({ documentType: "unknown", confidence: 0, predictions: [] }); } const sample = text.slice(0, 500).toLowerCase(); const keywordSets: Record<string, string[]> = { invoice: ["invoice", "inv#", "bill to", "purchase order"], packing_slip: ["packing slip", "delivery note", "items shipped"], remittance: ["remittance", "payment advice", "amount paid"], }; const maxConfidenceKey = Object.keys(keywordSets).reduce((best, label) => { const keywords = keywordSets[label]; const matches = keywords.filter((kw) => sample.includes(kw)); const confidence = matches.length / keywords.length; const currentBest = best.confidence; return confidence > currentBest ? { label, confidence } : best; }, { label: "", confidence: 0 }); const predictions = Object.keys(keywordSets).map((label) => ({ label, confidence: label === maxConfidenceKey.label ? maxConfidenceKey.confidence : 0, })); const decision = router.decide({ predictions }); let documentType: "invoice" | "packing_slip" | "remittance" | "unknown"; if (decision.type === "ROUTE") { documentType = (decision.target ?? "unknown") as "invoice" | "packing_slip" | "remittance" | "unknown"; } else { documentType = "unknown"; } return Promise.resolve({ documentType, confidence: maxConfidenceKey.confidence, predictions, }); }

import { TextractClient, DetectDocumentTextCommand, AnalyzeExpenseCommand, } from "@aws-sdk/client-textract"; export function createTextractClient(region: string): TextractClient { return new TextractClient({ region }); } async function sleep(ms: number): Promise<void> { return new Promise((resolve) => setTimeout(resolve, ms)); } export async function detectText( client: TextractClient, bytes: Uint8Array ): Promise<Array<{ text: string; confidence: number }>> { if (bytes.length === 0) { return []; } let lastError: Error | undefined; const delays = [1000, 2000, 4000]; for (let attempt = 0; attempt <= 3; attempt++) { try { const command = new DetectDocumentTextCommand({ Document: { Bytes: bytes }, }); const response = await client.send(command); const blocks = response.Blocks ?? []; return blocks .filter((block) => block.BlockType === "LINE") .map((block) => ({ text: block.Text ?? "", confidence: block.Confidence ?? 0, })); } catch (error) { if (error instanceof Error && error.name === "ThrottlingException") { if (attempt < 3) { lastError = error; await sleep(delays[attempt]); continue; } } throw error; } } throw lastError ?? new Error("detectText failed after retries"); } export async function analyzeExpense( client: TextractClient, bytes: Uint8Array ): Promise<Array<{ label: string; value: string }>> { if (bytes.length === 0) { return []; } let lastError: Error | undefined; const delays = [1000, 2000, 4000]; for (let attempt = 0; attempt <= 3; attempt++) { try { const command = new AnalyzeExpenseCommand({ Document: { Bytes: bytes }, }); const response = await client.send(command); const expenseDocs = response.ExpenseDocuments ?? []; if (expenseDocs.length === 0) { return []; } const summaryFields = expenseDocs[0].SummaryFields ?? []; return summaryFields.map((field) => ({ label: field.Type?.Text ?? "", value: field.ValueDetection?.Text ?? "", })); } catch (error) { if (error instanceof Error && error.name === "ThrottlingException") { if (attempt < 3) { lastError = error; await sleep(delays[attempt]); continue; } } throw error; } } throw lastError ?? new Error("analyzeExpense failed after retries"); }

AWS Bedrock Document Pipeline for Magento Invoice Automation

The problem

Built from

Intro

Prerequisites

Step 1: Create the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Create the project and install dependencies

Step 2: Configure environment variables

Step 3: Create the types and configuration module

Step 4: Build the multi-format file parser

Step 5: Create the document classifier

Step 6: Implement the Textract client with retry logic

Step 7: Build the Bedrock LLM client

Step 8: Create the output repairer with Zod schema

Step 9: Build the Magento REST client

Step 10: Create the document extraction adapter

Step 11: Set up session tracking

Step 12: Create the observability layer

Step 13: Wire the pipeline orchestrator

Step 14: Create the API route handlers

Step 15: Add Langfuse instrumentation at startup

Step 16: Run the tests

Next steps