E‑commerce merchants on Shopify manually re‑type tax details from PDF invoices and receipts into accounting software, leading to errors, delays, and compliance risks.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a document pipeline that extracts tax data from Shopify PDF invoices, DOCX receipts, and scanned images — then validates, repairs, and delivers the structured data. You’ll use Mistral AI for LLM-based extraction, the REAA package family for document processing and cost tracking, and Next.js App Router as the HTTP surface.
By the end, you’ll have a working service with three REST endpoints, a Shopify webhook parser, and a full test suite with 80 passing tests. This is a copy-paste-along tutorial — every code block is the real file content.
A Shopify store with an app API key and secret (or placeholder values for the mock)
Basic familiarity with TypeScript, Next.js App Router, and REST APIs
Step 1: Scaffold the project
Start from the existing scaffold — a Next.js 16 App Router project with all config files in place. The package.json pins every dependency to exact versions and includes the four REAA packages you’ll use:
Run pnpm install to lock the dependencies. The scaffold already includes next.config.ts with experimental.instrumentationHook: true (required for the instrumentation file you’ll add later), and vitest.config.ts with coverage thresholds at 90% on all four metrics.
Expected output:pnpm install completes with no errors.
Step 2: Define shared types and Zod schemas
Create src/lib/types.ts with the domain types that flow through every stage of the pipeline. A TaxDocument represents an incoming file, ExtractedTaxData is what the pipeline produces, and ExtractionResult wraps it with status, confidence, and cost metadata.
Now create src/lib/schemas.ts with Zod schemas that mirror the types. These schemas power structured repair — they validate LLM outputs and coerce them into the correct shape.
Expected output: TypeScript type-checks pass with no errors.
Step 3: Create the cost telemetry service
The @reaatech/llm-cost-telemetry package provides calculateCostFromTokens, generateId, and loadConfig. Build a service that wraps these and tracks per-session spend. Mistral’s mistral-large-latest costs approximately $2 per million tokens — you’ll hardcode that constant.
Expected output:pnpm typecheck reports no errors. The loadConfig() call reads optional env vars DEFAULT_DAILY_BUDGET and TENANT_BUDGETS at startup.
Step 4: Build the confidence router
The @reaatech/confidence-router package routes between rule-based and LLM extraction based on confidence scores. Wrap it in a DocumentConfidenceRouter that sets sensible thresholds: above 0.8 means route directly, below 0.3 means fall back to rules, in between means ask for clarification.
Expected output: A service that takes a confidence score and returns one of three routing decisions.
Step 5: Implement structured repair
The @reaatech/structured-repair-core package repairs malformed LLM outputs. It handles trailing commas, fuzzy key names, extra fields, and code fences. Build a StructuredRepairService that wraps the four exported functions.
Expected output: A service that turns noisy LLM JSON into clean, schema-validated tax data.
Step 6: Build file processing utilities
Create src/lib/file-processor.ts to handle three document formats. Use unpdf for PDF extraction, mammoth for DOCX conversion, and sharp for image preprocessing.
ts
import { extractText, getDocumentProxy } from "unpdf";import mammoth from "mammoth";import sharp from "sharp";export async function processPdf(buffer: Uint8Array): Promise<string> { const pdf = await getDocumentProxy(new Uint8Array(buffer)); const { text } = await extractText(pdf, { mergePages: true }); return text;}export async function processDocx(buffer: Buffer): Promise<string> { const result = await mammoth.extractRawText({ buffer }); return result.value;}export async function preprocessImage(buffer: Uint8Array): Promise<Uint8Array> { return sharp(buffer).jpeg({ quality: 85 }).toBuffer();}export function detectFileType(fileName: string): "pdf" | "docx" | "image" { const lower = fileName.toLowerCase(); if (lower.endsWith(".pdf")) { return "pdf"; } if (lower.endsWith(".docx")) { return "docx"; } return "image";}
Expected output: Four utility functions that convert PDFs, DOCX, and images to extractable text.
Step 7: Create the Mistral AI client
The @mistralai/mistralai SDK uses a named Mistral import. Build a factory function and a completion helper that sends tax extraction prompts with a json_object response format.
ts
import { Mistral } from "@mistralai/mistralai";import * as errors from "@mistralai/mistralai/models/errors";export function createMistralClient(): Mistral { return new Mistral({ apiKey: process.env.MISTRAL_API_KEY ?? "" });}export async function callExtractionCompletion(mistral: Mistral, text: string): Promise<string> { try { const result = await mistral.chat.complete({ model: "mistral-large-latest", messages: [ { role: "user", content: "Extract tax fields as JSON from this invoice text:\n\n" + text, }, ], responseFormat: { type: "json_object" }, }); const firstChoice = result.choices[0]; const msg = firstChoice.message; if (msg === undefined) { return ""; } const content = msg.content; if (typeof content !== "string") { return ""; } return content; } catch (error) { if (error instanceof errors.MistralError) { const statusStr = String(error.statusCode); throw new Error( `Mistral API error (status ${statusStr}): ${error.message} - body: ${error.body}`, ); } throw error; }}
Expected output: A client that sends invoice text to Mistral and returns structured JSON, with proper error wrapping for API failures.
Step 8: Build the document extraction service
This is the orchestrator. It wires together file processing, REAA document extraction, structured repair, cost telemetry, and confidence routing into a single runExtractionPipeline method. It also provides clarifyWithMistral for cases where the confidence router asks for LLM clarification.
ts
import { createDocumentExtractionOperations } from "@reaatech/media-pipeline-mcp-doc-extraction";import { ArtifactRegistry } from "@reaatech/media-pipeline-mcp-core";import { LocalStorage } from "@reaatech/media-pipeline-mcp-storage";import type { ExtractedTaxData, TaxDocument, ExtractionResult } from "../lib/types.js";import { StructuredRepairService } from "../services/structured-repair.js";import { CostTelemetryService } from "../services/cost-telemetry.js";import { DocumentConfidenceRouter } from "../services/confidence-routing.js";import { processPdf, processDocx, detectFileType, preprocessImage } from "../lib/file-processor.js";import { createMistralClient, callExtractionCompletion } from "../lib/mistral-client.js";interface FieldSchema {
Expected output: A service that takes a TaxDocument and runs the full pipeline — file conversion, in-memory artifact storage, field extraction, structured repair, cost logging, and confidence routing.
Step 9: Build the validation service
The ValidationService stores past extractions in memory and validates new ones against what’s been seen before.
ts
import type { ExtractedTaxData } from "../lib/types.js";export class ValidationService { private history = new Map<string, ExtractedTaxData>(); recordExtraction(shopifyOrderId: string, data: ExtractedTaxData): void { this.history.set(shopifyOrderId, data); } validateAgainstHistory(extractedData: ExtractedTaxData): { confidence: number } { const hasTotal = extractedData.total !== undefined && extractedData.total > 0; return { confidence: hasTotal ? 0.9 : 0.3 }; }}
Expected output: A service that assigns 0.9 confidence if the total field is present and positive, and 0.3 if critical fields are missing.
Step 10: Create the Shopify webhook handler
Create src/webhooks/shopify-tax.ts to handle incoming Shopify order webhooks, fetch attached documents, and run the extraction pipeline. The @shopify/shopify-api package requires a side-effect import for the Node adapter.
ts
import "@shopify/shopify-api/adapters/node";import { shopifyApi, ApiVersion } from "@shopify/shopify-api";import type { TaxDocument, ExtractionResult } from "../lib/types.js";import { DocumentExtractionService } from "../services/document-extraction.js";import { CostTelemetryService } from "../services/cost-telemetry.js";export const shopify = shopifyApi({ apiKey: process.env.SHOPIFY_API_KEY ?? "", apiSecretKey: process.env.SHOPIFY_API_SECRET ?? "", scopes: ["read_orders"], hostName: process.env.SHOPIFY_HOST_NAME ?? "", apiVersion: ApiVersion.July25, isEmbeddedApp: false,});export function handleShopifyOrderWebhook( orderData: Record<string, unknown>,): { orderId: string; documents: TaxDocument[] } { const rawId = orderData.id; const orderId = typeof rawId === "string" ? rawId : String(rawId); const documents: TaxDocument[] = []; const lineItems = orderData.line_items as Array<Record<string, unknown>> | undefined; if (lineItems) { for (const item of lineItems) { const attachments = item.attachments as Array<Record<string, unknown>> | undefined; if (attachments) { for (const att of attachments) { const url = typeof att.url === "string" ? att.url : undefined; if (url) { const docIndex = documents.length + 1; const fileName = typeof att.filename === "string" ? att.filename : `document-${String(docIndex)}`; documents.push({ id: `doc-${String(docIndex)}`, shopifyOrderId: orderId, fileName, mimeType: typeof att.mime_type === "string" ? att.mime_type : "application/octet-stream", buffer: new Uint8Array(), }); } } } } } return { orderId, documents };}export async function fetchDocumentFromUrl(url: string): Promise<TaxDocument> { const response = await fetch(url); if (!response.ok) { const statusStr = String(response.status); throw new Error(`Failed to fetch document from ${url}: ${statusStr} ${response.statusText}`); } const buffer = new Uint8Array(await response.arrayBuffer()); const mimeType = response.headers.get("content-type") ?? "application/octet-stream"; const fileName = url.split("/").pop() ?? "document"; return { id: `fetched-${crypto.randomUUID()}`, shopifyOrderId: "", fileName, mimeType, buffer, };}export async function processDocument(document: TaxDocument): Promise<ExtractionResult> { const extractionService = new DocumentExtractionService(); const costTelemetry = new CostTelemetryService(); const extractionId = crypto.randomUUID(); const result = await extractionService.runExtractionPipeline(document, extractionId); costTelemetry.recordTokenUsage( "mistral", "mistral-large-latest", 500, 200, document.shopifyOrderId, "tax-extraction", ); return result;}
Expected output: Three exported functions that parse Shopify webhooks, fetch remote documents, and run the complete extraction pipeline.
Step 11: Set up API routes
Create three Next.js App Router route handlers. Each must use NextRequest and NextResponse (never bare Request or new Response).
app/api/extract/route.ts — the main extraction endpoint:
ts
import { NextRequest, NextResponse } from "next/server";import { extractionResultsStore, getExtractionsForOrder } from "../../../src/services/document-extraction.js";export async function POST(req: NextRequest) { let body: Record<string, unknown>; try { body = (await req.json()) as Record<string, unknown>; } catch { return NextResponse.json({ error: "Invalid JSON body" }, { status: 400 }); } const orderId = typeof body.orderId === "string" ? body.orderId : undefined; const documentUrls = Array.isArray(body.documentUrls) ? (body.documentUrls as string[]) : undefined; if (!orderId || typeof orderId !== "string") { return NextResponse.json({ error: "orderId is required" }, { status: 400 }); } if (!Array.isArray(documentUrls) || documentUrls.length === 0) { return NextResponse.json({ error: "documentUrls must be a non-empty array" }, { status: 400 }); } const id = crypto.randomUUID(); extractionResultsStore.set(id, { id, status: "processing", confidence: 0, costUsd: 0, }); return NextResponse.json({ id, status: "processing" }, { status: 202 });}export function GET(req: NextRequest) { const orderId = req.nextUrl.searchParams.get("orderId"); if (orderId) { return NextResponse.json(getExtractionsForOrder(orderId)); } return NextResponse.json(Array.from(extractionResultsStore.values()));}
app/api/status/[id]/route.ts — poll endpoint for extraction results:
ts
import { NextRequest, NextResponse } from "next/server";import { extractionResultsStore } from "../../../../src/services/document-extraction.js";export async function GET( _req: NextRequest, { params }: { params: Promise<{ id: string }> },) { const { id } = await params; if (!id) { return NextResponse.json({ error: "not found" }, { status: 404 }); } const result = extractionResultsStore.get(id); if (!result) { return NextResponse.json({ error: "not found" }, { status: 404 }); } return NextResponse.json(result);}
app/api/webhook/deliver/route.ts — delivery endpoint that sends extracted data to accounting systems:
Expected output: Three route handlers that accept POST and GET requests, use NextResponse.json() for all responses, and integrate with the extraction pipeline.
Step 12: Add middleware, instrumentation, and the index
Create middleware.ts at the project root to scope request processing to API routes only:
Create src/instrumentation.ts to initialize the cost telemetry config at app startup. The experimental.instrumentationHook: true flag in next.config.ts is already set — without it, this file would be dead code:
ts
export async function register() { if (process.env.NEXT_RUNTIME === "nodejs") { const { loadConfig } = await import("@reaatech/llm-cost-telemetry"); loadConfig(); }}
Create src/index.ts as the library’s barrel export, re-exporting every type, schema, service, utility, and webhook function:
ts
// Typesexport type { TaxDocument, ExtractedTaxData, ExtractionStatus, ExtractionResult, WebhookPayload,} from "./lib/types.js";// Schemasexport { ExtractedTaxDataSchema, LineItemSchema, ExtractionResultSchema, WebhookPayloadSchema,} from "./lib/schemas.js";// Servicesexport { CostTelemetryService } from "./services/cost-telemetry.js";export { DocumentConfidenceRouter } from "./services/confidence-routing.js";export { StructuredRepairService } from "./services/structured-repair.js";export { DocumentExtractionService } from "./services/document-extraction.js";export { ValidationService } from "./services/validation-service.js";// Utilitiesexport { createMistralClient, callExtractionCompletion } from "./lib/mistral-client.js";export { processPdf, processDocx, preprocessImage, detectFileType } from "./lib/file-processor.js";// Webhookexport { handleShopifyOrderWebhook, fetchDocumentFromUrl, processDocument } from "./webhooks/shopify-tax.js";
Expected output:pnpm typecheck passes with zero errors, pnpm lint passes with zero errors.
Step 13: Configure environment variables
Add these entries to .env.example so consumers know which variables to set:
code
# Env vars used by mistral-ai-document-pipeline-for-shopify-tax-document-extraction.
# The builder adds entries here as it wires up each integration.
# Keep placeholders only — never commit real values.
NODE_ENV=development
MISTRAL_API_KEY=<your-mistral-key>
SHOPIFY_API_KEY=<your-shopify-api-key>
SHOPIFY_API_SECRET=<your-shopify-api-secret>
SHOPIFY_HOST_NAME=<your-ngrok-or-production-host>
ACCOUNTING_WEBHOOK_URL=<https://webhook-target.example.com/tax-data>
NEXT_PUBLIC_APP_URL=http://localhost:3000
Expected output: All process.env.X references in the code have corresponding placeholders in .env.example.
Step 14: Run the tests
The test suite covers every service with happy-path, error, and boundary cases. Run the full suite:
terminal
pnpm test
Expected output: All tests pass, numFailedTests=0, numTotalTests=80, and coverage meets the 90% threshold on all four metrics (lines, branches, functions, statements).
terminal
pnpm typecheck
Expected output: Zero TypeScript errors.
terminal
pnpm lint
Expected output: Zero ESLint errors.
Step 15: Try the pipeline
Start the dev server:
terminal
pnpm dev
You can test the API endpoints with curl. Trigger a document extraction: