Files · Anthropic Document Pipeline for Square SMB Receipt Extraction

81 (1 binary, 606.9 kB total)attempt 1

README.md·4713 B·markdown

markdown

# Anthropic Document Pipeline for Square SMB Receipt Extraction
 
> Automatically extract line items, totals, and vendor info from Square receipts and push structured data to accounting systems.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## What it does
 
This pipeline ingests receipt image URLs, preprocesses them with Unstructured for OCR/text extraction, confidence-gates the result with @reaatech/confidence-router to filter low-quality scans, extracts structured data via Anthropic Claude, repairs malformed JSON with @reaatech/structured-repair-core, budget-enforces with @reaatech/agent-budget-engine, and pushes the final structured receipt to Square.
 
## Pipeline flow
 
```
Image URL → Unstructured partition → Confidence threshold check → Claude structured extraction → JSON repair → Square push
```
 
## Configuration
 
| Variable | Description |
|---|---|
| `NODE_ENV` | Runtime environment (`development`, `production`, `test`) |
| `ANTHROPIC_API_KEY` | Anthropic API credential |
| `ANTHROPIC_MODEL` | Claude model ID (default: `claude-sonnet-4-6`) |
| `ANTHROPIC_MAX_TOKENS` | Max output tokens per extraction call (default: `4096`) |
| `SQUARE_ACCESS_TOKEN` | Square SDK auth token |
| `SQUARE_LOCATION_ID` | Target Square location for expense pushes |
| `UNSTRUCTURED_API_KEY` | Unstructured partition API key |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key for tracing |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key for tracing |
| `LANGFUSE_BASE_URL` | Langfuse base URL (default: `https://cloud.langfuse.com`) |
| `CONFIDENCE_ROUTE_THRESHOLD` | ConfidenceRouter route threshold (default: `0.8`) |
| `CONFIDENCE_FALLBACK_THRESHOLD` | ConfidenceRouter fallback threshold (default: `0.3`) |
| `BUDGET_DAILY_LIMIT` | Daily USD spend cap (default: `5.0`) |
| `BUDGET_SOFT_CAP` | Soft-cap ratio for budget warnings (default: `0.8`) |
 
## API endpoints
 
### POST /api/ingest
 
Ingest a single receipt image URL.
 
**Request:**
```json
{
  "receiptImageUrl": "https://example.com/receipt.jpg",
  "source": "mobile-upload",
  "callbackUrl": "https://hooks.example.com/callback"
}
```
 
**Response (200):**
```json
{
  "receiptId": "rcpt_abc123",
  "status": "success",
  "extractedData": {
    "vendorName": "Acme Coffee",
    "date": "2025-06-01",
    "lineItems": [
      { "name": "Latte", "quantity": 1, "unitPrice": 4.50, "totalPrice": 4.50 }
    ],
    "subtotal": 4.50,
    "total": 5.13,
    "currency": "USD"
  },
  "costUsd": 0.0023
}
```
 
**Response (422):**
```json
{
  "receiptId": "rcpt_abc123",
  "status": "low_confidence",
  "error": "OCR confidence below route threshold (0.45 < 0.80)",
  "costUsd": 0.0004
}
```
 
### GET /api/health
 
**Response (200):**
```json
{
  "status": "ok",
  "timestamp": "2025-06-22T12:00:00.000Z"
}
```
 
### POST /api/batch
 
Ingest up to 50 receipt image URLs in a single request.
 
**Request:**
```json
{
  "requests": [
    { "receiptImageUrl": "https://example.com/receipt1.jpg" },
    { "receiptImageUrl": "https://example.com/receipt2.jpg" }
  ]
}
```
 
**Response (200):**
```json
{
  "results": [
    { "receiptId": "rcpt_001", "status": "success", ... },
    { "receiptId": "rcpt_002", "status": "budget_exceeded", ... }
  ]
}
```
 
**Response (400):**
```json
{
  "error": "max batch size 50"
}
```
 
## Tech Stack
 
- **Next.js 16+ App Router** — API route handlers and server infrastructure
- **@anthropic-ai/sdk** — Claude structured extraction from receipt text
- **Square SDK v44** — Pushing structured receipts to Square accounting
- **unstructured-client** — OCR and text extraction from receipt images
- **Zod** — Runtime schema validation for requests, config, and receipt data
- **@reaatech/confidence-router** — Quality gating on OCR confidence scores
- **@reaatech/structured-repair-core** — Repair malformed Claude JSON output
- **@reaatech/llm-cost-telemetry** — Token and cost tracking per LLM call
- **@reaatech/agent-budget-engine** — Daily USD spend caps and soft-cap warnings
- **Langfuse** — Observability and tracing across the pipeline
- **vitest** — Test runner with v8 coverage at ≥90%
 
## Running locally
 
```bash
pnpm install
pnpm test            # vitest run with coverage
pnpm dev             # next dev
```
 
## Project layout
 
```
app/                  Next.js App Router pages + API routes
src/                  services, lib, adapters
tests/                vitest suite (mirrors src/)
packages/             API references for every dependency (read these first)
DEV_PLAN.md           build plan for this recipe
```
 
## License
 
MIT — see [LICENSE](./LICENSE).