Skip to content
reaatechREAATECH

Files · OpenAI Invoice Extraction for Xero SMB Accounting

86 (1 binary, 661.9 kB total)attempt 2

README.md·5741 B·markdown
markdown
# OpenAI Invoice Extraction for Xero SMB Accounting
 
> Automatically extract and sync invoice data from PDFs and images directly into Xero, eliminating manual data entry for small businesses.
 
A production-grade AI pipeline from [reaatech.com](https://reaatech.com) built with the `@reaatech/*` package family.
 
## What it does
 
PDF/image upload → text extraction (unpdf OCR / tesseract.js) → OpenAI GPT-5.2 vision extraction with Zod schema validation → output repair → confidence routing (via `@reaatech/confidence-router`) → high-confidence items auto-posted to Xero via `xero-node`, low-confidence items flagged for human review in the dashboard. Budget monitoring via `@reaatech/agent-budget-engine`. Observability via Langfuse.
 
## Architecture
 
```
User uploads invoice → S3 storage → Document loader (PDF parser / OCR)
  → OpenAI Responses API structured extraction → Output repair
    → Confidence routing → Xero API (auto) or review queue (manual)
```
 
## Prerequisites
 
- Node.js >= 22
- pnpm
- OpenAI API key
- Xero developer account (OAuth2 `client_credentials` / Custom Connection)
- AWS S3 bucket + IAM credentials
- Langfuse account (optional, for tracing)
 
## Getting Started
 
1. Copy `.env.example` to `.env` and fill in all values.
2. `pnpm install`
3. `pnpm dev`
4. Open http://localhost:3000
5. Upload a PDF invoice via the web UI or POST to `/api/ingest`
 
## API Reference
 
### POST /api/ingest
 
Upload a PDF or image invoice for extraction. Accepts `multipart/form-data` with a `file` field.
 
- **Success**: `200` `{ extractionId, status, invoice, confidence, warnings, requiresReview }`
- **No file**: `400`
- **Oversized**: `413`
- **Budget exhausted**: `429`
 
### GET /api/review
 
Returns extractions awaiting human review.
 
### PATCH /api/review
 
Approve, reject, or edit a pending extraction. Body: `{ extractionId, action: "approve"|"reject"|"edit", editedData? }`
 
### GET /api/invoices
 
Returns processed invoices. Supports `?status=` and `?page=`/`?limit=` params.
 
## Project Structure
 
| Path | Responsibility |
|------|---------------|
| `src/types/invoice.ts` | InvoiceData + LineItem interfaces and Zod schemas |
| `src/types/extraction.ts` | ExtractionResult, ExtractionMetadata, DocumentSource types |
| `src/types/xero.ts` | Xero API types |
| `src/services/pdf-parser.ts` | PDF text extraction via unpdf |
| `src/services/ocr-service.ts` | Image OCR via tesseract.js |
| `src/services/document-loader.ts` | Unified document loading (PDF or image → text) |
| `src/extraction/schema-builder.ts` | Invoice schema prompt + validation |
| `src/extraction/output-repair.ts` | Output repair with trim/coerce/regex fallback |
| `src/extraction/pipeline.ts` | Main extraction orchestration |
| `src/integrations/s3-storage.ts` | S3 document upload/download |
| `src/integrations/xero.ts` | Xero invoice creation via xero-node |
| `src/budget/budget-controller.ts` | LLM spend monitoring |
| `src/classification/confidence-router.ts` | Confidence-based routing |
| `src/evaluation/golden-comparator.ts` | Golden trajectory comparison |
| `src/lib/langfuse.ts` | Langfuse observability |
| `app/api/ingest/route.ts` | Document upload endpoint |
| `app/api/review/route.ts` | Review queue endpoints |
| `app/api/invoices/route.ts` | Processed invoices endpoint |
| `app/page.tsx` | Landing page with upload form |
| `app/admin/page.tsx` | Review dashboard |
 
## Configuration
 
| Variable | Default | Description |
|----------|---------|-------------|
| `OPENAI_API_KEY` | — | OpenAI API key |
| `OPENAI_MODEL` | `gpt-5.2` | OpenAI model ID |
| `XERO_CLIENT_ID` | — | Xero OAuth2 client ID |
| `XERO_CLIENT_SECRET` | — | Xero OAuth2 client secret |
| `AWS_REGION` | `us-east-1` | AWS region |
| `AWS_ACCESS_KEY_ID` | — | AWS access key |
| `AWS_SECRET_ACCESS_KEY` | — | AWS secret key |
| `S3_BUCKET_NAME` | — | S3 bucket for document storage |
| `LANGFUSE_PUBLIC_KEY` | — | Langfuse public key |
| `LANGFUSE_SECRET_KEY` | — | Langfuse secret key |
| `LANGFUSE_BASE_URL` | — | Langfuse base URL |
| `CONFIDENCE_ROUTE_THRESHOLD` | `0.8` | Confidence above this auto-routes |
| `CONFIDENCE_FALLBACK_THRESHOLD` | `0.3` | Confidence below this triggers fallback |
| `BUDGET_MONTHLY_LIMIT` | `50.0` | Monthly LLM spend cap in USD |
| `EMBEDDING_MODEL` | `text-embedding-3-small` | Embedding model |
| `EMBEDDING_BATCH_SIZE` | `100` | Embedding batch size |
 
## Packages
 
| Package | Doc |
|---------|-----|
| [`@reaatech/agent-budget-engine`](packages/reaatech__agent-budget-engine.md) | LLM spend monitoring |
| [`@reaatech/agent-eval-harness-golden`](packages/reaatech__agent-eval-harness-golden.md) | Golden trajectory evaluation |
| [`@reaatech/confidence-router`](packages/reaatech__confidence-router.md) | Confidence-based routing |
| [`@reaatech/hybrid-rag`](packages/reaatech__hybrid-rag.md) | Hybrid retrieval-augmented generation |
| [`@reaatech/hybrid-rag-embedding`](packages/reaatech__hybrid-rag-embedding.md) | Embedding utilities |
| [`unpdf`](packages/unpdf.md) | PDF text extraction |
| [`tesseract.js`](packages/tesseract.js.md) | Image OCR |
| [`openai`](packages/openai.md) | OpenAI Responses API |
| [`xero-node`](packages/xero-node.md) | Xero API integration |
| [`zod`](packages/zod.md) | Schema validation |
| [`@aws-sdk/client-s3`](packages/aws-sdk__client-s3.md) | S3 document storage |
| [`langfuse`](packages/langfuse.md) | Observability tracing |
| [`p-limit`](packages/p-limit.md) | Concurrency limiting |
| [`p-retry`](packages/p-retry.md) | Retry logic |
 
## Testing
 
```bash
pnpm test
```
 
All tests mock externals (MSW for OpenAI, `vi.mock` for other packages). Coverage >= 90% on runtime code.
 
## License
 
MIT — see [LICENSE](./LICENSE).