Files · Mistral AI Document Pipeline for Xero Expense Report Processing

63 (1 binary, 480.2 kB total)attempt 1
README.md·7399 B·markdown
markdown
# Mistral AI Document Pipeline for Xero Expense Report Processing
 
> A production-grade reference solution that automatically extracts line items from receipts and invoices using Mistral AI, categorizes expenses, enforces daily budget limits, and pushes the results into Xero for seamless expense reporting.
 
Built on the `@reaatech/*` package family, this pipeline demonstrates how to compose structured repair, context-window planning, LLM cost telemetry, and agent budget enforcement into a cohesive document processing system.
 
## Architecture
 
The pipeline processes documents through six sequential stages:
 
```
Extract → Plan → Parse → Repair → Record → Push
```
 
| Stage   | Service                        | Description |
|---------|--------------------------------|-------------|
| Extract | `document-extractor`           | Reads text from PDF (via `pdfjs-dist`) and XLSX (via `xlsx`) files |
| Plan    | `context-planner`              | Chunks multi-page documents to fit within the LLM context window using `@reaatech/context-window-planner` |
| Parse   | `expense-parser`               | Sends extracted text to Mistral AI (`mistral-large-latest`) for structured expense JSON extraction |
| Repair  | `@reaatech/structured-repair-core` | Fixes malformed JSON (missing braces, trailing commas, code fences) via a six-strategy repair pipeline |
| Record  | `cost-telemetry` + `budget-enforcer` | Records token usage and cost via `@reaatech/llm-cost-telemetry`; enforces daily spend caps via `@reaatech/agent-budget-engine` |
| Push    | `xero-client`                  | Creates ACCREC invoices in Xero via `xero-node` (client_credentials OAuth 2.0) |
 
## Layout
 
```
.
├── app/
│   ├── api/
│   │   ├── process-expense/route.ts   # POST — upload and process a receipt
│   │   └── budget-status/route.ts     # GET — current budget state
│   ├── layout.tsx
│   └── page.tsx                       # Upload UI (client component)
├── src/
│   ├── schemas/expense-schema.ts      # Zod schemas for expense documents
│   ├── types/index.ts                 # Shared TypeScript types
│   ├── services/
│   │   ├── document-extractor.ts      # PDF/XLSX text extraction
│   │   ├── context-planner.ts         # Context window packing
│   │   ├── expense-parser.ts          # Mistral AI parsing orchestration
│   │   ├── cost-telemetry.ts          # LLM cost tracking
│   │   ├── budget-enforcer.ts         # Daily budget enforcement
│   │   └── xero-client.ts             # Xero API integration
│   └── index.ts                       # Barrel exports
├── tests/
│   ├── unit/                          # Unit tests per service
│   └── integration/                   # End-to-end route tests
├── packages/                          # API references for every dependency
├── DEV_PLAN.md                        # Build plan
└── package.json
```
 
## Quick Start
 
```bash
pnpm install
pnpm dev              # Start Next.js dev server (http://localhost:3000)
pnpm test             # Run vitest with coverage (target: 90%+)
pnpm typecheck        # TypeScript type checking
pnpm lint             # ESLint
```
 
## Environment Variables
 
| Variable               | Description                                          |
|------------------------|------------------------------------------------------|
| `MISTRAL_API_KEY`      | Mistral AI API key for model inference               |
| `XERO_CLIENT_ID`       | Xero OAuth 2.0 client ID (custom connection)         |
| `XERO_CLIENT_SECRET`   | Xero OAuth 2.0 client secret (custom connection)     |
| `DEFAULT_DAILY_BUDGET` | Default daily LLM spend limit in USD (default: 10.0) |
 
Copy `.env.example` to `.env.local` and fill in your credentials.
 
## API Endpoints
 
### POST /api/process-expense
 
Upload a receipt or invoice (PDF or XLSX, max 10 MB) for processing.
 
```bash
curl -X POST http://localhost:3000/api/process-expense \
  -F "file=@receipt.pdf"
```
 
**Response** (200):
 
```json
{
  "batchId": "abc123",
  "status": "completed",
  "documents": [
    {
      "vendorName": "Acme Corp",
      "date": "2025-01-15",
      "totalAmount": 42.00,
      "currency": "USD",
      "lineItems": [
        {
          "itemDescription": "Widget",
          "quantity": 2,
          "unitAmount": 21.00,
          "taxType": "GST",
          "lineAmount": 42.00,
          "category": "Supplies"
        }
      ],
      "receiptNumber": "INV-001"
    }
  ],
  "costBreakdown": {
    "totalTokens": 1450,
    "totalCostUsd": 0.0058
  },
  "xeroStatus": "pushed",
  "errors": []
}
```
 
**Error responses:**
 
| Status | Condition            |
|--------|----------------------|
| 400    | No file or unsupported MIME type |
| 413    | File exceeds 10 MB   |
| 429    | Daily budget exceeded |
| 500    | Internal error (parse failure, Xero error, etc.) |
 
### GET /api/budget-status
 
Returns the current daily budget state.
 
```bash
curl http://localhost:3000/api/budget-status
```
 
**Response** (200):
 
```json
{
  "spent": 3.50,
  "remaining": 6.50,
  "state": "Active"
}
```
 
`state` is `"Active"` when under the soft cap (80%), `"Warned"` when between 80% and 100%, and `"Exceeded"` when the hard cap is reached.
 
## Packages
 
### REAA Packages
 
| Package                                | Role                                                      |
|----------------------------------------|-----------------------------------------------------------|
| `@reaatech/structured-repair-core`     | Six-strategy JSON repair pipeline for malformed LLM output |
| `@reaatech/llm-cost-telemetry`         | Token-based cost calculation and telemetry span generation |
| `@reaatech/context-window-planner`     | Context-window packing and priority-based chunk dropping  |
| `@reaatech/agent-budget-engine`        | Daily spend budget enforcement with soft/hard cap policies |
 
### Third-Party Packages
 
| Package            | Version | Role                                     |
|--------------------|---------|------------------------------------------|
| `@mistralai/mistralai` | 2.2.5 | Mistral AI SDK (chat completion)          |
| `next`             | 16.2.7  | Next.js framework (App Router)           |
| `react`            | 19.2.4  | React UI library                         |
| `react-dom`        | 19.2.4  | React DOM renderer                       |
| `pdfjs-dist`       | 6.0.227 | PDF text extraction                      |
| `xlsx`             | 0.18.5  | XLSX spreadsheet parsing                 |
| `xero-node`        | 18.0.0  | Xero accounting API SDK                  |
| `zod`              | 3.23.8  | Schema validation                        |
 
## Testing
 
The project uses [Vitest](https://vitest.dev) with `@vitest/coverage-v8` for test coverage. Tests mirror the `src/` directory structure under `tests/`.
 
```bash
pnpm test    # Runs all tests with coverage report
```
 
The test suite includes:
- **Unit tests** for each service (`document-extractor`, `context-planner`, `expense-parser`, `cost-telemetry`, `budget-enforcer`, `xero-client`)
- **Integration tests** for the API route handlers (using MSW for HTTP mocking)
- Coverage target: 90%+ across lines, branches, functions, and statements
 
## License
 
MIT — see [LICENSE](./LICENSE).