Files · Vertex AI Invoice Extraction for SMB Accounting
66 (1 binary, 539.9 kB total)attempt 2
README.md·3308 B·markdown
markdown
# Vertex AI Invoice Extraction for SMB Accounting
> Turn stacks of invoices and receipts into clean QuickBooks transactions with Vertex AI document parsing and structured repair, reducing manual data entry to zero.
Small business owners waste hours each week manually entering invoice data into QuickBooks. This recipe demonstrates a production-grade document pipeline that:
1. Ingests PDF invoices via a Next.js API route or CLI
2. Extracts structured data using Google's Gemini 2.5 Flash on Vertex AI
3. Repairs malformed LLM output with @reaatech/structured-repair-core
4. Routes high-confidence transactions to QuickBooks and flags low-confidence fields for human review using @reaatech/confidence-router
5. Tracks per-document processing costs with @reaatech/llm-cost-telemetry
## Problem
Small business owners waste hours each week manually entering invoice data into QuickBooks. Manual data entry is error-prone, leading to bookkeeping mistakes that ripple into tax filings and cash-flow decisions. This recipe eliminates that pain by turning stacks of PDF invoices and receipts into clean, structured QuickBooks transactions with zero manual typing.
## Architecture
PDF Upload → Vertex AI Gemini 2.5 Flash → Structured Repair → Confidence Routing → QuickBooks (high confidence) | Review Queue (low confidence)
## Getting Started
```bash
pnpm install
pnpm dev
```
### Environment Variables
Copy `.env.example` to `.env` and fill in your credentials. Required variables:
| Variable | Description |
|----------|-------------|
| `GOOGLE_CLOUD_PROJECT` | GCP project ID for Vertex AI |
| `GOOGLE_CLOUD_LOCATION` | GCP location (e.g. `us-central1`) |
| `GOOGLE_GENAI_USE_VERTEXAI` | Set to `true` to use Vertex AI |
| `GOOGLE_APPLICATION_CREDENTIALS` | Path to service-account JSON key |
| `QUICKBOOKS_WEBHOOK_URL` | Webhook URL to receive QuickBooks transactions |
| `QUICKBOOKS_API_TOKEN` | Bearer token for the webhook |
| `DEFAULT_DAILY_BUDGET` | Daily LLM cost budget in USD (default `5.0`) |
| `DEBUG` | Set to `true` for verbose repair-pipeline logging |
## API
### POST /api/extract
Upload a PDF invoice as multipart/form-data (field name: `file`).
- **200**: `{ status: "sent", transactionId }` — high confidence, sent to QuickBooks
- **200**: `{ status: "review_required", reviewId }` — low confidence, queued for review
- **400**: `{ error: "file required" }`
- **422**: `{ error: "extraction failed", details }`
- **500**: `{ error: "internal error", message }`
### GET /api/extract
Healthcheck: `{ status: "ok", version: "0.1.0" }`
## CLI
```bash
pnpm tsx src/cli/batch-process.ts \
--dir /path/to/invoices \
--quickbooks-url https://your-webhook.example.com \
--quickbooks-token your-token \
--budget-limit 10.0
```
## Testing
```bash
pnpm test
```
## Packages
| Package | Role |
|---------|------|
| @google/genai | Vertex AI Gemini SDK for structured extraction |
| @reaatech/structured-repair-core | Repairs malformed LLM JSON output |
| @reaatech/confidence-router | Routes low-confidence fields to human review |
| @reaatech/llm-cost-telemetry | Tracks per-document LLM cost |
| @reaatech/llm-cost-telemetry-calculator | Calculates cost from token counts |
| pdf-parse | Extracts text from PDF files |
| zod | Schema validation for invoice data |