Files · xAI Grok Tax Form Extraction for SMB Accounting
74 (1 binary, 578.1 kB total)attempt 1
README.md·3660 B·markdown
markdown
# xAI Grok Tax Form Extraction for SMB Accounting
> Automatically extract and normalize line items from tax forms (1040, W-2, 1099) for small business bookkeeping using Grok's reasoning and REAA's output repair engine.
## Problem
SMB accountants spend hours transcribing numbers from PDF tax forms into spreadsheets. Manual entry is slow and error-prone, and off-the-shelf OCR often produces garbled or malformed JSON that downstream systems can't use.
## Architecture
This pipeline ingests scanned or digital tax forms, extracts text with OCR, sends the content to xAI Grok with a structured JSON schema, and repairs any malformed output using `@reaatech/structured-repair-core`'s graduated repair strategies. `@reaatech/llm-cost-telemetry` tracks per-form spend, `@reaatech/llm-cache` avoids re-processing semantically identical pages, and `@reaatech/agent-budget-engine` enforces daily cost caps.
```
PDF Upload → Text Extraction → xAI Grok LLM → Schema Repair → Validated JSON
(unpdf + (unpdf primary, (OpenAI SDK, (structured- (Zod-validated
tesseract.js tesseract.js Chat Completions repair-core) response)
fallback) OCR fallback) API)
```
## API
### `POST /api/extract-tax`
Upload a PDF tax form for extraction.
**Request:** `multipart/form-data` with a `file` field containing a PDF.
**Success Response (200):**
```json
{
"documents": [
{
"formType": "1040",
"filingStatus": "single",
"wages": 75000,
"taxableInterest": 1200,
"adjustedGrossIncome": 76200,
"totalTax": 12500
}
],
"processingMetadata": {
"extractionMethod": "pdf-text",
"confidence": 1,
"tokensUsed": 700,
"costUsd": 0.021,
"totalPages": 1
}
}
```
**Error Responses:**
| Status | Meaning |
|--------|---------|
| 400 | No file uploaded |
| 413 | File exceeds 20MB |
| 415 | File is not a PDF |
| 422 | Extraction or repair failed |
| 429 | Daily budget exhausted |
| 500 | Internal server error |
### `GET /api/extract-tax`
Health check. Returns `{ "status": "ok" }`.
## Supported Tax Forms
- **1040**: Individual Income Tax Return
- **W-2**: Wage and Tax Statement
- **1099-NEC**: Nonemployee Compensation
- **1099-MISC**: Miscellaneous Income
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `XAI_API_KEY` | (required) | xAI Grok API key |
| `XAI_BASE_URL` | `https://api.x.ai/v1` | xAI API base URL |
| `XAI_MODEL` | `grok-3` | Grok model identifier |
| `REDIS_URL` | `redis://localhost:6379` | Redis connection string |
| `DAILY_BUDGET_USD` | `5.00` | Per-day API spend ceiling |
| `CACHE_SEMANTIC_THRESHOLD` | `0.85` | Cosine similarity cutoff |
## Running locally
```bash
pnpm install
pnpm typecheck
pnpm lint
pnpm test
pnpm dev
```
Example curl:
```bash
curl -X POST http://localhost:3000/api/extract-tax -F "file=@w2-sample.pdf"
```
## Project layout
```
app/api/extract-tax/route.ts API route handler
src/services/tax-extractor.ts Pipeline orchestrator
src/services/grok-client.ts xAI Grok client (OpenAI SDK)
src/services/text-extractor.ts PDF/OCR text extraction
src/services/repair-service.ts JSON repair via structured-repair-core
src/services/cache-service.ts LLM cache with Redis backend
src/services/telemetry-service.ts Cost telemetry
src/services/budget-service.ts Budget enforcement
src/lib/tax-schemas.ts Zod schemas for tax forms
src/lib/config.ts Environment config
src/types.ts TypeScript types and error classes
tests/ Vitest test suite
```
## License
MIT — see [LICENSE](./LICENSE).