Skip to content
reaatech

Files · xAI Grok Tax Form Extraction for SMB Accounting

74 (1 binary, 578.1 kB total)attempt 1

README.md·3660 B·markdown
markdown
# xAI Grok Tax Form Extraction for SMB Accounting
 
> Automatically extract and normalize line items from tax forms (1040, W-2, 1099) for small business bookkeeping using Grok's reasoning and REAA's output repair engine.
 
## Problem
 
SMB accountants spend hours transcribing numbers from PDF tax forms into spreadsheets. Manual entry is slow and error-prone, and off-the-shelf OCR often produces garbled or malformed JSON that downstream systems can't use.
 
## Architecture
 
This pipeline ingests scanned or digital tax forms, extracts text with OCR, sends the content to xAI Grok with a structured JSON schema, and repairs any malformed output using `@reaatech/structured-repair-core`'s graduated repair strategies. `@reaatech/llm-cost-telemetry` tracks per-form spend, `@reaatech/llm-cache` avoids re-processing semantically identical pages, and `@reaatech/agent-budget-engine` enforces daily cost caps.
 
```
PDF Upload → Text Extraction → xAI Grok LLM → Schema Repair → Validated JSON
 (unpdf +       (unpdf primary,    (OpenAI SDK,   (structured-   (Zod-validated
 tesseract.js    tesseract.js       Chat Completions repair-core)   response)
 fallback)       OCR fallback)      API)
```
 
## API
 
### `POST /api/extract-tax`
 
Upload a PDF tax form for extraction.
 
**Request:** `multipart/form-data` with a `file` field containing a PDF.
 
**Success Response (200):**
```json
{
  "documents": [
    {
      "formType": "1040",
      "filingStatus": "single",
      "wages": 75000,
      "taxableInterest": 1200,
      "adjustedGrossIncome": 76200,
      "totalTax": 12500
    }
  ],
  "processingMetadata": {
    "extractionMethod": "pdf-text",
    "confidence": 1,
    "tokensUsed": 700,
    "costUsd": 0.021,
    "totalPages": 1
  }
}
```
 
**Error Responses:**
| Status | Meaning |
|--------|---------|
| 400 | No file uploaded |
| 413 | File exceeds 20MB |
| 415 | File is not a PDF |
| 422 | Extraction or repair failed |
| 429 | Daily budget exhausted |
| 500 | Internal server error |
 
### `GET /api/extract-tax`
 
Health check. Returns `{ "status": "ok" }`.
 
## Supported Tax Forms
 
- **1040**: Individual Income Tax Return
- **W-2**: Wage and Tax Statement
- **1099-NEC**: Nonemployee Compensation
- **1099-MISC**: Miscellaneous Income
 
## Environment Variables
 
| Variable | Default | Description |
|----------|---------|-------------|
| `XAI_API_KEY` | (required) | xAI Grok API key |
| `XAI_BASE_URL` | `https://api.x.ai/v1` | xAI API base URL |
| `XAI_MODEL` | `grok-3` | Grok model identifier |
| `REDIS_URL` | `redis://localhost:6379` | Redis connection string |
| `DAILY_BUDGET_USD` | `5.00` | Per-day API spend ceiling |
| `CACHE_SEMANTIC_THRESHOLD` | `0.85` | Cosine similarity cutoff |
 
## Running locally
 
```bash
pnpm install
pnpm typecheck
pnpm lint
pnpm test
pnpm dev
```
 
Example curl:
```bash
curl -X POST http://localhost:3000/api/extract-tax -F "file=@w2-sample.pdf"
```
 
## Project layout
 
```
app/api/extract-tax/route.ts    API route handler
src/services/tax-extractor.ts   Pipeline orchestrator
src/services/grok-client.ts     xAI Grok client (OpenAI SDK)
src/services/text-extractor.ts  PDF/OCR text extraction
src/services/repair-service.ts  JSON repair via structured-repair-core
src/services/cache-service.ts   LLM cache with Redis backend
src/services/telemetry-service.ts  Cost telemetry
src/services/budget-service.ts  Budget enforcement
src/lib/tax-schemas.ts          Zod schemas for tax forms
src/lib/config.ts               Environment config
src/types.ts                    TypeScript types and error classes
tests/                          Vitest test suite
```
 
## License
 
MIT — see [LICENSE](./LICENSE).