Files · Cohere Insurance Policy Data Extraction for SMB Brokers
72 (1 binary, 602.9 kB total)attempt 1
README.md·3158 B·markdown
markdown
# Cohere Insurance Policy Data Extraction for SMB Brokers
> Extract structured data from insurance policy PDFs with Cohere’s models, automatic PDF parsing, and per‑broker cost tracking.
A Next.js API that accepts PDF uploads, extracts structured insurance policy data using Cohere's LLM, repairs malformed JSON output with `@reaatech/structured-repair-core`, enforces per-broker budgets via `@reaatech/agent-budget-engine`, and tracks cost telemetry with `@reaatech/llm-cost-telemetry-aggregation`.
## Quick Start
```bash
pnpm install
cp .env.example .env # edit COHERE_API_KEY with your key
pnpm dev # start dev server on port 3000
```
Submit a policy PDF:
```bash
curl -X POST http://localhost:3000/api/documents/process \
-F "file=@policy.pdf" \
-F "brokerId=broker-1"
```
## API Endpoints
### `POST /api/documents/process`
Upload a PDF and extract structured policy data.
- **Request:** multipart/form-data with `file` (PDF) and `brokerId` (string)
- **Response 200:** `{ documentId, brokerId, policyData, status, costUsd, extractedAt }`
- **Response 400:** `{ error: "Bad request" }` — missing file or brokerId
- **Response 402:** `{ error: "Budget exceeded" }` — broker has exhausted their budget
### `GET /api/brokers/:brokerId/budget`
Get current budget state for a broker.
- **Response 200:** `{ spent, remaining, state }` — state is `Active`, `Warned`, `Degraded`, `Stopped`, or `unknown`
### `GET /api/brokers/:brokerId/usage`
Get aggregated cost usage for a broker.
- **Response 200:** `{ brokerId, usage: { totalUsd, byProvider, byFeature } }`
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| `COHERE_API_KEY` | Yes | — | Cohere API key |
| `COHERE_MODEL` | No | `command-a-03-2025` | Cohere model ID |
| `DEFAULT_BROKER_BUDGET` | No | `100.0` | Default monthly budget in USD |
## Pipeline Architecture
1. **Upload** — PDF file received via multipart POST
2. **Budget check** — `@reaatech/agent-budget-engine` verifies broker has remaining allowance
3. **PDF parsing** — `pdfjs-dist` extracts text; falls back to page-render OCR via `sharp` + `@reaatech/media-pipeline-mcp-doc-extraction`
4. **LLM extraction** — `cohere-ai` (CohereClientV2) extracts structured policy fields
5. **JSON repair** — `@reaatech/structured-repair-core` fixes malformed output against a Zod schema
6. **Cost tracking** — spend recorded in `BudgetController` and telemetry aggregated via `CostCollector`/`CostAggregator`
## Project layout
```
app/api/ Next.js App Router API route handlers
src/lib/ types, config, PDF parser, pipeline orchestrator
src/services/ REAA package wrappers (budget, telemetry, document extraction, Cohere client)
tests/ vitest suite with unit + integration tests
packages/ API references for every dependency
```
## Running tests
```bash
pnpm test # vitest run with coverage (thresholds: >=90% lines/branches/functions/statements)
pnpm typecheck # tsc --noEmit
pnpm lint # eslint flat config
```
## License
MIT — see [LICENSE](./LICENSE).