Skip to content
reaatech

Files · Cohere Insurance Policy Data Extraction for SMB Brokers

72 (1 binary, 602.9 kB total)attempt 1

README.md·3158 B·markdown
markdown
# Cohere Insurance Policy Data Extraction for SMB Brokers
 
> Extract structured data from insurance policy PDFs with Cohere’s models, automatic PDF parsing, and per‑broker cost tracking.
 
A Next.js API that accepts PDF uploads, extracts structured insurance policy data using Cohere's LLM, repairs malformed JSON output with `@reaatech/structured-repair-core`, enforces per-broker budgets via `@reaatech/agent-budget-engine`, and tracks cost telemetry with `@reaatech/llm-cost-telemetry-aggregation`.
 
## Quick Start
 
```bash
pnpm install
cp .env.example .env   # edit COHERE_API_KEY with your key
pnpm dev               # start dev server on port 3000
```
 
Submit a policy PDF:
 
```bash
curl -X POST http://localhost:3000/api/documents/process \
  -F "file=@policy.pdf" \
  -F "brokerId=broker-1"
```
 
## API Endpoints
 
### `POST /api/documents/process`
Upload a PDF and extract structured policy data.
 
- **Request:** multipart/form-data with `file` (PDF) and `brokerId` (string)
- **Response 200:** `{ documentId, brokerId, policyData, status, costUsd, extractedAt }`
- **Response 400:** `{ error: "Bad request" }` — missing file or brokerId
- **Response 402:** `{ error: "Budget exceeded" }` — broker has exhausted their budget
 
### `GET /api/brokers/:brokerId/budget`
Get current budget state for a broker.
 
- **Response 200:** `{ spent, remaining, state }` — state is `Active`, `Warned`, `Degraded`, `Stopped`, or `unknown`
 
### `GET /api/brokers/:brokerId/usage`
Get aggregated cost usage for a broker.
 
- **Response 200:** `{ brokerId, usage: { totalUsd, byProvider, byFeature } }`
 
## Environment Variables
 
| Variable | Required | Default | Description |
|---|---|---|---|
| `COHERE_API_KEY` | Yes | — | Cohere API key |
| `COHERE_MODEL` | No | `command-a-03-2025` | Cohere model ID |
| `DEFAULT_BROKER_BUDGET` | No | `100.0` | Default monthly budget in USD |
 
## Pipeline Architecture
 
1. **Upload** — PDF file received via multipart POST
2. **Budget check**`@reaatech/agent-budget-engine` verifies broker has remaining allowance
3. **PDF parsing**`pdfjs-dist` extracts text; falls back to page-render OCR via `sharp` + `@reaatech/media-pipeline-mcp-doc-extraction`
4. **LLM extraction**`cohere-ai` (CohereClientV2) extracts structured policy fields
5. **JSON repair**`@reaatech/structured-repair-core` fixes malformed output against a Zod schema
6. **Cost tracking** — spend recorded in `BudgetController` and telemetry aggregated via `CostCollector`/`CostAggregator`
 
## Project layout
 
```
app/api/              Next.js App Router API route handlers
src/lib/              types, config, PDF parser, pipeline orchestrator
src/services/         REAA package wrappers (budget, telemetry, document extraction, Cohere client)
tests/                vitest suite with unit + integration tests
packages/             API references for every dependency
```
 
## Running tests
 
```bash
pnpm test            # vitest run with coverage (thresholds: >=90% lines/branches/functions/statements)
pnpm typecheck       # tsc --noEmit
pnpm lint            # eslint flat config
```
 
## License
 
MIT — see [LICENSE](./LICENSE).