Skip to content
reaatech

Files · Databricks RAG Pipeline for Insurance Policy Analysis

83 (3 binary, 509801.7 kB total)attempt 1

README.md·7071 B·markdown
markdown
# Databricks RAG Pipeline for Insurance Policy Analysis
 
> A retrieval‑augmented generation service that lets small insurance agencies query policy documents with natural language, backed by Databricks LLMs and pgvector.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## Architecture
 
The pipeline consists of two main API endpoints:
 
```
POST /api/ingest
  PDF → pdf.js (text extraction) → VoyageAI (embeddings) → pgvector (via PostgresMemoryStorage)
 
POST /api/query
  question → VoyageAI (embed) → pgvector (retrieve) → ContextPlanner → Databricks LLM → structured repair → cache
```
 
**Ingest flow:** A PDF file is uploaded as `multipart/form-data`, text is extracted via `pdfjs-dist`, split into chunks, embedded through the VoyageAI API, and stored in PostgreSQL with the pgvector extension using `@reaatech/agent-memory-storage`'s `PostgresMemoryStorage`.
 
**Query flow:** A natural-language question is embedded via VoyageAI, semantically similar chunks are retrieved from pgvector, the `@reaatech/context-window-planner` packs them into a token-budgeted prompt, the Databricks LLM (serving endpoint) generates a structured answer, the output is repaired via `@reaatech/structured-repair-core`, and the result is cached for repeat queries.
 
### Internal modules (src/lib/)
 
| Module | File | Role |
|---|---|---|
| config | `src/lib/config.ts` | Zod-validated env vars and merged telemetry config |
| context-planner | `src/lib/context-planner.ts` | Token-budgeted prompt assembly via `@reaatech/context-window-planner` |
| repair | `src/lib/repair.ts` | LLM output structuring via `@reaatech/structured-repair-core` |
| cost-telemetry | `src/lib/cost-telemetry.ts` | Per-call cost tracking with `@reaatech/llm-cost-telemetry` |
| observability | `src/lib/observability.ts` | Langfuse trace generation for LLM calls |
 
### Services (src/services/)
 
| Service | Responsibility |
|---|---|
| `ingestion-service` | Orchestrates PDF ingestion (extract → chunk → embed → store) |
| `embedding-service` | Wraps VoyageAI embedding API |
| `storage-service` | Wraps `@reaatech/agent-memory-storage` (pgvector) |
| `retrieval-service` | Semantic search over stored embeddings |
| `query-service` | Orchestrates query (embed → retrieve → plan → LLM → repair → cache) |
| `databricks-service` | Calls Databricks LLM serving endpoint |
| `cache-service` | Deduplicates repeat queries with similarity threshold |
 
## Setup
 
### Environment variables
 
Create a `.env.local` file (see `envSchema` in `src/lib/config.ts`):
 
| Variable | Required | Description |
|---|---|---|
| `DATABRICKS_HOST` | Yes | Databricks workspace URL |
| `DATABRICKS_TOKEN` | Yes | Databricks PAT token |
| `DATABRICKS_CLIENT_ID` | No | OAuth client ID (optional) |
| `DATABRICKS_CLIENT_SECRET` | No | OAuth client secret (optional) |
| `DATABRICKS_SERVING_ENDPOINT` | No | Serving endpoint name (default: `databricks-dbrx-instruct`) |
| `VOYAGE_API_KEY` | Yes | VoyageAI API key |
| `PGHOST` | No | PostgreSQL host (default: `localhost`) |
| `PGPORT` | No | PostgreSQL port (default: `5432`) |
| `PGDATABASE` | No | Database name (default: `insurance_policies`) |
| `PGUSER` | No | DB user (default: `postgres`) |
| `PGPASSWORD` | No | DB password |
| `LANGFUSE_PUBLIC_KEY` | No | Langfuse public key |
| `LANGFUSE_SECRET_KEY` | No | Langfuse secret key |
| `LANGFUSE_HOST` | No | Langfuse host (default: `https://cloud.langfuse.com`) |
| `LLM_CACHE_SIMILARITY_THRESHOLD` | No | Cosine threshold for cache hit (default: `0.8`) |
| `LLM_CACHE_TTL_SECONDS` | No | Cache TTL (default: `3600`) |
| `DEFAULT_DAILY_BUDGET` | No | Daily cost budget in USD (default: `100.0`) |
 
### Prerequisites
 
- **PostgreSQL 15+** with the `pgvector` extension installed
- **Databricks** workspace with a serving endpoint (and a PAT token)
- **VoyageAI** API key
- **Langfuse** account (optional, for observability)
 
### Quick start
 
```bash
pnpm install
pnpm dev             # next dev on http://localhost:3000
```
 
Run tests:
 
```bash
pnpm test            # vitest run with coverage
```
 
## API Reference
 
### POST /api/ingest
 
Upload a PDF policy document for indexing.
 
**Request:** `multipart/form-data`
 
| Field | Type | Required | Description |
|---|---|---|---|
| `file` | File | Yes | PDF document |
| `tenantId` | string | No | Tenant identifier (default: `"default"`) |
| `description` | string | No | Human-readable description |
| `tags` | string | No | Comma-separated tags |
 
**Example:**
 
```bash
curl -X POST http://localhost:3000/api/ingest \
  -F "file=@policy.pdf" \
  -F "tenantId=acme-insurance" \
  -F "description="2025 Auto Policy"" \
  -F "tags=auto,2025"
```
 
**Response (201):**
 
```json
{
  "documentId": "abc-123",
  "chunkCount": 42,
  "tenantId": "acme-insurance"
}
```
 
---
 
### POST /api/query
 
Ask a natural-language question about indexed policies.
 
**Request:** `application/json`
 
| Field | Type | Required | Description |
|---|---|---|---|
| `question` | string | Yes | Natural-language query |
| `tenantId` | string | Yes | Tenant scope |
 
**Example:**
 
```bash
curl -X POST http://localhost:3000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the deductible for collision coverage?", "tenantId": "acme-insurance"}'
```
 
**Response (200):**
 
```json
{
  "answer": "The deductible for collision coverage is $500.",
  "sources": ["chunk-001", "chunk-042"],
  "confidence": 0.94,
  "coverageGaps": [],
  "cached": false
}
```
 
---
 
### GET /api/health
 
Returns the status of each upstream service.
 
**Example:**
 
```bash
curl http://localhost:3000/api/health
```
 
**Response (200):**
 
```json
{
  "status": "ok",
  "services": {
    "database": true,
    "voyage": true,
    "databricks": true
  }
}
```
 
## Dependencies
 
### REAA packages
 
| Package | Version | Role |
|---|---|---|
| `@reaatech/agent-memory-retrieval` | `0.1.0` | Semantic search over pgvector embeddings |
| `@reaatech/agent-memory-storage` | `0.1.0` | PostgresMemoryStorage for pgvector persistence |
| `@reaatech/context-window-planner` | `0.1.0` | Token-budgeted prompt assembly with RAG strategy |
| `@reaatech/llm-cache` | `0.1.0` | Embedding-similarity cache for repeat queries |
| `@reaatech/llm-cost-telemetry` | `0.2.0` | Per-call cost calculation and budget enforcement |
| `@reaatech/structured-repair-core` | `1.0.0` | Zod-schema-based LLM output repair |
 
### Third-party packages
 
| Package | Version | Role |
|---|---|---|
| `@databricks/sdk-experimental` | `0.18.0` | Databricks REST API client |
| `pdfjs-dist` | `6.0.227` | PDF text extraction |
| `voyageai` | `0.3.1` | VoyageAI embedding API client |
| `pg` | `8.14.1` | PostgreSQL driver |
| `pgvector` | `0.3.0` | pgvector extension client |
| `postgres` | `3.4.9` | Postgres.js SQL client |
| `langfuse` | `3.38.20` | LLM observability and tracing |
| `zod` | `4.4.3` | Runtime schema validation |
 
## License
 
MIT — see [LICENSE](./LICENSE).