Files · AWS Bedrock RAG Eval Harness for SMB Customer Support Bots

67 (1 binary, 611.0 kB total)attempt 1

README.md·6142 B·markdown

markdown

# AWS Bedrock RAG Eval Harness for SMB Customer Support Bots
 
> Automatically score RAG answer quality, track evaluation costs, and block deployments when your AI support bot's accuracy dips.
 
**Problem:** SMB customer support bots powered by RAG (Retrieval-Augmented Generation) frequently ship hallucinated or irrelevant answers. Without automated quality regression detection, every deployment risks degrading the user experience.
 
**What this recipe does:**
- **On-demand RAG evaluations** using AWS Bedrock as a judge LLM — score faithfulness, relevance, context precision, and context recall.
- **Cost tracking** — monitor per-run and cumulative evaluation spend with configurable budgets and hard stop limits.
- **CI gating** — define quality thresholds (gates) that block deployments when metrics dip below acceptable levels.
- **Observability** — traces and metrics sent to Langfuse for inspection and dashboarding.
 
## Prerequisites
 
- Node.js >= 22
- pnpm 10
- AWS Bedrock access (Claude Sonnet 4 or compatible model)
- Langfuse account (free tier works)
 
## Quick Start
 
```bash
pnpm install
cp .env.example .env
# Fill in your AWS credentials and Langfuse keys in .env
pnpm dev
```
 
Then trigger an evaluation:
 
```bash
curl -X POST http://localhost:3000/api/evals \
  -H "Content-Type: application/json" \
  -d '{}'
```
 
## Configuration
 
| Variable | Default | Description |
|---|---|---|
| `PORT` | `3000` | Server port |
| `AWS_REGION` | `us-east-1` | AWS region for Bedrock |
| `AWS_ACCESS_KEY_ID` | — | AWS access key |
| `AWS_SECRET_ACCESS_KEY` | — | AWS secret key |
| `LANGFUSE_PUBLIC_KEY` | — | Langfuse public key |
| `LANGFUSE_SECRET_KEY` | — | Langfuse secret key |
| `LANGFUSE_HOST` | — | Langfuse host URL |
| `EVAL_DATASET_PATH` | `./datasets/eval-samples.jsonl` | Path to evaluation dataset |
| `EVAL_BUDGET_LIMIT` | `10.00` | Maximum evaluation spend (USD) |
| `JUDGE_MODEL_ID` | `anthropic.claude-sonnet-4-v1:0` | Bedrock model for judging |
 
## API Reference
 
### `POST /api/evals`
 
Trigger an evaluation run.
 
**Request body:**
 
```json
{
  "dataset": "./datasets/eval-samples.jsonl",
  "config": "./datasets/eval-config.yaml",
  "gates": [{ "name": "min-faithfulness", "type": "threshold", "metric": "avg_faithfulness", "operator": ">=", "threshold": 0.85 }],
  "baseline": { "run_id": "...", "metrics": { "overall_score": 0.92, ... } },
  "traceId": "optional-custom-trace-id"
}
```
 
All fields are optional. Falls back to `EVAL_DATASET_PATH` env var and default gates.
 
**Response:**
 
```json
{
  "results": {
    "run_id": "uuid",
    "dataset": "./datasets/eval-samples.jsonl",
    "samples": [...],
    "metrics": {
      "overall_score": 0.91,
      "avg_faithfulness": 0.94,
      "avg_relevance": 0.88,
      "avg_context_precision": 0.92,
      "avg_context_recall": 0.90,
      "cost_per_sample": 0.002,
      "total_samples": 5
    },
    "total_cost": 0.01,
    "duration_ms": 4523
  },
  "gates": {
    "passed": true,
    "gates": [...],
    "failures": []
  }
}
```
 
**Status codes:** `200` (success), `400` (missing dataset), `422` (invalid dataset).
 
### `GET /api/evals` (health check)
 
```json
{ "status": "ok", "uptime": 12345.6 }
```
 
### `GET /api/evals/cost`
 
Returns a cost report and breakdown from the cost tracking module.
 
```json
{
  "report": {
    "totalCost": 0.01,
    "costPerSample": 0.002,
    "trend": "stable"
  },
  "breakdown": {
    "total": 0.01,
    "by_metric": {},
    "by_provider": {},
    "per_sample": []
  }
}
```
 
## CI Integration
 
Add a step to your GitHub Actions workflow to gate deployments:
 
```yaml
- name: Run RAG evaluation gates
  run: |
    pnpm exec rag-eval-pack gate \
      --results ./eval-results.json \
      --gates ./datasets/gate-config.yaml
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    AWS_REGION: us-east-1
```
 
Or use the programmatic API in a custom script:
 
```typescript
import { runCiGateFile } from "./src/lib/gate.js";
 
const exitCode = await runCiGateFile("./eval-results.json", "./datasets/gate-config.yaml");
process.exit(exitCode);
```
 
## Project Structure
 
```
.
├── app/
│   ├── api/evals/route.ts       App Router POST + GET handlers
│   ├── api/evals/cost/route.ts  Cost report endpoint
│   ├── layout.tsx               Root layout with metadata
│   ├── page.tsx                 Recipe home page
│   └── globals.css              Global styles
├── src/
│   ├── config.ts                Env-based configuration loader
│   ├── types.ts                 Request/response type definitions
│   └── lib/
│       ├── judge.ts             AWS Bedrock judge adapter
│       ├── dataset.ts           Dataset loading and validation
│       ├── gate.ts              Quality gate CI integration
│       ├── cost.ts              Evaluation cost tracking
│       ├── observability.ts     Langfuse tracing and scoring
│       └── runner.ts            CLI runner entry point
├── datasets/
│   ├── eval-samples.jsonl       Sample evaluation dataset
│   ├── eval-config.yaml         Evaluation suite configuration
│   └── gate-config.yaml         CI gate definitions
├── tests/                       Vitest test suite (mirrors src/)
├── packages/                    API references for dependencies
├── .env.example                 Environment variable template
├── DEV_PLAN.md                  Build plan
└── README.md                    This file
```
 
## Related Packages
 
| Package | Role |
|---|---|
| `@reaatech/rag-eval-core` | Core types, schemas, and domain models |
| `@reaatech/rag-eval-dataset` | Dataset loading, validation, and versioning |
| `@reaatech/rag-eval-cost` | Cost tracking, budget management, and reporting |
| `@reaatech/rag-eval-gate` | Quality gate engine and CI integration |
| `@reaatech/rag-eval-cli` | CLI suite (`rag-eval-pack evaluate|gate|compare|cost|report|judge`) |
 
## License
 
MIT — see [LICENSE](./LICENSE).