Files · AWS Bedrock RAG Eval Harness for SMB Customer Support Bots
67 (1 binary, 611.0 kB total)attempt 1
README.md·6142 B·markdown
markdown
# AWS Bedrock RAG Eval Harness for SMB Customer Support Bots
> Automatically score RAG answer quality, track evaluation costs, and block deployments when your AI support bot's accuracy dips.
**Problem:** SMB customer support bots powered by RAG (Retrieval-Augmented Generation) frequently ship hallucinated or irrelevant answers. Without automated quality regression detection, every deployment risks degrading the user experience.
**What this recipe does:**
- **On-demand RAG evaluations** using AWS Bedrock as a judge LLM — score faithfulness, relevance, context precision, and context recall.
- **Cost tracking** — monitor per-run and cumulative evaluation spend with configurable budgets and hard stop limits.
- **CI gating** — define quality thresholds (gates) that block deployments when metrics dip below acceptable levels.
- **Observability** — traces and metrics sent to Langfuse for inspection and dashboarding.
## Prerequisites
- Node.js >= 22
- pnpm 10
- AWS Bedrock access (Claude Sonnet 4 or compatible model)
- Langfuse account (free tier works)
## Quick Start
```bash
pnpm install
cp .env.example .env
# Fill in your AWS credentials and Langfuse keys in .env
pnpm dev
```
Then trigger an evaluation:
```bash
curl -X POST http://localhost:3000/api/evals \
-H "Content-Type: application/json" \
-d '{}'
```
## Configuration
| Variable | Default | Description |
|---|---|---|
| `PORT` | `3000` | Server port |
| `AWS_REGION` | `us-east-1` | AWS region for Bedrock |
| `AWS_ACCESS_KEY_ID` | — | AWS access key |
| `AWS_SECRET_ACCESS_KEY` | — | AWS secret key |
| `LANGFUSE_PUBLIC_KEY` | — | Langfuse public key |
| `LANGFUSE_SECRET_KEY` | — | Langfuse secret key |
| `LANGFUSE_HOST` | — | Langfuse host URL |
| `EVAL_DATASET_PATH` | `./datasets/eval-samples.jsonl` | Path to evaluation dataset |
| `EVAL_BUDGET_LIMIT` | `10.00` | Maximum evaluation spend (USD) |
| `JUDGE_MODEL_ID` | `anthropic.claude-sonnet-4-v1:0` | Bedrock model for judging |
## API Reference
### `POST /api/evals`
Trigger an evaluation run.
**Request body:**
```json
{
"dataset": "./datasets/eval-samples.jsonl",
"config": "./datasets/eval-config.yaml",
"gates": [{ "name": "min-faithfulness", "type": "threshold", "metric": "avg_faithfulness", "operator": ">=", "threshold": 0.85 }],
"baseline": { "run_id": "...", "metrics": { "overall_score": 0.92, ... } },
"traceId": "optional-custom-trace-id"
}
```
All fields are optional. Falls back to `EVAL_DATASET_PATH` env var and default gates.
**Response:**
```json
{
"results": {
"run_id": "uuid",
"dataset": "./datasets/eval-samples.jsonl",
"samples": [...],
"metrics": {
"overall_score": 0.91,
"avg_faithfulness": 0.94,
"avg_relevance": 0.88,
"avg_context_precision": 0.92,
"avg_context_recall": 0.90,
"cost_per_sample": 0.002,
"total_samples": 5
},
"total_cost": 0.01,
"duration_ms": 4523
},
"gates": {
"passed": true,
"gates": [...],
"failures": []
}
}
```
**Status codes:** `200` (success), `400` (missing dataset), `422` (invalid dataset).
### `GET /api/evals` (health check)
```json
{ "status": "ok", "uptime": 12345.6 }
```
### `GET /api/evals/cost`
Returns a cost report and breakdown from the cost tracking module.
```json
{
"report": {
"totalCost": 0.01,
"costPerSample": 0.002,
"trend": "stable"
},
"breakdown": {
"total": 0.01,
"by_metric": {},
"by_provider": {},
"per_sample": []
}
}
```
## CI Integration
Add a step to your GitHub Actions workflow to gate deployments:
```yaml
- name: Run RAG evaluation gates
run: |
pnpm exec rag-eval-pack gate \
--results ./eval-results.json \
--gates ./datasets/gate-config.yaml
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: us-east-1
```
Or use the programmatic API in a custom script:
```typescript
import { runCiGateFile } from "./src/lib/gate.js";
const exitCode = await runCiGateFile("./eval-results.json", "./datasets/gate-config.yaml");
process.exit(exitCode);
```
## Project Structure
```
.
├── app/
│ ├── api/evals/route.ts App Router POST + GET handlers
│ ├── api/evals/cost/route.ts Cost report endpoint
│ ├── layout.tsx Root layout with metadata
│ ├── page.tsx Recipe home page
│ └── globals.css Global styles
├── src/
│ ├── config.ts Env-based configuration loader
│ ├── types.ts Request/response type definitions
│ └── lib/
│ ├── judge.ts AWS Bedrock judge adapter
│ ├── dataset.ts Dataset loading and validation
│ ├── gate.ts Quality gate CI integration
│ ├── cost.ts Evaluation cost tracking
│ ├── observability.ts Langfuse tracing and scoring
│ └── runner.ts CLI runner entry point
├── datasets/
│ ├── eval-samples.jsonl Sample evaluation dataset
│ ├── eval-config.yaml Evaluation suite configuration
│ └── gate-config.yaml CI gate definitions
├── tests/ Vitest test suite (mirrors src/)
├── packages/ API references for dependencies
├── .env.example Environment variable template
├── DEV_PLAN.md Build plan
└── README.md This file
```
## Related Packages
| Package | Role |
|---|---|
| `@reaatech/rag-eval-core` | Core types, schemas, and domain models |
| `@reaatech/rag-eval-dataset` | Dataset loading, validation, and versioning |
| `@reaatech/rag-eval-cost` | Cost tracking, budget management, and reporting |
| `@reaatech/rag-eval-gate` | Quality gate engine and CI integration |
| `@reaatech/rag-eval-cli` | CLI suite (`rag-eval-pack evaluate|gate|compare|cost|report|judge`) |
## License
MIT — see [LICENSE](./LICENSE).