Files · Anthropic Eval Harness for Agent Quality Assurance

49 (0 binary, 343.9 kB total)attempt 1

README.md·3412 B·markdown

markdown

# Anthropic Eval Harness for Agent Quality Assurance
 
Continuous regression testing and safety scoring for Anthropic-powered agents, with automated quality gates before any customer-facing deployment.
 
## Problem
 
SMBs shipping customer-support or sales agents on Anthropic's models experience quality drift over time—toxic phrasing, hallucinated facts, or missed tools—but lack a repeatable test suite to catch these regressions before they reach users.
 
## Architecture
 
This harness wraps the Anthropic SDK with REAA's evaluation suite, using judge-based scoring and golden datasets to compare every new model revision against known-good responses. Key components:
 
- **Evaluator Engine** (`src/lib/evaluator.ts`) — Runs trajectories through Claude, scores with judge modules, and aggregates results
- **Gates Engine** (`src/lib/gates.ts`) — Enforces pass/fail thresholds on quality, cost, and latency metrics
- **Incident Integration** (`src/lib/incidents.ts`) — Creates incident workflows when gates fail
- **Replay Integration** (`src/lib/replay.ts`) — Records and replays trajectories for root-cause analysis
- **Langfuse Observability** (`src/lib/observability.ts`) — Captures all evaluations for a live dashboard
- **API Routes** (`src/app/api/eval/run` and `src/app/api/health`) — RESTful interfaces for triggering evaluations and health checks
 
## Prerequisites
 
- Node.js >=22
- pnpm >=10
- Anthropic API key (set `ANTHROPIC_API_KEY`)
- Langfuse account (set `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`)
 
## Setup
 
```bash
pnpm install
cp .env.example .env
# Fill in your API keys in .env
```
 
## Running Evaluations
 
Trigger an evaluation via the API:
 
```bash
curl -X POST http://localhost:3000/api/eval/run \
  -H 'Content-Type: application/json' \
  -d '{
    "suiteConfig": "metrics:\n  - faithfulness\n  - relevance",
    "trajectories": [
      {
        "trajectory_id": "t1",
        "turns": [{ "turn_id": 1, "role": "user", "content": "Hello", "timestamp": "2024-01-01T00:00:00Z" }]
      }
    ]
  }'
```
 
## View the Dashboard
 
```bash
pnpm dev
# Open http://localhost:3000/dashboard
```
 
## Environment Variables
 
| Variable | Description |
|----------|-------------|
| `ANTHROPIC_API_KEY` | Anthropic API key |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key |
| `LANGFUSE_HOST` | Langfuse host URL |
 
## Golden Datasets
 
Use the `goldenScenario` parameter in eval requests to compare against known-good trajectories. The harness uses `quickCreateGolden()` from `@reaatech/agent-eval-harness-golden` to bootstrap datasets from existing trajectories and `compareAgainstGolden()` to detect regressions.
 
## Gate Threshold Configuration
 
Three presets are available:
 
| Preset | Quality | Faithfulness | Relevance | Tool Correctness | Cost | Latency P99 | Pass Rate |
|--------|---------|-------------|-----------|-----------------|------|-------------|-----------|
| Standard | >= 0.80 | >= 0.80 | >= 0.80 | >= 0.90 | <= $0.05 | <= 5000ms | >= 95% |
| Strict | >= 0.90 | >= 0.90 | >= 0.90 | >= 0.95 | <= $0.02 | <= 2000ms | >= 99% |
| Lenient | >= 0.60 | >= 0.60 | >= 0.60 | >= 0.70 | <= $0.10 | <= 10000ms | — |
 
## Example YAML Suite Config
 
```yaml
metrics:
  - faithfulness
  - relevance
  - cost
  - latency
judge_model: claude-sonnet-4-6
budget_limit: 10.00
parallel_workers: 4
```
 
## License
 
MIT