Files · OpenAI Agent Eval Harness for SMB Customer Support Quality

69 (1 binary, 599.3 kB total)attempt 1

README.md·2427 B·markdown

markdown

# OpenAI Agent Eval Harness for SMB Customer Support Quality
 
Automatically evaluate every production AI support interaction to catch bad answers, hallucination, and policy violations before they affect customers. This harness combines four REAA packages — suite runner, LLM-as-judge, CI regression gates, and cost telemetry — with Langfuse observability to provide a production-grade evaluation pipeline for AI-powered customer support.
 
## Quick Start
 
```bash
pnpm install
pnpm typecheck
pnpm lint
pnpm test
```
 
## CLI Usage
 
```bash
node src/cli/index.js eval --trajectories ./trajectories --config eval-config.yaml --output ./results
```
 
## API Usage
 
```bash
# Run an evaluation
curl -X POST http://localhost:3000/api/eval \
  -H 'Content-Type: application/json' \
  -d '{"trajectoriesPath": "./trajectories", "preset": "standard"}'
 
# List result runs
curl http://localhost:3000/api/results
 
# Check gate status
curl -X POST http://localhost:3000/api/gate \
  -H 'Content-Type: application/json' \
  -d '{"resultsPath": "./eval-results/run-123/results.json", "preset": "standard"}'
```
 
## Environment Variables
 
| Variable | Required | Description |
|----------|----------|-------------|
| `OPENAI_API_KEY` | Yes | OpenAI API key for LLM judge |
| `LANGFUSE_PUBLIC_KEY` | No | Langfuse project public key |
| `LANGFUSE_SECRET_KEY` | No | Langfuse project secret key |
| `LANGFUSE_BASE_URL` | No | Langfuse base URL (default: `https://cloud.langfuse.com`) |
| `EVAL_BUDGET_LIMIT` | No | Maximum evaluation spend in USD (default: `10.00`) |
| `EVAL_CONCURRENCY` | No | Parallel evaluation concurrency (default: `4`) |
 
## Architecture
 
The harness is built on four REAA packages:
 
- **@reaatech/agent-eval-harness-suite** — YAML-driven batch evaluation runner with configurable concurrency, multi-metric scoring, and results aggregation (JSON, JUnit, CSV, Markdown)
- **@reaatech/agent-eval-harness-judge** — Provider-agnostic LLM-as-judge engine supporting Claude, GPT-4, Gemini with calibration, consensus, and cost tracking
- **@reaatech/agent-eval-harness-gate** — CI/CD regression gates with threshold presets (standard, strict, lenient), baseline comparison, and JUnit / GitHub Actions output
- **@reaatech/llm-cost-telemetry** — Cost span creation, budget enforcement, and cost aggregation across providers and features
 
Results are exported to Langfuse for trace-level observability and dashboards.