Files · Azure AI Agent Eval Harness for SMB Support QA

65 (1 binary, 638.3 kB total)attempt 1
README.md·9287 B·markdown
markdown
# Azure AI Agent Eval Harness for SMB Support QA
 
> Automated quality gates for Azure AI-powered support agents, catching regressions in tool use, answer quality, and cost before they reach customers.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI evaluation infrastructure with the `@reaatech/*` package family.
 
---
 
## Problem
 
SMBs rely on AI chatbots to handle customer support, but every prompt tweak, model upgrade, or tool change risks silent regressions. Without automated evaluation, teams ship blind — degraded answers, broken tool calls, and ballooning costs reach customers first. This harness solves that by bringing CI/CD discipline to AI agent quality.
 
---
 
## Architecture
 
```
┌─────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  Azure      │────▶│  Express Eval    │────▶│  Next.js         │
│  OpenAI     │     │  Server (:4567)  │     │  Dashboard       │
└─────────────┘     │                  │     │  (/dashboard)    │
                    │  /trajectories   │     └──────────────────┘
                    │  /eval/*         │
                    │  /cost/report    │     ┌──────────────────┐
                    │  /health         │     │  CI Gate         │
                    └──────┬───────────┘     │  (ci-check.ts)   │
                           │                 │  GitHub Actions  │
                           │                 └──────────────────┘
                    ┌──────▼───────────┐
                    │  Observability   │
                    │  Langfuse +      │
                    │  OpenTelemetry   │
                    └──────────────────┘
```
 
- **Eval Server** — Express app that stores AI agent trajectories, runs LLM-as-judge evaluations, and tracks cost. Wires together `@reaatech/agent-eval-harness-*` packages.
- **CI Gate** — `src/gates/ci-check.ts` runs inside GitHub Actions to gate deployments when quality, cost, or latency thresholds are breached. Produces JUnit XML for pipeline reporting.
- **Dashboard** — Next.js App Router pages that display eval results, trajectory history, and cost reports.
- **Tracing** — Langfuse + OpenTelemetry via `@traceloop/node-server-sdk` for observability.
 
---
 
## Quick Start
 
```bash
# Install dependencies
pnpm install
 
# Configure environment
cp .env.example .env
# Edit .env with your Azure OpenAI endpoint, key, deployment name,
# Langfuse credentials, and eval server port.
 
# Run the test suite (vitest with coverage)
pnpm test
 
# Start the Next.js dashboard
pnpm dev
 
# Start the Express eval server (separate terminal)
node --loader ts-node/esm src/server.ts
```
 
### Environment Variables
 
| Variable | Description | Default |
|---|---|---|
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI base URL | — |
| `AZURE_OPENAI_API_KEY` | Azure OpenAI API key | — |
| `AZURE_OPENAI_DEPLOYMENT_NAME` | Deployment/model name | — |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key | — |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key | — |
| `LANGFUSE_BASE_URL` | Langfuse instance URL | `https://cloud.langfuse.com` |
| `EVAL_PORT` | Express server port | `4567` |
| `EVAL_BUDGET_LIMIT` | Daily budget in USD | `10.00` |
 
---
 
## API Reference
 
All endpoints are served by the Express eval server on `EVAL_PORT` (default `4567`).
 
### Trajectories
 
| Method | Path | Request Body | Response |
|---|---|---|---|
| `POST` | `/trajectories` | `{ id?: string, timestamp: string, turns: Turn[], metadata?: Record<string, unknown> }` | `200 { ok: true, id: string }` |
| `GET` | `/trajectories` | — | `200 Trajectory[]` |
| `GET` | `/trajectories/:id` | — | `200 Trajectory` or `404 { error: "not found" }` |
 
### Evaluation
 
| Method | Path | Request Body | Response |
|---|---|---|---|
| `POST` | `/eval/run` | — | `200 { runId, status, overallMetrics }` |
| `GET` | `/eval/results/:runId` | — | `200 AggregatedResults` or `404` |
| `POST` | `/eval/compare` | `{ baselineRunId: string, candidateRunId: string }` | `200 RunComparisonResult` |
 
### Cost
 
| Method | Path | Request Body | Response |
|---|---|---|---|
| `GET` | `/cost/report` | — | `200 CostReport` |
 
### Health
 
| Method | Path | Response |
|---|---|---|
| `GET` | `/health` | `200 { status: "ok", trajectoryCount: number }` |
 
---
 
## Evaluation Metrics
 
All five dimensions are scored per trajectory and aggregated across a run:
 
| Metric | Description | Scoring |
|---|---|---|
| **Faithfulness** | Does the response stay grounded in the provided context? | 0–1 |
| **Relevance** | Does the response address the user's intent? | 0–1 |
| **Tool Correctness** | Did the agent call the right tool with the right arguments? | 0–1 |
| **Cost** | Token usage costs per trajectory, checked against budget | USD |
| **Latency** | End-to-end response time per trajectory | ms |
 
Scores feed into configurable gates (e.g. `overallQuality >= 0.85`, `perTrajectoryCost <= $0.05`) that block CI pipelines on regression.
 
---
 
## CI/CD Integration
 
Run `ci-check.ts` in your GitHub Actions pipeline after every eval run:
 
```yaml
- name: Run eval gate
  run: node --loader ts-node/esm src/gates/ci-check.ts
```
 
The gate engine (`@reaatech/agent-eval-harness-gate`) evaluates aggregated results against preset or custom thresholds:
 
- **Standard preset** — enforces `overallQuality >= 0.80`, `faithfulness >= 0.85`, `relevance >= 0.80`, `perTrajectoryCost <= $0.05`.
- **Threshold overrides** — pass custom thresholds per metric.
- **Exit code** — `0` for pass, `1` for fail (CI-friendly).
- **JUnit XML** — `generateJUnitReport()` produces standard XML for pipeline reporting (Jenkins, GitHub Actions, etc.).
 
---
 
## Project Structure
 
```
├── app/                          # Next.js App Router
│   ├── dashboard/page.tsx        # Eval results dashboard
│   ├── layout.tsx                # Root layout
│   └── page.tsx                  # Landing page
├── src/
│   ├── index.ts                  # Library entry point (re-exports)
│   ├── instrumentation.ts        # OpenTelemetry + Langfuse tracing
│   ├── server.ts                 # Express API server (createApp factory)
│   ├── lib/
│   │   ├── azure-client.ts       # Azure OpenAI client factory (singleton)
│   │   ├── cost-service.ts       # Cost tracking (wraps @reaatech/agent-eval-harness-cost)
│   │   ├── eval-service.ts       # Evaluation orchestration (wraps @reaatech/agent-eval-harness-suite)
│   │   ├── judge-service.ts      # LLM-as-judge (wraps @reaatech/agent-eval-harness-judge)
│   │   ├── trajectory-store.ts   # In-memory trajectory storage
│   │   └── types.ts              # Domain types (Trajectory, Turn, ToolCall)
│   └── gates/
│       └── ci-check.ts           # CI/CD gate checker (wraps @reaatech/agent-eval-harness-gate)
├── tests/                        # Vitest test suite (mirrors src/)
│   ├── setup.ts                  # MSW mocks + test fixtures
│   ├── lib/
│   │   ├── azure-client.test.ts
│   │   ├── cost-service.test.ts
│   │   ├── eval-service.test.ts
│   │   ├── judge-service.test.ts
│   │   └── trajectory-store.test.ts
│   ├── gates/
│   │   └── ci-check.test.ts
│   └── server.test.ts            # Express integration tests
├── packages/                     # API references for every dependency
├── .env.example                  # Environment variable template
├── DEV_PLAN.md                   # Build plan
└── vitest.config.ts              # Test configuration
```
 
---
 
## Tech Stack
 
| Package | Version | Purpose |
|---|---|---|
| `@azure/openai` | `2.0.0` | Azure OpenAI client SDK |
| `@reaatech/agent-eval-harness-cost` | `0.1.0` | Trajectory cost calculation & budget enforcement |
| `@reaatech/agent-eval-harness-gate` | `0.1.1` | CI gate engine with standard presets & JUnit reporting |
| `@reaatech/agent-eval-harness-judge` | `0.1.0` | LLM-as-judge for faithfulness, relevance, tool correctness |
| `@reaatech/agent-eval-harness-suite` | `0.1.1` | Suite runner, results aggregation, run comparison |
| `@traceloop/node-server-sdk` | `0.27.0` | OpenTelemetry instrumentation |
| `express` | `5.1.0` | Eval API server |
| `langfuse` | `3.38.20` | LLM observability & tracing |
| `next` | `16.2.9` | Dashboard framework |
| `openai` | `4.104.0` | OpenAI client (used internally) |
| `react` / `react-dom` | `19.2.4` | UI library |
| `zod` | `4.4.3` | Schema validation |
| `vitest` | `4.1.8` | Test runner |
| `msw` | `2.14.6` | API mocking in tests |
| `supertest` | `7.1.0` | HTTP integration testing |
 
---
 
## License
 
MIT — see [LICENSE](./LICENSE).