Files · Vercel AI Gateway Agent Eval Harness for SMB Support Bots
65 (1 binary, 575.2 kB total)attempt 1
README.md·5919 B·markdown
markdown
# Vercel AI Gateway Agent Eval Harness for SMB Support Bots
> An automated regression testing pipeline that evaluates SMB support agents against golden datasets, using Vercel AI Gateway as the LLM backbone and exporting observability to Langfuse.
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
## What this is
This is an automated evaluation pipeline for SMB support bots. It runs a curated set of golden question/answer pairs (`.jsonl` files) against a support agent, sends responses to an LLM judge routed through Vercel AI Gateway, applies quality gates with configurable thresholds, and exports every evaluation trace to Langfuse for dashboard-level observability. A failing gate halts CI with a non-zero exit code, preventing regressions from reaching production.
## Prerequisites
The following API keys and credentials are required:
| Variable | Purpose |
|---|---|
| `AI_GATEWAY_API_KEY` | Routes LLM calls through Vercel AI Gateway |
| `LANGFUSE_SECRET_KEY` | Langfuse ingestion authentication |
| `LANGFUSE_PUBLIC_KEY` | Langfuse ingestion authentication |
| `OPENAI_API_KEY` | LLM cache embedder and fallback provider |
## Quick Start
```bash
pnpm install
```
Copy `.env.example` to `.env` and fill in the credentials above, then run:
```bash
node . eval --golden ./golden --output ./results
```
When the run finishes, open `results/report.md` for a human-readable summary.
### Running locally
```bash
pnpm test # vitest run with coverage
pnpm dev # next dev
```
### Project layout
```
app/ Next.js App Router pages + API routes
src/ services, lib, adapters
tests/ vitest suite (mirrors src/)
packages/ API references for every dependency (read these first)
golden/ golden .jsonl evaluation datasets
results/ eval output (report.md + per-run artifacts)
bin/ CLI entry points
DEV_PLAN.md build plan for this recipe
```
## Architecture
```
┌──────────────┐
│ golden .jsonl │
└──────┬───────┘
│
┌──────▼──────────────┐
│ SuiteRunner │ parallel executor
│ (per-suite) │
└──────┬──────────────┘
│
┌──────▼──────────────────┐
│ Vercel AI Gateway │ LLM judge
│ (judge endpoint) │
└──────┬──────────────────┘
│
┌──────▼─────────┐
│ repair │ strip markdown fences
│ (strip) │ from judge output
└──────┬─────────┘
│
┌──────▼──────────────┐
│ ResultsAggregator │ merge scores
└──────┬──────────────┘
│
┌──────▼────────┐
│ GateEngine │ pass/fail thresholds
└──────┬────────┘
│
┌────────┴────────────┐
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Langfuse │ │ CI exit │
│ (export) │ │ code (0/1) │
└──────────────┘ └─────────────┘
```
## Configuration
All configuration is via environment variables. Copy `.env.example` to `.env` and adjust as needed.
| Variable | Default | Effect |
|---|---|---|
| `NODE_ENV` | `development` | Runtime environment |
| `AI_GATEWAY_API_KEY` | — | Vercel AI Gateway API key |
| `LANGFUSE_SECRET_KEY` | — | Langfuse secret key |
| `LANGFUSE_PUBLIC_KEY` | — | Langfuse public key |
| `LANGFUSE_HOST` | `https://cloud.langfuse.com` | Langfuse API host |
| `EVAL_JUDGE_MODEL` | `openai/gpt-5.2` | Model ID for the LLM judge |
| `EVAL_BUDGET_LIMIT` | `10.00` | Max USD spend per eval run |
| `EVAL_GATE_PRESET` | `standard` | Quality gate preset name |
| `EVAL_CONCURRENCY` | `4` | Parallel suite runners |
| `EVAL_GOLDEN_PATH` | `./golden` | Directory containing `.jsonl` files |
| `EVAL_OUTPUT_DIR` | `./results` | Output directory for reports |
| `OPENAI_API_KEY` | — | OpenAI API key (cache embedder) |
| `CACHE_ENABLED` | `true` | Enable LLM response caching |
## CI Integration
The eval + gate workflow runs on every push. A non-zero exit code from the gate step blocks the pipeline.
```yaml
name: Eval & Gate
on: [push]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: pnpm
- run: pnpm install --frozen-lockfile
- run: npx tsx src/index.ts eval
env:
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- run: npx tsx src/index.ts gate ./results --preset standard
env:
EVAL_GATE_PRESET: standard
```
## License
MIT — see [LICENSE](./LICENSE).