Files · Vercel AI Gateway Agent Eval Harness for SMB Support Bots

65 (1 binary, 575.2 kB total)attempt 1

README.md·5919 B·markdown

markdown

# Vercel AI Gateway Agent Eval Harness for SMB Support Bots
 
> An automated regression testing pipeline that evaluates SMB support agents against golden datasets, using Vercel AI Gateway as the LLM backbone and exporting observability to Langfuse.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## What this is
 
This is an automated evaluation pipeline for SMB support bots. It runs a curated set of golden question/answer pairs (`.jsonl` files) against a support agent, sends responses to an LLM judge routed through Vercel AI Gateway, applies quality gates with configurable thresholds, and exports every evaluation trace to Langfuse for dashboard-level observability. A failing gate halts CI with a non-zero exit code, preventing regressions from reaching production.
 
## Prerequisites
 
The following API keys and credentials are required:
 
| Variable | Purpose |
|---|---|
| `AI_GATEWAY_API_KEY` | Routes LLM calls through Vercel AI Gateway |
| `LANGFUSE_SECRET_KEY` | Langfuse ingestion authentication |
| `LANGFUSE_PUBLIC_KEY` | Langfuse ingestion authentication |
| `OPENAI_API_KEY` | LLM cache embedder and fallback provider |
 
## Quick Start
 
```bash
pnpm install
```
 
Copy `.env.example` to `.env` and fill in the credentials above, then run:
 
```bash
node . eval --golden ./golden --output ./results
```
 
When the run finishes, open `results/report.md` for a human-readable summary.
 
### Running locally
 
```bash
pnpm test            # vitest run with coverage
pnpm dev             # next dev
```
 
### Project layout
 
```
app/                  Next.js App Router pages + API routes
src/                  services, lib, adapters
tests/                vitest suite (mirrors src/)
packages/             API references for every dependency (read these first)
golden/               golden .jsonl evaluation datasets
results/              eval output (report.md + per-run artifacts)
bin/                  CLI entry points
DEV_PLAN.md           build plan for this recipe
```
 
## Architecture
 
```
            ┌──────────────┐
            │ golden .jsonl │
            └──────┬───────┘
                   │
            ┌──────▼──────────────┐
            │    SuiteRunner      │  parallel executor
            │   (per-suite)       │
            └──────┬──────────────┘
                   │
            ┌──────▼──────────────────┐
            │  Vercel AI Gateway       │  LLM judge
            │  (judge endpoint)        │
            └──────┬──────────────────┘
                   │
            ┌──────▼─────────┐
            │   repair        │  strip markdown fences
            │   (strip)       │  from judge output
            └──────┬─────────┘
                   │
            ┌──────▼──────────────┐
            │  ResultsAggregator  │  merge scores
            └──────┬──────────────┘
                   │
            ┌──────▼────────┐
            │   GateEngine   │  pass/fail thresholds
            └──────┬────────┘
                   │
          ┌────────┴────────────┐
          │                     │
   ┌──────▼──────┐      ┌──────▼──────┐
   │   Langfuse   │      │  CI exit     │
   │  (export)    │      │  code (0/1)  │
   └──────────────┘      └─────────────┘
```
 
## Configuration
 
All configuration is via environment variables. Copy `.env.example` to `.env` and adjust as needed.
 
| Variable | Default | Effect |
|---|---|---|
| `NODE_ENV` | `development` | Runtime environment |
| `AI_GATEWAY_API_KEY` | — | Vercel AI Gateway API key |
| `LANGFUSE_SECRET_KEY` | — | Langfuse secret key |
| `LANGFUSE_PUBLIC_KEY` | — | Langfuse public key |
| `LANGFUSE_HOST` | `https://cloud.langfuse.com` | Langfuse API host |
| `EVAL_JUDGE_MODEL` | `openai/gpt-5.2` | Model ID for the LLM judge |
| `EVAL_BUDGET_LIMIT` | `10.00` | Max USD spend per eval run |
| `EVAL_GATE_PRESET` | `standard` | Quality gate preset name |
| `EVAL_CONCURRENCY` | `4` | Parallel suite runners |
| `EVAL_GOLDEN_PATH` | `./golden` | Directory containing `.jsonl` files |
| `EVAL_OUTPUT_DIR` | `./results` | Output directory for reports |
| `OPENAI_API_KEY` | — | OpenAI API key (cache embedder) |
| `CACHE_ENABLED` | `true` | Enable LLM response caching |
 
## CI Integration
 
The eval + gate workflow runs on every push. A non-zero exit code from the gate step blocks the pipeline.
 
```yaml
name: Eval & Gate
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: pnpm
      - run: pnpm install --frozen-lockfile
      - run: npx tsx src/index.ts eval
        env:
          AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - run: npx tsx src/index.ts gate ./results --preset standard
        env:
          EVAL_GATE_PRESET: standard
```
 
## License
 
MIT — see [LICENSE](./LICENSE).