Vercel AI Gateway Agent Eval Harness for SMB Support Bots
An automated regression testing pipeline that evaluates SMB support agents against golden datasets, using Vercel AI Gateway as the LLM backbone and exporting observability to Langfuse.
Small businesses deploying AI support bots lack a systematic way to catch regressions before they reach customers. Ad‑hoc manual testing and single‑metric checks miss subtle degradations in answer quality, tool‑use accuracy, and cost creep.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building an automated regression testing pipeline for SMB support bots. You’ll create a CI-friendly evaluation harness that replays golden conversations, scores responses with an LLM judge routed through Vercel AI Gateway, enforces quality gates, and exports every trace to Langfuse for dashboard-level observability. By the end, a failing gate halts CI with a non-zero exit code so regressions never reach production.
Prerequisites
Node.js 22+ and pnpm 10+ installed
A Vercel AI Gateway API key (AI_GATEWAY_API_KEY)
A Langfuse account (cloud or self-hosted) with secret and public keys
An OpenAI API key for the LLM cache embedder
Basic familiarity with TypeScript and environment variables
Step 1: Create the project and install dependencies
Start from an empty directory. Create the project, add configuration files, install dependencies, and set up your environment.
terminal
mkdir my-eval-harness && cd my-eval-harness
Create package.json with the required dependencies:
import type { NextConfig } from "next";const nextConfig: NextConfig = { /* config options here */};export default nextConfig;
Now install everything:
terminal
pnpm install
Copy the environment example file and fill in your credentials:
terminal
cp .env.example .env
Your .env should look like this:
env
NODE_ENV=development# Vercel AI Gateway — routes to all providersAI_GATEWAY_API_KEY=<your-vercel-ai-gateway-key># Langfuse observabilityLANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_HOST=https://cloud.langfuse.com# Evaluation configurationEVAL_JUDGE_MODEL=openai/gpt-5.2EVAL_BUDGET_LIMIT=10.00EVAL_GATE_PRESET=standardEVAL_CONCURRENCY=4EVAL_GOLDEN_PATH=./goldenEVAL_OUTPUT_DIR=./results# LLM cache embedderOPENAI_API_KEY=<your-openai-key># Cache toggleCACHE_ENABLED=true
The EVAL_JUDGE_MODEL is the model identifier your Vercel AI Gateway routes to. The default openai/gpt-5.2 works if you’ve configured that provider in your gateway. Adjust EVAL_BUDGET_LIMIT to cap how much each evaluation run can spend.
Expected output:pnpm install exits cleanly and your .env file is populated with real keys.
Step 2: Define the core types
Create src/lib/types.ts to define the data structures that flow through the pipeline. These types describe golden conversations, evaluation results, and judge outputs.
Expected output: A clean TypeScript module with no type errors. These interfaces document every data shape your pipeline will handle.
Step 3: Load configuration from environment variables
Create src/lib/config.ts to centralize all environment variable reads. This module loads your .env values into a typed EvalRunConfig object with sensible defaults.
ts
import type { EvalRunConfig } from "./types.js";function env(name: string, fallback: string): string { return process.env[name] ?? fallback;}function envFloat(name: string, fallback: number): number { const raw = process.env[name]; if (raw === undefined) return fallback; const parsed = Number.parseFloat(raw); return Number.isNaN(parsed) ? fallback : parsed;}function envInt(name: string, fallback: number): number { const raw = process.env[name]; if (raw === undefined) return fallback; const parsed = Number.parseInt(raw, 10); return Number.isNaN(parsed) ? fallback : parsed;}export function loadConfig(): EvalRunConfig { const judgeModel = process.env.EVAL_JUDGE_MODEL; if (!judgeModel) { throw new Error("EVAL_JUDGE_MODEL must be set"); } return { judgeModel, budgetLimit: envFloat("EVAL_BUDGET_LIMIT", 10.0), concurrency: envInt("EVAL_CONCURRENCY", 4), goldenPath: env("EVAL_GOLDEN_PATH", "./golden"), outputDir: env("EVAL_OUTPUT_DIR", "./results"), gatePreset: parseGatePreset(env("EVAL_GATE_PRESET", "standard")), cacheEnabled: env("CACHE_ENABLED", "true") === "true", };}function parseGatePreset(value: string): "standard" | "strict" | "lenient" { if (value === "strict" || value === "lenient") { return value; } return "standard";}
Expected output:loadConfig() returns a fully typed config object. If EVAL_JUDGE_MODEL is missing, it throws immediately with a clear error so your pipeline fails fast rather than using an empty string.
Step 4: Create the Vercel AI Gateway judge adapter
Create src/eval/vercel-adapter.ts. This is the core LLM-as-judge module. It uses the Vercel AI SDK (ai package) to send evaluation prompts through Vercel AI Gateway and record cost telemetry.
ts
import { generateId, now, calculateCostFromTokens, type CostSpan } from "@reaatech/llm-cost-telemetry";import { generateText } from "ai";import type { JudgeOutput } from "../lib/types.js";function recordCostSpan(_span: CostSpan): void { void _span;}export type JudgeAdapter = (prompt: string, context?: string) => Promise<JudgeOutput>;export function createVercelJudgeAdapter(config: { model: string; budgetLimit: number }): JudgeAdapter { return async (prompt: string, context?: string): Promise<JudgeOutput> => { try { const fullPrompt = context ? `Context: ${context}\n\n${prompt}` : prompt; const result = await generateText({ model: config.model, prompt: fullPrompt, system: "You are an AI response quality judge. Return a score from 0.0 to 1.0 and a brief reasoning.", }); const input = result.usage.inputTokens ?? 0; const output = result.usage.outputTokens ?? 0; const trimmedScore = result.text.trim(); const totalTokens = input + output; recordCostSpan({ id: generateId(), provider: "openai", model: config.model, inputTokens: input, outputTokens: output, costUsd: calculateCostFromTokens(totalTokens, 0), tenant: "eval", feature: "llm-judge", timestamp: now(), }); return { score: parseFloat(trimmedScore), reasoning: result.text, tokenUsage: { input, output }, }; } catch (err) { const message = err instanceof Error ? err.message : String(err); throw new JudgeError(message, config.model, prompt.length); } };}export class JudgeError extends Error { constructor( message: string, public readonly model: string, public readonly promptLength: number, ) { super(message); this.name = "JudgeError"; }}
Expected output:createVercelJudgeAdapter returns a function that takes any evaluation prompt, sends it through Vercel AI Gateway via generateText, and returns a score with token usage. If the LLM call fails, it throws a typed JudgeError with model and prompt metadata for debugging.
Step 5: Build an LLM judge response cache
Create src/services/cache.ts. Running an LLM judge on every evaluation can get expensive. This module provides a caching layer that uses @reaatech/llm-cache for semantic similarity-based caching — identical and near-identical prompts reuse previous results.
Expected output:createJudgeCache({ enabled: true }) returns a working cache backend using OpenAI embeddings for semantic lookup. When enabled: false or initialization fails, it returns a no-op cache that always misses — your pipeline keeps working.
Step 6: Repair and sanitize judge outputs
LLM judges sometimes return malformed JSON wrapped in markdown fences. Create src/repair/strip.ts to clean those outputs before they reach your metric computation.
Expected output:repairJudgeOutput takes a raw string like ```json\n{ "score": 0.85, "reasoning": "Looks good" }\n``` and returns a clean { score: 0.85, reasoning: "Looks good" } object. If the output is truly garbage, it throws a JudgeRepairError with the original input for debugging.
Step 7: Wire up CI quality gates
Create src/gate/ci-gate.ts. This module reads the aggregated evaluation results and runs them through configurable quality gates. It produces a JUnit XML report that CI systems can parse, and returns a pass/fail verdict with an exit code.
ts
import { createGateEngine, getStandardPreset, getStrictPreset, getLenientPreset, CIIntegration } from "@reaatech/agent-eval-harness-gate";import type { AggregatedResults } from "@reaatech/agent-eval-harness-suite";import fs from "node:fs/promises";import path from "node:path";export async function evaluateGates(resultsPath: string, presetName: string): Promise<{ passed: boolean; exitCode: number }> { let preset: ReturnType<typeof getStandardPreset>; switch (presetName) { case "standard": { preset = getStandardPreset(); break; } case "strict": { preset = getStrictPreset(); break; } case "lenient": { preset = getLenientPreset(); break; } default: { throw new Error("Unknown preset: " + presetName); } } let raw: string; try { raw = await fs.readFile(resultsPath, "utf-8"); } catch { throw new Error("Failed to read results file at " + resultsPath + ": ensure the file exists and is readable"); } const results = JSON.parse(raw) as AggregatedResults; const engine = createGateEngine(preset.gates); const summary = engine.evaluate(results); const exitCode = CIIntegration.getExitCode(summary); const outputDir = path.dirname("./results/gate-results.xml"); await fs.mkdir(outputDir, { recursive: true }); const junitXml = CIIntegration.generateJUnitReport(summary); await fs.writeFile("./results/gate-results.xml", junitXml, "utf-8"); return { passed: summary.overallPassed, exitCode };}
Expected output:evaluateGates("./results/results.json", "standard") reads your evaluation results, runs them through the standard quality gate preset, writes a JUnit report to results/gate-results.xml, and returns { passed: true, exitCode: 0 } if all gates pass.
Step 8: Export evaluation traces to Langfuse
Create src/observability/langfuse-exporter.ts. This module sends your evaluation run as a trace to Langfuse, with one span per trajectory so you can inspect each conversation’s quality score in the Langfuse dashboard.
ts
import Langfuse from "langfuse";import type { EvalRunResult } from "@reaatech/agent-eval-harness-suite";import type { EvalRunConfig } from "../lib/types.js";export function createLangfuseExporter(): Langfuse | null { const secretKey = process.env.LANGFUSE_SECRET_KEY; const publicKey = process.env.LANGFUSE_PUBLIC_KEY; if (!secretKey || !publicKey) return null; return new Langfuse({ secretKey, publicKey, baseUrl: process.env.LANGFUSE_HOST ?? "https://cloud.langfuse.com", });}export async function exportRunToLangfuse( runResult: EvalRunResult, evalConfig: EvalRunConfig,): Promise<string | null> { const langfuse = createLangfuseExporter(); if (!langfuse) return null; try { const trace = langfuse.trace({ name: "agent-eval-run", metadata: { judgeModel: evalConfig.judgeModel, totalTrajectories: runResult.totalTrajectories, gatePreset: evalConfig.gatePreset, }, }); for (const tr of runResult.trajectoryResults) { trace.span({ name: `trajectory-${tr.trajectoryId}`, input: { trajectoryId: tr.trajectoryId }, output: { overallScore: tr.result.overall_score, passed: tr.result.passed ?? false, }, metadata: { metricScores: tr.result.metrics, errors: tr.error, }, startTime: new Date(runResult.startedAt), }); } await langfuse.flushAsync(); return trace.id; } catch (error) { console.warn("Failed to export run to Langfuse:", error); return null; }}
Expected output:exportRunToLangfuse creates a trace named agent-eval-run with trajectory-level spans and flushes the data to Langfuse. If credentials are missing or the request fails, it returns null — observability is optional and never blocks the pipeline.
Step 9: Build the evaluation runner
Create src/eval/runner.ts. This is the pipeline orchestrator. It loads golden .jsonl files from disk, converts them to trajectories, runs each trajectory through the Vercel AI Gateway judge, aggregates results, writes JSON and Markdown reports, and invokes the quality gate.
ts
import { SuiteRunner, parseConfig, createResultsAggregator, validateConfig, type SuiteConfig, type EvalRunResult,} from "@reaatech/agent-eval-harness-suite";import { reportCommand } from "@reaatech/agent-eval-harness-cli";import type { Trajectory, EvalResult as HarnessEvalResult } from "@reaatech/agent-eval-harness-types";import { generateId, now } from "@reaatech/llm-cost-telemetry";import type { EvalRunConfig, GoldenConversation, EvalResult } from "../lib/types.js";import { createVercelJudgeAdapter, type JudgeAdapter } from "./vercel-adapter.js";import { createJudgeCache, type CacheService } from
Expected output:runEvaluation(config) loads all .jsonl files from the golden directory, evaluates each trajectory through the Vercel AI Gateway judge, aggregates results using the suite framework, writes results/results.json and results/report.md, and returns a typed EvalResult.
Step 10: Wire the CLI entry point
Create src/index.ts as the CLI entry point. It exposes three commands — eval, gate, and report — so you can run the pipeline from your terminal or CI system.
ts
export { loadConfig } from "./lib/config.js";export type { EvalRunConfig, EvalResult, GoldenConversation, ConversationMessage, JudgeOutput, ToolCall, ToolResult,} from "./lib/types.js";export { createVercelJudgeAdapter, JudgeError } from "./eval/vercel-adapter.js";export type { JudgeAdapter } from "./eval/vercel-adapter.js";export { createJudgeCache } from "./services/cache.js";export type { CacheService } from "./services/cache.js";export { runEvaluation, runGateCheck } from "./eval/runner.js";export { evaluateGates } from "./gate/ci-gate.js";export { createLangfuseExporter, exportRunToLangfuse } from "./observability/langfuse-exporter.js";export { JudgeOutputSchema, repairJudgeOutput, repairJudgeOutputWithTrace, isValidJudgeOutput, analyzeJudgeOutput, JudgeRepairError,} from "./repair/strip.js";import { loadConfig as _loadConfig } from "./lib/config.js";import { runEvaluation as _runEvaluation } from "./eval/runner.js";import { evaluateGates as _evaluateGates } from "./gate/ci-gate.js";import { reportCommand as _reportCommand } from "@reaatech/agent-eval-harness-cli";export async function main(): Promise<void> { const args = process.argv.slice(2); if (args.length === 0 || args[0] === "--help" || args[0] === "-h") { console.log(`Usage: node . <command> [options]Commands: eval Run evaluation pipeline gate <path> Check CI gates against results JSON report <path> Generate report from results JSONOptions: --help, -h Show this help message`); return; } const command = args[0]; try { switch (command) { case "eval": { const config = _loadConfig(); const result = await _runEvaluation(config); process.exit(result.status === "error" ? 1 : 0); } case "gate": { const resultsPath = args[1]; if (!resultsPath) { console.error("Usage: node . gate <results-json-path> [preset]"); process.exit(1); } const preset = args[2] ?? "standard"; const { passed } = await _evaluateGates(resultsPath, preset); process.exit(passed ? 0 : 1); } case "report": { const resultsPath = args[1]; if (!resultsPath) { console.error("Usage: node . report <results-json-path>"); process.exit(1); } await _reportCommand(resultsPath, { format: "markdown" }); break; } default: console.error(`Unknown command: ${command}`); console.error("Run 'node . --help' for usage."); process.exit(1); } } catch (err) { console.error("Fatal error:", err instanceof Error ? err.message : String(err)); process.exit(1); }}if (!process.env.VITEST) { void main();}
Expected output: Running node . --help prints the usage text. node . eval loads config, runs the full evaluation pipeline, and exits 0 on success. node . gate ./results/results.json strict evaluates gates and exits 1 on failure — ideal for CI blocking.
Step 11: Run the tests
The project includes a full vitest test suite with test files mirroring every source module. Run the tests to verify everything works end to end.
config — default values, custom overrides, missing required vars
entry point — CLI commands, help text, unknown commands, exit codes
Expected output: All tests pass (numFailedTests=0), coverage meets thresholds (lines, branches, functions, statements all above 90%), and pnpm typecheck returns no TypeScript errors.
Next steps
Add more golden datasets. Create .jsonl files in the golden/ directory with real support conversations from your bot. Each line should be a JSON object with id, messages, expectedQuality, and optional expectedTools and metadata fields.
Customize the quality gate presets. Adjust threshold values or add new presets in @reaatech/agent-eval-harness-gate to match your own quality bar. You can switch between standard, strict, and lenient via the EVAL_GATE_PRESET environment variable.
Wire into your CI pipeline. Add the eval and gate steps to your GitHub Actions, GitLab CI, or Jenkins configuration. The non-zero exit code from node . gate automatically blocks deploys that would have shipped regressions.
Connect a live support agent. The golden dataset evaluates a static snapshot of your bot’s behavior. For continuous monitoring, point the evaluator at your production support bot’s actual conversation log and run the harness on a schedule.
"../services/cache.js"
;
import { evaluateGates } from "../gate/ci-gate.js";
import fs from "node:fs/promises";
import path from "node:path";
export function convertConversationToTrajectory(conv: GoldenConversation): Trajectory {
const turns = conv.messages.map((msg, idx) => ({
turn_id: idx + 1,
role: msg.role === "user" ? "user" as const : "agent" as const,
content: msg.content,
timestamp: new Date().toISOString(),
tool_calls: msg.toolCalls?.map((tc) => ({
name: tc.name,
arguments: tc.arguments,
})),
...(conv.metadata?.quality_notes ? { quality_notes: conv.metadata.quality_notes as string } : {}),
}));
return {
trajectory_id: conv.id,
turns,
metadata: {
agent_id: "eval-judge",
session_id: conv.id,
total_turns: turns.length,
...(conv.metadata ? { ...conv.metadata } : {}),
},
};
}
export function buildPromptFromTrajectory(trajectory: Trajectory): string {
const lines: string[] = [];
lines.push("Evaluate the following conversation:");