Vercel AI Gateway Agent Eval Harness for SMB Support Bots

An automated regression testing pipeline that evaluates SMB support agents against golden datasets, using Vercel AI Gateway as the LLM backbone and exporting observability to Langfuse.

vercel-ai-gateway eval-harness langfuse smb-support regression-testing llm-judge quality-gates

The problem

Small businesses deploying AI support bots lack a systematic way to catch regressions before they reach customers. Ad‑hoc manual testing and single‑metric checks miss subtle degradations in answer quality, tool‑use accuracy, and cost creep.

Built from

Intro

This tutorial walks you through building an automated regression testing pipeline for SMB support bots. You’ll create a CI-friendly evaluation harness that replays golden conversations, scores responses with an LLM judge routed through Vercel AI Gateway, enforces quality gates, and exports every trace to Langfuse for dashboard-level observability. By the end, a failing gate halts CI with a non-zero exit code so regressions never reach production.

Prerequisites

Node.js 22+ and pnpm 10+ installed
A Vercel AI Gateway API key (AI_GATEWAY_API_KEY)
A Langfuse account (cloud or self-hosted) with secret and public keys
An OpenAI API key for the LLM cache embedder
Basic familiarity with TypeScript and environment variables

Step 1: Create the project and install dependencies

Start from an empty directory. Create the project, add configuration files, install dependencies, and set up your environment.

terminal

mkdir my-eval-harness && cd my-eval-harness

Create package.json with the required dependencies:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

175 kB·87 tests·98.4% coverage·vitest passing

SHA-25684dad4328903bab3994569ec06164d53b0139c56cdf2e8711988e92bab9d34ce

Book a conversation All solutions

Comments

Loading comments…

import { CacheEngine, InMemoryAdapter, OpenAIEmbedder, } from "@reaatech/llm-cache"; export interface CacheLookupOptions { model: string; modelVersion?: string; useCase?: string; } export interface CacheSetOptions { model: string; modelVersion?: string; useCase?: string; } export type CacheResult = { hit: boolean; data?: unknown; type?: "exact" | "semantic"; }; export interface CacheService { get(prompt: string, options: CacheLookupOptions): Promise<CacheResult>; set( prompt: string, response: unknown, options: CacheSetOptions, ): Promise<void>; } function createNoopCache(): CacheService { return { get(): Promise<CacheResult> { return Promise.resolve({ hit: false }); }, async set(): Promise<void> {}, }; } function createEnabledCache(): CacheService { const engine = new CacheEngine({ storage: new InMemoryAdapter(), vectorStorage: new InMemoryAdapter(), embedder: new OpenAIEmbedder({ provider: "openai", model: "text-embedding-3-small", dimensions: 1536, apiKey: process.env.OPENAI_API_KEY ?? "", }), config: { storage: { adapter: "memory" }, vectorStorage: { adapter: "memory" }, embedding: { provider: "openai", model: "text-embedding-3-small", dimensions: 1536, batchSize: 100, maxRetries: 3, }, similarity: { threshold: 0.8, metric: "cosine", maxResults: 10, }, ttl: { default: 3600, factual: 1800, creative: 7200, analytical: 3600, sensitive: 600, byUseCase: {}, }, segmentation: { enabled: true, defaultUseCase: "general", }, cost: { enabled: true, currency: "USD", }, observability: { metrics: true, tracing: false, logging: "info", }, }, }); return { async get( prompt: string, options: CacheLookupOptions, ): Promise<CacheResult> { const result = await engine.get(prompt, { model: options.model, modelVersion: options.modelVersion, useCase: options.useCase, }); if (result.hit) { return { hit: true, data: result.entry.response, type: result.type, }; } return { hit: false }; }, async set( prompt: string, response: unknown, options: CacheSetOptions, ): Promise<void> { await engine.set(prompt, response, { model: options.model, modelVersion: options.modelVersion, useCase: options.useCase, }); }, }; } export function createJudgeCache(config: { enabled: boolean }): CacheService { if (!config.enabled) { return createNoopCache(); } try { return createEnabledCache(); } catch (error) { console.warn( "[cache] Failed to initialize CacheEngine, falling back to no-op cache:", error, ); return createNoopCache(); } }

export { loadConfig } from "./lib/config.js"; export type { EvalRunConfig, EvalResult, GoldenConversation, ConversationMessage, JudgeOutput, ToolCall, ToolResult, } from "./lib/types.js"; export { createVercelJudgeAdapter, JudgeError } from "./eval/vercel-adapter.js"; export type { JudgeAdapter } from "./eval/vercel-adapter.js"; export { createJudgeCache } from "./services/cache.js"; export type { CacheService } from "./services/cache.js"; export { runEvaluation, runGateCheck } from "./eval/runner.js"; export { evaluateGates } from "./gate/ci-gate.js"; export { createLangfuseExporter, exportRunToLangfuse } from "./observability/langfuse-exporter.js"; export { JudgeOutputSchema, repairJudgeOutput, repairJudgeOutputWithTrace, isValidJudgeOutput, analyzeJudgeOutput, JudgeRepairError, } from "./repair/strip.js"; import { loadConfig as _loadConfig } from "./lib/config.js"; import { runEvaluation as _runEvaluation } from "./eval/runner.js"; import { evaluateGates as _evaluateGates } from "./gate/ci-gate.js"; import { reportCommand as _reportCommand } from "@reaatech/agent-eval-harness-cli"; export async function main(): Promise<void> { const args = process.argv.slice(2); if (args.length === 0 || args[0] === "--help" || args[0] === "-h") { console.log(` Usage: node . <command> [options] Commands: eval Run evaluation pipeline gate <path> Check CI gates against results JSON report <path> Generate report from results JSON Options: --help, -h Show this help message `); return; } const command = args[0]; try { switch (command) { case "eval": { const config = _loadConfig(); const result = await _runEvaluation(config); process.exit(result.status === "error" ? 1 : 0); } case "gate": { const resultsPath = args[1]; if (!resultsPath) { console.error("Usage: node . gate <results-json-path> [preset]"); process.exit(1); } const preset = args[2] ?? "standard"; const { passed } = await _evaluateGates(resultsPath, preset); process.exit(passed ? 0 : 1); } case "report": { const resultsPath = args[1]; if (!resultsPath) { console.error("Usage: node . report <results-json-path>"); process.exit(1); } await _reportCommand(resultsPath, { format: "markdown" }); break; } default: console.error(`Unknown command: ${command}`); console.error("Run 'node . --help' for usage."); process.exit(1); } } catch (err) { console.error("Fatal error:", err instanceof Error ? err.message : String(err)); process.exit(1); } } if (!process.env.VITEST) { void main(); }

Vercel AI Gateway Agent Eval Harness for SMB Support Bots

The problem

Built from

Intro

Prerequisites

Step 1: Create the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Create the project and install dependencies

Step 2: Define the core types

Step 3: Load configuration from environment variables

Step 4: Create the Vercel AI Gateway judge adapter

Step 5: Build an LLM judge response cache

Step 6: Repair and sanitize judge outputs

Step 7: Wire up CI quality gates

Step 8: Export evaluation traces to Langfuse

Step 9: Build the evaluation runner

Step 10: Wire the CLI entry point

Step 11: Run the tests

Next steps