An SMB running on‑premises support agents on vLLM lacks systematic regression testing after model updates or prompt changes. Manual conversation review is slow, and a bad deployment can degrade customer satisfaction before anyone notices.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial builds a quality-gate pipeline for on-premise vLLM-hosted support bots. You’ll load golden trajectory datasets, run evaluations against a vLLM endpoint, compare agent responses against reference trajectories, enforce configurable quality thresholds, and export metrics to Langfuse for observability. By the end, you’ll have a CLI tool and an HTTP endpoint that a CI pipeline can call before deploying a model update.
Prerequisites
Node.js >= 22
pnpm 10.x
A running vLLM instance with the OpenAI-compatible API enabled (/v1 endpoint)
A Langfuse account (optional — metric tracking works without it, only the export step is skipped)
Basic familiarity with TypeScript, Next.js App Router, and Zod schema validation
Step 1: Scaffold the project
Start with an empty directory and create the Next.js project shell. The scaffold includes the package manifest, TypeScript config, and app layout.
Create package.json with exact-pinned dependencies:
Create next.config.ts with standard Next.js settings:
ts
// next.config.tsimport type { NextConfig } from "next";const nextConfig: NextConfig = { /* config options here */};export default nextConfig;
Expected output:pnpm install completes without errors, node_modules/ and pnpm-lock.yaml are created.
Step 2: Configure environment variables
Create .env.example with all the variables the pipeline reads:
env
# Env vars used by vllm-agent-quality-gate-for-on-prem-smb-support-bots.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentVLLM_ENDPOINT=http://localhost:8000/v1VLLM_API_KEY=<your-vllm-api-key>VLLM_MODEL=deepseek-v4-flashLANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_BASE_URL=https://cloud.langfuse.comGOLDEN_DATA_DIR=./goldenEVAL_RESULTS_DIR=./resultsGATE_PRESET=standardQUALITY_THRESHOLD=0.9CI_MODE=false
Copy it to .env and fill in your real values:
terminal
cp .env.example .env
Expected output:.env exists with your vLLM endpoint, API key, and Langfuse credentials filled in.
Step 3: Create TypeScript interfaces for evaluation config
Create src/eval/types.ts with the interfaces that define the shape of your evaluation configuration and results:
Expected output:src/eval/types.ts compiles cleanly with pnpm typecheck.
Step 4: Build a Zod-validated config parser
Create src/eval/config.ts. This reads process.env, validates every value with Zod schemas, and returns a fully typed EvalRunConfig. Using Zod means you catch misconfigured environment variables at startup with clear error messages:
Expected output: Running pnpm tsx -e "import { parseConfigFromEnv } from './src/eval/config.js'; console.log(JSON.stringify(parseConfigFromEnv().vllm.endpoint))" prints your vLLM endpoint URL.
Step 5: Build the golden trajectory comparison library
Create src/lib/golden.ts. This module wraps the @reaatech/agent-eval-harness-trajectory and @reaatech/agent-eval-harness-golden packages to load golden trajectories from JSONL files and compare candidate agent runs against them:
The key function is compareTrajectoriesAgainstGoldens. For each candidate trajectory it calls findBestGolden to pick the closest reference, then compareAgainstGolden to produce a turn-level similarity score and regression list.
Expected output: A module that, given a directory of JSONL golden files and a directory of candidate results, returns per-trajectory similarity scores and aggregate stats.
Step 6: Wrap the CI gate engine
Create src/lib/gate.ts. This wraps the @reaatech/agent-eval-harness-gate package to build a gate engine from one of three presets (standard, strict, lenient), evaluate aggregated results against it, and produce CI exit codes and Markdown reports:
ts
import { createGateEngine, getStandardPreset, getStrictPreset, getLenientPreset, CIIntegration,} from "@reaatech/agent-eval-harness-gate";import type { GateEvaluationSummary, GateDefinition,} from "@reaatech/agent-eval-harness-gate";type AggregatedResults = Parameters< ReturnType<typeof createGateEngine>["evaluate"]>[0];export function buildGateEngine(config: { preset: string; customThreshold?: number;}): ReturnType<typeof createGateEngine> { let gates: GateDefinition[]; switch (config.preset) { case "standard": gates = getStandardPreset().gates; break; case "strict": gates = getStrictPreset().gates; break; case "lenient": gates = getLenientPreset().gates; break; default: throw new Error( `Unknown gate preset: ${config.preset}. Use "standard", "strict", or "lenient".` ); } const engine = createGateEngine(gates); if (config.customThreshold !== undefined) { engine.removeGate("overall-quality"); engine.addGate({ name: "overall-quality", type: "threshold", metric: "overallScore", operator: ">=", threshold: config.customThreshold, enabled: true, description: `Overall quality score >= ${String(config.customThreshold)}`, }); } return engine;}export function evaluateAndCheck( engine: ReturnType<typeof createGateEngine>, aggregatedResults: AggregatedResults): { summary: GateEvaluationSummary; passed: boolean } { const summary = engine.evaluate(aggregatedResults); return { summary, passed: summary.overallPassed };}export function getCiExitCode(summary: GateEvaluationSummary): number { return CIIntegration.getExitCode(summary);}export function generateCiReport(summary: GateEvaluationSummary): string { return CIIntegration.generateStepSummary(summary);}
Expected output: The module exports buildGateEngine (which selects a preset and optionally overrides the threshold), evaluateAndCheck (runs gates, returns pass/fail), and getCiExitCode / generateCiReport for CI pipeline integration.
Step 7: Wire up Langfuse metric export
Create src/lib/langfuse.ts. This wraps the langfuse SDK to create a client, export evaluation metrics as trace scores, and flush on shutdown:
Expected output:exportEvalMetrics creates a Langfuse trace named "eval-run" and scores it with avg-similarity, pass-rate, and trajectory-count. The Langfuse dashboard will show each evaluation run as a trace.
Step 8: Wire up the instrumentation hook
Create src/instrumentation.ts. Next.js calls the register() function at startup, guarded against running in non-Node.js runtimes:
ts
export async function register(): Promise<void> { if (process.env.NEXT_RUNTIME !== "nodejs") return; await Promise.resolve();}
Expected output: At server startup, register() fires and returns immediately (a no-op unless you extend it with custom startup logic).
Step 9: Build the evaluation pipeline orchestrator
Create src/eval/run.ts. This is the core of the recipe — a runEval() function that orchestrates the full quality gate pipeline: verify the vLLM endpoint, load golden references, run the evaluation, compare results, run quality gates, and export to Langfuse.
ts
import { evalCommand } from "@reaatech/agent-eval-harness-cli";import OpenAI from "openai";import { randomUUID } from "node:crypto";import { loadTrajectories, compareTrajectoriesAgainstGoldens } from "../lib/golden.js";import { createLangfuse, exportEvalMetrics, shutdownLangfuse } from "../lib/langfuse.js";import { buildGateEngine, evaluateAndCheck, getCiExitCode, generateCiReport } from "../lib/gate.js";import { parseConfigFromEnv } from "./config.js";import type { EvalRunResult } from "./types.js";import type { GoldenTrajectory } from "@reaatech/agent-eval-harness-golden";export async function runEval(
Expected output: Calling runEval() runs the full pipeline: connect to vLLM, load golden trajectories, evaluate the agent, compare results, run gates, export to Langfuse, and return a pass/fail result.
Step 10: Create the CLI entry point
Create src/index.ts. This is the executable entry point that calls runEval() from the command line. It also handles SIGINT and SIGTERM for clean shutdown:
Expected output: Running pnpm tsx src/index.ts executes the full quality-gate pipeline and prints pass/fail results with gate counts and a Langfuse trace ID.
Step 11: Expose the pipeline as an HTTP API route
Create app/api/eval/route.ts. This Next.js App Router route handler allows CI webhooks or other services to trigger evaluations over HTTP:
Expected output:POST /api/eval with {"goldenDir":"./golden"} returns {"ok":true,"passed":true,...}. GET /api/eval returns the current config state (endpoint URL, gate preset, threshold, and CI mode — the API key is not exposed).
Step 12: Write and run the test suite
Create tests that verify every module. Start with the config parser test at tests/eval/config.test.ts:
The route handler test at tests/app/api/eval/route.test.ts should look like this:
ts
import { describe, it, expect, vi, beforeEach } from "vitest";import { POST, GET } from "../../../../app/api/eval/route.js";import { NextRequest } from "next/server";vi.mock("../../../../src/eval/run.js", () => ({ runEval: vi.fn(),}));vi.mock("../../../../src/eval/config.js", () => ({ parseConfigFromEnv: vi.fn(),}));describe("POST /api/eval", () => { beforeEach(() => { vi.clearAllMocks(); }); it
Create the remaining test files for each module (tests/lib/golden.test.ts, tests/lib/gate.test.ts, tests/lib/langfuse.test.ts, tests/eval/run.test.ts, tests/index.test.ts, tests/instrumentation.test.ts). Each should mock external dependencies and cover both success and failure paths.
Run the full test suite:
terminal
pnpm vitest run --coverage --reporter=json --outputFile=vitest-report.json
Expected output:numFailedTests = 0, numTotalTests = 64, and coverage above 90% on lines, branches, functions, and statements. All tests mock external dependencies — no live HTTP calls during tests.
Step 13: Configure the CI pipeline
Add a GitLab CI job that runs the evaluation on every merge request. Setting CI_MODE=true enables the exit-code behavior that blocks failing merges:
Expected output: A merge request that degrades agent quality below the 0.85 threshold will fail the CI pipeline with a non-zero exit code and a Markdown report in the logs showing which gates failed.
Next steps
Add a dashboard route — Build a Next.js page under app/dashboard/ that fetches from GET /api/eval and shows live gate status with a quality score gauge.
Extend the golden library — Use the GoldenCurator class from @reaatech/agent-eval-harness-golden to build a curation UI where support engineers can annotate and approve new golden trajectories.
Add multiple gate presets per environment — Use standard for staging, strict for production, and lenient for development environments by setting GATE_PRESET differently in each deployment.