vLLM Agent Quality Gate for On-Prem SMB Support Bots

Automated regression testing for self‑hosted LLM agents, with CI gates that block deployment when support‑bot quality drops.

vllm eval-harness quality-gate support-bot on-premise langfuse ci-cd regression-testing

The problem

An SMB running on‑premises support agents on vLLM lacks systematic regression testing after model updates or prompt changes. Manual conversation review is slow, and a bad deployment can degrade customer satisfaction before anyone notices.

Built from

Intro

This tutorial builds a quality-gate pipeline for on-premise vLLM-hosted support bots. You’ll load golden trajectory datasets, run evaluations against a vLLM endpoint, compare agent responses against reference trajectories, enforce configurable quality thresholds, and export metrics to Langfuse for observability. By the end, you’ll have a CLI tool and an HTTP endpoint that a CI pipeline can call before deploying a model update.

Prerequisites

Node.js >= 22
pnpm 10.x
A running vLLM instance with the OpenAI-compatible API enabled (/v1 endpoint)
A Langfuse account (optional — metric tracking works without it, only the export step is skipped)
Basic familiarity with TypeScript, Next.js App Router, and Zod schema validation

Step 1: Scaffold the project

Start with an empty directory and create the Next.js project shell. The scaffold includes the package manifest, TypeScript config, and app layout.

Create package.json with exact-pinned dependencies:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

170 kB·64 tests·100.0% coverage·vitest passing

SHA-2564c5d192481edee85239bf3c902df103903f0992b875cd51e7b678839d229fed1

Book a conversation All solutions

Comments

Loading comments…

import { loadFromDirectory } from "@reaatech/agent-eval-harness-trajectory"; import { compareAgainstGolden, findBestGolden, batchCompare, } from "@reaatech/agent-eval-harness-golden"; import type { GoldenTrajectory } from "@reaatech/agent-eval-harness-golden"; import type { Trajectory } from "@reaatech/agent-eval-harness-types"; export interface TrajectoryComparisonResult { similarity: number; passesThreshold: boolean; regressions: unknown[]; turnComparisons: unknown[]; matchingTurns: number; divergentTurns: number; diffSummary: string; } export async function loadTrajectories(dirPath: string): Promise<Trajectory[]> { return loadFromDirectory(dirPath); } export function batchCompareTrajectories( golden: GoldenTrajectory, candidates: Trajectory[], threshold?: number ): Array<{ trajectory: Trajectory; result: TrajectoryComparisonResult }> { return batchCompare(golden, candidates, { similarityThreshold: threshold ?? 0.85, }); } export function compareTrajectoriesAgainstGoldens( candidates: Trajectory[], goldens: GoldenTrajectory[], threshold?: number ): { comparisons: TrajectoryComparisonResult[]; summary: { avgSimilarity: number; passCount: number; failCount: number }; } { const thresholdValue = threshold ?? 0.85; const comparisons: TrajectoryComparisonResult[] = []; for (const candidate of candidates) { const bestGolden = findBestGolden(candidate, goldens, { similarityThreshold: thresholdValue, }); if (!bestGolden) continue; const result = compareAgainstGolden(bestGolden.golden, candidate, { similarityThreshold: thresholdValue, }); comparisons.push(result); } const passCount = comparisons.filter((c) => c.passesThreshold).length; const failCount = comparisons.length - passCount; const avgSimilarity = comparisons.length > 0 ? comparisons.reduce((sum, c) => sum + c.similarity, 0) / comparisons.length : 0; return { comparisons, summary: { avgSimilarity, passCount, failCount }, }; }

import { describe, it, expect, afterEach, beforeEach } from "vitest"; import { parseConfigFromEnv } from "../../src/eval/config.js"; const OLD_ENV = process.env; beforeEach(() => { process.env = { ...OLD_ENV }; }); afterEach(() => { process.env = OLD_ENV; }); describe("parseConfigFromEnv", () => { it("returns valid EvalRunConfig when all env vars are set", () => { process.env.VLLM_ENDPOINT = "http://localhost:8000/v1"; process.env.VLLM_API_KEY = "test-key"; process.env.VLLM_MODEL = "test-model"; process.env.LANGFUSE_PUBLIC_KEY = "pk-test"; process.env.LANGFUSE_SECRET_KEY = "sk-test"; process.env.GOLDEN_DATA_DIR = "./golden"; process.env.EVAL_RESULTS_DIR = "./results"; process.env.GATE_PRESET = "strict"; process.env.QUALITY_THRESHOLD = "0.95"; process.env.CI_MODE = "true"; const config = parseConfigFromEnv(); expect(config.vllm.endpoint).toBe("http://localhost:8000/v1"); expect(config.vllm.apiKey).toBe("test-key"); expect(config.vllm.model).toBe("test-model"); expect(config.gate.preset).toBe("strict"); expect(config.gate.qualityThreshold).toBe(0.95); expect(config.ciMode).toBe(true); }); it("throws ZodError when VLLM_ENDPOINT is missing", () => { process.env.VLLM_ENDPOINT = undefined; process.env.VLLM_API_KEY = "key"; process.env.VLLM_MODEL = "m"; process.env.LANGFUSE_PUBLIC_KEY = "pk"; process.env.LANGFUSE_SECRET_KEY = "sk"; expect(() => parseConfigFromEnv()).toThrow(); }); it("throws ZodError when LANGFUSE_PUBLIC_KEY is missing", () => { process.env.VLLM_ENDPOINT = "http://localhost:8000/v1"; process.env.VLLM_API_KEY = "key"; process.env.VLLM_MODEL = "m"; process.env.LANGFUSE_PUBLIC_KEY = undefined; process.env.LANGFUSE_SECRET_KEY = "sk"; expect(() => parseConfigFromEnv()).toThrow(); }); it("applies defaults when optional vars are omitted", () => { process.env.VLLM_ENDPOINT = "http://localhost:8000/v1"; process.env.VLLM_API_KEY = "key"; process.env.VLLM_MODEL = "m"; process.env.LANGFUSE_PUBLIC_KEY = "pk"; process.env.LANGFUSE_SECRET_KEY = "sk"; const config = parseConfigFromEnv(); expect(config.gate.qualityThreshold).toBe(0.9); expect(config.gate.preset).toBe("standard"); expect(config.ciMode).toBe(false); expect(config.goldenDataDir).toBe("./golden"); expect(config.evalResultsDir).toBe("./results"); }); it("throws ZodError when GATE_PRESET is invalid", () => { process.env.VLLM_ENDPOINT = "http://localhost:8000/v1"; process.env.VLLM_API_KEY = "key"; process.env.VLLM_MODEL = "m"; process.env.LANGFUSE_PUBLIC_KEY = "pk"; process.env.LANGFUSE_SECRET_KEY = "sk"; process.env.GATE_PRESET = "nonexistent"; expect(() => parseConfigFromEnv()).toThrow(); }); });

vLLM Agent Quality Gate for On-Prem SMB Support Bots

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project

Step 2: Configure environment variables

Step 3: Create TypeScript interfaces for evaluation config

Step 4: Build a Zod-validated config parser

Step 5: Build the golden trajectory comparison library

Step 6: Wrap the CI gate engine

Step 7: Wire up Langfuse metric export

Step 8: Wire up the instrumentation hook

Step 9: Build the evaluation pipeline orchestrator

Step 10: Create the CLI entry point

Step 11: Expose the pipeline as an HTTP API route

Step 12: Write and run the test suite

Step 13: Configure the CI pipeline

Next steps