vLLM Agent Eval Harness for Fine-Tuned Model Quality

Intro

This tutorial walks you through building a vLLM Agent Eval Harness — a CLI tool that runs automated, CI/CD-quality evaluations on fine-tuned LLMs hosted behind a local vLLM server. You’ll wire up LLM-as-judge scoring (via GPT-4), per-trajectory cost tracking, regression gate enforcement, and Langfuse observability, all orchestrated through a single CLI with six subcommands.

This recipe is for developers who fine-tune open models on their own hardware and need a structured, repeatable way to verify model quality before deployment.

Prerequisites

Node.js >= 22
pnpm 10 (npm install -g pnpm@10)
A running vLLM server with the OpenAI-compatible endpoint (typically http://localhost:8000/v1)
An OpenAI API key with GPT-4 access (used by the judge)
A Langfuse account (free tier works — optional, toggleable via env var)
Basic familiarity with TypeScript, zod, and the openai SDK

Step 1: Scaffold the project and install dependencies

Start from an empty directory. Create the project with exact-pinned dependencies for all packages you’ll use.

json

{
  "name": "vllm-agent-eval-harness-for-fine-tuned-model-quality",
  "version": "0.1.0",
  "private": true,
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "tsx src/cli/index.ts",
    "lint": "eslint .",

terminal

pnpm install

Expected output: pnpm creates node_modules/ and pnpm-lock.yaml. Every dependency is pinned to an exact semver (no ^ or ~).

Next, create the environment variable template. This file serves as documentation for every env var the harness reads.

env

# .env.example
 
NODE_ENV=development
VLLM_BASE_URL=http://localhost:8000/v1
VLLM_MODEL=<your-fine-tuned-model-name>
VLLM_API_KEY=<optional-vllm-api-key>
VLLM_MAX_TOKENS=4096
VLLM_TEMPERATURE=0.1
OPENAI_API_KEY=<your-openai-key-for-judge>
EVAL_JUDGE_MODEL=gpt-5.2
EVAL_JUDGE_PROVIDER=gpt4
EVAL_JUDGE_TEMPERATURE=0
LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>
LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>
LANGFUSE_BASE_URL=https://cloud.langfuse.com
LANGFUSE_ENABLED=false
EVAL_CONFIG_PATH=./eval-config.yaml
GOLDEN_DIR=./golden
RESULTS_DIR=./results
EVAL_BUDGET_USD=10.00
EVAL_GATE_PRESET=standard
JUDGE_MOCK=false

Expected output: A .env.example file with 20 environment variables. Copy it to .env and fill in your actual values before running the harness.

Step 2: Define the TypeScript type interfaces

Create src/lib/types.ts with the type interfaces that define the shape of every configuration object in the system.

export interface VLLMConfig {
  baseUrl: string;
  model: string;
  apiKey?: string;
  maxTokens?: number;
  temperature?: number;
}
 
export type JudgeProvider = 'claude' | 'gpt4' | 'gemini' | 'openrouter';
 
export interface JudgeServiceConfig {

Expected output: Six exported types/interfaces. AppConfig is the top-level config that aggregates all sub-configs. JudgeProvider is a union of four literal provider names matching the @reaatech/agent-eval-harness-judge package.

Step 3: Build the zod schemas and parser functions

Create src/lib/schemas.ts with zod schemas that validate runtime config data and throw descriptive errors on mismatch. Each schema mirrors one of the type interfaces.

import { z } from 'zod';
import type { VLLMConfig, JudgeServiceConfig, ObservabilityConfig, AppConfig } from './types';
 
export const VLLMConfigSchema = z.object({
  baseUrl: z.url(),
  model: z.string().min(1),
  apiKey: z.string().optional(),

Then add the parser functions that use safeParse and throw wrapped errors on failure:

export function parseVLLMConfig(raw: unknown): VLLMConfig {
  const result = VLLMConfigSchema.safeParse(raw);
  if (!result.success) {
    const field = result.error.issues[0]?.path.join('.') || 'unknown'

Expected output: Four zod schemas and four parser functions. Each parser uses safeParse and throws an Error whose message includes the specific field that failed. Missing baseUrl on a vLLM config throws parseVLLMConfig: validation failed for field "baseUrl", for example.

Step 4: Build the configuration loader

Create src/lib/config.ts to read environment variables and optional JSON config files, merge them with zod validation, and return a validated AppConfig.

import { readFileSync } from 'node:fs';
import { parseAppConfig } from './schemas';
import type

Expected output: Three exported functions. loadEnvConfig() reads every var from process.env, parses booleans and numbers with helper functions, and validates the result through parseAppConfig. loadFileConfig(path) reads a JSON file and returns a partial AppConfig. getConfig(path?) merges env config over file config, with env vars always taking final precedence for secrets.

Step 5: Create the vLLM client adapter

Create src/services/vllm-adapter.ts. This wraps the openai SDK and points it at your local vLLM server’s OpenAI-compatible /v1 endpoint.

Expected output: VLLMClient with generate (non-streaming) and generateStream (async generator) methods. Both wrap API errors in a typed VLLMError preserving status and requestId. The factory createVLLMClient() reads env defaults when called without arguments. vLLM may not require an API key, so the constructor falls back to "not-needed" when none is provided.

Step 6: Build the eval runner (orchestration)

Create src/services/eval-runner.ts. This is the centerpiece that wires the model-under-test through the judge, cost tracker, and gate engine.

Expected output: EvalRunner with four key methods:

evaluateResponse — generates text from vLLM, judges it across 4 dimensions, calculates cost. If vLLM fails, returns an error object instead of crashing.
runBatch — processes trajectories sequentially, accumulates costs in a CostTracker, aggregates scores, runs gateEngine.evaluate() against the configured preset.
compareRuns — uses compareCosts from the cost package to diff two result sets.
generateReport — outputs JSON, JUnit XML, or CSV depending on the format argument.

The factory createEvalRunner(config?, vllm?) reads env var defaults when called with no arguments, providing safe fallbacks for every field. It also validates the judge provider against the known set and normalizes the gate preset name.

Step 7: Add Langfuse observability

Create src/services/observability.ts to wrap the Langfuse SDK. When LANGFUSE_ENABLED=false, every method becomes a no-op so you can develop offline.

import Langfuse from "langfuse";
import type { LangfuseTraceClient }

Expected output: ObservabilityService with createTrace, recordGeneration, recordScore, isEnabled, and shutdown. When enabled: false or when the Langfuse constructor throws (e.g. network unreachable), isEnabled() returns false and all recording methods silently no-op.

Step 8: Wire it all together with the CLI entry point

Create src/cli/index.ts. This is the main entry — it parses command-line arguments with Node 22’s built-in parseArgs and dispatches to six subcommands: eval, judge, compare, gate, report, and golden.

The full file is 519 lines, so here are the key sections.

First, the imports and the usage printer:

import { parseArgs } from "node:util";
import { readFileSync, readdirSync, statSync, mkdirSync, writeFileSync } from "node:fs";
import { join, resolve, dirname } from "node:path";
import { getConfig } from "../lib/config.js";
import { createVLLMClient } from "../services/vllm-adapter.js";
import type { VLLMClient } from "../services/vllm-adapter.js";
import { createEvalRunner } from "../services/eval-runner.js";
import

The argument-free subcommand dispatch:

export async function execCLI(rawArgs: string[]): Promise<void> {
  if (rawArgs.length === 0) {
    printUsage();
    return;
  }
 
  const subcommand = rawArgs[0];
  const subArgs = rawArgs.slice(1);

The eval subcommand is the most complex — it discovers JSONL trajectory files, parses them, runs the batch, writes results, and traces to Langfuse:

async function handleEval(
  subArgs: string[],
  observability:

Each of the remaining subcommands (judge, compare, gate, report, golden) follows the same pattern: parse its flags, instantiate the relevant REAA package classes, execute, and print or write output. The full implementation is at src/cli/index.ts.

Finally, the self-execution wrapper that lets you run the file directly:

const isMain =
  process.argv[1] === new URL(import.meta.url).pathname;
if (isMain) {
  void execCLI(process.argv.slice(2));
}

Expected output: A CLI entry point with 6 dispatchable subcommands plus a default usage printer. Running pnpm start without arguments prints usage. Running pnpm start eval ./golden/ --format json processes trajectory files, runs the full pipeline, and writes results to disk.

Step 9: Run the tests

This recipe ships with 6 test files covering schemas, config loading, the VLLM client, the EvalRunner, the ObservabilityService, and CLI integration. Run them all now.

terminal

pnpm test

Expected output: All 126 tests pass across 19 test suites. Coverage reaches 100% on lines, functions, and statements, with 95%+ branches on runtime code (src/**/*.ts). The vitest-report.json file is written with numFailedTests: 0 and numPassedTests: 126.

Individual coverage highlights:

schemas.test.ts — 16 tests: happy paths (valid configs pass), error paths (missing fields throw), boundaries (temperature edges, empty strings, null input)
config.test.ts — 12 tests: env var loading, file config parsing, merge precedence, missing optional vars, parseBool/parseNum edge cases
vllm-adapter.test.ts — 20 tests: generate returns text+usage, streaming yields deltas, API errors wrap as VLLMError, empty prompt and empty response edge cases, maxTokens/temperature fallbacks, createVLLMClient env/override behavior
eval-runner.test.ts — 25 tests: evaluateResponse success+failure, runBatch with 0/3/mixed trajectories, cost comparison, gate preset differentiation (standard/strict/lenient), judge failure fallback to score 0, expectedTool forwarding, generateReport JSON/JUnit/CSV, createEvalRunner factory with all judge providers and gate presets
observability.test.ts — 12 tests: enabled vs disabled, trace creation, recordGeneration/recordScore, Langfuse constructor failure handling, null-trace no-op edge cases, unknown-trace-id no-ops, shutdown
cli/index.test.ts — 41 tests: each subcommand dispatch, error paths for missing args, gate —exit-code behavior, report format routing to @reaatech/agent-eval-harness-cli, golden subcommand delegation, verbose output, usage printing

Next steps

Add more judgment types — Extend EvalRunner.evaluateResponse to call the judge with custom prompts via @reaatech/agent-eval-harness-judge’s createCustomTemplate for domain-specific criteria.
Set up baseline comparison in CI — Store the results.json from a known-good run, then use pnpm start compare baseline.json candidate.json --format markdown in your CI pipeline to detect regressions in cost and quality.
Enable Langfuse dashboards — Set LANGFUSE_ENABLED=true with valid keys, then build Langfuse dashboards that track score trends over time, model cost burn rates, and per-trajectory quality breakdowns.

vLLM Agent Eval Harness for Fine-Tuned Model Quality

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Step 2: Define the TypeScript type interfaces

Step 3: Build the zod schemas and parser functions

Step 4: Build the configuration loader

Step 5: Create the vLLM client adapter

Step 6: Build the eval runner (orchestration)

Step 7: Add Langfuse observability

Step 8: Wire it all together with the CLI entry point

Step 9: Run the tests

Next steps

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Step 2: Define the TypeScript type interfaces

Step 3: Build the zod schemas and parser functions

Step 4: Build the configuration loader

Step 5: Create the vLLM client adapter

Step 6: Build the eval runner (orchestration)

Step 7: Add Langfuse observability

Step 8: Wire it all together with the CLI entry point

Step 9: Run the tests

Next steps