SMBs that fine-tune open models locally lack a structured way to verify model quality before production, exposing them to regressions and failed customer interactions.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a vLLM Agent Eval Harness — a CLI tool that runs automated, CI/CD-quality evaluations on fine-tuned LLMs hosted behind a local vLLM server. You’ll wire up LLM-as-judge scoring (via GPT-4), per-trajectory cost tracking, regression gate enforcement, and Langfuse observability, all orchestrated through a single CLI with six subcommands.
This recipe is for developers who fine-tune open models on their own hardware and need a structured, repeatable way to verify model quality before deployment.
Expected output: Six exported types/interfaces. AppConfig is the top-level config that aggregates all sub-configs. JudgeProvider is a union of four literal provider names matching the @reaatech/agent-eval-harness-judge package.
Step 3: Build the zod schemas and parser functions
Create src/lib/schemas.ts with zod schemas that validate runtime config data and throw descriptive errors on mismatch. Each schema mirrors one of the type interfaces.
ts
import { z } from 'zod';import type { VLLMConfig, JudgeServiceConfig, ObservabilityConfig, AppConfig } from './types';export const VLLMConfigSchema = z.object({ baseUrl: z.url(), model: z.string().min(1), apiKey: z.string().optional(),
Then add the parser functions that use safeParse and throw wrapped errors on failure:
ts
export function parseVLLMConfig(raw: unknown): VLLMConfig { const result = VLLMConfigSchema.safeParse(raw); if (!result.success) { const field = result.error.issues[0]?.path.join('.') || 'unknown'
Expected output: Four zod schemas and four parser functions. Each parser uses safeParse and throws an Error whose message includes the specific field that failed. Missing baseUrl on a vLLM config throws parseVLLMConfig: validation failed for field "baseUrl", for example.
Step 4: Build the configuration loader
Create src/lib/config.ts to read environment variables and optional JSON config files, merge them with zod validation, and return a validated AppConfig.
ts
import { readFileSync } from 'node:fs';import { parseAppConfig } from './schemas';import type
Expected output: Three exported functions. loadEnvConfig() reads every var from process.env, parses booleans and numbers with helper functions, and validates the result through parseAppConfig. loadFileConfig(path) reads a JSON file and returns a partial AppConfig. getConfig(path?) merges env config over file config, with env vars always taking final precedence for secrets.
Step 5: Create the vLLM client adapter
Create src/services/vllm-adapter.ts. This wraps the openai SDK and points it at your local vLLM server’s OpenAI-compatible /v1 endpoint.
Expected output:VLLMClient with generate (non-streaming) and generateStream (async generator) methods. Both wrap API errors in a typed VLLMError preserving status and requestId. The factory createVLLMClient() reads env defaults when called without arguments. vLLM may not require an API key, so the constructor falls back to "not-needed" when none is provided.
Step 6: Build the eval runner (orchestration)
Create src/services/eval-runner.ts. This is the centerpiece that wires the model-under-test through the judge, cost tracker, and gate engine.
Expected output:EvalRunner with four key methods:
evaluateResponse — generates text from vLLM, judges it across 4 dimensions, calculates cost. If vLLM fails, returns an error object instead of crashing.
runBatch — processes trajectories sequentially, accumulates costs in a CostTracker, aggregates scores, runs gateEngine.evaluate() against the configured preset.
compareRuns — uses compareCosts from the cost package to diff two result sets.
generateReport — outputs JSON, JUnit XML, or CSV depending on the format argument.
The factory createEvalRunner(config?, vllm?) reads env var defaults when called with no arguments, providing safe fallbacks for every field. It also validates the judge provider against the known set and normalizes the gate preset name.
Step 7: Add Langfuse observability
Create src/services/observability.ts to wrap the Langfuse SDK. When LANGFUSE_ENABLED=false, every method becomes a no-op so you can develop offline.
ts
import Langfuse from "langfuse";import type { LangfuseTraceClient }
Expected output:ObservabilityService with createTrace, recordGeneration, recordScore, isEnabled, and shutdown. When enabled: false or when the Langfuse constructor throws (e.g. network unreachable), isEnabled() returns false and all recording methods silently no-op.
Step 8: Wire it all together with the CLI entry point
Create src/cli/index.ts. This is the main entry — it parses command-line arguments with Node 22’s built-in parseArgs and dispatches to six subcommands: eval, judge, compare, gate, report, and golden.
The full file is 519 lines, so here are the key sections.
First, the imports and the usage printer:
ts
import { parseArgs } from "node:util";import { readFileSync, readdirSync, statSync, mkdirSync, writeFileSync } from "node:fs";import { join, resolve, dirname } from "node:path";import { getConfig } from "../lib/config.js";import { createVLLMClient } from "../services/vllm-adapter.js";import type { VLLMClient } from "../services/vllm-adapter.js";import { createEvalRunner } from "../services/eval-runner.js";import
The eval subcommand is the most complex — it discovers JSONL trajectory files, parses them, runs the batch, writes results, and traces to Langfuse:
ts
async function handleEval( subArgs: string[], observability:
Each of the remaining subcommands (judge, compare, gate, report, golden) follows the same pattern: parse its flags, instantiate the relevant REAA package classes, execute, and print or write output. The full implementation is at src/cli/index.ts.
Finally, the self-execution wrapper that lets you run the file directly:
Expected output: A CLI entry point with 6 dispatchable subcommands plus a default usage printer. Running pnpm start without arguments prints usage. Running pnpm start eval ./golden/ --format json processes trajectory files, runs the full pipeline, and writes results to disk.
Step 9: Run the tests
This recipe ships with 6 test files covering schemas, config loading, the VLLM client, the EvalRunner, the ObservabilityService, and CLI integration. Run them all now.
terminal
pnpm test
Expected output: All 126 tests pass across 19 test suites. Coverage reaches 100% on lines, functions, and statements, with 95%+ branches on runtime code (src/**/*.ts). The vitest-report.json file is written with numFailedTests: 0 and numPassedTests: 126.
cli/index.test.ts — 41 tests: each subcommand dispatch, error paths for missing args, gate —exit-code behavior, report format routing to @reaatech/agent-eval-harness-cli, golden subcommand delegation, verbose output, usage printing
Next steps
Add more judgment types — Extend EvalRunner.evaluateResponse to call the judge with custom prompts via @reaatech/agent-eval-harness-judge’s createCustomTemplate for domain-specific criteria.
Set up baseline comparison in CI — Store the results.json from a known-good run, then use pnpm start compare baseline.json candidate.json --format markdown in your CI pipeline to detect regressions in cost and quality.
Enable Langfuse dashboards — Set LANGFUSE_ENABLED=true with valid keys, then build Langfuse dashboards that track score trends over time, model cost burn rates, and per-trajectory quality breakdowns.
"typecheck"
:
"tsc --noEmit"
,
"test": "vitest run --coverage --reporter=json --outputFile=vitest-report.json"
},
"dependencies": {
"@ai-sdk/openai-compatible": "2.0.47",
"@reaatech/agent-eval-harness-cli": "0.1.0",
"@reaatech/agent-eval-harness-cost": "0.1.0",
"@reaatech/agent-eval-harness-gate": "0.1.0",
"@reaatech/agent-eval-harness-judge": "0.1.0",
"langfuse": "3.38.20",
"next": "16.2.6",
"openai": "6.38.0",
"react": "19.2.4",
"react-dom": "19.2.4",
"zod": "4.4.3"
},
"devDependencies": {
"@types/node": "20.17.30",
"@types/react": "19.1.4",
"@types/react-dom": "19.1.4",
"@vitest/coverage-v8": "4.1.7",
"eslint": "9.27.0",
"eslint-config-next": "16.2.6",
"msw": "2.14.6",
"tsx": "4.19.4",
"typescript": "5.8.3",
"typescript-eslint": "8.59.4",
"vitest": "4.1.7"
},
"type": "module",
"engines": {
"node": ">=22"
},
"packageManager": "pnpm@10.0.0"
}
model
:
string
;
provider: JudgeProvider;
temperature?: number;
maxTokens?: number;
apiKey?: string;
}
export type GatePreset = 'standard' | 'strict' | 'lenient';