SMBs deploying AI agents have no way to systematically test if updates or new prompts break business-critical tasks, leading to customer-facing errors and trust erosion.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
You’ll build a CLI evaluation harness that lets your small business systematically test AI agents before they touch customers. By the end, you’ll have three commands — init, run, and report — that scaffold an evaluation config, execute LangChain-powered agent evaluations across multiple scenarios, and export structured reports with scores, latency percentiles, and regression detection. The harness uses REAA’s eval suite packages for golden dataset management, LLM-as-judge scoring, latency tracking, and OpenTelemetry observability, all wired together with Commander, Zod, and TypeScript.
Prerequisites
Node.js >= 22 (the package.json engine field requires it)
pnpm 10.x (the project uses pnpm@10.15.1 as its package manager)
OpenAI, Anthropic, Gemini, or OpenRouter API key — the judge engine needs at least one provider to score agent responses
Familiarity with TypeScript and running CLI tools from the terminal
Step 1: Scaffold the project and install dependencies
Create a new directory and set up a package.json with all the dependencies the harness needs. The project uses ES modules ("type": "module") and TypeScript with NodeNext module resolution.
Expected output: pnpm downloads and links all dependencies. You should see a node_modules/ directory appear, and pnpm install completes without errors.
Step 2: Configure TypeScript and Vitest
Set up TypeScript to target ES2022 with NodeNext module resolution, and configure Vitest with a test setup file and 90% coverage thresholds.
The setupFiles entry points at ./tests/setup.ts, which we’ll fill in later.
Step 3: Set environment variables
The judge engine authenticates against LLM providers using standard API key variables. Create a .env file in the project root with every key the harness might use:
Set at least one provider key — the judge engine picks the provider you specify in the eval config. The EVAL_OBSERVABILITY_URL points to an OpenTelemetry collector (optional; metrics recording works without it in local mode). LOG_LEVEL defaults to info if left unset.
Step 4: Create the configuration loader
The evaluation harness reads its entire setup from a single JSON file. You need a Zod-validated loader that parses the file, checks required fields, and surfaces clear errors for missing or malformed configs.
Create the src/lib/ directory, then create src/lib/config.ts:
ts
import { z } from 'zod';import { readFileSync } from 'node:fs';/** * Schema for a single scenario in the eval configuration. */const ScenarioSchema = z.object({ name: z.string(), description: z.string().optional(), trajectoryPath: z.string().optional(), tags: z.array(z.string()).optional(),});/** * Schema for judge configuration in the eval config. */const JudgeConfigSchema = z.object({ model: z.string().optional(), provider: z.string().optional(), temperature: z.number().optional(), maxTokens: z.number().optional(), budgetLimit: z.number().optional(), calibrationEnabled: z.boolean().optional(),});/** * Schema for latency configuration in the eval config. */const LatencyConfigSchema = z.object({ preset: z.enum(['strict', 'moderate', 'lenient']).optional(), p50ThresholdMs: z.number().optional(), p90ThresholdMs: z.number().optional(), p99ThresholdMs: z.number().optional(), maxTurnMs: z.number().optional(), totalMs: z.number().optional(),});/** * Schema for observability configuration in the eval config. */const ObsConfigSchema = z.object({ logLevel: z.string().optional(), logFormat: z.enum(['json', 'pretty']).optional(), metricsEnabled: z.boolean().optional(), tracingEnabled: z.boolean().optional(), dashboardEnabled: z.boolean().optional(), otlpEndpoint: z.string().optional(),});/** * Schema for the full eval configuration file. */const EvalConfigSchema = z.object({ name: z.string(), description: z.string().optional(), scenarios: z.array(ScenarioSchema), judge: JudgeConfigSchema.optional(), latency: LatencyConfigSchema.optional(), observability: ObsConfigSchema.optional(),});export type Scenario = z.infer<typeof ScenarioSchema>;export type JudgeConfig = z.infer<typeof JudgeConfigSchema>;export type LatencyConfig = z.infer<typeof LatencyConfigSchema>;export type ObsConfig = z.infer<typeof ObsConfigSchema>;export type EvalConfig = z.infer<typeof EvalConfigSchema>;export function loadConfig(filePath: string): EvalConfig { let raw: string; try { raw = readFileSync(filePath, 'utf-8'); } catch (err) { const nodeErr = err as NodeJS.ErrnoException; if (nodeErr.code === 'ENOENT') { throw new Error(`Config file not found: ${filePath}`); } throw new Error(`Failed to read config file at ${filePath}: ${(err as Error).message}`); } let parsed: unknown; try { parsed = JSON.parse(raw); } catch { throw new Error(`Invalid JSON syntax in config file: ${filePath}`); } const result = EvalConfigSchema.safeParse(parsed); if (!result.success) { const issues = result.error.issues.map((i) => `${i.path.join('.')}: ${i.message}`).join('; '); throw new Error(`Config validation failed for ${filePath}: ${issues}`); } return result.data;}
This module gives you both runtime validation and TypeScript types (EvalConfig, Scenario, JudgeConfig, etc.) from the same Zod schemas.
Step 5: Build the golden dataset module
Golden datasets are reference agent trajectories stored in JSONL format. This module wraps the @reaatech/agent-eval-harness-golden package to load, create, validate, and compare golden entries. It also defines the local TypeScript interfaces used throughout the harness.
Create src/lib/golden.ts:
ts
import { readFileSync } from 'node:fs';import { loadGoldenTrajectories, createGolden, validateGolden, compareAgainstGolden,} from '@reaatech/agent-eval-harness-golden';export interface TrajectoryComparisonResult { similarity: number; turnComparisons: TurnComparison[]; matchingTurns: number; divergentTurns: number; passesThreshold: boolean; regressions: Regression[]; diffSummary: string;}export interface TurnComparison { turnId: number; similarity: number; contentMatch: boolean; toolMatch: boolean; differences: string[];}export interface Regression { type: string; severity: string; turnId: number; description: string;}export interface ComparisonConfig { similarityThreshold: number; compareTools: boolean; semanticComparison: boolean; turnMatching: string;}export interface GoldenValidationResult { valid: boolean; errors: string[]; warnings: string[]; score: number;}export function loadGoldenDataset(filePath: string): ReturnType<typeof loadGoldenTrajectories> { let content: string; try { content = readFileSync(filePath, 'utf-8'); } catch (err) { const nodeErr = err as NodeJS.ErrnoException; if (nodeErr.code === 'ENOENT') { throw new Error(`Golden dataset file not found: ${filePath}`); } throw new Error(`Failed to read golden dataset at ${filePath}: ${(err as Error).message}`); } try { return loadGoldenTrajectories(content); } catch (err) { throw new Error( `Failed to parse golden dataset at ${filePath}: expected JSONL format with one JSON object per line: ${(err as Error).message}`, ); }}export function createGoldenEntry( trajectory: unknown, options?: { description?: string; tags?: string[] },): ReturnType<typeof createGolden> { try { return createGolden( trajectory as Parameters<typeof createGolden>[0], options as Parameters<typeof createGolden>[1], ); } catch (err) { throw new Error(`Failed to create golden entry: ${(err as Error).message}`); }}export function validateGoldenDataset(golden: unknown): GoldenValidationResult { try { return validateGolden(golden as Parameters<typeof validateGolden>[0]) as never; } catch (err) { throw new Error(`Failed to validate golden dataset: ${(err as Error).message}`); }}export function compareTrajectories( golden: unknown, candidate: unknown, config?: Partial<ComparisonConfig>,): TrajectoryComparisonResult { try { return compareAgainstGolden( golden as Parameters<typeof compareAgainstGolden>[0], candidate as Parameters<typeof compareAgainstGolden>[1], config as Parameters<typeof compareAgainstGolden>[2], ) as TrajectoryComparisonResult; } catch (err) { throw new Error(`Failed to compare trajectories: ${(err as Error).message}`); }}
Each function wraps a REAA package call with proper error handling and re-throws with descriptive messages so the CLI can report failures clearly.
Step 6: Build the judge and latency modules
The judge module uses @reaatech/agent-eval-harness-judge to score agent responses with an LLM. It exposes single-request and batch evaluation functions. The latency module wraps @reaatech/agent-eval-harness-latency to track P50/P90/P99 percentiles and enforce SLA budgets.
Create src/lib/judge.ts:
ts
import { JudgeEngine } from '@reaatech/agent-eval-harness-judge';import type { JudgeRequest, JudgeScore, JudgeConfig } from '@reaatech/agent-eval-harness-judge';export type { JudgeRequest, JudgeScore, JudgeConfig };export async function evaluateWithJudge( request: JudgeRequest, config: JudgeConfig,): Promise<JudgeScore> { const engine = new JudgeEngine(config); try { return await engine.judge(request); } catch (err) { throw new Error(`Judge evaluation failed: ${(err as Error).message}`); }}export async function evaluateBatch( requests: Array<{ id: string; request: JudgeRequest }>, config: JudgeConfig, concurrency?: number,): Promise<unknown> { const engine = new JudgeEngine(config); try { return await engine.judgeBatch(requests, concurrency); } catch (err) { throw new Error(`Batch judge evaluation failed: ${(err as Error).message}`); }}
Create src/lib/latency.ts:
ts
import { monitorLatency, enforceBudget, createLatencyBudget, LatencyTracker,} from '@reaatech/agent-eval-harness-latency';export interface LatencyResult { totalLatencyMs: number; avgLatencyMs: number; p50Ms: number; p90Ms: number; p99Ms: number; maxLatencyMs: number; minLatencyMs: number; turnCount: number;}export interface BudgetEnforcementResult { passed: boolean; violations: LatencyViolation[]; score: number;}export interface LatencyViolation { type: string; severity: string; description: string; actual: number; threshold: number; turnId?: number;}export function trackLatency(trajectory: unknown): LatencyResult { const result = monitorLatency(trajectory as Parameters<typeof monitorLatency>[0]); return result as LatencyResult;}export function checkSLABudget( result: LatencyResult, preset: 'strict' | 'moderate' | 'lenient',): BudgetEnforcementResult { const budget = createLatencyBudget(preset); const enforcement = enforceBudget( result as Parameters<typeof enforceBudget>[0], budget as Parameters<typeof enforceBudget>[1], ); return enforcement as BudgetEnforcementResult;}export function createTracker(): LatencyTracker { return new LatencyTracker();}export function getTrackerTrend(tracker: LatencyTracker): ReturnType<LatencyTracker['getTrend']> { return tracker.getTrend();}
Together, evaluateWithJudge scores single responses, checkSLABudget validates whether latency falls within your chosen preset, and createTracker gives you trend data across multiple runs.
Step 7: Build the observability module
The observability module initializes logging, metrics, tracing, and dashboard managers from @reaatech/agent-eval-harness-observability. It provides functions to record evaluation results and flush or shut down telemetry at process exit.
Each manager is a singleton — the first call initializes it, subsequent calls return the same instance.
Step 8: Create the suite runner
The runner orchestrates the full evaluation pipeline: it converts a config into YAML for @reaatech/agent-eval-harness-suite, creates a suite runner with concurrency and timeout settings, executes all trajectories, and aggregates results. It also exposes a detectRegressions function for comparing two runs.
Create src/lib/runner.ts:
ts
import * as YAML from 'js-yaml';import { createSuiteRunner, parseConfig, createRunComparator, createResultsAggregator,} from '@reaatech/agent-eval-harness-suite';import type { SuiteConfig, EvalRunResult, RunComparisonResult,} from '@reaatech/agent-eval-harness-suite';import { getLogger } from '@reaatech/agent-eval-harness-observability';export type { SuiteConfig, EvalRunResult, RunComparisonResult };export async function runEvaluation( config: { concurrency?: number; continueOnError?: boolean; timeoutMs?: number; metrics?: string[]; }, trajectories: unknown[], evaluator: (trajectory: unknown) => Promise<Record<string, unknown>>,): Promise<EvalRunResult> { const yamlString = YAML.dump({ metrics: config.metrics ?? ['faithfulness', 'relevance'], judge_model: 'claude-opus', budget_limit: 10.00, parallel_workers: config.concurrency ?? 1, }); const parsedConfig = parseConfig(yamlString); const runner = createSuiteRunner({ concurrency: config.concurrency ?? 1, continueOnError: config.continueOnError ?? true, timeoutMs: config.timeoutMs ?? 30000, metrics: config.metrics ?? ['faithfulness', 'relevance'], }); const result = await runner.run(trajectories as never, evaluator as never); const aggregator = createResultsAggregator(parsedConfig); aggregator.aggregate(result as never); return result;}export function detectRegressions( baselineRun: EvalRunResult, candidateRun: EvalRunResult,): RunComparisonResult { const comparator = createRunComparator(); const comparison = comparator.compare(baselineRun as never, candidateRun as never); const logger = getLogger(); if (comparison.regressions.length > 0) { logger.warn('Regressions detected in comparison: ' + String(comparison.regressions.length)); } return comparison;}
The runEvaluation function is the heart of the harness — it accepts a config, trajectories, and an evaluator function, then hands everything to the suite package for parallel execution.
Step 9: Wire up the CLI
Now you’ll build the three commands users actually type. The entry point uses Commander to register init, run, and report. init scaffolds a default eval.config.json, run executes an evaluation and saves results to .hermes/last-run.json, and report reads that file and exports a structured report.
Create the src/cli/ directory, then create src/cli/index.ts:
ts
import { Command } from 'commander';import pc from 'picocolors';import { readFileSync, writeFileSync, existsSync } from 'node:fs';import { resolve } from 'node:path';import { loadConfig } from '../lib/config.js';import { runEvaluation } from '../lib/runner.js';import { createResultsAggregator, parseConfig } from '@reaatech/agent-eval-harness-suite';const DEFAULT_SUITE_YAML = 'metrics:\n - faithfulness\n - relevance\n';export function createProgram(): Command { const program = new Command();
Now create the barrel export at src/index.ts so consumers can import from the package root:
ts
export { createProgram } from './cli/index.js';export { loadConfig } from './lib/config.js';export type { EvalConfig, Scenario, JudgeConfig, LatencyConfig, ObsConfig } from './lib/config.js';export { loadGoldenDataset, createGoldenEntry, validateGoldenDataset, compareTrajectories,} from './lib/golden.js';export type { TrajectoryComparisonResult, TurnComparison, Regression, ComparisonConfig, GoldenValidationResult,} from './lib/golden.js';export { evaluateWithJudge, evaluateBatch } from './lib/judge.js';export type { JudgeRequest, JudgeScore, JudgeConfig as JudgeLibConfig } from './lib/judge.js';export { trackLatency, checkSLABudget, createTracker, getTrackerTrend } from './lib/latency.js';export type { LatencyResult, BudgetEnforcementResult, LatencyViolation } from './lib/latency.js';export { initObservability, recordRunMetrics, flushObservability, shutdownObservability, withTracing,} from './lib/observability.js';export { runEvaluation, detectRegressions } from './lib/runner.js';export type { SuiteConfig, EvalRunResult, RunComparisonResult } from './lib/runner.js';
Verify the project compiles:
terminal
pnpm typecheck
Expected output: nothing printed to the terminal (zero exit code). If there are type errors, they’ll be listed with file paths and line numbers.
Step 10: Write the test suite
The test suite mocks every REAA package so tests run without real API calls. The setup file configures Vitest’s mock factories for golden datasets, judge engines, latency trackers, and observability managers. Integration tests then exercise the CLI and runner directly.
The full tests/setup.ts runs ~390 lines and includes complete mock factories for all five REAA packages plus @langchain/core and commander. Copy the full file from the downloadable artifact — what’s shown above is the skeleton; the complete version handles every import path the tests exercise.
Expected output: all 67 tests pass, and a coverage summary prints to the terminal showing each source file with >= 90% line, function, branch, and statement coverage. The terminal also prints a summary line like Tests 67 passed (67) with a green checkmark. A vitest-report.json file is written to the project root, and a coverage/ directory appears with the detailed coverage report.
Step 11: Run the evaluation pipeline
Now you’ll exercise the full flow: scaffold a config, run an evaluation, and generate a report. Start by initializing a default configuration:
terminal
pnpm tsx src/cli/index.ts init
Expected output:
code
Scaffolded eval.config.json at /home/you/agent-eval-harness-cli/eval.config.json
Open eval.config.json — it contains one scenario (example-scenario), a judge config using claude-opus, a moderate latency preset, and observability settings with metrics and dashboard enabled. Customize the scenarios array and judge.provider to match your real agents and API keys.
Now run an evaluation against that config:
terminal
pnpm tsx src/cli/index.ts run --config ./eval.config.json
Report written to /home/you/agent-eval-harness-cli/report.md
Open ./report.md — it contains a Markdown-formatted summary of the last evaluation run. Pass --output ./report.json instead (any path ending in .json) to get a machine-readable JSON export.
Next steps
Add real scenarios — replace the placeholder example-scenario in eval.config.json with scenarios that exercise your business-critical agent workflows
Integrate golden datasets — use loadGoldenDataset to load reference trajectories and compareTrajectories to detect regressions between runs
Wire up OpenTelemetry — set tracingEnabled: true in your eval config and point EVAL_OBSERVABILITY_URL at a real collector (Jaeger, Grafana Tempo) for distributed tracing across evaluation runs
program
.name('agent-eval-harness')
.description('CLI for LangChain Agent Eval Harness')
.version('1.0.0');
program
.command('run')
.description('Run an evaluation suite')
.requiredOption('--config <path>', 'Path to the eval config JSON file')
.action(async (options: { config: string }) => {
const configPath = resolve(options.config);
if (!existsSync(configPath)) {
console.error(pc.red(`Config file not found: ${configPath}`));