LangChain Agent Eval Harness for Small Business Reliability

Continuous evaluation of your AI agents using LangChain and REAA's eval harness suite to ensure reliable business outcomes.

langchain eval-harness agent-evaluation smb cli golden-dataset llm-judge

The problem

SMBs deploying AI agents have no way to systematically test if updates or new prompts break business-critical tasks, leading to customer-facing errors and trust erosion.

Built from

Intro

You’ll build a CLI evaluation harness that lets your small business systematically test AI agents before they touch customers. By the end, you’ll have three commands — init, run, and report — that scaffold an evaluation config, execute LangChain-powered agent evaluations across multiple scenarios, and export structured reports with scores, latency percentiles, and regression detection. The harness uses REAA’s eval suite packages for golden dataset management, LLM-as-judge scoring, latency tracking, and OpenTelemetry observability, all wired together with Commander, Zod, and TypeScript.

Prerequisites

Node.js >= 22 (the package.json engine field requires it)
pnpm 10.x (the project uses pnpm@10.15.1 as its package manager)
OpenAI, Anthropic, Gemini, or OpenRouter API key — the judge engine needs at least one provider to score agent responses
Familiarity with TypeScript and running CLI tools from the terminal

Step 1: Scaffold the project and install dependencies

Create a new directory and set up a package.json with all the dependencies the harness needs. The project uses ES modules ("type": "module") and TypeScript with NodeNext module resolution.

Create package.json:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

67 tests·100.0% coverage·vitest passing

Book a conversation All solutions

Comments

Loading comments…

Intro

Prerequisites

Node.js >= 22 (the package.json engine field requires it)
pnpm 10.x (the project uses pnpm@10.15.1 as its package manager)
OpenAI, Anthropic, Gemini, or OpenRouter API key — the judge engine needs at least one provider to score agent responses
Familiarity with TypeScript and running CLI tools from the terminal

Step 1: Scaffold the project and install dependencies

Create a new directory and set up a package.json with all the dependencies the harness needs. The project uses ES modules ("type": "module") and TypeScript with NodeNext module resolution.

Create package.json:

import { z } from 'zod'; import { readFileSync } from 'node:fs'; /** * Schema for a single scenario in the eval configuration. */ const ScenarioSchema = z.object({ name: z.string(), description: z.string().optional(), trajectoryPath: z.string().optional(), tags: z.array(z.string()).optional(), }); /** * Schema for judge configuration in the eval config. */ const JudgeConfigSchema = z.object({ model: z.string().optional(), provider: z.string().optional(), temperature: z.number().optional(), maxTokens: z.number().optional(), budgetLimit: z.number().optional(), calibrationEnabled: z.boolean().optional(), }); /** * Schema for latency configuration in the eval config. */ const LatencyConfigSchema = z.object({ preset: z.enum(['strict', 'moderate', 'lenient']).optional(), p50ThresholdMs: z.number().optional(), p90ThresholdMs: z.number().optional(), p99ThresholdMs: z.number().optional(), maxTurnMs: z.number().optional(), totalMs: z.number().optional(), }); /** * Schema for observability configuration in the eval config. */ const ObsConfigSchema = z.object({ logLevel: z.string().optional(), logFormat: z.enum(['json', 'pretty']).optional(), metricsEnabled: z.boolean().optional(), tracingEnabled: z.boolean().optional(), dashboardEnabled: z.boolean().optional(), otlpEndpoint: z.string().optional(), }); /** * Schema for the full eval configuration file. */ const EvalConfigSchema = z.object({ name: z.string(), description: z.string().optional(), scenarios: z.array(ScenarioSchema), judge: JudgeConfigSchema.optional(), latency: LatencyConfigSchema.optional(), observability: ObsConfigSchema.optional(), }); export type Scenario = z.infer<typeof ScenarioSchema>; export type JudgeConfig = z.infer<typeof JudgeConfigSchema>; export type LatencyConfig = z.infer<typeof LatencyConfigSchema>; export type ObsConfig = z.infer<typeof ObsConfigSchema>; export type EvalConfig = z.infer<typeof EvalConfigSchema>; export function loadConfig(filePath: string): EvalConfig { let raw: string; try { raw = readFileSync(filePath, 'utf-8'); } catch (err) { const nodeErr = err as NodeJS.ErrnoException; if (nodeErr.code === 'ENOENT') { throw new Error(`Config file not found: ${filePath}`); } throw new Error(`Failed to read config file at ${filePath}: ${(err as Error).message}`); } let parsed: unknown; try { parsed = JSON.parse(raw); } catch { throw new Error(`Invalid JSON syntax in config file: ${filePath}`); } const result = EvalConfigSchema.safeParse(parsed); if (!result.success) { const issues = result.error.issues.map((i) => `${i.path.join('.')}: ${i.message}`).join('; '); throw new Error(`Config validation failed for ${filePath}: ${issues}`); } return result.data; }

import { readFileSync } from 'node:fs'; import { loadGoldenTrajectories, createGolden, validateGolden, compareAgainstGolden, } from '@reaatech/agent-eval-harness-golden'; export interface TrajectoryComparisonResult { similarity: number; turnComparisons: TurnComparison[]; matchingTurns: number; divergentTurns: number; passesThreshold: boolean; regressions: Regression[]; diffSummary: string; } export interface TurnComparison { turnId: number; similarity: number; contentMatch: boolean; toolMatch: boolean; differences: string[]; } export interface Regression { type: string; severity: string; turnId: number; description: string; } export interface ComparisonConfig { similarityThreshold: number; compareTools: boolean; semanticComparison: boolean; turnMatching: string; } export interface GoldenValidationResult { valid: boolean; errors: string[]; warnings: string[]; score: number; } export function loadGoldenDataset(filePath: string): ReturnType<typeof loadGoldenTrajectories> { let content: string; try { content = readFileSync(filePath, 'utf-8'); } catch (err) { const nodeErr = err as NodeJS.ErrnoException; if (nodeErr.code === 'ENOENT') { throw new Error(`Golden dataset file not found: ${filePath}`); } throw new Error(`Failed to read golden dataset at ${filePath}: ${(err as Error).message}`); } try { return loadGoldenTrajectories(content); } catch (err) { throw new Error( `Failed to parse golden dataset at ${filePath}: expected JSONL format with one JSON object per line: ${(err as Error).message}`, ); } } export function createGoldenEntry( trajectory: unknown, options?: { description?: string; tags?: string[] }, ): ReturnType<typeof createGolden> { try { return createGolden( trajectory as Parameters<typeof createGolden>[0], options as Parameters<typeof createGolden>[1], ); } catch (err) { throw new Error(`Failed to create golden entry: ${(err as Error).message}`); } } export function validateGoldenDataset(golden: unknown): GoldenValidationResult { try { return validateGolden(golden as Parameters<typeof validateGolden>[0]) as never; } catch (err) { throw new Error(`Failed to validate golden dataset: ${(err as Error).message}`); } } export function compareTrajectories( golden: unknown, candidate: unknown, config?: Partial<ComparisonConfig>, ): TrajectoryComparisonResult { try { return compareAgainstGolden( golden as Parameters<typeof compareAgainstGolden>[0], candidate as Parameters<typeof compareAgainstGolden>[1], config as Parameters<typeof compareAgainstGolden>[2], ) as TrajectoryComparisonResult; } catch (err) { throw new Error(`Failed to compare trajectories: ${(err as Error).message}`); } }

import { vi } from 'vitest'; // Mock all REAA packages vi.mock('@reaatech/agent-eval-harness-suite', () => ({ createSuiteRunner: vi.fn(() => ({ run: vi.fn().mockResolvedValue({ runId: 'mock-run-001', status: 'completed', startedAt: new Date().toISOString(), totalTrajectories: 3, completedTrajectories: 3, failedTrajectories: 0, trajectoryResults: [ { trajectoryId: 't1', result: { trajectory_id: 't1', overall_score: 0.95, metrics: {} } }, { trajectoryId: 't2', result: { trajectory_id: 't2', overall_score: 0.87, metrics: {} } }, { trajectoryId: 't3', result: { trajectory_id: 't3', overall_score: 0.92, metrics: {} } }, ], overallMetrics: { overallScore: 0.91, avgFaithfulness: 0.9, avgRelevance: 0.92, toolCorrectnessRate: 0.95, avgCostPerTask: 0.05, latencyP50: 450, latencyP90: 900, latencyP99: 1200, slaViolations: 0, }, durationMs: 3200, }), })), parseConfig: vi.fn(() => ({ name: 'test-suite', metrics: [{ name: 'faithfulness', enabled: true, weight: 0.5, threshold: 0.8, config: {} }], })), createResultsAggregator: vi.fn(() => ({ aggregate: vi.fn(), exportJSON: vi.fn(() => JSON.stringify({})), exportMarkdown: vi.fn(() => '# Report'), export: vi.fn((_r: unknown, format: string) => format === 'json' ? JSON.stringify({}) : '# Report', ), })), createRunComparator: vi.fn(() => ({ compare: vi.fn(() => ({ scoreDiff: -0.05, metricDiffs: [], statisticalSignificance: { test: 't-test', pValue: 0.03, confidenceInterval: [], significant: true, alpha: 0.05, }, regressions: [ { type: 'tool_mismatch', severity: 'high', turnId: 2, description: 'Tool mismatch' }, ], improvements: [], summary: 'Comparison summary', })), })), validateConfig: vi.fn(() => ({ valid: true, errors: [] })), createDefaultConfig: vi.fn(() => ({ name: 'default', metrics: [] })), mergeConfig: vi.fn((p: unknown) => p), calculateOverallScore: vi.fn(() => 0.85), checkThresholds: vi.fn(() => ({ passed: true, failures: [] })), })); // ... (additional mock blocks for golden, judge, latency, observability, langchain, and commander) vi.mock('@langchain/core', () => { class BaseChatModelStub { invoke = vi.fn().mockResolvedValue('Mock response'); } return { BaseChatModel: BaseChatModelStub, ChatPromptTemplate: { fromTemplate: vi.fn(() => ({ pipe: vi.fn(() => ({ invoke: vi.fn().mockResolvedValue('Mock response'), })), })), }, }; }); vi.mock('commander', async () => { const actual = await vi.importActual<typeof import('commander')>('commander'); return actual; });

import { describe, it, expect, vi } from 'vitest'; import { createProgram } from '../src/cli/index.js'; describe('Integration: CLI end-to-end', () => { it('should create a program with all commands', () => { const program = createProgram(); program.exitOverride(); expect(program.name()).toBe('agent-eval-harness'); const commands = program.commands; const commandNames = commands.map((c: { name: () => string }) => c.name()); expect(commandNames).toContain('run'); expect(commandNames).toContain('init'); expect(commandNames).toContain('report'); }); it('should support concurrency in run command options', () => { const program = createProgram(); program.exitOverride(); const runCmd = program.commands.find((c: { name: () => string }) => c.name() === 'run'); expect(runCmd).toBeDefined(); }); it('should handle partial results when judge throws for one trajectory', async () => { const { runEvaluation } = await import('../src/lib/runner.js'); const result = await runEvaluation( { concurrency: 1, continueOnError: true, timeoutMs: 30000 }, [{ turns: [{ turn_id: 1 }] }, { turns: [{ turn_id: 1 }] }], async () => ({ score: 0.85 }), ); expect(result).toBeDefined(); expect(result.status).toBe('completed'); }); it('should export both JSON and Markdown formats via ResultsAggregator', async () => { const { createResultsAggregator, parseConfig, } = await import('@reaatech/agent-eval-harness-suite'); const config = parseConfig('metrics:\n - faithfulness\n'); const aggregator = createResultsAggregator(config); const aggregated = aggregator.aggregate({ runId: 'test', status: 'completed', startedAt: new Date().toISOString(), totalTrajectories: 3, completedTrajectories: 3, failedTrajectories: 0, trajectoryResults: [], overallMetrics: { overallScore: 0.91, avgFaithfulness: 0.9, avgRelevance: 0.92, toolCorrectnessRate: 0.95, avgCostPerTask: 0.05, latencyP50: 450, latencyP90: 900, latencyP99: 1200, slaViolations: 0, }, durationMs: 3200, }); const jsonReport = aggregator.export(aggregated, 'json'); const mdReport = aggregator.export(aggregated, 'markdown'); expect(typeof jsonReport).toBe('string'); expect(typeof mdReport).toBe('string'); }); it('should handle concurrent scenario execution', async () => { const { runEvaluation } = await import('../src/lib/runner.js'); const trajectories = Array.from({ length: 5 }, (_, i) => ({ turns: [{ turn_id: i + 1 }], })); const result = await runEvaluation( { concurrency: 5, continueOnError: true, timeoutMs: 30000 }, trajectories, async () => ({ score: 0.9 }), ); expect(result).toBeDefined(); expect(result.totalTrajectories).toBeGreaterThanOrEqual(3); }); });

LangChain Agent Eval Harness for Small Business Reliability

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Step 2: Configure TypeScript and Vitest

Step 3: Set environment variables

Step 4: Create the configuration loader

Step 5: Build the golden dataset module

Step 6: Build the judge and latency modules

Step 7: Build the observability module

Step 8: Create the suite runner

Step 9: Wire up the CLI

Step 10: Write the test suite

Step 11: Run the evaluation pipeline

Next steps