OpenAI Agent Eval Harness for SMB Customer Support Quality
Automatically evaluate every production AI support interaction to catch bad answers, hallucination, and policy violations before they affect customers.
SMB customer support agents powered by OpenAI often drift in tone, hallucinate product details, or miss steps, but manual spot-checking doesn't scale as ticket volume grows.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
Small and medium businesses that use AI for customer support need a systematic way to catch bad answers, hallucinated product details, and policy violations. Manual spot-checking doesn’t scale as ticket volume grows. This tutorial builds an automated evaluation harness that scores support conversations across faithfulness, relevance, cost, latency, and PII safety, then gates deployments on quality regressions. You’ll write a configurable pipeline that loads recorded support trajectories, scores them with a custom PII redaction metric and an OpenAI LLM judge, checks results against CI-compatible quality gates, logs costs, and exports everything to Langfuse for dashboards.
Prerequisites
Node.js 22+ and pnpm 10 installed on your machine
An OpenAI API key (set as OPENAI_API_KEY in your environment or .env)
A Langfuse account (free tier works) — optional, for the observability export step
Familiarity with TypeScript and basic Next.js App Router patterns
Step 1: Scaffold the project and install dependencies
Create a new Next.js project with the App Router, then replace package.json with the exact dependencies below. Every package is pinned to a specific version so the harness behaves the same way every time you install:
EvalSummary is the return type of the pipeline orchestrator — clients (CLI, API routes) get a compact result with the run ID, pass rate, overall score, and whether the quality gates passed. ConfigError gives you programmatic error discrimination so error handlers can branch on .code rather than parsing strings.
Step 3: Build the YAML config loader
Create src/lib/config.ts that reads and validates an eval configuration file (YAML) via the @reaatech/agent-eval-harness-suite package:
ts
import { readFileSync } from 'node:fs';import { parseConfig, mergeConfig, validateConfig, createDefaultConfig, type SuiteConfig } from '@reaatech/agent-eval-harness-suite';import { ConfigError } from './types.js';export function loadEvalConfig(path?: string): SuiteConfig { const configPath = path ?? './eval-config.yaml'; let content: string; try { content = readFileSync(configPath, 'utf-8'); } catch { throw new ConfigError(`Config file not found: ${configPath}`, 'FILE_NOT_FOUND'); } let config: SuiteConfig; try { config = parseConfig(content); } catch { throw new ConfigError('Invalid config: failed to parse YAML', 'INVALID_CONFIG'); } const validation = validateConfig(config); if (!validation.valid) { throw new ConfigError(`Invalid config: ${validation.errors.join(', ')}`, 'INVALID_CONFIG'); } return config;}export function createDefaultSuiteConfig(name: string): SuiteConfig { const config = createDefaultConfig(name); return mergeConfig({ ...config, metrics: [ ...config.metrics, { name: 'pii_redaction', enabled: true, weight: 0.1, threshold: 0.95, config: {}, }, ], });}export function loadAndValidateConfig(path?: string): SuiteConfig { const config = loadEvalConfig(path); const validation = validateConfig(config); if (!validation.valid) { throw new ConfigError(`Config validation failed: ${validation.errors.join(', ')}`, 'INVALID_CONFIG'); } return config;}
Three exports, three responsibilities: loadEvalConfig reads and parses a YAML file at a given path (falling back to ./eval-config.yaml), createDefaultSuiteConfig generates a starter config with all five standard metrics plus a custom PII redaction metric, and loadAndValidateConfig chains load + validate so callers can’t forget the second step.
Expected output:pnpm typecheck passes with no errors on this file.
Step 4: Create the PII redaction metric and suite runner
The src/lib/suite.ts module is the evaluation engine. It defines a custom PII redaction metric that scans support responses for emails, phone numbers, SSNs, and credit card numbers, then wraps the SuiteRunner from @reaatech/agent-eval-harness-suite:
ts
import { SuiteRunner, createResultsAggregator, type SuiteRunnerConfig, type EvalRunResult, type SuiteConfig, type ResultsAggregator } from '@reaatech/agent-eval-harness-suite';import type { PiiRedactionResult } from './types.js';export function createPiiRedactionMetric(): { name: string; score(response: string): PiiRedactionResult } { const patterns: Array<{ regex: RegExp; type: string }> = [ { regex: /\b[\w.-]+@[\w.-]+\.\w+\b/, type: 'email' }, { regex: /\b\d{3}-\d{3}-\d{4}\b/, type: 'phone' }, { regex: /\b\d{3}-\d{2}-\d{4}\b/, type: 'ssn' }, { regex: /\b(?:\d[ -]*?){13,16}\b/, type: 'credit_card' }, ]; return { name: 'pii_redaction', score(response: string): PiiRedactionResult { const found = new Set<string>(); let totalCount = 0; for (const { regex, type } of patterns) { const matches = response.match(regex); if (matches) { found.add(type); totalCount += matches.length; } } if (totalCount === 0) { return { score: 1.0, hasPiiLeak: false, piiTypesFound: [], redactedCount: 0 }; } return { score: Math.max(0, 1 - totalCount * 0.25), hasPiiLeak: true, piiTypesFound: Array.from(found), redactedCount: totalCount, }; }, };}export function createEvalSuiteRunner(config: SuiteRunnerConfig, progressCallback?: (progress: { completed: number; total: number }) => void): SuiteRunner { if (progressCallback) { return new SuiteRunner(config, (update) => { progressCallback({ completed: update.completed, total: update.total }); }); } return new SuiteRunner(config);}export function createAggregator(config: SuiteConfig): ResultsAggregator { return createResultsAggregator(config);}export function aggregateAndExport(results: EvalRunResult, config: SuiteConfig, format: 'json' | 'junit' | 'csv' | 'markdown'): string { const aggregator = createResultsAggregator(config); const aggregated = aggregator.aggregate(results); return aggregator.export(aggregated, format);}
The PII metric scores a response from 1.0 (no leaks) down to 0.0 (many leaks), deducting 0.25 points per PII snippet found. The createEvalSuiteRunner factory accepts an optional progress callback so the CLI can show live completed/total output during long runs.
Expected output:pnpm typecheck reports zero errors.
Step 5: Wire up the LLM-as-judge engine
src/lib/judges.ts wraps @reaatech/agent-eval-harness-judge and the openai SDK. It provides a factory for an OpenAI-powered judge engine, error-safe judging with a custom JudgeError class, batch judging, cost tracking, and human-label calibration:
ts
import OpenAI from 'openai'import { JudgeEngine, JudgeCalibrator, JudgeCostTracker, type JudgeRequest, type JudgeScore } from '@reaatech/agent-eval-harness-judge'export class JudgeError extends Error { status?: number requestId?: string constructor(message: string, status?: number, requestId?: string) { super(message) this.name = 'JudgeError' this.status = status this.requestId = requestId }}export function createOpenaiJudgeEngine(options?: { model?: string; temperature?: number }): JudgeEngine { return new JudgeEngine({ model: options?.model ?? 'gpt-4o', provider: 'gpt4', temperature: options?.temperature ?? 0.1, })}export async function judgeResponse(judgeEngine: JudgeEngine, request: JudgeRequest): Promise<JudgeScore> { try { return await judgeEngine.judge(request) } catch (error) { if (error instanceof OpenAI.APIError) { throw new JudgeError(error.message, error.status as number | undefined) } const judgeErr = error as { message: string; status?: number } throw new JudgeError(judgeErr.message, judgeErr.status) }}export async function judgeBatch( judgeEngine: JudgeEngine, requests: Array<{ id: string; request: JudgeRequest }>, concurrency?: number,) { return judgeEngine.judgeBatch(requests, concurrency ?? 5)}export function createJudgeCostTracker(opts: { budgetLimit?: number; maxCostPerJudgment?: number; alertThresholds?: number[] }): JudgeCostTracker { return new JudgeCostTracker({ budgetLimit: opts.budgetLimit ?? 10.00, maxCostPerJudgment: opts.maxCostPerJudgment ?? 0.05, alertThresholds: opts.alertThresholds ?? [0.5, 0.75, 0.9], })}export function trackJudgmentCost( tracker: JudgeCostTracker, opts: { judgmentId: string; provider: string; model: string; inputTokens: number; outputTokens: number },): void { const result = tracker.recordJudgment( opts.judgmentId, opts.provider as 'claude' | 'gpt4' | 'gemini' | 'openrouter', opts.model, opts.inputTokens, opts.outputTokens, ) if (result.alerts.length > 0) { for (const alert of result.alerts) { console.warn(`[JudgeCostTracker] ${alert.level}: ${alert.message}`) } }}export function calibrateJudge( humanLabels: Array<{ sampleId: string; score: number; type: string }>, judgeScores: JudgeScore[],): JudgeCalibrator { if (humanLabels.length < 3) { throw new Error('At least 3 human-labeled samples are required for calibration') } const calibrator = new JudgeCalibrator('temperature_scaling') calibrator.addCalibrationData(humanLabels, judgeScores) calibrator.calibrate() return calibrator}
Note how judgeResponse catches OpenAI.APIError and wraps it in your own JudgeError. This keeps the call site from having to import the openai package just for error handling — your own type travels across module boundaries.
Expected output: TypeScript compiles. You can verify by importing createOpenaiJudgeEngine and checking the returned object’s type.
Step 6: Integrate CI regression gates
src/lib/gate.ts maps evaluation results to pass/fail decisions using @reaatech/agent-eval-harness-gate. It supports standard and strict presets, adds a custom PII redaction gate, and writes JUnit XML reports for CI:
ts
import { createGateEngine, getStandardPreset, getStrictPreset, writeJUnitReport, type GateEvaluationSummary } from '@reaatech/agent-eval-harness-gate'import { mkdirSync, writeFileSync } from 'node:fs'import { join } from 'node:path'export type { GateEvaluationSummary }export function createEvalGateEngine(preset?: 'standard' | 'strict') { return createGateEngine(preset === 'strict' ? getStrictPreset().gates : getStandardPreset().gates)}export function addCustomPiiGate(engine: ReturnType<typeof createGateEngine>): void { engine.addGate({ name: 'pii-redaction', type: 'threshold' as const, metric: 'pii_redaction', operator: '>=' as const, threshold: 0.95, enabled: true, description: 'PII redaction quality must be >= 0.95', })}export function evaluateResults( engine: ReturnType<typeof createGateEngine>, results: object, comparison?: object,): GateEvaluationSummary { return engine.evaluate(results as never, comparison as never)}export function gatesPassed(summary: GateEvaluationSummary): boolean { return summary.overallPassed}export async function writeGateReport(summary: GateEvaluationSummary, dir: string): Promise<void> { const json = JSON.stringify(summary, null, 2) await writeJUnitReport(summary, join(dir, 'gate-results.xml')) writeFileSync(join(dir, 'gate-summary.json'), json, 'utf-8')}export function addPreCommitHook(engine: ReturnType<typeof createGateEngine>): void { void engine mkdirSync('.husky', { recursive: true }) writeFileSync( join('.husky', 'pre-commit'), `pnpm exec agent-eval-harness gate eval-results/results.json --preset standard --exit-code`, )}
The standard preset gates check overall quality >= 0.80, faithfulness >= 0.80, relevance >= 0.80, tool correctness >= 0.90, cost per task <= $0.05, and P99 latency <= 5000ms. The strict preset raises the bar: quality >= 0.90, latency <= 2000ms, and no SLA violations. Your addCustomPiiGate adds a seventh gate requiring PII redaction >= 0.95.
Expected output:pnpm typecheck and pnpm lint both pass.
Step 7: Track evaluation costs with telemetry
src/lib/telemetry.ts uses @reaatech/llm-cost-telemetry to create cost spans, build telemetry contexts, load budget configuration from environment variables, and summarize costs:
ts
import { generateId, now, calculateCostFromTokens, loadBudgetConfig, retryWithBackoff, type CostSpan, type TelemetryContext, type Provider } from '@reaatech/llm-cost-telemetry'export { retryWithBackoff }export function createCostSpan(opts: { provider: string; model: string; inputTokens: number; outputTokens: number; tenant: string; feature: string }): CostSpan { if (opts.inputTokens < 0) { throw new Error('inputTokens must not be negative') } if (opts.outputTokens < 0) { throw new Error('outputTokens must not be negative') } return { id: generateId(), provider: opts.provider as Provider, model: opts.model, inputTokens: opts.inputTokens, outputTokens: opts.outputTokens, costUsd: calculateCostFromTokens(opts.inputTokens + opts.outputTokens, 15), tenant: opts.tenant, feature: opts.feature, timestamp: now(), }}export function createTelemetryContext(tenant: string, feature: string, route?: string, metadata?: Record<string, unknown>): TelemetryContext { return { tenant, feature, route: route ?? '', metadata: metadata ?? {} }}export function loadBudgetFromEnv(): { daily: number; monthly: number } { const config = loadBudgetConfig() return { daily: config.global?.daily ?? 100, monthly: config.global?.monthly ?? 3000, }}export function summarizeCosts(spans: CostSpan[]): { totalCostUsd: number; byFeature: Record<string, { totalCostUsd: number; spanCount: number }> } { const totalCostUsd = spans.reduce((sum, s) => sum + s.costUsd, 0) const byFeature: Record<string, { totalCostUsd: number; spanCount: number }> = {} for (const span of spans) { const feature = span.feature ?? 'unknown' byFeature[feature] = byFeature[feature] ?? { totalCostUsd: 0, spanCount: 0 } const entry = byFeature[feature] entry.totalCostUsd += span.costUsd entry.spanCount++ } return { totalCostUsd, byFeature }}
createCostSpan validates that token counts are non-negative (throwing a descriptive error if they’re not) and uses the calculateCostFromTokens utility to compute USD cost. loadBudgetFromEnv delegates to @reaatech/llm-cost-telemetry’s loadBudgetConfig() which reads EVAL_BUDGET_LIMIT and other env vars — a single call site for all budget configuration.
Step 8: Export results to Langfuse
src/lib/langfuse.ts wraps the Langfuse SDK for trace-level observability:
ts
import Langfuse from 'langfuse'let client: Langfuse | undefinedexport function createLangfuseClient(): Langfuse { if (!client) { const publicKey = process.env.LANGFUSE_PUBLIC_KEY ?? '' const secretKey = process.env.LANGFUSE_SECRET_KEY ?? '' client = new Langfuse({ publicKey, secretKey, baseUrl: process.env.LANGFUSE_BASE_URL ?? 'https://cloud.langfuse.com', }) } return client}export function exportToLangfuse( runResult: { runId: string totalTrajectories: number summary: { passRate: number } trajectoryResults?: Array<{ trajectoryId?: string; overallScore: number; passed: boolean; metricScores?: Record<string, number> }> }, langfuseClient: Langfuse,): void { try { langfuseClient.trace({ id: runResult.runId, name: 'eval-run', metadata: { total: runResult.totalTrajectories, passRate: runResult.summary.passRate }, }) if (runResult.trajectoryResults) { for (const [idx, trajectory] of runResult.trajectoryResults.entries()) { try { langfuseClient.event({ traceId: runResult.runId, name: `trajectory-${String(trajectory.trajectoryId ?? idx)}`, input: {} as object, output: {} as object, metadata: { overallScore: trajectory.overallScore, passed: trajectory.passed, metricScores: trajectory.metricScores }, }) } catch { } } } } catch (err) { console.error('Failed to export to Langfuse:', err) }}export async function flushLangfuse(langfuseClient: Langfuse): Promise<void> { try { await langfuseClient.shutdownAsync() } catch (err) { console.error('Failed to flush Langfuse:', err) }}
The client is a lazy singleton — createLangfuseClient() only calls new Langfuse(...) once and reuses the instance. exportToLangfuse creates one root trace per eval run and one event per trajectory result, so you can filter your Langfuse dashboard by run ID and drill into individual trajectories. All Langfuse errors are caught and logged rather than thrown, keeping the eval pipeline running even when observability is unavailable.
Step 9: Orchestrate the full evaluation pipeline
src/lib/eval-pipeline.ts is the heart of the harness. It chains config loading → trajectory loading → suite runner → results aggregation → LLM judging → gate evaluation → export → Langfuse sync:
ts
import { loadAndValidateConfig } from './config.js'import { createEvalSuiteRunner, createAggregator, aggregateAndExport, createPiiRedactionMetric } from './suite.js'import { createOpenaiJudgeEngine, judgeBatch, createJudgeCostTracker, trackJudgmentCost } from './judges.js'import { createEvalGateEngine, addCustomPiiGate, evaluateResults, gatesPassed, writeGateReport } from './gate.js'import { createLangfuseClient, exportToLangfuse, flushLangfuse } from './langfuse.js'import type { PipelineOptions, EvalSummary } from './types.js'import { readFileSync, readdirSync, existsSync, mkdirSync, writeFileSync } from 'node:fs'import { join } from 'node:path'import { randomUUID } from 'node:crypto'interface LoadedTrajectory { trajectory_id?: string turns: Array<{ turn_id
This is the only function your CLI and API routes need to call. It reads JSONL trajectory files from a directory, runs them through the suite runner with the PII metric, aggregates the results, sends them to the OpenAI judge for faithfulness scoring, checks every result against the CI gates, writes JSON/Markdown/JUnit exports, and optionally ships everything to Langfuse.
Expected output:pnpm typecheck reports zero errors across all src/ files.
Step 10: Build the CLI entry point and API routes
The CLI at src/cli/index.ts dispatches process.argv to subcommand handlers. It uses the @reaatech/agent-eval-harness-cli package’s output helpers (cliOut, cliError, cliWarn) and delegates report, compare, and golden commands to the package’s built-in command functions:
ts
import { compareCommand, goldenCommand, reportCommand, cliOut, cliError, cliWarn } from '@reaatech/agent-eval-harness-cli'import { runEvalPipeline } from '../lib/eval-pipeline.js'import { createOpenaiJudgeEngine, judgeResponse } from '../lib/judges.js'import { evaluateResults, createEvalGateEngine } from '../lib/gate.js'import { readFileSync } from 'node:fs'function parseFlags(args: string[]): Record<string, string | boolean> { const flags: Record<string, string | boolean> = {} for (let i = 0; i < args.length; i
The three API routes under app/api/ expose the pipeline via HTTP:
app/api/eval/route.ts — accepts POST with { trajectoriesPath, configPath?, preset?, outputDir? }, validated via Zod, and calls runEvalPipeline:
app/api/results/route.ts — GET handler that lists past eval runs (when called without runId) or returns a specific run’s results in JSON, Markdown, or JUnit format.
app/api/gate/route.ts — POST handler that accepts { resultsPath, preset? }, reads the results file, runs gate evaluation, and returns the gate summary.
All route handlers use NextRequest and NextResponse.json() as required by Next.js App Router conventions.
Expected output:pnpm typecheck passes. pnpm lint passes with zero violations.
Step 11: Run the test suite
The project ships with 12 test files covering every module. Here’s a sample to show the testing pattern — tests/lib/suite.test.ts tests the PII redaction metric:
ts
import { describe, it, expect, vi, beforeEach } from 'vitest'vi.mock('@reaatech/agent-eval-harness-suite', () => ({ SuiteRunner: vi.fn(function () { return { run: vi.fn() } }), createResultsAggregator: vi.fn(),}))import { SuiteRunner, createResultsAggregator } from '@reaatech/agent-eval-harness-suite'import { createPiiRedactionMetric, createEvalSuiteRunner, createAggregator, aggregateAndExport } from '../../src/lib/suite.js'describe('createPiiRedactionMetric', () => { const metric = createPiiRedactionMetric() it('returns score 1.0 for clean response with no PII', () => { const result = metric.score('Clean response with no personal data.') expect(result).toEqual({ score: 1.0, hasPiiLeak: false, piiTypesFound: [], redactedCount: 0 }) }) it('detects email and phone in response', () => { const result = metric.score('Contact john.doe@example.com or call 555-123-4567') expect(result.hasPiiLeak).toBe(true) expect(result.piiTypesFound).toContain('email') expect(result.piiTypesFound).toContain('phone') }) it('returns score 1.0 for empty string', () => { const result = metric.score('') expect(result.score).toBe(1.0) expect(result.hasPiiLeak).toBe(false) }) it('detects SSN pattern', () => { const result = metric.score('My SSN is 123-45-6789') expect(result.hasPiiLeak).toBe(true) expect(result.piiTypesFound).toContain('ssn') }) it('detects credit card pattern', () => { const result = metric.score('Card: 4111-1111-1111-1111') expect(result.hasPiiLeak).toBe(true) expect(result.piiTypesFound).toContain('credit_card') })})
The test file also includes describe blocks for createEvalSuiteRunner, createAggregator, and aggregateAndExport — the same pattern: mock the dependencies, call the function, and assert the result.
Run the full suite with:
terminal
pnpm typecheck && pnpm lint && pnpm test
The Vitest configuration uses pool: "threads", the v8 coverage provider, and enforces 90%+ thresholds on lines, branches, functions, and statements across src/**/*.ts and app/**/route.ts.
Expected output: All tests pass, zero test failures, and coverage exceeds the 90% threshold on all four metrics. The vitest-report.json written to disk shows numFailedTests: 0.
Next steps
Add more metrics — extend createPiiRedactionMetric with additional patterns (IP addresses, API keys, custom regex sets) or create new metric modules for tone analysis and policy compliance
Integrate with GitHub Actions — use the @reaatech/agent-eval-harness-gateCIIntegration class to generate PR comments and step summaries when gates fail
Add golden trajectory comparison — use the CLI’s golden subcommand to curate a set of ideal support responses and compare every new run against them
Deploy as a dashboard — run next dev and point a cron job at POST /api/eval to produce weekly quality reports in your Langfuse workspace
:
number
; role
:
'user'
|
'agent'
; content
:
string
; timestamp
:
string
}>
metadata?: Record<string, unknown>
}
export async function runEvalPipeline(opts: PipelineOptions): Promise<EvalSummary> {