OpenAI Agent Eval Harness for SMB Customer Support Quality

Automatically evaluate every production AI support interaction to catch bad answers, hallucination, and policy violations before they affect customers.

openai eval-harness customer-support llm-evaluation quality-gates typescript nextjs langfuse

The problem

SMB customer support agents powered by OpenAI often drift in tone, hallucinate product details, or miss steps, but manual spot-checking doesn't scale as ticket volume grows.

Built from

Intro

Small and medium businesses that use AI for customer support need a systematic way to catch bad answers, hallucinated product details, and policy violations. Manual spot-checking doesn’t scale as ticket volume grows. This tutorial builds an automated evaluation harness that scores support conversations across faithfulness, relevance, cost, latency, and PII safety, then gates deployments on quality regressions. You’ll write a configurable pipeline that loads recorded support trajectories, scores them with a custom PII redaction metric and an OpenAI LLM judge, checks results against CI-compatible quality gates, logs costs, and exports everything to Langfuse for dashboards.

Prerequisites

Node.js 22+ and pnpm 10 installed on your machine
An OpenAI API key (set as OPENAI_API_KEY in your environment or .env)
A Langfuse account (free tier works) — optional, for the observability export step
Familiarity with TypeScript and basic Next.js App Router patterns

Step 1: Scaffold the project and install dependencies

Create a new Next.js project with the App Router, then replace package.json with the exact dependencies below. Every package is pinned to a specific version so the harness behaves the same way every time you install:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

179 kB·92 tests·96.1% coverage·vitest passing

SHA-2563a7d840bd0143f810b60840feead01bf55ac51cbad0153841bfeae3d21ccbe74

Book a conversation All solutions

Comments

Loading comments…

import { SuiteRunner, createResultsAggregator, type SuiteRunnerConfig, type EvalRunResult, type SuiteConfig, type ResultsAggregator } from '@reaatech/agent-eval-harness-suite'; import type { PiiRedactionResult } from './types.js'; export function createPiiRedactionMetric(): { name: string; score(response: string): PiiRedactionResult } { const patterns: Array<{ regex: RegExp; type: string }> = [ { regex: /\b[\w.-]+@[\w.-]+\.\w+\b/, type: 'email' }, { regex: /\b\d{3}-\d{3}-\d{4}\b/, type: 'phone' }, { regex: /\b\d{3}-\d{2}-\d{4}\b/, type: 'ssn' }, { regex: /\b(?:\d[ -]*?){13,16}\b/, type: 'credit_card' }, ]; return { name: 'pii_redaction', score(response: string): PiiRedactionResult { const found = new Set<string>(); let totalCount = 0; for (const { regex, type } of patterns) { const matches = response.match(regex); if (matches) { found.add(type); totalCount += matches.length; } } if (totalCount === 0) { return { score: 1.0, hasPiiLeak: false, piiTypesFound: [], redactedCount: 0 }; } return { score: Math.max(0, 1 - totalCount * 0.25), hasPiiLeak: true, piiTypesFound: Array.from(found), redactedCount: totalCount, }; }, }; } export function createEvalSuiteRunner(config: SuiteRunnerConfig, progressCallback?: (progress: { completed: number; total: number }) => void): SuiteRunner { if (progressCallback) { return new SuiteRunner(config, (update) => { progressCallback({ completed: update.completed, total: update.total }); }); } return new SuiteRunner(config); } export function createAggregator(config: SuiteConfig): ResultsAggregator { return createResultsAggregator(config); } export function aggregateAndExport(results: EvalRunResult, config: SuiteConfig, format: 'json' | 'junit' | 'csv' | 'markdown'): string { const aggregator = createResultsAggregator(config); const aggregated = aggregator.aggregate(results); return aggregator.export(aggregated, format); }

import OpenAI from 'openai' import { JudgeEngine, JudgeCalibrator, JudgeCostTracker, type JudgeRequest, type JudgeScore } from '@reaatech/agent-eval-harness-judge' export class JudgeError extends Error { status?: number requestId?: string constructor(message: string, status?: number, requestId?: string) { super(message) this.name = 'JudgeError' this.status = status this.requestId = requestId } } export function createOpenaiJudgeEngine(options?: { model?: string; temperature?: number }): JudgeEngine { return new JudgeEngine({ model: options?.model ?? 'gpt-4o', provider: 'gpt4', temperature: options?.temperature ?? 0.1, }) } export async function judgeResponse(judgeEngine: JudgeEngine, request: JudgeRequest): Promise<JudgeScore> { try { return await judgeEngine.judge(request) } catch (error) { if (error instanceof OpenAI.APIError) { throw new JudgeError(error.message, error.status as number | undefined) } const judgeErr = error as { message: string; status?: number } throw new JudgeError(judgeErr.message, judgeErr.status) } } export async function judgeBatch( judgeEngine: JudgeEngine, requests: Array<{ id: string; request: JudgeRequest }>, concurrency?: number, ) { return judgeEngine.judgeBatch(requests, concurrency ?? 5) } export function createJudgeCostTracker(opts: { budgetLimit?: number; maxCostPerJudgment?: number; alertThresholds?: number[] }): JudgeCostTracker { return new JudgeCostTracker({ budgetLimit: opts.budgetLimit ?? 10.00, maxCostPerJudgment: opts.maxCostPerJudgment ?? 0.05, alertThresholds: opts.alertThresholds ?? [0.5, 0.75, 0.9], }) } export function trackJudgmentCost( tracker: JudgeCostTracker, opts: { judgmentId: string; provider: string; model: string; inputTokens: number; outputTokens: number }, ): void { const result = tracker.recordJudgment( opts.judgmentId, opts.provider as 'claude' | 'gpt4' | 'gemini' | 'openrouter', opts.model, opts.inputTokens, opts.outputTokens, ) if (result.alerts.length > 0) { for (const alert of result.alerts) { console.warn(`[JudgeCostTracker] ${alert.level}: ${alert.message}`) } } } export function calibrateJudge( humanLabels: Array<{ sampleId: string; score: number; type: string }>, judgeScores: JudgeScore[], ): JudgeCalibrator { if (humanLabels.length < 3) { throw new Error('At least 3 human-labeled samples are required for calibration') } const calibrator = new JudgeCalibrator('temperature_scaling') calibrator.addCalibrationData(humanLabels, judgeScores) calibrator.calibrate() return calibrator }

OpenAI Agent Eval Harness for SMB Customer Support Quality

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Step 2: Define the domain types and error classes

Step 3: Build the YAML config loader

Step 4: Create the PII redaction metric and suite runner

Step 5: Wire up the LLM-as-judge engine

Step 6: Integrate CI regression gates

Step 7: Track evaluation costs with telemetry

Step 8: Export results to Langfuse

Step 9: Orchestrate the full evaluation pipeline

Step 10: Build the CLI entry point and API routes

Step 11: Run the test suite

Next steps