Small businesses deploying Azure AI chatbots for customer support struggle with maintaining consistent answer quality as prompts, models, and knowledge bases change. Manual testing is time-consuming and unreliable, leading to wrong answers, inappropriate tool calls, and surprise cost overruns.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building an automated evaluation harness for Azure AI-powered customer support agents. You’ll create an Express API server that ingests agent trajectory logs, runs them through an LLM-as-judge evaluation pipeline, tracks cost, enforces quality gates, and surfaces results in a Next.js dashboard. By the end, your support bot QA will run automatically on every deployment, catching regressions in answer accuracy, tool use, and cost before they reach customers.
Prerequisites
Node.js 22+ and pnpm 10 installed on your machine
An Azure OpenAI resource with a deployed model (GPT-4 or similar) and its endpoint, API key, and deployment name
A Langfuse account (free tier works) for OpenTelemetry tracing — you’ll need the public and secret keys
Familiarity with TypeScript, Express, and Next.js App Router — this is a hands-on code-along, not an introduction to these
About 30 minutes to complete all steps
Step 1: Scaffold the project and install dependencies
Start from an empty directory. Create the project structure and install dependencies with pnpm. The project uses Next.js 16 (App Router) for the dashboard, Express for the eval API server, and four REAA evaluation packages for the heavy lifting.
Create a package.json with exact-pinned dependencies:
Next, create your environment variable template. Every env var your runtime code reads must be listed here:
env
# Env vars used by azure-ai-agent-eval-harness-for-smb-support-qa.# Keep placeholders only — never commit real values.NODE_ENV=development# Azure OpenAIAZURE_OPENAI_ENDPOINT=<your-azure-openai-endpoint>AZURE_OPENAI_API_KEY=<your-azure-openai-key>AZURE_OPENAI_DEPLOYMENT_NAME=<your-deployment-name># Langfuse observabilityLANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_BASE_URL=<your-langfuse-base-url># Eval serverEVAL_PORT=4567EVAL_BUDGET_LIMIT=10.00
You’ll also need a next.config.ts with the instrumentation hook flag enabled, since you’ll add OpenTelemetry tracing later:
ts
import type { NextConfig } from "next";const nextConfig = { experimental: { instrumentationHook: true } } as NextConfig;export default nextConfig;
Expected output: After pnpm install, you see a node_modules/ directory and a pnpm-lock.yaml.
Step 2: Define your domain types
Your eval harness works with multi-turn agent conversations called trajectories. Define the core types that model a support conversation. Create src/lib/types.ts:
A Trajectory represents one customer support interaction — a sequence of turns between the user, the AI assistant, and any tools it calls. Each turn optionally records token usage so you can calculate cost per turn later.
Expected output: The file compiles without errors when you run pnpm typecheck.
Step 3: Create the Azure OpenAI client
The eval harness needs an Azure OpenAI client to run LLM-as-judge evaluations. Create src/lib/azure-client.ts that reads credentials from environment variables and configures the AzureOpenAI client from the openai SDK:
readAzureConfig() validates env vars at runtime, throwing a typed ConfigError if any are missing.
createAzureClient() accepts an optional config — pass one directly for tests, or omit it to fall back to env vars.
getAzureClient() is a singleton accessor that reuses the same client across your app.
Expected output: Calling readAzureConfig() with valid env vars returns an object with endpoint, apiKey, and deploymentName. Calling it without AZURE_OPENAI_API_KEY throws a ConfigError with a descriptive message.
Step 4: Build the in-memory trajectory store
Create a simple in-memory store for agent trajectory logs. Your Express API receives trajectories from the support bot and stores them here before evaluation. Create src/lib/trajectory-store.ts:
The store uses a private Map for O(1) lookups by trajectory ID. You’ll bind it to the Express API routes in a later step.
Expected output: A fresh TrajectoryStore with no trajectories returns 0 from count() and [] from getAll(). After add(), get() returns the same trajectory object.
Step 5: Wire up the LLM-as-judge service
The evaluation harness uses an LLM-as-judge approach — one AI model scores another AI model’s responses. The @reaatech/agent-eval-harness-judge package provides a JudgeEngine that evaluates agent responses on four dimensions: faithfulness, relevance, tool correctness, and overall quality.
The JudgeEngine uses an OpenAI-compatible provider. When NODE_ENV=test or JUDGE_MOCK=true is set, the package automatically returns a mock score of 0.85 so your tests don’t need live API calls.
Expected output:judgeFaithfulness("context", "response") returns a JudgeScore with score between 0 and 1 and an explanation string.
Step 6: Add cost tracking with REAA cost service
Every LLM call costs money, especially in production. The @reaatech/agent-eval-harness-cost package calculates per-trajectory token costs, enforces daily budgets, and generates cost reports.
You’ll need an adapter that converts your local Trajectory type to the REAA package’s expected shape. Create src/lib/reaa-adapter.ts:
The multiple interfaces describe cost data at different levels: per-turn breakdown (TurnCost), per-trajectory summary (CostBreakdown), and accumulated report (CostReport). The CostService wraps the package’s stateless functions and adds a CostTracker for budget enforcement.
Expected output:calculateTrajectoryCost() returns a breakdown with a non-zero total_cost for a trajectory that has token usage on its turns. getTotalCost() returns 0 before any trajectories are tracked.
Step 7: Build the evaluation orchestration service
This is the central piece that ties everything together. The EvalService combines your judge, cost, and store layers into a single “run evaluation” operation using @reaatech/agent-eval-harness-suite. Create src/lib/eval-service.ts:
ts
import { SuiteRunner, parseConfig, createResultsAggregator, ResultsAggregator, RunComparator } from "@reaatech/agent-eval-harness-suite";import type { SuiteConfig, EvalRunResult, AggregatedResults, RunComparisonResult,} from "@reaatech/agent-eval-harness-suite";import type { EvalResult } from "@reaatech/agent-eval-harness-types";import { calculateTrajectoryCost as rawCalculateTrajectoryCost } from "@reaatech/agent-eval-harness-cost";import type { Trajectory } from "./types.js";import { EvalJudgeService } from "./judge-service.js";import { CostService } from "./cost-service.js";import { TrajectoryStore } from "./trajectory-store.js";
The SuiteRunner takes a list of trajectories and an evaluateTrajectory callback. It runs evaluations in parallel (4 concurrent workers by default) and returns an EvalRunResult with per-trajectory scores and overall metrics. The ResultsAggregator then rolls those scores into a structured summary with averages, pass rates, and breakdowns.
The cost library import is aliased as rawCalculateTrajectoryCost to avoid shadowing the CostService.calculateTrajectoryCost instance method on line 85.
Expected output:runEvaluation() with 2 trajectories in the store returns an EvalRunResult with status: "completed" and totalTrajectories: 2.
Step 8: Create the Express API server
The Express server exposes REST endpoints for trajectory ingestion, evaluation runs, cost reports, and health checks. It wires together all the services you’ve built so far. Create src/server.ts:
ts
import express, { type Request, type Response } from "express";import { z } from "zod";import { TrajectoryStore } from "./lib/trajectory-store.js";import { EvalService } from "./lib/eval-service.js";import { EvalJudgeService, createEvalJudgeEngine } from "./lib/judge-service.js";import { CostService } from "./lib/cost-service.js";import { createAzureClient, readAzureConfig } from "./lib/azure-client.js";import type { Trajectory } from "./lib/types.js";import type { EvalRunResult } from "@reaatech/agent-eval-harness-suite";const TrajectoryInputSchema =
The server exports a createApp() factory function so your tests can create fresh instances without binding to a port. The startup guard (NODE_ENV !== "test") prevents the server from listening during vitest runs.
Expected output: Starting the server with NODE_ENV=development and EVAL_PORT=4567 prints Eval server listening on port 4567. A POST /trajectories with a valid body returns { "ok": true, "id": "..." }.
Step 9: Add the CI/CD gate checker
The gate checker evaluates aggregated results against configurable quality thresholds and produces a pass/fail summary suitable for CI pipelines like GitHub Actions. Create src/gates/ci-check.ts:
ts
import { createGateEngine, getStandardPreset, CIIntegration,} from "@reaatech/agent-eval-harness-gate";import type { GateEvaluationSummary, GateDefinition } from "@reaatech/agent-eval-harness-gate";import type { AggregatedResults } from "@reaatech/agent-eval-harness-suite";export function runGateCheck( results: AggregatedResults, thresholdOverrides?: Record<string, number>,): GateEvaluationSummary { const preset = getStandardPreset(); const gates = preset.gates.map((gate: GateDefinition) => { const name = gate.name; if (thresholdOverrides && name in thresholdOverrides) { return { ...gate, threshold: thresholdOverrides[name] }; } return gate; }); const engine = createGateEngine(gates); return engine.evaluate(results);}export function getCIExitCode(summary: GateEvaluationSummary): number { return CIIntegration.getExitCode(summary);}export function generateJUnitReport(summary: GateEvaluationSummary): string { return CIIntegration.generateJUnitReport(summary);}
The standard preset gates include:
Overall quality >= 0.80
Faithfulness >= 0.80
Relevance >= 0.80
Tool correctness >= 0.90
Cost per task <= $0.05
Latency P99 <= 5000ms
Pass rate >= 95%
You can override any threshold by passing thresholdOverrides — useful when a particular deployment needs stricter or looser gates.
Expected output:runGateCheck() with all scores at 1.0 returns { overallPassed: true }. With a quality score of 0.5, it returns { overallPassed: false } and a failedGates count greater than 0.
Step 10: Set up OpenTelemetry tracing
Create src/instrumentation.ts to initialize OpenTelemetry tracing and Langfuse observability at startup. This file runs automatically when the Next.js dev server starts, because you enabled experimental.instrumentationHook in next.config.ts:
The NEXT_RUNTIME === "nodejs" guard is essential — register() runs in both Node.js and Edge runtimes, but @traceloop/node-server-sdk and langfuse are Node-only packages. Dynamic import() ensures they’re only loaded when safe.
Expected output: When you run next dev, the register() function executes and initializes OpenTelemetry. No errors appear during startup.
Step 11: Create the Next.js dashboard pages
The Next.js dashboard connects to the Express eval server and displays trajectory counts and cost data. First, create the layout at app/layout.tsx:
ts
import type { Metadata } from "next";import { Geist, Geist_Mono } from "next/font/google";import "./globals.css";const geistSans = Geist({ variable: "--font-geist-sans", subsets: ["latin"],});const geistMono = Geist_Mono({ variable: "--font-geist-mono", subsets: ["latin"],});export const metadata: Metadata = { title: "Create Next App", description: "Generated by create next app",};export default function RootLayout({ children,}: Readonly<{ children: React.ReactNode;}>) { return ( <html lang="en" className={`${geistSans.variable} ${geistMono.variable}`}> <body>{children}</body> </html> );}
Create a minimal CSS file at app/globals.css with basic body styling:
Both pages are server components that fetch from the Express eval server at render time. If the eval server isn’t running yet, they gracefully degrade instead of crashing.
Expected output: Visiting http://localhost:3000 shows the landing page with the eval server status. The dashboard at /dashboard lists ingested trajectories and cumulative cost.
Step 12: Write tests and run the suite
Your test infrastructure uses vitest with MSW (Mock Service Worker) to intercept Azure OpenAI network calls, plus vi.mock for package-level mocks. Create tests/setup.ts as the global test configuration:
The vi.mock calls must come before the MSW imports — vitest hoists them to the top of the module, so ordering matters for correctness. The test helpers (makeTrajectory, makeTrajectoryResult, makeAggregatedResults) produce valid objects for test assertions without repeating boilerplate in every test file.
Now run the full test suite:
terminal
pnpm test
Expected output: vitest reports numFailedTests: 0 and coverage thresholds all above 90% (lines, branches, functions, statements). The test run produces a vitest-report.json file and a coverage report in ./coverage/.
Next steps
Add Langfuse traces to each evaluation run — instrument EvalService.runEvaluation() to create a Langfuse trace per run with spans for each judge call, giving you full observability into evaluation latency and failure modes.
Connect it to real Azure OpenAI — replace the mock judge with a real Azure OpenAI deployment by setting your env vars and removing JUDGE_MOCK=true. Your first real evaluation will score actual agent trajectories against the four quality dimensions.
Integrate the CI gate into GitHub Actions — create a workflow that runs runGateCheck() on every PR, posts the JUnit report as a check annotation, and blocks merges when quality dips below the threshold.
Persist trajectories to a database — replace the in-memory TrajectoryStore with SQLite or Postgres so trajectories survive server restarts and you can run historical trend analysis across evaluation runs.
import { toReaaTrajectoryArray } from "./reaa-adapter.js";