Azure AI Agent Eval Harness for SMB Support QA

Automated quality gates for Azure AI-powered support agents, catching regressions in tool use, answer quality, and cost before they reach customers.

azure-ai eval-harness llm-as-judge quality-gates customer-support nextjs express typescript langfuse ci-cd

The problem

Small businesses deploying Azure AI chatbots for customer support struggle with maintaining consistent answer quality as prompts, models, and knowledge bases change. Manual testing is time-consuming and unreliable, leading to wrong answers, inappropriate tool calls, and surprise cost overruns.

Built from

Intro

This tutorial walks you through building an automated evaluation harness for Azure AI-powered customer support agents. You’ll create an Express API server that ingests agent trajectory logs, runs them through an LLM-as-judge evaluation pipeline, tracks cost, enforces quality gates, and surfaces results in a Next.js dashboard. By the end, your support bot QA will run automatically on every deployment, catching regressions in answer accuracy, tool use, and cost before they reach customers.

Prerequisites

Node.js 22+ and pnpm 10 installed on your machine
An Azure OpenAI resource with a deployed model (GPT-4 or similar) and its endpoint, API key, and deployment name
A Langfuse account (free tier works) for OpenTelemetry tracing — you’ll need the public and secret keys
Familiarity with TypeScript, Express, and Next.js App Router — this is a hands-on code-along, not an introduction to these
About 30 minutes to complete all steps

Step 1: Scaffold the project and install dependencies

Start from an empty directory. Create the project structure and install dependencies with pnpm. The project uses Next.js 16 (App Router) for the dashboard, Express for the eval API server, and four REAA evaluation packages for the heavy lifting.

Create a package.json with exact-pinned dependencies:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

184 kB·42 tests·100.0% coverage·vitest passing

SHA-256d9901549df4932d30fcdb42ea3bb716d7c5a0b9c1d50f99bd066da67351fe292

Book a conversation All solutions

Comments

Loading comments…

import { calculateTrajectoryCost, checkBudget, createBudget, generateCostReport, CostTracker, } from "@reaatech/agent-eval-harness-cost"; import type { BudgetConfig, BudgetCheckResult } from "@reaatech/agent-eval-harness-cost"; import type { Trajectory } from "./types.js"; import { toReaaTrajectory } from "./reaa-adapter.js"; interface TurnCost { turn_id: number; cost: number; llm_cost?: number; tool_cost?: number; total_cost?: number; input_tokens?: number; output_tokens?: number; tokens?: { input: number; output: number; }; } interface CostBreakdown { trajectory_id?: string; total_cost: number; llm_cost?: number; tool_cost?: number; input_tokens?: number; output_tokens?: number; breakdown?: { llm_calls: number; tool_invocations?: number; judge_evaluations?: number; }; per_turn?: TurnCost[]; } interface CostComponentBreakdown { llmCalls: number; toolInvocations: number; judgeEvaluations?: number; } interface TrajectoryCostEntry { trajectoryId: string; totalCost: number; inputTokens: number; outputTokens: number; turnCount: number; timestamp?: string; } interface CostTrend { timestamp: string; cost: number; trajectoryCount: number; avgCost: number; } interface ExpensiveOperation { type: "turn" | "tool_call" | "trajectory"; id: string | number; cost: number; details?: string; } interface CostReport { generatedAt: string; totalCost: number; trajectoryCount: number; avgCostPerTrajectory: number; breakdown: CostComponentBreakdown; perTrajectory: TrajectoryCostEntry[]; trends?: CostTrend[]; topExpensive: ExpensiveOperation[]; } export class CostService { #tracker: CostTracker; constructor() { const budgetLimit = Number.parseFloat(process.env.EVAL_BUDGET_LIMIT ?? "10.00"); this.#tracker = new CostTracker(budgetLimit); } calculateTrajectoryCost(trajectory: Trajectory, provider: string): CostBreakdown { return calculateTrajectoryCost(toReaaTrajectory(trajectory), provider); } checkBudget(cost: CostBreakdown, budget: BudgetConfig): BudgetCheckResult { return checkBudget(cost, budget); } generateReport( trajectories: Array<{ trajectory: Trajectory; cost: CostBreakdown }>, ): CostReport { const converted = trajectories.map(({ trajectory, cost }) => ({ trajectory: toReaaTrajectory(trajectory), cost, })); return generateCostReport(converted); } getTotalCost(): number { return this.#tracker.getTotalCost(); } createModerateBudget(): BudgetConfig { return createBudget("moderate"); } createStrictBudget(): BudgetConfig { return createBudget("strict"); } }

import { vi, beforeAll, afterEach, afterAll } from "vitest"; vi.mock("@traceloop/node-server-sdk", () => ({ default: { initialize: vi.fn() }, initialize: vi.fn(), })); vi.mock("langfuse", () => ({ Langfuse: vi.fn().mockImplementation(() => ({ trace: vi.fn().mockReturnValue({ update: vi.fn() }), })), default: { Langfuse: vi.fn() }, })); process.env.AZURE_OPENAI_ENDPOINT = "https://test.openai.azure.com"; process.env.AZURE_OPENAI_API_KEY = "test-key"; process.env.AZURE_OPENAI_DEPLOYMENT_NAME = "test-deployment"; Object.assign(process.env, { NODE_ENV: "test" }); process.env.JUDGE_MOCK = "true"; process.env.EVAL_PORT = "4567"; process.env.EVAL_BUDGET_LIMIT = "10.00"; import { setupServer } from "msw/node"; import { http, HttpResponse } from "msw"; import type { Trajectory } from "../src/lib/types.js"; import type { TrajectoryResult, AggregatedResults, } from "@reaatech/agent-eval-harness-suite"; const azureEndpoint = process.env.AZURE_OPENAI_ENDPOINT; const deploymentName = process.env.AZURE_OPENAI_DEPLOYMENT_NAME; const chatUrl = `${azureEndpoint}/openai/deployments/${deploymentName}/chat/completions`; const server = setupServer( http.post(chatUrl, () => HttpResponse.json({ choices: [ { message: { role: "assistant", content: "mocked", }, }, ], usage: { prompt_tokens: 10, completion_tokens: 5, total_tokens: 15, }, }), ), ); beforeAll(() => { server.listen({ onUnhandledRequest: "bypass" }); }); afterEach(() => { server.resetHandlers(); }); afterAll(() => { server.close(); }); export function makeTrajectory(overrides?: Partial<Trajectory>): Trajectory { return { id: "test-trajectory-id", timestamp: new Date("2025-01-01T00:00:00Z"), turns: [ { role: "user", content: "What is the return policy?" }, { role: "assistant", content: "Our return policy allows returns within 30 days.", }, ], ...overrides, }; } export function makeTrajectoryResult( overrides?: Partial<TrajectoryResult>, ): TrajectoryResult { return { trajectoryId: "test-trajectory-id", overallScore: 0.85, metricScores: { faithfulness: 0.9, relevance: 0.8 }, passed: true, ...overrides, }; } export function makeAggregatedResults( overrides?: Partial<AggregatedResults>, ): AggregatedResults { return { runId: "test-run-id", config: { name: "test-suite", metrics: [], }, overallMetrics: { overallScore: 0.85, avgFaithfulness: 0.9, avgRelevance: 0.8, toolCorrectnessRate: 1.0, avgCostPerTask: 0.02, latencyP50: 100, latencyP90: 200, latencyP99: 300, slaViolations: 0, }, metricBreakdown: {}, trajectoryResults: [], summary: { totalTrajectories: 0, passedTrajectories: 0, failedTrajectories: 0, passRate: 0, overallPassed: true, durationMs: 0, }, timestamp: new Date().toISOString(), ...overrides, }; }

Azure AI Agent Eval Harness for SMB Support QA

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Step 2: Define your domain types

Step 3: Create the Azure OpenAI client

Step 4: Build the in-memory trajectory store

Step 5: Wire up the LLM-as-judge service

Step 6: Add cost tracking with REAA cost service

Step 7: Build the evaluation orchestration service

Step 8: Create the Express API server

Step 9: Add the CI/CD gate checker

Step 10: Set up OpenTelemetry tracing

Step 11: Create the Next.js dashboard pages

Step 12: Write tests and run the suite

Next steps