xAI Grok Agent Eval Harness for SMB Support QA

Continuously evaluate your xAI Grok-powered customer support agents to catch regressions before they affect customers.

xai-grok eval-harness agent-evaluation customer-support llm-judge ci-quality-gate langfuse cli

The problem

Small businesses using xAI Grok for customer support agents have no automated way to verify response quality across prompt changes, model updates, or conversation scenarios. Manual spot-checks miss regressions, leading to incorrect answers, safety issues, and lost trust.

Built from

Intro

This tutorial walks you through building an automated evaluation harness for xAI Grok-powered customer support agents. You’ll wire together the REAA agent-eval-harness suite — running batch evaluations against golden test datasets, scoring responses with Grok as an LLM judge, enforcing quality thresholds in CI pipelines, and streaming OpenTelemetry traces to Langfuse. By the end, you’ll have a CLI tool with four subcommands (eval, gate, results, compare) and a Next.js API route for browsing past runs.

Prerequisites

Node.js >= 22 and pnpm (tested with pnpm@10.0.0)
An xAI API key — set XAI_API_KEY in your environment (see .env.example)
A Langfuse account (optional, for observability) — set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST, and LANGFUSE_OTLP_ENDPOINT
Familiarity with TypeScript, Next.js App Router, and Vitest
The project scaffold (Next.js 16+ with App Router, TypeScript, Vitest, and ESLint already configured)

Step 1: Set up the project and install dependencies

Start by reviewing the project scaffold. Open package.json and confirm it pins every dependency to an exact version:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

180 kB·91 tests·100.0% coverage·vitest passing

SHA-2567238d2d998a5ed6a277f9e20dd50af567d67951b928ece48a25904f4636232c2

Book a conversation All solutions

Comments

Loading comments…

Intro

Prerequisites

Node.js >= 22 and pnpm (tested with pnpm@10.0.0)
An xAI API key — set XAI_API_KEY in your environment (see .env.example)
A Langfuse account (optional, for observability) — set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST, and LANGFUSE_OTLP_ENDPOINT
Familiarity with TypeScript, Next.js App Router, and Vitest
The project scaffold (Next.js 16+ with App Router, TypeScript, Vitest, and ESLint already configured)

Step 1: Set up the project and install dependencies

Start by reviewing the project scaffold. Open package.json and confirm it pins every dependency to an exact version:

import { parseConfig, validateConfig, } from "@reaatech/agent-eval-harness-suite"; import { getStandardPreset, getStrictPreset, getLenientPreset, } from "@reaatech/agent-eval-harness-gate"; import type { GateDefinition } from "@reaatech/agent-eval-harness-gate"; import type { SuiteConfig } from "@reaatech/agent-eval-harness-suite"; import { readFileSync } from "fs"; export function loadSuiteConfig(goldenPath: string): SuiteConfig { let raw: string; try { raw = readFileSync(goldenPath, "utf-8"); } catch { throw new Error(`Cannot read config at: ${goldenPath}`); } const config = parseConfig(raw); const validation = validateConfig(config); if (!validation.valid) { throw new Error(`Invalid config: ${validation.errors.join("; ")}`); } return config; } export function loadGoldenTrajectories( goldenPath: string, ): Array<{ id: string; input: string; expectedToolCalls?: Array<{ name: string; arguments: Record<string, unknown>; }>; expectedContext?: string; }> { let raw: string; try { raw = readFileSync(goldenPath, "utf-8"); } catch { throw new Error(`Cannot read golden file at: ${goldenPath}`); } type RawTrajectoryLine = { id?: string; input?: string; expectedToolCalls?: Array<{ name: string; arguments: Record<string, unknown> }>; expectedContext?: string; }; const lines = raw.trim().split("\n").filter(Boolean); return lines.map((line, i) => { const parsed = JSON.parse(line) as RawTrajectoryLine; return { id: parsed.id ?? `trajectory-${String(i)}`, input: parsed.input ?? "", expectedToolCalls: parsed.expectedToolCalls, expectedContext: parsed.expectedContext, }; }); } export function resolveGatePreset(presetName: string): GateDefinition[] { switch (presetName) { case "standard": return getStandardPreset().gates; case "strict": return getStrictPreset().gates; case "lenient": return getLenientPreset().gates; default: throw new Error(`Unknown gate preset: ${presetName}`); } }

import { createGateEngine, CIIntegration } from "@reaatech/agent-eval-harness-gate"; import { getStandardPreset, getStrictPreset, getLenientPreset } from "@reaatech/agent-eval-harness-gate"; import { createOverallQualityGate, createFaithfulnessGate, createRelevanceGate, createToolCorrectnessGate, createPassRateGate } from "@reaatech/agent-eval-harness-gate"; import type { AggregatedResults, RunComparisonResult } from "@reaatech/agent-eval-harness-suite"; import type { GateEvaluationSummary } from "@reaatech/agent-eval-harness-gate"; import * as fs from "node:fs"; import * as path from "node:path"; export function evaluateWithPreset( results: AggregatedResults, presetName: string, ): GateEvaluationSummary { let preset = getStandardPreset(); switch (presetName) { case "standard": preset = getStandardPreset(); break; case "strict": preset = getStrictPreset(); break; case "lenient": preset = getLenientPreset(); break; } const engine = createGateEngine(preset.gates); return engine.evaluate(results); } export function evaluateWithBaseline( results: AggregatedResults, comparison: RunComparisonResult, ): GateEvaluationSummary { const engine = createGateEngine([ createOverallQualityGate(), createFaithfulnessGate(), createRelevanceGate(), createToolCorrectnessGate(), createPassRateGate(), ]); return engine.evaluate(results, comparison); } export function formatCIOutput(summary: GateEvaluationSummary): { junitXml: string; annotations: string; exitCode: number; prComment: string; } { return { junitXml: CIIntegration.generateJUnitReport(summary), annotations: CIIntegration.generateGitHubAnnotations(summary), exitCode: CIIntegration.getExitCode(summary), prComment: CIIntegration.generatePRComment(summary), }; } export function checkAndExit(summary: GateEvaluationSummary): void { const exitCode = CIIntegration.getExitCode(summary); if (exitCode !== 0) { process.exit(exitCode); } } export function exportGateResults(summary: GateEvaluationSummary, outputDir: string): void { const junitXml = CIIntegration.generateJUnitReport(summary); const jsonOutput = JSON.stringify(summary, null, 2); fs.mkdirSync(outputDir, { recursive: true }); fs.writeFileSync(path.join(outputDir, "gate-junit.xml"), junitXml, "utf-8"); fs.writeFileSync(path.join(outputDir, "gate-results.json"), jsonOutput, "utf-8"); }

import { type NextRequest, NextResponse } from "next/server"; import { getResultById, getAllResults } from "../../../src/results-store.js"; export function GET(req: NextRequest): NextResponse { const runId = req.nextUrl.searchParams.get("runId"); const format = req.nextUrl.searchParams.get("format"); if (runId) { const record = getResultById(runId); if (!record) { return NextResponse.json({ error: "not found" }, { status: 404 }); } if (format === "markdown") { const md = formatSingleRecord(record); return new NextResponse(md, { headers: { "content-type": "text/markdown; charset=utf-8" }, }); } return NextResponse.json(record); } const all = getAllResults(); if (format === "markdown") { const md = formatAllRecords(all); return new NextResponse(md, { headers: { "content-type": "text/markdown; charset=utf-8" }, }); } return NextResponse.json({ runs: all }); } function formatSingleRecord(r: { runId: string; timestamp: string; status: string; overallScore: number; passRate: number; totalTrajectories: number; durationMs: number; exportedFormats?: string[]; }): string { const e = r.exportedFormats?.join(", ") ?? ""; return [ "| Run ID | Timestamp | Status | Score | Pass Rate | Trajectories | Duration (ms) | Formats |", "|--------|-----------|--------|-------|-----------|--------------|--------------|---------|", `| ${r.runId} | ${r.timestamp} | ${r.status} | ${String(r.overallScore)} | ${String(r.passRate)} | ${String(r.totalTrajectories)} | ${String(r.durationMs)} | ${e} |`, ].join("\n"); } function formatAllRecords( records: Array<{ runId: string; timestamp: string; status: string; overallScore: number; passRate: number; totalTrajectories: number; durationMs: number; exportedFormats?: string[]; }>, ): string { if (records.length === 0) return "No results found."; const rows = records .map( (r) => `| ${r.runId} | ${r.timestamp} | ${r.status} | ${String(r.overallScore)} | ${String(r.passRate)} | ${String(r.totalTrajectories)} | ${String(r.durationMs)} | ${r.exportedFormats?.join(", ") ?? ""} |`, ) .join("\n"); return [ "| Run ID | Timestamp | Status | Score | Pass Rate | Trajectories | Duration (ms) | Formats |", "|--------|-----------|--------|-------|-----------|--------------|--------------|---------|", rows, ].join("\n"); }

xAI Grok Agent Eval Harness for SMB Support QA

The problem

Built from

Intro

Prerequisites

Step 1: Set up the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Set up the project and install dependencies

Step 2: Configure environment variables

Step 3: Define the type system

Step 4: Build the configuration layer

Step 5: Create the xAI Grok judge adapter

Step 6: Implement the evaluator

Step 7: Wire up the gate engine

Step 8: Add observability with Langfuse

Step 9: Build the results store and API route

Step 10: Implement the CLI entry point

Step 11: Add Next.js instrumentation

Step 12: Set up MSW test mocks

Step 13: Verify with tests

Next steps