Small businesses using xAI Grok for customer support agents have no automated way to verify response quality across prompt changes, model updates, or conversation scenarios. Manual spot-checks miss regressions, leading to incorrect answers, safety issues, and lost trust.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building an automated evaluation harness for xAI Grok-powered customer support agents. You’ll wire together the REAA agent-eval-harness suite — running batch evaluations against golden test datasets, scoring responses with Grok as an LLM judge, enforcing quality thresholds in CI pipelines, and streaming OpenTelemetry traces to Langfuse. By the end, you’ll have a CLI tool with four subcommands (eval, gate, results, compare) and a Next.js API route for browsing past runs.
Prerequisites
Node.js >= 22 and pnpm (tested with pnpm@10.0.0)
An xAI API key — set XAI_API_KEY in your environment (see .env.example)
A Langfuse account (optional, for observability) — set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST, and LANGFUSE_OTLP_ENDPOINT
Familiarity with TypeScript, Next.js App Router, and Vitest
The project scaffold (Next.js 16+ with App Router, TypeScript, Vitest, and ESLint already configured)
Step 1: Set up the project and install dependencies
Start by reviewing the project scaffold. Open package.json and confirm it pins every dependency to an exact version:
All versions are pinned without ^ or ~. Run pnpm install to lock the dependency tree.
Expected output:pnpm install prints a success message. No warnings about missing peer dependencies.
Step 2: Configure environment variables
Add the environment variables your tool reads at runtime. Create or open .env.example and confirm it contains entries for xAI and Langfuse:
env
# Env vars used by xai-grok-agent-eval-harness-for-smb-support-qa.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentXAI_API_KEY=<your-xai-api-key>LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_HOST=<your-langfuse-host>LANGFUSE_OTLP_ENDPOINT=<https://cloud.langfuse.com/api/public/otel/v1/traces>
Copy .env.example to .env.local and fill in your real xAI API key if you plan to run evaluations. For testing, the MSW mock service worker intercepts all HTTP calls — you don’t need live keys in CI.
Step 3: Define the type system
Create src/types.ts with Zod schemas and TypeScript interfaces that describe evaluation options, run records, CLI commands, and trajectory inputs:
EvalOptionsSchema validates CLI flags at runtime. EvalRunRecord is what the results store persists. TrajectoryInput describes one golden test case: a user input, optional expected context for faithfulness scoring, and optional expected tool calls for tool correctness scoring.
Step 4: Build the configuration layer
Create src/config.ts — this module loads and validates YAML configs, parses JSONL golden trajectories, and resolves gate presets:
ts
import { parseConfig, validateConfig,} from "@reaatech/agent-eval-harness-suite";import { getStandardPreset, getStrictPreset, getLenientPreset,} from "@reaatech/agent-eval-harness-gate";import type { GateDefinition } from "@reaatech/agent-eval-harness-gate";import type { SuiteConfig } from "@reaatech/agent-eval-harness-suite";import { readFileSync } from "fs";export function loadSuiteConfig(goldenPath: string): SuiteConfig { let raw: string; try { raw = readFileSync(goldenPath, "utf-8"); } catch { throw new Error(`Cannot read config at: ${goldenPath}`); } const config = parseConfig(raw); const validation = validateConfig(config); if (!validation.valid) { throw new Error(`Invalid config: ${validation.errors.join("; ")}`); } return config;}export function loadGoldenTrajectories( goldenPath: string,): Array<{ id: string; input: string; expectedToolCalls?: Array<{ name: string; arguments: Record<string, unknown>; }>; expectedContext?: string;}> { let raw: string; try { raw = readFileSync(goldenPath, "utf-8"); } catch { throw new Error(`Cannot read golden file at: ${goldenPath}`); } type RawTrajectoryLine = { id?: string; input?: string; expectedToolCalls?: Array<{ name: string; arguments: Record<string, unknown> }>; expectedContext?: string; }; const lines = raw.trim().split("\n").filter(Boolean); return lines.map((line, i) => { const parsed = JSON.parse(line) as RawTrajectoryLine; return { id: parsed.id ?? `trajectory-${String(i)}`, input: parsed.input ?? "", expectedToolCalls: parsed.expectedToolCalls, expectedContext: parsed.expectedContext, }; });}export function resolveGatePreset(presetName: string): GateDefinition[] { switch (presetName) { case "standard": return getStandardPreset().gates; case "strict": return getStrictPreset().gates; case "lenient": return getLenientPreset().gates; default: throw new Error(`Unknown gate preset: ${presetName}`); }}
loadSuiteConfig reads a YAML file, parses it with @reaatech/agent-eval-harness-suite’s parseConfig, and validates it with validateConfig. It throws a descriptive error on any failure. loadGoldenTrajectories reads a JSONL file (one JSON object per line), fills in default IDs when missing, and returns a typed array ready for the evaluator. resolveGatePreset maps preset names to arrays of GateDefinition objects from the three presets in @reaatech/agent-eval-harness-gate.
Step 5: Create the xAI Grok judge adapter
Create src/grok-judge.ts. This module wraps @ai-sdk/xai and @reaatech/agent-eval-harness-judge so the rest of your code doesn’t depend directly on the xAI client library:
ts
import { JudgeEngine } from "@reaatech/agent-eval-harness-judge";import { xai } from "@ai-sdk/xai";export function createGrokJudge( overrides?: { model?: string; temperature?: number },): JudgeEngine { return new JudgeEngine({ model: overrides?.model ?? "grok-3", provider: "openrouter", temperature: overrides?.temperature ?? 0, apiKey: process.env.XAI_API_KEY, });}export function createGrokModel(modelId?: string) { return xai(modelId ?? "grok-3");}
The JudgeEngine constructor takes a model name, provider type (set to "openrouter" because xAI’s API is OpenAI-compatible), temperature, and the API key from the environment. The createGrokModel function returns a Vercel AI SDK model instance for direct generateText calls.
Step 6: Implement the evaluator
Create src/evaluator.ts. This is the core orchestration module — it evaluates each golden trajectory by asking Grok for a response, then scoring that response across multiple quality metrics:
ts
import { createResultsAggregator, RunComparator } from "@reaatech/agent-eval-harness-suite";import type { SuiteConfig, EvalRunResult, AggregatedResults, RunComparisonResult, TrajectoryResult,} from "@reaatech/agent-eval-harness-suite";import { JudgeEngine } from "@reaatech/agent-eval-harness-judge";import { generateText } from "ai";import { createGrokModel } from "./grok-judge.js";export interface TrajectoryItem { id: string; input: string; expectedContext?: string; expectedToolCalls?:
The flow for a single trajectory: generateText with Grok gets the agent’s response, then the JudgeEngine scores it on faithfulness (does it match expected context?), relevance (is it on-topic?), and overall quality. If expectedToolCalls are provided, tool_correctness is also scored. Any judge call failure marks the trajectory as failed and records the error.
aggregateAndExport uses the REAA suite’s createResultsAggregator to produce formatted output in JSON, JUnit XML, CSV, or Markdown. compareRuns uses RunComparator for statistical comparison between baseline and candidate runs.
Step 7: Wire up the gate engine
Create src/gate.ts. This module evaluates aggregated results against quality thresholds and produces CI artifacts (JUnit XML, GitHub Actions annotations, PR comments):
ts
import { createGateEngine, CIIntegration } from "@reaatech/agent-eval-harness-gate";import { getStandardPreset, getStrictPreset, getLenientPreset } from "@reaatech/agent-eval-harness-gate";import { createOverallQualityGate, createFaithfulnessGate, createRelevanceGate, createToolCorrectnessGate, createPassRateGate } from "@reaatech/agent-eval-harness-gate";import type { AggregatedResults, RunComparisonResult } from "@reaatech/agent-eval-harness-suite";import type { GateEvaluationSummary } from "@reaatech/agent-eval-harness-gate";import * as fs from "node:fs";import * as path from "node:path";export function evaluateWithPreset( results: AggregatedResults, presetName: string,): GateEvaluationSummary { let preset = getStandardPreset(); switch (presetName) { case "standard": preset = getStandardPreset(); break; case "strict": preset = getStrictPreset(); break; case "lenient": preset = getLenientPreset(); break; } const engine = createGateEngine(preset.gates); return engine.evaluate(results);}export function evaluateWithBaseline( results: AggregatedResults, comparison: RunComparisonResult,): GateEvaluationSummary { const engine = createGateEngine([ createOverallQualityGate(), createFaithfulnessGate(), createRelevanceGate(), createToolCorrectnessGate(), createPassRateGate(), ]); return engine.evaluate(results, comparison);}export function formatCIOutput(summary: GateEvaluationSummary): { junitXml: string; annotations: string; exitCode: number; prComment: string;} { return { junitXml: CIIntegration.generateJUnitReport(summary), annotations: CIIntegration.generateGitHubAnnotations(summary), exitCode: CIIntegration.getExitCode(summary), prComment: CIIntegration.generatePRComment(summary), };}export function checkAndExit(summary: GateEvaluationSummary): void { const exitCode = CIIntegration.getExitCode(summary); if (exitCode !== 0) { process.exit(exitCode); }}export function exportGateResults(summary: GateEvaluationSummary, outputDir: string): void { const junitXml = CIIntegration.generateJUnitReport(summary); const jsonOutput = JSON.stringify(summary, null, 2); fs.mkdirSync(outputDir, { recursive: true }); fs.writeFileSync(path.join(outputDir, "gate-junit.xml"), junitXml, "utf-8"); fs.writeFileSync(path.join(outputDir, "gate-results.json"), jsonOutput, "utf-8");}
Three evaluation paths: evaluateWithPreset picks from three built-in presets (standard, strict, lenient). The standard preset checks overall quality >= 0.8, faithfulness >= 0.8, relevance >= 0.8, and tool correctness >= 0.9. evaluateWithBaseline passes a RunComparisonResult to detect regressions against a previous run. formatCIOutput generates all CI artifacts from a single summary. The checkAndExit helper calls process.exit(1) if any gate fails, making this usable in GitHub Actions workflows.
Step 8: Add observability with Langfuse
Create src/observability.ts. This module initializes OpenTelemetry tracing, metrics, and structured logging through the REAA observability package:
When otlpEndpoint is provided (from LANGFUSE_OTLP_ENDPOINT), the tracing manager exports spans via the OTLP protocol to Langfuse. When it’s not set, tracing falls back to console output. The metrics manager always logs to console and the dashboard manager stores an in-memory run history.
Step 9: Build the results store and API route
Create src/results-store.ts — a simple in-memory store backed by a Map<string, EvalRunRecord>:
ts
import type { EvalRunRecord } from "./types.js";const store = new Map<string, EvalRunRecord>();export function storeResult(record: EvalRunRecord): void { store.set(record.runId, record);}export function getAllResults(): EvalRunRecord[] { return Array.from(store.values());}export function getResultById(id: string): EvalRunRecord | undefined { return store.get(id);}export function clearResults(): void { store.clear();}
Now create the API route at app/api/results/route.ts. This is a Next.js App Router route handler that exposes the stored results over HTTP:
Notice the use of NextRequest and NextResponse.json() — always use these instead of bare Request/Response so that Next.js attaches the correct Content-Type headers. The route supports two query parameters:
?runId=<id> — returns a single record (or 404 JSON)
?format=markdown — returns results formatted as a Markdown table
Step 10: Implement the CLI entry point
Create src/cli.ts. This is the main executable that parses arguments and dispatches to the right subcommand:
ts
import { initObservability, getEvalLogger } from "./observability.js";import { runSuite, aggregateAndExport, compareRuns } from "./evaluator.js";import { loadSuiteConfig, loadGoldenTrajectories,} from "./config.js";import { createGrokJudge } from "./grok-judge.js";import { evaluateWithPreset, evaluateWithBaseline, formatCIOutput, checkAndExit, exportGateResults,} from "./gate.js";import { storeResult, getAllResults } from "./results-store.js";import type { AggregatedResults, RunComparisonResult } from "@reaatech/agent-eval-harness-suite";import type { CliCommand }
The argument parser is a zero-dependency implementation — it reads process.argv.slice(2), identifies the first token as the command, and parses --key value pairs into an options object. Each command handler follows the same pattern: validate required flags, load data, execute the core logic, and output results.
Then update src/index.ts to export the CLI entry:
ts
import { main } from "./cli.js";export const VERSION = "0.1.0";export { main };
Step 11: Add Next.js instrumentation
Create src/instrumentation.ts. This file auto-starts observability when the Next.js process boots:
The register() function uses a dynamic import inside a NEXT_RUNTIME === "nodejs" guard because it runs in both Node and Edge runtimes, and Edge fails on modules that import Node-only APIs. The matching next.config.ts looks like this:
ts
import type { NextConfig } from "next";const nextConfig: NextConfig = {};export default nextConfig;
When you are ready to enable the instrumentation hook in your Next.js runtime, add experimental: { instrumentationHook: true } to the config object. The instrumentation module is only needed when the eval harness runs inside the Next.js dev server or production runtime — for pure CLI usage, you can skip this step entirely.
Step 12: Set up MSW test mocks
Create tests/setup.ts to intercept xAI and Langfuse HTTP calls during tests:
MSW intercepts outgoing HTTP requests at the network level, so the @ai-sdk/xai client never reaches a real API during tests. The xAI handler returns a realistic Grok response with usage stats, and the Langfuse handler returns an empty 200 so traces don’t fail.
Step 13: Verify with tests
The vitest config (vitest.config.ts) sets up coverage thresholds at 90% across all categories and uses the v8 provider:
Expected output: Vitest runs all 12 test files (types, config, grok-judge, evaluator, gate, observability, results-store, cli, index, instrumentation, api-results, integration) and prints a passing summary. All lines, statements, functions, and branches meet or exceed the 90% coverage thresholds. The test suite covers happy paths, error paths (missing files, API timeouts, unknown presets, missing arguments), and boundary cases (empty trajectory lists, empty stores, default model IDs, zero-trajectory runs).
Next steps
Add more gate presets — define custom GateDefinition arrays for your specific quality thresholds using createOverallQualityGate(), createFaithfulnessGate(), and createPassRateGate().
Integrate into GitHub Actions — use the gate command in a PR workflow to block deployments when evaluation scores drop below your threshold. The CIIntegration classes already generate annotations and PR comments.
Store results persistently — replace the in-memory Map in results-store.ts with a SQLite database or your production data store so results survive server restarts.
Add a dashboard UI — build a Next.js page at app/dashboard/ that calls GET /api/results and renders a chart showing score trends over time.
Array
<{ name
:
string
; arguments
:
Record
<
string
,
unknown
> }>;
}
export function createTrajectoryEvaluator(judge: JudgeEngine): (trajectory: TrajectoryItem) => Promise<TrajectoryResult> {