Small businesses deploying AI chat or email agents struggle to know when an update breaks quality—manual testing doesn't scale, and proprietary LLM judges are expensive to use at volume.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
In this tutorial, you’ll build a CLI-based AI agent evaluation harness that uses Perplexity as a neutral LLM judge to score your customer-facing AI agents. You’ll wire up golden test case datasets, feed them to your agent under test, get judgment scores from Perplexity, compute classification metrics, gate prompt-version promotions based on threshold checks, lint agent definition files, and stream results to Langfuse for observability dashboards. The final pipeline runs as a Next.js API route and a standalone CLI command that’s CI-ready.
This recipe is for anyone deploying AI chat or email agents at small-to-medium businesses who needs automated quality gating without the cost of proprietary judge LLMs.
Prerequisites
Node.js 22+ and pnpm 10
A Perplexity API key — set as PERPLEXITY_API_KEY
Langfuse account (free tier is fine) — for telemetry dashboards
Basic familiarity with TypeScript, Next.js App Router patterns, and asynchronous pipelines
A running agent under test (any HTTP endpoint that accepts POST with {"input": "...", ...} and returns a response)
Step 1: Review the project layout
The scaffold has already been created for you. Let’s orient ourselves by looking at the file layout:
code
app/api/eval/route.ts — webhook-triggered evaluation
src/
index.ts — CLI entrypoint
lib/
types.ts — shared interfaces
config.ts — configuration loader (zod-validated)
eval-pipeline.ts — central orchestrator
services/
golden-dataset.ts — golden trajectory management
agent-under-test.ts — agent HTTP client
judge-service.ts — Perplexity-as-judge bridge
classifier-metrics.ts — classification metrics engine
pvc-service.ts — prompt version control client
markdown-linter.ts — agent definition linter
langfuse-exporter.ts — observability exporter
tests/ — Vitest suite (mirrors src/)
packages/ — API references for every dependency
The core dependencies are already pinned in package.json:
Expected output: The file src/lib/types.ts exists with these interfaces, serving as the single source of truth for every cross-module data shape.
Step 4: Build the configuration loader
The configuration system uses Zod to validate a JSON config file at runtime, with environment variable overrides. Create src/lib/config.ts:
ts
import { z } from "zod";import { readFileSync, existsSync } from "node:fs";import type { EvalConfig } from "./types.js";const evalConfigSchema = z.object({ metrics: z.array(z.string()).default(["faithfulness", "relevance"]), judgeModel: z.string().default("pplx-7b-online"), threshold: z.number().min(0).max(1).default(0.7), concurrency: z.number().min(1).default(4), budgetLimit: z.number().min(0).default(10.0), outputFormats: z.array(z.string()).default(["json"]),});export function loadConfig(path?: string): EvalConfig { const configPath = path ?? process.env.EVAL_CONFIG_PATH ?? "./eval-config.yaml"; if (!existsSync(configPath)) { throw new Error(`Config file not found: ${configPath}`); } const raw = readFileSync(configPath, "utf-8"); const parsed = JSON.parse(raw) as Record<string, unknown>; return validateConfig(parsed);}export function validateConfig(raw: unknown): EvalConfig { const result = evalConfigSchema.parse(raw); return result;}export function mergeWithEnv(config: EvalConfig): EvalConfig { return { ...config, threshold: process.env.EVAL_THRESHOLD_SCORE ? Number(process.env.EVAL_THRESHOLD_SCORE) : config.threshold, };}export function createDefaultConfig(): EvalConfig { const result = evalConfigSchema.parse({}); return result;}
Expected output: The loadConfig() function reads a JSON file from disk, validates it against the Zod schema, and returns a typed EvalConfig. The mergeWithEnv() function lets the EVAL_THRESHOLD_SCORE environment variable override the file-based threshold at runtime — useful for CI pipelines that set thresholds per branch.
Step 5: Wire the golden dataset loader
Golden datasets are serialized trajectories — known-good interactions you want to test against. The golden-dataset.ts service wraps @reaatech/agent-eval-harness-golden to load, validate, and compare them:
ts
import { readFileSync, existsSync } from "node:fs";import { loadGoldenTrajectories, validateGolden, compareAgainstGolden, batchCompare, quickCreateGolden,} from "@reaatech/agent-eval-harness-golden";import type { GoldenTrajectory } from "@reaatech/agent-eval-harness-golden";import type { Trajectory } from "@reaatech/agent-eval-harness-types";function textToTrajectory(text: string): Trajectory { return { turns: [ { turn_id: 0, role: "agent" as const, content: text, timestamp: new Date().toISOString(), }, ], };}export async function loadGoldenDataset( path: string): Promise<GoldenTrajectory[]> { if (!existsSync(path)) { throw new Error(`Golden dataset file not found: ${path}`); } const content = readFileSync(path, "utf-8"); return await Promise.resolve(loadGoldenTrajectories(content));}export async function validateGoldens( goldens: GoldenTrajectory[]): Promise<ReturnType<typeof validateGolden>[]> { return await Promise.resolve(goldens.map((g) => validateGolden(g)));}export async function batchCompareRuns( goldens: GoldenTrajectory[], candidates: Array<{ id: string; text: string }>): Promise<ReturnType<typeof batchCompare>[number]["result"][]> { const results = await Promise.all( goldens.map((golden) => Promise.resolve( batchCompare( golden, candidates.map((c) => textToTrajectory(c.text)) ) ) ) ); return results.flat().map((r) => r.result);}
Expected output:loadGoldenDataset() reads a JSONL file of golden trajectories and deserializes them. Each trajectory becomes a test case that the pipeline feeds to the agent under test. The batchCompareRuns() function compares the agent’s actual output against the golden reference for regression detection.
Step 6: Create the agent under test HTTP client
The AgentUnderTest class sends each test case to your live agent endpoint with retry logic and concurrency control:
Expected output: Each test case is sent to your agent endpoint with up to 3 retries (exponential backoff) and a 30-second timeout. The callBatch() method uses p-limit to bound concurrency — you set the concurrency level in the config. Client errors (4xx) return immediately without retry.
Step 7: Build the Perplexity judge service
The JudgeService is the heart of the evaluation. It takes each agent response, builds a prompt from a judge template, sends it to Perplexity for scoring, and parses the result:
ts
import Perplexity from "perplexity-sdk";import pLimit from "p-limit";import { JudgeCalibrator, JudgeCostTracker, getFaithfulnessTemplate, getRelevanceTemplate, getToolCorrectnessTemplate, getOverallQualityTemplate, buildPrompt,} from "@reaatech/agent-eval-harness-judge";import type { JudgeRequest, JudgeScore,} from "@reaatech/agent-eval-harness-judge";interface PerplexityClient { chatCompletionsPost(options: { model: string; messages: Array<{ role:
Expected output: When NODE_ENV=test, the judge returns a mock score of 0.85 so you can run tests without real API calls. In production, it sends the prompt to Perplexity’s chat completions API, extracts a numeric score from the response text using a regex, and returns a JudgeScore between 0 and 1. The cost tracker logs every judgment for budget monitoring.
Step 8: Compute classification metrics
The ClassifierMetricsService builds a confusion matrix from judge scores and computes accuracy, precision, recall, F1, Matthew’s Correlation Coefficient, and Cohen’s Kappa:
ts
import { randomUUID } from "node:crypto";import { logger, setEvalRunId, logEvalStart, logEvalComplete, startEvalSpan, endSpan, recordEvalRun, recordSamplesEvaluated,} from "@reaatech/classifier-evals";import type { ConfusionMatrix, ClassificationMetrics, EvalRun,} from "@reaatech/classifier-evals";export class ClassifierMetricsService { buildConfusionMatrix( results: Array<{ testCaseId: string; score: number; type: string }>,
Expected output: For a run with results scored as “faithfulness” and “relevance”, the service builds a 2x2 confusion matrix and computes accuracy, macro/micro/weighted precision/recall/F1, MCC, and Cohen’s Kappa. Weighted metrics are computed proportionally by class support size rather than hardcoded to zero. These metrics feed into the run report that gets exported to Langfuse.
Step 9: Add prompt version control gating
The PVCService wraps @reaatech/prompt-version-control to check whether the overall score clears the threshold, then decides whether to promote or block the prompt version:
Expected output: If the overall score across all test cases meets the threshold, evaluateAndGate() returns { action: "promote", score }. Otherwise it returns { action: "block", reason: ... }. This decision is recorded in the run report and used by downstream CI/CD systems to gate deployments.
Step 10: Lint agent definition files
The MarkdownLinterService uses @reaatech/agents-markdown-linter to check AGENTS.md and SKILL.md files for built-in lint rules:
Expected output: The linter scans a directory for AGENTS.md and SKILL.md files, parses each one, runs all available lint rules, and returns error/warning counts per file. The autoFixFile() method can auto-fix trailing whitespace, missing final newlines, and other fixable issues.
Step 11: Build the Langfuse exporter
The LangfuseExporter creates a trace with spans for each test case and score annotations:
Expected output: When valid Langfuse credentials are configured, each evaluation run creates a trace with a child span per test case and a score annotation per judgment. The returned trace URL is included in the run report so you can click directly to the Langfuse dashboard. When credentials are placeholders (e.g. ***), the exporter silently skips the call and logs a warning — useful during local development.
Step 12: Wire the evaluation pipeline
Now all the services come together. The EvalPipeline class orchestrates the full flow: load goldens, validate them, feed to agent, judge responses, compute metrics, lint agent files, gate prompt versions, and export:
ts
import { nanoid } from "nanoid";import { checkThresholds, createResultsAggregator,} from "@reaatech/agent-eval-harness-suite";import type { ResultsAggregator,} from "@reaatech/agent-eval-harness-suite";import type { EvalConfig, EvalRunReport, EvalTestCase,} from "@/src/lib/types";import { loadGoldenDataset, validateGoldens, batchCompareRuns,} from "@/src/services/golden-dataset";import { AgentUnderTest } from "@/src/services/agent-under-test";import { JudgeService } from
Expected output: The run() method executes all steps in sequence and returns a complete EvalRunReport with pass/fail status, score breakdown, PVC decision, and the Langfuse trace URL. If the golden dataset doesn’t load or goldens fail validation, it returns an error report early without proceeding. The pipeline handles empty result sets safely (overallScore defaults to 0 when there are no results).
Step 13: Create the CLI entrypoint
The CLI entrypoint in src/index.ts parses command-line arguments, loads the config, instantiates the pipeline, and outputs the JSON report:
ts
import "dotenv/config";import { writeFileSync } from "node:fs";import type { CLIOptions, EvalConfig } from "./lib/types.js";import { loadConfig, mergeWithEnv, createDefaultConfig,} from "./lib/config.js";import { EvalPipeline } from "./lib/eval-pipeline.js";async function main(): Promise<void> { const options = parseArgs(process.argv); if (!process.env.PERPLEXITY_API_KEY) { console.error( "Error: PERPLEXITY_API_KEY environment variable is required" ); process.exit(1); } let config: EvalConfig; if (options.config) { config = loadConfig(options.config); } else { try { config = loadConfig(); } catch { config = createDefaultConfig(); } } config = mergeWithEnv(config); if (options.threshold !== undefined) { config = { ...config, threshold: options.threshold }; } if (options.model) { config = { ...config, judgeModel: options.model }; } const pipeline = new EvalPipeline(config); const report = await pipeline.run(); const jsonOutput = JSON.stringify(report, null, 2); console.log(jsonOutput); if (options.output) { writeFileSync(options.output, jsonOutput, "utf-8"); } process.exit(report.status === "passed" ? 0 : 1);}function parseArgs(argv: string[]): CLIOptions { const options: CLIOptions = { command: "run" }; for (let i = 2; i < argv.length; i++) { const arg = argv[i]; switch (arg) { case "--config": case "-c": options.config = argv[++i]; break; case "--output": case "-o": options.output = argv[++i]; break; case "--verbose": case "-v": options.verbose = true; break; case "--threshold": case "-t": options.threshold = Number(argv[++i]); break; case "--model": case "-m": options.model = argv[++i]; break; } } return options;}main().catch((error: unknown) => { console.error(error instanceof Error ? error.message : String(error)); process.exit(1);});
Expected output: Running node src/index.js prints the JSON report to stdout and exits with 0 if all tests passed, 1 otherwise. Flags let you override the config path (--config), output file (--output), threshold (--threshold, -t), and judge model (--model, -m). The exit code makes it suitable for CI gating — your workflow can read the exit code to block deployments.
Step 14: Create the API route handler
The Next.js App Router route at app/api/eval/route.ts exposes a webhook endpoint that triggers evaluations asynchronously:
Expected output: A POST to /api/eval with the correct x-api-key header (matching AGENT_API_KEY) triggers an asynchronous evaluation and returns a 202 Accepted with the run ID. The GET endpoint returns a health check for monitoring. This lets you integrate evaluations into CI/CD webhooks or manual testing dashboards.
Step 15: Run the tests
The test suite covers every service with mocked externals. Run it with coverage:
terminal
pnpm test
This executes Vitest with the config from vitest.config.ts. The test suite includes:
Add more judge dimensions — wire additional templates from @reaatech/agent-eval-harness-judge like getToolCorrectnessTemplate() to score tool-use accuracy in agent responses
Deploy as a scheduled job — run the pipeline on a cron schedule to continuously monitor your agent’s quality and catch regressions before they reach customers
Integrate with your CI/CD — add the eval CLI command as a required step in your deployment workflow to block low-quality prompt versions automatically