A 25-person B2B SaaS company receives hundreds of pieces of feedback each week via support tickets, NPS comments, and sales calls. The product manager manually reads through a fraction of them, often missing the most requested features or the most painful bugs. This leads to a product roadmap that doesn't align with customer needs, increasing churn. They need an agent that ingests all feedback sources, deduplicates, clusters by topic, estimates the number of affected accounts, and generates a weekly report of the top 10 most impactful product changes. This ensures the product team works on what matters most to retention.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial builds an automated evaluation pipeline that scores, costs, and gates AI agent trajectories, using customer feedback triage as the example domain. You’ll wire 6 REAA agent-eval-harness packages into a Hono API stack behind a Next.js route handler, then validate the pipeline with a test suite running at 90%+ coverage. This is for backend/ML engineers who want a reusable evaluation harness they can point at any agent workflow — support-ticket triage is just the demo scenario.
Prerequisites
Node.js >= 22 and pnpm 10 installed
An OpenAI API key (for the LLM-as-judge engine — the key is used at runtime via process.env)
Basic familiarity with TypeScript, Next.js App Router, and REST APIs
Familiarity with Zod schemas and the concept of LLM trajectory evaluation
Step 1: Create the project and install dependencies
Start from an empty directory and scaffold a Next.js 16 project with App Router. Then install the 6 REAA packages plus their supporting dependencies.
Expected output:pnpm-lock.yaml is created. No peer-dependency warnings.
Step 2: Configure Vitest
Create vitest.config.ts at the project root. This sets up the v8 coverage provider with 90% thresholds across all metrics and excludes config files, type declarations, and UI pages from coverage tracking.
Expected output: The file exists at the project root. You can now run pnpm test to see the configuration in action (it will fail until you write source files and tests).
Step 3: Configure environment variables
Create .env.example with placeholder entries. Every env var your code reads must be listed here — never commit real secrets.
env
# Env vars used by agnostic-customer-feedback-triage.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=development# LLM provider for LLM-as-judge (required)OPENAI_API_KEY=<your-openai-key># Optional additional providers for multi-model judge consensusANTHROPIC_API_KEY=<your-anthropic-key># Langfuse observability tracing (optional but recommended)LANGfUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGfUSE_SECRET_KEY=<your-langfuse-secret-key>LANGfUSE_HOST=https://cloud.langfuse.com# Upstash Vector for storing evaluation run embeddings# (optional — service degrades gracefully when unset)UPSTASH_VECTOR_URL=<your-upstash-vector-url>UPSTASH_VECTOR_TOKEN=<your-upstash-vector-token># Evaluation runner concurrencyEVAL_CONCURRENCY=4# Judge engine defaults (overridable per request)JUDGE_PROVIDER=openaiJUDGE_MODEL=gpt-5.2
Copy this to .env.local and fill in your real values:
terminal
cp .env.example .env.local# Then edit .env.local with your keys
Expected output: Both .env.example and .env.local exist. .env.example is committed to git, .env.local is git-ignored.
Step 4: Define types and Zod schemas
Create the domain types that wrap the REAA types package. Start with src/types/eval.ts — this file re-exports the key types and defines the API request/response schemas.
Next create src/types/feedback-eval-inputs.ts — a specialization of the Trajectory type for the feedback triage domain:
ts
import type { Trajectory } from "@reaatech/agent-eval-harness-types";export type FeedbackTrajectory = Trajectory & { metadata?: { scenario?: string; feedbackSource?: "support_ticket" | "nps_comment" | "sales_call"; expectedClusterCount?: number; expectedItemCount?: number; total_turns?: number; };};
Finally, create src/types/index.ts to re-export everything from a single entry point:
ts
export { EvalApiRequestSchema, JudgeRequestSchema } from "./eval.js";export type { EvalApiRequest, JudgeRequest, EvalApiResponse, Trajectory, Turn, JudgeScore, CostBreakdown, EvalResult } from "./eval.js";export { TrajectorySchema, TurnSchema, JudgeScoreSchema, EvalResultSchema } from "./eval.js";export type { FeedbackTrajectory } from "./feedback-eval-inputs.js";
Expected output: Three files under src/types/. The barrel export at src/types/index.ts is the only import any other module needs.
Step 5: Create example trajectories
Create src/lib/example-trajectories.ts with 3 golden trajectories that simulate customer feedback triage scenarios. Each trajectory is validated against TrajectorySchema from the REAA types package.
ts
import type { Trajectory } from "@reaatech/agent-eval-harness-types";import { TrajectorySchema } from "@reaatech/agent-eval-harness-types";const happyPathTrajectory = { turns: [ { turn_id: 1, role: "user", content: "Support ticket feedback: Login page is too slow, takes 8 seconds to load. Our team of 12 engineers loses 2 hours daily waiting.", timestamp: "2026-04-14T09:00:00Z", quality_notes: "Expected: 8 feedback items clustered into 3 roadmap topics", }, { turn_id: 2, role: "user", content: "NPS comment (score 3): 'Dashboard is confusing. I keep clicking the wrong button to find reports. Been using this for 6 months and still get lost.'", timestamp: "2026-04-14T10:30:00Z"
Expected output: Three validated trajectory objects. The happy-path has 8 turns (varied feedback sources), the error-path has 4 turns (with null and malformed content), and the boundary has 1 turn (single positive comment).
Step 6: Build the evaluation suite service
The suite service wraps @reaatech/agent-eval-harness-suite to run batch evaluations. Create src/services/eval-suite.ts:
Expected output: Three exported functions. runEvalSuite runs trajectories through a SuiteRunner with 4 concurrent workers. aggregateResults compresses a run result into aggregated metrics. exportResultsToMarkdown produces a human-readable report. Errors are logged via the observability logger.
Step 7: Build the judge service
The judge service wraps @reaatech/agent-eval-harness-judge to create LLM-as-judge engines and score individual trajectories. Create src/services/eval-judge.ts:
Expected output:createJudge returns a configured JudgeEngine instance. judgeTrajectory calls the judge’s judge() method with the trajectory context and returns a structured EvalResult. An empty trajectory returns score 0 with the message “No turns to evaluate”.
Step 8: Build the cost tracking service
The cost service wraps @reaatech/agent-eval-harness-cost to calculate per-trajectory costs and enforce budgets. Create src/services/eval-cost.ts:
Expected output:trackCosts delegates to calculateTrajectoryCost with a fallback that returns zero costs on failure. enforceBudget creates a moderate budget and checks the cost against it. The error path produces all-zero costs instead of crashing.
Step 9: Build the gate and observability services
The gate service wraps @reaatech/agent-eval-harness-gate to run quality, cost, and latency regression gates. Create src/services/eval-gate.ts:
ts
import { createGateEngine, getStandardPreset, CIIntegration, type GateEvaluationSummary,} from "@reaatech/agent-eval-harness-gate";import type { AggregatedResults } from "@reaatech/agent-eval-harness-suite";export type GateEngine = ReturnType<typeof createGateEngine>;export function createStandardGateEngine(): GateEngine { const gates = getStandardPreset().gates; return createGateEngine(gates);}export function evaluateGates(engine: GateEngine, results: AggregatedResults): GateEvaluationSummary { try { return engine.evaluate(results); } catch { return { runId: results.runId, overallPassed: false, passedGates: 0, failedGates: 0, totalGates: 0, results: [], durationMs: 0, }; }}export function getCIExitCode(summary: GateEvaluationSummary): number { return CIIntegration.getExitCode(summary);}export function generateJUnitReport(summary: GateEvaluationSummary): string { return CIIntegration.generateJUnitReport(summary);}
The observability service wraps @reaatech/agent-eval-harness-observability for tracing, logging, metrics, and dashboard recording. Create src/services/eval-observability.ts:
Expected output: Both services handle failures gracefully. Every public function wraps its call in try/catch and logs a [observability-skipped] message rather than crashing the pipeline.
Step 10: Build the summariser and vector store services
The summariser uses the Vercel AI SDK to generate human-readable summaries. Create src/services/eval-summariser.ts:
ts
import { generateText } from "ai";import { openai } from "@ai-sdk/openai";export async function summariseResults(aggregated: { overallMetrics: { overallScore: number; avgCostPerTask: number; latencyP99: number }; summary: { totalTrajectories: number; passRate: number; passedTrajectories: number; failedTrajectories: number };}): Promise<string> { if (aggregated.summary.totalTrajectories === 0) { return "No trajectories evaluated"; } try { const { text } = await generateText({ model: openai("gpt-5.2"), system: "You are an evaluation results summariser. Summarise in 2-3 clear, concise sentences.", prompt: JSON.stringify(aggregated), }); return text; } catch (err) { console.error("[eval-summariser] generateText failed:", err); return `Evaluation run completed: ${String(aggregated.summary.totalTrajectories)} trajectories evaluated, overall score ${aggregated.overallMetrics.overallScore.toFixed(2)}, pass rate ${aggregated.summary.passRate.toFixed(0)}%.`; }}
The vector store service wraps @upstash/vector for storing and querying evaluation run embeddings. Create src/services/eval-vector-store.ts:
ts
import { Index } from "@upstash/vector";export type VectorIndex = Index | null;export function createVectorIndex(): VectorIndex { const url = process.env.UPSTASH_VECTOR_URL; const token = process.env.UPSTASH_VECTOR_TOKEN; if (!url || !token) return null; return new Index({ url, token });}export async function storeEvalRunEmbedding( index: VectorIndex, runId: string, aggregated: { overallMetrics: { overallScore: number }; summary: { passRate: number } },): Promise<void> { if (!index) return; try { await index.upsert({ id: runId, vector: [ aggregated.overallMetrics.overallScore, aggregated.summary.passRate / 100, ], metadata: { overallScore: aggregated.overallMetrics.overallScore, passRate: aggregated.summary.passRate, timestamp: new Date().toISOString(), }, }); } catch (err) { console.warn("[eval-vector-store] upsert failed:", err); }}export async function querySimilarRuns( index: VectorIndex, aggregated: { overallMetrics: { overallScore: number }; summary: { passRate: number } }, topK: number,): Promise<Array<{ id: string; score: number }>> { if (!index) return []; try { const results = await index.query({ vector: [ aggregated.overallMetrics.overallScore, aggregated.summary.passRate / 100, ], topK, includeMetadata: true, }); return results.map((r) => ({ id: String(r.id), score: r.score })); } catch (err) { console.warn("[eval-vector-store] query failed:", err); return []; }}
Expected output: The summariser returns a template fallback when the AI SDK call fails. The vector store returns null from createVectorIndex() when env vars are missing, and all operations are no-ops in that case.
Step 11: Wire the orchestrator
The orchestrator ties all 8 services together into a single runFullEvaluation pipeline. Create src/services/eval-orchestrator.ts:
ts
import type { Trajectory } from "@reaatech/agent-eval-harness-types";import type { AggregatedResults, EvalRunResult } from "@reaatech/agent-eval-harness-suite";import type { EvalApiResponse } from "../types/index.js";import { createJudge, judgeTrajectory } from "./eval-judge.js";import { runEvalSuite, aggregateResults } from "./eval-suite.js";import { trackCosts, enforceBudget } from "./eval-cost.js";import { createStandardGateEngine, evaluateGates } from "./eval-gate.js";import { initObservability, logRunStart, logRunEnd, recordDashboardRun } from "./eval-observability.js";import { summariseResults } from "./eval-summariser.js";import { createVectorIndex, storeEvalRunEmbedding } from "./eval-vector-store.js"
Expected output: The orchestrator runs 10 sequential stages: observability init, judge creation, batch evaluation, aggregation, cost tracking, budget enforcement, gate evaluation, LLM summarisation, vector store sync, and dashboard recording. Each stage is individually try/caught so a single failure never crashes the entire pipeline.
Step 12: Create the Hono API and Next.js route handler
The Hono app exposes 4 endpoints. Create src/hono-eval-app.ts:
Now bridge the Hono app to Next.js via the catch-all route handler. Create app/api/eval/[[...route]]/route.ts:
ts
import { handle } from "hono/vercel";import app from "../../../../src/hono-eval-app.js";export const GET = handle(app);export const POST = handle(app);export const PUT = handle(app);export const DELETE = handle(app);
Expected output: HTTP requests to /api/eval/run, /api/eval/judge, /api/eval/gates, and /api/eval/report are all routed through Hono. Each route has Zod validation on the input and returns 400 on schema mismatch, 500 on unexpected errors.
Step 13: Create the barrel export and update the home page
Replace src/index.ts with a barrel export that surfaces everything consumers need:
ts
export { runFullEvaluation } from "./services/eval-orchestrator.js";export { runEvalSuite, aggregateResults, exportResultsToMarkdown } from "./services/eval-suite.js";export { createJudge, judgeTrajectory } from "./services/eval-judge.js";export { trackCosts, enforceBudget, createDailyCostTracker, generateReport } from "./services/eval-cost.js";export { createStandardGateEngine, evaluateGates, getCIExitCode, generateJUnitReport } from "./services/eval-gate.js";export { initObservability, logRunStart, logRunEnd, logGateResult, recordDashboardRun } from "./services/eval-observability.js";export { summariseResults } from "./services/eval-summariser.js";export { createVectorIndex, storeEvalRunEmbedding, querySimilarRuns } from "./services/eval-vector-store.js";export { EvalApiRequestSchema, JudgeRequestSchema, TrajectorySchema, TurnSchema, JudgeScoreSchema, EvalResultSchema } from "./types/index.js";export type { EvalApiRequest, JudgeRequest, EvalApiResponse, Trajectory, Turn, JudgeScore, CostBreakdown, EvalResult, FeedbackTrajectory } from "./types/index.js";export { exampleTrajectories, getExampleTrajectories } from "./lib/example-trajectories.js";export { default as honoApp } from "./hono-eval-app.js";
Update app/page.tsx to show a minimal dashboard with the API endpoints and package list:
tsx
import styles from "./page.module.css";export default function Home() { return ( <div className={styles.page}> <main className={styles.main}> <h1>Customer Feedback Triage — Evaluation Dashboard</h1> <p> This service wires{" "} <strong>6 REAA agent-eval-harness packages</strong> into a{" "} <strong>Hono API</strong> stack, demonstrated on customer feedback triage evaluation trajectories. </p> <section className={styles.section}> <h2>API Endpoints</h2> <ul> <li> <code>POST /api/eval/run</code> — Run a full evaluation suite </li> <li> <code>POST /api/eval/judge</code> — Judge a single trajectory </li> <li> <code>POST /api/eval/gates</code> — Evaluate gates on results </li> <li> <code>GET /api/eval/report</code> — Dashboard report </li> </ul> </section> <section className={styles.section}> <h2>Packages</h2> <ul> <li><code>@reaatech/agent-eval-harness-types</code> — Domain types & Zod schemas</li> <li><code>@reaatech/agent-eval-harness-suite</code> — Batch evaluation runner</li> <li><code>@reaatech/agent-eval-harness-judge</code> — LLM-as-judge engine</li> <li><code>@reaatech/agent-eval-harness-cost</code> — Cost tracking & budgets</li> <li><code>@reaatech/agent-eval-harness-gate</code> — CI regression gates</li> <li><code>@reaatech/agent-eval-harness-observability</code> — OTel tracing, metrics, dashboards</li> </ul> </section> </main> </div> );}
Expected output:src/index.ts exports 20+ named exports from a single entry point. The home page renders a clean API reference card. Both files are excluded from coverage (UI pages and barrel exports aren’t tested).
Step 14: Write the test suite
Create tests under tests/ that mirror the source structure. Here’s the core test for the orchestrator — create tests/services/eval-orchestrator.test.ts:
Write tests for every service module: types, judge, cost, gate, observability, summariser, vector store, suite, Hono app, barrel exports, example trajectories, and integration pipelines. The complete test suite covers happy paths, error paths, and boundary conditions across all 14 source modules. Use the orchestrator test above as a template — mock the internal service modules with vi.mock, then import the function under test and assert on its return value.
Expected output: 14 test files under tests/ mirroring the src/ structure, with 60+ individual test cases. Every module has at least one happy-path, one error-path, and one boundary-condition test.
Step 15: Run the full verification suite
Run all quality gates to confirm the pipeline works end to end:
terminal
pnpm typecheck
Expected output: Zero TypeScript errors. If you get module-resolution errors, verify all import paths use the .js extension (required by NodeNext module resolution).
terminal
pnpm lint
Expected output: Zero ESLint warnings or errors.
terminal
pnpm test
Expected output: All tests pass (0 failed), and the coverage report shows >= 90% on lines, branches, functions, and statements for src/**/*.ts and app/**/route.ts.
The test runner generates vitest-report.json — you can inspect it to confirm numFailedTests: 0 and numTotalTests >= 60.
Next steps
Add a multi-model judge consensus — configure both OpenAI and Anthropic judge engines in the orchestrator and average their scores for more balanced evaluations
Connect real feedback sources — replace the hardcoded example trajectories with a webhook that ingests support tickets from Zendesk or Intercom, NPS scores from SurveyMonkey, and sales call transcripts from Gong
Add CI integration — use getCIExitCode and generateJUnitReport in a GitHub Actions workflow that gates PR merges based on evaluation quality thresholds
,
},
{
turn_id: 3,
role: "user",
content: "Sales call transcript excerpt: 'Client on enterprise plan says they need CSV export — without it they can't do their monthly board reporting. At least 5 accounts have mentioned this.'",
timestamp: "2026-04-14T14:00:00Z",
},
{
turn_id: 4,
role: "user",
content: "Support ticket: Error 500 when exporting reports larger than 100 rows. Happens every time for the past 3 releases.",
timestamp: "2026-04-15T08:15:00Z",
},
{
turn_id: 5,
role: "user",
content: "NPS comment (score 2): 'Had to restart the app 4 times today because of crashes after saving. Lost work twice. Very frustrating.'",
timestamp: "2026-04-15T11:00:00Z",
},
{
turn_id: 6,
role: "user",
content: "Sales call: 3 prospects from healthcare vertical asked about HIPAA compliance. They said it's a dealbreaker without it.",
timestamp: "2026-04-15T16:30:00Z",
},
{
turn_id: 7,
role: "user",
content: "Support ticket: Search functionality returns incomplete results. Searching for 'invoice' misses half the related tickets.",
timestamp: "2026-04-16T07:45:00Z",
},
{
turn_id: 8,
role: "user",
content: "NPS comment (score 4): 'Mobile app is great but I need push notifications when teammates comment on my tickets.'",