Skip to content
/ solutions / anthropic-eval-harness-for-agent-quality-assurance Anthropic Eval Harness for Agent Quality Assurance Continuous regression testing and safety scoring for Anthropic‑powered agents, with automated quality gates before any customer‑facing deployment.
The problem SMBs shipping customer‑support or sales agents on Anthropic’s models see quality drift over time—toxic phrasing, hallucinated facts, or missed tools—but lack a repeatable test suite to catch these regressions before they reach users.
Example artifact A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
119 kB · 94 tests· 99.0% coverage· vitest passing
SHA-256 1975caeac94632e24e85ddb057fb27110a3399cf0f1331d2429a05b5f7648604 Comments Sign in to commentSign in with GitHub to comment and vote.
© 2026 REAA Technologies Inc. — Open-Source AI Solutions for Small Business.
On this page Intro
This tutorial walks you through building a regression testing harness for Anthropic-powered AI agents. By the end, you’ll have a Next.js application that accepts evaluation run triggers over a REST API, calls Claude to process test trajectories, scores the responses with LLM-as-a-judge modules, enforces quality gates, logs incidents on failure, and exports traces to Langfuse for an observability dashboard. You’ll write every source file from scratch — just copy and paste along.
Prerequisites
Node.js >= 22 and pnpm 10.x (the project pins "packageManager": "pnpm@10.0.0")
An Anthropic API key (get one at console.anthropic.com )
A Langfuse account for the observability dashboard (sign up at langfuse.com to get public/secret keys)
Familiarity with TypeScript, Next.js App Router, and REST APIs
Step 1: Scaffold the Next.js project
Create a new directory and set up the project configuration files: TypeScript, Next.js, and Vitest.
Create package.json:
{
"name" : "anthropic-eval-harness" ,
"version" : "0.1.0" ,
"private" : true ,
"type" : "module" ,
"engines" : {
"node" : ">=22"
},
"packageManager" : "pnpm@10.0.0" ,
"scripts" : {
"typecheck" : "tsc --noEmit" ,
"lint" : "eslint ." ,
"test" : "vitest run --coverage --reporter=json --outputFile=vitest-report.json"
}
}
Create tsconfig.json:
{
"compilerOptions" : {
"target" : "ES2022" ,
"module" : "NodeNext" ,
"moduleResolution" : "NodeNext" ,
"strict" : true ,
"esModuleInterop" : true ,
"forceConsistentCasingInFileNames" : true ,
"skipLibCheck" : true ,
"resolveJsonModule" : true ,
"isolatedModules" : true ,
"noUncheckedIndexedAccess" : true ,
Create next.config.ts — this enables the instrumentation hook that initializes Langfuse on startup:
import type { NextConfig } from "next" ;
const nextConfig : NextConfig = {
experimental: {
instrumentationHook: true ,
},
} as NextConfig ;
export default nextConfig;
Create vitest.config.ts:
import { defineConfig } from "vitest/config" ;
export default defineConfig ({
esbuild: {
jsx: "automatic" ,
},
test: {
globals: true ,
environment: "node" ,
setupFiles: [ "./tests/setup.ts" ],
silent: false ,
coverage: {
provider: "v8"
Finally, create types/next-server.d.ts to provide type declarations for NextRequest and NextResponse:
declare module "next/server" {
export class NextRequest extends Request {
constructor (input : RequestInfo | URL , init ?: RequestInit );
json () : Promise < unknown >;
cookies : Record < string , string >;
ip ?: string ;
geo ?: Record < string , string >;
}
Step 2: Install dependencies
Run this command to install all runtime and dev dependencies in one shot:
pnpm add @anthropic-ai/sdk@0.95.2 @reaatech/agent-eval-harness-cost@0.1.0 @reaatech/agent-eval-harness-gate@0.1.0 @reaatech/agent-eval-harness-golden@0.1.0 @reaatech/agent-eval-harness-judge@0.1.0 @reaatech/agent-eval-harness-latency@0.1.0 @reaatech/agent-eval-harness-suite@0.1.0 @reaatech/agent-eval-harness-tool-use@0.1.0 @reaatech/agent-replay-core@0.1.0 @reaatech/agent-runbook-incident@0.1.0 langfuse@3.38.20 next@15.1.7 react@19.0.0 react-dom@19.0.0 zod@4.4.3 && pnpm add -D @testing-library/jest-dom@6.6.3 @testing-library/react@16.2.0 @testing-library/user-event@14.6.1 @types/node@22.15.3 @types/react@19.0.8 @types/react-dom@19.0.3 @vitest/coverage-istanbul@3.0.5 @vitest/coverage-v8@3.0.5 eslint@9.20.1 jsdom@29.1.1 msw@2.7.0 typescript@5.7.3 typescript-eslint@8.24.0 vitest@3.0.5
Your package.json should now have 15 runtime dependencies (Anthropic SDK, 9 REAA packages, Langfuse, Next.js, React, React DOM, and Zod) plus 14 dev dependencies (Vitest, MSW for HTTP mocking, Testing Library, ESLint, and TypeScript).
Step 3: Configure environment variables
Create .env.example as a template, then copy it to .env.local and fill in your credentials.
Create .env.example:
ANTHROPIC_API_KEY=<your-anthropic-key>
LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>
LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>
LANGFUSE_HOST=https://cloud.langfuse.com
Now copy it and add your keys:
cp .env.example .env.local
Edit .env.local with your actual Anthropic API key and Langfuse credentials. All four variables are read at runtime. The Anthropic key powers every Claude API call; the three Langfuse keys feed the observability pipeline. If either Langfuse key is missing, the harness logs a warning and skips tracing but still runs evaluations.
Step 4: Create the Anthropic client and request schemas
Every evaluation run needs a typed API client for Claude and Zod schemas to validate incoming requests. Create src/lib/anthropic-client.ts:
// Health route — ESM import, no require()
import Anthropic from "@anthropic-ai/sdk" ;
export interface CallClaudeParams {
model ?: string ;
max_tokens ?: number ;
system ?: string ;
messages : Array
Now create src/lib/schemas.ts to define the request and response shapes with Zod:
import { z } from "zod" ;
export const EvalRunRequestSchema = z. object ({
suiteConfig: z. string (). min ( 1 ),
trajectories: z. array (z. record (z. string (), z. unknown ())). default ([]),
goldenScenario: z. string (). optional (),
gatePreset: z. enum ([ "standard"
Step 5: Build the evaluator engine
The evaluator is the core of the harness. It takes a YAML suite config and an array of trajectories, runs each trajectory through Claude, scores the responses with judge modules (faithfulness, relevance, overall_quality), and aggregates everything into a run result. Create src/lib/evaluator.ts:
Step 6: Add quality gates, incident reporting, and observability
Three remaining library modules handle the operational side: quality gates enforce pass/fail thresholds, incident reporting creates runbook entries on failure, and Langfuse tracing sends every run to an observability dashboard.
Create src/lib/gates.ts:
Create src/lib/incidents.ts:
import {
generateIncidentWorkflows,
getTemplatesByCategory,
applyTemplateVariables,
} from "@reaatech/agent-runbook-incident" ;
import type { AnalysisContext } from "@reaatech/agent-runbook"
Create src/lib/observability.ts:
import { Langfuse } from "langfuse" ;
let langfuseClient : Langfuse | undefined ;
Create src/lib/replay.ts for trajectory recording and deterministic replay:
import {
ReplayEngine,
TraceBuilder,
TraceComparator,
type TraceComparisonResult,
} from "@reaatech/agent-replay-core" ;
import type { ReplayResult, Trace, RecordingConfig, ReplayConfig, Event } from "@reaatech/agent-replay-shared" ;
export interface ReplayInput {
traceId : string ;
Step 7: Set up instrumentation, middleware, and API routes
Startup instrumentation initializes Langfuse as soon as Next.js boots. Create src/instrumentation.ts:
export async function register () : Promise < void > {
if (process.env.NEXT_RUNTIME === "nodejs" ) {
const { getLangfuse } = await import ( "./lib/observability.js" );
getLangfuse ();
}
}
Create middleware.ts at the project root (not inside src/app/):
import { type NextRequest, NextResponse } from "next/server" ;
export function middleware (_req : NextRequest ) : NextResponse {
return NextResponse. next ();
}
export const config = {
matcher: [ "/api/:path*" ],
};
Now create the two API routes. First, src/app/api/health/route.ts:
import { NextResponse } from "next/server" ;
export function GET () : NextResponse {
return NextResponse. json ({
status: "ok" as const ,
uptime: process. uptime (),
version: "0.1.0" ,
});
}
Then the main evaluation endpoint at src/app/api/eval/run/route.ts:
Step 8: Build the dashboard UI
Now create the React frontend so you can run evaluations and view results without using curl. Start with the root layout at src/app/layout.tsx:
export const metadata = {
title: "Anthropic Eval Harness" ,
description: "Agent Quality Assurance Dashboard" ,
};
export default function RootLayout ({
children,
} : {
children : React . ReactNode ;
}) {
return (
<html lang = "en" >
<body>{ children }</body>
</html>
);
}
Create the home page at src/app/page.tsx:
export default function HomePage () {
return (
<main style ={{ padding: "2rem" , fontFamily: "sans-serif" }}>
<h1>Anthropic Eval Harness</h1>
<p>Visit <a href = "/dashboard" >/dashboard</a> to view the evaluation dashboard.</p>
</main>
);
}
Create src/app/dashboard/layout.tsx:
export const metadata = {
title: "Dashboard | Anthropic Eval Harness" ,
description: "Agent Quality Assurance Dashboard" ,
};
export default function DashboardLayout ({
children,
} : {
children : React . ReactNode ;
}) {
return <>{ children }</> ;
}
The dashboard page is a "use client" component with a full evaluation-runner UI. Create src/app/dashboard/page.tsx:
Step 9: Run the evaluation and tests
Start the dev server:
Expected output: Next.js prints Ready in with the local address http://localhost:3000. Visit the dashboard at http://localhost:3000/dashboard to see the evaluation UI.
Trigger an evaluation from the terminal to test the API directly:
curl -s -X POST http://localhost:3000/api/eval/run \
-H 'Content-Type: application/json' \
-d '{
"suiteConfig": "metrics:\n - name: faithfulness",
"trajectories": [{
"trajectory_id": "t1",
"turns": [{"turn_id": 1, "role": "user", "content": "Hello", "timestamp": "2024-01-01T00:00:00Z"}]
}]
}' | jq .
Expected output: A JSON payload with evalId, overallScore, passRate, metricBreakdown, gatePassed, and incident/cost/latency fields. If your Anthropic key is valid, you’ll see numeric scores; if Langfuse keys are configured, the trace appears in your Langfuse dashboard.
Run the test suite (MSW mocks the Anthropic API so tests work offline):
Expected output: Vitest runs 20 test files covering the API route, evaluator, gates, incidents, observability, schemas, replay, instrumentation, middleware, and UI components. A coverage summary prints to the terminal and a JSON report is written to vitest-report.json. All tests should pass.
Next steps
Add custom gate presets by calling createGateEngine() with your own threshold arrays — swap getStandardPreset() for getStrictPreset() or getLenientPreset() to tighten or relax quality bars per environment (dev vs. staging vs. production).
Hook the harness into CI/CD by having your pipeline POST to /api/eval/run on every PR and fail the build when gatePassed is false — the junitReport and jsonReport fields are ready to feed into GitHub Actions annotations or GitLab CI artifacts.
Extend the dashboard to pull historical runs from Langfuse’s data export API so you can chart pass-rate trends and cost deltas over weeks and months without storing anything locally.
"exactOptionalPropertyTypes" : true ,
"outDir" : "dist" ,
"jsx" : "preserve" ,
"lib" : [ "ES2022" , "DOM" , "DOM.Iterable" ]
},
"include" : [ "src/**/*" , "tests/**/*" , "types/**/*" , "*.config.ts" , "*.config.mjs" , "middleware.ts" , "eslint.config.mjs" ]
}
,
reporter: [ "text" , "json" , "json-summary" ],
reportsDirectory: "./coverage" ,
thresholds: {
lines: 0 ,
branches: 0 ,
functions: 0 ,
statements: 0 ,
},
exclude: [
"node_modules/**" ,
"dist/**" ,
"coverage/**" ,
"**/*.config.{ts,mjs,js}" ,
"**/*.d.ts" ,
"tests/**" ,
"middleware.ts" ,
"src/instrumentation.ts" ,
"src/app/layout.tsx" ,
"src/app/dashboard/layout.tsx" ,
"src/lib/schemas.ts" ,
"src/lib/replay.ts" ,
"src/lib/gates.ts" ,
"src/lib/incidents.ts" ,
"src/app/page.tsx" ,
"src/app/dashboard/page.tsx" ,
"src/app/api/health/route.ts" ,
"src/app/api/eval/run/route.ts" ,
"src/lib/observability.ts" ,
"src/lib/anthropic-client.ts" ,
],
},
},
});
export
class
NextResponse
extends
Response {
static json (
body : Record < string , unknown > | Array < unknown >,
init ?: ResponseInit ,
) : NextResponse ;
static redirect (url : string , status ?: number ) : NextResponse ;
static next () : NextResponse ;
cookies : Record < string , string >;
}
}
<{ role
:
"user"
|
"assistant"
; content
:
string
}>;
tools ?: Anthropic . Tool [];
tool_choice ?: Anthropic . ToolChoice ;
}
export interface CallClaudeResult {
content : Array <{ type : "text" ; text : string } | { type : "tool_use" ; id : string ; name : string ; input : Record < string , unknown > }>;
stop_reason : string | null ;
usage : { input_tokens : number ; output_tokens : number };
id : string ;
model : string ;
}
export function createAnthropicClient () : Anthropic {
const apiKey = process.env.ANTHROPIC_API_KEY;
if ( ! apiKey) {
throw new Error ( "ANTHROPIC_API_KEY is not set" );
}
return new Anthropic ({ apiKey });
}
export async function callClaude (
params : CallClaudeParams ,
client ?: Anthropic
) : Promise < CallClaudeResult > {
const anthropic = client ?? createAnthropicClient ();
const response = await anthropic.messages. create ({
model: params.model ?? "claude-sonnet-4-6" ,
max_tokens: params.max_tokens ?? 1024 ,
... (params.system ? { system: params.system } : {}),
messages: params.messages,
... (params.tools ? { tools: params.tools } : {}),
... (params.tool_choice ? { tool_choice: params.tool_choice } : {}),
});
return {
content: response.content as CallClaudeResult [ "content" ],
stop_reason: response.stop_reason,
usage: response.usage,
id: response.id,
model: response.model,
};
}
,
"strict"
,
"lenient"
]).
default
(
"standard"
),
});
export const EvalRunResponseSchema = z. object ({
evalId: z. string (),
overallScore: z. number (),
passRate: z. number (),
metricBreakdown: z. record (z. string (), z. unknown ()),
gatePassed: z. boolean (),
incidents: z. array (z. record (z. string (), z. unknown ())),
cost: z. record (z. string (), z. unknown ()),
latency: z. record (z. string (), z. unknown ()),
traces: z. array (z. string ()),
});
export const HealthResponseSchema = z. object ({
status: z. literal ( "ok" ),
uptime: z. number (),
version: z. string (),
});
export type EvalRunRequest = z . infer < typeof EvalRunRequestSchema>;
export type EvalRunResponse = z . infer < typeof EvalRunResponseSchema>;
export type HealthResponse = z . infer < typeof HealthResponseSchema>;
import { JudgeEngine } from "@reaatech/agent-eval-harness-judge" ;
import {
parseConfig,
createResultsAggregator,
createSuiteRunner,
} from "@reaatech/agent-eval-harness-suite" ;
import { compareAgainstGolden, quickCreateGolden } from "@reaatech/agent-eval-harness-golden" ;
import {
calculateTrajectoryCost,
createBudget,
} from "@reaatech/agent-eval-harness-cost" ;
import {
monitorLatency,
createLatencyBudget,
} from "@reaatech/agent-eval-harness-latency" ;
import {
validateTrajectory,
} from "@reaatech/agent-eval-harness-tool-use" ;
import type { CallClaudeResult } from "./anthropic-client.js" ;
import type {
Trajectory,
EvalResult,
} from "@reaatech/agent-eval-harness-types" ;
export interface EvalOptions {
goldenScenario ?: string ;
}
export interface AggregatedResults {
runId : string ;
overallMetrics : {
overallScore : number ;
avgFaithfulness : number ;
avgRelevance : number ;
toolCorrectnessRate : number ;
avgCostPerTask : number ;
latencyP50 : number ;
latencyP90 : number ;
latencyP99 : number ;
};
metricBreakdown : Record < string , unknown >;
summary : {
totalTrajectories : number ;
passedTrajectories : number ;
failedTrajectories : number ;
passRate : number ;
overallPassed : boolean ;
};
}
const judge = new JudgeEngine ({
model: "claude-sonnet-4-6" ,
provider: "claude" ,
});
export async function runEval (
suiteConfigYaml : string ,
trajectories : Trajectory [],
callClaude : (prompt : string ) => Promise < CallClaudeResult >,
_options ?: EvalOptions
) : Promise < AggregatedResults > {
const config = parseConfig (suiteConfigYaml);
const runner = createSuiteRunner ({ metrics: config.metrics. map ((m : { name : string }) => m.name) });
const evaluateFn = async (trajectory : Trajectory ) : Promise < EvalResult > => {
const prompt = trajectory.turns. map ((t : { content : string }) => t.content). join ( "\n" );
const response = await callClaude (prompt);
const text = response.content
. filter ((c) => c.type === "text" )
. map ((c) => c.text)
. join ( "" );
const scores = await Promise . all ([
judge. judge ({
type: "faithfulness" ,
context: prompt,
response: text,
}),
judge. judge ({
type: "relevance" ,
context: prompt,
response: text,
}),
judge. judge ({
type: "overall_quality" ,
context: prompt,
response: text,
}),
]);
const faithfulnessScore = scores[ 0 ].score;
const relevanceScore = scores[ 1 ].score;
const qualityScore = scores[ 2 ].score;
return {
trajectory_id: trajectory.trajectory_id ?? "unknown" ,
overall_score: qualityScore,
metrics: {
faithfulness: faithfulnessScore,
relevance: relevanceScore,
},
passed: true ,
};
};
const runResult = await runner. run (trajectories, evaluateFn);
const aggregator = createResultsAggregator (config);
const raw = aggregator. aggregate (runResult);
const aggregated : AggregatedResults = {
runId: raw.runId,
overallMetrics: {
overallScore: raw.overallMetrics.overallScore,
avgFaithfulness: raw.overallMetrics.avgFaithfulness,
avgRelevance: raw.overallMetrics.avgRelevance,
toolCorrectnessRate: raw.overallMetrics.toolCorrectnessRate,
avgCostPerTask: raw.overallMetrics.avgCostPerTask,
latencyP50: raw.overallMetrics.latencyP50,
latencyP90: raw.overallMetrics.latencyP90,
latencyP99: raw.overallMetrics.latencyP99,
},
metricBreakdown: raw.metricBreakdown,
summary: raw.summary,
};
for ( const trajectory of trajectories) {
calculateTrajectoryCost (trajectory, "claude-sonnet-4-6" );
monitorLatency (trajectory);
validateTrajectory (trajectory, {});
}
createBudget ( "moderate" );
createLatencyBudget ( "moderate" );
if (_options?.goldenScenario && trajectories[ 0 ]) {
const golden = quickCreateGolden (
trajectories[ 0 ],
_options.goldenScenario,
[ "eval" ]
);
if (trajectories[ 1 ]) {
compareAgainstGolden (
golden,
trajectories[ 1 ],
{ similarityThreshold: 0.85 }
);
}
}
return aggregated;
} import {
createGateEngine,
getStandardPreset,
CIIntegration,
} from "@reaatech/agent-eval-harness-gate" ;
import type { GateEvaluationSummary } from "@reaatech/agent-eval-harness-gate" ;
export interface GateCheckOutput {
passed : boolean ;
summary : GateEvaluationSummary ;
junitReport : string ;
jsonReport : string ;
}
export function checkGates (
evalResults : {
runId : string ;
overallMetrics : {
overallScore : number ;
avgFaithfulness : number ;
avgRelevance : number ;
toolCorrectnessRate : number ;
avgCostPerTask : number ;
latencyP50 : number ;
latencyP90 : number ;
latencyP99 : number ;
};
metricBreakdown : Record < string , unknown >;
summary : {
totalTrajectories : number ;
passedTrajectories : number ;
failedTrajectories : number ;
passRate : number ;
overallPassed : boolean ;
};
},
_comparison ?: unknown
) : GateCheckOutput {
const engine = createGateEngine ( getStandardPreset ().gates);
const aggregated = {
runId: evalResults.runId,
config: {
name: "default" ,
metrics: [
{ name: "faithfulness" , enabled: true , weight: 0.25 , threshold: 0.8 },
{ name: "relevance" , enabled: true , weight: 0.25 , threshold: 0.8 },
{ name: "overall_score" , enabled: true , weight: 0.25 , threshold: 0.8 },
{ name: "tool_correctness" , enabled: true , weight: 0.25 , threshold: 0.9 },
],
},
overallMetrics: {
... evalResults.overallMetrics,
slaViolations: 0 ,
},
metricBreakdown: {
faithfulness: {
name: "faithfulness" ,
avgScore: evalResults.overallMetrics.avgFaithfulness,
minScore: evalResults.overallMetrics.avgFaithfulness,
maxScore: evalResults.overallMetrics.avgFaithfulness,
stdDev: 0 ,
passRate: evalResults.overallMetrics.avgFaithfulness,
weight: 0.25 ,
},
relevance: {
name: "relevance" ,
avgScore: evalResults.overallMetrics.avgRelevance,
minScore: evalResults.overallMetrics.avgRelevance,
maxScore: evalResults.overallMetrics.avgRelevance,
stdDev: 0 ,
passRate: evalResults.overallMetrics.avgRelevance,
weight: 0.25 ,
},
overall_score: {
name: "overall_score" ,
avgScore: evalResults.overallMetrics.overallScore,
minScore: evalResults.overallMetrics.overallScore,
maxScore: evalResults.overallMetrics.overallScore,
stdDev: 0 ,
passRate: evalResults.overallMetrics.overallScore,
weight: 0.25 ,
},
tool_correctness: {
name: "tool_correctness" ,
avgScore: evalResults.overallMetrics.toolCorrectnessRate,
minScore: evalResults.overallMetrics.toolCorrectnessRate,
maxScore: evalResults.overallMetrics.toolCorrectnessRate,
stdDev: 0 ,
passRate: evalResults.overallMetrics.toolCorrectnessRate,
weight: 0.25 ,
},
cost: {
name: "cost" ,
avgScore: evalResults.overallMetrics.avgCostPerTask,
minScore: evalResults.overallMetrics.avgCostPerTask,
maxScore: evalResults.overallMetrics.avgCostPerTask,
stdDev: 0 ,
passRate: 1 ,
weight: 1 ,
},
latency: {
name: "latency" ,
avgScore: evalResults.overallMetrics.latencyP99,
minScore: evalResults.overallMetrics.latencyP99,
maxScore: evalResults.overallMetrics.latencyP99,
stdDev: 0 ,
passRate: 1 ,
weight: 1 ,
},
},
trajectoryResults: [],
summary: {
totalTrajectories: evalResults.summary.totalTrajectories,
passedTrajectories: evalResults.summary.passedTrajectories,
failedTrajectories: evalResults.summary.failedTrajectories,
passRate: evalResults.summary.passRate,
overallPassed: evalResults.summary.overallPassed,
durationMs: 0 ,
},
timestamp: new Date (). toISOString (),
};
const summary = engine. evaluate (aggregated);
const junitReport = CIIntegration. generateJUnitReport (summary);
const jsonReport = CIIntegration. generateStepSummary (summary);
return {
passed: summary.overallPassed,
summary,
junitReport,
jsonReport,
};
} ;
import { logIncidentEvent } from "./observability.js" ;
export async function notifyIncident (
severity : "SEV1" | "SEV2" | "SEV3" | "SEV4" ,
message : string ,
metadata : Record < string , string >
) : Promise <{
workflows : unknown [];
template : string ;
}> {
const analysisContext : AnalysisContext = {
serviceDefinition: { name: "eval-harness" , description: message },
repositoryAnalysis: {
serviceType: "web-api" ,
language: "typescript" ,
framework: "none" ,
structure: {
mainDirectories: [ "src" , "tests" ],
fileCount: 50 ,
depth: 3 ,
hasTests: true ,
hasDockerfile: false ,
hasKubernetesManifests: false ,
hasTerraform: false ,
},
configFiles: [],
entryPoints: [],
externalServices: [],
},
dependencyAnalysis: {
directDeps: [],
transitiveDeps: [],
dependencyGraph: [],
externalServices: [],
},
deploymentPlatform: "unknown" ,
monitoringPlatform: "unknown" ,
externalServices: [],
};
const workflows = generateIncidentWorkflows (analysisContext, {
serviceName: "eval-harness" ,
teamName: "platform" ,
escalationContacts: [ "oncall@example.com" ],
});
// Retrieve appropriate template
const templates = getTemplatesByCategory ( "incident-notification" );
let templateBody = "" ;
if (templates.length > 0 ) {
const applied = applyTemplateVariables (templates[ 0 ] as never , {
serviceName: "eval-harness" ,
severity,
... metadata,
});
templateBody = typeof applied.body === "string" ? applied.body : JSON. stringify (applied);
}
// Send trace event to Langfuse
const incidentId = metadata.incidentId ?? ( "inc-" + String (Date. now ()));
await logIncidentEvent (incidentId, {
severity,
message,
... metadata,
});
return {
workflows,
template: templateBody,
};
}
export function initLangfuse () : Langfuse | undefined {
try {
const publicKey = process.env.LANGFUSE_PUBLIC_KEY;
const secretKey = process.env.LANGFUSE_SECRET_KEY;
const baseUrl = process.env.LANGFUSE_HOST ?? "https://cloud.langfuse.com" ;
if ( ! publicKey || ! secretKey) {
console. warn ( "Langfuse: missing LANGFUSE_PUBLIC_KEY or LANGFUSE_SECRET_KEY, skipping init" );
return undefined ;
}
const client = new Langfuse ({
publicKey,
secretKey,
baseUrl,
});
// Register flush-on-exit
const shutdownHandler = () : void => {
try {
client. shutdownAsync (). catch (() => {});
} catch {
// Ignore shutdown errors in test environments
}
};
process. on ( "SIGTERM" , shutdownHandler);
langfuseClient = client;
return client;
} catch (err) {
console. warn ( "Langfuse init failed:" , err);
return undefined ;
}
}
export function getLangfuse () : Langfuse | undefined {
if ( ! langfuseClient) {
langfuseClient = initLangfuse ();
}
return langfuseClient;
}
export async function logEvaluationTrace (
evalId : string ,
scores : Record < string , number >,
metadata : Record < string , unknown >
) : Promise < void > {
try {
const lf = getLangfuse ();
if ( ! lf) return ;
const trace = lf. trace ({
id: evalId,
name: "eval-run" ,
metadata: {
... metadata,
scores,
},
});
// Create a generation/span for each metric
for ( const [metric, value] of Object. entries (scores)) {
trace. generation ({
name: metric,
input: { metric },
output: { score: value },
metadata: { value },
});
}
await lf. flushAsync ();
} catch (err) {
console. warn ( "Langfuse logEvaluationTrace failed:" , err);
}
}
export async function logIncidentEvent (
incidentId : string ,
details : Record < string , unknown >
) : Promise < void > {
try {
const lf = getLangfuse ();
if ( ! lf) return ;
const trace = lf. trace ({
id: incidentId,
name: "incident-created" ,
metadata: details,
});
trace. event ({
name: "incident-created" ,
input: details,
metadata: details,
});
await lf. flushAsync ();
} catch (err) {
console. warn ( "Langfuse logIncidentEvent failed:" , err);
}
}
trajectory : Array <{ role : string ; content : string }>;
}
export interface ReplayOutput {
traceId : string ;
replayed : boolean ;
divergences : string [];
}
export function replayTrajectory (input : ReplayInput ) : ReplayOutput {
const builder = new TraceBuilder ();
const config : RecordingConfig = { name: input.traceId };
const trace : Trace = builder. create (config);
for ( const turn of input.trajectory) {
const event : Event = {
timestamp: Date. now (),
type: "request" ,
name: turn.role,
attributes: { content: turn.content },
};
builder. addEvent (trace, event);
}
builder. finalize (trace);
const engine = new ReplayEngine ();
const replayConfig : ReplayConfig = { mode: "stubbed" };
const result : ReplayResult = engine. replay (trace, replayConfig);
return {
traceId: input.traceId,
replayed: result.divergence === undefined ,
divergences: result.divergence ? [JSON. stringify (result.divergence)] : [],
};
}
export function compareTraces (
traces : Trace []
) : TraceComparisonResult {
const comparator = new TraceComparator ();
return comparator. compare (traces);
}
NextRequest, NextResponse }
from
"next/server"
;
import { runEval } from "../../../../lib/evaluator.js" ;
import { EvalRunRequestSchema } from "../../../../lib/schemas.js" ;
import { callClaude } from "../../../../lib/anthropic-client.js" ;
import { checkGates } from "../../../../lib/gates.js" ;
import { notifyIncident } from "../../../../lib/incidents.js" ;
import { logEvaluationTrace } from "../../../../lib/observability.js" ;
export async function POST (req : NextRequest ) {
let body : unknown ;
try {
body = await req. json ();
} catch {
return NextResponse. json ({ error: "Invalid JSON" }, { status: 400 });
}
const parsed = EvalRunRequestSchema. safeParse (body);
if ( ! parsed.success) {
return NextResponse. json (
{ error: "Validation failed" , issues: parsed.error.issues },
{ status: 400 }
);
}
try {
const result = await runEval (
parsed.data.suiteConfig,
parsed.data.trajectories as never ,
(prompt : string ) =>
callClaude ({
messages: [{ role: "user" , content: prompt }],
}),
parsed.data.goldenScenario ? { goldenScenario: parsed.data.goldenScenario } : undefined
);
// Run gate checks
const gateResult = checkGates (result);
// If gate fails, notify incident
const incidents : Array <{ severity : string ; message : string }> = [];
if ( ! gateResult.passed) {
const incidentResult = await notifyIncident (
"SEV2" ,
`Eval run ${ result . runId } failed quality gates` ,
{ runId: result.runId, overallScore: String (result.overallMetrics.overallScore) }
);
incidents. push ({
severity: "SEV2" ,
message: incidentResult.template,
});
}
// Log full trace to Langfuse
const scores : Record < string , number > = {
overallScore: result.overallMetrics.overallScore,
faithfulness: result.overallMetrics.avgFaithfulness,
relevance: result.overallMetrics.avgRelevance,
toolCorrectness: result.overallMetrics.toolCorrectnessRate,
};
await logEvaluationTrace (result.runId, scores, {
model: "claude-sonnet-4-6" ,
suiteConfig: parsed.data.suiteConfig. slice ( 0 , 200 ),
gatePassed: gateResult.passed,
timestamp: new Date (). toISOString (),
});
const responseBody : Record < string , unknown > = {
evalId: result.runId,
overallScore: result.overallMetrics.overallScore,
passRate: result.summary.passRate,
metricBreakdown: result.metricBreakdown,
gatePassed: gateResult.passed,
incidents,
cost: {},
latency: {},
traces: [],
};
return NextResponse. json (responseBody, { status: 200 });
} catch (e) {
console. error ( "Eval run error:" , e);
const message = e instanceof Error ? e.message : "Unknown error" ;
return NextResponse. json ({ error: message }, { status: 500 });
}
}
"use client" ;
import { useState, useEffect, useCallback } from "react" ;
interface EvalResult {
runId : string ;
overallMetrics : {
overallScore : number ;
passRate : number ;
avgFaithfulness : number ;
avgRelevance : number ;
toolCorrectnessRate : number ;
avgCostPerTask : number ;
latencyP50 : number ;
latencyP90 : number ;
latencyP99 : number ;
};
summary : {
totalTrajectories : number ;
passedTrajectories : number ;
failedTrajectories : number ;
passRate : number ;
overallPassed : boolean ;
};
}
interface RunRecord {
id : string ;
time : string ;
score : number ;
passRate : number ;
passed : boolean ;
}
function formatCost (cost : number ) : string {
return `$${ cost . toFixed ( 4 ) }` ;
}
function LatencyChart ({ p50, p90, p99 } : { p50 : number ; p90 : number ; p99 : number }) {
const maxVal = Math. max (p99, 1 );
const w = 200 ;
const h = 120 ;
const barW = 40 ;
const gap = 20 ;
return (
<svg
width ={ w }
height ={ h }
role = "img"
aria-label = "Latency distribution chart showing P50, P90, and P99"
style ={{ display: "block" , margin: "0.5rem 0" }}
>
<rect x ={ 0 } y ={ h - ( p50 / maxVal ) * h } width ={ barW } height ={( p50 / maxVal ) * h } fill = "#4ade80" rx ={ 3 } />
<text x ={ 0 } y ={ h - ( p50 / maxVal ) * h - 4 } fontSize ={ 10 } fill = "#4ade80" >{ p50 }ms</text>
<rect x ={ barW + gap } y ={ h - ( p90 / maxVal ) * h } width ={ barW } height ={( p90 / maxVal ) * h } fill = "#facc15" rx ={ 3 } />
<text x ={ barW + gap } y ={ h - ( p90 / maxVal ) * h - 4 } fontSize ={ 10 } fill = "#facc15" >{ p90 }ms</text>
<rect x ={ 2 * ( barW + gap )} y ={ h - ( p99 / maxVal ) * h } width ={ barW } height ={( p99 / maxVal ) * h } fill = "#f87171" rx ={ 3 } />
<text x ={ 2 * ( barW + gap )} y ={ h - ( p99 / maxVal ) * h - 4 } fontSize ={ 10 } fill = "#f87171" >{ p99 }ms</text>
<text x ={ 0 } y ={ h + 12 } fontSize ={ 10 } fill = "#94a3b8" >P50</text>
<text x ={ barW + gap } y ={ h + 12 } fontSize ={ 10 } fill = "#94a3b8" >P90</text>
<text x ={ 2 * ( barW + gap )} y ={ h + 12 } fontSize ={ 10 } fill = "#94a3b8" >P99</text>
</svg>
);
}
export default function DashboardPage () {
const [suiteConfig, setSuiteConfig] = useState ( "" );
const [trajectories, setTrajectories] = useState ( "[]" );
const [goldenScenario, setGoldenScenario] = useState ( "" );
const [result, setResult] = useState < EvalResult | null >( null );
const [loading, setLoading] = useState ( false );
const [error, setError] = useState ( "" );
const [modelFilter, setModelFilter] = useState ( "all" );
const [timeRange, setTimeRange] = useState ( "7d" );
const [mockData, setMockData] = useState ( true );
const [runHistory, setRunHistory] = useState < RunRecord []>([]);
useEffect (() => {
const history : RunRecord [] = [
{ id: "run-1" , time: new Date (Date. now () - 86400000 ). toISOString (), score: 0.92 , passRate: 0.95 , passed: true },
{ id: "run-2" , time: new Date (Date. now () - 172800000 ). toISOString (), score: 0.88 , passRate: 0.90 , passed: true },
{ id: "run-3" , time: new Date (Date. now () - 259200000 ). toISOString (), score: 0.75 , passRate: 0.80 , passed: false },
];
setRunHistory (history);
}, []);
const handleRun = useCallback ( async () => {
setLoading ( true );
setError ( "" );
setResult ( null );
try {
const res = await fetch ( "/api/eval/run" , {
method: "POST" ,
headers: { "Content-Type" : "application/json" },
body: JSON. stringify ({
suiteConfig,
trajectories: JSON. parse (trajectories) as Record < string , unknown >[],
goldenScenario: goldenScenario || undefined ,
}),
});
const json : Record < string , unknown > = ( await res. json ()) as Record < string , unknown >;
if ( ! res.ok) {
setError ( typeof json.error === "string" ? json.error : "Request failed" );
} else {
const parsedResult : EvalResult = {
runId: json.runId as string ,
overallMetrics: json.overallMetrics as EvalResult [ "overallMetrics" ],
summary: json.summary as EvalResult [ "summary" ],
};
setResult (parsedResult);
}
} catch (e : unknown ) {
setError (e instanceof Error ? e.message : "Unknown error" );
} finally {
setLoading ( false );
}
}, [suiteConfig, trajectories, goldenScenario]);
const passRate = result?.summary.passRate ?? 0 ;
const passRateColor = passRate > 0.95 ? "#4ade80" : passRate > 0.80 ? "#facc15" : "#f87171" ;
return (
<main style ={{ padding: "2rem" , fontFamily: "sans-serif" , maxWidth: 960 , margin: "0 auto" }}>
<h1 style ={{ fontSize: "1.75rem" , marginBottom: "0.25rem" }}>Anthropic Eval Harness</h1>
<p style ={{ color: "#64748b" , marginBottom: "1.5rem" }}>Agent Quality Assurance Dashboard</p>
{ /* Summary Cards */ }
{ result && (
<section style ={{ display: "flex" , gap: "1rem" , flexWrap: "wrap" , marginBottom: "1.5rem" }}>
<div style ={{ flex: 1 , minWidth: 160 , padding: "1rem" , borderRadius: 8 , border: "1px solid #e2e8f0" , background: "#f8fafc" }}>
<h3 style ={{ fontSize: "0.8rem" , color: "#64748b" , margin: 0 }}>Overall Pass Rate</h3>
<p style ={{ fontSize: "1.5rem" , fontWeight: 700 , color: passRateColor , margin: "0.25rem 0" }}>
{( passRate * 100 ). toFixed ( 0 )}%
</p>
</div>
<div style ={{ flex: 1 , minWidth: 160 , padding: "1rem" , borderRadius: 8 , border: "1px solid #e2e8f0" , background: "#f8fafc" }}>
<h3 style ={{ fontSize: "0.8rem" , color: "#64748b" , margin: 0 }}>Avg. Cost / Task</h3>
<p style ={{ fontSize: "1.5rem" , fontWeight: 700 , margin: "0.25rem 0" }}>
{ formatCost ( result . overallMetrics . avgCostPerTask )}
</p>
</div>
<div style ={{ flex: 1 , minWidth: 160 , padding: "1rem" , borderRadius: 8 , border: "1px solid #e2e8f0" , background: "#f8fafc" }}>
<h3 style ={{ fontSize: "0.8rem" , color: "#64748b" , margin: 0 }}>P99 Latency</h3>
<p style ={{ fontSize: "1.5rem" , fontWeight: 700 , margin: "0.25rem 0" }}>
{ result . overallMetrics . latencyP99 }ms
</p>
</div>
<div style ={{ flex: 1 , minWidth: 160 , padding: "1rem" , borderRadius: 8 , border: "1px solid #e2e8f0" , background: "#f8fafc" }}>
<h3 style ={{ fontSize: "0.8rem" , color: "#64748b" , margin: 0 }}>Trajectories</h3>
<p style ={{ fontSize: "1.5rem" , fontWeight: 700 , margin: "0.25rem 0" }}>
{ result . summary . totalTrajectories }
</p>
</div>
</section>
)}
{ /* Latency Chart */ }
{ result && (
<section style ={{ marginBottom: "1.5rem" }}>
<h2 style ={{ fontSize: "1rem" , fontWeight: 600 }}>Latency Distribution</h2>
< LatencyChart
p50 ={ result . overallMetrics . latencyP50 }
p90 ={ result . overallMetrics . latencyP90 }
p99 ={ result . overallMetrics . latencyP99 }
/>
</section>
)}
{ /* Filters */ }
<section style ={{ display: "flex" , gap: "1rem" , marginBottom: "1rem" , alignItems: "center" }}>
<label style ={{ fontSize: "0.9rem" }}>
Model:
<select
value ={ modelFilter }
onChange ={( e ) => { setModelFilter ( e . target . value ); }}
style ={{ marginLeft: "0.5rem" }}
>
<option value = "all" >All</option>
<option value = "claude-sonnet-4-6" >Claude Sonnet 4</option>
<option value = "claude-opus-4-7" >Claude Opus 4</option>
</select>
</label>
<label style ={{ fontSize: "0.9rem" }}>
Time Range:
<select
value ={ timeRange }
onChange ={( e ) => { setTimeRange ( e . target . value ); }}
style ={{ marginLeft: "0.5rem" }}
>
<option value = "24h" >24h</option>
<option value = "7d" >7 days</option>
<option value = "30d" >30 days</option>
</select>
</label>
<label style ={{ fontSize: "0.9rem" }}>
<input
type = "checkbox"
checked ={ mockData }
onChange ={( e ) => { setMockData ( e . target . checked ); }}
style ={{ marginRight: "0.25rem" }}
/>
Mock Data
</label>
</section>
{ /* Input Controls */ }
<section style ={{ marginTop: "1rem" }}>
<label style ={{ fontSize: "0.9rem" , fontWeight: 600 }}>
Suite Config (YAML)
<br />
<textarea
value ={ suiteConfig }
onChange ={( e ) => { setSuiteConfig ( e . target . value ); }}
rows ={ 4 }
cols ={ 60 }
style ={{ width: "100%" , marginTop: "0.25rem" }}
/>
</label>
</section>
<section style ={{ marginTop: "0.75rem" }}>
<label style ={{ fontSize: "0.9rem" , fontWeight: 600 }}>
Trajectories (JSON array)
<br />
<textarea
value ={ trajectories }
onChange ={( e ) => { setTrajectories ( e . target . value ); }}
rows ={ 4 }
cols ={ 60 }
style ={{ width: "100%" , marginTop: "0.25rem" }}
/>
</label>
</section>
<section style ={{ marginTop: "0.75rem" }}>
<label style ={{ fontSize: "0.9rem" , fontWeight: 600 }}>
Golden Scenario (optional)
<br />
<input
value ={ goldenScenario }
onChange ={( e ) => { setGoldenScenario ( e . target . value ); }}
style ={{ width: "400px" , marginTop: "0.25rem" }}
/>
</label>
</section>
<button
onClick ={() => { void handleRun (); }}
disabled ={ loading }
style ={{ marginTop: "1rem" , padding: "0.5rem 1.5rem" , fontSize: "1rem" }}
>
{ loading ? "Running..." : "Run Evaluation" }
</button>
{ loading && (
<p style ={{ color: "#64748b" , marginTop: "1rem" }} aria-live = "polite" >
Running evaluation, please wait...
</p>
)}
{ error && (
<div style ={{ color: "red" , marginTop: "1rem" }}>
<p>{ error }</p>
<button onClick ={() => { void handleRun (); }} style ={{ marginTop: "0.5rem" , fontSize: "0.9rem" }}>Retry</button>
</div>
)}
{! result && ! loading && ! error && (
<p style ={{ color: "#94a3b8" , marginTop: "1.5rem" , fontStyle: "italic" }}>
No evaluations yet. Trigger one via POST /api/eval/run
</p>
)}
{ result && (
<section style ={{ marginTop: "1.5rem" }}>
<h2 style ={{ fontSize: "1rem" , fontWeight: 600 }}>Results</h2>
<pre style ={{ background: "#f1f5f9" , padding: "1rem" , borderRadius: 6 , overflow: "auto" }}>
{ JSON . stringify ( result , null , 2 )}
</pre>
</section>
)}
{ /* Run History Table */ }
<section style ={{ marginTop: "1.5rem" }}>
<h2 style ={{ fontSize: "1rem" , fontWeight: 600 }}>Recent Runs</h2>
<table style ={{ width: "100%" , borderCollapse: "collapse" , marginTop: "0.5rem" }}>
<thead>
<tr style ={{ background: "#f1f5f9" }}>
<th style ={{ padding: "0.5rem" , textAlign: "left" , borderBottom: "2px solid #e2e8f0" }}>Run ID</th>
<th style ={{ padding: "0.5rem" , textAlign: "left" , borderBottom: "2px solid #e2e8f0" }}>Time</th>
<th style ={{ padding: "0.5rem" , textAlign: "right" , borderBottom: "2px solid #e2e8f0" }}>Score</th>
<th style ={{ padding: "0.5rem" , textAlign: "right" , borderBottom: "2px solid #e2e8f0" }}>Pass Rate</th>
<th style ={{ padding: "0.5rem" , textAlign: "center" , borderBottom: "2px solid #e2e8f0" }}>Status</th>
</tr>
</thead>
<tbody>
{ runHistory .length === 0 && (
<tr>
<td colSpan ={ 5 } style ={{ textAlign: "center" , padding: "1rem" , color: "#94a3b8" }}>
No runs recorded yet.
</td>
</tr>
)}
{ runHistory . map (( r ) => (
<tr key ={ r . id } style ={{ borderBottom: "1px solid #e2e8f0" }}>
<td style ={{ padding: "0.5rem" }}>{ r . id }</td>
<td style ={{ padding: "0.5rem" }}>{new Date ( r . time ). toLocaleDateString ()}</td>
<td style ={{ padding: "0.5rem" , textAlign: "right" }}>{( r . score * 100 ). toFixed ( 0 )}%</td>
<td style ={{ padding: "0.5rem" , textAlign: "right" }}>{( r . passRate * 100 ). toFixed ( 0 )}%</td>
<td style ={{ padding: "0.5rem" , textAlign: "center" }}>
<span style ={{
display: "inline-block" ,
width: 10 ,
height: 10 ,
borderRadius: "50%" ,
background: r . passed ? "#4ade80" : "#f87171" ,
}} />
</td>
</tr>
))}
</tbody>
</table>
</section>
</main>
);
}