Anthropic Eval Harness for Agent Quality Assurance

Intro

This tutorial walks you through building a regression testing harness for Anthropic-powered AI agents. By the end, you’ll have a Next.js application that accepts evaluation run triggers over a REST API, calls Claude to process test trajectories, scores the responses with LLM-as-a-judge modules, enforces quality gates, logs incidents on failure, and exports traces to Langfuse for an observability dashboard. You’ll write every source file from scratch — just copy and paste along.

Prerequisites

Node.js >= 22 and pnpm 10.x (the project pins "packageManager": "pnpm@10.0.0")
An Anthropic API key (get one at console.anthropic.com)
A Langfuse account for the observability dashboard (sign up at langfuse.com to get public/secret keys)
Familiarity with TypeScript, Next.js App Router, and REST APIs

Step 1: Scaffold the Next.js project

Create a new directory and set up the project configuration files: TypeScript, Next.js, and Vitest.

Create package.json:

json

{
  "name": "anthropic-eval-harness",
  "version": "0.1.0",
  "private": true,
  "type": "module",
  "engines": {
    "node": ">=22"
  },
  "packageManager": "pnpm@10.0.0",
  "scripts": {
    "typecheck": "tsc --noEmit",
    "lint": "eslint .",
    "test": "vitest run --coverage --reporter=json --outputFile=vitest-report.json"
  }
}

Create tsconfig.json:

json

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "NodeNext",
    "moduleResolution": "NodeNext",
    "strict": true,
    "esModuleInterop": true,
    "forceConsistentCasingInFileNames": true,
    "skipLibCheck": true,
    "resolveJsonModule": true,
    "isolatedModules": true,
    "noUncheckedIndexedAccess": true,

Create next.config.ts — this enables the instrumentation hook that initializes Langfuse on startup:

import type { NextConfig } from "next";
 
const nextConfig: NextConfig = {
  experimental: {
    instrumentationHook: true,
  },
} as NextConfig;
 
export default nextConfig;

Create vitest.config.ts:

import { defineConfig } from "vitest/config";
 
export default defineConfig({
  esbuild: {
    jsx: "automatic",
  },
  test: {
    globals: true,
    environment: "node",
    setupFiles: ["./tests/setup.ts"],
    silent: false,
    coverage: {
      provider: "v8"

Finally, create types/next-server.d.ts to provide type declarations for NextRequest and NextResponse:

declare module "next/server" {
  export class NextRequest extends Request {
    constructor(input: RequestInfo | URL, init?: RequestInit);
    json(): Promise<unknown>;
    cookies: Record<string, string>;
    ip?: string;
    geo?: Record<string, string>;
  }

Step 2: Install dependencies

Run this command to install all runtime and dev dependencies in one shot:

terminal

pnpm add @anthropic-ai/sdk@0.95.2 @reaatech/agent-eval-harness-cost@0.1.0 @reaatech/agent-eval-harness-gate@0.1.0 @reaatech/agent-eval-harness-golden@0.1.0 @reaatech/agent-eval-harness-judge@0.1.0 @reaatech/agent-eval-harness-latency@0.1.0 @reaatech/agent-eval-harness-suite@0.1.0 @reaatech/agent-eval-harness-tool-use@0.1.0 @reaatech/agent-replay-core@0.1.0 @reaatech/agent-runbook-incident@0.1.0 langfuse@3.38.20 next@15.1.7 react@19.0.0 react-dom@19.0.0 zod@4.4.3 && pnpm add -D @testing-library/jest-dom@6.6.3 @testing-library/react@16.2.0 @testing-library/user-event@14.6.1 @types/node@22.15.3 @types/react@19.0.8 @types/react-dom@19.0.3 @vitest/coverage-istanbul@3.0.5 @vitest/coverage-v8@3.0.5 eslint@9.20.1 jsdom@29.1.1 msw@2.7.0 typescript@5.7.3 typescript-eslint@8.24.0 vitest@3.0.5

Your package.json should now have 15 runtime dependencies (Anthropic SDK, 9 REAA packages, Langfuse, Next.js, React, React DOM, and Zod) plus 14 dev dependencies (Vitest, MSW for HTTP mocking, Testing Library, ESLint, and TypeScript).

Step 3: Configure environment variables

Create .env.example as a template, then copy it to .env.local and fill in your credentials.

Create .env.example:

env

ANTHROPIC_API_KEY=<your-anthropic-key>
LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>
LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>
LANGFUSE_HOST=https://cloud.langfuse.com

Now copy it and add your keys:

terminal

cp .env.example .env.local

Edit .env.local with your actual Anthropic API key and Langfuse credentials. All four variables are read at runtime. The Anthropic key powers every Claude API call; the three Langfuse keys feed the observability pipeline. If either Langfuse key is missing, the harness logs a warning and skips tracing but still runs evaluations.

Step 4: Create the Anthropic client and request schemas

Every evaluation run needs a typed API client for Claude and Zod schemas to validate incoming requests. Create src/lib/anthropic-client.ts:

// Health route — ESM import, no require()
import Anthropic from "@anthropic-ai/sdk";
 
export interface CallClaudeParams {
  model?: string;
  max_tokens?: number;
  system?: string;
  messages: Array

Now create src/lib/schemas.ts to define the request and response shapes with Zod:

import { z } from "zod";
 
export const EvalRunRequestSchema = z.object({
  suiteConfig: z.string().min(1),
  trajectories: z.array(z.record(z.string(), z.unknown())).default([]),
  goldenScenario: z.string().optional(),
  gatePreset: z.enum(["standard"

Step 5: Build the evaluator engine

The evaluator is the core of the harness. It takes a YAML suite config and an array of trajectories, runs each trajectory through Claude, scores the responses with judge modules (faithfulness, relevance, overall_quality), and aggregates everything into a run result. Create src/lib/evaluator.ts:

Step 6: Add quality gates, incident reporting, and observability

Three remaining library modules handle the operational side: quality gates enforce pass/fail thresholds, incident reporting creates runbook entries on failure, and Langfuse tracing sends every run to an observability dashboard.

Create src/lib/gates.ts:

Create src/lib/incidents.ts:

import {
  generateIncidentWorkflows,
  getTemplatesByCategory,
  applyTemplateVariables,
} from "@reaatech/agent-runbook-incident";
import type { AnalysisContext } from "@reaatech/agent-runbook"

Create src/lib/observability.ts:

import { Langfuse } from "langfuse";
 
let langfuseClient: Langfuse | undefined;

Create src/lib/replay.ts for trajectory recording and deterministic replay:

import {
  ReplayEngine,
  TraceBuilder,
  TraceComparator,
  type TraceComparisonResult,
} from "@reaatech/agent-replay-core";
import type { ReplayResult, Trace, RecordingConfig, ReplayConfig, Event } from "@reaatech/agent-replay-shared";
 
export interface ReplayInput {
  traceId: string;

Step 7: Set up instrumentation, middleware, and API routes

Startup instrumentation initializes Langfuse as soon as Next.js boots. Create src/instrumentation.ts:

export async function register(): Promise<void> {
  if (process.env.NEXT_RUNTIME === "nodejs") {
    const { getLangfuse } = await import("./lib/observability.js");
    getLangfuse();
  }
}

Create middleware.ts at the project root (not inside src/app/):

import { type NextRequest, NextResponse } from "next/server";
 
export function middleware(_req: NextRequest): NextResponse {
  return NextResponse.next();
}
 
export const config = {
  matcher: ["/api/:path*"],
};

Now create the two API routes. First, src/app/api/health/route.ts:

import { NextResponse } from "next/server";
 
export function GET(): NextResponse {
  return NextResponse.json({
    status: "ok" as const,
    uptime: process.uptime(),
    version: "0.1.0",
  });
}

Then the main evaluation endpoint at src/app/api/eval/run/route.ts:

import { type

Step 8: Build the dashboard UI

Now create the React frontend so you can run evaluations and view results without using curl. Start with the root layout at src/app/layout.tsx:

tsx

export const metadata = {
  title: "Anthropic Eval Harness",
  description: "Agent Quality Assurance Dashboard",
};
 
export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    <html lang="en">
      <body>{children}</body>
    </html>
  );
}

Create the home page at src/app/page.tsx:

tsx

export default function HomePage() {
  return (
    <main style={{ padding: "2rem", fontFamily: "sans-serif" }}>
      <h1>Anthropic Eval Harness</h1>
      <p>Visit <a href="/dashboard">/dashboard</a> to view the evaluation dashboard.</p>
    </main>
  );
}

Create src/app/dashboard/layout.tsx:

tsx

export const metadata = {
  title: "Dashboard | Anthropic Eval Harness",
  description: "Agent Quality Assurance Dashboard",
};
 
export default function DashboardLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return <>{children}</>;
}

The dashboard page is a "use client" component with a full evaluation-runner UI. Create src/app/dashboard/page.tsx:

Step 9: Run the evaluation and tests

Start the dev server:

terminal

npx next dev

Expected output: Next.js prints Ready in with the local address http://localhost:3000. Visit the dashboard at http://localhost:3000/dashboard to see the evaluation UI.

Trigger an evaluation from the terminal to test the API directly:

terminal

curl -s -X POST http://localhost:3000/api/eval/run \
  -H 'Content-Type: application/json' \
  -d '{
    "suiteConfig": "metrics:\n  - name: faithfulness",
    "trajectories": [{
      "trajectory_id": "t1",
      "turns": [{"turn_id": 1, "role": "user", "content": "Hello", "timestamp": "2024-01-01T00:00:00Z"}]
    }]
  }' | jq .

Expected output: A JSON payload with evalId, overallScore, passRate, metricBreakdown, gatePassed, and incident/cost/latency fields. If your Anthropic key is valid, you’ll see numeric scores; if Langfuse keys are configured, the trace appears in your Langfuse dashboard.

Run the test suite (MSW mocks the Anthropic API so tests work offline):

terminal

pnpm test

Expected output: Vitest runs 20 test files covering the API route, evaluator, gates, incidents, observability, schemas, replay, instrumentation, middleware, and UI components. A coverage summary prints to the terminal and a JSON report is written to vitest-report.json. All tests should pass.

Next steps

Add custom gate presets by calling createGateEngine() with your own threshold arrays — swap getStandardPreset() for getStrictPreset() or getLenientPreset() to tighten or relax quality bars per environment (dev vs. staging vs. production).
Hook the harness into CI/CD by having your pipeline POST to /api/eval/run on every PR and fail the build when gatePassed is false — the junitReport and jsonReport fields are ready to feed into GitHub Actions annotations or GitLab CI artifacts.
Extend the dashboard to pull historical runs from Langfuse’s data export API so you can chart pass-rate trends and cost deltas over weeks and months without storing anything locally.

Intro

Prerequisites

Node.js >= 22 and pnpm 10.x (the project pins "packageManager": "pnpm@10.0.0")
An Anthropic API key (get one at console.anthropic.com)
A Langfuse account for the observability dashboard (sign up at langfuse.com to get public/secret keys)
Familiarity with TypeScript, Next.js App Router, and REST APIs

Step 1: Scaffold the Next.js project

Create a new directory and set up the project configuration files: TypeScript, Next.js, and Vitest.

Create package.json:

json

{
  "name": "anthropic-eval-harness",
  "version": "0.1.0",
  "private": true,
  "type": "module",
  "engines": {
    "node": ">=22"
  },
  "packageManager": "pnpm@10.0.0",
  "scripts": {
    "typecheck": "tsc --noEmit",
    "lint": "eslint .",
    "test": "vitest run --coverage --reporter=json --outputFile=vitest-report.json"
  }
}

Create tsconfig.json:

json

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "NodeNext",
    "moduleResolution": "NodeNext",
    "strict": true,
    "esModuleInterop": true,
    "forceConsistentCasingInFileNames": true,
    "skipLibCheck": true,
    "resolveJsonModule": true,
    "isolatedModules": true,
    "noUncheckedIndexedAccess": true,

Create next.config.ts — this enables the instrumentation hook that initializes Langfuse on startup:

import type { NextConfig } from "next";
 
const nextConfig: NextConfig = {
  experimental: {
    instrumentationHook: true,
  },
} as NextConfig;
 
export default nextConfig;

Create vitest.config.ts:

import { defineConfig } from "vitest/config";
 
export default defineConfig({
  esbuild: {
    jsx: "automatic",
  },
  test: {
    globals: true,
    environment: "node",
    setupFiles: ["./tests/setup.ts"],
    silent: false,
    coverage: {
      provider: "v8"

Finally, create types/next-server.d.ts to provide type declarations for NextRequest and NextResponse:

declare module "next/server" {
  export class NextRequest extends Request {
    constructor(input: RequestInfo | URL, init?: RequestInit);
    json(): Promise<unknown>;
    cookies: Record<string, string>;
    ip?: string;
    geo?: Record<string, string>;
  }

Step 2: Install dependencies

Run this command to install all runtime and dev dependencies in one shot:

terminal

pnpm add @anthropic-ai/sdk@0.95.2 @reaatech/agent-eval-harness-cost@0.1.0 @reaatech/agent-eval-harness-gate@0.1.0 @reaatech/agent-eval-harness-golden@0.1.0 @reaatech/agent-eval-harness-judge@0.1.0 @reaatech/agent-eval-harness-latency@0.1.0 @reaatech/agent-eval-harness-suite@0.1.0 @reaatech/agent-eval-harness-tool-use@0.1.0 @reaatech/agent-replay-core@0.1.0 @reaatech/agent-runbook-incident@0.1.0 langfuse@3.38.20 next@15.1.7 react@19.0.0 react-dom@19.0.0 zod@4.4.3 && pnpm add -D @testing-library/jest-dom@6.6.3 @testing-library/react@16.2.0 @testing-library/user-event@14.6.1 @types/node@22.15.3 @types/react@19.0.8 @types/react-dom@19.0.3 @vitest/coverage-istanbul@3.0.5 @vitest/coverage-v8@3.0.5 eslint@9.20.1 jsdom@29.1.1 msw@2.7.0 typescript@5.7.3 typescript-eslint@8.24.0 vitest@3.0.5

Step 3: Configure environment variables

Create .env.example as a template, then copy it to .env.local and fill in your credentials.

Create .env.example:

env

ANTHROPIC_API_KEY=<your-anthropic-key>
LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>
LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>
LANGFUSE_HOST=https://cloud.langfuse.com

Now copy it and add your keys:

terminal

cp .env.example .env.local

Step 4: Create the Anthropic client and request schemas

Every evaluation run needs a typed API client for Claude and Zod schemas to validate incoming requests. Create src/lib/anthropic-client.ts:

// Health route — ESM import, no require()
import Anthropic from "@anthropic-ai/sdk";
 
export interface CallClaudeParams {
  model?: string;
  max_tokens?: number;
  system?: string;
  messages: Array

Now create src/lib/schemas.ts to define the request and response shapes with Zod:

import { z } from "zod";
 
export const EvalRunRequestSchema = z.object({
  suiteConfig: z.string().min(1),
  trajectories: z.array(z.record(z.string(), z.unknown())).default([]),
  goldenScenario: z.string().optional(),
  gatePreset: z.enum(["standard"

Step 5: Build the evaluator engine

Step 6: Add quality gates, incident reporting, and observability

Create src/lib/gates.ts:

Create src/lib/incidents.ts:

import {
  generateIncidentWorkflows,
  getTemplatesByCategory,
  applyTemplateVariables,
} from "@reaatech/agent-runbook-incident";
import type { AnalysisContext } from "@reaatech/agent-runbook"

Create src/lib/observability.ts:

import { Langfuse } from "langfuse";
 
let langfuseClient: Langfuse | undefined;

Create src/lib/replay.ts for trajectory recording and deterministic replay:

import {
  ReplayEngine,
  TraceBuilder,
  TraceComparator,
  type TraceComparisonResult,
} from "@reaatech/agent-replay-core";
import type { ReplayResult, Trace, RecordingConfig, ReplayConfig, Event } from "@reaatech/agent-replay-shared";
 
export interface ReplayInput {
  traceId: string;

Step 7: Set up instrumentation, middleware, and API routes

Startup instrumentation initializes Langfuse as soon as Next.js boots. Create src/instrumentation.ts:

export async function register(): Promise<void> {
  if (process.env.NEXT_RUNTIME === "nodejs") {
    const { getLangfuse } = await import("./lib/observability.js");
    getLangfuse();
  }
}

Create middleware.ts at the project root (not inside src/app/):

import { type NextRequest, NextResponse } from "next/server";
 
export function middleware(_req: NextRequest): NextResponse {
  return NextResponse.next();
}
 
export const config = {
  matcher: ["/api/:path*"],
};

Now create the two API routes. First, src/app/api/health/route.ts:

import { NextResponse } from "next/server";
 
export function GET(): NextResponse {
  return NextResponse.json({
    status: "ok" as const,
    uptime: process.uptime(),
    version: "0.1.0",
  });
}

Then the main evaluation endpoint at src/app/api/eval/run/route.ts:

import { type

Step 8: Build the dashboard UI

Now create the React frontend so you can run evaluations and view results without using curl. Start with the root layout at src/app/layout.tsx:

tsx

export const metadata = {
  title: "Anthropic Eval Harness",
  description: "Agent Quality Assurance Dashboard",
};
 
export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    <html lang="en">
      <body>{children}</body>
    </html>
  );
}

Create the home page at src/app/page.tsx:

tsx

export default function HomePage() {
  return (
    <main style={{ padding: "2rem", fontFamily: "sans-serif" }}>
      <h1>Anthropic Eval Harness</h1>
      <p>Visit <a href="/dashboard">/dashboard</a> to view the evaluation dashboard.</p>
    </main>
  );
}

Create src/app/dashboard/layout.tsx:

tsx

export const metadata = {
  title: "Dashboard | Anthropic Eval Harness",
  description: "Agent Quality Assurance Dashboard",
};
 
export default function DashboardLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return <>{children}</>;
}

The dashboard page is a "use client" component with a full evaluation-runner UI. Create src/app/dashboard/page.tsx:

Step 9: Run the evaluation and tests

Start the dev server:

terminal

npx next dev

Expected output: Next.js prints Ready in with the local address http://localhost:3000. Visit the dashboard at http://localhost:3000/dashboard to see the evaluation UI.

Trigger an evaluation from the terminal to test the API directly:

terminal

curl -s -X POST http://localhost:3000/api/eval/run \
  -H 'Content-Type: application/json' \
  -d '{
    "suiteConfig": "metrics:\n  - name: faithfulness",
    "trajectories": [{
      "trajectory_id": "t1",
      "turns": [{"turn_id": 1, "role": "user", "content": "Hello", "timestamp": "2024-01-01T00:00:00Z"}]
    }]
  }' | jq .

Run the test suite (MSW mocks the Anthropic API so tests work offline):

terminal

pnpm test

Next steps

Add custom gate presets by calling createGateEngine() with your own threshold arrays — swap getStandardPreset() for getStrictPreset() or getLenientPreset() to tighten or relax quality bars per environment (dev vs. staging vs. production).
Hook the harness into CI/CD by having your pipeline POST to /api/eval/run on every PR and fail the build when gatePassed is false — the junitReport and jsonReport fields are ready to feed into GitHub Actions annotations or GitLab CI artifacts.
Extend the dashboard to pull historical runs from Langfuse’s data export API so you can chart pass-rate trends and cost deltas over weeks and months without storing anything locally.

Anthropic Eval Harness for Agent Quality Assurance

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the Next.js project

Step 2: Install dependencies

Step 3: Configure environment variables

Step 4: Create the Anthropic client and request schemas

Step 5: Build the evaluator engine

Step 6: Add quality gates, incident reporting, and observability

Step 7: Set up instrumentation, middleware, and API routes

Step 8: Build the dashboard UI

Step 9: Run the evaluation and tests

Next steps

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the Next.js project

Step 2: Install dependencies

Step 3: Configure environment variables

Step 4: Create the Anthropic client and request schemas

Step 5: Build the evaluator engine

Step 6: Add quality gates, incident reporting, and observability

Step 7: Set up instrumentation, middleware, and API routes

Step 8: Build the dashboard UI

Step 9: Run the evaluation and tests

Next steps