Intro
You’ll build a CI/CD-ready evaluation harness that runs automated quality gates on your AI agents using Databricks as an LLM-as-judge backend, Langfuse for observability traces, and golden-trajectory regression checks. By the end you’ll have a Next.js application that receives GitHub webhooks on pull requests, executes batched trajectory evaluations, enforces pass/fail thresholds, writes JUnit XML and JSON reports, and posts GitHub check results back to the PR.
Prerequisites
- Node.js >= 22
- pnpm 10.x
- A Databricks workspace with a model serving endpoint (e.g.,
databricks-dbrx-instruct) - A Langfuse project (cloud or self-hosted)
- A GitHub repository with a webhook configured (optional for the webhook step)
- Familiarity with TypeScript, Next.js App Router route handlers, and environment variables
Step 1: Install dependencies
Start from the project root. Install all runtime and dev dependencies at once. The recipe pins pnpm@10.0.0 as the package manager.
pnpm installExpected output: The install completes with no error-level output. You’ll see the @reaatech/agent-eval-harness-* packages resolved under node_modules/@reaatech/.
Step 2: Set up environment variables
Copy the example env file and fill in your credentials. Every variable is required for at least one part of the pipeline.
cp .env.example .env.localOpen .env.local and fill in each placeholder. The file at .env.example documents every variable:
NODE_ENV=development
DATABRICKS_HOST=https://<workspace>.cloud.databricks.com
DATABRICKS_TOKEN=<your-databricks-token>
DATABRICKS_MODEL_SERVING_ENDPOINT=databricks-dbrx-instruct
LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>
LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>
LANGFUSE_HOST=https://cloud.langfuse.com
GITHUB_WEBHOOK_SECRET=<your-github-webhook-secret>
GITHUB_TOKEN=<your-github-token>
ANTHROPIC_API_KEY=<your-anthropic-api-key>
LOG_LEVEL=info
OTEL_EXPORTER_OTLP_ENDPOINT=DATABRICKS_HOST, DATABRICKS_TOKEN, and DATABRICKS_MODEL_SERVING_ENDPOINT control which Databricks model serving endpoint the judge calls. LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST configure trace export. GITHUB_WEBHOOK_SECRET and GITHUB_TOKEN are used by the webhook route to verify payloads and post GitHub check results.
Step 3: Enable the Next.js instrumentation hook
The src/instrumentation.ts file sets up the observability stack at process startup. Next.js requires experimental.instrumentationHook: true in the config, otherwise the register() function never fires.
Open next.config.ts and replace its contents with this:
import type { NextConfig } from "next";
const nextConfig: NextConfig = {
experimental: {
instrumentationHook: true,
},
};
export default nextConfig;The instrumentationHook flag tells Next.js to call the exported register() function from src/instrumentation.ts when the Node.js server starts. Without it, the observability initialization is dead code.
Step 4: Set up observability and Langfuse
The observability layer wires Pino logging, OpenTelemetry tracing, a metrics manager, and a dashboard manager. Create src/lib/observability.ts:
import {
getLogger,
getTracingManager,
getMetricsManager,
getDashboardManager,
} from "@reaatech/agent-eval-harness-observability";
export function initializeObservability(): void {
getLogger({
level: process.env.LOG_LEVEL ?? "info",
format: process.env.NODE_ENV === "production" ? "json" : "pretty",
includeRunId: true,
});
The dashboard manager tracks run summaries and fires alerts when metrics cross configurable thresholds. initializeObservability() is called once in instrumentation.ts and then the singleton managers are used throughout the pipeline.
Create src/lib/langfuse.ts to wrap the Langfuse client with a trace helper:
import { Langfuse } from "langfuse";
import { getLogger } from "@reaatech/agent-eval-harness-observability";
let langfuseInstance: Langfuse | null = null;
export function getLangfuseClient(): Langfuse | null
getLangfuseTrace() returns a helper that wraps any async function in a Langfuse trace span. If Langfuse credentials are missing, it falls through to calling the function directly — the pipeline keeps working without observability.
Step 5: Implement the Databricks judge service
The judge is the core of the eval harness. It calls Databricks model serving to score agent trajectories across four dimensions: faithfulness, relevance, tool correctness, and coherence. Create src/services/databricks-judge.ts:
The judge service uses the OpenAI SDK configured with a Databricks base URL to call your model serving endpoint. Each metric runs in parallel via Promise.all. Responses are validated with a Zod schema, and the call retries up to 3 times with exponential backoff. The cumulative cost tracker throws BudgetExceededError once the configured budget is consumed.
Step 6: Build the eval pipeline runner
The eval runner orchestrates the full pipeline: parse config, load trajectories, run the Databricks judge, aggregate results, evaluate gates, and write output files. Create src/eval-runner.ts:
The pipeline takes an EvalRunConfig with a YAML config string, a path to candidate trajectories in JSONL format, an optional golden dataset path, and an output directory. It writes results.json, junit-report.xml, and results.md to the output directory, plus CI artifacts from the gate engine.
Step 7: Wire the GitHub webhook route
Create app/api/webhook/route.ts. This route receives pull request events from GitHub, verifies the HMAC signature, runs the eval pipeline, and posts a GitHub check result back to the PR.
The route returns 200 immediately and runs the pipeline asynchronously so GitHub doesn’t time out waiting for a long-running eval. The Octokit client posts a GitHub check annotation with pass/fail status.
Step 8: Add the dashboard route
Create app/api/dashboard/route.ts to expose the current eval summary for the home page:
import { NextResponse } from "next/server";
import { getDashboardManager } from "@reaatech/agent-eval-harness-observability";
export async function GET(): Promise<NextResponse> {
const dashboard = getDashboardManager();
const summary = dashboard.getSummary();
return NextResponse.json(summary);
}The home page reads the dashboard manager’s singleton and renders current score, pass rate, latency, cost, and any active alerts — no API call needed, it reads directly from the singleton on the server side.
Step 9: Run the tests
Run the vitest suite with coverage reporting:
pnpm testExpected output: All tests pass. The terminal shows the coverage table with percentages for each source file, and the final line reads Test Files X passed. The suite covers the eval runner, webhook handler, databricks judge service, gate check, golden loader, observability module, and the index export.
Next steps
-
Register the GitHub webhook in your repository settings pointing to
https://your-domain.com/api/webhook. Each pull request triggers an evaluation run and posts a GitHub check result automatically. -
Add golden trajectory files to
./goldens/as.jsonland pass the directory path ingoldenPathwhen callingrunEvalPipeline. The comparator diffs candidate trajectories against the reference set and surfaces regressions. -
Experiment with the gate presets in
src/services/gate-check.ts— swapgetStandardPreset()forgetStrictPreset()orgetLenientPreset()to tighten or relax the pass thresholds in CI.
