Databricks Agent Eval Harness for SMB AI Quality Assurance

Intro

You’ll build a CI/CD-ready evaluation harness that runs automated quality gates on your AI agents using Databricks as an LLM-as-judge backend, Langfuse for observability traces, and golden-trajectory regression checks. By the end you’ll have a Next.js application that receives GitHub webhooks on pull requests, executes batched trajectory evaluations, enforces pass/fail thresholds, writes JUnit XML and JSON reports, and posts GitHub check results back to the PR.

Prerequisites

Node.js >= 22
pnpm 10.x
A Databricks workspace with a model serving endpoint (e.g., databricks-dbrx-instruct)
A Langfuse project (cloud or self-hosted)
A GitHub repository with a webhook configured (optional for the webhook step)
Familiarity with TypeScript, Next.js App Router route handlers, and environment variables

Step 1: Install dependencies

Start from the project root. Install all runtime and dev dependencies at once. The recipe pins pnpm@10.0.0 as the package manager.

terminal

pnpm install

Expected output: The install completes with no error-level output. You’ll see the @reaatech/agent-eval-harness-* packages resolved under node_modules/@reaatech/.

Step 2: Set up environment variables

Copy the example env file and fill in your credentials. Every variable is required for at least one part of the pipeline.

terminal

cp .env.example .env.local

Open .env.local and fill in each placeholder. The file at .env.example documents every variable:

env

NODE_ENV=development
DATABRICKS_HOST=https://<workspace>.cloud.databricks.com
DATABRICKS_TOKEN=<your-databricks-token>
DATABRICKS_MODEL_SERVING_ENDPOINT=databricks-dbrx-instruct
LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>
LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>
LANGFUSE_HOST=https://cloud.langfuse.com
GITHUB_WEBHOOK_SECRET=<your-github-webhook-secret>
GITHUB_TOKEN=<your-github-token>
ANTHROPIC_API_KEY=<your-anthropic-api-key>
LOG_LEVEL=info
OTEL_EXPORTER_OTLP_ENDPOINT=

DATABRICKS_HOST, DATABRICKS_TOKEN, and DATABRICKS_MODEL_SERVING_ENDPOINT control which Databricks model serving endpoint the judge calls. LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST configure trace export. GITHUB_WEBHOOK_SECRET and GITHUB_TOKEN are used by the webhook route to verify payloads and post GitHub check results.

Step 3: Enable the Next.js instrumentation hook

The src/instrumentation.ts file sets up the observability stack at process startup. Next.js requires experimental.instrumentationHook: true in the config, otherwise the register() function never fires.

Open next.config.ts and replace its contents with this:

import type { NextConfig } from "next";
 
const nextConfig: NextConfig = {
  experimental: {
    instrumentationHook: true,
  },
};
 
export default nextConfig;

The instrumentationHook flag tells Next.js to call the exported register() function from src/instrumentation.ts when the Node.js server starts. Without it, the observability initialization is dead code.

Step 4: Set up observability and Langfuse

The observability layer wires Pino logging, OpenTelemetry tracing, a metrics manager, and a dashboard manager. Create src/lib/observability.ts:

import {
  getLogger,
  getTracingManager,
  getMetricsManager,
  getDashboardManager,
} from "@reaatech/agent-eval-harness-observability";
 
export function initializeObservability(): void {
  getLogger({
    level: process.env.LOG_LEVEL ?? "info",
    format: process.env.NODE_ENV === "production" ? "json" : "pretty",
    includeRunId: true,
  });

The dashboard manager tracks run summaries and fires alerts when metrics cross configurable thresholds. initializeObservability() is called once in instrumentation.ts and then the singleton managers are used throughout the pipeline.

Create src/lib/langfuse.ts to wrap the Langfuse client with a trace helper:

import { Langfuse } from "langfuse";
import { getLogger } from "@reaatech/agent-eval-harness-observability";
 
let langfuseInstance: Langfuse | null = null;
 
export function getLangfuseClient(): Langfuse | null

getLangfuseTrace() returns a helper that wraps any async function in a Langfuse trace span. If Langfuse credentials are missing, it falls through to calling the function directly — the pipeline keeps working without observability.

Step 5: Implement the Databricks judge service

The judge is the core of the eval harness. It calls Databricks model serving to score agent trajectories across four dimensions: faithfulness, relevance, tool correctness, and coherence. Create src/services/databricks-judge.ts:

The judge service uses the OpenAI SDK configured with a Databricks base URL to call your model serving endpoint. Each metric runs in parallel via Promise.all. Responses are validated with a Zod schema, and the call retries up to 3 times with exponential backoff. The cumulative cost tracker throws BudgetExceededError once the configured budget is consumed.

Step 6: Build the eval pipeline runner

The eval runner orchestrates the full pipeline: parse config, load trajectories, run the Databricks judge, aggregate results, evaluate gates, and write output files. Create src/eval-runner.ts:

The pipeline takes an EvalRunConfig with a YAML config string, a path to candidate trajectories in JSONL format, an optional golden dataset path, and an output directory. It writes results.json, junit-report.xml, and results.md to the output directory, plus CI artifacts from the gate engine.

Step 7: Wire the GitHub webhook route

Create app/api/webhook/route.ts. This route receives pull request events from GitHub, verifies the HMAC signature, runs the eval pipeline, and posts a GitHub check result back to the PR.

The route returns 200 immediately and runs the pipeline asynchronously so GitHub doesn’t time out waiting for a long-running eval. The Octokit client posts a GitHub check annotation with pass/fail status.

Step 8: Add the dashboard route

Create app/api/dashboard/route.ts to expose the current eval summary for the home page:

import { NextResponse } from "next/server";
import { getDashboardManager } from "@reaatech/agent-eval-harness-observability";
 
export async function GET(): Promise<NextResponse> {
  const dashboard = getDashboardManager();
  const summary = dashboard.getSummary();
  return NextResponse.json(summary);
}

The home page reads the dashboard manager’s singleton and renders current score, pass rate, latency, cost, and any active alerts — no API call needed, it reads directly from the singleton on the server side.

Step 9: Run the tests

Run the vitest suite with coverage reporting:

terminal

pnpm test

Expected output: All tests pass. The terminal shows the coverage table with percentages for each source file, and the final line reads Test Files X passed. The suite covers the eval runner, webhook handler, databricks judge service, gate check, golden loader, observability module, and the index export.

Next steps

Register the GitHub webhook in your repository settings pointing to https://your-domain.com/api/webhook. Each pull request triggers an evaluation run and posts a GitHub check result automatically.
Add golden trajectory files to ./goldens/ as .jsonl and pass the directory path in goldenPath when calling runEvalPipeline. The comparator diffs candidate trajectories against the reference set and surfaces regressions.
Experiment with the gate presets in src/services/gate-check.ts — swap getStandardPreset() for getStrictPreset() or getLenientPreset() to tighten or relax the pass thresholds in CI.

Intro

Prerequisites

Node.js >= 22
pnpm 10.x
A Databricks workspace with a model serving endpoint (e.g., databricks-dbrx-instruct)
A Langfuse project (cloud or self-hosted)
A GitHub repository with a webhook configured (optional for the webhook step)
Familiarity with TypeScript, Next.js App Router route handlers, and environment variables

Step 1: Install dependencies

Start from the project root. Install all runtime and dev dependencies at once. The recipe pins pnpm@10.0.0 as the package manager.

terminal

pnpm install

Expected output: The install completes with no error-level output. You’ll see the @reaatech/agent-eval-harness-* packages resolved under node_modules/@reaatech/.

Step 2: Set up environment variables

Copy the example env file and fill in your credentials. Every variable is required for at least one part of the pipeline.

terminal

cp .env.example .env.local

Open .env.local and fill in each placeholder. The file at .env.example documents every variable:

env

NODE_ENV=development
DATABRICKS_HOST=https://<workspace>.cloud.databricks.com
DATABRICKS_TOKEN=<your-databricks-token>
DATABRICKS_MODEL_SERVING_ENDPOINT=databricks-dbrx-instruct
LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>
LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>
LANGFUSE_HOST=https://cloud.langfuse.com
GITHUB_WEBHOOK_SECRET=<your-github-webhook-secret>
GITHUB_TOKEN=<your-github-token>
ANTHROPIC_API_KEY=<your-anthropic-api-key>
LOG_LEVEL=info
OTEL_EXPORTER_OTLP_ENDPOINT=

Step 3: Enable the Next.js instrumentation hook

Open next.config.ts and replace its contents with this:

import type { NextConfig } from "next";
 
const nextConfig: NextConfig = {
  experimental: {
    instrumentationHook: true,
  },
};
 
export default nextConfig;

Step 4: Set up observability and Langfuse

The observability layer wires Pino logging, OpenTelemetry tracing, a metrics manager, and a dashboard manager. Create src/lib/observability.ts:

import {
  getLogger,
  getTracingManager,
  getMetricsManager,
  getDashboardManager,
} from "@reaatech/agent-eval-harness-observability";
 
export function initializeObservability(): void {
  getLogger({
    level: process.env.LOG_LEVEL ?? "info",
    format: process.env.NODE_ENV === "production" ? "json" : "pretty",
    includeRunId: true,
  });

Create src/lib/langfuse.ts to wrap the Langfuse client with a trace helper:

import { Langfuse } from "langfuse";
import { getLogger } from "@reaatech/agent-eval-harness-observability";
 
let langfuseInstance: Langfuse | null = null;
 
export function getLangfuseClient(): Langfuse | null

Step 5: Implement the Databricks judge service

Step 6: Build the eval pipeline runner

The eval runner orchestrates the full pipeline: parse config, load trajectories, run the Databricks judge, aggregate results, evaluate gates, and write output files. Create src/eval-runner.ts:

Step 7: Wire the GitHub webhook route

Create app/api/webhook/route.ts. This route receives pull request events from GitHub, verifies the HMAC signature, runs the eval pipeline, and posts a GitHub check result back to the PR.

Step 8: Add the dashboard route

Create app/api/dashboard/route.ts to expose the current eval summary for the home page:

import { NextResponse } from "next/server";
import { getDashboardManager } from "@reaatech/agent-eval-harness-observability";
 
export async function GET(): Promise<NextResponse> {
  const dashboard = getDashboardManager();
  const summary = dashboard.getSummary();
  return NextResponse.json(summary);
}

Step 9: Run the tests

Run the vitest suite with coverage reporting:

terminal

pnpm test

Next steps

Register the GitHub webhook in your repository settings pointing to https://your-domain.com/api/webhook. Each pull request triggers an evaluation run and posts a GitHub check result automatically.
Add golden trajectory files to ./goldens/ as .jsonl and pass the directory path in goldenPath when calling runEvalPipeline. The comparator diffs candidate trajectories against the reference set and surfaces regressions.
Experiment with the gate presets in src/services/gate-check.ts — swap getStandardPreset() for getStrictPreset() or getLenientPreset() to tighten or relax the pass thresholds in CI.

Databricks Agent Eval Harness for SMB AI Quality Assurance

The problem

Built from

Intro

Prerequisites

Step 1: Install dependencies

Step 2: Set up environment variables

Step 3: Enable the Next.js instrumentation hook

Step 4: Set up observability and Langfuse

Step 5: Implement the Databricks judge service

Step 6: Build the eval pipeline runner

Step 7: Wire the GitHub webhook route

Step 8: Add the dashboard route

Step 9: Run the tests

Next steps

Example artifact

Comments

Intro

Prerequisites

Step 1: Install dependencies

Step 2: Set up environment variables

Step 3: Enable the Next.js instrumentation hook

Step 4: Set up observability and Langfuse

Step 5: Implement the Databricks judge service

Step 6: Build the eval pipeline runner

Step 7: Wire the GitHub webhook route

Step 8: Add the dashboard route

Step 9: Run the tests

Next steps