Databricks Agent Eval Harness for SMB Support Bots

Intro

This tutorial walks you through building a CI-friendly eval harness that tests AI support agent responses against golden conversation datasets, scores them with a Databricks model serving endpoint, tracks costs, enforces quality gates, and exports results to Braintrust for analytics. By the end, you’ll have a working pipeline you can drop into any CI workflow.

You’ll build it step-by-step from an empty Next.js scaffold, wiring together five @reaatech/agent-eval-harness-* packages, the Databricks SDK, Braintrust, Zod, p-limit, and dotenv.

Prerequisites

Node.js 22+ and pnpm 10 installed
A Databricks workspace with a model serving endpoint deployed (e.g., DBRX Instruct or Llama 3.1 70B)
A Braintrust account with an API key (free tier works)
Basic familiarity with TypeScript, Next.js App Router, and vitest

Step 1: Scaffold the project

The project starts with a Next.js 16 App Router scaffold. Install its dependencies:

terminal

pnpm install

This installs every dependency already pinned in package.json — no ^ or ~ versions anywhere. The key dependencies are:

@reaatech/agent-eval-harness-golden@0.1.0 — golden trajectory management
@reaatech/agent-eval-harness-judge@0.1.0 — LLM-as-judge engine
@reaatech/agent-eval-harness-cost@0.1.0 — cost tracking and budgets
@reaatech/agent-eval-harness-gate@0.1.0 — CI regression gates
@reaatech/agent-eval-harness-cli@0.1.0 — CLI subcommand functions
braintrust@3.16.0 — experiment logging and analytics
@databricks/sdk-experimental@0.18.0 — Databricks workspace client
zod@4.4.3, p-limit@7.3.0, dotenv@17.4.2

Your project now has the scaffold files in place: tsconfig.json, next.config.ts, vitest.config.ts, eslint.config.mjs, and the placeholder app/page.tsx, src/index.ts, and tests/index.test.ts.

Step 2: Configure environment variables

Open .env.example and add the variables the pipeline needs:

env

# Env vars used by databricks-agent-eval-harness-for-smb-support-bots.
# The builder adds entries here as it wires up each integration.
# Keep placeholders only — never commit real values.
 
NODE_ENV=development
DATABRICKS_HOST=<your-databricks-workspace-url>
DATABRICKS_TOKEN=<your-databricks-pat-token>
DATABRICKS_SERVING_ENDPOINT=<your-serving-endpoint-name>
BRAINTRUST_API_KEY=<your-braintrust-api-key>
ANTHROPIC_API_KEY=<your-anthropic-key>
OPENAI_API_KEY=<your-openai-key>

Copy it to .env and fill in your real values:

terminal

cp .env.example .env

Your .env file is git-ignored. Never commit secrets.

Step 3: Define shared types

Create src/lib/types.ts with the interfaces and Zod schemas that the rest of the codebase depends on:

import { z } from "zod";
import type { GateEvaluationSummary } from "@reaatech/agent-eval-harness-gate";
 
export const EvalConfigSchema = z.object({
  databricksHost: z.string(),
  databricksToken: z.string

The JudgeAdapter interface is the contract every judge must fulfill. EvalConfig captures all the configuration the pipeline needs. EvalRunResult records what comes out of each test scenario.

Step 4: Create the Databricks Judge adapter

Create src/lib/databricks-judge.ts. This class wraps a Databricks model serving endpoint as an LLM-as-judge, implementing the JudgeAdapter interface:

Key design decisions in this adapter:

Never throws — all errors produce a zero-score fallback, making the pipeline resilient to transient endpoint failures
Uses raw fetch to POST to the Databricks serving endpoint’s /invocations URL directly, with the bearer token stored on the instance
judgeBatch uses p-limit to bound concurrency and fires independent judge() calls for each request

Step 5: Create the Braintrust exporter

Create src/lib/braintrust-exporter.ts. This class logs evaluation results to Braintrust experiments for historical dashboards:

import * as braintrust from "braintrust";
import type { Experiment } from "braintrust";
import type { EvalRunResult } from "./types.js";
 
export

The flow is: initExperiment() creates a Braintrust experiment, logResults() writes each eval result as a log entry with the scenario as input, the score as output, and 1.0 as the expected value (perfect match), and summarize() fetches an auto-generated summary. The exportAll() convenience method chains all three.

Step 6: Build the eval pipeline service

Create src/services/eval-pipeline.ts. This is the core orchestrator — it loads golden datasets, scores each candidate trajectory through the Databricks judge (plus a secondary JudgeEngine for multi-model consensus), tracks costs, runs regression gates, and exports to Braintrust:

The pipeline’s runFullEval() method is the main entry point. It loads *.jsonl golden files from the dataset directory, processes each through runSingleEntry() with bounded concurrency via pLimit, compares candidates against goldens, scores every turn through both the Databricks judge and a Claude-based JudgeEngine combined via weighted consensus, tracks costs with CostTracker, enforces budget limits, evaluates quality gates using the configured preset, and exports results to Braintrust alongside a JUnit XML report.

Step 7: Create the CLI entry point

Create src/ci/run-evals.ts. This is the command-line script that wires env vars, CLI flags, and the pipeline together:

The script supports multiple modes:

No subcommand — runs the full eval pipeline using DATABRICKS_HOST, DATABRICKS_TOKEN, DATABRICKS_SERVING_ENDPOINT, and BRAINTRUST_API_KEY from the environment, with CLI overrides for --golden-path, --budget, --gate-preset, --concurrency, and --judge-model
gate <path> — runs only the gate evaluation on existing results using the @reaatech/agent-eval-harness-cli gateCommand
eval <paths...> — runs only evaluation via the CLI package’s evalCommand
golden [--create <path>] — lists or creates golden trajectories via goldenCommand

Step 8: Set up the library entry point

Replace the placeholder src/index.ts with re-exports so consumers can import everything from one module:

export { DatabricksJudge } from "./lib/databricks-judge.js";
export { BraintrustExporter } from "./lib/braintrust-exporter.js";
export { EvalPipeline } from "./services/eval-pipeline.js";
export type { EvalConfig, EvalRunResult, JudgeAdapter, JudgeRequest, JudgeScore, EvalSummary } from "./lib/types.js";

This makes the three main classes and all key types available via a single import:

import { DatabricksJudge, BraintrustExporter, EvalPipeline } from "databricks-agent-eval-harness-for-smb-support-bots";

Step 9: Run the tests

The project comes with a comprehensive test suite covering all modules. Run it with:

terminal

pnpm test

This runs vitest with v8 coverage. The test suite covers:

DatabricksJudge — happy path (returning parsed scores), error path (HTTP 500, network failure, malformed JSON), boundary cases (empty response, missing score field), judgeBatch (multiple requests, empty input, partial failure), and judge requests with intent, tools, and arguments
BraintrustExporter — initExperiment, error propagation, logResults shape and empty-array boundary, summarize, exportAll integration sequence, error wrapping in all methods
EvalPipeline — full pipeline happy path (all stages called), gate preset selection, getExitCode pass/fail, error handling (parsing failure, non-Error throws), empty turns array, empty dataset, budget enforcement, cost tracking, Braintrust export integration, score thresholds, comparison threshold failures
run-evals CLI — parseCliArgs (flag-value pairs, boolean flags, missing args, edge cases), requireEnv (present, missing, empty, logging), main function (gate/eval/golden subcommands, full eval with env vars, missing env vars, error propagation)
index entry — verifies all three classes are exported

Run type checking and linting too:

terminal

pnpm typecheck
pnpm lint

Step 10: Run the evaluation pipeline

With your .env populated and golden JSONL files in a directory (e.g., ./golden/), run the pipeline:

terminal

npx tsx src/ci/run-evals.ts --golden-path ./golden --output ./results

You can customize the run with the available flags:

terminal

npx tsx src/ci/run-evals.ts \
  --golden-path ./golden \
  --output ./results \
  --budget strict \
  --gate-preset strict \
  --concurrency 3 \
  --judge-model databricks-dbrx-instruct

The pipeline outputs:

results/results.json — per-scenario scores with pass/fail
results/junit.xml — JUnit report for CI integration
Braintrust experiment with all logs and a summary

Next steps

Add more judge providers — wire in GPT-4 or Gemini via the JudgeEngine from @reaatech/agent-eval-harness-judge and expand the consensus model weights
Human-in-the-loop calibration — collect human-labeled scores and use JudgeCalibrator with temperature scaling to correct systematic bias in the LLM judge
Baseline comparison gates — use createNoRegressionGate() and createSignificanceGate() from @reaatech/agent-eval-harness-gate to compare against a previous run’s results
GitHub Actions CI — add the eval pipeline to your PR workflow using the JUnit report and CIIntegration.generateGitHubAnnotations() for inline annotations
Latency monitoring — use @reaatech/agent-eval-harness-latency to track P50/P90/P99 response times and add SLA violation gates

Intro

You’ll build it step-by-step from an empty Next.js scaffold, wiring together five @reaatech/agent-eval-harness-* packages, the Databricks SDK, Braintrust, Zod, p-limit, and dotenv.

Prerequisites

Node.js 22+ and pnpm 10 installed
A Databricks workspace with a model serving endpoint deployed (e.g., DBRX Instruct or Llama 3.1 70B)
A Braintrust account with an API key (free tier works)
Basic familiarity with TypeScript, Next.js App Router, and vitest

Step 1: Scaffold the project

The project starts with a Next.js 16 App Router scaffold. Install its dependencies:

terminal

pnpm install

This installs every dependency already pinned in package.json — no ^ or ~ versions anywhere. The key dependencies are:

@reaatech/agent-eval-harness-golden@0.1.0 — golden trajectory management
@reaatech/agent-eval-harness-judge@0.1.0 — LLM-as-judge engine
@reaatech/agent-eval-harness-cost@0.1.0 — cost tracking and budgets
@reaatech/agent-eval-harness-gate@0.1.0 — CI regression gates
@reaatech/agent-eval-harness-cli@0.1.0 — CLI subcommand functions
braintrust@3.16.0 — experiment logging and analytics
@databricks/sdk-experimental@0.18.0 — Databricks workspace client
zod@4.4.3, p-limit@7.3.0, dotenv@17.4.2

Step 2: Configure environment variables

Open .env.example and add the variables the pipeline needs:

env

# Env vars used by databricks-agent-eval-harness-for-smb-support-bots.
# The builder adds entries here as it wires up each integration.
# Keep placeholders only — never commit real values.
 
NODE_ENV=development
DATABRICKS_HOST=<your-databricks-workspace-url>
DATABRICKS_TOKEN=<your-databricks-pat-token>
DATABRICKS_SERVING_ENDPOINT=<your-serving-endpoint-name>
BRAINTRUST_API_KEY=<your-braintrust-api-key>
ANTHROPIC_API_KEY=<your-anthropic-key>
OPENAI_API_KEY=<your-openai-key>

Copy it to .env and fill in your real values:

terminal

cp .env.example .env

Your .env file is git-ignored. Never commit secrets.

Step 3: Define shared types

Create src/lib/types.ts with the interfaces and Zod schemas that the rest of the codebase depends on:

import { z } from "zod";
import type { GateEvaluationSummary } from "@reaatech/agent-eval-harness-gate";
 
export const EvalConfigSchema = z.object({
  databricksHost: z.string(),
  databricksToken: z.string

The JudgeAdapter interface is the contract every judge must fulfill. EvalConfig captures all the configuration the pipeline needs. EvalRunResult records what comes out of each test scenario.

Step 4: Create the Databricks Judge adapter

Create src/lib/databricks-judge.ts. This class wraps a Databricks model serving endpoint as an LLM-as-judge, implementing the JudgeAdapter interface:

Key design decisions in this adapter:

Never throws — all errors produce a zero-score fallback, making the pipeline resilient to transient endpoint failures
Uses raw fetch to POST to the Databricks serving endpoint’s /invocations URL directly, with the bearer token stored on the instance
judgeBatch uses p-limit to bound concurrency and fires independent judge() calls for each request

Step 5: Create the Braintrust exporter

Create src/lib/braintrust-exporter.ts. This class logs evaluation results to Braintrust experiments for historical dashboards:

import * as braintrust from "braintrust";
import type { Experiment } from "braintrust";
import type { EvalRunResult } from "./types.js";
 
export

Step 6: Build the eval pipeline service

Step 7: Create the CLI entry point

Create src/ci/run-evals.ts. This is the command-line script that wires env vars, CLI flags, and the pipeline together:

The script supports multiple modes:

No subcommand — runs the full eval pipeline using DATABRICKS_HOST, DATABRICKS_TOKEN, DATABRICKS_SERVING_ENDPOINT, and BRAINTRUST_API_KEY from the environment, with CLI overrides for --golden-path, --budget, --gate-preset, --concurrency, and --judge-model
gate <path> — runs only the gate evaluation on existing results using the @reaatech/agent-eval-harness-cli gateCommand
eval <paths...> — runs only evaluation via the CLI package’s evalCommand
golden [--create <path>] — lists or creates golden trajectories via goldenCommand

Step 8: Set up the library entry point

Replace the placeholder src/index.ts with re-exports so consumers can import everything from one module:

export { DatabricksJudge } from "./lib/databricks-judge.js";
export { BraintrustExporter } from "./lib/braintrust-exporter.js";
export { EvalPipeline } from "./services/eval-pipeline.js";
export type { EvalConfig, EvalRunResult, JudgeAdapter, JudgeRequest, JudgeScore, EvalSummary } from "./lib/types.js";

This makes the three main classes and all key types available via a single import:

import { DatabricksJudge, BraintrustExporter, EvalPipeline } from "databricks-agent-eval-harness-for-smb-support-bots";

Step 9: Run the tests

The project comes with a comprehensive test suite covering all modules. Run it with:

terminal

pnpm test

This runs vitest with v8 coverage. The test suite covers:

DatabricksJudge — happy path (returning parsed scores), error path (HTTP 500, network failure, malformed JSON), boundary cases (empty response, missing score field), judgeBatch (multiple requests, empty input, partial failure), and judge requests with intent, tools, and arguments
BraintrustExporter — initExperiment, error propagation, logResults shape and empty-array boundary, summarize, exportAll integration sequence, error wrapping in all methods
EvalPipeline — full pipeline happy path (all stages called), gate preset selection, getExitCode pass/fail, error handling (parsing failure, non-Error throws), empty turns array, empty dataset, budget enforcement, cost tracking, Braintrust export integration, score thresholds, comparison threshold failures
run-evals CLI — parseCliArgs (flag-value pairs, boolean flags, missing args, edge cases), requireEnv (present, missing, empty, logging), main function (gate/eval/golden subcommands, full eval with env vars, missing env vars, error propagation)
index entry — verifies all three classes are exported

Run type checking and linting too:

terminal

pnpm typecheck
pnpm lint

Step 10: Run the evaluation pipeline

With your .env populated and golden JSONL files in a directory (e.g., ./golden/), run the pipeline:

terminal

npx tsx src/ci/run-evals.ts --golden-path ./golden --output ./results

You can customize the run with the available flags:

terminal

npx tsx src/ci/run-evals.ts \
  --golden-path ./golden \
  --output ./results \
  --budget strict \
  --gate-preset strict \
  --concurrency 3 \
  --judge-model databricks-dbrx-instruct

The pipeline outputs:

results/results.json — per-scenario scores with pass/fail
results/junit.xml — JUnit report for CI integration
Braintrust experiment with all logs and a summary

Next steps

Add more judge providers — wire in GPT-4 or Gemini via the JudgeEngine from @reaatech/agent-eval-harness-judge and expand the consensus model weights
Human-in-the-loop calibration — collect human-labeled scores and use JudgeCalibrator with temperature scaling to correct systematic bias in the LLM judge
Baseline comparison gates — use createNoRegressionGate() and createSignificanceGate() from @reaatech/agent-eval-harness-gate to compare against a previous run’s results
GitHub Actions CI — add the eval pipeline to your PR workflow using the JUnit report and CIIntegration.generateGitHubAnnotations() for inline annotations
Latency monitoring — use @reaatech/agent-eval-harness-latency to track P50/P90/P99 response times and add SLA violation gates

Databricks Agent Eval Harness for SMB Support Bots

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project

Step 2: Configure environment variables

Step 3: Define shared types

Step 4: Create the Databricks Judge adapter

Step 5: Create the Braintrust exporter

Step 6: Build the eval pipeline service

Step 7: Create the CLI entry point

Step 8: Set up the library entry point

Step 9: Run the tests

Step 10: Run the evaluation pipeline

Next steps

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project

Step 2: Configure environment variables

Step 3: Define shared types

Step 4: Create the Databricks Judge adapter

Step 5: Create the Braintrust exporter

Step 6: Build the eval pipeline service

Step 7: Create the CLI entry point

Step 8: Set up the library entry point

Step 9: Run the tests

Step 10: Run the evaluation pipeline

Next steps