Intro
This tutorial walks you through building a CI-friendly eval harness that tests AI support agent responses against golden conversation datasets, scores them with a Databricks model serving endpoint, tracks costs, enforces quality gates, and exports results to Braintrust for analytics. By the end, you’ll have a working pipeline you can drop into any CI workflow.
You’ll build it step-by-step from an empty Next.js scaffold, wiring together five @reaatech/agent-eval-harness-* packages, the Databricks SDK, Braintrust, Zod, p-limit, and dotenv.
Prerequisites
- Node.js 22+ and pnpm 10 installed
- A Databricks workspace with a model serving endpoint deployed (e.g., DBRX Instruct or Llama 3.1 70B)
- A Braintrust account with an API key (free tier works)
- Basic familiarity with TypeScript, Next.js App Router, and vitest
Step 1: Scaffold the project
The project starts with a Next.js 16 App Router scaffold. Install its dependencies:
pnpm installThis installs every dependency already pinned in package.json — no ^ or ~ versions anywhere. The key dependencies are:
@reaatech/agent-eval-harness-golden@0.1.0— golden trajectory management@reaatech/agent-eval-harness-judge@0.1.0— LLM-as-judge engine@reaatech/agent-eval-harness-cost@0.1.0— cost tracking and budgets@reaatech/agent-eval-harness-gate@0.1.0— CI regression gates@reaatech/agent-eval-harness-cli@0.1.0— CLI subcommand functionsbraintrust@3.16.0— experiment logging and analytics@databricks/sdk-experimental@0.18.0— Databricks workspace clientzod@4.4.3,p-limit@7.3.0,dotenv@17.4.2
Your project now has the scaffold files in place: tsconfig.json, next.config.ts, vitest.config.ts, eslint.config.mjs, and the placeholder app/page.tsx, src/index.ts, and tests/index.test.ts.
Step 2: Configure environment variables
Open .env.example and add the variables the pipeline needs:
# Env vars used by databricks-agent-eval-harness-for-smb-support-bots.
# The builder adds entries here as it wires up each integration.
# Keep placeholders only — never commit real values.
NODE_ENV=development
DATABRICKS_HOST=<your-databricks-workspace-url>
DATABRICKS_TOKEN=<your-databricks-pat-token>
DATABRICKS_SERVING_ENDPOINT=<your-serving-endpoint-name>
BRAINTRUST_API_KEY=<your-braintrust-api-key>
ANTHROPIC_API_KEY=<your-anthropic-key>
OPENAI_API_KEY=<your-openai-key>Copy it to .env and fill in your real values:
cp .env.example .envYour .env file is git-ignored. Never commit secrets.
Step 3: Define shared types
Create src/lib/types.ts with the interfaces and Zod schemas that the rest of the codebase depends on:
import { z } from "zod";
import type { GateEvaluationSummary } from "@reaatech/agent-eval-harness-gate";
export const EvalConfigSchema = z.object({
databricksHost: z.string(),
databricksToken: z.string
The JudgeAdapter interface is the contract every judge must fulfill. EvalConfig captures all the configuration the pipeline needs. EvalRunResult records what comes out of each test scenario.
Step 4: Create the Databricks Judge adapter
Create src/lib/databricks-judge.ts. This class wraps a Databricks model serving endpoint as an LLM-as-judge, implementing the JudgeAdapter interface:
Key design decisions in this adapter:
- Never throws — all errors produce a zero-score fallback, making the pipeline resilient to transient endpoint failures
- Uses raw
fetchto POST to the Databricks serving endpoint’s/invocationsURL directly, with the bearer token stored on the instance judgeBatchusesp-limitto bound concurrency and fires independentjudge()calls for each request
Step 5: Create the Braintrust exporter
Create src/lib/braintrust-exporter.ts. This class logs evaluation results to Braintrust experiments for historical dashboards:
import * as braintrust from "braintrust";
import type { Experiment } from "braintrust";
import type { EvalRunResult } from "./types.js";
export
The flow is: initExperiment() creates a Braintrust experiment, logResults() writes each eval result as a log entry with the scenario as input, the score as output, and 1.0 as the expected value (perfect match), and summarize() fetches an auto-generated summary. The exportAll() convenience method chains all three.
Step 6: Build the eval pipeline service
Create src/services/eval-pipeline.ts. This is the core orchestrator — it loads golden datasets, scores each candidate trajectory through the Databricks judge (plus a secondary JudgeEngine for multi-model consensus), tracks costs, runs regression gates, and exports to Braintrust:
The pipeline’s runFullEval() method is the main entry point. It loads *.jsonl golden files from the dataset directory, processes each through runSingleEntry() with bounded concurrency via pLimit, compares candidates against goldens, scores every turn through both the Databricks judge and a Claude-based JudgeEngine combined via weighted consensus, tracks costs with CostTracker, enforces budget limits, evaluates quality gates using the configured preset, and exports results to Braintrust alongside a JUnit XML report.
Step 7: Create the CLI entry point
Create src/ci/run-evals.ts. This is the command-line script that wires env vars, CLI flags, and the pipeline together:
The script supports multiple modes:
- No subcommand — runs the full eval pipeline using
DATABRICKS_HOST,DATABRICKS_TOKEN,DATABRICKS_SERVING_ENDPOINT, andBRAINTRUST_API_KEYfrom the environment, with CLI overrides for--golden-path,--budget,--gate-preset,--concurrency, and--judge-model gate <path>— runs only the gate evaluation on existing results using the@reaatech/agent-eval-harness-cligateCommandeval <paths...>— runs only evaluation via the CLI package’sevalCommandgolden [--create <path>]— lists or creates golden trajectories viagoldenCommand
Step 8: Set up the library entry point
Replace the placeholder src/index.ts with re-exports so consumers can import everything from one module:
export { DatabricksJudge } from "./lib/databricks-judge.js";
export { BraintrustExporter } from "./lib/braintrust-exporter.js";
export { EvalPipeline } from "./services/eval-pipeline.js";
export type { EvalConfig, EvalRunResult, JudgeAdapter, JudgeRequest, JudgeScore, EvalSummary } from "./lib/types.js";This makes the three main classes and all key types available via a single import:
import { DatabricksJudge, BraintrustExporter, EvalPipeline } from "databricks-agent-eval-harness-for-smb-support-bots";Step 9: Run the tests
The project comes with a comprehensive test suite covering all modules. Run it with:
pnpm testThis runs vitest with v8 coverage. The test suite covers:
- DatabricksJudge — happy path (returning parsed scores), error path (HTTP 500, network failure, malformed JSON), boundary cases (empty response, missing score field),
judgeBatch(multiple requests, empty input, partial failure), and judge requests with intent, tools, and arguments - BraintrustExporter —
initExperiment, error propagation,logResultsshape and empty-array boundary,summarize,exportAllintegration sequence, error wrapping in all methods - EvalPipeline — full pipeline happy path (all stages called), gate preset selection,
getExitCodepass/fail, error handling (parsing failure, non-Error throws), empty turns array, empty dataset, budget enforcement, cost tracking, Braintrust export integration, score thresholds, comparison threshold failures - run-evals CLI —
parseCliArgs(flag-value pairs, boolean flags, missing args, edge cases),requireEnv(present, missing, empty, logging),mainfunction (gate/eval/golden subcommands, full eval with env vars, missing env vars, error propagation) - index entry — verifies all three classes are exported
Run type checking and linting too:
pnpm typecheck
pnpm lintStep 10: Run the evaluation pipeline
With your .env populated and golden JSONL files in a directory (e.g., ./golden/), run the pipeline:
npx tsx src/ci/run-evals.ts --golden-path ./golden --output ./resultsYou can customize the run with the available flags:
npx tsx src/ci/run-evals.ts \
--golden-path ./golden \
--output ./results \
--budget strict \
--gate-preset strict \
--concurrency 3 \
--judge-model databricks-dbrx-instructThe pipeline outputs:
results/results.json— per-scenario scores with pass/failresults/junit.xml— JUnit report for CI integration- Braintrust experiment with all logs and a summary
Next steps
- Add more judge providers — wire in GPT-4 or Gemini via the
JudgeEnginefrom@reaatech/agent-eval-harness-judgeand expand the consensus model weights - Human-in-the-loop calibration — collect human-labeled scores and use
JudgeCalibratorwith temperature scaling to correct systematic bias in the LLM judge - Baseline comparison gates — use
createNoRegressionGate()andcreateSignificanceGate()from@reaatech/agent-eval-harness-gateto compare against a previous run’s results - GitHub Actions CI — add the eval pipeline to your PR workflow using the JUnit report and
CIIntegration.generateGitHubAnnotations()for inline annotations - Latency monitoring — use
@reaatech/agent-eval-harness-latencyto track P50/P90/P99 response times and add SLA violation gates
