Skip to content
reaatechREAATECH

Files · Databricks Agent Eval Harness for SMB Support Bots

55 (1 binary, 644.9 kB total)attempt 1

README.md·6663 B·markdown
markdown
# Databricks Agent Eval Harness for SMB Support Bots
 
> Automated regression testing for SMB customer support agents, running on Databricks with BrainsTrust analytics.
 
A reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build CI-friendly eval harnesses using the `@reaatech/agent-eval-harness-*` package family.
 
## Problem
 
SMBs deploying AI support agents struggle to catch regressions before they impact customers, leading to poor responses and handoffs. Manual QA is costly and inconsistent.
 
## Architecture
 
This solution provides a CI-friendly eval harness that uses golden conversation datasets to test agent responses:
 
1. **Golden datasets**`@reaatech/agent-eval-harness-golden` manages reference trajectories with turn-level similarity scoring
2. **LLM judge**`DatabricksJudge` wraps a Databricks model serving endpoint as an LLM-as-judge adapter, scoring responses on quality
3. **Cost tracking**`@reaatech/agent-eval-harness-cost` calculates per-trajectory LLM token and tool costs with budget enforcement
4. **Regression gates**`@reaatech/agent-eval-harness-gate` blocks deployment if quality, cost, or latency thresholds are breached
5. **BrainsTrust export** — Results are logged to Braintrust experiments for historical dashboards and trend analysis
 
## Quick Start
 
```bash
# Install dependencies
pnpm install
 
# Copy and populate environment variables
cp .env.example .env
# Edit .env with your Databricks workspace URL, token, serving endpoint, and Braintrust API key
 
# Run the eval pipeline
npx tsx src/ci/run-evals.ts --golden-path ./golden --output ./results
```
 
## Environment Variables
 
| Variable | Required | Description |
|----------|----------|-------------|
| `DATABRICKS_HOST` | Yes | Databricks workspace URL (e.g. `https://your-workspace.cloud.databricks.com`) |
| `DATABRICKS_TOKEN` | Yes | Databricks personal access token |
| `DATABRICKS_SERVING_ENDPOINT` | Yes | Name of the model serving endpoint to use for judging |
| `BRAINTRUST_API_KEY` | Yes | Braintrust API key for experiment logging |
| `ANTHROPIC_API_KEY` | No | Used by `JudgeEngine` when judging directly via Claude |
| `OPENAI_API_KEY` | No | Used by `JudgeEngine` when judging via GPT-4/OpenRouter |
 
## CLI Reference
 
```text
npx tsx src/ci/run-evals.ts [options]
 
Options:
  --golden-path <path>        Path to golden dataset directory [default: ./golden]
  --output <path>             Output directory for results [default: ./results]
  --budget <preset>           Budget preset: strict | moderate | lenient [default: moderate]
  --gate-preset <preset>      Gate preset: standard | strict | lenient [default: standard]
  --concurrency <n>           Parallel evaluation limit [default: 5]
  --judge-model <model>       Model name for the Databricks serving endpoint [default: databricks-dbrx-instruct]
```
 
## API
 
### `DatabricksJudge`
 
Wraps a Databricks serving endpoint as an LLM-as-judge adapter. Implements `JudgeAdapter`.
 
```ts
const judge = new DatabricksJudge({
  databricksHost: "https://workspace.databricks.com",
  databricksToken: "dapi...",
  servingEndpoint: "my-endpoint",
});
 
const result = await judge.judge({
  type: "faithfulness",
  context: "The balance is $42.50",
  response: "Your balance is $42.50.",
});
// { score: 0.95, explanation: "...", confidence: 0.9 }
```
 
### `BraintrustExporter`
 
Logs evaluation results to Braintrust experiments.
 
```ts
const exporter = new BraintrustExporter("your-braintrust-api-key");
const experiment = await exporter.initExperiment("project-name");
exporter.logResults(experiment, results);
const summary = await exporter.summarize(experiment);
```
 
| Method | Signature | Description |
|--------|-----------|-------------|
| `constructor` | `(apiKey: string)` | Creates exporter with Braintrust API key |
| `initExperiment` | `(projectName: string) => Promise<Experiment>` | Initializes a Braintrust experiment |
| `logResults` | `(experiment: Experiment, results: EvalRunResult[]) => void` | Logs evaluation results to the experiment |
| `summarize` | `(experiment: Experiment) => Promise<string>` | Gets experiment summary |
| `exportAll` | `(config: { projectName: string; results: EvalRunResult[] }) => Promise<{ summary: string }>` | Full flow: init, log, summarize |
 
### `EvalPipeline`
 
Orchestrates the full evaluation pipeline: load golden datasets, score responses via Databricks judge, track costs, evaluate gates, and export to Braintrust.
 
```ts
const pipeline = new EvalPipeline({
  databricksHost: "https://workspace.databricks.com",
  databricksToken: process.env.DATABRICKS_TOKEN,
  databricksServingEndpoint: "my-endpoint",
  braintrustApiKey: process.env.BRAINTRUST_API_KEY,
  goldenDatasetPath: "./golden",
  judgeModel: "databricks-dbrx-instruct",
  budgetPreset: "moderate",
  gatePreset: "standard",
  concurrency: 5,
});
const { results, gateSummary } = await pipeline.runFullEval();
const exitCode = pipeline.getExitCode(gateSummary);
```
 
| Method | Signature | Description |
|--------|-----------|-------------|
| `constructor` | `(config: EvalConfig)` | Creates pipeline with Databricks, Braintrust, and eval configuration |
| `loadGoldenDataset` | `(path: string) => Promise<GoldenTrajectory[]>` | Loads golden trajectories from JSONL files in directory |
| `runSingleEntry` | `(golden: GoldenTrajectory, candidateTrajectory: Trajectory) => Promise<{ result: EvalRunResult; cost: CostBreakdown }>` | Evaluates a single candidate against a golden reference |
| `runFullEval` | `() => Promise<{ results: EvalRunResult[]; gateSummary: GateEvaluationSummary }>` | Runs the full evaluation pipeline |
| `getExitCode` | `(summary: GateEvaluationSummary) => number` | Returns 0 if all gates pass, 1 otherwise |
 
## CI Integration
 
```yaml
# .github/workflows/eval.yml
name: Agent Evaluation
 
on:
  pull_request:
    branches: [main]
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "22"
      - run: pnpm install
      - name: Run eval pipeline
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
          DATABRICKS_SERVING_ENDPOINT: ${{ secrets.DATABRICKS_SERVING_ENDPOINT }}
          BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
        run: npx tsx src/ci/run-evals.ts --gate-preset standard --exit-code
```
 
## Testing
 
```bash
pnpm test            # vitest run with coverage
pnpm typecheck       # TypeScript type checking
pnpm lint            # ESLint
```
 
## License
 
MIT — see [LICENSE](./LICENSE).