Skip to content
reaatechREAATECH

@reaatech/agent-eval-harness-cli

npm v0.1.0

This CLI provides a suite of commands for executing agent evaluation pipelines, managing golden trajectories, and enforcing CI quality gates. It also functions as an MCP server in stdio mode, exposing its evaluation tools to other AI agents.

@reaatech/agent-eval-harness-cli

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Command-line interface for the agent-eval-harness ecosystem. Provides 7 subcommands for full evaluation runs, on-the-fly LLM judging, baseline comparison, CI gate checking, golden trajectory management, multi-format reporting, and an MCP server in stdio mode.

Installation

terminal
npm install @reaatech/agent-eval-harness-cli
# or
npm install -g @reaatech/agent-eval-harness-cli

Feature Overview

  • 7 subcommandseval, judge, compare, gate, golden, report, serve
  • Full evaluation pipeline — load trajectories from files or directories, run multi-metric evaluation, output results as JSON or CSV
  • On-the-fly judging — evaluate faithfulness, relevance, tool correctness, or overall quality with a single command
  • CI gate checking — evaluate gate presets (standard, strict, lenient) against results with exit codes for pipeline integration
  • Golden trajectory management — list, create, update, and validate golden reference trajectories
  • Multi-format reporting — JSON, HTML, Markdown, and PDF output for evaluation results
  • MCP server — stdio-mode MCP server exposing all 13 eval tools to AI coding agents

Quick Start

terminal
# Install globally
npm install -g @reaatech/agent-eval-harness-cli
 
# Run evaluation on a directory of JSONL trajectories
agent-eval-harness eval trajectories/ --config eval-config.yaml --output results/
 
# Judge a single response on faithfulness
agent-eval-harness judge faithfulness \
  --context "The user's account is associated with email john@example.com" \
  --response "I've sent the password reset to john@example.com"
 
# Compare two evaluation runs
agent-eval-harness compare results/baseline.json results/candidate.json --format markdown
 
# Check CI regression gates
agent-eval-harness gate results/results.json --preset standard --exit-code
 
# List golden trajectories
agent-eval-harness golden --list
 
# Generate HTML report
agent-eval-harness report results/results.json --format html --output report.html
 
# Start MCP server
agent-eval-harness serve

API Reference

Binary Entry

code
agent-eval-harness [global-options] <command> [command-options]

Global Options

FlagTypeDefaultDescription
-v, --verbosebooleanfalseEnable verbose output
-c, --config <path>stringeval-config.yamlPath to configuration file
-o, --output <path>stringresultsOutput directory for results

Subcommand: eval <paths...>

Run full evaluation on trajectory files or directories.

FlagTypeDefaultDescription
-g, --golden <path>stringPath to golden trajectory for comparison
-m, --metrics <metrics>stringComma-separated list of metrics to evaluate
--judge-model <model>stringclaude-opusModel to use for LLM judge
--no-judgebooleanfalseDisable LLM judge evaluation
--budget <budget>string10.00Cost budget limit (USD)
-f, --format <format>stringjsonOutput format (json, junit, csv)

Subcommand: judge <aspect>

Run LLM judge on a specific evaluation aspect.

FlagTypeDefaultDescription
-t, --trajectory <path>stringPath to trajectory file
--context <text>stringContext for faithfulness evaluation
--response <text>stringResponse to evaluate
--intent <text>stringUser intent for relevance evaluation
--model <model>stringclaude-opusModel to use for judging
--calibratedbooleanfalseUse calibrated scores

Valid aspects: faithfulness, relevance, tool_correctness, overall

Subcommand: compare <baseline> <candidate>

Compare two evaluation runs.

FlagTypeDefaultDescription
--statisticalbooleanfalseRun statistical significance tests
-f, --format <format>stringjsonOutput format (json, markdown, table)

Subcommand: gate <results>

Check regression gates against evaluation results.

FlagTypeDefaultDescription
--gates <path>stringgates.yamlPath to gate configuration file
--preset <preset>stringstandardGate preset (standard, strict, lenient)
--exit-codebooleantrueReturn CI-compatible exit code

Subcommand: golden

Manage golden reference trajectories.

FlagTypeDefaultDescription
-l, --listbooleanfalseList all golden trajectories
-c, --create <path>stringCreate new golden trajectory from file
-u, --update <id>stringUpdate existing golden trajectory
-d, --delete <id>stringDelete golden trajectory
--validate <path>stringValidate golden trajectory quality
--dir <path>stringgoldenGolden trajectories directory

Subcommand: report <results>

Generate evaluation reports.

FlagTypeDefaultDescription
-f, --format <format>stringmarkdownOutput format (html, markdown, json, pdf)
-o, --output <path>stringOutput file path
--template <path>stringCustom report template
--include-rawbooleanfalseInclude raw trajectory data in report

Subcommand: serve

Start the MCP server.

FlagTypeDefaultDescription
-p, --port <port>string3000Server port
--host <host>stringlocalhostServer host
--transport <transport>stringhttpTransport type (http, stdio)

Programmatic Use

Command functions and output helpers are available as library exports:

typescript
import {
  evalCommand,
  judgeCommand,
  compareCommand,
  gateCommand,
  goldenCommand,
  reportCommand,
  cliOut,
  cliError,
  cliWarn,
} from "@reaatech/agent-eval-harness-cli";

Type Exports

TypeDescription
EvalOptionsOptions interface for evalCommand
JudgeOptionsOptions interface for judgeCommand
CompareOptionsOptions interface for compareCommand
GateOptionsOptions interface for gateCommand
GoldenOptionsOptions interface for goldenCommand
ReportOptionsOptions interface for reportCommand

Usage Patterns

Using in Docker

terminal
# Build the image
docker build -t agent-eval-harness .
 
# Run evaluation with mounted volumes
docker run -v ./trajectories:/app/trajectories \
  -v ./results:/app/results \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  agent-eval-harness eval trajectories/ --output results/
 
# Start MCP server in stdio mode
docker run -i agent-eval-harness serve

CI Pipeline Integration

Use the gate subcommand in CI workflows to block regressions:

yaml
# .github/workflows/eval.yml
name: Agent Evaluation
 
on:
  pull_request:
    branches: [main]
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Run evaluation suite
        run: |
          npx @reaatech/agent-eval-harness-cli eval trajectories/ \
            --config eval-config.yaml \
            --output results/
 
      - name: Run regression gates
        run: |
          npx @reaatech/agent-eval-harness-cli gate results/results.json \
            --preset standard \
            --exit-code
 
      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/

The --exit-code flag causes the command to exit with code 1 when any gate fails, failing the CI step.

Gate presets provide ready-made thresholds:

PresetOverall QualityCost LimitLatency P99Tool CorrectnessFaithfulness
standard>= 0.80<= $0.05<= 5000ms>= 0.90>= 0.80
strict>= 0.90<= $0.02<= 2000ms>= 0.95>= 0.90
lenient>= 0.60<= $0.10<= 10000ms>= 0.70>= 0.60

License

MIT