@reaatech/agent-eval-harness-cli
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Command-line interface for the agent-eval-harness ecosystem. Provides 7 subcommands for full evaluation runs, on-the-fly LLM judging, baseline comparison, CI gate checking, golden trajectory management, multi-format reporting, and an MCP server in stdio mode.
Installation
npm install @reaatech/agent-eval-harness-cli
# or
npm install -g @reaatech/agent-eval-harness-cliFeature Overview
- 7 subcommands —
eval,judge,compare,gate,golden,report,serve - Full evaluation pipeline — load trajectories from files or directories, run multi-metric evaluation, output results as JSON or CSV
- On-the-fly judging — evaluate faithfulness, relevance, tool correctness, or overall quality with a single command
- CI gate checking — evaluate gate presets (standard, strict, lenient) against results with exit codes for pipeline integration
- Golden trajectory management — list, create, update, and validate golden reference trajectories
- Multi-format reporting — JSON, HTML, Markdown, and PDF output for evaluation results
- MCP server — stdio-mode MCP server exposing all 13 eval tools to AI coding agents
Quick Start
# Install globally
npm install -g @reaatech/agent-eval-harness-cli
# Run evaluation on a directory of JSONL trajectories
agent-eval-harness eval trajectories/ --config eval-config.yaml --output results/
# Judge a single response on faithfulness
agent-eval-harness judge faithfulness \
--context "The user's account is associated with email john@example.com" \
--response "I've sent the password reset to john@example.com"
# Compare two evaluation runs
agent-eval-harness compare results/baseline.json results/candidate.json --format markdown
# Check CI regression gates
agent-eval-harness gate results/results.json --preset standard --exit-code
# List golden trajectories
agent-eval-harness golden --list
# Generate HTML report
agent-eval-harness report results/results.json --format html --output report.html
# Start MCP server
agent-eval-harness serveAPI Reference
Binary Entry
agent-eval-harness [global-options] <command> [command-options]
Global Options
| Flag | Type | Default | Description |
|---|---|---|---|
-v, --verbose | boolean | false | Enable verbose output |
-c, --config <path> | string | eval-config.yaml | Path to configuration file |
-o, --output <path> | string | results | Output directory for results |
Subcommand: eval <paths...>
Run full evaluation on trajectory files or directories.
| Flag | Type | Default | Description |
|---|---|---|---|
-g, --golden <path> | string | — | Path to golden trajectory for comparison |
-m, --metrics <metrics> | string | — | Comma-separated list of metrics to evaluate |
--judge-model <model> | string | claude-opus | Model to use for LLM judge |
--no-judge | boolean | false | Disable LLM judge evaluation |
--budget <budget> | string | 10.00 | Cost budget limit (USD) |
-f, --format <format> | string | json | Output format (json, junit, csv) |
Subcommand: judge <aspect>
Run LLM judge on a specific evaluation aspect.
| Flag | Type | Default | Description |
|---|---|---|---|
-t, --trajectory <path> | string | — | Path to trajectory file |
--context <text> | string | — | Context for faithfulness evaluation |
--response <text> | string | — | Response to evaluate |
--intent <text> | string | — | User intent for relevance evaluation |
--model <model> | string | claude-opus | Model to use for judging |
--calibrated | boolean | false | Use calibrated scores |
Valid aspects: faithfulness, relevance, tool_correctness, overall
Subcommand: compare <baseline> <candidate>
Compare two evaluation runs.
| Flag | Type | Default | Description |
|---|---|---|---|
--statistical | boolean | false | Run statistical significance tests |
-f, --format <format> | string | json | Output format (json, markdown, table) |
Subcommand: gate <results>
Check regression gates against evaluation results.
| Flag | Type | Default | Description |
|---|---|---|---|
--gates <path> | string | gates.yaml | Path to gate configuration file |
--preset <preset> | string | standard | Gate preset (standard, strict, lenient) |
--exit-code | boolean | true | Return CI-compatible exit code |
Subcommand: golden
Manage golden reference trajectories.
| Flag | Type | Default | Description |
|---|---|---|---|
-l, --list | boolean | false | List all golden trajectories |
-c, --create <path> | string | — | Create new golden trajectory from file |
-u, --update <id> | string | — | Update existing golden trajectory |
-d, --delete <id> | string | — | Delete golden trajectory |
--validate <path> | string | — | Validate golden trajectory quality |
--dir <path> | string | golden | Golden trajectories directory |
Subcommand: report <results>
Generate evaluation reports.
| Flag | Type | Default | Description |
|---|---|---|---|
-f, --format <format> | string | markdown | Output format (html, markdown, json, pdf) |
-o, --output <path> | string | — | Output file path |
--template <path> | string | — | Custom report template |
--include-raw | boolean | false | Include raw trajectory data in report |
Subcommand: serve
Start the MCP server.
| Flag | Type | Default | Description |
|---|---|---|---|
-p, --port <port> | string | 3000 | Server port |
--host <host> | string | localhost | Server host |
--transport <transport> | string | http | Transport type (http, stdio) |
Programmatic Use
Command functions and output helpers are available as library exports:
import {
evalCommand,
judgeCommand,
compareCommand,
gateCommand,
goldenCommand,
reportCommand,
cliOut,
cliError,
cliWarn,
} from "@reaatech/agent-eval-harness-cli";Type Exports
| Type | Description |
|---|---|
EvalOptions | Options interface for evalCommand |
JudgeOptions | Options interface for judgeCommand |
CompareOptions | Options interface for compareCommand |
GateOptions | Options interface for gateCommand |
GoldenOptions | Options interface for goldenCommand |
ReportOptions | Options interface for reportCommand |
Usage Patterns
Using in Docker
# Build the image
docker build -t agent-eval-harness .
# Run evaluation with mounted volumes
docker run -v ./trajectories:/app/trajectories \
-v ./results:/app/results \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
agent-eval-harness eval trajectories/ --output results/
# Start MCP server in stdio mode
docker run -i agent-eval-harness serveCI Pipeline Integration
Use the gate subcommand in CI workflows to block regressions:
# .github/workflows/eval.yml
name: Agent Evaluation
on:
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run evaluation suite
run: |
npx @reaatech/agent-eval-harness-cli eval trajectories/ \
--config eval-config.yaml \
--output results/
- name: Run regression gates
run: |
npx @reaatech/agent-eval-harness-cli gate results/results.json \
--preset standard \
--exit-code
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results/The --exit-code flag causes the command to exit with code 1 when any gate fails, failing the CI step.
Gate presets provide ready-made thresholds:
| Preset | Overall Quality | Cost Limit | Latency P99 | Tool Correctness | Faithfulness |
|---|---|---|---|---|---|
standard | >= 0.80 | <= $0.05 | <= 5000ms | >= 0.90 | >= 0.80 |
strict | >= 0.90 | <= $0.02 | <= 2000ms | >= 0.95 | >= 0.90 |
lenient | >= 0.60 | <= $0.10 | <= 10000ms | >= 0.70 | >= 0.60 |
Related Packages
| Package | Description |
|---|---|
| @reaatech/agent-eval-harness-types | Shared domain types and schemas |
| @reaatech/agent-eval-harness-trajectory | Trajectory evaluation |
| @reaatech/agent-eval-harness-tool-use | Tool-use validation |
| @reaatech/agent-eval-harness-cost | Cost tracking |
| @reaatech/agent-eval-harness-latency | Latency monitoring |
| @reaatech/agent-eval-harness-judge | LLM-as-judge |
| @reaatech/agent-eval-harness-golden | Golden trajectories |
| @reaatech/agent-eval-harness-suite | Suite runner |
| @reaatech/agent-eval-harness-gate | CI gates |
| @reaatech/agent-eval-harness-mcp-server | MCP server |
| @reaatech/agent-eval-harness-observability | Observability |
