@reaatech/agent-eval-harness-suite

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Orchestrated evaluation suite runner with results aggregation and run comparison. Executes multi-metric evaluations across trajectory batches with configurable concurrency, YAML-driven configuration, and statistical comparison between runs.

Installation

terminal

npm install @reaatech/agent-eval-harness-suite

Feature Overview

Batch evaluation — run evaluations across hundreds of trajectories with configurable parallel workers
YAML-driven config — declare metrics, judge models, budget limits, and gate configs in a single file
Multi-metric scoring — aggregates faithfulness, relevance, tool correctness, cost, latency, coherence, and goal completion into an overall score
Results aggregation — exports to JSON, JUnit XML, CSV, and Markdown with per-metric breakdowns
Run comparison — statistical comparison between baseline and candidate runs with regression detection
Threshold checking — validate results against configurable per-metric thresholds
Progress tracking — real-time progress callbacks for long-running suites

Quick Start

typescript

import { SuiteRunner, parseConfig, createResultsAggregator } from '@reaatech/agent-eval-harness-suite';
import { evaluate } from '@reaatech/agent-eval-harness-trajectory';
import type { Trajectory } from '@reaatech/agent-eval-harness-types';
 
const config = parseConfig(`
metrics:
  - faithfulness
  - relevance
  - cost
  - latency
judge_model: claude-opus
budget_limit: 10.00
parallel_workers: 4
`);
 
const runner = new SuiteRunner(config);
const result = await runner.run(trajectories, evaluate);
console.log(`Overall: ${result.overallMetrics.overallScore}, Pass rate: ${result.summary.passRate}`);

API Reference

Suite Runner

Name	Type	Description
`SuiteRunner`	`class`	Orchestrates batch evaluation with configurable concurrency, timeout, error handling, and progress callbacks
`createSuiteRunner(config?)`	`function`	Factory: returns a new `SuiteRunner` instance with optional partial config

SuiteRunner constructor accepts config?: Partial<SuiteRunnerConfig> and an optional progressCallback. The run(trajectories, evaluator) method executes evaluations in concurrent batches and returns EvalRunResult.

Configuration

Name	Type	Description
`parseConfig(yamlString)`	`function`	Parse a YAML configuration string into a `SuiteConfig` object
`validateConfig(config)`	`function`	Validate a `SuiteConfig`; returns `{ valid, errors }` — checks weights sum to 1.0, threshold ranges, required fields
`createDefaultConfig(name)`	`function`	Create a default `SuiteConfig` with all five standard metrics pre-configured
`mergeConfig(partial)`	`function`	Merge a partial config object with sensible defaults
`calculateOverallScore(metricScores, config)`	`function`	Weighted composite score from per-metric scores using config weights
`checkThresholds(metricScores, config)`	`function`	Verify all enabled metric thresholds are met; returns `{ passed, failures }`

Results Aggregation

Name	Type	Description
`ResultsAggregator`	`class`	Aggregates raw run results into structured breakdowns with export methods
`createResultsAggregator(config)`	`function`	Factory: returns a new `ResultsAggregator` for the given `SuiteConfig`

ResultsAggregator methods:

Method	Returns	Description
`aggregate(runResult)`	`AggregatedResults`	Compute per-metric breakdowns, trajectory results, and summary statistics
`exportJSON(results)`	`string`	Export aggregated results as formatted JSON
`exportJUnit(results)`	`string`	Export as JUnit XML for CI test reporters
`exportCSV(results)`	`string`	Export as CSV with one row per trajectory
`exportMarkdown(results)`	`string`	Export as Markdown with summary table and per-metric breakdown
`export(results, format)`	`string`	Export in any supported format (`json` \| `junit` \| `csv` \| `markdown`)

Run Comparison

Name	Type	Description
`RunComparator`	`class`	Statistical comparison engine for two evaluation runs
`createRunComparator(significanceLevel?, minEffectSize?)`	`function`	Factory with configurable significance alpha (default 0.05) and minimum effect size (default 0.1)

RunComparator methods:

Method	Returns	Description
`compare(baseline, candidate)`	`RunComparisonResult`	Full comparison with metric diffs, statistical significance, regressions, improvements, and verdict
`generateVisualizationData(comparison)`	`VisualizationData`	Generate bar chart, waterfall, and heatmap data for chart rendering

Types

Name	Type	Description
`SuiteConfig`	`interface`	Top-level suite configuration: `name`, `metrics`, `judge`, `goldenPath`, `baseline`, `output`
`MetricConfig`	`interface`	Per-metric config: `name`, `enabled`, `weight`, `threshold`, `config`
`JudgeConfig`	`interface`	Judge settings: `model`, `provider`, `budgetLimit`, `calibrationEnabled`
`OutputConfig`	`interface`	Output settings: `formats`, `directory`, `includeDetails`
`SuiteRunnerConfig`	`interface`	Runtime config: `concurrency`, `continueOnError`, `timeoutMs`, `metrics`
`EvalRunResult`	`interface`	Full run result: `runId`, `status`, `totalTrajectories`, `trajectoryResults[]`, `overallMetrics`, `durationMs`
`OverallMetrics`	`interface`	Aggregate scores: `overallScore`, `avgFaithfulness`, `avgRelevance`, `toolCorrectnessRate`, `avgCostPerTask`, `latencyP50/P90/P99`, `slaViolations`
`ProgressUpdate`	`interface`	Real-time progress: `runId`, `status`, `progress`, `completed`, `total`, `currentTrajectory`
`AggregatedResults`	`interface`	Full aggregation: `runId`, `config`, `overallMetrics`, `metricBreakdown`, `trajectoryResults[]`, `summary`, `timestamp`
`MetricBreakdown`	`interface`	Per-metric stats: `name`, `avgScore`, `minScore`, `maxScore`, `stdDev`, `passRate`, `weight`
`TrajectoryResult`	`interface`	Per-trajectory: `trajectoryId`, `overallScore`, `metricScores`, `passed`, `errors`
`SummaryStatistics`	`interface`	Aggregate counts: `totalTrajectories`, `passedTrajectories`, `failedTrajectories`, `passRate`, `overallPassed`, `durationMs`
`RunComparisonResult`	`interface`	Comparison output: `scoreDiff`, `metricDiffs[]`, `statisticalSignificance`, `regressions[]`, `improvements[]`, `summary`
`MetricDiff`	`interface`	Per-metric change: `metric`, `baseline`, `candidate`, `diff`, `percentChange`, `effectSize` (Cohen’s d)
`StatisticalResult`	`interface`	Significance test: `test`, `pValue`, `confidenceInterval`, `significant`, `alpha`

Package	Description
@reaatech/agent-eval-harness-types	Shared domain types and Zod schemas
@reaatech/agent-eval-harness-trajectory	Trajectory loading, evaluation, and golden comparison
@reaatech/agent-eval-harness-tool-use	Tool-use validation and schema compliance
@reaatech/agent-eval-harness-cost	Cost tracking, budgets, and reporting
@reaatech/agent-eval-harness-latency	Latency monitoring, SLA enforcement, and optimization
@reaatech/agent-eval-harness-judge	LLM-as-judge with calibration and consensus
@reaatech/agent-eval-harness-golden	Golden trajectory management and curation
@reaatech/agent-eval-harness-suite	Suite runner, results aggregation, and comparison
@reaatech/agent-eval-harness-gate	CI regression gates with JUnit and GitHub output
@reaatech/agent-eval-harness-mcp-server	MCP server with three-layer tool architecture
@reaatech/agent-eval-harness-cli	Command-line interface
@reaatech/agent-eval-harness-observability	OTel tracing, metrics, structured logging, and dashboards

License

MIT

@reaatech/agent-eval-harness-suite

@reaatech/agent-eval-harness-suite

Installation

Feature Overview

Quick Start

API Reference

Suite Runner

Configuration

Results Aggregation

Run Comparison

Types

Related Packages

License