Skip to content
reaatechREAATECH

@reaatech/agent-eval-harness-suite

npm v0.1.0

Executes batch evaluations of agent trajectories using a YAML-configured runner class that aggregates multi-metric scores and performs statistical regression analysis between runs. It requires an external evaluator function and trajectory data to process concurrent test suites.

@reaatech/agent-eval-harness-suite

npm version License CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Orchestrated evaluation suite runner with results aggregation and run comparison. Executes multi-metric evaluations across trajectory batches with configurable concurrency, YAML-driven configuration, and statistical comparison between runs.

Installation

terminal
npm install @reaatech/agent-eval-harness-suite

Feature Overview

  • Batch evaluation — run evaluations across hundreds of trajectories with configurable parallel workers
  • YAML-driven config — declare metrics, judge models, budget limits, and gate configs in a single file
  • Multi-metric scoring — aggregates faithfulness, relevance, tool correctness, cost, latency, coherence, and goal completion into an overall score
  • Results aggregation — exports to JSON, JUnit XML, CSV, and Markdown with per-metric breakdowns
  • Run comparison — statistical comparison between baseline and candidate runs with regression detection
  • Threshold checking — validate results against configurable per-metric thresholds
  • Progress tracking — real-time progress callbacks for long-running suites

Quick Start

typescript
import { SuiteRunner, parseConfig, createResultsAggregator } from '@reaatech/agent-eval-harness-suite';
import { evaluate } from '@reaatech/agent-eval-harness-trajectory';
import type { Trajectory } from '@reaatech/agent-eval-harness-types';
 
const config = parseConfig(`
metrics:
  - faithfulness
  - relevance
  - cost
  - latency
judge_model: claude-opus
budget_limit: 10.00
parallel_workers: 4
`);
 
const runner = new SuiteRunner(config);
const result = await runner.run(trajectories, evaluate);
console.log(`Overall: ${result.overallMetrics.overallScore}, Pass rate: ${result.summary.passRate}`);

API Reference

Suite Runner

NameTypeDescription
SuiteRunnerclassOrchestrates batch evaluation with configurable concurrency, timeout, error handling, and progress callbacks
createSuiteRunner(config?)functionFactory: returns a new SuiteRunner instance with optional partial config

SuiteRunner constructor accepts config?: Partial<SuiteRunnerConfig> and an optional progressCallback. The run(trajectories, evaluator) method executes evaluations in concurrent batches and returns EvalRunResult.

Configuration

NameTypeDescription
parseConfig(yamlString)functionParse a YAML configuration string into a SuiteConfig object
validateConfig(config)functionValidate a SuiteConfig; returns { valid, errors } — checks weights sum to 1.0, threshold ranges, required fields
createDefaultConfig(name)functionCreate a default SuiteConfig with all five standard metrics pre-configured
mergeConfig(partial)functionMerge a partial config object with sensible defaults
calculateOverallScore(metricScores, config)functionWeighted composite score from per-metric scores using config weights
checkThresholds(metricScores, config)functionVerify all enabled metric thresholds are met; returns { passed, failures }

Results Aggregation

NameTypeDescription
ResultsAggregatorclassAggregates raw run results into structured breakdowns with export methods
createResultsAggregator(config)functionFactory: returns a new ResultsAggregator for the given SuiteConfig

ResultsAggregator methods:

MethodReturnsDescription
aggregate(runResult)AggregatedResultsCompute per-metric breakdowns, trajectory results, and summary statistics
exportJSON(results)stringExport aggregated results as formatted JSON
exportJUnit(results)stringExport as JUnit XML for CI test reporters
exportCSV(results)stringExport as CSV with one row per trajectory
exportMarkdown(results)stringExport as Markdown with summary table and per-metric breakdown
export(results, format)stringExport in any supported format (json | junit | csv | markdown)

Run Comparison

NameTypeDescription
RunComparatorclassStatistical comparison engine for two evaluation runs
createRunComparator(significanceLevel?, minEffectSize?)functionFactory with configurable significance alpha (default 0.05) and minimum effect size (default 0.1)

RunComparator methods:

MethodReturnsDescription
compare(baseline, candidate)RunComparisonResultFull comparison with metric diffs, statistical significance, regressions, improvements, and verdict
generateVisualizationData(comparison)VisualizationDataGenerate bar chart, waterfall, and heatmap data for chart rendering

Types

NameTypeDescription
SuiteConfiginterfaceTop-level suite configuration: name, metrics, judge, goldenPath, baseline, output
MetricConfiginterfacePer-metric config: name, enabled, weight, threshold, config
JudgeConfiginterfaceJudge settings: model, provider, budgetLimit, calibrationEnabled
OutputConfiginterfaceOutput settings: formats, directory, includeDetails
SuiteRunnerConfiginterfaceRuntime config: concurrency, continueOnError, timeoutMs, metrics
EvalRunResultinterfaceFull run result: runId, status, totalTrajectories, trajectoryResults[], overallMetrics, durationMs
OverallMetricsinterfaceAggregate scores: overallScore, avgFaithfulness, avgRelevance, toolCorrectnessRate, avgCostPerTask, latencyP50/P90/P99, slaViolations
ProgressUpdateinterfaceReal-time progress: runId, status, progress, completed, total, currentTrajectory
AggregatedResultsinterfaceFull aggregation: runId, config, overallMetrics, metricBreakdown, trajectoryResults[], summary, timestamp
MetricBreakdowninterfacePer-metric stats: name, avgScore, minScore, maxScore, stdDev, passRate, weight
TrajectoryResultinterfacePer-trajectory: trajectoryId, overallScore, metricScores, passed, errors
SummaryStatisticsinterfaceAggregate counts: totalTrajectories, passedTrajectories, failedTrajectories, passRate, overallPassed, durationMs
RunComparisonResultinterfaceComparison output: scoreDiff, metricDiffs[], statisticalSignificance, regressions[], improvements[], summary
MetricDiffinterfacePer-metric change: metric, baseline, candidate, diff, percentChange, effectSize (Cohen’s d)
StatisticalResultinterfaceSignificance test: test, pValue, confidenceInterval, significant, alpha
PackageDescription
@reaatech/agent-eval-harness-typesShared domain types and Zod schemas
@reaatech/agent-eval-harness-trajectoryTrajectory loading, evaluation, and golden comparison
@reaatech/agent-eval-harness-tool-useTool-use validation and schema compliance
@reaatech/agent-eval-harness-costCost tracking, budgets, and reporting
@reaatech/agent-eval-harness-latencyLatency monitoring, SLA enforcement, and optimization
@reaatech/agent-eval-harness-judgeLLM-as-judge with calibration and consensus
@reaatech/agent-eval-harness-goldenGolden trajectory management and curation
@reaatech/agent-eval-harness-suiteSuite runner, results aggregation, and comparison
@reaatech/agent-eval-harness-gateCI regression gates with JUnit and GitHub output
@reaatech/agent-eval-harness-mcp-serverMCP server with three-layer tool architecture
@reaatech/agent-eval-harness-cliCommand-line interface
@reaatech/agent-eval-harness-observabilityOTel tracing, metrics, structured logging, and dashboards

License

MIT