@reaatech/agent-eval-harness-gate
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
CI/CD regression gates for AI agent evaluation. Define quality, cost, latency, and correctness thresholds that block merges when agents regress. Outputs JUnit XML for test reporters, GitHub Actions annotations for PR comments, and structured JSON for dashboards.
Installation
npm install @reaatech/agent-eval-harness-gate
Feature Overview
Threshold gates — overall quality, faithfulness, relevance, tool correctness, cost, latency, pass rate, SLA violations
Baseline comparison gates — no-regression, improvement-required, statistical significance, per-metric regression
Three presets — standard (quality >= 0.80), strict (quality >= 0.90), lenient (quality >= 0.60)
Custom gate functions — programmatic gates with access to full results and comparison data
CI integration — JUnit XML output, GitHub Actions annotations, step outputs, PR comments
Result caching — configurable TTL caching to speed repeated evaluations
Quick Start
import { createGateEngine, getStandardPreset, CIIntegration } from '@reaatech/agent-eval-harness-gate' ;
const engine = createGateEngine ( getStandardPreset ().gates);
const results = await getAggregatedResults ();
const summary = engine. evaluate (results);
console. log ( `Passed: ${ summary . overallPassed }, ${ summary . passedGates }/${ summary . totalGates } gates` );
console. log ( `Exit code: ${ CIIntegration . getExitCode ( summary ) }` );
API Reference
GateEngine
Method Signature Description evaluate(results: AggregatedResults, comparison?: RunComparisonResult) => GateEvaluationSummaryEvaluate all gates against results clearCache() => voidClear the result cache getGates() => GateDefinition[]Get all registered gates addGate(gate: GateDefinition) => voidAdd a gate dynamically removeGate(name: string) => voidRemove a gate by name
Factory: createGateEngine(gates: GateDefinition[], cacheTTL?: number): GateEngine
Threshold Gate Builders
Builder Default Description createOverallQualityGate(threshold?)0.8Overall quality score >= threshold createFaithfulnessGate(threshold?)0.8Faithfulness score >= threshold createRelevanceGate(threshold?)0.8Relevance score >= threshold createToolCorrectnessGate(threshold?)0.9Tool correctness rate >= threshold createCostGate(maxCost?)0.05Cost per task <= maxCost createLatencyGate(maxLatencyMs?)5000P99 latency <= maxLatencyMs createPassRateGate(minPassRate?)0.95Pass rate >= minPassRate createSLAViolationsGate(maxViolations?)0SLA violations <= maxViolations buildThresholdGates(config)— Build gates from a config object
Presets
Preset Function Quality Faithfulness Relevance Tool Correctness Cost Latency P99 Pass Rate SLA Violations Standard getStandardPreset()>= 0.80 >= 0.80 >= 0.80 >= 0.90 <= $0.05 <= 5000ms >= 95% — Strict getStrictPreset()>= 0.90 >= 0.90 >= 0.90 >= 0.95 <= $0.02 <= 2000ms >= 99% <= 0 Lenient getLenientPreset()>= 0.60 >= 0.60 >= 0.60 >= 0.70 <= $0.10 <= 10000ms — —
Baseline Gate Builders
Builder Description createNoRegressionGate()Fail if any regression detected vs baseline createImprovementGate(minImprovement?)Require minimum overall score improvement createSignificanceGate(alpha?)Require statistical significance (default α=0.05) createMetricRegressionGate(metric, allowDecline?)Per-metric regression gate with tolerance getBaselinePreset()Returns [noRegression, improvement(0)] getStrictBaselinePreset()Returns [noRegression, improvement(0.05), significance(0.05), metricRegression × 3]
CI Integration
Export Type Description CIIntegrationClass (static methods) Generate annotations, JUnit XML, PR comments, env vars writeJUnitReport(summary, filePath)Function Write JUnit XML to file outputGitHubAnnotations(summary)Function Print GitHub Actions workflow commands setGitHubOutput(key, value)Function Set GitHub Actions step output exportForCI(summary, outputDir)Function Export JUnit XML + JSON results + PR comment
CIIntegration static methods:
Method Returns Description generateGitHubAnnotations(summary)stringWorkflow command string for GitHub Actions generateJUnitReport(summary)stringJUnit XML for test reporters generatePRComment(summary)stringMarkdown table for PR comments generateStepSummary(summary)stringMarkdown for GitHub Actions step summary generateEnvVars(summary)Record<string, string>Environment variables for CI getExitCode(summary)number0 if all passed, 1 otherwise parseGateConfig(yamlString)GateConfig[]Parse gate config from YAML
Types
GateDefinition
Field Type Required Description namestringyes Unique gate name typeGateTypeyes threshold | baseline-comparison | regression | custommetricstringno Metric to check (for threshold/baseline gates) operatorGateOperatorno >= | <= | > | < | == | !=thresholdnumberno Threshold value for comparison baselinestringno Baseline run ID allowRegressionbooleanno Whether regression is allowed customFn(results, comparison?) => GateResultno Custom evaluation function enabledbooleanno Gate enabled flag (default true) descriptionstringno Human-readable description
GateResult
Field Type Description namestringGate name passedbooleanWhether gate passed reasonstringPass/fail reason actualValuenumber?Actual value observed expectedValuenumber?Expected/threshold value typeGateTypeGate type
GateEvaluationSummary
Field Type Description runIdstringEvaluation run ID totalGatesnumberTotal gates evaluated passedGatesnumberGates that passed failedGatesnumberGates that failed overallPassedbooleanAll gates passed resultsGateResult[]Individual gate results durationMsnumberEvaluation duration cacheHitRatenumber?Cache hit rate (0-1)
Advanced Patterns
Custom Programmatic Gates
Custom gates have full access to evaluation results and comparison data, enabling arbitrary logic beyond simple thresholds:
import { createGateEngine } from '@reaatech/agent-eval-harness-gate' ;
const customGate : GateDefinition = {
name: 'composite-quality' ,
type: 'custom' ,
description: 'Composite gate combining multiple metrics' ,
customFn : (results, comparison) => {
const overall = results.overallMetrics.overallScore;
const faithfulness = results.metricBreakdown.faithfulness?.avgScore ?? 0 ;
const cost = results.metricBreakdown.cost?.avgScore ?? 0 ;
const composite = overall * 0.5 + faithfulness * 0.3 + ( 1 - cost) * 0.2 ;
const passed = composite >= 0.75 ;
return {
passed,
reason: passed
? `Composite score ${ composite . toFixed ( 2 ) } >= 0.75`
: `Composite score ${ composite . toFixed ( 2 ) } < 0.75` ,
};
},
};
const engine = createGateEngine ([customGate]);
const summary = engine. evaluate (results);
CI Pipeline Integration
# .github/workflows/eval-gates.yml
name : Agent Evaluation Gates
on :
pull_request :
branches : [ main ]
jobs :
evaluate :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- name : Run evaluation
run : |
npx agent-eval-harness eval trajectories/*.jsonl \
--output results/
- name : Run regression gates
id : gates
run : |
npx agent-eval-harness gate results/results.json \
--preset standard \
--exit-code
- name : Upload JUnit report
if : always()
uses : actions/upload-artifact@v4
with :
name : gate-results
path : results/
- name : Comment on PR
if : always()
uses : actions/github-script@v7
with :
script : |
const { CIIntegration } = require('@reaatech/agent-eval-harness-gate');
const results = require('./results/results.json');
const summary = CIIntegration.evaluateFromResults(results);
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: CIIntegration.generatePRComment(summary)
});
Related Packages
License
MIT