Skip to content
reaatechREAATECH

@reaatech/agent-eval-harness-gate

npm v0.1.0

Enforces CI/CD regression thresholds for AI agent performance, cost, and quality metrics. It provides a `GateEngine` class to evaluate agent results against configurable gates and generates JUnit XML, GitHub Actions annotations, and JSON summaries.

@reaatech/agent-eval-harness-gate

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

CI/CD regression gates for AI agent evaluation. Define quality, cost, latency, and correctness thresholds that block merges when agents regress. Outputs JUnit XML for test reporters, GitHub Actions annotations for PR comments, and structured JSON for dashboards.

Installation

terminal
npm install @reaatech/agent-eval-harness-gate

Feature Overview

  • Threshold gates — overall quality, faithfulness, relevance, tool correctness, cost, latency, pass rate, SLA violations
  • Baseline comparison gates — no-regression, improvement-required, statistical significance, per-metric regression
  • Three presets — standard (quality >= 0.80), strict (quality >= 0.90), lenient (quality >= 0.60)
  • Custom gate functions — programmatic gates with access to full results and comparison data
  • CI integration — JUnit XML output, GitHub Actions annotations, step outputs, PR comments
  • Result caching — configurable TTL caching to speed repeated evaluations

Quick Start

typescript
import { createGateEngine, getStandardPreset, CIIntegration } from '@reaatech/agent-eval-harness-gate';
 
const engine = createGateEngine(getStandardPreset().gates);
const results = await getAggregatedResults();
const summary = engine.evaluate(results);
 
console.log(`Passed: ${summary.overallPassed}, ${summary.passedGates}/${summary.totalGates} gates`);
console.log(`Exit code: ${CIIntegration.getExitCode(summary)}`);

API Reference

GateEngine

MethodSignatureDescription
evaluate(results: AggregatedResults, comparison?: RunComparisonResult) => GateEvaluationSummaryEvaluate all gates against results
clearCache() => voidClear the result cache
getGates() => GateDefinition[]Get all registered gates
addGate(gate: GateDefinition) => voidAdd a gate dynamically
removeGate(name: string) => voidRemove a gate by name

Factory: createGateEngine(gates: GateDefinition[], cacheTTL?: number): GateEngine

Threshold Gate Builders

BuilderDefaultDescription
createOverallQualityGate(threshold?)0.8Overall quality score >= threshold
createFaithfulnessGate(threshold?)0.8Faithfulness score >= threshold
createRelevanceGate(threshold?)0.8Relevance score >= threshold
createToolCorrectnessGate(threshold?)0.9Tool correctness rate >= threshold
createCostGate(maxCost?)0.05Cost per task <= maxCost
createLatencyGate(maxLatencyMs?)5000P99 latency <= maxLatencyMs
createPassRateGate(minPassRate?)0.95Pass rate >= minPassRate
createSLAViolationsGate(maxViolations?)0SLA violations <= maxViolations
buildThresholdGates(config)Build gates from a config object

Presets

PresetFunctionQualityFaithfulnessRelevanceTool CorrectnessCostLatency P99Pass RateSLA Violations
StandardgetStandardPreset()>= 0.80>= 0.80>= 0.80>= 0.90<= $0.05<= 5000ms>= 95%
StrictgetStrictPreset()>= 0.90>= 0.90>= 0.90>= 0.95<= $0.02<= 2000ms>= 99%<= 0
LenientgetLenientPreset()>= 0.60>= 0.60>= 0.60>= 0.70<= $0.10<= 10000ms

Baseline Gate Builders

BuilderDescription
createNoRegressionGate()Fail if any regression detected vs baseline
createImprovementGate(minImprovement?)Require minimum overall score improvement
createSignificanceGate(alpha?)Require statistical significance (default α=0.05)
createMetricRegressionGate(metric, allowDecline?)Per-metric regression gate with tolerance
getBaselinePreset()Returns [noRegression, improvement(0)]
getStrictBaselinePreset()Returns [noRegression, improvement(0.05), significance(0.05), metricRegression × 3]

CI Integration

ExportTypeDescription
CIIntegrationClass (static methods)Generate annotations, JUnit XML, PR comments, env vars
writeJUnitReport(summary, filePath)FunctionWrite JUnit XML to file
outputGitHubAnnotations(summary)FunctionPrint GitHub Actions workflow commands
setGitHubOutput(key, value)FunctionSet GitHub Actions step output
exportForCI(summary, outputDir)FunctionExport JUnit XML + JSON results + PR comment

CIIntegration static methods:

MethodReturnsDescription
generateGitHubAnnotations(summary)stringWorkflow command string for GitHub Actions
generateJUnitReport(summary)stringJUnit XML for test reporters
generatePRComment(summary)stringMarkdown table for PR comments
generateStepSummary(summary)stringMarkdown for GitHub Actions step summary
generateEnvVars(summary)Record<string, string>Environment variables for CI
getExitCode(summary)number0 if all passed, 1 otherwise
parseGateConfig(yamlString)GateConfig[]Parse gate config from YAML

Types

GateDefinition

FieldTypeRequiredDescription
namestringyesUnique gate name
typeGateTypeyesthreshold | baseline-comparison | regression | custom
metricstringnoMetric to check (for threshold/baseline gates)
operatorGateOperatorno>= | <= | > | < | == | !=
thresholdnumbernoThreshold value for comparison
baselinestringnoBaseline run ID
allowRegressionbooleannoWhether regression is allowed
customFn(results, comparison?) => GateResultnoCustom evaluation function
enabledbooleannoGate enabled flag (default true)
descriptionstringnoHuman-readable description

GateResult

FieldTypeDescription
namestringGate name
passedbooleanWhether gate passed
reasonstringPass/fail reason
actualValuenumber?Actual value observed
expectedValuenumber?Expected/threshold value
typeGateTypeGate type

GateEvaluationSummary

FieldTypeDescription
runIdstringEvaluation run ID
totalGatesnumberTotal gates evaluated
passedGatesnumberGates that passed
failedGatesnumberGates that failed
overallPassedbooleanAll gates passed
resultsGateResult[]Individual gate results
durationMsnumberEvaluation duration
cacheHitRatenumber?Cache hit rate (0-1)

Advanced Patterns

Custom Programmatic Gates

Custom gates have full access to evaluation results and comparison data, enabling arbitrary logic beyond simple thresholds:

typescript
import { createGateEngine } from '@reaatech/agent-eval-harness-gate';
 
const customGate: GateDefinition = {
  name: 'composite-quality',
  type: 'custom',
  description: 'Composite gate combining multiple metrics',
  customFn: (results, comparison) => {
    const overall = results.overallMetrics.overallScore;
    const faithfulness = results.metricBreakdown.faithfulness?.avgScore ?? 0;
    const cost = results.metricBreakdown.cost?.avgScore ?? 0;
 
    const composite = overall * 0.5 + faithfulness * 0.3 + (1 - cost) * 0.2;
    const passed = composite >= 0.75;
 
    return {
      passed,
      reason: passed
        ? `Composite score ${composite.toFixed(2)} >= 0.75`
        : `Composite score ${composite.toFixed(2)} < 0.75`,
    };
  },
};
 
const engine = createGateEngine([customGate]);
const summary = engine.evaluate(results);

CI Pipeline Integration

yaml
# .github/workflows/eval-gates.yml
name: Agent Evaluation Gates
 
on:
  pull_request:
    branches: [main]
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Run evaluation
        run: |
          npx agent-eval-harness eval trajectories/*.jsonl \
            --output results/
 
      - name: Run regression gates
        id: gates
        run: |
          npx agent-eval-harness gate results/results.json \
            --preset standard \
            --exit-code
 
      - name: Upload JUnit report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: gate-results
          path: results/
 
      - name: Comment on PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const { CIIntegration } = require('@reaatech/agent-eval-harness-gate');
            const results = require('./results/results.json');
            const summary = CIIntegration.evaluateFromResults(results);
 
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: CIIntegration.generatePRComment(summary)
            });

License

MIT