Skip to content
reaatechREAATECH

@reaatech/agent-eval-harness-golden

npm v0.1.0

Manages reference agent trajectories for regression testing through a collection of utility functions and a `GoldenCurator` class. It provides tools to create, annotate, and validate golden datasets, and includes a comparison engine to detect regressions by diffing candidate trajectories against these references.

@reaatech/agent-eval-harness-golden

npm version License CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Golden trajectory management for agent evaluation regression testing. Create, annotate, validate, and curate reference trajectories, then compare candidate agent runs against them with detailed diff analysis and regression detection.

Installation

terminal
npm install @reaatech/agent-eval-harness-golden

Feature Overview

  • Golden trajectory CRUD — load, create, update, and filter reference trajectories by tags and scenarios
  • Annotation workflow — mark expected turns, add quality notes, tag golden trajectories for organization
  • Curation pipeline — structured workflow: identify → annotate → validate → publish with batch quality checks
  • Comparison engine — diff candidate trajectories against goldens with turn-level similarity scoring
  • Regression detection — identify missing turns, tool mismatches, and low-similarity responses
  • Batch comparison — compare multiple candidates against a library of golden references

Quick Start

typescript
import { createGolden, compareAgainstGolden, quickCreateGolden } from '@reaatech/agent-eval-harness-golden';
import type { Trajectory } from '@reaatech/agent-eval-harness-types';
 
// Quick creation for simple scenarios
const golden = quickCreateGolden(trajectory, 'password-reset', ['auth', 'critical']);
 
// Compare a new run against the golden
const result = compareAgainstGolden(golden, candidateTrajectory, { similarityThreshold: 0.85 });
console.log(`Similarity: ${result.similarity}, Regressions: ${result.regressions.length}`);

API Reference

Golden Manager

NameTypeDescription
loadGoldenTrajectories(jsonlContent)functionParse JSONL string into an array of GoldenTrajectory objects
validateGolden(golden)functionValidate a golden trajectory structure; returns { valid, errors, warnings, score }
goldenToJSONL(golden)functionSerialize a golden trajectory back to JSONL string format
createGolden(trajectory, options)functionCreate a new golden trajectory from a candidate trajectory with metadata options
updateGolden(golden, changes)functionUpdate a golden trajectory’s metadata and bump the updatedAt timestamp
filterByTags(goldens, tags)functionFilter golden trajectories by tag intersection
getByScenario(goldens, scenario)functionSearch golden trajectories by scenario name (description or trajectory ID match)

Comparison Engine

NameTypeDescription
compareAgainstGolden(golden, candidate, config?)functionCompare a candidate trajectory against a golden; returns TrajectoryComparisonResult
batchCompare(golden, candidates, config?)functionCompare multiple candidates against a single golden in one call
findBestGolden(candidate, goldens, config?)functionFind the best-matching golden for a candidate across a golden library

Curation

NameTypeDescription
GoldenCuratorclassStructured curation workflow with start(), annotateTurn(), autoAnnotate(), runQualityChecks(), validate(), publish(), exportJSONL()
createCurator(trajectory)functionFactory: returns a new GoldenCurator instance
quickCreateGolden(trajectory, description, tags)functionOne-shot: auto-annotate, validate, and publish a golden in a single call
batchQualityCheck(goldens)functionRun quality checks across a library of goldens; returns per-golden results
generateCurationReport(goldens)functionGenerate a human-readable curation report with issues and suggestions

Types

NameTypeDescription
GoldenTrajectoryinterfaceGolden reference with id, metadata (version, tags, description, quality notes), and trajectory
GoldenMetadatainterfaceVersion, timestamps, description, tags, quality notes, expected outcomes
GoldenValidationResultinterfaceResult of validateGolden with valid, errors, warnings, score
TrajectoryComparisonResultinterfaceComparison output: similarity, turnComparisons, matchingTurns, divergentTurns, passesThreshold, regressions, diffSummary
TurnComparisoninterfacePer-turn diff: turnId, similarity, contentMatch, toolMatch, differences
RegressioninterfaceDetected regression with type (tool_mismatch / content_divergence / missing_turn / extra_turn), severity, turnId, description
ComparisonConfiginterfaceComparison options: similarityThreshold, compareTools, semanticComparison, turnMatching
CurationStateinterfaceCuration workflow step and accumulated annotations
TurnAnnotationinterfacePer-turn annotation: turnId, expected, qualityNotes, alternatives
QualityCheckResultinterfaceCuration quality check output: passed, score, issues, suggestions
PackageDescription
@reaatech/agent-eval-harness-typesShared domain types and Zod schemas
@reaatech/agent-eval-harness-trajectoryTrajectory loading, evaluation, and golden comparison
@reaatech/agent-eval-harness-tool-useTool-use validation and schema compliance
@reaatech/agent-eval-harness-costCost tracking, budgets, and reporting
@reaatech/agent-eval-harness-latencyLatency monitoring, SLA enforcement, and optimization
@reaatech/agent-eval-harness-judgeLLM-as-judge with calibration and consensus
@reaatech/agent-eval-harness-goldenGolden trajectory management and curation
@reaatech/agent-eval-harness-suiteSuite runner, results aggregation, and comparison
@reaatech/agent-eval-harness-gateCI regression gates with JUnit and GitHub output
@reaatech/agent-eval-harness-mcp-serverMCP server with three-layer tool architecture
@reaatech/agent-eval-harness-cliCommand-line interface
@reaatech/agent-eval-harness-observabilityOTel tracing, metrics, structured logging, and dashboards

License

MIT