@reaatech/agent-eval-harness-trajectory
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Trajectory loading, multi-turn evaluation, and golden-comparison utilities for agent conversation analysis. Parses JSONL turn files, scores coherence and goal completion, and diffs candidate trajectories against golden references.
Installation
npm install @reaatech/agent-eval-harness-trajectory
# or
pnpm add @reaatech/agent-eval-harness-trajectory
Feature Overview
JSONL loader — parse, validate, and serialize trajectory files with full Zod schema validation via @reaatech/agent-eval-harness-types
Multi-turn evaluation — coherence analysis, goal completion scoring, and conversation flow metrics
Golden comparison — diff candidate trajectories against reference with regression detection and improvement tracking
Directory batch load — load and validate all .jsonl files in a directory in a single call
Custom error types — TrajectoryLoadError with file path and cause tracking for precise debugging
Quick Start
import { loadFromContent, evaluate, compare } from '@reaatech/agent-eval-harness-trajectory' ;
import type { Trajectory } from '@reaatech/agent-eval-harness-types' ;
const jsonl =
'{"turn_id":1,"role":"user","content":"Reset password","timestamp":"2026-04-15T00:00:00Z"}\n' +
'{"turn_id":1,"role":"agent","content":"What\'s your email?","tool_calls":[],"timestamp":"2026-04-15T00:00:01Z"}' ;
const trajectory = loadFromContent (jsonl);
const result = evaluate (trajectory);
console. log ( `Score: ${ result . overall_score }, Coherence: ${ result . metrics . coherence }` );
API Reference
Loader Functions
Name Type Description parseTurn(line, lineNumber)(string, number) => TurnParse a single JSONL line into a validated Turn object loadFromContent(content, options?)(string, LoadOptions?) => TrajectoryLoad a trajectory from a JSONL content string loadFromFile(filePath, options?)(string, LoadOptions?) => Promise<Trajectory>Load a trajectory from a .jsonl file on disk loadFromDirectory(dirPath, options?)(string, LoadOptions?) => Promise<Trajectory[]>Load all .jsonl files in a directory serializeToJsonl(trajectory)(Trajectory) => stringSerialize a trajectory to JSONL string format saveToFile(trajectory, filePath)(Trajectory, string) => Promise<void>Save a trajectory to a .jsonl file
Loader Types
Name Type Description LoadOptionsinterfaceOptions with validate (boolean, default true) and generateId (boolean, default true) TrajectoryLoadErrorclass extends ErrorCustom error with cause, filePath, and descriptive message
Evaluator Functions
Name Type Description evaluate(trajectory, options?)(Trajectory, EvaluateOptions?) => EvalResultFull trajectory evaluation returning overall score, per-metric scores, and issues analyzeCoherence(trajectory)(Trajectory) => CoherenceResultMulti-turn coherence analysis with per-transition scoring analyzeGoalCompletion(trajectory)(Trajectory) => GoalCompletionResultDetermine if the agent completed the user’s goal with confidence scoring analyzeConversationFlow(trajectory)(Trajectory) => FlowAnalysisConversation flow analysis with topic changes, interruptions, and flow score
Evaluator Types
Name Type Description EvaluateOptionsinterfaceOptions with checkCoherence, checkGoalCompletion, analyzeFlow, and coherenceThreshold CoherenceResultinterfaceCoherence score, issues list, and per-transition analysis TurnTransitioninterfaceSingle turn transition with from, to, coherent, and optional reason GoalCompletionResultinterfaceCompletion status, confidence, evidence array, and unresolved turn IDs FlowAnalysisinterfaceFlow metrics: avg turns per topic, topic changes, interruptions, and flow score
Comparator
Name Type Description compare(candidate, golden, options?)(Trajectory, GoldenTrajectory | Trajectory, CompareOptions?) => ComparisonResultCompare a candidate trajectory against a golden reference
Comparator Types
Name Type Description CompareOptionsinterfaceOptions with similarityThreshold (default 0.85), compareTools, compareLatency, compareCosts, and strict ComparisonResultinterfaceOverall similarity, pass/fail, diff, regressions, improvements, and per-turn comparisons TrajectoryDiffinterfaceDetailed diff with missing turns, extra turns, modified turns, and tool differences TurnDiffinterfaceDifference in a single turn field with expected and actual values ToolDiffinterfaceTool call difference with expected/actual tool names and argument differences ArgumentDiffinterfaceSingle argument difference with expected and actual values RegressioninterfaceRegression record with type, severity, description, turn ID, and impact score ImprovementinterfaceImprovement record with type, description, turn ID, and benefit score TurnComparisoninterfacePer-turn comparison with similarity, match status, and differences
Related Packages
Package Description @reaatech/agent-eval-harness-types Shared domain types and Zod schemas @reaatech/agent-eval-harness-trajectory Trajectory loading, evaluation, and golden comparison @reaatech/agent-eval-harness-tool-use Tool-use validation and schema compliance @reaatech/agent-eval-harness-cost Cost tracking, budgets, and reporting @reaatech/agent-eval-harness-latency Latency monitoring, SLA enforcement, and optimization @reaatech/agent-eval-harness-judge LLM-as-judge with calibration and consensus @reaatech/agent-eval-harness-golden Golden trajectory management and curation @reaatech/agent-eval-harness-suite Suite runner, results aggregation, and comparison @reaatech/agent-eval-harness-gate CI regression gates with JUnit and GitHub output @reaatech/agent-eval-harness-mcp-server MCP server with three-layer tool architecture @reaatech/agent-eval-harness-cli Command-line interface @reaatech/agent-eval-harness-observability OTel tracing, metrics, structured logging, and dashboards
License
MIT