These packages give you a full evaluation pipeline for AI agent trajectories—loading multi-turn conversations, scoring them on quality, tool correctness, cost, and latency, then running those scores through CI/CD regression gates. You'd adopt them to catch regressions in agent behavior before deploying, replacing ad-hoc manual review or single-metric checks with structured, repeatable evaluation. The monorepo is organized as independent packages (trajectory loading, tool-use validation, cost tracking, LLM-as-judge, golden comparisons, suite orchestration, CI gates, MCP server, observability) that each export plain TypeScript functions and Zod schemas, so you compose exactly the pieces you need without a framework lock-in.
A CLI providing 7 subcommands (`eval`, `judge`, `compare`, `gate`, `golden`, `report`, `serve`) for running and managing LLM agent evaluations, including trajectory loading, multi-metric scoring, CI gate checks with exit codes, golden reference management, multi-format reporting (JSON, HTML, Markdown, PDF), and an MCP server exposing evaluation tools to AI coding agents.
Calculates per-task LLM token and tool invocation costs for AI agent trajectories, with budget enforcement and cost reporting. Exports functions like `calculateTrajectoryCost`, `checkBudget`, and `generateCostReport` that operate on `Trajectory` objects from `@reaatech/agent-eval-harness-types`.
A function that evaluates AI agent evaluation results against configurable quality, cost, latency, and correctness thresholds, returning a pass/fail summary for CI/CD gating.
A library for creating, managing, and comparing golden reference trajectories against candidate agent runs to detect regressions. It provides functions (`createGolden`, `compareAgainstGolden`, `batchCompare`) and a `GoldenCurator` class for structured curation workflows, returning diff analysis with turn-level similarity scores and regression details.
Terraform modules and environment configurations for deploying the agent-eval-harness application to AWS, Azure, GCP, OCI, Netlify, and Vercel, providing reusable infrastructure-as-code for compute, database, cache, and storage resources across each platform.
A provider-agnostic LLM-as-judge engine that scores agent responses on faithfulness, relevance, tool correctness, and overall quality using Claude, GPT-4, Gemini, or any OpenAI-compatible provider. Exports a `JudgeEngine` class with `judge` and `judgeBatch` methods, plus a `JudgeCalibrator` for calibrating scores against human labels using temperature scaling, isotonic regression, or linear regression.
A latency monitoring and SLA enforcement toolkit for AI agent evaluation, providing functions to compute P50/P90/P99 percentiles per turn and trajectory, detect anomalous turns, and validate against configurable latency budgets with severity-graded violations and optimization recommendations.
An MCP (Model Context Protocol) server that exposes 13 evaluation tools across three layers—atomic judge operations, orchestrated suite runs, and CI gate operations—via stdio transport for integration with AI coding agents like Claude Desktop. It provides a `createMCPServer` function that returns an `EvalHarnessMCPServer` instance, with no external database dependency and session-scoped in-memory state.
Provides OpenTelemetry tracing, 7 pre-configured metrics, Pino-based structured logging with automatic PII redaction, and an in-memory dashboard manager with trend analysis and alerting for agent evaluation pipelines. Exports factory functions (`getLogger`, `getTracingManager`, `getMetricsManager`, `getDashboardManager`) and a `withTracing()` decorator.
A YAML-driven batch evaluation runner that executes multi-metric assessments across trajectory collections with configurable concurrency, then aggregates results into JSON, JUnit XML, CSV, or Markdown and supports statistical comparison between baseline and candidate runs. It provides `SuiteRunner`, `parseConfig`, `ResultsAggregator`, and threshold-checking utilities, and pairs with `@reaatech/agent-eval-harness-trajectory` for per-trajectory evaluation.
Validates tool selection, argument schema compliance, result hallucination, and integration for agent tool calls across trajectories, exporting functions like `validateToolCall`, `validateTrajectory`, and `verifyResult` that operate on `ToolCall` and `Turn` types from `@reaatech/agent-eval-harness-types`.
Parses JSONL turn files into validated trajectories, then scores them for coherence, goal completion, and conversation flow, and diffs candidate trajectories against golden references for regression detection.