agent-eval-harness · packages
Every package shipped from reaatech/agent-eval-harness, published or pending.
13 packages · page 1 of 2
@reaatech/agent-eval-harness-cli
This CLI provides a suite of commands for executing agent evaluation pipelines, managing golden trajectories, and enforcing CI quality gates. It also functions as an MCP server in stdio mode, exposing its evaluation tools to other AI agents.
- status
- published
- published
- 7 days ago
@reaatech/agent-eval-harness-cost
Calculates and enforces spending limits for AI agent trajectories by providing functions to compute token-based costs, compare performance, and trigger budget alerts. It exports a suite of utility functions that operate on trajectory objects to generate granular cost breakdowns and optimization recommendations across major LLM providers.
- status
- published
- published
- 8 days ago
@reaatech/agent-eval-harness-gate
Enforces CI/CD regression thresholds for AI agent performance, cost, and quality metrics. It provides a `GateEngine` class to evaluate agent results against configurable gates and generates JUnit XML, GitHub Actions annotations, and JSON summaries.
- status
- published
- published
- 8 days ago
@reaatech/agent-eval-harness-golden
Manages reference agent trajectories for regression testing through a collection of utility functions and a `GoldenCurator` class. It provides tools to create, annotate, and validate golden datasets, and includes a comparison engine to detect regressions by diffing candidate trajectories against these references.
- status
- published
- published
- 8 days ago
@reaatech/agent-eval-harness-infra
Provides a collection of Terraform modules and environment configurations for deploying the agent-eval-harness across AWS, Azure, GCP, OCI, Vercel, and Netlify. It requires Terraform 1.0+ and cloud-specific provider credentials to provision the necessary compute, database, and storage infrastructure.
- status
- awaiting publish
@reaatech/agent-eval-harness-judge
Evaluates agent responses using LLM-as-a-judge patterns with support for multi-model consensus, automated calibration, and cost tracking. It provides a `JudgeEngine` class that interfaces with OpenAI-compatible providers to score faithfulness, relevance, and tool correctness.
- status
- published
- published
- 8 days ago
@reaatech/agent-eval-harness-latency
Computes latency metrics, enforces SLA budgets, and identifies performance bottlenecks for AI agent trajectories. It provides a suite of utility functions and a `LatencyTracker` class to analyze turn-level and component-specific timing data.
- status
- published
- published
- 8 days ago
@reaatech/agent-eval-harness-mcp-server
Exposes 13 evaluation tools for AI agents via the Model Context Protocol (MCP) using stdio transport. It provides a factory function to instantiate a server that handles atomic judgments, suite orchestration, and CI gate operations.
- status
- published
- published
- 8 days ago
@reaatech/agent-eval-harness-observability
Provides OpenTelemetry instrumentation, Pino-based structured logging with PII redaction, and an in-memory dashboard manager for tracking agent evaluation pipelines. It exposes a set of singleton managers for recording metrics, tracing execution spans, and aggregating performance trends.
- status
- published
- published
- 8 days ago
@reaatech/agent-eval-harness-suite
Executes batch evaluations of agent trajectories using a YAML-configured runner class that aggregates multi-metric scores and performs statistical regression analysis between runs. It requires an external evaluator function and trajectory data to process concurrent test suites.
- status
- published
- published
- 8 days ago
@reaatech/agent-eval-harness-tool-use
Validates agent tool-use trajectories by checking schema compliance, argument accuracy, and result integration. It provides a set of utility functions to evaluate individual tool calls or full conversation turns against defined tool schemas.
- status
- published
- published
- 8 days ago
@reaatech/agent-eval-harness-trajectory
Provides utilities for loading, validating, and evaluating agent conversation trajectories from JSONL files. It exports functions for parsing data, calculating coherence and goal completion metrics, and comparing candidate trajectories against golden references, requiring `@reaatech/agent-eval-harness-types` for schema validation.
- status
- published
- published
- 8 days ago
