Skip to content
reaatechREAATECH

@reaatech/agent-eval-harness-mcp-server

npm v0.1.0

Exposes 13 evaluation tools for AI agents via the Model Context Protocol (MCP) using stdio transport. It provides a factory function to instantiate a server that handles atomic judgments, suite orchestration, and CI gate operations.

@reaatech/agent-eval-harness-mcp-server

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Three-layer MCP (Model Context Protocol) server exposing evaluation tools. Provides 13 tools across three layers — atomic judge operations, orchestrated suite runs, and CI gate operations — all accessible via MCP stdio transport for integration with AI coding agents like Claude Desktop.

Installation

terminal
npm install @reaatech/agent-eval-harness-mcp-server

Feature Overview

  • 13 MCP tools — covering the full evaluation lifecycle from atomic judgment to CI gate checking
  • Three-layer architectureeval.judge.* (5 fast, stateless atomic ops), eval.suite.* (5 orchestrated longer-running ops), eval.gate.* (3 blocking CI gate ops)
  • Stdio transport — standard MCP protocol over stdin/stdout, no HTTP server required
  • Auto-discovery — agents can list available tools and their input/output schemas at connection
  • In-memory state — session-scoped run storage with no external database dependency
  • JSON Schema tool definitions — each tool declares its input shape for type-safe agent invocation

Quick Start

typescript
import { createMCPServer } from '@reaatech/agent-eval-harness-mcp-server';
 
const server = await createMCPServer();
await server.start(); // Connects via stdio — ready for MCP clients

Configure tool layers individually:

typescript
import { createMCPServer } from '@reaatech/agent-eval-harness-mcp-server';
 
const server = await createMCPServer({
  name: 'my-eval-server',
  enableJudgeTools: true,
  enableSuiteTools: true,
  enableGateTools: false, // gate ops disabled
});

API Reference

Server

ExportTypeDescription
EvalHarnessMCPServerClassMCP server instance wrapping @modelcontextprotocol/sdk
createMCPServer(config?)async (config?: Partial<MCPServerConfig>) => Promise<EvalHarnessMCPServer>Create and start server in one call

EvalHarnessMCPServer methods:

MethodDescription
run()Connect and start listening on stdio transport
getServer()Access underlying MCP Server instance
close()Gracefully close the server connection

Configuration

MCPServerConfig

FieldTypeDefaultDescription
namestringagent-eval-harnessServer name reported to MCP clients
versionstringpackage.versionServer version
enableJudgeToolsbooleantrueRegister eval.judge.* tools
enableSuiteToolsbooleantrueRegister eval.suite.* tools
enableGateToolsbooleantrueRegister eval.gate.* tools

Tool Reference

Layer 1 — eval.judge.* (Atomic Operations)

Fast, stateless operations designed for mid-task self-evaluation by agents.

ToolInputOutputDescription
eval.judge.faithfulness{ context: string, response: string }{ score, explanation, confidence }Score response faithfulness to context
eval.judge.relevance{ intent: string, response: string }{ score, explanation, confidence }Score response relevance to intent
eval.judge.tool_correctness{ expected_tool: string, actual_tool: string, arguments?: object, result?: object }{ score, explanation, confidence }Validate tool selection and arguments
eval.judge.cost_check{ trajectory: object, budget: number }{ within_budget, cost, budget, usage_percentage }Verify cost within budget
eval.judge.latency_check{ trajectory: object, sla: number }{ within_sla, p99_ms, p50_ms, p90_ms, total_ms }Verify latency within SLA

Layer 2 — eval.suite.* (Orchestrated Runs)

Stateful operations for eval-driven development. In-memory storage per session.

ToolInputOutputDescription
eval.suite.run{ trajectories: object[], config?: { metrics?, judge_model?, budget_limit? } }{ run_id, status, total_trajectories, completed, failed, duration_ms }Execute evaluation suite
eval.suite.status{ run_id: string }{ run_id, status, progress, completed, total, started_at, ended_at }Get run progress
eval.suite.results{ run_id: string, format?: 'json' | 'summary' }Aggregated results or summaryRetrieve evaluation results
eval.suite.compare{ baseline_run: string, candidate_run: string }{ score_diff, verdict, regressions, improvements, key_findings }Compare two runs
eval.suite.baseline{ run_id: string, name?: string }{ baseline_id, name, set_at }Set baseline for regression

Layer 3 — eval.gate.* (CI Gates)

Blocking, opinionated operations for CI/CD pipelines. In-memory gate storage per session.

ToolInputOutputDescription
eval.gate.run{ run_id?: string, gate_config?: string, results?: object, comparison?: object }{ passed, total_gates, passed_gates, failed_gates, results, exit_code }Run CI-style pass/fail gate
eval.gate.config{ action: 'get' | 'set' | 'list', config?: object[], preset?: 'standard' | 'strict' | 'lenient' }{ gates } or { success, gates_loaded }Get/set/list gate configuration
eval.gate.diff{ baseline: object, candidate: object, metrics?: string[] }{ score_diff, metric_diffs, regressions, improvements, verdict }Detailed diff from baseline

License

MIT