Skip to content
reaatech

@reaatech/agent-eval-harness-mcp-server

npm v0.1.1

An MCP (Model Context Protocol) server that exposes 13 evaluation tools across three layers—atomic judge operations, orchestrated suite runs, and CI gate operations—via stdio transport for integration with AI coding agents like Claude Desktop. It provides a `createMCPServer` function that returns an `EvalHarnessMCPServer` instance, with no external database dependency and session-scoped in-memory state.

@reaatech/agent-eval-harness-mcp-server

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Three-layer MCP (Model Context Protocol) server exposing evaluation tools. Provides 13 tools across three layers — atomic judge operations, orchestrated suite runs, and CI gate operations — all accessible via MCP stdio transport for integration with AI coding agents like Claude Desktop.

Installation

terminal
npm install @reaatech/agent-eval-harness-mcp-server

Feature Overview

  • 13 MCP tools — covering the full evaluation lifecycle from atomic judgment to CI gate checking
  • Three-layer architectureeval.judge.* (5 fast, stateless atomic ops), eval.suite.* (5 orchestrated longer-running ops), eval.gate.* (3 blocking CI gate ops)
  • Stdio transport — standard MCP protocol over stdin/stdout, no HTTP server required
  • Auto-discovery — agents can list available tools and their input/output schemas at connection
  • In-memory state — session-scoped run storage with no external database dependency
  • JSON Schema tool definitions — each tool declares its input shape for type-safe agent invocation

Quick Start

typescript
import { createMCPServer } from '@reaatech/agent-eval-harness-mcp-server';
 
const server = await createMCPServer();
await server.start(); // Connects via stdio — ready for MCP clients

Configure tool layers individually:

typescript
import { createMCPServer } from '@reaatech/agent-eval-harness-mcp-server';
 
const server = await createMCPServer({
  name: 'my-eval-server',
  enableJudgeTools: true,
  enableSuiteTools: true,
  enableGateTools: false, // gate ops disabled
});

API Reference

Server

ExportTypeDescription
EvalHarnessMCPServerClassMCP server instance wrapping @modelcontextprotocol/sdk
createMCPServer(config?)async (config?: Partial<MCPServerConfig>) => Promise<EvalHarnessMCPServer>Create and start server in one call

EvalHarnessMCPServer methods:

MethodDescription
run()Connect and start listening on stdio transport
getServer()Access underlying MCP Server instance
close()Gracefully close the server connection

Configuration

MCPServerConfig

FieldTypeDefaultDescription
namestringagent-eval-harnessServer name reported to MCP clients
versionstringpackage.versionServer version
enableJudgeToolsbooleantrueRegister eval.judge.* tools
enableSuiteToolsbooleantrueRegister eval.suite.* tools
enableGateToolsbooleantrueRegister eval.gate.* tools

Tool Reference

Layer 1 — eval.judge.* (Atomic Operations)

Fast, stateless operations designed for mid-task self-evaluation by agents.

ToolInputOutputDescription
eval.judge.faithfulness{ context: string, response: string }{ score, explanation, confidence }Score response faithfulness to context
eval.judge.relevance{ intent: string, response: string }{ score, explanation, confidence }Score response relevance to intent
eval.judge.tool_correctness{ expected_tool: string, actual_tool: string, arguments?: object, result?: object }{ score, explanation, confidence }Validate tool selection and arguments
eval.judge.cost_check{ trajectory: object, budget: number }{ within_budget, cost, budget, usage_percentage }Verify cost within budget
eval.judge.latency_check{ trajectory: object, sla: number }{ within_sla, p99_ms, p50_ms, p90_ms, total_ms }Verify latency within SLA

Layer 2 — eval.suite.* (Orchestrated Runs)

Stateful operations for eval-driven development. In-memory storage per session.

ToolInputOutputDescription
eval.suite.run{ trajectories: object[], config?: { metrics?, judge_model?, budget_limit? } }{ run_id, status, total_trajectories, completed, failed, duration_ms }Execute evaluation suite
eval.suite.status{ run_id: string }{ run_id, status, progress, completed, total, started_at, ended_at }Get run progress
eval.suite.results{ run_id: string, format?: 'json' | 'summary' }Aggregated results or summaryRetrieve evaluation results
eval.suite.compare{ baseline_run: string, candidate_run: string }{ score_diff, verdict, regressions, improvements, key_findings }Compare two runs
eval.suite.baseline{ run_id: string, name?: string }{ baseline_id, name, set_at }Set baseline for regression

Layer 3 — eval.gate.* (CI Gates)

Blocking, opinionated operations for CI/CD pipelines. In-memory gate storage per session.

ToolInputOutputDescription
eval.gate.run{ run_id?: string, gate_config?: string, results?: object, comparison?: object }{ passed, total_gates, passed_gates, failed_gates, results, exit_code }Run CI-style pass/fail gate
eval.gate.config{ action: 'get' | 'set' | 'list', config?: object[], preset?: 'standard' | 'strict' | 'lenient' }{ gates } or { success, gates_loaded }Get/set/list gate configuration
eval.gate.diff{ baseline: object, candidate: object, metrics?: string[] }{ score_diff, metric_diffs, regressions, improvements, verdict }Detailed diff from baseline

License

MIT