Skip to content
reaatechREAATECH

@reaatech/classifier-evals-mcp-server

npm v0.1.0

Exposes classifier evaluation workflows—including running evaluations, checking regression gates, and performing LLM-as-judge comparisons—as a set of Model Context Protocol (MCP) tools. It provides a CLI executable and a `startMCPServer` function that runs over stdio, requiring the `@modelcontextprotocol/sdk` at runtime.

@reaatech/classifier-evals-mcp-server

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

MCP (Model Context Protocol) server exposing classifier evaluation tools via stdio transport. Integrates with MCP-compatible clients (Claude Desktop, agent-mesh, and other MCP hosts) to run evaluations, check gates, compare models, run LLM-as-judge, and generate reports.

Installation

terminal
npm install @reaatech/classifier-evals-mcp-server @modelcontextprotocol/sdk
# or
pnpm add @reaatech/classifier-evals-mcp-server @modelcontextprotocol/sdk

Feature Overview

  • 5 MCP toolsrun_eval, check_gates, compare_models, llm_judge, generate_report
  • Stdio transport — standard MCP server over stdin/stdout, compatible with all MCP clients
  • Full evaluation pipeline — dataset loading, metrics calculation, and result construction in a single call
  • YAML and JSON configs — load gate configs and eval results from file paths
  • Structured logging — Pino-based logging of all tool invocations and results
  • Error handling — typed MCP errors with descriptive messages

Quick Start

As an MCP Server

terminal
# Start the MCP server (uses stdio transport)
npx @reaatech/classifier-evals-mcp-server

As a Library

typescript
import { startMCPServer } from "@reaatech/classifier-evals-mcp-server";
 
// Start the MCP server programmatically
await startMCPServer();

MCP Tools

run_eval

Execute a full evaluation pipeline on a dataset.

json
{
  "name": "run_eval",
  "arguments": {
    "dataset_path": "datasets/test-set.csv",
    "predictions": [],
    "metrics": ["accuracy", "f1", "confusion_matrix"],
    "output_format": "json"
  }
}
ParameterTypeRequiredDescription
dataset_pathstringYesPath to the dataset file (CSV, JSON, JSONL)
predictionsobject[]NoArray of prediction objects with text, label, predicted_label, confidence
metricsstring[]NoMetrics to calculate
output_formatjson" | "htmlNoOutput format

If predictions is provided, those samples are used directly. Otherwise, the dataset is loaded from dataset_path and its samples are used.

check_gates

Evaluate regression gates against evaluation results for CI.

json
{
  "name": "check_gates",
  "arguments": {
    "eval_results": "results/latest.json",
    "gate_config": "gates.yaml",
    "baseline_results": "results/baseline.json"
  }
}
ParameterTypeRequiredDescription
eval_resultsstring | EvalRunYesPath to evaluation results JSON or inline EvalRun object
gate_configstring | RegressionGate[]YesPath to gate YAML config or inline gate definitions
baseline_resultsstringNoPath to baseline results for comparison gates

Returns a pass/fail summary with individual gate results.

compare_models

Compare two model evaluation results with statistical significance.

json
{
  "name": "compare_models",
  "arguments": {
    "baseline_results": "results/model-v1.json",
    "candidate_results": "results/model-v2.json"
  }
}
ParameterTypeRequiredDescription
baseline_resultsstringYesPath to baseline model results
candidate_resultsstringYesPath to candidate model results

Returns accuracy difference, p-value, significance flag, effect size, and per-class F1 comparison.

llm_judge

Run LLM-as-judge on samples with cost tracking and consensus voting.

json
{
  "name": "llm_judge",
  "arguments": {
    "samples": [
      { "text": "Reset my password", "label": "password_reset", "predicted_label": "password_reset" }
    ],
    "judge_model": "claude-opus",
    "consensus_count": 3,
    "budget_limit": 10.00
  }
}
ParameterTypeRequiredDescription
samplesobject[]YesArray of samples to judge
judge_modelstringYesLLM model to use for judging
consensus_countnumberNoNumber of judges for consensus (default: 1)
budget_limitnumberNoMaximum budget in USD

generate_report

Generate a JSON or HTML report from evaluation results.

json
{
  "name": "generate_report",
  "arguments": {
    "eval_results": "results/latest.json",
    "format": "html"
  }
}
ParameterTypeRequiredDescription
eval_resultsstringYesPath to evaluation results JSON
formatjson" | "htmlNoReport format (default: json)

Configuration

MCP Client Integration

Add to your MCP client configuration (e.g., Claude Desktop):

json
{
  "mcpServers": {
    "classifier-evals": {
      "command": "npx",
      "args": ["@reaatech/classifier-evals-mcp-server"]
    }
  }
}

Or using the executable directly:

json
{
  "mcpServers": {
    "classifier-evals": {
      "command": "node",
      "args": ["./node_modules/@reaatech/classifier-evals-mcp-server/dist/index.js"]
    }
  }
}

Usage Patterns

Integration with agent-mesh

Register classifier-evals as an agent in agent-mesh:

yaml
agent_id: classifier-evals
display_name: Classifier Evaluation
endpoint: "${CLASSIFIER_EVALS_ENDPOINT:-http://localhost:8083}"
type: mcp
is_default: false
confidence_threshold: 0.9

Headless Evaluation Pipeline

typescript
import { startMCPServer } from "@reaatech/classifier-evals-mcp-server";
 
// The server handles tool dispatch automatically
// Clients connect via stdio and call:
//   run_eval → dataset loading → metrics → eval run
//   check_gates → gate evaluation → pass/fail
//   compare_models → statistical comparison
//   llm_judge → judge engine → consensus → results
//   generate_report → JSON/HTML export
 
await startMCPServer();

License

MIT