Skip to content
reaatechREAATECH

@reaatech/rag-eval-mcp-server

pending npm

Exposes RAG evaluation tools—including atomic judges, test suites, and regression gates—as an MCP server for integration with clients like Claude Desktop or Cursor. It provides a set of tool handler functions and server initialization utilities that rely on the `@modelcontextprotocol/sdk` to execute evaluation tasks via stdio.

@reaatech/rag-eval-mcp-server

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

MCP (Model Context Protocol) server exposing RAG evaluation tools as a three-layer API: atomic judge operations, orchestrated suite runs, and CI-style regression gates. Connect from MCP clients like Claude Desktop or Cursor to evaluate RAG systems in real time.

Installation

terminal
npm install @reaatech/rag-eval-mcp-server @modelcontextprotocol/sdk
# or
pnpm add @reaatech/rag-eval-mcp-server @modelcontextprotocol/sdk

Feature Overview

  • Three-layer tool architecturerag_eval.judge.* (atomic), rag_eval.suite.* (orchestrated), rag_eval.gate.* (CI gates)
  • Judge tools — faithfulness, relevance, context precision, context recall, and cost check
  • Suite tools — run, status, results, compare, and baseline management
  • Gate tools — run gates, get/set config, and diff from baseline
  • Stdio transport — standard MCP transport for local agent integration
  • Programmatic and standalone modes — embed as a library or run as a standalone server

Quick Start

Standalone Server

terminal
# Start the MCP server via CLI
npx rag-eval-pack mcp-server

Programmatic Usage

typescript
import { createMcpServer, startMcpServer } from "@reaatech/rag-eval-mcp-server";
 
// Create a server instance
const server = createMcpServer();
 
// Start via stdio (for MCP client integration)
await startMcpServer();

API Reference

Server Functions

typescript
import { createMcpServer, startMcpServer } from "@reaatech/rag-eval-mcp-server";
 
const server = createMcpServer();
await startMcpServer();
FunctionReturnsDescription
createMcpServer()ServerCreate an MCP server with all tools registered
startMcpServer()Promise<void>Start the server via stdio transport

Tool Handler Functions

For programmatic invocation outside the MCP protocol:

typescript
import { handleJudgeTool, handleSuiteTool, handleGateTool } from "@reaatech/rag-eval-mcp-server";
FunctionDescription
handleJudgeTool(name, input)Invoke a judge tool programmatically
handleSuiteTool(name, input)Invoke a suite tool programmatically
handleGateTool(name, input)Invoke a gate tool programmatically

Each returns an MCP-compatible CallToolResult.

Tool Reference

Layer 1: rag_eval.judge.* (Atomic Operations)

Fast, stateless, composable operations for mid-task self-evaluation.

rag_eval.judge.faithfulness

Check if a generated answer is faithful to the provided context.

ParameterTypeRequiredDescription
contextstring[]YesRetrieved context chunks
generated_answerstringYesRAG system’s generated answer

Returns: { score, statements, supported_count }

rag_eval.judge.relevance

Check if an answer is relevant to the query.

ParameterTypeRequiredDescription
querystringYesUser query
generated_answerstringYesRAG system’s generated answer

Returns: { score, semantic_similarity, intent_score }

rag_eval.judge.context_precision

Check context ranking quality against a ground truth.

ParameterTypeRequiredDescription
querystringYesUser query
contextstring[]YesRetrieved context chunks
ground_truthstringYesExpected answer

Returns: { score, map, ndcg }

rag_eval.judge.context_recall

Check ground truth coverage in retrieved context.

ParameterTypeRequiredDescription
querystringYesUser query
contextstring[]YesRetrieved context chunks
ground_truthstringYesExpected answer

Returns: { score, total_facts, covered_facts }

rag_eval.judge.cost_check

Verify evaluation cost is within budget.

ParameterTypeRequiredDescription
eval_resultobjectYesPartial evaluation result
budgetnumberYesBudget limit

Returns: { within_budget, cost }

Layer 2: rag_eval.suite.* (Orchestrated Runs)

Stateful, longer-running operations for eval-driven development.

ToolDescription
rag_eval.suite.runExecute a full evaluation suite with dataset and config
rag_eval.suite.statusGet the status of a running evaluation
rag_eval.suite.resultsRetrieve results for a completed evaluation
rag_eval.suite.compareCompare two evaluation runs
rag_eval.suite.baselineSet a baseline for regression comparison

rag_eval.suite.run

ParameterTypeRequiredDescription
datasetstring | EvaluationSample[]YesPath to dataset or in-memory samples
configEvalSuiteConfigYesEvaluation configuration

Returns: { run_id, status }

Layer 3: rag_eval.gate.* (CI Gates)

Opinionated, blocking operations for CI/CD integration.

ToolDescription
rag_eval.gate.runRun CI-style pass/fail gate evaluation
rag_eval.gate.configGet or set gate configuration
rag_eval.gate.diffGet detailed diff from baseline

rag_eval.gate.run

ParameterTypeRequiredDescription
resultsEvalResultsYesEvaluation results to gate
gate_configGateConfig[]YesGate configuration

Returns: { passed, failures[] }

Usage Patterns

Agent Self-Evaluation Mid-Task

typescript
import { handleJudgeTool } from "@reaatech/rag-eval-mcp-server";
 
const response = await handleJudgeTool("rag_eval.judge.faithfulness", {
  context: [
    "Refunds are processed within 14 days of purchase.",
    "Contact support@example.com for refund requests.",
  ],
  generated_answer: "You can request a refund within 14 days by emailing support.",
});
 
const result = JSON.parse(response.content[0].text);
if (result.score < 0.85) {
  // Trigger fallback: retrieve more context
}

CI Pipeline Gate

typescript
const gateResponse = await handleGateTool("rag_eval.gate.run", {
  results: latestEvalResults,
  gate_config: [
    { name: "min-faithfulness", type: "threshold", metric: "avg_faithfulness", operator: ">=", threshold: 0.85 },
    { name: "no-regression", type: "baseline-comparison", metric: "overall_score", allow_regression: false },
  ],
});
 
const gateResult = JSON.parse(gateResponse.content[0].text);
if (!gateResult.passed) {
  console.error("Gates failed:", gateResult.failures);
  process.exit(1);
}

License

MIT