Skip to content
reaatech

@reaatech/pi-bench-mcp-server

npm v1.0.1

An MCP server that exposes four tools (`run_benchmark`, `compare_defenses`, `generate_report`, `submit_results`) for running prompt injection benchmarks against defenses, comparing results, and generating reports. It provides a `createMCPServer` factory function and a `normalizeReportData` utility for standardizing benchmark results across different formats.

@reaatech/pi-bench-mcp-server

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

MCP (Model Context Protocol) server for prompt-injection-bench, plus report data normalization and deterministic reproducibility tools. Exposes benchmark operations as MCP tools consumable by any MCP client.

Installation

terminal
npm install @reaatech/pi-bench-mcp-server
# or
pnpm add @reaatech/pi-bench-mcp-server

Feature Overview

  • 4 MCP toolsrun_benchmark, compare_defenses, generate_report, submit_results
  • Stdio transport — Connect to any MCP-compatible client via standard I/O
  • Report generation — HTML and Markdown reports with category breakdowns
  • Path traversal protection — Validates all file paths in tool operations
  • Deterministic seed manager — LCG-based PRNG with SHA-256 proof hashes for reproducible benchmarks
  • Dual ESM/CJS output — works with import and require

Quick Start

typescript
import { createMCPServer } from "@reaatech/pi-bench-mcp-server";
 
const server = createMCPServer();
await server.start();

Or connect via MCP client configuration:

json
{
  "mcpServers": {
    "prompt-injection-bench": {
      "command": "npx",
      "args": ["@reaatech/pi-bench-mcp-server"]
    }
  }
}

API Reference

BenchmarkMCPServer

MethodDescription
start()Start the MCP server on stdio transport
stop()Stop the server

createMCPServer(config?)

Factory function.

MCP Tools

run_benchmark

Execute a full benchmark against a defense:

json
{
  "name": "run_benchmark",
  "arguments": {
    "defense": "rebuff",
    "corpus": "default",
    "categories": ["direct-injection", "prompt-leaking", "role-playing"],
    "parallel": 10,
    "timeout_ms": 30000
  }
}
compare_defenses

Compare multiple defense results:

json
{
  "name": "compare_defenses",
  "arguments": {
    "results": ["results/rebuff.json", "results/lakera.json"],
    "significance_level": 0.05
  }
}
generate_report

Generate HTML/JSON/Markdown reports:

json
{
  "name": "generate_report",
  "arguments": {
    "results": "results/latest.json",
    "format": "html",
    "include_categories": true,
    "output": "reports/benchmark-report.html"
  }
}
submit_results

Submit results to the public leaderboard:

json
{
  "name": "submit_results",
  "arguments": {
    "results": "results/latest.json",
    "defense_name": "my-custom-defense",
    "defense_version": "1.0.0",
    "reproducibility_proof": {
      "seed": "abc123",
      "corpus_version": "2026.04",
      "adapter_versions": { "rebuff": "1.2.0" }
    }
  }
}

Report Data Normalization

typescript
import { normalizeReportData } from "@reaatech/pi-bench-mcp-server";
 
const data = normalizeReportData(rawResults);
// Returns normalized { defense, score, overallMetrics } regardless of input format

normalizeReportData accepts multiple result formats:

  • Full BenchmarkResult objects
  • Pre-computed DefenseScore objects
  • JSON file paths (auto-reads and parses)
  • Mixed arrays of any of the above

SeedManager

Deterministic PRNG for reproducible benchmarks:

typescript
import { createSeedManager } from "@reaatech/pi-bench-mcp-server";
 
const seed = createSeedManager({ seed: "my-benchmark-v1" });
 
const nextInt = seed.nextInt(1, 1000);
const shuffled = seed.shuffle(samples);
const proofHash = seed.generateProof(corpusVersion, adapterVersions);
MethodDescription
next()Next float in [0, 1)
nextInt(min, max)Next integer in [min, max]
shuffle(array)Deterministic Fisher-Yates shuffle
generateProof(corpusVersion, adapterVersions)SHA-256 proof hash
reset()Reset to initial seed

createSeedManager(config?)

Factory function. Accepts optional SeedConfig with seed (string) and algorithm (default: "lcg").

Usage Patterns

MCP Tool Invocation from an Agent

typescript
// From an MCP client (e.g., Claude Desktop, agent-mesh):
const result = await mcpClient.callTool("run_benchmark", {
  defense: "mock",
  corpus: "default",
  categories: ["direct-injection", "role-playing"],
  parallel: 10,
});
 
console.log(`Defense: ${result.defense}`);
console.log(`Detection rate: ${result.detectionRate}`);
console.log(`Score: ${result.overallScore}`);

Generating Reports Programmatically

typescript
import { normalizeReportData } from "@reaatech/pi-bench-mcp-server";
 
const data = normalizeReportData("results/benchmark.json");
 
// Build a Markdown report
let md = `# Benchmark Report\n\n`;
md += `**Defense:** ${data.defense}\n`;
md += `**Overall Score:** ${data.score.overallScore.toFixed(3)}\n`;
md += `**Detection Rate:** ${(1 - data.score.attackSuccessRate) * 100}%\n`;
 
for (const [category, catData] of Object.entries(data.score.categoryScores)) {
  md += `- **${category}:** ${(catData.detectionRate * 100).toFixed(1)}%\n`;
}

Reproducible Runs

typescript
const seed = createSeedManager({ seed: "my-benchmark-v1" });
 
// Use seed for deterministic corpus generation and result ordering
const shuffled = seed.shuffle(corpus);
const proof = seed.generateProof("2026.04", { mock: "1.0.0" });
console.log(`Reproducibility proof: ${proof}`);

License

MIT