Skip to content
reaatechREAATECH

@reaatech/pi-bench-mcp-server

pending npm

Exposes prompt-injection-bench operations as an MCP server, providing tools to execute benchmarks, compare defense results, and generate reports via stdio. It also includes utility functions for normalizing benchmark data and managing deterministic seeds for reproducible testing.

@reaatech/pi-bench-mcp-server

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

MCP (Model Context Protocol) server for prompt-injection-bench, plus report data normalization and deterministic reproducibility tools. Exposes benchmark operations as MCP tools consumable by any MCP client.

Installation

terminal
npm install @reaatech/pi-bench-mcp-server
# or
pnpm add @reaatech/pi-bench-mcp-server

Feature Overview

  • 4 MCP toolsrun_benchmark, compare_defenses, generate_report, submit_results
  • Stdio transport — Connect to any MCP-compatible client via standard I/O
  • Report generation — HTML and Markdown reports with category breakdowns
  • Path traversal protection — Validates all file paths in tool operations
  • Deterministic seed manager — LCG-based PRNG with SHA-256 proof hashes for reproducible benchmarks
  • Dual ESM/CJS output — works with import and require

Quick Start

typescript
import { createMCPServer } from "@reaatech/pi-bench-mcp-server";
 
const server = createMCPServer();
await server.start();

Or connect via MCP client configuration:

json
{
  "mcpServers": {
    "prompt-injection-bench": {
      "command": "npx",
      "args": ["@reaatech/pi-bench-mcp-server"]
    }
  }
}

API Reference

BenchmarkMCPServer

MethodDescription
start()Start the MCP server on stdio transport
stop()Stop the server

createMCPServer(config?)

Factory function.

MCP Tools

run_benchmark

Execute a full benchmark against a defense:

json
{
  "name": "run_benchmark",
  "arguments": {
    "defense": "rebuff",
    "corpus": "default",
    "categories": ["direct-injection", "prompt-leaking", "role-playing"],
    "parallel": 10,
    "timeout_ms": 30000
  }
}
compare_defenses

Compare multiple defense results:

json
{
  "name": "compare_defenses",
  "arguments": {
    "results": ["results/rebuff.json", "results/lakera.json"],
    "significance_level": 0.05
  }
}
generate_report

Generate HTML/JSON/Markdown reports:

json
{
  "name": "generate_report",
  "arguments": {
    "results": "results/latest.json",
    "format": "html",
    "include_categories": true,
    "output": "reports/benchmark-report.html"
  }
}
submit_results

Submit results to the public leaderboard:

json
{
  "name": "submit_results",
  "arguments": {
    "results": "results/latest.json",
    "defense_name": "my-custom-defense",
    "defense_version": "1.0.0",
    "reproducibility_proof": {
      "seed": "abc123",
      "corpus_version": "2026.04",
      "adapter_versions": { "rebuff": "1.2.0" }
    }
  }
}

Report Data Normalization

typescript
import { normalizeReportData } from "@reaatech/pi-bench-mcp-server";
 
const data = normalizeReportData(rawResults);
// Returns normalized { defense, score, overallMetrics } regardless of input format

normalizeReportData accepts multiple result formats:

  • Full BenchmarkResult objects
  • Pre-computed DefenseScore objects
  • JSON file paths (auto-reads and parses)
  • Mixed arrays of any of the above

SeedManager

Deterministic PRNG for reproducible benchmarks:

typescript
import { createSeedManager } from "@reaatech/pi-bench-mcp-server";
 
const seed = createSeedManager({ seed: "my-benchmark-v1" });
 
const nextInt = seed.nextInt(1, 1000);
const shuffled = seed.shuffle(samples);
const proofHash = seed.generateProof(corpusVersion, adapterVersions);
MethodDescription
next()Next float in [0, 1)
nextInt(min, max)Next integer in [min, max]
shuffle(array)Deterministic Fisher-Yates shuffle
generateProof(corpusVersion, adapterVersions)SHA-256 proof hash
reset()Reset to initial seed

createSeedManager(config?)

Factory function. Accepts optional SeedConfig with seed (string) and algorithm (default: "lcg").

Usage Patterns

MCP Tool Invocation from an Agent

typescript
// From an MCP client (e.g., Claude Desktop, agent-mesh):
const result = await mcpClient.callTool("run_benchmark", {
  defense: "mock",
  corpus: "default",
  categories: ["direct-injection", "role-playing"],
  parallel: 10,
});
 
console.log(`Defense: ${result.defense}`);
console.log(`Detection rate: ${result.detectionRate}`);
console.log(`Score: ${result.overallScore}`);

Generating Reports Programmatically

typescript
import { normalizeReportData } from "@reaatech/pi-bench-mcp-server";
 
const data = normalizeReportData("results/benchmark.json");
 
// Build a Markdown report
let md = `# Benchmark Report\n\n`;
md += `**Defense:** ${data.defense}\n`;
md += `**Overall Score:** ${data.score.overallScore.toFixed(3)}\n`;
md += `**Detection Rate:** ${(1 - data.score.attackSuccessRate) * 100}%\n`;
 
for (const [category, catData] of Object.entries(data.score.categoryScores)) {
  md += `- **${category}:** ${(catData.detectionRate * 100).toFixed(1)}%\n`;
}

Reproducible Runs

typescript
const seed = createSeedManager({ seed: "my-benchmark-v1" });
 
// Use seed for deterministic corpus generation and result ordering
const shuffled = seed.shuffle(corpus);
const proof = seed.generateProof("2026.04", { mock: "1.0.0" });
console.log(`Reproducibility proof: ${proof}`);

License

MIT