Skip to content
reaatechREAATECH

@reaatech/pi-bench-runner

pending npm

Executes prompt injection benchmarks by running attack suites against defense adapters in parallel with configurable timeouts and progress tracking. It provides factory functions to create a benchmark engine, attack executor, and defense evaluator, requiring a compatible defense adapter implementation to function.

@reaatech/pi-bench-runner

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Benchmark execution engine, attack executor, and defense evaluator for prompt-injection-bench. Runs injection attacks against defense adapters in parallel with configurable timeouts, progress reporting, and PII-safe result collection.

Installation

terminal
npm install @reaatech/pi-bench-runner
# or
pnpm add @reaatech/pi-bench-runner

Feature Overview

  • Parallel execution — Run attacks concurrently with configurable batch sizing
  • Progress callbacks — Real-time progress reporting with per-category stats
  • Configurable timeouts — Per-attack timeout with fallback results
  • Safe execution sandbox — Memory limits, PII sanitization, audit trail
  • Benign sample generation — Generate non-attack prompts for false positive testing
  • Result aggregation — Latency percentiles, category breakdowns, JSON export
  • Dual ESM/CJS output — works with import and require

Quick Start

typescript
import { createBenchmarkEngine } from "@reaatech/pi-bench-runner";
import { createMockAdapter } from "@reaatech/pi-bench-adapters";
import { generateDefaultCorpus } from "@reaatech/pi-bench-corpus";
 
const adapter = createMockAdapter(0.95, 0.03);
const corpus = generateDefaultCorpus();
 
const engine = createBenchmarkEngine({
  defense: adapter,
  parallel: 10,
  timeoutMs: 30000,
  onProgress: (progress) => console.log(`${progress.completed}/${progress.total}`),
});
 
const result = await engine.runBenchmark(corpus);
console.log(`Detection rate: ${(result.attackResults.filter((r) => r.detected).length / result.attackResults.length * 100).toFixed(1)}%`);

API Reference

BenchmarkEngine

MethodDescription
runBenchmark(corpus)Execute a full benchmark against the configured defense
runBenchmarkWithBenign(corpus, benignCount)Run attacks plus benign (false positive) samples
abort()Abort the current benchmark run

BenchmarkEngineConfig

PropertyTypeDefaultDescription
defenseDefenseAdapter(required)The defense adapter to benchmark
parallelnumber10Max parallel attack executions
timeoutMsnumber30000Per-attack timeout
onProgressProgressCallbackCalled on each batch completion
categoriesAttackCategory[]allLimit to specific categories

createBenchmarkEngine(config?)

Factory function.

AttackExecutor

MethodDescription
execute(sample, defense)Execute a single attack, returns AttackResult
executeBatch(samples, defense, config?)Execute attacks in parallel batches
executeStream(samples, defense, callback?)Stream results as they complete

createAttackExecutor(config?)

Factory function.

DefenseEvaluator

MethodDescription
evaluate(result)Evaluate benchmark results into an EvaluationResult
compareScores(scoreA, scoreB)Pairwise comparison of two defense scores
checkThresholds(score, thresholds)Assert minimum performance thresholds

createDefenseEvaluator(config?)

Factory function.

SafeExecution

MethodDescription
run(fn, context)Execute a function in a sandboxed context
validateInput(input)Check for null bytes, length limits, control chars
sanitizeForLogging(input)Truncate and redact PII from execution context

createSafeExecution(config?)

Factory function.

BenignSamples

ExportDescription
generateBenignSamples(count)Generate count harmless test prompts (conversation, factual questions, etc.)

ResultCollector

MethodDescription
collect(results)Aggregate attack results
summarize()Generate a CollectionSummary with stats
export(format)Export results as JSON or CSV

createResultCollector(config?)

Factory function.

Usage Patterns

Running with Progress

typescript
const engine = createBenchmarkEngine({
  defense: adapter,
  parallel: 20,
  onProgress: (progress) => {
    console.log(
      `[${progress.completed}/${progress.total}] ` +
      `Rate: ${progress.detectionRate.toFixed(2)} ` +
      `Category: ${progress.currentCategory}`
    );
  },
});

Filtering by Category

typescript
const engine = createBenchmarkEngine({
  defense: adapter,
  categories: ["direct-injection", "role-playing"],
});
 
const result = await engine.runBenchmark(corpus);
// Only tests direct-injection and role-playing attacks

False Positive Testing

typescript
const result = await engine.runBenchmarkWithBenign(corpus, 200);
console.log(`Attack success rate: ${result.attackResults.filter(r => !r.detected).length / result.attackResults.length}`);
console.log(`False positive rate: ${result.benignResults.filter(r => r.detected).length / result.benignResults.length}`);

License

MIT