@reaatech/pi-bench-runner
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Benchmark execution engine, attack executor, and defense evaluator for prompt-injection-bench. Runs injection attacks against defense adapters in parallel with configurable timeouts, progress reporting, and PII-safe result collection.
Installation
terminal
npm install @reaatech/pi-bench-runner
# or
pnpm add @reaatech/pi-bench-runnerFeature Overview
- Parallel execution — Run attacks concurrently with configurable batch sizing
- Progress callbacks — Real-time progress reporting with per-category stats
- Configurable timeouts — Per-attack timeout with fallback results
- Safe execution sandbox — Memory limits, PII sanitization, audit trail
- Benign sample generation — Generate non-attack prompts for false positive testing
- Result aggregation — Latency percentiles, category breakdowns, JSON export
- Dual ESM/CJS output — works with
importandrequire
Quick Start
typescript
import { createBenchmarkEngine } from "@reaatech/pi-bench-runner";
import { createMockAdapter } from "@reaatech/pi-bench-adapters";
import { generateDefaultCorpus } from "@reaatech/pi-bench-corpus";
const adapter = createMockAdapter(0.95, 0.03);
const corpus = generateDefaultCorpus();
const engine = createBenchmarkEngine({
defense: adapter,
parallel: 10,
timeoutMs: 30000,
onProgress: (progress) => console.log(`${progress.completed}/${progress.total}`),
});
const result = await engine.runBenchmark(corpus);
console.log(`Detection rate: ${(result.attackResults.filter((r) => r.detected).length / result.attackResults.length * 100).toFixed(1)}%`);API Reference
BenchmarkEngine
| Method | Description |
|---|---|
runBenchmark(corpus) | Execute a full benchmark against the configured defense |
runBenchmarkWithBenign(corpus, benignCount) | Run attacks plus benign (false positive) samples |
abort() | Abort the current benchmark run |
BenchmarkEngineConfig
| Property | Type | Default | Description |
|---|---|---|---|
defense | DefenseAdapter | (required) | The defense adapter to benchmark |
parallel | number | 10 | Max parallel attack executions |
timeoutMs | number | 30000 | Per-attack timeout |
onProgress | ProgressCallback | — | Called on each batch completion |
categories | AttackCategory[] | all | Limit to specific categories |
createBenchmarkEngine(config?)
Factory function.
AttackExecutor
| Method | Description |
|---|---|
execute(sample, defense) | Execute a single attack, returns AttackResult |
executeBatch(samples, defense, config?) | Execute attacks in parallel batches |
executeStream(samples, defense, callback?) | Stream results as they complete |
createAttackExecutor(config?)
Factory function.
DefenseEvaluator
| Method | Description |
|---|---|
evaluate(result) | Evaluate benchmark results into an EvaluationResult |
compareScores(scoreA, scoreB) | Pairwise comparison of two defense scores |
checkThresholds(score, thresholds) | Assert minimum performance thresholds |
createDefenseEvaluator(config?)
Factory function.
SafeExecution
| Method | Description |
|---|---|
run(fn, context) | Execute a function in a sandboxed context |
validateInput(input) | Check for null bytes, length limits, control chars |
sanitizeForLogging(input) | Truncate and redact PII from execution context |
createSafeExecution(config?)
Factory function.
BenignSamples
| Export | Description |
|---|---|
generateBenignSamples(count) | Generate count harmless test prompts (conversation, factual questions, etc.) |
ResultCollector
| Method | Description |
|---|---|
collect(results) | Aggregate attack results |
summarize() | Generate a CollectionSummary with stats |
export(format) | Export results as JSON or CSV |
createResultCollector(config?)
Factory function.
Usage Patterns
Running with Progress
typescript
const engine = createBenchmarkEngine({
defense: adapter,
parallel: 20,
onProgress: (progress) => {
console.log(
`[${progress.completed}/${progress.total}] ` +
`Rate: ${progress.detectionRate.toFixed(2)} ` +
`Category: ${progress.currentCategory}`
);
},
});Filtering by Category
typescript
const engine = createBenchmarkEngine({
defense: adapter,
categories: ["direct-injection", "role-playing"],
});
const result = await engine.runBenchmark(corpus);
// Only tests direct-injection and role-playing attacksFalse Positive Testing
typescript
const result = await engine.runBenchmarkWithBenign(corpus, 200);
console.log(`Attack success rate: ${result.attackResults.filter(r => !r.detected).length / result.attackResults.length}`);
console.log(`False positive rate: ${result.benignResults.filter(r => r.detected).length / result.benignResults.length}`);Related Packages
@reaatech/pi-bench-core— Core types@reaatech/pi-bench-adapters— Defense adapters@reaatech/pi-bench-corpus— Attack corpus@reaatech/pi-bench-scoring— Scoring engineprompt-injection-bench— CLI and umbrella package
