@reaatech/pi-bench-scoring
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Scoring engine, statistical analysis, and ranking algorithms for prompt-injection-bench. Computes weighted scores, confidence intervals, effect sizes, and cross-defense comparisons with multiple statistical tests.
Installation
terminal
npm install @reaatech/pi-bench-scoring
# or
pnpm add @reaatech/pi-bench-scoringFeature Overview
- Weighted scoring — Category-weighted overall score with FPR penalty and consistency bonus
- Confidence intervals — Wilson score intervals for detection rates
- Effect sizes — Cohen’s h with
small/medium/largeinterpretation - Statistical tests — Two-proportion z-test, chi-square, Welch’s t-test, one-way ANOVA
- Bayesian smoothing — Category-level scores smoothed with prior for small samples
- Tiered ranking — S/A/B/C/D composite ranking with configurable weights
- Dual ESM/CJS output — works with
importandrequire
Quick Start
typescript
import {
calculateDefenseScore,
compareMetrics,
createStatisticalTests,
} from "@reaatech/pi-bench-scoring";
const score = calculateDefenseScore(benchmarkResult);
console.log(`Detection rate: ${(1 - score.attackSuccessRate) * 100}%`);
console.log(`Weighted score: ${score.overallScore.toFixed(3)}`);
// Compare two defense results
const comparison = compareMetrics(scoreA, scoreB);
console.log(`Effect size (Cohen's h): ${comparison.effectSize}`);
// Run statistical tests
const stats = createStatisticalTests({ significanceLevel: 0.05 });
const zResult = stats.twoProportionZTest(0.9, 100, 0.7, 100);
console.log(`p-value: ${zResult.pValue.toFixed(4)}`);API Reference
MetricsCalculator
| Export | Description |
|---|---|
calculateDefenseScore(result) | Compute full DefenseScore from a BenchmarkResult |
calculateCategoryScore(results, category) | Score a single attack category |
calculateConfidenceInterval(proportion, n, level?) | Wilson score interval |
calculateStatisticalSignificance(n1, r1, n2, r2, level?) | z-test between two proportions |
calculateEffectSize(p1, p2) | Cohen’s h effect size |
interpretEffectSize(h) | Returns small, medium, or large |
calculateConsistencyBonus(scores) | Computes bonus for consistent detection across categories |
calculateMetrics(result) | Compute raw metrics (ASR, FPR, latency stats) |
compareMetrics(scoreA, scoreB) | Pairwise comparison with effect size and significance |
MetricsConfig
| Property | Type | Default | Description |
|---|---|---|---|
confidenceLevel | number | 0.95 | 95% confidence level |
minSampleSize | number | 10 | Minimum samples for valid statistics |
fprPenalty | number | 1.0 | Multiplier for FPR penalty |
CategoryScorer
| Method | Description |
|---|---|
score(results, category) | Score a single attack category |
scoreAll(results) | Score all categories with Bayesian smoothing |
aggregate(scores) | Aggregate into AggregatedCategoryScores |
createCategoryScorer(config?)
Factory function.
StatisticalTests
| Method | Description |
|---|---|
twoProportionZTest(p1, n1, p2, n2) | Z-test for two independent proportions |
chiSquareTest(observed, expected?) | Chi-square goodness of fit or independence |
oneWayAnova(groups) | One-way ANOVA for comparing 3+ defense categories |
tTest(a, b) | Welch’s t-test (unequal variance assumed) |
HypothesisTestResult
| Property | Type | Description |
|---|---|---|
pValue | number | The computed p-value |
significant | boolean | Whether p < significance level |
testStatistic | number | The computed test statistic |
degreesOfFreedom | number | Degrees of freedom |
createStatisticalTests(config?)
Factory function.
RankingAlgorithm
| Method | Description |
|---|---|
rank(scores) | Rank defenses by composite score |
computeComposite(scores) | Compute composite weighted ranking |
summarize(rankings) | Generate tiered summary with S/A/B/C/D tiers |
compare(entryA, entryB) | Head-to-head comparison of two entries |
Tier Thresholds
| Tier | Composite Score | Description |
|---|---|---|
| S | ≥ 0.90 | Exceptional — top-tier defense |
| A | ≥ 0.75 | Excellent — strong all-around |
| B | ≥ 0.60 | Good — acceptable for production |
| C | ≥ 0.40 | Marginal — needs improvement |
| D | < 0.40 | Poor — easily bypassed |
createRankingAlgorithm(config?)
Factory function.
Usage Patterns
Scoring a Benchmark Run
typescript
import { BenchmarkEngine } from "@reaatech/pi-bench-runner";
import { calculateDefenseScore, compareMetrics } from "@reaatech/pi-bench-scoring";
const result = await engine.runBenchmark(corpus);
const score = calculateDefenseScore(result);
// Check individual category performance
for (const cat of Object.keys(score.categoryScores)) {
const cs = score.categoryScores[cat];
console.log(`${cat}: ${(cs.detectionRate * 100).toFixed(1)}%`);
}Comparing Defenses
typescript
const comparison = compareMetrics(rebuffScore, lakeraScore);
console.log(`Rebuff ${comparison.isSignificant ? "significantly" : "marginally"} better`);
console.log(`Effect size (Cohen's h): ${comparison.effectSize} (${comparison.interpretation})`);Related Packages
@reaatech/pi-bench-core— Core types and domain models@reaatech/pi-bench-runner— Benchmark execution engine@reaatech/pi-bench-leaderboard— Leaderboard managementprompt-injection-bench— CLI and umbrella package
