@reaatech/pi-bench-scoring

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Scoring engine, statistical analysis, and ranking algorithms for prompt-injection-bench. Computes weighted scores, confidence intervals, effect sizes, and cross-defense comparisons with multiple statistical tests.

Installation

terminal

npm install @reaatech/pi-bench-scoring
# or
pnpm add @reaatech/pi-bench-scoring

Feature Overview

Weighted scoring — Category-weighted overall score with FPR penalty and consistency bonus
Confidence intervals — Wilson score intervals for detection rates
Effect sizes — Cohen’s h with small / medium / large interpretation
Statistical tests — Two-proportion z-test, chi-square, Welch’s t-test, one-way ANOVA
Bayesian smoothing — Category-level scores smoothed with prior for small samples
Tiered ranking — S/A/B/C/D composite ranking with configurable weights
Dual ESM/CJS output — works with import and require

Quick Start

typescript

import {
  calculateDefenseScore,
  compareMetrics,
  createStatisticalTests,
} from "@reaatech/pi-bench-scoring";
 
const score = calculateDefenseScore(benchmarkResult);
console.log(`Detection rate: ${(1 - score.attackSuccessRate) * 100}%`);
console.log(`Weighted score: ${score.overallScore.toFixed(3)}`);
 
// Compare two defense results
const comparison = compareMetrics(scoreA, scoreB);
console.log(`Effect size (Cohen's h): ${comparison.effectSize}`);
 
// Run statistical tests
const stats = createStatisticalTests({ significanceLevel: 0.05 });
const zResult = stats.twoProportionZTest(0.9, 100, 0.7, 100);
console.log(`p-value: ${zResult.pValue.toFixed(4)}`);

API Reference

MetricsCalculator

Export	Description
`calculateDefenseScore(result)`	Compute full `DefenseScore` from a `BenchmarkResult`
`calculateCategoryScore(results, category)`	Score a single attack category
`calculateConfidenceInterval(proportion, n, level?)`	Wilson score interval
`calculateStatisticalSignificance(n1, r1, n2, r2, level?)`	z-test between two proportions
`calculateEffectSize(p1, p2)`	Cohen’s h effect size
`interpretEffectSize(h)`	Returns `small`, `medium`, or `large`
`calculateConsistencyBonus(scores)`	Computes bonus for consistent detection across categories
`calculateMetrics(result)`	Compute raw metrics (ASR, FPR, latency stats)
`compareMetrics(scoreA, scoreB)`	Pairwise comparison with effect size and significance

`MetricsConfig`

Property	Type	Default	Description
`confidenceLevel`	`number`	`0.95`	95% confidence level
`minSampleSize`	`number`	`10`	Minimum samples for valid statistics
`fprPenalty`	`number`	`1.0`	Multiplier for FPR penalty

CategoryScorer

Method	Description
`score(results, category)`	Score a single attack category
`scoreAll(results)`	Score all categories with Bayesian smoothing
`aggregate(scores)`	Aggregate into `AggregatedCategoryScores`

`createCategoryScorer(config?)`

Factory function.

StatisticalTests

Method	Description
`twoProportionZTest(p1, n1, p2, n2)`	Z-test for two independent proportions
`chiSquareTest(observed, expected?)`	Chi-square goodness of fit or independence
`oneWayAnova(groups)`	One-way ANOVA for comparing 3+ defense categories
`tTest(a, b)`	Welch’s t-test (unequal variance assumed)

HypothesisTestResult

Property	Type	Description
`pValue`	`number`	The computed p-value
`significant`	`boolean`	Whether p < significance level
`testStatistic`	`number`	The computed test statistic
`degreesOfFreedom`	`number`	Degrees of freedom

`createStatisticalTests(config?)`

Factory function.

RankingAlgorithm

Method	Description
`rank(scores)`	Rank defenses by composite score
`computeComposite(scores)`	Compute composite weighted ranking
`summarize(rankings)`	Generate tiered summary with S/A/B/C/D tiers
`compare(entryA, entryB)`	Head-to-head comparison of two entries

Tier Thresholds

Tier	Composite Score	Description
S	≥ 0.90	Exceptional — top-tier defense
A	≥ 0.75	Excellent — strong all-around
B	≥ 0.60	Good — acceptable for production
C	≥ 0.40	Marginal — needs improvement
D	< 0.40	Poor — easily bypassed

`createRankingAlgorithm(config?)`

Factory function.

Usage Patterns

Scoring a Benchmark Run

typescript

import { BenchmarkEngine } from "@reaatech/pi-bench-runner";
import { calculateDefenseScore, compareMetrics } from "@reaatech/pi-bench-scoring";
 
const result = await engine.runBenchmark(corpus);
const score = calculateDefenseScore(result);
 
// Check individual category performance
for (const cat of Object.keys(score.categoryScores)) {
  const cs = score.categoryScores[cat];
  console.log(`${cat}: ${(cs.detectionRate * 100).toFixed(1)}%`);
}

Comparing Defenses

typescript

const comparison = compareMetrics(rebuffScore, lakeraScore);
console.log(`Rebuff ${comparison.isSignificant ? "significantly" : "marginally"} better`);
console.log(`Effect size (Cohen's h): ${comparison.effectSize} (${comparison.interpretation})`);

@reaatech/pi-bench-core — Core types and domain models
@reaatech/pi-bench-runner — Benchmark execution engine
@reaatech/pi-bench-leaderboard — Leaderboard management
prompt-injection-bench — CLI and umbrella package

License

MIT

@reaatech/pi-bench-scoring

@reaatech/pi-bench-scoring

Installation

Feature Overview

Quick Start

API Reference

MetricsCalculator

MetricsConfig

CategoryScorer

createCategoryScorer(config?)

StatisticalTests

HypothesisTestResult

createStatisticalTests(config?)

RankingAlgorithm

Tier Thresholds

createRankingAlgorithm(config?)

Usage Patterns

Scoring a Benchmark Run

Comparing Defenses

Related Packages

License

`MetricsConfig`

`createCategoryScorer(config?)`

`createStatisticalTests(config?)`

`createRankingAlgorithm(config?)`