Skip to content
reaatechREAATECH

@reaatech/pi-bench-scoring

pending npm

Calculates weighted scores, statistical significance, and effect sizes for prompt injection defense benchmarks. It provides a collection of utility functions for computing metrics like Wilson score intervals, Cohen's h, and z-tests from benchmark result objects.

@reaatech/pi-bench-scoring

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Scoring engine, statistical analysis, and ranking algorithms for prompt-injection-bench. Computes weighted scores, confidence intervals, effect sizes, and cross-defense comparisons with multiple statistical tests.

Installation

terminal
npm install @reaatech/pi-bench-scoring
# or
pnpm add @reaatech/pi-bench-scoring

Feature Overview

  • Weighted scoring — Category-weighted overall score with FPR penalty and consistency bonus
  • Confidence intervals — Wilson score intervals for detection rates
  • Effect sizes — Cohen’s h with small / medium / large interpretation
  • Statistical tests — Two-proportion z-test, chi-square, Welch’s t-test, one-way ANOVA
  • Bayesian smoothing — Category-level scores smoothed with prior for small samples
  • Tiered ranking — S/A/B/C/D composite ranking with configurable weights
  • Dual ESM/CJS output — works with import and require

Quick Start

typescript
import {
  calculateDefenseScore,
  compareMetrics,
  createStatisticalTests,
} from "@reaatech/pi-bench-scoring";
 
const score = calculateDefenseScore(benchmarkResult);
console.log(`Detection rate: ${(1 - score.attackSuccessRate) * 100}%`);
console.log(`Weighted score: ${score.overallScore.toFixed(3)}`);
 
// Compare two defense results
const comparison = compareMetrics(scoreA, scoreB);
console.log(`Effect size (Cohen's h): ${comparison.effectSize}`);
 
// Run statistical tests
const stats = createStatisticalTests({ significanceLevel: 0.05 });
const zResult = stats.twoProportionZTest(0.9, 100, 0.7, 100);
console.log(`p-value: ${zResult.pValue.toFixed(4)}`);

API Reference

MetricsCalculator

ExportDescription
calculateDefenseScore(result)Compute full DefenseScore from a BenchmarkResult
calculateCategoryScore(results, category)Score a single attack category
calculateConfidenceInterval(proportion, n, level?)Wilson score interval
calculateStatisticalSignificance(n1, r1, n2, r2, level?)z-test between two proportions
calculateEffectSize(p1, p2)Cohen’s h effect size
interpretEffectSize(h)Returns small, medium, or large
calculateConsistencyBonus(scores)Computes bonus for consistent detection across categories
calculateMetrics(result)Compute raw metrics (ASR, FPR, latency stats)
compareMetrics(scoreA, scoreB)Pairwise comparison with effect size and significance

MetricsConfig

PropertyTypeDefaultDescription
confidenceLevelnumber0.9595% confidence level
minSampleSizenumber10Minimum samples for valid statistics
fprPenaltynumber1.0Multiplier for FPR penalty

CategoryScorer

MethodDescription
score(results, category)Score a single attack category
scoreAll(results)Score all categories with Bayesian smoothing
aggregate(scores)Aggregate into AggregatedCategoryScores

createCategoryScorer(config?)

Factory function.

StatisticalTests

MethodDescription
twoProportionZTest(p1, n1, p2, n2)Z-test for two independent proportions
chiSquareTest(observed, expected?)Chi-square goodness of fit or independence
oneWayAnova(groups)One-way ANOVA for comparing 3+ defense categories
tTest(a, b)Welch’s t-test (unequal variance assumed)

HypothesisTestResult

PropertyTypeDescription
pValuenumberThe computed p-value
significantbooleanWhether p < significance level
testStatisticnumberThe computed test statistic
degreesOfFreedomnumberDegrees of freedom

createStatisticalTests(config?)

Factory function.

RankingAlgorithm

MethodDescription
rank(scores)Rank defenses by composite score
computeComposite(scores)Compute composite weighted ranking
summarize(rankings)Generate tiered summary with S/A/B/C/D tiers
compare(entryA, entryB)Head-to-head comparison of two entries

Tier Thresholds

TierComposite ScoreDescription
S≥ 0.90Exceptional — top-tier defense
A≥ 0.75Excellent — strong all-around
B≥ 0.60Good — acceptable for production
C≥ 0.40Marginal — needs improvement
D< 0.40Poor — easily bypassed

createRankingAlgorithm(config?)

Factory function.

Usage Patterns

Scoring a Benchmark Run

typescript
import { BenchmarkEngine } from "@reaatech/pi-bench-runner";
import { calculateDefenseScore, compareMetrics } from "@reaatech/pi-bench-scoring";
 
const result = await engine.runBenchmark(corpus);
const score = calculateDefenseScore(result);
 
// Check individual category performance
for (const cat of Object.keys(score.categoryScores)) {
  const cs = score.categoryScores[cat];
  console.log(`${cat}: ${(cs.detectionRate * 100).toFixed(1)}%`);
}

Comparing Defenses

typescript
const comparison = compareMetrics(rebuffScore, lakeraScore);
console.log(`Rebuff ${comparison.isSignificant ? "significantly" : "marginally"} better`);
console.log(`Effect size (Cohen's h): ${comparison.effectSize} (${comparison.interpretation})`);

License

MIT