Skip to content
reaatechREAATECH

@reaatech/llm-judge-calibration

npm v0.1.0

Measures LLM judge accuracy against human-labeled datasets using a `CalibrationRunner` class for batch evaluation and a `CalibrationMetrics` utility for computing Cohen's kappa, F1 scores, and confusion matrices. It provides tools to detect performance drift over time and requires a custom `JudgmentEngine` implementation to execute the evaluations.

@reaatech/llm-judge-calibration

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Calibration suite for measuring LLM judge accuracy against human-labeled gold-standard datasets. Computes Cohen’s kappa, confusion matrices, F1 scores, and detects calibration drift over time.

Installation

terminal
npm install @reaatech/llm-judge-calibration
# or
pnpm add @reaatech/llm-judge-calibration

Feature Overview

  • Cohen’s kappa inter-rater reliability coefficient
  • Confusion matrix with precision/recall/F1 computation
  • Bundled gold-standard datasets for 5 evaluation criteria
  • CalibrationRunner for automated batch evaluation
  • DriftDetector to detect performance regression
  • Concurrency-controlled parallel execution

Quick Start

typescript
import { CalibrationRunner } from '@reaatech/llm-judge-calibration';
import { DatasetManager } from '@reaatech/llm-judge-calibration';
 
const runner = new CalibrationRunner({
  engine: judgmentEngine,
  criteria: 'faithfulness',
  concurrency: 5,
  onProgress: (done, total) => console.log(`${done}/${total}`),
});
 
const { report, judgments } = await runner.run();
console.log(report.cohensKappa, report.accuracy);
typescript
import { CalibrationMetrics } from '@reaatech/llm-judge-calibration';
 
const report = CalibrationMetrics.generateReport(judgments, humanLabels);
console.log(report);
// { cohensKappa: 0.78, accuracy: 0.91, precision: 0.88, recall: 0.85, f1Score: 0.86, ... }

API Reference

CalibrationMetrics

Static class for computing calibration statistics.

MethodDescription
cohensKappa(judgments, humanLabels)Cohen’s kappa inter-rater reliability coefficient
confusionMatrix(judgments, humanLabels, thresholds?)3-class confusion matrix
accuracy(confusionMatrix)Overall accuracy from confusion matrix
precisionRecallF1(confusionMatrix, positiveClass?)Precision, recall, and F1 score
generateReport(judgments, humanLabels)Full CalibrationReport with all metrics

CalibrationRunner

Constructor OptionTypeDescription
engineJudgmentEngineEngine to evaluate examples
criteriaEvaluationCriteriaCriteria to calibrate against
concurrencynumberParallel evaluations (default 3)
onProgress(completed, total) => voidProgress callback
MethodDescription
run()Execute calibration and return CalibrationRunnerResult

DatasetManager

ConstructorDescription
new DatasetManager(datasetsDir?)Optional custom directory for calibration JSON files
MethodDescription
load(criteria)Load dataset JSON (faithfulness.json, etc.)
loadAll()Load all criteria datasets as a Map
getStats(dataset)Summary stats (total, good/bad/borderline, average label)

DriftDetector

ConstructorDescription
new DriftDetector(kappaThreshold?, accuracyThreshold?)Thresholds for drift detection (default 0.1 each)
MethodDescription
detect(baseline, current)Compare two reports, return DriftReport

Key Types

CalibrationReport

FieldTypeDescription
cohensKappanumberInter-rater reliability coefficient
accuracynumberOverall accuracy
precisionnumberPositive class precision
recallnumberPositive class recall
f1ScorenumberHarmonic mean of precision and recall
confusionMatrixConfusionMatrix3-class confusion matrix

CalibrationDataset

FieldTypeDescription
criteriaEvaluationCriteriaEvaluation criteria this dataset targets
versionstringDataset version
examplesCalibrationExample[]Array of labeled examples

DriftReport

FieldTypeDescription
hasDriftbooleanWhether significant drift was detected
kappaDeltanumberChange in Cohen’s kappa
accuracyDeltanumberChange in accuracy
recommendationstringActionable recommendation

License

MIT