@reaatech/agent-eval-harness-judge
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Provider-agnostic LLM-as-judge engine with calibration and multi-model consensus. Scores agent responses on faithfulness, relevance, tool correctness, and overall quality using Claude, GPT-4, Gemini, or any OpenAI-compatible provider.
Installation
npm install @reaatech/agent-eval-harness-judge
Feature Overview
4 provider support — Claude (Anthropic SDK), GPT-4 (OpenAI SDK), Gemini (Google Generative AI), OpenRouter (OpenAI-compatible) with automatic API key detection from environment variables
4 judgment types — faithfulness (context adherence), relevance (intent alignment), tool_correctness (selection + arguments), overall_quality (multi-dimensional holistic assessment)
3 calibration methods — Temperature scaling (grid search over logit temperature), isotonic regression (non-parametric rank-preserving), and linear regression fit against human labels
Multi-model consensus — Weighted, majority, and unweighted voting strategies with tie-breaking by highest confidence or averaging
Built-in rate limiting — Per-provider rate limits with automatic backoff (50 rpm Claude, 60 rpm GPT-4/Gemini, 30 rpm OpenRouter)
Retry with exponential backoff — Configurable max retries (default 3) with doubling delay starting at 1s
Cost tracking — Per-judgment cost estimation with provider-aware pricing, budget alerts at configurable thresholds (50%/75%/90%), and optimization recommendations
Mock fallback — Returns score: 0.85 when NODE_ENV=test or JUDGE_MOCK=true to enable offline testing
Custom prompt templates — Pre-built templates for all judgment types plus createCustomTemplate for bespoke evaluation criteria
Quick Start
import { JudgeEngine } from '@reaatech/agent-eval-harness-judge' ;
const judge = new JudgeEngine ({
model: 'claude-opus' ,
provider: 'claude' ,
temperature: 0.1 ,
});
const result = await judge. judge ({
type: 'faithfulness' ,
context: 'The account balance is $42.50' ,
response: 'Your balance is $42.50. Would you like to make a payment?' ,
});
console. log ( `Score: ${ result . score }, Confidence: ${ result . confidence }` );
console. log (result.explanation);
API Reference
JudgeEngine
Method Signature Description constructor(config: JudgeConfig, retryConfig?: { maxRetries, baseDelayMs })Initializes engine with provider config, builds rate limiter judge(request: JudgeRequest) => Promise<JudgeScore>Evaluates a single request with rate limiting and retry logic judgeBatch(requests: Array<{ id, request: JudgeRequest }>, concurrency?: number) => Promise<BatchJudgeResult>Evaluates multiple requests with configurable concurrency (default 5)
JudgeCalibrator
Method Signature Description constructor(method?: CalibrationMethod)Creates calibrator (default: temperature_scaling) addCalibrationData(humanLabels: HumanLabel[], judgeScores: JudgeScore[]) => voidPairs human labels with raw judge scores as calibration points calibrate() => CalibrationResultFits calibration model against collected data (≥3 points required). Returns before/after MAE and improvement percentage apply(rawScore: number) => numberTransforms a raw judge score using fitted calibration parameters getIsCalibrated() => booleanReturns whether calibration has been completed
ConsensusEngine
Method Signature Description constructor(config: ConsensusConfig)Creates consensus engine with strategy and model weights consensus(scores: Array<{ model, score: JudgeScore }>) => ConsensusResultComputes final score from multiple judges using configured voting strategy and agreement threshold
JudgeCostTracker
Method Signature Description constructor(config?: JudgeCostConfig)Creates tracker with optional budget limit, max cost per judgment, alert thresholds, and custom pricing recordJudgment(judgmentId, provider, model, inputTokens, outputTokens) => { cost, alerts }Records a judgment and returns cost + any budget alerts triggered estimateCost(provider, estimatedInputTokens, estimatedOutputTokens) => numberEstimates cost without recording canAfford(estimatedCost) => { allowed, reason? }Checks if projected total would exceed budget getBreakdown() => JudgeCostBreakdownReturns total cost, token counts, per-provider costs, and budget usage percentage getRemainingBudget() => numberReturns remaining budget (Infinity if no limit set) getOptimizationRecommendations() => string[]Returns actionable cost-saving recommendations
Prompt Templates
Function Returns Description getFaithfulnessTemplatePromptTemplateContext-adherence scoring prompt with 0–1 rubric getRelevanceTemplatePromptTemplateIntent-alignment scoring prompt with 0–1 rubric getToolCorrectnessTemplatePromptTemplateTool selection and argument validation prompt (includes issues field) getOverallQualityTemplatePromptTemplateMulti-dimensional quality prompt with dimension-level scores (accuracy, completeness, clarity, helpfulness) getAvailableTemplatesRecord<string, PromptTemplate>Returns all four built-in templates keyed by judgment type buildPrompt{ system, user }Substitutes PromptVariables into a PromptTemplate createCustomTemplatePromptTemplateCreates a custom template with name, system prompt, user prompt, and response format
Types
JudgeConfig
Field Type Description modelstringPrimary judge model name providerJudgeProviderOne of claude' | 'gpt4' | 'gemini' | 'openrouter fallbackModelsstring[]?Fallback model chain for failover temperaturenumber?Sampling temperature (default: 0) maxTokensnumber?Max output tokens apiKeystring?API key override (alternatively via env vars)
JudgeRequest
Field Type Description typeJudgmentTypefaithfulness' | 'relevance' | 'tool_correctness' | 'overall_qualitycontextstring?Reference context for faithfulness/quality intentstring?User intent for relevance/quality responsestringAgent response to evaluate expected_toolstring?Expected tool name (tool_correctness) actual_toolstring?Actual tool name (tool_correctness) argumentsRecord<string, unknown>?Tool arguments (tool_correctness)
JudgeScore
Field Type Description scorenumberScore from 0.0 to 1.0 explanationstringHuman-readable explanation confidencenumberConfidence in the score (0.0 to 1.0) calibratedbooleanWhether score has been calibrated rawScorenumber?Pre-calibration score costnumber?Cost of this judge call in USD
JudgeProvider
'claude' | 'gpt4' | 'gemini' | 'openrouter'
JudgmentType
'faithfulness' | 'relevance' | 'tool_correctness' | 'overall_quality'
Calibration Methods
Method Description Best For temperature_scalingAdjusts logit temperature via grid search (0.1–5.0) to minimize MAE. Keeps ranking intact. Scores with consistent bias isotonic_regressionNon-parametric least-squares fit preserving monotonicity. Approximated via linear slope + offset. Non-linear calibration curves linearSimple linear regression (y = slope × x + intercept). Fastest calibration. Scores with linear bias
Consensus Voting Strategies
Strategy Description Use Case weightedScore-weighted average using per-model weights from config Best when model quality varies majorityBins scores into low (<0.33), medium (0.33–0.67), high (>0.67) and uses weighted majority vote Quick pass/fail-style decisions unweightedSimple arithmetic mean of all scores Equal confidence in all models
Advanced: Calibration with Human Labels
Human label calibration corrects systematic bias in LLM judge scores, aligning them with ground truth:
import { JudgeCalibrator, JudgeEngine } from '@reaatech/agent-eval-harness-judge' ;
const calibrator = new JudgeCalibrator ( 'temperature_scaling' );
// Collect human-labeled samples
const humanLabels = [
{ sampleId: 's1' , score: 0.80 , type: 'faithfulness' },
{ sampleId: 's2' , score: 0.95 , type: 'faithfulness' },
{ sampleId: 's3' , score: 0.60 , type: 'faithfulness' },
];
// Get raw judge scores for the same samples
const judge = new JudgeEngine ({ model: 'claude-sonnet-4-20250514' , provider: 'claude' });
const judgeScores = await Promise . all ([
judge. judge ({ type: 'faithfulness' , context: '...' , response: '...' }),
judge. judge ({ type: 'faithfulness' , context: '...' , response: '...' }),
judge. judge ({ type: 'faithfulness' , context: '...' , response: '...' }),
]);
calibrator. addCalibrationData (humanLabels, judgeScores);
const result = calibrator. calibrate ();
console. log ( `MAE: ${ result . beforeMAE } → ${ result . afterMAE } (${ result . improvement }% improvement)` );
// Apply calibration to future scores
const futureScore = await judge. judge ({ type: 'faithfulness' , context: '...' , response: '...' });
const calibrated = calibrator. apply (futureScore.score);
console. log ( `Raw: ${ futureScore . score }, Calibrated: ${ calibrated }` );
Advanced: Multi-Model Consensus
Combine multiple judge models to improve reliability and reduce single-model bias:
import { ConsensusEngine } from '@reaatech/agent-eval-harness-judge' ;
const consensusEngine = new ConsensusEngine ({
enabled: true ,
models: [
{ id: 'claude-opus' , weight: 0.5 },
{ id: 'gpt-4-turbo' , weight: 0.3 },
{ id: 'gemini-pro' , weight: 0.2 },
],
votingStrategy: 'weighted' ,
minAgreement: 0.7 ,
tieBreaker: 'highest_confidence' ,
});
// Assume scores collected from three separate JudgeEngine instances
const consensusResult = consensusEngine. consensus ([
{ model: 'claude-opus' , score: { score: 0.85 , confidence: 0.9 , ... } },
{ model: 'gpt-4-turbo' , score: { score: 0.78 , confidence: 0.85 , ... } },
{ model: 'gemini-pro' , score: { score: 0.82 , confidence: 0.8 , ... } },
]);
console. log ( `Consensus score: ${ consensusResult . score }` );
console. log ( `Agreement: ${ consensusResult . agreement }` );
console. log ( `Consensus reached: ${ consensusResult . consensusReached }` );
Related Packages
License
MIT