Skip to content
reaatechREAATECH

reaatech/llm-judge-toolkit

0Last commit: Jun 4, 2026GitHub →

These packages give you a complete system for using LLMs to evaluate generated text, with built-in prompt templates for five criteria (faithfulness, relevance, coherence, safety, tool-use), multi-provider support (OpenAI, Anthropic, local endpoints), and a judgment engine that handles retries, caching, and rate limiting. You would adopt them to replace ad-hoc LLM evaluation scripts with a structured pipeline that includes statistical calibration against human labels, bias detection (position, length, style), and multi-judge consensus strategies. The packages are designed as independently installable modules that share a common type system and pluggable interfaces—the engine, providers, templates, cache backends, and bias detectors each implement a shared contract, so you can swap implementations or use only the pieces you need without pulling in the rest.

Packages

10 packages

@reaatech/llm-judge-bias

v0.1.0
A set of classes (`PositionBiasDetector`, `LengthBiasDetector`, `StyleBiasDetector`, and `ComprehensiveBiasDetector`) that detect systematic biases—position, length, and style—in LLM judgments by comparing scores across controlled variations (e.g., swapped order, length correlation, style transformations). Requires a judge function (e.g., an LLM call) to evaluate candidates or responses.
status
published
published
23 days ago

@reaatech/llm-judge-cache

v0.1.0
A CacheManager facade with pluggable in-memory, file-system, and Redis backends for caching LLM judgment results, using SHA-256 content-addressed keys and configurable TTL.
status
published
published
23 days ago

@reaatech/llm-judge-calibration

v0.1.0
A TypeScript library that measures how accurately an LLM judge system classifies text against human-labeled gold-standard datasets, providing Cohen's kappa, confusion matrices, precision/recall/F1 scores, and drift detection. It exports `CalibrationRunner`, `CalibrationMetrics`, `DatasetManager`, and `DriftDetector` classes for batch evaluation and ongoing monitoring.
status
published
published
23 days ago

@reaatech/llm-judge-cli

v0.1.0
A CLI that evaluates LLM responses against criteria (faithfulness, relevance, coherence, safety, tool-use) and calibrates judgments against human labels, reading JSONL input and outputting scored results via `evaluate` and `calibrate` subcommands.
status
published
published
23 days ago

@reaatech/llm-judge-consensus

v0.1.0
Provides consensus strategies (majority voting, weighted voting, and a cheap-first tiebreaker) as classes implementing a shared `ConsensusStrategy` interface, combining individual judgment scores into a final evaluation with an agreement score.
status
published
published
23 days ago

@reaatech/llm-judge-engine

v0.1.0
Orchestrates LLM-based judgment calls with automatic retry, caching, rate limiting, and a typed event bus, exposing a `JudgmentEngine` class that takes a provider, template, and config.
status
published
published
23 days ago

@reaatech/llm-judge-infra

v0.1.0
Infrastructure utilities for LLM judgment pipelines, providing a `CostTracker` with period-aware budget enforcement, a `BatchProcessor` with configurable concurrency and retry, a `MetricsCollector`, and structured Pino logging helpers. Depends on `pino` at runtime.
status
published
published
23 days ago

@reaatech/llm-judge-providers

v0.1.0
A factory-pattern provider that exposes `ProviderFactory.create()` and `ProviderFactory.fromEnv()` to instantiate OpenAI, Anthropic, or local (OpenAI-compatible) LLM clients, all conforming to a shared `LLMProvider` interface with built-in cost calculation and health checks. It lazily loads the `openai` or `@anthropic-ai/sdk` packages only when the corresponding provider is used.
status
published
published
23 days ago

@reaatech/llm-judge-templates

v0.1.0
A set of evaluation prompt templates (faithfulness, relevance, coherence, safety, tool-use) that implement a `JudgmentTemplate` interface with `buildPrompt` and `parseResponse` methods, returning structured JSON judgments with score, reasoning, confidence, and metadata.
status
published
published
23 days ago

@reaatech/llm-judge-types

v0.1.0
Shared TypeScript types, Zod schemas, and error classes for the LLM Judge Toolkit ecosystem, providing 70+ exported types and 6 typed error classes with zero runtime dependencies beyond Zod.
status
published
published
23 days ago

Comments

Sign in with GitHub to comment and vote.

Loading comments…