reaatech/llm-judge-toolkit

★ 0Last commit: May 14, 2026GitHub →

These packages provide a modular framework for automating LLM-based evaluation, including prompt templates, consensus strategies, and bias detection. You would adopt them to standardize how you score model outputs while managing costs, caching, and calibration against human-labeled datasets. The system is built on a decoupled architecture where a central `JudgmentEngine` orchestrates pluggable providers, cache backends, and statistical analysis tools through a shared set of TypeScript interfaces.

agentic-ai ai ai-agents anthropic developer-tools generative-ai llm openai typescript

Packages

Sort

10 packages

llm-judge-bias

@reaatech/llm-judge-bias

v0.1.0

Identifies systematic position, length, and style biases in LLM evaluations using detector classes that analyze judgment consistency across varied inputs. It provides a `ComprehensiveBiasDetector` to orchestrate these checks and return structured reports, requiring an LLM judge interface to perform the underlying scoring.

View package View on npm

status: published
published: 1 day ago

llm-judge-cache

@reaatech/llm-judge-cache

v0.1.0

Provides a `CacheManager` class to store and retrieve LLM judgment results using deterministic SHA-256 keys. It supports in-memory, file-system, and Redis backends, with the Redis implementation requiring an external `ioredis`-compatible client.

View package View on npm

status: published
published: 1 day ago

llm-judge-calibration

@reaatech/llm-judge-calibration

v0.1.0

Measures LLM judge accuracy against human-labeled datasets using a `CalibrationRunner` class for batch evaluation and a `CalibrationMetrics` utility for computing Cohen's kappa, F1 scores, and confusion matrices. It provides tools to detect performance drift over time and requires a custom `JudgmentEngine` implementation to execute the evaluations.

View package View on npm

status: published
published: 1 day ago

llm-judge-cli

@reaatech/llm-judge-cli

v0.1.0

Provides a CLI for batch-evaluating LLM responses and calibrating judgment criteria against human-labeled datasets using JSONL input. It supports multiple LLM providers and configurable concurrency, outputting scored results directly to stdout or a file.

View package View on npm

status: published
published: 1 day ago

llm-judge-consensus

@reaatech/llm-judge-consensus

v0.1.0

Aggregates multiple LLM evaluation scores into a single consensus result using strategies like majority voting, weighted voting, or cost-optimized tiebreaking. It provides a set of classes implementing a shared `execute` method that returns a normalized score and an agreement metric.

View package View on npm

status: published
published: 1 day ago

llm-judge-engine

@reaatech/llm-judge-engine

v0.1.0

Orchestrates LLM evaluation workflows by providing a `JudgmentEngine` class that handles retries, rate limiting, caching, and event-driven logging. It requires a provider implementation and a prompt template to execute structured judgments against LLM outputs.

View package View on npm

status: published
published: 1 day ago

llm-judge-infra

@reaatech/llm-judge-infra

v0.1.0

Provides infrastructure utilities for LLM evaluation, including a `BatchProcessor` for concurrent execution, a `CostTracker` for budget enforcement, and a `MetricsCollector` for monitoring performance. It exports these as class-based tools and structured logging helpers that integrate with Pino.

View package View on npm

status: published
published: 1 day ago

llm-judge-providers

@reaatech/llm-judge-providers

v0.1.0

Provides a unified interface and factory for interacting with OpenAI, Anthropic, and local OpenAI-compatible LLM APIs. It includes built-in cost calculation and health checks, lazily loading the required SDKs only when a specific provider is instantiated.

View package View on npm

status: published
published: 1 day ago

llm-judge-templates

@reaatech/llm-judge-templates

v0.1.0

Provides a set of TypeScript classes implementing a `JudgmentTemplate` interface to generate LLM evaluation prompts and parse their structured JSON responses. Each template includes built-in logic for cleaning markdown, handling malformed output, and normalizing scores for criteria like faithfulness, relevance, and safety.

View package View on npm

status: published
published: 1 day ago

llm-judge-types

@reaatech/llm-judge-types

v0.1.0

Provides a shared library of TypeScript interfaces, Zod schemas, and custom error classes for defining LLM judgment results, provider configurations, and evaluation metrics. It serves as the type-safe foundation for the LLM Judge Toolkit ecosystem and requires Zod as a runtime dependency.