Skip to content
reaatechREAATECH

llm-judge-toolkit · packages

Every package shipped from reaatech/llm-judge-toolkit, published or pending.

10 packages

@reaatech/llm-judge-bias

v0.1.0
Identifies systematic position, length, and style biases in LLM evaluations using detector classes that analyze judgment consistency across varied inputs. It provides a `ComprehensiveBiasDetector` to orchestrate these checks and return structured reports, requiring an LLM judge interface to perform the underlying scoring.
status
published
published
1 day ago

@reaatech/llm-judge-cache

v0.1.0
Provides a `CacheManager` class to store and retrieve LLM judgment results using deterministic SHA-256 keys. It supports in-memory, file-system, and Redis backends, with the Redis implementation requiring an external `ioredis`-compatible client.
status
published
published
1 day ago

@reaatech/llm-judge-calibration

v0.1.0
Measures LLM judge accuracy against human-labeled datasets using a `CalibrationRunner` class for batch evaluation and a `CalibrationMetrics` utility for computing Cohen's kappa, F1 scores, and confusion matrices. It provides tools to detect performance drift over time and requires a custom `JudgmentEngine` implementation to execute the evaluations.
status
published
published
1 day ago

@reaatech/llm-judge-cli

v0.1.0
Provides a CLI for batch-evaluating LLM responses and calibrating judgment criteria against human-labeled datasets using JSONL input. It supports multiple LLM providers and configurable concurrency, outputting scored results directly to stdout or a file.
status
published
published
1 day ago

@reaatech/llm-judge-consensus

v0.1.0
Aggregates multiple LLM evaluation scores into a single consensus result using strategies like majority voting, weighted voting, or cost-optimized tiebreaking. It provides a set of classes implementing a shared `execute` method that returns a normalized score and an agreement metric.
status
published
published
1 day ago

@reaatech/llm-judge-engine

v0.1.0
Orchestrates LLM evaluation workflows by providing a `JudgmentEngine` class that handles retries, rate limiting, caching, and event-driven logging. It requires a provider implementation and a prompt template to execute structured judgments against LLM outputs.
status
published
published
1 day ago

@reaatech/llm-judge-infra

v0.1.0
Provides infrastructure utilities for LLM evaluation, including a `BatchProcessor` for concurrent execution, a `CostTracker` for budget enforcement, and a `MetricsCollector` for monitoring performance. It exports these as class-based tools and structured logging helpers that integrate with Pino.
status
published
published
1 day ago

@reaatech/llm-judge-providers

v0.1.0
Provides a unified interface and factory for interacting with OpenAI, Anthropic, and local OpenAI-compatible LLM APIs. It includes built-in cost calculation and health checks, lazily loading the required SDKs only when a specific provider is instantiated.
status
published
published
1 day ago

@reaatech/llm-judge-templates

v0.1.0
Provides a set of TypeScript classes implementing a `JudgmentTemplate` interface to generate LLM evaluation prompts and parse their structured JSON responses. Each template includes built-in logic for cleaning markdown, handling malformed output, and normalizing scores for criteria like faithfulness, relevance, and safety.
status
published
published
1 day ago

@reaatech/llm-judge-types

v0.1.0
Provides a shared library of TypeScript interfaces, Zod schemas, and custom error classes for defining LLM judgment results, provider configurations, and evaluation metrics. It serves as the type-safe foundation for the LLM Judge Toolkit ecosystem and requires Zod as a runtime dependency.
status
published
published
1 day ago