llm-judge-toolkit · packages
Every package shipped from reaatech/llm-judge-toolkit, published or pending.
10 packages
@reaatech/llm-judge-bias
Identifies systematic position, length, and style biases in LLM evaluations using detector classes that analyze judgment consistency across varied inputs. It provides a `ComprehensiveBiasDetector` to orchestrate these checks and return structured reports, requiring an LLM judge interface to perform the underlying scoring.
- status
- published
- published
- 1 day ago
@reaatech/llm-judge-cache
Provides a `CacheManager` class to store and retrieve LLM judgment results using deterministic SHA-256 keys. It supports in-memory, file-system, and Redis backends, with the Redis implementation requiring an external `ioredis`-compatible client.
- status
- published
- published
- 1 day ago
@reaatech/llm-judge-calibration
Measures LLM judge accuracy against human-labeled datasets using a `CalibrationRunner` class for batch evaluation and a `CalibrationMetrics` utility for computing Cohen's kappa, F1 scores, and confusion matrices. It provides tools to detect performance drift over time and requires a custom `JudgmentEngine` implementation to execute the evaluations.
- status
- published
- published
- 1 day ago
@reaatech/llm-judge-cli
Provides a CLI for batch-evaluating LLM responses and calibrating judgment criteria against human-labeled datasets using JSONL input. It supports multiple LLM providers and configurable concurrency, outputting scored results directly to stdout or a file.
- status
- published
- published
- 1 day ago
@reaatech/llm-judge-consensus
Aggregates multiple LLM evaluation scores into a single consensus result using strategies like majority voting, weighted voting, or cost-optimized tiebreaking. It provides a set of classes implementing a shared `execute` method that returns a normalized score and an agreement metric.
- status
- published
- published
- 1 day ago
@reaatech/llm-judge-engine
Orchestrates LLM evaluation workflows by providing a `JudgmentEngine` class that handles retries, rate limiting, caching, and event-driven logging. It requires a provider implementation and a prompt template to execute structured judgments against LLM outputs.
- status
- published
- published
- 1 day ago
@reaatech/llm-judge-infra
Provides infrastructure utilities for LLM evaluation, including a `BatchProcessor` for concurrent execution, a `CostTracker` for budget enforcement, and a `MetricsCollector` for monitoring performance. It exports these as class-based tools and structured logging helpers that integrate with Pino.
- status
- published
- published
- 1 day ago
@reaatech/llm-judge-providers
Provides a unified interface and factory for interacting with OpenAI, Anthropic, and local OpenAI-compatible LLM APIs. It includes built-in cost calculation and health checks, lazily loading the required SDKs only when a specific provider is instantiated.
- status
- published
- published
- 1 day ago
@reaatech/llm-judge-templates
Provides a set of TypeScript classes implementing a `JudgmentTemplate` interface to generate LLM evaluation prompts and parse their structured JSON responses. Each template includes built-in logic for cleaning markdown, handling malformed output, and normalizing scores for criteria like faithfulness, relevance, and safety.
- status
- published
- published
- 1 day ago
@reaatech/llm-judge-types
Provides a shared library of TypeScript interfaces, Zod schemas, and custom error classes for defining LLM judgment results, provider configurations, and evaluation metrics. It serves as the type-safe foundation for the LLM Judge Toolkit ecosystem and requires Zod as a runtime dependency.
- status
- published
- published
- 1 day ago
