reaatech/rag-eval-pack
These packages provide a modular toolkit for evaluating RAG systems using heuristic scorers, LLM-as-judge, and automated quality gates. They help teams measure retrieval and generation performance while enforcing cost budgets and CI/CD regression thresholds. The system is built as a composable suite where an orchestration engine coordinates data loading, metric calculation, and observability across independent, type-safe packages.
Packages
10 packages
@reaatech/rag-eval-cli
Provides a CLI for executing, gating, and comparing RAG evaluation suites, while also acting as a barrel package that re-exports the entire `@reaatech/rag-eval-*` library for programmatic use.
- status
- awaiting publish
@reaatech/rag-eval-core
Provides TypeScript types and Zod schemas for defining RAG evaluation suites, including configurations for judges, cost tracking, and quality gates. It serves as a shared schema library for the `@reaatech/rag-eval-*` ecosystem, requiring only `zod` as a runtime dependency.
- status
- awaiting publish
@reaatech/rag-eval-cost
Tracks token consumption and enforces budget limits for RAG evaluations using a set of classes for cost accounting, model pricing lookups, and report generation. It provides utilities to record per-sample costs and export results in JSON or JUnit XML formats for CI integration.
- status
- awaiting publish
@reaatech/rag-eval-dataset
Manages RAG evaluation datasets by providing classes to load, validate, and version-track samples from JSON, JSONL, and YAML files. It relies on Zod for schema enforcement and integrates with @reaatech/rag-eval-core for sample definitions.
- status
- awaiting publish
@reaatech/rag-eval-gate
Enforces quality standards on RAG evaluation metrics using a `GateEngine` class that validates results against fixed thresholds or historical baselines. It provides CI-friendly output and configurable exit codes, typically paired with evaluation data structures from `@reaatech/rag-eval-core`.
- status
- awaiting publish
@reaatech/rag-eval-judge
Evaluates RAG pipeline outputs using LLM-as-a-judge with support for multi-model consensus, provider fallbacks, and human-label calibration. It provides a `JudgeEngine` class that executes pre-defined prompt templates for metrics like faithfulness and relevance, returning structured scores and reasoning.
- status
- awaiting publish
@reaatech/rag-eval-mcp-server
Exposes RAG evaluation tools—including atomic judges, test suites, and regression gates—as an MCP server for integration with clients like Claude Desktop or Cursor. It provides a set of tool handler functions and server initialization utilities that rely on the `@modelcontextprotocol/sdk` to execute evaluation tasks via stdio.
- status
- awaiting publish
@reaatech/rag-eval-metrics
Calculates heuristic-based RAG evaluation metrics including faithfulness, relevance, context precision, and context recall without requiring LLM API calls. It provides individual scorer classes and a `MetricsEngine` orchestrator for executing these evaluations in parallel.
- status
- awaiting publish
@reaatech/rag-eval-observability
Provides structured logging via Pino and OpenTelemetry instrumentation for tracing and metrics specific to RAG evaluation workflows. It exports a set of wrapper functions for tracing evaluation runs, judge calls, and metric calculations, alongside a factory function for pre-configured loggers.
- status
- awaiting publish
@reaatech/rag-eval-suite
Orchestrates RAG pipeline evaluations by combining metric computation, LLM-based judging, cost tracking, and quality gate enforcement. It provides an `EvaluationSuite` class that executes these tasks against datasets to generate aggregated performance reports and regression analysis.
- status
- awaiting publish
Comments
Sign in with GitHub to comment and vote.
Loading comments…
