These packages give you a full RAG evaluation pipeline—heuristic scorers for faithfulness, relevance, context precision, and context recall, plus an LLM-as-judge with multi-provider support, cost tracking with budget enforcement, and CI quality gates that can fail a build. You'd adopt them to catch regressions in a RAG system before deployment, whether that's a pre-commit smoke check or a nightly regression suite. The distinctive design is that every metric can run at three fidelity levels—free lexical scoring, embedding-based semantic scoring, or LLM judging—so you can trade cost for accuracy per use case without changing the evaluation interface.
A CLI that runs RAG evaluation suites, quality gates, run comparisons, cost breakdowns, markdown reports, LLM-based judging, and an MCP server, exposed as the `rag-eval-pack` command. It also re-exports the full programmatic API from all `@reaatech/rag-eval-*` packages as a single importable library.
Canonical TypeScript types and Zod schemas for RAG evaluation data shapes. Exports 18+ types (`EvaluationSample`, `EvalSuiteConfig`, `SampleEvalResult`, `GateConfig`, `JudgeConfig`, etc.) and two Zod schemas (`EvaluationSampleSchema`, `EvalSuiteConfigSchema`) for runtime validation, with zero runtime dependencies beyond `zod`.
Cost tracking, pricing, budgeting, and reporting infrastructure for RAG evaluations, providing `CostTracker`, `Pricing`, `BudgetManager`, and `CostReporter` classes that track per-sample token consumption, enforce budget limits with configurable alert thresholds, and generate cost reports in JSON and JUnit XML formats.
A Zod-validated dataset loader and validator for RAG evaluation samples, supporting JSONL, JSON, and YAML formats with duplicate detection, synthetic generation from templates, and version tracking. Exports `DatasetLoader`, `DatasetValidator`, and `loadEvalConfig` functions.
A quality gate engine for RAG evaluation pipelines that enforces threshold-based metric checks and baseline regression detection, returning a `GateResult` object with pass/fail status and per-gate failure messages. It pairs with `@reaatech/rag-eval-core` for evaluation result types and is designed for CI/CD integration with formatted output and configurable exit codes.
A TypeScript class (`JudgeEngine`) that uses an LLM (Anthropic, OpenAI, or Google) to score RAG outputs on metrics like faithfulness and relevance, with optional consensus voting across multiple models and calibration against human labels.
An MCP server that exposes RAG evaluation tools as a three-layer API of atomic judge operations, orchestrated suite runs, and CI-style regression gates, providing `createMcpServer()` and `startMcpServer()` functions for integration with MCP clients like Claude Desktop or Cursor.
Provides four heuristic metric scorers (faithfulness, relevance, context precision, context recall) for evaluating RAG outputs, plus a `MetricsEngine` orchestrator that runs them in parallel with configurable concurrency. Each scorer is a class with a `score` method that returns a numeric score and supporting details, using only NLP libraries (`compromise`, `natural`) with no LLM calls.
Provides structured JSON logging via Pino, OpenTelemetry tracing, and OpenTelemetry metrics specifically for RAG evaluation pipelines, exporting functions like `createLogger`, `traceEvalRun`, and `recordEvalRun`.
A class (`EvaluationSuite`) that orchestrates RAG evaluation runs by executing heuristic metrics, optional LLM judge scoring, cost tracking, and quality gates against a dataset, returning a `SuiteRunResult` with aggregated metrics and gate pass/fail status.