Skip to content
reaatechREAATECH

@reaatech/rag-eval-cli

pending npm

Provides a CLI for executing, gating, and comparing RAG evaluation suites, while also acting as a barrel package that re-exports the entire `@reaatech/rag-eval-*` library for programmatic use.

@reaatech/rag-eval-cli

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

CLI entry point and commands for the RAG evaluation toolkit. Also serves as the master barrel package, re-exporting the full API surface of all @reaatech/rag-eval-* packages for programmatic consumers.

Installation

terminal
npm install @reaatech/rag-eval-cli
# or
pnpm add @reaatech/rag-eval-cli

Feature Overview

  • Seven CLI commands — evaluate, gate, compare, cost, report, judge, and mcp-server
  • Multi-format output — write results as JSON, Markdown, or both simultaneously
  • Config-driven evaluation — load suite configuration from YAML or JSON files
  • Master barrel export — re-exports all types, scorers, judges, trackers, gates, and tools from sibling packages
  • Dual ESM/CJS — works as both a CLI tool and an importable library

Quick Start

CLI Usage

terminal
# Run evaluation suite on a dataset
rag-eval-pack evaluate --dataset dataset.jsonl --output results.json
 
# Run quality gates against results
rag-eval-pack gate --results results.json --gates gates.yaml
 
# Compare two evaluation runs
rag-eval-pack compare --baseline baseline.json --candidate candidate.json
 
# View cost breakdown
rag-eval-pack cost --results results.json
 
# Generate markdown report
rag-eval-pack report --results results.json --output report.md
 
# Run LLM judge on a dataset
rag-eval-pack judge --dataset dataset.jsonl --metric faithfulness
 
# Start MCP server
rag-eval-pack mcp-server

Programmatic Usage

typescript
import {
  EvaluationSuite,
  FaithfulnessScorer,
  GateEngine,
  JudgeEngine,
} from "@reaatech/rag-eval-cli";
 
// The CLI package re-exports everything from all @reaatech/rag-eval-* packages
// Use it as a single dependency if you need the full toolkit
 
const suite = new EvaluationSuite({
  metrics: ["faithfulness", "relevance"],
});
const result = await suite.runFromFile("dataset.jsonl");

Commands

evaluate

Run the evaluation suite on a dataset.

terminal
rag-eval-pack evaluate \
  --dataset datasets/samples.jsonl \
  --config eval-config.yaml \
  --output results/results.json \
  --format json,markdown \
  --no-judge
OptionTypeDefaultDescription
--datasetstring(required)Path to evaluation dataset (JSONL, JSON, YAML)
--configstringPath to evaluation config (YAML or JSON)
--outputstringOutput file path for results
--formatstringjsonOutput formats: json, markdown, or json,markdown
--no-judgebooleanfalseSkip LLM judge evaluation

gate

Run CI gates against evaluation results.

terminal
rag-eval-pack gate \
  --results results/results.json \
  --gates gates.yaml \
  --baseline results/baseline.json
OptionTypeDefaultDescription
--resultsstring(required)Path to evaluation results JSON
--gatesstring(required)Path to gate config (YAML or JSON)
--baselinestringPath to baseline results for comparison gates

compare

Compare two evaluation runs.

terminal
rag-eval-pack compare \
  --baseline results/v1.json \
  --candidate results/v2.json \
  --output diff.json
OptionTypeDefaultDescription
--baselinestring(required)Path to baseline evaluation results
--candidatestring(required)Path to candidate evaluation results
--outputstringOutput file for diff

cost

Display cost breakdown for an evaluation run.

terminal
rag-eval-pack cost --results results/results.json
OptionTypeDefaultDescription
--resultsstring(required)Path to evaluation results JSON

report

Generate a formatted report from evaluation results.

terminal
rag-eval-pack report \
  --results results/results.json \
  --gates gates.yaml \
  --output report.md
OptionTypeDefaultDescription
--resultsstring(required)Path to evaluation results JSON
--gatesstringPath to gate config for gate status in report
--outputstring(required)Output file path for report

judge

Run LLM judge evaluation on a dataset.

terminal
rag-eval-pack judge \
  --dataset dataset.jsonl \
  --metric faithfulness \
  --model claude-opus \
  --output judge-results.json
OptionTypeDefaultDescription
--datasetstring(required)Path to dataset (JSONL, JSON, YAML)
--metricstringfaithfulnessMetric to evaluate
--modelstringclaude-opusLLM model to use
--outputstringOutput file for judge results
--consensusbooleanfalseEnable consensus voting

mcp-server

Start the MCP server for agent integration.

terminal
rag-eval-pack mcp-server

The server uses stdio transport. Configure in MCP client settings (e.g., claude_desktop_config.json) to expose evaluation tools.

Configuration

Suite Config (YAML)

yaml
# eval-config.yaml
metrics:
  - faithfulness
  - relevance
  - context_precision
  - context_recall
 
judge:
  model: claude-opus
  enabled: true
  consensus:
    enabled: false
 
cost:
  budget_limit: 10.00
  hard_limit: true
  alert_thresholds: [0.5, 0.75, 0.9]
 
execution:
  parallel_jobs: 5

Gate Config (YAML)

yaml
# gates.yaml
gates:
  - name: min-faithfulness
    type: threshold
    metric: avg_faithfulness
    operator: ">="
    threshold: 0.85
 
  - name: max-cost-per-sample
    type: threshold
    metric: cost_per_sample
    operator: "<="
    threshold: 0.05
 
  - name: no-regression
    type: baseline-comparison
    metric: overall_score
    allow_regression: false

License

MIT