Skip to content
reaatechREAATECH

@reaatech/rag-eval-metrics

pending npm

Calculates heuristic-based RAG evaluation metrics including faithfulness, relevance, context precision, and context recall without requiring LLM API calls. It provides individual scorer classes and a `MetricsEngine` orchestrator for executing these evaluations in parallel.

@reaatech/rag-eval-metrics

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Heuristic metric scorers for RAG evaluation. Provides four independent scorers — faithfulness, relevance, context precision, and context recall — plus a MetricsEngine orchestrator that runs them in parallel with configurable concurrency.

Installation

terminal
npm install @reaatech/rag-eval-metrics
# or
pnpm add @reaatech/rag-eval-metrics

Feature Overview

  • Faithfulness — measures factual grounding of the answer in retrieved context (statement-level decomposition)
  • Relevance — measures semantic alignment between query and answer (intent decomposition + cosine similarity)
  • Context Precision — measures retrieval ranking quality via MAP (Mean Average Precision) and NDCG
  • Context Recall — measures ground truth coverage by decomposing facts and checking context overlap
  • Parallel executionMetricsEngine runs all configured scorers concurrently with configurable parallelJobs
  • Heuristic-first — no LLM calls required; all scorers use NLP libraries (compromise, natural)

Quick Start

typescript
import {
  FaithfulnessScorer,
  RelevanceScorer,
  ContextPrecisionScorer,
  ContextRecallScorer,
  MetricsEngine,
} from "@reaatech/rag-eval-metrics";
 
const engine = new MetricsEngine({ parallelJobs: 4 });
 
const result = await engine.evaluateSample(
  {
    query: "What is the refund policy?",
    context: [
      "Refunds are processed within 14 days of purchase.",
      "Contact support@example.com for refund requests.",
    ],
    ground_truth: "Refunds must be requested within 14 days by contacting support.",
    generated_answer: "You can request a refund within 14 days by emailing support.",
  },
  { metrics: ["faithfulness", "relevance", "context_precision", "context_recall"] },
  0
);
 
console.log(result.faithfulness?.score); // ~0.95
console.log(result.relevance?.score);    // ~0.88

API Reference

FaithfulnessScorer

Decomposes the generated answer into atomic statements and verifies each against the provided context.

typescript
import { FaithfulnessScorer } from "@reaatech/rag-eval-metrics";
 
const scorer = new FaithfulnessScorer();
const result = await scorer.score(sample);
// → { score: 0.90, statements: [...], supported_count: 8, total_count: 9 }
PropertyTypeDescription
scorenumberRatio of supported statements to total (0–1)
statementsstring[]Decomposed atomic statements from the answer
supported_countnumberNumber of statements supported by context
total_countnumberTotal number of extracted statements

RelevanceScorer

Decomposes the query into intents and checks how well the answer addresses each intent using semantic similarity.

typescript
import { RelevanceScorer } from "@reaatech/rag-eval-metrics";
 
const scorer = new RelevanceScorer();
const result = await scorer.score(sample);
// → { score: 0.88, intents: [...], similarity: 0.82 }
PropertyTypeDescription
scorenumberComposite relevance score (0–1)
intentsstring[]Decomposed query intents
similaritynumberCosine similarity between intent and answer embeddings

ContextPrecisionScorer

Evaluates how well the retrieval system ranks relevant context chunks. Computes MAP and NDCG against the ground truth.

typescript
import { ContextPrecisionScorer } from "@reaatech/rag-eval-metrics";
 
const scorer = new ContextPrecisionScorer();
const result = await scorer.score(sample);
// → { score: 0.75, map: 0.72, ndcg: 0.78, relevant_ranks: [1, 3] }
PropertyTypeDescription
scorenumberAverage of MAP and NDCG
mapnumberMean Average Precision
ndcgnumberNormalized Discounted Cumulative Gain
relevant_ranksnumber[]Rank positions of relevant chunks (1-indexed)

ContextRecallScorer

Decomposes the ground truth into individual facts and measures how many are covered by the retrieved context.

typescript
import { ContextRecallScorer } from "@reaatech/rag-eval-metrics";
 
const scorer = new ContextRecallScorer();
const result = await scorer.score(sample);
// → { score: 0.90, total_facts: 5, covered_facts: 4 }
PropertyTypeDescription
scorenumberRatio of covered facts to total (0–1)
total_factsnumberNumber of facts extracted from ground truth
covered_factsnumberNumber of facts found in retrieved context

MetricsEngine

Orchestrates parallel metric computation.

typescript
import { MetricsEngine } from "@reaatech/rag-eval-metrics";
 
const engine = new MetricsEngine({ parallelJobs: 5 });
 
// Evaluate a single sample
const result = await engine.evaluateSample(sample, config, index);
 
// Aggregate results across all samples
const aggregated = engine.aggregateResults(sampleResults);
// → { overall_score, avg_faithfulness, avg_relevance, ..., std_dev: { ... } }

Constructor Options

PropertyTypeDefaultDescription
parallelJobsnumber5Maximum concurrent metric evaluations

Usage Patterns

Individual Scorer

typescript
import { FaithfulnessScorer } from "@reaatech/rag-eval-metrics";
 
const scorer = new FaithfulnessScorer();
 
const result = await scorer.score({
  query: "What is the refund policy?",
  context: ["Refunds are processed within 14 days."],
  ground_truth: "Refunds within 14 days.",
  generated_answer: "You have 14 days to request a refund.",
});
 
if (result.score < 0.85) {
  console.warn("Answer may contain hallucinations");
}

Batch Evaluation with Aggregation

typescript
import { MetricsEngine } from "@reaatech/rag-eval-metrics";
import type { EvaluationSample, EvalSuiteConfig } from "@reaatech/rag-eval-core";
 
const engine = new MetricsEngine({ parallelJobs: 8 });
const config: EvalSuiteConfig = {
  metrics: ["faithfulness", "relevance", "context_precision", "context_recall"],
};
 
const results = await Promise.all(
  samples.map((sample, i) => engine.evaluateSample(sample, config, i))
);
 
const aggregated = engine.aggregateResults(results);
console.log("Overall score:", aggregated.overall_score);
console.log("Faithfulness:", aggregated.avg_faithfulness);

License

MIT