@reaatech/rag-eval-metrics
Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.
Heuristic metric scorers for RAG evaluation. Provides four independent scorers — faithfulness, relevance, context precision, and context recall — plus a MetricsEngine orchestrator that runs them in parallel with configurable concurrency.
Installation
npm install @reaatech/rag-eval-metrics
# or
pnpm add @reaatech/rag-eval-metricsFeature Overview
- Faithfulness — measures factual grounding of the answer in retrieved context (statement-level decomposition)
- Relevance — measures semantic alignment between query and answer (intent decomposition + cosine similarity)
- Context Precision — measures retrieval ranking quality via MAP (Mean Average Precision) and NDCG
- Context Recall — measures ground truth coverage by decomposing facts and checking context overlap
- Parallel execution —
MetricsEngineruns all configured scorers concurrently with configurableparallelJobs - Heuristic-first — no LLM calls required; all scorers use NLP libraries (
compromise,natural)
Quick Start
import {
FaithfulnessScorer,
RelevanceScorer,
ContextPrecisionScorer,
ContextRecallScorer,
MetricsEngine,
} from "@reaatech/rag-eval-metrics";
const engine = new MetricsEngine({ parallelJobs: 4 });
const result = await engine.evaluateSample(
{
query: "What is the refund policy?",
context: [
"Refunds are processed within 14 days of purchase.",
"Contact support@example.com for refund requests.",
],
ground_truth: "Refunds must be requested within 14 days by contacting support.",
generated_answer: "You can request a refund within 14 days by emailing support.",
},
{ metrics: ["faithfulness", "relevance", "context_precision", "context_recall"] },
0
);
console.log(result.faithfulness?.score); // ~0.95
console.log(result.relevance?.score); // ~0.88API Reference
FaithfulnessScorer
Decomposes the generated answer into atomic statements and verifies each against the provided context.
import { FaithfulnessScorer } from "@reaatech/rag-eval-metrics";
const scorer = new FaithfulnessScorer();
const result = await scorer.score(sample);
// → { score: 0.90, statements: [...], supported_count: 8, total_count: 9 }| Property | Type | Description |
|---|---|---|
score | number | Ratio of supported statements to total (0–1) |
statements | string[] | Decomposed atomic statements from the answer |
supported_count | number | Number of statements supported by context |
total_count | number | Total number of extracted statements |
RelevanceScorer
Decomposes the query into intents and checks how well the answer addresses each intent using semantic similarity.
import { RelevanceScorer } from "@reaatech/rag-eval-metrics";
const scorer = new RelevanceScorer();
const result = await scorer.score(sample);
// → { score: 0.88, intents: [...], similarity: 0.82 }| Property | Type | Description |
|---|---|---|
score | number | Composite relevance score (0–1) |
intents | string[] | Decomposed query intents |
similarity | number | Cosine similarity between intent and answer embeddings |
ContextPrecisionScorer
Evaluates how well the retrieval system ranks relevant context chunks. Computes MAP and NDCG against the ground truth.
import { ContextPrecisionScorer } from "@reaatech/rag-eval-metrics";
const scorer = new ContextPrecisionScorer();
const result = await scorer.score(sample);
// → { score: 0.75, map: 0.72, ndcg: 0.78, relevant_ranks: [1, 3] }| Property | Type | Description |
|---|---|---|
score | number | Average of MAP and NDCG |
map | number | Mean Average Precision |
ndcg | number | Normalized Discounted Cumulative Gain |
relevant_ranks | number[] | Rank positions of relevant chunks (1-indexed) |
ContextRecallScorer
Decomposes the ground truth into individual facts and measures how many are covered by the retrieved context.
import { ContextRecallScorer } from "@reaatech/rag-eval-metrics";
const scorer = new ContextRecallScorer();
const result = await scorer.score(sample);
// → { score: 0.90, total_facts: 5, covered_facts: 4 }| Property | Type | Description |
|---|---|---|
score | number | Ratio of covered facts to total (0–1) |
total_facts | number | Number of facts extracted from ground truth |
covered_facts | number | Number of facts found in retrieved context |
MetricsEngine
Orchestrates parallel metric computation.
import { MetricsEngine } from "@reaatech/rag-eval-metrics";
const engine = new MetricsEngine({ parallelJobs: 5 });
// Evaluate a single sample
const result = await engine.evaluateSample(sample, config, index);
// Aggregate results across all samples
const aggregated = engine.aggregateResults(sampleResults);
// → { overall_score, avg_faithfulness, avg_relevance, ..., std_dev: { ... } }Constructor Options
| Property | Type | Default | Description |
|---|---|---|---|
parallelJobs | number | 5 | Maximum concurrent metric evaluations |
Usage Patterns
Individual Scorer
import { FaithfulnessScorer } from "@reaatech/rag-eval-metrics";
const scorer = new FaithfulnessScorer();
const result = await scorer.score({
query: "What is the refund policy?",
context: ["Refunds are processed within 14 days."],
ground_truth: "Refunds within 14 days.",
generated_answer: "You have 14 days to request a refund.",
});
if (result.score < 0.85) {
console.warn("Answer may contain hallucinations");
}Batch Evaluation with Aggregation
import { MetricsEngine } from "@reaatech/rag-eval-metrics";
import type { EvaluationSample, EvalSuiteConfig } from "@reaatech/rag-eval-core";
const engine = new MetricsEngine({ parallelJobs: 8 });
const config: EvalSuiteConfig = {
metrics: ["faithfulness", "relevance", "context_precision", "context_recall"],
};
const results = await Promise.all(
samples.map((sample, i) => engine.evaluateSample(sample, config, i))
);
const aggregated = engine.aggregateResults(results);
console.log("Overall score:", aggregated.overall_score);
console.log("Faithfulness:", aggregated.avg_faithfulness);Related Packages
@reaatech/rag-eval-core— Types and schemas@reaatech/rag-eval-judge— LLM-based evaluation (higher accuracy, higher cost)@reaatech/rag-eval-suite— Central orchestrator
