Skip to content
reaatechREAATECH

@reaatech/hybrid-rag-evaluation

npm v0.1.0

Evaluates hybrid RAG systems by providing classes for running IR metrics, ablation studies, and performance benchmarking. It exposes runner classes and utility functions that accept custom query callbacks to measure retrieval accuracy, generation quality, latency, and cost.

@reaatech/hybrid-rag-evaluation

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Evaluation runner, ablation studies, and performance benchmarking for hybrid RAG systems. Provides standard IR metrics, component contribution analysis, and latency/throughput/cost measurement.

Installation

terminal
npm install @reaatech/hybrid-rag-evaluation
# or
pnpm add @reaatech/hybrid-rag-evaluation

Feature Overview

  • Standard IR metrics — Precision@K, Recall@K, NDCG@K, MAP, MRR with per-query and aggregate results
  • Generation metrics — Relevance, fluency, coherence, faithfulness, answer correctness scoring
  • Ablation studies — YAML-configurable variant testing with delta comparisons and cost tracking
  • Dataset management — load .jsonl datasets, split into train/test, generate synthetic datasets
  • Performance benchmarking — latency percentiles, throughput, cost breakdown, environment capture
  • Markdown reporting — auto-generated benchmark and ablation reports

Quick Start

Evaluation

typescript
import { EvaluationRunner, loadEvaluationDataset } from '@reaatech/hybrid-rag-evaluation';
 
const dataset = await loadEvaluationDataset('./datasets/eval.jsonl');
 
const runner = new EvaluationRunner({
  topK: 10,
  metrics: ['precision', 'recall', 'ndcg', 'map', 'mrr'],
});
 
const results = await runner.evaluate(dataset, async (query) => {
  return pipeline.query(query);
});
 
console.log(`NDCG@10: ${results.summary.ndcgAtK}`);
console.log(`MAP: ${results.summary.map}`);
console.log(`MRR: ${results.summary.mrr}`);

Ablation Study

typescript
import { AblationRunner } from '@reaatech/hybrid-rag-evaluation';
 
const config = {
  baseline: {
    chunking: 'fixed-size',
    chunkSize: 512,
    overlap: 50,
    retrieval: 'hybrid',
    vectorWeight: 0.7,
    bm25Weight: 0.3,
    reranker: 'cohere',
    topK: 10,
  },
  variants: [
    { name: 'no-reranker', changes: { reranker: null } },
    { name: 'vector-only', changes: { retrieval: 'vector' } },
    { name: 'bm25-only', changes: { retrieval: 'bm25' } },
  ],
};
 
const results = await AblationRunner.run(config, dataset, async (pipelineConfig) => {
  const pipeline = new RAGPipeline(pipelineConfig);
  await pipeline.initialize();
  return async (query) => pipeline.query(query);
});

Benchmarking

typescript
import { benchmarkLatency, benchmarkThroughput } from '@reaatech/hybrid-rag-evaluation';
 
const latencyResult = await benchmarkLatency({
  queries: ['How do I...', 'What is...', /* ... */],
  queryFn: async (q) => pipeline.query(q),
  warmupQueries: 5,
  iterations: 50,
});
 
console.log(`P50: ${latencyResult.p50}ms, P99: ${latencyResult.p99}ms`);
 
const throughputResult = await benchmarkThroughput({
  queries: loadQueries(),
  queryFn: async (q) => pipeline.query(q),
  concurrentQueries: 10,
});

API Reference

Evaluation

EvaluationRunner

Constructor OptionTypeDefaultDescription
topKnumber10K value for @K metrics
metricsstring[]['precision','recall','ndcg','map','mrr']Metrics to compute
MethodReturnsDescription
evaluate(dataset, queryFn)EvaluationResultsRun evaluation on all samples
runEvaluation(dataset, queryFn, config?)EvaluationResultsAlias for evaluate

Retrieval Metrics

FunctionDescription
precisionAtK(retrieved, relevant, k)Fraction of retrieved items that are relevant
recallAtK(retrieved, relevant, k)Fraction of relevant items that were retrieved
ndcgAtK(retrieved, relevant, k)Normalized Discounted Cumulative Gain
dcgAtK(retrieved, relevant, k)Discounted Cumulative Gain
idcgAtK(relevant, k)Ideal DCG
averagePrecision(retrieved, relevant)Mean of precision at each relevant position
reciprocalRank(retrieved, relevant)1 / rank of first relevant item
evaluateQuery(retrieved, relevant, config)Compute all metrics for a single query
aggregateMetrics(perQueryResults)Aggregate per-query results into summary

Generation Metrics

FunctionDescription
relevanceScore(generated, query)How relevant the answer is to the query
fluencyScore(generated)Grammatical correctness and readability
coherenceScore(generated)Logical flow and consistency
faithfulnessScore(generated, source)How well the answer adheres to source material
answerCorrectnessScore(generated, ideal)Factual correctness against ground truth
evaluateGeneration(generated, query, config)Compute all generation metrics
aggregateGenerationMetrics(results)Aggregate per-query generation results

Ablation Studies

AblationRunner

Static MethodDescription
run(config, dataset, builderFn)Execute full ablation study
runAblation(config, dataset, builderFn)Alias for run

Config Format (YAML)

yaml
baseline:
  chunking: fixed-size
  chunk_size: 512
  overlap: 50
  retrieval: hybrid
  vector_weight: 0.7
  bm25_weight: 0.3
  reranker: cohere
  top_k: 10
 
variants:
  - name: no-reranker
    description: "Skip reranking step"
    changes:
      reranker: null
 
  - name: semantic-chunking
    changes:
      chunking: semantic

Reporter

FunctionDescription
generateMarkdownTable(results)Render results as a Markdown table
generateSummary(results)One-line summary of best/worst variants
sortByNDCG(results)Sort variants by NDCG descending
sortByDelta(results)Sort variants by delta vs baseline

Benchmarking

Latency

FunctionDescription
benchmarkLatency(options)Measure per-query latency with warmup
benchmarkComponentLatency(options)Measure latency per pipeline component
calculateLatencyStats(latencies)Compute p50/p90/p95/p99/mean/min/max
OptionTypeDescription
queriesstring[]Test queries
queryFn(q: string) => Promise<unknown>Function to benchmark
warmupQueriesnumberQueries to run before measurement
iterationsnumberMeasurement iterations

Throughput

FunctionDescription
benchmarkThroughput(options)Measure queries per second with concurrency
OptionTypeDescription
queriesstring[]Test queries
queryFn(q: string) => Promise<unknown>Function to benchmark
concurrentQueriesnumberMax concurrent queries

Cost

FunctionDescription
calculateEmbeddingCost(tokens, model)Cost of embedding generation
calculateRerankerCost(documents, provider)Cost of reranking
calculateVectorSearchCost(queries)Cost of vector search
calculateQueryCost(breakdown)Total per-query cost
CostTrackerClass for tracking cumulative costs

Reporter

FunctionDescription
generateMarkdownReport(result)Render benchmark as Markdown
saveBenchmarkReport(result, path)Save report to file
getEnvironmentInfo()Capture Node version, platform, CPU, memory
createBenchmarkReport(latency, throughput, cost)Combine all results into a report

Dataset Management

Loader

FunctionDescription
loadEvaluationDataset(path)Load a .jsonl dataset file
loadEvaluationConfig(path)Load a YAML evaluation config
validateEvaluationSample(sample)Zod-validate a sample
splitDataset(dataset, ratio?)Split into train/test (default 80/20)

Generator

FunctionDescription
generateDataset(config)Generate synthetic evaluation samples
generateAndSaveDataset(config, path)Generate and save to file

Dataset Format (.jsonl)

jsonl
{"query_id": "q1", "query": "How do I reset my password?", "relevant_docs": ["doc-001"], "relevant_chunks": ["chunk-001-3"]}
{"query_id": "q2", "query": "What is the refund policy?", "relevant_docs": ["doc-010"], "relevant_chunks": ["chunk-010-2", "chunk-010-4"]}

License

MIT