Continuously evaluate your small business RAG knowledge base using Perplexity’s LLM-as-judge, heuristic metrics, and cost-tracked CI gates from REAA’s eval packs.
SMBs that deploy internal RAG bots for employee or customer support find their answers drift as documents change. Without automated evaluation, they only discover quality regressions through user complaints, with no reproducible benchmark and no way to track LLM judging costs.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a CLI-powered RAG evaluation suite for small business knowledge bases. You’ll use Perplexity as a low-cost LLM-as-judge, heuristic metrics for fast scoring, and CI quality gates to catch answer regressions before your users do. The final tool loads a golden Q&A dataset, runs faithfulness/relevance/precision/recall scoring (heuristic first, then Perplexity-powered judging for ambiguous cases), tracks every cent of LLM spend, and fails a CI pipeline if scores dip below your thresholds.
Prerequisites
Node.js >= 22 with pnpm installed (npm install -g pnpm)
A Perplexity API key — sign up at perplexity.ai and create a key
A Langfuse account (optional, for tracing) — create one at langfuse.com
Familiarity with TypeScript and Node.js CLI development — you’ll work in src/ building a command-line evaluation tool
Step 1: Scaffold the project and install dependencies
Start with a Next.js 16 project (used as the TypeScript build system for this CLI tool):
Expected output: a working Next.js project with all dependencies installed. Run pnpm typecheck to confirm the TypeScript compiler is happy with the scaffold.
Step 2: Define the evaluation types
The whole pipeline speaks a shared language of types. Create src/lib/types.ts to re-export core types and define your CLI option shapes:
ts
import { type EvaluationSample, type EvalSuiteConfig, type SampleEvalResult, type EvalResults, type GateConfig, type GateResult, type AggregatedMetrics, type GateFailure, EvaluationSampleSchema, EvalSuiteConfigSchema,} from "@reaatech/rag-eval-core";import { z } from "zod";export type FidelityMode = "heuristic-only" | "full-judge";export type OutputFormat = "json" | "junit";export interface CliOptions { datasetPath: string; configPath?: string; fidelity: FidelityMode; output: OutputFormat; baselinePath?: string;}export interface EvalRunResult { results: EvalResults; gateResult: GateResult; totalCost: number; durationMs: number;}export class EvalPipelineError extends Error { constructor( message: string, public readonly code: string, public readonly details?: Record<string, unknown>, ) { super(message); this.name = "EvalPipelineError"; }}export { EvaluationSample, EvalSuiteConfig, SampleEvalResult, EvalResults, GateConfig, GateResult, AggregatedMetrics, GateFailure, EvaluationSampleSchema, EvalSuiteConfigSchema, z,};
Expected output: a clean pnpm typecheck — these are just type re-exports plus your own CliOptions and EvalRunResult interfaces.
Step 3: Build the heuristic scorer
The heuristic scorer computes four metrics using string-based algorithms — no LLM calls. This gives you fast, free scores on every sample. Create src/services/heuristic-scorer.ts:
ts
import { type EvaluationSample, type SampleEvalResult } from "@reaatech/rag-eval-core";function bigramJaccard(a: string, b: string): number { const wordsA = a.toLowerCase().split(/\s+/).filter(Boolean); const wordsB = b.toLowerCase().split(/\s+/).filter(Boolean); if (wordsA.length <= 1 || wordsB.length <= 1) return 0; const
Expected output:pnpm typecheck passes. The scorer uses three algorithms — bigram Jaccard for faithfulness, word overlap for relevance, keyword matching for precision, and sentence overlap for recall. Any sample scoring below 0.7 on a metric gets flagged with needsJudge: true for deeper LLM analysis.
Step 4: Build the Perplexity judge adapter
When heuristic scores are low, you delegate to a Perplexity-powered LLM judge. Create src/api/perplexity-adapter.ts:
ts
import Perplexity from "perplexity-sdk";import { ChatCompletionsPostRequestModelEnum } from "perplexity-sdk";import { calculateCostFromTokens } from "@reaatech/llm-cost-telemetry";export interface JudgeErrorDetails { provider: string; model: string; cause?: unknown;}export class JudgeError extends Error { public details: JudgeErrorDetails; constructor(msg: string, details: JudgeErrorDetails) { super(msg); this.name = "JudgeError"
Expected output:pnpm typecheck passes. The adapter builds a structured prompt for each metric, sends it to Perplexity’s pplx-7b-online model, and parses the JSON response. It also includes a fallback regex parser in case the model returns free-form text instead of JSON.
Step 5: Build the judge scorer service
The judge scorer bridges your pipeline to the JudgeEngine from @reaatech/rag-eval-judge. Create src/services/judge-scorer.ts:
Expected output:pnpm typecheck passes. This service only calls the LLM judge for flagged samples — if all heuristic scores are above 0.7, the judge is skipped entirely, saving you money on every run.
Step 6: Build the cost tracker
Track every Perplexity call with the REAA cost telemetry package. Create src/services/cost-tracker.ts:
Expected output:pnpm typecheck passes. The tracker reads a daily budget from the environment (defaulting to $10), records each API call as a CostSpan, and can report whether you’re over budget.
Step 7: Build the output formatter
Results need to come out as JSON or JUnit XML. Create src/lib/output-formatter.ts:
Expected output:pnpm typecheck passes. The JUnit formatter creates one testsuite per metric, with testcase entries for each sample and failure elements for any gate violations. CI tools like Jenkins and GitLab CI understand JUnit XML natively.
Step 8: Build the Langfuse tracer
Trace every eval run in Langfuse for historical comparison. Create src/lib/langfuse-tracer.ts:
ts
import Langfuse from "langfuse";export function initLangfuse(): unknown { try { const publicKey = process.env.LANGFUSE_PUBLIC_KEY; const secretKey = process.env.LANGFUSE_SECRET_KEY; const baseUrl = process.env.LANGFUSE_HOST; if (!publicKey || !secretKey) { return null; } const options: Record<string, unknown> = { publicKey, secretKey }; if (baseUrl) { options.baseUrl = baseUrl; } return new Langfuse(options); } catch { console.warn("Warning: Failed to initialize Langfuse client"); return null; }}export function traceEvalRun( client: unknown, runId: string, input: unknown,): void { if (!client) return; (client as Langfuse).trace({ name: "rag-eval-run", id: runId, input });}export function finalizeEvalTrace( client: unknown, output: { passed: boolean; totalCost: number; durationMs: number },): void { if (!client) return; (client as Langfuse).trace({ name: "rag-eval-result", output: JSON.stringify(output), });}
Expected output:pnpm typecheck passes. The tracer gracefully returns null when Langfuse credentials aren’t set — your pipeline works with or without it.
Step 9: Build the gate checker
Quality gates compare evaluation results against thresholds and fail the CI run if any metric drops below the bar. Create src/services/gate-checker.ts:
ts
import { GateEngine, CIIntegration } from "@reaatech/rag-eval-gate";import { type GateConfig, type GateResult, type EvalResults,} from "@reaatech/rag-eval-core";export function createGateEngine(gates: GateConfig[]): GateEngine { const engine = new GateEngine(); engine.loadGates(gates); return engine;}export function runGates( engine: GateEngine, results: EvalResults, baseline?: EvalResults,): GateResult { try { return engine.evaluate(results, baseline); } catch { const now = new Date().toISOString(); return { passed: false, gates: [], failures: [{ gate_name: "evaluation-error", metric: "unknown", actual: 0, expected: 0, difference: 0, }], warnings: [], evaluated_at: now, }; }}export function formatGateResultForCi( result: GateResult,): { text: string; exitCode: number } { try { const ci = new CIIntegration(); const output = ci.generateGitHubActionsOutput(result); return { text: output.summary, exitCode: ci.getExitCode(result) }; } catch (err) { return { text: `Gate evaluation error: ${String(err)}`, exitCode: 1, }; }}
Expected output:pnpm typecheck passes. The gate engine evaluates threshold gates (like “avg_faithfulness >= 0.85”) against your results. The formatGateResultForCi function produces GitHub Actions-compatible output your CI pipeline can parse.
Step 10: Wire up the eval pipeline
The eval pipeline ties everything together — load dataset, run heuristic scores, optionally run the judge, compute aggregates, evaluate gates. Create src/services/eval-pipeline.ts:
ts
import { readFile } from "node:fs/promises";import { DatasetLoader, DatasetValidator, loadEvalConfig,} from "@reaatech/rag-eval-dataset";import { type GateConfig, type EvalResults, type SampleEvalResult, type EvalSuiteConfig, type CostBreakdown,} from "@reaatech/rag-eval-core";import { generateId } from "@reaatech/llm-cost-telemetry";import { type CliOptions, type EvalRunResult, EvalPipelineError } from "../lib/types.js";import { traceEvalRun, initLangfuse, finalizeEvalTrace,
Expected output:pnpm typecheck passes. This is the heart of the recipe — the pipeline loads a dataset, validates it, computes heuristic scores, conditionally invokes the Perplexity judge, aggregates metrics, runs quality gates, and traces everything to Langfuse.
Step 11: Build the CLI entry point with Commander
The CLI wraps the pipeline in a clean command-line interface. Create src/index.ts as a simple re-export hub:
ts
export { runEvalPipeline } from "./services/eval-pipeline.js";export { formatOutput } from "./lib/output-formatter.js";
Now create the CLI runner. You’ll wire it through Commander in src/cli/eval.ts:
ts
import { Command } from "commander";import { runEvalPipeline } from "../services/eval-pipeline.js";import { formatOutput } from "../lib/output-formatter.js";import { type CliOptions, type FidelityMode, type OutputFormat,} from "../lib/types.js";interface CommanderOptions { dataset: string; config?: string; fidelity: string; output: string; baseline?: string;}const VALID_FIDELITY: FidelityMode[] = ["heuristic-only", "full-judge"];const VALID_OUTPUT: OutputFormat[] = ["json", "junit"];const program = new Command();program .name("perplexity-rag-eval") .description( "Continuously evaluate your small business RAG knowledge base answer quality with Perplexity-powered LLM judges and CI gating", ) .requiredOption("--dataset <path>", "Path to evaluation dataset (JSONL/JSON/YAML)") .option("--config <path>", "Path to eval config YAML", "./eval-config.yaml") .option("--fidelity <mode>", "Evaluation fidelity: heuristic-only or full-judge", "heuristic-only") .option("--output <format>", "Output format: json or junit", "json") .option("--baseline <path>", "Path to baseline results JSON for regression gates") .action(async (rawOptions: CommanderOptions) => { if (!VALID_FIDELITY.includes(rawOptions.fidelity as FidelityMode)) { console.error(`Error: --fidelity must be one of: ${VALID_FIDELITY.join(", ")}`); process.exit(2); } if (!VALID_OUTPUT.includes(rawOptions.output as OutputFormat)) { console.error(`Error: --output must be one of: ${VALID_OUTPUT.join(", ")}`); process.exit(2); } let resolvedFidelity: FidelityMode = rawOptions.fidelity as FidelityMode; if (resolvedFidelity === "full-judge" && !process.env.PERPLEXITY_API_KEY) { console.warn("Warning: PERPLEXITY_API_KEY not set, falling back to heuristic-only"); resolvedFidelity = "heuristic-only"; } const cliOptions: CliOptions = { datasetPath: rawOptions.dataset, configPath: rawOptions.config, fidelity: resolvedFidelity, output: rawOptions.output as OutputFormat, baselinePath: rawOptions.baseline, }; try { const result = await runEvalPipeline(cliOptions); const output = formatOutput(result, cliOptions.output); console.log(output); process.exit(result.gateResult.passed ? 0 : 1); } catch (err) { console.error("Error:", (err as Error).message); process.exit(2); } });program.parse();
Expected output: The CLI validates inputs up front, gracefully falls back to heuristic-only when the Perplexity API key is missing, and exits with code 1 when gates fail — ready for CI integration.
Step 12: Write targeted tests
Test the heuristic scorer to validate the core logic. Create tests/services/heuristic-scorer.test.ts:
Expected output:pnpm test passes all tests. The heuristic scorer tests cover high overlap, empty answers, single-word inputs, whitespace-only context, missing ground truth, and edge cases. The CLI tests mock the pipeline and verify correct exit codes for passing/failing gates, invalid options, pipeline errors, and API key fallback behavior.
Step 13: Run the full quality gate
The package.json already includes test, lint, and typecheck scripts. Run them all:
terminal
pnpm typecheckpnpm lintpnpm test
Expected output: all three commands exit 0 — no TypeScript errors, no lint violations, and all tests passing with coverage above 90% across lines, branches, functions, and statements.
Next steps
Add regression gates — store a baseline JSON file from a passing run and pass --baseline baseline.json to catch regressions on every CI run
Deploy as a GitHub Action — wrap perplexity-rag-eval --dataset datasets/latest.jsonl --fidelity full-judge --output junit in a workflow that runs on every PR
Scale to larger datasets — generate synthetic datasets using @reaatech/rag-eval-dataset’s DatasetGenerator to cover hundreds of knowledge base scenarios