Perplexity RAG Eval Suite for SMB Knowledge Bases

Continuously evaluate your small business RAG knowledge base using Perplexity’s LLM-as-judge, heuristic metrics, and cost-tracked CI gates from REAA’s eval packs.

perplexity rag eval-harness llm-as-judge cli smb knowledge-base ci-gate cost-tracking

The problem

SMBs that deploy internal RAG bots for employee or customer support find their answers drift as documents change. Without automated evaluation, they only discover quality regressions through user complaints, with no reproducible benchmark and no way to track LLM judging costs.

Built from

Intro

This tutorial walks you through building a CLI-powered RAG evaluation suite for small business knowledge bases. You’ll use Perplexity as a low-cost LLM-as-judge, heuristic metrics for fast scoring, and CI quality gates to catch answer regressions before your users do. The final tool loads a golden Q&A dataset, runs faithfulness/relevance/precision/recall scoring (heuristic first, then Perplexity-powered judging for ambiguous cases), tracks every cent of LLM spend, and fails a CI pipeline if scores dip below your thresholds.

Prerequisites

Node.js >= 22 with pnpm installed (npm install -g pnpm)
A Perplexity API key — sign up at perplexity.ai and create a key
A Langfuse account (optional, for tracing) — create one at langfuse.com
Familiarity with TypeScript and Node.js CLI development — you’ll work in src/ building a command-line evaluation tool

Step 1: Scaffold the project and install dependencies

Start with a Next.js 16 project (used as the TypeScript build system for this CLI tool):

terminal

pnpm create next-app eval-suite --typescript --app --src-dir --import-alias

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

163 kB·81 tests·99.6% coverage·vitest passing

SHA-256c85388c0f986e942088799999f90c6a956219d92ecb70eca419845b00b35c9ba

Book a conversation All solutions

Comments

Loading comments…

import { EvalRunResult, OutputFormat, SampleEvalResult, GateFailure } from "./types.js"; const METRIC_KEYS = ["faithfulness", "relevance", "context_precision", "context_recall"]; function escapeXml(s: string): string { return s .replace(/&/g, "&") .replace(/</g, "<") .replace(/>/g, ">") .replace(/"/g, """) .replace(/'/g, "'"); } function getMetricScore(sample: SampleEvalResult, metric: string): number { switch (metric) { case "faithfulness": return sample.faithfulness?.score ?? 0; case "relevance": return sample.relevance?.score ?? 0; case "context_precision": return sample.context_precision?.score ?? 0; case "context_recall": return sample.context_recall?.score ?? 0; default: return 0; } } export function formatJsonOutput(result: EvalRunResult): string { return JSON.stringify(result, null, 2); } export function formatJunitOutput(result: EvalRunResult): string { const { results, gateResult, durationMs } = result; const samples = results.samples; if (samples.length === 0) { return `<?xml version="1.0" encoding="UTF-8"?>\n<testsuites name="rag-eval-suite" tests="0" failures="0" time="0.000"></testsuites>`; } let suitesXml = ""; for (const metric of METRIC_KEYS) { let testsXml = ""; let failuresCount = 0; for (const sampleResult of samples) { const query = sampleResult.sample.query; const score = getMetricScore(sampleResult, metric); let failureXml = ""; for (const gateFail of gateResult.failures) { const gf = gateFail as GateFailure & { message?: string }; if (gf.metric === metric) { const msg: string = gf.message ?? `Score ${String(score)} below expected ${String(gf.expected)}`; failureXml += `<failure message="${escapeXml(msg)}">${escapeXml(gf.gate_name)}: actual=${String(gf.actual)}, expected=${String(gf.expected)}, diff=${String(gf.difference)}</failure>`; failuresCount++; } } testsXml += `<testcase name="${escapeXml(query)}" classname="${escapeXml(metric)}">${failureXml}</testcase>`; } suitesXml += `<testsuite name="${escapeXml(metric)}" tests="${String(samples.length)}" failures="${String(failuresCount)}">${testsXml}</testsuite>`; } const totalTests = samples.length * METRIC_KEYS.length; const totalFailures = gateResult.failures.length; const timeSeconds = (durationMs / 1000).toFixed(3); return `<?xml version="1.0" encoding="UTF-8"?>\n<testsuites name="rag-eval-suite" tests="${String(totalTests)}" failures="${String(totalFailures)}" time="${timeSeconds}">${suitesXml}</testsuites>`; } export function formatOutput( result: EvalRunResult, format: OutputFormat, ): string { if (format === "json") return formatJsonOutput(result); return formatJunitOutput(result); }

import { Command } from "commander"; import { runEvalPipeline } from "../services/eval-pipeline.js"; import { formatOutput } from "../lib/output-formatter.js"; import { type CliOptions, type FidelityMode, type OutputFormat, } from "../lib/types.js"; interface CommanderOptions { dataset: string; config?: string; fidelity: string; output: string; baseline?: string; } const VALID_FIDELITY: FidelityMode[] = ["heuristic-only", "full-judge"]; const VALID_OUTPUT: OutputFormat[] = ["json", "junit"]; const program = new Command(); program .name("perplexity-rag-eval") .description( "Continuously evaluate your small business RAG knowledge base answer quality with Perplexity-powered LLM judges and CI gating", ) .requiredOption("--dataset <path>", "Path to evaluation dataset (JSONL/JSON/YAML)") .option("--config <path>", "Path to eval config YAML", "./eval-config.yaml") .option("--fidelity <mode>", "Evaluation fidelity: heuristic-only or full-judge", "heuristic-only") .option("--output <format>", "Output format: json or junit", "json") .option("--baseline <path>", "Path to baseline results JSON for regression gates") .action(async (rawOptions: CommanderOptions) => { if (!VALID_FIDELITY.includes(rawOptions.fidelity as FidelityMode)) { console.error(`Error: --fidelity must be one of: ${VALID_FIDELITY.join(", ")}`); process.exit(2); } if (!VALID_OUTPUT.includes(rawOptions.output as OutputFormat)) { console.error(`Error: --output must be one of: ${VALID_OUTPUT.join(", ")}`); process.exit(2); } let resolvedFidelity: FidelityMode = rawOptions.fidelity as FidelityMode; if (resolvedFidelity === "full-judge" && !process.env.PERPLEXITY_API_KEY) { console.warn("Warning: PERPLEXITY_API_KEY not set, falling back to heuristic-only"); resolvedFidelity = "heuristic-only"; } const cliOptions: CliOptions = { datasetPath: rawOptions.dataset, configPath: rawOptions.config, fidelity: resolvedFidelity, output: rawOptions.output as OutputFormat, baselinePath: rawOptions.baseline, }; try { const result = await runEvalPipeline(cliOptions); const output = formatOutput(result, cliOptions.output); console.log(output); process.exit(result.gateResult.passed ? 0 : 1); } catch (err) { console.error("Error:", (err as Error).message); process.exit(2); } }); program.parse();

Perplexity RAG Eval Suite for SMB Knowledge Bases

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Step 2: Define the evaluation types

Step 3: Build the heuristic scorer

Step 4: Build the Perplexity judge adapter

Step 5: Build the judge scorer service

Step 6: Build the cost tracker

Step 7: Build the output formatter

Step 8: Build the Langfuse tracer

Step 9: Build the gate checker

Step 10: Wire up the eval pipeline

Step 11: Build the CLI entry point with Commander

Step 12: Write targeted tests

Step 13: Run the full quality gate

Next steps