SMB support teams rely on RAG chatbots to handle customer questions, but hallucinations or irrelevant answers slip through unnoticed, damaging trust. They have no systematic way to continuously measure answer quality and catch regressions before customers do.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building an automated RAG evaluation harness for a customer support chatbot. You’ll create a Next.js API that scores RAG answer quality on four metrics (faithfulness, relevance, context precision, context recall) using AWS Bedrock as a judge LLM, tracks evaluation spend with configurable budgets, and gates CI/CD deployments when quality dips below defined thresholds. Evaluation traces and scores are pushed to Langfuse for dashboarding and alerting.
This is for developers who run customer-facing RAG chatbots and need a systematic way to catch answer quality regressions before they reach users.
Prerequisites
Node.js >= 22 and pnpm 10 installed
An AWS account with Bedrock access (Claude Sonnet 4 or another compatible model enabled)
A Langfuse account (the free tier works) with a public and secret key
Basic familiarity with TypeScript, Next.js App Router, and AWS SDK
AWS credentials configured in your environment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION)
Step 1: Scaffold the Next.js project
Create the project directory and initialize a Next.js 16+ App Router project with the required dependencies:
Expected output: A clean project directory with node_modules/ and working TypeScript compilation.
Step 2: Configure environment variables
Create a .env.example file with placeholders for every credential the harness needs:
env
# Env vars used by aws-bedrock-rag-eval-harness-for-smb-customer-support-bots.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentAWS_REGION=us-east-1AWS_ACCESS_KEY_ID=<your-aws-access-key>AWS_SECRET_ACCESS_KEY=<your-aws-secret>LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_HOST=<your-langfuse-host>EVAL_DATASET_PATH=./datasets/eval-samples.jsonlEVAL_BUDGET_LIMIT=10.00JUDGE_MODEL_ID=anthropic.claude-sonnet-4-v1:0PORT=3000
Copy it to .env and fill in your real values:
terminal
cp .env.example .env
Expected output: The .env file is ready. Never commit it — add .env to .gitignore.
Step 3: Define the types
Create src/types.ts with the TypeScript interfaces for your API request/response shapes and app configuration:
Expected output: A clean interface that centralizes every shape your modules will exchange. AppConfig is populated from environment variables in the next step.
Step 4: Create the configuration loader
Create src/config.ts to read environment variables and produce typed configuration objects:
Expected output:loadAppConfig() returns sensible defaults when env vars aren’t set and reads overrides when they are. loadDefaultGates() defines the three quality thresholds used in CI.
Step 5: Build the dataset manager
Create src/lib/dataset.ts. This wraps the @reaatech/rag-eval-dataset package to load, validate, and version-control evaluation samples:
ts
import { DatasetLoader, DatasetValidator, DatasetVersioning, loadEvalConfig,} from "@reaatech/rag-eval-dataset";import { type EvaluationSample, type EvalSuiteConfig, type DatasetVersion } from "@reaatech/rag-eval-core";export class EvalDatasetManager { private readonly loader: DatasetLoader; private readonly validator: DatasetValidator; private readonly versioning: DatasetVersioning; constructor() { this.loader = new DatasetLoader(); this.validator = new DatasetValidator(); this.versioning = new DatasetVersioning(); } async loadDataset(path: string): Promise<EvaluationSample[]> { return this.loader.load(path); } validateDataset( samples: EvaluationSample[], ): { valid: boolean; errors: Array<{ field: string; message: string }> } { const result = this.validator.validate(samples); return { valid: result.valid, errors: result.errors }; } trackVersion(version: string, description: string): void { this.versioning.createVersion([], { version, description }); } getVersionHistory(): DatasetVersion[] { return this.versioning.getAllVersions(); } async loadSuiteConfig(path: string): Promise<EvalSuiteConfig> { return loadEvalConfig(path); }}export function createDatasetManager(): EvalDatasetManager { return new EvalDatasetManager();}
Expected output: A reusable manager that decouples your API routes from the dataset library’s constructor details.
Step 6: Create the AWS Bedrock judge adapter
Create src/lib/judge.ts. This is the core module — it sends evaluation samples to Bedrock via the Converse API, asks the model to score the RAG output on four metrics, and parses the JSON response:
ts
import { type EvaluationSample, type SampleEvalResult, type JudgeConfig,} from "@reaatech/rag-eval-core";import { BedrockRuntimeClient, ConverseCommand } from "@aws-sdk/client-bedrock-runtime";export class JudgeAdapter { private readonly model: string; private readonly enabled: boolean; private readonly client: BedrockRuntimeClient; constructor(config: JudgeConfig, client: BedrockRuntimeClient) { this.model = config.model ?? "anthropic.claude-sonnet-4-v1:0";
Expected output: The judge adapter builds an LLM prompt from a RAG sample, calls Bedrock’s Converse API with temperature 0 for deterministic scoring, and parses the four metric scores from the response. It gracefully handles disabled mode, malformed JSON, out-of-range values, and API errors.
Step 7: Build the cost tracker
Create src/lib/cost.ts to track per-sample evaluation costs and enforce budgets:
Expected output: The cost manager tracks spending at a granular per-sample level, alerts at 50%, 75%, and 90% of budget, and supports a hard stop that halts evaluation runs when the budget is exhausted.
Step 8: Create the quality gate
Create src/lib/gate.ts. This evaluates evaluation results against quality thresholds and integrates with CI pipelines:
ts
import { readFileSync } from "node:fs";import { GateEngine, CIIntegration } from "@reaatech/rag-eval-gate";import { type EvalResults, type GateConfig, type GateResult } from "@reaatech/rag-eval-core";export class QualityGate { private readonly engine: GateEngine; private readonly ci: CIIntegration; constructor() { this.engine = new GateEngine(); this.ci = new CIIntegration(); } loadGates(gates: GateConfig[]): void { this.engine.loadGates(gates); } evaluate(results: EvalResults, baseline?: EvalResults): GateResult { return this.engine.evaluate(results, baseline); } setBaseline(baseline: EvalResults): void { this.engine.setBaseline(baseline); } formatResult(result: GateResult): string { return this.ci.generateMarkdownReport(result); } getExitCode(result: GateResult): number { return this.ci.getExitCode(result); }}export function createQualityGate(): QualityGate { return new QualityGate();}export function runCiGateFile( resultsPath: string, gatesConfigPath: string, baselinePath?: string,): Promise<number> { const gate = new QualityGate(); const gatesRaw = readFileSync(gatesConfigPath, "utf-8"); const gates: GateConfig[] = JSON.parse(gatesRaw) as GateConfig[]; gate.loadGates(gates); const resultsRaw = readFileSync(resultsPath, "utf-8"); const results: EvalResults = JSON.parse(resultsRaw) as EvalResults; let baseline: EvalResults | undefined; if (baselinePath !== undefined) { const baselineRaw = readFileSync(baselinePath, "utf-8"); baseline = JSON.parse(baselineRaw) as EvalResults; } const gateResult = gate.evaluate(results, baseline); const output = gate.formatResult(gateResult); process.stdout.write(output + "\n"); return Promise.resolve(gate.getExitCode(gateResult));}
Expected output:QualityGate wraps the gate engine for programmatic use. runCiGateFile is the CI entry point — it reads results and gate config from JSON files, evaluates, prints a Markdown report to stdout, and returns a non-zero exit code when thresholds fail.
Step 9: Wire up observability with Langfuse
Create src/lib/observability.ts to push evaluation traces and scores to Langfuse:
Expected output: The observability manager gracefully degrades when Langfuse credentials are missing — trace() returns null and score() becomes a no-op. When keys are present, traces and metric scores are sent to Langfuse for dashboarding.
Step 10: Create the CLI runner
Create src/lib/runner.ts as an entry point for running evaluations from the command line:
ts
import { EvaluationSuite } from "@reaatech/rag-eval-cli";import type { EvalResults, EvalSuiteConfig } from "@reaatech/rag-eval-core";export async function runFromCLI( datasetPath: string, config: EvalSuiteConfig,): Promise<EvalResults> { const suite = new EvaluationSuite(config); const result = await suite.runFromFile(datasetPath); return result.results;}
Expected output:runFromCLI() lets you trigger evaluations from scripts or CLI tools without starting the Next.js dev server.
Step 11: Build the API route handlers
Create app/api/evals/route.ts — the endpoint that accepts POST requests to trigger evaluation runs and GET requests for health checks:
ts
import { type NextRequest, NextResponse } from "next/server";import { type SampleEvalResult, type EvalResults, type GateConfig, type EvalSuiteConfig } from "@reaatech/rag-eval-core";import { loadAppConfig, loadJudgeConfig, loadDefaultGates, loadEvalSuiteConfigFromPath } from "../../../src/config.js";import { createDatasetManager } from "../../../src/lib/dataset.js";import { createQualityGate } from "../../../src/lib/gate.js";import { createObservabilityManager } from "../../../src/lib/observability.js";import { createJudgeAdapter } from "../../../src/lib/judge.js";import { createEvalCostManager } from "../../../src/lib/cost.js";import crypto from
Create app/api/evals/cost/route.ts for the cost report endpoint:
Expected output: Three API endpoints — POST /api/evals triggers a full evaluation run, GET /api/evals returns a health status, and GET /api/evals/cost returns a cost report and breakdown.
Step 12: Prepare the evaluation dataset
Create datasets/eval-samples.jsonl with sample question-answer-context triples representing common customer support questions:
jsonl
{"query":"What is the return policy?","context":["Our return policy allows returns within 30 days of purchase.","Items must be in original condition with receipt."],"ground_truth":"Returns are accepted within 30 days with original receipt and condition.","generated_answer":"You can return items within 30 days as long as you have the receipt."}{"query":"How do I track my order?","context":["Order tracking is available in your account dashboard under 'My Orders'.","You will receive a tracking number via email once your order ships."],"ground_truth":"Track your order through the account dashboard or via the tracking number emailed after shipment.","generated_answer":"You can track your order in your account dashboard."}{"query":"Do you offer international shipping?","context":["We ship to over 50 countries worldwide.","International shipping rates vary by destination and are calculated at checkout.","Delivery times range from 5-14 business days depending on the destination."],"ground_truth":"We ship to 50+ countries with rates calculated at checkout and delivery in 5-14 business days.","generated_answer":"International shipping is available and rates are shown at checkout."}{"query":"Can I change my subscription plan?","context":["You can upgrade or downgrade your subscription at any time from the Billing settings.","Changes take effect at the start of the next billing cycle.","Price differences are prorated for the remainder of the current cycle."],"ground_truth":"Subscription changes are made in Billing settings, take effect next billing cycle, and are prorated.","generated_answer":"Yes, you can change your plan anytime in Billing settings."}{"query":"What happens when my free trial ends?","context":["After your 14-day free trial, your account will be downgraded to a free tier with limited features.","You can upgrade to a paid plan at any time to regain full access.","Your data is retained for 30 days after the trial ends before being archived."],"ground_truth":"After the 14-day trial, accounts downgrade to a limited free tier; data is retained for 30 days before archiving.","generated_answer":"Your free trial ends after 14 days and you will need to upgrade to continue using all features."}
Expected output: Five customer-support Q&A samples with known ground truth answers and AI-generated answers. The YAML files define which metrics to evaluate and which quality thresholds to enforce.
Step 13: Write and run the tests
Create tests/lib/judge.test.ts — the most important test file, since the judge adapter is the core of the harness. It mocks the Bedrock client using aws-sdk-client-mock:
ts
import { describe, it, expect, vi, beforeEach } from "vitest";import { mockClient } from "aws-sdk-client-mock";import { BedrockRuntimeClient, ConverseCommand } from "@aws-sdk/client-bedrock-runtime";import { JudgeAdapter, createJudgeAdapter } from "../../src/lib/judge.js";import type { EvaluationSample, JudgeConfig } from "@reaatech/rag-eval-core";const bedrockMock = mockClient(BedrockRuntimeClient);const sample: EvaluationSample = { query: "What is RAG?", context: ["Retrieval Augmented Generation is a technique."], ground_truth: "RAG is a technique for enhancing LLMs with external knowledge.", generated_answer: "RAG stands for Retrieval Augmented Generation.",
Then create tests/api/evals/route.test.ts to test the API route handler end to end with mocked dependencies. The full test file is available in the repository; here are the key tests:
pnpm vitest run --coverage --reporter=json --outputFile=vitest-report.json
Expected output: All tests pass with at least 90% line, branch, function, and statement coverage on runtime code (src/**/*.ts and app/**/route.ts). The mock-based approach means no real AWS or Langfuse API calls are made during tests.
Step 14: Start the dev server and trigger an evaluation
terminal
pnpm dev
In another terminal, send a POST request to trigger an evaluation:
Expected output: The API responds with a JSON object containing results (per-sample scores, aggregated metrics, cost breakdown), gates (pass/fail status for each quality threshold), and costBreakdown. The response should look like:
Add Slack or Discord alerts — wire the gate result to send a notification when CI gates fail, so your team knows immediately when answer quality drops.
Schedule nightly evaluations — use cron or GitHub Actions scheduled workflows to run the evaluation suite every night and compare results against a stored baseline.
Expand the metrics — add custom metrics like answer conciseness, tone consistency, or brand-voice adherence by extending the judge prompt and parser.
Deploy to production — run this harness as a sidecar service alongside your customer support bot, triggered by each deployment pipeline step.
this.enabled = config.enabled ?? true;
this.client = client;
}
buildPrompt(sample: EvaluationSample): string {
const contextText = sample.context.length > 0
? sample.context.join("\n")
: "(no context provided)";
return `You are an expert RAG evaluator. Score the following RAG system output on four metrics on a 0-1 scale.