Perplexity Agent Eval Harness for SMB AI Quality Assurance

Run continuous, automated evaluations of your customer‑facing AI agents using Perplexity as a neutral LLM judge, with version‑gated prompt promotions.

perplexity eval-harness ai-agents quality-assurance cli typescript nextjs langfuse prompt-version-control

The problem

Small businesses deploying AI chat or email agents struggle to know when an update breaks quality—manual testing doesn't scale, and proprietary LLM judges are expensive to use at volume.

Built from

Intro

In this tutorial, you’ll build a CLI-based AI agent evaluation harness that uses Perplexity as a neutral LLM judge to score your customer-facing AI agents. You’ll wire up golden test case datasets, feed them to your agent under test, get judgment scores from Perplexity, compute classification metrics, gate prompt-version promotions based on threshold checks, lint agent definition files, and stream results to Langfuse for observability dashboards. The final pipeline runs as a Next.js API route and a standalone CLI command that’s CI-ready.

This recipe is for anyone deploying AI chat or email agents at small-to-medium businesses who needs automated quality gating without the cost of proprietary judge LLMs.

Prerequisites

Node.js 22+ and pnpm 10
A Perplexity API key — set as PERPLEXITY_API_KEY
Langfuse account (free tier is fine) — for telemetry dashboards
Basic familiarity with TypeScript, Next.js App Router patterns, and asynchronous pipelines
A running agent under test (any HTTP endpoint that accepts POST with {"input": "...", ...} and returns a response)

Step 1: Review the project layout

The scaffold has already been created for you. Let’s orient ourselves by looking at the file layout:

code

app/api/eval/route.ts      — webhook-triggered evaluation
src/
  index.ts                 — CLI entrypoint
  lib/
    types.ts               — shared interfaces
    config.ts              — configuration loader (zod-validated)
    eval-pipeline.ts       — central orchestrator
  services/
    golden-dataset.ts      — golden trajectory management
    agent-under-test.ts    — agent HTTP client
    judge-service.ts       — Perplexity-as-judge bridge
    classifier-metrics.ts  — classification metrics engine
    pvc-service.ts         — prompt version control client
    markdown-linter.ts     — agent definition linter
    langfuse-exporter.ts   — observability exporter
tests/                     — Vitest suite (mirrors src/)
packages/                  — API references for every dependency

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

192 kB·122 tests·99.7% coverage·vitest passing

SHA-25694237642b7cb2b5b36a41d2a90e909b78adaead8b104d835a04834720caf4edf

Book a conversation All solutions

Comments

Loading comments…

Intro

This recipe is for anyone deploying AI chat or email agents at small-to-medium businesses who needs automated quality gating without the cost of proprietary judge LLMs.

Prerequisites

Node.js 22+ and pnpm 10

A Perplexity API key — set as PERPLEXITY_API_KEY

Langfuse account (free tier is fine) — for telemetry dashboards

Basic familiarity with TypeScript, Next.js App Router patterns, and asynchronous pipelines

A running agent under test (any HTTP endpoint that accepts POST with {"input": "...", ...} and returns a response)

Step 1: Review the project layout

The scaffold has already been created for you. Let’s orient ourselves by looking at the file layout:

code

app/api/eval/route.ts      — webhook-triggered evaluation
src/
  index.ts                 — CLI entrypoint
  lib/
    types.ts               — shared interfaces
    config.ts              — configuration loader (zod-validated)
    eval-pipeline.ts       — central orchestrator
  services/
    golden-dataset.ts      — golden trajectory management
    agent-under-test.ts    — agent HTTP client
    judge-service.ts       — Perplexity-as-judge bridge
    classifier-metrics.ts  — classification metrics engine
    pvc-service.ts         — prompt version control client
    markdown-linter.ts     — agent definition linter
    langfuse-exporter.ts   — observability exporter
tests/                     — Vitest suite (mirrors src/)
packages/                  — API references for every dependency

import { Langfuse } from "langfuse"; export class LangfuseExporter { private client: Langfuse; private baseUrl: string; private credentialsValid: boolean; constructor(config: { publicKey: string; secretKey: string; baseUrl?: string; }) { this.baseUrl = config.baseUrl ?? "https://cloud.langfuse.com"; this.credentialsValid = config.secretKey !== "***" && config.secretKey.length > 0; this.client = new Langfuse({ publicKey: config.publicKey, secretKey: config.secretKey, baseUrl: this.baseUrl, }); } exportEvalRun( report: { runId: string; status: string; overallScore: number; passRate: number; threshold: number; totalTests: number; passedTests: number; failedTests: number; timestamp: string; }, testCases: Array<{ id: string; input: string; expectedOutput?: string }>, scores: Array<{ testCaseId: string; score: number; explanation: string; }> ): string { if (!this.credentialsValid) { console.warn( "Langfuse secretKey is a placeholder — skipping export" ); return ""; } const trace = this.client.trace({ name: `perplexity-eval-${report.runId}`, metadata: { status: report.status, threshold: report.threshold, overallScore: report.overallScore, passRate: report.passRate, totalTests: report.totalTests, passedTests: report.passedTests, failedTests: report.failedTests, timestamp: report.timestamp, }, }); for (const testCase of testCases) { trace.span({ name: `test-case-${testCase.id}`, input: testCase.input, metadata: testCase.expectedOutput ? { expectedOutput: testCase.expectedOutput } : undefined, }); } for (const score of scores) { trace.score({ name: "judge_score", value: score.score }); } return `${this.baseUrl}/trace/${trace.id}`; } async flush(): Promise<void> { await this.client.flushAsync(); } async shutdown(): Promise<void> { await this.client.shutdownAsync(); } }

import "dotenv/config"; import { writeFileSync } from "node:fs"; import type { CLIOptions, EvalConfig } from "./lib/types.js"; import { loadConfig, mergeWithEnv, createDefaultConfig, } from "./lib/config.js"; import { EvalPipeline } from "./lib/eval-pipeline.js"; async function main(): Promise<void> { const options = parseArgs(process.argv); if (!process.env.PERPLEXITY_API_KEY) { console.error( "Error: PERPLEXITY_API_KEY environment variable is required" ); process.exit(1); } let config: EvalConfig; if (options.config) { config = loadConfig(options.config); } else { try { config = loadConfig(); } catch { config = createDefaultConfig(); } } config = mergeWithEnv(config); if (options.threshold !== undefined) { config = { ...config, threshold: options.threshold }; } if (options.model) { config = { ...config, judgeModel: options.model }; } const pipeline = new EvalPipeline(config); const report = await pipeline.run(); const jsonOutput = JSON.stringify(report, null, 2); console.log(jsonOutput); if (options.output) { writeFileSync(options.output, jsonOutput, "utf-8"); } process.exit(report.status === "passed" ? 0 : 1); } function parseArgs(argv: string[]): CLIOptions { const options: CLIOptions = { command: "run" }; for (let i = 2; i < argv.length; i++) { const arg = argv[i]; switch (arg) { case "--config": case "-c": options.config = argv[++i]; break; case "--output": case "-o": options.output = argv[++i]; break; case "--verbose": case "-v": options.verbose = true; break; case "--threshold": case "-t": options.threshold = Number(argv[++i]); break; case "--model": case "-m": options.model = argv[++i]; break; } } return options; } main().catch((error: unknown) => { console.error(error instanceof Error ? error.message : String(error)); process.exit(1); });

Perplexity Agent Eval Harness for SMB AI Quality Assurance

The problem

Built from

Intro

Prerequisites

Step 1: Review the project layout

Example artifact

Comments

Intro

Prerequisites

Step 1: Review the project layout

Step 2: Set up the environment file

Step 3: Create the shared types

Step 4: Build the configuration loader

Step 5: Wire the golden dataset loader

Step 6: Create the agent under test HTTP client

Step 7: Build the Perplexity judge service

Step 8: Compute classification metrics

Step 9: Add prompt version control gating

Step 10: Lint agent definition files

Step 11: Build the Langfuse exporter

Step 12: Wire the evaluation pipeline

Step 13: Create the CLI entrypoint

Step 14: Create the API route handler

Step 15: Run the tests

Next steps