Files · Perplexity Agent Eval Harness for SMB AI Quality Assurance
68 (1 binary, 695.4 kB total)attempt 1
README.md·5759 B·markdown
markdown
# Perplexity Agent Eval Harness for SMB AI Quality Assurance
> Run continuous, automated evaluations of your customer‑facing AI agents using Perplexity as a neutral LLM judge, with version‑gated prompt promotions.
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
## Quick Start
```bash
pnpm install
pnpm test # vitest run with coverage
pnpm dev # next dev
```
### CLI Usage
```bash
# Set up your environment
cp .env.example .env
# Edit .env with your API keys
# Run an evaluation
node src/index.js run
# With custom config
node src/index.js run --config ./my-config.json --verbose
# Output report to file
node src/index.js run --output ./report.json
```
## Pipeline Stages
The evaluation pipeline runs these stages in order:
1. **Golden Load** — Loads golden test cases from a JSONL file via `@reaatech/agent-eval-harness-golden`
2. **Agent Feed** — Sends each test case to your agent under test and collects responses
3. **Judge Scoring** — Evaluates each response using Perplexity as a neutral LLM judge, using prompt templates from `@reaatech/agent-eval-harness-judge`
4. **Golden Comparison** — Compares candidate trajectories against golden references to detect regressions
5. **Classifier Metrics** — Computes accuracy, precision, recall, F1 via `@reaatech/classifier-evals`
6. **Markdown Lint** — Lints AGENTS.md/SKILL.md files via `@reaatech/agents-markdown-linter`
7. **Threshold Gating** — Compares scores against configured threshold; promotes or blocks prompt versions via `@reaatech/prompt-version-control`
8. **Langfuse Export** — Streams results to Langfuse dashboards via the `langfuse` SDK
## Configuration
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `PERPLEXITY_API_KEY` | Perplexity API key (required) | — |
| `PVC_API_KEY` | Prompt Version Control API key | — |
| `PVC_BASE_URL` | PVC server base URL | `http://localhost:3000` |
| `LANGFUSE_PUBLIC_KEY` | Langfuse project public key | — |
| `LANGFUSE_SECRET_KEY` | Langfuse project secret key | — |
| `LANGFUSE_BASE_URL` | Langfuse server URL | `https://cloud.langfuse.com` |
| `EVAL_CONFIG_PATH` | Path to evaluation config JSON | `./eval-config.yaml` |
| `AGENT_ENDPOINT` | Agent-under-test HTTP endpoint | — |
| `AGENT_API_KEY` | Agent endpoint auth key | — |
| `EVAL_THRESHOLD_SCORE` | Pass/fail threshold (0.0–1.0) | `0.7` |
### Config File
Create a JSON config file (default: `eval-config.yaml`):
```json
{
"metrics": ["faithfulness", "relevance"],
"judgeModel": "pplx-7b-online",
"threshold": 0.7,
"concurrency": 4,
"budgetLimit": 10.0,
"outputFormats": ["json"]
}
```
## CI Integration
The CLI exits with code `0` when all tests pass (score >= threshold) and code `1` otherwise, making it suitable for CI gating:
```yaml
# .github/workflows/eval.yml
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pnpm install
- run: node src/index.js run
- name: Check evaluation result
if: failure()
run: echo "Evaluation failed — prompt blocked from production"
```
## Project Layout
```
app/api/eval/route.ts API route for webhook-triggered evaluations
src/
index.ts CLI entrypoint
lib/
types.ts Shared interfaces
config.ts Configuration loader
eval-pipeline.ts Central pipeline orchestrator
services/
golden-dataset.ts Golden trajectory management
agent-under-test.ts Agent HTTP client
judge-service.ts Perplexity-as-judge bridge
classifier-metrics.ts Metrics computation
pvc-service.ts Prompt version control client
markdown-linter.ts Agent definition linter
langfuse-exporter.ts Observability exporter
tests/ Vitest suite (mirrors src/)
packages/ API references for every dependency
DEV_PLAN.md Build plan for this recipe
```
## Packages
### REAA Packages
| Package | Version | Role | Import Path |
|---------|---------|------|-------------|
| `@reaatech/agent-eval-harness-suite` | `0.1.0` | Batch evaluation runner, threshold checking, results aggregation | `src/lib/eval-pipeline.ts` |
| `@reaatech/agent-eval-harness-judge` | `0.1.0` | LLM-as-judge with calibration, cost tracking, prompt templates | `src/services/judge-service.ts` |
| `@reaatech/agent-eval-harness-golden` | `0.1.0` | Golden trajectory management and regression comparison | `src/services/golden-dataset.ts` |
| `@reaatech/prompt-version-control` | `0.1.0` | Typed PVC API client with retry and caching | `src/services/pvc-service.ts` |
| `@reaatech/classifier-evals` | `0.1.0` | Classification metrics, confusion matrix, structured logging, OTel | `src/services/classifier-metrics.ts` |
| `@reaatech/agents-markdown-linter` | `1.0.0` | 18 built-in lint rules for AGENTS.md and SKILL.md | `src/services/markdown-linter.ts` |
### Third-Party Packages
| Package | Version | Role | Import Path |
|---------|---------|------|-------------|
| `perplexity-sdk` | `1.0.4` | Perplexity AI chat completions | `src/services/judge-service.ts` |
| `langfuse` | `3.38.20` | LLM observability and tracing | `src/services/langfuse-exporter.ts` |
| `zod` | `4.4.3` | Runtime schema validation | `src/lib/config.ts` |
| `p-limit` | `7.3.0` | Concurrency limiting | `src/services/judge-service.ts` |
| `nanoid` | `5.1.11` | URL-friendly unique ID generator | `src/lib/eval-pipeline.ts` |
| `dotenv` | `17.4.2` | Environment variable loading | `src/index.ts` |
## License
MIT — see [LICENSE](./LICENSE).