Files · Perplexity Agent Eval Harness for SMB AI Quality Assurance

68 (1 binary, 695.4 kB total)attempt 1

README.md·5759 B·markdown

markdown

# Perplexity Agent Eval Harness for SMB AI Quality Assurance
 
> Run continuous, automated evaluations of your customer‑facing AI agents using Perplexity as a neutral LLM judge, with version‑gated prompt promotions.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## Quick Start
 
```bash
pnpm install
pnpm test            # vitest run with coverage
pnpm dev             # next dev
```
 
### CLI Usage
 
```bash
# Set up your environment
cp .env.example .env
# Edit .env with your API keys
 
# Run an evaluation
node src/index.js run
 
# With custom config
node src/index.js run --config ./my-config.json --verbose
 
# Output report to file
node src/index.js run --output ./report.json
```
 
## Pipeline Stages
 
The evaluation pipeline runs these stages in order:
 
1. **Golden Load** — Loads golden test cases from a JSONL file via `@reaatech/agent-eval-harness-golden`
2. **Agent Feed** — Sends each test case to your agent under test and collects responses
3. **Judge Scoring** — Evaluates each response using Perplexity as a neutral LLM judge, using prompt templates from `@reaatech/agent-eval-harness-judge`
4. **Golden Comparison** — Compares candidate trajectories against golden references to detect regressions
5. **Classifier Metrics** — Computes accuracy, precision, recall, F1 via `@reaatech/classifier-evals`
6. **Markdown Lint** — Lints AGENTS.md/SKILL.md files via `@reaatech/agents-markdown-linter`
7. **Threshold Gating** — Compares scores against configured threshold; promotes or blocks prompt versions via `@reaatech/prompt-version-control`
8. **Langfuse Export** — Streams results to Langfuse dashboards via the `langfuse` SDK
 
## Configuration
 
### Environment Variables
 
| Variable | Description | Default |
|----------|-------------|---------|
| `PERPLEXITY_API_KEY` | Perplexity API key (required) | — |
| `PVC_API_KEY` | Prompt Version Control API key | — |
| `PVC_BASE_URL` | PVC server base URL | `http://localhost:3000` |
| `LANGFUSE_PUBLIC_KEY` | Langfuse project public key | — |
| `LANGFUSE_SECRET_KEY` | Langfuse project secret key | — |
| `LANGFUSE_BASE_URL` | Langfuse server URL | `https://cloud.langfuse.com` |
| `EVAL_CONFIG_PATH` | Path to evaluation config JSON | `./eval-config.yaml` |
| `AGENT_ENDPOINT` | Agent-under-test HTTP endpoint | — |
| `AGENT_API_KEY` | Agent endpoint auth key | — |
| `EVAL_THRESHOLD_SCORE` | Pass/fail threshold (0.0–1.0) | `0.7` |
 
### Config File
 
Create a JSON config file (default: `eval-config.yaml`):
 
```json
{
  "metrics": ["faithfulness", "relevance"],
  "judgeModel": "pplx-7b-online",
  "threshold": 0.7,
  "concurrency": 4,
  "budgetLimit": 10.0,
  "outputFormats": ["json"]
}
```
 
## CI Integration
 
The CLI exits with code `0` when all tests pass (score >= threshold) and code `1` otherwise, making it suitable for CI gating:
 
```yaml
# .github/workflows/eval.yml
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pnpm install
      - run: node src/index.js run
      - name: Check evaluation result
        if: failure()
        run: echo "Evaluation failed — prompt blocked from production"
```
 
## Project Layout
 
```
app/api/eval/route.ts     API route for webhook-triggered evaluations
src/
  index.ts                CLI entrypoint
  lib/
    types.ts              Shared interfaces
    config.ts             Configuration loader
    eval-pipeline.ts      Central pipeline orchestrator
  services/
    golden-dataset.ts     Golden trajectory management
    agent-under-test.ts   Agent HTTP client
    judge-service.ts      Perplexity-as-judge bridge
    classifier-metrics.ts Metrics computation
    pvc-service.ts        Prompt version control client
    markdown-linter.ts    Agent definition linter
    langfuse-exporter.ts  Observability exporter
tests/                    Vitest suite (mirrors src/)
packages/                 API references for every dependency
DEV_PLAN.md               Build plan for this recipe
```
 
## Packages
 
### REAA Packages
 
| Package | Version | Role | Import Path |
|---------|---------|------|-------------|
| `@reaatech/agent-eval-harness-suite` | `0.1.0` | Batch evaluation runner, threshold checking, results aggregation | `src/lib/eval-pipeline.ts` |
| `@reaatech/agent-eval-harness-judge` | `0.1.0` | LLM-as-judge with calibration, cost tracking, prompt templates | `src/services/judge-service.ts` |
| `@reaatech/agent-eval-harness-golden` | `0.1.0` | Golden trajectory management and regression comparison | `src/services/golden-dataset.ts` |
| `@reaatech/prompt-version-control` | `0.1.0` | Typed PVC API client with retry and caching | `src/services/pvc-service.ts` |
| `@reaatech/classifier-evals` | `0.1.0` | Classification metrics, confusion matrix, structured logging, OTel | `src/services/classifier-metrics.ts` |
| `@reaatech/agents-markdown-linter` | `1.0.0` | 18 built-in lint rules for AGENTS.md and SKILL.md | `src/services/markdown-linter.ts` |
 
### Third-Party Packages
 
| Package | Version | Role | Import Path |
|---------|---------|------|-------------|
| `perplexity-sdk` | `1.0.4` | Perplexity AI chat completions | `src/services/judge-service.ts` |
| `langfuse` | `3.38.20` | LLM observability and tracing | `src/services/langfuse-exporter.ts` |
| `zod` | `4.4.3` | Runtime schema validation | `src/lib/config.ts` |
| `p-limit` | `7.3.0` | Concurrency limiting | `src/services/judge-service.ts` |
| `nanoid` | `5.1.11` | URL-friendly unique ID generator | `src/lib/eval-pipeline.ts` |
| `dotenv` | `17.4.2` | Environment variable loading | `src/index.ts` |
 
## License
 
MIT — see [LICENSE](./LICENSE).