Skip to content
reaatechREAATECH

Files · vLLM Agent Eval Harness for Fine-Tuned Model Quality

58 (1 binary, 682.7 kB total)attempt 1

README.md·14093 B·markdown
markdown
# vLLM Agent Eval Harness for Fine-Tuned Model Quality
 
> Automated CI/CD-quality evaluations for locally-hosted fine-tuned LLMs using vLLM with LLM-as-judge and cost tracking.
 
Small and medium businesses fine-tuning open models on their own hardware lack a structured, repeatable quality verification pipeline. Without automated eval, regressions slip into production. This recipe solves that with a CLI tool that runs trajectories against your local vLLM endpoint, scores responses via GPT-4 as a judge, tracks token costs, enforces CI gates, and exports observability to Langfuse.
 
## Architecture
 
```
┌─────────────────────────────────────────────────────────────────────┐
│  Local Machine (recipe)                                             │
│                                                                     │
│  ┌──────────────┐     ┌────────────────┐     ┌──────────────────┐  │
│  │  Trajectory   │────>│  VLLMClient    │────>│  EvalRunner      │  │
│  │  JSONL files  │     │  (openai SDK   │     │  Orchestrates:   │  │
│  └──────────────┘     │   → vLLM        │     │  ┌────────────┐  │  │
│                       │   /v1/chat/     │     │  │ JudgeEngine│  │  │
│                       │   completions)  │     │  │(GPT-4 cloud)│  │  │
│                       └────────────────┘     │  ├────────────┤  │  │
│                                              │  │ CostTracker│  │  │
│  ┌──────────────────┐                        │  ├────────────┤  │  │
│  │  Langfuse SDK    │<────── traces/─────────│  │ GateEngine │  │  │
│  │  (observability) │       gens/scores      │  └────────────┘  │  │
│  └──────────────────┘                        └──────────────────┘  │
│         │                                                          │
└─────────┼──────────────────────────────────────────────────────────┘


┌──────────────────────┐
│  Langfuse Cloud      │
│  (traces, generations│
│   scores, costs)     │
└──────────────────────┘
 
  ┌──────────────────────────────────────┐
  │  vLLM Server (local)                 │
  │  Hosts fine-tuned model(s)           │
  │  Exposes OpenAI-compatible API       │
  └──────────────────────────────────────┘
 
  ┌──────────────────────────────────────┐
  │  OpenAI API (cloud)                  │
  │  GPT-4 judge scores responses        │
  └──────────────────────────────────────┘
```
 
The judge runs GPT-4 via the OpenAI API. Everything else — trajectory storage, model-under-test generation via vLLM, cost tracking, gate evaluation, CLI orchestration — runs on your local machine or CI runner.
 
## Prerequisites
 
- **Node.js** >= 22
- **pnpm** 10 (see `packageManager` in `package.json`)
- A running **vLLM server** with the [OpenAI-compatible endpoint](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) enabled (typically `http://localhost:8000/v1`)
- An **OpenAI API key** with access to GPT-4 (used by the judge)
 
## Quick Start
 
```bash
# Install dependencies
pnpm install
 
# Create environment config
cp .env.example .env
 
# Edit .env — set at minimum:
#   VLLM_BASE_URL=http://localhost:8000/v1
#   VLLM_MODEL=your-model-name
#   OPENAI_API_KEY=sk-...
 
# Run a sample eval against the golden trajectories bundled with the project
pnpm start eval ./golden/ --format json
```
 
Results are written to `./results/results.json` and `./results/gate-summary.json` by default (configurable via `RESULTS_DIR`).
 
## Configuration
 
The recipe is configured through environment variables. All values can be set in `.env` or passed directly to the process.
 
| Variable | Default | Description |
|---|---|---|
| `VLLM_BASE_URL` | `http://localhost:8000/v1` | Base URL of the vLLM OpenAI-compatible endpoint |
| `VLLM_MODEL` | — | Name of the model hosted on vLLM (required) |
| `VLLM_API_KEY` | — | Optional API key if vLLM requires authentication |
| `VLLM_MAX_TOKENS` | `4096` | Max output tokens for the model-under-test |
| `VLLM_TEMPERATURE` | `0.1` | Sampling temperature for generation |
| `OPENAI_API_KEY` | — | API key for the GPT-4 judge |
| `EVAL_JUDGE_MODEL` | `gpt-5.2` | Model ID used by JudgeEngine |
| `EVAL_JUDGE_PROVIDER` | `gpt4` | Judge provider (`gpt4`, `claude`, `gemini`, `openrouter`) |
| `EVAL_JUDGE_TEMPERATURE` | `0` | Judge sampling temperature |
| `LANGFUSE_PUBLIC_KEY` | — | Langfuse project public key |
| `LANGFUSE_SECRET_KEY` | — | Langfuse project secret key |
| `LANGFUSE_BASE_URL` | `https://cloud.langfuse.com` | Langfuse API base URL |
| `LANGFUSE_ENABLED` | `false` | Set to `true` to enable Langfuse tracing |
| `EVAL_CONFIG_PATH` | `./eval-config.yaml` | Path to config file (JSON or YAML) |
| `GOLDEN_DIR` | `./golden` | Directory for golden trajectory goldens |
| `RESULTS_DIR` | `./results` | Output directory for eval results |
| `EVAL_BUDGET_USD` | — | Max budget in USD before the cost tracker halts |
| `EVAL_GATE_PRESET` | `standard` | Gate preset: `standard`, `strict`, or `lenient` |
| `JUDGE_MOCK` | `false` | When `true`, judge returns dummy scores (useful for smoke tests) |
 
### Config file format
 
The recipe loads config from a JSON (or YAML) file specified by `EVAL_CONFIG_PATH`. The file schema mirrors the `AppConfig` interface and is validated at runtime by zod. Environment variables override file values. Example `eval-config.json`:
 
```json
{
  "vllm": {
    "baseUrl": "http://localhost:8000/v1",
    "model": "my-fine-tuned-model",
    "maxTokens": 4096,
    "temperature": 0.1
  },
  "judge": {
    "model": "gpt-5.2",
    "provider": "gpt4",
    "temperature": 0
  },
  "budget": 10.0,
  "gatePreset": "standard"
}
```
 
### Gate presets
 
| Preset | Description |
|---|---|
| `standard` | Default thresholds — overall score >= 0.7, pass rate >= 80% |
| `strict` | Higher bar — overall score >= 0.9, pass rate >= 95%, zero critical failures |
| `lenient` | Relaxed — overall score >= 0.5, pass rate >= 60% |
 
## Commands
 
The CLI exposes seven subcommands. Global flags `-v`/`--verbose` and `-c`/`--config` work with all of them.
 
| Subcommand | Description | Key Flags | Example |
|---|---|---|---|
| `eval <paths...>` | Run evaluation on trajectory JSONL files | `--output`, `--format`, `--budget`, `--golden`, `--judge-model`, `--verbose` | `pnpm start eval ./golden/ --format json --budget 5.00` |
| `judge <aspect>` | On-the-fly judging of a single response | `--context`, `--response`, `--intent`, `--model`, `--calibrated` | `pnpm start judge faithfulness --context "..." --response "..." ` |
| `compare <baseline> <candidate>` | Compare two eval run result files | `--format` (`json`, `markdown`, `table`) | `pnpm start compare results/v1.json results/v2.json --format table` |
| `gate <results>` | Check eval results against CI regression gates | `--preset` (`standard`, `strict`, `lenient`), `--exit-code` | `pnpm start gate results/gate-summary.json --preset strict --exit-code` |
| `report <results>` | Generate a formatted report from eval results | `--format` (`html`, `markdown`, `json`, `pdf`), `--output`, `--include-raw` | `pnpm start report results/results.json --format html --output report.html` |
| `golden` | Manage golden trajectories | `--list`, `--create <path>`, `--update <id>`, `--delete <id>`, `--validate <path>`, `--dir <path>` | `pnpm start golden --list` |
 
### eval
 
The primary subcommand. Discovers all `.jsonl` files under the given paths (directories are scanned recursively), parses each line as a trajectory with `id`, `context`, `prompt`, and optional `expectedTool`. For each trajectory it:
 
1. Generates a response from the vLLM-hosted model via `VLLMClient.generate`
2. Scores the response across four judgment types (faithfulness, relevance, tool_correctness, overall_quality) using `JudgeEngine`
3. Calculates per-trajectory cost via `@reaatech/agent-eval-harness-cost`
4. Aggregates all results, runs the gate engine against configured presets
5. Writes `results.json` (full output) and `gate-summary.json` (gate pass/fail) to the output directory
 
### judge
 
Sends a single `(context, response, aspect)` tuple to the judge without involving the model-under-test. Useful for iterating on judge prompt engineering or manually verifying edge cases.
 
### compare
 
Reads two result files (baseline and candidate), delegates to `compareCosts` from `@reaatech/agent-eval-harness-cost`, and prints which run is cheaper and by how much.
 
### gate
 
Reads a `gate-summary.json` file, evaluates it against the chosen preset, prints per-gate pass/fail, and optionally exits non-zero (for CI). Uses `@reaatech/agent-eval-harness-gate`.
 
### report
 
Reads a `results.json` file and produces a formatted report. For non-JSON formats delegates to `reportCommand` from `@reaatech/agent-eval-harness-cli`.
 
### golden
 
Manages golden trajectory files — the set of canonical eval examples that serve as the regression test suite. Delegates to `goldenCommand` from `@reaatech/agent-eval-harness-cli`. Supports listing, creating from a file, updating by ID, deleting, and validating trajectories against a golden set.
 
## CI Pipeline Integration
 
The `gate` subcommand is designed for CI. Below is a GitHub Actions workflow that runs eval and gates a PR:
 
```yaml
# .github/workflows/eval-gate.yml
name: vLLM Eval Gate
on:
  pull_request:
    paths: ['golden/**', 'src/**']
 
jobs:
  eval-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'pnpm'
 
      - run: pnpm install
 
      - name: Run evaluation
        run: pnpm start eval ./golden/ --format json
        env:
          VLLM_BASE_URL: ${{ secrets.VLLM_BASE_URL }}
          VLLM_MODEL: ${{ secrets.VLLM_MODEL }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
 
      - name: Gate results
        run: pnpm start gate results/gate-summary.json --preset standard --exit-code
 
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/
```
 
The `--exit-code` flag makes `gate` return a non-zero exit code when any gate fails, which blocks the PR from merging.
 
## Langfuse Observability
 
When `LANGFUSE_ENABLED=true`, the `ObservabilityService` wraps the Langfuse SDK and captures a trace for every eval run:
 
- **Traces** — One trace per evaluated trajectory, named `eval-<trajectoryId>`. Metadata includes the trajectory ID, model name, and user prompt.
- **Generations** — Each vLLM call is recorded as a generation on the trace, with `input`, `output`, `model`, and `usage` (input/output token counts).
- **Scores** — Judge scores (faithfulness, relevance, tool_correctness, overall_quality) can be attached as trace scores via `recordScore`.
 
To disable (e.g. during local development), leave `LANGFUSE_ENABLED=false`; all observability methods become no-ops and no data is sent.
 
## Project Structure
 
```
├── bin/
│   └── preflight.js              Pre-build validation script
├── golden/                       Golden trajectory files (eval regression suite)
├── results/                      Eval output (gitignored)
├── src/
│   ├── cli/
│   │   └── index.ts              CLI entry point (all 7 subcommands)
│   ├── lib/
│   │   ├── config.ts             loadEnvConfig / loadFileConfig / getConfig
│   │   ├── schemas.ts            Zod schemas + parser functions
│   │   └── types.ts              TypeScript interfaces (VLLMConfig, AppConfig, etc.)
│   ├── services/
│   │   ├── eval-runner.ts        EvalRunner (judge + cost + gate orchestration)
│   │   ├── observability.ts      ObservabilityService (Langfuse wrapper)
│   │   └── vllm-adapter.ts       VLLMClient (openai SDK → vLLM endpoint)
│   └── index.ts                  Barrel export
├── tests/
│   ├── cli/
│   │   └── index.test.ts         CLI integration tests
│   ├── lib/
│   │   ├── config.test.ts        Config loading tests
│   │   └── schemas.test.ts       Zod schema validation tests
│   └── services/
│       ├── eval-runner.test.ts   EvalRunner unit + integration tests
│       ├── observability.test.ts Langfuse wrapper tests
│       └── vllm-adapter.test.ts  VLLMClient tests (MSW mocked)
├── .env.example                  Environment variable template
├── DEV_PLAN.md                   Build plan for this recipe
├── package.json                  pnpm workspace root (engines, scripts, deps)
└── tsconfig.json                 TypeScript configuration
```
 
## License
 
MIT — see [LICENSE](./LICENSE).