Files · vLLM Agent Eval Harness for Fine-Tuned Model Quality
58 (1 binary, 682.7 kB total)attempt 1
README.md·14093 B·markdown
markdown
# vLLM Agent Eval Harness for Fine-Tuned Model Quality
> Automated CI/CD-quality evaluations for locally-hosted fine-tuned LLMs using vLLM with LLM-as-judge and cost tracking.
Small and medium businesses fine-tuning open models on their own hardware lack a structured, repeatable quality verification pipeline. Without automated eval, regressions slip into production. This recipe solves that with a CLI tool that runs trajectories against your local vLLM endpoint, scores responses via GPT-4 as a judge, tracks token costs, enforces CI gates, and exports observability to Langfuse.
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Local Machine (recipe) │
│ │
│ ┌──────────────┐ ┌────────────────┐ ┌──────────────────┐ │
│ │ Trajectory │────>│ VLLMClient │────>│ EvalRunner │ │
│ │ JSONL files │ │ (openai SDK │ │ Orchestrates: │ │
│ └──────────────┘ │ → vLLM │ │ ┌────────────┐ │ │
│ │ /v1/chat/ │ │ │ JudgeEngine│ │ │
│ │ completions) │ │ │(GPT-4 cloud)│ │ │
│ └────────────────┘ │ ├────────────┤ │ │
│ │ │ CostTracker│ │ │
│ ┌──────────────────┐ │ ├────────────┤ │ │
│ │ Langfuse SDK │<────── traces/─────────│ │ GateEngine │ │ │
│ │ (observability) │ gens/scores │ └────────────┘ │ │
│ └──────────────────┘ └──────────────────┘ │
│ │ │
└─────────┼──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────┐
│ Langfuse Cloud │
│ (traces, generations│
│ scores, costs) │
└──────────────────────┘
┌──────────────────────────────────────┐
│ vLLM Server (local) │
│ Hosts fine-tuned model(s) │
│ Exposes OpenAI-compatible API │
└──────────────────────────────────────┘
┌──────────────────────────────────────┐
│ OpenAI API (cloud) │
│ GPT-4 judge scores responses │
└──────────────────────────────────────┘
```
The judge runs GPT-4 via the OpenAI API. Everything else — trajectory storage, model-under-test generation via vLLM, cost tracking, gate evaluation, CLI orchestration — runs on your local machine or CI runner.
## Prerequisites
- **Node.js** >= 22
- **pnpm** 10 (see `packageManager` in `package.json`)
- A running **vLLM server** with the [OpenAI-compatible endpoint](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) enabled (typically `http://localhost:8000/v1`)
- An **OpenAI API key** with access to GPT-4 (used by the judge)
## Quick Start
```bash
# Install dependencies
pnpm install
# Create environment config
cp .env.example .env
# Edit .env — set at minimum:
# VLLM_BASE_URL=http://localhost:8000/v1
# VLLM_MODEL=your-model-name
# OPENAI_API_KEY=sk-...
# Run a sample eval against the golden trajectories bundled with the project
pnpm start eval ./golden/ --format json
```
Results are written to `./results/results.json` and `./results/gate-summary.json` by default (configurable via `RESULTS_DIR`).
## Configuration
The recipe is configured through environment variables. All values can be set in `.env` or passed directly to the process.
| Variable | Default | Description |
|---|---|---|
| `VLLM_BASE_URL` | `http://localhost:8000/v1` | Base URL of the vLLM OpenAI-compatible endpoint |
| `VLLM_MODEL` | — | Name of the model hosted on vLLM (required) |
| `VLLM_API_KEY` | — | Optional API key if vLLM requires authentication |
| `VLLM_MAX_TOKENS` | `4096` | Max output tokens for the model-under-test |
| `VLLM_TEMPERATURE` | `0.1` | Sampling temperature for generation |
| `OPENAI_API_KEY` | — | API key for the GPT-4 judge |
| `EVAL_JUDGE_MODEL` | `gpt-5.2` | Model ID used by JudgeEngine |
| `EVAL_JUDGE_PROVIDER` | `gpt4` | Judge provider (`gpt4`, `claude`, `gemini`, `openrouter`) |
| `EVAL_JUDGE_TEMPERATURE` | `0` | Judge sampling temperature |
| `LANGFUSE_PUBLIC_KEY` | — | Langfuse project public key |
| `LANGFUSE_SECRET_KEY` | — | Langfuse project secret key |
| `LANGFUSE_BASE_URL` | `https://cloud.langfuse.com` | Langfuse API base URL |
| `LANGFUSE_ENABLED` | `false` | Set to `true` to enable Langfuse tracing |
| `EVAL_CONFIG_PATH` | `./eval-config.yaml` | Path to config file (JSON or YAML) |
| `GOLDEN_DIR` | `./golden` | Directory for golden trajectory goldens |
| `RESULTS_DIR` | `./results` | Output directory for eval results |
| `EVAL_BUDGET_USD` | — | Max budget in USD before the cost tracker halts |
| `EVAL_GATE_PRESET` | `standard` | Gate preset: `standard`, `strict`, or `lenient` |
| `JUDGE_MOCK` | `false` | When `true`, judge returns dummy scores (useful for smoke tests) |
### Config file format
The recipe loads config from a JSON (or YAML) file specified by `EVAL_CONFIG_PATH`. The file schema mirrors the `AppConfig` interface and is validated at runtime by zod. Environment variables override file values. Example `eval-config.json`:
```json
{
"vllm": {
"baseUrl": "http://localhost:8000/v1",
"model": "my-fine-tuned-model",
"maxTokens": 4096,
"temperature": 0.1
},
"judge": {
"model": "gpt-5.2",
"provider": "gpt4",
"temperature": 0
},
"budget": 10.0,
"gatePreset": "standard"
}
```
### Gate presets
| Preset | Description |
|---|---|
| `standard` | Default thresholds — overall score >= 0.7, pass rate >= 80% |
| `strict` | Higher bar — overall score >= 0.9, pass rate >= 95%, zero critical failures |
| `lenient` | Relaxed — overall score >= 0.5, pass rate >= 60% |
## Commands
The CLI exposes seven subcommands. Global flags `-v`/`--verbose` and `-c`/`--config` work with all of them.
| Subcommand | Description | Key Flags | Example |
|---|---|---|---|
| `eval <paths...>` | Run evaluation on trajectory JSONL files | `--output`, `--format`, `--budget`, `--golden`, `--judge-model`, `--verbose` | `pnpm start eval ./golden/ --format json --budget 5.00` |
| `judge <aspect>` | On-the-fly judging of a single response | `--context`, `--response`, `--intent`, `--model`, `--calibrated` | `pnpm start judge faithfulness --context "..." --response "..." ` |
| `compare <baseline> <candidate>` | Compare two eval run result files | `--format` (`json`, `markdown`, `table`) | `pnpm start compare results/v1.json results/v2.json --format table` |
| `gate <results>` | Check eval results against CI regression gates | `--preset` (`standard`, `strict`, `lenient`), `--exit-code` | `pnpm start gate results/gate-summary.json --preset strict --exit-code` |
| `report <results>` | Generate a formatted report from eval results | `--format` (`html`, `markdown`, `json`, `pdf`), `--output`, `--include-raw` | `pnpm start report results/results.json --format html --output report.html` |
| `golden` | Manage golden trajectories | `--list`, `--create <path>`, `--update <id>`, `--delete <id>`, `--validate <path>`, `--dir <path>` | `pnpm start golden --list` |
### eval
The primary subcommand. Discovers all `.jsonl` files under the given paths (directories are scanned recursively), parses each line as a trajectory with `id`, `context`, `prompt`, and optional `expectedTool`. For each trajectory it:
1. Generates a response from the vLLM-hosted model via `VLLMClient.generate`
2. Scores the response across four judgment types (faithfulness, relevance, tool_correctness, overall_quality) using `JudgeEngine`
3. Calculates per-trajectory cost via `@reaatech/agent-eval-harness-cost`
4. Aggregates all results, runs the gate engine against configured presets
5. Writes `results.json` (full output) and `gate-summary.json` (gate pass/fail) to the output directory
### judge
Sends a single `(context, response, aspect)` tuple to the judge without involving the model-under-test. Useful for iterating on judge prompt engineering or manually verifying edge cases.
### compare
Reads two result files (baseline and candidate), delegates to `compareCosts` from `@reaatech/agent-eval-harness-cost`, and prints which run is cheaper and by how much.
### gate
Reads a `gate-summary.json` file, evaluates it against the chosen preset, prints per-gate pass/fail, and optionally exits non-zero (for CI). Uses `@reaatech/agent-eval-harness-gate`.
### report
Reads a `results.json` file and produces a formatted report. For non-JSON formats delegates to `reportCommand` from `@reaatech/agent-eval-harness-cli`.
### golden
Manages golden trajectory files — the set of canonical eval examples that serve as the regression test suite. Delegates to `goldenCommand` from `@reaatech/agent-eval-harness-cli`. Supports listing, creating from a file, updating by ID, deleting, and validating trajectories against a golden set.
## CI Pipeline Integration
The `gate` subcommand is designed for CI. Below is a GitHub Actions workflow that runs eval and gates a PR:
```yaml
# .github/workflows/eval-gate.yml
name: vLLM Eval Gate
on:
pull_request:
paths: ['golden/**', 'src/**']
jobs:
eval-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: 'pnpm'
- run: pnpm install
- name: Run evaluation
run: pnpm start eval ./golden/ --format json
env:
VLLM_BASE_URL: ${{ secrets.VLLM_BASE_URL }}
VLLM_MODEL: ${{ secrets.VLLM_MODEL }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Gate results
run: pnpm start gate results/gate-summary.json --preset standard --exit-code
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results/
```
The `--exit-code` flag makes `gate` return a non-zero exit code when any gate fails, which blocks the PR from merging.
## Langfuse Observability
When `LANGFUSE_ENABLED=true`, the `ObservabilityService` wraps the Langfuse SDK and captures a trace for every eval run:
- **Traces** — One trace per evaluated trajectory, named `eval-<trajectoryId>`. Metadata includes the trajectory ID, model name, and user prompt.
- **Generations** — Each vLLM call is recorded as a generation on the trace, with `input`, `output`, `model`, and `usage` (input/output token counts).
- **Scores** — Judge scores (faithfulness, relevance, tool_correctness, overall_quality) can be attached as trace scores via `recordScore`.
To disable (e.g. during local development), leave `LANGFUSE_ENABLED=false`; all observability methods become no-ops and no data is sent.
## Project Structure
```
├── bin/
│ └── preflight.js Pre-build validation script
├── golden/ Golden trajectory files (eval regression suite)
├── results/ Eval output (gitignored)
├── src/
│ ├── cli/
│ │ └── index.ts CLI entry point (all 7 subcommands)
│ ├── lib/
│ │ ├── config.ts loadEnvConfig / loadFileConfig / getConfig
│ │ ├── schemas.ts Zod schemas + parser functions
│ │ └── types.ts TypeScript interfaces (VLLMConfig, AppConfig, etc.)
│ ├── services/
│ │ ├── eval-runner.ts EvalRunner (judge + cost + gate orchestration)
│ │ ├── observability.ts ObservabilityService (Langfuse wrapper)
│ │ └── vllm-adapter.ts VLLMClient (openai SDK → vLLM endpoint)
│ └── index.ts Barrel export
├── tests/
│ ├── cli/
│ │ └── index.test.ts CLI integration tests
│ ├── lib/
│ │ ├── config.test.ts Config loading tests
│ │ └── schemas.test.ts Zod schema validation tests
│ └── services/
│ ├── eval-runner.test.ts EvalRunner unit + integration tests
│ ├── observability.test.ts Langfuse wrapper tests
│ └── vllm-adapter.test.ts VLLMClient tests (MSW mocked)
├── .env.example Environment variable template
├── DEV_PLAN.md Build plan for this recipe
├── package.json pnpm workspace root (engines, scripts, deps)
└── tsconfig.json TypeScript configuration
```
## License
MIT — see [LICENSE](./LICENSE).