Skip to content
reaatech

Files · Ollama Agent Eval Harness for On-Prem SMB Support QA

60 (1 binary, 573.1 kB total)attempt 1

README.md·5241 B·markdown
markdown
# Ollama Agent Eval Harness for On-Prem SMB Support QA
 
> Run continuous quality evaluation on local AI agents using Ollama, with regression gating and cost tracking, all from a CLI.
 
SMBs running on-prem LLMs with Ollama lack automated QA to catch regressions in agent performance before customers encounter errors, leading to support drift and quality degradation. This harness evaluates agent trajectories against golden references, gates releases on quality thresholds, tracks per-call token costs, and exports results to Langfuse dashboards — all from a single CLI entrypoint.
 
> _A tutorialized reference solution from [reaatech.com](https://reaatech.com)_
 
## Quick start
 
```bash
# 1. Install dependencies
pnpm install
 
# 2. Configure environment
cp .env.example .env
# Edit .env with your Langfuse keys and Ollama settings
 
# 3. Ensure Ollama is running with a model pulled
ollama pull llama3.1
 
# 4. Run the evaluation harness
npx tsx src/cli.ts
```
 
## Prerequisites
 
- **Node.js >= 22**
- **pnpm >= 10.x**
- **Ollama** installed and running (`ollama serve`)
- **A model pulled**`ollama pull llama3.1` (or another model set in `OLLAMA_MODEL`)
 
## Configuration
 
All configuration is via environment variables. Copy `.env.example` to `.env` and fill in your values.
 
| Variable | Description | Default | Required |
|---|---|---|---|
| `OLLAMA_HOST` | Ollama REST API endpoint | `http://127.0.0.1:11434` | Yes |
| `OLLAMA_MODEL` | Model name on the Ollama host | `llama3.1` | Yes |
| `OLLAMA_API_KEY` | API key for Ollama Cloud (omit for local) | — | No |
| `LANGFUSE_PUBLIC_KEY` | Langfuse project public key | — | Yes |
| `LANGFUSE_SECRET_KEY` | Langfuse project secret key | — | Yes |
| `LANGFUSE_HOST` | Langfuse API host | `https://cloud.langfuse.com` | Yes |
| `GATE_PRESET` | Quality gate preset | `standard` | Yes |
| `BUDGET_PRESET` | Cost budget preset | `moderate` | Yes |
| `GOLDEN_DIR` | Directory of golden trajectory files | `./golden` | Yes |
| `RESULTS_DIR` | Output directory for eval results | `./results` | Yes |
| `EVAL_CONFIG_PATH` | Path to custom eval-config.yaml | `./eval-config.yaml` | No |
 
## Architecture
 
The evaluation pipeline follows a five-stage flow:
 
```
golden load → eval → cost → gate → langfuse export
```
 
| File | Responsibility |
|---|---|
| `src/eval/runner.ts` | Pipeline orchestrator — loads golden sets, runs eval, tracks cost, applies quality gates |
| `src/telemetry/instrumentation.ts` | Cost span instrumentation for every Ollama call |
| `src/export/langfuse.ts` | Pushes eval results and cost spans to Langfuse dashboards |
| `src/cli.ts` | Single CLI entrypoint — loads config, validates, orchestrates runner, handles exit codes |
 
## Usage
 
Run the harness with default settings:
 
```bash
npx tsx src/cli.ts
```
 
### Interpreting gate results
 
| Preset | Quality threshold | Budget limit |
|---|---|---|
| `standard` | >= 0.80 | $0.05/task |
| `strict` | >= 0.90 | $0.01/task |
| `lenient` | >= 0.60 | $0.10/task |
 
If the gate fails, the CLI exits with code 1. If the pipeline encounters a fatal error, it exits with code 2.
 
### Viewing Langfuse dashboards
 
After a successful run, results appear in your Langfuse project under the `eval-run` trace name. Each failed gate is recorded as a score on the trace, and every Ollama call is attached as a span with token counts and cost.
 
### Example output
 
```
Evaluation Summary
━━━━━━━━━━━━━━━━━
  Trajectories      : 12
  Passed            : 10
  Failed            : 2
  Total cost        : $0.0423
 
Gate Results
──────────────────────────────────────────
  overall_score     : 0.83  PASS  (≥ 0.80)
  response_quality  : 0.91  PASS  (≥ 0.70)
  tool_accuracy     : 0.76  PASS  (≥ 0.60)
  latency_p95       : 0.65  FAIL  (≥ 0.80)
 
  Budget            : $0.0423 / $0.0500  PASS  (84.6%)
 
  OVERALL           : PASS  ✓
 
  Exit code         : 0
```
 
## Project structure
 
```
project-root/
├── src/
│   ├── cli.ts                 # CLI entry point
│   ├── config.ts              # Config loader + validation
│   ├── types.ts               # Shared domain types
│   ├── eval/
│   │   └── runner.ts          # Evaluation pipeline orchestrator
│   ├── export/
│   │   └── langfuse.ts        # Langfuse dashboard export
│   ├── ollama/
│   │   └── client.ts          # Ollama wrapper with cost telemetry
│   └── telemetry/
│       └── instrumentation.ts # Cost span instrumentation
├── tests/                     # Vitest test suite
│   ├── cli.test.ts
│   ├── config.test.ts
│   ├── eval/
│   │   └── runner.test.ts
│   ├── export/
│   │   └── langfuse.test.ts
│   ├── ollama/
│   │   └── client.test.ts
│   ├── telemetry/
│   │   └── instrumentation.test.ts
│   └── types.test.ts
├── .env.example
├── package.json
├── tsconfig.json
├── vitest.config.ts
├── eslint.config.mjs
└── DEV_PLAN.md
```
 
## License
 
MIT — see [LICENSE](./LICENSE).