SMBs running on-prem LLMs with Ollama lack automated QA to catch regressions in agent performance before customers encounter errors, leading to support drift and quality degradation.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial builds an Ollama Agent Eval Harness — a CLI tool that runs continuous quality evaluation on local AI agents using Ollama, with regression gating and cost tracking, all from a single command. You’ll wire up a five-stage pipeline: load golden trajectories, evaluate agent responses, track Ollama token costs, gate releases on quality thresholds, and export results to Langfuse dashboards. By the end you’ll have a working src/cli.ts entrypoint you can run against any Ollama-hosted model.
Prerequisites
Node.js >= 22 — verify with node --version
pnpm >= 10.x — verify with pnpm --version
Ollama installed and running (ollama serve)
A model pulled — ollama pull llama3.1 (or set another in OLLAMA_MODEL)
A Langfuse account (free tier at cloud.langfuse.com) with a project public and secret key
Basic TypeScript knowledge — interfaces, async/await, and module imports
Step 1: Scaffold the project and install dependencies
Create a fresh TypeScript project directory. Start with a package.json that wires up the toolchain — Next.js is included as a build framework, Vitest for testing, and ESLint for linting.
Expected output:node_modules/ and pnpm-lock.yaml are created. pnpm typecheck exits 0 (there are no source files yet).
Step 2: Configure environment variables
Create .env.example with placeholders for every setting the harness reads at runtime. Copy it to .env and fill in your real values.
env
# Env vars used by ollama-agent-eval-harness-for-on-prem-smb-support-qa.# Keep placeholders only — never commit real values.OLLAMA_HOST=http://127.0.0.1:11434OLLAMA_MODEL=llama3.1OLLAMA_API_KEY=<your-ollama-api-key>LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_HOST=https://cloud.langfuse.comGATE_PRESET=standardBUDGET_PRESET=moderateGOLDEN_DIR=./goldenRESULTS_DIR=./resultsEVAL_CONFIG_PATH=./eval-config.yaml
terminal
cp .env.example .env# Edit .env with your real Langfuse keys and Ollama settings
Expected output:.env exists with your values. The harness reads these via @reaatech/llm-cost-telemetry’s getEnvVar() helper, with built-in defaults for every field.
Step 3: Define the shared domain types
Create src/types.ts — the central type definitions that every other module imports. This file defines the configuration shape, the evaluation run result, the Ollama call record, a defaultConfig() helper, and default pricing for local models.
Expected output:pnpm typecheck compiles src/types.ts without errors. The defaultConfig() function provides a quick-start config for consumers who just need reasonable defaults.
Step 4: Build the configuration loader
Create src/config.ts — a module that loads environment variables via the @reaatech/llm-cost-telemetrygetEnvVar() helper, returns sensible defaults, and validates the config before the pipeline runs.
ts
import { getEnvVar } from "@reaatech/llm-cost-telemetry";import { getStandardPreset, getStrictPreset, getLenientPreset } from "@reaatech/agent-eval-harness-gate";import type { EvalConfig } from "./types.js";export function loadEvalConfig(): EvalConfig { return { ollamaHost: getEnvVar("OLLAMA_HOST") ?? "http://127.0.0.1:11434", ollamaModel: getEnvVar("OLLAMA_MODEL") ?? "llama3.1", ollamaApiKey: getEnvVar("OLLAMA_API_KEY") ?? undefined, langfusePublicKey: getEnvVar("LANGFUSE_PUBLIC_KEY") ?? "", langfuseSecretKey: getEnvVar("LANGFUSE_SECRET_KEY") ?? "", langfuseHost: getEnvVar("LANGFUSE_HOST") ?? "https://cloud.langfuse.com", gatePreset: (getEnvVar("GATE_PRESET") ?? "standard") as EvalConfig["gatePreset"], budgetPreset: (getEnvVar("BUDGET_PRESET") ?? "moderate") as EvalConfig["budgetPreset"], goldenDir: getEnvVar("GOLDEN_DIR") ?? "./golden", resultsDir: getEnvVar("RESULTS_DIR") ?? "./results", evalConfigPath: getEnvVar("EVAL_CONFIG_PATH") ?? "./eval-config.yaml", };}export function validateConfig(config: EvalConfig): string[] { const errors: string[] = []; if (!config.ollamaHost) errors.push("OLLAMA_HOST must be set"); else if (!config.ollamaHost.startsWith("http")) errors.push("OLLAMA_HOST must start with http:// or https://"); if (!config.ollamaModel) errors.push("OLLAMA_MODEL must be set"); if (!config.langfusePublicKey) errors.push("LANGFUSE_PUBLIC_KEY must be set"); if (!config.langfuseSecretKey) errors.push("LANGFUSE_SECRET_KEY must be set"); if (!["standard", "strict", "lenient"].includes(config.gatePreset)) errors.push(`GATE_PRESET must be one of: standard, strict, lenient (got "${config.gatePreset}")`); if (!["strict", "moderate", "lenient"].includes(config.budgetPreset)) errors.push(`BUDGET_PRESET must be one of: strict, moderate, lenient (got "${config.budgetPreset}")`); return errors;}export function getGateEnginePreset( preset: "standard" | "strict" | "lenient"): { gates: import("@reaatech/agent-eval-harness-gate").GateDefinition[] } { switch (preset) { case "strict": return getStrictPreset(); case "lenient": return getLenientPreset(); default: return getStandardPreset(); }}
Expected output:pnpm typecheck passes. The validateConfig() function returns an empty array for valid configs and descriptive error strings for missing or invalid fields.
Step 5: Create the Ollama client with cost telemetry
Create src/ollama/client.ts — a wrapper around the ollama npm package that captures token counts and computes cost using @reaatech/llm-cost-telemetry. This is what you’ll call when you want to chat with a model and know what it cost.
Expected output:pnpm typecheck passes. For local models like llama3.1, the pricing is zero so costUsd returns 0 — on-prem models have no API token cost.
Step 6: Add telemetry instrumentation
Create src/telemetry/instrumentation.ts — a module that builds CostSpan objects for every Ollama call, validated against Zod schemas from @reaatech/llm-cost-telemetry. The createTelemetryContext function establishes a tenant/feature identity that flows through every span.
Expected output:pnpm typecheck passes. The CostSpanSchema.parse() call would throw if any field is missing or has the wrong type — Zod enforces the shape at runtime.
Step 7: Build the evaluation pipeline runner
Create src/eval/runner.ts — the heart of the harness. This module orchestrates the full pipeline: loading golden references, running the evaluation command, tracking per-trajectory cost against budget, evaluating quality gates, and returning a structured result.
ts
import { evalCommand, goldenCommand, cliOut, cliWarn } from "@reaatech/agent-eval-harness-cli";import type { EvalOptions } from "@reaatech/agent-eval-harness-cli";import { calculateTrajectoryCost, checkBudget, createBudget, generateCostReport, CostTracker,} from "@reaatech/agent-eval-harness-cost";import type { CostBreakdown, Trajectory } from "@reaatech/agent-eval-harness-types";import type { AggregatedResults } from "@reaatech/agent-eval-harness-suite";import { createGateEngine, CIIntegration } from "@reaatech/agent-eval-harness-gate";import type { GateEngine, GateEvaluationSummary } from "@reaatech/agent-eval-harness-gate";import { generateId, now } from "@reaatech/llm-cost-telemetry"
Expected output:pnpm typecheck passes. The runner imports from four REAA packages: @reaatech/agent-eval-harness-cli for the CLI commands, @reaatech/agent-eval-harness-cost for cost tracking, @reaatech/agent-eval-harness-gate for quality gating, and @reaatech/llm-cost-telemetry for identifiers and timestamps.
Step 8: Create the Langfuse exporter
Create src/export/langfuse.ts — a module that pushes evaluation results and cost spans to Langfuse for dashboarding. Each eval run becomes a trace, failed gates are recorded as scores, and each Ollama call is attached as a span.
Expected output:pnpm typecheck passes. The Langfuse client is imported from the langfuse npm package (version 3.38.20 exact). On a real run, you’d see the trace appear in your Langfuse project under the eval-run name.
Step 9: Wire the CLI entrypoint
Create src/cli.ts — the single entry point that ties everything together. It loads the config, validates it, creates the Langfuse client, establishes telemetry context, runs the evaluation, pushes results, prints the summary, and exits with the correct code.
pnpm vitest run --coverage --reporter=json --outputFile=vitest-report.json
Expected output: All 8 tests pass, and coverage meets the 90% thresholds for lines, branches, functions, and statements across all src/ runtime files. The report is written to vitest-report.json.
Step 12: Run the harness end to end
Before you run the harness, make sure Ollama is serving and you have a golden trajectory file. Create a sample golden file:
terminal
mkdir -p golden resultscat > golden/support-qa.jsonl << 'EOF'{"messages":[{"role":"user","content":"How do I reset my password?"},{"role":"assistant","content":"Go to Settings > Account > Reset Password."}],"expected":{"overall":0.9}}EOF
Now invoke the CLI:
terminal
npx tsx src/cli.ts
Expected output (with Ollama running and Langfuse keys set):