Ollama Agent Eval Harness for On-Prem SMB Support QA

Run continuous quality evaluation on local AI agents using Ollama, with regression gating and cost tracking, all from a CLI.

ollama eval-harness cli typescript langfuse agent-evaluation regression-gating cost-tracking smb-support-qa

The problem

SMBs running on-prem LLMs with Ollama lack automated QA to catch regressions in agent performance before customers encounter errors, leading to support drift and quality degradation.

Built from

Intro

This tutorial builds an Ollama Agent Eval Harness — a CLI tool that runs continuous quality evaluation on local AI agents using Ollama, with regression gating and cost tracking, all from a single command. You’ll wire up a five-stage pipeline: load golden trajectories, evaluate agent responses, track Ollama token costs, gate releases on quality thresholds, and export results to Langfuse dashboards. By the end you’ll have a working src/cli.ts entrypoint you can run against any Ollama-hosted model.

Prerequisites

Node.js >= 22 — verify with node --version
pnpm >= 10.x — verify with pnpm --version
Ollama installed and running (ollama serve)
A model pulled — ollama pull llama3.1 (or set another in OLLAMA_MODEL)
A Langfuse account (free tier at cloud.langfuse.com) with a project public and secret key
Basic TypeScript knowledge — interfaces, async/await, and module imports

Step 1: Scaffold the project and install dependencies

Create a fresh TypeScript project directory. Start with a package.json that wires up the toolchain — Next.js is included as a build framework, Vitest for testing, and ESLint for linting.

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

174 kB·63 tests·99.2% coverage·vitest passing

SHA-25615136547bd15153e0ee109e098f661a87990334eff3e695073cf36e16f2147f4

Book a conversation All solutions

Comments

Loading comments…

import type { GateEvaluationSummary } from "@reaatech/agent-eval-harness-gate"; import type { CostBreakdown } from "@reaatech/agent-eval-harness-types"; import type { BudgetCheckResult } from "@reaatech/agent-eval-harness-cost"; export type OLlamaGatePreset = "standard" | "strict" | "lenient"; export type OllamaBudgetPreset = "strict" | "moderate" | "lenient"; export interface EvalConfig { ollamaHost: string; ollamaModel: string; ollamaApiKey?: string; langfusePublicKey: string; langfuseSecretKey: string; langfuseHost: string; gatePreset: "standard" | "strict" | "lenient"; budgetPreset: "strict" | "moderate" | "lenient"; goldenDir: string; resultsDir: string; evalConfigPath: string; } export interface EvalRunResult { evalId: string; timestamp: Date; resultsPath: string; gateSummary: GateEvaluationSummary; costBreakdown: CostBreakdown; trajectoryCount: number; passCount: number; failCount: number; totalCost: number; } export interface OllamaCallRecord { id: string; model: string; messages: Array<{ role: string; content: string }>; response: string; inputTokens: number; outputTokens: number; durationMs: number; timestamp: Date; } export const DEFAULT_OLLAMA_PRICING: Record<string, { input: number; output: number }> = { "llama3.1": { input: 0, output: 0 }, "llama3.2": { input: 0, output: 0 }, "mistral": { input: 0, output: 0 }, "qwen2.5": { input: 0, output: 0 }, "deepseek-r1": { input: 0, output: 0 }, }; export function defaultConfig(): EvalConfig { return { ollamaHost: process.env.OLLAMA_HOST ?? "http://127.0.0.1:11434", ollamaModel: process.env.OLLAMA_MODEL ?? "llama3.1", ollamaApiKey: process.env.OLLAMA_API_KEY, langfusePublicKey: process.env.LANGFUSE_PUBLIC_KEY ?? "", langfuseSecretKey: process.env.LANGFUSE_SECRET_KEY ?? "", langfuseHost: process.env.LANGFUSE_HOST ?? "https://cloud.langfuse.com", gatePreset: (process.env.GATE_PRESET ?? "standard") as EvalConfig["gatePreset"], budgetPreset: (process.env.BUDGET_PRESET ?? "moderate") as EvalConfig["budgetPreset"], goldenDir: process.env.GOLDEN_DIR ?? "./golden", resultsDir: process.env.RESULTS_DIR ?? "./results", evalConfigPath: process.env.EVAL_CONFIG_PATH ?? "./eval-config.yaml", }; } export type { GateEvaluationSummary, BudgetCheckResult };

import { getEnvVar } from "@reaatech/llm-cost-telemetry"; import { getStandardPreset, getStrictPreset, getLenientPreset } from "@reaatech/agent-eval-harness-gate"; import type { EvalConfig } from "./types.js"; export function loadEvalConfig(): EvalConfig { return { ollamaHost: getEnvVar("OLLAMA_HOST") ?? "http://127.0.0.1:11434", ollamaModel: getEnvVar("OLLAMA_MODEL") ?? "llama3.1", ollamaApiKey: getEnvVar("OLLAMA_API_KEY") ?? undefined, langfusePublicKey: getEnvVar("LANGFUSE_PUBLIC_KEY") ?? "", langfuseSecretKey: getEnvVar("LANGFUSE_SECRET_KEY") ?? "", langfuseHost: getEnvVar("LANGFUSE_HOST") ?? "https://cloud.langfuse.com", gatePreset: (getEnvVar("GATE_PRESET") ?? "standard") as EvalConfig["gatePreset"], budgetPreset: (getEnvVar("BUDGET_PRESET") ?? "moderate") as EvalConfig["budgetPreset"], goldenDir: getEnvVar("GOLDEN_DIR") ?? "./golden", resultsDir: getEnvVar("RESULTS_DIR") ?? "./results", evalConfigPath: getEnvVar("EVAL_CONFIG_PATH") ?? "./eval-config.yaml", }; } export function validateConfig(config: EvalConfig): string[] { const errors: string[] = []; if (!config.ollamaHost) errors.push("OLLAMA_HOST must be set"); else if (!config.ollamaHost.startsWith("http")) errors.push("OLLAMA_HOST must start with http:// or https://"); if (!config.ollamaModel) errors.push("OLLAMA_MODEL must be set"); if (!config.langfusePublicKey) errors.push("LANGFUSE_PUBLIC_KEY must be set"); if (!config.langfuseSecretKey) errors.push("LANGFUSE_SECRET_KEY must be set"); if (!["standard", "strict", "lenient"].includes(config.gatePreset)) errors.push(`GATE_PRESET must be one of: standard, strict, lenient (got "${config.gatePreset}")`); if (!["strict", "moderate", "lenient"].includes(config.budgetPreset)) errors.push(`BUDGET_PRESET must be one of: strict, moderate, lenient (got "${config.budgetPreset}")`); return errors; } export function getGateEnginePreset( preset: "standard" | "strict" | "lenient" ): { gates: import("@reaatech/agent-eval-harness-gate").GateDefinition[] } { switch (preset) { case "strict": return getStrictPreset(); case "lenient": return getLenientPreset(); default: return getStandardPreset(); } }

import ollama, { Ollama } from "ollama"; import { generateId, now, calculateCostFromTokens, type CostSpan } from "@reaatech/llm-cost-telemetry"; import { DEFAULT_OLLAMA_PRICING } from "../types.js"; export interface ChatParams { model: string; messages: Array<{ role: string; content: string }>; host?: string; tenant?: string; feature?: string; } export interface ChatResult { response: string; span: Omit<CostSpan, "provider"> & { provider: string }; } export async function callOllamaChat(params: ChatParams): Promise<ChatResult> { const startMs = Date.now(); const pricing = DEFAULT_OLLAMA_PRICING[params.model] ?? { input: 0, output: 0 }; const chatOptions: { model: string; messages: Array<{ role: string; content: string }>; host?: string } = { model: params.model, messages: params.messages, }; if (params.host) chatOptions.host = params.host; const ollamaResponse = await ollama.chat(chatOptions); const durationMs = Date.now() - startMs; const inputTokens = ollamaResponse.prompt_eval_count || 0; const outputTokens = ollamaResponse.eval_count || 0; const responseText = ollamaResponse.message.content; const totalTokens = inputTokens + outputTokens; const costUsd = calculateCostFromTokens(totalTokens, pricing.input); const spanObj = { id: generateId(), provider: "ollama" as string, model: params.model, inputTokens, outputTokens, costUsd, tenant: params.tenant ?? "unknown", feature: params.feature ?? "chat", timestamp: now(), durationMs, }; const spanUnknown: unknown = spanObj; const span = spanUnknown as CostSpan; return { response: responseText, span }; } export function createOllamaClient(config: { host: string; headers?: Record<string, string> }) { const client = new Ollama(config); return { chat: async (params: { model: string; messages: Array<{ role: string; content: string }> }) => { return client.chat(params); }, generate: async (params: Parameters<typeof client.generate>[0]) => { return client.generate(params); }, embed: async (params: Parameters<typeof client.embed>[0]) => { return client.embed(params); }, }; }

Ollama Agent Eval Harness for On-Prem SMB Support QA

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Step 2: Configure environment variables

Step 3: Define the shared domain types

Step 4: Build the configuration loader

Step 5: Create the Ollama client with cost telemetry

Step 6: Add telemetry instrumentation

Step 7: Build the evaluation pipeline runner

Step 8: Create the Langfuse exporter

Step 9: Wire the CLI entrypoint

Step 10: Export the public API barrel

Step 11: Write and run the test suite

Step 12: Run the harness end to end

Next steps