Ollama Reliability Suite for On-Prem SMB Agent Operations
Keep local Ollama agents running 24/7 with automatic circuit breaking, runbook generation, session durability, and repair of broken structured outputs.
Small businesses running local AI agents on Ollama face silent failures when models return garbled JSON, hits rate limits, or crash mid-session, leaving customers hanging. They lack the ops tooling to detect and recover from these failures without a dedicated SRE.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
Small businesses running local AI agents on Ollama face silent failures when models return garbled JSON, hit rate limits, or crash mid-session — leaving customers hanging. They lack the ops tooling to detect and recover from these failures without a dedicated SRE. This tutorial builds a complete reliability suite: a circuit breaker that stops hammering a downed model, session continuity that preserves conversation state across restarts, a structured output repair engine that fixes malformed LLM JSON before it reaches your application logic, and automated runbook generation via Trigger.dev durable workflows when sessions time out.
You will build a CLI agent and a Next.js API route that demonstrate all four reliability layers working together. The code runs against a local Ollama instance and works on any Linux or macOS machine with Node.js 22+, pnpm, and a running Ollama server.
Prerequisites
Node.js >= 22 installed
pnpm 10+ (npm install -g pnpm@10)
Ollama running locally (ollama serve)
An Upstash Redis account (free tier works) for rate limiting
A Trigger.dev account for durable workflows
An Anthropic API key for runbook generation with Claude
A Langfuse account for LLM tracing (free tier works)
Step 1: Scaffold the project and install dependencies
The project uses Next.js 16 (App Router) as its shell, with TypeScript, Vitest for testing, and ESLint for linting. Start by installing the dependencies already listed in package.json:
terminal
pnpm install
This installs the four REAA reliability packages plus third-party dependencies:
Package
Version
Purpose
@reaatech/circuit-breaker-core
0.1.0
Circuit breaker state machine
@reaatech/session-continuity
0.1.0
Session lifecycle manager
@reaatech/structured-repair-core
1.0.0
Structured output repair engine
@reaatech/agent-runbook-agent
0.1.0
AI-powered runbook generation
ollama
0.6.3
Ollama TypeScript client
@trigger.dev/sdk
4.4.6
Durable workflow framework
zod
4.4.3
Schema validation
jsonrepair
3.14.0
JSON syntax repair
@upstash/ratelimit
2.0.8
Rate limiting via Upstash Redis
@upstash/redis
1.38.0
Upstash Redis client (used by ratelimit)
langfuse
3.38.20
LLM observability and tracing
commander
14.0.0
CLI argument parsing
next
16.2.6
Next.js framework
react / react-dom
19.2.4
Next.js peer dependencies
Next, set up environment variables. The file .env.example already contains all required placeholders. Copy it to .env:
terminal
cp .env.example .env
Fill in real values for each key in .env. The file expects:
env
# Env vars used by ollama-reliability-suite-for-on-prem-smb-agent-operations.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=development# OllamaOLLAMA_HOST=http://127.0.0.1:11434OLLAMA_API_KEY=<your-ollama-api-key># Anthropic — required by @reaatech/agent-runbook-agent for Claude providerANTHROPIC_API_KEY=<your-anthropic-key># Trigger.dev — durable workflowTRIGGER_API_KEY=<your-trigger-api-key># Upstash Redis — rate limitingUPSTASH_REDIS_REST_URL=<your-upstash-url>UPSTASH_REDIS_REST_TOKEN=<your-upstash-token># Langfuse — LLM tracingLANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>
Expected output:pnpm install prints a tree of installed packages and exits with code 0. Your .env file has real API keys (not placeholders) for the services you plan to use.
Step 2: Create the Ollama API client wrapper
The first reliability building block is a typed wrapper around the Ollama client. It handles error classification so upstream code can distinguish model-not-found errors from network failures from timeouts.
Expected output: The TypeScript compiler type-checks this file with no errors. The classifyOllamaError function maps TypeError to model-not-found, connection failures to network-failure, abort/timeout messages to timeout, and everything else to unknown.
Step 3: Configure the circuit breaker
The circuit breaker prevents cascading failures by detecting repeated Ollama failures and stopping new requests until the server recovers. It uses five configurable thresholds: failure count, recovery timeout, minimum confidence, cost per minute, and recovery strategy.
Expected output: After 5 consecutive failures, breaker.getState(DEFAULT_CIRCUIT_ID) returns 'OPEN'. The circuit stays open for 30 seconds (the recoveryTimeoutMs), then transitions to HALF_OPEN and gradually allows test calls until confidence is restored.
Step 4: Build the structured output repair pipeline
When Ollama emits malformed JSON — markdown fences, trailing commas, unquoted keys, type mismatches — the repair pipeline fixes it before it reaches your agent logic. This module wraps @reaatech/structured-repair-core with jsonrepair as a pre-processing fallback.
Create src/lib/repair-pipeline.ts:
ts
import { repair, repairOutput, isValid, analyzeInput, UnrepairableError } from '@reaatech/structured-repair-core';import type { RepairResult, RepairStrategyName, InputAnalysis } from '@reaatech/structured-repair-core';import { jsonrepair } from 'jsonrepair';import { z } from 'zod';export { UnrepairableError };export type { RepairResult, RepairStrategyName, InputAnalysis };export const repairStrategies: RepairStrategyName[] = [ 'strip-fences', 'extract-json', 'fix-json-syntax', 'coerce-types', 'fuzzy-match-keys', 'remove-extra-fields',];export async function repairLlmOutput<T>(schema: z.ZodType<T>, input: string): Promise<T> { return repair(schema, input);}export function repairLlmOutputDetailed<T>( schema: z.ZodType<T>, input: string, debug?: boolean,): Promise<RepairResult<T>> { return Promise.resolve(repairOutput({ schema, input, debug }));}export function quickValidate<T>(schema: z.ZodType<T>, input: string): boolean { return isValid(schema, input);}export function diagnoseInput(input: string): InputAnalysis { return analyzeInput(input);}export function preprocessWithJsonrepair(raw: string): string { return jsonrepair(raw);}
Expected output:repairLlmOutput(z.object({ name: z.string() }), '```json\\n{ "name": "Alice" }\\n```') returns { name: 'Alice' } — the markdown fences and JSON wrapper are stripped automatically. If the first repair pass fails, the orchestrator retries with preprocessWithJsonrepair as a fallback.
Step 5: Create the token counter and rate limiter
The token counter provides approximate token accounting for the session manager, using a character-based heuristic (1 token per 4 characters). The rate limiter wraps Upstash’s sliding window implementation.
Create src/lib/token-counter.ts:
ts
import { TokenCounter } from '@reaatech/session-continuity';import type { Message } from '@reaatech/session-continuity';export { type Message };export class SimpleTokenCounter implements TokenCounter { readonly model = 'simple-approximate'; readonly tokenizer = 'character-div-4'; count(text: string): number { return Math.ceil(text.length / 4); } countMessages(messages: Message[]): number { return messages.reduce((sum, m) => { const text = typeof m.content === 'string' ? m.content : JSON.stringify(m.content); return sum + Math.ceil(text.length / 4); }, 0); }}
Create src/lib/rate-limiter.ts:
ts
import { Ratelimit } from '@upstash/ratelimit';import type { Duration } from '@upstash/ratelimit';import { Redis } from '@upstash/redis';export function createRateLimiter(limit?: number, windowMs?: number): Ratelimit { const effectiveLimit = limit ?? 60; const windowSeconds = (windowMs ?? 60000) / 1000; return new Ratelimit({ redis: Redis.fromEnv(), limiter: Ratelimit.slidingWindow(effectiveLimit, `${String(windowSeconds)} s` as Duration), analytics: true, });}export async function checkRateLimit( ratelimit: Ratelimit, identifier: string,): Promise<{ allowed: boolean; remaining: number; reset: number }> { const result = await ratelimit.limit(identifier); return { allowed: result.success, remaining: result.remaining, reset: result.reset, };}
Expected output:new SimpleTokenCounter().countMessages([{ role: 'user', content: 'Hello world' }]) returns 3 (11 characters / 4, ceiling). The rate limiter defaults to 60 requests per 60-second sliding window per identifier.
Step 6: Set up observability with Langfuse
Tracing gives you visibility into every LLM call, circuit state transition, and repair attempt. This module creates a Langfuse client from environment variables and exposes a trace/span API that degrades gracefully when credentials are absent.
Expected output: When LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY are set, createTrace('agent-call', 'session-1') returns a real Langfuse trace handle. Without them, it returns a no-op handle — your code runs without crashing whether or not observability is configured.
Step 7: Implement the in-memory storage adapter
Session continuity needs a storage backend. This in-memory implementation of IStorageAdapter manages sessions and messages with optimistic concurrency control, monotonic message sequencing, and filtering capabilities.
Create src/services/memory-adapter.ts:
ts
import { type IStorageAdapter, type Session, type Message, type SessionId, type MessageId, type SessionFilters, type MessageQueryOptions, type HealthStatus, type UpdateSessionOptions, ConcurrencyError,} from '@reaatech/session-continuity';import crypto from 'node:crypto';let nextSequence = 1;function generateId(): string { return crypto.randomUUID();}export class MemoryAdapter
Expected output:adapter.createSession({ userId: 'user-1', status: 'active', ... }) returns a session with a generated UUID, createdAt timestamp, and version: 1. Calling updateSession with a stale expectedVersion throws ConcurrencyError.
Step 8: Create the session manager
The session manager wraps the storage adapter with token budget enforcement and automatic sliding-window compression. It wires event listeners for budget warnings and compression notifications.
Create src/services/session-manager.ts:
ts
import { SessionManager, type Session, type Message,} from '@reaatech/session-continuity';import { MemoryAdapter } from './memory-adapter.js';import { SimpleTokenCounter } from '../lib/token-counter.js';export function createSessionManager(): SessionManager { const manager = new SessionManager({ storage: new MemoryAdapter(), tokenCounter: new SimpleTokenCounter(), tokenBudget: { maxTokens: 4096, reserveTokens: 500, overflowStrategy: 'compress', }, compression: { strategy: 'sliding_window', targetTokens: 3500, minMessages: 5, }, }); manager.on('budget:exceeded', (payload) => { console.warn( `[session-manager] budget exceeded for session ${payload.sessionId}`, payload.data, ); }); manager.on('compression:applied', (payload) => { console.info( `[session-manager] compression applied for session ${payload.sessionId}`, payload.data, ); }); manager.on('error', (payload) => { console.error( `[session-manager] error on session ${payload.sessionId}`, payload.data, ); }); return manager;}export async function createAgentSession( manager: SessionManager, userId?: string,): Promise<Session> { return manager.createSession({ userId });}export async function addAgentMessage( manager: SessionManager, sessionId: string, role: 'user' | 'assistant' | 'system', content: string,): Promise<Message> { return manager.addMessage(sessionId, { role, content });}export async function getAgentContext( manager: SessionManager, sessionId: string,): Promise<Message[]> { return manager.getConversationContext(sessionId);}export async function endAgentSession( manager: SessionManager, sessionId: string,): Promise<void> { return manager.endSession(sessionId);}
Expected output: After adding enough messages to exceed the 4096-token budget, the session manager automatically applies sliding-window compression, keeping the context at or below 3500 tokens while preserving the 5 most recent messages.
Step 9: Build the agent orchestrator
The orchestrator ties all reliability layers together into a single processMessage method. Every user message passes through rate limiting, circuit breaking, Ollama chat, and structured output repair in sequence. The class also includes a streaming variant (processMessageStream) and internal helpers for circuit-open recovery and repair-failure fallback.
Create src/services/agent-orchestrator.ts:
ts
import type { Ollama, ChatResponse } from 'ollama';import { type CircuitBreaker, CircuitOpenError } from '@reaatech/circuit-breaker-core';import type { SessionManager } from '@reaatech/session-continuity';import type { z } from 'zod';import { DEFAULT_CIRCUIT_ID } from '../lib/breaker-config.js';import { getAgentContext, addAgentMessage } from './session-manager.js';export interface RateLimitResult { allowed: boolean; remaining: number; reset: number;}export interface
(The full source also includes processMessageStream for streaming chat, makeSingleChunkIterable for fallback streaming, and isCircuitOpenError for type-safe error checks — download the finished artifact to see the complete file.)
Expected output:orchestrator.processMessage(sessionId, 'Hello') goes through the full reliability chain: rate limit check (blocks if over 60 req/min), circuit breaker (falls back to degraded mode after 5 failures), Ollama chat, structured output repair, session persistence, and Langfuse tracing.
Step 10: Set up Trigger.dev workflow for runbook generation
When a session times out, a Trigger.dev durable workflow generates an incident runbook using @reaatech/agent-runbook-agent. The workflow retrieves the session’s conversation history, feeds it to Claude via the analysis agent, and produces failure modes plus three runbook sections (alerts, incident response, health checks).
Expected output: When Trigger.dev fires a session.timeout event with a { sessionId } payload, the task retrieves the session’s messages, calls Claude to identify failure modes, and returns a structured runbook with alerts, incident-response, and health-checks sections. If the session is missing or Claude errors, it returns a partial runbook with an error note.
Step 11: Build the CLI
The CLI wires all services together into an interactive REPL or single-shot mode. It uses commander for argument parsing and supports --model, --host, --session-id, --non-interactive, and --prompt flags.
Create src/cli/run.ts:
ts
import { Command } from 'commander';import { createOllamaClient } from '../lib/ollama-client.js';import { createOllamaBreaker } from '../lib/breaker-config.js';import { createSessionManager, createAgentSession } from '../services/session-manager.js';import { AgentOrchestrator } from '../services/agent-orchestrator.js';import { repairLlmOutput, preprocessWithJsonrepair } from '../lib/repair-pipeline.js';import { createRateLimiter, checkRateLimit } from '../lib/rate-limiter.js';import { createTrace, endTrace, recordSpan, flushAll } from '../lib/observability.js';const program = new Command();program .name('ollama-reliability-cli'
Expected output: Running node src/cli/run.ts --non-interactive --prompt "Hello" --host http://127.0.0.1:11434 sends a message through the full reliability pipeline and prints the assistant’s response. Running without --non-interactive starts a REPL where each line is processed through the same pipeline, and Ctrl+C invokes a clean shutdown.
Step 12: Add the Next.js API route
The webhook route receives Trigger.dev workflow events, validates them with Zod, authenticates via the x-trigger-api-key header, and dispatches to the workflow handler.
The test suite covers every service and lib module with mocked dependencies. Each module tests happy paths, error conditions, and boundary cases.
Run the full test suite:
terminal
pnpm test
All tests use vi.mock to isolate modules and MSW to mock Ollama’s HTTP API, exercising the Ollama client, circuit breaker, repair pipeline, token counter, rate limiter, observability, memory adapter, session manager, agent orchestrator, and Trigger.dev workflow.
Run the TypeScript type check:
terminal
pnpm typecheck
Run the linter:
terminal
pnpm lint
Expected output:pnpm test passes all test suites. pnpm typecheck reports zero errors. pnpm lint reports zero warnings.
Next steps
Add a persistence adapter — swap MemoryAdapter for @reaatech/session-continuity’s Redis or DynamoDB adapters so sessions survive process restarts
Wire the CLI into your agent framework — use AgentOrchestrator as the backend for LangChain, Vercel AI SDK, or a custom agent loop
Add structured output schemas — pass Zod schemas to processMessage() so the repair engine enforces schema conformance on every LLM response
Deploy the Trigger.dev workflow — connect the session-timeout task to a real Trigger.dev client and fire events from the session manager when sessions expire
Build a dashboard — render session health data (circuit states, rate limit hits, repair counts) using the Langfuse trace output
implements
IStorageAdapter {
private sessions: Map<SessionId, Session> = new Map();
private messages: Map<SessionId, Message[]> = new Map();