AI Runbook Automation for Agent Failure Recovery

Automatically trigger health checks and runbooks when AI agents fail, with circuit breakers and idempotent retries—no single provider lock-in.

ai-runbook-automation agent-failure-recovery circuit-breaker idempotency trigger-dev langfuse reliability-ops typescript nextjs

The problem

Small businesses relying on AI agents for customer support or operations face unpredictable outages and errors; when an agent goes down or returns garbage, the business needs automatic failover and recovery without 24/7 ops staff.

Built from

Intro

When AI agents power your customer support or operations, you can’t afford silent outages. This recipe builds a complete runbook automation system that automatically detects agent failures, isolates broken dependencies with circuit breakers, retries critical actions idempotently, and notifies your team — all without locking you into a single LLM provider. You’ll wire up a Next.js app with Trigger.dev workflows, Langfuse observability, and Slack notifications in about 30 minutes.

Prerequisites

Node.js 22+ and pnpm 10
A Next.js project scaffolded with App Router
A Slack bot token (xoxb-...) and a channel ID for notifications
A Langfuse account (public + secret keys) for tracing — sign up at cloud.langfuse.com
An OpenAI API key or Anthropic API key (or both)
Basic familiarity with TypeScript and Next.js App Router

Step 1: Scaffold the project and pin dependencies

Create a new Next.js project with the App Router, then install the REAA ecosystem packages and third-party dependencies. Every version must be exact-pinned (no ^ or ~).

terminal

npx create-next-app@latest ai-runbook-automation --typescript --app --src-dir --no-tailwind --import-alias "@/*"
cd

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

187 kB·115 tests·94.6% coverage·vitest passing

SHA-256ecffc5672dcb283e8e281f610c011c1e783c8912de32ff760201c9e64904ca07

Book a conversation All solutions

Comments

Loading comments…

// src/services/circuit-breaker-manager.ts import { CircuitBreaker, CircuitOpenError, InMemoryAdapter, DefaultMetricsCollector } from "@reaatech/circuit-breaker-agents"; import { createTrace, endSpan } from "../lib/tracing.js"; class CircuitBreakerManager { private breakers = new Map<string, CircuitBreaker>(); getOrCreate(serviceId: string): CircuitBreaker { const existing = this.breakers.get(serviceId); if (existing) return existing; const persistence = new InMemoryAdapter(); const metrics = new DefaultMetricsCollector(); const breaker = new CircuitBreaker({ name: serviceId, failureThreshold: 5, recoveryTimeoutMs: 30000, minConfidence: 0.7, recoveryStrategy: "gradual", persistence, metricsCollector: metrics, }); void persistence.connect(); breaker.on("stateChange", (event) => { const trace = createTrace(`circuit-breaker.${event.circuit_id}.state-change`, ["circuit-breaker"]); const span = trace.span({ name: `${String(event.data.from)}->${String(event.data.to)}` }); endSpan(span); }); this.breakers.set(serviceId, breaker); return breaker; } async executeWithBreaker<T>( serviceId: string, fn: () => Promise<T>, opts?: { onSuccess?: (res: T) => { confidence: number; costUsd?: number } }, ): Promise<T> { const breaker = this.getOrCreate(serviceId); if (opts?.onSuccess) { const onSuccess = opts.onSuccess; return breaker.execute(fn, { onSuccess: (result: unknown) => onSuccess(result as T), }); } return breaker.execute(fn); } getState(serviceId: string): string { const breaker = this.getOrCreate(serviceId); return breaker.getState(serviceId); } getStats(serviceId: string): object { const breaker = this.breakers.get(serviceId); if (!breaker) throw new Error(`Unknown circuit breaker: ${serviceId}`); return breaker.getStats(serviceId); } resetBreaker(serviceId: string): void { const breaker = this.breakers.get(serviceId); if (breaker) breaker.reset(); } async evaluateCircuitState(serviceId: string): Promise<void> { const breaker = this.getOrCreate(serviceId); try { await breaker.execute( () => Promise.resolve("noop"), { onSuccess: () => ({ confidence: 1.0 }) }, ); } catch (err) { if (err instanceof CircuitOpenError) { return; } throw err; } } } export const circuitBreakerManager = new CircuitBreakerManager();

// src/services/slack-notifier.ts import { WebClient } from "@slack/web-api"; import { config } from "../lib/config.js"; import { withIdempotency } from "../lib/actions.js"; import type { IncidentEvent, HealthCheckResult } from "../types/index.js"; export class SlackNotifier { private web: WebClient; constructor(web?: WebClient) { this.web = web ?? new WebClient(config.SLACK_TOKEN); } async sendAlert(channelId: string, message: string, severity: string): Promise<{ ts: string }> { return withIdempotency(`alert-${channelId}-${severity}`, "POST", "/chat.postMessage", async () => { const result = await this.web.chat.postMessage({ channel: channelId, text: `[${severity.toUpperCase()}] ${message}`, }); if (!result.ts) { throw new Error("Slack postMessage returned no timestamp"); } return { ts: result.ts }; }); } async sendIncidentNotification(incident: IncidentEvent): Promise<void> { const text = [ `🚨 Incident: ${incident.severity}`, `Service: ${incident.serviceId}`, `Message: ${incident.message}`, `Time: ${incident.occurredAt.toISOString()}`, ].join("\n"); const incidentKey = `incident-${incident.serviceId}-${String(incident.occurredAt.getTime())}`; await withIdempotency(incidentKey, "POST", "/chat.postMessage", async () => { await this.web.chat.postMessage({ channel: config.SLACK_CHANNEL, text, }); }); } async sendHealthReport(results: HealthCheckResult[]): Promise<void> { const healthy = results.filter((r) => r.status === "healthy"); const degraded = results.filter((r) => r.status === "degraded"); const down = results.filter((r) => r.status === "down"); const lines: string[] = []; const resultsLen = String(results.length); const healthyLen = String(healthy.length); const degradedLen = String(degraded.length); const downLen = String(down.length); lines.push(`Health Check Report — ${resultsLen} services`); lines.push(""); lines.push(`✅ Healthy: ${healthyLen}`); if (degraded.length > 0) { lines.push(`⚠️ Degraded: ${degradedLen}`); for (const r of degraded) { const dMs = String(r.latencyMs); lines.push(` • ${r.serviceId} (${dMs}ms)${r.error ? ` — ${r.error}` : ""}`); } } if (down.length > 0) { lines.push(`❌ Down: ${downLen}`); for (const r of down) { const dMs = String(r.latencyMs); lines.push(` • ${r.serviceId} (${dMs}ms)${r.error ? ` — ${r.error}` : ""}`); } } const key = `health-report-${String(results.length)}-${String(Date.now())}`; await withIdempotency(key, "POST", "/chat.postMessage", async () => { await this.web.chat.postMessage({ channel: config.SLACK_CHANNEL, text: lines.join("\n"), }); }); } }

// src/runbooks/daily-health.ts import { schedules, tasks, logger } from "@trigger.dev/sdk"; import { HealthChecker } from "../services/health-checker.js"; import { CircuitOpenError } from "@reaatech/circuit-breaker-agents"; import { circuitBreakerManager } from "../services/circuit-breaker-manager.js"; import { SlackNotifier } from "../services/slack-notifier.js"; import { createTrace, createSpan, endSpan, shutdownTracer } from "../lib/tracing.js"; import type { IncidentEvent } from "../types/index.js"; export async function dailyHealthCheckRun(): Promise<void> { const trace = createTrace("daily-health-check", ["runbook", "health"]); const span = createSpan(trace, "workflow"); try { const checker = new HealthChecker({ services: [ { serviceId: "ai-agent-1", endpointUrl: "http://localhost:3001", serviceType: "web-api" }, { serviceId: "ai-agent-2", endpointUrl: "http://localhost:3002", serviceType: "web-api" }, ], }); const results = await checker.checkAll(); const downServices = results.filter((r) => r.status === "down"); const degradedServices = results.filter((r) => r.status === "degraded"); const notifier = new SlackNotifier(); await notifier.sendHealthReport(results); for (const svc of degradedServices) { const state = circuitBreakerManager.getState(svc.serviceId); if (state === "OPEN") { logger.info(`Skipping degraded report for ${svc.serviceId} — circuit is OPEN`); continue; } await circuitBreakerManager.executeWithBreaker(svc.serviceId, () => Promise.resolve(svc), { onSuccess: () => ({ confidence: 0.5 }), }); } for (const svc of downServices) { const state = circuitBreakerManager.getState(svc.serviceId); if (state === "OPEN") { logger.info(`Skipping probe for ${svc.serviceId} — circuit is OPEN`); continue; } const incident: IncidentEvent = { serviceId: svc.serviceId, severity: "SEV2", message: svc.error ?? "Health check failed", occurredAt: new Date(), metadata: { latencyMs: svc.latencyMs }, }; await tasks.trigger("incident-response", incident); } endSpan(span); } catch (err) { if (err instanceof CircuitOpenError) { logger.info("Circuit is OPEN — degraded, not critical", { error: err instanceof Error ? err.message : String(err) }); } else { logger.error("Daily health check failed", { error: err instanceof Error ? err.message : String(err) }); } endSpan(span); } finally { await shutdownTracer(); } } export const dailyHealthCheck = schedules.task({ id: "daily-health-check", cron: "*/5 * * * *", run: dailyHealthCheckRun, });

// tests/setup.ts import { setupServer } from "msw/node"; import { http, HttpResponse } from "msw"; import { beforeAll, afterEach, afterAll } from "vitest"; // Set env vars at module level so modules that parse process.env at import // time (e.g. config.ts) see the correct values before any hook runs. process.env.SLACK_TOKEN = "slack-token-placeholder"; process.env.SLACK_CHANNEL = "C123456"; process.env.LANGFUSE_PUBLIC_KEY = "pk-lf-test"; process.env.LANGFUSE_SECRET_KEY = "sk-lf-test"; process.env.LANGFUSE_HOST = "https://cloud.langfuse.com"; process.env.OPENAI_API_KEY = "sk-test"; process.env.ANTHROPIC_API_KEY = "sk-ant-test"; process.env.DEFAULT_PROVIDER = "openai"; process.env.HEALTH_CHECK_INTERVAL_MS = "300000"; export const server = setupServer( // OpenAI Chat Completions http.post("https://api.openai.com/v1/chat/completions", () => HttpResponse.json({ id: "cmpl_test", model: "gpt-5.2", choices: [{ message: { role: "assistant", content: "mocked OpenAI response" }, finish_reason: "stop" }], usage: { prompt_tokens: 10, completion_tokens: 5, total_tokens: 15 }, }), ), // Anthropic Messages http.post("https://api.anthropic.com/v1/messages", () => HttpResponse.json({ id: "msg_test", type: "message", role: "assistant", model: "claude-sonnet-4-6", content: [{ type: "text", text: "mocked Anthropic response" }], stop_reason: "end_turn", usage: { input_tokens: 10, output_tokens: 5 }, }), ), // Slack chat.postMessage http.post("https://slack.com/api/chat.postMessage", () => HttpResponse.json({ ok: true, ts: "mock-ts-123" }), ), // Health check endpoints — dynamic based on URL pattern http.get("*/health", ({ request }) => { const url = new URL(request.url); if (url.hostname.includes("down")) { return new HttpResponse(null, { status: 500 }); } return HttpResponse.json({ status: "ok" }, { status: 200 }); }), ); beforeAll(() => { server.listen({ onUnhandledRequest: "error" }); }); afterEach(() => { server.resetHandlers(); }); afterAll(() => { server.close(); });

AI Runbook Automation for Agent Failure Recovery

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and pin dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and pin dependencies

Step 2: Configure environment variables with Zod

Step 3: Define shared domain types

Step 4: Set up Langfuse observability tracing

Step 5: Build a provider-agnostic LLM client

Step 6: Implement the circuit breaker manager

Step 7: Add idempotency middleware for safe retries

Step 8: Create the Slack notification service

Step 9: Build the health check service

Step 10: Create the daily health check workflow

Step 11: Create the incident response workflow

Step 12: Wire up the Next.js API routes

Step 13: Set up test infrastructure with MSW and run the suite

Next steps