Small businesses relying on AI agents for customer support or operations face unpredictable outages and errors; when an agent goes down or returns garbage, the business needs automatic failover and recovery without 24/7 ops staff.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
When AI agents power your customer support or operations, you can’t afford silent outages. This recipe builds a complete runbook automation system that automatically detects agent failures, isolates broken dependencies with circuit breakers, retries critical actions idempotently, and notifies your team — all without locking you into a single LLM provider. You’ll wire up a Next.js app with Trigger.dev workflows, Langfuse observability, and Slack notifications in about 30 minutes.
Prerequisites
Node.js 22+ and pnpm 10
A Next.js project scaffolded with App Router
A Slack bot token (xoxb-...) and a channel ID for notifications
A Langfuse account (public + secret keys) for tracing — sign up at cloud.langfuse.com
An OpenAI API key or Anthropic API key (or both)
Basic familiarity with TypeScript and Next.js App Router
Step 1: Scaffold the project and pin dependencies
Create a new Next.js project with the App Router, then install the REAA ecosystem packages and third-party dependencies. Every version must be exact-pinned (no ^ or ~).
Expected output:pnpm install completes without errors, and your node_modules/ contains all REAA and third-party packages.
Step 2: Configure environment variables with Zod
Create src/lib/config.ts to validate every environment variable at import time using Zod. This eliminates the “missing env var” mystery during runtime.
Expected output: These types are the contract between every module. ProviderClient is the abstraction that lets you swap OpenAI for Anthropic without changing a single line of workflow code.
Step 4: Set up Langfuse observability tracing
Create src/lib/tracing.ts to initialize Langfuse as a lazy singleton. Every workflow will create traces and spans for full observability.
ts
// src/lib/tracing.tsimport { Langfuse } from "langfuse";import { config } from "./config.js";let client: Langfuse | null = null;export function getTracer(): Langfuse { if (!client) { client = new Langfuse({ publicKey: config.LANGFUSE_PUBLIC_KEY, secretKey: config.LANGFUSE_SECRET_KEY, baseUrl: config.LANGFUSE_HOST, }); } return client;}export function createTrace(name: string, tags?: string[]) { return getTracer().trace({ name, tags });}export type TraceClient = ReturnType<typeof createTrace>;export function createSpan(trace: TraceClient, name: string) { return trace.span({ name });}export type SpanClient = ReturnType<typeof createSpan>;export function endSpan(span: SpanClient, status?: "ok" | "error", usage?: { inputTokens: number; outputTokens: number }) { span.end({ metadata: { status, usage } });}export async function shutdownTracer() { await client?.shutdownAsync();}
Expected output: Call createTrace("my-workflow") to start a trace, createSpan(trace, "step-1") to create a child span, and endSpan(span, "ok", { inputTokens: 10, outputTokens: 5 }) to record token usage.
Step 5: Build a provider-agnostic LLM client
Create src/services/provider-client.ts with OpenAIProvider and AnthropicProvider implementations behind the ProviderClient interface. The factory function createProvider picks the right class and validates the matching API key is present.
ts
// src/services/provider-client.tsimport { type ProviderClient, type ProviderResponse, type ModelOptions } from "../types/index.js";import { config } from "../lib/config.js";import { ConfigurationError } from "@reaatech/agent-runbook";import OpenAI from "openai";import Anthropic from "@anthropic-ai/sdk";export class ProviderError extends Error { code: string; statusCode: number; constructor(message: string, code: string, statusCode: number) {
Expected output:getDefaultProvider() returns an OpenAIProvider or AnthropicProvider based on your DEFAULT_PROVIDER env var. Calling complete() returns a normalized ProviderResponse regardless of which provider is active.
Step 6: Implement the circuit breaker manager
Create src/services/circuit-breaker-manager.ts using @reaatech/circuit-breaker-agents. Each service gets its own CircuitBreaker with a 5-failure threshold, 30-second recovery timeout, and gradual recovery strategy.
Expected output:circuitBreakerManager.getOrCreate("ai-agent-1") returns a breaker in CLOSED state. After 5 failures, it trips to OPEN. The stateChange event logs every transition to Langfuse.
Step 7: Add idempotency middleware for safe retries
Create src/lib/actions.ts using @reaatech/idempotency-middleware. This wraps any mutation (Slack messages, incident responses) so duplicate invocations with the same key return the cached first result instead of re-executing.
ts
// src/lib/actions.tsimport { MemoryAdapter, IdempotencyMiddleware, IdempotencyError } from "@reaatech/idempotency-middleware";import { generateId } from "@reaatech/agent-runbook";import { createTrace, createSpan, endSpan } from "./tracing.js";const storage = new MemoryAdapter();await storage.connect();const middleware = new IdempotencyMiddleware(storage, { ttl: 86_400_000, lockTimeout: 30_000, includeBodyInKey: true,});export async function withIdempotency<T>( key: string, method: string, path: string, handler: () => Promise<T>,): Promise<T> { const span = createSpan(createTrace("idempotency", ["middleware"]), "withIdempotency"); const MAX_RETRIES = 3; let lastError: Error = new Error("Idempotency retries exhausted"); for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) { try { const result = await middleware.execute(key, { method, path }, handler); endSpan(span); return result; } catch (err) { if (err instanceof IdempotencyError && err.isRecoverable()) { lastError = err; if (attempt < MAX_RETRIES) { await new Promise((resolve) => setTimeout(resolve, 1000)); } continue; } throw err; } } endSpan(span); throw lastError;}export function generateIdempotencyKey(serviceId: string, action: string): string { return `${serviceId}--${action}--${generateId("ik")}`;}
Expected output: Calling withIdempotency("my-key", "POST", "/chat.postMessage", handler) twice with the same key only invokes the handler once. The second call returns the cached result. Recoverable errors (lock timeouts, storage errors) auto-retry up to 3 times with 1-second backoff.
Step 8: Create the Slack notification service
Create src/services/slack-notifier.ts to send alerts, incident notifications, and health reports through Slack. Every Slack API call is wrapped in withIdempotency to prevent duplicate messages.
Expected output:notifier.sendIncidentNotification(incident) posts a formatted message to your Slack channel. The idempotency key prevents the same notification from being sent twice if the workflow retries.
Step 9: Build the health check service
Create src/services/health-checker.ts that probes each service endpoint, retries on transient failures, and generates Kubernetes probe YAML.
ts
// src/services/health-checker.tsimport { suggestHealthChecks, generateKubernetesProbeYaml } from "@reaatech/agent-runbook-health-checks";import { type HealthCheck, retry } from "@reaatech/agent-runbook";import type { HealthCheckResult } from "../types/index.js";export class HealthCheckError extends Error { code: string; serviceId: string; constructor(message: string, code: string, serviceId: string) { super(message); this.name = "HealthCheckError"; this.code = code; this.serviceId
Expected output:checker.checkAll() returns a HealthCheckResult[] where each entry is "healthy", "degraded", or "down". Transient network errors auto-retry 3 times via @reaatech/agent-runbook’s retry().
Step 10: Create the daily health check workflow
Create src/runbooks/daily-health.ts — a Trigger.dev scheduled task that runs every 5 minutes, probes all services, publishes incident events for down services, and records degraded signals through the circuit breaker.
ts
// src/runbooks/daily-health.tsimport { schedules, tasks, logger } from "@trigger.dev/sdk";import { HealthChecker } from "../services/health-checker.js";import { CircuitOpenError } from "@reaatech/circuit-breaker-agents";import { circuitBreakerManager } from "../services/circuit-breaker-manager.js";import { SlackNotifier } from "../services/slack-notifier.js";import { createTrace, createSpan, endSpan, shutdownTracer } from "../lib/tracing.js";import type { IncidentEvent } from "../types/index.js";export async function dailyHealthCheckRun(): Promise<void> { const trace = createTrace("daily-health-check", ["runbook", "health"]); const span = createSpan(trace, "workflow"); try { const checker = new HealthChecker({ services: [ { serviceId: "ai-agent-1", endpointUrl: "http://localhost:3001", serviceType: "web-api" }, { serviceId: "ai-agent-2", endpointUrl: "http://localhost:3002", serviceType: "web-api" }, ], }); const results = await checker.checkAll(); const downServices = results.filter((r) => r.status === "down"); const degradedServices = results.filter((r) => r.status === "degraded"); const notifier = new SlackNotifier(); await notifier.sendHealthReport(results); for (const svc of degradedServices) { const state = circuitBreakerManager.getState(svc.serviceId); if (state === "OPEN") { logger.info(`Skipping degraded report for ${svc.serviceId} — circuit is OPEN`); continue; } await circuitBreakerManager.executeWithBreaker(svc.serviceId, () => Promise.resolve(svc), { onSuccess: () => ({ confidence: 0.5 }), }); } for (const svc of downServices) { const state = circuitBreakerManager.getState(svc.serviceId); if (state === "OPEN") { logger.info(`Skipping probe for ${svc.serviceId} — circuit is OPEN`); continue; } const incident: IncidentEvent = { serviceId: svc.serviceId, severity: "SEV2", message: svc.error ?? "Health check failed", occurredAt: new Date(), metadata: { latencyMs: svc.latencyMs }, }; await tasks.trigger("incident-response", incident); } endSpan(span); } catch (err) { if (err instanceof CircuitOpenError) { logger.info("Circuit is OPEN — degraded, not critical", { error: err instanceof Error ? err.message : String(err) }); } else { logger.error("Daily health check failed", { error: err instanceof Error ? err.message : String(err) }); } endSpan(span); } finally { await shutdownTracer(); }}export const dailyHealthCheck = schedules.task({ id: "daily-health-check", cron: "*/5 * * * *", run: dailyHealthCheckRun,});
Expected output: Every 5 minutes the workflow probes ai-agent-1 and ai-agent-2. If a service is down, it triggers the incident response workflow and sends a Slack health report. If a circuit is already OPEN, it skips further probes for that service.
Step 11: Create the incident response workflow
Create src/runbooks/incident-response.ts — a Trigger.dev event-driven task that receives incident events, runs a full runbook (escalation policy, communication templates, restart decisions), and resets the circuit breaker on resolution.
ts
// src/runbooks/incident-response.tsimport { task, logger } from "@trigger.dev/sdk";import { CircuitOpenError } from "@reaatech/circuit-breaker-agents";import { circuitBreakerManager } from "../services/circuit-breaker-manager.js";import { SlackNotifier } from "../services/slack-notifier.js";import { getDefaultProvider } from "../services/provider-client.js";import { withIdempotency, generateIdempotencyKey } from "../lib/actions.js";import { createTrace, createSpan, endSpan, shutdownTracer } from "../lib/tracing.js";import { generateIncidentWorkflows, generateEscalationPolicy, getTemplatesByCategory, applyTemplateVariables } from "@reaatech/agent-runbook-incident";import type { IncidentEvent } from "../types/index.js";import type
Expected output: When an incident event arrives, the workflow creates an idempotent incident response — generates escalation policies, fetches notification templates, posts to Slack, optionally asks the LLM whether to restart, and resets the circuit breaker on resolution.
Step 12: Wire up the Next.js API routes
Create app/api/health/route.ts for checking service health and app/api/trigger/route.ts for handling Trigger.dev webhook events.
Expected output:GET /api/health returns { status: "ok", services: [...], timestamp: "..." } with individual service statuses. POST /api/trigger with a valid secret dispatches to the correct workflow. Both routes use NextRequest/NextResponse (not bare Request/Response).
Step 13: Set up test infrastructure with MSW and run the suite
Create tests/setup.ts to mock all external APIs (OpenAI, Anthropic, Slack, health endpoints) with MSW so tests run without any live credentials:
ts
// tests/setup.tsimport { setupServer } from "msw/node";import { http, HttpResponse } from "msw";import { beforeAll, afterEach, afterAll } from "vitest";// Set env vars at module level so modules that parse process.env at import// time (e.g. config.ts) see the correct values before any hook runs.process.env.SLACK_TOKEN = "slack-token-placeholder";process.env.SLACK_CHANNEL = "C123456";process.env.LANGFUSE_PUBLIC_KEY = "pk-lf-test";process.env.LANGFUSE_SECRET_KEY = "sk-lf-test";process.env.LANGFUSE_HOST = "https://cloud.langfuse.com";process.env.OPENAI_API_KEY = "sk-test";process.env.ANTHROPIC_API_KEY = "sk-ant-test";process.env.DEFAULT_PROVIDER = "openai";process.env.HEALTH_CHECK_INTERVAL_MS = "300000";export const server = setupServer( // OpenAI Chat Completions http.post("https://api.openai.com/v1/chat/completions", () => HttpResponse.json({ id: "cmpl_test", model: "gpt-5.2", choices: [{ message: { role: "assistant", content: "mocked OpenAI response" }, finish_reason: "stop" }], usage: { prompt_tokens: 10, completion_tokens: 5, total_tokens: 15 }, }), ), // Anthropic Messages http.post("https://api.anthropic.com/v1/messages", () => HttpResponse.json({ id: "msg_test", type: "message", role: "assistant", model: "claude-sonnet-4-6", content: [{ type: "text", text: "mocked Anthropic response" }], stop_reason: "end_turn", usage: { input_tokens: 10, output_tokens: 5 }, }), ), // Slack chat.postMessage http.post("https://slack.com/api/chat.postMessage", () => HttpResponse.json({ ok: true, ts: "mock-ts-123" }), ), // Health check endpoints — dynamic based on URL pattern http.get("*/health", ({ request }) => { const url = new URL(request.url); if (url.hostname.includes("down")) { return new HttpResponse(null, { status: 500 }); } return HttpResponse.json({ status: "ok" }, { status: 200 }); }),);beforeAll(() => { server.listen({ onUnhandledRequest: "error" });});afterEach(() => { server.resetHandlers();});afterAll(() => { server.close();});
Create vitest.config.ts to load the setup file, use the threads pool, and enforce 90% coverage thresholds:
Then replace the placeholder src/index.ts to export all public APIs:
ts
// src/index.tsexport { HealthChecker, HealthCheckError, type ServiceTarget } from "./services/health-checker.js";export { circuitBreakerManager } from "./services/circuit-breaker-manager.js";export { SlackNotifier } from "./services/slack-notifier.js";export { withIdempotency, generateIdempotencyKey } from "./lib/actions.js";export { getDefaultProvider, createProvider, OpenAIProvider, AnthropicProvider, ProviderError } from "./services/provider-client.js";export { getTracer, createTrace, createSpan, endSpan, shutdownTracer } from "./lib/tracing.js";export { config, type Config } from "./lib/config.js";export type { ServiceStatus, HealthCheckResult, IncidentEvent, RunbookAction, ProviderResponse, ModelOptions, ProviderClient,} from "./types/index.js";import "@reaatech/agent-runbook";import "@reaatech/agent-runbook-incident";import "@reaatech/agent-runbook-health-checks";import { CircuitBreaker } from "@reaatech/circuit-breaker-core";import "@reaatech/circuit-breaker-agents";import "@reaatech/idempotency-middleware";void CircuitBreaker;
Now run the full quality gate:
terminal
pnpm typecheckpnpm lintpnpm test
Expected output:pnpm typecheck exits 0 (no type errors). pnpm lint exits 0 (no lint errors). pnpm test runs vitest run --coverage and reports zero failed tests with lines, branches, functions, and statements all at or above 90%.
Next steps
Add a Slack interactive component — replace the plain-text alerts with Block Kit buttons for “Acknowledge” and “Escalate” actions
Deploy to production — run the Trigger.dev workflows on your actual infrastructure, pointing HEALTH_CHECK_SERVICE_ENDPOINTS at your real agent services
Add more providers — implement GroqProvider or BedrockProvider behind the same ProviderClient interface to expand your LLM options
Persist circuit breaker state — swap InMemoryAdapter for RedisAdapter so breaker state survives process restarts in a distributed deployment
super(message);
this.name = "ProviderError";
this.code = code;
this.statusCode = statusCode;
}
}
export class OpenAIProvider implements ProviderClient {
private client: OpenAI;
constructor() {
this.client = new OpenAI({ apiKey: config.OPENAI_API_KEY });
w.steps.some((s) => typeof s !== "string" && s.action === "restart")
);
let parsed: { shouldRestart: boolean; reason: string } = { shouldRestart: false, reason: "No restart action in runbook" };
if (hasRestartAction) {
const provider = getDefaultProvider();
const decision = await provider.complete(
`Context: Service "${serviceId}" is down with severity ${severity}. Message: ${message}\n\nShould we restart the service? Respond with JSON: { "shouldRestart": boolean, "reason": string }`,
{ model: "gpt-5.2", maxTokens: 256, systemPrompt: "You are a reliability engineer." },
);
parsed = JSON.parse(decision.content) as { shouldRestart: boolean; reason: string };