SMBs lack dedicated DevOps: agent runbooks are either nonexistent or stale, causing prolonged downtime when AI agents fail. Manually writing and maintaining them isn’t viable.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial builds an automated incident runbook system that scans your service repositories, generates structured runbooks with alert definitions, validates recovery steps through chaos scenarios, and produces an AI-generated summary using Anthropic Claude. Everything is orchestrated through durable Trigger.dev workflows and persisted to DynamoDB for audit.
You’ll wire up six REAA (Reliability Engineering Agentic Automation) packages, the Anthropic SDK, Slack notifications, and a background freshness job that re-syncs stale runbooks on a schedule. The result is a Next.js API that lets any SMB ops team trigger a full runbook sync with a single POST request.
Prerequisites
Node.js >= 22 and pnpm 10 installed
An Anthropic API key for Claude summary generation
A Slack bot token and channel ID for notifications
A Trigger.dev API key and endpoint for durable workflows
AWS credentials with DynamoDB access (region, access key, secret)
Basic familiarity with TypeScript and Next.js App Router patterns
Step 1: Scaffold the project and configure environment variables
Start from an empty directory. You’ll use Next.js 16 with the App Router, TypeScript, and a set of vendored REAA packages for reliability automation.
Create package.json with all dependencies pinned to exact versions:
Then create .env.example with all the environment variables the system needs:
env
# Env vars used by anthropic-ai-runbook-automation-for-smb-devops-incident-recovery.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentANTHROPIC_API_KEY=<your-anthropic-key>SLACK_TOKEN=<your-slack-bot-token>SLACK_CHANNEL=<your-channel-id>TRIGGER_API_KEY=<your-trigger-dev-api-key>TRIGGER_API_ENDPOINT=<your-trigger-dev-endpoint>AWS_REGION=us-east-1AWS_ACCESS_KEY_ID=<your-access-key>AWS_SECRET_ACCESS_KEY=<your-secret>DYNAMODB_TABLE_NAME=runbook-sessionsRUNBOOK_SYNC_INTERVAL_MS=3600000LOG_LEVEL=info
Expected output:pnpm install exits without errors. The .env.example file lists 12 environment variables covering Anthropic, Slack, Trigger.dev, AWS DynamoDB, and application config.
Step 2: Create shared types and structured logger
Define the core data types used throughout the system. These describe what a runbook sync request looks like and what results it produces.
Now create the pino-based logger in src/services/logger.ts:
ts
import pino from "pino";export const logger = pino({ level: process.env.LOG_LEVEL ?? "info" });export function createContextLogger(ctx: Record<string, unknown>) { return logger.child(ctx);}
Expected output: Two files that compile cleanly. The logger respects the LOG_LEVEL env var, defaulting to "info".
Step 3: Build the Anthropic Claude summary service
This service calls Claude to generate a human-readable runbook summary from the analysis context. It guards against empty input and missing API keys before making any network calls.
Create src/services/anthropic.ts:
ts
import Anthropic from "@anthropic-ai/sdk";import { logger } from "./logger.js";const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });export async function generateRunbookSummary(analysis: Record<string, unknown>): Promise<string> { if (Object.keys(analysis).length === 0) return "No summary available"; if (!process.env.ANTHROPIC_API_KEY) throw new Error("ANTHROPIC_API_KEY is not configured"); try { const message = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, system: "You are a DevOps runbook summary generator. Summarize the following analysis into a concise incident runbook overview for SMB operators.", messages: [{ role: "user", content: JSON.stringify(analysis) }], }); const textBlock = message.content.find((c) => c.type === "text"); if (textBlock && "text" in textBlock) { return textBlock.text; } throw new Error("No text content in Claude response"); } catch (err) { logger.error({ err, service: "anthropic" }, "Claude API call failed"); throw err instanceof Error ? err : new Error("Unknown Claude API error"); }}
Key details: max_tokens is required (the API 400s without it), and you must narrow content[0].type === "text" before accessing .text since Claude can also respond with tool_use blocks.
Expected output: A function that returns "No summary available" for empty input, throws on a missing API key, and returns the Claude response text on success.
Step 4: Wire up Slack notifications
The Slack service sends runbook sync results and alert counts to a configured channel. It handles missing credentials by logging a warning instead of throwing, keeping notifications non-critical to the workflow.
Create src/services/slack.ts:
ts
import { WebClient, ErrorCode } from "@slack/web-api";import type { RunbookSyncResult } from "./types.js";import { logger } from "./logger.js";const web = new WebClient(process.env.SLACK_TOKEN);export async function notifyRunbookSync(result: RunbookSyncResult, channel?: string): Promise<void> { if (!process.env.SLACK_TOKEN) { logger.warn("SLACK_TOKEN not configured, skipping notification"); return; } const targetChannel = channel ?? process.env.SLACK_CHANNEL; if (!targetChannel) { logger.warn("SLACK_CHANNEL not configured, skipping notification"); return; } try { await web.chat.postMessage({ text: `Runbook Sync ${result.status.toUpperCase()}\nRepo: ${result.repoUrl}\nAlerts: ${String(result.alertsGenerated)}\nChaos scenarios: ${String(result.chaosScenariosValidated)}\n${result.summary ? `Summary: ${result.summary}` : ""}`, channel: targetChannel, }); } catch (error: unknown) { const childLogger = logger.child({ service: "slack" }); if (error && typeof error === "object" && "code" in error && error.code === ErrorCode.PlatformError) { childLogger.error({ error }, "Slack platform error (non-critical)"); } else { childLogger.error({ error }, "Slack request error (non-critical)"); } }}export async function notifyAlertGenerated(alertCount: number): Promise<void> { if (!process.env.SLACK_TOKEN || !process.env.SLACK_CHANNEL) return; try { await web.chat.postMessage({ text: `Generated ${String(alertCount)} alert definitions from runbook analysis`, channel: process.env.SLACK_CHANNEL, }); } catch { // non-critical, silently ignore }}
Expected output: Both functions safely early-return when credentials are missing. Platform and request errors are caught and logged but never propagated.
Step 5: Create DynamoDB persistence layer
The storage service wraps the @reaatech/session-continuity-storage-dynamodb adapter. It provides saveRunbookSession, getRunbookSession, and listStaleRunbooks — all guarded by a PersistenceError wrapper.
Create src/services/storage.ts:
ts
import { DynamoDBAdapter } from "@reaatech/session-continuity-storage-dynamodb";import { DynamoDBClient } from "@aws-sdk/client-dynamodb";import { DynamoDBDocumentClient } from "@aws-sdk/lib-dynamodb";import type { RunbookSyncResult } from "./types.js";export class PersistenceError extends Error { constructor(message: string, public readonly cause?: unknown) { super(message); this.name = "PersistenceError"; }}function getAdapter(): DynamoDBAdapter { if (!process.env.AWS_REGION) { throw new Error("AWS_REGION is not configured"); } const ddbClient = new DynamoDBClient({ region: process.env.AWS_REGION }); const ddbDocClient = DynamoDBDocumentClient.from(ddbClient); return new DynamoDBAdapter({ client: ddbDocClient, tableName: process.env.DYNAMODB_TABLE_NAME ?? "sessions" });}export async function saveRunbookSession(session: RunbookSyncResult): Promise<void> { try { const adapter = getAdapter(); const sessionRecord = { ...session, id: session.runbookId, metadata: {}, status: "active" as const, participants: [], schemaVersion: 1, ttl: Math.floor(Date.now() / 1000) + 86400, }; await adapter.createSession(sessionRecord); } catch (err) { throw new PersistenceError("Failed to save runbook session", err); }}export async function getRunbookSession(id: string): Promise<RunbookSyncResult | null> { try { const adapter = getAdapter(); const result: unknown = await adapter.getSession(id); if (result === undefined || result === null) return null; return result as RunbookSyncResult; } catch (err) { throw new PersistenceError("Failed to get runbook session", err); }}export async function listStaleRunbooks(before: Date): Promise<string[]> { try { const adapter = getAdapter(); const sessions: unknown = await adapter.getExpiredSessions(before); if (!Array.isArray(sessions)) return []; return (sessions as Array<Record<string, unknown>>).map((s) => { const id = s.id; const pk = s.PK; return typeof id === "string" ? id : typeof pk === "string" ? pk : ""; }); } catch (err) { throw new PersistenceError("Failed to list stale runbooks", err); }}
Expected output: A complete storage module that creates an adapter lazily per call, never caches the client in module scope (so it works with env-var mocks in tests), and wraps all adapter errors in typed PersistenceError.
Step 6: Build the repository analyzer adapter
The analyzer delegates to @reaatech/agent-runbook-analyzer to scan a service repository for topology, dependencies, configs, and code structure. If the repo has no detectable language, it returns a partial result.
Expected output: When the scanner returns language: "unknown", the adapter skips analyzeCode and returns a partial result. Errors include the repo URL in the message for traceability.
Step 7: Build the alert generation adapter
The alerts module extracts existing alert definitions from the repo and generates new ones based on the analysis context. It uses Zod schema validation before passing the context to generateAlerts.
Expected output:generateServiceAlerts validates its context against AnalysisContextSchema — invalid context throws immediately. When extractAlerts returns nothing, the module still proceeds with generated-only alerts.
Step 8: Implement chaos scenario validation
The chaos validator loads YAML and JSON scenario files from a directory, validates each against a schema, and produces a ChaosScenarioResult per file. Each result uses the .valid property (not .isValid).
Expected output: An empty directory returns an empty array. Parse errors and validation failures are mapped to { valid: false, errors: [...] } — they never throw through to the caller.
Step 9: Orchestrate the full pipeline with a Trigger.dev durable workflow
The workflow chains seven steps: generate the runbook via the CLI, analyze the repo, generate alerts, validate chaos scenarios, produce the Claude summary, persist to DynamoDB, and notify Slack. Each non-critical step is wrapped in its own try/catch so a failure in alert generation doesn’t abort the entire workflow.
Expected output: A Trigger.dev task that handles partial failures gracefully — alert generation failing doesn’t prevent the chaos validation from running. On critical failure (the CLI generation step), it persists the failed state and notifies Slack before re-throwing.
Step 10: Create the API route to trigger a sync
The POST endpoint accepts a repo URL and optional overrides, validates the input with Zod, triggers the workflow, and returns a 202 with a generated runbook ID. This is a Next.js 16 App Router route handler.
Key Next.js rules followed here: NextRequest parameter type, NextResponse.json() for JSON responses (never new Response(JSON.stringify(...))), and a named POST export (not a default export).
Expected output: GET /api/runbooks/rb_123 returns the runbook result on success, 404 if the session doesn’t exist in DynamoDB, and 500 if storage throws.
Step 12: Create the background freshness job and instrumentation hook
The freshness job checks for runbooks that haven’t been synced within the configured interval and re-triggers them. The instrumentation hook starts this job on a setInterval when the Node.js runtime loads.
The dynamic import() in register() is required — the function runs in both Node.js and Edge runtimes, and Node-only modules would break Edge on module resolution.
Expected output: When NEXT_RUNTIME === "nodejs", the server starts a periodic freshness check that re-triggers runbook syncs for stale entries. The experimental.instrumentationHook: true flag ensures register() actually fires.
Step 13: Set up MSW test infrastructure and run the tests
Create a test setup that mocks external HTTP services (Anthropic API, Slack API) using MSW so your tests never make real network calls.
Expected output: All 74 tests pass with numFailedTests: 0 and coverage above 90% on all four metrics (lines, branches, functions, statements). The coverage scope is runtime code only — UI files (*.tsx) are excluded.
Next steps
Extend the analyzer to support private Git repos with SSH key authentication for scanning internal services
Add a Slack interactive message with Approve / Reject buttons on the runbook summary so operators can confirm before the runbook is saved
Integrate PagerDuty — trigger an alert when a chaos scenario validation fails repeatedly
Build a dashboard in the Next.js frontend that lists all runbook syncs with drill-down into alert definitions and chaos scenario results
Add PII redaction to the Claude summary input to strip secrets and tokens before sending data to the API