Anthropic AI Runbook Automation for SMB DevOps Incident Recovery

Automatically generate and test agent incident runbooks from your service repositories, then trigger them via durable workflows.

anthropic runbook-automation devops incident-recovery triggerdev nextjs reliability-ops chaos-engineering

The problem

SMBs lack dedicated DevOps: agent runbooks are either nonexistent or stale, causing prolonged downtime when AI agents fail. Manually writing and maintaining them isn’t viable.

Built from

Intro

This tutorial builds an automated incident runbook system that scans your service repositories, generates structured runbooks with alert definitions, validates recovery steps through chaos scenarios, and produces an AI-generated summary using Anthropic Claude. Everything is orchestrated through durable Trigger.dev workflows and persisted to DynamoDB for audit.

You’ll wire up six REAA (Reliability Engineering Agentic Automation) packages, the Anthropic SDK, Slack notifications, and a background freshness job that re-syncs stale runbooks on a schedule. The result is a Next.js API that lets any SMB ops team trigger a full runbook sync with a single POST request.

Prerequisites

Node.js >= 22 and pnpm 10 installed
An Anthropic API key for Claude summary generation
A Slack bot token and channel ID for notifications
A Trigger.dev API key and endpoint for durable workflows
AWS credentials with DynamoDB access (region, access key, secret)
Basic familiarity with TypeScript and Next.js App Router patterns

Step 1: Scaffold the project and configure environment variables

Start from an empty directory. You’ll use Next.js 16 with the App Router, TypeScript, and a set of vendored REAA packages for reliability automation.

Create package.json with all dependencies pinned to exact versions:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

186 kB·74 tests·95.8% coverage·vitest passing

SHA-2562fbf94b068a837c7d564e6b6af9227b52b9553be59b62dfbea650b5a2bcf250e

Book a conversation All solutions

Comments

Loading comments…

import { DynamoDBAdapter } from "@reaatech/session-continuity-storage-dynamodb"; import { DynamoDBClient } from "@aws-sdk/client-dynamodb"; import { DynamoDBDocumentClient } from "@aws-sdk/lib-dynamodb"; import type { RunbookSyncResult } from "./types.js"; export class PersistenceError extends Error { constructor(message: string, public readonly cause?: unknown) { super(message); this.name = "PersistenceError"; } } function getAdapter(): DynamoDBAdapter { if (!process.env.AWS_REGION) { throw new Error("AWS_REGION is not configured"); } const ddbClient = new DynamoDBClient({ region: process.env.AWS_REGION }); const ddbDocClient = DynamoDBDocumentClient.from(ddbClient); return new DynamoDBAdapter({ client: ddbDocClient, tableName: process.env.DYNAMODB_TABLE_NAME ?? "sessions" }); } export async function saveRunbookSession(session: RunbookSyncResult): Promise<void> { try { const adapter = getAdapter(); const sessionRecord = { ...session, id: session.runbookId, metadata: {}, status: "active" as const, participants: [], schemaVersion: 1, ttl: Math.floor(Date.now() / 1000) + 86400, }; await adapter.createSession(sessionRecord); } catch (err) { throw new PersistenceError("Failed to save runbook session", err); } } export async function getRunbookSession(id: string): Promise<RunbookSyncResult | null> { try { const adapter = getAdapter(); const result: unknown = await adapter.getSession(id); if (result === undefined || result === null) return null; return result as RunbookSyncResult; } catch (err) { throw new PersistenceError("Failed to get runbook session", err); } } export async function listStaleRunbooks(before: Date): Promise<string[]> { try { const adapter = getAdapter(); const sessions: unknown = await adapter.getExpiredSessions(before); if (!Array.isArray(sessions)) return []; return (sessions as Array<Record<string, unknown>>).map((s) => { const id = s.id; const pk = s.PK; return typeof id === "string" ? id : typeof pk === "string" ? pk : ""; }); } catch (err) { throw new PersistenceError("Failed to list stale runbooks", err); } }

import { task } from "@trigger.dev/sdk/v3"; import { generateRunbook } from "@reaatech/agent-runbook-cli"; import type { RunbookSyncRequest, RunbookSyncResult } from "./types.js"; import { analyzeServiceRepo } from "./analyzer.js"; import { generateServiceAlerts } from "./alerts.js"; import { validateChaosScenarios } from "./chaos.js"; import { generateRunbookSummary } from "./anthropic.js"; import { saveRunbookSession } from "./storage.js"; import { notifyRunbookSync } from "./slack.js"; import { logger } from "./logger.js"; export const runbookSyncTask = task({ id: "runbook-sync", run: async (payload: RunbookSyncRequest): Promise<RunbookSyncResult> => { const repoPath = payload.repoPath ?? `/tmp/repos/${payload.repoUrl.replace(/[^a-zA-Z0-9]/g, "-")}`; const partialResult: RunbookSyncResult = { runbookId: "", repoUrl: payload.repoUrl, timestamp: new Date().toISOString(), alertsGenerated: 0, chaosScenariosValidated: 0, status: "queued", }; try { await generateRunbook({ path: repoPath, output: `${repoPath}/runbook.md`, format: "markdown", provider: (payload.provider ?? "claude") as "claude", model: payload.model ?? "claude-sonnet-4-6", sections: ["alerts", "failure-modes", "rollback", "incident-response"], }); const analysis = await analyzeServiceRepo(payload.repoUrl); try { const alerts = generateServiceAlerts(repoPath, analysis); partialResult.alertsGenerated = alerts.length; } catch (alertErr) { logger.error({ err: alertErr }, "Alert generation failed, continuing"); } try { const chaosDir = `${repoPath}/scenarios`; const chaosResults = await validateChaosScenarios(chaosDir); partialResult.chaosScenariosValidated = chaosResults.length; } catch (chaosErr) { logger.error({ err: chaosErr }, "Chaos validation failed, continuing"); } try { partialResult.summary = await generateRunbookSummary(analysis); } catch (summaryErr) { logger.error({ err: summaryErr }, "Summary generation failed, continuing"); } partialResult.status = "completed"; try { await saveRunbookSession(partialResult); } catch (persistErr) { logger.error({ err: persistErr }, "Persistence failed"); } try { await notifyRunbookSync(partialResult); } catch { // non-critical } return partialResult; } catch (err) { logger.error({ err, payload }, "Runbook sync workflow failed"); partialResult.status = "failed"; try { await saveRunbookSession(partialResult); await notifyRunbookSync(partialResult); } catch { // best effort } throw err; } }, });

Anthropic AI Runbook Automation for SMB DevOps Incident Recovery

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Step 2: Create shared types and structured logger

Step 3: Build the Anthropic Claude summary service

Step 4: Wire up Slack notifications

Step 5: Create DynamoDB persistence layer

Step 6: Build the repository analyzer adapter

Step 7: Build the alert generation adapter

Step 8: Implement chaos scenario validation

Step 9: Orchestrate the full pipeline with a Trigger.dev durable workflow

Step 10: Create the API route to trigger a sync

Step 11: Create the status-check API route

Step 12: Create the background freshness job and instrumentation hook

Step 13: Set up MSW test infrastructure and run the tests

Next steps