Small teams deploying AI agents (e.g., customer support bots, lead intake) face catastrophic outages like database deletions or cost spikes, but lack DevOps staff to craft and run reliable incident playbooks. Manual fixes waste hours and erode customer trust.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
In this tutorial, you’ll build an AI-powered incident runbook system that monitors your services, detects failures, and generates plain-language triage reports using AWS Bedrock — all without a dedicated DevOps team. You’ll wire up a Fastify server that ingests webhook alerts, classifies them against known failure modes, calls Bedrock to produce summaries and remediation plans, triggers automatic rollbacks for critical incidents, and sends Slack notifications. On the front end, a Next.js dashboard gives you a real-time service map and alert view, while an admin panel lets you customize runbooks in natural language. By the end, you’ll have a working reliability system you can point at your own AI agents.
Prerequisites
Node.js >= 22 (the project’s package.json specifies "node": ">=22")
pnpm 10.x (the lockfile specifies "packageManager": "pnpm@10.21.0")
An AWS account with Bedrock access — you’ll need the model ID for the model you’ve enabled (the default is amazon.nova-micro-v1:0)
A Slack webhook URL (optional but recommended) for incident notifications
Familiarity with TypeScript, Next.js pages routing, and Fastify
Step 1: Scaffold the project
Create a new directory and scaffold the project configuration files. These define your build, linting, and test tooling.
Run the package manager to install everything declared in package.json:
terminal
pnpm install
Expected output: pnpm resolves and installs all dependencies, then prints a summary of the number of packages added.
Step 3: Configure environment variables
Create .env.local with the credentials and settings your runbook system needs. Every variable here is referenced by the source code — leave none empty.
BEDROCK_REGION — the AWS region where Bedrock is enabled (e.g. us-east-1)
BEDROCK_MODEL_ID — the Bedrock model you’ve requested access to (the default Nova Micro model works well for incident triage)
SLACK_WEBHOOK_URL — an incoming webhook URL from your Slack workspace (the server sends critical alerts here)
FASTIFY_PORT — port for the Fastify incident API (defaults to 3001 if unset)
NEXT_PUBLIC_API_BASE — the Next.js dashboard calls this base URL to reach the Fastify server
These environment variables are loaded by the config validator you’ll build in the next step; if any required variable is missing, the server refuses to start with a clear error message.
Step 4: Create the config loader and observability init
The config module validates environment variables at startup using Zod schemas (from @reaatech/agent-runbook). The init module bootstraps logging, tracing, and metrics.
Create the directories:
terminal
mkdir -p src/lib
Create src/lib/config.ts:
ts
import { ConfigurationError } from "@reaatech/agent-runbook";import { z } from "zod";const configSchema = z.object({ region: z.string().min(1, "BEDROCK_REGION is required"), modelId: z.string().min(1, "BEDROCK_MODEL_ID is required"), slackWebhook: z.string().min(1, "SLACK_WEBHOOK_URL is required"), fastifyPort: z.coerce.number().int().positive().default(3001),});export interface AppConfig { region: string; modelId: string; slackWebhook: string; fastifyPort: number;}export function loadConfig(): AppConfig { const parsed = configSchema.safeParse({ region: process.env.BEDROCK_REGION ?? "", modelId: process.env.BEDROCK_MODEL_ID ?? "", slackWebhook: process.env.SLACK_WEBHOOK_URL ?? "", fastifyPort: process.env.FASTIFY_PORT ?? "3001", }); if (!parsed.success) { const msg = parsed.error.issues[0]?.message ?? "Invalid configuration"; throw new ConfigurationError(msg); } return parsed.data as AppConfig;}
These two modules are the first things loaded when the application starts — they validate your environment and set up the observability pipeline before any other code runs.
Step 5: Create the Bedrock AI integration
This is the core of the recipe. The Bedrock module creates a client for the Bedrock Runtime, builds prompts from the incident context, and calls the ConverseCommand to generate plain-language summaries and structured remediation plans.
Create src/lib/bedrock.ts:
ts
import { BedrockRuntimeClient, ConverseCommand } from "@aws-sdk/client-bedrock-runtime";import { type FailureMode, LLMError } from "@reaatech/agent-runbook";import { error } from "@reaatech/agent-runbook-observability";export interface IncidentContext { serviceName: string; failureModes: FailureMode[]; alertPayload: Record<string, unknown>; timestamp: string;}export function createBedrockClient(): BedrockRuntimeClient { const region = process.env.BEDROCK_REGION ??
The module exports two main functions: generateIncidentSummary asks Bedrock to describe what went wrong, and proposeRemediation asks for a step-by-step fix plan. Both wrap the Bedrock Converse API call in proper error handling — if Bedrock is unreachable, they throw a typed LLMError that the server catches downstream.
Step 6: Create incident classification
When an alert arrives, the system needs to classify it against known failure modes. This module extracts a keyword from the alert payload, looks it up in the failure mode catalog (provided by @reaatech/agent-runbook-failure-modes), and returns a severity level. Unrecognized alerts fall back to the “application” category at warning severity.
Create src/lib/classify.ts:
ts
import { type AlertSeverity, type FailureMode, ValidationError,} from "@reaatech/agent-runbook";import { findFailureMode, getFailureModesByCategory,} from "@reaatech/agent-runbook-failure-modes";export interface ClassificationResult { matchedMode: FailureMode | undefined; severity: AlertSeverity | undefined;}export function classifyIncident( alertPayload: Record<string, unknown>,): ClassificationResult { if ( typeof alertPayload !== "object" || Object.keys(alertPayload).length === 0 ) { throw new ValidationError("Alert payload is empty or invalid"); } const keyword = extractKeyword(alertPayload); const matchedMode = findFailureMode(keyword); if (matchedMode) { const severity = mapFailureSeverityToAlertSeverity(matchedMode.severity); return { matchedMode, severity }; } const fallbackModes = getFailureModesByCategory("application"); const firstFallback = fallbackModes[0]; return { matchedMode: firstFallback, severity: "warning", };}function extractKeyword(payload: Record<string, unknown>): string { const possibleFields = ["type", "error", "name", "kind", "failure", "mode"]; for (const field of possibleFields) { const value = payload[field]; if (typeof value === "string") { return value; } } return "unknown";}function mapFailureSeverityToAlertSeverity( failureSeverity: string | undefined,): AlertSeverity | undefined { if (failureSeverity === "critical" || failureSeverity === "high") { return "critical"; } if (failureSeverity === "medium") { return "warning"; } return "info";}
The severity mapping drives the rest of the pipeline: critical incidents trigger automatic rollback procedures in the next step.
Step 7: Create the Fastify server
The Fastify server is the operational backbone. It exposes three endpoints — /health for uptime monitoring, /alerts for generated alert definitions, and POST /api/incidents for processing incoming alerts. It bootstraps observability, registers the incidents plugin, and handles graceful shutdown.
The server is structured as a buildServer factory so tests can create isolated instances. The isMainModule guard prevents the server from auto-starting when imported by tests or the Next.js build.
Step 8: Create the incidents API plugin
This is the heart of the runbook engine. When POST /api/incidents receives a webhook payload, it validates the body with Zod, classifies the incident, calls Bedrock for a summary and remediation plan, generates automatic rollback procedures for critical incidents, and optionally sends a Slack alert.
Create the API directory:
terminal
mkdir -p src/api/incidents
Create src/api/incidents/route.ts:
ts
import { type FastifyInstance, type FastifyRequest, type FastifyReply } from "fastify";import { IncomingWebhook } from "@slack/webhook";import { type AnalysisContext, generateId, LLMError, type FailureMode, ValidationError,} from "@reaatech/agent-runbook";import { generateRollbackProcedures, generateVerificationChecklist,} from "@reaatech/agent-runbook-rollback";import { recordGeneration, recordAgentCall, startGenerationSpan, endSpanSuccess, endSpanError, info,} from
The handler follows a clear pipeline: validate, classify, summarize with Bedrock, propose remediation, optionally rollback, optionally notify Slack, and return a structured response. The observability spans and record calls feed metrics so you can track every incident through the pipeline.
Step 9: Create the Next.js admin UI and dashboard
The Next.js front end uses the pages router with three pages: _app (the root wrapper), a dashboard for service health and alert monitoring, and an admin panel for natural-language runbook customization.
The dashboard shows four states — loading, error, empty, and loaded — so the UI never leaves the user guessing. The health bar at the top turns green when the Fastify server responds with "ok" and red otherwise.
Create src/pages/admin.tsx. This page accepts natural-language prompts to customize your runbooks, sending them to Bedrock via the Next.js API route you’ll build next:
Two API routes power the dashboard and admin panel: /api/service-map generates a Mermaid-format dependency graph, and /api/admin/interpret sends custom prompts to Bedrock.
The interpret route reuses generateIncidentSummary from the Bedrock module — the same function that powers incident triage also interprets admin customization prompts. It validates the input with Zod, calls Bedrock, and returns the result.
Step 11: Create the entry point and test setup
The entry point bootstraps observability and validates configuration at startup. The test setup imports the jest-dom matchers for DOM-based tests.
Create the entry point src/index.ts:
ts
import { initRunbookSystem } from "./lib/init.js";import { loadConfig } from "./lib/config.js";// Bootstrap observability before any other module that may logawait initRunbookSystem();// Validate configuration at startuploadConfig();
Additional test files — for the server, the incidents API endpoint, classification, configuration, and the admin panel — are included in the full downloadable artifact. The test suite covers every route, the Bedrock integration (through mocks), classification logic, and server lifecycle.
Step 12: Build, test, and run
First, compile the TypeScript to JavaScript:
terminal
pnpm build
Expected output: TypeScript compiles to dist/ with no errors. If you see type errors, double-check that all source files match exactly and that pnpm install completed without issues.
Run the full test suite with coverage:
terminal
pnpm test
Expected output: vitest runs all tests, each marked with a green checkmark. The coverage report prints at the end showing >= 90% across lines, branches, functions, and statements. The command also writes vitest-report.json and coverage/ with machine-readable results.
Start the Fastify incident API server:
terminal
pnpm start
Expected output: the terminal prints a log line indicating the server started on port 3001 (or your configured FASTIFY_PORT).
In a second terminal, start the Next.js admin dashboard:
terminal
pnpm dev
Expected output: Next.js starts on http://localhost:3000. Open that URL in your browser to see the dashboard. Visit http://localhost:3000/admin to try natural-language runbook customization.
To send a test incident, use curl against the Fastify server:
Expected output: a JSON response containing incidentId, status: "processed", a Bedrock-generated summary, a structured remediation plan, and rollback / postRollbackChecklist fields (non-null when the incident is critical). Check your Slack channel — if you configured the webhook URL, you’ll see a notification arrive.
Customize the failure mode catalog. The @reaatech/agent-runbook-failure-modes package includes common patterns, but you should register your own by adding entries that match the alert types your AI agents actually produce — prompt injection, runaway spend, tool misuse, and model latency are good starting points.
Deploy to production. The server includes graceful shutdown and SIGTERM handling, so it’s ready for a platform like ECS, Kubernetes, or a simple systemd service. Point your agent monitoring tools at the /api/incidents webhook endpoint.
Extend the dashboard. The current dashboard renders Mermaid markup as plain text — integrate mermaid.js in the browser for a live-rendered service map, or add a history view that shows past incidents, their classifications, and remediation outcomes over time.
""
;
return new BedrockRuntimeClient({ region });
}
function buildPrompt(context: IncidentContext): string {
return [
`You are an AI incident triage assistant for the service "${context.serviceName}".`,
"",
"The following failure modes have been identified:",
...context.failureModes.map(
(fm) => `- ${fm.name}: ${fm.description}`,
),
"",
"Alert payload:",
JSON.stringify(context.alertPayload, null, 2),
"",
"Provide a concise incident summary and triage analysis.",
].join("\n");
}
function buildRemediationPrompt(context: IncidentContext): string {
return [
`You are an AI incident remediation planner for the service "${context.serviceName}".`,
"",
"The following failure modes have been identified:",
...context.failureModes.map(
(fm) => `- ${fm.name}: ${fm.description}`,
),
"",
"Alert payload:",
JSON.stringify(context.alertPayload, null, 2),
"",
"Provide a structured remediation plan as a numbered list of steps to resolve the incident.",
"Each step should include the action, the target system, and expected outcome.",