Small field service businesses lose revenue when their AI dispatch agents fail during after-hours or peak times. Manual recovery is slow and requires operations staff that small teams can’t afford.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This recipe builds an AI dispatch failure detection and remediation pipeline. You’ll wire up six REAA agent-runbook packages, create a Next.js API route that receives webhook failure events, classify incidents, generate runbooks, and alert Slack only when automatic recovery fails. By the end, you’ll have a runbook engine that autonomously triages and remediates field service failures.
Prerequisites
Node.js 22+ and pnpm 10 (npm install -g pnpm@10)
An Anthropic API key (free tier at console.anthropic.com)
A Slack bot token and channel ID (create an app at api.slack.com/apps, add chat:write scope, install to workspace)
A Langfuse account for observability (free tier at langfuse.com — get public/secret keys)
Familiarity with Next.js App Router route handlers, TypeScript, and vitest
Step 1: Scaffold the project and install dependencies
Create the project directory and scaffold a Next.js project:
Expected output: A node_modules/ directory with all third-party packages, a pnpm-lock.yaml, and no errors.
Step 2: Configure your environment
Create a .env.example (safe to commit) with every variable your services need. Copy it to .env.local (never commit) and fill in real values:
env
# Env vars used by anthropic-ai-runbook-automation-for-smb-field-service-dispatching.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentANTHROPIC_API_KEY=<your-anthropic-api-key>SLACK_TOKEN=<your-slack-bot-token>SLACK_CHANNEL_ID=<your-slack-channel-id>TRIGGER_API_KEY=<your-trigger-dev-api-key>TRIGGER_PROJECT_ID=<your-trigger-dev-project-id>TRIGGER_ENVIRONMENT=<production|staging>LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_HOST=<your-langfuse-host>
Expected output: Ten placeholder environment variables ready for real values. The ANTHROPIC_API_KEY is mandatory — the webhook handler returns a 500 immediately if it’s missing.
Step 3: Set up the observability layer
Create src/lib/default-context.ts — a shared analysis context that every REAA package consumes:
Now create src/lib/observability.ts — wraps the @reaatech/agent-runbook-observability package and re-exports loggers, span tracking, and cost recording:
Expected output: Two files (default-context.ts, observability.ts) in src/lib/. initializeObservability() reads LANGFUSE_HOST from the environment at runtime to configure OpenTelemetry export.
Step 4: Build the health check service
Create src/lib/health-check.ts — probes agent endpoints over HTTP and generates Kubernetes probe YAML from the REAA health checks package:
Expected output:src/lib/health-check.ts. probeAgentEndpoints uses fetch with a 5-second abort timeout per endpoint and reports alive: false for any HTTP or network failure — including DNS errors, timeouts, and non-OK status codes.
Step 5: Create the runbook engine
This is the heart of the recipe. src/lib/runbook-engine.ts orchestrates the full pipeline: failure classification, mitigation generation, runbook assembly, and export.
ts
import { createAnalysisAgent } from "@reaatech/agent-runbook-agent";import { generateRunbookArtifacts, exportRunbook, validateCompleteness } from "@reaatech/agent-runbook-runbook";import { identifyFailureModes, generateMitigations } from "@reaatech/agent-runbook-failure-modes";import { generateAlerts } from "@reaatech/agent-runbook-alerts";import Anthropic from "@anthropic-ai/sdk";import { createGenerationSpan, info, trackRunbookGeneration, trackAgentCall } from "./observability.js";import { defaultContext } from "./default-context.js";export interface RunbookEngineConfig { anthropicApiKey: string; repoPath: string; triggerApiKey?: string;
Expected output:src/lib/runbook-engine.ts. The engine wraps every execution in a generation span, classifies failure modes, generates mitigations, assembles a five-section runbook, validates its completeness, and flags the failure path for human escalation. The createAnthropicClient export provides a standalone Anthropic SDK client.
Step 6: Wire up Slack notifications
Create src/lib/notify.ts — posts formatted alerts to a Slack channel when automatic recovery fails:
ts
import { WebClient, ErrorCode } from "@slack/web-api";import { error } from "./observability.js";export interface SlackAlertMessage { title: string; description: string; severity: "info" | "warning" | "critical"; runbookId?: string; markdown?: string;}export function createSlackNotifier(token: string, channelId: string) { const web = new WebClient(token); return { async sendAlert(message: SlackAlertMessage): Promise<void> { try { let text = `*[${message.severity}]* ${message.title}\n\n${message.description}`; if (message.runbookId) { text += `\n\nRunbook ID: ${message.runbookId}`; } await web.chat.postMessage({ text, channel: channelId }); } catch (err) { if (err && typeof err === "object" && "code" in err && (err as Record<string, unknown>).code === ErrorCode.PlatformError) { error("Slack platform error", { data: (err as Record<string, unknown>).data }); } else { error("Slack notification error", { message: err instanceof Error ? err.message : String(err) }); } } }, };}
Expected output:src/lib/notify.ts. The message format is *[critical]* Agent down\n\nRunbook failed for agent dispatch-01 with an optional Runbook ID: trailer. Slack platform errors (wrong channel, insufficient scopes) are logged but never rethrown — the service stays up.
Step 7: Create the webhook trigger route
Create app/api/runbooks/trigger.ts — the App Router API handler that receives failure events:
Expected output:next.config.ts with instrumentationHook: true. This flag is required — without it the register() function in the next step never fires.
Step 8: Set up instrumentation for startup init
Create src/instrumentation.ts — Next.js calls register() once at startup, giving you a hook to initialize observability before any request arrives:
ts
export async function register() { if (process.env.NEXT_RUNTIME === "nodejs") { const { initializeObservability } = await import("./lib/observability.js"); await initializeObservability(); }}
Expected output:src/instrumentation.ts. The NEXT_RUNTIME guard ensures the code only runs in the Node.js server, not the Edge runtime. The dynamic import() keeps Node-only packages (like OpenTelemetry) out of the Edge bundle.
Step 9: Export the public API and create a landing page
Replace the placeholder src/index.ts with real re-exports so consumers can import everything from a single entry point:
ts
export { createRunbookEngine } from "./lib/runbook-engine.js";export { createSlackNotifier } from "./lib/notify.js";export { probeAgentEndpoints, generateHealthProbes, getExistingHealthChecks } from "./lib/health-check.js";export { initializeObservability, createGenerationSpan, trackRunbookGeneration, trackAgentCall, trackAgentCost } from "./lib/observability.js";export type { RunbookEngineConfig, FailureContext, RunbookResult, RunbookStep } from "./lib/runbook-engine.js";export type { SlackAlertMessage } from "./lib/notify.js";export type { HealthCheckResult, HealthCheckConfig } from "./lib/health-check.js";
Now update app/page.tsx with a minimal landing page:
tsx
export default function Home() { return ( <main style={{ maxWidth: 720, margin: "0 auto", padding: "2rem 1rem", fontFamily: "system-ui, sans-serif", lineHeight: 1.6 }}> <h1>Anthropic AI Runbook Automation for SMB Field Service Dispatching</h1> <p style={{ fontSize: "1.1rem", color: "#555" }}> Automatically triage and remediate AI dispatch agent failures using Claude-powered runbooks. </p> <hr style={{ margin: "1.5rem 0" }} /> <h2>API Endpoint</h2> <p> <code>POST /api/runbooks/trigger</code> — Submit a failure event and receive an auto-generated runbook. </p> <h2>Architecture</h2> <p> Trigger.dev webhook → <code>POST /api/runbooks/trigger</code> → health check probes → failure classification → runbook generation → Claude-powered execution → Slack alert on escalation. </p> </main> );}
Expected output: Every module re-exported from src/index.ts and a readable dashboard at the root URL.
Expected output: One test file covering initialization order, span success/error paths (including non-Error throws, re-throws, and undefined returns), and every tracking method. Run pnpm vitest run --coverage to verify 90%+ coverage before moving on.
Create the remaining test files for the health check service, runbook engine, Slack notifier, trigger route, public entry, instrumentation, and an integration test. Here’s a summary of every test file you need:
Valid POST, missing fields, empty body, handler throw, requiresHuman alert, GET health
tests/index.test.ts
Every export is a function
tests/integration.test.ts
End-to-end MSW-mocked Anthropic flow, 500 from API, logger verification
tests/instrumentation.test.ts
register() calls initializeObservability
Run the full suite:
terminal
pnpm vitest run --coverage --reporter=json --outputFile=vitest-report.json
Expected output: All tests pass (numFailedTests: 0), total tests >= 50, and coverage thresholds >= 90% across lines, branches, functions, and statements.
Step 11: Type-check, lint, and preflight
Run the final verification commands:
terminal
pnpm typecheckpnpm lintpnpm exec vitest run --coverage --reporter=json --outputFile=vitest-report.json
Expected output: TypeScript compiles with zero errors. ESLint passes. Vitest reports every test green with coverage >= 90% on all four metrics.
Expected output: Both checks print PASS. The instrumentation hook flag is set correctly, and every route handler uses NextRequest / NextResponse.json().
If the runbook pipeline fails (missing API key, network error), the response returns success: false, requiresHuman: true, and a critical Slack alert is dispatched to your configured channel.
Next steps
Add PagerDuty escalation — extend the sendAlert notifier to call PagerDuty’s Events API when severity is critical, so on-call engineers are paged directly.
Store runbook results in a database — persist each RunbookResult to SQLite (via better-sqlite3) so you can review failure history, track SLO attainment, and build a dashboard.
Add a Trigger.dev job — replace the manual curl with a Trigger.dev cron job that regularly probes all registered agent endpoints and calls POST /api/runbooks/trigger automatically on failure.
Add a retry step — modify handleFailure to retry the runbook once on transient errors before escalating to human, improving the autonomous recovery rate.