Databricks AI Runbook Automation for SMB Data Pipelines
Auto-generate runbooks and automate incident recovery for Databricks data pipelines so small teams can resolve failures without a dedicated DevOps hire.
Small businesses running ETL jobs on Databricks lack on-call expertise; a failed pipeline stalls reporting and can stay broken for hours because no one knows the recovery steps.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a Databricks AI Runbook Automation system that auto-generates runbooks from Databricks job definitions, triggers incident playbooks on failure webhooks, and uses circuit breakers to isolate misbehaving pipelines. By the end, you’ll have a Next.js application with eight API endpoints, a dashboard, persistent circuit breaker state in Redis, and Trigger.dev workflows for durable recovery orchestration.
This is for small teams running ETL on Databricks who need automated incident recovery without a dedicated DevOps engineer.
Prerequisites
Node.js >= 22 and pnpm >= 10 installed
A Databricks SQL warehouse with a SQL endpoint (host, path, token)
A Redis instance (local or remote)
An OpenAI API key (for the analysis agent)
A Langfuse account (for LLM tracing)
A Trigger.dev account and API key (for durable workflows)
Familiarity with TypeScript, Next.js App Router, and basic Databricks concepts
Step 1: Scaffold the project
Create a new Next.js project with the App Router and install all dependencies.
Fill in the real values for your Databricks warehouse, Redis, OpenAI, Langfuse, and Trigger.dev accounts.
Expected output: A .env file with your real credentials ready for the config validator to parse.
Step 3: Add config validation with Zod
Create src/config.ts to parse and validate all environment variables at startup using Zod. This catches misconfiguration early with clear error messages.
Expected output: On missing env vars, the module throws a ValidationError listing every missing field. On success, config is a typed object with all values.
Step 4: Set up structured logging
Create src/logger.ts using pino, a high-performance structured JSON logger.
Expected output:logger.info("started") produces a JSON line with timestamp, level, and the logger name.
Step 5: Define Databricks schemas with Zod
Create src/types/databricks.ts with Zod schemas for the three Databricks data shapes you’ll handle: job definitions, run records, and failure webhooks.
Create src/types/runbook.ts to extend the REAA @reaatech/agent-runbook types with Databricks-specific fields:
ts
import { type Runbook, type RunbookSection, type ServiceDefinition, type FailureMode, type IncidentWorkflow, type EscalationPolicy, type CommunicationTemplate, type AnalysisContext, type AlertDefinition,} from "@reaatech/agent-runbook";export interface DatabricksRunbook extends Runbook { jobId: string; notebooks: string[]; scheduleCron: string | null;}export type { Runbook, RunbookSection, ServiceDefinition, FailureMode, IncidentWorkflow, EscalationPolicy, CommunicationTemplate, AnalysisContext, AlertDefinition,};
Create src/types/index.ts to re-export everything:
ts
export * from "./databricks.js";export * from "./runbook.js";
Expected output: Typed schemas for Databricks jobs, runs, and webhooks, plus runbook types re-exported from a single barrel file.
Step 6: Build the Databricks job collector
Create src/runbooks/databricks-collector.ts. This is the core data-fetching module — it connects to a Databricks SQL warehouse, queries job definitions, and transforms them into runbook pages. The Databricks connection is wrapped in p-retry for resilience against transient failures.
ts
import { DBSQLClient } from "@databricks/sql";import { type Runbook, type RunbookSection, type FailureMode, generateId, type ProgrammingLanguage, type Framework } from "@reaatech/agent-runbook";import { scanRepository, analyzeCode, mapDependencies, parseConfigs } from "@reaatech/agent-runbook-analyzer";import pRetry from "p-retry";import { config } from "../config.js";export async function collectJobs(connection?: { host: string; path: string; token: string }): Promise<Runbook[]> { const dbHost
Expected output: A module that connects to Databricks SQL, fetches job metadata, and returns typed Runbook and FailureMode arrays.
Step 7: Build the analysis agent
Create src/lib/analysis-agent.ts wrapping the @reaatech/agent-runbook-agent LLM analysis agent. It uses OpenAI’s GPT-4.1 Mini to identify failure modes and generate runbook sections.
ts
import { type AnalysisContext } from "@reaatech/agent-runbook";import { createAnalysisAgent, type AnalysisAgent, type AgentConfig, generatePrompt } from "@reaatech/agent-runbook-agent";import { config } from "../config.js";const agentConfig: AgentConfig = { provider: "openai", model: "gpt-4.1-mini", apiKey: config.OPENAI_API_KEY, temperature: 0.3,};const agent: AnalysisAgent = createAnalysisAgent(agentConfig);export async function identifyFailures(context: AnalysisContext): Promise<string[]> { generatePrompt("failure-mode-identification", {}); return agent.identifyFailureModes(context);}export async function generateSection(sectionType: string, context: AnalysisContext): Promise<string> { return agent.generateRunbookSection(sectionType as "alerts" | "dashboards" | "failure-modes" | "rollback" | "incident-response" | "health-checks", context);}
Expected output: An agent that uses OpenAI to analyze infrastructure context and return structured failure insights.
Step 8: Build the alert generator
Create src/lib/alert-generator.ts. It uses @reaatech/agent-runbook-alerts to extract existing alerts, generate new SLO-based alerts, and format them for Prometheus.
ts
import { type AnalysisContext, type AlertDefinition, ValidationError } from "@reaatech/agent-runbook";import { extractAlerts, generateAlerts, generateDefaultAlerts, formatAlertsForPlatform, calculateSloThresholds } from "@reaatech/agent-runbook-alerts";export async function generatePipelineAlerts(context: AnalysisContext): Promise<AlertDefinition[]> { const ctx: Record<string, unknown> = JSON.parse(JSON.stringify(context)) as Record<string, unknown>; const serviceDef = ctx["serviceDefinition"] as Record<string, unknown> | undefined; const serviceName = serviceDef?.["name"] as string | undefined ?? ctx["serviceName"] as string | undefined; if (!serviceName) { throw new ValidationError("Missing serviceName in analysis context"); } const monitoringConfig = ctx["monitoringConfig"]; if (!monitoringConfig) { return await Promise.resolve(generateDefaultAlerts(serviceName, true, false, true)); } extractAlerts("/path/to/repo"); const alerts = await Promise.resolve(generateAlerts(context, { sloTargets: { availability: 99.5, latencyP99: 1000 }, platform: "prometheus", })); return alerts;}export async function formatAlerts(alertDefs: AlertDefinition[]): Promise<string> { return await Promise.resolve(formatAlertsForPlatform(alertDefs, "prometheus"));}export function getSloRecommendations(availability: number, latencyP99: number) { return calculateSloThresholds({ availability, latencyP99 });}
Expected output: Functions that generate Prometheus-compatible alert definitions from an analysis context.
Step 9: Build the incident handler
Create src/api/incidents/handler.ts. This receives a Databricks failure webhook payload, validates it with Zod, then uses @reaatech/agent-runbook-incident to generate incident workflows, escalation policies, and communication templates.
ts
import { z } from "zod";import { AnalysisContextSchema, type AnalysisContext, type IncidentWorkflow, type EscalationPolicy, ValidationError, validateInput, NotFoundError } from "@reaatech/agent-runbook";import { generateIncidentWorkflows, generateEscalationPolicy, getTemplatesByCategory, applyTemplateVariables } from "@reaatech/agent-runbook-incident";export interface IncidentResult { workflows: IncidentWorkflow[]; escalation: EscalationPolicy; templates: Record<string, unknown>[];}const WebhookSchema = z.object({ jobId: z.string(), runId: z.
Expected output: A handler that transforms a Databricks failure webhook into structured incident workflows, escalation policies, and notification templates.
Step 10: Create the circuit breaker with Redis persistence
Create src/services/redis.ts — a singleton Redis client using ioredis that stores circuit breaker state and publishes state change events via Pub/Sub.
ts
import { Redis } from "ioredis";import { config } from "../config.js";let client: Redis | null = null;export function getRedis(): Redis { if (!client) { client = new Redis(config.REDIS_URL); } return client;}export async function getCircuitState(circuitId: string): Promise<string | null> { return getRedis().get("cb:" + circuitId);}export async function setCircuitState(circuitId: string, state: string): Promise<void> { await getRedis().set("cb:" + circuitId, state);}export async function publishCircuitEvent(channel: string, event: Record<string, unknown>): Promise<void> { const serialized = JSON.stringify(event); await getRedis().publish(channel, serialized);}
Create src/services/circuit-breaker.ts — a CircuitBreaker from @reaatech/circuit-breaker-core configured for the Databricks pipeline, listening to stateChange events and persisting state to Redis.
ts
import { CircuitBreaker, CircuitOpenError, type ResultMetadata, type CircuitEvent, type CircuitBreakerOptions } from "@reaatech/circuit-breaker-core";import { publishCircuitEvent, setCircuitState } from "./redis.js";import logger from "../logger.js";const breakerOptions: CircuitBreakerOptions = { name: "databricks-pipelines", failureThreshold: 5, recoveryTimeoutMs: 60000, minConfidence: 0.7, recoveryStrategy: "gradual",};const breaker = new CircuitBreaker(breakerOptions);breaker.on("stateChange", (event: CircuitEvent) => { const circuitId = event.circuit_id; const toState = JSON.stringify(event.data.to); setCircuitState(circuitId, toState).catch((err: unknown) => { logger.error({ err, circuitId }, "failed to persist circuit state"); }); const eventRecord: Record<string, unknown> = JSON.parse(JSON.stringify(event)) as Record<string, unknown>; publishCircuitEvent("circuit:events", eventRecord).catch((err: unknown) => { logger.error({ err }, "failed to publish circuit event"); });});export function checkPipelineCircuit(pipelineId: string): 'CLOSED' | 'OPEN' | 'HALF_OPEN' { return breaker.getState(pipelineId);}export async function recordPipelineFailure(pipelineId: string, meta: ResultMetadata): Promise<void> { try { await breaker.execute( () => { throw new Error("pipeline failed"); }, { onSuccess: () => meta, onFailure: () => meta }, ); } catch (err: unknown) { if (err instanceof CircuitOpenError) { logger.warn({ pipelineId }, "circuit is open, skipping re-run"); return; } throw err; }}export async function recordPipelineSuccess(pipelineId: string, meta: ResultMetadata): Promise<void> { await breaker.execute( async () => {}, { onSuccess: () => meta }, );}export function resetCircuit(_circuitId: string): void { void _circuitId; breaker.reset();}export function forceCircuitState(circuitId: string, state: 'CLOSED' | 'OPEN'): void { breaker.forceState(circuitId, state);}export function getCircuitStats(circuitId: string) { return breaker.getStats(circuitId);}
Expected output: A circuit breaker that tracks pipeline health, automatically trips after 5 failures, recovers after 60 seconds, and persists state changes to Redis.
Step 11: Define Trigger.dev workflows
Create src/trigger/workflows.ts. Define two durable workflows: pipelineRecoveryTask (triggered on pipeline failure) and runbookGenerationTask (triggered to regenerate runbooks). Both use the task() API from @trigger.dev/sdk.
Expected output: A set of fully wired API routes. Start the dev server with pnpm dev and hit curl http://localhost:3000/api/health to see the health check response.
Step 13: Create the dashboard page
Replace app/page.tsx with a dashboard that shows system health, circuit breaker status, and usage hints.
export const metadata: Metadata = { title: "Databricks AI Runbook Automation", description: "Auto-generate runbooks and automate incident recovery for Databricks data pipelines so small teams can resolve failures without a dedicated DevOps hire.",};
Expected output: A dashboard at / that fetches and displays circuit breaker states and health status on load.
Step 14: Add Langfuse instrumentation
Create src/lib/tracing.ts for LLM observability. It wraps the Langfuse client that is initialized during server startup.
Create src/instrumentation.ts to initialize Langfuse during Next.js server startup. This requires experimental.instrumentationHook: true in next.config.ts.
ts
import { config } from "./config.js";export async function register(): Promise<void> { if (process.env.NEXT_RUNTIME === "nodejs") { const LangfuseModule = await import("langfuse"); const Langfuse = LangfuseModule.default; const client = new Langfuse({ secretKey: config.LANGFUSE_SECRET_KEY, publicKey: config.LANGFUSE_PUBLIC_KEY, }); (globalThis as Record<string, unknown>)["__langfuse"] = client; }}
Update next.config.ts to enable the instrumentation hook:
ts
import type { NextConfig } from "next";const nextConfig: NextConfig = { experimental: { instrumentationHook: true, } as NextConfig["experimental"],};export default nextConfig;
Expected output: Langfuse client initialized at server start, available globally for the tracing module. All LLM calls can be traced by wrapping them in startTrace / endTrace.
Step 15: Export the public API surface
Create src/index.ts to re-export every public function consumers will need:
ts
export { collectJobs, transformToRunbookPage, getJobFailureHistory } from "./runbooks/databricks-collector.js";export { handleIncident } from "./api/incidents/handler.js";export { checkPipelineCircuit, recordPipelineFailure, recordPipelineSuccess, resetCircuit, forceCircuitState, getCircuitStats } from "./services/circuit-breaker.js";export { getRedis, getCircuitState, setCircuitState, publishCircuitEvent } from "./services/redis.js";export { generatePipelineAlerts, formatAlerts, getSloRecommendations } from "./lib/alert-generator.js";export { identifyFailures, generateSection } from "./lib/analysis-agent.js";export { startTrace } from "./lib/tracing.js";export { pipelineRecoveryTask, runbookGenerationTask } from "./trigger/workflows.js";export { config } from "./config.js";export { default as logger } from "./logger.js";export { AppError, NotFoundError, validateInput } from "@reaatech/agent-runbook";
Step 16: Run the quality checks and tests
Type-check the project to catch any type errors:
terminal
pnpm typecheck
Run the linter:
terminal
pnpm lint
Run the test suite with coverage:
terminal
pnpm test
Expected output: All typechecks pass, linter clean, and tests pass with at least 90% line, branch, function, and statement coverage on runtime code under src/ and app/**/route.ts.
Next steps
Add authentication to the API routes — wrap each route handler with a middleware that validates a shared API key or JWT token.
Deploy the Trigger.dev workflows — connect your project to Trigger.dev’s cloud with npx trigger.dev init to run the durable tasks in production.
Expand runbook sections — add RollbackProcedure, HealthCheck, and ServiceDependency sections to the runbook generation pipeline.
Build a notification channel — use the communication templates from the incident handler to send Slack or email alerts when a pipeline fails.
Connect multiple Databricks workspaces — extend the config to accept an array of workspace credentials and run the collector against each.