Databricks AI Runbook Automation for SMB Data Pipelines

Auto-generate runbooks and automate incident recovery for Databricks data pipelines so small teams can resolve failures without a dedicated DevOps hire.

databricks runbook-automation incident-recovery etl-pipelines nextjs trigger-dev circuit-breaker reliability-ops

The problem

Small businesses running ETL jobs on Databricks lack on-call expertise; a failed pipeline stalls reporting and can stay broken for hours because no one knows the recovery steps.

Built from

Intro

This tutorial walks you through building a Databricks AI Runbook Automation system that auto-generates runbooks from Databricks job definitions, triggers incident playbooks on failure webhooks, and uses circuit breakers to isolate misbehaving pipelines. By the end, you’ll have a Next.js application with eight API endpoints, a dashboard, persistent circuit breaker state in Redis, and Trigger.dev workflows for durable recovery orchestration.

This is for small teams running ETL on Databricks who need automated incident recovery without a dedicated DevOps engineer.

Prerequisites

Node.js >= 22 and pnpm >= 10 installed
A Databricks SQL warehouse with a SQL endpoint (host, path, token)
A Redis instance (local or remote)
An OpenAI API key (for the analysis agent)
A Langfuse account (for LLM tracing)
A Trigger.dev account and API key (for durable workflows)
Familiarity with TypeScript, Next.js App Router, and basic Databricks concepts

Step 1: Scaffold the project

Create a new Next.js project with the App Router and install all dependencies.

terminal

npx create-next-app@latest databricks-runbook --typescript --app --src-dir --eslint --turbopack
cd databricks-runbook

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

199 kB·81 tests·93.5% coverage·vitest passing

SHA-256078e6332b2da79bd71caaa51ec35398a6dd4e62e0c3a52e682c09dc79b5c99cb

Book a conversation All solutions

Comments

Loading comments…

import { CircuitBreaker, CircuitOpenError, type ResultMetadata, type CircuitEvent, type CircuitBreakerOptions } from "@reaatech/circuit-breaker-core"; import { publishCircuitEvent, setCircuitState } from "./redis.js"; import logger from "../logger.js"; const breakerOptions: CircuitBreakerOptions = { name: "databricks-pipelines", failureThreshold: 5, recoveryTimeoutMs: 60000, minConfidence: 0.7, recoveryStrategy: "gradual", }; const breaker = new CircuitBreaker(breakerOptions); breaker.on("stateChange", (event: CircuitEvent) => { const circuitId = event.circuit_id; const toState = JSON.stringify(event.data.to); setCircuitState(circuitId, toState).catch((err: unknown) => { logger.error({ err, circuitId }, "failed to persist circuit state"); }); const eventRecord: Record<string, unknown> = JSON.parse(JSON.stringify(event)) as Record<string, unknown>; publishCircuitEvent("circuit:events", eventRecord).catch((err: unknown) => { logger.error({ err }, "failed to publish circuit event"); }); }); export function checkPipelineCircuit(pipelineId: string): 'CLOSED' | 'OPEN' | 'HALF_OPEN' { return breaker.getState(pipelineId); } export async function recordPipelineFailure(pipelineId: string, meta: ResultMetadata): Promise<void> { try { await breaker.execute( () => { throw new Error("pipeline failed"); }, { onSuccess: () => meta, onFailure: () => meta }, ); } catch (err: unknown) { if (err instanceof CircuitOpenError) { logger.warn({ pipelineId }, "circuit is open, skipping re-run"); return; } throw err; } } export async function recordPipelineSuccess(pipelineId: string, meta: ResultMetadata): Promise<void> { await breaker.execute( async () => {}, { onSuccess: () => meta }, ); } export function resetCircuit(_circuitId: string): void { void _circuitId; breaker.reset(); } export function forceCircuitState(circuitId: string, state: 'CLOSED' | 'OPEN'): void { breaker.forceState(circuitId, state); } export function getCircuitStats(circuitId: string) { return breaker.getStats(circuitId); }

"use client"; import { useEffect, useState } from "react"; import styles from "./page.module.css"; interface CircuitData { circuits: Array<{ id: string; state: string | null }>; } interface ApiStatus { status: string; } export default function Home() { const [circuits, setCircuits] = useState<CircuitData | null>(null); const [health, setHealth] = useState<ApiStatus | null>(null); const [loading, setLoading] = useState(true); useEffect(() => { async function fetchData() { try { const [circuitsRes, healthRes] = await Promise.all([ fetch("/api/circuit-breaker"), fetch("/api/health"), ]); if (circuitsRes.ok) { const circuitsData = await circuitsRes.json() as CircuitData; setCircuits(circuitsData); } if (healthRes.ok) { const healthData = await healthRes.json() as ApiStatus; setHealth(healthData); } } catch { // Silently fail — dashboard shows limited data } finally { setLoading(false); } } void fetchData(); }, []); return ( <div className={styles.page}> <main className={styles.main}> <h1>Databricks Pipeline Runbook Automation</h1> {loading ? ( <p>Loading dashboard data...</p> ) : ( <> <section> <h2>System Health</h2> <p>Status: {health?.status ?? "unknown"}</p> </section> <section> <h2>Circuit Status</h2> {circuits?.circuits.length ? ( <ul> {circuits.circuits.map((c) => ( <li key={c.id}>{c.id}: {c.state}</li> ))} </ul> ) : ( <p>No circuits found. Check <code>/api/circuit-breaker</code>.</p> )} </section> <section> <h2>Recent Incidents</h2> <p>POST <code>/api/incidents</code> with a webhook payload to trigger incident response.</p> </section> <section> <h2>Runbooks</h2> <p>GET <code>/api/runbooks</code> to view generated runbooks.</p> </section> </> )} </main> </div> ); }

Databricks AI Runbook Automation for SMB Data Pipelines

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project

Step 2: Configure environment variables

Step 3: Add config validation with Zod

Step 4: Set up structured logging

Step 5: Define Databricks schemas with Zod

Step 6: Build the Databricks job collector

Step 7: Build the analysis agent

Step 8: Build the alert generator

Step 9: Build the incident handler

Step 10: Create the circuit breaker with Redis persistence

Step 11: Define Trigger.dev workflows

Step 12: Create the API route handlers

Step 13: Create the dashboard page

Step 14: Add Langfuse instrumentation

Step 15: Export the public API surface

Step 16: Run the quality checks and tests

Next steps