Files · Databricks AI Runbook Automation for SMB Data Pipelines
93 (1 binary, 629.6 kB total)attempt 1
README.md·4540 B·markdown
markdown
# Databricks AI Runbook Automation for SMB Data Pipelines
> Auto-generate runbooks and automate incident recovery for Databricks data pipelines so small teams can resolve failures without a dedicated DevOps hire.
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
Small businesses running ETL jobs on Databricks lack on-call expertise; a failed pipeline stalls reporting and can stay broken for hours because no one knows the recovery steps. This recipe automatically generates runbooks from Databricks job definitions, triggers incident playbooks on failure webhooks, and uses circuit breakers to isolate misbehaving pipelines.
## Features
- **Runbook generation** — Scan Databricks SQL warehouses for job definitions and produce human-readable runbooks via `@reaatech/agent-runbook-analyzer`
- **Incident response** — Receive Databricks failure webhooks and automatically generate SEV1–SEV4 incident workflows via `@reaatech/agent-runbook-incident`
- **Circuit breaker** — Isolate failing pipelines using `@reaatech/circuit-breaker-core` with Redis-persisted state
- **Durable workflows** — Orchestrate recovery through `@trigger.dev/sdk` for retryable, stateful execution
- **Dashboard** — Next.js App Router dashboard with real-time circuit status and runbook views
## API Endpoints
| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/incidents` | Receive Databricks failure webhook, trigger incident workflow |
| GET | `/api/circuit-breaker` | List all circuit states |
| GET | `/api/circuit-breaker/[id]` | Get circuit state + stats |
| POST | `/api/circuit-breaker/[id]` | Reset or force circuit state |
| GET | `/api/health` | Health check (Redis + Databricks connectivity) |
| GET | `/api/runbooks` | List generated runbooks |
| POST | `/api/runbooks` | Generate new runbooks |
| GET | `/api/runbooks/[id]` | Get specific runbook |
| GET | `/api/alerts` | List generated alert definitions |
## Prerequisites
- Node.js >= 22
- pnpm >= 10
- A Databricks SQL warehouse (with SQL endpoint)
- Redis instance
- OpenAI API key
- Langfuse account (for LLM tracing)
- Trigger.dev account (for workflow orchestration)
## Getting Started
1. Clone the repo and install dependencies:
```bash
pnpm install
```
2. Copy `.env.example` to `.env` and fill in your credentials:
```bash
cp .env.example .env
```
3. Run the test suite:
```bash
pnpm test
```
4. Start the development server:
```bash
pnpm dev
```
## Project layout
```
app/ Next.js App Router pages + API routes
src/
api/incidents/ Incident webhook handler
lib/ Alert generator, analysis agent, tracing
runbooks/ Databricks job collector
services/ Circuit breaker, Redis client
trigger/ Trigger.dev workflow definitions
types/ Zod schemas and domain type extensions
tests/ Vitest suite (mirrors src/)
packages/ API references for every dependency
```
## Packages
### REAA
- `@reaatech/agent-runbook` — Core domain types, Zod schemas, utilities, error classes
- `@reaatech/agent-runbook-agent` — LLM-powered analysis agent wrapping OpenAI
- `@reaatech/agent-runbook-alerts` — Alert extraction and generation (Prometheus, Datadog, CloudWatch)
- `@reaatech/agent-runbook-analyzer` — Repository scanning and code analysis
- `@reaatech/agent-runbook-incident` — Incident response workflows and templates
- `@reaatech/circuit-breaker-core` — Circuit breaker state machine for pipeline isolation
### Third-party
- `@databricks/sql` — Databricks SQL driver (Thrift API)
- `@trigger.dev/sdk` — Durable, retryable workflow orchestration
- `ioredis` — Redis client for circuit breaker state persistence
- `openai` — OpenAI client for analysis agent
- `langfuse` — LLM observability and tracing
- `pino` — High-performance structured logging
- `p-retry` — Retry with exponential backoff
- `zod` — Runtime schema validation
## Trigger.dev Workflow Triggers
| Event | Task | Description |
|-------|------|-------------|
| `pipeline.failure` | `pipelineRecoveryTask` | Fired when a Databricks pipeline fails — generates incident workflows and updates circuit breaker |
| `runbook.generate` | `runbookGenerationTask` | Fired to regenerate runbooks from current Databricks job definitions |
## License
MIT — see [LICENSE](./LICENSE).