Skip to content
reaatechREAATECH

Files · AI Runbook Automation for Agent Failure Recovery

82 (1 binary, 618.7 kB total)attempt 2

README.md·5398 B·markdown
markdown
# AI Runbook Automation for Agent Failure Recovery
 
Automatically trigger health checks and runbooks when AI agents fail, with circuit breakers and idempotent retries—no single provider lock-in.
 
## Problem
 
Small businesses relying on AI agents for customer support or operations face unpredictable outages and errors; when an agent goes down or returns garbage, the business needs automatic failover and recovery without 24/7 ops staff.
 
## Architecture
 
```
┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
│  Trigger.dev     │     │  Health Checker       │     │  Circuit Breaker │
│  Daily Cron      │────▶│  (@reaatech/agent-    │────▶│  Manager          │
│  (every 5 min)   │     │  runbook-health-      │     │  (@reaatech/      │
│                  │     │  checks)              │     │  circuit-breaker) │
└─────────────────┘     └──────────────────────┘     └─────────────────┘
        │                                                       │
        ▼                                                       ▼
┌─────────────────┐                                    ┌─────────────────┐
│  Incident        │                                    │  Slack           │
│  Response        │◀────────────────────────────────────│  Notifier        │
│  Workflow        │                                    │  (@slack/web-api)│
│  (@trigger.dev)  │                                    └─────────────────┘
└─────────────────┘


┌─────────────────┐     ┌──────────────────────┐
│  Idempotency     │     │  Langfuse Tracing     │
│  Middleware      │     │  (Observability)      │
│  (@reaatech/     │     └──────────────────────┘
│   idempotency)   │
└─────────────────┘
```
 
- A **Trigger.dev** scheduled task runs every 5 minutes
- The **Health Checker** probes agent endpoints using health checks from `@reaatech/agent-runbook-health-checks`
- On failure, an incident event is published to the **Incident Response** workflow
- The **Circuit Breaker Manager** isolates failing dependencies
- All external calls are wrapped with **idempotency middleware** to ensure safe retries
- **Langfuse** traces observability across all operations
- **Slack** notifications alert the team
 
## Packages Used
 
### REAA Packages
- `@reaatech/agent-runbook@0.1.0` — Core types, Zod schemas, utilities, error classes
- `@reaatech/agent-runbook-incident@0.1.0` — Incident workflow generation
- `@reaatech/agent-runbook-health-checks@0.1.0` — Health check probe generation
- `@reaatech/circuit-breaker-core@0.1.1` — Circuit breaker state machine
- `@reaatech/circuit-breaker-agents@0.1.1` — Circuit breaker with persistence adapters
- `@reaatech/idempotency-middleware@1.0.0` — Idempotent request middleware
 
### Third-Party Packages
- `@trigger.dev/sdk@4.4.6` — Durable workflow engine
- `langfuse@3.38.20` — LLM observability and tracing
- `@slack/web-api@7.16.0` — Slack messaging
- `zod@4.4.3` — Schema validation
- `openai@6.42.0` — OpenAI provider
- `@anthropic-ai/sdk@0.102.0` — Anthropic provider
 
## Getting Started
 
1. Install dependencies:
   ```bash
   pnpm install
   ```
 
2. Copy `.env.example` to `.env` and fill in your credentials:
   ```bash
   cp .env.example .env
   ```
 
3. Run the dev server:
   ```bash
   pnpm dev
   ```
 
4. Check the health endpoint:
   ```bash
   curl http://localhost:3000/api/health
   ```
 
## API Routes
 
### `GET /api/health`
Returns the status of all monitored agent services:
```json
{
  "status": "ok",
  "services": [
    { "serviceId": "ai-agent-1", "status": "healthy", "latencyMs": 42, "checkedAt": "..." }
  ],
  "timestamp": "..."
}
```
 
### `POST /api/trigger`
Webhook endpoint for Trigger.dev events. Accepts JSON with `event` and `payload` fields. Supports `"health-check"` and `"incident"` event types.
 
## Trigger.dev Workflows
 
This recipe defines two Trigger.dev tasks:
 
1. **Daily Health Check** (`src/runbooks/daily-health.ts`) — Scheduled every 5 minutes via `schedules.task`. Probes all registered agent services and reports results.
 
2. **Incident Response** (`src/runbooks/incident-response.ts`) — Triggered by incident events. Opens circuit breakers, generates runbook workflows and escalation policies, notifies Slack, and optionally decides on service restart via LLM.
 
## Provider-Agnostic Design
 
The recipe supports both OpenAI and Anthropic as LLM providers. Set `DEFAULT_PROVIDER=openai` or `DEFAULT_PROVIDER=anthropic` in your `.env` to choose. The `createProvider()` factory and `getDefaultProvider()` functions in `src/services/provider-client.ts` handle instantiation.
 
## License
 
MIT