Files · AI Runbook Automation for Agent Failure Recovery
82 (1 binary, 618.7 kB total)attempt 2
README.md·5398 B·markdown
markdown
# AI Runbook Automation for Agent Failure Recovery
Automatically trigger health checks and runbooks when AI agents fail, with circuit breakers and idempotent retries—no single provider lock-in.
## Problem
Small businesses relying on AI agents for customer support or operations face unpredictable outages and errors; when an agent goes down or returns garbage, the business needs automatic failover and recovery without 24/7 ops staff.
## Architecture
```
┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ Trigger.dev │ │ Health Checker │ │ Circuit Breaker │
│ Daily Cron │────▶│ (@reaatech/agent- │────▶│ Manager │
│ (every 5 min) │ │ runbook-health- │ │ (@reaatech/ │
│ │ │ checks) │ │ circuit-breaker) │
└─────────────────┘ └──────────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Incident │ │ Slack │
│ Response │◀────────────────────────────────────│ Notifier │
│ Workflow │ │ (@slack/web-api)│
│ (@trigger.dev) │ └─────────────────┘
└─────────────────┘
│
▼
┌─────────────────┐ ┌──────────────────────┐
│ Idempotency │ │ Langfuse Tracing │
│ Middleware │ │ (Observability) │
│ (@reaatech/ │ └──────────────────────┘
│ idempotency) │
└─────────────────┘
```
- A **Trigger.dev** scheduled task runs every 5 minutes
- The **Health Checker** probes agent endpoints using health checks from `@reaatech/agent-runbook-health-checks`
- On failure, an incident event is published to the **Incident Response** workflow
- The **Circuit Breaker Manager** isolates failing dependencies
- All external calls are wrapped with **idempotency middleware** to ensure safe retries
- **Langfuse** traces observability across all operations
- **Slack** notifications alert the team
## Packages Used
### REAA Packages
- `@reaatech/agent-runbook@0.1.0` — Core types, Zod schemas, utilities, error classes
- `@reaatech/agent-runbook-incident@0.1.0` — Incident workflow generation
- `@reaatech/agent-runbook-health-checks@0.1.0` — Health check probe generation
- `@reaatech/circuit-breaker-core@0.1.1` — Circuit breaker state machine
- `@reaatech/circuit-breaker-agents@0.1.1` — Circuit breaker with persistence adapters
- `@reaatech/idempotency-middleware@1.0.0` — Idempotent request middleware
### Third-Party Packages
- `@trigger.dev/sdk@4.4.6` — Durable workflow engine
- `langfuse@3.38.20` — LLM observability and tracing
- `@slack/web-api@7.16.0` — Slack messaging
- `zod@4.4.3` — Schema validation
- `openai@6.42.0` — OpenAI provider
- `@anthropic-ai/sdk@0.102.0` — Anthropic provider
## Getting Started
1. Install dependencies:
```bash
pnpm install
```
2. Copy `.env.example` to `.env` and fill in your credentials:
```bash
cp .env.example .env
```
3. Run the dev server:
```bash
pnpm dev
```
4. Check the health endpoint:
```bash
curl http://localhost:3000/api/health
```
## API Routes
### `GET /api/health`
Returns the status of all monitored agent services:
```json
{
"status": "ok",
"services": [
{ "serviceId": "ai-agent-1", "status": "healthy", "latencyMs": 42, "checkedAt": "..." }
],
"timestamp": "..."
}
```
### `POST /api/trigger`
Webhook endpoint for Trigger.dev events. Accepts JSON with `event` and `payload` fields. Supports `"health-check"` and `"incident"` event types.
## Trigger.dev Workflows
This recipe defines two Trigger.dev tasks:
1. **Daily Health Check** (`src/runbooks/daily-health.ts`) — Scheduled every 5 minutes via `schedules.task`. Probes all registered agent services and reports results.
2. **Incident Response** (`src/runbooks/incident-response.ts`) — Triggered by incident events. Opens circuit breakers, generates runbook workflows and escalation policies, notifies Slack, and optionally decides on service restart via LLM.
## Provider-Agnostic Design
The recipe supports both OpenAI and Anthropic as LLM providers. Set `DEFAULT_PROVIDER=openai` or `DEFAULT_PROVIDER=anthropic` in your `.env` to choose. The `createProvider()` factory and `getDefaultProvider()` functions in `src/services/provider-client.ts` handle instantiation.
## License
MIT