Files · Azure AI Reliability Suite for SMB AI Operations
77 (1 binary, 625.3 kB total)attempt 1
README.md·2592 B·markdown
markdown
# Azure AI Reliability Suite for SMB AI Operations
Proactive incident detection, self-healing, and cost-aware failure recovery for SMB AI agent operations, powered by Azure AI.
## Architecture
This reference implementation combines four REAA packages to build a reliability-operations layer for Azure AI applications:
- **@reaatech/circuit-breaker-agents** — isolate failing tool calls and prevent cascading failures
- **@reaatech/idempotency-middleware** — prevent duplicate side effects from retries
- **@reaatech/agent-runbook-agent** — automated incident runbook generation and failure-mode analysis
- **@reaatech/agent-runbook-mcp** — expose reliability tools as MCP server for AI agents
All coordination is orchestrated via **@trigger.dev/sdk** durable workflows, ensuring recovery steps survive server restarts. **Azure OpenAI** provides the model inference layer. **Langfuse** provides LLM observability and tracing.
## Quick Start
```bash
pnpm install
pnpm dev
```
### Required Environment Variables
Copy `.env.example` to `.env` and fill in the values:
| Variable | Description |
|----------|-------------|
| `AZURE_OPENAI_ENDPOINT` | Your Azure OpenAI endpoint URL |
| `AZURE_OPENAI_API_KEY` | Your Azure OpenAI API key |
| `AZURE_OPENAI_DEPLOYMENT` | Your deployment name |
| `TRIGGER_SECRET_KEY` | Your trigger.dev secret key |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key (optional) |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key (optional) |
## API Endpoints
### Health Check
- `GET /api/health` — Overall suite health status (healthy/degraded/unhealthy)
- `POST /api/health` — Targeted health check for a specific service
### Dashboard
- `GET /api/dashboard/stats` — Aggregated statistics
- `GET /api/dashboard/incidents` — Incident list (optional `?status=` and `?severity=` filters)
- `POST /api/dashboard/incidents` — Create a new incident report
- `POST /api/workflows/trigger` — Trigger a workflow task
### Dashboard UI
Open `/dashboard` in the browser for the web interface.
## Workflow Flow
1. Health check detects a failure → triggers `detectIncidentTask`
2. Circuit breaker analysis checks if circuit is open
3. `resetCircuitTask` resets the circuit and generates a recovery plan
4. `rollbackTask` executes runbook rollback procedures
5. `retryTask` retries failed operations with exponential backoff
6. `escalateTask` marks incidents as escalated when max retries are exhausted
## Testing
```bash
pnpm typecheck # TypeScript checks
pnpm lint # ESLint
pnpm test # Vitest with coverage
```
## License
MIT