Skip to content
reaatechREAATECH

Files · Azure AI Reliability Suite for SMB AI Operations

77 (1 binary, 625.3 kB total)attempt 1

README.md·2592 B·markdown
markdown
# Azure AI Reliability Suite for SMB AI Operations
 
Proactive incident detection, self-healing, and cost-aware failure recovery for SMB AI agent operations, powered by Azure AI.
 
## Architecture
 
This reference implementation combines four REAA packages to build a reliability-operations layer for Azure AI applications:
 
- **@reaatech/circuit-breaker-agents** — isolate failing tool calls and prevent cascading failures
- **@reaatech/idempotency-middleware** — prevent duplicate side effects from retries
- **@reaatech/agent-runbook-agent** — automated incident runbook generation and failure-mode analysis
- **@reaatech/agent-runbook-mcp** — expose reliability tools as MCP server for AI agents
 
All coordination is orchestrated via **@trigger.dev/sdk** durable workflows, ensuring recovery steps survive server restarts. **Azure OpenAI** provides the model inference layer. **Langfuse** provides LLM observability and tracing.
 
## Quick Start
 
```bash
pnpm install
pnpm dev
```
 
### Required Environment Variables
 
Copy `.env.example` to `.env` and fill in the values:
 
| Variable | Description |
|----------|-------------|
| `AZURE_OPENAI_ENDPOINT` | Your Azure OpenAI endpoint URL |
| `AZURE_OPENAI_API_KEY` | Your Azure OpenAI API key |
| `AZURE_OPENAI_DEPLOYMENT` | Your deployment name |
| `TRIGGER_SECRET_KEY` | Your trigger.dev secret key |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key (optional) |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key (optional) |
 
## API Endpoints
 
### Health Check
 
- `GET /api/health` — Overall suite health status (healthy/degraded/unhealthy)
- `POST /api/health` — Targeted health check for a specific service
 
### Dashboard
 
- `GET /api/dashboard/stats` — Aggregated statistics
- `GET /api/dashboard/incidents` — Incident list (optional `?status=` and `?severity=` filters)
- `POST /api/dashboard/incidents` — Create a new incident report
- `POST /api/workflows/trigger` — Trigger a workflow task
 
### Dashboard UI
 
Open `/dashboard` in the browser for the web interface.
 
## Workflow Flow
 
1. Health check detects a failure → triggers `detectIncidentTask`
2. Circuit breaker analysis checks if circuit is open
3. `resetCircuitTask` resets the circuit and generates a recovery plan
4. `rollbackTask` executes runbook rollback procedures
5. `retryTask` retries failed operations with exponential backoff
6. `escalateTask` marks incidents as escalated when max retries are exhausted
 
## Testing
 
```bash
pnpm typecheck   # TypeScript checks
pnpm lint        # ESLint
pnpm test        # Vitest with coverage
```
 
## License
 
MIT