Skip to content
reaatechREAATECH

Files · xAI Grok Reliability Suite for SMB AI Operations

69 (1 binary, 601.9 kB total)attempt 1

README.md·3157 B·markdown
markdown
# xAI Grok Reliability Suite for SMB AI Operations
 
> Proactively monitor, diagnose, and self-heal your AI agent operations with an automated reliability suite powered by xAI Grok.
 
## Problem
 
SMBs cannot afford dedicated SREs; agent failures disrupt business. This suite fills the gap by providing automated monitoring, fault isolation, runbook generation, and incident response — without requiring a dedicated operations team.
 
## Architecture
 
Four pillars compose the reliability suite:
 
- **xAI Grok analysis** (`@ai-sdk/xai`) — analyze agent logs and metrics for anomalies using Grok's structured output and repair fallbacks
- **Circuit breaker fault isolation** (`@reaatech/circuit-breaker-core`) — isolate failing services to prevent cascading failures across multi-agent systems
- **Runbook generation** (`@reaatech/agent-runbook`) — auto-generate operational runbooks from service definitions, alerts, and health checks
- **Automated incident response** (`@trigger.dev/sdk`) — trigger incident workflows when anomalies are detected, with circuit-state-aware rollback
 
## Quick Start
 
```bash
pnpm install
cp .env.example .env
# Fill in your API keys in .env
npx tsx src/cli/index.ts help
```
 
## CLI Reference
 
```bash
# Monitor a set of services at a polling interval
rel-ops monitor --config ./services.json --interval 5000
 
# Generate an operational runbook for a service
rel-ops generate-runbook --service agent-api --team my-team --platform kubernetes
```
 
## Configuration
 
| Variable | Description | Default |
|----------|-------------|---------|
| `XAI_API_KEY` | xAI Grok API key | — |
| `TRIGGER_SECRET_KEY` | Trigger.dev webhook secret | — |
| `TRIGGER_API_KEY` | Trigger.dev API key | — |
| `TRIGGER_API_URL` | Trigger.dev API base URL | `https://api.trigger.dev` |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key for tracing | — |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key | — |
| `LANGFUSE_BASE_URL` | Langfuse API base URL | `https://cloud.langfuse.com` |
| `CIRCUIT_BREAKER_FAILURE_THRESHOLD` | Failure count before circuit opens | `5` |
| `CIRCUIT_BREAKER_RECOVERY_TIMEOUT_MS` | Milliseconds before half-open recovery | `30000` |
| `CIRCUIT_BREAKER_MAX_COST_PER_MINUTE` | Cost cap per minute for circuit | `0.50` |
 
## Package Dependencies
 
| Package | Role |
|---------|------|
| `@ai-sdk/xai` | xAI Grok provider for the AI SDK |
| `@reaatech/agent-runbook` | Core types, schemas, and utilities for runbook generation |
| `@reaatech/agent-runbook-alerts` | Alert extraction and generation |
| `@reaatech/agent-runbook-health-checks` | Health check identification and Kubernetes probe YAML generation |
| `@reaatech/circuit-breaker-core` | Circuit breaker state machine and persistence |
| `@reaatech/circuit-breaker-agents` | Circuit breaker with AI agent optimizations |
| `@reaatech/structured-repair-core` | JSON repair and structured output recovery |
| `@trigger.dev/sdk` | Incident workflow orchestration |
| `langfuse` | Observability and tracing |
| `zod` | Runtime schema validation |
 
## Development
 
```bash
pnpm typecheck
pnpm lint
pnpm test
```
 
## License
 
MIT — see [LICENSE](./LICENSE).