Files · Vertex AI Reliability Suite for SMB Agent Operations
69 (1 binary, 582.4 kB total)attempt 1
README.md·4427 B·markdown
markdown
# Vertex AI Reliability Suite for SMB Agent Operations
> Keep AI agents running reliably with automated circuit breakers, idempotent retries, and self-healing runbooks backed by Vertex AI.
Small business AI agents regularly fail due to downstream tool outages, LLM hallucinations, and retry storms. This recipe combines four reliability layers — circuit breakers, idempotency middleware, structured output repair, and automated runbook incident workflows — all orchestrated via Inngest durable workflows on Vertex AI.
## Reliability layers
- **Circuit breakers** (`@reaatech/circuit-breaker-agents`) — isolate failing tools with configurable failure thresholds and automatic recovery
- **Idempotency middleware** (`@reaatech/idempotency-middleware`) — prevent duplicate execution of side-effecting Vertex AI calls using idempotency keys
- **Structured output repair** (`@instructor-ai/instructor`) — validate and repair malformed LLM outputs against Zod schemas
- **Runbook incident workflows** (`@reaatech/agent-runbook-incident`) — SEV1–SEV4 incident response with escalation policies and communication templates
## Prerequisites
- Node.js >=22, pnpm 10+
- GCP project with Vertex AI API enabled
- Supabase project (for incident records)
- Langfuse account (for LLM telemetry)
- Inngest account (for durable workflow orchestration)
- OpenAI API key (for Instructor structured output repair — peer dependency of `@instructor-ai/instructor`)
## Quick start
```bash
pnpm install
cp .env.example .env # fill in your credentials
pnpm dev # Next.js dev server
pnpm test # vitest run with coverage
pnpm typecheck # TypeScript type checking
pnpm lint # ESLint
```
## API
### POST /api/runbook/webhook
Receives circuit-breaker state change alerts and triggers incident response workflows via Inngest.
**Request body:**
```json
{
"circuitBreakerName": "vertex-tool-call",
"state": "OPEN",
"failureCount": 5,
"timestamp": "2025-01-01T00:00:00Z"
}
```
**Response (200):**
```json
{
"received": true,
"severity": "SEV2",
"incidentId": "uuid"
}
```
## Project structure
```
app/api/runbook/webhook/route.ts — Next.js App Router webhook endpoint
src/types/index.ts — shared Zod schemas and TypeScript interfaces
src/services/vertex-client.ts — Vertex AI GenerativeModel wrapper
src/services/circuit-breaker-service.ts — CircuitBreaker lifecycle manager
src/services/idempotency-service.ts — IdempotencyMiddleware wrapper
src/services/structured-output.ts — Instructor-based output repair
src/services/runbook-service.ts — Agent-runbook-incident wrapper
src/middleware/reliability.ts — Composed reliability middleware
src/workflows/retry-orchestrator.ts — Inngest durable workflow
src/lib/supabase.ts — Supabase client
src/lib/langfuse.ts — Langfuse telemetry
src/lib/pricing.ts — Gemini pricing calculator
tests/ — vitest suite (mirrors src/)
```
## Environment variables
| Variable | Description |
|---|---|
| `GOOGLE_CLOUD_PROJECT` | GCP project ID for Vertex AI |
| `GOOGLE_CLOUD_LOCATION` | GCP region (e.g. us-central1) |
| `GOOGLE_APPLICATION_CREDENTIALS` | Path to GCP service account JSON |
| `SUPABASE_URL` | Supabase project URL |
| `SUPABASE_ANON_KEY` | Supabase anonymous key |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key for LLM telemetry |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key |
| `LANGFUSE_HOST` | Langfuse host (default: https://cloud.langfuse.com) |
| `INNGEST_EVENT_KEY` | Inngest event key for durable workflow orchestration |
| `INNGEST_SIGNING_KEY` | Inngest signing key |
| `OPENAI_API_KEY` | OpenAI API key (peer dep of `@instructor-ai/instructor`) |
| `RELIABILITY_CIRCUIT_BREAKER_THRESHOLD` | Failure count before circuit opens |
| `RELIABILITY_CIRCUIT_BREAKER_WINDOW_MS` | Time window for failure counting (ms) |
| `RELIABILITY_IDEMPOTENCY_TTL_MS` | Idempotency key TTL (ms) |
| `RELIABILITY_MAX_RETRIES` | Maximum retry attempts |
| `RELIABILITY_CONCURRENCY_LIMIT` | Maximum concurrent operations |
## Running tests
```bash
pnpm test # vitest run with coverage (requires 90%+ on all metrics)
pnpm typecheck # TypeScript strict type checking
pnpm lint # ESLint
```
## License
MIT — see [LICENSE](./LICENSE).