Files · Vertex AI Reliability Suite for SMB Agent Operations

69 (1 binary, 582.4 kB total)attempt 1

README.md·4427 B·markdown

markdown

# Vertex AI Reliability Suite for SMB Agent Operations
 
> Keep AI agents running reliably with automated circuit breakers, idempotent retries, and self-healing runbooks backed by Vertex AI.
 
Small business AI agents regularly fail due to downstream tool outages, LLM hallucinations, and retry storms. This recipe combines four reliability layers — circuit breakers, idempotency middleware, structured output repair, and automated runbook incident workflows — all orchestrated via Inngest durable workflows on Vertex AI.
 
## Reliability layers
 
- **Circuit breakers** (`@reaatech/circuit-breaker-agents`) — isolate failing tools with configurable failure thresholds and automatic recovery
- **Idempotency middleware** (`@reaatech/idempotency-middleware`) — prevent duplicate execution of side-effecting Vertex AI calls using idempotency keys
- **Structured output repair** (`@instructor-ai/instructor`) — validate and repair malformed LLM outputs against Zod schemas
- **Runbook incident workflows** (`@reaatech/agent-runbook-incident`) — SEV1–SEV4 incident response with escalation policies and communication templates
 
## Prerequisites
 
- Node.js >=22, pnpm 10+
- GCP project with Vertex AI API enabled
- Supabase project (for incident records)
- Langfuse account (for LLM telemetry)
- Inngest account (for durable workflow orchestration)
- OpenAI API key (for Instructor structured output repair — peer dependency of `@instructor-ai/instructor`)
 
## Quick start
 
```bash
pnpm install
cp .env.example .env        # fill in your credentials
pnpm dev                     # Next.js dev server
pnpm test                    # vitest run with coverage
pnpm typecheck               # TypeScript type checking
pnpm lint                    # ESLint
```
 
## API
 
### POST /api/runbook/webhook
 
Receives circuit-breaker state change alerts and triggers incident response workflows via Inngest.
 
**Request body:**
```json
{
  "circuitBreakerName": "vertex-tool-call",
  "state": "OPEN",
  "failureCount": 5,
  "timestamp": "2025-01-01T00:00:00Z"
}
```
 
**Response (200):**
```json
{
  "received": true,
  "severity": "SEV2",
  "incidentId": "uuid"
}
```
 
## Project structure
 
```
app/api/runbook/webhook/route.ts   — Next.js App Router webhook endpoint
src/types/index.ts                 — shared Zod schemas and TypeScript interfaces
src/services/vertex-client.ts      — Vertex AI GenerativeModel wrapper
src/services/circuit-breaker-service.ts — CircuitBreaker lifecycle manager
src/services/idempotency-service.ts     — IdempotencyMiddleware wrapper
src/services/structured-output.ts       — Instructor-based output repair
src/services/runbook-service.ts         — Agent-runbook-incident wrapper
src/middleware/reliability.ts           — Composed reliability middleware
src/workflows/retry-orchestrator.ts     — Inngest durable workflow
src/lib/supabase.ts              — Supabase client
src/lib/langfuse.ts              — Langfuse telemetry
src/lib/pricing.ts               — Gemini pricing calculator
tests/                           — vitest suite (mirrors src/)
```
 
## Environment variables
 
| Variable | Description |
|---|---|
| `GOOGLE_CLOUD_PROJECT` | GCP project ID for Vertex AI |
| `GOOGLE_CLOUD_LOCATION` | GCP region (e.g. us-central1) |
| `GOOGLE_APPLICATION_CREDENTIALS` | Path to GCP service account JSON |
| `SUPABASE_URL` | Supabase project URL |
| `SUPABASE_ANON_KEY` | Supabase anonymous key |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key for LLM telemetry |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key |
| `LANGFUSE_HOST` | Langfuse host (default: https://cloud.langfuse.com) |
| `INNGEST_EVENT_KEY` | Inngest event key for durable workflow orchestration |
| `INNGEST_SIGNING_KEY` | Inngest signing key |
| `OPENAI_API_KEY` | OpenAI API key (peer dep of `@instructor-ai/instructor`) |
| `RELIABILITY_CIRCUIT_BREAKER_THRESHOLD` | Failure count before circuit opens |
| `RELIABILITY_CIRCUIT_BREAKER_WINDOW_MS` | Time window for failure counting (ms) |
| `RELIABILITY_IDEMPOTENCY_TTL_MS` | Idempotency key TTL (ms) |
| `RELIABILITY_MAX_RETRIES` | Maximum retry attempts |
| `RELIABILITY_CONCURRENCY_LIMIT` | Maximum concurrent operations |
 
## Running tests
 
```bash
pnpm test        # vitest run with coverage (requires 90%+ on all metrics)
pnpm typecheck   # TypeScript strict type checking
pnpm lint        # ESLint
```
 
## License
 
MIT — see [LICENSE](./LICENSE).