Vertex AI Reliability Suite for SMB Agent Operations

Keep AI agents running reliably with automated circuit breakers, idempotent retries, and self-healing runbooks backed by Vertex AI.

vertex-ai reliability circuit-breaker idempotency structured-output runbook inngest nextjs hono typescript

The problem

Small business AI agents regularly fail due to downstream tool outages, LLM hallucinations, and retry storms, causing customer-facing disruptions and uncontrolled costs without dedicated SRE teams.

Built from

Intro

In this tutorial you’ll build a Vertex AI reliability suite that gives small-business AI agents production-grade fault tolerance without a dedicated SRE team. You’ll layer automated circuit breakers over Vertex AI model calls to isolate failing tools, add idempotency middleware to prevent duplicate side effects, wire up structured output repair that fixes malformed LLM responses against Zod schemas, and connect it all to runbook incident workflows with severity-based escalation. By the end you’ll have a reusable reliability middleware, a webhook endpoint that reacts to circuit-breaker state changes, and an Inngest durable workflow that orchestrates backoff, health-check recovery, and escalation — all backed by Gemini models and fully tested.

Prerequisites

Node.js >=22 and pnpm 10.x (the project uses pnpm workspaces)
A Google Cloud project with the Vertex AI API enabled
A Supabase project (stores incident records and recovery actions)
A Langfuse account for LLM telemetry
An Inngest account for durable workflow orchestration
An OpenAI API key (used by @instructor-ai/instructor for structured output repair)
Familiarity with TypeScript, Next.js App Router route handlers, and basic Zod schema definitions

Step 1: Scaffold the project and install dependencies

Start with a fresh Next.js project using the App Router. The scaffold provides the TypeScript config, ESLint, Vitest with coverage, and a clean lockfile. Once the shell is in place, install all the reliability packages.

terminal

pnpm

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

172 kB·71 tests·98.4% coverage·vitest passing

SHA-256f50a7f1fb503a536932bb067eaa40fbb9baf5cb9e4eceef4aeb471dc1397081e

Book a conversation All solutions

Comments

Loading comments…

Intro

Prerequisites

Node.js >=22 and pnpm 10.x (the project uses pnpm workspaces)

A Google Cloud project with the Vertex AI API enabled

A Supabase project (stores incident records and recovery actions)

A Langfuse account for LLM telemetry

An Inngest account for durable workflow orchestration

An OpenAI API key (used by @instructor-ai/instructor for structured output repair)

Familiarity with TypeScript, Next.js App Router route handlers, and basic Zod schema definitions

import { circuitBreakerService } from "../services/circuit-breaker-service.js" import { idempotencyService } from "../services/idempotency-service.js" import { outputRepair } from "../services/structured-output.js" import { vertexClient } from "../services/vertex-client.js" import type { ReliabilityConfig, ExecutionResult } from "../types/index.js" import pRetry from "p-retry" export { AbortError } from "p-retry" import pLimit from "p-limit" import { z } from "zod" type AnyZodObject = z.ZodObject export class ReliabilityMiddleware { async callWithReliability<T>( opts: { toolName: string idempotencyKey: string prompt: string outputSchema: AnyZodObject config?: ReliabilityConfig }, ): Promise<ExecutionResult<T>> { const context = { toolName: opts.toolName, prompt: opts.prompt } try { const result = await idempotencyService.executeWithIdempotency( opts.idempotencyKey, context, async () => { const rawText = await circuitBreakerService.executeWithBreaker( opts.toolName, async () => vertexClient.generateContent(opts.toolName, opts.prompt), ) const repaired = await outputRepair.repair<T>(rawText, opts.outputSchema) return repaired }, ) return { success: true, data: result } } catch (error) { return { success: false, error: String(error) } } } async callWithRetry<T>( toolName: string, fn: () => Promise<T>, opts?: { maxRetries?: number; signal?: AbortSignal }, ): Promise<T> { return pRetry(fn, { retries: opts?.maxRetries ?? 3, signal: opts?.signal, }) } async callWithConcurrencyLimit<T>( tasks: (() => Promise<T>)[], limit?: number, ): Promise<T[]> { const limiter = pLimit(limit ?? 5) const wrapped = tasks.map((task) => limiter(task)) return Promise.all(wrapped) } } export const reliabilityMiddleware = new ReliabilityMiddleware()

import { type NextRequest, NextResponse } from "next/server" import type { WebhookPayload } from "../../../../src/types/index.js" import { runbookService } from "../../../../src/services/runbook-service.js" import { inngestClient } from "../../../../src/workflows/retry-orchestrator.js" import type { CommunicationTemplate } from "@reaatech/agent-runbook" export async function POST(req: NextRequest): Promise<NextResponse> { let body: WebhookPayload try { body = (await req.json()) as WebhookPayload } catch { return NextResponse.json({ error: "Invalid JSON body" }, { status: 400 }) } const validStates: ReadonlyArray<string> = ["OPEN", "HALF_OPEN", "CLOSED"] if (!validStates.includes(body.state)) { return NextResponse.json( { error: `Invalid state: ${body.state}. Must be one of ${validStates.join(", ")}` }, { status: 400 }, ) } const severity = runbookService.determineSeverity(body.failureCount) if (body.state === "OPEN") { const incidentId = crypto.randomUUID() runbookService.evaluateIncident( { serviceDefinition: { name: body.circuitBreakerName }, repositoryAnalysis: { serviceType: "web-api", language: "typescript", framework: "express", structure: { mainDirectories: ["src"], fileCount: 10, depth: 2, hasTests: true, hasDockerfile: false, hasKubernetesManifests: false, hasTerraform: false }, configFiles: ["package.json"], entryPoints: [{ file: "src/index.ts", type: "http_server" }], externalServices: [] }, dependencyAnalysis: { directDeps: [], transitiveDeps: [], dependencyGraph: [], externalServices: [] }, deploymentPlatform: "cloud-run", monitoringPlatform: "prometheus", externalServices: [], }, { serviceName: body.circuitBreakerName, teamName: "platform-engineering", escalationContacts: [] }, ) const templates = runbookService.getNotificationTemplates("incident-notification") const appliedTemplates = templates.map((t: CommunicationTemplate) => runbookService.applyTemplate(t, { serviceName: body.circuitBreakerName, severity, incidentId, timestamp: body.timestamp, }), ) await inngestClient.send({ name: "reliability/circuit-breaker-tripped", data: { circuitBreakerName: body.circuitBreakerName, state: body.state, failureCount: body.failureCount, severity, incidentId, timestamp: body.timestamp, metadata: body.metadata, }, }) return NextResponse.json({ received: true, severity, incidentId, templates: appliedTemplates, }) } return NextResponse.json({ received: true, severity, incidentId: crypto.randomUUID() }) } export function GET(_req: NextRequest): NextResponse { return NextResponse.json({ status: "ok" }) }

Vertex AI Reliability Suite for SMB Agent Operations

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and install dependencies

Step 2: Configure environment variables

Step 3: Create shared types with Zod

Step 4: Create the Vertex AI client

Step 5: Create the circuit breaker service

Step 6: Create the idempotency service

Step 7: Create structured output repair and runbook services

Step 8: Create supporting libraries

Step 9: Compose the reliability middleware

Step 10: Create the Inngest retry orchestrator

Step 11: Wire up the webhook API route

Step 12: Run the tests

Next steps