Vercel AI Gateway Reliability Suite for SMB AI Operations
A self‑serve reliability dashboard that monitors, replays, and self‑heals AI workflows running through Vercel AI Gateway, so small teams can keep LLM apps running 24/7.
SMBs deploying LLM features on Vercel have no visibility into why a response was slow, failed, or drifted. Without replays, health checks, and incident runbooks, a weekend spike in errors means lost revenue and frantic debugging on Monday.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
You’ll build a self-serve reliability dashboard that monitors, replays, and self-heals AI workflows running through Vercel AI Gateway. By the end of this tutorial, you’ll have a Next.js application with health probes, incident workflows, replay tracing of every LLM call, service dependency maps, and rollback procedures — all wired together into a single admin UI. Every LLM interaction is recorded for replay, health checks run on a schedule, and incidents auto-trigger when anomalies are detected.
Prerequisites
Node.js >= 22 (the project sets "engines": { "node": ">=22" } in package.json)
pnpm 10.x (the project declares "packageManager": "pnpm@10.0.0" — enable it with corepack enable)
A Vercel AI Gateway API key or any provider key the ai SDK supports (set as AI_PROVIDER_API_KEY in your .env)
Assumed knowledge: Familiarity with TypeScript, Next.js App Router routing, and basic React hooks. You should be comfortable with pnpm, fetch, and async/await patterns.
Step 1: Scaffold and configure the Next.js project
Start by creating the project directory and its core configuration files. These files define your dependencies, TypeScript settings, testing infrastructure, ESLint rules, and Next.js runtime behavior.
Create the project at ./vercel-ai-gateway-reliability-suite-for-smb-ai-operations/:
Now write the package.json with all the dependencies this recipe needs — the ai SDK for Vercel AI Gateway calls, REAA operational packages for health checks, incidents, rollbacks, service maps, and replay traces, plus Trigger.dev for scheduled jobs:
The next.config.ts enables the instrumentation hook so that src/instrumentation.ts fires at startup and can initialize the AI client with its replay interceptors. The key experimental.instrumentationHook: true must be spelled exactly as shown — any variation and your startup code becomes dead code:
Step 2: Install dependencies and set environment variables
Install everything with pnpm, then create your environment files. The .env.example file serves as a template; you’ll copy it to .env and fill in your own values.
Expected output:pnpm install resolves all dependencies and devDependencies and creates pnpm-lock.yaml and node_modules/. No errors.
Step 3: Define shared types
Type definitions keep every module speaking the same language. Create a src/types/ directory with four files: health, incident, replay, and rollback types. You’ll also add a Next.js server type shim for Next.js 16.
For the type shim, create src/types/next-server.d.ts — this provides NextRequest and NextResponse types that route handlers need while Next.js 16 refines its public types:
Also create a re-export convenience shim in src/types/next/server.d.ts:
terminal
mkdir -p src/types/next
Create src/types/next/server.d.ts:
ts
import type { NextFetchEvent } from 'next/dist/server/web/spec-extension/fetch-event'import type { NextRequest } from 'next/dist/server/web/spec-extension/request'import type { NextResponse } from 'next/dist/server/web/spec-extension/response'import type { NextMiddleware, MiddlewareConfig } from 'next/dist/server/web/spec-extension/middleware'import type { UserAgent } from 'next/dist/compiled/@edge-runtime/primitives'export type { NextFetchEvent, NextRequest, NextResponse, NextMiddleware, MiddlewareConfig, UserAgent }
Step 4: Build the AI client with replay instrumentation
This is the heart of the recipe: an ai SDK wrapper that records every LLM call for later replay. Whenever your application calls generateText, the client opens a recording session with the RecordingEngine from @reaatech/agent-replay-core, captures the request/response pair as trace events, and persists the trace to disk via LocalFileStorage. If the LLM call fails, the error is recorded before the trace is saved and the original error is re-thrown.
Step 5: Build the trace replay and diffing library
The replay module provides three functions: recordTrace (for the API route to store traces), replayTrace (load and replay a saved trace via ReplayEngine), and diffTraces (compare two traces to detect response drift). All three use LocalFileStorage backed by the REPLAY_TRACE_DIR environment variable.
Step 6: Build the health checker, incident manager, service mapper, and rollback executor
These four library modules wrap the REAA operational packages. Each is a thin facade that calls the underlying package functions with configuration pulled from environment variables.
Create src/lib/health-checker.ts:
ts
import { identifyHealthChecks, generateHealthChecks, generateKubernetesProbeYaml, suggestHealthChecks,} from "@reaatech/agent-runbook-health-checks";import type { HealthProbeDefinition, HealthCheckResult, OverallHealth } from "../types/health.js";export function generateProbeDefinitions(): HealthProbeDefinition[] { const context = { repoPath: process.env.SERVICE_MAP_REPO_PATH ?? "./" }; const checks = identifyHealthChecks( process.env.SERVICE_MAP_REPO_PATH ?? "./", context as never ); return (checks as { name?: string; endpoint: string }[]).map((c) => ({ name: c.name ?? `probe-${c.endpoint}`, endpoint: c.endpoint, expectedStatus: 200, timeout: 5000, interval: 30, }));}export function generateKubernetesChecks(): unknown[] { return generateHealthChecks( process.env.SERVICE_MAP_REPO_PATH ?? "./", { repoPath: process.env.SERVICE_MAP_REPO_PATH ?? "./" } as never, { platform: "kubernetes", serviceName: "my-api", port: 3000 } );}export function generateProbeYaml(checks: unknown[]): string { return generateKubernetesProbeYaml(checks as never, "my-container", 3000);}export function getOverallHealth(results: HealthCheckResult[]): OverallHealth { if (results.length === 0) return "healthy"; if (results.some((r) => r.status === "down")) return "down"; if (results.some((r) => r.status === "degraded")) return "degraded"; return "healthy";}export { suggestHealthChecks };
Create src/lib/incident-manager.ts:
ts
import { generateIncidentWorkflows, generateEscalationPolicy, getTemplatesByCategory, applyTemplateVariables,} from "@reaatech/agent-runbook-incident";export function createIncident(details: { serviceName: string; teamName: string; severity: string; description: string; escalationContacts: string[];}): unknown[] { if (!details.serviceName || !details.teamName) { throw new Error("Incident requires serviceName and teamName"); } return generateIncidentWorkflows( {} as never, { serviceName: details.serviceName, teamName: details.teamName, escalationContacts: details.escalationContacts, } );}export function getEscalationPolicy(): unknown { return generateEscalationPolicy({ serviceName: "my-api", teamName: "platform-engineering", });}export function sendNotification(_incident: unknown): unknown { const templates = getTemplatesByCategory("incident-notification"); const template = templates[0]; if (!template) { throw new Error("No notification templates found"); } return applyTemplateVariables(template, { serviceName: "my-api", severity: "sev3", incidentId: "inc-001", description: "test incident", });}export function closeIncident(incidentId: string): unknown { const templates = getTemplatesByCategory("postmortem"); const template = templates[0]; if (!template) { throw new Error("No postmortem templates found"); } return applyTemplateVariables(template, { incidentId, serviceName: "my-api", });}
Create src/lib/service-mapper.ts:
ts
import { analyzeDependencies, generateServiceMap, exportGraph, exportToMermaid, exportToDot, exportToJson, exportToYaml,} from "@reaatech/agent-runbook-service-map";export function generateServiceMapGraph(repoPath: string): ReturnType<typeof generateServiceMap> { const deps = analyzeDependencies(repoPath, {} as never); return generateServiceMap(deps, "my-service", {} as never);}export { exportGraph, exportToMermaid, exportToDot, exportToJson, exportToYaml,};
Create src/lib/rollback-executor.ts:
ts
import { generateRollbackProcedures as genRollbackProcedures, getRollbackCommands, generateVerificationSteps,} from "@reaatech/agent-runbook-rollback";export function generateRollbackProcedures( platform: string): ReturnType<typeof genRollbackProcedures> { const validPlatforms = ["kubernetes", "ecs", "cloud-run"]; if (!validPlatforms.includes(platform) && platform !== "cloudrun") { throw new Error(`Unsupported platform: ${platform}`); } const actualPlatform = platform === "cloudrun" ? "cloud-run" : platform; return genRollbackProcedures({ repoPath: process.env.ROLLBACK_GIT_REMOTE ?? "./" } as never, actualPlatform as never);}export function getRollbackCommandList( platform: string, serviceName: string): string[] { const actualPlatform = platform === "cloudrun" ? "cloud-run" : platform; return getRollbackCommands(actualPlatform as never, serviceName);}export function generateVerification( platform: string): Record<string, unknown> { const actualPlatform = platform === "cloudrun" ? "cloud-run" : platform; const result = generateVerificationSteps({ repoPath: process.env.ROLLBACK_GIT_REMOTE ?? "./" } as never, actualPlatform as never); return result as never;}
Step 7: Create the Next.js API routes
Five API routes expose the reliability suite’s operations over HTTP. Each route sits in the App Router at src/app/api/<name>/route.ts and exports named functions for the HTTP methods it handles (GET, POST). The routes use NextResponse.json() for proper content-type headers.
import { type NextRequest, NextResponse } from "next/server";import { generateServiceMapGraph } from "../../../lib/service-mapper.js";export function GET(_req: NextRequest): NextResponse { const repoPath = process.env.SERVICE_MAP_REPO_PATH ?? "./"; const graph = generateServiceMapGraph(repoPath); return NextResponse.json({ graph, format: "json" });}
Step 8: Build the dashboard UI pages
The dashboard consists of a root layout with navigation and five client-component pages: Dashboard, Incidents, Replays, Service Map, and Rollback. Each page fetches its data from the API routes you just built.
Next.js runs src/instrumentation.ts at startup — your entry point for initializing the AI client with its replay interceptors. The workers directory holds the health probe runner and Trigger.dev scheduled tasks that ping endpoints on cron schedules.
Create the directories:
terminal
mkdir -p src/workers
Create src/instrumentation.ts:
ts
export async function register(): Promise<void> { if (process.env.NEXT_RUNTIME === "nodejs") { const { createAiClient } = await import("./lib/ai-client.js"); createAiClient(); }}
The health worker probes every endpoint returned by generateProbeDefinitions() and classifies each as healthy, degraded, or down:
The Trigger.dev scheduled tasks run on cron schedules — the health check sweep pings every minute, and the incident polling checks the service map every 5 minutes, creating an incident when anomalies are detected:
The test suite covers the AI client, replay library, library modules, API routes, workers, and dashboard pages. Vitest is configured with 90% coverage thresholds. Run the tests first to confirm everything works, then start the Next.js dev server.
terminal
pnpm test
Expected output: Vitest runs all test files in the tests/ directory — ai-client.test.ts, replay.test.ts, api-health.test.ts, api-incidents.test.ts, api-replay.test.ts, api-rollback.test.ts, api-service-map.test.ts, health-checker.test.ts, incident-manager.test.ts, service-mapper.test.ts, rollback-executor.test.ts, health-worker.test.ts, trigger.test.ts, instrumentation.test.ts, dashboard-pages.test.ts, incidents-page.test.ts, layout.test.ts, replays-page.test.ts, rollback-page.test.ts, service-map-page.test.ts, and types.test.ts. All tests should pass, and the coverage report shows at least 90% across lines, branches, functions, and statements.
With tests passing, start the dev server. Since the project doesn’t include a dev script in package.json, use npx to run Next.js directly:
terminal
npx next dev
Expected output: Next.js compiles your application and prints:
Open http://localhost:3000 in your browser. You’ll see the dashboard page with navigation links to Incidents, Replays, Service Map, and Rollback. The dashboard fetches health status from /api/health and shows “Health Status: healthy” with zero active incidents.
Try the API routes directly with curl:
terminal
curl http://localhost:3000/api/health
Expected output: A JSON response with a timestamp, status “healthy”, and an empty results array.
Next steps
Configure Trigger.dev with a real TRIGGER_TOKEN so the healthCheckSweep and incidentPolling scheduled tasks run in production. Deploy to Trigger.dev’s cloud platform to replace the in-process cron schedule.
Point HEALTH_ENDPOINT_BASE at your staging or production Vercel AI Gateway endpoints and expand the health probe definitions in generateProbeDefinitions() to cover latency budgets, token rate limits, and error rate thresholds.
Integrate a real notification channel (Slack, PagerDuty, email) into the sendNotification function by replacing the template-based output with an actual API call. Wire the INCIDENT_ESCALATION_RECIPIENT env var to your on-call rotation.