Vercel AI Gateway Reliability Suite for SMB AI Operations

A self‑serve reliability dashboard that monitors, replays, and self‑heals AI workflows running through Vercel AI Gateway, so small teams can keep LLM apps running 24/7.

vercel-ai-gateway reliability-ops nextjs self-healing incident-management ai-monitoring

The problem

SMBs deploying LLM features on Vercel have no visibility into why a response was slow, failed, or drifted. Without replays, health checks, and incident runbooks, a weekend spike in errors means lost revenue and frantic debugging on Monday.

Built from

Intro

You’ll build a self-serve reliability dashboard that monitors, replays, and self-heals AI workflows running through Vercel AI Gateway. By the end of this tutorial, you’ll have a Next.js application with health probes, incident workflows, replay tracing of every LLM call, service dependency maps, and rollback procedures — all wired together into a single admin UI. Every LLM interaction is recorded for replay, health checks run on a schedule, and incidents auto-trigger when anomalies are detected.

Prerequisites

Node.js >= 22 (the project sets "engines": { "node": ">=22" } in package.json)
pnpm 10.x (the project declares "packageManager": "pnpm@10.0.0" — enable it with corepack enable)
A Vercel AI Gateway API key or any provider key the ai SDK supports (set as AI_PROVIDER_API_KEY in your .env)
Environment variables (from .env.example): REPLAY_TRACE_DIR, HEALTH_ENDPOINT_BASE, INCIDENT_ESCALATION_RECIPIENT, ROLLBACK_GIT_REMOTE, SERVICE_MAP_REPO_PATH, TRIGGER_TOKEN, AI_PROVIDER_API_KEY, DASHBOARD_BASE_URL
Assumed knowledge: Familiarity with TypeScript, Next.js App Router routing, and basic React hooks. You should be comfortable with pnpm, fetch, and async/await patterns.

Step 1: Scaffold and configure the Next.js project

Start by creating the project directory and its core configuration files. These files define your dependencies, TypeScript settings, testing infrastructure, ESLint rules, and Next.js runtime behavior.

Create the project at ./vercel-ai-gateway-reliability-suite-for-smb-ai-operations/:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

146 tests·98.8% coverage·vitest passing

Book a conversation All solutions

Comments

Loading comments…

Intro

Prerequisites

Node.js >= 22 (the project sets "engines": { "node": ">=22" } in package.json)

pnpm 10.x (the project declares "packageManager": "pnpm@10.0.0" — enable it with corepack enable)

A Vercel AI Gateway API key or any provider key the ai SDK supports (set as AI_PROVIDER_API_KEY in your .env)

Environment variables (from .env.example): REPLAY_TRACE_DIR, HEALTH_ENDPOINT_BASE, INCIDENT_ESCALATION_RECIPIENT, ROLLBACK_GIT_REMOTE, SERVICE_MAP_REPO_PATH, TRIGGER_TOKEN, AI_PROVIDER_API_KEY, DASHBOARD_BASE_URL

Assumed knowledge: Familiarity with TypeScript, Next.js App Router routing, and basic React hooks. You should be comfortable with pnpm, fetch, and async/await patterns.

import { generateText as originalGenerateText } from "ai"; import { RecordingEngine, LocalFileStorage } from "@reaatech/agent-replay-core"; interface AiClientResult { text: string; usage: { inputTokens: number; outputTokens: number }; } interface AiClient { generateText(options: { model: string; prompt: string }): Promise<AiClientResult>; recordingEngine: RecordingEngine; } export function createAiClient(): AiClient { const engine = new RecordingEngine(); const traceDir = process.env.REPLAY_TRACE_DIR ?? "./traces"; async function generateTextWithReplay(options: { model: string; prompt: string; }): Promise<AiClientResult> { const session = engine.startRecording({ name: "ai-gateway-call" }); const spanId = engine.startSpan(options.model, "llm_call"); engine.captureEvent( { timestamp: Date.now(), type: "request", name: "llm-request", attributes: { model: options.model }, data: { prompt: options.prompt }, }, { spanId } ); try { const result = await originalGenerateText({ model: options.model, prompt: options.prompt, }); const usage = result.usage as { inputTokens?: number; outputTokens?: number }; const inputTokens = usage.inputTokens ?? 0; const outputTokens = usage.outputTokens ?? 0; engine.captureEvent( { timestamp: Date.now(), type: "response", name: "llm-response", attributes: { inputTokens, outputTokens }, data: { text: result.text }, }, { spanId } ); engine.endSpan(spanId, "ok"); const trace = engine.stopRecording(session); const storage = new LocalFileStorage(traceDir); await storage.save(trace); return { text: result.text, usage: { inputTokens, outputTokens }, }; } catch (error) { const errorMessage = error instanceof Error ? error.message : String(error); engine.captureEvent( { timestamp: Date.now(), type: "error", name: "llm-error", attributes: { error: errorMessage }, data: { error: errorMessage }, }, { spanId } ); engine.endSpan(spanId, "error"); const trace = engine.stopRecording(session); const storage = new LocalFileStorage(traceDir); await storage.save(trace); throw error; } } return { generateText: generateTextWithReplay, recordingEngine: engine }; }

import { RecordingEngine, ReplayEngine, DiffEngine, LocalFileStorage, } from "@reaatech/agent-replay-core"; import type { TraceInput, ReplayResult, DiffResult } from "../types/replay.js"; export class TraceNotFoundError extends Error { constructor(traceId: string) { super(`Trace ${traceId} not found`); this.name = "TraceNotFoundError"; } } const getTraceDir = (): string => process.env.REPLAY_TRACE_DIR ?? "./traces"; export async function recordTrace(traceInput: TraceInput): Promise<string> { const engine = new RecordingEngine(); const storage = new LocalFileStorage(getTraceDir()); const session = engine.startRecording({ name: `trace-${String(Date.now())}` }); const spanId = engine.startSpan(traceInput.model, "llm_call"); engine.captureEvent( { timestamp: Date.now(), type: "request", name: "llm-request", attributes: { model: traceInput.model, inputTokens: traceInput.inputTokens, }, data: { body: traceInput.requestBody }, }, { spanId } ); engine.captureEvent( { timestamp: Date.now(), type: "response", name: "llm-response", attributes: { outputTokens: traceInput.outputTokens, latency: traceInput.latency, status: traceInput.status, }, data: { body: traceInput.responseBody }, }, { spanId } ); engine.endSpan(spanId, traceInput.status === "success" ? "ok" : "error"); const trace = engine.stopRecording(session); await storage.save(trace); return `trace-${String(Date.now())}`; } export async function replayTrace(traceId: string): Promise<ReplayResult> { const storage = new LocalFileStorage(getTraceDir()); const trace = await storage.load(traceId); const replay = new ReplayEngine(); const result = replay.replay(trace, { mode: "stubbed" }); return { traceId, outputs: result.outputs, duration: result.duration, }; } export async function diffTraces( oldTraceId: string, newTraceId: string ): Promise<DiffResult> { const storage = new LocalFileStorage(getTraceDir()); const [oldTrace, newTrace] = await Promise.all([ storage.load(oldTraceId), storage.load(newTraceId), ]); const diff = new DiffEngine(); const diffResult = diff.compare(oldTrace, newTrace as never, {}); return { hasDifferences: diffResult.diffs.length > 0, changes: diffResult.diffs.map((d) => ({ path: d.type, expected: d.severity, actual: null, })), }; }

"use client"; import { useState } from "react"; type Platform = "kubernetes" | "ecs" | "cloudrun"; export default function RollbackPage() { const [platform, setPlatform] = useState<Platform>("kubernetes"); const [serviceName, setServiceName] = useState(""); const [result, setResult] = useState<string | null>(null); const [loading, setLoading] = useState(false); const [error, setError] = useState<string | null>(null); async function handleRollback(): Promise<void> { setLoading(true); setError(null); try { const res = await fetch("/api/rollback", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ platform, serviceName: serviceName || "my-service", }), }); const data = (await res.json()) as Record<string, unknown>; setResult(JSON.stringify(data, null, 2)); } catch (err) { setError(err instanceof Error ? err.message : String(err)); } finally { setLoading(false); } } return ( <div> <h2>Rollback</h2> <div> <label> Platform: <select value={platform} onChange={(e) => setPlatform(e.target.value as Platform)} > <option value="kubernetes">Kubernetes</option> <option value="ecs">ECS</option> <option value="cloudrun">Cloud Run</option> </select> </label> </div> <div> <label> Service Name: <input type="text" value={serviceName} onChange={(e) => setServiceName(e.target.value)} placeholder="my-service" /> </label> </div> <button onClick={() => { void handleRollback(); }} disabled={loading}> {loading ? "Executing..." : "Execute Rollback"} </button> {error && <div style={{ color: "red" }}>Error: {error}</div>} {result && ( <div> <h3>Procedure Output</h3> <pre>{result}</pre> </div> )} </div> ); }

Vercel AI Gateway Reliability Suite for SMB AI Operations

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold and configure the Next.js project

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold and configure the Next.js project

Step 2: Install dependencies and set environment variables

Step 3: Define shared types

Step 4: Build the AI client with replay instrumentation

Step 5: Build the trace replay and diffing library

Step 6: Build the health checker, incident manager, service mapper, and rollback executor

Step 7: Create the Next.js API routes

Step 8: Build the dashboard UI pages

Step 9: Add instrumentation and scheduled workers

Step 10: Run the tests and start the dev server

Next steps