Azure AI Reliability Suite for SMB AI Operations

Proactive incident detection, self-healing, and cost-aware failure recovery for SMB AI agent operations, powered by Azure AI.

azure-ai reliability-ops circuit-breaker idempotency trigger-dev nextjs typescript observability

The problem

Small businesses running AI agents on Azure often face unpredictable outages, silent failures, and runaway costs without dedicated SRE teams to monitor, alert, and recover.

Built from

Intro

This tutorial walks through building the Azure AI Reliability Suite, a reliability layer for small businesses running AI agents on Azure. You’ll wire together circuit breakers, idempotency middleware, observability tracing, and durable incident workflows — a system that detects failures, prevents duplicate side effects, isolates failing services, and runs automated recovery. By the end you’ll have a Next.js 16 App Router app with two API routes, @trigger.dev durable tasks, and a complete Vitest test suite.

This is an intermediate-level recipe. You should be comfortable with TypeScript, Next.js App Router, and basic testing with Vitest.

Prerequisites

Node.js 22+ and pnpm 10 installed
An Azure OpenAI endpoint, API key, and deployment name (or set AGENT_PROVIDER=mock in .env to skip live LLM calls)
A free trigger.dev account and secret key (for durable workflow orchestration)
A Langfuse account (public + secret keys) if you want observability tracing — the system degrades gracefully when disabled

Step 1: Scaffold the project and configure environment variables

Start from the scaffolded Next.js 16 App Router project. The dependency tree is already set — install everything, then copy the example environment file:

terminal

pnpm install
cp .env.example .env

Open .env and fill in your credentials. The key variables are:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

187 kB·102 tests·97.7% coverage·vitest passing

SHA-256931dc4e18c4ff73a21896c48ffae354bcf3589dcc195c7bd9f8b7a23ce89699c

Book a conversation All solutions

Comments

Loading comments…

Intro

This is an intermediate-level recipe. You should be comfortable with TypeScript, Next.js App Router, and basic testing with Vitest.

Prerequisites

Node.js 22+ and pnpm 10 installed
An Azure OpenAI endpoint, API key, and deployment name (or set AGENT_PROVIDER=mock in .env to skip live LLM calls)
A free trigger.dev account and secret key (for durable workflow orchestration)
A Langfuse account (public + secret keys) if you want observability tracing — the system degrades gracefully when disabled

Step 1: Scaffold the project and configure environment variables

Start from the scaffolded Next.js 16 App Router project. The dependency tree is already set — install everything, then copy the example environment file:

terminal

pnpm install
cp .env.example .env

Open .env and fill in your credentials. The key variables are:

import OpenAI from "openai"; import type { ChatCompletionMessageParam } from "openai/resources/chat/completions"; interface CompletionResult { text: string; usage: { promptTokens: number; completionTokens: number }; } interface HealthResult { healthy: boolean; latencyMs: number; } interface ErrorResult { error: string; statusCode: number; } export class AzureOpenAIService { private client: OpenAI; private deploymentName: string; constructor(endpoint: string, deploymentName: string, apiKey?: string) { this.deploymentName = deploymentName; this.client = new OpenAI({ baseURL: `https://${endpoint}.openai.azure.com/openai/deployments/${deploymentName}`, apiKey: apiKey, defaultQuery: { "api-version": "2024-10-01-preview" }, }); } async getCompletion(messages: ChatCompletionMessageParam[]): Promise<CompletionResult | ErrorResult> { try { const response = await this.client.chat.completions.create({ model: this.deploymentName, messages: messages, }); const text = response.choices[0]?.message?.content ?? ""; const promptTokens = response.usage?.prompt_tokens ?? 0; const completionTokens = response.usage?.completion_tokens ?? 0; return { text, usage: { promptTokens, completionTokens, }, }; } catch (err: unknown) { const statusCode = (err as { statusCode?: number }).statusCode ?? 500; const message = (err as { message?: string }).message ?? "Unknown error"; return { error: message, statusCode }; } } async healthCheck(): Promise<HealthResult | ErrorResult> { try { const start = Date.now(); await this.client.chat.completions.create({ model: this.deploymentName, messages: [{ role: "user", content: "ping" }], max_tokens: 1, }); const latencyMs = Date.now() - start; return { healthy: true, latencyMs }; } catch (err: unknown) { const statusCode = (err as { statusCode?: number }).statusCode ?? 500; const message = (err as { message?: string }).message ?? "Unknown error"; return { error: message, statusCode }; } } } export function createAzureOpenAIService(config: { endpoint: string; deploymentName: string; apiKey?: string; }): AzureOpenAIService { return new AzureOpenAIService(config.endpoint, config.deploymentName, config.apiKey); }

import { CircuitBreaker, CircuitOpenError, DefaultMetricsCollector, InMemoryAdapter, } from "@reaatech/circuit-breaker-agents"; interface ExecuteSuccess { rejected: false; result: unknown; } interface ExecuteRejected { rejected: true; reason: "circuit_open"; } type ExecuteResult = ExecuteSuccess | ExecuteRejected; interface BreakerStats { state: string; failureCount: number; consecutiveSuccesses: number; } export class CircuitBreakerManager { private breakers: Map<string, CircuitBreaker> = new Map(); getOrCreate( name: string, options?: { failureThreshold?: number; recoveryTimeoutMs?: number }, ): CircuitBreaker { const existing = this.breakers.get(name); if (existing) { return existing; } const breaker = new CircuitBreaker({ name, failureThreshold: options?.failureThreshold ?? 5, recoveryTimeoutMs: options?.recoveryTimeoutMs ?? 30000, persistence: new InMemoryAdapter(), metricsCollector: new DefaultMetricsCollector(), }); this.breakers.set(name, breaker); return breaker; } async execute(name: string, action: () => Promise<unknown>): Promise<ExecuteResult> { try { const breaker = this.getOrCreate(name); const result = await breaker.execute(() => action()); return { rejected: false, result }; } catch (err: unknown) { if (err instanceof CircuitOpenError) { return { rejected: true, reason: "circuit_open" }; } throw err; } } reset(name: string): void { const breaker = this.breakers.get(name); if (breaker) { breaker.reset(); } else { this.getOrCreate(name); } } getStats(name: string): BreakerStats { const breaker = this.getOrCreate(name); const stats = breaker.getStats(); return { state: stats.state, failureCount: stats.failure_count, consecutiveSuccesses: stats.success_count, }; } } export const circuitBreakerManager = new CircuitBreakerManager();

import { type NextRequest, NextResponse } from "next/server"; import { healthCheckPayloadSchema, type IncidentReport } from "@/src/types/index.js"; import { runHealthCheck, recoverFromIncident } from "@/src/lib/reliability.js"; import { idempotencyService } from "@/src/services/idempotency.js"; import { observabilityService } from "@/src/services/observability.js"; const KNOWN_SERVICES = ["azure-openai"]; export async function GET() { const trace = observabilityService.createTrace("health-check-list"); const services = await Promise.all( KNOWN_SERVICES.map((name) => runHealthCheck(name)), ); const hasUnhealthy = services.some((s) => s.status === "unhealthy"); const hasDegraded = services.some((s) => s.status === "degraded"); const overallStatus: "healthy" | "degraded" | "unhealthy" = hasUnhealthy ? "unhealthy" : hasDegraded ? "degraded" : "healthy"; observabilityService.finalizeTrace(trace, "success"); return NextResponse.json({ status: overallStatus, services, timestamp: new Date().toISOString(), }); } export async function POST(req: NextRequest) { let body: unknown; try { body = await req.json(); } catch { return NextResponse.json({ error: "invalid payload" }, { status: 400 }); } const parsed = healthCheckPayloadSchema.safeParse(body); if (!parsed.success) { return NextResponse.json({ error: "invalid payload" }, { status: 400 }); } const { serviceName } = parsed.data; const idempotencyKey = req.headers.get("Idempotency-Key"); const executeHealthCheck = async () => { const healthResult = await runHealthCheck(serviceName); if (healthResult.status === "unhealthy" || healthResult.status === "degraded") { const trace = observabilityService.createTrace("health-check-failure", { serviceName, status: healthResult.status, }); observabilityService.finalizeTrace(trace, "error", `Service ${serviceName} is ${healthResult.status}`); const incident: IncidentReport = { id: `health-${String(Date.now())}`, timestamp: new Date().toISOString(), service: serviceName, severity: "high", status: "detected", failureMode: `health check returned ${healthResult.status}`, }; await recoverFromIncident(incident); } return healthResult; }; if (idempotencyKey) { const result = await idempotencyService.execute( idempotencyKey, { method: "POST", path: req.nextUrl.pathname, body }, executeHealthCheck, ); return NextResponse.json(result); } const result = await executeHealthCheck(); return NextResponse.json(result); }

Test file	What it validates
`tests/services/circuit-breaker.test.ts`	Breaker creation, execution, open/close transitions
`tests/services/idempotency.test.ts`	Key deduplication, concurrent requests, TTL
`tests/services/observability.test.ts`	Trace creation, span recording, Langfuse fallback
`tests/lib/reliability.test.ts`	`reliableCall`, `runHealthCheck`, `recoverFromIncident`
`tests/workflows/incident.test.ts`	All five trigger.dev task handlers
`tests/api/health.test.ts`	GET aggregate status, POST with/without idempotency
`tests/api/workflow-trigger.test.ts`	Workflow dispatch for all five workflow types

Azure AI Reliability Suite for SMB AI Operations

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Step 2: Define the shared types and Zod schemas

Step 3: Build the Azure OpenAI service

Step 4: Build the circuit breaker manager

Step 5: Build the idempotency service

Step 6: Build the observability service

Step 7: Build the runbook service

Step 8: Wire the reliability orchestration layer

Step 9: Build the @trigger.dev incident workflows

Step 10: Build the health check API route

Step 11: Build the workflow trigger API route

Step 12: Run the tests

Next steps