Small businesses running AI agents on Azure often face unpredictable outages, silent failures, and runaway costs without dedicated SRE teams to monitor, alert, and recover.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks through building the Azure AI Reliability Suite, a reliability layer for small businesses running AI agents on Azure. You’ll wire together circuit breakers, idempotency middleware, observability tracing, and durable incident workflows — a system that detects failures, prevents duplicate side effects, isolates failing services, and runs automated recovery. By the end you’ll have a Next.js 16 App Router app with two API routes, @trigger.dev durable tasks, and a complete Vitest test suite.
This is an intermediate-level recipe. You should be comfortable with TypeScript, Next.js App Router, and basic testing with Vitest.
Prerequisites
Node.js 22+ and pnpm 10 installed
An Azure OpenAI endpoint, API key, and deployment name (or set AGENT_PROVIDER=mock in .env to skip live LLM calls)
A free trigger.dev account and secret key (for durable workflow orchestration)
A Langfuse account (public + secret keys) if you want observability tracing — the system degrades gracefully when disabled
Step 1: Scaffold the project and configure environment variables
Start from the scaffolded Next.js 16 App Router project. The dependency tree is already set — install everything, then copy the example environment file:
terminal
pnpm installcp .env.example .env
Open .env and fill in your credentials. The key variables are:
If you want to skip live LLM calls for local development, set AGENT_PROVIDER=mock and LANGFUSE_ENABLED=false.
Expected output:pnpm install finishes with no errors and your .env file has real values for the services you’re using.
Step 2: Define the shared types and Zod schemas
All the types — health check results, incident reports, recovery actions, and reliability configuration — live in a single module. This gives every service a consistent vocabulary.
Expected output:pnpm typecheck passes with no errors. The interfaces and Zod schemas compile cleanly.
Step 3: Build the Azure OpenAI service
This service wraps the openai SDK for Azure endpoints. Every AI call your agents make flows through this adapter, which also exposes a healthCheck() method that the reliability layer polls.
Expected output:pnpm typecheck passes. The constructor builds an OpenAI client pointed at Azure’s {endpoint}.openai.azure.com with the 2024-10-01-preview API version.
Step 4: Build the circuit breaker manager
The circuit breaker pattern prevents your system from hammering a failing service. After failureThreshold consecutive failures the breaker opens and subsequent calls are rejected immediately with a CircuitOpenError. After recoveryTimeoutMs it transitions to half-open and allows one probe call.
Expected output: The CircuitBreakerManager wraps the @reaatech/circuit-breaker-agents library. Each named circuit gets its own singleton instance. The execute() method normalizes the CircuitOpenError into a { rejected: true, reason: "circuit_open" } result so callers don’t have to catch.
Step 5: Build the idempotency service
Idempotency ensures that retrying the same request — even with the same idempotency key — produces the same result without duplicating side effects. This is critical for AI agent operations where a timeout might cause a duplicate charge or double-order.
Expected output: The module-level await service.connect() runs during module initialization, priming the in-memory storage with a 24-hour TTL by default. The execute() method delegates to IdempotencyMiddleware, which handles duplicate detection, distributed locking, and cache expiry.
Step 6: Build the observability service
Observability traces every reliability operation — health checks, circuit breaker trips, idempotency hits, and incident recovery — via Langfuse. When Langfuse is not configured, the service falls back to no-op stubs so your system never crashes due to missing telemetry infrastructure.
Expected output: The ObservabilityService checks LANGFUSE_ENABLED once on first use. If disabled or misconfigured, every method is a no-op. If enabled, it creates real Langfuse traces and spans.
Step 7: Build the runbook service
The runbook service uses the @reaatech/agent-runbook-agent package to generate recovery plans and identify failure modes through an LLM. This is where your system delegates reasoning about how to recover to an AI agent.
Expected output: The RunbookService creates an AnalysisAgent configured from environment variables. The generateRecoveryPlan() call is the primary method used by the incident workflow.
Step 8: Wire the reliability orchestration layer
Now you connect everything — the circuit breaker, idempotency, observability, and runbook services — into a single lib/reliability.ts module. This is the heart of the system.
Create src/lib/reliability.ts:
typescript
import { createAzureOpenAIService } from "../services/azure-openai.js";import { circuitBreakerManager } from "../services/circuit-breaker.js";import { idempotencyService } from "../services/idempotency.js";import { runbookService } from "../services/runbook.js";import { observabilityService } from "../services/observability.js";import type { HealthCheckResult, IncidentReport, ReliabilityConfig } from "../types/index.js";let azureServiceInstance: ReturnType<typeof createAzureOpenAIService> | null = null;export function createAnalysisContext(data: { service?: string; incidentId?: string
Expected output:reliableCall() chains idempotency (outer) around the circuit breaker (inner), so duplicate requests are caught before they hit the breaker. runHealthCheck() polls the Azure OpenAI health endpoint through the breaker. recoverFromIncident() resets the circuit and generates an LLM-driven recovery plan.
Step 9: Build the @trigger.dev incident workflows
Durable workflows ensure recovery steps survive server restarts. @trigger.dev/sdk tasks are persisted to trigger.dev’s infrastructure and retried on failure.
Create src/workflows/incident.ts:
typescript
import { task } from "@trigger.dev/sdk/v3";import { circuitBreakerManager } from "../services/circuit-breaker.js";import { runbookService } from "../services/runbook.js";import { observabilityService } from "../services/observability.js";import type { IncidentReport } from "../types/index.js";function createWorkflowContext(data: { service?: string; incidentId?: string }) { return { serviceDefinition: { name: data.service ?? "unknown", repository: undefined, description: undefined, version: undefined,
Expected output: Five @trigger.dev tasks: detectIncidentTask checks if the circuit is open and queues a reset if so; resetCircuitTask resets the breaker and generates a recovery plan; rollbackTask resets plus generates a rollback-specific plan; retryTask uses exponential backoff with escalation on exhaustion; escalateTask logs the escalation to observability.
Step 10: Build the health check API route
The health API at app/api/health/route.ts exposes two endpoints:
GET — runs health checks on all known services and returns an aggregate status
POST — runs a health check on a specific service and auto-recovers if unhealthy
Expected output:GET /api/health returns { status: "healthy"|"degraded"|"unhealthy", services: [...], timestamp: "..." }. POST /api/health with { serviceName: "azure-openai" } runs a health check and automatically triggers recovery when the service is unhealthy or degraded.
Step 11: Build the workflow trigger API route
This route accepts a workflowType and payload and dispatches to the matching @trigger.dev task.
Create app/api/workflows/trigger/route.ts:
typescript
import { type NextRequest, NextResponse } from "next/server";import { workflowTriggerSchema, type IncidentReport } from "@/src/types/index.js";import { detectIncidentTask, resetCircuitTask, rollbackTask, retryTask, escalateTask } from "@/src/workflows/incident.js";export async function POST(req: NextRequest) { let body: unknown; try { body = await req.json(); } catch { return NextResponse.json({ error: "invalid payload" }, { status: 400 }); } const parsed = workflowTriggerSchema.safeParse(body); if (!parsed.success) { return NextResponse.json({ error: "invalid payload" }, { status: 400 }); } const { workflowType, payload } = parsed.data; switch (workflowType) { case "detect_incident": await detectIncidentTask.trigger(payload as IncidentReport); break; case "reset_circuit": await resetCircuitTask.trigger(payload as { serviceName: string }); break; case "rollback": await rollbackTask.trigger(payload as { serviceName: string; incidentId: string }); break; case "retry": await retryTask.trigger(payload as { serviceName: string; maxRetries: number }); break; case "escalate": await escalateTask.trigger(payload as { incidentId: string; reason: string }); break; } return NextResponse.json({ accepted: true, workflowType, });}
The test suite covers every service and route handler. All external dependencies — Azure OpenAI, Langfuse, trigger.dev, and the REAA packages — are mocked so the tests run without network access.
terminal
pnpm test
Expected output: All tests pass with zero failures and coverage above 90% on runtime code (services, lib, and route handlers). The output includes a JSON report in vitest-report.json.
GET aggregate status, POST with/without idempotency
tests/api/workflow-trigger.test.ts
Workflow dispatch for all five workflow types
Each test mocks its external dependency layer using vi.mock so every scenario — happy path, circuit open, idempotency dedup, retry exhaustion, escalation, invalid payloads — runs in isolation.
Next steps
Switch to persistent storage for the circuit breaker — swap InMemoryAdapter for DynamoDBAdapter so breaker state survives server restarts across your fleet (the @aws-sdk/client-dynamodb dependency is already in package.json).
Replace the in-memory idempotency storage with a Redis or DynamoDB adapter from @reaatech/idempotency-middleware so idempotency keys survive process restarts and scale horizontally.
Expose runbook analysis as MCP tools — the src/services/mcp-server.ts module wraps @reaatech/agent-runbook-mcp to create a local MCP server consumable by Claude Code, Cursor, or other MCP clients.