SMBs running AI-driven customer support or automation can't afford dedicated Site Reliability Engineers. Agent failures, broken tool calls, and unexpected behaviors disrupt business without automated recovery.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
The xAI Grok Reliability Suite is a CLI toolchain that helps SMBs proactively monitor, diagnose, and self-heal their AI agent operations — without needing a dedicated Site Reliability Engineer. You’ll build the entire suite from scratch: shared types, an xAI Grok integration service, a circuit breaker manager, a structured repair service, Langfuse telemetry, a Commander-based CLI, a runbook generator, an anomaly monitor loop, Trigger.dev incident workflows, Next.js API routes, and a full test suite.
Prerequisites
Node.js >= 22 and pnpm installed
An xAI API key for Grok language model access
Langfuse account keys (optional for telemetry, but required to run the full suite)
Trigger.dev account keys (optional for incident webhooks, but required for the webhook route)
Familiarity with TypeScript, Next.js App Router patterns, and basic CLI design
Step 1: Set up the project and environment variables
Start from an empty Next.js 16 (App Router) project with pnpm as the package manager. The scaffold provides package.json, tsconfig.json, next.config.ts, and vitest.config.ts — you’ll modify package.json and later next.config.ts as you go. Your first step is to pin every dependency and set up the environment configuration.
Add the required dependencies to package.json. The scaffold already includes next, react, react-dom, and dev tooling. You add the recipe-specific packages. Open package.json and verify your dependencies and devDependencies blocks match the following:
Notice the bin field in package.json — it maps the rel-ops CLI command to the entry point:
json
"bin": { "rel-ops": "./src/cli/index.ts"}
Now populate .env.example with every environment variable the suite reads. This file is your single source of truth for configuration:
env
# Env vars used by xai-grok-reliability-suite-for-smb-ai-operations.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentXAI_API_KEY=<your-xai-api-key>TRIGGER_SECRET_KEY=<your-trigger-secret>TRIGGER_API_KEY=<your-trigger-api-key>TRIGGER_API_URL=https://api.trigger.devLANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_BASE_URL=https://cloud.langfuse.comCIRCUIT_BREAKER_FAILURE_THRESHOLD=5CIRCUIT_BREAKER_RECOVERY_TIMEOUT_MS=30000CIRCUIT_BREAKER_MAX_COST_PER_MINUTE=0.50
Install the new dependencies:
terminal
pnpm install
Expected output:pnpm resolves and links all packages into node_modules/ without errors.
Step 2: Define the shared types
The entire suite speaks a common language of interfaces. Create src/types/index.ts with every type the services, CLI, workflows, and API routes will share:
Expected output: TypeScript compiles these interfaces without error. They form the data contract between every module you’ll build next.
Step 3: Create the xAI Grok service
The GrokService class sends logs and metrics to xAI Grok for anomaly detection, failure diagnosis, and metrics summarization. It uses the Vercel AI SDK (ai package) with the @ai-sdk/xai provider, wraps API calls in retry() from @reaatech/agent-runbook, and falls back to repair() from @reaatech/structured-repair-core when Grok returns malformed output.
Create src/services/grok-service.ts:
ts
import { generateText, Output } from "ai";import { xai } from "@ai-sdk/xai";import { z } from "zod";import { retry } from "@reaatech/agent-runbook";import { repair } from "@reaatech/structured-repair-core";import type { LogEntry, AnomalyReport } from "../types/index.js";const anomalyReportSchema = z.object({ anomalyId: z.string(), severity: z.string(), description: z.string(), affectedService: z.string(), detectedAt: z.string(), confidence: z.number
Key details about this module:
generateText with Output.object() uses Zod schema inference from ai@6.0.191 to produce typed structured output
The retry(fn, 3, 1000) utility from @reaatech/agent-runbook handles transient API failures — max 3 attempts with 1000ms backoff
When generateText throws but a raw response was captured, the service calls repair() from @reaatech/structured-repair-core as a last-resort recovery — it can extract JSON from markdown fences, fix trailing commas, coerce types, and more
Empty logs short-circuit to [] without any API call
Expected output:pnpm typecheck reports no type errors for this module.
Step 4: Build the circuit breaker manager
The circuit breaker prevents cascading failures by tripping to OPEN when an error threshold is exceeded, then recovering to HALF_OPEN after a timeout. You’ll use @reaatech/circuit-breaker-agents — a meta-package that re-exports everything from @reaatech/circuit-breaker-core and all persistence adapters.
Create src/services/circuit-breaker-service.ts:
ts
import { CircuitBreaker, CircuitOpenError, InMemoryAdapter, DefaultMetricsCollector, type CircuitState, type CircuitBreakerStats, type CircuitEvent,} from "@reaatech/circuit-breaker-agents";import { NoOpMetricsCollector } from "@reaatech/circuit-breaker-core";const circuits = new Map<string, CircuitBreaker>();let sharedAdapter: InMemoryAdapter | null = null;function getAdapter(): InMemoryAdapter { if (!sharedAdapter) { sharedAdapter
The module also exports free functions (executeWithCircuit, getCircuitStats, resetAll) that operate on the same module-level circuits map — these let CLI callers work without holding a reference to the manager instance.
Expected output: Each circuit is isolated by serviceName, uses the "gradual" recovery strategy, and shares an InMemoryAdapter for state persistence.
Step 5: Build the structured repair service
The RepairService is a thin wrapper around @reaatech/structured-repair-core that adds logging context. It repairs malformed LLM output using six graduated strategies: fence stripping, prose extraction, JSON syntax fixing, type coercion, fuzzy key matching, and extra field removal.
Expected output: A service that handles JSON-within-markdown-fences (```json ... ```), trailing commas, type coercion ("30" → 30), and completely invalid input — all delegated to the @reaatech/structured-repair-core package.
Step 6: Set up Langfuse telemetry
Telemetry wraps every significant operation (Grok analysis, runbook generation, incident response) with tracing and structured events. The langfuse SDK is initialized as a singleton and configured from environment variables.
Expected output: Every call to withTelemetry("name", fn) creates a Langfuse trace + span, records success or error metadata, and the process exit handler calls shutdownAsync() to flush pending telemetry.
Step 7: Create the CLI entry point
The CLI uses commander to expose two commands: rel-ops monitor and rel-ops generate-runbook.
Create src/cli/index.ts:
ts
import { Command } from "commander";import { runMonitor } from "./agent-monitor.js";import { runGenerateRunbook } from "./runbook-gen.js";export function createProgram(): Command { const program = new Command(); program .name("rel-ops") .description("xAI Grok Reliability Suite — proactively monitor, diagnose, and self-heal your AI agent operations") .version("0.1.0"); program .command("monitor") .description("Monitor service health and detect anomalies") .option("--config <path>", "Path to service config JSON", "./services.json") .option("--interval <ms>", "Polling interval in ms", "5000") .action(async (opts: { config: string; interval: string }) => { await runMonitor({ config: opts.config, interval: opts.interval }); }); program .command("generate-runbook") .description("Generate a runbook for a service") .requiredOption("--service <name>", "Service name") .requiredOption("--team <name>", "Team name") .option("--platform <name>", "Deployment platform k8s|ecs|cloud-run", "kubernetes") .action(async (opts: { service: string; team: string; platform: string }) => { await runGenerateRunbook({ service: opts.service, team: opts.team, platform: opts.platform }); }); return program;}async function main() { const program = createProgram(); await program.parseAsync(process.argv);}const runningDirectly = process.argv[1]?.endsWith("index.ts");if (runningDirectly) { main().catch((err: unknown) => { console.error("Fatal error:", err); process.exit(1); });}
The module auto-invokes main() when run directly via npx tsx src/cli/index.ts. It also exports createProgram() for programmatic use and testing.
Expected output: Running npx tsx src/cli/index.ts help prints the rel-ops usage with monitor and generate-runbook subcommands.
Step 8: Implement the agent monitor
The monitor is the heart of the suite. Each tick collects logs, runs Grok anomaly detection through the circuit breaker, records anomalies via Langfuse, and emits JSON results to stdout.
Create src/cli/agent-monitor.ts:
ts
import type { ServiceConfig, LogEntry, MonitorResult, AnomalyReport } from "../types/index.js";import { GrokService } from "../services/grok-service.js";import { CircuitBreakerManager } from "../services/circuit-breaker-service.js";import { RepairService } from "../services/repair-service.js";import { withTelemetry, recordAnomaly } from "../lib/telemetry.js";import { readJsonFile } from "@reaatech/agent-runbook";let monitorInterval: ReturnType<typeof setInterval> | null = null;const grokService = new GrokService();const breakerManager = new CircuitBreakerManager
The monitor pipes data through a clear chain: collectLogs → breakerManager.executeWithCircuit (wrapping grokService.analyzeLogs wrapped in withTelemetry) → recordAnomaly → return MonitorResult. When the circuit is OPEN, the fallback returns an empty anomaly array so monitoring degrades gracefully.
Expected output: With collectLogs returning simulated log entries, detectAnomalies returns a structured MonitorResult with service health, anomaly list, circuit state, and error rate.
Step 9: Implement the runbook generator
The runbook generator uses @reaatech/agent-runbook, @reaatech/agent-runbook-alerts, and @reaatech/agent-runbook-health-checks to produce a complete markdown runbook. It builds an AnalysisContext from a RunbookConfig, discovers existing alerts and health checks, generates new ones, and writes runbook.md to disk.
Create src/cli/runbook-gen.ts:
ts
import { writeFile, generateId, escapeMarkdown, sanitizeAnchor, AnalysisContextSchema, retry, ValidationError, ConfigurationError, type ServiceDefinition,} from "@reaatech/agent-runbook";import { extractAlerts, generateAlerts, formatAlertsForPlatform,} from "@reaatech/agent-runbook-alerts";import { identifyHealthChecks, generateHealthChecks, generateKubernetesProbeYaml,} from "@reaatech/agent-runbook-health-checks";import type { RunbookConfig } from "../types/index.js";import { withTelemetry } from "../lib/telemetry.js"
Expected output: Running npx tsx src/cli/index.ts generate-runbook --service agent-api --team my-team --platform kubernetes writes a runbook.md file with alerts, health checks, Kubernetes probe YAML, and failure mode documentation.
Step 10: Wire up incident response workflows
The incident response module wraps Trigger.dev for event-driven rollback. When an anomaly.detected event fires, the defineIncidentJob handler checks the circuit breaker state: if OPEN, it executes the rollback steps from the payload; if CLOSED or HALF_OPEN, it logs and skips.
Expected output: The TriggerClient class wraps the @trigger.dev/sdk primitives (task, tasks.trigger, configure) into a clean interface. The defineIncidentJob function wires a Trigger.dev job that routes on the anomaly.detected event name.
Step 11: Add the webhook route and health check
The Next.js App Router handles two API routes: a health check endpoint and a Trigger.dev webhook receiver. These route files go under app/api/ at the project root.
Create app/api/health/route.ts:
ts
import { NextResponse } from "next/server";export const dynamic = "force-dynamic";export function GET() { return NextResponse.json({ status: "ok", timestamp: new Date().toISOString(), });}
Expected output:GET /api/health returns { status: "ok", timestamp: "<ISO-8601>" } with a 200 status. POST /api/webhook/trigger validates the x-trigger-secret header, parses the body, forwards anomaly.detected payloads to the incident flow, and returns { received: true }.
Step 12: Add Next.js instrumentation
The instrumentation file hooks into Next.js’s server lifecycle to initialize Langfuse before any request handler runs.
Create src/instrumentation.ts:
ts
export async function register() { if (process.env.NEXT_RUNTIME === "nodejs") { await import("./lib/telemetry.js"); }}
For this to work, you need to enable the instrumentation hook in next.config.ts. Open the file (the scaffold created it with placeholder config options) and replace its contents with:
ts
import type { NextConfig } from "next";const nextConfig: NextConfig = { experimental: { instrumentationHook: true, },};export default nextConfig;
Expected output: When Next.js starts in Node runtime, register() dynamically imports the telemetry module, initializing the Langfuse client before any route handler executes. The if (process.env.NEXT_RUNTIME === "nodejs") guard prevents the import from failing in Edge runtime contexts.
Step 13: Expose the public API surface
Replace the scaffold placeholder src/index.ts with re-exports of the main public API:
ts
export { createProgram } from "./cli/index.js";export { GrokService } from "./services/grok-service.js";export { CircuitBreakerManager } from "./services/circuit-breaker-service.js";export { RepairService } from "./services/repair-service.js";
Expected output: Consumers who install this package can import { createProgram, GrokService, CircuitBreakerManager, RepairService } from "xai-grok-reliability-suite-for-smb-ai-operations".
Step 14: Run the tests
The test suite covers every module with mocked externals. The Grok service tests use MSW (Mock Service Worker) to intercept HTTP calls to https://api.x.ai/*, circuit breaker tests mock @reaatech/circuit-breaker-agents directly, and repair service tests call the real @reaatech/structured-repair-core functions.
Run the full test suite:
terminal
pnpm test
Expected output: All tests pass with numFailedTests: 0 and coverage thresholds of at least 90% across lines, branches, functions, and statements on runtime code (src/**/*.ts and app/**/route.ts). UI files (*.tsx) are excluded from coverage targets.
Then run type checking and linting:
terminal
pnpm typecheckpnpm lint
Both exit 0 with no errors.
Next steps
Add real log ingestion — Replace collectLogs in agent-monitor.ts with an actual log aggregator (e.g., reading from a file, a database, or a streaming API like Kafka)
Deploy the webhook endpoint — Deploy the Next.js app to Vercel or a Node.js host and configure Trigger.dev to send anomaly.detected events to the /api/webhook/trigger route
Swap persistence — Replace the InMemoryAdapter in the circuit breaker with RedisAdapter or DynamoDBAdapter from @reaatech/circuit-breaker-agents for production state persistence across restarts
"You are an AI anomaly detection system. Analyze the following service logs and identify any anomalies, errors, or unusual patterns. Return a structured report of all detected anomalies.",
"You are an AI failure diagnosis system. Analyze the error log and provide a detailed diagnosis of what went wrong, the root cause, and suggested remediation steps.",
This runbook covers the operational procedures for **${safeName}**. It includes alerting rules, health check configurations, and failure mode documentation.
**Generated Health Checks (${String(healthChecks.length)} total):**
${hcList}
**Kubernetes Probe Configuration:**
\`\`\`yaml
${probeYaml}
\`\`\`
---
## [${failureModesAnchor}]
### Failure Modes
_Refer to the service dependency analysis and alerting rules above for identified failure scenarios. Review the health check coverage to ensure all critical paths are monitored._
_This runbook was generated automatically. Update it as the service evolves._