SMBs deploying AI customer support agents often face unpredictable monthly bills as chat volume spikes, with no built-in controls to limit spending per customer or automatically switch to cheaper models when budgets are exhausted.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a layered budget guard for xAI Grok-powered customer support agents. You’ll build an Express API that enforces per-tenant daily spending limits, automatically falls back to a cheaper OpenAI model when budgets tighten, records cost telemetry, and renders a spend dashboard in Next.js. By the end, you’ll have a reference implementation that prevents runaway AI costs without interrupting your users’ conversations.
Prerequisites
Node.js >= 22 and pnpm 10
An xAI API key (for Grok)
An OpenAI API key (for the fallback model)
A Helicone API key (for cost observability)
Familiarity with TypeScript, Express, and Next.js App Router basics
Step 1: Inspect the scaffold and configure environment
The project scaffold already exists with Next.js 16 (App Router), Vitest, ESLint, and TypeScript configured. Start by inspecting what’s on disk.
terminal
ls -lacat .env.example
You’ll see the .env.example already has placeholder entries for all the environment variables you need:
env
# Env vars used by xai-grok-cost-control-for-smb-customer-support-agents.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentXAI_API_KEY=<your-xai-api-key>OPENAI_API_KEY=<your-openai-api-key>HELICONE_API_KEY=<your-helicone-api-key>HELICONE_BASE_URL=https://api.hconeai.comDEFAULT_DAILY_BUDGET=5.00TENANT_BUDGETS={"acme-corp":{"dailyLimit":10.00,"softCap":0.8,"hardCap":1.0}}PORT=3001NEXT_PUBLIC_EXPRESS_URL=http://localhost:3001
Copy it to .env and fill in your real API keys:
terminal
cp .env.example .env
Now install the dependencies:
terminal
pnpm install
Expected output: No errors. The node_modules/ directory is populated with all runtime and dev dependencies.
Step 2: Define shared types and validation schemas
Create src/lib/types.ts with the core domain types used across every module:
Next, create src/lib/validation.ts with a Zod schema that validates incoming chat requests at the API boundary:
ts
import { z } from "zod";export const ChatRequestSchema = z.object({ tenantId: z.string().min(1, "tenantId is required"), messages: z .array( z.object({ role: z.enum(["user", "assistant", "system"]), content: z.string(), }), ) .min(1, "at least one message is required"), maxTokens: z.number().int().positive().optional(),});export type ChatRequest = z.infer<typeof ChatRequestSchema>;
Expected output: Both files type-check with zero errors. You can verify with pnpm typecheck, though you’ll want a few more files before it reports clean.
Step 3: Create LLM client wrappers for Grok and OpenAI fallback
xAI Grok uses an OpenAI-compatible API, so you can use the openai npm package for both providers.
Create src/lib/grok-client.ts:
ts
import OpenAI from "openai";import type { ChatCompletionMessageParam } from "openai/resources/index.js";import type { ChatMessage } from "./types.js";export class GrokApiError extends Error { constructor( message: string, public readonly status?: number, ) { super(message); this.name = "GrokApiError"; }}export function createGrokClient(): OpenAI { const apiKey = process.env["XAI_API_KEY"]; if (!apiKey) { throw new GrokApiError("XAI_API_KEY is not configured"); } return new OpenAI({ apiKey, baseURL: "https://api.x.ai/v1" });}function toOpenAIMessages(messages: ChatMessage[]): ChatCompletionMessageParam[] { return messages.map((m) => ({ role: m.role, content: m.content }));}export async function chatWithGrok( client: OpenAI, messages: ChatMessage[], maxTokens?: number,): Promise<{ content: string; model: string; usage: { inputTokens: number; outputTokens: number } }> { try { const completion = await client.chat.completions.create({ model: "grok-3", messages: toOpenAIMessages(messages), max_tokens: maxTokens ?? 1024, }); if (completion.choices.length === 0) { throw new GrokApiError("Empty choices in Grok response"); } const content = completion.choices[0]?.message?.content ?? ""; const apiUsage = completion.usage; return { content, model: "grok-3", usage: { inputTokens: apiUsage?.prompt_tokens ?? 0, outputTokens: apiUsage?.completion_tokens ?? 0, }, }; } catch (err) { if (err instanceof OpenAI.APIError) { const rawStatus: unknown = err.status; const statusCode = typeof rawStatus === "number" ? rawStatus : undefined; throw new GrokApiError(err.message, statusCode); } throw err; }}
Create src/lib/fallback-client.ts similarly but targeting gpt-5.2-mini via the default OpenAI endpoint:
ts
import OpenAI from "openai";import type { ChatCompletionMessageParam } from "openai/resources/index.js";import type { ChatMessage } from "./types.js";export class FallbackClientError extends Error { constructor( message: string, public readonly status?: number, ) { super(message); this.name = "FallbackClientError"; }}export function createFallbackClient(): OpenAI { const apiKey = process.env["OPENAI_API_KEY"]; if (!apiKey) { throw new FallbackClientError("OPENAI_API_KEY is not configured"); } return new OpenAI({ apiKey });}function toOpenAIMessages(messages: ChatMessage[]): ChatCompletionMessageParam[] { return messages.map((m) => ({ role: m.role, content: m.content }));}export async function chatWithFallback( client: OpenAI, messages: ChatMessage[], maxTokens?: number,): Promise<{ content: string; model: string; usage: { inputTokens: number; outputTokens: number } }> { try { const completion = await client.chat.completions.create({ model: "gpt-5.2-mini", messages: toOpenAIMessages(messages), max_tokens: maxTokens ?? 1024, }); if (completion.choices.length === 0) { throw new FallbackClientError("Empty choices in fallback response"); } const content = completion.choices[0]?.message?.content ?? ""; return { content, model: "gpt-5.2-mini", usage: { inputTokens: completion.usage?.prompt_tokens ?? 0, outputTokens: completion.usage?.completion_tokens ?? 0, }, }; } catch (err) { if (err instanceof OpenAI.APIError) { const rawStatus: unknown = err.status; const statusCode = typeof rawStatus === "number" ? rawStatus : undefined; throw new FallbackClientError(err.message, statusCode); } throw err; }}
Expected output: Each file compiles cleanly. The createGrokClient function throws immediately if XAI_API_KEY is missing, and both wrappers catch OpenAI.APIError and wrap it in a typed error class so upstream code can distinguish API failures from other exceptions.
Step 4: Build the pricing service and in-memory spend store
The pricing service maps model IDs to per-token costs and uses calculateCostFromTokens from @reaatech/llm-cost-telemetry for the actual math.
The spend store tracks per-scope balances in memory. It extends the SpendStore abstract class from @reaatech/agent-budget-spend-tracker. Create src/services/spend-store.ts:
Expected output:PricingService returns zero when you estimate cost for zero tokens on a known model. SpendStore returns 0 for uninitialized scopes without throwing.
Step 5: Wire up budget enforcement and cost telemetry
The budget service wraps BudgetController from @reaatech/agent-budget-engine. Create src/services/budget-service.ts:
Now create the telemetry service at src/services/telemetry-service.ts. It stores CostSpan objects in memory and uses loadConfig, generateId, now, and getWindowStart from @reaatech/llm-cost-telemetry:
Expected output:BudgetService uses BudgetScope.User (runtime string "user") for all scope operations. TelemetryService calls loadConfig() in its constructor to pick up OTel and budget defaults from environment variables.
Step 6: Build the fallback router
The router service uses createFallbackChain from @reaatech/llm-router-fallback and ModelDefinitionSchema from @reaatech/llm-router-core to define a two-model chain (Grok first, fallback second).
Expected output: The service creates a fallback chain with a circuit breaker (5 failures, 60-second reset, 3 half-open calls). executeFrom starts with “grok-3” and only tries “gpt-5.2-mini” if the first model fails.
Step 7: Create the chat API handler
The chat handler wires everything together: validate input, estimate cost, check budget, execute the router, record spend and telemetry, and log to Helicone.
Create src/api/chat.ts:
ts
import { Router, type Request, type Response } from "express";import { z } from "zod";import { ChatRequestSchema } from "../lib/validation.js";import { generateId } from "@reaatech/llm-cost-telemetry";import { logToHelicone } from "../lib/helicone-logger.js";import type { PricingProvider } from "../services/pricing-service.js";interface ChatBudgetService { checkBudget(tenantId: string, estimatedCost: number, modelId: string): { allowed: boolean; suggestedModel?: string; action
Expected output: The handler covers four paths — 200 (success), 400 (validation failure), 429 (budget exceeded), 503 (all models failed), and 500 (internal error).
Step 8: Boot the Express server with graceful shutdown
The server assembles all services, registers tenant budgets, mounts the chat router, and listens for HTTP requests.
Create src/server.ts:
ts
import express from "express";import cors from "cors";import { loadAppConfig } from "./config.js";import { SpendStore } from "./services/spend-store.js";import { PricingService } from "./services/pricing-service.js";import { BudgetService } from "./services/budget-service.js";import { TelemetryService } from "./services/telemetry-service.js";import { RouterService } from "./services/router-service.js";import { createChatRouter } from "./api/chat.js";export function createApp() { const app = express(); app.use(cors()); app.use(express.json()); const config = loadAppConfig(); const spendStore = new SpendStore(); const pricingService = new PricingService(); const budgetService = new BudgetService(pricingService, spendStore); const telemetryService = new TelemetryService(); const routerService = new RouterService(); for (const [tenantId, budgetDef] of Object.entries(config.tenantBudgets)) { budgetService.defineTenantBudget(tenantId, budgetDef.dailyLimit, { softCap: budgetDef.softCap, hardCap: budgetDef.hardCap, }); } const chatRouter = createChatRouter( budgetService, telemetryService, routerService, pricingService, ); app.use("/api/chat", chatRouter); app.get("/api/health", (_req, res) => { res.json({ status: "ok", uptime: process.uptime() }); }); app.get("/api/spend", (req, res) => { const tenantId = req.query.tenantId as string | undefined; const spans = tenantId ? telemetryService.getSpans(tenantId) : telemetryService.getAllSpans(); const totalCost = spans.reduce((sum, s) => sum + s.costUsd, 0); const totalInputTokens = spans.reduce((sum, s) => sum + s.inputTokens, 0); const totalOutputTokens = spans.reduce((sum, s) => sum + s.outputTokens, 0); const totalCalls = spans.length; const modelsUsed: Record<string, number> = {}; for (const s of spans) { modelsUsed[s.model] = (modelsUsed[s.model] ?? 0) + 1; } res.json({ totalCost, totalInputTokens, totalOutputTokens, totalCalls, modelsUsed }); }); return app;}const PORT = parseInt(process.env["PORT"] ?? "3001", 10);const app = createApp();const server = app.listen(PORT, () => { console.log(`Server listening on port ${String(PORT)}`);});const shutdown = () => { server.close(() => process.exit(0));};process.on("SIGTERM", shutdown);process.on("SIGINT", shutdown);
The config loader at src/config.ts parses the TENANT_BUDGETS JSON and validates all values:
ts
import { getEnvVar, getEnvFloat } from "@reaatech/llm-cost-telemetry";export interface TenantBudgetDef { dailyLimit: number; softCap: number; hardCap: number;}export interface AppConfig { port: number; defaultDailyBudget: number; tenantBudgets: Record<string, TenantBudgetDef>;}export function loadAppConfig(): AppConfig { const portValue = getEnvVar("PORT", "3001") ?? "3001"; const port = parseInt(portValue, 10); const defaultDailyBudget = getEnvFloat("DEFAULT_DAILY_BUDGET", 5.0); const raw = getEnvVar("TENANT_BUDGETS", "{}") ?? "{}"; const tenantBudgets: Record<string, TenantBudgetDef> = {}; try { const parsed: Record<string, unknown> = JSON.parse(raw) as Record<string, unknown>; for (const [key, val] of Object.entries(parsed)) { const v = val as Record<string, unknown>; if (typeof v.dailyLimit !== "number" || v.dailyLimit <= 0) { throw new Error(`Invalid dailyLimit for tenant "${key}": must be positive number`); } tenantBudgets[key] = { dailyLimit: v.dailyLimit, softCap: typeof v.softCap === "number" ? v.softCap : 0.8, hardCap: typeof v.hardCap === "number" ? v.hardCap : 1.0, }; } } catch (err) { throw new Error(`Failed to parse TENANT_BUDGETS: ${(err as Error).message}`); } if (Number.isNaN(port) || port <= 0) { throw new Error("PORT must be a positive integer"); } return { port, defaultDailyBudget, tenantBudgets };}
Update src/index.ts to export the config types:
ts
// Main entry — Express server is booted from src/server.ts// Import this file in tests or other entry points that need side effectsexport { loadAppConfig } from "./config.js";export type { AppConfig, TenantBudgetDef } from "./config.js";
Expected output: Run pnpm server and you’ll see Server listening on port 3001. A curl http://localhost:3001/api/health returns {"status":"ok","uptime":...}.
Expected output: When HELICONE_API_KEY is set, each chat request is logged to Helicone with Helicone-User-Id and Helicone-Property-Cost headers. When the key is missing, getLogger() returns null and the call is skipped silently.
Step 10: Set up test infrastructure with MSW
MSW (Mock Service Worker) intercepts HTTP requests so your tests never hit real APIs. Create tests/setup.ts:
Expected output: The onUnhandledRequest callback throws on any HTTP request that doesn’t match the MSW handlers or a localhost URL, preventing accidental real network calls in test runs.
Step 11: Run the tests and verify coverage
Run the full test suite with coverage:
terminal
pnpm test
This runs all tests under tests/, including:
LLM client tests — tests/lib/grok-client.test.ts and tests/lib/fallback-client.test.ts mock the OpenAI module and verify each client returns the correct content shape and throws typed errors on API failures.
Service tests — tests/services/pricing-service.test.ts verifies cost calculation for known models and throws UnknownModelError for unknown ones. tests/services/spend-store.test.ts verifies add, get, and reset. tests/services/budget-service.test.ts tests budget checks, state transitions (Active → Warned), auto-downgrade suggestions, and wildcard fallback. tests/services/telemetry-service.test.ts validates span creation and schema rejection. tests/services/router-service.test.ts tests the fallback chain with happy path, primary failure, and total exhaustion scenarios.
Chat API tests — tests/api/chat.test.ts tests every HTTP path: 200 with full response, 400 on invalid/missing body, 429 on budget exceeded, 503 on all models failed, and 500 on internal error. One integration-style test verifies all service calls were made.
Server tests — tests/server.test.ts tests GET /api/health, POST /api/chat, GET /api/chat (404), and GET /api/spend with and without tenant filtering.
Integration tests — tests/integration/full-flow.test.ts tests the real Express app wiring with mocked services: happy path through Grok, budget-exhausted 429 response, and budget-reset recovery.
After the tests pass, run the type checker and linter:
terminal
pnpm typecheckpnpm lint
Expected output:pnpm test exits 0 with numFailedTests: 0, numTotalTests >= 40, and all four coverage metrics (lines, branches, functions, statements) at 90% or above. pnpm typecheck and pnpm lint both exit 0.
Step 12: Build the Next.js dashboard and spend API route
The dashboard at app/dashboard/page.tsx fetches spend data from the Next.js /api/spend proxy route and renders it as an ISR-cached server component:
tsx
import Link from "next/link";interface SpendResponse { totalCost: number; totalInputTokens: number; totalOutputTokens: number; totalCalls: number; modelsUsed: Record<string, number>;}interface HealthResponse { status: string; uptime: number;}async function getSpendData(): Promise<SpendResponse | null> {
Expected output: With the Express server running (pnpm server in one terminal), run pnpm dev in another and navigate to http://localhost:3000/dashboard. You’ll see the Spend Overview cards and Express API Status. When the Express server is stopped, the dashboard shows fallback messages instead of crashing.
Next steps
Add a shared database — Replace the in-memory SpendStore and TelemetryService with PostgreSQL or Redis-backed stores so the Express and Next.js processes share the same data.
Extend the model catalog — Add more models to PRICING_TABLE and MODEL_DEFINITIONS with different cost tiers. The fallback chain and auto-downgrade rules reference model IDs, so new entries slot in without code changes.
Wire up OpenTelemetry — The @reaatech/llm-cost-telemetry package exports loadTelemetryConfig for OTel setup. Enable export to your preferred backend (CloudWatch, GCP Cloud Monitoring, or Grafana Loki via the @reaatech/llm-cost-telemetry-exporters package).
Add admin endpoints — Create PATCH /api/budgets/:tenant and DELETE /api/budgets/:tenant routes using controller.defineBudget() and controller.reset() so operators can adjust limits at runtime.
Replace helicone with @helicone/helicone — The current helicone@1.0.7 package is deprecated. Migrate to @helicone/helicone for ongoing support and the latest async logging features.