Enforce daily AI spend limits, automatically downgrade to cheaper Cohere models, and get real-time cost dashboards without modifying existing agent code.
Small businesses deploying AI agents often see unpredictable monthly bills because every customer interaction triggers expensive model calls. They need a way to cap spending, pick the right model for each query, and audit costs without hiring an MLOps engineer.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
Small businesses deploying AI agents often see unpredictable monthly bills because every customer interaction triggers expensive model calls. This tutorial shows you how to build a spend-control system for Cohere’s API using the @reaatech/* package family. You’ll create per-tenant daily budgets, automatic model downgrade from command-a-03-2025 to command-r-03-2025 when budgets get tight, and real-time cost dashboards — all served through a Next.js API or a standalone Express CLI daemon.
Prerequisites
Node.js 22+ and pnpm 10 installed
A Cohere API key (free trial available at dashboard.cohere.com)
(Optional) Langfuse and Portkey accounts for observability export
Basic familiarity with TypeScript, Next.js App Router, and Express
Step 1: Scaffold the project and install dependencies
Start from a fresh directory created by the Next.js scaffold. Your package.json should include all the packages you’ll need — the @reaatech/* family for telemetry, routing, and budget enforcement, plus Cohere’s SDK, Express, Langfuse, Portkey, and dev tooling.
Every version is pinned to an exact semver (no ^ or ~). Now install everything:
terminal
pnpm install
Expected output: pnpm resolves and links all 28 packages (17 dependencies + 11 dev dependencies) into node_modules/ with zero warnings. You’ll see the pnpm-lock.yaml file appear.
Step 2: Configure environment variables
Copy .env.example to .env and fill in your credentials. This file drives the entire system — Cohere API key, model IDs, budget defaults, and optional observability keys for Langfuse and Portkey.
env
# Env vars used by cohere-ai-spend-control-for-budget-conscious-smbs.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentCOHERE_API_KEY=<your-cohere-api-key>COHERE_DEFAULT_MODEL=command-a-03-2025COHERE_BUDGET_MODEL=command-r-03-2025DEFAULT_DAILY_BUDGET=50.0DEFAULT_MONTHLY_BUDGET=1000.0LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>PORTKEY_API_KEY=<your-portkey-api-key>PORTKEY_VIRTUAL_KEY=<your-portkey-virtual-key>PORT=<port-number>OTEL_SERVICE_NAME=cohere-spend-control
terminal
cp .env.example .env# Then edit .env and set COHERE_API_KEY to your real key
Expected output: The .env file now contains your Cohere API key. The system reads COHERE_API_KEY at runtime — the Cohere SDK picks it up automatically from the environment.
Step 3: Define core types
Create src/lib/types.ts with the branded types and interfaces that the rest of the codebase will share. You’ll re-export the CostSpan type from @reaatech/llm-cost-telemetry so all modules reference a single source of truth.
ts
import { type CostSpan as CostSpanFromTelemetry, type Provider } from "@reaatech/llm-cost-telemetry";export type { CostSpanFromTelemetry as CostSpan };export function widenProvider(p: string): Provider { return p as Provider;}export type TenantId = string & { readonly __brand: "TenantId" };export type CohereModelId = "command-a-03-2025" | "command-r-03-2025";export interface BudgetBreakdown { totalAllocated: number; totalSpent: number; remaining: number; tenant: string;}export interface DashboardSummary { totalSpend: number; byTenant: Record<string, number>; byModel: Record<string, number>; budgetStatuses: Array<{ tenant: string; state: string; dailyPct: number; monthlyPct: number; }>;}
The widenProvider helper casts the string "cohere" to the Provider union type that the telemetry package expects. CohereModelId is a strict union of the two models you’ll support: premium command-a-03-2025 and budget-friendly command-r-03-2025.
Expected output:src/lib/types.ts exists and exports all the types above. You can verify with pnpm typecheck — it should pass with no errors.
Step 4: Add Cohere pricing
Create src/lib/pricing.ts to define the per-model token costs and register them with the @reaatech/llm-cost-telemetry-calculator via its addCustomPricing function. The premium model costs $15/M input tokens and $75/M output tokens; the economy model costs $2.50/M and $10/M.
Note the side-effect call at module scope: registerCoherePricing() runs the first time this module is imported, pushing both Cohere models into the calculator’s global pricing table. The estimateCohereCost function computes cost from raw token counts — all the downstream services will use it.
Expected output:src/lib/pricing.ts exists. Running pnpm typecheck shows no errors.
Step 5: Build the Cohere client wrapper
Create src/services/cohere-client.ts — a wrapper around CohereClientV2 that adds cost tracking, retry logic, and streaming support. Every call produces a CostSpan that propagates to your telemetry pipeline.
ts
import { CohereClientV2, CohereError, CohereTimeoutError } from "cohere-ai";import pino from "pino";import { generateId, now, type CostSpan } from "@reaatech/llm-cost-telemetry";import { type CohereModelId, widenProvider } from "../lib/types.js";import { estimateCohereCost } from "../lib/pricing.js";const logger = pino({ name: "cohere-client" });const sleep = (ms: number) => new Promise(resolve => setTimeout(resolve, ms));export interface CohereClientOptions { tenant
Key design decisions in this wrapper:
The CohereClientV2 constructor reads COHERE_API_KEY from the environment automatically — no need to pass it explicitly
CohereTimeoutError triggers one retry after a 1-second delay; other CohereError types are logged and re-thrown without retry
The onCostSpan callback lets consumers (like the telemetry service) react to each completed span without coupling the wrapper to any particular observer
chatStream accumulates output token estimates from content-delta events and emits a final CostSpan when the stream ends
Expected output:src/services/cohere-client.ts exists. pnpm typecheck produces no errors.
Step 6: Build the cost telemetry service
Create src/services/cost-telemetry.ts to collect and aggregate cost spans using CostCollector and CostAggregator from @reaatech/llm-cost-telemetry-aggregation. This service buffers spans, flushes them on a 60-second interval, and provides query methods by tenant, model, and time period.
ts
import { type CostSpan } from "@reaatech/llm-cost-telemetry";import { CostCollector, CostAggregator, type AggregationDimension } from "@reaatech/llm-cost-telemetry-aggregation";import type { DashboardSummary } from "../lib/types.js";export class CostTelemetryService { private collector: CostCollector; private aggregator: CostAggregator; constructor() { this.aggregator = new CostAggregator({ dimensions: ["tenant", "feature", "provider", "model"] as never, timeWindows: ["hour", "day", "month"] as never, }); this.collector = new CostCollector({ maxBufferSize: 1000, flushIntervalMs: 60000, onFlush: (spans: CostSpan[]) => { for (const span of spans) { this.aggregator.add(span); } }, }); } ingestSpan(span: CostSpan): void { this.collector.add(span); } getTenantCosts(tenant: string): { totalUsd: number; byProvider: Record<string, number> } { const records = this.aggregator.getByTenant(tenant); const totalUsd = records.reduce((sum, r) => sum + (r.totalUsd ?? 0), 0); const byProvider: Record<string, number> = {}; for (const r of records) { const providerName = r.key?.provider ?? "unknown"; byProvider[providerName] = (byProvider[providerName] ?? 0) + (r.totalUsd ?? 0); } return { totalUsd, byProvider }; } getModelCosts(modelId?: string, period?: "day" | "month"): { totalUsd: number; callCount: number } { const options: { period?: "day" | "month"; groupBy?: AggregationDimension[] } = {}; if (period) options.period = period; if (modelId) options.groupBy = ["model"]; const summary = this.aggregator.getSummary(options); return { totalUsd: summary.totalUsd ?? 0, callCount: summary.totalCalls ?? 0 }; } getDailyTrend(): Array<{ date: string; totalUsd: number }> { const records = this.aggregator.getAll(); const byDate: Record<string, number> = {}; for (const r of records) { const ws = r.key?.windowStart ?? r.windowStart; if (!ws) continue; const date = typeof ws === "string" ? ws.slice(0, 10) : ws.toISOString().slice(0, 10); byDate[date] = (byDate[date] ?? 0) + (r.totalUsd ?? 0); } return Object.entries(byDate) .sort(([a], [b]) => a.localeCompare(b)) .map(([date, totalUsd]) => ({ date, totalUsd })); } getDashboardSummary(): DashboardSummary { const summary = this.aggregator.getSummary({}); const records = this.aggregator.getAll(); const byTenant: Record<string, number> = {}; const byModel: Record<string, number> = {}; for (const r of records) { const t = r.key?.tenant ?? "unknown"; const m = r.key?.model ?? "unknown"; byTenant[t] = (byTenant[t] ?? 0) + (r.totalUsd ?? 0); byModel[m] = (byModel[m] ?? 0) + (r.totalUsd ?? 0); } return { totalSpend: summary.totalUsd ?? 0, byTenant, byModel, budgetStatuses: [], }; } close(): void { void this.collector.close(); }}
The collector buffers up to 1,000 spans (or 60 seconds, whichever comes first), then flushes them into the aggregator. The aggregator maintains rolling time-windowed totals by tenant, feature, provider, and model. The getDashboardSummary() method is what the API route and CLI dashboard endpoint will call.
Expected output:src/services/cost-telemetry.ts exists and compiles.
Step 7: Build the budget enforcement service
Create src/services/budget-controller.ts to define per-tenant budgets, check whether a call is allowed, record spend, and enforce automatic downgrades. This uses BudgetController from @reaatech/agent-budget-engine backed by a SpendStore.
The constructor wires three event listeners: threshold-breach fires when spend crosses configurable percentage milestones (logged as warnings), hard-stop fires when the budget is fully exhausted (logged as error), and state-change fires on every state transition. The defineTenantBudget method accepts a policy with autoDowngrade rules — when the budget hits the soft cap, the system suggests switching from command-a-03-2025 to command-r-03-2025.
Expected output:src/services/budget-controller.ts compiles. The event listeners don’t do anything until the controller emits events, which happens when checkCall or recordSpend is called against a defined budget.
Step 8: Build the LLM router service
Create src/services/llm-router.ts to dynamically select models based on budget state. The router uses @reaatech/llm-router-engine with two registered models: premium command-a-03-2025 (with reasoning capability) and economy command-r-03-2025 (general-only).
ts
import { widenProvider } from "../lib/types.js";import { LLMRouter, parseRouterConfig, createRouter, ModelRegistry, evalHooksManager } from "@reaatech/llm-router-engine";import { type ModelDefinition, type RoutingRequest } from "@reaatech/llm-router-core";import { CostTelemetryService } from "./cost-telemetry.js";import pino from "pino";const logger = pino({ name: "llm-router" });const PREMIUM_MODEL: ModelDefinition = { id: "command-a-03-2025", provider: "cohere", costPerMillionInput: 15.00, costPerMillionOutput:
The updateRouterWorkhorsePool function is the budget-control lever: when you pass usePremium: false, only the economy model stays registered, so all routed requests will select command-r-03-2025. Pass usePremium: true to restore the premium model. The evalHooksManager.registerPostExecution callback fires after every successful route, piping cost data into CostTelemetryService.
Expected output:src/services/llm-router.ts compiles. The WeakMap router-registry association keeps the registry alive as long as the router instance is reachable.
Step 9: Wire up observability adapters
Create src/adapters/observability.ts to export cost data to Langfuse and Portkey. Both clients are initialized as singletons from environment variables — if the keys are missing, the clients are null and all tracing calls become no-ops.
Both clients are constructed at module load time from environment variables. traceCohereCall creates a Langfuse trace with full cost metadata. reportCostToPortkey sends a lightweight metadata-only chat completion to Portkey to register the cost.
The route handler uses lazy singleton accessors from src/index.ts (which you’ll wire next). Each HTTP verb maps to a single responsibility: GET returns the dashboard summary, POST checks the budget, makes the Cohere call, and records the spend (returns 400/403/502 on errors), PUT defines a tenant budget, and DELETE resets one.
Expected output:app/api/cost/route.ts exists. This uses App Router conventions — route handlers are exported as named functions for each HTTP verb.
Step 11: Wire up singletons in the main module
Replace the placeholder src/index.ts with lazy-initialized singleton accessors for all four services. This is the module that the API route imports from.
ts
import { CohereClientWrapper } from "./services/cohere-client.js";import { CostTelemetryService } from "./services/cost-telemetry.js";import { BudgetEnforcementService } from "./services/budget-controller.js";import { createCohereRouter } from "./services/llm-router.js";import { shutdown as shutdownObservability } from "./adapters/observability.js";import type { LLMRouter } from "@reaatech/llm-router-engine";let _costTelemetry: CostTelemetryService | undefined;let _budgetEnforcement: BudgetEnforcementService | undefined;let _cohereClient: CohereClientWrapper | undefined;let _router: LLMRouter | undefined;export function getCostTelemetry(): CostTelemetryService { _costTelemetry ??= new CostTelemetryService(); return _costTelemetry;}export function getBudgetEnforcement(): BudgetEnforcementService { _budgetEnforcement ??= new BudgetEnforcementService(); return _budgetEnforcement;}export function getCohereClient(): CohereClientWrapper { if (!_cohereClient) { _cohereClient = new CohereClientWrapper({ onCostSpan: (span) => { getCostTelemetry().ingestSpan(span); }, }); } return _cohereClient;}export function getRouter(): LLMRouter { if (!_router) { _router = createCohereRouter({ executeModel: async (model, request) => { const result = await getCohereClient().chat({ model: model.id, messages: [{ role: "user", content: request.prompt }], }); return { content: result.content, inputTokens: result.costSpan.inputTokens, outputTokens: result.costSpan.outputTokens, }; }, }); } return _router;}export async function init(options?: { cohereApiKey?: string }): Promise<void> { if (options?.cohereApiKey) { process.env.COHERE_API_KEY = options.cohereApiKey; } getCostTelemetry(); getBudgetEnforcement(); getCohereClient(); getRouter(); await Promise.resolve();}export async function shutdown(): Promise<void> { if (_costTelemetry) _costTelemetry.close(); if (_budgetEnforcement) _budgetEnforcement.close(); await shutdownObservability();}
The CohereClientWrapper is constructed with an onCostSpan callback that automatically pipes every cost span into the telemetry service — this is the glue that makes the entire pipeline work without explicit wiring at the call site. The getRouter() accessor creates a router whose executeModel callback delegates to the wrapped Cohere client, completing the flow: route -> execute -> cost span -> telemetry ingestion.
Expected output:src/index.ts exports all four accessor functions plus init and shutdown. The API route imports from this module using "../../../src/index.js".
Step 12: Create the CLI daemon
Create src/cli/cost-daemon.ts — an Express-based standalone server and CLI tool built with commander. It provides the same budget-management capabilities as the Next.js API route but as a separate process you can run independently.
ts
import { Command } from "commander";import express from "express";import pino from "pino";import { CohereClientWrapper } from "../services/cohere-client.js";import { CostTelemetryService } from "../services/cost-telemetry.js";import { BudgetEnforcementService } from "../services/budget-controller.js";import { createDefaultRouterConfig } from "../services/llm-router.js";import { BudgetScope } from "@reaatech/agent-budget-types";import { shutdown as shutdownObservability } from "../adapters/observability.js";const logger = pino({ name: "cost-daemon" });
Four subcommands give you full control:
start — launch the Express server with /api/dashboard, /api/simulate, /api/budget/:tenant, /api/budget/:tenant/define, and /api/budget/:tenant/reset endpoints
status — dump budget state and costs for all tenants to stdout
config — print the current model configuration
reset --tenant <name> — reset a tenant’s budget back to zero
The server handles SIGTERM and SIGINT gracefully, closing all services and flushing observability before exiting.
Expected output:src/cli/cost-daemon.ts exists and compiles. You can run it with node --loader tsx src/cli/cost-daemon.ts start (requires tsx installed).
Step 13: Run the tests
The test suite covers every module with unit tests (mocking external APIs) plus an integration test that simulates the full pipeline end-to-end. Run the full suite with coverage:
terminal
pnpm test
Expected output: vitest runs all 10 test files and writes a JSON report to vitest-report.json. You should see output listing every passing test file:
All tests mock external dependencies — the Cohere SDK, Langfuse, Portkey, and the agent-budget packages — so no network calls are made during the test run. The integration test uses MSW with onUnhandledRequest: "error" to catch any leaked real HTTP requests.
Next steps
Add Slack alerts: Wire the threshold-breach and hard-stop events to a Slack webhook so your team gets notified the moment a tenant crosses 80% of its daily budget
Extend to multi-provider routing: Register OpenAI or Anthropic models in the router alongside Cohere and let the budget controller decide which provider to route to based on current costs
Persist budget state to Postgres: Swap the in-memory SpendStore for a persistent database-backed store so budget state survives server restarts
Add a dashboard UI: Replace the JSON API with a real-time dashboard using a charting library like Recharts or Tremor, polling GET /api/cost every 10 seconds