Small businesses running agents on self‑hosted vLLM struggle to see aggregated LLM spend per customer, team, or use case. Without built‑in budgeting, a runaway prompt or misconfigured agent can balloon compute costs before anyone notices.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This recipe adds cost-aware budget enforcement to any self-hosted vLLM agent workflow. You’ll build a Next.js API layer that intercepts every call to your vLLM server, runs it through a budget controller with soft and hard caps per user or tenant, records spend in an in-memory store, and exports cost telemetry to Langfuse and Helicone. By the end, a chat endpoint will reject requests when budgets are exhausted and emit cost traces to your observability stack — all without slowing down responses.
The tutorial is designed for developers who run their own LLM inference and want per-scope budget limits without signing up for a managed proxy or rewriting their agent code.
Prerequisites
Node.js 22+ and pnpm 10 installed
A running vLLM server (or access to one — the code defaults to http://localhost:8000/v1)
Optional: Langfuse account (for cost trace export) and Helicone API key (for usage logging)
Familiarity with TypeScript and Next.js App Router basics
Step 1: Scaffold the project and install dependencies
Create a new Next.js project with the App Router and install the exact dependency versions this recipe needs.
Also create .env.example with the same entries and placeholder values — the recipe ships one for reference so your team knows which variables are required.
Now enable the Next.js instrumentation hook in next.config.ts — this is required because the recipe uses src/instrumentation.ts to initialize budget stores at startup:
Expected output: The file next.config.ts contains the experimental.instrumentationHook flag set to true. Without this flag, the register() function in instrumentation.ts is dead code — the framework never calls it.
Step 3: Create shared types
Create src/lib/types.ts to define the interfaces and error classes your modules will share. These types travel through every layer of the system: from the vLLM response shape to the budget report.
Expected output: A file with three interfaces (VllmCallResult, CostReport, SpendState) and two error classes (VllmApiError, BudgetExceededError), plus the BudgetScope enum imported from an external types package.
Step 4: Build the vLLM client
Create src/services/vllm-client.ts to wrap calls to your vLLM server using the @ai-sdk/openai-compatible adapter. This adapter talks the OpenAI-compatible chat completions protocol that vLLM exposes.
Expected output: Two exports — vllmModel() which returns a chat model handle, and callVllm() which fires the request and returns structured content and token counts. The ?? 0 fallback handles cases where the vLLM response omits usage data.
Step 5: Build the budget infrastructure
This step creates three files that form the budget backbone: an in-memory spend store, a pricing engine with vLLM model costs, and a budget controller wired together.
Spend store
ts
// src/modules/budget/spend-store.service.tsimport { SpendStore } from "@reaatech/agent-budget-spend-tracker";export function createSpendStore(maxEntries = 100_000): SpendStore { return new SpendStore({ maxEntries });}
Expected output: A factory function that creates an in-memory circular-buffer spend store with room for up to 100,000 entries. The store provides O(1) per-scope spend lookups via getSpend().
Pricing engine
ts
// src/modules/budget/pricing.service.tsimport { PricingEngine } from "@reaatech/agent-budget-pricing";import { calculateCost } from "@reaatech/llm-cost-telemetry-calculator";export function createPricingEngine(): PricingEngine { const pricing = new PricingEngine({ cacheTtlMs: 3600_000 }); pricing.loadTable("vllm", { "mistral-7b-instruct": { inputPricePerMillion: 0.07, outputPricePerMillion: 0.07 }, "llama-3-8b": { inputPricePerMillion: 0.06, outputPricePerMillion: 0.06 }, "deepseek-v4-flash": { inputPricePerMillion: 0.14, outputPricePerMillion: 0.28 }, }); return pricing;}export function verifyCostCalculation(): number { const result = calculateCost({ provider: "openai", model: "gpt-4o", inputTokens: 1000, outputTokens: 500, }); return result.costUsd;}
Expected output: A pricing engine seeded with three open-source model price entries and a helper function that verifies the cost calculator works end-to-end. Price lookups are cached for one hour.
Expected output: Three exports: a factory that wires SpendStore into BudgetController, a default-budget definer that reads from env vars, and a state query helper. The wildcard scope key "*" applies the default budget to any scope that doesn’t have its own explicit definition.
Step 6: Create the cost tracking service
The CostTrackingService sits between the interceptor and the underlying budget/pricing modules. It handles pre-flight budget checks, records completed calls, and returns formatted spend state — all through a single class that the interceptor depends on.
Expected output: A single class with three methods — preflightCheck() estimates cost and asks the controller whether the request is allowed, recordCall() computes actual cost and records it, and getState() returns the formatted spend state for the current scope.
Step 7: Create the cost interceptor
The interceptor is the central orchestration point. Every chat request passes through interceptVllmCall(), which: (1) runs a pre-flight budget check, (2) conditionally downgrades to a cheaper model if suggested, (3) fires the vLLM call, (4) records actual cost, and (5) emits telemetry fire-and-forget.
Expected output: One async function that chains the pre-flight check, model downgrade, vLLM call, cost recording, and telemetry emission. Telemetry is intentionally not awaited — the response returns immediately while cost traces ship in the background.
Step 8: Create the telemetry services
The telemetry layer fans out cost reports to two backends: Langfuse (structured traces) and Helicone (usage logging). Each integration is a separate module so you can swap or extend them independently.
Langfuse service
ts
// src/modules/telemetry/langfuse.service.tsimport Langfuse from "langfuse";import { CostReport } from "../../lib/types.js";export function createLangfuseClient(): Langfuse { return new Langfuse({ secretKey: process.env.LANGFUSE_SECRET_KEY, publicKey: process.env.LANGFUSE_PUBLIC_KEY, baseUrl: process.env.LANGFUSE_BASE_URL, });}export function emitCostTrace(langfuse: Langfuse, costReport: CostReport): void { langfuse.trace({ name: "vllm-cost", input: costReport, metadata: { costUsd: costReport.costUsd }, });}
// src/modules/telemetry/telemetry.service.tsimport Langfuse from "langfuse";import { CostReport } from "../../lib/types.js";import { emitCostTrace } from "./langfuse.service.js";import { emitCostToHelicone } from "./helicone.service.js";export class TelemetryService { private langfuse: Langfuse; constructor(langfuse: Langfuse) { this.langfuse = langfuse; } async emitCost(costReport: CostReport): Promise<void> { const results = await Promise.allSettled([ Promise.resolve().then(() => { emitCostTrace(this.langfuse, costReport); }), emitCostToHelicone(costReport), ]); for (const result of results) { if (result.status === "rejected") { console.error("Telemetry emission failed:", result.reason); } } }}
Expected output: Three files totaling about 40 lines. TelemetryService.emitCost() uses Promise.allSettled() so a failure in one backend never blocks the other. Network errors in Helicone are silently caught — telemetry is never allowed to crash the calling request.
Step 9: Wire up instrumentation and route handlers
Instrumentation
Next.js calls the register() function at startup when the instrumentationHook flag is enabled. This is where you initialize the spend store, pricing engine, budget controller, and Langfuse client so they’re ready before the first request arrives.
Note the dynamic import() calls — these are required because register() runs in both Node.js and Edge runtimes. Edge can’t import Node-specific modules, so you gate them behind NEXT_RUNTIME === "nodejs".
Chat route handler
This is the main endpoint. It accepts messages, passes them through the cost interceptor, and returns the response with usage and cost data.
ts
// app/api/chat/route.tsimport { NextRequest, NextResponse } from "next/server";import { z } from "zod";import { interceptVllmCall } from "../../../src/interceptors/cost.interceptor.js";import { createSpendStore } from "../../../src/modules/budget/spend-store.service.js";import { createPricingEngine } from "../../../src/modules/budget/pricing.service.js";import { createBudgetController } from "../../../src/modules/budget/budget.service.js";import { CostTrackingService } from "../../../src/services/cost-tracking.service.js";import { createLangfuseClient } from "../../../src/modules/telemetry/langfuse.service.js";import { TelemetryService } from "../../../src/modules/telemetry/telemetry.service.js";import { BudgetExceededError } from "../../../src/lib/types.js";const spendStore = createSpendStore();const pricing = createPricingEngine();const budgetController = createBudgetController(spendStore, pricing);const costTracking = new CostTrackingService(budgetController, pricing);const langfuse = createLangfuseClient();const telemetry = new TelemetryService(langfuse);const chatSchema = z.object({ model: z.string().optional(), messages: z.array(z.object({ role: z.enum(["user", "assistant", "system"]), content: z.string(), })).min(1), scopeType: z.string().optional(), scopeKey: z.string().optional(),});export async function POST(req: NextRequest) { let body: unknown; try { body = await req.json(); } catch { return NextResponse.json({ error: "bad_request", message: "Malformed JSON body" }, { status: 400 }); } const parsed = chatSchema.safeParse(body); if (!parsed.success) { return NextResponse.json({ error: "bad_request", details: z.treeifyError(parsed.error) }, { status: 400 }); } const { model, messages, scopeType, scopeKey } = parsed.data; try { const result = await interceptVllmCall( { model: model ?? "mistral-7b-instruct", messages, scopeType, scopeKey, }, costTracking, telemetry, ); return NextResponse.json(result); } catch (err) { if (err instanceof BudgetExceededError) { return NextResponse.json({ error: "budget_exceeded", message: err.message }, { status: 402 }); } throw err; }}
Budget management route handler
This route provides CRUD operations on budgets — querying spend state per scope, defining new budgets, and removing them.
// app/api/health/route.tsimport { NextResponse } from "next/server";export function GET() { return NextResponse.json({ status: "ok", timestamp: new Date().toISOString() });}
Expected output: Four route handler files under app/api/. The chat route uses NextRequest/NextResponse (never bare Request/Response) and returns proper JSON responses with correct status codes: 200 on success, 400 for bad input, 402 when the budget is exceeded.
Step 10: Write tests and verify
This recipe ships with a test suite that covers every module: the vLLM client, the cost interceptor, the budget controller, the telemetry service, and the route handlers. Create your test files under tests/ mirroring the source structure. Here is the test for the cost interceptor:
ts
// tests/interceptors/cost.interceptor.test.tsimport { describe, it, expect, vi, beforeEach } from "vitest";import { BudgetExceededError } from "../../src/lib/types.js";vi.mock("../../src/services/vllm-client.js", () => ({ callVllm: vi.fn(),}));import { interceptVllmCall } from "../../src/interceptors/cost.interceptor.js";import { callVllm } from "../../src/services/vllm-client.js";function mockPricing() { return { computeCost: vi.fn().mockReturnValue(0.005) };}function mockController() { return { getState: vi.
Run the verification suite:
terminal
pnpm typecheckpnpm lintpnpm vitest run --coverage --reporter=json --outputFile=vitest-report.json
Expected output: TypeScript compiles with zero errors, ESLint passes, and the test suite reports numFailedTests: 0 with coverage thresholds above 90% for lines, branches, functions, and statements. The coverage is scoped to runtime code under src/ and app/**/route.ts — UI files like page.tsx and layout.tsx are excluded.
Next steps
Add per-scope budget definitions via API — use POST /api/budget to define different limits per user or tenant, going beyond the default wildcard budget
Replace the in-memory SpendStore with PostgreSQL — the postgres dependency is already in package.json; wire a real database so spend persists across restarts
Integrate with Langfuse prompt management — extend the Langfuse trace to include the prompt template version and model parameters for richer cost attribution
Add a dashboard page — create an app/dashboard/ page that reads GET /api/budget and displays spend state per scope with charts
fn
().
mockReturnValue
({ remaining:
0
, limit:
0.001
}) };
}
function mockCostTracking(overrides: Record<string, unknown> = {}) {