Small businesses running vLLM for AI inference struggle to monitor token usage, latency, and cost across multiple agents, leading to overspend and undetected performance regressions.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This recipe builds a complete vLLM observability stack that automatically instruments every LLM call to your vLLM OpenAI-compatible endpoint, exports spans to Langfuse, tracks per-model token usage and cost, and displays real-time metrics in a Next.js dashboard. You’ll wire up OpenTelemetry span processors, a Drizzle + SQLite aggregation pipeline, and a server-rendered dashboard — all in a few hundred lines of TypeScript.
Prerequisites
Node.js >= 22 and pnpm (install via corepack enable && corepack prepare pnpm@10 --activate)
A running vLLM instance with an OpenAI-compatible endpoint (default: http://localhost:8000/v1)
A Langfuse account (cloud at https://langfuse.com or self-hosted) with a public and secret API key
Basic familiarity with Next.js App Router and OpenTelemetry concepts
Step 1: Create the Next.js project and install dependencies
Start by scaffolding a Next.js 16 project with TypeScript, then install the REAA observability packages and their third-party dependencies. The project uses App Router, strict TypeScript, ESM modules, and exact version pinning.
Next, create next.config.ts with the experimental.instrumentationHook flag. This is required — without it, the register() function in src/instrumentation.ts is dead code and the OpenTelemetry setup never fires.
Create .env.example with the environment variables the app reads at startup:
env
# Env vars used by vllm-observability-suite-for-smb-ai-operations.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=developmentVLLM_BASE_URL=http://localhost:8000/v1VLLM_API_KEY=<your-vllm-key>LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_BASE_URL=https://cloud.langfuse.comOTLP_ENDPOINT=http://localhost:4318/v1/tracesDATABASE_URL=file:local.db
Create the minimal Next.js App Router shell. First, app/layout.tsx:
tsx
import type { Metadata } from "next";import { Geist, Geist_Mono } from "next/font/google";import "./globals.css";const geistSans = Geist({ variable: "--font-geist-sans", subsets: ["latin"],});const geistMono = Geist_Mono({ variable: "--font-geist-mono", subsets: ["latin"],});export const metadata: Metadata = { title: "vLLM Observability Suite", description: "Real-time cost, latency and token usage monitoring for any AI agent using vLLM as the inference backend.",};export default function RootLayout({ children,}: Readonly<{ children: React.ReactNode;}>) { return ( <html lang="en" className={`${geistSans.variable} ${geistMono.variable}`}> <body>{children}</body> </html> );}
Then app/globals.css (any global styles you want — the dashboard has its own inline styles), and app/page.tsx (a simple landing page linking to the dashboard).
Expected output:pnpm install exits cleanly, all config files are in place, and you have an .env.example ready for real values.
Step 2: Set up the database schema and connection
The recipe stores aggregated span metrics and cost alerts in a local SQLite database using Drizzle ORM with libSQL (the Turso client). Start with the schema in src/db/schema.ts:
The spanMetrics table holds one row per OTel observation: the model used, token counts, cost in USD, duration in milliseconds, and a provider/status. The costAlerts table is a placeholder for budget threshold tracking.
Now wire up the libSQL client and Drizzle instance in src/db/index.ts:
ts
import { createClient } from "@libsql/client";import { drizzle } from "drizzle-orm/libsql";const client = createClient({ url: process.env["DATABASE_URL"] ?? "file:local.db" });export const db = drizzle(client);export const turso = client;
Expected output: Both files compile without errors. The db export is a Drizzle instance pointed at a local SQLite file.
Step 3: Configure OpenTelemetry instrumentation
This is the heart of the recipe. The src/instrumentation.ts file exports a register() function that Next.js calls at startup (because of instrumentationHook: true in next.config.ts). It sets up the OpenTelemetry Node SDK with two span processors — one OTLP HTTP exporter and one Langfuse exporter — then wraps the OpenAI client with GenAI semantic convention instrumentation.
ts
import { OpenAIInstrumentation } from "@reaatech/otel-genai-semconv-openai";import { LangfuseExporter } from "@reaatech/otel-genai-semconv-exporters";import { MetricsManager, getLogger } from "@reaatech/llm-cost-telemetry-observability";import type OpenAI from "openai";declare global { var __langfuseExporter: LangfuseExporter | undefined; var __metricsManager: MetricsManager | undefined; var __vllmClient: OpenAI | undefined;}export async function register() { if (process.env["NEXT_RUNTIME"] !== "nodejs") { return; } const { NodeSDK } = await import("@opentelemetry/sdk-node"); const { SimpleSpanProcessor } = await import("@opentelemetry/sdk-trace-base"); const { OTLPTraceExporter } = await import("@opentelemetry/exporter-trace-otlp-http"); const { default: OpenAI } = await import("openai"); const otlpExporter = new OTLPTraceExporter({ url: process.env["OTLP_ENDPOINT"] ?? "http://localhost:4318/v1/traces", }); const langfuseExporter = new LangfuseExporter({ publicKey: process.env["LANGFUSE_PUBLIC_KEY"] ?? "", secretKey: process.env["LANGFUSE_SECRET_KEY"] ?? "", baseUrl: process.env["LANGFUSE_BASE_URL"], }); globalThis.__langfuseExporter = langfuseExporter; const sdk = new NodeSDK({ spanProcessors: [ new SimpleSpanProcessor(otlpExporter), new SimpleSpanProcessor(langfuseExporter), ], }); sdk.start(); const metrics = new MetricsManager({ serviceName: "vllm-observability" }); metrics.init(); globalThis.__metricsManager = metrics; const logger = getLogger({ name: "vllm-observability" }); logger.logInfo("instrumentation booted"); const client = new OpenAI({ baseURL: process.env["VLLM_BASE_URL"] ?? "http://localhost:8000/v1", apiKey: process.env["VLLM_API_KEY"] ?? "not-needed", }); new OpenAIInstrumentation({ trackCosts: true }).instrument(client); globalThis.__vllmClient = client; process.on("SIGTERM", () => { sdk.shutdown().catch(() => {}); metrics.close().catch(() => {}); });}
Key details:
The if (process.env["NEXT_RUNTIME"] !== "nodejs") guard skips registration in Edge runtime — Node-only imports are fetched dynamically with await import() inside the nodejs branch.
Two span processors run in parallel: the OTLP exporter sends spans to your OTLP collector endpoint, and the LangfuseExporter buffers GenAI spans for later batch export.
MetricsManager tracks token counters, cost histograms, and API call counts as OTel metrics.
OpenAIInstrumentation wraps client.chat.completions.create() so every call automatically emits spans with gen_ai.* attributes (model, temperature, token usage, streaming metrics, cost).
"not-needed" is the default VLLM_API_KEY value because vLLM often runs without authentication in local/development setups.
Note that metrics.init() is synchronous — no await.
Expected output: When you run pnpm dev, the server logs “instrumentation booted”. Every openai.chat.completions.create() call is now traced.
Step 4: Write the vLLM client wrapper
The src/services/llm-client.ts module provides a factory function and two convenience wrappers for sending chat completions (sync and streaming) to the vLLM endpoint. Each function properly handles OpenAI API errors and extracts token usage from the response.
The PROVIDER_SYSTEMS.OPENAI constant comes from @reaatech/otel-genai-semconv-core and ensures the provider identifier matches what the span builder expects. Both sendChatCompletion and sendChatCompletionStream catch OpenAI.APIError and re-throw with a vLLM API error (<status>): <message> format so the operator can distinguish vLLM issues from other failures.
Expected output: Calling createVllmClient() returns an OpenAI SDK client pointed at your vLLM instance. Calling sendChatCompletion(client, params) returns a CompletionResult with the provider, response, model, and token usage.
Step 5: Build the cost tracking service
The src/services/cost-tracker.ts uses @reaatech/llm-cost-telemetry-calculator to compute per-call costs and estimate costs before making a request.
trackCost calls calculateCost from the calculator package, which looks up the model’s built-in pricing table and returns a detailed breakdown.
estimateCostBeforeCall runs an async estimation with a confidence score — useful for pre-call budget gating.
registerCustomModel adds custom pricing tiers at runtime (for fine-tuned vLLM models not in the built-in table). The addCustomPricing function accepts an array of pricing objects with inputPricePerMillion and outputPricePerMillion.
Expected output:trackCost("gpt-4", 500, 200) returns approximate cost and breakdown values based on the calculator’s built-in pricing table.
Step 6: Implement the Langfuse span pusher
The Langfuse pusher in src/services/langfuse-pusher.ts creates a Langfuse SDK client from environment variables and pushes observation arrays. Each observation carries the span metadata that the LangfuseExporter buffered.
The function de-duplicates trace creation (only calls client.trace() once per unique traceId) and wraps each observation push in a try/catch so a single failing observation doesn’t cancel the entire batch.
Expected output:pushObservations(client, [obs1, obs2, obs3]) returns { pushed: 3, errors: 0 } and creates 3 spans under 3 traces in Langfuse.
Step 7: Build the span aggregator
The SpanAggregator in src/services/span-aggregator.ts is the background worker that bridges the OTel span buffer and the SQLite database. It reads buffered Langfuse-format observations, persists each one to SQLite with parsed token/cost/duration values, and pushes them to the Langfuse API.
Calls this.langfuseExporter.getLangfuseFormat() to drain the buffered observations.
Parses gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and llm.cost.total attributes from each observation’s metadata.
Calculates duration by subtracting parsed start and end timestamps.
Inserts a row into the spanMetrics SQLite table.
Pushes the same observations to the live Langfuse API via pushObservations.
The startLoop and stopLoop methods manage a configurable interval timer — call startLoop(30000) to aggregate every 30 seconds, or call collectAndAggregate() directly for one-shot use.
Expected output: Calling aggregator.collectAndAggregate() returns { pushed: N, stored: N }. Each buffered observation appears as a row in the SQLite database and a span in Langfuse.
Step 8: Create the API routes
Three Next.js App Router route handlers expose the aggregated data.
Health check at app/api/health/route.ts:
ts
import { NextRequest, NextResponse } from "next/server";export async function GET(_req: NextRequest): Promise<NextResponse> { void _req.headers; await Promise.resolve(); return NextResponse.json({ status: "ok", service: "vllm-observability", timestamp: new Date().toISOString(), });}
Span query at app/api/spans/route.ts with filtering by model and status:
Expected output: After seeding some span data, GET /api/costs?groupBy=model returns an array of cost rows aggregated by model name. GET /api/spans?limit=10 returns the 10 most recent span records.
Step 9: Create the barrel export
The src/index.ts file re-exports all public symbols so consumers can import from the package root:
ts
export { createVllmClient, sendChatCompletion, sendChatCompletionStream } from "./services/llm-client.js";export { trackCost, estimateCostBeforeCall, registerCustomModel } from "./services/cost-tracker.js";export { SpanAggregator } from "./services/span-aggregator.js";export { createLangfuseClient, pushObservations } from "./services/langfuse-pusher.js";
This is a standard barrel export — it collects the four service modules into a single entry point.
Expected output: The file compiles cleanly. Other modules in the project can now import from "../index.js" instead of reaching into individual service files.
Step 10: Build the dashboard page
The app/dashboard/page.tsx is a Next.js server component that fetches data from the API routes and renders it as summary cards and tables.
tsx
export const revalidate = 30;interface CostRow { key: string; costUsd: number | null; inputTokens: number | null; outputTokens: number | null;}interface SpanRow { id: number; traceId: string; spanId: string; model: string | null; inputTokens: number | null
The page is a pure server component — no hooks, no client-side state. It fetches both endpoints in parallel with Promise.all, computes summary stats (total spend, total tokens, average duration), and renders two tables: a model cost breakdown and a recent-spans log. The CostRow and SpanRow interfaces provide type safety for the API responses, and style definitions are extracted into named React.CSSProperties constants at the bottom of the file.
Expected output: When you visit http://localhost:3000/dashboard, you see three summary cards and two tables. If no data exists yet, a “No data yet” message is shown.
Step 11: Run the tests
The project ships with a comprehensive test suite using Vitest and MSW. The test setup in tests/setup.ts creates an MSW server that intercepts calls to http://localhost:8000/v1/chat/completions:
API routes: 200 responses, empty-data responses, 500 on DB error, filter params, aggregation grouping
REAA package imports: verifies all five @reaatech/* packages export their expected symbols
Expected output: All tests pass (0 failures) with at least 90% line, branch, function, and statement coverage on runtime code — matching the thresholds configured in vitest.config.ts.
Next steps
Register custom model pricing — If your vLLM instance serves a fine-tuned model not in the calculator’s built-in pricing table, call registerCustomModel("my-model", 1.5, 6.0) during startup to add it.
Add budget alerting — Use the costAlerts table schema to implement threshold checks in the SpanAggregator’s collect loop, and wire notifications (email, Slack, webhook) when spend crosses a budget limit.
Deploy with a real OTLP collector — Point the OTLP_ENDPOINT env var to a production OpenTelemetry collector (Grafana Tempo, SigNoz, or Honeycomb) for long-term trace storage and querying.
Replace libSQL with Turso — Switch DATABASE_URL to a remote Turso database URL for multi-instance deployments where each server needs access to the same span data.
Add per-tenant cost breakdowns — Extend the spanMetrics schema with a tenant column and update the dashboard to show per-tenant spend when you share a vLLM instance across teams or customers.