vLLM Observability Suite for SMB AI Operations

Prebuilt observability stack with OpenTelemetry traces and dashboards for any AI agent using vLLM as the inference backend.

vllm observability opentelemetry langfuse nextjs cost-tracking llm-monitoring smb

The problem

Small businesses running vLLM for AI inference struggle to monitor token usage, latency, and cost across multiple agents, leading to overspend and undetected performance regressions.

Built from

Intro

This recipe builds a complete vLLM observability stack that automatically instruments every LLM call to your vLLM OpenAI-compatible endpoint, exports spans to Langfuse, tracks per-model token usage and cost, and displays real-time metrics in a Next.js dashboard. You’ll wire up OpenTelemetry span processors, a Drizzle + SQLite aggregation pipeline, and a server-rendered dashboard — all in a few hundred lines of TypeScript.

Prerequisites

Node.js >= 22 and pnpm (install via corepack enable && corepack prepare pnpm@10 --activate)
A running vLLM instance with an OpenAI-compatible endpoint (default: http://localhost:8000/v1)
A Langfuse account (cloud at https://langfuse.com or self-hosted) with a public and secret API key
Basic familiarity with Next.js App Router and OpenTelemetry concepts

Step 1: Create the Next.js project and install dependencies

Start by scaffolding a Next.js 16 project with TypeScript, then install the REAA observability packages and their third-party dependencies. The project uses App Router, strict TypeScript, ESM modules, and exact version pinning.

Create package.json:

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

174 kB·61 tests·100.0% coverage·vitest passing

SHA-2566a1be26adca42bda4bd54d3b1b556e5e8a4ab55b3967290a6132e394aafb88a2

Book a conversation All solutions

Comments

Loading comments…

import { OpenAIInstrumentation } from "@reaatech/otel-genai-semconv-openai"; import { LangfuseExporter } from "@reaatech/otel-genai-semconv-exporters"; import { MetricsManager, getLogger } from "@reaatech/llm-cost-telemetry-observability"; import type OpenAI from "openai"; declare global { var __langfuseExporter: LangfuseExporter | undefined; var __metricsManager: MetricsManager | undefined; var __vllmClient: OpenAI | undefined; } export async function register() { if (process.env["NEXT_RUNTIME"] !== "nodejs") { return; } const { NodeSDK } = await import("@opentelemetry/sdk-node"); const { SimpleSpanProcessor } = await import("@opentelemetry/sdk-trace-base"); const { OTLPTraceExporter } = await import("@opentelemetry/exporter-trace-otlp-http"); const { default: OpenAI } = await import("openai"); const otlpExporter = new OTLPTraceExporter({ url: process.env["OTLP_ENDPOINT"] ?? "http://localhost:4318/v1/traces", }); const langfuseExporter = new LangfuseExporter({ publicKey: process.env["LANGFUSE_PUBLIC_KEY"] ?? "", secretKey: process.env["LANGFUSE_SECRET_KEY"] ?? "", baseUrl: process.env["LANGFUSE_BASE_URL"], }); globalThis.__langfuseExporter = langfuseExporter; const sdk = new NodeSDK({ spanProcessors: [ new SimpleSpanProcessor(otlpExporter), new SimpleSpanProcessor(langfuseExporter), ], }); sdk.start(); const metrics = new MetricsManager({ serviceName: "vllm-observability" }); metrics.init(); globalThis.__metricsManager = metrics; const logger = getLogger({ name: "vllm-observability" }); logger.logInfo("instrumentation booted"); const client = new OpenAI({ baseURL: process.env["VLLM_BASE_URL"] ?? "http://localhost:8000/v1", apiKey: process.env["VLLM_API_KEY"] ?? "not-needed", }); new OpenAIInstrumentation({ trackCosts: true }).instrument(client); globalThis.__vllmClient = client; process.on("SIGTERM", () => { sdk.shutdown().catch(() => {}); metrics.close().catch(() => {}); }); }

import type { LangfuseExporter } from "@reaatech/otel-genai-semconv-exporters"; import type { Langfuse } from "langfuse"; import { db } from "../db/index.js"; import { spanMetrics } from "../db/schema.js"; import { getLogger } from "@reaatech/llm-cost-telemetry-observability"; import { pushObservations, type LangfuseObservation } from "./langfuse-pusher.js"; export class SpanAggregator { private langfuseExporter: LangfuseExporter; private langfuseClient: Langfuse; private _timer: ReturnType<typeof setInterval> | null = null; constructor(langfuseExporter: LangfuseExporter, langfuseClient: Langfuse) { this.langfuseExporter = langfuseExporter; this.langfuseClient = langfuseClient; } async collectAndAggregate(): Promise<{ pushed: number; stored: number }> { const rawObservations = this.langfuseExporter.getLangfuseFormat(); const observations = rawObservations as LangfuseObservation[]; const logger = getLogger({ name: "span-aggregator" }); if (observations.length === 0) { logger.logAggregation({ dimension: "batch", value: "0", totalUsd: 0, totalCalls: 0, }); return { pushed: 0, stored: 0 }; } for (const obs of observations) { const attrs = obs.metadata.attributes; const modelVal = attrs["gen_ai.request.model"]; const model = typeof modelVal === "string" ? modelVal : undefined; const inputTokensVal = attrs["gen_ai.usage.input_tokens"]; const inputTokens = typeof inputTokensVal === "number" ? inputTokensVal : undefined; const outputTokensVal = attrs["gen_ai.usage.output_tokens"]; const outputTokens = typeof outputTokensVal === "number" ? outputTokensVal : undefined; const costVal = attrs["llm.cost.total"]; const costUsd = typeof costVal === "number" ? costVal : undefined; const startMs = Date.parse(obs.startTime); const endMs = Date.parse(obs.endTime); const durationMs = !isNaN(startMs) && !isNaN(endMs) ? endMs - startMs : undefined; await db.insert(spanMetrics).values({ traceId: obs.traceId, spanId: obs.observationId, model: model ?? null, inputTokens: inputTokens ?? null, outputTokens: outputTokens ?? null, costUsd: costUsd ?? null, durationMs: durationMs ?? null, provider: "openai", status: obs.level === "ERROR" ? "error" : "ok", timestamp: obs.endTime, }); } const pushResult = pushObservations(this.langfuseClient, observations); logger.logAggregation({ dimension: "batch", value: String(observations.length), totalUsd: 0, totalCalls: observations.length, }); return { pushed: pushResult.pushed, stored: observations.length }; } startLoop(intervalMs: number): void { this._timer = setInterval(() => { this.collectAndAggregate().catch(() => {}); }, intervalMs); } stopLoop(): void { if (this._timer !== null) { clearInterval(this._timer); this._timer = null; } } }

import { NextRequest, NextResponse } from "next/server"; import { db } from "../../../src/db/index.js"; import { spanMetrics } from "../../../src/db/schema.js"; import { and, gte, lte, sql } from "drizzle-orm"; export async function GET(req: NextRequest): Promise<NextResponse> { try { const from = req.nextUrl.searchParams.get("from"); const to = req.nextUrl.searchParams.get("to"); const groupBy = req.nextUrl.searchParams.get("groupBy"); const fromFilter = from ? gte(spanMetrics.timestamp, from) : undefined; const toFilter = to ? lte(spanMetrics.timestamp, to) : undefined; const dateWhere = and(fromFilter, toFilter); let costs: Array<{ key: string; costUsd: number | null; inputTokens: number | null; outputTokens: number | null; }>; if (groupBy === "day") { costs = await db.select({ key: sql<string>`date(${spanMetrics.timestamp})`, costUsd: sql<number>`coalesce(sum(${spanMetrics.costUsd}), 0)`, inputTokens: sql<number>`coalesce(sum(${spanMetrics.inputTokens}), 0)`, outputTokens: sql<number>`coalesce(sum(${spanMetrics.outputTokens}), 0)`, }).from(spanMetrics) .where(dateWhere) .groupBy(sql`date(${spanMetrics.timestamp})`); } else { costs = await db.select({ key: sql<string>`coalesce(${spanMetrics.model}, '')`, costUsd: sql<number>`coalesce(sum(${spanMetrics.costUsd}), 0)`, inputTokens: sql<number>`coalesce(sum(${spanMetrics.inputTokens}), 0)`, outputTokens: sql<number>`coalesce(sum(${spanMetrics.outputTokens}), 0)`, }).from(spanMetrics) .where(dateWhere) .groupBy(spanMetrics.model); } const totalResult = await db.select({ totalCostUsd: sql<number>`coalesce(sum(${spanMetrics.costUsd}), 0)`, }).from(spanMetrics) .where(dateWhere); const totalCostUsd = totalResult[0]?.totalCostUsd ?? 0; return NextResponse.json({ costs, totalCostUsd, period: { from, to }, }); } catch (err) { return NextResponse.json({ error: String(err) }, { status: 500 }); } }

vLLM Observability Suite for SMB AI Operations

The problem

Built from

Intro

Prerequisites

Step 1: Create the Next.js project and install dependencies

Example artifact

Comments

Intro

Prerequisites

Step 1: Create the Next.js project and install dependencies

Step 2: Set up the database schema and connection

Step 3: Configure OpenTelemetry instrumentation

Step 4: Write the vLLM client wrapper

Step 5: Build the cost tracking service

Step 6: Implement the Langfuse span pusher

Step 7: Build the span aggregator

Step 8: Create the API routes

Step 9: Create the barrel export

Step 10: Build the dashboard page

Step 11: Run the tests

Next steps