Databricks LLM Observability Suite for SMB AI Operations

Gain end-to-end visibility into every LLM call on Databricks, from token usage to cost, with ready-made dashboards and alerts.

databricks llm-observability opentelemetry langfuse nextjs express openai anthropic cost-monitoring smb

The problem

SMBs using Databricks for AI workloads have no easy way to monitor spending, latency, or error rates across multiple models, leading to bill shock and debugging blind spots.

Built from

Intro

This tutorial walks you through building a complete LLM observability suite for a small-to-medium business running AI workloads on Databricks. You’ll instrument OpenAI and Anthropic SDK calls with OpenTelemetry GenAI semantic conventions, track token usage and cost per team, detect budget anomalies, expose Prometheus metrics, and serve a real-time admin dashboard — all with Next.js App Router and Express. By the end you’ll have a working observability pipeline you can extend to any model provider.

Prerequisites

Node.js 22+ and pnpm 10+ installed
A Databricks workspace (for the SQL-backed store) — or skip it and use the in-memory store for local development
API keys for OpenAI, Anthropic, and Langfuse (Langfuse is the visualization backend)
Basic familiarity with TypeScript, Next.js App Router, and OpenTelemetry concepts (traces, spans, span processors)

Step 1: Scaffold the project and configure environment variables

The project has been scaffolded with Next.js 16 (App Router). All the config files — package.json, tsconfig.json, next.config.ts, and vitest.config.ts — are on disk. Open .env.example to see the environment variables you’ll need:

env

Example artifact

A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.

Download example (zip)Browse files

211 kB·127 tests·100.0% coverage·vitest passing

SHA-256fcbc898d1ea69fe181969b582f47c16b9dec99c7d6b6cf5d648c23f1e6051595

Book a conversation All solutions

Comments

Loading comments…

Intro

Prerequisites

Node.js 22+ and pnpm 10+ installed
A Databricks workspace (for the SQL-backed store) — or skip it and use the in-memory store for local development
API keys for OpenAI, Anthropic, and Langfuse (Langfuse is the visualization backend)
Basic familiarity with TypeScript, Next.js App Router, and OpenTelemetry concepts (traces, spans, span processors)

Step 1: Scaffold the project and configure environment variables

env

import Langfuse from "langfuse"; import type { CostSpan } from "@reaatech/llm-cost-telemetry"; import type { Context } from "@opentelemetry/api"; import type { Span, SpanProcessor, ReadableSpan } from "@opentelemetry/sdk-trace-base"; import type { SpanExporter } from "@opentelemetry/sdk-trace-base"; let langfuseClient: Langfuse | null = null; export function initLangfuse(config?: { publicKey?: string; secretKey?: string; baseUrl?: string; }): Langfuse { const publicKey = config?.publicKey ?? process.env.LANGFUSE_PUBLIC_KEY ?? ""; const secretKey = config?.secretKey ?? process.env.LANGFUSE_SECRET_KEY ?? ""; const baseUrl = config?.baseUrl ?? process.env.LANGFUSE_BASE_URL; if (!publicKey || !secretKey) { throw new Error( "Langfuse public key and secret key are required" ); } langfuseClient = new Langfuse({ publicKey, secretKey, baseUrl, }); return langfuseClient; } export function getLangfuseClient(): Langfuse | null { return langfuseClient; } export async function flushLangfuse(): Promise<void> { if (langfuseClient) { await langfuseClient.flushAsync(); } } export async function shutdownLangfuse(): Promise<void> { if (langfuseClient) { await langfuseClient.shutdownAsync(); langfuseClient = null; } } export function createLangfuseSpanProcessor(): SpanProcessor { return { onStart(span: Span, parentContext: Context): void { void span; void parentContext; }, onEnd(span: ReadableSpan): void { if (!langfuseClient) return; const attrs = span.attributes; const model = typeof attrs["gen_ai.response.model"] === "string" ? attrs["gen_ai.response.model"] : typeof attrs["gen_ai.request.model"] === "string" ? attrs["gen_ai.request.model"] : ""; const inputTokens = Number(attrs["gen_ai.usage.input_tokens"]) || 0; const outputTokens = Number(attrs["gen_ai.usage.output_tokens"]) || 0; const costUsd = typeof attrs["llm.cost.total"] === "number" ? attrs["llm.cost.total"] : 0; langfuseClient.trace({ name: "llm-cost", metadata: { model, provider: model.split("/")[0] || "unknown", costUsd, inputTokens, outputTokens, tenant: "", timestamp: new Date().toISOString(), }, }); }, async forceFlush(): Promise<void> {}, async shutdown(): Promise<void> {}, }; } export function sendCostDataToLangfuse(span: CostSpan): void { if (!langfuseClient) { return; } langfuseClient.trace({ name: "llm-cost", metadata: { model: span.model, provider: span.provider, costUsd: span.costUsd, inputTokens: span.inputTokens, outputTokens: span.outputTokens, tenant: span.tenant, timestamp: span.timestamp instanceof Date ? span.timestamp.toISOString() : "", }, }); } export { Langfuse }; export type { SpanExporter };

import { NodeSDK } from "@opentelemetry/sdk-node"; import type { SpanProcessor } from "@opentelemetry/sdk-trace-base"; import { OpenAIInstrumentation } from "@reaatech/otel-genai-semconv-openai"; import { AnthropicInstrumentation } from "@reaatech/otel-genai-semconv-anthropic"; import { SpanBuilder, type ProviderType, GEN_AI_ATTRIBUTES, COST_ATTRIBUTES } from "@reaatech/otel-genai-semconv-core"; import OpenAI from "openai"; import Anthropic from "@anthropic-ai/sdk"; import { initCostTracking } from "./cost.js"; import { createLangfuseSpanProcessor } from "./langfuse.js"; let sdk: NodeSDK | null = null; let initialized = false; export function initTelemetry( spanProcessors?: SpanProcessor[] ): NodeSDK { if (initialized && sdk) { return sdk; } sdk = new NodeSDK({ serviceName: "databricks-llm-observability", instrumentations: [], spanProcessors, }); initialized = true; sdk.start(); process.on("SIGTERM", () => { void shutdownTelemetry(); }); return sdk; } export async function initFullTelemetry(): Promise<NodeSDK> { const costProcessor = await initCostTracking(); const langfuseProcessor = createLangfuseSpanProcessor(); return initTelemetry([costProcessor, langfuseProcessor]); } export async function shutdownTelemetry(): Promise<void> { if (!initialized || !sdk) { return; } await sdk.shutdown(); sdk = null; initialized = false; } export function createInstrumentedOpenAIClient( apiKey?: string ): { client: OpenAI; instrumentation: OpenAIInstrumentation; } { const client = new OpenAI({ apiKey }); const instrumentation = new OpenAIInstrumentation({ trackCosts: true }); instrumentation.instrument(client); return { client, instrumentation }; } export function createInstrumentedAnthropicClient( apiKey?: string ): { client: Anthropic; instrumentation: AnthropicInstrumentation; } { const client = new Anthropic({ apiKey }); const instrumentation = new AnthropicInstrumentation({ trackCosts: true }); instrumentation.instrument(client); return { client, instrumentation }; } export function getSDK(): NodeSDK | null { return sdk; } export function createSpanBuilder(provider: ProviderType): SpanBuilder { return new SpanBuilder({ provider, addMessageEvents: true, addChoiceEvents: true }); } export { GEN_AI_ATTRIBUTES, COST_ATTRIBUTES };

"use client"; import { useEffect, useState } from "react"; interface Summary { totalCalls: number; totalCost: number; avgLatencyMs: number; modelsUsed: number; teamsActive: number; } export default function Home() { const [summary, setSummary] = useState<Summary | null>(null); const [loading, setLoading] = useState(true); const [error, setError] = useState<string | null>(null); useEffect(() => { const end = new Date().toISOString(); const start = new Date(Date.now() - 86400000).toISOString(); fetch(`/api/observability/summary?start=${start}&end=${end}`) .then((r) => { if (!r.ok) throw new Error("Failed to fetch summary"); return r.json() as Promise<Summary>; }) .then(setSummary) .catch((e: unknown) => { setError(e instanceof Error ? e.message : String(e)); }) .finally(() => { setLoading(false); }); }, []); if (loading) { return <div style={{ padding: 24 }}>Loading dashboard...</div>; } if (error) { return <div style={{ padding: 24, color: "red" }}>Error: {error}</div>; } if (!summary) { return <div style={{ padding: 24 }}>No data available.</div>; } return ( <div style={{ padding: 24, fontFamily: "monospace" }}> <h1>LLM Observability Dashboard</h1> <div style={{ display: "grid", gridTemplateColumns: "repeat(auto-fit, minmax(200px, 1fr))", gap: 16, marginTop: 24, }} > <Card title="Total Calls" value={summary.totalCalls} /> <Card title="Total Cost" value={`$${summary.totalCost.toFixed(4)}`} /> <Card title="Avg Latency" value={String(summary.avgLatencyMs) + "ms"} /> <Card title="Active Models" value={summary.modelsUsed} /> <Card title="Active Teams" value={summary.teamsActive} /> </div> </div> ); } function Card({ title, value }: { title: string; value: string | number }) { return ( <div style={{ border: "1px solid #ccc", borderRadius: 8, padding: 16, textAlign: "center", }} > <div style={{ fontSize: 12, color: "#666" }}>{title}</div> <div style={{ fontSize: 24, fontWeight: "bold", marginTop: 8 }}> {value} </div> </div> ); }

Databricks LLM Observability Suite for SMB AI Operations

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Example artifact

Comments

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Step 2: Create the in-memory observability store

Step 3: Implement cost tracking with `@reaatech/llm-cost-telemetry`

Step 4: Build the Langfuse span processor

Step 5: Create the budget service

Step 6: Build the anomaly detector

Step 7: Wire up the OpenTelemetry instrumentation

Step 8: Build the Express metrics server

Step 9: Create the Next.js API route handlers

GET /api/observability/teams

GET /api/observability/teams/[teamId]

GET /api/observability/latency

GET /api/observability/anomalies

GET /api/observability/summary

GET /api/observability/models

GET /api/observability/timeseries

Step 10: Create the Databricks SQL-backed store

Step 11: Build the admin dashboard page

Step 12: Write and run the tests

Step 13: Run the quality gate

Next steps

Databricks LLM Observability Suite for SMB AI Operations

The problem

Built from

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Example artifact

Intro

Prerequisites

Step 1: Scaffold the project and configure environment variables

Step 2: Create the in-memory observability store

Step 3: Implement cost tracking with @reaatech/llm-cost-telemetry

Step 4: Build the Langfuse span processor

Step 5: Create the budget service

Step 6: Build the anomaly detector

Step 7: Wire up the OpenTelemetry instrumentation

Step 8: Build the Express metrics server

Step 9: Create the Next.js API route handlers

GET /api/observability/teams

GET /api/observability/teams/[teamId]

GET /api/observability/latency

GET /api/observability/anomalies

GET /api/observability/summary

GET /api/observability/models

GET /api/observability/timeseries

Step 10: Create the Databricks SQL-backed store

Step 11: Build the admin dashboard page

Step 12: Write and run the tests

Step 13: Run the quality gate

Next steps

Step 3: Implement cost tracking with `@reaatech/llm-cost-telemetry`