Files · Databricks LLM Observability Suite for SMB AI Operations

86 (1 binary, 763.3 kB total)attempt 1

README.md·5054 B·markdown

markdown

# Databricks LLM Observability Suite for SMB AI Operations
 
> Gain end-to-end visibility into every LLM call on Databricks, from token usage to cost, with ready-made dashboards and alerts.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## Architecture
 
```
OpenTelemetry instrumentation → Cost metrics → Express /metrics → Langfuse export → Databricks storage → Next.js admin panel
```
 
1. **OTel Instrumentation** — `@reaatech/otel-genai-semconv-openai` and `@reaatech/otel-genai-semconv-anthropic` wrap OpenAI/Anthropic clients to emit GenAI semantic convention spans with token usage and cost attributes.
2. **Cost Metrics** — `@reaatech/llm-cost-telemetry` computes per-request cost USD from span attributes and pushes cost spans to Prometheus counters/histograms.
3. **Express /metrics** — A lightweight Express server (`server/index.ts`) exposes Prometheus metrics at `/metrics` and a health check at `/health`.
4. **Langfuse Export** — `@reaatech/otel-cost-exporter` forwards spans to Langfuse for trace-level observability.
5. **Databricks Storage** — Cost and observability data is persisted in Databricks SQL warehouses (`src/services/databricks-store.ts`) for long-term querying.
6. **Next.js Admin Panel** — The Next.js app (`app/`) provides a dashboard with summary cards, model rankings, team budget tracking, latency percentiles, and anomaly alerts.
 
## Quick start
 
```bash
pnpm install
pnpm dev             # starts Next.js dev server
```
 
In a separate terminal, start the metrics server:
 
```bash
pnpm tsx server/index.ts
```
 
Open http://localhost:3000 to view the admin dashboard.
 
## Running tests
 
```bash
pnpm test            # vitest run with coverage
pnpm typecheck       # TypeScript type checking
pnpm lint            # ESLint
```
 
## Environment variables
 
| Variable | Description | Required |
|---|---|---|
| `DATABRICKS_HOST` | Databricks workspace hostname | Yes |
| `DATABRICKS_HTTP_PATH` | HTTP path for the Databricks SQL warehouse | Yes |
| `DATABRICKS_TOKEN` | Databricks personal access token | Yes |
| `OPENAI_API_KEY` | OpenAI API key | Yes |
| `ANTHROPIC_API_KEY` | Anthropic API key | Yes |
| `LANGFUSE_PUBLIC_KEY` | Langfuse project public key | Yes |
| `LANGFUSE_SECRET_KEY` | Langfuse project secret key | Yes |
| `LANGFUSE_BASE_URL` | Langfuse API base URL | No (defaults to Langfuse Cloud) |
| `METRICS_PORT` | Port for the Express metrics server | No (default: 9090) |
| `OTEL_SERVICE_NAME` | OpenTelemetry service name | No (default: `databricks-llm-observability`) |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP exporter gRPC endpoint | Yes |
 
Copy `.env.example` to `.env.local` and fill in the values.
 
## Express API
 
| Endpoint | Method | Description |
|---|---|---|
| `/metrics` | GET | Prometheus-formatted metrics (LLM request count, cost, tokens, latency histogram) |
| `/health` | GET | Health check returning `{ "status": "ok" }` |
 
The Express server is started separately via `pnpm tsx server/index.ts` on the port configured by `METRICS_PORT` (default 9090).
 
## Next.js API routes (`/api/observability/*`)
 
| Route | Method | Query params | Description |
|---|---|---|---|
| `/api/observability/summary` | GET | `start`, `end` (ISO strings) | Aggregate dashboard summary (total calls, cost, models, teams) |
| `/api/observability/models` | GET | — | Top 10 models ranked by cost and call count |
| `/api/observability/teams` | GET | — | All teams with spend, call count, and budget usage |
| `/api/observability/teams/[teamId]` | GET | — | Per-team detail: total cost and per-model breakdown |
| `/api/observability/latency` | GET | `model` (required) | Latency percentiles (p50, p90, p95, p99, mean) for a model |
| `/api/observability/timeseries` | GET | `interval` (minute\|hour\|day) | Time-series cost data over the last 7 days |
| `/api/observability/anomalies` | GET | — | Anomaly alerts detected in the last 24 hours |
 
## Admin panel pages
 
| Route | Description |
|---|---|
| `/` | Dashboard with summary cards (total calls, cost, avg latency, active models, active teams) |
| `/models` | Model rankings table sorted by cost and call volume |
| `/teams` | Team budget tracking with spend, call count, and budget utilization percentage |
| `/teams/[teamId]` | Per-team drilldown showing per-model cost breakdown |
| `/latency` | Latency percentile bar chart (p50/p90/p95/p99/mean) for a selected model |
| `/anomalies` | Anomaly alerts table with severity, timestamp, model, and team context |
 
## Project layout
 
```
app/                  Next.js App Router pages + API routes
src/                  services, lib, adapters
server/               Express metrics server
tests/                vitest suite (mirrors src/)
packages/             API references for every dependency (read these first)
bin/                  CLI scripts
DEV_PLAN.md           build plan for this recipe
```
 
## License
 
MIT — see [LICENSE](./LICENSE).