Files · Databricks MCP Server for SMB Data Analytics
61 (1 binary, 655.8 kB total)attempt 1
README.md·5400 B·markdown
markdown
# Databricks MCP Server for SMB Data Analytics
> Expose Databricks SQL queries as secure, multi-tenant MCP tools for AI agents, with built-in auth, rate limiting, and observability.
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
## Overview
This server wraps Databricks SQL warehouses as MCP (Model Context Protocol) tools, enabling AI agents to query data through a secure, multi-tenant gateway. Each request is authenticated, rate-limited, and scoped to a tenant via an `x-tenant-id` header. Tools include running SQL queries, listing tables, describing schemas, validating SQL, and estimating row counts. Observability is provided through Langfuse and OpenTelemetry tracing.
## Prerequisites
- Node.js >= 22
- pnpm 10
- A Databricks SQL warehouse (with HTTP path and access token)
- A Langfuse account (optional, for observability)
## Quick Start
```bash
git clone <repo-url> databricks-mcp-server
cd databricks-mcp-server
cp .env.example .env
```
Fill in your credentials in `.env`:
```env
DATABRICKS_HOST=your-workspace.cloud.databricks.com
DATABRICKS_PATH=/sql/2.0/warehouses/your-warehouse-id
DATABRICKS_TOKEN=your-pat-token
```
```bash
pnpm install
pnpm dev
```
The server starts on port 8080. Connect any MCP client to `http://localhost:8080` with `x-tenant-id` and `x-api-key` headers.
## Architecture
The middleware pipeline processes every request in this order:
- **Auth** — API key or bearer token verification via `@reaatech/mcp-server-auth`
- **Tenant resolution** — reads `x-tenant-id` header via `HeaderTenantResolver`
- **Rate limiting** — per-tenant RPM and TPM enforcement via `DefaultRateLimiter`
- **Tool visibility** — filters exposed tools based on the tenant's plan (`basic` or `enterprise`)
- **Observability** — automatic span creation and tool execution logging via `@reaatech/mcp-server-observability`
Five tools are registered through the multi-tenant middleware on an MCP `Server` instance. SQL queries execute against a shared `DatabricksSessionManager` that wraps the `@databricks/sql` Thrift driver. Row-level security is applied automatically: every `SELECT` query is wrapped in a subquery filtered by `tenant_id`.
## MCP Tools
| Tool | Input | Behavior |
|---|---|---|
| `run_sql` | `{ sql: string }` | Execute SQL and return results as a markdown table. RLS subquery wrapping is applied for `SELECT` queries. |
| `list_tables` | `{}` | List all tables from `information_schema.tables` (excluding system schemas). |
| `describe_table` | `{ tableName: string }` | Describe columns of a table via `DESCRIBE TABLE`. Table name is backtick-escaped. |
| `validate_sql` | `{ sql: string }` | Validate SQL using `EXPLAIN` without executing the query. |
| `estimate_rows` | `{ sql: string }` | Estimate row count by wrapping the query in `SELECT COUNT(*)`. |
## Multi-Tenancy
The server uses `@reaatech/multi-tenant-mcp-middleware` to enforce tenant isolation:
- **Tenant resolution** — the `x-tenant-id` HTTP header identifies the requesting tenant
- **Rate limits** — configured via `RATE_LIMIT_RPM` and `RATE_LIMIT_TPM` environment variables
- **Tool visibility** — `basic` plan tenants can use `run_sql` and `list_tables` only; `enterprise` tenants have access to all five tools
- **RLS** — every `SELECT` query via `run_sql` is wrapped with `WHERE tenant_id = '<tenant-id>'` using the resolved tenant context
## Observability
Observability is provided by `@reaatech/mcp-server-observability`:
- **Langfuse** — tool invocations are logged with status and duration via `recordToolInvocation()`
- **OpenTelemetry** — automatic span creation via `withSpan()` and configurable OTLP exporter
- **Structured logging** — JSON logs via the `logger` instance at `info` and `error` levels
Configure Langfuse with `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, and `LANGFUSE_BASE_URL` in your `.env`.
## Environment Variables
| Variable | Default | Description |
|---|---|---|
| `DATABRICKS_HOST` | — | Databricks workspace URL |
| `DATABRICKS_PATH` | — | SQL warehouse HTTP path |
| `DATABRICKS_TOKEN` | — | Databricks personal access token |
| `PORT` | `8080` | Express server listen port |
| `NODE_ENV` | `development` | Environment mode |
| `LOG_LEVEL` | `info` | Logging level |
| `API_KEY` | — | Shared secret for API key auth |
| `AUTH_MODE` | `api-key` | Authentication mode |
| `AUTH_BYPASS_IN_DEV` | `true` | Skip auth in development when no API_KEY set |
| `CORS_ORIGIN` | `*` | Allowed CORS origin |
| `TENANT_RESOLVER_HEADER` | `x-tenant-id` | Header for tenant resolution |
| `RATE_LIMIT_RPM` | `60` | Requests per minute per tenant |
| `RATE_LIMIT_TPM` | `10000` | Tokens per minute per tenant |
| `LANGFUSE_PUBLIC_KEY` | — | Langfuse public key |
| `LANGFUSE_SECRET_KEY` | — | Langfuse secret key |
| `LANGFUSE_BASE_URL` | `https://cloud.langfuse.com` | Langfuse API base URL |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | — | OTLP collector endpoint |
| `OTEL_SERVICE_NAME` | `databricks-mcp-server` | OpenTelemetry service name |
## Testing
```bash
pnpm typecheck # TypeScript type checking (tsc --noEmit)
pnpm lint # ESLint with strict type-checked rules
pnpm test # Vitest with V8 coverage (thresholds: 90%+)
```
## License
MIT — see [LICENSE](./LICENSE).