Files · Databricks Security Guardrails for SMB Data Pipelines

74 (1 binary, 536.0 kB total)attempt 2
README.md·9492 B·markdown
markdown
# Databricks Security Guardrails for SMB Data Pipelines
 
> Add PII redaction, prompt injection defense, and content policy enforcement to your Databricks model-serving pipelines — no retraining required.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## Overview
 
Small and medium businesses running AI workloads on Databricks model-serving endpoints often lack the security infrastructure that large enterprises take for granted. Exposed endpoints can receive prompts containing personally identifiable information (PII), prompt injection attacks, toxic content, or requests that violate topic boundaries — any of which can lead to compliance violations, data leaks, or reputational damage. Retraining models to handle these edge cases is impractical; a defense-in-depth layer is required.
 
**Databricks Security Guardrails** addresses this with a pluggable guardrail chain that sits between the client and Databricks. Every incoming request is run through PII detection via Presidio (`@presidio-dev/hai-guardrails`) and then processed by a configurable sequence of guardrails from the `@reaatech/guardrail-chain` ecosystem: PII redaction, prompt injection detection, toxicity filtering, topic boundary enforcement, and cost precheck. The guardrail chain supports budgets (max latency, max tokens) and can dynamically skip slow checks under pressure. Output responses from Databricks are also scanned (PII leak detection, toxicity). All guardrail executions are traced, logged, and metered through Langfuse via `@reaatech/guardrail-chain-observability` adapters, providing an auditable trail for security reviews.
 
The architecture is straightforward: a Next.js 16+ API route (`POST /api/guardrails/chat`) accepts chat requests, passes them through the `GuardrailService` (which bundles Presidio analysis + guardrail chain execution), forwards clean requests to Databricks, scans the model output, and returns the result — or rejects the request with a structured error. A health endpoint (`GET /api/health`) checks both the server and Databricks connectivity.
 
## Prerequisites
 
- **Node.js** >= 22
- **pnpm** (see `packageManager` in `package.json`)
- A **Databricks workspace** with at least one model-serving endpoint deployed
- A **Langfuse** account (self-hosted or cloud) for observability
 
## Quick Start
 
```bash
pnpm install
cp .env.example .env
```
 
Fill in your credentials in `.env` (see [Configuration](#configuration)), then:
 
```bash
pnpm dev
```
 
The server starts on `http://localhost:3000`. Send test requests to `POST /api/guardrails/chat` and `GET /api/health`.
 
## Configuration
 
All configuration is driven by environment variables. Copy `.env.example` to `.env` and supply the following:
 
| Variable | Description |
|---|---|
| `DATABRICKS_HOST` | Base URL of your Databricks workspace (e.g. `https://<workspace>.cloud.databricks.com`) |
| `DATABRICKS_TOKEN` | Databricks personal access token |
| `LANGFUSE_SECRET_KEY` | Langfuse API secret key |
| `LANGFUSE_PUBLIC_KEY` | Langfuse API public key |
| `LANGFUSE_HOST` | Langfuse base URL (e.g. `https://cloud.langfuse.com`) |
| `GUARDRAIL_CHAIN_BUDGET_MAX_LATENCY_MS` | Override the default maximum latency budget (ms) for the guardrail chain |
| `GUARDRAIL_CHAIN_BUDGET_MAX_TOKENS` | Override the default maximum token budget for cost-precheck guardrails |
 
The `GUARDRAIL_CHAIN_*` variables are consumed by `@reaatech/guardrail-chain-config`, which loads and validates them via `loadConfigFromEnv`. If unset, sensible defaults apply (5000 ms max latency, 10000 max tokens). Per-endpoint profiles can override these budgets programmatically through the `EndpointProfileManager`.
 
## API Endpoints
 
### `POST /api/guardrails/chat`
 
Routes a chat request through the guardrail chain to a Databricks model-serving endpoint.
 
**Request body:**
 
```json
{
  "endpoint": "my-model-endpoint",
  "messages": [{ "role": "user", "content": "What is my account balance?" }],
  "model": "gpt-4",
  "temperature": 0.7,
  "maxTokens": 2048
}
```
 
| Field | Type | Required | Description |
|---|---|---|---|
| `endpoint` | `string` | yes | Databricks serving endpoint name |
| `messages` | `Array<{role, content}>` | yes | Chat messages (typically one user message) |
| `model` | `string` | no | Model identifier passed to Databricks |
| `temperature` | `number` | no | Sampling temperature |
| `maxTokens` | `number` | no | Max tokens for the response |
 
**Responses:**
 
| Status | Description |
|---|---|
| `200` | Request passed all guardrails. Returns the raw Databricks response body (`{id, choices, usage}`). |
| `400` | Invalid JSON body or missing `endpoint` field. |
| `403` | Guardrail chain rejected the input. Body includes `{error: "guardrail_blocked", violations}`. |
| `502` | Databricks upstream error. Body includes `{error: "upstream_error", status, detail}`. |
| `500` | Output guardrail rejected the model response, or internal error. |
 
### `GET /api/health`
 
Health check that reports server and Databricks connectivity status.
 
```json
{
  "status": "ok",
  "databricks": true
}
```
 
## Guardrail Profiles
 
Per-endpoint configuration is managed by the `EndpointProfileManager` (`src/config/rules.ts`). A default profile with sensible defaults is created at startup; profiles for specific endpoints can be registered at any time.
 
**Default profile guardrails:**
 
| Guardrail | Role |
|---|---|
| `pii-redaction` | Masks detected PII (email, phone, SSN, credit card numbers) in prompts |
| `prompt-injection` | Detects and blocks prompt injection / jailbreak attempts |
| `toxicity-filter` | Blocks toxic or hateful content |
| `topic-boundary` | Enforces allowed/blocked topic lists |
| `cost-precheck` | Estimates token usage and rejects requests that exceed the budget |
| `rate-limiter` | (Optional) Enforces per-window request limits |
 
The rate limiter is included in profile configuration but not enabled in the default profile by default. Enable it by adding `"rate-limiter"` to the profile's `enabledGuardrails` array.
 
Budgets (max latency, max tokens, skip-slow-under-pressure) are set per profile and can be tuned independently for each endpoint via `profileManager.setProfile()`.
 
## Observability
 
Observability is wired through `@reaatech/guardrail-chain-observability`, which defines three interfaces: `Logger`, `MetricsCollector`, and `Tracer`. The `initObservability()` function (`src/observability/index.ts`) instantiates a Langfuse client and wraps it behind each interface:
 
- **`LangfuseLogger`** — Logs at debug/info/warn/error levels as Langfuse traces with metadata.
- **`LangfuseMetricsCollector`** — Records increment counters (as Langfuse scores), histograms, and gauges as traces with generation events.
- **`LangfuseTracer`** — Creates hierarchical spans (as Langfuse trace → generation pairs) for every guardrail execution.
 
`initObservability()` is called automatically during Next.js instrumentation (`src/instrumentation.ts`), so it runs once at server startup. Every guardrail execution in the chain is traced end-to-end, giving you full visibility into latency, failure points, and violation patterns.
 
## Project Structure
 
```
app/
├── api/
│   ├── guardrails/chat/route.ts   POST handler — guardrail chain + Databricks proxy
│   └── health/route.ts            GET handler — health check
├── globals.css
├── layout.tsx
└── page.tsx
src/
├── api/
│   └── databricks-proxy.ts        DatabricksClient — HTTP forwarding with timeout & error handling
├── config/
│   ├── index.ts                   Re-exports EndpointProfileManager
│   └── rules.ts                   EndpointProfileManager — per-endpoint guardrail profiles
├── guards/
│   ├── index.ts                   Re-exports PresidioAnalyzer
│   └── presidio.ts                PresidioAnalyzer — PII detection & sanitization via GuardrailsEngine
├── observability/
│   ├── index.ts                   initObservability() — Langfuse wiring
│   ├── langfuse-logger.ts         Logger → Langfuse traces
│   ├── langfuse-metrics.ts        MetricsCollector → Langfuse scores/traces
│   └── langfuse-tracer.ts         Tracer → Langfuse generations
├── services/
│   └── guardrail-service.ts       GuardrailService — Presidio + guardrail chain composition
├── index.ts                       Public API exports
├── instrumentation.ts             Next.js instrumentation — calls initObservability()
└── types.ts                       Shared TypeScript interfaces
tests/                             Vitest suite (mirrors src/ layout)
packages/                          API references for every dependency
DEV_PLAN.md                        Build plan
```
 
## Testing
 
```bash
pnpm test
```
 
Runs the full Vitest test suite with coverage reporting. Tests are organized to mirror the `src/` directory layout under `tests/`. The test suite uses **msw** (Mock Service Worker) to intercept HTTP requests to Databricks and Langfuse, isolating unit and integration tests from external dependencies. Guardrail chain execution is tested with mock profiles and synthetic input data to verify correct verdicts (pass / reject / sanitize) across all guardrail types.
 
## License
 
MIT — see [LICENSE](./LICENSE).