Skip to content
reaatech

Files · Anthropic Prompt Injection Shield for SMB Support Chat

66 (1 binary, 566.1 kB total)attempt 1

README.md·5778 B·markdown
markdown
# Anthropic Prompt Injection Shield for SMB Support Chat
 
> Protect your small business customer chat from prompt injection, PII leaks, and harmful content with a plug‑and‑play Anthropic guardrails layer.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI safety systems with the `@reaatech/*` package family.
 
## Features
 
- **Presidio PII Redaction** — Detects and redacts personally identifiable information (email, phone, etc.) using `@presidio-dev/hai-guardrails` with heuristic injection guard and PII scanning.
- **Custom Injection Classifier** — Pattern-based heuristic detection of common prompt injection attacks (jailbreaks, system prompt overrides, token injections).
- **Anthropic Content Moderation** — Uses Claude (`claude-sonnet-4-6`) to classify content as safe or unsafe, with configurable thresholds.
- **Guardrail Chain Orchestration** — All three guardrails run in sequence via `@reaatech/guardrail-chain` with budget-aware scheduling, timeout handling, circuit breaker, and fail-open support.
- **Langfuse Audit Logging** — Guardrail events, metrics, and traces streamed to Langfuse via `@reaatech/guardrail-chain-observability` adapters.
- **Benchmark Regression Testing** — Uses `prompt-injection-bench` to run weekly defense regression suites and compare scores on the REAA leaderboard.
 
## Architecture
 
```
POST /api/moderate  →  SecurityGuardService
                        ├── PresidioGuard (PII redaction)
                        ├── InjectionClassifierGuard (heuristic patterns)
                        └── AnthropicModerationGuard (Claude classification)
                        └── results → Langfuse (audit logs)
```
 
Messages arrive at the `/api/moderate` endpoint, pass through three guardrails in sequence, and the verdict is returned as JSON. If any guardrail fails, the chain short-circuits and reports which guardrail blocked the request. Under budget pressure, slow guardrails are skipped.
 
## Prerequisites
 
All env vars are listed in `.env.example`. Required:
 
| Variable | Description |
|---|---|
| `ANTHROPIC_API_KEY` | Anthropic API key for Claude content moderation |
| `LANGFUSE_PUBLIC_KEY` | Langfuse project public key |
| `LANGFUSE_SECRET_KEY` | Langfuse project secret key |
| `LANGFUSE_BASE_URL` | Langfuse host URL |
 
Optional configuration:
 
| Variable | Default | Description |
|---|---|---|
| `GUARDRAIL_CHAIN_BUDGET_MAX_LATENCY_MS` | 2000 | Max total latency for guardrail chain (ms) |
| `GUARDRAIL_CHAIN_BUDGET_MAX_TOKENS` | 8000 | Max total token budget across guardrails |
| `PRESIDIO_HEURISTIC_THRESHOLD` | 0.7 | Presidio injection guard threshold |
| `MODERATION_MODEL` | claude-sonnet-4-6 | Anthropic model for content moderation |
| `MODERATION_MAX_TOKENS` | 1024 | Max tokens for moderation LLM calls |
 
## Getting Started
 
```bash
pnpm install
pnpm dev              # starts Next.js dev server
pnpm test             # runs vitest with coverage
pnpm typecheck        # TypeScript type checking
pnpm lint             # ESLint
```
 
### Example usage
 
```bash
# Moderate a message
curl -X POST http://localhost:3000/api/moderate \
  -H 'Content-Type: application/json' \
  -d '{"message": "What are your return policies?"}'
 
# Run a security benchmark
curl -X POST http://localhost:3000/api/security-bench
 
# Health check
curl http://localhost:3000/api/health
```
 
## API Reference
 
### POST /api/moderate
 
Moderate a message through the guardrail chain.
 
**Request body:**
```json
{ "message": "string (required)", "userId": "string (optional)", "sessionId": "string (optional)" }
```
 
**Response (200):**
```json
{ "passed": true, "correlationId": "uuid", "failedGuardrail": null, "details": {} }
```
 
**Response (400):**
```json
{ "error": { "message": { "_errors": ["Required"] } } }
```
 
### GET /api/moderate
 
Health check for the moderation endpoint.
 
### GET /api/security-bench
 
Returns recent prompt-injection-bench leaderboard scores.
 
### POST /api/security-bench
 
Runs a benchmark against the current defense stack using `prompt-injection-bench`.
 
**Response (200):** `{ "detectionRate": 0.95, "totalAttacks": 100, "detected": 95 }`
 
### GET /api/health
 
General health check.
 
**Response (200):** `{ "status": "ok", "service": "anthropic-prompt-injection-shield", "version": "0.1.0", "timestamp": "..." }`
 
## Configuration
 
The guardrail chain can be configured via environment variables (see `.env.example`) or by placing a `guardrail.config.yaml` file in the project root. The `@reaatech/guardrail-chain-config` package deep-merges file config with environment variables (env takes precedence).
 
Example `guardrail.config.yaml`:
 
```yaml
budget:
  maxLatencyMs: 1000
  maxTokens: 8000
  skipSlowGuardrailsUnderPressure: true
```
 
## Testing
 
```bash
pnpm test
```
 
Tests use MSW (Mock Service Worker) to intercept all HTTP calls — no real network traffic during testing. Coverage is measured over `src/**/*.ts` and `app/**/route.ts` only. UI components are excluded.
 
## Packages
 
| Package | Purpose |
|---|---|
| `@reaatech/guardrail-chain` | Guardrail chain orchestration, budget management, circuit breaker |
| `@reaatech/guardrail-chain-config` | Configuration loading from YAML/JSON/env |
| `@reaatech/guardrail-chain-observability` | Logging, metrics, and tracing interfaces |
| `prompt-injection-bench` | Benchmark engine and defense regression testing |
| `@presidio-dev/hai-guardrails` | PII redaction and heuristic injection detection |
| `langfuse` | LLM observability and audit logging |
| `@anthropic-ai/sdk` | Anthropic Claude API for content moderation |
| `zod` | Schema validation for config and request bodies |
 
## License
 
MIT — see [LICENSE](./LICENSE).