Skip to content
reaatechREAATECH

Files · Anthropic Code Sandbox for SMB Data Cleansing Pipelines

70 (1 binary, 503.9 kB total)attempt 1

README.md·4385 B·markdown
markdown
# Anthropic Code Sandbox for SMB Data Cleansing Pipelines
 
> Safely run LLM-generated data transformation code in an isolated sandbox, with cost tracking and automatic output repair.
 
A tutorialized reference solution from [reaatech.com](https://reaatech.com), demonstrating how to build production-grade AI systems with the `@reaatech/*` package family.
 
## Problem
 
Small businesses often need to clean and transform CSV, JSON, or database exports but lack the infrastructure to safely execute LLM-generated code. Running it directly risks data corruption, runaway costs, or exposure of sensitive records.
 
This recipe wraps Anthropic's Claude API with REAA's structured-output-repair to fix malformed code or data outputs, confidence-router to select the appropriate cleansing strategy based on data shape, llm-cost-telemetry to enforce per-job budgets, and idempotency-middleware to make retries safe. Code is executed inside an E2B sandbox so real data never leaves the boundary, while the router adapts to different formats like CSV, JSON, or SQL dumps.
 
## Pipeline flow
 
```
POST /api/jobs (CSV/JSON/SQL data)
  → 1. Format classifier (Anthropic Claude + confidence-router)
  → 2. Code generator (Anthropic Claude → structured-repair-core validates JSON)
  → 3. Sandbox executor (E2B code-interpreter runs transformation)
  → 4. Result validator (structured-repair-core repairs malformed output)
  → 5. Return cleaned data
```
 
## Quick Start
 
```bash
# Install dependencies
pnpm install
 
# Set environment variables
export ANTHROPIC_API_KEY=<your-anthropic-key>
export E2B_API_KEY=<your-e2b-api-key>
 
# Start the dev server
pnpm dev
```
 
### Example request
 
```bash
curl -X POST http://localhost:3000/api/jobs \
  -H "Content-Type: application/json" \
  -H "Idempotency-Key: my-unique-key" \
  -d '{
    "inputType": "csv",
    "inputData": "name,age\nAlice,30\nBob,25",
    "instructions": "Normalize names to title case"
  }'
```
 
### Expected response
 
```json
{
  "id": "uuid-here",
  "cleanedData": "name,age\nAlice,30\nBob,25",
  "originalFormat": "csv",
  "transformationsApplied": ["function clean..."],
  "executionTimeMs": 1234,
  "tokensUsed": { "input": 150, "output": 75 }
}
```
 
## Packages used
 
### REAA packages
 
- **`@reaatech/structured-repair-core`** — Zod schema-driven repair engine. Repairs malformed JSON from Claude responses (fences, truncation, type coercion, fuzzy key matching, extra field removal).
- **`@reaatech/confidence-router`** — Threshold-based decision engine. Routes to the correct format path (CSV/JSON/SQL) when Claude's confidence is high, clarifies when ambiguous, falls back when uncertain.
- **`@reaatech/llm-cost-telemetry`** — Cost tracking and budget enforcement. Tracks token spend per request and enforces per-job budget limits via `loadConfig()` and `BudgetConfig`.
- **`@reaatech/idempotency-middleware`** — Idempotency via `Idempotency-Key` header. Caches pipeline responses so duplicate requests don't re-execute the transformation. Uses in-memory `MemoryAdapter` with TTL expiry.
 
### Third-party packages
 
- **`@anthropic-ai/sdk`** — Claude API client for format classification and code generation.
- **`@e2b/code-interpreter`** — Secure sandbox for executing LLM-generated code against real data.
- **`zod`** — Runtime schema validation for all request/response shapes.
- **`dotenv`** — Environment variable loading.
 
## Idempotency
 
Include the `Idempotency-Key` header with a unique value for each transformation request. If the same key is sent again within the cache TTL (24 hours), the cached response is returned without re-executing the pipeline. Different request bodies with the same key produce different cache keys due to SHA-256 body hashing.
 
## Budget enforcement
 
The `CostTracker` loads budget limits from `loadConfig()` (reads `DEFAULT_DAILY_BUDGET` from the environment). Before each LLM call, `enforceBudgetOrThrow()` checks the session cost against the daily budget and throws `BudgetExceededError` if exceeded.
 
## Project layout
 
```
app/                  Next.js App Router pages + API routes
src/                  services, lib, adapters
tests/                vitest suite (mirrors src/)
packages/             API references for every dependency (read these first)
DEV_PLAN.md           build plan for this recipe
```
 
## License
 
MIT — see [LICENSE](./LICENSE).