xAI Grok Secure Code Sandbox for SMB Data Pipelines
Empower SMBs to safely run AI-generated code on their data with cost controls and execution quarantine, preventing runaway bills and destructive operations.
Small businesses want to use LLMs to automate Excel transformations, CSV analysis, or generate reports, but blindly executing generated code risks data corruption, infinite loops, and unpredictable cloud costs.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This tutorial walks you through building a secure code execution sandbox powered by xAI’s Grok API. You’ll create a Next.js application with an Express companion server that takes natural-language prompts from users, generates Python code via Grok, validates and repairs the structured output, checks it against a security policy (blocked patterns, allowed libraries), enforces per-user budget limits, and executes approved code inside an E2B sandbox. Every step along the way is traced to Langfuse for observability. By the end, you’ll have both an Express REST API and a Next.js App Router API serving the same pipeline, with a full test suite hitting over 90% coverage.
Prerequisites
Node.js >= 22 with pnpm (v10+) installed
An xAI API key — set as XAI_API_KEY (get one from the xAI console)
An E2B API key — set as E2B_API_KEY (sign up at e2b.dev)
(Optional) Langfuse credentials for telemetry — skip this and the pipeline still works
Basic familiarity with TypeScript, Next.js App Router, and Express
Step 1: Create the project scaffold
Create a new Next.js project and install all dependencies. This recipe uses Next.js 16 (App Router), six @reaatech/* vendored packages, the E2B sandbox SDK, the Vercel AI SDK, and xlsx for spreadsheet processing.
Expected output:package.json lists all dependencies with exact versions (no ^ or ~ prefixes). Run pnpm install to lock.
Step 2: Set up environment variables
Create .env.example with every key the application reads at runtime. The configuration module validates XAI_API_KEY and E2B_API_KEY as required and falls back to defaults for everything else.
cp .env.example .env.local# Fill in your real API keys in .env.local — never commit them
Step 3: Create the typed configuration module
src/lib/config.ts wraps every environment variable in a typed config object. It throws immediately at import time if a required variable is missing — no silent failures at runtime.
Expected output: Importing config from anywhere in your app reads typed env vars. If you forget XAI_API_KEY, the module throws on the first import.
Step 4: Wire up xAI Grok via the AI SDK
src/lib/llm.ts creates an OpenAI-compatible provider pointed at xAI’s API. The generateCode function calls Grok with a Zod schema so the response is always typed. This is the entry point for every code generation request.
Expected output:generateCode("write a python script to sum a column", MySchema) returns a typed object with token usage. A 401 from xAI surfaces as CodeGenerationError.
Step 5: Build the structured-output repair layer
LLMs don’t always emit valid JSON. The @reaatech/structured-repair-core package fixes common issues — stray markdown fences, trailing commas, unquoted keys. src/lib/repair.ts defines the expected CodeOutput shape and wraps repair logic.
ts
// src/lib/repair.tsimport { z } from "zod";import { repair, repairOutput, isValid, analyzeInput, UnrepairableError } from "@reaatech/structured-repair-core";export const CodeOutputSchema = z.object({ code: z.string(), language: z.string(), description: z.string(),});export type CodeOutput = z.infer<typeof CodeOutputSchema>;export function repairLlmOutput<T>(schema: z.ZodType<T>, raw: string): Promise<T> { const result = repairOutput({ schema, input: raw }); if (result.success) return Promise.resolve(result.data as T); return Promise.reject(new UnrepairableError("All repair strategies exhausted", raw, result.steps));}export function validateCodeOutput(o: unknown): o is CodeOutput { if (typeof o !== "object" || o === null) return false; return isValid(CodeOutputSchema, JSON.stringify(o));}export { analyzeInput, repair };
Expected output:repairLlmOutput(CodeOutputSchema, '```json\n{"code":"x","language":"python","description":"d"}\n```') strips the fences and returns a clean { code, language, description } object.
Step 6: Set up the sandbox service with a firewall
src/lib/sandbox.ts is the heart of the recipe. It wraps the E2B Sandbox SDK and implements two middleware classes — CodeInspectionMiddleware (checks imports and blocked patterns) and BudgetGateMiddleware (checks budget before execution). The SandboxService writes generated code to /tmp/script.py inside the sandbox and runs it with python.
ts
// src/lib/sandbox.tsimport Sandbox from "e2b";import { FirewallError, PolicyViolationError, BudgetExceededError, Logger, redact, safeRegExp, createRequestContext,} from "@reaatech/tool-use-firewall-core";import type { Middleware, MiddlewareResult, RequestContext,} from "@reaatech/tool-use-firewall-core";import { BudgetService } from "./budget";export interface SandboxPolicy { allowedLibraries: string[]; maxExecutionTimeMs: number; blockedPatterns
policies/default.yaml lists the libraries SMB data pipelines typically need (pandas, numpy, openpyxl, csv, json) and the patterns that are never allowed (subprocess, os.system, file writes outside /tmp). The policy loader in src/lib/policy-loader.ts parses this file and extracts a machine-readable SandboxPolicy.
The loader wraps @reaatech/tool-use-firewall-config:
ts
// src/lib/policy-loader.tsimport { loadPolicyConfig, validatePolicyFile } from "@reaatech/tool-use-firewall-config";import type { PolicyConfig } from "@reaatech/tool-use-firewall-config";import { readFileSync } from "node:fs";import type { SandboxPolicy } from "./sandbox";export interface LoadedPolicy { config: PolicyConfig; sandbox: SandboxPolicy;}export function loadPolicy(path: string): LoadedPolicy { const config = loadPolicyConfig(path); const content = readFileSync(path, "utf-8"); const sandbox = extractSandboxPolicy(content); return { config, sandbox };}function extractSandboxPolicy(yaml: string): SandboxPolicy { const allowedLibraries: string[] = []; const blockedPatterns: string[] = []; let maxExecutionTimeMs = 30000; let section: "libs" | "blocks" | null = null; for (const line of yaml.split("\n")) { const trimmed = line.trim(); if (trimmed === "allowed_libraries:") { section = "libs"; } else if (trimmed === "blocked_patterns:") { section = "blocks"; } else if (trimmed.startsWith("max_execution_time_ms:")) { const parts = trimmed.split(":"); if (parts.length >= 2) { const val = parseInt(parts[1].trim(), 10); if (!isNaN(val)) maxExecutionTimeMs = val; } section = null; } else if (trimmed.startsWith("- ") && section) { const value = trimmed.slice(2); if (section === "libs") { allowedLibraries.push(value); } else { blockedPatterns.push(value); } } else if (trimmed.includes(":") && !trimmed.startsWith("-")) { section = null; } } return { allowedLibraries, maxExecutionTimeMs, blockedPatterns };}export function validatePolicy( path: string,): { valid: boolean; errors: string[]; warnings: string[] } { return validatePolicyFile(path);}
Expected output:loadPolicy("policies/default.yaml") returns an object with config (full firewall config) and sandbox (the extracted SandboxPolicy with 12 allowed libraries and 4 blocked patterns).
Step 8: Implement budget enforcement
src/lib/budget.ts wraps @reaatech/agent-budget-engine to define per-user spending limits, check whether a generation is within budget, record spend after execution, and reset budgets. It subscribes to threshold-breach and hard-stop events and logs them to your telemetry service.
Expected output: After calling defineUserBudget("alice", 10), checkBudget("alice", "grok-3", 4000) returns { allowed: true, action: "Allow" } when the user’s spend is still under the limit.
Step 9: Add approval workflow, telemetry, and spreadsheet utilities
Three smaller modules complete the library layer:
src/lib/approval.ts — an in-memory store for pending, approved, and rejected execution requests. When the preflight check returns APPROVAL_REQUIRED, the pipeline creates an ApprovalRequest and returns early with its ID.
ts
// src/lib/approval.tsimport { ApprovalRequiredError } from "@reaatech/tool-use-firewall-core";export interface ApprovalRequest { id: string; code: string; userId: string; reason: string; riskLevel: "low" | "medium" | "high"; createdAt: Date;}export class ApprovalStore { private pending: Map<string, ApprovalRequest> = new Map(); private rejected: Map<string, { request: ApprovalRequest; reason: string }> = new Map(); private approved: Set<string> = new Set(); create(request: Omit<ApprovalRequest, "id" | "createdAt">): ApprovalRequest { const full: ApprovalRequest = { ...request, id: crypto.randomUUID(), createdAt: new Date(), }; this.pending.set(full.id, full); return full; } approve(id: string): void { const request = this.pending.get(id); if (!request) { throw new Error(`Approval request ${id} not found or already processed`); } this.pending.delete(id); this.approved.add(id); } reject(id: string, reason: string): void { const request = this.pending.get(id); if (!request) { throw new Error(`Approval request ${id} not found or already processed`); } this.pending.delete(id); this.rejected.set(id, { request, reason }); } listPending(): ApprovalRequest[] { return Array.from(this.pending.values()); } getRequest(id: string): ApprovalRequest | undefined { return this.pending.get(id); }}export { ApprovalRequiredError };
src/lib/telemetry.ts — a singleton wrapping the Langfuse SDK. Every code generation and sandbox execution creates a Langfuse trace so you can audit what was generated, what the result was, and how much it cost.
src/lib/xlsx-utils.ts — wraps the xlsx npm package for round-tripping spreadsheet data. The pipeline uses this to parse uploaded Excel files before feeding the rows into a code-generation prompt.
Expected output:transformAndExport(xlsxBuffer, rows => rows.filter(r => (r.value as number) > 10)) returns a new XLSX buffer with only the rows that match the filter.
Step 10: Build the pipeline orchestrator
src/services/pipeline.ts ties everything together. It takes a user prompt (and an optional Excel file), enriches the prompt with data rows, generates code via Grok, repairs the structured output, runs the preflight check, checks the budget, executes in the E2B sandbox, records spend, and traces the result. If a preflight check requires human approval, it returns early with the approval ID instead of executing.
ts
// src/services/pipeline.tsimport { CodeOutputSchema, repairLlmOutput } from "../lib/repair";import { generateCode } from "../lib/llm";import { loadPolicy } from "../lib/policy-loader";import { type SandboxService } from "../lib/sandbox";import { type BudgetService } from "../lib/budget";import { type TelemetryService } from "../lib/telemetry";import { type ApprovalStore } from "../lib/approval";import { readWorkbook, sheetToJson } from "../lib/xlsx-utils";import { PolicyViolationError, BudgetExceededError } from "@reaatech/tool-use-firewall-core";
Expected output:runCodePipeline({ prompt: "print the numbers 1 to 5", userId: "alice" }, deps) returns a PipelineOutput with the generated code, sandbox stdout, and cost. A call with prompt: "import subprocess; subprocess.run" throws PolicyViolationError.
Step 11: Create the Express server and Next.js App Router routes
The recipe exposes the pipeline through two surfaces: an Express server at src/server.ts and a Next.js App Router API at app/api/. Both wire the same runCodePipeline function.
Express server:
ts
// src/server.tsimport express from "express";import { config } from "./lib/config";import { runCodePipeline } from "./services/pipeline";import { SandboxService } from "./lib/sandbox";import { BudgetService } from "./lib/budget";import { TelemetryService } from "./lib/telemetry";import { ApprovalStore } from "./lib/approval";import { PolicyViolationError, BudgetExceededError } from "@reaatech/tool-use-firewall-core";const app = express();app.use(express.json());app.use
Next.js App Router routes — three files in app/api/, one per resource. Each uses NextRequest and NextResponse:
Expected output:curl -X POST http://localhost:3000/api/code -H 'Content-Type: application/json' -d '{"prompt":"print hello","userId":"alice"}' returns JSON with status: "success", the generated code, sandbox stdout, and the cost.
Step 12: Write tests and verify
The test suite covers every module, the pipeline orchestration, and all API routes. It uses MSW to mock the xAI and Langfuse endpoints and vi.mock("e2b") to avoid real sandbox calls. Create tests/setup.ts to configure MSW and environment variables:
pnpm vitest run --coverage --reporter=json --outputFile=vitest-report.json
Expected output: All 117 tests pass with 0 failures. The coverage report shows all four metrics (lines, branches, functions, statements) at 90% or above.
Verify types and lint:
terminal
pnpm typecheck # zero errorspnpm lint # zero warnings
Next steps
Add persistence: Replace the in-memory ApprovalStore with a PostgreSQL-backed implementation so approvals survive restarts.
Extend the policy YAML: Add more granular rules — for example, allow requests but block requests.get("http://internal-admin").
Build a dashboard UI: Create a React page that calls the Next.js API routes and displays execution history with real-time polling.
SSO integration: Tie budget scopes to authenticated user sessions rather than requiring userId in every API call.
:
string
[];
}
export interface ExecutionResult {
stdout: string;
stderr: string;
exitCode: number;
executionTimeMs: number;
}
export class CodeInspectionMiddleware implements Middleware {