Files · vLLM AI Spend Control for SMB Agent Workflows

73 (1 binary, 518.6 kB total)attempt 1

README.md·4135 B·markdown

markdown

# vLLM AI Spend Control for SMB Agent Workflows
 
This recipe instruments every vLLM call through a cost interceptor that passes token counts to `@reaatech/agent-budget-spend-tracker`, which accumulates spend using `@reaatech/agent-budget-pricing` mappings for open-source models. `@reaatech/agent-budget-engine` enforces soft and hard caps per agent or tenant, while `@reaatech/llm-cost-telemetry-calculator` converts token usage into dollar amounts. Cost telemetry is exported to Langfuse and Helicone for real-time observability.
 
## Architecture
 
The system is organized around a single interception point:
 
```mermaid
graph LR
   A[Client] --> B[POST /api/chat]
   B --> C[Cost Interceptor]
   C --> D[vLLM API<br/>(@ai-sdk/openai-compatible)]
   C --> E[BudgetController]
   E --> F[SpendStore]
   C --> G[TelemetryService]
   G --> H[Langfuse]
   G --> I[Helicone]
```
 
- **Cost Interceptor** (`src/interceptors/cost.interceptor.ts`) wraps every vLLM API call (via `@ai-sdk/openai-compatible`), checks budgets via `BudgetController`, records spend via `SpendStore`, and emits cost traces to Langfuse + Helicone via `TelemetryService`.
- **BudgetController** (`src/modules/budget/budget.service.ts`) enforces soft-cap warnings and hard-cap rejections per scope using `@reaatech/agent-budget-engine`.
- **SpendStore** (`src/modules/budget/spend-store.service.ts`) accumulates token counts and converts them to dollar amounts via `@reaatech/llm-cost-telemetry-calculator` with pricing data from `@reaatech/agent-budget-pricing`.
- **TelemetryService** (`src/modules/telemetry/telemetry.service.ts`) fans out cost events to the Langfuse (`src/modules/telemetry/langfuse.service.ts`) and Helicone (`src/modules/telemetry/helicone.service.ts`) backends.
 
## Quick Start
 
```bash
cp .env.example .env.local
# Edit .env.local — set VLLM_BASE_URL to your vLLM server endpoint
pnpm install
pnpm dev
```
 
Send a test request:
 
```bash
curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello"}],
    "scope": "tenant-1"
  }'
```
 
## Environment Variables
 
| Variable | Description |
|---|---|
| `VLLM_BASE_URL` | Base URL of the vLLM OpenAI-compatible API |
| `VLLM_MODEL` | Default model name used in requests |
| `LANGFUSE_PUBLIC_KEY` | Langfuse project public key |
| `LANGFUSE_SECRET_KEY` | Langfuse project secret key |
| `LANGFUSE_BASE_URL` | Langfuse API base URL |
| `HELICONE_API_KEY` | Helicone API key for usage telemetry |
| `DATABASE_URL` | PostgreSQL connection string for spend persistence |
| `BUDGET_DEFAULT_LIMIT` | Default dollar limit for any scope without an explicit budget |
| `BUDGET_SOFT_CAP` | Fraction of the limit that triggers a soft-cap warning (e.g., 0.8) |
| `BUDGET_HARD_CAP` | Fraction of the limit that triggers a hard-cap rejection (e.g., 1.0) |
 
## API Reference
 
### `POST /api/chat`
 
Send messages to a vLLM model with budget enforcement and cost telemetry.
 
**Request body:**
 
```json
{
  "messages": [{ "role": "user", "content": "string" }],
  "scope": "string",
  "model": "string (optional)"
}
```
 
| Field | Description |
|---|---|
| `messages` | Array of chat messages in OpenAI format |
| `scope` | Budget scope identifier (e.g., `"agent-1"` or `"tenant-1"`) |
| `model` | Model override (defaults to `VLLM_MODEL`) |
 
**Response:** Server-sent events (SSE) stream of the vLLM chat completion.
 
### `GET /api/budget`
 
Query the current budget state for a scope.
 
**Query parameters:** `?scope=<scope-id>`
 
**Response:**
 
```json
{
  "scope": "string",
  "limit": 10.0,
  "spent": 0.0,
  "remaining": 10.0,
  "softCapReached": false,
  "hardCapReached": false
}
```
 
### `POST /api/budget`
 
Define or update a budget for a scope.
 
**Request body:**
 
```json
{
  "scope": "string",
  "limit": 50.0
}
```
 
**Response:** `201 Created`
 
### `DELETE /api/budget`
 
Remove a budget for a scope.
 
**Query parameters:** `?scope=<scope-id>`
 
**Response:** `204 No Content`
 
### `GET /api/health`
 
Health check endpoint.
 
**Response:**
 
```json
{ "status": "ok" }
```
 
## License
 
MIT — see [LICENSE](./LICENSE).