Small service businesses miss after-hours calls, losing customers because they can't afford a 24/7 receptionist. Existing AI voice solutions require expensive cloud LLM APIs and send sensitive call data off-site.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This recipe builds a self-hosted voice agent that answers after-hours calls for small businesses using your own vLLM inference server. The agent handles appointment booking and FAQ queries over Twilio phone calls — no cloud LLM API calls, no customer data leaving your infrastructure. You’ll wire up Deepgram STT (speech-to-text), Cartesia TTS (text-to-speech), a vLLM-powered intent router, and Redis-backed session storage into a pipeline that runs on an Express server with Twilio Media Streams.
Prerequisites
Node.js 22+ and pnpm 10+ installed
A vLLM server running with an OpenAI-compatible endpoint
Twilio account with a phone number that supports voice
Deepgram API key for speech-to-text
Cartesia API key for text-to-speech
Redis instance (local or remote) for session storage
Langfuse account (optional, for tracing)
Familiarity with TypeScript, Express, WebSockets, and Next.js App Router
Step 1: Scaffold the project and install dependencies
Start with a Next.js 16 project using the App Router. If you don’t have one yet, scaffold it with npx create-next-app@latest . (choose TypeScript, App Router, and src/ directory). Then pin all dependencies:
Create .env.example with all the environment variables your services will need:
env
# Env vars used by vllm-voice-agent-for-after-hours-small-business-support.# The builder adds entries here as it wires up each integration.# Keep placeholders only — never commit real values.NODE_ENV=development# vLLMVLLM_ENDPOINT=http://localhost:8000/v1VLLM_API_KEY=<optional>VLLM_MODEL=mistral-7b-instruct-v0.3# Deepgram STTDEEPGRAM_API_KEY=<your-deepgram-api-key># Cartesia TTSCARTESIA_API_KEY=<your-cartesia-api-key># TwilioTWILIO_ACCOUNT_SID=<your-twilio-account-sid>TWILIO_AUTH_TOKEN=<your-twilio-auth-token>TWILIO_PHONE_NUMBER=<your-twilio-phone-number># RedisREDIS_URL=redis://localhost:6379# LangfuseLANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_BASE_URL=https://cloud.langfuse.com# Express serverPORT=8080WS_URL=ws://localhost:8080
Expected output: a package.json with all versions pinned, a .env.example with placeholder values, and a node_modules/ directory installed via pnpm install.
Step 2: Define types and configuration schema
Create src/types.ts to hold the domain types used across your services:
Expected output: two modules — src/types.ts with discriminated union types and src/config.ts with a Zod-validated parseConfig() function. Running pnpm typecheck should pass.
Step 3: Build the vLLM client
The vLLM client wraps the OpenAI SDK to talk to your vLLM server’s OpenAI-compatible endpoint. Create src/services/vllm-client.ts:
The client provides two methods: chat() for full responses and streamChat() for streaming tokens via an AsyncGenerator. Both wrap OpenAI SDK errors into VLLMError with the HTTP status code and body. The baseURL points to your vLLM server’s /v1 path, which mirrors the OpenAI API structure.
Expected output: VLLMClient that connects to an OpenAI-compatible vLLM endpoint.
Step 4: Create the intent router
The intent router classifies a caller’s transcribed speech into one of four intents: greeting, appointment, faq, or unknown. It uses the VLLMClient to prompt the local LLM. Create src/services/intent-router.ts:
ts
import type { Intent } from "../types.js"import { VLLMClient } from "./vllm-client.js"export class IntentRouter { readonly vllmClient: VLLMClient constructor(vllmClient: VLLMClient) { this.vllmClient = vllmClient } async classify(transcript: string): Promise<Intent> { try { const response = await this.vllmClient.chat([ { role: "user", content: `Classify this caller statement into one word: greeting, appointment, faq, or unknown. Statement: "${transcript}"` }, ]); const trimmed = response.trim().toLowerCase(); if (trimmed === "greeting") return { type: "greeting" }; if (trimmed === "appointment") return { type: "appointment" }; if (trimmed === "faq") return { type: "faq" }; if (trimmed === "unknown") return { type: "unknown" }; return { type: "unknown" }; } catch { return { type: "unknown" }; } } getGreeting(): string { return "Thank you for calling after hours. How can I help you? You can book an appointment or ask about our services." } getFallbackMessage(): string { return "I didn't quite catch that. Could you say 'appointment' or 'questions'?" }}
The classify() method catches errors from VLLMClient and returns { type: "unknown" } — this way, a vLLM outage doesn’t crash your call handler; it falls back to the fallback message flow.
Expected output: IntentRouter with three methods. Tests verify it maps utterances like “I need a haircut” to { type: "appointment" } and returns unknown when the client throws.
Step 5: Implement the FAQ service
Create src/services/faq-service.ts to answer common business questions using your vLLM server. The class embeds a static set of default FAQs covering hours, services, pricing, location, and emergencies:
ts
import type { FaqEntry } from "../types.js"import { VLLMClient } from "./vllm-client.js"export class FAQService { readonly vllmClient: VLLMClient private static DEFAULT_FAQS: FaqEntry[] = [ { question: "What are your hours?", answer: "We're open Monday through Friday, 9am to 5pm.", keywords: ["hours", "open", "close", "schedule"] }, { question: "What services do you offer?", answer: "We offer haircuts, coloring, styling, beard trims, and more.", keywords: ["services", "offer", "provide", "do"] }, { question: "How much does it cost?", answer: "Our prices start at $30 for a basic haircut. We offer senior and student discounts.", keywords: ["price", "cost", "pricing", "rate", "how much"] }, { question: "Where are you located?", answer: "We are located at 123 Main Street, Suite 100, Anytown USA.", keywords: ["location", "address", "where", "directions"] }, { question: "What if there's an emergency?", answer: "For emergencies please call 911. For after-hours urgent questions, leave a message and we'll call back within 24 hours.", keywords: ["emergency", "urgent", "after hours", "critical"] }, ] constructor(vllmClient: VLLMClient) { this.vllmClient = vllmClient } async answer(question: string): Promise<string> { const systemPrompt = "You are a helpful business assistant. Use the following FAQ data to answer. If the question does not match any FAQ, say 'I don't have information about that.'\n\nFAQs:\n" + JSON.stringify(FAQService.DEFAULT_FAQS) const response = await this.vllmClient.chat([ { role: "system", content: systemPrompt }, { role: "user", content: question }, ]) return response }}
The system prompt includes the FAQs as JSON, so the LLM can reference them. If the caller asks something not covered by the FAQ data, the model returns a “don’t have information” response.
Expected output: FAQService with an answer() method. A test for the happy path:
ts
it("answer returns FAQ content for hours question", async () => { const result = await faqService.answer("hours?"); expect(result).toContain("Monday through Friday");});
Step 6: Build the appointment scheduler
Create src/services/appointment-scheduler.ts to manage appointment slots in Redis. It uses sorted sets (ZADD, ZRANGE) and optimistic concurrency (WATCH/MULTI/EXEC) for atomic booking:
Three operations: seedDefaultSlots(date) populates a Redis sorted set with hourly slots (9am-5pm), each with score 0 (available). getAvailableSlots(date) reads the set, returning each slot’s availability. bookSlot(date, time, customerName) uses WATCH on the key for optimistic locking, then runs a MULTI/EXEC block that marks the slot as booked (score 1) and stores a booking record.
Expected output: an AppointmentScheduler that can seed, list, and book slots atomically. Tests verify that booking an already-taken slot returns { success: false }.
Step 7: Create the Redis session storage adapter
The @reaatech/session-continuity package provides IStorageAdapter — an interface you implement to manage session data. Create src/services/redis-storage-adapter.ts that wraps ioredis:
ts
import { Redis } from "ioredis";import type { IStorageAdapter, Session, Message, SessionId, HealthStatus, UpdateSessionOptions, SessionFilters, MessageQueryOptions, MessageId,} from "@reaatech/session-continuity";import { ConcurrencyError } from "@reaatech/session-continuity";export class RedisStorageAdapter implements IStorageAdapter { private redis: Redis; private prefix: string; constructor(redis: Redis, prefix: string = "session:"
Create also the SimpleTokenCounter at src/services/simple-token-counter.ts — it implements the TokenCounter interface from @reaatech/session-continuity with a rough approximation (1 token per 4 characters):
ts
import type { TokenCounter, Message } from "@reaatech/session-continuity";export class SimpleTokenCounter implements TokenCounter { readonly model = "simple"; readonly tokenizer = "simple-char-count"; count(text: string): number { return Math.ceil(text.length / 4); } countMessages(messages: Message[]): number { let total = 0; for (const message of messages) { if (typeof message.content === "string") { total += this.count(message.content); } else { for (const block of message.content) { if (block.type === "text") { total += this.count(block.text); } } } } return total; }}
Expected output: a full IStorageAdapter implementation backed by Redis, plus a TokenCounter. Tests confirm create+get roundtrip, missing ID returns null, and stale version throws ConcurrencyError.
Step 8: Build the TwiML handler
Create src/services/twiml-handler.ts with two functions — generating TwiML for Twilio to connect a call to your WebSocket media stream, and validating incoming Twilio webhook signatures:
ts
import twilio from "twilio"export function buildTwiMLResponse(wsUrl: string): string { const vr = new twilio.twiml.VoiceResponse() vr.say("Thank you for calling. Please hold while we connect you.") const connect = vr.connect() connect.stream({ url: wsUrl + "/api/twilio/stream" }) return vr.toString()}export function validateTwilioRequest(signature: string, url: string, params: Record<string, string>, authToken: string): boolean { return twilio.validateRequest(authToken, signature, url, params)}
buildTwiMLResponse() constructs XML that tells Twilio to stream the call’s audio to your server at /api/twilio/stream
validateTwilioRequest() verifies the X-Twilio-Signature header to prevent forged webhooks
Expected output: a module that generates TwiML XML and validates Twilio signatures.
Step 9: Wire the voice call handler
This is the main integration point. Create src/services/voice-call-handler.ts — it orchestrates the full STT to intent router to LLM to TTS pipeline using the REAA packages:
ts
import { createPipeline, createLatencyBudget, LatencyBudgetEnforcer, initializeSessionManager } from "@reaatech/voice-agent-core";import { createTwilioHandler } from "@reaatech/voice-agent-telephony";import { DeepgramSTTProvider } from "@reaatech/voice-agent-stt";import { CartesiaTTSProvider } from "@reaatech/voice-agent-tts";import { SessionManager } from "@reaatech/session-continuity";import { RedisStorageAdapter } from "./redis-storage-adapter.js";import { SimpleTokenCounter } from "./simple-token-counter.js";import { VLLMClient } from "./vllm-client.js";import { IntentRouter } from "./intent-router.js";import { AppointmentScheduler } from "./appointment-scheduler.js";import { FAQService } from "./faq-service.js";import type { AppConfig } from "../config.js";import type WebSocket from "ws";import { Redis } from "ioredis";import type { AgentResponse, VoiceAgentKitConfig, AudioChunk } from "@reaatech/voice-agent-core";
The VoiceCallHandler class takes all services in its constructor and implements handleConnection(ws). Create the handler class with the full pipeline wiring:
This handler accepts a WebSocket, wires Deepgram STT to receive audio from Twilio, routes speech through vLLM intent classification, dispatches to FAQ or appointment logic, and streams Cartesia TTS audio back through the Twilio handler.
Expected output: a VoiceCallHandler class whose handleConnection() method orchestrates the full voice pipeline.
Step 10: Create the Express server
Create src/server.ts — it bootstraps Express, attaches the Twilio webhook route, the health check, and the WebSocket server for Twilio Media Streams:
ts
import express from "express";import { createServer } from "node:http";import { WebSocketServer } from "ws";import { Redis } from "ioredis";import { Langfuse } from "langfuse";import { parseConfig } from "./config.js";import { VLLMClient } from "./services/vllm-client.js";import { IntentRouter } from "./services/intent-router.js";import { AppointmentScheduler } from "./services/appointment-scheduler.js";import { FAQService } from "./services/faq-service.js";import { VoiceCallHandler } from "./services/voice-call-handler.js";
The server auto-starts when not in a test environment (VITEST check). Tests invoke createApp() directly and listen on a random port.
To run the Express server separately from the Next.js dev server, use:
terminal
npx tsx src/server.ts
Expected output: an Express server with three routes:
POST /api/twilio/voice — validates the Twilio signature and returns TwiML
GET /health — returns { "status": "ok", "uptime": ... }
WebSocket at /api/twilio/stream — connects calls to the voice pipeline
Step 11: Add the Next.js health endpoint and landing page
Add a Next.js App Router health endpoint at app/api/health/route.ts:
ts
import { NextResponse } from "next/server";export function GET() { return NextResponse.json({ status: "ok", service: "vllm-voice-agent-for-after-hours-small-business-support", });}
Replace the scaffolded landing page at app/page.tsx with a simple description:
tsx
import styles from "./page.module.css";export default function Home() { return ( <div className={styles.page}> <main className={styles.main}> <h1>vLLM Voice Agent</h1> <p className={styles.tagline}> A self-hosted voice agent that answers after-hours calls using your own vLLM inference, with customizable workflows for appointment booking and FAQs. </p> <p> This service runs an Express server that handles Twilio voice webhooks and WebSocket media streams. The LLM brain runs on your vLLM server, keeping all voice data on-premises. </p> <h2>Quick Start</h2> <ol> <li>Clone the repository</li> <li>Run <code>pnpm install</code></li> <li>Copy <code>.env.example</code> to <code>.env</code> and fill in your API keys</li> <li>Run <code>pnpm run dev</code> to start the Next.js development server</li> <li>In a separate terminal, run <code>npx tsx src/server.ts</code> for the Express voice server</li> <li>Configure your Twilio phone number voice webhook to point to <code>https://your-domain.com/api/twilio/voice</code></li> </ol> <p> See the README for detailed configuration instructions. </p> <div className={styles.ctas}> <a className={styles.primary} href="/api/health" > Health Check </a> </div> </main> </div> );}
Note: pnpm run dev starts the Next.js development server for the landing page and health endpoint. Start the Express voice server separately with npx tsx src/server.ts in another terminal.
Expected output: a health endpoint returning JSON and a landing page describing the recipe.
Step 12: Run tests, typecheck, and lint
The test suite uses vitest with MSW mocks for outbound HTTP calls. The MSW server is configured in tests/setup.ts:
This mocks the vLLM /chat/completions endpoint and the Langfuse ingestion API so tests run without network calls. Server tests mock ioredis and other services at the module level.
To run the full verification suite:
terminal
pnpm testpnpm typecheckpnpm lint
All three should exit with code 0.
Next steps
Add SMS fallback — when the voice agent can’t resolve a query, capture the caller’s number and send an SMS with a link to book online or call back during business hours
Extend the FAQ database — replace the hardcoded DEFAULT_FAQS array with a Redis-backed FAQ store that can be updated at runtime through a management API
Add multi-language support — configure Cartesia TTS with different language codes and update the intent router’s prompt to detect and respond in the caller’s language
Deploy with Docker — containerize the Express server, Redis, and vLLM into a docker-compose.yml stack for one-command deployment