Small businesses miss after‑hours calls and can’t afford a 24/7 receptionist. They need an automated phone system that understands natural language and completes tasks without costly human staffing.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This recipe builds a Google Gemini voice agent that handles inbound Twilio phone calls in real time. When a caller speaks, Deepgram transcribes their words, Gemini classifies the intent, a confidence router decides what to do, and ElevenLabs speaks the response back — all over a bidirectional WebSocket media stream. The agent handles two primary intents: appointment booking (checking calendar availability and confirming a slot) and FAQ lookup (grounding responses in conversation history). All calls are traced through Langfuse for observability, and repeated phrases are served from an LLM cache to reduce latency and cost.
Prerequisites
Node.js 22 or later
pnpm installed
A Twilio account with a phone number configured for webhook callbacks
A Google AI Studio API key for Gemini
A Deepgram API key for speech-to-text
An ElevenLabs API key for text-to-speech
An OpenAI API key for embedding (used by agent-memory and llm-cache)
Optional: a Langfuse account for observability tracing
Step 1: Configure environment variables
Copy the .env.example file to .env.local and fill in your credentials.
terminal
cp .env.example .env.local
Edit .env.local with your real values. These are the environment variables the config module reads at startup:
Expected output: The .env.local file is created with all 16 variables populated. dotenv loads these into process.env at module-load time, and the Zod schema in src/lib/config.ts validates every required key before the server starts.
Step 2: Understand the types and shared interfaces
Create src/lib/types.ts. This file re-exports types from @reaatech/agent-mesh (the canonical protocol shapes) and adds call-specific interfaces. You do not need to hand-roll types for session records or confidence decisions — the agent-mesh package already exports those.
Expected output:src/lib/types.ts exports all shared types. CallState extends SessionRecord (from agent-mesh) so it already has session_id, status, and turn_history. The Intent union defines the two classified intents plus a catch-all. OrchestrationResult carries the text response alongside the routing decision and an action label the caller can switch on.
Step 3: Build the config validation module
Create src/lib/config.ts. This is the first piece of code that runs — it reads environment variables, validates them with Zod, and exports the typed config object. If a required variable is missing, the server refuses to start with a clear validation error.
Expected output:src/lib/config.ts exports a singleton config object. Required fields (the six API keys) throw at startup if absent. Optional fields default to the values shown — nova-3 for Deepgram, gemini-2.5-flash for the Gemini model, eleven_flash_v2_5 for ElevenLabs (the lowest-latency option).
Step 4: Build the Gemini LLM service
Create src/services/gemini.ts. This service wraps @google/genai and exposes generateResponse for free-form text and classifyIntent for structured intent classification via function calling.
ts
import { GoogleGenAI } from "@google/genai"import type { ClassifierOutput } from "@reaatech/agent-mesh"import type { Config } from "../lib/config.js"export class GeminiServiceError extends Error { statusCode: number declare cause: unknown constructor(message: string, statusCode: number = 500, cause?: unknown) { super(message) this.name = "GeminiServiceError" this.statusCode = statusCode this.cause = cause }}export class GeminiService { private ai: GoogleGenAI private model: string constructor(config: Config) { this.ai = new GoogleGenAI({ apiKey: config.googleApiKey }) this.model = config.geminiModel } async generateResponse(prompt: string): Promise<string> { try { const response = await this.ai.models.generateContent({ model: this.model, contents: prompt, }) if (!response.text) { throw new GeminiServiceError("Empty response from Gemini", 500) } return response.text } catch (err: unknown) { if (err instanceof GeminiServiceError) throw err const e = err as { name?: string; message?: string; status?: number } throw new GeminiServiceError( e.message ?? "Gemini API error", e.status ?? 500, err, ) } } async classifyIntent(transcript: string, intentLabels: string[]): Promise<ClassifierOutput[]> { try { const declaration = { name: "classify_intent", parametersJsonSchema: { type: "object" as const, properties: { label: { type: "string" as const }, confidence: { type: "number" as const }, }, required: ["label", "confidence"], }, } const contents = `Classify this transcript into one of: ${intentLabels.join(", ")}. Transcript: ${transcript}` const response = await this.ai.models.generateContent({ model: this.model, contents, config: { tools: [{ functionDeclarations: [declaration] }], }, }) if (!response.functionCalls || response.functionCalls.length === 0) { return [{ agent_id: "general_query", confidence: 0, ambiguous: false, detected_language: "en", intent_summary: "", entities: {} }] } return response.functionCalls.map((fc) => ({ agent_id: (fc.args as { label?: string }).label ?? "general_query", confidence: (fc.args as { confidence?: number }).confidence ?? 0, ambiguous: false, detected_language: "en", intent_summary: "", entities: {}, })) } catch (err: unknown) { const e = err as { name?: string; message?: string; status?: number } throw new GeminiServiceError( e.message ?? "Gemini classification error", e.status ?? 500, err, ) } } estimateTokens(text: string): number { return Math.ceil(text.length / 4) }}
Expected output:GeminiService is instantiated with a Config object and calls new GoogleGenAI({ apiKey }) as shown in the @google/genai README. generateResponse returns response.text and throws GeminiServiceError on empty or failed responses. classifyIntent sends a function-declaration tool to Gemini and maps response.functionCalls back to ClassifierOutput[]. The ClassifierOutput shape — with agent_id, confidence, ambiguous, detected_language, intent_summary, and entities — comes from @reaatech/agent-mesh.
Step 5: Build the memory, router, and cache services
Create three service files. These are supporting services that wrap the @reaatech/agent-memory, @reaatech/confidence-router, and @reaatech/llm-cache packages respectively.
src/services/memory.ts — stores each conversation turn and retrieves history for context:
Expected output:MemoryService wraps AgentMemory with the extractAndStore / retrieve / close lifecycle from the README — passing { speaker, content, timestamp } objects and mapping results back to TurnEntry[]. RouterService wraps ConfidenceRouter and maps the RoutingDecision shape (with .type, .target, .prompt, .confidence) to the agent-mesh ConfidenceDecision shape. CacheService uses CacheEngine with InMemoryAdapter for both storage layers and OpenAIEmbedder for semantic similarity — the get method returns the discriminated CacheResult union and set stores with queryType: "factual" to apply the 1800-second TTL for factual responses.
Step 6: Build the Twilio, audio, and observability services
src/services/twilio-call.ts — Twilio client wrapper for webhook validation and call control:
Expected output:TwilioCallService uses the ESM import pattern shown in the twilio README — import twilio from 'twilio' then const { RestException } = twilio. generateTwiML produces the TwiML XML that Twilio expects to open the WebSocket media stream. AudioService calls deepgram.listen.v1.connect for real-time transcription and elevenlabs.textToSpeech.convert for non-streaming TTS (returns a ReadableStream that is read to completion and converted to a Buffer). ObservabilityService creates Langfuse traces and generation spans for each LLM call.
Step 7: Build the calendar integration
src/integrations/calendar.ts — a MockCalendarProvider (for dev/test) and a CalendarService that delegates to the provider:
Expected output:CalendarProvider is the interface. MockCalendarProvider generates hourly slots from 9 AM to 5 PM and stores bookings in an in-memory Map. CalendarService is the delegation wrapper. In production, swap MockCalendarProvider for a real calendar adapter (Google Calendar, Calendly, etc.) that implements the same interface.
Step 8: Build the orchestrator
src/agent/orchestrator.ts — the central coordinator that wires all services together via dependency injection. No hidden new calls — every dependency is passed in through the constructor so tests can mock each service independently.
ts
import type { Config } from "../lib/config.js";import type { OrchestrationResult, TurnEntry, CallState } from "../lib/types.js";import type { GeminiService } from "../services/gemini.js";import type { MemoryService } from "../services/memory.js";import type { RouterService } from "../services/router.js";import type { CacheService } from "../services/cache.js";import type { AudioService } from "../services/audio.js";import type { TwilioCallService } from "../services/twilio-call.js";import type { CalendarService } from "../integrations/calendar.js";import
Expected output:Orchestrator accepts nine injected dependencies (all services) through its constructor. processTurn is the core flow: cache check — history retrieval — intent classification — routing decision — intent handler — LLM tracing — caching — memory storage. The three intent handlers (handleAppointmentBooking, handleFAQ, handleGeneralQuery) each build a prompt and call GeminiService.generateResponse. Cache hits short-circuit the full flow and return immediately without calling Gemini.
Step 9: Build the call handler and WebSocket server
src/api/call.ts — the CallHandler class that bridges Twilio webhooks and the WebSocket media stream. This file also contains the raw WebSocket server that runs inside the Next.js instrumentation hook.
ts
import type { Config } from "../lib/config.js";import type { Orchestrator } from "../agent/orchestrator.js";import type { TwilioCallService } from "../services/twilio-call.js";import type { AudioService } from "../services/audio.js";import type { MemoryService } from "../services/memory.js";import type { ObservabilityService } from "../services/observability.js";interface WebSocketLike { on(event: string, handler: (...args: unknown[]) => void): void;
The WebSocket server lives in src/instrumentation.ts. It uses Node’s built-in HTTP server and a custom WebSocket implementation (no ws npm package needed — Node 22+ has WebSocket support built in):
Expected output:CallHandler.handleWebhook receives the parsed Twilio webhook params, creates the Orchestrator call state, and returns TwiML pointing to the WebSocket URL. CallHandler.handleMediaStream sets up the bidirectional flow: Twilio WebSocket messages (base64 audio) — Deepgram transcription — Orchestrator.processTurn — ElevenLabs TTS — base64 audio sent back to Twilio. instrumentation.ts runs register() on Node.js startup (guarded by NEXT_RUNTIME === "nodejs") and monkey-patches http.createServer to attach a WebSocket upgrade listener on /api/media-stream.
Step 10: Wire the API route handlers
Create app/api/call/route.ts. This is the only place service instances are created at module level — they are singletons for the lifetime of the server.
ts
import { type NextRequest, NextResponse } from "next/server";import { config } from "@/src/lib/config.js";import { GeminiService } from "@/src/services/gemini.js";import { MemoryService } from "@/src/services/memory.js";import { RouterService } from "@/src/services/router.js";import { CacheService } from "@/src/services/cache.js";import { AudioService } from "@/src/services/audio.js";import { TwilioCallService, RestException } from "@/src/services/twilio-call.js";import { CalendarService, MockCalendarProvider } from "@/src/integrations/calendar.js";import { ObservabilityService } from "@/src/services/observability.js";import
Also create app/api/health/route.ts:
ts
import { NextResponse } from "next/server";import { config } from "@/src/lib/config.js";export function GET(): NextResponse { return NextResponse.json({ status: "healthy", uptime: process.uptime(), timestamp: new Date().toISOString(), checks: { gemini: Boolean(config.googleApiKey), deepgram: Boolean(config.deepgramApiKey), elevenlabs: Boolean(config.elevenlabsApiKey), }, });}
Expected output: The POST handler parses the URL-encoded Twilio webhook body, validates the X-Twilio-Signature header, delegates to CallHandler.handleWebhook, and returns TwiML XML with Content-Type: text/xml. Errors are caught: RestException returns the Twilio error code; everything else returns 500. The GET handler on the same route returns a health check payload. The health route is a simple standalone health check with no auth.
Step 11: Enable the instrumentation hook and start the server
Before starting the dev server, add experimental.instrumentationHook: true to next.config.ts. This flag is required for src/instrumentation.ts to run — without it, the WebSocket server never starts and calls have no media stream.
ts
import type { NextConfig } from "next"const nextConfig: NextConfig = { experimental: { instrumentationHook: true, },}export default nextConfig
Once the config is in place, start the dev server:
terminal
pnpm dev
Expected output: The Next.js server starts and listens on port 3000. The register() function in src/instrumentation.ts fires, instantiates all services, and attaches the WebSocket upgrade listener. You can verify the server is up by calling the health endpoint:
terminal
curl http://localhost:3000/api/health
The response should show "status": "healthy" with the three API key checks all boolean.
To expose the webhook to Twilio in development, use ngrok:
terminal
ngrok http 3000
Then configure your Twilio phone number’s voice webhook to point to https://your-ngrok-url.ngrok.io/api/call.
Step 12: Run the tests
The test suite covers every service and the API route. Run it with:
terminal
pnpm test
Expected output: All tests pass. The test suite includes:
Connect to a real calendar provider — swap MockCalendarProvider for a Google Calendar or Calendly adapter that implements the CalendarProvider interface. The bookAppointment method already returns a confirmation ID and time slot.
Handle multiple callers concurrently — the Orchestrator.activeCalls Map tracks each live call independently. To scale horizontally, replace the in-memory map with a Redis store and the AgentMemory storage with PostgreSQL/pgvector.
Add multi-turn FAQ grounding — populate the FAQ context by calling MemoryService.storeCallTurn with knowledge-base excerpts before the first call. The FAQ handler already calls getRelevantContext for semantic retrieval.
Stream the TTS response — use AudioService.synthesizeSpeechStream (which calls elevenlabs.textToSpeech.stream) to send audio chunks back to Twilio as they are generated, reducing the perceived latency of the first word.
type
{ ObservabilityService }
from
"../services/observability.js"
;
type ObservabilityTrace = ReturnType<ObservabilityService["traceCall"]>;
const extractionPrompt = `Extract appointment booking details from this conversation. Return JUST the appointment date and time, and the caller's name if mentioned.\n\nTranscript: ${transcript}\nContext:\n${context}`;
return `Great, I've booked your appointment for ${slot.toLocaleDateString()} at ${slot.toLocaleTimeString()}. Here's what I understood: ${details}. Confirmation ID: ${bookingResult.eventId}`;
}
}
return "I'm sorry, there are no available time slots at the moment. Could you try a different date?";
const prompt = `You are a helpful FAQ assistant. Answer the following question based on the context provided.\n\nFAQ Context:\n${contextHistory}\n\nRecent conversation:\n${context}\n\nQuestion: ${transcript}\n\nProvide a helpful, concise answer.`;
const prompt = `You are a helpful voice assistant. Respond naturally to the user's query.\n\nConversation history:\n${context}\n\nUser: ${transcript}\n\nRespond helpfully and concisely.`;