Small clinics miss after-hours calls and get overwhelmed during peak times, leading to lost appointments and patient frustration. Staff spend too much time on the phone instead of in-clinic care.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
In this tutorial you’ll build a voice AI receptionist for medical and dental clinics using Google Gemini. The system answers real-time phone calls via Twilio, transcribes speech with Deepgram Nova-3, runs conversation logic through Gemini 2.5 Flash with tool calling, and speaks back through Cartesia Sonic-3.5 TTS. It connects to an EHR system for checking availability, booking appointments, and sending SMS confirmations and reminders. By the end you’ll have a fully tested Next.js project that wires the entire STT → LLM → TTS pipeline.
Prerequisites
Node.js 22+ and pnpm 10 installed on your machine
A Twilio account with a voice-enabled phone number (for telephony and SMS)
API keys from: Google Gemini, Deepgram, Cartesia, and Langfuse (optional)
Familiarity with TypeScript, Next.js App Router, and basic WebSocket concepts
An MCP-compatible EHR/calendar server endpoint (or you can mock one)
Step 1: Scaffold the project and pin dependencies
Start by creating a new Next.js project with App Router and TypeScript. Then install all the dependencies the voice pipeline needs.
Now open package.json and replace its contents with the exact-pinned dependencies shown below. Every version is pinned precisely (no or ) so builds are reproducible.
Run pnpm install to install everything. Next, create .env.example with the placeholder values the pipeline needs:
env
# Env vars used by google-gemini-voice-agent-for-clinic-appointment-scheduling.# Keep placeholders only — never commit real values.NODE_ENV=development# Google Gemini AIGEMINI_API_KEY=<your-gemini-api-key># Twilio telephony + SMSTWILIO_ACCOUNT_SID=<your-twilio-account-sid>TWILIO_AUTH_TOKEN=<your-twilio-auth-token>TWILIO_FROM_NUMBER=<your-twilio-phone-number># Deepgram speech-to-textDEEPGRAM_API_KEY=<your-deepgram-api-key># Cartesia text-to-speechCARTESIA_TOKEN=<your-cartesia-token>CARTESIA_VOICE_ID=<optional-voice-id># Langfuse observabilityLANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key># MCP endpoint for EHR/calendar toolsMCP_ENDPOINT=<your-mcp-server-endpoint>MCP_API_KEY=<your-mcp-api-key># Session TTL in seconds (default 3600)SESSION_TTL=3600# Latency budgets in ms (defaults: target=800, hardCap=1200)LATENCY_TARGET_MS=800LATENCY_HARD_CAP_MS=1200
Expected output:pnpm install finishes without errors, and pnpm typecheck passes. A .env file copied from .env.example and filled with real API keys is ready.
Step 2: Set up the Next.js configuration
The voice pipeline starts a WebSocket server when the app boots via src/instrumentation.ts. The default Next.js configuration detects the instrumentation file automatically, so no special config is needed.
Create next.config.ts:
ts
import type { NextConfig } from "next";const nextConfig: NextConfig = {};export default nextConfig;
Expected output: The config file compiles without TypeScript errors. The instrumentation file at src/instrumentation.ts will be picked up automatically by the Next.js runtime.
Step 3: Create the Zod-validated environment configuration
You’ll validate every environment variable at runtime with Zod. Create src/lib/config.ts.
defineConfig creates a typed configuration object that the voice-agent-core pipeline consumes. The ClinicEnvSchema validates that required keys exist — if GEMINI_API_KEY is missing or MCP_ENDPOINT isn’t a URL, parseEnv() throws a ZodError with a clear path.
Expected output: Running npx tsx src/lib/config.ts prints no errors when env vars are set, and throws a ZodError with path: ['GEMINI_API_KEY'] when they’re missing.
Step 4: Build the EHR adapter for clinic operations
The EHRAdapter communicates with the clinic’s appointment system. It checks availability, books appointments, cancels them, looks up patients, lists existing appointments, and reschedules. Every method validates input with Zod before making HTTP calls.
HTTP 204 returns null for empty results (like checkAvailability with no slots). Network errors are wrapped in EHRAPIError with a 'NETWORK' code. Error responses read the response body for context.
Expected output: You can instantiate new EHRAdapter({ baseUrl: 'http://localhost:9090', apiKey: 'test-key' }) and call checkAvailability('2026-06-10') — it returns [] if the server returns 204.
Step 5: Build the Twilio SMS reminder service
The SMSService sends appointment reminders, confirmations, and cancellation notices via Twilio SMS. All message bodies are kept under 160 characters to stay in a single SMS segment.
Create src/lib/sms-service.ts:
ts
import twilio from 'twilio';import { RestException } from 'twilio';export interface SMSResult { sid?: string; status?: string; error?: string; detail?: string;}export class SMSServiceError extends Error { public readonly code: string; constructor(code: string, message: string) { super(message); this.name = 'SMSServiceError';
The service reads TWILIO_ACCOUNT_SID and TWILIO_AUTH_TOKEN from the environment and wraps every message in error handling: 4xx errors (like invalid phone numbers) return a structured { error, detail } result, while 5xx errors rethrow as SMSServiceError.
Expected output:new SMSService() throws SMSServiceError with code 'MISSING_CONFIG' if env vars aren’t set. With mocked twilio, the service returns { sid, status }.
Step 6: Build the Gemini-powered LLM service
The ClinicLLMService wraps Google Gemini with function declarations for clinic operations. It sends conversation history, processes function calls Gemini requests, and returns natural-language responses.
Create src/lib/llm-service.ts:
ts
import { GoogleGenAI, type FunctionDeclaration } from '@google/genai';const clinicTools: FunctionDeclaration[] = [ { name: 'check_availability', description: 'Check available appointment slots for a given date and optional provider', parametersJsonSchema: { type: 'object', properties: { date: { type: 'string', description: 'Date in YYYY-MM-DD format' }, providerId: { type: 'string', description: 'Optional provider ID to filter by' }, }, required: ['date'], }, }, { name: 'book_appointment', description: 'Book an appointment for a patient at a specific time slot'
When Gemini returns a functionCalls array, the service resolves each one through registered handlers, sends the results back to Gemini for a final response, and handles 429 rate-limits with a single retry after one second.
Expected output: With mocked @google/genai, generateResponse([{ role: 'user', content: 'Hello' }]) returns the text from the mock.
Step 7: Create the speech-to-text and text-to-speech providers
The STT provider wraps Deepgram Nova-3 for live transcription and the TTS provider wraps Cartesia Sonic-3.5 for speech synthesis.
Create src/lib/stt-service.ts:
ts
import { DeepgramClient } from '@deepgram/sdk';export function createDeepgramSTTProvider(): DeepgramClient { const apiKey = process.env.DEEPGRAM_API_KEY; if (!apiKey) { throw new Error('DEEPGRAM_API_KEY environment variable is required'); } return new DeepgramClient({ apiKey });}export interface UtteranceEvent { isFinal: boolean; transcript: string; confidence: number;}export type DeepgramConnection = ReturnType<Awaited<ReturnType<DeepgramClient['listen']['v1']['connect']>>['connect']>;export async function connectDeepgramLive( client: DeepgramClient, onUtterance: (utterance: UtteranceEvent) => void,): Promise<Awaited<ReturnType<DeepgramClient['listen']['v1']['connect']>>> { const apiKey = process.env.DEEPGRAM_API_KEY ?? ''; const connection = await client.listen.v1.connect({ model: 'nova-3', language: 'en', punctuate: 'true', interim_results: 'true', Authorization: `token ${apiKey}`, }); connection.connect(); connection.on('message', (msg: unknown) => { const data = msg as { type?: string; channel?: { alternatives?: Array<{ transcript: string; confidence: number }> }; is_final?: boolean; }; if (data.type === 'Results' && data.channel?.alternatives) { for (const alt of data.channel.alternatives) { onUtterance({ isFinal: data.is_final ?? false, transcript: alt.transcript, confidence: alt.confidence, }); } } }); return connection;}export function sendAudioToDeepgram( connection: { sendMedia: (data: Buffer) => void }, audioData: Buffer,): void { connection.sendMedia(audioData);}
The Deepgram connection filters for Results messages and emits structured UtteranceEvent objects with isFinal, transcript, and confidence. Cartesia’s synthesizeSpeech returns a Buffer that can be sent directly to the Twilio media stream.
Expected output:createDeepgramSTTProvider() returns a DeepgramClient instance. createCartesiaTTSProvider() returns a Cartesia instance.
Step 8: Build session management with memory storage
The SessionManager from @reaatech/session-continuity maintains multi-turn conversation context. It needs a storage adapter and a token counter.
Create src/lib/character-token-counter.ts:
ts
import { type TokenCounter, type Message } from '@reaatech/session-continuity';export function countText(text: string): number { if (!text) return 0; return text.split(/\s+/).filter(Boolean).length;}export class CharacterTokenCounter implements TokenCounter { readonly model = 'character-estimator'; readonly tokenizer = 'whitespace-split'; count(text: string): number { return countText(text); } countMessages(messages: Message[]): number { let total = 0; for (const message of messages) { const content = typeof message.content === 'string' ? message.content : ''; total += this.count(content); } return total; }}
Create src/lib/memory-storage-adapter.ts implementing IStorageAdapter with an in-memory Map. It stores sessions with optimistic concurrency (version field throws ConcurrencyError on mismatch), supports message CRUD, session listing with filters, and getExpiredSessions.
ts
import { type IStorageAdapter, type Session, type Message, type SessionId, type MessageId, type UpdateSessionOptions, type SessionFilters, type MessageQueryOptions, type HealthStatus, ConcurrencyError,} from '@reaatech/session-continuity';interface InternalMessage extends Message { sequence: number;}interface InternalSession extends Session { version: number;}export class MemoryStorageAdapter implements IStorageAdapter {
Create src/services/session-service.ts that wires the adapter and counter into a SessionManager:
ts
import { SessionManager, type Session, type Message,} from '@reaatech/session-continuity';import { MemoryStorageAdapter } from '../lib/memory-storage-adapter.js';import { CharacterTokenCounter } from '../lib/character-token-counter.js';export function createClinicSessionManager(): SessionManager { return new SessionManager({ storage: new MemoryStorageAdapter(), tokenCounter: new CharacterTokenCounter(), tokenBudget: { maxTokens: 4096, reserveTokens: 500, overflowStrategy: 'compress', }, compression: { strategy: 'sliding_window', targetTokens: 3500, }, });}export async function createSession( manager: SessionManager, userId: string,): Promise<Session> { return manager.createSession({ userId });}export async function addMessage( manager: SessionManager, sessionId: string, role: 'user' | 'assistant' | 'system' | 'tool', content: string,): Promise<Message> { return manager.addMessage(sessionId, { role, content });}export async function getConversationContext( manager: SessionManager, sessionId: string,): Promise<Message[]> { return manager.getConversationContext(sessionId);}export async function endSession( manager: SessionManager, sessionId: string,): Promise<void> { await manager.endSession(sessionId);}
Expected output:createClinicSessionManager() returns a SessionManager. createSession(manager, 'user-1') returns a Session with the userId set.
Step 9: Create the voice pipeline orchestrator
The pipeline service wires STT, TTS, MCP, session management, and latency enforcement into a unified call pipeline. It registers event handlers for observability and exposes startCallSession, processInboundAudio, endCallSession, and handleBargeIn.
Create src/services/pipeline-service.ts:
ts
import { createPipeline, initializeSessionManager, createLatencyBudget, LatencyBudgetEnforcer, type Pipeline, type PipelineEvent, type STTProvider, type TTSProvider, type MCPClient as CoreMCPClient, type PipelineDependencies, type AudioChunk,} from '@reaatech/voice-agent-core';import { clinicConfig } from '../lib/config.js';import { traceTurn, traceLLMCall } from '../lib/observability.js';export function createVoicePipeline( sttProvider: STTProvider, ttsProvider: TTSProvider, mcpClient: CoreMCPClient,): ReturnType<typeof createPipelineBackend> { const sessionManager = initializeSessionManager({ defaultTTL: 1800, maxTurns: 20, maxTokens: 4096, }); const budget = createLatencyBudget({ target: 800, hardCap: 1200, stt: 200, mcp: 400, tts: 200, }); const latencyEnforcer = new LatencyBudgetEnforcer(budget); return createPipelineBackend({ sessionManager, latencyEnforcer, sttProvider, ttsProvider, mcpClient, config: clinicConfig, });}function createPipelineBackend(deps: PipelineDependencies) { const pipeline: Pipeline = createPipeline(deps); pipeline.on('pipeline:stt:final', (event: PipelineEvent) => { traceTurn(event.sessionId, event.turnId ?? '', { stage: 'stt', payload: event.data }); }); pipeline.on('pipeline:mcp:response', () => { traceLLMCall('gemini-2.5-flash', 0, 0, 0); }); pipeline.on('pipeline:turn:end', (event: PipelineEvent) => { const eventData = event.data as Record<string, unknown>; const metrics = eventData.metrics as Record<string, unknown> | undefined; if (metrics) { traceTurn(event.sessionId, event.turnId ?? '', metrics); } }); pipeline.on('pipeline:error', (event: PipelineEvent) => { console.error('[pipeline:error]', event); }); async function startCallSession(sessionId: string) { await pipeline.startSession({ sessionId, status: 'active' }); } async function processInboundAudio(sessionId: string, chunk: AudioChunk) { await pipeline.processAudioChunk(sessionId, chunk); } async function endCallSession(sessionId: string) { await pipeline.endSession(sessionId); } function handleBargeIn(sessionId: string) { pipeline.bargeIn(sessionId); } function destroy() { pipeline.destroy(); } return { pipeline, startCallSession, processInboundAudio, endCallSession, handleBargeIn, destroy };}
Expected output:createVoicePipeline(sttProvider, ttsProvider, mcpClient) returns an object with pipeline, startCallSession, processInboundAudio, endCallSession, handleBargeIn, and destroy functions.
Step 10: Build the telephony and MCP client services
The telephony service wraps @reaatech/voice-agent-telephony to handle Twilio Media Streams. The MCP client service connects to an MCP endpoint for EHR/calendar tool discovery.
Expected output:createVoiceCallHandler() returns a handler configured with barge-in enabled. createClinicMCPClient() connects to the MCP endpoint and discovers available tools.
Step 11: Wire the WebSocket server and route handler
The instrumentation file starts a WebSocket server on boot that accepts Twilio Media Streams connections. The route handler returns TwiML that directs Twilio to stream audio to the WebSocket.
Create src/instrumentation.ts at the project root:
The TwiML response tells Twilio: “When a call comes in, connect it to a Media Stream at wss://your-domain.com/api/voice.” The GET handler rejects non-POST requests with 405.
Expected output: A POST to /api/voice with CallSid=CA123 returns XML containing <Stream url="wss://localhost:3000/api/voice"/>. A GET returns 405 with { error: "method not allowed" }.
Step 12: Write and run the tests
The project ships with a comprehensive test suite covering every service and lib module. Here are excerpts from the pipeline service and voice route tests.
And the voice API route test at tests/app/api/voice/route.test.ts:
ts
import { describe, it, expect } from 'vitest';import { NextRequest } from 'next/server';import { POST, GET } from '../../../../app/api/voice/route';describe('voice route', () => { it('POST with CallSid returns 200 with text/xml containing Connect and Stream', async () => { const req = new NextRequest('http://localhost:3000/api/voice', { method: 'POST', headers: { 'content-type': 'application/x-www-form-urlencoded' }, body: new URLSearchParams({ CallSid: 'CA123', From: '+155****4567' }).toString(), }); const response = await POST(req); const text = await response.text(); expect(response.status).toBe(200); expect(response.headers.get('content-type')).toBe('text/xml'); expect(text).toContain('<Connect>'); expect(text).toContain('<Stream'); }); it('POST with empty body returns 400', async () => { const req = new NextRequest('http://localhost:3000/api/voice', { method: 'POST', headers: { 'content-type': 'application/x-www-form-urlencoded' }, body: '', }); const response = await POST(req); const body = await response.json() as { error?: string }; expect(response.status).toBe(400); expect(body.error).toBe('missing CallSid'); }); it('GET returns 405 with JSON error', async () => { const response = GET(); const body = await response.json() as { error?: string }; expect(response.status).toBe(405); expect(body.error).toBe('method not allowed'); });});
Now run the test suite:
terminal
pnpm test
Expected output: All tests pass with zero failures. The coverage report shows lines, branches, functions, and statements all above 90%.
Next steps
Replace MemoryStorageAdapter with PostgreSQL or Redis — the in-memory store loses data on restart; swap it for a persistent backend that implements the same IStorageAdapter interface.
Add a voice selection menu — extend Gemini’s tool declarations to let callers say “I want Dr. Smith” and route to a specific provider’s availability.
Deploy to production — wrap the Next.js app in a Docker container, point a Twilio phone number at your public endpoint, and add monitoring with Langfuse dashboards.