Small online stores using Square Online lose sales when customers can't get quick order updates and end up calling the owner's personal phone repeatedly.
A complete, working implementation of this recipe — downloadable as a zip or browsable file by file. Generated by our build pipeline; tested with full coverage before publishing.
This recipe builds a conversational AI voice agent that answers inbound phone calls via Twilio, transcribes speech using Deepgram, classifies caller intent with OpenAI, and looks up Square Online order statuses in real time. You will wire up the full pipeline — telephony webhook, WebSocket media relay, speech-to-text, LLM-powered intent classification, Square API order retrieval, text-to-speech, and cost tracking with Langfuse — all running inside a Next.js 16 app with REAA’s voice agent framework.
Prerequisites
Node.js 22+ and pnpm 10+ installed on your machine
A Twilio account with a purchased phone number (trial works)
A Square account with an access token (sandbox is fine)
An OpenAI API key
A Deepgram API key
A Cartesia API key for text-to-speech
A Langfuse account (free tier) for cost observability
Familiarity with TypeScript and basic Next.js App Router patterns
Step 1: Scaffold the project and install dependencies
Create a new Next.js project and install the exact pinned versions of every dependency the voice agent needs.
Expected output: All packages install cleanly with no peer-dep warnings. The node_modules directory contains the REAA packages with their TypeScript declarations.
Add a test script to your package.json and configure vitest. Your package.json scripts block should look like this:
Create .env.example with all the secrets the app reads at runtime. Every env var here must be set in your actual .env.local when you deploy.
env
# Env vars used by openai-voice-agent-for-square-online-order-status-inquiries.# Keep placeholders only — never commit real values.NODE_ENV=developmentTWILIO_ACCOUNT_SID=<your-twilio-account-sid>TWILIO_AUTH_TOKEN=<your-twilio-auth-token>TWILIO_PHONE_NUMBER=<your-twilio-phone-number>SQUARE_ACCESS_TOKEN=<your-square-access-token>OPENAI_API_KEY=<your-openai-key>DEEPGRAM_API_KEY=<your-deepgram-key>CARTESIA_API_KEY=<your-cartesia-key>LANGFUSE_PUBLIC_KEY=<your-langfuse-public-key>LANGFUSE_SECRET_KEY=<your-langfuse-secret-key>LANGFUSE_HOST=https://cloud.langfuse.comWS_PORT=8080
Copy this to .env.local and fill in real credentials for local testing. The WS_PORT controls where the WebSocket media server listens for Twilio audio streams.
Step 3: Define the shared domain types
Create src/lib/types.ts — these interfaces are the shared vocabulary across every service in the app.
Expected output: TypeScript compiles these without errors. OrderInfo holds the fields you will map from Square’s API response.
Step 4: Build the Square order lookup service
Create src/services/square-client.ts. This service wraps Square’s SDK to look up an order by ID and also searches orders by phone number via Square’s REST search endpoint.
Expected output: The Square SDK initializes with your SQUARE_ACCESS_TOKEN. getOrder fetches a single order by ID; searchOrdersByPhone does a POST to the Square Orders search API with a customer phone filter.
Step 5: Build the OpenAI client for intent classification
Create src/services/openai-client.ts. This module wraps the OpenAI SDK with four functions: classifying the caller’s intent, extracting an order number from the transcript, generating a friendly response given order data, and asking a clarification question when the intent is ambiguous.
ts
import OpenAI from "openai";import { OrderInfo } from "../lib/types.js";let _openai: OpenAI | null = null;function getOpenAI(): OpenAI { if (!_openai) { _openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY ?? "" }); } return _openai;}export const openai = new Proxy({} as OpenAI, { get(_target: OpenAI, prop: keyof OpenAI
Expected output: The openai export uses a Proxy to lazily initialize the SDK, so the client isn’t created until the first API call. All four functions use gpt-5.2-mini for classification and extraction, and gpt-5.2 for response generation, with response_format: "json_object" on the structured tasks.
Step 6: Set up Langfuse cost tracking
Create src/services/cost-tracker.ts to record per-turn LLM usage to Langfuse for observability and billing analysis.
Expected output: Each call to trackUsage creates a CostSpan via @reaatech/llm-cost-telemetry’s generateId and calculateCostFromTokens, then emits a Langfuse trace so you can visualize your per-call spend in the Langfuse dashboard.
Step 7: Create session continuity and the confidence router
Create src/services/session-store.ts to manage call sessions across multiple turns. This wraps REAA’s SessionManager with a memory-backed store and tokenizer configuration.
ts
import { SessionManager, type Session, type Message } from "@reaatech/session-continuity";import { TiktokenTokenizer } from "@reaatech/session-continuity-tokenizers";import { MemoryAdapter } from "@reaatech/session-continuity-storage-memory";export const callSessionManager = new SessionManager({ storage: new MemoryAdapter(), tokenCounter: new TiktokenTokenizer("gpt-4"), tokenBudget: { maxTokens: 4096, reserveTokens: 500, overflowStrategy: "compress" as const, }, compression: { strategy: "sliding_window" as const, targetTokens: 3500, }, sessionTTL: 3600,});export async function createCallSession(callSid: string, from: string): Promise<Session> { const session = await callSessionManager.createSession({ userId: from }); session.metadata.custom = { callSid }; return session;}export async function addUtterance( sessionId: string, role: "user" | "assistant", content: string,): Promise<Message> { return callSessionManager.addMessage(sessionId, { role, content });}export async function getHistory(sessionId: string): Promise<Message[]> { return callSessionManager.getConversationContext(sessionId);}export async function endCallSession(sessionId: string): Promise<void> { return callSessionManager.endSession(sessionId);}
Now create src/services/confidence-router-service.ts to route caller intent based on confidence thresholds:
Create src/services/audio-processor.ts to convert audio formats between Twilio’s mu-law / 8 kHz and Deepgram’s expected linear16 / 16 kHz:
ts
import { STTProviderInterface } from "@reaatech/voice-agent-stt";import { TTSProviderInterface } from "@reaatech/voice-agent-tts";import { type AudioChunk } from "@reaatech/voice-agent-core";export function convertTwilioAudioForDeepgram(chunk: AudioChunk): AudioChunk { return STTProviderInterface.convertAudioFormat(chunk, 16000, "linear16");}export function formatTTSAudioForTwilio(chunk: AudioChunk): AudioChunk { return TTSProviderInterface.formatAudioForTwilio(chunk);}export function injectSilence(durationMs: number): AudioChunk { return TTSProviderInterface.createSilenceChunk(durationMs);}
Expected output: Session manager uses a 3600-second TTL with sliding-window compression. The confidence router routes when confidence >= 0.8 and falls back (asks for clarification) when < 0.3. The STTProviderInterface provides built-in audio format conversion helpers.
Step 8: Create the runtime configuration loader
Create src/lib/config.ts to validate all environment variables at startup using Zod, giving you early failure instead of cryptic runtime errors.
ts
import "dotenv/config";import { z } from "zod";const AppConfigSchema = z.object({ twilioAccountSid: z.string().min(1, "TWILIO_ACCOUNT_SID is required"), twilioAuthToken: z.string().min(1, "TWILIO_AUTH_TOKEN is required"), twilioPhoneNumber: z.string().min(1, "TWILIO_PHONE_NUMBER is required"), squareAccessToken: z.string().min(1, "SQUARE_ACCESS_TOKEN is required"), openaiApiKey: z.string().min(1, "OPENAI_API_KEY is required"), deepgramApiKey: z.string().min(1, "DEEPGRAM_API_KEY is required"), cartesiaApiKey: z.string().min(1, "CARTESIA_API_KEY is required"), langfusePublicKey: z.string().min(1, "LANGFUSE_PUBLIC_KEY is required"), langfuseSecretKey: z.string().min(1, "LANGFUSE_SECRET_KEY is required"), langfuseHost: z.string().default("https://cloud.langfuse.com"), wsPort: z.string().default("8080").transform(Number),});export type AppConfig = z.infer<typeof AppConfigSchema>;export function loadConfig(): AppConfig { return Object.freeze(AppConfigSchema.parse({ twilioAccountSid: process.env.TWILIO_ACCOUNT_SID, twilioAuthToken: process.env.TWILIO_AUTH_TOKEN, twilioPhoneNumber: process.env.TWILIO_PHONE_NUMBER, squareAccessToken: process.env.SQUARE_ACCESS_TOKEN, openaiApiKey: process.env.OPENAI_API_KEY, deepgramApiKey: process.env.DEEPGRAM_API_KEY, cartesiaApiKey: process.env.CARTESIA_API_KEY, langfusePublicKey: process.env.LANGFUSE_PUBLIC_KEY, langfuseSecretKey: process.env.LANGFUSE_SECRET_KEY, langfuseHost: process.env.LANGFUSE_HOST, wsPort: process.env.WS_PORT, }));}
Expected output:loadConfig() calls dotenv/config once at module import time, then validates every env var — if any required key is missing, Zod throws immediately with a message naming the missing variable.
Step 9: Wire up the voice agent pipeline
Create src/services/agent-pipeline.ts. This is the core of the app: it creates the REAA Pipeline with a Deepgram STT provider, a Cartesia TTS provider, session management, a latency budget enforcer, and an MCP client that routes utterances through the OpenAI intent classifier and Square order service.
ts
import { createPipeline, createLatencyBudget, initializeSessionManager, LatencyBudgetEnforcer, type AudioChunk, type VoiceAgentKitConfig, type MCPClient, type AgentResponse,} from "@reaatech/voice-agent-core";import { DeepgramSTTProvider } from "@reaatech/voice-agent-stt";import { CartesiaTTSProvider } from "@reaatech/voice-agent-tts";import { classifyOrderQuery, extractOrderNumber, generateAgentResponse, generateClarificationQuestion,} from "./openai-client.js";import { squareService } from "./square-client.js";import { costTracker } from "./cost-tracker.js";import { addUtterance } from "./session-store.js";
Expected output: The OrderStatusAgent ties together STT, intent classification, order lookup, and TTS. The pipeline emits lifecycle events — pipeline:stt:final fires on each completed transcription, pipeline:error logs pipeline errors, and pipeline:turn:end fires after each response (where cost tracking is wired in). Barge-in interrupts in-progress TTS when the caller starts speaking.
Step 10: Create the WebSocket media server
Create src/services/media-server.ts. This starts a WebSocket server that Twilio connects to for bidirectional audio streaming. It handles call start, audio chunks, DTMF input, barge-in detection, and call end — all relayed through the @reaatech/voice-agent-telephonycreateTwilioHandler.
ts
import { createTwilioHandler } from "@reaatech/voice-agent-telephony";import { WebSocketServer, type WebSocket } from "ws";import { createCallSession, endCallSession } from "./session-store.js";import { createOrderStatusAgent } from "./agent-pipeline.js";import { defineConfig, type AudioChunk } from "@reaatech/voice-agent-core";let wss: WebSocketServer | null = null;export function startMediaServer(port: number): void { wss = new WebSocketServer({ port }); const agent =
Expected output: The WebSocket server starts on the configured WS_PORT (default 8080). When a Twilio call connects, call:start fires, a session is created, and the agent’s STT pipeline begins. DTMF digits are buffered with a 3-second timeout before being injected as an audio chunk for processing. The defineConfig function from @reaatech/voice-agent-core produces the typed config object used to construct the OrderStatusAgent.
Step 11: Create observability, the instrumentation hook, and the Twilio webhook
Create src/services/observability.ts to initialize the REAA OpenTelemetry integration:
ts
import { initializeObservability } from "@reaatech/voice-agent-core";export async function initObservability(): Promise<void> { await initializeObservability({ serviceName: "openai-voice-agent-square", serviceVersion: "0.1.0", enabled: true, });}
Create src/instrumentation.ts — the Next.js instrumentation hook that runs the media server and observability on server startup:
Configure next.config.ts to enable the instrumentation hook:
ts
import type { NextConfig } from "next";const nextConfig: NextConfig = { experimental: { instrumentationHook: true, } as NextConfig["experimental"],};export default nextConfig;
Now create the Twilio webhook route at app/api/twilio/incoming/route.ts. When Twilio receives an incoming call, it POSTs to this endpoint, which returns TwiML instructing Twilio to stream audio to your WebSocket server:
ts
import { type NextRequest, NextResponse } from "next/server";import twilio from "twilio";export async function POST(req: NextRequest): Promise<NextResponse> { const formData = await req.formData(); const callSid = formData.get("CallSid") as string | null; if (!callSid) { return new NextResponse( "<Response><Say>Error: Missing call identifier</Say></Response>", { status: 400, headers: { "Content-Type": "text/xml" } }, ); } const VoiceResponse = twilio.twiml.VoiceResponse; const twiml = new VoiceResponse(); twiml.say("Welcome to order status. Tell me your order number and I will check it for you."); const connect = twiml.connect(); connect.stream({ url: `wss://${req.headers.get("host") ?? "localhost:3000"}/media-stream` }); return new NextResponse(twiml.toString(), { headers: { "Content-Type": "text/xml" }, });}
Expected output: When your Next.js app starts, register() runs, calls initObservability() to wire up OpenTelemetry, and starts the WebSocket media server. Incoming Twilio calls POST to /api/twilio/incoming, the route returns TwiML with <Connect><Stream>, and Twilio opens a WebSocket to your media server.
Step 12: Wire up the public API exports
Create src/index.ts to re-export every service for external consumption (and to satisfy the import gate for testing):
ts
import "dotenv/config";import { z } from "zod";import { DeepgramClient } from "@deepgram/sdk";import Cartesia from "@cartesia/cartesia-js";export const _importGate = { zodSchema: z, deepgramClient: DeepgramClient, cartesiaClient: Cartesia,};export { OrderStatusAgent, createOrderStatusAgent } from "./services/agent-pipeline.js";export { squareService, SquareOrderService } from "./services/square-client.js";export { callSessionManager, createCallSession, addUtterance, getHistory, endCallSession } from "./services/session-store.js";export { router, routeClassification } from "./services/confidence-router-service.js";export { CostTracker, costTracker, langfuse } from "./services/cost-tracker.js";export { convertTwilioAudioForDeepgram, formatTTSAudioForTwilio, injectSilence } from "./services/audio-processor.js";export { openai, classifyOrderQuery, extractOrderNumber, generateAgentResponse, generateClarificationQuestion } from "./services/openai-client.js";
Step 13: Write tests and verify coverage
Now write tests for the critical modules. Create tests/services/square-client.test.ts to verify the Square service handles success, 404, boundary cases, and search-by-phone:
ts
import { describe, it, expect, vi } from "vitest";const mockGet = vi.fn();let MockErrorClass: new (code: number) => Error & { statusCode: number } = class extends Error { statusCode: number = 0; constructor(c: number) { super(); this.statusCode = c; } };vi.mock("square", () => { class MockSquareClient { orders = { get: mockGet }; } class MockSquareError extends
Create tests/services/session-store.test.ts to cover the session lifecycle:
ts
import { describe, it, expect } from "vitest";import { createCallSession, addUtterance, getHistory, endCallSession,} from "../../src/services/session-store.js";describe("session store", () => { it("createCallSession returns a session", async () => { const session = await createCallSession("CA-test123", "+155****4567"); expect(session).toBeDefined(); expect(session.userId).toBe("+155****4567"); }); it("addUtterance and getHistory return messages in order", async () => { const session = await createCallSession("CA-test456", "+155****0000"); await addUtterance(session.id, "user", "Where is my order?"); await addUtterance(session.id, "assistant", "Let me check."); const history = await getHistory(session.id); expect(history).toHaveLength(2); expect(history[0].content).toBe("Where is my order?"); expect(history[1].content).toBe("Let me check."); }); it("getHistory returns empty array for new session", async () => { const session = await createCallSession("CA-empty", "+155****0001"); const history = await getHistory(session.id); expect(history).toEqual([]); }); it("endCallSession marks session completed", async () => { const session = await createCallSession("CA-end", "+155****0002"); await endCallSession(session.id); }); it("addUtterance with non-existent session throws", async () => { await expect(addUtterance("nonexistent", "user", "test")).rejects.toThrow(); });});
Now run all checks:
terminal
pnpm typecheck
Expected output: tsc exits 0 with no errors. All your TypeScript types, imports, and module references are correct.
terminal
pnpm lint
Expected output: eslint exits 0. No lint errors in any source or test file.
Expected output: numFailedTests is 0, numTotalTests is >= 3, and all coverage thresholds (lines, branches, functions, statements) are 90% or higher.
You now have a fully functional voice agent pipeline. Start the app with pnpm dev, configure your Twilio phone number’s voice webhook URL to https://your-domain.com/api/twilio/incoming, and call it — the agent will greet callers, transcribe their speech via Deepgram, classify their intent with OpenAI, look up their Square order, and respond over the phone with Cartesia TTS.
Next steps
Replace the in-memory session store with a database. Swap MemoryAdapter for a PostgreSQL or Redis adapter from @reaatech/session-continuity-storage-* to persist sessions across server restarts.
Add phone-number-based order search. Wire squareService.searchOrdersByPhone() into the agent pipeline so callers can say “check my orders” without knowing their order number, then have the agent disambiguate between matches.
Deploy the WebSocket server behind a load balancer. For production, run the media WebSocket server on a dedicated subdomain (e.g., wss://media.yourdomain.com) and configure Twilio’s stream.url accordingly, with sticky sessions for call continuity.
Monitor with Langfuse dashboards. The cost tracker already emits traces — build a Langfuse dashboard showing per-day spend, average latency per turn, and most-requested features.
)
:
unknown
{
return getOpenAI()[prop];
},
});
export async function classifyOrderQuery(
transcript: string,
): Promise<{ label: string; confidence: number }> {
"You are a helpful order status assistant. Explain the order status in a friendly, concise way to the customer over the phone.",
},
{
role: "user",
content: `Order ${orderInfo.orderId}: status ${orderInfo.status}, items ${orderInfo.lineItems.join(", ")}, total ${orderInfo.totalAmount}, placed at ${orderInfo.placedAt}.`,
},
],
});
return completion.choices[0]?.message?.content ?? "Sorry, I couldn't look up your order. Please try again.";
} catch {
return "Sorry, I couldn't look up your order. Please try again.";
}
}
export async function generateClarificationQuestion(