Skip to content
reaatech

@reaatech/voice-agent-tts

npm v0.1.0

Provider-agnostic text-to-speech interface with five adapter implementations (Deepgram Aura, AWS Polly, Google Cloud TTS, ElevenLabs, Cartesia), returning streaming audio as `AsyncIterable<AudioChunk>` with cancelable synthesis and Twilio-ready audio formatting.

@reaatech/voice-agent-tts

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Provider-agnostic text-to-speech interface with five adapter implementations: Deepgram Aura, AWS Polly, Google Cloud Text-to-Speech, ElevenLabs, and Cartesia. Streaming audio output via AsyncIterable<AudioChunk>, cancelable synthesis, and Twilio-ready audio formatting.

Installation

terminal
npm install @reaatech/voice-agent-tts
pnpm add @reaatech/voice-agent-tts

Provider SDKs (install only what you use)

The cloud adapters load their provider SDKs lazily and declare them as optional peer dependencies, so you only install the SDK for the provider you actually use. Deepgram needs no extra SDK.

terminal
# AWS Polly
npm install @aws-sdk/client-polly @aws-sdk/credential-provider-ini
 
# Google Cloud Text-to-Speech
npm install @google-cloud/text-to-speech

Feature Overview

  • Unified TTS interfaceTTSProvider with synthesize() returning AsyncIterable<AudioChunk>
  • Deepgram Aura adapter — Low-latency HTTP/2 streaming with voice selection and mulaw encoding
  • AWS Polly adapter — Neural engine with SSML support, multiple voice IDs, sample rate configuration
  • Google Cloud TTS adapter — 220+ voices, speaking rate, pitch, volume control, and SSML gender
  • ElevenLabs adapter — Streaming HTTP/2 with ultra-realistic voices (Turbo v2.5, Flash v2.5)
  • Cartesia adapter — Ultra-low latency streaming with Sonic model and emotion control
  • Cancelable synthesiscancel() stops in-progress TTS immediately (barge-in support)
  • Twilio audio formatting — Automatic mulaw 8kHz conversion via formatAudioForTwilio()
  • Silence generationcreateSilenceChunk() for injecting pauses between utterances
  • Text chunkingchunkTextForStreaming() to split long responses for streaming TTS
  • Provider factorycreateTTSProvider() for runtime provider selection

Quick Start

typescript
import { DeepgramTTSProvider } from '@reaatech/voice-agent-tts';
 
const tts = new DeepgramTTSProvider();
 
for await (const chunk of tts.synthesize('Hello, how can I help you today?', {
  provider: 'deepgram',
  apiKey: process.env.DEEPGRAM_API_KEY,
  voice: 'asteria',
  model: 'aura',
  encoding: 'mulaw',
  sampleRate: 8000,
})) {
  // Send chunk.buffer to Twilio Media Stream
  twilioHandler.sendAudio(chunk);
}

API Reference

TTSProvider Interface

typescript
interface TTSProvider {
  readonly name: string;
  synthesize(text: string, config: DeepgramTTSConfig | AWSPollyConfig | GoogleCloudTTSConfig): AsyncIterable<AudioChunk>;
  readonly supportsStreaming: boolean;
  readonly firstByteLatencyMs: number | null;
  cancel(): void;
  connect?(config: unknown): Promise<void>;
}

TTSProviderInterface (Static Utilities)

typescript
class TTSProviderInterface {
  static formatAudioForTwilio(chunk: AudioChunk): AudioChunk;
  static createSilenceChunk(durationMs: number, sampleRate?: number): AudioChunk;
  static chunkTextForStreaming(text: string, maxChunkSize?: number): string[];
}
MethodDescription
formatAudioForTwilioConverts any audio chunk to mulaw 8kHz for Twilio Media Streams
createSilenceChunkCreates a mulaw silence buffer of specified duration (default 8kHz)
chunkTextForStreamingSplits long text at sentence boundaries for sentence-by-sentence TTS

DeepgramTTSProvider

typescript
class DeepgramTTSProvider implements TTSProvider {
  readonly name = 'deepgram';
  readonly supportsStreaming = true;
  constructor(options?: DeepgramTTSOptions);
  getLastFirstByteLatency(): number | null;
}
 
interface DeepgramTTSOptions {
  apiUrl?: string;   // default: 'api.deepgram.com'
  version?: string;  // default: 'v1'
}
 
interface DeepgramTTSConfig extends TTSConfig {
  model?: 'aura';
  voice?: string;        // e.g., 'asteria', 'luna', 'stella', 'arcas'
  encoding?: 'mulaw' | 'linear16' | 'pcm';
  sampleRate?: number;   // 8000, 16000, 24000, 48000
  container?: 'none' | 'wav';
}

AWSPollyProvider

typescript
class AWSPollyProvider extends EventEmitter implements TTSProvider {
  readonly name = 'aws-polly';
  readonly supportsStreaming = true;
  constructor(options?: AWSPollyOptions);
  connect(config: AWSPollyConfig): Promise<void>;
  onError(cb: (error: Error) => void): void;
  close(): Promise<void>;
  isConnected(): boolean;
}
 
interface AWSPollyOptions {
  region?: string;          // default: 'us-east-1'
  defaultVoiceId?: string;  // default: 'Joanna'
  defaultEngine?: Engine;   // default: NEURAL
}
 
interface AWSPollyConfig extends TTSConfig {
  region: string;
  voiceId?: string;          // Joanna, Matthew, Salli, etc.
  engine?: 'standard' | 'neural';
  languageCode?: string;
  sampleRate?: number;       // 8000, 16000, 22050
  textType?: 'text' | 'ssml';
}

GoogleCloudTTSProvider

typescript
class GoogleCloudTTSProvider implements TTSProvider {
  readonly name = 'google-cloud-tts';
  readonly supportsStreaming = true;
  constructor(options?: GoogleCloudTTSOptions);
  getLastFirstByteLatency(): number | null;
}
 
interface GoogleCloudTTSOptions {
  projectId?: string;
  keyFilename?: string;
}
 
interface GoogleCloudTTSConfig extends TTSConfig {
  projectId: string;
  voiceName?: string;              // e.g., 'en-US-Standard-A'
  languageCode?: string;           // e.g., 'en-US'
  ssmlGender?: 'MALE' | 'FEMALE' | 'NEUTRAL';
  audioEncoding?: 'MP3' | 'LINEAR16' | 'OGG_OPUS' | 'MULAW' | 'ALAW';
  sampleRateHertz?: number;
  speakingRate?: number;           // 0.25–4.0
  pitch?: number;                  // -20.0–20.0
  volumeGainDb?: number;           // -96.0–16.0
}

ElevenLabsProvider

typescript
class ElevenLabsProvider implements TTSProvider {
  readonly name = 'elevenlabs';
  readonly supportsStreaming = true;
  constructor(options?: ElevenLabsOptions);
  getLastFirstByteLatency(): number | null;
}
 
interface ElevenLabsConfig extends TTSConfig {
  modelId?: 'eleven_turbo_v2_5' | 'eleven_flash_v2_5';
  voiceId?: string;
  stability?: number;
  similarityBoost?: number;
  optimizeStreamingLatency?: number;
  outputFormat?: 'mp3_44100' | 'pcm_8000' | 'mulaw_8000';
}

Streaming HTTP/2 adapter for ElevenLabs ultra-realistic voices. Supports latency optimization and multiple output formats.

CartesiaProvider

typescript
class CartesiaProvider implements TTSProvider {
  readonly name = 'cartesia';
  readonly supportsStreaming = true;
  constructor(options?: CartesiaOptions);
  getLastFirstByteLatency(): number | null;
}
 
interface CartesiaConfig extends TTSConfig {
  modelId?: 'sonic' | 'sonic-2';
  voiceId?: string;
  speed?: 'slowest' | 'slow' | 'normal' | 'fast' | 'fastest';
  emotion?: 'anger' | 'positivity' | 'surprise' | 'sadness' | 'curiosity' | 'neutral';
  language?: string;
  outputFormat?: 'raw' | 'wav' | 'mp3';
  sampleRate?: number;
}

Ultra-low latency streaming adapter with Sonic model and emotion control. Sub-100ms P50 latency for real-time use.

Provider Factory

typescript
import { createTTSProvider } from '@reaatech/voice-agent-tts';
 
const tts = createTTSProvider({
  provider: 'deepgram',             // 'deepgram' | 'aws-polly' | 'google-cloud-tts' | 'elevenlabs' | 'cartesia'
  config: { provider: 'deepgram', apiKey: '...' },
});

Usage Patterns

Barge-In (Cancel In-Progress TTS)

typescript
// Start TTS
const ttsStream = tts.synthesize(text, config);
 
// User interrupts — cancel immediately
tts.cancel();
// The synthesize() generator will exit cleanly

Sentence-Level Streaming for Low Latency

typescript
import { TTSProviderInterface } from '@reaatech/voice-agent-tts';
 
const sentences = TTSProviderInterface.chunkTextForStreaming(longText, 200);
 
for (const sentence of sentences) {
  for await (const chunk of tts.synthesize(sentence, config)) {
    handler.sendAudio(chunk);
  }
}

Silence Between Utterances

typescript
import { TTSProviderInterface } from '@reaatech/voice-agent-tts';
 
// 500ms silence gap
const silence = TTSProviderInterface.createSilenceChunk(500);
handler.sendAudio(silence);

License

MIT