Skip to content
reaatechREAATECH

@reaatech/media-pipeline-mcp-openai

npm v0.3.0

An OpenAI provider for the media-pipeline framework that exposes a class (`OpenAIProvider`) supporting DALL-E 3 image generation, GPT-4o vision-based image description, TTS-1 text-to-speech, and Whisper-1 speech-to-text transcription via the OpenAI REST API with no SDK dependency.

@reaatech/media-pipeline-mcp-openai

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

OpenAI provider for the media pipeline framework. Supports image generation (DALL-E 3), vision-based image description (GPT-4o), text-to-speech (TTS-1), and speech-to-text transcription (Whisper-1). Fully self-contained using only the OpenAI REST API — no SDK dependency required.

Installation

terminal
npm install @reaatech/media-pipeline-mcp-openai
# or
pnpm add @reaatech/media-pipeline-mcp-openai

Feature Overview

  • DALL-E 3 image generation with quality (standard/hd), size, and style control
  • GPT-4o / GPT-4o-mini vision-based image description at three detail levels
  • TTS-1 text-to-speech with voice selection and speaking speed
  • Whisper-1 speech-to-text with verbose JSON output and optional language hint
  • Streaming support for TTS, text completion, and image description (supportsStreaming)
  • Organization and project header support for multi-tenant OpenAI accounts
  • Base URL override for custom endpoints and proxies
  • Per-operation cost estimation with size and quality multipliers

Quick Start

typescript
import { OpenAIProvider } from "@reaatech/media-pipeline-mcp-openai";
 
const provider = new OpenAIProvider({ apiKey: process.env.OPENAI_API_KEY! });
 
// Generate an image
const image = await provider.execute({
  operation: "image.generate",
  params: { prompt: "A futuristic city skyline at sunset", dimensions: "1024x1024", quality: "standard", style: "vivid" },
  config: {},
});
 
// Describe an image
const description = await provider.execute({
  operation: "image.describe",
  params: { artifact_data: imageBuffer, detail_level: "detailed", mime_type: "image/png" },
  config: {},
});
 
// Text to speech
const audio = await provider.execute({
  operation: "audio.tts",
  params: { text: "Hello, welcome to our service", voice: "alloy", speed: 1.0, output_format: "mp3" },
  config: {},
});
 
// Speech to text
const transcript = await provider.execute({
  operation: "audio.stt",
  params: { audio_data: audioBuffer, language: "en" },
  config: {},
});

Supported Operations

OperationDefault ModelDescriptionOutput Format
image.generatedall-e-3Text-to-image with size/quality/style optionsPNG image buffer
image.describegpt-4oVision-based image description at brief, detailed, or structured levelsPlain text
audio.ttstts-1Text-to-speech with voice and speed controlAudio bytes (mp3, wav, opus)
audio.sttwhisper-1Speech-to-text transcription with verbose JSON outputJSON with text and segments

Configuration Parameters

image.generate

ParameterTypeDefaultDescription
promptstringrequiredText description of the desired image
dimensionsstring1024x1024Image size: 1024x1024, 1024x1792, 1792x1024
qualitystringstandardImage quality: standard or hd
stylestringvividImage style: vivid or natural
num_outputsnumber1Number of images to generate

image.describe

ParameterTypeDefaultDescription
artifact_dataBufferrequiredImage as raw buffer
detail_levelstringdetailedDescription detail: brief, detailed, structured
mime_typestringimage/pngImage MIME type

audio.tts

ParameterTypeDefaultDescription
textstringrequiredText to convert to speech
voicestringalloyVoice: alloy, echo, fable, onyx, nova, shimmer
speednumber1.0Speaking speed (0.25 to 4.0)
output_formatstringmp3Audio format: mp3, wav, opus

audio.stt

ParameterTypeDefaultDescription
audio_dataBufferrequiredAudio data as raw buffer
languagestringOptional BCP-47 language code hint

API Reference

OpenAIProvider

typescript
class OpenAIProvider extends MediaProvider {
  constructor(config: OpenAIConfig)
 
  healthCheck(): Promise<ProviderHealth>
  estimateCost(input: ProviderInput): Promise<CostEstimate>
  execute(input: ProviderInput): Promise<ProviderOutput>
}

OpenAIConfig

typescript
interface OpenAIConfig {
  apiKey: string;          // OpenAI API key (required)
  organization?: string;   // Optional org ID for multi-org accounts
  project?: string;        // Optional project ID for scoped access
  baseUrl?: string;        // Default: "https://api.openai.com/v1"
}

Factory Function

typescript
import { createOpenAIProvider } from "@reaatech/media-pipeline-mcp-openai";
 
const provider = createOpenAIProvider({ apiKey: process.env.OPENAI_API_KEY! });

Key Methods

MethodReturnsDescription
healthCheck()ProviderHealthValidates API key by listing available models
estimateCost(input)CostEstimateEstimates cost per operation with size/quality multipliers
execute(input)ProviderOutputRoutes to DALL-E, GPT-4o, TTS-1, or Whisper-1 based on operation

Non-Retryable Errors

Non-retryable errors are determined by OpenAI HTTP status codes. The provider relies on the base class retry logic for transient failures.

Cost Estimation

DALL-E 3 Image Generation

QualitySizeCost
standard1024×1024$0.04
standard1024×1792 / 1792×1024$0.08
hd1024×1024$0.08
hd1024×1792 / 1792×1024$0.12

GPT-4o Image Description

ModelInput (per 1K tokens)Output (per 1K tokens)
gpt-4o$0.0025$0.01
gpt-4o-mini$0.00015$0.0006

TTS-1 Text-to-Speech

ModelCost (per 1M chars)
tts-1$15.00
tts-1-hd$30.00

Whisper-1 Speech-to-Text

ModelCost (per minute)
whisper-1$0.006

Cache Configuration

The provider exposes static cacheConfig with deterministic and non-deterministic parameters.

Deterministic parameters: prompt, model, size, quality, style, text, voice, speed

Non-deterministic parameters: n, response_format, user, output_format, num_outputs, style_preset, dimensions, artifact_data, mime_type, detail, detail_level, audio_data, language

The normalize() function trims/collapses whitespace in prompt and text, normalizes dimensionssize and style_presetstyle for consistent cache keying. Image and audio binary data are deliberately excluded from deterministic params since identical media files will produce equivalent descriptions/transcriptions.

Health Check

The health check sends a GET request to {baseUrl}/models using the configured API key and optional organization/project headers. Returns { healthy: true, latency: <ms> } on 2xx response, or { healthy: false, error: "HTTP <status>: <message>" } on failure.

License

MIT