These packages give you a set of MCP tools for generating and processing images, audio, video, documents, and 3D models, then chaining those operations into pipelines with artifact passing between steps. You would adopt them to avoid wiring together multiple provider SDKs, retry logic, cost tracking, and caching yourself when building AI agents that produce media as output. The most distinctive thing is that every operation is a composable pipeline step with built-in quality gates, budget enforcement, content-addressed caching, and multi-provider routing, so you can define a multi-step workflow once and have it handle provider fallback, cost caps, and resumable execution automatically.
An Anthropic provider for the media pipeline framework that wraps Claude Sonnet's vision models to perform image description, OCR, table extraction, structured field extraction, and document summarization. It exports an `AnthropicProvider` class with an `execute()` method that accepts an operation name and parameters, and supports streaming token-by-token responses for all text-shaped operations.
A factory function that creates an `AudioGenOperations` instance providing text-to-speech, speech-to-text, speaker diarization, source separation, music generation, and sound effects, with automatic multi-provider routing to OpenAI, ElevenLabs, Deepgram, or any conformant provider.
A ComfyUI provider for the media pipeline framework that runs image generation, image editing, and video generation on your own GPU via local ComfyUI workflows, with zero API cost. It exports a `ComfyUIProvider` class that accepts a `baseUrl` pointing to a running ComfyUI instance and optionally a `workflowsDir` for custom JSON workflows.
Core framework for media pipeline orchestration, providing a Zod-validated type system, pipeline execution engine with variable interpolation, quality gate evaluation, artifact registry, budget enforcement, persistence-based resume, cost tracking, event bus, and a configurable mock provider for testing.
A typed cost ledger for tracking per-operation expenses in a media pipeline, providing an `InMemoryCostLedger` class with `charge()`, `preflight()`, `totalForRun()`, and `totalForTenant()` methods that support USD micro-precision, run-scoped and tenant-scoped queries, and preflight budget checks.
A Deepgram provider for the media-pipeline framework that exposes `audio.stt` and `audio.diarize` operations via a `DeepgramProvider` class, using Nova-2 for speech-to-text transcription with smart formatting, speaker diarization, and WebSocket streaming support.
A factory function (`createDocumentExtractionOperations`) that returns a `DocumentExtractionOperations` instance providing OCR, table extraction, schema-driven field extraction, and content summarization, delegating each operation to registered LLM providers (e.g., Google, Anthropic, OpenAI) with automatic fallback chains.
An ElevenLabs provider for the media pipeline framework that exposes a `MediaProvider` class (`ElevenLabsProvider`) with `execute`, `healthCheck`, and `estimateCost` methods for generating text-to-speech audio with configurable voice, speed, model, and output format.
A Fal.ai provider for the media pipeline framework that exposes a `FalProvider` class supporting image generation, upscaling, background removal, text-to-video, and image-to-video operations via the fal.ai API, with native webhook support and streaming queue events for long-running tasks.
A Google Cloud provider for the media-pipeline framework that exposes Document AI (OCR, table extraction, field extraction) and Vertex AI Gemini (image description) as a unified set of operations via an `execute` method on the `GoogleProvider` class.
An image editing operations factory that provides Sharp-based local processing (resize, crop, composite) and provider-delegated operations (upscale, background removal, inpainting, image description) through a multi-provider routing system.