Speechbase: Universal Text-to-Speech Gateway & Voice Management

The catalogue of upstream TTS providers Speechbase routes to, the models each exposes, and the workspace state attached to each provider.

A provider is an upstream TTS vendor: OpenAI, ElevenLabs, Cartesia, Hume, Google, Deepgram, Inworld, MiniMax, Fish Audio, Murf, Resemble, fal, Mistral, xAI, and others as the catalog grows. Speechbase ships an integration with each provider and exposes their models through one API.

For the end-to-end routing model, including BYOK and Managed Routing, start with Providers and routing.

Browse providers

Pick a provider for its models, voices, output quirks, and per-model capabilities.

Provider	Prefix	Default model
OpenAI	`openai`	`gpt-4o-mini-tts`
ElevenLabs	`elevenlabs`	`eleven_multilingual_v2`
Deepgram	`deepgram`	`aura-2`
Cartesia	`cartesia`	`sonic-3`
Hume	`hume`	`octave-2`
Google	`google`	`gemini-2.5-flash-preview-tts`
Fish Audio	`fish-audio`	`s2-pro`
Inworld	`inworld`	`inworld-tts-1.5-max`
MiniMax	`minimax`	`speech-2.8-hd`
Murf	`murf`	`GEN2`
Resemble	`resemble`	`default`
Smallest AI	`smallest-ai`	`lightning_v3.1`
fal	`fal-ai`	(specify a model)
Mistral	`mistral`	`voxtral-mini-tts-2603`
xAI	`xai`	`grok-tts`

Capability matrix

Provider	Streaming	Audio tags	Voice cloning	Timestamps	Open source
OpenAI	Yes	Yes	—	Gateway-generated	—
ElevenLabs	Yes	Yes (`eleven_v3`)	—	Native	—
Deepgram	Yes	—	—	Gateway-generated	—
Cartesia	Yes	Yes (`sonic-3`)	Yes (`sonic-3`)	Native	—
Hume	Yes	—	Yes (`octave-2`)	Native (`octave-2`)	—
Google	Yes	Yes (`gemini-3.1`)	—	Gateway-generated	—
Fish Audio	Yes	Yes	Yes	Gateway-generated	Yes
Inworld	Yes	—	—	Native	—
MiniMax	—	—	—	Gateway-generated	—
Murf	Yes	—	—	Native (`GEN2`)	—
Resemble	Yes	—	Yes	Native	Yes
Smallest AI	—	—	—	Gateway-generated	—
fal	—	—	Yes (select models)	Gateway-generated	Varies
Mistral	Yes	—	Yes	Gateway-generated	Yes
xAI	Yes	Yes	—	Gateway-generated	—

Support is per-model — check each provider page for the breakdown. "Gateway-generated" timestamps are explained under Fallbacks and timestamps; cloning is configured through saved Voices, not inline.

How a synthesis call gets to a provider

Every inline synthesis request specifies a model string of the form <provider_id>/<model_id>, e.g. openai/gpt-4o-mini-tts or elevenlabs/eleven_v3. Speechbase reads the prefix, looks up the provider integration, resolves provider access for your workspace, and dispatches the call. The string takes exactly one slash — fal-ai/f5-tts, never a doubled prefix.

If you didn't pin a provider in the request — for instance because you passed a voice_id that already encodes one — Speechbase uses the provider that the voice was registered against.

Listing what's available

GET /v1/audio/providers returns the full catalog with three pieces of state per provider:

enabled — whether the provider is currently switched on for your org. You can toggle this in the dashboard at Speechbase → Model Providers.
byok — whether you've stored a key for this provider yet. In BYOK mode, synthesis calls targeting a provider without a key fail with no_api_key.
models — the prefixed model IDs you can pass in a synthesis request.

Provider access

Speechbase supports BYOK for self-serve provider access and Managed Routing for workspaces where Speechbase manages provider relationships, billing, and quotas.

With BYOK, your provider key talks to the provider directly; the provider bill arrives at the provider, not at Speechbase.

Mechanically: when you store a key via PUT /v1/api-keys/{providerId}, Speechbase stores it encrypted in a secure key store and writes a metadata row recording the last four characters and key_updated_at. We can't view or recover the full key. The plaintext key never lives on disk and is never logged. At request time the gateway decrypts the key in-memory, instantiates the provider client, and discards the plaintext when the request completes.

To rotate, just PUT the new value over the old one. To stop using a provider entirely, DELETE /v1/api-keys/{providerId} — both the encrypted key and the metadata row are cleared atomically.

For the step-by-step setup flow, see BYOK guide. For the comparison between BYOK and Managed Routing, see Providers and routing.

Fallbacks and timestamps

Not every provider exposes word-level timestamps natively. Speechbase wraps each provider with a timestamp fallback pass for the with-timestamps endpoints, so any provider can produce timestamps even when the upstream API doesn't. The gateway aligns the rendered audio with the exact source text for that pass. Use the provider table above to see which models currently expose native timestamps. This is transparent — you don't choose; the gateway uses native timestamps when available and timestamp fallback when needed.

If both native timestamps and the timestamp fallback pass fail, the /v1/audio/speech/with-timestamps endpoint returns 503 timestamps_unavailable. The conversation variant is more forgiving — it can return audio with an empty timestamps array and a warnings entry instead, so you still get the audio to play. See Word-level timestamps.

Multi-provider conversations

A single conversation request can dispatch different turns to different providers. Speechbase validates that every provider referenced in the turn list has a stored key and is enabled before any call is made — partial dispatches don't happen. See Conversations.

Providers