Speechbase
Providers

Providers

The catalogue of upstream TTS providers Speechbase routes to, the models each exposes, and the workspace state attached to each provider.

A provider is an upstream TTS vendor: OpenAI, ElevenLabs, Cartesia, Hume, Google, Deepgram, Inworld, MiniMax, Fish Audio, Murf, Resemble, fal, Mistral, xAI, and others as the catalog grows. Speechbase ships an integration with each provider and exposes their models through one API.

For the end-to-end routing model, including BYOK and Managed Routing, start with Providers and routing.

Browse providers

Pick a provider for its models, voices, output quirks, and per-model capabilities.

ProviderPrefixDefault model
OpenAIopenaigpt-4o-mini-tts
ElevenLabselevenlabseleven_multilingual_v2
Deepgramdeepgramaura-2
Cartesiacartesiasonic-3
Humehumeoctave-2
Googlegooglegemini-2.5-flash-preview-tts
Fish Audiofish-audios2-pro
Inworldinworldinworld-tts-1.5-max
MiniMaxminimaxspeech-2.8-hd
MurfmurfGEN2
Resembleresembledefault
Smallest AIsmallest-ailightning_v3.1
falfal-ai(specify a model)
Mistralmistralvoxtral-mini-tts-2603
xAIxaigrok-tts

Capability matrix

ProviderStreamingAudio tagsVoice cloningTimestampsOpen source
OpenAIYesYesGateway-generated
ElevenLabsYesYes (eleven_v3)Native
DeepgramYesGateway-generated
CartesiaYesYes (sonic-3)Yes (sonic-3)Native
HumeYesYes (octave-2)Native (octave-2)
GoogleYesYes (gemini-3.1)Gateway-generated
Fish AudioYesYesYesGateway-generatedYes
InworldYesNative
MiniMaxGateway-generated
MurfYesNative (GEN2)
ResembleYesYesNativeYes
Smallest AIGateway-generated
falYes (select models)Gateway-generatedVaries
MistralYesYesGateway-generatedYes
xAIYesYesGateway-generated

Support is per-model — check each provider page for the breakdown. "Gateway-generated" timestamps are explained under Fallbacks and timestamps; cloning is configured through saved Voices, not inline.

How a synthesis call gets to a provider

Every inline synthesis request specifies a model string of the form <provider_id>/<model_id>, e.g. openai/gpt-4o-mini-tts or elevenlabs/eleven_v3. Speechbase reads the prefix, looks up the provider integration, resolves provider access for your workspace, and dispatches the call. The string takes exactly one slash — fal-ai/f5-tts, never a doubled prefix.

If you didn't pin a provider in the request — for instance because you passed a voice_id that already encodes one — Speechbase uses the provider that the voice was registered against.

Listing what's available

GET /v1/audio/providers returns the full catalog with three pieces of state per provider:

  • enabled — whether the provider is currently switched on for your org. You can toggle this in the dashboard at Speechbase → Model Providers.
  • byok — whether you've stored a key for this provider yet. In BYOK mode, synthesis calls targeting a provider without a key fail with no_api_key.
  • models — the prefixed model IDs you can pass in a synthesis request.

Provider access

Speechbase supports BYOK for self-serve provider access and Managed Routing for workspaces where Speechbase manages provider relationships, billing, and quotas.

With BYOK, your provider key talks to the provider directly; the provider bill arrives at the provider, not at Speechbase.

Mechanically: when you store a key via PUT /v1/api-keys/{providerId}, Speechbase stores it encrypted in a secure key store and writes a metadata row recording the last four characters and key_updated_at. We can't view or recover the full key. The plaintext key never lives on disk and is never logged. At request time the gateway decrypts the key in-memory, instantiates the provider client, and discards the plaintext when the request completes.

To rotate, just PUT the new value over the old one. To stop using a provider entirely, DELETE /v1/api-keys/{providerId} — both the encrypted key and the metadata row are cleared atomically.

For the step-by-step setup flow, see BYOK guide. For the comparison between BYOK and Managed Routing, see Providers and routing.

Fallbacks and timestamps

Not every provider exposes word-level timestamps natively. Speechbase wraps each provider with a timestamp fallback pass for the with-timestamps endpoints, so any provider can produce timestamps even when the upstream API doesn't. The gateway aligns the rendered audio with the exact source text for that pass. Use the provider table above to see which models currently expose native timestamps. This is transparent — you don't choose; the gateway uses native timestamps when available and timestamp fallback when needed.

If both native timestamps and the timestamp fallback pass fail, the /v1/audio/speech/with-timestamps endpoint returns 503 timestamps_unavailable. The conversation variant is more forgiving — it can return audio with an empty timestamps array and a warnings entry instead, so you still get the audio to play. See Word-level timestamps.

Multi-provider conversations

A single conversation request can dispatch different turns to different providers. Speechbase validates that every provider referenced in the turn list has a stored key and is enabled before any call is made — partial dispatches don't happen. See Conversations.

On this page