Speechbase
← BackBlog

Which TTS Provider Should You Use?

Speechbase Team
  • tts
  • providers
  • speech

There is no single best text-to-speech provider. The right one depends on what you are building. A voice agent, an audiobook narrator, a game character, a branded IVR, and a podcast generator all want different things, and the answer shifts again with your languages, your latency budget, your need for word timing or cloning, and how cheaply you can switch later.

So the honest answer to "which TTS provider should I use?" is this: pick by the workload, not the brand, and test two or three candidates against your real script before you commit.

We currently route production traffic to 15 TTS providers through Speechbase: OpenAI, ElevenLabs, Cartesia, Deepgram, Google, Hume, Inworld, MiniMax, Murf, Resemble, Smallest AI, Fish Audio, fal, Mistral, and xAI. More are on the way. This guide comes from wiring up all of them and watching where each one shines and where it bites, so read it as a snapshot of the current routing layer rather than a fixed list.

One warning before the list. The common mistake is choosing on "quality" alone, as if quality were one number. It isn't. A gorgeous voice that returns first audio too slowly will break a voice agent. A fast voice that mangles customer names will break enterprise training. A cheap model with no usable word timing will break your captions and your barge-in handling. Decide which of those failures you cannot live with, then test for it.

The fast shortlist

If you only take one thing from this guide, choose by workload. Here is where to start testing for the most common jobs.

If your product needs...Start by testing...Why
Realtime voice agentsCartesia, Inworld, Deepgram, Murf Falcon, Smallest AI, xAIBuilt around low-latency streaming, agent workflows, or telephony output.
Premium narration or character voiceElevenLabs, Hume, MiniMax, Google Gemini TTSStronger when voice style, emotion, and long-form continuity matter more than raw speed.
Owned or cloned voicesElevenLabs, Resemble, Fish Audio, Hume, Cartesia, MiniMax, MistralCloning workflows vary a lot, so test your exact reference audio.
Captions, highlighting, or avatarsElevenLabs, Cartesia, Inworld, Hume, Resemble, plus Speechbase timing fallbackNative timing is uneven. Some providers return it; others need an alignment pass.
Cost-sensitive experimentationxAI, Smallest AI, Fish Audio, Mistral, OpenAI, falUseful for early tests and model bake-offs where the voice doesn't carry the brand.
Google Cloud or Gemini-native appsGoogle Gemini TTSNatural-language style prompting and multi-speaker output fit Gemini and Vertex teams.
Enterprise governance and provenanceResemble, Deepgram, Google Cloud, Murf enterpriseLook here when watermarking, deployment posture, or enterprise controls matter (Resemble is the one that adds owned-voice cloning).
Open or marketplace explorationfal, Mistral, Fish Audio, ResembleGood for trying model families that aren't a single polished vendor stack.

One caveat worth flagging up front: through Speechbase, most providers stream audio as it is generated, but Smallest AI and fal currently return the finished clip instead. If your agent needs token-by-token streaming for barge-in or the lowest first-byte latency, weigh that when you test them.

How to compare them

Use the same test prompt across every candidate, and never judge a provider from its own demo text. Five things are worth measuring.

  • First audio latency. For agents, measure time to first playable audio, not total generation time.
  • Long-form stability. For narration, run a full chapter or a multi-minute script. Listen for voice drift and strange pauses, not just one clean sentence.
  • Pronunciation. Feed it names, brands, numbers, dates, acronyms, product terms, and code-switched text.
  • Timing. If you need captions, avatars, highlighting, or barge-in, check whether word timing is native, reconstructed, character-level, or word-level.
  • Operational surface. Streaming shape, concurrency, rate limits, logs, pricing units, and how hard it would be to swap the provider out.

There are really two questions hiding inside "does this provider support X?" One is whether the upstream provider exposes the feature in its own API. The other is whether your app gets a usable version of it through Speechbase. The table below answers the first question: a check means the provider exposes the feature in its own API or model docs. "Partial" means there's a related control but not a full developer-facing feature. "Varies" means it depends on the model endpoint.

ProviderAudio tags / reactionsNative word timingStreamingVoice cloningPronunciation control
OpenAIPartial: style instructionsNoNoPartial: prompt hints
ElevenLabs✓ Eleven v3 tags✓ character alignment✓ dictionaries
Cartesia✓ SSML-like tags (Sonic 3)✓ word timing✓ dictionaries, inline IPA
DeepgramNoNoNoNo
Google Gemini TTS✓ audio tags (Gemini 3.1); style prompting (2.5)No✓ bufferedNoPartial: prompt hints
HumePartial: emotional steering✓ word + phoneme (Octave 2)No
InworldPartial: steering, SSML breaks✓ word, character, phoneme, viseme✓ inline IPA
MiniMax✓ sound + interjection tags (2.8)Partial: sentence subtitles✓ dictionaries
MurfPartial: styles, pause controlswordDurations on Gen2EnterpriseNo
Resemble✓ SSML, paralinguistic tags✓ grapheme + phoneme✓ SSML, custom
Smallest AINo✓ WebSocket word timing✓ dictionaries
Fish Audio✓ expression and effect tagsNoPartial: fine-grained controls
falVaries by modelVaries by modelVaries by model✓ select modelsVaries by model
MistralNoNoNo
xAI✓ inline and wrapping tags✓ character-levelNo

The second question is where Speechbase earns its keep, and word timing is the clearest example. You can request word timestamps on every route, from a dedicated buffered endpoint. The difference is how they are produced. ElevenLabs, Cartesia, Inworld, and Resemble return timing natively, as do Hume Octave 2 and Murf Gen2. For everyone else, Speechbase reconstructs word boundaries on the gateway by aligning the rendered audio back to your text, so captions and highlighting work even when the provider ships no timing of its own. Two caveats: streaming and timestamps are separate request paths, since alignment needs the full clip, and if both native timing and the fallback fail the endpoint returns a 503 rather than guessing. When a feature is load-bearing, confirm the exact path in the provider docs before you plan around it.

What it costs

Price is usually the second question after "does it sound good," so here is a rough cost ladder. These are approximate published list prices, normalized to US dollars per one million characters as of June 2026.

Provider (model)Approx. list price / 1M charsNotes
Smallest AI Lightning v3.1~$15Pro tier is ~$20.
OpenAI TTS-1~$15tts-1-hd is ~$30; gpt-4o-mini-tts is token-billed.
xAI Grok TTS~$15
Fish Audio S2 Pro~$15Billed per UTF-8 byte, so non-Latin scripts cost more.
Mistral Voxtral~$16
fal Kokoro~$20F5-TTS and Orpheus are ~$50 each; on fal even a tiny model like Kokoro lands at ~$20.
Inworld TTS 2~$25Mini is ~$15; 1.5 Max is ~$35.
Deepgram Aura-2~$30
Murf Gen2~$30Falcon is ~$10.
Resemble~$30Billed per audio second, so this shifts with speaking rate.
Google Gemini TTS~$17 to ~$40Token-billed; estimate assumes ~900 chars/min.
Cartesia~$37 to ~$50One credit per character; effective rate varies by plan.
ElevenLabs~$50 to ~$100Flash is the cheaper half; v3 and Multilingual the top.
MiniMax Speech 2.8~$60 (Turbo)HD is ~$100.
Hume Octave~$150Drops toward ~$50 at the highest tiers.

Read it as a ladder, not a quote. Several providers bill in their own units (tokens, audio seconds, UTF-8 bytes, or plan credits), so your real bill depends on language, speaking rate, and tier. The cheap end (xAI, Smallest AI, Fish, Mistral, OpenAI TTS-1) is fine for bake-offs and high-volume utility speech. The premium end (Hume, ElevenLabs v3, MiniMax HD) is where you pay for the voice doing creative work. Verify the current number before you budget.

The rest of this guide assumes you are shipping a real product, not rendering a one-off clip.

OpenAI

OpenAI's text-to-speech API is the easiest default when your stack already runs on OpenAI for text, agents, or transcription. The newer gpt-4o-mini-tts is promptable: you can steer accent, emotion, intonation, pace, and tone in plain language, and the API returns MP3, Opus, AAC, FLAC, WAV, and PCM. OpenAI's realtime audio models are a separate product line, so for one-shot synthesis this is still the model to reach for.

The trade-off is depth. The public voice catalog is small next to voice-first vendors, and the TTS docs don't treat word timing as a headline feature. If you need timestamps, Speechbase reconstructs them on the gateway, but test that timing against your actual UI.

Model strings: openai/gpt-4o-mini-tts, openai/tts-1, openai/tts-1-hd

Best for: a simple app voice inside an OpenAI-native product, with fast integration.

ElevenLabs

ElevenLabs is the name most people reach for when the voice is part of the product. It leads on voice quality, a library of thousands of community voices, cloning and voice design, multilingual speech, streaming, pronunciation dictionaries, and expressive audio tags like [laughs] and [whispers] in Eleven v3. Its timestamps endpoint returns character-level alignment, which is finer-grained than most.

The catch is surface area and cost. Commercial rights, latency options, library access, and tag behavior all shift by plan, model, voice, and language. Treat Eleven v3, Multilingual v2, and Flash as three different tools rather than one checkbox: v3 for expression, Flash for low-latency cheap throughput, Multilingual for breadth.

Model strings: elevenlabs/eleven_v3, elevenlabs/eleven_multilingual_v2, elevenlabs/eleven_flash_v2_5, elevenlabs/eleven_flash_v2

Best for: premium voices, expressive speech, large voice libraries, production narration.

Cartesia

Cartesia Sonic is built for realtime. Sonic 3.5, released in May 2026, advertises sub-90ms latency, and the developer surface covers bytes, SSE, and WebSocket TTS, streaming text input, word-level timing, pronunciation dictionaries with inline IPA, voice cloning from a few seconds of audio, and SSML-like controls for pacing and emotion.

Watch the operational fit. Concurrency is plan-shaped, the emotion tag is still experimental and works best with emotive-tagged voices, and Cartesia has reworked its voice and timing interfaces over time. If you are migrating from an older voice-embedding integration, confirm the current voice ID flow before assuming it's a drop-in swap.

Model strings: cartesia/sonic-3.5, cartesia/sonic-3, cartesia/sonic-2

Best for: low-latency agents, multilingual realtime UX, native timing.

Deepgram

Deepgram is a speech infrastructure company first and a TTS vendor second. Aura-2 sits on the same stack as Deepgram's speech-to-text and voice-agent products, which makes it a natural pick when you want STT, TTS, and realtime agent plumbing from one vendor with clear per-character pricing.

Expressiveness is the limit. There's no creator voice library, no public cloning, and no audio-tag system. The docs emphasize output settings, WebSocket streaming, and voice controls over creator-style performance direction. If you want a voice that acts, look elsewhere; if you want a dependable agent voice next to your STT, it's a strong fit.

Model string: deepgram/aura-2

Best for: enterprise speech stacks, STT-plus-TTS teams, realtime agents where the infrastructure matters.

Google Gemini TTS

Google Gemini TTS brings Gemini-style natural-language control to speech, with single-speaker and multi-speaker output. Gemini 3.1 Flash TTS adds an inline audio-tag system on top of the prompt-directed delivery, and the same models reach you through the Gemini API, AI Studio, Vertex, and Cloud Text-to-Speech.

Two things to know. All three TTS models listed below are still preview surfaces, and the Gemini API and Cloud/Vertex experiences are not identical. Native timing alignment also isn't a focus, so test captions and highlighting through your own pipeline.

Model strings: google/gemini-3.1-flash-tts-preview, google/gemini-2.5-flash-preview-tts, google/gemini-2.5-pro-preview-tts

Best for: Google Cloud teams, prompt-controlled delivery, multi-speaker content.

Hume

Hume Octave is built for expressive, emotionally aware speech. The model reads tone and intent from the text and adapts pronunciation, pitch, and emphasis to match, and Octave 2 adds word and phoneme timestamps, voice cloning from about 15 seconds of audio, and long-form continuity across a script.

Cost and versioning are the things to watch. Hume is not the choice for commodity speech at the lowest price, Octave 2 is currently a preview model, and some performance controls (acting instructions among them) are still rolling out. Validate cloning, timestamps, and language support on the exact version you plan to ship.

Model strings: hume/octave-2, hume/octave-1

Best for: emotional delivery, character speech, avatars, narrative products.

Inworld

Inworld TTS is strongest for realtime, interactive speech, and Realtime TTS-2 is its lowest-latency model yet. The platform leans into low-latency streaming, voice cloning and design, inline IPA pronunciation, on-prem options, and rich alignment data: word and character timing, with phoneme and viseme detail nested inside the word-level output.

The fine print is model-specific behavior. Some controls vary by model, enabling timestamp alignment can add latency on the non-streaming path, and there's a per-request input limit of roughly 2,000 characters to plan around for long passages. Inworld fits interactive speech better than plain batch narration.

Model strings: inworld/inworld-tts-2, inworld/inworld-tts-1.5-max, inworld/inworld-tts-1.5-mini

Best for: interactive characters, realtime agents, alignment-heavy experiences.

MiniMax

MiniMax Speech 2.8 is tuned for production content: long-form speech, a large voice catalog, 40-plus languages, sound and interjection tags (laughs, sighs, and the like) on the 2.8 models, pronunciation dictionaries, sentence-level subtitles, and voice cloning from about 10 seconds of reference audio. HD trades cost for broadcast-grade quality; Turbo trades quality for speed.

The surface is broad. MiniMax exposes sync, async, and WebSocket flows with different limits, and pay-as-you-go pricing runs higher than the budget tier, so test both the HD and Turbo paths against your real scripts before you settle.

Model strings: minimax/speech-2.8-hd, minimax/speech-2.8-turbo

Best for: long-form narration, multilingual content, production voice workflows.

Murf

Murf runs two lanes. Falcon is the low-latency lane for voice-agent output, with roughly 130ms time-to-first-audio and high concurrency. Gen2 is the polished lane for business content, e-learning, ads, and dubbing, and it returns wordDurations timing.

Billing and cloning are where expectations trip teams up. Murf publishes multiple pricing units, and voice cloning is an enterprise flow with a multi-week turnaround, not the instant self-serve clone that clone-first vendors offer. Confirm the model, billing basis, region, and concurrency before you promise production behavior.

Model strings: murf/GEN2, murf/FALCON

Best for: business narration, enterprise voice output, Falcon-style agent tests.

Resemble AI

Resemble AI is the pick when voice ownership, security, and provenance come first. Its TTS now runs on the open-source Chatterbox family (Chatterbox Multilingual v3, with a Turbo variant for low latency), and the platform adds custom voices, cloning, speech-to-speech, grapheme and phoneme timestamps, a PerTh watermark on every output, a separate deepfake detector, and on-prem deployment.

It is not the cheapest generic TTS path, and custom pronunciation behavior can be language-specific. For a quick prototype voice, a simpler provider gets you moving faster. Resemble pays off when customers will ask how your voice assets are made, protected, and detected.

Model string: resemble/default

Best for: secure cloning, branded voices, provenance-conscious teams.

Smallest AI

Smallest AI is built around low-latency voice infrastructure. The Lightning models target realtime agents, telephony, and live narration, with sub-100ms latency, instant voice cloning, pronunciation dictionaries, and a strong emphasis on Indic and code-mixed language support.

Capability details need care. Public docs and marketing differ on exact language and voice counts (the site lists around 15 languages), a newer Lightning v3.2 with instruction following has appeared upstream, and some timing or cloning features depend on mode and plan. Through Speechbase, Smallest currently returns the full clip rather than a token stream, so factor that into agent designs that expect streaming.

Model strings: smallest-ai/lightning_v3.1, smallest-ai/lightning_v3.1_pro

Best for: low-latency agents, Indic language coverage, budget realtime speech.

Fish Audio

Fish Audio is a developer-friendly platform around the S2 Pro model, with streaming, voice cloning from short samples, voice references, expression and effect tags, fine-grained delivery controls, and aggressive pricing. It's a good pick when you want an open, experimental, low-cost voice stack.

Two things to weigh. Fish bills per UTF-8 byte, so Chinese, Japanese, Korean, Arabic, and Hindi cost two to four times more per character than Latin text. And rate limits are tiered by spend. It's a strong model choice, but it's less suited to enterprise provenance, consent workflows, and governed voice operations than a provider like Resemble.

Model string: fish-audio/s2-pro

Best for: expressive experiments, low-cost custom voices, open-adjacent workflows.

fal

fal is different in kind: a model marketplace and inference layer rather than one TTS product. Through it you can reach model families like F5-TTS, Kokoro, Orpheus, Chatterbox, and Dia, route jobs through queues and webhooks, and expose model choice to your own users.

Consistency is the price of that range. Streaming, timing, audio tags, cloning, pricing, and licenses all change from one endpoint to the next (F5-TTS clones voices, Kokoro doesn't, and so on). fal is excellent for exploration and routing. It is not a governed voice-management layer on its own.

Model strings: fal-ai/f5-tts, fal-ai/kokoro, fal-ai/orpheus-tts

Best for: model exploration, open model access, async and batch experimentation.

Mistral

Mistral Voxtral TTS is worth evaluating for open-weight access, fast generation (around 70ms latency, light enough to run locally), and zero-shot cloning from a few seconds of audio. It's a clean way to add an open-model path or a European vendor to your provider mix.

The license is the part people get wrong. The weights ship under CC BY-NC 4.0, which is non-commercial, so commercial use means the hosted API or a separate license, not free self-hosting. Language coverage is also narrow (nine languages), so if you need timestamps, moderation, or long-form stability, test those explicitly.

Model string: mistral/voxtral-mini-tts-2603

Best for: open-model evaluation, custom voice experiments, vendor diversification.

xAI

xAI Grok TTS pairs expressive voices with inline and wrapping speech tags, streaming, character-level timestamps, multiple output formats including telephony codecs (mulaw, alaw, PCM, MP3, WAV), and custom voices, plus tight Grok ecosystem integration. It is competitively priced for an expressive model with telephony support.

The limits are voice breadth and access. The built-in set is small (about five voices), and custom voice creation is subject to xAI's account and content policies. If custom voices are central to your product, confirm access before you plan around it.

Model string: xai/grok-tts

Best for: Grok-native apps, expressive tagged speech, realtime and telephony output.

What changes when you route through Speechbase

Going provider-direct, the first integration is easy. The second is where the tax starts: a new SDK, new auth, new model names, new voice IDs, new streaming behavior, new logs, new error shapes, and new pricing math. Speechbase is built for the moment your audio stack outgrows one provider.

Switching providers is a string change:

import { generateSpeech } from "@speech-sdk/core"

await generateSpeech({
  model: "cartesia/sonic-3.5",
  voice: "your-provider-voice-id",
  text: "This same call shape can route to another TTS provider.",
})
await generateSpeech({
  model: "elevenlabs/eleven_v3",
  voice: "your-elevenlabs-voice-id",
  text: "This same call shape can route to another TTS provider.",
})

That doesn't make the providers identical. It makes them comparable, behind one API surface, one voice library, one request log, and one timing story, so you can keep moving when your first provider stops being the right fit. The point isn't to hide provider differences. It's to put them in one place where you can measure them. All 15 routes share that surface today, and new providers join it regularly.

How to run your own test

Don't pick from a landing page. Build a small test suite and run every candidate through it.

For voice agents, script a 10-turn conversation with interruption points, numbers, names, and one out-of-domain response. Measure first audio latency and whether the timing data is actually usable.

For narration, render 5 to 10 minutes of real script. Listen for voice drift, mispronunciations, odd pauses, repeated words, and how the provider handles chapters, headings, and abbreviations.

For branded voices, clone with the exact consented reference audio you have, then probe the failure cases: noisy audio, emotional text, numbers, brand names, and a passage that should be refused.

For captions or avatars, test timing before you pick the voice. Retrofitting alignment later is possible, but make it a decision, not a surprise.

And when two providers both sound good, pick the stack that makes switching cheapest. Your first TTS provider is rarely your last.

Frequently asked questions

What is the best text-to-speech provider? There isn't one. For expressive, creator-facing voice, ElevenLabs and Hume lead. For low-latency agents, Cartesia, Inworld, and Deepgram are strong. For long-form multilingual content, MiniMax and Google Gemini TTS hold up well. Match the provider to the workload and test against your own script.

What is the cheapest TTS API? At June 2026 list prices, the cheapest options cluster around $15 per million characters: Smallest AI Lightning v3.1, OpenAI's TTS-1, xAI Grok TTS, Fish Audio, and Mistral Voxtral, with fal's Kokoro just behind near $20. Units differ, though: Fish bills per UTF-8 byte, so non-Latin scripts cost more, and OpenAI's newer gpt-4o-mini-tts is token-billed.

Which TTS provider is best for voice agents? Optimize for time to first audio and timing data, not headline quality. Cartesia (sub-90ms), Inworld, Murf Falcon, and Deepgram are built for realtime turn-taking. xAI adds telephony codecs for phone use cases. Always test latency on your own infrastructure, since network path and region change the numbers.

Which TTS providers support voice cloning? Cloning is widely available but very uneven. ElevenLabs, Resemble, Hume, Cartesia, Fish Audio, MiniMax, and Mistral all clone from short reference audio, while Murf treats cloning as an enterprise flow with a multi-week turnaround. Test your exact consented reference clip, because results swing with audio quality and accent.

Do I need word-level timestamps? You need them for captions, word highlighting, avatar lip-sync, and barge-in. ElevenLabs, Cartesia, Inworld, Resemble, Hume Octave 2, and Murf Gen2 return timing natively. For providers that don't, Speechbase reconstructs word boundaries on the gateway, so you can request usable timing on every route, from a separate buffered endpoint. Decide this before you pick the voice.

Sources and notes

This guide draws on official provider docs, product pages, and pricing pages, plus Speechbase's own provider docs, routing docs, and timestamp docs. Capabilities and pricing change quickly, so treat this as a June 2026 snapshot and verify the exact model, plan, and region before committing production traffic.