Output formats
What audio formats Speechbase returns, how to pick between them, and gotchas around streaming and bit-depth.
Every synthesis endpoint accepts an output field. This guide is a quick
matrix of what to pick when.
The shorthand
If you don't care about details, pass a string:
{ "output": "mp3" }Available shorthands: "wav" (default), "mp3", "pcm".
The object form
For more control, pass an object. bitrate (in kbps) is only valid with mp3:
{ "output": { "format": "mp3", "bitrate": 192 } }What to pick
| Format | Container | Default bitrate | Best for |
|---|---|---|---|
wav | RIFF/WAV | n/a | Highest fidelity. Editing, storage, downstream processing. Default if you don't pass output. |
mp3 | MP3 | 96 kbps | Web/mobile playback, streaming to end users, podcasts. Pick 192 for music-quality voice, 64 for low-bandwidth. |
pcm | none (raw) | n/a | Headerless, sample-rate-only. Feed straight into a downstream audio pipeline (mixing, voice DSP, custom containers). |
pcm is what you want if you're piping audio into something like an LLM
voice agent loop or a real-time audio mixer that owns its own framing —
otherwise pick wav or mp3.
Streaming and format
The plain /v1/audio/speech endpoint streams the response when the upstream
provider supports streaming. That works for mp3 and pcm (chunked decoding
is well-defined). It also works for wav but the RIFF header at the start
of the stream will report a 0 length until the stream finishes — most
players handle this fine; some don't.
If you're streaming to a player that's strict about WAV headers, prefer
mp3.
Sample rate and bit depth
Speechbase returns whatever sample rate and bit depth the provider produced. That's typically:
- OpenAI: 24 kHz, 16-bit PCM.
- ElevenLabs: 22.05 kHz or 44.1 kHz depending on model.
- Cartesia: 22 kHz to 44.1 kHz depending on model.
If your downstream pipeline requires a fixed sample rate, resample on your side after decoding — Speechbase doesn't currently do server-side resampling.
Conversations
Conversation responses always carry one consistent format end-to-end. Speechbase
synthesises each turn at whatever the provider produces, optionally
normalises with volumeDbfs, and re-encodes to your chosen output format
during stitching. So a conversation request with output: "mp3" returns a
single MP3 even if individual turns came back as WAV from the providers.