Output formats
What audio formats Speechbase returns, how to pick between them, and gotchas around streaming and bit-depth.
Every synthesis endpoint accepts an output field. This guide is a quick
matrix of what to pick when.
The shorthand
If you don't care about details, pass a string:
{ "output": "mp3" }Available shorthands: "wav" (default), "mp3", "pcm".
The object form
For more control, pass an object. bitrate (in kbps) is only valid with mp3:
{ "output": { "format": "mp3", "bitrate": 192 } }What to pick
| Format | Container | Default bitrate | Best for |
|---|---|---|---|
wav | RIFF/WAV | n/a | Highest fidelity. Editing, storage, downstream processing. Default if you don't pass output. |
mp3 | MP3 | 96 kbps | Web/mobile playback, streaming to end users, podcasts. Pick 192 for music-quality voice, 64 for low-bandwidth. |
pcm | none (raw) | n/a | Headerless, sample-rate-only. Feed straight into a downstream audio pipeline (mixing, voice DSP, custom containers). |
pcm is what you want if you're piping audio into something like an LLM
voice agent loop or a real-time audio mixer that owns its own framing —
otherwise pick wav or mp3.
Buffered vs. streaming, and format
output format conversion is a whole-clip operation (decode the full clip,
re-encode), so it only applies on the buffered /v1/audio/speech endpoint,
which returns the complete audio in a single response.
For low latency, /v1/audio/speech/stream streams the provider's audio
straight through. Because there's no whole-clip step, it does not accept
output (or volumeDbfs) — you get the provider's native format. Chunked
decoding is well-defined for mp3 and pcm; wav also streams, but the
RIFF header reports a 0 length until the stream finishes (most players
handle this; some don't — prefer mp3 for strict players).
Sample rate and bit depth
Speechbase returns whatever sample rate and bit depth the provider produced. That's typically:
- OpenAI: 24 kHz, 16-bit PCM.
- ElevenLabs: 22.05 kHz or 44.1 kHz depending on model.
- Cartesia: 22 kHz to 44.1 kHz depending on model.
If your downstream pipeline requires a fixed sample rate, resample on your side after decoding — Speechbase doesn't currently do server-side resampling.
Conversations
Conversation responses always carry one consistent format end-to-end. Speechbase
synthesises each turn at whatever the provider produces, normalises each turn to
the volumeDbfs peak target (default -16 dBFS), and re-encodes to your chosen
output format during stitching. So a conversation request with output: "mp3" returns a
single MP3 even if individual turns came back as WAV from the providers.

