Output formats

What audio formats Speechbase returns, how to pick between them, and gotchas around streaming and bit-depth.

Every synthesis endpoint accepts an output field. This guide is a quick matrix of what to pick when.

The shorthand

If you don't care about details, pass a string:

{ "output": "mp3" }

Available shorthands: "wav" (default), "mp3", "pcm".

The object form

For more control, pass an object. bitrate (in kbps) is only valid with mp3:

{ "output": { "format": "mp3", "bitrate": 192 } }

What to pick

Format	Container	Default bitrate	Best for
`wav`	RIFF/WAV	n/a	Highest fidelity. Editing, storage, downstream processing. Default if you don't pass `output`.
`mp3`	MP3	96 kbps	Web/mobile playback, streaming to end users, podcasts. Pick `192` for music-quality voice, `64` for low-bandwidth.
`pcm`	none (raw)	n/a	Headerless, sample-rate-only. Feed straight into a downstream audio pipeline (mixing, voice DSP, custom containers).

pcm is what you want if you're piping audio into something like an LLM voice agent loop or a real-time audio mixer that owns its own framing — otherwise pick wav or mp3.

The plain /v1/audio/speech endpoint streams the response when the upstream provider supports streaming. That works for mp3 and pcm (chunked decoding is well-defined). It also works for wav but the RIFF header at the start of the stream will report a 0 length until the stream finishes — most players handle this fine; some don't.

If you're streaming to a player that's strict about WAV headers, prefer mp3.

Sample rate and bit depth

Speechbase returns whatever sample rate and bit depth the provider produced. That's typically:

OpenAI: 24 kHz, 16-bit PCM.
ElevenLabs: 22.05 kHz or 44.1 kHz depending on model.
Cartesia: 22 kHz to 44.1 kHz depending on model.

If your downstream pipeline requires a fixed sample rate, resample on your side after decoding — Speechbase doesn't currently do server-side resampling.

Conversations

Conversation responses always carry one consistent format end-to-end. Speechbase synthesises each turn at whatever the provider produces, optionally normalises with volumeDbfs, and re-encodes to your chosen output format during stitching. So a conversation request with output: "mp3" returns a single MP3 even if individual turns came back as WAV from the providers.

The shorthand

The object form

What to pick

Streaming and format

Sample rate and bit depth

Conversations

On this page