Speechbase

Output formats

What audio formats Speechbase returns, how to pick between them, and gotchas around streaming and bit-depth.

Every synthesis endpoint accepts an output field. This guide is a quick matrix of what to pick when.

The shorthand

If you don't care about details, pass a string:

{ "output": "mp3" }

Available shorthands: "wav" (default), "mp3", "pcm".

The object form

For more control, pass an object. bitrate (in kbps) is only valid with mp3:

{ "output": { "format": "mp3", "bitrate": 192 } }

What to pick

FormatContainerDefault bitrateBest for
wavRIFF/WAVn/aHighest fidelity. Editing, storage, downstream processing. Default if you don't pass output.
mp3MP396 kbpsWeb/mobile playback, streaming to end users, podcasts. Pick 192 for music-quality voice, 64 for low-bandwidth.
pcmnone (raw)n/aHeaderless, sample-rate-only. Feed straight into a downstream audio pipeline (mixing, voice DSP, custom containers).

pcm is what you want if you're piping audio into something like an LLM voice agent loop or a real-time audio mixer that owns its own framing — otherwise pick wav or mp3.

Buffered vs. streaming, and format

output format conversion is a whole-clip operation (decode the full clip, re-encode), so it only applies on the buffered /v1/audio/speech endpoint, which returns the complete audio in a single response.

For low latency, /v1/audio/speech/stream streams the provider's audio straight through. Because there's no whole-clip step, it does not accept output (or volumeDbfs) — you get the provider's native format. Chunked decoding is well-defined for mp3 and pcm; wav also streams, but the RIFF header reports a 0 length until the stream finishes (most players handle this; some don't — prefer mp3 for strict players).

Sample rate and bit depth

Speechbase returns whatever sample rate and bit depth the provider produced. That's typically:

  • OpenAI: 24 kHz, 16-bit PCM.
  • ElevenLabs: 22.05 kHz or 44.1 kHz depending on model.
  • Cartesia: 22 kHz to 44.1 kHz depending on model.

If your downstream pipeline requires a fixed sample rate, resample on your side after decoding — Speechbase doesn't currently do server-side resampling.

Conversations

Conversation responses always carry one consistent format end-to-end. Speechbase synthesises each turn at whatever the provider produces, normalises each turn to the volumeDbfs peak target (default -16 dBFS), and re-encodes to your chosen output format during stitching. So a conversation request with output: "mp3" returns a single MP3 even if individual turns came back as WAV from the providers.

On this page