Speechbase

Output formats

What audio formats Speechbase returns, how to pick between them, and gotchas around streaming and bit-depth.

Every synthesis endpoint accepts an output field. This guide is a quick matrix of what to pick when.

The shorthand

If you don't care about details, pass a string:

{ "output": "mp3" }

Available shorthands: "wav" (default), "mp3", "pcm".

The object form

For more control, pass an object. bitrate (in kbps) is only valid with mp3:

{ "output": { "format": "mp3", "bitrate": 192 } }

What to pick

FormatContainerDefault bitrateBest for
wavRIFF/WAVn/aHighest fidelity. Editing, storage, downstream processing. Default if you don't pass output.
mp3MP396 kbpsWeb/mobile playback, streaming to end users, podcasts. Pick 192 for music-quality voice, 64 for low-bandwidth.
pcmnone (raw)n/aHeaderless, sample-rate-only. Feed straight into a downstream audio pipeline (mixing, voice DSP, custom containers).

pcm is what you want if you're piping audio into something like an LLM voice agent loop or a real-time audio mixer that owns its own framing — otherwise pick wav or mp3.

Streaming and format

The plain /v1/audio/speech endpoint streams the response when the upstream provider supports streaming. That works for mp3 and pcm (chunked decoding is well-defined). It also works for wav but the RIFF header at the start of the stream will report a 0 length until the stream finishes — most players handle this fine; some don't.

If you're streaming to a player that's strict about WAV headers, prefer mp3.

Sample rate and bit depth

Speechbase returns whatever sample rate and bit depth the provider produced. That's typically:

  • OpenAI: 24 kHz, 16-bit PCM.
  • ElevenLabs: 22.05 kHz or 44.1 kHz depending on model.
  • Cartesia: 22 kHz to 44.1 kHz depending on model.

If your downstream pipeline requires a fixed sample rate, resample on your side after decoding — Speechbase doesn't currently do server-side resampling.

Conversations

Conversation responses always carry one consistent format end-to-end. Speechbase synthesises each turn at whatever the provider produces, optionally normalises with volumeDbfs, and re-encodes to your chosen output format during stitching. So a conversation request with output: "mp3" returns a single MP3 even if individual turns came back as WAV from the providers.

On this page