Conversations
Multi-turn, multi-speaker synthesis in a single API call — with optional cross-provider routing, gap control, and volume normalisation.
A conversation is a single API call that produces one stitched audio file out of multiple turns of dialogue. Each turn picks its own voice. Speechbase handles the heavy lifting: dispatching each turn (potentially to a different provider), inserting silence between turns, and normalising volume so one speaker doesn't drown out the next.
This is the right primitive for podcasts, narrated dialogues, customer-service demos, training videos, and any other multi-speaker content.
The shape of a request
{
"model": "openai/gpt-4o-mini-tts",
"turns": [
{ "voice": "alloy", "text": "How was your weekend?" },
{ "voice": "shimmer", "text": "Good — I finally caught up on sleep." },
{ "voice": "alloy", "text": "Lucky you." }
],
"gapMs": 500,
"volumeDbfs": -16,
"output": "mp3"
}POST /v1/audio/conversation returns the stitched audio bytes.
One model or per-turn models — pick one
Each turn needs to know what provider/model/voice to use. There are two mutually-exclusive ways to specify that:
- Shared model. Set
modelat the top level and let every turn inherit it. Each turn just picks avoice. - Per-turn model. Omit the top-level
modeland setmodelandvoice(ormode: "voice"+voiceId) on each turn individually.
Mixing the two — top-level model plus per-turn model on a flat (inline)
turn — is rejected. The schema enforces "exactly one source of truth per
turn."
Voice references
Each turn can reference its voice in either of two ways, the same as a single-shot speech request:
- Inline (flat):
{ "model": "openai/gpt-4o-mini-tts", "voice": "alloy" }(or just"voice"ifmodelis set top-level). - Voice (modal):
{ "mode": "voice", "voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2" }— UUID.
You can mix the styles across turns — turn 0 can use a saved voice, turn 1 can use an inline voice. Speechbase resolves them independently.
Gap and volume controls
| Field | Default | What it does |
|---|---|---|
gapMs | 0 | Milliseconds of silence inserted between consecutive turns. |
volumeDbfs | null | Target peak loudness in dBFS (e.g. -16). Each turn is normalised to it. |
output | wav | wav, mp3, or pcm. mp3 accepts { format: "mp3", bitrate: 96 } (kbps; defaults to 96). |
Volume normalisation matters in practice — different providers ship audio at
different reference loudness levels, so without normalisation a Cartesia turn
followed by an ElevenLabs turn will jump perceptibly. Setting
volumeDbfs: -16 (broadcast podcast standard) or -23 (LUFS streaming
standard) usually fixes this.
Cross-provider routing
Because each turn carries its own provider/model/voice, you can mix providers in a single conversation:
{
"turns": [
{ "provider": "openai", "model": "gpt-4o-mini-tts", "voice": "alloy",
"text": "Some narration goes here." },
{ "provider": "elevenlabs","model": "eleven_v3", "voice": "EXAV...",
"text": "And then a different speaker chimes in." }
]
}Speechbase validates up front that both providers have BYOK keys and are enabled. If either is missing the whole request fails — you never end up with a half-stitched conversation.
Word-level timestamps
POST /v1/audio/conversation/with-timestamps returns a JSON envelope with
audio (base64) plus a flat timestamps array. Each word entry includes
turnIndex, mapping it back to its originating turn. See
Word-level timestamps for the full
schema.
Behaviour and limits
- Audio is buffered, not streamed. Conversations always return after every turn has been synthesised.
- Each turn passes through moderation on its own text. A single bad turn fails the whole conversation.
- The whole request shares one moderation evaluation, one provider-keys check, and one log entry; per-turn telemetry (latency, character count) is attached as child events for analytics.