Conversations
Multi-turn, multi-speaker synthesis in a single API call — with optional cross-provider routing, gap control, and volume normalisation.
A conversation is a single API call that produces one stitched audio file out of multiple turns of dialogue. Each turn picks its own voice. Speechbase handles the heavy lifting: dispatching each turn (potentially to a different provider), inserting silence between turns, and normalising volume so one speaker doesn't drown out the next.
This is the right primitive for podcasts, narrated dialogues, customer-service demos, training videos, and any other multi-speaker content.
The shape of a request
{
"model": "openai/gpt-4o-mini-tts",
"turns": [
{ "voice": "alloy", "text": "How was your weekend?" },
{ "voice": "shimmer", "text": "Good — I finally caught up on sleep." },
{ "voice": "alloy", "text": "Lucky you." }
],
"gapMs": 500,
"volumeDbfs": -16,
"output": "mp3"
}POST /v1/audio/conversation returns the stitched audio bytes.
One model or per-turn models — pick one
Each turn needs to know what provider/model/voice to use. There are two mutually-exclusive ways to specify that:
- Shared model. Set
modelat the top level and let every turn inherit it. Each turn just picks avoice. - Per-turn model. Omit the top-level
modeland setmodelandvoice(orvoiceId) on each turn individually.
Mixing the two — top-level model plus per-turn model on a flat (inline)
turn — is rejected. The schema enforces "exactly one source of truth per
turn."
Voice references
Each turn can reference its voice in either of two ways, the same as a single-shot speech request:
- Inline:
{ "model": "openai/gpt-4o-mini-tts", "voice": "alloy" }(or just"voice"ifmodelis set top-level). - Saved voice:
{ "voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2" }— UUID of a saved voice. Speechbase dispatches by the shape of the turn.
You can mix the styles across turns — turn 0 can use a saved voice, turn 1 can use an inline voice. Speechbase resolves them independently.
Gap and volume controls
| Field | Default | What it does |
|---|---|---|
gapMs | 0 | Milliseconds of silence inserted between consecutive turns. |
volumeDbfs | -16 | Target peak loudness in dBFS. Each turn is normalised to it. Pass your own value to override the default. |
output | wav | wav, mp3, or pcm. mp3 accepts { format: "mp3", bitrate: 96 } (kbps; defaults to 96). |
Volume normalisation matters in practice: different providers ship audio at
different levels, so without it a Cartesia turn followed by an ElevenLabs turn
would jump perceptibly. Speechbase normalises each turn to a peak of -16 dBFS
by default; set your own volumeDbfs to target a different peak, for example
-23 for more headroom. This is a peak-level target in dBFS, not an
integrated-loudness (LUFS) measurement.
Cross-provider routing
Because each turn carries its own provider/model/voice, you can mix providers in a single conversation:
{
"turns": [
{ "model": "openai/gpt-4o-mini-tts", "voice": "alloy",
"text": "Some narration goes here." },
{ "model": "elevenlabs/eleven_v3", "voice": "EXAV...",
"text": "And then a different speaker chimes in." }
]
}model is always the single provider/model-id string (for example
openai/gpt-4o-mini-tts); there is no separate provider field.
Speechbase validates up front that both providers have BYOK keys and are enabled. If either is missing the whole request fails — you never end up with a half-stitched conversation.
Per-turn blocks with split
| Field | Default | What it does |
|---|---|---|
split | false | Return each turn as its own audio block instead of one stitched file. The response becomes a JSON envelope with an audioSegments array, one entry per turn, in order. |
Word-level timestamps
POST /v1/audio/conversation/with-timestamps returns a JSON envelope with
audio (base64) plus a flat timestamps array. Each word entry includes
turnIndex, mapping it back to its originating turn.
This endpoint accepts an optional timestamps field on the request body:
"on" (the default) returns the timestamps array, "off" skips timestamp
generation. The field is only valid on the /with-timestamps endpoint; the
plain /v1/audio/conversation endpoint rejects it. Under split, each entry in
audioSegments carries its own block-relative timestamps array (seconds from
the start of that segment). See
Word-level timestamps for the full
schema.
Per-turn segments with split
By default a conversation returns one stitched file. Set split: true to get a
JSON envelope with an audioSegments array instead, one entry per turn in order.
Each segment is { audio, mediaType, durationMs, turnIndex }, plus a
block-relative timestamps array (timed from the start of that segment, no
turnIndex) on the with-timestamps endpoint. The audio is identical to what the
turn contributes inside the stitched clip, just handed back in pieces.
Behaviour and limits
- Audio is buffered, not streamed. Conversations always return after every turn has been synthesised.
- Each turn passes through moderation on its own text. A single bad turn fails the whole conversation.
- The whole request shares one moderation evaluation, one provider-keys check, and one log entry; per-turn telemetry (latency, character count) is attached as child events for analytics.

