Multi-speaker conversations
Generate stitched, multi-turn dialogue audio in one API call — including across providers.
If you've read Conversations, you know the shape. This guide is the practical recipe: when to use the conversation endpoint, how to choose between shared and per-turn models, and how to manage gaps and volume.
When to use it
Use POST /v1/audio/conversation whenever you need:
- More than one voice in a single piece of audio,
- Server-side stitching (no client-side audio mixing),
- Volume normalisation across mixed providers,
- Per-turn timing control.
If you just need a single voice reading a long passage, use
/v1/audio/speech — it streams
and gets going faster.
Shared model recipe (single provider)
The simplest case. Pin the provider/model at the top level (or per-turn in the SDK) and let every turn just pick a voice:
import { generateConversation } from "@speech-sdk/core";
const result = await generateConversation({
apiKey: process.env.SPEECHBASE_API_KEY,
turns: [
{ model: "openai/gpt-4o-mini-tts", voice: "alloy", text: "Welcome to the show." },
{ model: "openai/gpt-4o-mini-tts", voice: "shimmer", text: "Glad to be here." },
],
gapMs: 400,
volumeDbfs: -16,
output: { format: "mp3" },
});{
"model": "openai/gpt-4o-mini-tts",
"turns": [
{ "voice": "alloy", "text": "Welcome to the show." },
{ "voice": "shimmer", "text": "Glad to be here." }
],
"gapMs": 400,
"volumeDbfs": -16,
"output": "mp3"
}Use this when one provider has the voices you want and you don't need cross-provider mixing.
Per-turn recipe (mixed providers)
Specify the model per turn and mix providers freely:
import { generateConversation } from "@speech-sdk/core";
const result = await generateConversation({
apiKey: process.env.SPEECHBASE_API_KEY,
turns: [
{
model: "elevenlabs/eleven_v3",
voice: "EXAV...",
text: "I'm running narration on ElevenLabs.",
},
{
model: "openai/gpt-4o-mini-tts",
voice: "alloy",
text: "And I'm replying on OpenAI.",
},
],
gapMs: 500,
volumeDbfs: -16,
output: { format: "mp3" },
});{
"turns": [
{
"provider": "elevenlabs",
"model": "eleven_v3",
"voice": "EXAV...",
"text": "I'm running narration on ElevenLabs."
},
{
"provider": "openai",
"model": "gpt-4o-mini-tts",
"voice": "alloy",
"text": "And I'm replying on OpenAI."
}
],
"gapMs": 500,
"volumeDbfs": -16,
"output": "mp3"
}Use this when:
- Different voices live with different providers (custom-cloned voices, niche providers, etc.),
- You want to A/B different providers in the same piece,
- You're cost-optimising — fast/cheap provider for short turns, premium provider for hero turns.
Speechbase validates BYOK availability for every provider you reference before dispatching anything; you won't get a half-rendered conversation.
Tuning gap and volume
| What you want | Settings |
|---|---|
| Tight, podcast-like back-and-forth | gapMs: 250–400, volumeDbfs: -16 |
| Thoughtful, narrated dialogue | gapMs: 600–900, volumeDbfs: -18 |
| Compliant streaming loudness (LUFS-style) | volumeDbfs: -23 |
| Audiobook-style pacing | gapMs: 800+ |
If you skip volumeDbfs, Speechbase passes audio through without normalisation.
That works when every turn comes from the same provider with the same voice
settings; with mixed providers you almost certainly want it set.
Word-level timestamps
POST /v1/audio/conversation/with-timestamps returns the same envelope plus
a timestamps array. Each entry includes a turnIndex so you can attribute
words back to their turn. See
Word-level timestamps.
Limits and behaviour
- Buffered, not streamed. Conversations always return after every turn is rendered.
- All-or-nothing. A failure on any turn (moderation, provider error) fails the whole request.
- Moderation runs per-turn. Each turn's text is checked; one bad turn blocks the conversation.
- Logged as one request. The log entry attributes the parent request and records per-turn metadata as children for analytics.