Multi-speaker conversations

Generate stitched, multi-turn dialogue audio in one API call — including across providers.

If you've read Conversations, you know the shape. This guide is the practical recipe: when to use the conversation endpoint, how to choose between shared and per-turn models, and how to manage gaps and volume.

When to use it

Use POST /v1/audio/conversation whenever you need:

More than one voice in a single piece of audio,
Server-side stitching (no client-side audio mixing),
Volume normalisation across mixed providers,
Per-turn timing control.

If you just need a single voice reading a long passage, use /v1/audio/speech — it streams and gets going faster.

Shared model recipe (single provider)

The simplest case. Pin the provider/model at the top level (or per-turn in the SDK) and let every turn just pick a voice:

import { generateConversation } from "@speech-sdk/core";

const result = await generateConversation({
  apiKey: process.env.SPEECHBASE_API_KEY,
  turns: [
    { model: "openai/gpt-4o-mini-tts", voice: "alloy",   text: "Welcome to the show." },
    { model: "openai/gpt-4o-mini-tts", voice: "shimmer", text: "Glad to be here." },
  ],
  gapMs: 400,
  volumeDbfs: -16,
  output: { format: "mp3" },
});

{
  "model": "openai/gpt-4o-mini-tts",
  "turns": [
    { "voice": "alloy",   "text": "Welcome to the show." },
    { "voice": "shimmer", "text": "Glad to be here." }
  ],
  "gapMs": 400,
  "volumeDbfs": -16,
  "output": "mp3"
}

Use this when one provider has the voices you want and you don't need cross-provider mixing.

Per-turn recipe (mixed providers)

Specify the model per turn and mix providers freely:

import { generateConversation } from "@speech-sdk/core";

const result = await generateConversation({
  apiKey: process.env.SPEECHBASE_API_KEY,
  turns: [
    {
      model: "elevenlabs/eleven_v3",
      voice: "EXAV...",
      text: "I'm running narration on ElevenLabs.",
    },
    {
      model: "openai/gpt-4o-mini-tts",
      voice: "alloy",
      text: "And I'm replying on OpenAI.",
    },
  ],
  gapMs: 500,
  volumeDbfs: -16,
  output: { format: "mp3" },
});

{
  "turns": [
    {
      "provider": "elevenlabs",
      "model": "eleven_v3",
      "voice": "EXAV...",
      "text": "I'm running narration on ElevenLabs."
    },
    {
      "provider": "openai",
      "model": "gpt-4o-mini-tts",
      "voice": "alloy",
      "text": "And I'm replying on OpenAI."
    }
  ],
  "gapMs": 500,
  "volumeDbfs": -16,
  "output": "mp3"
}

Use this when:

Different voices live with different providers (custom-cloned voices, niche providers, etc.),
You want to A/B different providers in the same piece,
You're cost-optimising — fast/cheap provider for short turns, premium provider for hero turns.

Speechbase validates BYOK availability for every provider you reference before dispatching anything; you won't get a half-rendered conversation.

Tuning gap and volume

What you want	Settings
Tight, podcast-like back-and-forth	`gapMs: 250–400`, `volumeDbfs: -16`
Thoughtful, narrated dialogue	`gapMs: 600–900`, `volumeDbfs: -18`
Compliant streaming loudness (LUFS-style)	`volumeDbfs: -23`
Audiobook-style pacing	`gapMs: 800+`

If you skip volumeDbfs, Speechbase passes audio through without normalisation. That works when every turn comes from the same provider with the same voice settings; with mixed providers you almost certainly want it set.

Word-level timestamps

POST /v1/audio/conversation/with-timestamps returns the same envelope plus a timestamps array. Each entry includes a turnIndex so you can attribute words back to their turn. See Word-level timestamps.

Limits and behaviour

Buffered, not streamed. Conversations always return after every turn is rendered.
All-or-nothing. A failure on any turn (moderation, provider error) fails the whole request.
Moderation runs per-turn. Each turn's text is checked; one bad turn blocks the conversation.
Logged as one request. The log entry attributes the parent request and records per-turn metadata as children for analytics.