Build a multi-speaker podcast
Generate scripted dialogue between two or more hosts in a single API call — stitched, normalised, and ready to publish.
You have a script — two hosts, alternating turns, maybe a guest. You want one audio file out: stitched, clean gaps between turns, consistent loudness. With Speechbase that's one HTTP request.
The shape of it
Use generateConversation()
from the SDK, or hit the gateway directly with curl. Both accept either
inline model + voice per turn or a saved voice referenced by voiceId.
import { generateConversation } from "@speech-sdk/core";
import { writeFile } from "node:fs/promises";
const result = await generateConversation({
apiKey: process.env.SPEECHBASE_API_KEY,
turns: [
{ model: "elevenlabs/eleven_v3", voice: "EXAV...", text: "Welcome back to the show. I’m Alice." },
{ model: "elevenlabs/eleven_v3", voice: "ZQe5...", text: "And I’m Bob. Today we’re talking about voice infrastructure." },
{ model: "elevenlabs/eleven_v3", voice: "EXAV...", text: "Big topic. Let’s start with where it all goes wrong." },
],
gapMs: 350,
volumeDbfs: -16,
output: { format: "mp3" },
});
await writeFile("episode-01.mp3", result.audio.uint8Array);curl -X POST https://api.speechbase.ai/v1/audio/conversation \
-H "Authorization: Bearer $SPEECHBASE_API_KEY" \
-H "Content-Type: application/json" \
--output episode-01.mp3 \
-d '{
"model": "elevenlabs/eleven_v3",
"turns": [
{ "voice": "EXAV...", "text": "Welcome back to the show. I’m Alice." },
{ "voice": "ZQe5...", "text": "And I’m Bob. Today we’re talking about voice infrastructure." },
{ "voice": "EXAV...", "text": "Big topic. Let’s start with where it all goes wrong." }
],
"gapMs": 350,
"volumeDbfs": -16,
"output": "mp3"
}'Either way, you get one MP3 back, every turn voiced by the right speaker, with 350ms of silence between them, normalised to -16 dBFS (a podcast-friendly target).
Why this is hard without Speechbase
If you go direct to a TTS provider, you get a clip per call. Stitching that into a podcast yourself means:
- Calling the API once per turn and tracking ordering.
- Decoding each clip, inserting silence, re-encoding into one container.
- Normalising loudness so a Cartesia turn doesn't blow out an OpenAI turn.
- Handling partial failures mid-script.
Speechbase's conversation endpoint does all of this server-side and validates BYOK availability for every provider in your turns before any synthesis happens — so you never get a half-rendered episode.
Set up your hosts as voices
Once, in the dashboard, save the voices you like under Voices — pin
provider, model, and the provider's voice ID. (See
Voices.) Application code can then reference
each saved voice by its UUID with mode: "voice".
Mixing providers in a single episode
Some voices live with one provider, some with another. You can mix them turn-by-turn:
import { generateConversation } from "@speech-sdk/core";
const result = await generateConversation({
apiKey: process.env.SPEECHBASE_API_KEY,
turns: [
{
model: "elevenlabs/eleven_v3",
voice: "EXAV...",
text: "I'm running narration on ElevenLabs.",
},
{
model: "openai/gpt-4o-mini-tts",
voice: "alloy",
text: "And I'm replying on OpenAI's voice.",
},
],
gapMs: 500,
volumeDbfs: -16,
output: { format: "mp3" },
});{
"turns": [
{
"provider": "elevenlabs",
"model": "eleven_v3",
"voice": "EXAV...",
"text": "I'm running narration on ElevenLabs."
},
{
"provider": "openai",
"model": "gpt-4o-mini-tts",
"voice": "alloy",
"text": "And I'm replying on OpenAI's voice."
}
],
"gapMs": 500,
"volumeDbfs": -16,
"output": "mp3"
}volumeDbfs matters here — providers ship audio at different loudness levels;
without normalisation a mixed-provider episode will jump perceptibly.
Tuning the conversation feel
| What you want | Settings |
|---|---|
| Tight, podcast-like back-and-forth | gapMs: 250–400, volumeDbfs: -16 |
| Thoughtful, narrated dialogue | gapMs: 600–900, volumeDbfs: -18 |
| Streaming-platform LUFS target | volumeDbfs: -23 |
| Audiobook-style pacing | gapMs: 800+ |
See Multi-speaker conversations for the full set of knobs.
Want chapters or show notes?
Use the with-timestamps variant to get word-level timing for every turn,
including a turnIndex field that maps each word back to its speaker:
POST /v1/audio/conversation/with-timestampsThat gives you everything you need to build chapter markers, searchable transcripts, or per-host word counts. See Word-level timestamps.
Going further
- Custom hosts. If you already have a custom or cloned voice in a provider dashboard, import its provider voice ID into Voices and reuse it by Speechbase voice ID.
- Pronunciations. Lock in correct pronunciations of your show name, brand terms, and guest names using pronunciation rules.
- Moderation off-rails. If you generate dialogue from an LLM, keep
moderation on
fail_mode: "closed"to catch off-policy text before it reaches a TTS bill. - Per-host metrics. The request log attributes per-turn latency and character counts back to the originating voice — useful for cost-per-host accounting on long shows.