Build a multi-speaker podcast

Generate scripted dialogue between two or more hosts in a single API call — stitched, normalised, and ready to publish.

You have a script — two hosts, alternating turns, maybe a guest. You want one audio file out: stitched, clean gaps between turns, consistent loudness. With Speechbase that's one HTTP request.

The shape of it

Use generateConversation() from the SDK, or hit the gateway directly with curl. Both accept either inline model + voice per turn or a saved voice referenced by voiceId.

import { generateConversation } from "@speech-sdk/core";
import { writeFile } from "node:fs/promises";

const result = await generateConversation({
  apiKey: process.env.SPEECHBASE_API_KEY,
  turns: [
    { model: "elevenlabs/eleven_v3", voice: "EXAV...", text: "Welcome back to the show. I’m Alice." },
    { model: "elevenlabs/eleven_v3", voice: "ZQe5...", text: "And I’m Bob. Today we’re talking about voice infrastructure." },
    { model: "elevenlabs/eleven_v3", voice: "EXAV...", text: "Big topic. Let’s start with where it all goes wrong." },
  ],
  gapMs: 350,
  volumeDbfs: -16,
  output: { format: "mp3" },
});

await writeFile("episode-01.mp3", result.audio.uint8Array);

curl -X POST https://api.speechbase.ai/v1/audio/conversation \
  -H "Authorization: Bearer $SPEECHBASE_API_KEY" \
  -H "Content-Type: application/json" \
  --output episode-01.mp3 \
  -d '{
    "model": "elevenlabs/eleven_v3",
    "turns": [
      { "voice": "EXAV...", "text": "Welcome back to the show. I’m Alice." },
      { "voice": "ZQe5...", "text": "And I’m Bob. Today we’re talking about voice infrastructure." },
      { "voice": "EXAV...", "text": "Big topic. Let’s start with where it all goes wrong." }
    ],
    "gapMs": 350,
    "volumeDbfs": -16,
    "output": "mp3"
  }'

Either way, you get one MP3 back, every turn voiced by the right speaker, with 350ms of silence between them, normalised to -16 dBFS (a podcast-friendly target).

Why this is hard without Speechbase

If you go direct to a TTS provider, you get a clip per call. Stitching that into a podcast yourself means:

Calling the API once per turn and tracking ordering.
Decoding each clip, inserting silence, re-encoding into one container.
Normalising loudness so a Cartesia turn doesn't blow out an OpenAI turn.
Handling partial failures mid-script.

Speechbase's conversation endpoint does all of this server-side and validates BYOK availability for every provider in your turns before any synthesis happens — so you never get a half-rendered episode.

Set up your hosts as voices

Once, in the dashboard, save the voices you like under Voices — pin provider, model, and the provider's voice ID. (See Voices.) Application code can then reference each saved voice by its UUID with mode: "voice".

Mixing providers in a single episode

Some voices live with one provider, some with another. You can mix them turn-by-turn:

import { generateConversation } from "@speech-sdk/core";

const result = await generateConversation({
  apiKey: process.env.SPEECHBASE_API_KEY,
  turns: [
    {
      model: "elevenlabs/eleven_v3",
      voice: "EXAV...",
      text: "I'm running narration on ElevenLabs.",
    },
    {
      model: "openai/gpt-4o-mini-tts",
      voice: "alloy",
      text: "And I'm replying on OpenAI's voice.",
    },
  ],
  gapMs: 500,
  volumeDbfs: -16,
  output: { format: "mp3" },
});

{
  "turns": [
    {
      "provider": "elevenlabs",
      "model": "eleven_v3",
      "voice": "EXAV...",
      "text": "I'm running narration on ElevenLabs."
    },
    {
      "provider": "openai",
      "model": "gpt-4o-mini-tts",
      "voice": "alloy",
      "text": "And I'm replying on OpenAI's voice."
    }
  ],
  "gapMs": 500,
  "volumeDbfs": -16,
  "output": "mp3"
}

volumeDbfs matters here — providers ship audio at different loudness levels; without normalisation a mixed-provider episode will jump perceptibly.

Tuning the conversation feel

What you want	Settings
Tight, podcast-like back-and-forth	`gapMs: 250–400`, `volumeDbfs: -16`
Thoughtful, narrated dialogue	`gapMs: 600–900`, `volumeDbfs: -18`
Streaming-platform LUFS target	`volumeDbfs: -23`
Audiobook-style pacing	`gapMs: 800+`

See Multi-speaker conversations for the full set of knobs.

Want chapters or show notes?

Use the with-timestamps variant to get word-level timing for every turn, including a turnIndex field that maps each word back to its speaker:

POST /v1/audio/conversation/with-timestamps

That gives you everything you need to build chapter markers, searchable transcripts, or per-host word counts. See Word-level timestamps.

Going further

Custom hosts. If you already have a custom or cloned voice in a provider dashboard, import its provider voice ID into Voices and reuse it by Speechbase voice ID.
Pronunciations. Lock in correct pronunciations of your show name, brand terms, and guest names using pronunciation rules.
Moderation off-rails. If you generate dialogue from an LLM, keep moderation on fail_mode: "closed" to catch off-policy text before it reaches a TTS bill.
Per-host metrics. The request log attributes per-turn latency and character counts back to the originating voice — useful for cost-per-host accounting on long shows.