Conversations

Multi-turn, multi-speaker synthesis in a single API call — with optional cross-provider routing, gap control, and volume normalisation.

A conversation is a single API call that produces one stitched audio file out of multiple turns of dialogue. Each turn picks its own voice. Speechbase handles the heavy lifting: dispatching each turn (potentially to a different provider), inserting silence between turns, and normalising volume so one speaker doesn't drown out the next.

This is the right primitive for podcasts, narrated dialogues, customer-service demos, training videos, and any other multi-speaker content.

The shape of a request

{
  "model": "openai/gpt-4o-mini-tts",
  "turns": [
    { "voice": "alloy",   "text": "How was your weekend?" },
    { "voice": "shimmer", "text": "Good — I finally caught up on sleep." },
    { "voice": "alloy",   "text": "Lucky you." }
  ],
  "gapMs": 500,
  "volumeDbfs": -16,
  "output": "mp3"
}

POST /v1/audio/conversation returns the stitched audio bytes.

One model or per-turn models — pick one

Each turn needs to know what provider/model/voice to use. There are two mutually-exclusive ways to specify that:

Shared model. Set model at the top level and let every turn inherit it. Each turn just picks a voice.
Per-turn model. Omit the top-level model and set model and voice (or mode: "voice" + voiceId) on each turn individually.

Mixing the two — top-level model plus per-turn model on a flat (inline) turn — is rejected. The schema enforces "exactly one source of truth per turn."

Voice references

Each turn can reference its voice in either of two ways, the same as a single-shot speech request:

Inline (flat): { "model": "openai/gpt-4o-mini-tts", "voice": "alloy" } (or just "voice" if model is set top-level).
Voice (modal): { "mode": "voice", "voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2" } — UUID.

You can mix the styles across turns — turn 0 can use a saved voice, turn 1 can use an inline voice. Speechbase resolves them independently.

Gap and volume controls

Field	Default	What it does
`gapMs`	`0`	Milliseconds of silence inserted between consecutive turns.
`volumeDbfs`	`null`	Target peak loudness in dBFS (e.g. `-16`). Each turn is normalised to it.
`output`	`wav`	`wav`, `mp3`, or `pcm`. `mp3` accepts `{ format: "mp3", bitrate: 96 }` (kbps; defaults to 96).

Volume normalisation matters in practice — different providers ship audio at different reference loudness levels, so without normalisation a Cartesia turn followed by an ElevenLabs turn will jump perceptibly. Setting volumeDbfs: -16 (broadcast podcast standard) or -23 (LUFS streaming standard) usually fixes this.

Cross-provider routing

Because each turn carries its own provider/model/voice, you can mix providers in a single conversation:

{
  "turns": [
    { "provider": "openai",    "model": "gpt-4o-mini-tts", "voice": "alloy",
      "text": "Some narration goes here." },
    { "provider": "elevenlabs","model": "eleven_v3",       "voice": "EXAV...",
      "text": "And then a different speaker chimes in." }
  ]
}

Speechbase validates up front that both providers have BYOK keys and are enabled. If either is missing the whole request fails — you never end up with a half-stitched conversation.

Word-level timestamps

POST /v1/audio/conversation/with-timestamps returns a JSON envelope with audio (base64) plus a flat timestamps array. Each word entry includes turnIndex, mapping it back to its originating turn. See Word-level timestamps for the full schema.

Behaviour and limits

Audio is buffered, not streamed. Conversations always return after every turn has been synthesised.
Each turn passes through moderation on its own text. A single bad turn fails the whole conversation.
The whole request shares one moderation evaluation, one provider-keys check, and one log entry; per-turn telemetry (latency, character count) is attached as child events for analytics.