Speechbase

Conversations

Multi-turn, multi-speaker synthesis in a single API call — with optional cross-provider routing, gap control, and volume normalisation.

A conversation is a single API call that produces one stitched audio file out of multiple turns of dialogue. Each turn picks its own voice. Speechbase handles the heavy lifting: dispatching each turn (potentially to a different provider), inserting silence between turns, and normalising volume so one speaker doesn't drown out the next.

This is the right primitive for podcasts, narrated dialogues, customer-service demos, training videos, and any other multi-speaker content.

The shape of a request

{
  "model": "openai/gpt-4o-mini-tts",
  "turns": [
    { "voice": "alloy",   "text": "How was your weekend?" },
    { "voice": "shimmer", "text": "Good — I finally caught up on sleep." },
    { "voice": "alloy",   "text": "Lucky you." }
  ],
  "gapMs": 500,
  "volumeDbfs": -16,
  "output": "mp3"
}

POST /v1/audio/conversation returns the stitched audio bytes.

One model or per-turn models — pick one

Each turn needs to know what provider/model/voice to use. There are two mutually-exclusive ways to specify that:

  1. Shared model. Set model at the top level and let every turn inherit it. Each turn just picks a voice.
  2. Per-turn model. Omit the top-level model and set model and voice (or mode: "voice" + voiceId) on each turn individually.

Mixing the two — top-level model plus per-turn model on a flat (inline) turn — is rejected. The schema enforces "exactly one source of truth per turn."

Voice references

Each turn can reference its voice in either of two ways, the same as a single-shot speech request:

  • Inline (flat): { "model": "openai/gpt-4o-mini-tts", "voice": "alloy" } (or just "voice" if model is set top-level).
  • Voice (modal): { "mode": "voice", "voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2" } — UUID.

You can mix the styles across turns — turn 0 can use a saved voice, turn 1 can use an inline voice. Speechbase resolves them independently.

Gap and volume controls

FieldDefaultWhat it does
gapMs0Milliseconds of silence inserted between consecutive turns.
volumeDbfsnullTarget peak loudness in dBFS (e.g. -16). Each turn is normalised to it.
outputwavwav, mp3, or pcm. mp3 accepts { format: "mp3", bitrate: 96 } (kbps; defaults to 96).

Volume normalisation matters in practice — different providers ship audio at different reference loudness levels, so without normalisation a Cartesia turn followed by an ElevenLabs turn will jump perceptibly. Setting volumeDbfs: -16 (broadcast podcast standard) or -23 (LUFS streaming standard) usually fixes this.

Cross-provider routing

Because each turn carries its own provider/model/voice, you can mix providers in a single conversation:

{
  "turns": [
    { "provider": "openai",    "model": "gpt-4o-mini-tts", "voice": "alloy",
      "text": "Some narration goes here." },
    { "provider": "elevenlabs","model": "eleven_v3",       "voice": "EXAV...",
      "text": "And then a different speaker chimes in." }
  ]
}

Speechbase validates up front that both providers have BYOK keys and are enabled. If either is missing the whole request fails — you never end up with a half-stitched conversation.

Word-level timestamps

POST /v1/audio/conversation/with-timestamps returns a JSON envelope with audio (base64) plus a flat timestamps array. Each word entry includes turnIndex, mapping it back to its originating turn. See Word-level timestamps for the full schema.

Behaviour and limits

  • Audio is buffered, not streamed. Conversations always return after every turn has been synthesised.
  • Each turn passes through moderation on its own text. A single bad turn fails the whole conversation.
  • The whole request shares one moderation evaluation, one provider-keys check, and one log entry; per-turn telemetry (latency, character count) is attached as child events for analytics.

On this page