Speechbase

Conversations

Multi-turn, multi-speaker synthesis in a single API call — with optional cross-provider routing, gap control, and volume normalisation.

A conversation is a single API call that produces one stitched audio file out of multiple turns of dialogue. Each turn picks its own voice. Speechbase handles the heavy lifting: dispatching each turn (potentially to a different provider), inserting silence between turns, and normalising volume so one speaker doesn't drown out the next.

This is the right primitive for podcasts, narrated dialogues, customer-service demos, training videos, and any other multi-speaker content.

The shape of a request

{
  "model": "openai/gpt-4o-mini-tts",
  "turns": [
    { "voice": "alloy",   "text": "How was your weekend?" },
    { "voice": "shimmer", "text": "Good — I finally caught up on sleep." },
    { "voice": "alloy",   "text": "Lucky you." }
  ],
  "gapMs": 500,
  "volumeDbfs": -16,
  "output": "mp3"
}

POST /v1/audio/conversation returns the stitched audio bytes.

One model or per-turn models — pick one

Each turn needs to know what provider/model/voice to use. There are two mutually-exclusive ways to specify that:

  1. Shared model. Set model at the top level and let every turn inherit it. Each turn just picks a voice.
  2. Per-turn model. Omit the top-level model and set model and voice (or voiceId) on each turn individually.

Mixing the two — top-level model plus per-turn model on a flat (inline) turn — is rejected. The schema enforces "exactly one source of truth per turn."

Voice references

Each turn can reference its voice in either of two ways, the same as a single-shot speech request:

  • Inline: { "model": "openai/gpt-4o-mini-tts", "voice": "alloy" } (or just "voice" if model is set top-level).
  • Saved voice: { "voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2" } — UUID of a saved voice. Speechbase dispatches by the shape of the turn.

You can mix the styles across turns — turn 0 can use a saved voice, turn 1 can use an inline voice. Speechbase resolves them independently.

Gap and volume controls

FieldDefaultWhat it does
gapMs0Milliseconds of silence inserted between consecutive turns.
volumeDbfs-16Target peak loudness in dBFS. Each turn is normalised to it. Pass your own value to override the default.
outputwavwav, mp3, or pcm. mp3 accepts { format: "mp3", bitrate: 96 } (kbps; defaults to 96).

Volume normalisation matters in practice: different providers ship audio at different levels, so without it a Cartesia turn followed by an ElevenLabs turn would jump perceptibly. Speechbase normalises each turn to a peak of -16 dBFS by default; set your own volumeDbfs to target a different peak, for example -23 for more headroom. This is a peak-level target in dBFS, not an integrated-loudness (LUFS) measurement.

Cross-provider routing

Because each turn carries its own provider/model/voice, you can mix providers in a single conversation:

{
  "turns": [
    { "model": "openai/gpt-4o-mini-tts", "voice": "alloy",
      "text": "Some narration goes here." },
    { "model": "elevenlabs/eleven_v3", "voice": "EXAV...",
      "text": "And then a different speaker chimes in." }
  ]
}

model is always the single provider/model-id string (for example openai/gpt-4o-mini-tts); there is no separate provider field.

Speechbase validates up front that both providers have BYOK keys and are enabled. If either is missing the whole request fails — you never end up with a half-stitched conversation.

Per-turn blocks with split

FieldDefaultWhat it does
splitfalseReturn each turn as its own audio block instead of one stitched file. The response becomes a JSON envelope with an audioSegments array, one entry per turn, in order.

Word-level timestamps

POST /v1/audio/conversation/with-timestamps returns a JSON envelope with audio (base64) plus a flat timestamps array. Each word entry includes turnIndex, mapping it back to its originating turn.

This endpoint accepts an optional timestamps field on the request body: "on" (the default) returns the timestamps array, "off" skips timestamp generation. The field is only valid on the /with-timestamps endpoint; the plain /v1/audio/conversation endpoint rejects it. Under split, each entry in audioSegments carries its own block-relative timestamps array (seconds from the start of that segment). See Word-level timestamps for the full schema.

Per-turn segments with split

By default a conversation returns one stitched file. Set split: true to get a JSON envelope with an audioSegments array instead, one entry per turn in order. Each segment is { audio, mediaType, durationMs, turnIndex }, plus a block-relative timestamps array (timed from the start of that segment, no turnIndex) on the with-timestamps endpoint. The audio is identical to what the turn contributes inside the stitched clip, just handed back in pieces.

Behaviour and limits

  • Audio is buffered, not streamed. Conversations always return after every turn has been synthesised.
  • Each turn passes through moderation on its own text. A single bad turn fails the whole conversation.
  • The whole request shares one moderation evaluation, one provider-keys check, and one log entry; per-turn telemetry (latency, character count) is attached as child events for analytics.

On this page