Speechbase

Word-level timestamps

How Speechbase produces word-level timestamps for synthesized audio — using native provider support when available, and gateway timestamp fallback when not.

Captions, lip-sync, and karaoke-style highlight effects all need to know when each word starts and ends in the rendered audio. Speechbase's with-timestamps endpoints give you that.

Two endpoints, one feature

Every synthesis endpoint has a with-timestamps companion:

Without timestampsWith timestamps
POST /v1/audio/speechPOST /v1/audio/speech/with-timestamps
POST /v1/audio/conversationPOST /v1/audio/conversation/with-timestamps

The plain endpoints stream raw audio bytes. The timestamps variants always return a JSON envelope:

{
  "audio": "<base64>",
  "mediaType": "audio/mpeg",
  "warnings": [],
  "timestamps": [
    { "text": "Hello", "start": 0.04, "end": 0.41 },
    { "text": "from",  "start": 0.43, "end": 0.62 },
    { "text": "Speechbase.", "start": 0.65, "end": 1.10 }
  ]
}

start and end are seconds from the beginning of the returned audio.

Native vs timestamp fallback

Some providers expose word-level timestamps natively (OpenAI, ElevenLabs, and others publish character or word offsets alongside the audio); some don't. Speechbase abstracts the difference:

  1. Native timestamps first. If the chosen provider supports it for the chosen model, Speechbase uses the timing data the provider returns.
  2. Timestamp fallback. Otherwise, Speechbase compares the rendered audio with the exact source text and reconstructs word boundaries from that timing pass. This is fully gateway-side and doesn't consume your provider key.

You don't pick. The gateway uses native timestamps when it can and falls back when it has to. The warnings array tells you which path was used if you care.

Choosing whether to return timestamps

The timestamps field on the request controls the behaviour:

  • "timestamps": "on" (default) — Speechbase attempts native timestamps, then timestamp fallback. If both fail, the request returns 503 timestamps_unavailable. The audio is not returned.
  • "timestamps": "off" — Speechbase skips timestamp generation entirely and returns timestamps: []. Useful when you call the with-timestamps endpoint just for the JSON envelope (e.g. to get base64 audio plus warnings) and don't need word offsets.

Conversations and turnIndex

Conversation timestamps include an extra field:

{ "text": "How", "start": 0.05, "end": 0.18, "turnIndex": 0 }

turnIndex is the zero-based index into the original turns array, so you can attribute each word back to its speaker. When Speechbase uses timestamp fallback across stitched mixed audio, the gateway aligns the mixed audio against the exact turn text and assigns turnIndex sequentially from the original turns.

Streaming caveat

Streaming and timestamps don't combine. The plain /v1/audio/speech endpoint streams; /v1/audio/speech/with-timestamps always buffers because timestamps need the full audio. If you need both low latency and timestamps, run them in parallel: stream the audio for playback, then issue a second call for captions.

On this page