Word-level timestamps

How Speechbase produces word-level alignment for synthesised audio — using native provider support when available, and falling back to STT when not.

Captions, lip-sync, and karaoke-style highlight effects all need to know when each word starts and ends in the rendered audio. Speechbase's with-timestamps endpoints give you that.

Two endpoints, one feature

Every synthesis endpoint has a with-timestamps companion:

Without timestamps	With timestamps
`POST /v1/audio/speech`	`POST /v1/audio/speech/with-timestamps`
`POST /v1/audio/conversation`	`POST /v1/audio/conversation/with-timestamps`

The plain endpoints stream raw audio bytes. The timestamps variants always return a JSON envelope:

{
  "audio": "<base64>",
  "mediaType": "audio/mpeg",
  "warnings": [],
  "timestamps": [
    { "text": "Hello", "start": 0.04, "end": 0.41 },
    { "text": "from",  "start": 0.43, "end": 0.62 },
    { "text": "Speechbase.", "start": 0.65, "end": 1.10 }
  ]
}

start and end are seconds from the beginning of the returned audio.

Native vs STT fallback

Some providers expose word-level alignment natively (OpenAI, ElevenLabs, and others publish character or word offsets alongside the audio); some don't. Speechbase abstracts the difference:

Native alignment first. If the chosen provider supports it for the chosen model, Speechbase uses the alignment data the provider returns.
STT fallback. Otherwise, Speechbase runs the rendered audio through speech-to-text (currently OpenAI Whisper) and reconstructs word boundaries that way. This is fully gateway-side and doesn't consume your provider key.

You don't pick. The gateway uses native alignment when it can and falls back when it has to. The warnings array tells you which path was used if you care.

Choosing whether to align

The timestamps field on the request controls the behaviour:

"timestamps": "on" (default) — Speechbase attempts native, then STT. If both fail, the request returns 503 timestamps_unavailable. The audio is not returned.
"timestamps": "off" — Speechbase skips alignment entirely and returns timestamps: []. Useful when you call the with-timestamps endpoint just for the JSON envelope (e.g. to get base64 audio plus warnings) and don't need word offsets.

Conversations and `turnIndex`

Conversation timestamps include an extra field:

{ "text": "How", "start": 0.05, "end": 0.18, "turnIndex": 0 }

turnIndex is the zero-based index into the original turns array, so you can attribute each word back to its speaker. When Speechbase falls back to STT across the stitched mixed audio, the gateway uses gap markers to recover turnIndex with high accuracy, but treat it as best-effort if your turn gaps are very short.

Streaming caveat

Streaming and timestamps don't combine. The plain /v1/audio/speech endpoint streams; /v1/audio/speech/with-timestamps always buffers because alignment needs the full audio. If you need both low latency and timestamps, run them in parallel: stream the audio for playback, then issue a second call for captions.

Two endpoints, one feature

Native vs STT fallback

Choosing whether to align

Conversations and turnIndex

Streaming caveat

On this page

Conversations and `turnIndex`