Word-level timestamps
How Speechbase produces word-level alignment for synthesised audio — using native provider support when available, and falling back to STT when not.
Captions, lip-sync, and karaoke-style highlight effects all need to know when
each word starts and ends in the rendered audio. Speechbase's with-timestamps
endpoints give you that.
Two endpoints, one feature
Every synthesis endpoint has a with-timestamps companion:
| Without timestamps | With timestamps |
|---|---|
POST /v1/audio/speech | POST /v1/audio/speech/with-timestamps |
POST /v1/audio/conversation | POST /v1/audio/conversation/with-timestamps |
The plain endpoints stream raw audio bytes. The timestamps variants always return a JSON envelope:
{
"audio": "<base64>",
"mediaType": "audio/mpeg",
"warnings": [],
"timestamps": [
{ "text": "Hello", "start": 0.04, "end": 0.41 },
{ "text": "from", "start": 0.43, "end": 0.62 },
{ "text": "Speechbase.", "start": 0.65, "end": 1.10 }
]
}start and end are seconds from the beginning of the returned audio.
Native vs STT fallback
Some providers expose word-level alignment natively (OpenAI, ElevenLabs, and others publish character or word offsets alongside the audio); some don't. Speechbase abstracts the difference:
- Native alignment first. If the chosen provider supports it for the chosen model, Speechbase uses the alignment data the provider returns.
- STT fallback. Otherwise, Speechbase runs the rendered audio through speech-to-text (currently OpenAI Whisper) and reconstructs word boundaries that way. This is fully gateway-side and doesn't consume your provider key.
You don't pick. The gateway uses native alignment when it can and falls back
when it has to. The warnings array tells you which path was used if you
care.
Choosing whether to align
The timestamps field on the request controls the behaviour:
"timestamps": "on"(default) — Speechbase attempts native, then STT. If both fail, the request returns503 timestamps_unavailable. The audio is not returned."timestamps": "off"— Speechbase skips alignment entirely and returnstimestamps: []. Useful when you call the with-timestamps endpoint just for the JSON envelope (e.g. to get base64 audio plus warnings) and don't need word offsets.
Conversations and turnIndex
Conversation timestamps include an extra field:
{ "text": "How", "start": 0.05, "end": 0.18, "turnIndex": 0 }turnIndex is the zero-based index into the original turns array, so you
can attribute each word back to its speaker. When Speechbase falls back to STT
across the stitched mixed audio, the gateway uses gap markers to recover
turnIndex with high accuracy, but treat it as best-effort if your turn gaps
are very short.
Streaming caveat
Streaming and timestamps don't combine. The plain /v1/audio/speech endpoint
streams; /v1/audio/speech/with-timestamps always buffers because alignment
needs the full audio. If you need both low latency and timestamps, run them in
parallel: stream the audio for playback, then issue a second call for
captions.