Word-level timestamps
How Speechbase produces word-level timestamps for synthesized audio — using native provider support when available, and gateway timestamp fallback when not.
Captions, lip-sync, and karaoke-style highlight effects all need to know when
each word starts and ends in the rendered audio. Speechbase's with-timestamps
endpoints give you that.
Two endpoints, one feature
Every synthesis endpoint has a with-timestamps companion:
| Without timestamps | With timestamps |
|---|---|
POST /v1/audio/speech | POST /v1/audio/speech/with-timestamps |
POST /v1/audio/conversation | POST /v1/audio/conversation/with-timestamps |
The plain endpoints stream raw audio bytes. The timestamps variants always return a JSON envelope:
{
"audio": "<base64>",
"mediaType": "audio/mpeg",
"warnings": [],
"timestamps": [
{ "text": "Hello", "start": 0.04, "end": 0.41 },
{ "text": "from", "start": 0.43, "end": 0.62 },
{ "text": "Speechbase.", "start": 0.65, "end": 1.10 }
]
}start and end are seconds from the beginning of the returned audio.
Native vs timestamp fallback
Some providers expose word-level timestamps natively (OpenAI, ElevenLabs, and others publish character or word offsets alongside the audio); some don't. Speechbase abstracts the difference:
- Native timestamps first. If the chosen provider supports it for the chosen model, Speechbase uses the timing data the provider returns.
- Timestamp fallback. Otherwise, Speechbase compares the rendered audio with the exact source text and reconstructs word boundaries from that timing pass. This is fully gateway-side and doesn't consume your provider key.
You don't pick. The gateway uses native timestamps when it can and falls back
when it has to. The warnings array tells you which path was used if you
care.
Choosing whether to return timestamps
The timestamps field on the request controls the behaviour:
"timestamps": "on"(default) — Speechbase attempts native timestamps, then timestamp fallback. If both fail, the request returns503 timestamps_unavailable. The audio is not returned."timestamps": "off"— Speechbase skips timestamp generation entirely and returnstimestamps: []. Useful when you call the with-timestamps endpoint just for the JSON envelope (e.g. to get base64 audio plus warnings) and don't need word offsets.
Conversations and turnIndex
Conversation timestamps include an extra field:
{ "text": "How", "start": 0.05, "end": 0.18, "turnIndex": 0 }turnIndex is the zero-based index into the original turns array, so you
can attribute each word back to its speaker. When Speechbase uses timestamp
fallback across stitched mixed audio, the gateway aligns the mixed audio against
the exact turn text and assigns turnIndex sequentially from the original turns.
Streaming caveat
Streaming and timestamps don't combine. The plain /v1/audio/speech endpoint
streams; /v1/audio/speech/with-timestamps always buffers because timestamps
need the full audio. If you need both low latency and timestamps, run them in
parallel: stream the audio for playback, then issue a second call for
captions.

