Word-level timestamps
Get word-by-word timing for synthesised audio — how to call the with-timestamps endpoints and what to do with the results.
This guide is the practical companion to Word-level timestamps (concept). If you want the what and why, read that first; this page is the how.
Calling the endpoint
import { generateSpeech } from "@speech-sdk/core";
const result = await generateSpeech({
apiKey: process.env.SPEECHBASE_API_KEY,
model: "openai/gpt-4o-mini-tts",
voice: "alloy",
text: "Hello from Speechbase.",
timestamps: true,
});
result.audio.uint8Array; // raw bytes
result.audio.mediaType; // "audio/mpeg"
result.timestamps;
// [
// { text: "Hello", start: 0.04, end: 0.41 },
// { text: "from", start: 0.43, end: 0.62 },
// { text: "Speechbase.", start: 0.65, end: 1.10 },
// ]curl -X POST https://api.speechbase.ai/v1/audio/speech/with-timestamps \
-H "Authorization: Bearer $SPEECHBASE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"mode": "voice",
"voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
"text": "Hello from Speechbase.",
"output": "mp3"
}' | tee response.jsonThe cURL form returns a JSON envelope:
{
"audio": "<base64 mp3>",
"mediaType": "audio/mpeg",
"warnings": [],
"timestamps": [
{ "text": "Hello", "start": 0.04, "end": 0.41 },
{ "text": "from", "start": 0.43, "end": 0.62 },
{ "text": "Speechbase.", "start": 0.65, "end": 1.10 }
]
}Decoding the audio
The SDK exposes ready-to-use bytes; with cURL you have to decode the base64 yourself.
import { writeFile } from "node:fs/promises";
await writeFile("hello.mp3", result.audio.uint8Array);import { writeFile } from "node:fs/promises";
const res = await fetch(
"https://api.speechbase.ai/v1/audio/speech/with-timestamps",
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.SPEECHBASE_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
mode: "voice",
voiceId: "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
text: "Hello from Speechbase.",
output: "mp3",
}),
}
);
const data = await res.json();
await writeFile("hello.mp3", Buffer.from(data.audio, "base64"));Or, in the browser:
const blob = new Blob([result.audio.uint8Array], { type: result.audio.mediaType });
const url = URL.createObjectURL(blob);
audioElement.src = url;Painting captions
The timestamps array is in seconds, monotonically increasing. To highlight
the current word during playback:
audioElement.addEventListener("timeupdate", () => {
const t = audioElement.currentTime;
const current = timestamps.find(w => t >= w.start && t < w.end);
if (current) renderHighlightedWord(current.text);
});For long files, sort the array once and binary-search instead of scanning.
Conversations: the turnIndex field
Conversation timestamps include turnIndex, the zero-based index into the
original turns array:
const groupedByTurn = new Map<number, Word[]>();
for (const word of timestamps) {
const arr = groupedByTurn.get(word.turnIndex) ?? [];
arr.push(word);
groupedByTurn.set(word.turnIndex, arr);
}Use this to colour different speakers, label captions with names, or compute per-turn duration metrics.
When timestamps fail
Speechbase tries native provider alignment first, then falls back to STT
(Whisper). If both fail you'll get a 503 timestamps_unavailable. Handle it
the same way you'd handle any 5xx:
if (res.status === 503) {
const body = await res.json();
if (body.title === "timestamps_unavailable") {
// Retry once without timestamps to at least get the audio.
return retryWithoutTimestamps();
}
}You can avoid the failure mode entirely by passing "timestamps": "off" in
the request — the endpoint then never attempts alignment and returns
timestamps: []. That's useful when you want the JSON envelope (base64 audio
plus warnings) but don't need word offsets.
Streaming + timestamps
You can't have both. Alignment requires the full audio, so the timestamps endpoints always buffer.
If your application needs both low first-byte latency and captions:
- Fire the streaming
/v1/audio/speechrequest and start playback. - In parallel, fire the same text against
/v1/audio/speech/with-timestamps. - Merge captions over the audio when the second response arrives.
For most user-facing playback flows the latency hit isn't worth this complexity — start with the buffered timestamps endpoint and only optimise if profiling tells you to.
Accuracy notes
- Native alignment from providers is generally word-accurate to a few tens of milliseconds.
- STT-fallback alignment is good but not perfect — accents, fast speech, and uncommon proper nouns can drift by ~50–100ms.
- Punctuation is usually attached to the preceding word ("Speechbase.") rather than emitted as its own entry.
- Trim trailing silence on the audio file before computing offsets if you re-encode — Speechbase's offsets reference the audio it returned.