Generate captions and karaoke timing
Get word-level timing for any synthesised audio — captions, lip-sync animation, highlight effects, accessibility.
If your product plays synthesised speech, sooner or later you need to know when each word lands. Captions for accessibility. Karaoke-style highlighting during playback. Lip-sync metadata for an animated character. Subtitle files for video content.
Speechbase's with-timestamps endpoints return word-level alignment for every
synthesis — natively when the provider supports it, and via STT fallback when
it doesn't.
The smallest possible example
import { generateSpeech } from "@speech-sdk/core";
const result = await generateSpeech({
apiKey: process.env.SPEECHBASE_API_KEY,
model: "openai/gpt-4o-mini-tts",
voice: "alloy",
text: "Hello and welcome to the show.",
timestamps: true,
});
result.timestamps;
// [
// { text: "Hello", start: 0.04, end: 0.41 },
// { text: "and", start: 0.43, end: 0.55 },
// { text: "welcome", start: 0.57, end: 0.94 },
// { text: "to", start: 0.96, end: 1.04 },
// { text: "the", start: 1.06, end: 1.18 },
// { text: "show.", start: 1.20, end: 1.62 },
// ]curl -X POST https://api.speechbase.ai/v1/audio/speech/with-timestamps \
-H "Authorization: Bearer $SPEECHBASE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"mode": "voice",
"voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
"text": "Hello and welcome to the show.",
"output": "mp3"
}'Each entry is { text, start, end } with timings in seconds from the start of
the returned audio. Punctuation is attached to the preceding word. The cURL
form additionally returns base64 audio and mediaType in a JSON envelope; the
SDK exposes the bytes via result.audio.uint8Array.
Native vs STT fallback — automatic
Some providers (OpenAI, ElevenLabs, others) emit alignment alongside the audio. Some don't. Speechbase abstracts the difference:
- Native first. If the chosen provider supports word-level alignment for the chosen model, Speechbase uses the provider's data.
- STT fallback. Otherwise Speechbase runs the rendered audio through Whisper and reconstructs the boundaries.
You don't pick. The warnings array tells you which path was used if you
care. If you have a reason to skip alignment entirely (e.g. you only want the
JSON envelope for buffered audio), pass "timestamps": "off" to get
timestamps: [] back without paying for STT.
Painting captions during playback
The timestamps array is monotonic in start, so a timeupdate listener can
find the current word in O(log n) with binary search:
function findCurrentWord(t: number, words: Word[]) {
let lo = 0, hi = words.length - 1;
while (lo <= hi) {
const mid = (lo + hi) >> 1;
if (t < words[mid].start) hi = mid - 1;
else if (t > words[mid].end) lo = mid + 1;
else return words[mid];
}
return null;
}
audio.addEventListener("timeupdate", () => {
const word = findCurrentWord(audio.currentTime, words);
if (word) renderHighlight(word);
});For 30-second clips a linear scan works fine. For long audiobook passages, the binary search keeps things smooth at 60fps.
Conversations: per-turn attribution
POST /v1/audio/conversation/with-timestamps returns the same envelope plus a
turnIndex on every word — the zero-based index into the original turns
array. Group by turnIndex to colour-code captions by speaker:
const grouped = new Map<number, Word[]>();
for (const word of timestamps) {
const arr = grouped.get(word.turnIndex) ?? [];
arr.push(word);
grouped.set(word.turnIndex, arr);
}When Speechbase falls back to STT across the stitched conversation audio, it uses
gap markers to recover turnIndex. That's high-accuracy as long as your
gapMs is at least 200–300ms; very short gaps make turn boundaries
ambiguous.
Generating a subtitle file
The SDK ships a timestampsToCaptions() helper that breaks word-level
timestamps into SRT or WebVTT cues — sentence-aware, with sensible defaults
for line length and cue duration:
import { generateSpeech, timestampsToCaptions } from "@speech-sdk/core";
const { timestamps } = await generateSpeech({
apiKey: process.env.SPEECHBASE_API_KEY,
model: "elevenlabs/eleven_v3",
voice: "JBFqnCBsd6RMkjVDRZzb",
text: "Hello world. This is a test.",
timestamps: true,
});
const srt = timestampsToCaptions(timestamps ?? []);
const vtt = timestampsToCaptions(timestamps ?? [], { format: "vtt" });If you'd rather roll your own — to control cue grouping, line breaks, or speaker prefixes — group words into caption-friendly chunks (5–10 words or ~3 seconds, whichever comes first):
function toVtt(words: Word[]): string {
const lines = ["WEBVTT", ""];
let cue: Word[] = [];
let cueStart = 0;
let cueIndex = 1;
function flush(end: number) {
if (cue.length === 0) return;
lines.push(String(cueIndex++));
lines.push(`${vttTime(cueStart)} --> ${vttTime(end)}`);
lines.push(cue.map((w) => w.text).join(" "));
lines.push("");
cue = [];
}
for (const word of words) {
if (cue.length === 0) cueStart = word.start;
cue.push(word);
if (cue.length >= 8 || word.end - cueStart >= 3) flush(word.end);
}
flush(words.at(-1)?.end ?? cueStart);
return lines.join("\n");
}
function vttTime(t: number): string {
const h = Math.floor(t / 3600);
const m = Math.floor((t % 3600) / 60);
const s = (t % 60).toFixed(3);
return `${String(h).padStart(2, "0")}:${String(m).padStart(2, "0")}:${s.padStart(6, "0")}`;
}Either output drops onto a <video> element via
<track kind="captions" src="...">.
Streaming + timestamps don't combine
Alignment requires the full audio. Streaming returns chunks before the full audio exists. Pick one:
- Buffered with timestamps — use the
with-timestampsendpoint. Higher first-byte latency but you get captions for free. - Streamed without timestamps — use the plain endpoint. Lowest latency, but no captions.
For most caption use cases the latency difference doesn't matter — captions ship with the audio file, not in real time.
Accuracy notes
- Native alignment is typically within a few tens of milliseconds.
- STT-fallback alignment can drift 50–100ms on accented or fast speech.
- Numbers and punctuation are tokenised differently per provider —
"$5"may come back as one token or three. - If you re-encode the audio after Speechbase returns it, recompute offsets from the new file — Speechbase's offsets are relative to the audio it returned.
Going further
- Lip-sync. Each word's start/end is enough to drive a simple talking-head
animation (mouth opens during a word, closes between). For phoneme-level
lip-sync, look at the per-provider
provider_options— some providers expose viseme data through the same alignment channel. - Search and seek. Word-level timing makes long-form audio searchable — build a transcript index that links every word to its second.
- Per-word click-to-seek. Render each word as a span with a click handler
that sets
audio.currentTime = word.start. Eight lines of code; massively improves the experience of a transcript view.