Generate captions and karaoke timing

Get word-level timing for any synthesised audio — captions, lip-sync animation, highlight effects, accessibility.

If your product plays synthesised speech, sooner or later you need to know when each word lands. Captions for accessibility. Karaoke-style highlighting during playback. Lip-sync metadata for an animated character. Subtitle files for video content.

Speechbase's with-timestamps endpoints return word-level alignment for every synthesis — natively when the provider supports it, and via STT fallback when it doesn't.

The smallest possible example

import { generateSpeech } from "@speech-sdk/core";

const result = await generateSpeech({
  apiKey: process.env.SPEECHBASE_API_KEY,
  model: "openai/gpt-4o-mini-tts",
  voice: "alloy",
  text: "Hello and welcome to the show.",
  timestamps: true,
});

result.timestamps;
// [
//   { text: "Hello",   start: 0.04, end: 0.41 },
//   { text: "and",     start: 0.43, end: 0.55 },
//   { text: "welcome", start: 0.57, end: 0.94 },
//   { text: "to",      start: 0.96, end: 1.04 },
//   { text: "the",     start: 1.06, end: 1.18 },
//   { text: "show.",   start: 1.20, end: 1.62 },
// ]

curl -X POST https://api.speechbase.ai/v1/audio/speech/with-timestamps \
  -H "Authorization: Bearer $SPEECHBASE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "voice",
    "voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
    "text": "Hello and welcome to the show.",
    "output": "mp3"
  }'

Each entry is { text, start, end } with timings in seconds from the start of the returned audio. Punctuation is attached to the preceding word. The cURL form additionally returns base64 audio and mediaType in a JSON envelope; the SDK exposes the bytes via result.audio.uint8Array.

Native vs STT fallback — automatic

Some providers (OpenAI, ElevenLabs, others) emit alignment alongside the audio. Some don't. Speechbase abstracts the difference:

Native first. If the chosen provider supports word-level alignment for the chosen model, Speechbase uses the provider's data.
STT fallback. Otherwise Speechbase runs the rendered audio through Whisper and reconstructs the boundaries.

You don't pick. The warnings array tells you which path was used if you care. If you have a reason to skip alignment entirely (e.g. you only want the JSON envelope for buffered audio), pass "timestamps": "off" to get timestamps: [] back without paying for STT.

Painting captions during playback

The timestamps array is monotonic in start, so a timeupdate listener can find the current word in O(log n) with binary search:

function findCurrentWord(t: number, words: Word[]) {
  let lo = 0, hi = words.length - 1;
  while (lo <= hi) {
    const mid = (lo + hi) >> 1;
    if (t < words[mid].start) hi = mid - 1;
    else if (t > words[mid].end) lo = mid + 1;
    else return words[mid];
  }
  return null;
}

audio.addEventListener("timeupdate", () => {
  const word = findCurrentWord(audio.currentTime, words);
  if (word) renderHighlight(word);
});

For 30-second clips a linear scan works fine. For long audiobook passages, the binary search keeps things smooth at 60fps.

Conversations: per-turn attribution

POST /v1/audio/conversation/with-timestamps returns the same envelope plus a turnIndex on every word — the zero-based index into the original turns array. Group by turnIndex to colour-code captions by speaker:

const grouped = new Map<number, Word[]>();
for (const word of timestamps) {
  const arr = grouped.get(word.turnIndex) ?? [];
  arr.push(word);
  grouped.set(word.turnIndex, arr);
}

When Speechbase falls back to STT across the stitched conversation audio, it uses gap markers to recover turnIndex. That's high-accuracy as long as your gapMs is at least 200–300ms; very short gaps make turn boundaries ambiguous.

Generating a subtitle file

The SDK ships a timestampsToCaptions() helper that breaks word-level timestamps into SRT or WebVTT cues — sentence-aware, with sensible defaults for line length and cue duration:

import { generateSpeech, timestampsToCaptions } from "@speech-sdk/core";

const { timestamps } = await generateSpeech({
  apiKey: process.env.SPEECHBASE_API_KEY,
  model: "elevenlabs/eleven_v3",
  voice: "JBFqnCBsd6RMkjVDRZzb",
  text: "Hello world. This is a test.",
  timestamps: true,
});

const srt = timestampsToCaptions(timestamps ?? []);
const vtt = timestampsToCaptions(timestamps ?? [], { format: "vtt" });

If you'd rather roll your own — to control cue grouping, line breaks, or speaker prefixes — group words into caption-friendly chunks (5–10 words or ~3 seconds, whichever comes first):

function toVtt(words: Word[]): string {
  const lines = ["WEBVTT", ""];
  let cue: Word[] = [];
  let cueStart = 0;
  let cueIndex = 1;

  function flush(end: number) {
    if (cue.length === 0) return;
    lines.push(String(cueIndex++));
    lines.push(`${vttTime(cueStart)} --> ${vttTime(end)}`);
    lines.push(cue.map((w) => w.text).join(" "));
    lines.push("");
    cue = [];
  }

  for (const word of words) {
    if (cue.length === 0) cueStart = word.start;
    cue.push(word);
    if (cue.length >= 8 || word.end - cueStart >= 3) flush(word.end);
  }
  flush(words.at(-1)?.end ?? cueStart);
  return lines.join("\n");
}

function vttTime(t: number): string {
  const h = Math.floor(t / 3600);
  const m = Math.floor((t % 3600) / 60);
  const s = (t % 60).toFixed(3);
  return `${String(h).padStart(2, "0")}:${String(m).padStart(2, "0")}:${s.padStart(6, "0")}`;
}

Either output drops onto a <video> element via <track kind="captions" src="...">.

Streaming + timestamps don't combine

Alignment requires the full audio. Streaming returns chunks before the full audio exists. Pick one:

Buffered with timestamps — use the with-timestamps endpoint. Higher first-byte latency but you get captions for free.
Streamed without timestamps — use the plain endpoint. Lowest latency, but no captions.

For most caption use cases the latency difference doesn't matter — captions ship with the audio file, not in real time.

Accuracy notes

Native alignment is typically within a few tens of milliseconds.
STT-fallback alignment can drift 50–100ms on accented or fast speech.
Numbers and punctuation are tokenised differently per provider — "$5" may come back as one token or three.
If you re-encode the audio after Speechbase returns it, recompute offsets from the new file — Speechbase's offsets are relative to the audio it returned.

Going further

Lip-sync. Each word's start/end is enough to drive a simple talking-head animation (mouth opens during a word, closes between). For phoneme-level lip-sync, look at the per-provider provider_options — some providers expose viseme data through the same alignment channel.
Search and seek. Word-level timing makes long-form audio searchable — build a transcript index that links every word to its second.
Per-word click-to-seek. Render each word as a span with a click handler that sets audio.currentTime = word.start. Eight lines of code; massively improves the experience of a transcript view.

Generate captions and karaoke timing

On this page