Speechbase

Word-level timestamps

Get word-by-word timing for synthesised audio — how to call the with-timestamps endpoints and what to do with the results.

This guide is the practical companion to Word-level timestamps (concept). If you want the what and why, read that first; this page is the how.

Calling the endpoint

import { generateSpeech } from "@speech-sdk/core";

const result = await generateSpeech({
  apiKey: process.env.SPEECHBASE_API_KEY,
  model: "openai/gpt-4o-mini-tts",
  voice: "alloy",
  text: "Hello from Speechbase.",
  timestamps: true,
});

result.audio.uint8Array; // raw bytes
result.audio.mediaType; // "audio/mpeg"
result.timestamps;
// [
//   { text: "Hello",   start: 0.04, end: 0.41 },
//   { text: "from",    start: 0.43, end: 0.62 },
//   { text: "Speechbase.", start: 0.65, end: 1.10 },
// ]
curl -X POST https://api.speechbase.ai/v1/audio/speech/with-timestamps \
  -H "Authorization: Bearer $SPEECHBASE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "voice",
    "voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
    "text": "Hello from Speechbase.",
    "output": "mp3"
  }' | tee response.json

The cURL form returns a JSON envelope:

{
  "audio": "<base64 mp3>",
  "mediaType": "audio/mpeg",
  "warnings": [],
  "timestamps": [
    { "text": "Hello",   "start": 0.04, "end": 0.41 },
    { "text": "from",    "start": 0.43, "end": 0.62 },
    { "text": "Speechbase.", "start": 0.65, "end": 1.10 }
  ]
}

Decoding the audio

The SDK exposes ready-to-use bytes; with cURL you have to decode the base64 yourself.

import { writeFile } from "node:fs/promises";

await writeFile("hello.mp3", result.audio.uint8Array);
import { writeFile } from "node:fs/promises";

const res = await fetch(
  "https://api.speechbase.ai/v1/audio/speech/with-timestamps",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.SPEECHBASE_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      mode: "voice",
      voiceId: "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
      text: "Hello from Speechbase.",
      output: "mp3",
    }),
  }
);
const data = await res.json();
await writeFile("hello.mp3", Buffer.from(data.audio, "base64"));

Or, in the browser:

const blob = new Blob([result.audio.uint8Array], { type: result.audio.mediaType });
const url = URL.createObjectURL(blob);
audioElement.src = url;

Painting captions

The timestamps array is in seconds, monotonically increasing. To highlight the current word during playback:

audioElement.addEventListener("timeupdate", () => {
  const t = audioElement.currentTime;
  const current = timestamps.find(w => t >= w.start && t < w.end);
  if (current) renderHighlightedWord(current.text);
});

For long files, sort the array once and binary-search instead of scanning.

Conversations: the turnIndex field

Conversation timestamps include turnIndex, the zero-based index into the original turns array:

const groupedByTurn = new Map<number, Word[]>();
for (const word of timestamps) {
  const arr = groupedByTurn.get(word.turnIndex) ?? [];
  arr.push(word);
  groupedByTurn.set(word.turnIndex, arr);
}

Use this to colour different speakers, label captions with names, or compute per-turn duration metrics.

When timestamps fail

Speechbase tries native provider alignment first, then falls back to STT (Whisper). If both fail you'll get a 503 timestamps_unavailable. Handle it the same way you'd handle any 5xx:

if (res.status === 503) {
  const body = await res.json();
  if (body.title === "timestamps_unavailable") {
    // Retry once without timestamps to at least get the audio.
    return retryWithoutTimestamps();
  }
}

You can avoid the failure mode entirely by passing "timestamps": "off" in the request — the endpoint then never attempts alignment and returns timestamps: []. That's useful when you want the JSON envelope (base64 audio plus warnings) but don't need word offsets.

Streaming + timestamps

You can't have both. Alignment requires the full audio, so the timestamps endpoints always buffer.

If your application needs both low first-byte latency and captions:

  1. Fire the streaming /v1/audio/speech request and start playback.
  2. In parallel, fire the same text against /v1/audio/speech/with-timestamps.
  3. Merge captions over the audio when the second response arrives.

For most user-facing playback flows the latency hit isn't worth this complexity — start with the buffered timestamps endpoint and only optimise if profiling tells you to.

Accuracy notes

  • Native alignment from providers is generally word-accurate to a few tens of milliseconds.
  • STT-fallback alignment is good but not perfect — accents, fast speech, and uncommon proper nouns can drift by ~50–100ms.
  • Punctuation is usually attached to the preceding word ("Speechbase.") rather than emitted as its own entry.
  • Trim trailing silence on the audio file before computing offsets if you re-encode — Speechbase's offsets reference the audio it returned.

On this page