Synthesise speech

Walk through the two ways to address a synthesis request — inline and by voice — and how to pick between them.

POST /v1/audio/speech is the single-shot speech endpoint. This guide is the deep dive on what you can put in the body.

The two addressing modes

A synthesis request needs to know who should speak. There are two ways to say that. Pick one and set the matching mode discriminator — the schema requires it.

Inline (`mode: "inline"`)

Specify model and voice directly. model is a provider/model-id string. Best for one-off requests where you don't want to set up a Voice row.

{
  "mode": "inline",
  "model": "openai/gpt-4o-mini-tts",
  "voice": "alloy",
  "text": "Hello there.",
  "output": "mp3"
}

By voice (`mode: "voice"`)

Reference a saved Voice by its UUID. The voice already pins provider, model, and the upstream voice ID:

{
  "mode": "voice",
  "voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
  "text": "Hello there.",
  "output": "mp3"
}

This is what most production code should do — voices are the right unit of configuration.

Output format

The output field controls audio encoding. Pass either a string shorthand:

{ "output": "mp3" }

…or an object for finer control:

{ "output": { "format": "mp3", "bitrate": 192 } }

Format	Default bitrate	Notes
`wav`	n/a	Default. Uncompressed PCM in a WAV container.
`mp3`	96 kbps	Compressed. Specify `bitrate` (in kbps) to override.
`pcm`	n/a	Headerless raw PCM. Useful for downstream pipelines.

See Output formats for the full matrix and recommendations.

Response shape

POST /v1/audio/speech returns the raw audio bytes — the response Content-Type matches the chosen format (audio/mpeg, audio/wav, or application/octet-stream for pcm).

Speechbase streams the response when the upstream provider supports streaming. That means you can start playback before the entire file is downloaded — but it also means you can't peek at headers to learn the duration up front.

If you need word-level timestamps, use /v1/audio/speech/with-timestamps.

Practical recipes

Save the audio to a file (Node)

import { generateSpeech } from "@speech-sdk/core";
import { writeFile } from "node:fs/promises";

const result = await generateSpeech({
  apiKey: process.env.SPEECHBASE_API_KEY,
  model: "openai/gpt-4o-mini-tts",
  voice: "alloy",
  text: "Hello from Speechbase.",
  output: { format: "mp3" },
});

await writeFile("hello.mp3", result.audio.uint8Array);

import { writeFile } from "node:fs/promises";

const res = await fetch("https://api.speechbase.ai/v1/audio/speech", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.SPEECHBASE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    mode: "voice",
    voiceId: "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
    text: "Hello from Speechbase.",
    output: "mp3",
  }),
});

if (!res.ok) {
  throw new Error(`${res.status} ${await res.text()}`);
}

await writeFile("hello.mp3", new Uint8Array(await res.arrayBuffer()));

Stream straight to the browser (Web standard `Response`)

import { streamSpeech } from "@speech-sdk/core";

const { audio, mediaType } = await streamSpeech({
  apiKey: env.SPEECHBASE_API_KEY,
  model: "cartesia/sonic-3",
  voice: "a0e99841-438c-4a64-b679-ae501e7d6091", // provider voice id
  text,
});

return new Response(audio, { headers: { "Content-Type": mediaType } });

const upstream = await fetch("https://api.speechbase.ai/v1/audio/speech", {
  method: "POST",
  headers: { Authorization: `Bearer ${env.SPEECHBASE_API_KEY}`, "Content-Type": "application/json" },
  body: JSON.stringify({ mode: "voice", voiceId, text, output: "mp3" }),
});

return new Response(upstream.body, {
  headers: { "Content-Type": "audio/mpeg" },
});

The bytes stream end-to-end — your client can start decoding before the upstream finishes generating.

Common pitfalls

403 no_api_key — the provider you targeted has no BYOK credential. Connect one in Speechbase → Provider Keys.
422 content_moderation_blocked — text tripped a moderation rule. See Moderation.
Audio sounds different across providers — different vendors ship at different reference loudness. Either pick one provider per use case or use conversations with volumeDbfs to normalise.
voice_id doesn't resolve — double-check the row exists with GET /v1/voices/{id}. Voice IDs are scoped to your org.

On this page