Synthesise speech
Walk through the two ways to address a synthesis request — inline and by voice — and how to pick between them.
POST /v1/audio/speech is the single-shot speech endpoint. This guide is the
deep dive on what you can put in the body.
The two addressing modes
A synthesis request needs to know who should speak. There are two ways to
say that. Pick one and set the matching mode discriminator — the schema
requires it.
Inline (mode: "inline")
Specify model and voice directly. model is a provider/model-id string.
Best for one-off requests where you don't want to set up a Voice row.
{
"mode": "inline",
"model": "openai/gpt-4o-mini-tts",
"voice": "alloy",
"text": "Hello there.",
"output": "mp3"
}By voice (mode: "voice")
Reference a saved Voice by its UUID. The voice already pins provider, model, and the upstream voice ID:
{
"mode": "voice",
"voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
"text": "Hello there.",
"output": "mp3"
}This is what most production code should do — voices are the right unit of configuration.
Output format
The output field controls audio encoding. Pass either a string shorthand:
{ "output": "mp3" }…or an object for finer control:
{ "output": { "format": "mp3", "bitrate": 192 } }| Format | Default bitrate | Notes |
|---|---|---|
wav | n/a | Default. Uncompressed PCM in a WAV container. |
mp3 | 96 kbps | Compressed. Specify bitrate (in kbps) to override. |
pcm | n/a | Headerless raw PCM. Useful for downstream pipelines. |
See Output formats for the full matrix and recommendations.
Response shape
POST /v1/audio/speech returns the raw audio bytes — the response Content-Type
matches the chosen format (audio/mpeg, audio/wav, or
application/octet-stream for pcm).
Speechbase streams the response when the upstream provider supports streaming. That means you can start playback before the entire file is downloaded — but it also means you can't peek at headers to learn the duration up front.
If you need word-level timestamps, use
/v1/audio/speech/with-timestamps.
Practical recipes
Save the audio to a file (Node)
import { generateSpeech } from "@speech-sdk/core";
import { writeFile } from "node:fs/promises";
const result = await generateSpeech({
apiKey: process.env.SPEECHBASE_API_KEY,
model: "openai/gpt-4o-mini-tts",
voice: "alloy",
text: "Hello from Speechbase.",
output: { format: "mp3" },
});
await writeFile("hello.mp3", result.audio.uint8Array);import { writeFile } from "node:fs/promises";
const res = await fetch("https://api.speechbase.ai/v1/audio/speech", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.SPEECHBASE_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
mode: "voice",
voiceId: "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
text: "Hello from Speechbase.",
output: "mp3",
}),
});
if (!res.ok) {
throw new Error(`${res.status} ${await res.text()}`);
}
await writeFile("hello.mp3", new Uint8Array(await res.arrayBuffer()));Stream straight to the browser (Web standard Response)
import { streamSpeech } from "@speech-sdk/core";
const { audio, mediaType } = await streamSpeech({
apiKey: env.SPEECHBASE_API_KEY,
model: "cartesia/sonic-3",
voice: "a0e99841-438c-4a64-b679-ae501e7d6091", // provider voice id
text,
});
return new Response(audio, { headers: { "Content-Type": mediaType } });const upstream = await fetch("https://api.speechbase.ai/v1/audio/speech", {
method: "POST",
headers: { Authorization: `Bearer ${env.SPEECHBASE_API_KEY}`, "Content-Type": "application/json" },
body: JSON.stringify({ mode: "voice", voiceId, text, output: "mp3" }),
});
return new Response(upstream.body, {
headers: { "Content-Type": "audio/mpeg" },
});The bytes stream end-to-end — your client can start decoding before the upstream finishes generating.
Common pitfalls
403 no_api_key— the provider you targeted has no BYOK credential. Connect one in Speechbase → Provider Keys.422 content_moderation_blocked— text tripped a moderation rule. See Moderation.- Audio sounds different across providers — different vendors ship at
different reference loudness. Either pick one provider per use case or use
conversations with
volumeDbfsto normalise. voice_iddoesn't resolve — double-check the row exists withGET /v1/voices/{id}. Voice IDs are scoped to your org.