Quickstart

By the end of this page you'll have a Speechbase API key, a connected provider, and a .mp3 on disk generated through the gateway.

1. Create an account

2. Issue a Speechbase API key

The Speechbase API key authenticates your requests against api.speechbase.ai.

Open Settings → API Keys in the dashboard.
Click Create key, name it, copy the value (shown once).
Export it so the snippets below pick it up:
```
export SPEECHBASE_API_KEY=sk_...
```

Every endpoint except /health requires this. See Authentication for the full story.

3. Connect a provider key (BYOK)

Speechbase routes your request through whichever upstream provider you choose. The self-serve path is BYOK: you bring the credentials.

Open Speechbase → Provider Keys.
Pick a provider — OpenAI is the simplest to start with — and paste in your provider API key.
Save. Speechbase stores the key encrypted in a secure key store. We can't view or recover the full key; only the last four characters are shown back to you.

For the deeper walkthrough see the BYOK guide. If your workspace uses Managed Routing, Speechbase manages the provider relationship and billing instead; the request shape below stays the same.

4. Synthesise speech

The fastest path is the @speech-sdk/core TypeScript SDK — one generateSpeech() call across every supported provider. Or call the gateway directly with curl.

import { generateSpeech } from "@speech-sdk/core";
import { writeFile } from "node:fs/promises";

const result = await generateSpeech({
  apiKey: process.env.SPEECHBASE_API_KEY,
  model: "openai/gpt-4o-mini-tts",
  voice: "alloy",
  text: "Hello from Speechbase.",
  output: { format: "mp3" },
});

await writeFile("hello.mp3", result.audio.uint8Array);

curl -X POST https://api.speechbase.ai/v1/audio/speech \
  -H "Authorization: Bearer $SPEECHBASE_API_KEY" \
  -H "Content-Type: application/json" \
  --output hello.mp3 \
  -d '{
    "mode": "inline",
    "text": "Hello from Speechbase.",
    "model": "openai/gpt-4o-mini-tts",
    "voice": "alloy",
    "output": "mp3"
  }'

You should get a playable hello.mp3 in your working directory.

If you get 403 no_api_key, finish step 3 first.

5. Try a multi-speaker conversation

import { generateConversation } from "@speech-sdk/core";
import { writeFile } from "node:fs/promises";

const result = await generateConversation({
  apiKey: process.env.SPEECHBASE_API_KEY,
  turns: [
    { model: "openai/gpt-4o-mini-tts", voice: "alloy",   text: "How was your day?" },
    { model: "openai/gpt-4o-mini-tts", voice: "shimmer", text: "Honestly? Long. But better now." },
    { model: "openai/gpt-4o-mini-tts", voice: "alloy",   text: "Tell me about it." },
  ],
  output: { format: "wav" },
});

await writeFile("chat.wav", result.audio.uint8Array);

curl -X POST https://api.speechbase.ai/v1/audio/conversation \
  -H "Authorization: Bearer $SPEECHBASE_API_KEY" \
  -H "Content-Type: application/json" \
  --output chat.wav \
  -d '{
    "model": "openai/gpt-4o-mini-tts",
    "turns": [
      { "voice": "alloy",   "text": "How was your day?" },
      { "voice": "shimmer", "text": "Honestly? Long. But better now." },
      { "voice": "alloy",   "text": "Tell me about it." }
    ],
    "output": "wav"
  }'

One stitched WAV, three turns, two voices. This is the shape of Conversations — and it generalises across providers and voices without changing the endpoint.

6. Try word-level timestamps

import { generateSpeech } from "@speech-sdk/core";

const result = await generateSpeech({
  apiKey: process.env.SPEECHBASE_API_KEY,
  model: "openai/gpt-4o-mini-tts",
  voice: "alloy",
  text: "Word level timing for every synthesis.",
  timestamps: true,
});

result.timestamps;
// [{ text: "Word", start: 0.04, end: 0.31 }, ...]

curl -X POST https://api.speechbase.ai/v1/audio/speech/with-timestamps \
  -H "Authorization: Bearer $SPEECHBASE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "inline",
    "text": "Word level timing for every synthesis.",
    "model": "openai/gpt-4o-mini-tts",
    "voice": "alloy"
  }'

Each entry in timestamps is { text, start, end } in seconds. See Captions and lip-sync timing.

What's next

Pick whichever matches what you're building:

Or jump to the full API Reference for exact request/response shapes.