The Prod Stack
for AI Audio

Speechbase is the foundation for AI teams to build faster with generative audio. Speech Gateway, Observability, Pronunciations, and Voice Management, all in one platform.

Book a Demo Get started for free

View the docsOpen Source SDK

End-to-end Audio Orchestration and Provider Management

The Stack

Speech Gateway

One API to connect your app to every TTS provider.

Speechbase gives AI teams a single, universal API across 16 text-to-speech providers, plus the observability, voice tooling, and governance that applications running in production actually need.

Your Application

app.tts

Agents · IVR · PodcastsAudiobooks · AvatarsVoice UX

Speechbase

>Speech Gateway

>Observability

>Pronunciations

>Voice Management

>Moderation

16 TTS Providers

Bring your own keys

Or managed routing

One invoice

The Platform

The foundation of a production audio stack.

06 primitives

A single voice library across multiple platforms.

Save voices once, reference them by name, and reuse them across providers and models. No more juggling opaque voice IDs across 16 dashboards.

Cross-provider voice library
Preview voices in the playground
Reference voices globally by alias or ID
Create voice clones (coming soon)

Read the voice management docs

Create multispeaker conversations.

Generate conversations between multiple voices, even when each speaker uses a different provider or model. Speechbase stitches turns into a single audio file, normalizes volume levels, and keeps word-level timestamps in sync.

Zero manual stitching
Mix voices, models, and providers by speaker
Volume levels normalized across turns
Per-turn timestamps and word captions

Read the generate dialogue docs

Universal Word Level Timestamps.

Timestamp support is messy across providers: missing on some models, character-level on others, and rarely consistent. Speechbase always returns word-level timestamps, using native timing when available and timestamp fallback when it is not.

Word-level start/end on every synthesis
Native on supported providers; timestamp fallback for the rest
Convert to SRT or WebVTT captions
Per-turn attribution in conversations

Read the timestamps docs

Open by design

Open SDK. No lock-in.

Swap providers with a string change. Apache 2.0, runs anywhere, and pairs with the hosted Speech Gateway, Observability, Pronunciations, and Voice Management when production catches up.

$ npm install @speech-sdk/core

Read the Docs →View on GitHub

generate-conversation.ts

import { generateConversation } from "@speech-sdk/core";

const result = await generateConversation({
  turns: [
    {
      model: "elevenlabs/eleven_v3",
      voice: "EXAVITQu4vr4xnSDxMaL",
      text: "Hello from the SDK.",
    },
    {
      model: "google/gemini-3.1-flash-tts-preview",
      voice: "Kore",
      text: "One call. Multiple voices. Auto-leveled.",
    },
  ],
});

result.audio.uint8Array; // Uint8Array
result.audio.mediaType;  // "audio/mpeg"

Multi-speaker dialogue

Conversation

One call returns the full multi-turn script as a single volume-leveled file. Mix providers per turn, get per-turn timestamps, skip the stitching code.

Streaming by default

streamSpeech

Audio streams as it generates via a standard Web ReadableStream. Pipe straight into a Response for low-latency playback in Node, Edge, or browser.

Universal audio tags

[laugh]

Write [laugh] once. The SDK passes through, translates to SSML, or strips with a warning. Same syntax across every provider.

Speed without pitch shift

0.75 → 1.5×

Pitch-preserving WSOLA time-stretch on mono PCM. Timestamps and audioDurationMs auto-scale by 1/speed so timings stay accurate.

Unicode-aware auto-chunking

Long-form

Long inputs split on balanced sentence boundaries (ASCII, CJK, Devanagari, Arabic) and stitch into one file, balanced so prosody stays continuous.

Reliable by default

Auto-retry

Jittered backoff on 5xx + 429. Retry-After honored. RFC 7807 errors with stable codes. Retry logic stays a one-liner.

Ready when you are

Promote your speech stack to production.

10 million free characters a month. Every TTS provider. No credit card to start.

Start for free Get a demo

The Prod Stack
for AI Audio

One API to connect your app to every TTS provider.

The foundation of a production audio stack.

A single voice library across multiple platforms.

Create multispeaker conversations.

Universal Word Level Timestamps.

Every request, in one place.

Get names, brands, and acronyms right.

Keep production output on-brand.

Open SDK. No lock-in.

Promote your speech stack to production.

The Prod Stack for AI Audio

One API to connect your app to every TTS provider.

The foundation of a production audio stack.

A single voice library across multiple platforms.

Open SDK. No lock-in.

Promote your speech stack to production.

The Prod Stack
for AI Audio