Speechbase

Pronunciations

Word-substitution rules grouped into dictionaries — fix mispronounced brand names, acronyms, and terms of art before synthesis.

A pronunciation rule rewrites text before it reaches the provider. "Speechbase""wave-form". "AGI""A G I". "jq""jay-cue". Set them once, applied automatically on every synthesis.

Speechbase groups rules into pronunciation dictionaries: named sets you can apply org-wide by default or opt into per request.

Why dictionaries?

A flat list of rules works fine until you have more than one product. The moment you have an investor podcast voice and a customer-support bot voice, you want different shorthand: the support bot says "AWS" ten times an hour and you want it spelled out, but the podcast host says "AWS" once an episode and a flat spell-out sounds wooden.

Dictionaries let you scope rules. The shape of the system:

  • Every org has one auto-created default dictionary that always applies.
  • You can create additional dictionaries (e.g. "Brand terms", "Engineering acronyms", "Spanish names") and apply them per request.
  • A single request can additionally apply up to 20 dictionaries by ID and 200 inline ad-hoc rules.

The data model

EntityNotes
pronunciation_dictionariesid, org_id, name, description, is_default. One row per dictionary; one row per org has is_default = true and is auto-created on first use.
pronunciationsThe rules. id, dictionary_id, word, replacement, case_sensitive. Belong to exactly one dictionary; cascade-delete with the dictionary.

Resolution order

When a synthesis request lands, Speechbase builds the rule map in this order (later sources overwrite earlier ones on the same word):

  1. Org default dictionary — always applies.
  2. Caller dictionariespronunciations.dictionaryIds from the request, in the order you listed them.
  3. Inline rulespronunciations.rules from the request body.

Inline rules win over caller dictionaries, and caller dictionaries win over the org default. The merged map is what gets substituted into your text.

case_sensitive is per-rule. The lookup key is word lowercased; case_sensitive: true rules also enforce a case match before substituting.

What happens at request time

input text
  └── substitute (rule map applied to text)
        └── moderate (substituted text checked against your policy)
              └── synthesise (substituted text sent to provider)
                    └── inverse-align timestamps (offsets mapped back to original text)

A few details:

  • Substitution runs before moderation. Your moderation policy sees what the model will actually say, not what the user typed. A rule that introduces a banned word will trip moderation; a rule that removes one legitimately rewrites past it.
  • Timestamps reference the original text. with-timestamps endpoints return offsets aligned to the input text you sent — substitution is invisible to your caption / karaoke code. For conversations, this happens per turnIndex.
  • Inline rule contents are redacted from request logs. Dictionary IDs applied are logged for auditability; the literal word/replacement pairs you sent inline are not.

REST API

The full CRUD lives at /v1/pronunciation-dictionaries and /v1/pronunciation-dictionaries/{id}/rules. See Pronunciation Dictionaries in the API reference for exact shapes. Headlines:

MethodPathPurpose
GET/v1/pronunciation-dictionariesList dictionaries with rule counts.
POST/v1/pronunciation-dictionariesCreate a new dictionary.
GET/v1/pronunciation-dictionaries/{id}Fetch one.
PATCH/v1/pronunciation-dictionaries/{id}Rename / re-describe.
DELETE/v1/pronunciation-dictionaries/{id}Cascade-deletes rules. The org default dictionary cannot be deleted.
GET/v1/pronunciation-dictionaries/{id}/rulesList rules in a dictionary.
POST/v1/pronunciation-dictionaries/{id}/rulesAdd a rule.
PATCH/v1/pronunciation-dictionaries/{id}/rules/{ruleId}Update a rule.
DELETE/v1/pronunciation-dictionaries/{id}/rules/{ruleId}Remove a rule.

The default dictionary is created lazily on first read; you don't need to create it. You also can't DELETE it — is_default = true rows are protected.

Applying dictionaries in a synthesis call

The request body for POST /v1/audio/speech (and friends) accepts an optional pronunciations field:

{
  "mode": "voice",
  "voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
  "text": "Welcome to Speechbase. Today we're talking about kubectl.",
  "pronunciations": {
    "dictionaryIds": [
      "01940f8a-7c11-7000-9000-fc6dd8a0a4d2",
      "01940f8a-9d22-7000-9100-fc6dd8a0a4d2"
    ],
    "rules": [
      { "word": "Saoirse", "replacement": "Seer-shuh", "caseSensitive": false }
    ]
  }
}

Both fields are optional. With neither, only the org default dictionary applies. Limits:

  • dictionaryIds: up to 20.
  • rules: up to 200 inline rules per request.

For POST /v1/audio/conversation, pronunciations is top-level only — a single block applies to every turn. Per-turn pronunciations are rejected at parse time so the rule set is unambiguous across turns.

When to reach for what

You wantUse
A rule for every voice, every request, foreverThe org default dictionary.
A rule set you toggle on per use case (e.g. "legal" content vs casual)Create a dictionary, pass dictionaryIds per request.
One-off fix for a single request (test data, user-supplied terms)rules (inline).

Rules of thumb:

  • Push fixed knowledge into dictionaries. Brand names, internal jargon, recurring proper nouns. Inline rules per request hide these in caller code.
  • Use request dictionaries for context. If "AWS" should be read differently in support audio than in an investor podcast, create separate dictionaries and pass the right ID for that request.
  • Reach for SSML or provider_options when phoneme control matters. Pronunciation rules are literal substring substitution, not phonetic control. If a provider supports SSML or per-word phoneme tags natively, pass those through provider_options for finer-grained results.

Dashboard

Pronunciations in the dashboard lists every dictionary in your org, including the default. Click a dictionary to manage its rules — add, edit, delete, search.

On this page