Pronunciations
Word-substitution rules grouped into dictionaries — fix mispronounced brand names, acronyms, and terms of art before synthesis.
A pronunciation rule rewrites text before it reaches the provider.
"Speechbase" → "wave-form". "AGI" → "A G I". "jq" → "jay-cue". Set
them once, applied automatically on every synthesis.
Speechbase groups rules into pronunciation dictionaries: named sets you can apply org-wide by default or opt into per request.
Why dictionaries?
A flat list of rules works fine until you have more than one product. The
moment you have an investor podcast voice and a customer-support bot voice,
you want different shorthand: the support bot says "AWS" ten times an hour
and you want it spelled out, but the podcast host says "AWS" once an episode
and a flat spell-out sounds wooden.
Dictionaries let you scope rules. The shape of the system:
- Every org has one auto-created default dictionary that always applies.
- You can create additional dictionaries (e.g.
"Brand terms","Engineering acronyms","Spanish names") and apply them per request. - A single request can additionally apply up to 20 dictionaries by ID and 200 inline ad-hoc rules.
The data model
| Entity | Notes |
|---|---|
pronunciation_dictionaries | id, org_id, name, description, is_default. One row per dictionary; one row per org has is_default = true and is auto-created on first use. |
pronunciations | The rules. id, dictionary_id, word, replacement, case_sensitive. Belong to exactly one dictionary; cascade-delete with the dictionary. |
Resolution order
When a synthesis request lands, Speechbase builds the rule map in this order
(later sources overwrite earlier ones on the same word):
- Org default dictionary — always applies.
- Caller dictionaries —
pronunciations.dictionaryIdsfrom the request, in the order you listed them. - Inline rules —
pronunciations.rulesfrom the request body.
Inline rules win over caller dictionaries, and caller dictionaries win over the org default. The merged map is what gets substituted into your text.
case_sensitive is per-rule. The lookup key is word lowercased; case_sensitive: true rules also enforce a case match before substituting.
What happens at request time
input text
└── substitute (rule map applied to text)
└── moderate (substituted text checked against your policy)
└── synthesise (substituted text sent to provider)
└── inverse-align timestamps (offsets mapped back to original text)A few details:
- Substitution runs before moderation. Your moderation policy sees what the model will actually say, not what the user typed. A rule that introduces a banned word will trip moderation; a rule that removes one legitimately rewrites past it.
- Timestamps reference the original text.
with-timestampsendpoints return offsets aligned to the input text you sent — substitution is invisible to your caption / karaoke code. For conversations, this happens perturnIndex. - Inline rule contents are redacted from request logs. Dictionary IDs
applied are logged for auditability; the literal
word/replacementpairs you sent inline are not.
REST API
The full CRUD lives at /v1/pronunciation-dictionaries and
/v1/pronunciation-dictionaries/{id}/rules. See Pronunciation Dictionaries in the API reference for exact shapes. Headlines:
| Method | Path | Purpose |
|---|---|---|
GET | /v1/pronunciation-dictionaries | List dictionaries with rule counts. |
POST | /v1/pronunciation-dictionaries | Create a new dictionary. |
GET | /v1/pronunciation-dictionaries/{id} | Fetch one. |
PATCH | /v1/pronunciation-dictionaries/{id} | Rename / re-describe. |
DELETE | /v1/pronunciation-dictionaries/{id} | Cascade-deletes rules. The org default dictionary cannot be deleted. |
GET | /v1/pronunciation-dictionaries/{id}/rules | List rules in a dictionary. |
POST | /v1/pronunciation-dictionaries/{id}/rules | Add a rule. |
PATCH | /v1/pronunciation-dictionaries/{id}/rules/{ruleId} | Update a rule. |
DELETE | /v1/pronunciation-dictionaries/{id}/rules/{ruleId} | Remove a rule. |
The default dictionary is created lazily on first read; you don't need to
create it. You also can't DELETE it — is_default = true rows are
protected.
Applying dictionaries in a synthesis call
The request body for POST /v1/audio/speech (and friends) accepts an
optional pronunciations field:
{
"mode": "voice",
"voiceId": "01940f8a-2dc1-7000-9b6c-fc6dd8a0a4d2",
"text": "Welcome to Speechbase. Today we're talking about kubectl.",
"pronunciations": {
"dictionaryIds": [
"01940f8a-7c11-7000-9000-fc6dd8a0a4d2",
"01940f8a-9d22-7000-9100-fc6dd8a0a4d2"
],
"rules": [
{ "word": "Saoirse", "replacement": "Seer-shuh", "caseSensitive": false }
]
}
}Both fields are optional. With neither, only the org default dictionary applies. Limits:
dictionaryIds: up to 20.rules: up to 200 inline rules per request.
For POST /v1/audio/conversation, pronunciations is top-level only —
a single block applies to every turn. Per-turn pronunciations are rejected
at parse time so the rule set is unambiguous across turns.
When to reach for what
| You want | Use |
|---|---|
| A rule for every voice, every request, forever | The org default dictionary. |
| A rule set you toggle on per use case (e.g. "legal" content vs casual) | Create a dictionary, pass dictionaryIds per request. |
| One-off fix for a single request (test data, user-supplied terms) | rules (inline). |
Rules of thumb:
- Push fixed knowledge into dictionaries. Brand names, internal jargon, recurring proper nouns. Inline rules per request hide these in caller code.
- Use request dictionaries for context. If "AWS" should be read differently in support audio than in an investor podcast, create separate dictionaries and pass the right ID for that request.
- Reach for SSML or
provider_optionswhen phoneme control matters. Pronunciation rules are literal substring substitution, not phonetic control. If a provider supports SSML or per-word phoneme tags natively, pass those throughprovider_optionsfor finer-grained results.
Dashboard
Pronunciations in the dashboard lists every dictionary in your org, including the default. Click a dictionary to manage its rules — add, edit, delete, search.
Word-level timestamps
How Speechbase produces word-level alignment for synthesised audio — using native provider support when available, and falling back to STT when not.
Moderation
How Speechbase screens synthesis requests with category-based moderation and org-defined custom rules, with fail-open and fail-closed semantics.