Writing a WSOLA time-stretcher for the Speech SDK
- engineering
- audio
- sdk
Play 1.25 seconds of audio in 1 second by speeding it up, and every frequency rises by 25%. Voices sound cartoonish. The fix isn't "play it faster" — it's "change the duration without changing the pitch," a different operation entirely.
The textbook algorithm for doing this on speech is WSOLA: Waveform Similarity Overlap-Add. We wrote a pure-TypeScript implementation with zero dependencies that runs identically in the browser and in Node.
Why we wrote it
We needed speed control on generateSpeech in the Speech SDK. That meant pitch-preserving time-stretch on raw Int16 mono PCM, the format we already pass between layers.
Constraints:
- Runs identically in browser and Node. No
AudioContext, no native bindings. - Zero dependencies. The SDK ships with no transitive deps and we wanted to keep it that way.
- Apache-2.0–compatible.
- Operates on raw Int16 mono PCM.
Every option failed at least one. SoundTouch.js is LGPL. Rubberband is GPL. Most browser-side stretchers depend on Web Audio APIs that don't exist in Node. The pure-JS attempts we found were either abandoned, FFT phase-vocoder-based (heavier, and prone to phasiness on speech), or shipped large WASM bundles.
Writing it in TypeScript came out to a few hundred lines.
What WSOLA does
WSOLA stretches audio by overlap-adding analysis frames at output positions. The trick: instead of grabbing each frame at a fixed input offset, it searches a small window around the expected offset and picks the frame whose overlap region best matches what's already in the output. The match is scored by normalized cross-correlation, so amplitude differences don't dominate.
For each synthesis hop:
- Compute the expected input position from the speed ratio.
- Search ±10ms around it. Score each candidate by normalized cross-correlation against the output so far.
- Hann-window the chosen frame and overlap-add it into the output.
// Defaults at 24kHz:
// windowSize = 30ms (720 samples)
// synthesisHop = windowSize / 4 (75% overlap)
// searchRadius = 10ms (240 samples)
// correlationStep = 8 samples
The search is two-stage: a coarse stride of correlationStep samples across the full radius, then a ±3-sample refinement around the best coarse match. That's the difference between "good enough at any sample rate" and "needs hand-tuned constants per rate."
Two things that bit us
The backward jump. Near the end of the input, the search range can collapse — there isn't enough signal left to span the full radius. The first version of chooseBestOffset fell back to the expected input position when this happened. That value is computed from the cumulative speed ratio, and near end-of-input it can sit below the previous frame's input position. The source pointer jumps backward, and you get an audible click on every collapsed-range frame.
The fix is to return the lower bound of the (collapsed) search range, which is constructed to be at least previousInputPosition + 1. Forward progress is preserved by construction. One-line change, several hours to find.
The double-encode round-trip. The first integration was embarrassing. The SDK was encoding the provider's PCM output to the user's chosen format (say, MP3), and then the stretch step was decoding the MP3, stretching, and re-encoding it. Lossy → lossy. We now defer output conversion until after stretching, so the pipeline is always provider → PCM → stretch → final encode. Invisible until you A/B the audio.
The boring parts that matter
- Reject encoded audio at the door. The plugin only handles raw mono s16le PCM. If you pass WAV, MP3, Ogg, FLAC, or MP4, we sniff the magic bytes and throw a clear error rather than emit garbage.
- Linear-interpolation fallback. WSOLA needs at least
windowSize + 2*searchRadiussamples. Below that, we fall through to clamped linear interpolation. Rare in practice, not zero. - Asymmetric Int16 ↔ Float conversion. Negative samples divide by 32768, positive by 32767, matching the actual Int16 range. Round-trips stay clean.
Using it
The stretcher is exposed as @speech-sdk/core/plugins so you can apply the same step outside the SDK pipeline:
import { timeStretch } from "@speech-sdk/core/plugins";
const stretched = timeStretch(pcm, { speed: 1.25, sampleRate: 24000 });
pcm is an Int16Array, Uint8Array of raw s16le bytes, or an ArrayBuffer. Speed range is 0.75–1.5; sample rate range is 8000–48000.
Source: speech-sdk/src/plugins/time-stretch.