Microsoft Open-Sources VibeVoice-1.5B Model: A Breakthrough in 90-Minute Long-Form Speech Synthesis

baoshi.rao

Recently, Microsoft Research officially open-sourced its latest audio model—VibeVoice-1.5B. This model has achieved multiple major breakthroughs in speech synthesis technology, resulting in more natural, longer-duration, and higher-quality synthesized speech.

VibeVoice-1.5B can synthesize ultra-long speech segments of up to 90 minutes in one go—a rare capability among previous speech synthesis models. Most earlier models could only handle speech segments under 60 minutes, often experiencing issues like timbre drift and semantic discontinuity beyond 30 minutes. Additionally, this model supports up to four speakers, significantly improving multi-speaker synthesis, whereas previous open-source models typically supported only two. Furthermore, VibeVoice achieves a 3200x compression ratio for 24kHz raw audio, greatly improving compression efficiency while retaining high-fidelity speech quality.

The core innovation of VibeVoice lies in its unique dual-tokenizer architecture. Unlike traditional TTS models that rely on a single tokenizer for feature extraction, VibeVoice introduces a collaborative mechanism between an acoustic tokenizer and a semantic tokenizer, addressing the mismatch between timbre and semantics. The acoustic tokenizer focuses on preserving voice characteristics and achieving extreme compression, while the semantic tokenizer extracts features aligned with textual semantics, ensuring the synthesized speech's emotional tone matches the content.

For training, VibeVoice employs a curriculum learning strategy, gradually increasing input sequence lengths to avoid training failures caused by processing ultra-long sequences. During training, the parameters of the acoustic and semantic tokenizers remain fixed, ensuring the stability of the feature extraction modules and shortening the training cycle.

The open-sourcing of VibeVoice-1.5B not only brings new technological breakthroughs to the field of speech synthesis but also lays the groundwork for future larger-parameter models. For researchers and developers in audio processing and speech synthesis, this represents an innovative advancement worth noting.

Open-source address: https://huggingface.co/microsoft/VibeVoice-1.5B

Online demo: https://aka.ms/VibeVoice-Demo

Key Highlights:

VibeVoice-1.5B can synthesize 90-minute ultra-long speech segments in one go and supports up to four speakers.

The model achieves a 3200x audio compression ratio while maintaining high-fidelity speech quality.

It adopts a dual-tokenizer architecture to resolve mismatches between timbre and semantics.