microsoft

VibeVoice

AISpeech RecognitionText-to-SpeechDeep LearningGenerative AIAudio Processing

141

+340

// summary

VibeVoice is an open-source series of cutting-edge speech AI models, including Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models. The project utilizes a 7.5 Hz ultra-low frame rate continuous speech tokenizer combined with a next-generation token diffusion framework to significantly improve long-sequence processing efficiency while maintaining audio fidelity. Currently, the series supports up to 60 minutes of speech recognition and various real-time streaming speech generation features, aiming to promote collaboration and research in the speech synthesis community.

// use cases

VibeVoice-ASR supports single-pass processing of 60-minute long audio, capable of outputting speaker identity, timestamps, and text content simultaneously, while also supporting custom keywords.

VibeVoice-TTS provides up to 90 minutes of multi-speaker speech synthesis, supporting multiple languages and complex dialogue scenarios.

VibeVoice-Realtime-0.5B is a lightweight real-time streaming TTS model with a low-latency response capability of approximately 300 milliseconds.