microsoft
VibeVoice
AISpeech RecognitionText-to-SpeechDeep LearningGenerative AIAudio Processing
View on GitHub 141
+340
// summary
VibeVoice is an open-source series of cutting-edge speech AI models, including Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models. The project utilizes a 7.5 Hz ultra-low frame rate continuous speech tokenizer combined with a next-generation token diffusion framework to significantly improve long-sequence processing efficiency while maintaining audio fidelity. Currently, the series supports up to 60 minutes of speech recognition and various real-time streaming speech generation features, aiming to promote collaboration and research in the speech synthesis community.
// use cases
01
VibeVoice-ASR supports single-pass processing of 60-minute long audio, capable of outputting speaker identity, timestamps, and text content simultaneously, while also supporting custom keywords.
02
VibeVoice-TTS provides up to 90 minutes of multi-speaker speech synthesis, supporting multiple languages and complex dialogue scenarios.
03
VibeVoice-Realtime-0.5B is a lightweight real-time streaming TTS model with a low-latency response capability of approximately 300 milliseconds.