This is the resources page for the papers that I am going to talk about on May 8, 2025 at Conversational AI Reading Group at Mila.
Voicebox is a non-autoregressive generative speech model based on flow-matching and is trained to perform speech infilling given audio context and the corresponding text. The Voicebox model can be used for zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In this talk, we will first review the Voicebox model. We will then focus on the synthetic speech generation capability of the model and present several use cases of these synthetic signals in various applications including automatic speech recognition and spoken language understanding. Through a few early studies on using Voicebox generated speech signals, we will discuss the cost saving benefits of the approach in terms of speech data collection and potential shortcomings of using synthetic speech in these applications.
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale, Le et al., paper
Audiobox: Unified Audio Generation with Natural Language Prompts, Vyas et al., paper on Arxiv
Towards Selection of Text-to-speech Data to Augment ASR Training, Liu et al., paper on Arxiv
Using Voicebox-based Synthetic Speech for ASR Adaptation, Dhamyal et al., paper on ISCA archive
Improving Spoken Semantic Parsing using Synthetic Data from Large Generative Models, Sharma et al., paper on ISCA archive
Flow Matching for Generative Modeling, Lipman et al., paper on Arxiv
Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition, Kim et al., paper on Arxiv