Manthan Thakker
YOU?
Author Swipe
View article: CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching
CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching Open
Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapp…
View article: Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech
Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech Open
People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotion…
View article: E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS Open
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS fr…
View article: An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS
An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS Open
Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio p…
View article: Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
Total-Duration-Aware Duration Modeling for Text-to-Speech Systems Open
Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility a…
View article: Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like Open
Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limit…
View article: SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer Open
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text spee…
View article: Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation
Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation Open
This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an …