Yanmin Qian
YOU?
Author Swipe
View article: MeanSE: Efficient Generative Speech Enhancement with Mean Flows
MeanSE: Efficient Generative Speech Enhancement with Mean Flows Open
Speech enhancement (SE) improves degraded speech's quality, with generative models like flow matching gaining attention for their outstanding perceptual quality. However, the flow-based model requires multiple numbers of function evaluatio…
View article: SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation
SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation Open
Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation qua…
View article: Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection
Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection Open
Machine anomalous sound detection (ASD) is a valuable technique across various applications. However, its generalization performance is often limited due to challenges in data collection and the complexity of acoustic environments. Inspire…
View article: Lightweight Front-end Enhancement for Robust ASR via Frame Resampling and Sub-Band Pruning
Lightweight Front-end Enhancement for Robust ASR via Frame Resampling and Sub-Band Pruning Open
Recent advancements in automatic speech recognition (ASR) have achieved notable progress, whereas robustness in noisy environments remains challenging. While speech enhancement (SE) front-ends are widely used to mitigate noise as a preproc…
View article: FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation
FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation Open
With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 pr…
View article: URGENT-PK: Perceptually-Aligned Ranking Model Designed for Speech Enhancement Competition
URGENT-PK: Perceptually-Aligned Ranking Model Designed for Speech Enhancement Competition Open
The Mean Opinion Score (MOS) is fundamental to speech quality assessment. However, its acquisition requires significant human annotation. Although deep neural network approaches, such as DNSMOS and UTMOS, have been developed to predict MOS…
View article: Less is More: Data Curation Matters in Scaling Speech Enhancement
Less is More: Data Curation Matters in Scaling Speech Enhancement Open
The vast majority of modern speech enhancement systems rely on data-driven neural network models. Conventionally, larger datasets are presumed to yield superior model performance, an observation empirically validated across numerous tasks …
View article: Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment
Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment Open
Speech quality assessment (SQA) aims to predict the perceived quality of speech signals under a wide range of distortions. It is inherently connected to speech enhancement (SE), which seeks to improve speech quality by removing unwanted si…
View article: From Sharpness to Better Generalization for Speech Deepfake Detection
From Sharpness to Better Generalization for Speech Deepfake Detection Open
Generalization remains a critical challenge in speech deepfake detection (SDD). While various approaches aim to improve robustness, generalization is typically assessed through performance metrics like equal error rate without a theoretica…
View article: Efficient Multilingual ASR Finetuning via LoRA Language Experts
Efficient Multilingual ASR Finetuning via LoRA Language Experts Open
Recent advancements in deep learning have significantly enhanced multilingual automatic speech recognition (ASR) due to the development of advanced model architectures and available large-scale multilingual datasets. Despite that, multilin…
View article: MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition
MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition Open
Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-t…
View article: Lessons Learned from the URGENT 2024 Speech Enhancement Challenge
Lessons Learned from the URGENT 2024 Speech Enhancement Challenge Open
The URGENT 2024 Challenge aims to foster speech enhancement (SE) techniques with great universality, robustness, and generalizability, featuring a broader task definition, large-scale multi-domain data, and comprehensive evaluation metrics…
View article: CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching
CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching Open
Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapp…
View article: Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling Open
Zero-shot streaming text-to-speech is an important research topic in human-computer interaction. Existing methods primarily use a lookahead mechanism, relying on future text to achieve natural streaming speech synthesis, which introduces h…
View article: BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM
BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM Open
While speech large language models (SpeechLLMs) have advanced standard automatic speech recognition (ASR), contextual biasing for named entities and rare words remains challenging, especially at scale. To address this, we propose BR-ASR: a…
View article: Contextual understanding with contextual embeddings for multi-talker speech separation and recognition in a cocktail party
Contextual understanding with contextual embeddings for multi-talker speech separation and recognition in a cocktail party Open
View article: Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction
Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction Open
The acoustic background plays a crucial role in natural conversation. It provides context and helps listeners understand the environment, but a strong background makes it difficult for listeners to understand spoken words. The appropriate …
View article: Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation
Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation Open
Advances in speech synthesis technologies, like text-to-speech (TTS) and voice conversion (VC), have made detecting deepfake speech increasingly challenging. Spoofing countermeasures often struggle to generalize effectively, particularly w…
View article: SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods
SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods Open
View article: SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation
SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation Open
Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. I…
View article: Establishing identification methods and assessing safety risk of adulterated ingredients of yam in Liuwei Dihuang Pill
Establishing identification methods and assessing safety risk of adulterated ingredients of yam in Liuwei Dihuang Pill Open
View article: Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling
Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling Open
Recent speech enhancement models have shown impressive performance gains by scaling up model complexity and training data. However, the impact of dataset variability (e.g. text, language, speaker, and noise) has been underexplored. Analyzi…
View article: Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification
Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification Open
Recent speaker verification (SV) systems have shown a trend toward adopting deeper speaker embedding extractors. Although deeper and larger neural networks can significantly improve performance, their substantial memory requirements hinder…
View article: Data-Efficient Low-Complexity Acoustic Scene Classification via Distilling and Progressive Pruning
Data-Efficient Low-Complexity Acoustic Scene Classification via Distilling and Progressive Pruning Open
The goal of the acoustic scene classification (ASC) task is to classify recordings into one of the predefined acoustic scene classes. However, in real-world scenarios, ASC systems often encounter challenges such as recording device mismatc…
View article: Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification
Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification Open
Speaker verification system trained on one domain usually suffers performance degradation when applied to another domain. To address this challenge, researchers commonly use feature distribution matching-based methods in unsupervised domai…
View article: WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction
WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction Open
Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In recent years, TSE draws increasing attention due t…
View article: Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models
Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models Open
Anomalous Sound Detection (ASD) has gained significant interest through the application of various Artificial Intelligence (AI) technologies in industrial settings. Though possessing great potential, ASD systems can hardly be readily deplo…
View article: Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion
Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion Open
Voice conversion (VC) aims to modify the speaker's timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, facilitating disentanglement of speech content information.…
View article: Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching
Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching Open
Speaker diarization is typically considered a discriminative task, using discriminative approaches to produce fixed diarization results. In this paper, we explore the use of neural network-based generative methods for speaker diarization f…
View article: Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning
Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning Open
Speaker individuality information is among the most critical elements within speech signals. By thoroughly and accurately modeling this information, it can be utilized in various intelligent speech applications, such as speaker recognition…