Jiangyu Han
YOU?
Author Swipe
View article: BUT System for the MLC-SLM Challenge
BUT System for the MLC-SLM Challenge Open
We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW -- a diarization-conditioned variant of Whisper -- with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out…
View article: Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization
Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization Open
Self-supervised learning (SSL) models like WavLM can be effectively utilized when building speaker diarization systems but are often large and slow, limiting their use in resource constrained scenarios. Previous studies have explored compr…
View article: Analysis of ABC Frontend Audio Systems for the NIST-SRE24
Analysis of ABC Frontend Audio Systems for the NIST-SRE24 Open
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for …
View article: DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition
DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition Open
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propos…
View article: DPCCN: DENSELY-CONNECTED PYRAMID COMPLEX CONVOLUTIONAL NETWORK FOR ROBUST SPEECH SEPARATION AND EXTRACTION
DPCCN: DENSELY-CONNECTED PYRAMID COMPLEX CONVOLUTIONAL NETWORK FOR ROBUST SPEECH SEPARATION AND EXTRACTION Open
In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, …
View article: Leveraging Self-Supervised Learning for Speaker Diarization
Leveraging Self-Supervised Learning for Speaker Diarization Open
End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on severa…
View article: DiaCorrect: Error Correction Back-end For Speaker Diarization
DiaCorrect: Error Correction Back-end For Speaker Diarization Open
In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. O…
View article: A novel brain inception neural network model using EEG graphic structure for emotion recognition
A novel brain inception neural network model using EEG graphic structure for emotion recognition Open
Purpose EEG analysis of emotions is greatly significant for the diagnosis of psychological diseases and brain-computer interface (BCI) applications. However, the applications of EEG brain neural network for emotion classification are rarel…
View article: HYBRIDFORMER: improving SqueezeFormer with hybrid attention and NSR mechanism
HYBRIDFORMER: improving SqueezeFormer with hybrid attention and NSR mechanism Open
SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by the large convolution kern…
View article: Heterogeneous separation consistency training for adaptation of unsupervised speech separation
Heterogeneous separation consistency training for adaptation of unsupervised speech separation Open
Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth r…
View article: Dynamic Acoustic Compensation and Adaptive Focal Training for Personalized Speech Enhancement
Dynamic Acoustic Compensation and Adaptive Focal Training for Personalized Speech Enhancement Open
Recently, more and more personalized speech enhancement systems (PSE) with excellent performance have been proposed. However, two critical issues still limit the performance and generalization ability of the model: 1) Acoustic environment …
View article: DiaCorrect: End-to-end error correction for speaker diarization
DiaCorrect: End-to-end error correction for speaker diarization Open
In recent years, speaker diarization has attracted widespread attention. To achieve better performance, some studies propose to diarize speech in multiple stages. Although these methods might bring additional benefits, most of them are qui…
View article: Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation
Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation Open
Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth r…
View article: PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement
PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement Open
PercepNet, a recent extension of the RNNoise, an efficient, high-quality and real-time full-band speech enhancement technique, has shown promising performance in various public deep noise suppression tasks. This paper proposes a new approa…
View article: Improving Channel Decorrelation for Multi-Channel Target Speech Extraction
Improving Channel Decorrelation for Multi-Channel Target Speech Extraction Open
Target speech extraction has attracted widespread attention. When microphone arrays are available, the additional spatial information can be helpful in extracting the target speech. We have recently proposed a channel decorrelation (CD) me…
View article: INTERSPEECH 2021 ConferencingSpeech Challenge: Towards Far-field Multi-Channel Speech Enhancement for Video Conferencing
INTERSPEECH 2021 ConferencingSpeech Challenge: Towards Far-field Multi-Channel Speech Enhancement for Video Conferencing Open
The ConferencingSpeech 2021 challenge is proposed to stimulate research on far-field multi-channel speech enhancement for video conferencing. The challenge consists of two separate tasks: 1) Task 1 is multi-channel speech enhancement with …
View article: Attention-based scaling adaptation for target speech extraction
Attention-based scaling adaptation for target speech extraction Open
The target speech extraction has attracted widespread attention in recent years. In this work, we focus on investigating the dynamic interaction between different mixtures and the target speaker to exploit the discriminative target speaker…
View article: Multi-channel target speech extraction with channel decorrelation and target speaker adaptation
Multi-channel target speech extraction with channel decorrelation and target speaker adaptation Open
The end-to-end approaches for single-channel target speech extraction have attracted widespread attention. However, the studies for end-to-end multi-channel target speech extraction are still relatively limited. In this work, we propose tw…