Lukáš Burget
YOU?
Author Swipe
View article: Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition
Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition Open
We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) enco…
View article: Unsupervised Speech Enhancement using Data-defined Priors
Unsupervised Speech Enhancement using Data-defined Priors Open
The majority of deep learning-based speech enhancement methods require paired clean-noisy speech data. Collecting such data at scale in real-world conditions is infeasible, which has led the community to rely on synthetically generated noi…
DeCRED: Decoder-Centric Regularization for Encoder-Decoder Based Speech Recognition Open
This paper presents a simple yet effective regularization for the internal language model induced by the decoder in encoder-decoder ASR models, thereby improving robustness and generalization in both in- and out-of-domain settings. The pro…
View article: BUT System for the MLC-SLM Challenge
BUT System for the MLC-SLM Challenge Open
We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW -- a diarization-conditioned variant of Whisper -- with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out…
View article: Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization
Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization Open
Self-supervised learning (SSL) models like WavLM can be effectively utilized when building speaker diarization systems but are often large and slow, limiting their use in resource constrained scenarios. Previous studies have explored compr…
View article: Analysis of ABC Frontend Audio Systems for the NIST-SRE24
Analysis of ABC Frontend Audio Systems for the NIST-SRE24 Open
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for …
View article: DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition
DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition Open
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propos…
View article: Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization
Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization Open
In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still att…
Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models Open
This paper proposes a simple yet effective way of regularising the encoder-decoder-based automatic speech recognition (ASR) models that enhance the robustness of the model and improve the generalisation to out-of-domain scenarios. The prop…
State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data Open
In this paper, we refine and validate our method for training speaker embedding extractors using weak annotations. More specifically, we use only the audio stream of the source VoxCeleb videos and the names of the celebrities without knowi…
CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification Open
Self-supervised learning (SSL) models for speaker verification (SV) have gained significant attention in recent years. However, existing SSL-based SV systems often struggle to capture local temporal dependencies and generalize across diffe…
View article: Leveraging Self-Supervised Learning for Speaker Diarization
Leveraging Self-Supervised Learning for Speaker Diarization Open
End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on severa…
View article: Target Speaker ASR with Whisper
Target Speaker ASR with Whisper Open
We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative differences among speakers by learning to …
BUT Systems and Analyses for the ASVspoof 5 Challenge Open
This paper describes the BUT submitted systems for the ASVspoof 5 challenge, along with analyses. For the conventional deepfake detection task, we use ResNet18 and self-supervised models for the closed and open conditions, respectively. In…
Challenging margin-based speaker embedding extractors by using the variational information bottleneck Open
Speaker embedding extractors are typically trained using a classification loss over the training speakers. During the last few years, the standard softmax/cross-entropy loss has been replaced by the margin-based losses, yielding significan…
Text-dependent Speaker Verification (TdSV) Challenge 2024: Challenge Evaluation Plan Open
This document outlines the Text-dependent Speaker Verification (TdSV) Challenge 2024, which centers on analyzing and exploring novel approaches for text-dependent speaker verification. The primary goal of this challenge is to motive partic…
Impact on diabetes control and patient-reported outcomes of a newer implantable continuous glucose monitoring system (Eversense® CGM System): a single-centre retro- and prospective observational study Open
AIMS OF THE STUDY: The Eversense® CGM System is the first and only continuous glucose monitoring system (CGMS) that uses a fully subcutaneous implanted sensor. This study aimed to evaluate effectiveness, safety and patient-reported outcome…
Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets Open
Paralinguistic traits like cognitive load and emotion are increasingly recognized as pivotal areas in speech recognition research, often examined through specialized datasets like CLSE and IEMOCAP. However, the integrity of these datasets …
Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information? Open
In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utiliz…
DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors Open
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the mo…
View article: Discriminative Training of VBx Diarization
Discriminative Training of VBx Diarization Open
Bayesian HMM clustering of x-vector sequences (VBx) has become a widely adopted diarization baseline model in publications and challenges. It uses an HMM to model speaker turns, a generatively trained probabilistic linear discriminant anal…
View article: DiaCorrect: Error Correction Back-end For Speaker Diarization
DiaCorrect: Error Correction Back-end For Speaker Diarization Open
In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. O…
Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization Open
Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers wi…
Hystoc: Obtaining word confidences for fusion of end-to-end ASR systems Open
End-to-end (e2e) systems have recently gained wide popularity in automatic speech recognition. However, these systems do generally not provide well-calibrated word-level confidences. In this paper, we propose Hystoc, a simple method for ob…
Improving Speaker Verification with Self-Pretrained Transformer Models Open
Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structu…
Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization Open
End-To-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once. Many flavors of end-To-end models have been proposed but all of them req…
Stabilized training of joint energy-based models and their practical applications Open
The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier $p(y|x)$ as an energy model, which is also trained as a generative model describing the distribution of the input observations $p(x)$. The …
Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization Open
End-to-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once. Many flavors of end-to-end models have been proposed but all of them req…