Explanipedia

Whisper-UT: A Unified Translation Framework for Speech and Text Open

Cihan Xiao, Matthew Wiesner, D. Chakraborty, Reno Kriz, Keith Cunningham , et al. · 2025

Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and eff…

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset Open

Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Sai Lodagala, William Chen , et al. · 2025

We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched lan…

Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts Open

Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola Garcia, Kevin Duh , et al. · 2025

We address the challenge of detecting synthesized speech under distribution shifts -- arising from unseen synthesis methods, speakers, languages, or audio conditions -- relative to the training data. Few-shot learning methods are a promisi…

Scalable Controllable Accented TTS Open

Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola Garcia , et al. · 2025

We tackle the challenge of scaling accented TTS systems, expanding their capabilities to include much larger amounts of training data and a wider variety of accent labels, even for accents that are poorly represented or unlabeled in tradit…

Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges Open

Samuele Cornell, Christoph Boeddeker, Tae‐Jin Park, He Huang, Desh Raj , et al. · 2025

The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse …

The Impact of Automatic Speech Transcription on Speaker Attribution Open

Cristina Aggazzotti, Matthew Wiesner, Elizabeth A. T. Smith, Nicholas Andrews · 2025

Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or un…

HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation Open

Amir Hussein, Cihan Xiao, Matthew Wiesner, Dan Povey, Leibny Paola Garcia , et al. · 2025

Neural transducers (NT) provide an effective framework for speech streaming, demonstrating strong performance in automatic speech recognition (ASR). However, the application of NT to speech translation (ST) remains challenging, as existing…

ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts Open

Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola Garcia, Kevin Duh , et al. · 2025

Computer science

The problem of synthetic speech detection has enjoyed considerable attention, with recent methods achieving low error rates across several established benchmarks. However, to what extent can low error rates on academic benchmarks translate…

GenVC: Self-Supervised Zero-Shot Voice Conversion Open

Zexin Cai, Henry Li Xinyuan, Ashi Garg, Leibny Paola Garcia, Kevin Duh , et al. · 2025

Computer science Psychology Materials science

Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework t…

DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition Open

Alexander Polok, Dominik Klement, Martin Kocour, Jiangyu Han, Federico Landini , et al. · 2024

Computer science

Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propos…

Early photometric and spectroscopic observations of the extraordinarily bright INTEGRAL-detected GRB 221009A Open

R. Sánchez-Ramírez, Rodrigo Guedes Lang, A. Pozanenko, H. Martínez-Huerta, Y. D. Hu , et al. · 2024

Physics

Context. GRB 221009A, initially detected as an X-ray transient by Swift , was later revealed to have triggered the Fermi satellite about an hour earlier, marking it as a post-peak observation of the event’s emission. This GRB distinguished…

Target Speaker ASR with Whisper Open

Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jaň Černocký , et al. · 2024

Computer science

We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative differences among speakers by learning to …

HLTCOE JHU Submission to the Voice Privacy Challenge 2024 Open

Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola Garcia , et al. · 2024

Computer science

We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We fou…

Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization Open

Zexin Cai, Henry Li Xinyuan, Ashi Garg, Leibny Paola Garcia, Kevin Duh , et al. · 2024

Computer science Psychology

Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its util…

Less Peaky and More Accurate CTC Forced Alignment by Label Priors Open

Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira , et al. · 2024

Computer science

Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer…

Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora Open

Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan , et al. · 2024

Computer science Philosophy Mathematics

Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthes…

On Speaker Attribution with SURT Open

Desh Raj, Matthew Wiesner, Matthew Maciejewski, Leibny Paola Garcia, Daniel Povey , et al. · 2024

Psychology Computer science Philosophy

The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, i…

Designing an Optimal Kilonova Search Using DECam for Gravitational-wave Events Open

Clécio R. Bom, J. Annis, Alyssa Garcia, A. Palmese, Nora Sherman , et al. · 2024

Physics Mathematics

We address the problem of optimally identifying all kilonovae detected via gravitational-wave emission in the upcoming LIGO/Virgo/KAGRA observing run, O4, which is expected to be sensitive to a factor of ∼7 more binary neutron star (BNS) a…

Speech collage: code-switched audio generation by collaging monolingual corpora Open

Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan , et al. · 2023

Computer science Mathematics Philosophy

Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthes…

Constraints on the Physical Properties of GW190814 through Simulations Based on DECam Follow-up Observations by the Dark Energy Survey Open

R. Morgan, M. Soares-Santos, J. Annis, K. Herner, Alyssa Garcia , et al. · 2023

Physics

On 2019 August 14, the LIGO and Virgo Collaborations detected gravitational waves from a black hole and a 2.6 solar mass compact object, possibly the first neutron star-black hole merger. In search of an optical counterpart, the Dark Energ…

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios Open

Samuele Cornell, Matthew Wiesner, Shinji Watanabe, Desh Raj, Xuankai Chang , et al. · 2023

Computer science Engineering Philosophy

The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task compris…

HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation Open

Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner , et al. · 2023

Computer science Biology Philosophy

We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sente…

The impact of human expert visual inspection on the discovery of strong gravitational lenses Open

K. Rojas, Thomas E. Collett, Daniel Ballard, M. Magee, Simon Birrer , et al. · 2023

Physics

We investigate the ability of human ‘expert’ classifiers to identify strong gravitational lens candidates in Dark Energy Survey like imaging. We recruited a total of 55 people that completed more than 25 per cent of the project. During the…

Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts Open

Dongji Gao, Matthew Wiesner, Xu H, Leibny Paola Garcia, Daniel Povey , et al. · 2023

Computer science Chemistry Philosophy

This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in human-annotated speech corpora, which degrades the performa…

Towards Zero-Shot Code-Switched Speech Recognition Open

Brian Yan, Matthew Wiesner, Ondřej Klejch, Preethi Jyothi, Shinji Watanabe · 2023

Computer science Economics Mathematics

In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot set-ting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditi…

JHU IWSLT 2023 Multilingual Speech Translation System Description Open

Henry Li Xinyuan, Neha Verma, Bismarck Odoom, Ujvala Pradeep, Matthew Wiesner , et al. · 2023

Computer science Philosophy Chemistry

Henry Li Xinyuan, Neha Verma, Bismarck Bamfo Odoom, Ujvala Pradeep, Matthew Wiesner, Sanjeev Khudanpur. Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023). 2023.

Matthew Wiesner YOU? Author Swipe