Matthew Wiesner
YOU?
Author Swipe
View article: Whisper-UT: A Unified Translation Framework for Speech and Text
Whisper-UT: A Unified Translation Framework for Speech and Text Open
Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and eff…
View article: CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset Open
We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched lan…
View article: Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts
Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts Open
We address the challenge of detecting synthesized speech under distribution shifts -- arising from unseen synthesis methods, speakers, languages, or audio conditions -- relative to the training data. Few-shot learning methods are a promisi…
View article: Scalable Controllable Accented TTS
Scalable Controllable Accented TTS Open
We tackle the challenge of scaling accented TTS systems, expanding their capabilities to include much larger amounts of training data and a wider variety of accent labels, even for accents that are poorly represented or unlabeled in tradit…
View article: Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges
Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges Open
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse …
View article: The Impact of Automatic Speech Transcription on Speaker Attribution
The Impact of Automatic Speech Transcription on Speaker Attribution Open
Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or un…
View article: HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation
HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation Open
Neural transducers (NT) provide an effective framework for speech streaming, demonstrating strong performance in automatic speech recognition (ASR). However, the application of NT to speech translation (ST) remains challenging, as existing…
View article: ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts
ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts Open
The problem of synthetic speech detection has enjoyed considerable attention, with recent methods achieving low error rates across several established benchmarks. However, to what extent can low error rates on academic benchmarks translate…
View article: GenVC: Self-Supervised Zero-Shot Voice Conversion
GenVC: Self-Supervised Zero-Shot Voice Conversion Open
Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework t…
View article: DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition
DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition Open
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propos…
View article: Early photometric and spectroscopic observations of the extraordinarily bright INTEGRAL-detected GRB 221009A
Early photometric and spectroscopic observations of the extraordinarily bright INTEGRAL-detected GRB 221009A Open
Context. GRB 221009A, initially detected as an X-ray transient by Swift , was later revealed to have triggered the Fermi satellite about an hour earlier, marking it as a post-peak observation of the event’s emission. This GRB distinguished…
View article: Target Speaker ASR with Whisper
Target Speaker ASR with Whisper Open
We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative differences among speakers by learning to …
View article: HLTCOE JHU Submission to the Voice Privacy Challenge 2024
HLTCOE JHU Submission to the Voice Privacy Challenge 2024 Open
We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We fou…
View article: Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization
Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization Open
Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its util…
View article: Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Less Peaky and More Accurate CTC Forced Alignment by Label Priors Open
Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer…
View article: Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora
Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora Open
Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthes…
View article: On Speaker Attribution with SURT
On Speaker Attribution with SURT Open
The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, i…
View article: Designing an Optimal Kilonova Search Using DECam for Gravitational-wave Events
Designing an Optimal Kilonova Search Using DECam for Gravitational-wave Events Open
We address the problem of optimally identifying all kilonovae detected via gravitational-wave emission in the upcoming LIGO/Virgo/KAGRA observing run, O4, which is expected to be sensitive to a factor of ∼7 more binary neutron star (BNS) a…
View article: Speech collage: code-switched audio generation by collaging monolingual corpora
Speech collage: code-switched audio generation by collaging monolingual corpora Open
Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthes…
View article: Constraints on the Physical Properties of GW190814 through Simulations Based on DECam Follow-up Observations by the Dark Energy Survey
Constraints on the Physical Properties of GW190814 through Simulations Based on DECam Follow-up Observations by the Dark Energy Survey Open
On 2019 August 14, the LIGO and Virgo Collaborations detected gravitational waves from a black hole and a 2.6 solar mass compact object, possibly the first neutron star-black hole merger. In search of an optical counterpart, the Dark Energ…
View article: The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios
The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios Open
The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task compris…
View article: HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation
HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation Open
We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sente…
View article: The impact of human expert visual inspection on the discovery of strong gravitational lenses
The impact of human expert visual inspection on the discovery of strong gravitational lenses Open
We investigate the ability of human ‘expert’ classifiers to identify strong gravitational lens candidates in Dark Energy Survey like imaging. We recruited a total of 55 people that completed more than 25 per cent of the project. During the…
View article: Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts
Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts Open
This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in human-annotated speech corpora, which degrades the performa…
View article: Towards Zero-Shot Code-Switched Speech Recognition
Towards Zero-Shot Code-Switched Speech Recognition Open
In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot set-ting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditi…
View article: JHU IWSLT 2023 Multilingual Speech Translation System Description
JHU IWSLT 2023 Multilingual Speech Translation System Description Open
Henry Li Xinyuan, Neha Verma, Bismarck Bamfo Odoom, Ujvala Pradeep, Matthew Wiesner, Sanjeev Khudanpur. Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023). 2023.