John R. Hershey
YOU?
Author Swipe
View article: Source Separation by Flow Matching
Source Separation by Flow Matching Open
We consider the problem of single-channel audio source separation with the goal of reconstructing $K$ sources from their mixture. We address this ill-posed problem with FLOSS (FLOw matching for Source Separation), a constrained generation …
View article: I-Con: A Unifying Framework for Representation Learning
I-Con: A Unifying Framework for Representation Learning Open
As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of m…
View article: Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement
Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement Open
This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP…
View article: Generative Data Augmentation Challenge: Synthesis of Room Acoustics for Speaker Distance Estimation
Generative Data Augmentation Challenge: Synthesis of Room Acoustics for Speaker Distance Estimation Open
This paper describes the synthesis of the room acoustics challenge as a part of the generative data augmentation workshop at ICASSP 2025. The challenge defines a unique generative task that is designed to improve the quantity and diversity…
View article: Understanding Learning with Sliced-Wasserstein Requires Rethinking Informative Slices
Understanding Learning with Sliced-Wasserstein Requires Rethinking Informative Slices Open
The practical applications of Wasserstein distances (WDs) are constrained by their sample and computational complexities. Sliced-Wasserstein distances (SWDs) provide a workaround by projecting distributions onto one-dimensional subspaces, …
View article: Towards Sub-millisecond Latency Real-Time Speech Enhancement Models on Hearables
Towards Sub-millisecond Latency Real-Time Speech Enhancement Models on Hearables Open
Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech e…
View article: Unsupervised Improved MVDR Beamforming for Sound Enhancement
Unsupervised Improved MVDR Beamforming for Sound Enhancement Open
<p>MCFSTD was created with the aim of evualuating multi-channel sound separation and localization. It is introduced in our article titled "Unsupervised Improved MVDR Beamforming for Sound Enhancement". It was recorded in 4 rooms with…
View article: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language Open
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of …
View article: The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement
The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement Open
Supervised speech enhancement models are trained using artificially generated\nmixtures of clean speech and noise signals, which may not match real-world\nrecording conditions at test time. This mismatch can lead to poor performance\nif th…
View article: TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition Open
We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on…
View article: The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement
The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement Open
Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the t…
View article: Unsupervised Multi-channel Separation and Adaptation
Unsupervised Multi-channel Separation and Adaptation Open
A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the …
View article: AudioSlots: A slot-centric generative model for audio separation
AudioSlots: A slot-centric generative model for audio separation Open
In a range of recent works, object-centric architectures have been shown to be suitable for unsupervised scene decomposition in the vision domain. Inspired by these methods we present AudioSlots, a slot-centric generative model for blind s…
View article: AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation Open
We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify …
View article: Distance-Based Sound Separation
Distance-Based Sound Separation Open
We propose the novel task of distance-based sound separation, where sounds are separated based only on their distance from a single microphone. In the context of assisted listening devices, proximity provides a simple criterion for sound s…
View article: CycleGAN-Based Unpaired Speech Dereverberation
CycleGAN-Based Unpaired Speech Dereverberation Open
Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance. The main limitation of this approach is that such models can only be trained…
View article: Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training
Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training Open
The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investi…
View article: Improving Bird Classification with Unsupervised Sound Separation
Improving Bird Classification with Unsupervised Sound Separation Open
This paper addresses the problem of species classification in bird song recordings. The massive amount of available field recordings of birds presents an opportunity to use machine learning to automatically track bird populations. However,…
View article: DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement
DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement Open
Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the …
View article: Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention
Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention Open
We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous …
View article: Improving On-Screen Sound Separation for Open-Domain Videos with\n Audio-Visual Self-Attention
Improving On-Screen Sound Separation for Open-Domain Videos with\n Audio-Visual Self-Attention Open
We introduce a state-of-the-art audio-visual on-screen sound separation\nsystem which is capable of learning to separate sounds and associate them with\non-screen objects by looking at in-the-wild videos. We identify limitations of\nprevio…
View article: Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation
Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation Open
Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on o…
View article: Sound Event Detection and Separation: A Benchmark on Desed Synthetic Soundscapes
Sound Event Detection and Separation: A Benchmark on Desed Synthetic Soundscapes Open
We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4…
View article: End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings
End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings Open
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discr…
View article: What’s all the Fuss about Free Universal Sound Separation Data?
What’s all the Fuss about Free Universal Sound Separation Data? Open
We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio…
View article: Self-Supervised Learning from Automatically Separated Sound Scenes
Self-Supervised Learning from Automatically Separated Sound Scenes Open
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and…
View article: Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds Open
Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioSc…
View article: Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset
Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset Open
Dataset This data repository contains separated audio streams for the LibriCSS dataset using the following window-based separation methods: 1. Mask-based MVDR: Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, and Fil Alleva, “Multi-microphone ne…
View article: Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset
Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset Open
Dataset This data repository contains separated audio streams for the LibriCSS dataset using the following window-based separation methods: 1. Mask-based MVDR: Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, and Fil Alleva, “Multi-microphone ne…