Explanipedia

Source Separation by Flow Matching Open

Robin Scheibler, John R. Hershey, Randal Douc, Henry Li · 2025

We consider the problem of single-channel audio source separation with the goal of reconstructing $K$ sources from their mixture. We address this ill-posed problem with FLOSS (FLOw matching for Source Separation), a constrained generation …

I-Con: A Unifying Framework for Representation Learning Open

Shaden Alshammari, John R. Hershey, Á. Feldmann, William T. Freeman, Mark F. Hamilton · 2025

As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of m…

Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement Open

Jae‐sung Bae, Anastasia Kuznetsova, Dinesh Manocha, John R. Hershey, Trausti Kristjansson , et al. · 2025

This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP…

Generative Data Augmentation Challenge: Synthesis of Room Acoustics for Speaker Distance Estimation Open

Jackie Lin, Georg Götz, Hermes Sampedro Llopis, Haukur Hafsteinsson, Steinar Guðjónsson , et al. · 2025

Computer science Engineering Physics

This paper describes the synthesis of the room acoustics challenge as a part of the generative data augmentation workshop at ICASSP 2025. The challenge defines a unique generative task that is designed to improve the quantity and diversity…

Understanding Learning with Sliced-Wasserstein Requires Rethinking Informative Slices Open

Huy Dat Tran, Yikun Bai, Ashkan Shahbazi, John R. Hershey, Soheil Kolouri · 2024

Computer science

The practical applications of Wasserstein distances (WDs) are constrained by their sample and computational complexities. Sliced-Wasserstein distances (SWDs) provide a workaround by projecting distributions onto one-dimensional subspaces, …

Towards Sub-millisecond Latency Real-Time Speech Enhancement Models on Hearables Open

Artem Dementyev, Chandan K. Reddy, Scott Wisdom, Navin Chatlani, John R. Hershey , et al. · 2024

Computer science Physics

Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech e…

Unsupervised Improved MVDR Beamforming for Sound Enhancement Open

Jacob Kealey, John R. Hershey, François Grondin · 2024

Computer science Physics

<p>MCFSTD was created with the aim of evualuating multi-channel sound separation and localization. It is introduced in our article titled "Unsupervised Improved MVDR Beamforming for Sound Enhancement". It was recorded in 4 rooms with…

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language Open

Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman · 2024

Computer science Psychology Physics

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of …

The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement Open

Simon Leglaive, Léonie Borne, Efthymios Tzinis, Mostafa Sadeghi, Matthieu Fraticelli , et al. · 2023

Computer science Psychology Engineering

Supervised speech enhancement models are trained using artificially generated\nmixtures of clean speech and noise signals, which may not match real-world\nrecording conditions at test time. This mismatch can lead to poor performance\nif th…

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition Open

Hakan Erdoğan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi , et al. · 2023

Computer science Engineering Art

We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on…

The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement Open

Simon Leglaive, Léonie Borne, Efthymios Tzinis, Mostafa Sadeghi, Matthieu Fraticelli , et al. · 2023

Computer science Psychology Engineering

Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the t…

Unsupervised Multi-channel Separation and Adaptation Open

Cong Han, Kevin R. Wilson, Scott Wisdom, John R. Hershey · 2023

Computer science Mathematics Physics

A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the …

AudioSlots: A slot-centric generative model for audio separation Open

Pradyumna Reddy, Scott Wisdom, Klaus Greff, John R. Hershey, Thomas Kipf · 2023

Computer science Mathematics Engineering

In a range of recent works, object-centric architectures have been shown to be suitable for unsupervised scene decomposition in the vision domain. Inspired by these methods we present AudioSlots, a slot-centric generative model for blind s…

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation Open

Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey · 2022

Computer science Geology

We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify …

Distance-Based Sound Separation Open

Katharine Patterson, Kevin R. Wilson, Scott Wisdom, John R. Hershey · 2022

Computer science Physics Geography

We propose the novel task of distance-based sound separation, where sounds are separated based only on their distance from a single microphone. In the context of assisted listening devices, proximity provides a simple criterion for sound s…

CycleGAN-Based Unpaired Speech Dereverberation Open

Hannah Muckenhirn, Aleksandr Safin, Hakan Erdoğan, Félix de Chaumont Quitry, Marco Tagliasacchi , et al. · 2022

Computer science Physics

Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance. The main limitation of this approach is that such models can only be trained…

Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training Open

Aswin Sivaraman, Scott Wisdom, Hakan Erdoğan, John R. Hershey · 2021

Computer science Mathematics Physics

The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investi…

Improving Bird Classification with Unsupervised Sound Separation Open

Tom Denton, Scott Wisdom, John R. Hershey · 2021

Computer science Sociology

This paper addresses the problem of species classification in bird song recordings. The massive amount of available field recordings of birds presents an opportunity to use machine learning to automatically track bird populations. However,…

DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement Open

Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdoğan, John R. Hershey , et al. · 2021

Computer science Mathematics Chemistry

Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the …

Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention Open

Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey · 2021

Computer science Chemistry Philosophy

We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous …

Improving On-Screen Sound Separation for Open-Domain Videos with\n Audio-Visual Self-Attention Open

Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey · 2021

Computer science Physics

We introduce a state-of-the-art audio-visual on-screen sound separation\nsystem which is capable of learning to separate sounds and associate them with\non-screen objects by looking at in-the-wild videos. We identify limitations of\nprevio…

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation Open

Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdoğan, John R. Hershey · 2021

Computer science Mathematics

Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on o…

Sound Event Detection and Separation: A Benchmark on Desed Synthetic Soundscapes Open

Nicolas Turpault, Romain Serizel, Scott Wisdom, Hakan Erdoğan, John R. Hershey , et al. · 2021

Computer science Engineering Geography

We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4…

End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings Open

Soumi Maiti, Hakan Erdoğan, Kevin Wilson, Scott Wisdom, Shinji Watanabe , et al. · 2021

Computer science

We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discr…

What’s all the Fuss about Free Universal Sound Separation Data? Open

Scott Wisdom, Hakan Erdoğan, Daniel P. W. Ellis, Romain Serizel, Nicolas Turpault , et al. · 2021

Computer science Mathematics Physics

We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio…

Self-Supervised Learning from Automatically Separated Sound Scenes Open

Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi , et al. · 2021

Computer science Geography Physics

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and…

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds Open

Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez , et al. · 2021

Computer science Physics

Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioSc…

Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset Open

Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdoğan, Zili Huang , et al. · 2021

Computer science

Dataset This data repository contains separated audio streams for the LibriCSS dataset using the following window-based separation methods: 1. Mask-based MVDR: Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, and Fil Alleva, “Multi-microphone ne…

Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset Open

Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdoğan, Zili Huang , et al. · 2021

Computer science

Dataset This data repository contains separated audio streams for the LibriCSS dataset using the following window-based separation methods: 1. Mask-based MVDR: Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, and Fil Alleva, “Multi-microphone ne…

John R. Hershey YOU? Author Swipe