Scott Wisdom
YOU?
Author Swipe
View article: Towards Sub-millisecond Latency Real-Time Speech Enhancement Models on Hearables
Towards Sub-millisecond Latency Real-Time Speech Enhancement Models on Hearables Open
Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech e…
View article: Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge
Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge Open
International audience
View article: The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement
The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement Open
Supervised speech enhancement models are trained using artificially generated\nmixtures of clean speech and noise signals, which may not match real-world\nrecording conditions at test time. This mismatch can lead to poor performance\nif th…
View article: TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition Open
We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on…
View article: The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement
The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement Open
Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the t…
View article: Unsupervised Multi-channel Separation and Adaptation
Unsupervised Multi-channel Separation and Adaptation Open
A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the …
View article: AudioSlots: A slot-centric generative model for audio separation
AudioSlots: A slot-centric generative model for audio separation Open
In a range of recent works, object-centric architectures have been shown to be suitable for unsupervised scene decomposition in the vision domain. Inspired by these methods we present AudioSlots, a slot-centric generative model for blind s…
View article: AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation Open
We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify …
View article: Distance-Based Sound Separation
Distance-Based Sound Separation Open
We propose the novel task of distance-based sound separation, where sounds are separated based only on their distance from a single microphone. In the context of assisted listening devices, proximity provides a simple criterion for sound s…
View article: Text-Driven Separation of Arbitrary Sounds
Text-Driven Separation of Arbitrary Sounds Open
We propose a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source. This is achieved by combining two distinct models. The first model,…
View article: CycleGAN-Based Unpaired Speech Dereverberation
CycleGAN-Based Unpaired Speech Dereverberation Open
Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance. The main limitation of this approach is that such models can only be trained…
View article: Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training
Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training Open
The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investi…
View article: Self-Supervised Learning from Automatically Separated Sound Scenes
Self-Supervised Learning from Automatically Separated Sound Scenes Open
Comunicació presentada a 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), celebrat del 17 al 20 d'octubre de 2021 a New Paltz, Estats Units.
View article: Improving Bird Classification with Unsupervised Sound Separation
Improving Bird Classification with Unsupervised Sound Separation Open
This paper addresses the problem of species classification in bird song recordings. The massive amount of available field recordings of birds presents an opportunity to use machine learning to automatically track bird populations. However,…
View article: DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement
DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement Open
Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the …
View article: Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention
Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention Open
We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous …
View article: Improving On-Screen Sound Separation for Open-Domain Videos with\n Audio-Visual Self-Attention
Improving On-Screen Sound Separation for Open-Domain Videos with\n Audio-Visual Self-Attention Open
We introduce a state-of-the-art audio-visual on-screen sound separation\nsystem which is capable of learning to separate sounds and associate them with\non-screen objects by looking at in-the-wild videos. We identify limitations of\nprevio…
View article: Evaluation set DCASE 2021 task 4 (for submissions)
Evaluation set DCASE 2021 task 4 (for submissions) Open
This repo contains the dataset to download to submit results and be evaluated in task 4 of DCASE 2021. It also contains the ground-truth for the public and synthetic evaluation dataset, together with the mapping file between the anonymized…
View article: Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation
Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation Open
Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on o…
View article: Sound Event Detection and Separation: A Benchmark on Desed Synthetic Soundscapes
Sound Event Detection and Separation: A Benchmark on Desed Synthetic Soundscapes Open
We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4…
View article: End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings
End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings Open
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discr…
View article: What’s all the Fuss about Free Universal Sound Separation Data?
What’s all the Fuss about Free Universal Sound Separation Data? Open
We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio…
View article: Self-Supervised Learning from Automatically Separated Sound Scenes
Self-Supervised Learning from Automatically Separated Sound Scenes Open
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and…
View article: Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds Open
Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioSc…
View article: Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset
Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset Open
Dataset This data repository contains separated audio streams for the LibriCSS dataset using the following window-based separation methods: 1. Mask-based MVDR: Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, and Fil Alleva, “Multi-microphone ne…
View article: Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset
Integration of speech separation, diarization, and recognition for multi-speaker meetings: Separated LibriCSS dataset Open
Dataset This data repository contains separated audio streams for the LibriCSS dataset using the following window-based separation methods: 1. Mask-based MVDR: Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, and Fil Alleva, “Multi-microphone ne…
View article: Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis
Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis Open
Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, an…
View article: Sound Event Detection and Separation: a Benchmark on Desed Synthetic\n Soundscapes
Sound Event Detection and Separation: a Benchmark on Desed Synthetic\n Soundscapes Open
We propose a benchmark of state-of-the-art sound event detection systems\n(SED). We designed synthetic evaluation sets to focus on specific sound event\ndetection challenges. We analyze the performance of the submissions to DCASE\n2021 tas…
View article: Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds Open
Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioSc…
View article: Improving Sound Event Detection In Domestic Environments Using Sound Separation
Improving Sound Event Detection In Domestic Environments Using Sound Separation Open
Performing sound event detection on real-world recordings often implies dealing with overlapping target sound events and non-target sounds, also referred to as interference or noise. Until now these problems were mainly tackled at the clas…