Jesse Engel
YOU?
Author Swipe
View article: MuChoMusic dataset
MuChoMusic dataset Open
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models MuChoMusic is a benchmark designed to evaluate music understanding in multimodal language models focused on audio. It includes 1,187 multiple-choice questions v…
View article: Noise2Music: Text-conditioned Music Generation with Diffusion Models
Noise2Music: Text-conditioned Music Generation with Diffusion Models Open
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation cond…
View article: SingSong: Generating musical accompaniments from singing
SingSong: Generating musical accompaniments from singing Open
We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build …
View article: Redefining Relationships in Music
Redefining Relationships in Music Open
AI tools increasingly shape how we discover, make and experience music. While these tools can have the potential to empower creativity, they may fundamentally redefine relationships between stakeholders, to the benefit of some and the detr…
View article: Multi-instrument Music Synthesis with Spectrogram Diffusion
Multi-instrument Music Synthesis with Spectrogram Diffusion Open
An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-speci…
View article: Scaling Polyphonic Transcription with Mixtures of Monophonic Transcriptions
Scaling Polyphonic Transcription with Mixtures of Monophonic Transcriptions Open
Automatic Music Transcription (AMT), in particular the problem of automatically extracting notes from audio, has seen much recent progress via the training of neural network models on musical audio recordings paired with aligned ground-tru…
View article: The Chamber Ensemble Generator: Limitless High-Quality MIR Data via Generative Modeling
The Chamber Ensemble Generator: Limitless High-Quality MIR Data via Generative Modeling Open
Data is the lifeblood of modern machine learning systems, including for those in Music Information Retrieval (MIR). However, MIR has long been mired by small datasets and unreliable labels. In this work, we propose to break this bottleneck…
View article: Multi-instrument Music Synthesis with Spectrogram Diffusion
Multi-instrument Music Synthesis with Spectrogram Diffusion Open
An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-speci…
View article: Improving Source Separation by Explicitly Modeling Dependencies Between Sources
Improving Source Separation by Explicitly Modeling Dependencies Between Sources Open
We propose a new method for training a supervised source separation system that aims to learn the interdependent relationships between all combinations of sources in a mixture. Rather than independently estimating each source from a mix, w…
View article: Expressive Communication: Evaluating Developments in Generative Models and Steering Interfaces for Music Creation
Expressive Communication: Evaluating Developments in Generative Models and Steering Interfaces for Music Creation Open
There is an increasing interest from ML and HCI communities in empowering creators with better generative models and more intuitive interfaces with which to control them. In music, ML researchers have focused on training models capable of …
View article: HEAR: Holistic Evaluation of Audio Representations
HEAR: Holistic Evaluation of Audio Representations Open
What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a…
View article: General-purpose, long-context autoregressive modeling with Perceiver AR
General-purpose, long-context autoregressive modeling with Perceiver AR Open
Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively …
View article: MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling
MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling Open
Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatena…
View article: Expressive Communication: A Common Framework for Evaluating Developments in Generative Models and Steering Interfaces
Expressive Communication: A Common Framework for Evaluating Developments in Generative Models and Steering Interfaces Open
There is an increasing interest from ML and HCI communities in empowering creators with better generative models and more intuitive interfaces with which to control them. In music, ML researchers have focused on training models capable of …
View article: Symbolic Music Generation with Diffusion Models
Symbolic Music Generation with Diffusion Models Open
Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in a variety of continuous domains. However, due to their Langevin-inspired sampling mechanisms, their application to …
View article: Sequence-to-Sequence Piano Transcription with Transformers
Sequence-to-Sequence Piano Transcription with Transformers Open
Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/out…
View article: MT3: Multi-Task Multitrack Music Transcription
MT3: Multi-Task Multitrack Music Transcription Open
Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT…
View article: Sequence-to-Sequence Piano Transcription with Transformers
Sequence-to-Sequence Piano Transcription with Transformers Open
Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/out…
View article: Variable-rate discrete representation learning
Variable-rate discrete representation learning Open
Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose…
View article: MAESTRO
MAESTRO Open
MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) is a dataset composed of about 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. We partnered wit…
View article: MAESTRO
MAESTRO Open
MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) is a dataset composed of about 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. We partnered wit…
View article: Author Correction: Analog Coding in Emerging Memory Systems
Author Correction: Analog Coding in Emerging Memory Systems Open
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
View article: Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset
Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset Open
We introduce the Expanded Groove MIDI dataset (E-GMD), an automatic drum transcription (ADT) dataset that contains 444 hours of audio from 43 drum kits, making it an order of magnitude larger than similar datasets, and the first with human…
View article: Expanded Groove MIDI Dataset
Expanded Groove MIDI Dataset Open
The Expanded Groove MIDI Dataset (E-GMD), a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. I…
View article: Expanded Groove MIDI Dataset
Expanded Groove MIDI Dataset Open
The Expanded Groove MIDI Dataset (E-GMD), a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. I…
View article: Expanded Groove MIDI Dataset
Expanded Groove MIDI Dataset Open
The Expanded Groove MIDI Dataset (E-GMD), a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. I…
View article: DDSP: Differentiable Digital Signal Processing
DDSP: Differentiable Digital Signal Processing Open
Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is ge…
View article: Encoding Musical Style with Transformer Autoencoders
Encoding Musical Style with Transformer Autoencoders Open
We consider the problem of learning high-level controls over the global structure of generated sequences, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer autoe…
View article: Fast and Flexible Neural Audio Synthesis
Fast and Flexible Neural Audio Synthesis Open
Autoregressive neural networks, such as WaveNet, have opened up new avenues for expressive audio synthesis. High-quality speech synthesis utilizes detailed linguistic features for conditioning, but comparable levels of control have yet to …