Gerhard Widmer
YOU?
Author Swipe
View article: A Study on the Data Distribution Gap in Music Emotion Recognition
A Study on the Data Distribution Gap in Music Emotion Recognition Open
Music Emotion Recognition (MER) is a task deeply connected to human perception, relying heavily on subjective annotations collected from contributors. Prior studies tend to focus on specific musical styles rather than incorporating a diver…
View article: Exploring System Adaptations For Minimum Latency Real-Time Piano Transcription
Exploring System Adaptations For Minimum Latency Real-Time Piano Transcription Open
Advances in neural network design and the availability of large-scale labeled datasets have driven major improvements in piano transcription. Existing approaches target either offline applications, with no restrictions on computational dem…
View article: AnalysisGNN: Unified Music Analysis with Graph Neural Networks
AnalysisGNN: Unified Music Analysis with Graph Neural Networks Open
Recent years have seen a boom in computational approaches to music analysis, yet each one is typically tailored to a specific analytical domain. In this work, we introduce AnalysisGNN, a novel graph neural network framework that leverages …
View article: Optical Music Recognition of Jazz Lead Sheets
Optical Music Recognition of Jazz Lead Sheets Open
In this paper, we address the challenge of Optical Music Recognition (OMR) for handwritten jazz lead sheets, a widely used musical score type that encodes melody and chords. The task is challenging due to the presence of chords, a score co…
View article: On Temporal Guidance and Iterative Refinement in Audio Source Separation
On Temporal Guidance and Iterative Refinement in Audio Source Separation Open
Spatial semantic segmentation of sound scenes (S5) involves the accurate identification of active sound classes and the precise separation of their sources from complex acoustic mixtures. Conventional systems rely on a two-stage pipeline -…
View article: Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation
Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation Open
Generative models of music audio are typically used to generate output based solely on a text prompt or melody. Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pr…
View article: TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining Open
TACOS is a collection of 12,358 audio recordings, annotated with 47,748 temporally strong audio captions (i.e., textual descriptions of sound events and their corresponding temporal onsets and offsets). Each audio file is additionally pair…
View article: Pairing Real-Time Piano Transcription with Symbol-level Tracking for Precise and Robust Score Following
Pairing Real-Time Piano Transcription with Symbol-level Tracking for Precise and Robust Score Following Open
Real-time music tracking systems follow a musical performance and at any time report the current position in a corresponding score. Most existing methods approach this problem exclusively in the audio domain, typically using online time wa…
View article: How to Infer Repeat Structures in MIDI Performances
How to Infer Repeat Structures in MIDI Performances Open
MIDI performances are generally expedient in performance research and music information retrieval, and even more so if they can be connected to a score. This connection is usually established by means of alignment, linking either notes or …
View article: Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge
Low-Complexity Acoustic Scene Classification with Device Information in the DCASE 2025 Challenge Open
This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge and its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from…
View article: Creating a Good Teacher for Knowledge Distillation in Acoustic Scene Classification
Creating a Good Teacher for Knowledge Distillation in Acoustic Scene Classification Open
Knowledge Distillation (KD) is a widespread technique for compressing the knowledge of large models into more compact and efficient models. KD has proved to be highly effective in building well-performing low-complexity Acoustic Scene Clas…
View article: Exploring Performance-Complexity Trade-Offs in Sound Event Detection Models
Exploring Performance-Complexity Trade-Offs in Sound Event Detection Models Open
We target the problem of developing new low-complexity networks for the sound event detection task. Our goal is to meticulously analyze the performance-complexity trade-off, aiming to be competitive with the large state-of-the-art models, …
View article: Language Models for Music Medicine Generation
Language Models for Music Medicine Generation Open
Music therapy has been shown in recent years to provide multiple health benefits related to emotional wellness. In turn, maintaining a healthy emotional state has proven to be effective for patients undergoing treatment, such as Parkinson'…
View article: Effective Pre-Training of Audio Transformers for Sound Event Detection
Effective Pre-Training of Audio Transformers for Sound Event Detection Open
We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. Th…
View article: Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval
Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval Open
Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modaliti…
View article: Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining
Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining Open
Query-by-Vocal Imitation (QBV) is about searching audio files within databases using vocal imitations created by the user's voice. Since most humans can effectively communicate sound concepts through voice, QBV offers the more intuitive an…
View article: Controlling Surprisal in Music Generation via Information Content Curve Matching
Controlling Surprisal in Music Generation via Information Content Curve Matching Open
In recent years, the quality and public interest in music generation systems have grown, encouraging research into various ways to control these systems. We propose a novel method for controlling surprisal in music generation using sequenc…
View article: TheGlueNote: Learned Representations for Robust and Flexible Note Alignment
TheGlueNote: Learned Representations for Robust and Flexible Note Alignment Open
Note alignment refers to the task of matching individual notes of two versions of the same symbolically encoded piece. Methods addressing this task commonly rely on sequence alignment algorithms such as Hidden Markov Models or Dynamic Time…
View article: Beat this! Accurate beat tracking without DBN postprocessing
Beat this! Accurate beat tracking without DBN postprocessing Open
We propose a system for tracking beats and downbeats with two objectives: generality across a diverse music range, and high accuracy. We achieve generality by training on multiple datasets -- including solo instrument recordings, pieces wi…
View article: Fine-Grained and Efficient Self-Unlearning with Layered Iteration
Fine-Grained and Efficient Self-Unlearning with Layered Iteration Open
As machine learning models become widely deployed in data-driven applications, ensuring compliance with the 'right to be forgotten' as required by many privacy regulations is vital for safeguarding user privacy. To forget the given data, e…
View article: DExter: Learning and Controlling Performance Expression with Diffusion Models
DExter: Learning and Controlling Performance Expression with Diffusion Models Open
In the pursuit of developing expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. The main…
View article: Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training
Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training Open
This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transfo…
View article: Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets
Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets Open
A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datase…
View article: GraphMuse: A Library for Symbolic Music Graph Processing
GraphMuse: A Library for Symbolic Music Graph Processing Open
Graph Neural Networks (GNNs) have recently gained traction in symbolic music tasks, yet a lack of a unified framework impedes progress. Addressing this gap, we present GraphMuse, a graph processing framework and library that facilitates ef…
View article: Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving
Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving Open
This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves. This is a fundamental part of the larger task of music score engraving (or score typesetti…
View article: Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval
Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval Open
Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that…
View article: DExter: Learning and Controlling Performance Expression with Diffusion Models
DExter: Learning and Controlling Performance Expression with Diffusion Models Open
In the pursuit of developing expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. In this …
View article: Towards Musically Informed Evaluation of Piano Transcription Models
Towards Musically Informed Evaluation of Piano Transcription Models Open
Automatic piano transcription models are typically evaluated using simple frame- or note-wise information retrieval (IR) metrics. Such benchmark metrics do not provide insights into the transcription quality of specific musical aspects suc…
View article: Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge
Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge Open
This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which foc…