Zhiyao Duan
YOU?
Author Swipe
View article: Turning Patients’ Open-Ended Narratives of Chronic Pain Into Quantitative Measures: Natural Language Processing Study
Turning Patients’ Open-Ended Narratives of Chronic Pain Into Quantitative Measures: Natural Language Processing Study Open
Background Subjective report of pain remains the gold standard for assessing symptoms in patients with chronic pain and their response to analgesics. This subjectivity underscores the importance of understanding patients’ personal narrativ…
View article: Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion
Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion Open
Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conver…
View article: Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation
Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation Open
Multi-Pitch Estimation (MPE) continues to be a sought after capability of Music Information Retrieval (MIR) systems, and is critical for many applications and downstream tasks involving pitch, including music transcription. However, existi…
View article: A Review on Score-based Generative Models for Audio Applications
A Review on Score-based Generative Models for Audio Applications Open
Diffusion models have emerged as powerful deep generative techniques, producing high-quality and diverse samples in applications in various domains including audio. These models have many different design choices suitable for different app…
View article: PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing
PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing Open
Neural speech editing enables seamless partial edits to speech utterances, allowing modifications to selected content while preserving the rest of the audio unchanged. This useful technique, however, also poses new risks of deepfakes. To e…
View article: Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges
Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges Open
International audience
View article: HARP 2.0: Expanding Hosted, Asynchronous, Remote Processing for Deep Learning in the DAW
HARP 2.0: Expanding Hosted, Asynchronous, Remote Processing for Deep Learning in the DAW Open
HARP 2.0 brings deep learning models to digital audio workstation (DAW) software through hosted, asynchronous, remote processing, allowing users to route audio from a plug-in interface through any compatible Gradio endpoint to perform arbi…
View article: Structural Design and Dynamic Analysis of a Deep Space Exploration Zoom Camera
Structural Design and Dynamic Analysis of a Deep Space Exploration Zoom Camera Open
Space optical cameras serve as vital tools for solar observation, mostly employing fixed-focus systems to reduce moving parts and increase system stability. However, with increasing demands for observation, maintaining consistent image siz…
View article: Audio Visual Segmentation Through Text Embeddings
Audio Visual Segmentation Through Text Embeddings Open
The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from video frames. Research on AVS suffers from data scarcity due to the high cost of fine-grained manual annotations. Recent works attempt …
View article: Measure by Measure: Measure-Based Automatic Music Composition with Modern Staff Notation
Measure by Measure: Measure-Based Automatic Music Composition with Modern Staff Notation Open
This paper introduces a hierarchical framework for automatic composition of polyphonic music in Western modern staff notation. Central to our framework, a music score is represented as a grid of part‑wise measures, where each measure is en…
View article: SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge
SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge Open
With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voice…
View article: Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition Open
Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problemati…
View article: A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection
A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection Open
This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of…
View article: GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis Open
Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory fe…
View article: CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection Open
Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensin…
View article: SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan
SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan Open
The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing v…
View article: Scoring Time Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription
Scoring Time Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription Open
The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed time intervals tied to specific ev…
View article: MusicHiFi: Fast High-Fidelity Stereo Vocoding
MusicHiFi: Fast High-Fidelity Stereo Vocoding Open
Diffusion-based audio and music generation models commonly perform generation by constructing an image representation of audio (e.g., a mel-spectrogram) and then convert it to audio using a phase reconstruction model or vocoder. Typical vo…
View article: Toward Fully Self-Supervised Multi-Pitch Estimation
Toward Fully Self-Supervised Multi-Pitch Estimation Open
Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performa…
View article: Cacophony: An Improved Contrastive Audio-Text Model
Cacophony: An Improved Contrastive Audio-Text Model Open
Despite recent advancements, audio-text models still lag behind their image-text counterparts in scale and performance. In this paper, we propose to improve both the data scale and the training procedure of audio-text contrastive models. S…
View article: SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge (CtrSVDD Track, Training/Development Set)
SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge (CtrSVDD Track, Training/Development Set) Open
For more information about SVDD Challenge 2024, please refer to https://challenge.singfake.org/.We have released the training and development set here and other relevant scripts on GitHub (https://github.com/SVDDChallenge/SVDD_Utils). For …
View article: SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge (CtrSVDD Track, Training/Development Set)
SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge (CtrSVDD Track, Training/Development Set) Open
For more information about SVDD Challenge 2024, please refer to https://challenge.singfake.org/.We have released the training and development set here and other relevant scripts on GitHub (https://github.com/SVDDChallenge/SVDD_Utils). For …
View article: BeatNet+: Real‑Time Rhythm Analysis for Diverse Music Audio
BeatNet+: Real‑Time Rhythm Analysis for Diverse Music Audio Open
This paper presents a comprehensive study on real-time music rhythm analysis, covering joint beat and downbeat tracking for diverse kinds of music signals. We introduce BeatNet+, a two-stage approach to real-time rhythm analysis built on a…
View article: Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech
Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech Open
Dimensional representations of speech emotions such as the arousal-valence (AV) representation provide a continuous and fine-grained description and control than their categorical counterparts. They have wide applications in tasks such as …
View article: EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis Open
Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In …
View article: SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription
SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription Open
Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an imp…
View article: Mitigating Cross-Database Differences for Learning Unified HRTF Representation
Mitigating Cross-Database Differences for Learning Unified HRTF Representation Open
Individualized head-related transfer functions (HRTFs) are crucial for accurate sound positioning in virtual auditory displays. As the acoustic measurement of HRTFs is resource-intensive, predicting individualized HRTFs using machine learn…