Explanipedia

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice Open

Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen , et al. · 2025

The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key chall…

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation Open

Longhao Li, Zhao Guo, Hongjie Chen, Yuhang Dai, Ziyu Zhang , et al. · 2025

The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks…

AudioX: Diffusion Transformer for Anything-to-Audio Generation Open

Zhenhua Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan , et al. · 2025

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality,…

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens Open

Xinsheng Wang, Ming Jiang, Ziyang Ma, Zihan Zhang, Songxiang Liu , et al. · 2025

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting mu…

Audio-FLAN: A Preliminary Release Open

Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan , et al. · 2025

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the d…

CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages Open

Shangda Wu, Zhancheng Guo, Ruibin Yuan, Junyan Jiang, SeungHeon Doh , et al. · 2025

CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities--including sheet music, perform…

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models Open

Shangda Wu, Yashan Wang, Ruibin Yuan, Zhancheng Guo, Xu Tan , et al. · 2025

CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages Open

Shangda Wu, Zhancheng Guo, Ruibin Yuan, Junyan Jiang, SeungHeon Doh , et al. · 2025

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models Open

Shangda Wu, Yashan Wang, Ruibin Yuan, Zhancheng Guo, Xu Tan , et al. · 2024

Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To add…

Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer Open

Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan , et al. · 2024

Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address the…

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router Open

Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Ruibin Yuan , et al. · 2024

As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or …

OmniBench: Towards The Future of Universal Omni-Language Models Open

Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu , et al. · 2024

Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, …

SongTrans: An unified song transcription and alignment method for lyrics and notes Open

Siwei Wu, Jie He, Ruibin Yuan, Haojie Wei, Xiaomou Wei , et al. · 2024

The quantity of processed data is crucial for advancing the field of singing voice synthesis. While there are tools available for lyric or note transcription tasks, they all need pre-processed data which is relatively time-consuming (e.g.,…

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions Open

Xiaowei Chi, Yatian Wang, Aosong Cheng, Pengjun Fang, Zeyue Tian , et al. · 2024

Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be wea…

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling Open

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang , et al. · 2024

In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documen…

ComposerX: Multi-Agent Symbolic Music Composition with LLMs Open

Qixin Deng, Qikai Yang, Ruibin Yuan, Yipeng Huang, Yi Wang , et al. · 2024

Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabiliti…

MuPT: A Generative Symbolic Music Pretrained Transformer Open

Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo , et al. · 2024

In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible wi…

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model Open

Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng , et al. · 2024

In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional metho…

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation Open

Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue , et al. · 2024

Large Language Models (LLMs) exhibit remarkable capabilities but are prone to generating inaccurate or hallucinatory responses. This limitation stems from their reliance on vast pretraining datasets, making them susceptible to errors in un…

Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models Open

Hanzhi Yin, Gang Cheng, Christian J. Steinmetz, Ruibin Yuan, Richard M. Stern , et al. · 2024

We describe a novel approach for developing realistic digital models of dynamic range compressors for digital audio production by analyzing their analog prototypes. While realistic digital dynamic compressors are potentially useful for man…

ChatMusician: Understanding and Generating Music Intrinsically with LLM Open

Ruibin Yuan, Hanfeng Lin, Y. F. Wang, Zeyue Tian, Shangda Wu , et al. · 2024

While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that inte…

CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models Open

Yizhi Li, Ge ZHANG, Xingwei Qu, Jiali Li, Zhaoqun Li , et al. · 2024

The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-re…

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Open

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang , et al. · 2024

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alter…

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark Open

Ge ZHANG, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo , et al. · 2024

As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abili…

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation Open

Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi , et al. · 2023

Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlo…

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Open

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu , et al. · 2023

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions…

On the Effectiveness of Speech Self-supervised Learning for Music Open

Yinghao Ma, Ruibin Yuan, Yizhi Li, Ge Zhang, Xingran Chen , et al. · 2023

Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL mode…

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT Open

Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi Li , et al. · 2023

We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our nov…

MARBLE: Music Audio Representation Benchmark for Universal Evaluation Open

Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen , et al. · 2023

In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limit…

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training Open

Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen , et al. · 2023

Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its ap…

Ruibin Yuan YOU? Author Swipe