Ruibin Yuan
YOU?
Author Swipe
View article: UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice Open
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key chall…
View article: WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation
WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation Open
The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks…
View article: AudioX: Diffusion Transformer for Anything-to-Audio Generation
AudioX: Diffusion Transformer for Anything-to-Audio Generation Open
Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality,…
View article: Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens Open
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting mu…
View article: Audio-FLAN: A Preliminary Release
Audio-FLAN: A Preliminary Release Open
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the d…
View article: CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages
CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages Open
CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities--including sheet music, perform…
View article: CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models Open
View article: CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages
CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages Open
View article: CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models Open
Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To add…
View article: Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer Open
Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address the…
View article: HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router Open
As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or …
View article: OmniBench: Towards The Future of Universal Omni-Language Models
OmniBench: Towards The Future of Universal Omni-Language Models Open
Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, …
View article: SongTrans: An unified song transcription and alignment method for lyrics and notes
SongTrans: An unified song transcription and alignment method for lyrics and notes Open
The quantity of processed data is crucial for advancing the field of singing voice synthesis. While there are tools available for lyric or note transcription tasks, they all need pre-processed data which is relatively time-consuming (e.g.,…
View article: MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions Open
Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be wea…
View article: VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling Open
In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documen…
View article: ComposerX: Multi-Agent Symbolic Music Composition with LLMs
ComposerX: Multi-Agent Symbolic Music Composition with LLMs Open
Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabiliti…
View article: MuPT: A Generative Symbolic Music Pretrained Transformer
MuPT: A Generative Symbolic Music Pretrained Transformer Open
In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible wi…
View article: Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model Open
In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional metho…
View article: RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation
RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation Open
Large Language Models (LLMs) exhibit remarkable capabilities but are prone to generating inaccurate or hallucinatory responses. This limitation stems from their reliance on vast pretraining datasets, making them susceptible to errors in un…
View article: Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models
Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models Open
We describe a novel approach for developing realistic digital models of dynamic range compressors for digital audio production by analyzing their analog prototypes. While realistic digital dynamic compressors are potentially useful for man…
View article: ChatMusician: Understanding and Generating Music Intrinsically with LLM
ChatMusician: Understanding and Generating Music Intrinsically with LLM Open
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that inte…
View article: CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models Open
The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-re…
View article: AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Open
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alter…
View article: CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark Open
As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abili…
View article: Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation Open
Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlo…
View article: MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Open
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions…
View article: On the Effectiveness of Speech Self-supervised Learning for Music
On the Effectiveness of Speech Self-supervised Learning for Music Open
Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL mode…
View article: LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT Open
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our nov…
View article: MARBLE: Music Audio Representation Benchmark for Universal Evaluation
MARBLE: Music Audio Representation Benchmark for Universal Evaluation Open
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limit…
View article: MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training Open
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its ap…