Kai Yu
YOU?
Author Swipe
View article: FOXO1-NMNAT3 axis dysregulation promotes doxorubicin cardiotoxicity: NAD <sup>+</sup> replenishment as a redox-targeted antioxidant therapy
FOXO1-NMNAT3 axis dysregulation promotes doxorubicin cardiotoxicity: NAD <sup>+</sup> replenishment as a redox-targeted antioxidant therapy Open
This study establishes the dysregulation of the FOXO1-NMNAT3 axis as a key mechanism underlying NAD+ depletion in DIC. Targeting this axis through NAD+ replenishment, particularly by activating NMNAT3, offers a novel …
View article: Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis Open
Flow-matching-based text-to-speech (TTS) models have shown high-quality speech synthesis. However, most current flow-matching-based TTS models still rely on reference transcripts corresponding to the audio prompt for synthesis. This depend…
View article: Research on Tennis Match Outcome Prediction Based on Multi-Algorithm Integration and Bayesian Analysis
Research on Tennis Match Outcome Prediction Based on Multi-Algorithm Integration and Bayesian Analysis Open
The intense competition in the men's singles final of the 2023 Wimbledon Championships highlighted the dynamic and unpredictable nature of tennis matches. Inspired by this observation, this study aims to quantify and analyze momentum shift…
View article: Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy
Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy Open
Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token predict…
View article: CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate
CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate Open
Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is …
View article: Improving estimation of winter wheat biophysical traits using solar-induced fluorescence indices and a multi-task Gaussian process model
Improving estimation of winter wheat biophysical traits using solar-induced fluorescence indices and a multi-task Gaussian process model Open
View article: Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling Open
Flow-matching-based text-to-speech (TTS) models, such as Voicebox, E2 TTS, and F5-TTS, have attracted significant attention in recent years. These models require multiple sampling steps to reconstruct speech from noise, making inference sp…
View article: MFA-KWS: Effective Keyword Spotting with Multi-head Frame-asynchronous Decoding
MFA-KWS: Effective Keyword Spotting with Multi-head Frame-asynchronous Decoding Open
Keyword spotting (KWS) is essential for voice-driven applications, demanding both accuracy and efficiency. Traditional ASR-based KWS methods, such as greedy and beam search, explore the entire search space without explicitly prioritizing k…
View article: VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining
VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining Open
Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promis…
View article: Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate
Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate Open
Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent interval…
View article: Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis Open
Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically requi…
View article: Developing ChemDFM as a large language foundation model for chemistry
Developing ChemDFM as a large language foundation model for chemistry Open
View article: Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling Open
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness, particularly when processing queries exceeding their knowledge boundaries. While existing mitigation strategies employ uncertainty estimation …
View article: Prediction of Soil Heavy Metal Extraction Efficiency by Leaching Agents and Identification of Key Factors Based on Machine Learning Algorithms
Prediction of Soil Heavy Metal Extraction Efficiency by Leaching Agents and Identification of Key Factors Based on Machine Learning Algorithms Open
View article: Alignment for Efficient Tool Calling of Large Language Models
Alignment for Efficient Tool Calling of Large Language Models Open
View article: DFM: Dialogue foundation model for universal large-scale dialogue-oriented task learning
DFM: Dialogue foundation model for universal large-scale dialogue-oriented task learning Open
Building a universal conversational agent has been a long-standing goal of the dialogue research community. Most previous works only focus on a small set of dialogue tasks. In this work, we aim to build a unified dialogue foundation model …
View article: UiO series of MOFs and their composites for photocatalytic CO2 reduction: A review
UiO series of MOFs and their composites for photocatalytic CO2 reduction: A review Open
Photocatalytic reduction of CO2 to produce valuable fuels or chemicals is a promising CO2 utilization technology, which is of great significance for carbon emission reduction. The unique features of the UiO series of metal-organic framewor…
View article: Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation Open
View article: MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation Open
View article: Research on Evacuation Behavior in Urban Villages Based on Social Networks
Research on Evacuation Behavior in Urban Villages Based on Social Networks Open
View article: Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective
Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective Open
Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this…
View article: Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency
Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency Open
Connectionist Temporal Classification (CTC), a non-autoregressive training criterion, is widely used in online keyword spotting (KWS). However, existing CTC-based KWS decoding strategies either rely on Automatic Speech Recognition (ASR), w…
View article: NTC-KWS: Noise-aware CTC for Robust Keyword Spotting
NTC-KWS: Noise-aware CTC for Robust Keyword Spotting Open
In recent years, there has been a growing interest in designing small-footprint yet effective Connectionist Temporal Classification based keyword spotting (CTC-KWS) systems. They are typically deployed on low-resource computing platforms, …
View article: Reducing Tool Hallucination via Reliability Alignment
Reducing Tool Hallucination via Reliability Alignment Open
Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real-world applications. However, tool hallucinations, where models either select inappropria…
View article: Unified Pathological Speech Analysis with Prompt Tuning
Unified Pathological Speech Analysis with Prompt Tuning Open
Pathological speech analysis has been of interest in the detection of certain diseases like depression and Alzheimer's disease and attracts much interest from researchers. However, previous pathological speech analysis models are commonly …
View article: MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation Open
Existing Multimodal Large Language Model (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI envir…
View article: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching Open
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text…
View article: vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders
vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders Open
We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task…
View article: Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter
Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter Open
Code-switching (CS) phenomenon occurs when words or phrases from different languages are alternated in a single sentence. Due to data scarcity, building an effective CS Automatic Speech Recognition (ASR) system remains challenging. In this…
View article: Text-aware Speech Separation for Multi-talker Keyword Spotting
Text-aware Speech Separation for Multi-talker Keyword Spotting Open
For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail pa…