Xinhan Di
YOU?
Author Swipe
View article: Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos
Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos Open
Creating realistic, fully animatable whole-body avatars from a single portrait is challenging due to limitations in capturing subtle expressions, body movements, and dynamic backgrounds. Current evaluation datasets and metrics fall short i…
View article: JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1
JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1 Open
Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current appro…
View article: DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis Open
While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) gener…
View article: Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks Open
Generating high-quality piano audio from video requires precise synchronization between visual cues and musical output, ensuring accurate semantic and temporal alignment.However, existing evaluation datasets do not fully capture the intric…
View article: MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing
MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing Open
Current movie dubbing technology can produce the desired speech using a reference voice and input video, maintaining perfect synchronization with the visuals while effectively conveying the intended emotions. However, crucial aspects of mo…
View article: Attentional Triple-Encoder Network in Spatiospectral Domains for Medical Image Segmentation
Attentional Triple-Encoder Network in Spatiospectral Domains for Medical Image Segmentation Open
Retinal Optical Coherence Tomography (OCT) segmentation is essential for diagnosing pathology. Traditional methods focus on either spatial or spectral domains, overlooking their combined dependencies. We propose a triple-encoder network th…
View article: Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks
Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks Open
Movie dubbing has advanced significantly, yet assessing the real-world effectiveness of these models remains challenging. A comprehensive evaluation benchmark is crucial for two key reasons: 1) Existing metrics fail to fully capture the co…
View article: HieraFashDiff: Hierarchical Fashion Design with Multi-stage Diffusion Models
HieraFashDiff: Hierarchical Fashion Design with Multi-stage Diffusion Models Open
Fashion design is a challenging and complex process. Recent works on fashion generation and editing are all agnostic of the actual fashion design process, which limits their usage in practice. In this paper, we propose a novel hierarchical…
View article: DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation
DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation Open
Currently, high-quality, synchronized audio is synthesized using various multi-modal joint learning frameworks, leveraging video and optional text inputs. In the video-to-audio benchmarks, video-to-audio quality, semantic alignment, and au…
View article: Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization
Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization Open
Creating high-quality sound effects from videos and text prompts requires precise alignment between visual and audio domains, both semantically and temporally, along with step-by-step guidance for professional audio generation. However, cu…
View article: Attentional Triple-Encoder Network in Spatiospectral Domains for Medical Image Segmentation
Attentional Triple-Encoder Network in Spatiospectral Domains for Medical Image Segmentation Open
Retinal Optical Coherence Tomography (OCT) segmentation is essential for diagnosing pathology. Traditional methods focus on either spatial or spectral domains, overlooking their combined dependencies. We propose a triple-encoder network th…
View article: Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search
Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search Open
Large language models (LLMs) have demonstrated their remarkable capacity across a variety of tasks. However, reasoning remains a challenge for LLMs. To improve LLMs' reasoning ability, process supervision has proven to be better than outco…
View article: Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio
Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio Open
One fascinating aspect of pre-trained Audio-Language Models (ALMs) learning is their impressive zero-shot generalization capability and test-time adaptation (TTA) methods aiming to improve domain performance without annotations. However, p…
View article: Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning
Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning Open
With current state-of-the-art approaches aimed at enhancing the reasoning capabilities of Large Language Models(LLMs) through iterative preference learning inspired by AlphaZero, we propose to further enhance the step-wise reasoning capabi…
View article: Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models
Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models Open
Fine-tuning pre-trained large language models in a parameter-efficient manner is widely studied for its effectiveness and efficiency. LoRA is one of the most widely used methods, which assumes that the optimization process is essentially l…
View article: YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls Open
Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled d…
View article: Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders
Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders Open
The insufficient supervision limit the performance of the deep supervised models for brain disease diagnosis. It is important to develop a learning framework that can capture more information in limited data and insufficient supervision. T…
View article: OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning
OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning Open
There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multi-modal models fail to provide satisfactory results in describing occluded objects through uni…
View article: OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects
OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects Open
There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multimodal models fail to provide satisfactory results in describing occluded objects for visual-l…
View article: Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation
Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation Open
Adaptation methods are developed to adapt depth foundation models to endoscopic depth estimation recently. However, such approaches typically under-perform training since they limit the parameter search to a low-rank subspace and alter the…
View article: Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation
Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation Open
Gestures are pivotal in enhancing co-speech communication. While recent works have mostly focused on point-level motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation…
View article: Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation
Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation Open
Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capabl…
View article: Hierarchical Reinforcement Learning for Furniture Layout in Virtual Indoor Scenes
Hierarchical Reinforcement Learning for Furniture Layout in Virtual Indoor Scenes Open
In real life, the decoration of 3D indoor scenes through designing furniture layout provides a rich experience for people. In this paper, we explore the furniture layout task as a Markov decision process (MDP) in virtual reality, which is …
View article: LWA-HAND: Lightweight Attention Hand for Interacting Hand Reconstruction
LWA-HAND: Lightweight Attention Hand for Interacting Hand Reconstruction Open
Recent years have witnessed great success for hand reconstruction in real-time applications such as visual reality and augmented reality while interacting with two-hand reconstruction through efficient transformers is left unexplored. In t…
View article: Multi-Agent Reinforcement Learning of 3D Furniture Layout Simulation in Indoor Graphics Scenes
Multi-Agent Reinforcement Learning of 3D Furniture Layout Simulation in Indoor Graphics Scenes Open
In the industrial interior design process, professional designers plan the furniture layout to achieve a satisfactory 3D design for selling. In this paper, we explore the interior graphics scenes design task as a Markov decision process (M…
View article: Deep Reinforcement Learning for Producing Furniture Layout in Indoor Scenes
Deep Reinforcement Learning for Producing Furniture Layout in Indoor Scenes Open
In the industrial interior design process, professional designers plan the size and position of furniture in a room to achieve a satisfactory design for selling. In this paper, we explore the interior scene design task as a Markov decision…
View article: End-to-end Generative Floor-plan and Layout with Attributes and Relation Graph
End-to-end Generative Floor-plan and Layout with Attributes and Relation Graph Open
In this paper, we propose an end-end model for producing furniture layout for interior scene synthesis from the random vector. This proposed model is aimed to support professional interior designers to produce the interior decoration solut…
View article: Deep Layout of Custom-size Furniture through Multiple-domain Learning
Deep Layout of Custom-size Furniture through Multiple-domain Learning Open
In this paper, we propose a multiple-domain model for producing a custom-size furniture layout in the interior scene. This model is aimed to support professional interior designers to produce interior decoration solutions with custom-size …
View article: Structural Plan of Indoor Scenes with Personalized Preferences
Structural Plan of Indoor Scenes with Personalized Preferences Open
In this paper, we propose an assistive model that supports professional interior designers to produce industrial interior decoration solutions and to meet the personalized preferences of the property owners. The proposed model is able to a…