Wenhan Luo
YOU?
Author Swipe
View article: CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation
CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation Open
Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each compo…
View article: UNIC: Unified In-Context Video Editing
UNIC: Unified In-Context Video Editing Open
Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inve…
View article: VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization Open
Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy O…
View article: Corrigendum to “Porous bead foam from semi-aromatic polycarbonate /polyester blend by supercritical carbon dioxide batch foaming” [Polym. Test. 146 (2025) 108765]
Corrigendum to “Porous bead foam from semi-aromatic polycarbonate /polyester blend by supercritical carbon dioxide batch foaming” [Polym. Test. 146 (2025) 108765] Open
View article: Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion
Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion Open
Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent …
View article: VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension Open
Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predomin…
View article: Porous bead foam from semi-aromatic polycarbonate /polyester blend by supercritical carbon dioxide batch foaming
Porous bead foam from semi-aromatic polycarbonate /polyester blend by supercritical carbon dioxide batch foaming Open
View article: VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer
VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer Open
Crafting magic and illusions is one of the most thrilling aspects of filmmaking, with visual effects (VFX) serving as the powerhouse behind unforgettable cinematic experiences. While recent advances in generative artificial intelligence ha…
View article: MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration
MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration Open
Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention po…
View article: VideoVista-CulturalLingo: 360° Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
VideoVista-CulturalLingo: 360° Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension Open
View article: SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model
SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model Open
Recent advancements in generative models have significantly enhanced talking face video generation, yet singing video generation remains underexplored. The differences between human talking and singing limit the performance of existing tal…
View article: Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge
Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge Open
Facial personalization faces challenges to maintain identity fidelity without disrupting the foundation model's prompt consistency. The mainstream personalization models employ identity embedding to integrate identity information within th…
View article: EVA: An Embodied World Model for Future Video Anticipation
EVA: An Embodied World Model for Future Video Anticipation Open
Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios. However, existing models often lack robust understanding, limiting their ability to p…
View article: DREAM: Domain-Agnostic Reverse Engineering Attributes of Black-Box Model
DREAM: Domain-Agnostic Reverse Engineering Attributes of Black-Box Model Open
Deep learning models are usually black boxes when deployed on machine learning platforms. Prior works have shown that the attributes (e.g., the number of convolutional layers) of a target black-box model can be exposed through a sequence o…
View article: PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing
PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing Open
Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the probl…
View article: HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts Open
The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and highe…
View article: MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions Open
Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be wea…
View article: HTNet for micro-expression recognition
HTNet for micro-expression recognition Open
View article: MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results
MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results Open
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-qual…
View article: Towards Multiple Character Image Animation Through Enhancing Implicit Decoupling
Towards Multiple Character Image Animation Through Enhancing Implicit Decoupling Open
Controllable character image animation has a wide range of applications. Although existing studies have consistently improved performance, challenges persist in the field of character image animation, particularly concerning stability in c…
View article: CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild
CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild Open
Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In …
View article: Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention
Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention Open
In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from ca…
View article: Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts Open
Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) arch…
View article: Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost
Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost Open
We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary…
View article: Homography Guided Temporal Fusion for Road Line and Marking Segmentation
Homography Guided Temporal Fusion for Road Line and Marking Segmentation Open
Reliable segmentation of road lines and markings is critical to autonomous driving. Our work is motivated by the observations that road lines and markings are (1) frequently occluded in the presence of moving vehicles, shadow, and glare an…
View article: Context-Aware Integration of Language and Visual References for Natural Language Tracking
Context-Aware Integration of Language and Visual References for Natural Language Tracking Open
Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for …
View article: OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models Open
Personalization is an important topic in text-to-image generation, especially the challenging multi-concept personalization. Current multi-concept methods are struggling with identity preservation, occlusion, and the harmony between foregr…
View article: AS-FIBA: Adaptive Selective Frequency-Injection for Backdoor Attack on Deep Face Restoration
AS-FIBA: Adaptive Selective Frequency-Injection for Backdoor Attack on Deep Face Restoration Open
Deep learning-based face restoration models, increasingly prevalent in smart devices, have become targets for sophisticated backdoor attacks. These attacks, through subtle trigger injection into input face images, can lead to unexpected re…
View article: Segmentation Guided Sparse Transformer for Under-Display Camera Image Restoration
Segmentation Guided Sparse Transformer for Under-Display Camera Image Restoration Open
Under-Display Camera (UDC) is an emerging technology that achieves full-screen display via hiding the camera under the display panel. However, the current implementation of UDC causes serious degradation. The incident light required for ca…
View article: SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising
SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising Open
Recently, promptable segmentation models, such as the Segment Anything Model (SAM), have demonstrated robust zero-shot generalization capabilities on static images. These promptable models exhibit denoising abilities for imprecise prompt i…