Explanipedia

CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation Open

Peng Li, S. Ma, Jialiang Chen, Yuan Liu, Congyi Zhang , et al. · 2025

Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each compo…

UNIC: Unified In-Context Video Editing Open

Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang , et al. · 2025

Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inve…

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization Open

Yunxin Li, Xinyu Chen, Zhipeng Li, Zhenyu Liu, Longyue Wang , et al. · 2025

Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy O…

Corrigendum to “Porous bead foam from semi-aromatic polycarbonate /polyester blend by supercritical carbon dioxide batch foaming” [Polym. Test. 146 (2025) 108765] Open

W. Zhan, Wenhan Luo, Yufei Wang, Liangbin Wang, Naiyu Xiao , et al. · 2025

Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion Open

Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue , et al. · 2025

Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent …

VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension Open

Xinyu Chen, Yunxin Li, Haoyuan Shi, Baotian Hu, Wenhan Luo , et al. · 2025

Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predomin…

Porous bead foam from semi-aromatic polycarbonate /polyester blend by supercritical carbon dioxide batch foaming Open

W. Zhan, Wenhan Luo, Yufei Wang, Liangbin Wang, Naiyu Xiao , et al. · 2025

VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer Open

Xinyu Liu, Ailing Zeng, Wei Xue, Harry Yang, Wenhan Luo , et al. · 2025

Crafting magic and illusions is one of the most thrilling aspects of filmmaking, with visual effects (VFX) serving as the powerhouse behind unforgettable cinematic experiences. While recent advances in generative artificial intelligence ha…

MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration Open

Zhi Jin, Yuwei Qiu, Kaihao Zhang, Hongdong Li, Wenhan Luo · 2025

Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention po…

VideoVista-CulturalLingo: 360° Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension Open

Xinyu Chen, Yunxin Li, Haoyuan Shi, Baotian Hu, Wenhan Luo , et al. · 2025

SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model Open

Yan Li, Ziya Zhou, Zhiqiang Wang, Wei Xue, Wenhan Luo , et al. · 2024

Recent advancements in generative models have significantly enhanced talking face video generation, yet singing video generation remains underexplored. The differences between human talking and singing limit the performance of existing tal…

Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge Open

Yiyang Cai, Zhengkai Jiang, Yulong Liu, Chengtao Jiang, Wei Xue , et al. · 2024

Facial personalization faces challenges to maintain identity fidelity without disrupting the foundation model's prompt consistency. The mainstream personalization models employ identity embedding to integrate identity information within th…

EVA: An Embodied World Model for Future Video Anticipation Open

Xiaowei Chi, Hengyuan Zhang, Chun-Kai Fan, Xingqun Qi, Rongyu Zhang , et al. · 2024

Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios. However, existing models often lack robust understanding, limiting their ability to p…

DREAM: Domain-Agnostic Reverse Engineering Attributes of Black-Box Model Open

Rongqing Li, Jiaqi Yu, Changsheng Li, Wenhan Luo, Ye Yuan , et al. · 2024

Deep learning models are usually black boxes when deployed on machine learning platforms. Prior works have shown that the attributes (e.g., the number of convolutional layers) of a target black-box model can be exposed through a sequence o…

PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing Open

Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li , et al. · 2024

Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the probl…

HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts Open

Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, B.N. Jin , et al. · 2024

The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and highe…

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions Open

Xiaowei Chi, Yatian Wang, Aosong Cheng, Pengjun Fang, Zeyue Tian , et al. · 2024

Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be wea…

HTNet for micro-expression recognition Open

Zhifeng Wang, Kaihao Zhang, Wenhan Luo, Ramesh Sankaranarayana · 2024

MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results Open

Xin Jin, Chunle Guo, Xiaoming Li, Zongsheng Yue, Chongyi Li , et al. · 2024

The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-qual…

Towards Multiple Character Image Animation Through Enhancing Implicit Decoupling Open

Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang , et al. · 2024

Controllable character image animation has a wide range of applications. Although existing studies have consistently improved performance, challenges persist in the field of character image animation, particularly concerning stability in c…

CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild Open

Xingqun Qi, Hengyuan Zhang, Yatian Wang, Jiahao Pan, Chen Liu , et al. · 2024

Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In …

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention Open

Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Lin Cheng , et al. · 2024

In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from ca…

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts Open

Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong , et al. · 2024

Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) arch…

Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost Open

Yuan Gao, Weizhong Zhang, Wenhan Luo, Lin Ma, Jin-Gang Yu , et al. · 2024

We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary…

Homography Guided Temporal Fusion for Road Line and Marking Segmentation Open

Shan Wang, Chuong Nguyen, Jiawei Liu, Kaihao Zhang, Wenhan Luo , et al. · 2024

Reliable segmentation of road lines and markings is critical to autonomous driving. Our work is motivated by the observations that road lines and markings are (1) frequently occluded in the presence of moving vehicles, shadow, and glare an…

Context-Aware Integration of Language and Visual References for Natural Language Tracking Open

Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo , et al. · 2024

Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for …

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models Open

Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang , et al. · 2024

Personalization is an important topic in text-to-image generation, especially the challenging multi-concept personalization. Current multi-concept methods are struggling with identity preservation, occlusion, and the harmony between foregr…

AS-FIBA: Adaptive Selective Frequency-Injection for Backdoor Attack on Deep Face Restoration Open

Zhenbo Song, Wenhao Gao, Kaihao Zhang, Wenhan Luo, Zhaoxin Fan , et al. · 2024

Deep learning-based face restoration models, increasingly prevalent in smart devices, have become targets for sophisticated backdoor attacks. These attacks, through subtle trigger injection into input face images, can lead to unexpected re…

Segmentation Guided Sparse Transformer for Under-Display Camera Image Restoration Open

Jingyun Xue, Tao Wang, Jun Wang, Kaihao Zhang, Wenhan Luo , et al. · 2024

Under-Display Camera (UDC) is an emerging technology that achieves full-screen display via hiding the camera under the display panel. However, the current implementation of UDC causes serious degradation. The incident light required for ca…

SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising Open

Tao Zhou, Wenhan Luo, Qi Ye, Zhiguo Shi, Jiming Chen · 2024

Recently, promptable segmentation models, such as the Segment Anything Model (SAM), have demonstrated robust zero-shot generalization capabilities on static images. These promptable models exhibit denoising abilities for imprecise prompt i…

Wenhan Luo YOU? Author Swipe