Explanipedia

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability Open

Yijing Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Qu Wanying , et al. · 2025

Large Multimodal Reasoning Models (LMRMs) are moving into real applications, where they must be both useful and safe. Safety is especially challenging in multimodal settings: images and text can be combined to bypass guardrails, and single…

Towards Reliable and Holistic Visual In-Context Learning Prompt Selection Open

Wenxiao Wu, Jing‐Hao Xue, Chengming Xu, Chen Liu, Xinwei Sun , et al. · 2025

Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a glob…

UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs Open

Wanying Qu, Jiawei Geng, Wenqi Shao, Yanwei Fu · 2025

Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, exi…

SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress Open

Lingyun Zhang, Yangli Xie, Yanwei Fu, Ping Chen · 2025

The widespread deployment of text-to-image models is challenged by their potential to generate harmful content. While existing safety methods, such as prompt rewriting or model fine-tuning, provide valuable interventions, they often introd…

SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment Open

Yanxiao Sun, Jiafu Wu, Yun Cao, Chengming Xu, Yabiao Wang , et al. · 2025

Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely base…

Diffusion-Based Imaginative Coordination for Bimanual Manipulation Open

Huilin Xu, Jian Ding, Jin-hong XU, Ruixiang Wang, Jun Chen , et al. · 2025

Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements…

A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding Open

Zhenyang Liu, Sixiao Zheng, Siyu Chen, Cairong Zhao, Longfei Liang , et al. · 2025

Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fie…

Spatial-Temporal Aware Visuomotor Diffusion Policy Learning Open

Zhenyang Liu, Yikai Wang, Kuanning Wang, Longfei Liang, Xiangyang Xue , et al. · 2025

Visual imitation learning is effective for robots to learn versatile tasks. However, many existing methods rely on behavior cloning with supervised historical trajectories, limiting their 3D spatial and 4D spatiotemporal awareness. Consequ…

Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection Open

Li Yu, Xingyu Qiu, Jie Chen, Tianwen Qian, Zheng Xu , et al. · 2025

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, th…

Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025 Open

Yuqian Fu, Runze Wang, Yanwei Fu, Danda Pani Paudel, Luc Van Gool · 2025

In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to…

You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping Open

Tianyu Wang, Yanwei Fu · 2025

This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these appr…

VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models Open

Xiaobin Hu, Liang Yujie, Luo Donghao, Xu Peng, Zhang Jiangning , et al. · 2025

While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human p…

A Unified and Fast-Sampling Diffusion Bridge Framework via Stochastic Optimal Control Open

K. J. Zhu, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang , et al. · 2025

Recent advances in diffusion bridge models leverage Doob's $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches often produc…

Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation Open

Weipeng Tan, FeiFan Xu, Xiaobin Hu, Xiaozhong Ji, Junwei Zhu , et al. · 2025

Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker …

Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation Open

Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu , et al. · 2025

Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, w…

CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image Open

Jiping Huang, Tianyu Wang, Yanwei Fu, Xiangyang Xue, Yi Zhu · 2025

This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks and introduces a new benchmark dataset. While recent methods estimate part poses and sizes at the category level, they often rely on geo…

NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results Open

Yuqian Fu, Xingyu Qiu, Bin Ren, Yanwei Fu, Radu Timofte , et al. · 2025

Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, a…

ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context Open

Sixiao Zheng, Yanwei Fu · 2025

Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle w…

DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding Open

Chong Li, Jingyang Huo, Weikang Gong, Yanwei Fu, Xiangyang Xue , et al. · 2025

Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are p…

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning Open

Zhenyang Liu, Yikai Wang, Sixiao Zheng, Tongying Pan, Yanwei Fu , et al. · 2025

Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous …

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation Open

Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu , et al. · 2025

Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still strugg…

Sequential Multi-Object Grasping with One Dexterous Hand Open

Sailing He, Zeyu Shangguan, Kuanning Wang, Yanfeng Gu, Yuqian Fu , et al. · 2025

Sequentially grasping multiple objects with multi-fingered hands is common in daily life, where humans can fully leverage the dexterity of their hands to enclose multiple objects. However, the diversity of object geometries and the complex…

HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation Open

Hsu-Chih Cheng, Tianyu Wang, Guangsi Shi, Z. G. Zhao, Yanwei Fu · 2025

Co-speech gestures are crucial non-verbal cues that enhance speech clarity and expressiveness in human communication, which have attracted increasing attention in multimodal research. While the existing methods have made strides in gesture…

CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors Open

Donghao Luo, Yujie Liang, Peng Xu, Xiaobin Hu, Boyuan Jiang , et al. · 2025

Computer science

Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like re…

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation Open

Sixiao Zheng, Z. Y. Peng, Yanpeng Zhou, Yi Zhu, Hang Xu , et al. · 2025

Computer science

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. In content creation workflows, precise and simultaneous control over camera motion, object motion, an…

UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control Open

K. J. Zhu, Mokai Pan, Yuexin Ma, Yanwei Fu, Jingyi Yu , et al. · 2025

Computer science Mathematics Physics

Recent advances in diffusion bridge models leverage Doob's $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently p…

A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs Open

Chang Wan, Ke Fan, Xinwei Sun, Yanwei Fu, Minglu Li , et al. · 2025

Mathematics Computer science

This paper introduces a promising alternative method for training Generative Adversarial Networks (GANs) on large-scale datasets with clear theoretical guarantees. GANs are typically learned through a minimax game between a generator and a…

Yanwei Fu YOU? Author Swipe