Yanwei Fu
YOU?
Author Swipe
View article: COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability
COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability Open
Large Multimodal Reasoning Models (LMRMs) are moving into real applications, where they must be both useful and safe. Safety is especially challenging in multimodal settings: images and text can be combined to bypass guardrails, and single…
View article: Towards Reliable and Holistic Visual In-Context Learning Prompt Selection
Towards Reliable and Holistic Visual In-Context Learning Prompt Selection Open
Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a glob…
View article: UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs
UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs Open
Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, exi…
View article: SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress
SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress Open
The widespread deployment of text-to-image models is challenged by their potential to generate harmful content. While existing safety methods, such as prompt rewriting or model fine-tuning, provide valuable interventions, they often introd…
View article: SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment
SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment Open
Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely base…
View article: Diffusion-Based Imaginative Coordination for Bimanual Manipulation
Diffusion-Based Imaginative Coordination for Bimanual Manipulation Open
Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements…
View article: A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding Open
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fie…
View article: Spatial-Temporal Aware Visuomotor Diffusion Policy Learning
Spatial-Temporal Aware Visuomotor Diffusion Policy Learning Open
Visual imitation learning is effective for robots to learn versatile tasks. However, many existing methods rely on behavior cloning with supervised historical trajectories, limiting their 3D spatial and 4D spatiotemporal awareness. Consequ…
View article: Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection
Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection Open
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, th…
View article: Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025
Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025 Open
In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to…
View article: You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping
You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping Open
This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these appr…
View article: VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models
VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models Open
While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human p…
View article: A Unified and Fast-Sampling Diffusion Bridge Framework via Stochastic Optimal Control
A Unified and Fast-Sampling Diffusion Bridge Framework via Stochastic Optimal Control Open
Recent advances in diffusion bridge models leverage Doob's $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches often produc…
View article: Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation Open
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker …
View article: Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation Open
Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, w…
View article: CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image
CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image Open
This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks and introduces a new benchmark dataset. While recent methods estimate part poses and sizes at the category level, they often rely on geo…
View article: NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results
NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results Open
Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, a…
View article: ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context
ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context Open
Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle w…
View article: DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding
DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding Open
Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are p…
View article: ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning Open
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous …
View article: EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation Open
Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still strugg…
View article: Sequential Multi-Object Grasping with One Dexterous Hand
Sequential Multi-Object Grasping with One Dexterous Hand Open
Sequentially grasping multiple objects with multi-fingered hands is common in daily life, where humans can fully leverage the dexterity of their hands to enclose multiple objects. However, the diversity of object geometries and the complex…
View article: HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation
HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation Open
Co-speech gestures are crucial non-verbal cues that enhance speech clarity and expressiveness in human communication, which have attracted increasing attention in multimodal research. While the existing methods have made strides in gesture…
View article: CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors
CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors Open
Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like re…
View article: VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation Open
Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. In content creation workflows, precise and simultaneous control over camera motion, object motion, an…
View article: UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control
UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control Open
Recent advances in diffusion bridge models leverage Doob's $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently p…
View article: A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs
A New Formulation of Lipschitz Constrained With Functional Gradient Learning for GANs Open
This paper introduces a promising alternative method for training Generative Adversarial Networks (GANs) on large-scale datasets with clear theoretical guarantees. GANs are typically learned through a minimax game between a generator and a…