Wei‐Shi Zheng
YOU?
Author Swipe
View article: World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training Open
Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-b…
View article: Revisit the Imbalance Optimization in Multi-task Learning: An Experimental Analysis
Revisit the Imbalance Optimization in Multi-task Learning: An Experimental Analysis Open
Multi-task learning (MTL) aims to build general-purpose vision systems by training a single network to perform multiple tasks jointly. While promising, its potential is often hindered by "unbalanced optimization", where task interference l…
View article: A Single‐Component Adhesive Sponge Based on Blood‐Triggered and Autopenetrative Adhesion for Robust Vascular Closure
A Single‐Component Adhesive Sponge Based on Blood‐Triggered and Autopenetrative Adhesion for Robust Vascular Closure Open
Achieving strong adhesion under wet and bleeding conditions remains a major challenge for medical adhesives. Existing strategies that utilize polymer chain penetration to overcome this have shown promise but typically rely on complex exter…
View article: CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion
CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion Open
3D human-object interaction (HOI) anticipation aims to predict the future motion of humans and their manipulated objects, conditioned on the historical context. Generally, the articulated humans and rigid objects exhibit different motion p…
View article: Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal
Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal Open
We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent …
View article: Domain Generalizable Portrait Style Transfer
Domain Generalizable Portrait Style Transfer Open
This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To th…
View article: FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection
FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection Open
Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalizati…
View article: Temporal Continual Learning with Prior Compensation for Human Motion Prediction
Temporal Continual Learning with Prior Compensation for Human Motion Prediction Open
Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning…
View article: DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering
DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering Open
Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly pro…
View article: TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types
TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types Open
Dexterous teleoperation plays a crucial role in robotic manipulation for real-world data collection and remote robot control. Previous dexterous teleoperation mostly relies on hand retargeting to closely mimic human hand postures. However,…
View article: Chain of Methodologies: Scaling Test Time Computation without Training
Chain of Methodologies: Scaling Test Time Computation without Training Open
Large Language Models (LLMs) often struggle with complex reasoning tasks due to insufficient in-depth insights in their training data, which are typically absent in publicly available documents. This paper introduces the Chain of Methodolo…
View article: Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation Open
Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short vide…
View article: ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding Open
Fine-grained understanding of human actions and poses in videos is essential for human-centric AI applications. In this work, we introduce ActionArt, a fine-grained video-caption dataset designed to advance research in human-centric multim…
View article: MaintaAvatar: A Maintainable Avatar Based on Neural Radiance Fields by Continual Learning
MaintaAvatar: A Maintainable Avatar Based on Neural Radiance Fields by Continual Learning Open
The generation of a virtual digital avatar is a crucial research topic in the field of computer vision. Many existing works utilize Neural Radiance Fields (NeRF) to address this issue and have achieved impressive results. However, previous…
View article: ParGo: Bridging Vision-Language with Partial and Global Views
ParGo: Bridging Vision-Language with Partial and Global Views Open
This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo…
View article: CLIP-RestoreX: Restore Image Structure and Perception in Exposure Correction
CLIP-RestoreX: Restore Image Structure and Perception in Exposure Correction Open
Exposure correction aims to adjust the exposure of an under- and over-exposed image to enhance its overall visual quality. The core challenge of this task lies in that it requires to faithfully restore both the structure and perception inf…
View article: Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation
Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation Open
Despite the significant role text-to-motion (T2M) generation plays across various applications, current methods involve a large number of parameters and suffer from slow inference speeds, leading to high usage costs. To address this, we ai…
View article: When Shadow Removal Meets Intrinsic Image Decomposition: A Joint Learning Framework Using Unpaired Data
When Shadow Removal Meets Intrinsic Image Decomposition: A Joint Learning Framework Using Unpaired Data Open
We present a framework that achieves shadow removal by learning intrinsic image decomposition (IID) from unpaired shadow and shadow-free images. Although it is well-known that intrinsic images, \ie, illumination and reflectance, are highly…
View article: Progressive Human Motion Generation Based on Text and Few Motion Frames
Progressive Human Motion Generation Based on Text and Few Motion Frames Open
Although existing text-to-motion (T2M) methods can produce realistic human motion from text description, it is still difficult to align the generated motion with the desired postures since using text alone is insufficient for precisely des…
View article: Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks
Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks Open
In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss …
View article: Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation
Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation Open
We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a si…
View article: Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks Open
Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Existing methods often focus on temporal ordering errors or rely on static prototypes to represent normal actions…
View article: VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness Open
Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench hav…
View article: Panorama Generation From NFoV Image Done Right
Panorama Generation From NFoV Image Done Right Open
Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metric…
View article: A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection
A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection Open
Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, eg, CLIP, to inherit its generalizable recognition ability so that d…
View article: Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework
Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework Open
Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their act…
View article: TacCap: A Wearable FBG-Based Tactile Sensor for Seamless Human-to-Robot Skill Transfer
TacCap: A Wearable FBG-Based Tactile Sensor for Seamless Human-to-Robot Skill Transfer Open
Tactile sensing is essential for dexterous manipulation, yet large-scale human demonstration datasets lack tactile feedback, limiting their effectiveness in skill transfer to robots. To address this, we introduce TacCap, a wearable Fiber B…
View article: Task-Oriented 6-DoF Grasp Pose Detection in Clutters
Task-Oriented 6-DoF Grasp Pose Detection in Clutters Open
In general, humans would grasp an object differently for different tasks, e.g., "grasping the handle of a knife to cut" vs. "grasping the blade to hand over". In the field of robotic grasp pose detection research, some existing works consi…