Yu-Xiong Wang
YOU?
Author Swipe
View article: Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning
Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning Open
Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: …
View article: MoReact: Generating Reactive Motion from Textual Descriptions
MoReact: Generating Reactive Motion from Textual Descriptions Open
Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating in…
View article: Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception
Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception Open
With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising proces…
View article: Photothermal direct methane conversion to formaldehyde at the gas-solid interface under ambient pressure
Photothermal direct methane conversion to formaldehyde at the gas-solid interface under ambient pressure Open
Photocatalytic direct oxidation of methane to C1 oxygenates offers a green alternative to conventional energy-intensive and high-carbon-footprint multi-step processes. However, current batch-type gas-liquid-solid reaction system…
View article: Electrocatalytic Nitric Oxide to Ammonia Over Copper-Based Nanosheets: Insights into the Critical Role of Chemical States
Electrocatalytic Nitric Oxide to Ammonia Over Copper-Based Nanosheets: Insights into the Critical Role of Chemical States Open
View article: Visual Program Distillation with Template-Based Augmentation
Visual Program Distillation with Template-Based Augmentation Open
View article: Visual Program Distillation with Template-Based Augmentation
Visual Program Distillation with Template-Based Augmentation Open
Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inferen…
View article: RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders Open
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive b…
View article: Transforming the Hybrid Cloud for Emerging AI Workloads
Transforming the Hybrid Cloud for Emerging AI Workloads Open
This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, fu…
View article: Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers
Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers Open
Decision Transformers have recently emerged as a new and compelling paradigm for offline Reinforcement Learning (RL), completing a trajectory in an autoregressive way. While improvements have been made to overcome initial shortcomings, onl…
View article: ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos Open
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method leverages the universal visual-language mapping learned by video diffusion models on Internet-scale dat…
View article: Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision
Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision Open
Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we …
View article: Floating No More: Object-Ground Reconstruction from a Single Image
Floating No More: Object-Ground Reconstruction from a Single Image Open
Recent advancements in 3D object reconstruction from single images have primarily focused on improving the accuracy of object shapes. Yet, these techniques often fail to accurately capture the inter-relation between the object, ground, and…
View article: RMem: Restricted Memory Banks Improve Video Object Segmentation
RMem: Restricted Memory Banks Improve Video Object Segmentation Open
With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory bank…
View article: SOHES: Self-supervised Open-world Hierarchical Entity Segmentation
SOHES: Self-supervised Open-world Hierarchical Entity Segmentation Open
Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Desp…
View article: Region-Based Representations Revisited
Region-Based Representations Revisited Open
We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnost…
View article: Oxygen-Deficient Wo3 for Stable Visible-Light Photocatalytic Degradation of Acetaldehyde within a Wide Humidity Range
Oxygen-Deficient Wo3 for Stable Visible-Light Photocatalytic Degradation of Acetaldehyde within a Wide Humidity Range Open
View article: Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models
Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models Open
Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object gene…
View article: Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching
Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching Open
In real-world scenarios, arbitrary interactions with the environment can often be costly, and actions of expert demonstrations are not always available. To reduce the need for both, offline Learning from Observations (LfO) is extensively s…
View article: Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models
Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models Open
We propose a conceptually simple and lightweight framework for improving the robustness of vision models through the combination of knowledge distillation and data augmentation. We address the conjecture that larger models do not make for …
View article: A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories
A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories Open
Offline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions a…
View article: Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models Open
While language models (LMs) have shown potential across a range of decision-making tasks, their reliance on simple acting processes limits their broad deployment as autonomous agents. In this paper, we introduce Language Agent Tree Search …
View article: Photocatalytic Oxidative Coupling of Methane over Au<sub>1</sub>Ag Single‐Atom Alloy Modified ZnO with Oxygen and Water Vapor: Synergy of Gold and Silver
Photocatalytic Oxidative Coupling of Methane over Au<sub>1</sub>Ag Single‐Atom Alloy Modified ZnO with Oxygen and Water Vapor: Synergy of Gold and Silver Open
C−H dissociation and C−C coupling are two key steps in converting CH 4 into multi‐carbon compounds. Here we report a synergy of Au and Ag to greatly promote C 2 H 6 formation over Au 1 Ag single‐atom alloy nanoparticles (Au 1 Ag NPs)‐modif…
View article: InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion
InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion Open
This paper addresses a novel task of anticipating 3D human-object interactions (HOIs). Most existing research on HOI synthesis lacks comprehensive whole-body interactions with dynamic objects, e.g., often limited to manipulating small or s…
View article: Is Pre-training Truly Better Than Meta-Learning?
Is Pre-training Truly Better Than Meta-Learning? Open
In the context of few-shot learning, it is currently believed that a fixed pre-trained (PT) model, along with fine-tuning the final layer during evaluation, outperforms standard meta-learning algorithms. We re-evaluate these claims under a…
View article: Stochastic Multi-Person 3D Motion Forecasting
Stochastic Multi-Person 3D Motion Forecasting Open
This paper aims to deal with the ignored real-world complexities in prior work on human motion forecasting, emphasizing the social properties of multi-person motion, the diversity of motion and social interactions, and the complexity of ar…
View article: MV-Map: Offboard HD-Map Generation with Multi-view Consistency
MV-Map: Offboard HD-Map Generation with Multi-view Consistency Open
While bird's-eye-view (BEV) perception models can be useful for building high-definition maps (HD-Maps) with less human labor, their results are often unreliable and demonstrate noticeable inconsistencies in the predicted HD-Maps from diff…
View article: Object Discovery from Motion-Guided Tokens
Object Discovery from Motion-Guided Tokens Open
Object discovery -- separating objects from the background without manual labels -- is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, …
View article: Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking
Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking Open
This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning…
View article: Towards overcoming data scarcity in materials science: unifying models and datasets with a mixture of experts framework
Towards overcoming data scarcity in materials science: unifying models and datasets with a mixture of experts framework Open