Explanipedia

Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning Open

Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen , et al. · 2025

Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: …

MoReact: Generating Reactive Motion from Textual Descriptions Open

Xiyan Xu, Sirui Xu, Yu-Xiong Wang, Liang-Yan Gui · 2025

Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating in…

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception Open

Ziqi Pang, Xin Xu, Yu-Xiong Wang · 2025

With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising proces…

Photothermal direct methane conversion to formaldehyde at the gas-solid interface under ambient pressure Open

Yu-Xiong Wang, Yaoyu Zhang, Xiaoqiang Wang, Yue Liu, Zhongbiao Wu · 2025

Photocatalytic direct oxidation of methane to C₁ oxygenates offers a green alternative to conventional energy-intensive and high-carbon-footprint multi-step processes. However, current batch-type gas-liquid-solid reaction system…

Electrocatalytic Nitric Oxide to Ammonia Over Copper-Based Nanosheets: Insights into the Critical Role of Chemical States Open

Yuqin Zhong, Zhongyi Sheng, George Z. Chen, Yanan Gong, Yu-Xiong Wang , et al. · 2025

Visual Program Distillation with Template-Based Augmentation Open

Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem · 2025

Visual Program Distillation with Template-Based Augmentation Open

Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem · 2024

Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inferen…

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders Open

Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan , et al. · 2024

We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive b…

Transforming the Hybrid Cloud for Emerging AI Workloads Open

Deming Chen, Alaa Youssef, Ravi Pendse, André Schleife, Bryan K. Clark , et al. · 2024

This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, fu…

Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers Open

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang · 2024

Decision Transformers have recently emerged as a new and compelling paradigm for offline Reinforcement Learning (RL), completing a trajectory in an autoregressive way. While improvements have been made to overcome initial shortcomings, onl…

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos Open

Anurag Bagchi, Zhipeng Bao, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert · 2024

We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method leverages the universal visual-language mapping learned by video diffusion models on Internet-scale dat…

Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision Open

Shixiang Cao, L. Gui, Yu-Xiong Wang · 2024

Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we …

Floating No More: Object-Ground Reconstruction from a Single Image Open

Yunze Man, Yichen Sheng, Jianming Zhang, Liang-Yan Gui, Yu-Xiong Wang · 2024

Recent advancements in 3D object reconstruction from single images have primarily focused on improving the accuracy of object shapes. Yet, these techniques often fail to accurately capture the inter-relation between the object, ground, and…

RMem: Restricted Memory Banks Improve Video Object Segmentation Open

Junbao Zhou, Ziqi Pang, Yu-Xiong Wang · 2024

With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios, we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory bank…

SOHES: Self-supervised Open-world Hierarchical Entity Segmentation Open

Shengcao Cao, Jiuxiang Gu, Jason Kuen, Hao Tan, Ruiyi Zhang , et al. · 2024

Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Desp…

Region-Based Representations Revisited Open

Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, T V Sethuraman , et al. · 2024

We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnost…

Oxygen-Deficient Wo3 for Stable Visible-Light Photocatalytic Degradation of Acetaldehyde within a Wide Humidity Range Open

Xiangjin Zhu, Yaoyu Zhang, Yu-Xiong Wang, Yue Liu, Zhongbiao Wu · 2024

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models Open

Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu-Xiong Wang, Martial Hebert · 2023

Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object gene…

Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching Open

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang · 2023

In real-world scenarios, arbitrary interactions with the environment can often be costly, and actions of expert demonstrations are not always available. To reduce the need for both, offline Learning from Observations (LfO) is extensively s…

Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models Open

Andy Zhou, Jindong Wang, Yu-Xiong Wang, Haohan Wang · 2023

We propose a conceptually simple and lightweight framework for improving the robustness of vision models through the combination of knowledge distillation and data augmentation. We address the conjecture that larger models do not make for …

A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories Open

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang · 2023

Offline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions a…

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models Open

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, Yu-Xiong Wang · 2023

While language models (LMs) have shown potential across a range of decision-making tasks, their reliance on simple acting processes limits their broad deployment as autonomous agents. In this paper, we introduce Language Agent Tree Search …

Photocatalytic Oxidative Coupling of Methane over Au<sub>1</sub>Ag Single‐Atom Alloy Modified ZnO with Oxygen and Water Vapor: Synergy of Gold and Silver Open

Yu-Xiong Wang, Guang Hong, Yaoyu Zhang, Yue Liu, Wanglai Cen , et al. · 2023

C−H dissociation and C−C coupling are two key steps in converting CH 4 into multi‐carbon compounds. Here we report a synergy of Au and Ag to greatly promote C 2 H 6 formation over Au 1 Ag single‐atom alloy nanoparticles (Au 1 Ag NPs)‐modif…

InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion Open

Sirui Xu, Zhengyuan Li, Yu-Xiong Wang, Liang-Yan Gui · 2023

This paper addresses a novel task of anticipating 3D human-object interactions (HOIs). Most existing research on HOI synthesis lacks comprehensive whole-body interactions with dynamic objects, e.g., often limited to manipulating small or s…

Is Pre-training Truly Better Than Meta-Learning? Open

Brando Miranda, Patrick Yu, Saumya Goyal, Yu-Xiong Wang, Sanmi Koyejo · 2023

In the context of few-shot learning, it is currently believed that a fixed pre-trained (PT) model, along with fine-tuning the final layer during evaluation, outperforms standard meta-learning algorithms. We re-evaluate these claims under a…

Stochastic Multi-Person 3D Motion Forecasting Open

Sirui Xu, Yu-Xiong Wang, Liang-Yan Gui · 2023

This paper aims to deal with the ignored real-world complexities in prior work on human motion forecasting, emphasizing the social properties of multi-person motion, the diversity of motion and social interactions, and the complexity of ar…

MV-Map: Offboard HD-Map Generation with Multi-view Consistency Open

Ziyang Xie, Ziqi Pang, Yu-Xiong Wang · 2023

While bird's-eye-view (BEV) perception models can be useful for building high-definition maps (HD-Maps) with less human labor, their results are often unreliable and demonstrate noticeable inconsistencies in the predicted HD-Maps from diff…

Object Discovery from Motion-Guided Tokens Open

Zhipeng Bao, Pavel Tokmakov, Yu-Xiong Wang, Adrien Gaidon, Martial Hebert · 2023

Object discovery -- separating objects from the background without manual labels -- is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, …

Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking Open

Ziqi Pang, Jie Li, Pavel Tokmakov, Dian Chen, Sergey Zagoruyko , et al. · 2023

This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning…

Towards overcoming data scarcity in materials science: unifying models and datasets with a mixture of experts framework Open

Rees Chang, Yu-Xiong Wang, Elif Ertekin · 2022

Yu-Xiong Wang YOU? Author Swipe