Explanipedia

Layer-Aware Video Composition via Split-then-Merge Open

Shuicheng Yan, James M. Rehg, Weiyu Chu · 2025

We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM spl…

Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications across Lab and Field Settings Open

Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg , et al. · 2025

Photoplethysmography (PPG)-based foundation models are gaining traction due to the widespread use of PPG in biosignal monitoring and their potential to track diverse health indicators. In this paper, we introduce Pulse-PPG, an open-source …

AI for Creative Visual Content Generation, Editing and Understanding Open

Or Patashnik, Gaurav Parmar, Anyi Rao, Ozgur Kara, Fabian Caba Heilbron , et al. · 2025

LSM-2: Learning from Incomplete Wearable Sensor Data Open

Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao , et al. · 2025

Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challe…

Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning Open

Bolin Lai, Sang‐Min Lee, Xu Cao, Xiang Li, James M. Rehg · 2025

Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by fine…

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models Open

Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg · 2025

This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to …

SocialGesture: Delving into Multi-person Gesture Understanding Open

Xu Cao, Pranav Virupaksha, Wenqi Jia, Bolin Lai, Fiona Ryan , et al. · 2025

Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a signif…

Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium Open

Amin Adibi, Cao Xu, Zongliang Ji, Jasvinder Kaur, Wei Chen , et al. · 2025

The fourth Machine Learning for Health (ML4H) symposium was held in person on December 15th and 16th, 2024, in the traditional, ancestral, and unceded territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, British …

Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings Open

Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg , et al. · 2025

Photoplethysmography (PPG)-based foundation models are gaining traction due to the widespread use of PPG in biosignal monitoring and their potential to generalize across diverse health applications. In this paper, we introduce Pulse-PPG, t…

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images Open

Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M. Rehg, Varun Jampani · 2025

We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occl…

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders Open

Fiona Ryan, Ajay Bati, Sang Min Lee, Daniel Bolya, Judy Hoffman , et al. · 2024

We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior work…

PyPulse: A Python Library for Biosignal Imputation Open

Kevin Gao, Maxwell A. Xu, James M. Rehg, Alexander Moreno · 2024

We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings. Missingness is commonplace in these settings and can arise from multiple causes, such as insecure sensor attachment or data …

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation Open

Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta , et al. · 2024

Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the traini…

Optimization-Free Image Immunization Against Diffusion-Based Editing Open

Tarik Can Ozden, Özgür Kara, Oguzhan Akcin, Kerem Zaman, Shashank Srivastava , et al. · 2024

Current image immunization defense techniques against diffusion-based editing embed imperceptible noise in target images to disrupt editing models. However, these methods face scalability challenges, as they require time-consuming re-optim…

RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data Open

Maxwell A. Xu, Jaya Narain, Gregory Darnell, Haraldur Tómas Hallgrímsson, Hyewon Jeong , et al. · 2024

We present RelCon, a novel self-supervised Relative Contrastive learning approach for training a motion foundation model from wearable accelerometry sensors. First, a learnable distance measure is trained to capture motif similarity and do…

Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation Open

Xiang Li, Zixuan Huang, Anh Thai, James M. Rehg · 2024

Symmetry is a ubiquitous and fundamental property in the visual world, serving as a critical cue for perception and structure interpretation. This paper investigates the detection of 3D reflection symmetry from a single RGB image, and reve…

Medical Video Generation for Disease Progression Simulation Open

Xu Cao, Kaizhao Liang, Kuei-Da Liao, Tianren Gao, Wenqian Ye , et al. · 2024

Modeling disease progression is crucial for improving the quality and efficacy of clinical diagnosis and prognosis, but it is often hindered by a lack of longitudinal medical image monitoring for individual patients. To address this challe…

Human Action Anticipation: A Survey Open

Bolin Lai, Sam Toyer, Tushar Nagarajan, Rohit Girdhar, Shengxin Zha , et al. · 2024

Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior predictio…

Leveraging Object Priors for Point Tracking Open

Bikram Boote, Anh Thai, Wenqi Jia, Özgür Kara, Stefan Stojanov , et al. · 2024

Point tracking is a fundamental problem in computer vision with numerous applications in AR and robotics. A common failure mode in long-term point tracking occurs when the predicted point leaves the object it belongs to and lands on the ba…

Towards Social AI: A Survey on Understanding Social Interactions Open

Sangmin Lee, Minzhi Li, Bolin Lai, Wenqi Jia, Fiona Ryan , et al. · 2024

Social interactions form the foundation of human societies. Artificial intelligence has made significant progress in certain areas, but enabling machines to seamlessly understand social interactions remains an open challenge. It is importa…

Ego4D: Around the World in 3,600 Hours of Egocentric Video Open

Kristen Grauman, Andrew Westbury, Eugene H. Byrne, Vincent Cartillier, Zachary Chavis , et al. · 2024

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camer…

3x2: 3D Object Part Segmentation by 2D Semantic Correspondences Open

Anh Thai, Wei‐Yao Wang, Hao Tang, Stefan Stojanov, Matt Feiszli , et al. · 2024

3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D …

Temporally Multi-Scale Sparse Self-Attention for Physical Activity Data Imputation Open

Hui Wei, Maxwell A. Xu, Colin Samplawski, James M. Rehg, Santosh Kumar , et al. · 2024

Wearable sensors enable health researchers to continuously collect data pertaining to the physiological state of individuals in real-world settings. However, such data can be subject to extensive missingness due to a complex combination of…

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs Open

Wenqian Ye, Guangtao Zheng, Yunsheng Ma, Xu Cao, Bolin Lai , et al. · 2024

Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe robustness pitfall in deep learning models trained on single modality data. Multimoda…

What is the Visual Cognition Gap between Humans and Multimodal LLMs? Open

Xu Cao, Bolin Lai, Wenqian Ye, Yunsheng Ma, Joerg Heintz , et al. · 2024

Recently, Multimodal Large Language Models (MLLMs) and Vision Language Models (VLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addr…

PointInfinity: Resolution-Invariant Point Diffusion Models Open

Zixuan Huang, Justin C. Johnson, Shoubhik Debnath, James M. Rehg, Chao-Yuan Wu · 2024

We present PointInfinity, an efficient family of point cloud diffusion models. Our core idea is to use a transformer-based architecture with a fixed-size, resolution-invariant latent representation. This enables efficient training with low…

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations Open

Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, James M. Rehg · 2024

Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or …

Web Based Programming Guide For Allen Bradley Pl Cs Open

James M. Rehg · 2024

NOTE: The first page of text has been automatically extracted and included below in lieu of an abstract Session 1647 Web-based Programming Guide for Allen Bradley PLCs James A. Rehg Penn State Altoona Abstract Programmable logic controller…

ZeroShape: Regression-based Zero-shot Shape Reconstruction Open

Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James M. Rehg · 2023

We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets, but these models are computationally expensive at train and inference time.…

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective Open

Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg , et al. · 2023

In recent years, the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions, where both visual and audio signals play a crucial role. While most prior wo…

James M. Rehg YOU? Author Swipe