Explanipedia

Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions Open

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu , et al. · 2025

Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social intera…

Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions Open

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Yoichi Sato · 2025

As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of …

Leveraging RGB Images for Pre-Training of Event-Based Hand Pose Estimation Open

Ruicong Liu, Takehiko Ohkawa, Tze Ho Elden Tse, Mingfang Zhang, Angela Yao , et al. · 2025

This paper presents RPEP, the first pre-training method for event-based 3D hand pose estimation using labeled RGB images and unpaired, unlabeled event data. Event data offer significant benefits such as high temporal resolution and low lat…

Initial in-orbit operation of the soft X-ray spectrometer Resolve onboard the X-ray imaging and spectroscopy mission satellite Open

Yoshitomo Maeda, Ryuichi Fujimoto, Hisamitsu Awaki, Jesus C. Balleza, Kim R. Barnstable , et al. · 2025

FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation Open

Takuma Yagi, Misaki Ohashi, Yifei Huang, Ryosuke Furuta, Shungo Adachi , et al. · 2025

In the development of science, accurate and reproducible documentation of the experimental process is crucial. Automatic recognition of the actions in experiments from videos would help experimenters by complementing the recording of exper…

Audio-visual localization based on spatial relative sound order Open

Tomoya Sato, Yusuke Sugano, Yoichi Sato · 2025

Sound localization is one of the essential tasks in audio-visual learning. Especially, stereo sound localization methods have been proposed to handle multiple sound sources. However, existing stereo-sound localization methods treat sound s…

Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision Open

Yuping He, Guo Chen, Linghai Lu, Baoqi Pei, Tong Lü , et al. · 2025

Perceiving the world from both egocentric (first-person) and exocentric (third-person) perspectives is fundamental to human cognition, enabling rich and complementary understanding of dynamic environments. In recent years, allowing the mac…

Cross-View Correspondence Modeling for Joint Representation Learning Between Egocentric and Exocentric Videos Open

Zhehao Zhu, Yoichi Sato · 2025

Joint analysis of human action videos from egocentric and exocentric views enables a more comprehensive understanding of human behavior. While previous works leverage paired videos to align clip-level features across views, they often igno…

Ego4D: Around the World in 3,600 Hours of Egocentric Video Open

Kristen Grauman, Andrew Westbury, Eugene H. Byrne, Vincent Cartillier, Zachary Chavis , et al. · 2024

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camer…

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition Open

Mingfang Zhang, Yifei Huang, Ruicong Liu, Yoichi Sato · 2024

Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help …

Learning Multiple Object States from Actions via Large Language Models Open

Masatoshi Tateno, Takuma Yagi, Ryosuke Furuta, Yoichi Sato · 2024

Recognizing the states of objects in a video is crucial in understanding the scene beyond actions and objects. For instance, an egg can be raw, cracked, and whisked while cooking an omelet, and these states can coexist simultaneously (an e…

Matching Compound Prototypes for Few-Shot Action Recognition Open

Yifei Huang, Lijin Yang, Guo Chen, Hongjie Zhang, Feng Lu , et al. · 2024

The task of few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. How to better describe the action in each video and how to compare the similarity between videos are two …

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects Open

Zicong Fan, Takehiko Ohkawa, Linlin Yang, Lin Nie, Zhishan Zhou , et al. · 2024

We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion g…

Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation Open

Ruicong Liu, Takehiko Ohkawa, Mingfang Zhang, Yoichi Sato · 2024

The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input, leading to pote…

Simultaneous control of head pose and expressions in 3D facial keypoint-based GAN Open

Tomoyuki Hatakeyama, Ryosuke Furuta, Yoichi Sato · 2024

In this work, we present a novel method for simultaneously controlling the head pose and the facial expressions of a given input image using a 3D keypoint-based GAN. Existing methods for controlling head pose and expressions simultaneously…

FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation Open

Takuma Yagi, Misaki Ohashi, Yifei Huang, Ryosuke Furuta, Shungo Adachi , et al. · 2024

In the development of science, accurate and reproducible documentation of the experimental process is crucial. Automatic recognition of the actions in experiments from videos would help experimenters by complementing the recording of exper…

Direction-of-Arrival Estimation for Mobile Agents Utilizing the Relationship Between Agent’s Trajectory and Binaural Audio Open

Tomoya Sato, Yusuke Sugano, Yoichi Sato · 2024

With the development of robotics and wearable devices, there is a need for information processing under the assumption that an agent itself is mobile. Especially, understanding an acoustic environment around an agent is an important issue.…

Image Cropping under Design Constraints Open

Takumi Nishiyasu, Wataru Shimoda, Yoichi Sato · 2023

Image cropping is essential in image editing for obtaining a compositionally\nenhanced image. In display media, image cropping is a prospective technique for\nautomatically creating media content. However, image cropping for media\ncontent…

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives Open

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik , et al. · 2023

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dan…

Generative Hierarchical Temporal Transformer for Hand Pose and Action Modeling Open

Yilin Wen, Hao Pan, Takehiko Ohkawa, Lei Yang, Jia Pan , et al. · 2023

We present a novel unified framework that concurrently tackles recognition and future prediction for human hand pose and action modeling. Previous works generally provide isolated solutions for either recognition or prediction, which not o…

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos Open

Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto , et al. · 2023

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and…

Seeking Flat Minima with Mean Teacher on Semi- and Weakly-Supervised Domain Generalization for Object Detection Open

Ryosuke Furuta, Yoichi Sato · 2023

Object detectors do not work well when domains largely differ between training and testing data. To overcome this domain gap in object detection without requiring expensive annotations, we consider two problem settings: semi-supervised dom…

Proposal-based Temporal Action Localization with Point-level Supervision Open

Yuan Yin, Yifei Huang, Ryosuke Furuta, Yoichi Sato · 2023

Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annota…

Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey Open

Takehiko Ohkawa, Ryosuke Furuta, Yoichi Sato · 2023

In this survey, we present a systematic review of 3D hand pose estimation from the perspective of efficient annotation and learning. 3D hand pose estimation has been an important research area owing to its potential to enable various appli…

Intuitive Surgical SurgToolLoc Challenge Results: 2022-2023 Open

Aneeq Zia, Kiran Bhattacharyya, Xi Liu, Max Berniker, Ziheng Wang , et al. · 2023

Robotic assisted (RA) surgery promises to transform surgical intervention. Intuitive Surgical is committed to fostering these changes and the machine learning models and algorithms that will enable them. With these goals in mind we have in…

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction Open

Mingfang Zhang, Jinglu Wang, Xiao Li, Yifei Huang, Yoichi Sato , et al. · 2023

The Multiplane Image (MPI), containing a set of fronto-parallel RGBA layers, is an effective and efficient representation for view synthesis from sparse inputs. Yet, its fixed structure limits the performance, especially for surfaces image…

ClipCrop: Conditioned Cropping Driven by Vision-Language Model Open

Zhihang Zhong, Mingxi Cheng, Zhirong Wu, Yuhui Yuan, Yinqiang Zheng , et al. · 2022

Image cropping has progressed tremendously under the data-driven paradigm. However, current approaches do not account for the intentions of the user, which is an issue especially when the composition of the input image is complex. Moreover…

A case of infective endocarditis caused by Kocuria rosea in a non-compromised patient Open

Tsuyoshi Fujimiya, Yoichi Sato · 2022

Surgical Skill Assessment via Video Semantic Aggregation Open

Zhenqiang Li, Lin Gu, Weimin Wang, Ryosuke Nakamura, Yoichi Sato · 2022

Automated video-based assessment of surgical skills is a promising task in assisting young surgical trainees, especially in poor-resource areas. Existing works often resort to a CNN-LSTM joint framework that models long-term relationships …

CompNVS: Novel View Synthesis with Scene Completion Open

Zuoyue Li, Tianxing Fan, Zhenqiang Li, Zhaopeng Cui, Yoichi Sato , et al. · 2022

We introduce a scalable framework for novel view synthesis from RGB-D images with largely incomplete scene coverage. While generative neural approaches have demonstrated spectacular results on 2D images, they have not yet achieved similar …

Yoichi Sato YOU? Author Swipe