Yoichi Sato
YOU?
Author Swipe
View article: Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions Open
Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social intera…
View article: Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions Open
As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of …
View article: Leveraging RGB Images for Pre-Training of Event-Based Hand Pose Estimation
Leveraging RGB Images for Pre-Training of Event-Based Hand Pose Estimation Open
This paper presents RPEP, the first pre-training method for event-based 3D hand pose estimation using labeled RGB images and unpaired, unlabeled event data. Event data offer significant benefits such as high temporal resolution and low lat…
View article: Initial in-orbit operation of the soft X-ray spectrometer Resolve onboard the X-ray imaging and spectroscopy mission satellite
Initial in-orbit operation of the soft X-ray spectrometer Resolve onboard the X-ray imaging and spectroscopy mission satellite Open
View article: FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation
FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation Open
In the development of science, accurate and reproducible documentation of the experimental process is crucial. Automatic recognition of the actions in experiments from videos would help experimenters by complementing the recording of exper…
View article: Audio-visual localization based on spatial relative sound order
Audio-visual localization based on spatial relative sound order Open
Sound localization is one of the essential tasks in audio-visual learning. Especially, stereo sound localization methods have been proposed to handle multiple sound sources. However, existing stereo-sound localization methods treat sound s…
View article: Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision Open
Perceiving the world from both egocentric (first-person) and exocentric (third-person) perspectives is fundamental to human cognition, enabling rich and complementary understanding of dynamic environments. In recent years, allowing the mac…
View article: Cross-View Correspondence Modeling for Joint Representation Learning Between Egocentric and Exocentric Videos
Cross-View Correspondence Modeling for Joint Representation Learning Between Egocentric and Exocentric Videos Open
Joint analysis of human action videos from egocentric and exocentric views enables a more comprehensive understanding of human behavior. While previous works leverage paired videos to align clip-level features across views, they often igno…
View article: Ego4D: Around the World in 3,600 Hours of Egocentric Video
Ego4D: Around the World in 3,600 Hours of Egocentric Video Open
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camer…
View article: Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition Open
Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help …
View article: Learning Multiple Object States from Actions via Large Language Models
Learning Multiple Object States from Actions via Large Language Models Open
Recognizing the states of objects in a video is crucial in understanding the scene beyond actions and objects. For instance, an egg can be raw, cracked, and whisked while cooking an omelet, and these states can coexist simultaneously (an e…
View article: Matching Compound Prototypes for Few-Shot Action Recognition
Matching Compound Prototypes for Few-Shot Action Recognition Open
The task of few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. How to better describe the action in each video and how to compare the similarity between videos are two …
View article: Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects Open
We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion g…
View article: Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation
Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation Open
The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input, leading to pote…
View article: Simultaneous control of head pose and expressions in 3D facial keypoint-based GAN
Simultaneous control of head pose and expressions in 3D facial keypoint-based GAN Open
In this work, we present a novel method for simultaneously controlling the head pose and the facial expressions of a given input image using a 3D keypoint-based GAN. Existing methods for controlling head pose and expressions simultaneously…
View article: FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation
FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation Open
In the development of science, accurate and reproducible documentation of the experimental process is crucial. Automatic recognition of the actions in experiments from videos would help experimenters by complementing the recording of exper…
View article: Direction-of-Arrival Estimation for Mobile Agents Utilizing the Relationship Between Agent’s Trajectory and Binaural Audio
Direction-of-Arrival Estimation for Mobile Agents Utilizing the Relationship Between Agent’s Trajectory and Binaural Audio Open
With the development of robotics and wearable devices, there is a need for information processing under the assumption that an agent itself is mobile. Especially, understanding an acoustic environment around an agent is an important issue.…
View article: Image Cropping under Design Constraints
Image Cropping under Design Constraints Open
Image cropping is essential in image editing for obtaining a compositionally\nenhanced image. In display media, image cropping is a prospective technique for\nautomatically creating media content. However, image cropping for media\ncontent…
View article: Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives Open
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dan…
View article: Generative Hierarchical Temporal Transformer for Hand Pose and Action Modeling
Generative Hierarchical Temporal Transformer for Hand Pose and Action Modeling Open
We present a novel unified framework that concurrently tackles recognition and future prediction for human hand pose and action modeling. Previous works generally provide isolated solutions for either recognition or prediction, which not o…
View article: Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos Open
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and…
View article: Seeking Flat Minima with Mean Teacher on Semi- and Weakly-Supervised Domain Generalization for Object Detection
Seeking Flat Minima with Mean Teacher on Semi- and Weakly-Supervised Domain Generalization for Object Detection Open
Object detectors do not work well when domains largely differ between training and testing data. To overcome this domain gap in object detection without requiring expensive annotations, we consider two problem settings: semi-supervised dom…
View article: Proposal-based Temporal Action Localization with Point-level Supervision
Proposal-based Temporal Action Localization with Point-level Supervision Open
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annota…
View article: Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey
Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey Open
In this survey, we present a systematic review of 3D hand pose estimation from the perspective of efficient annotation and learning. 3D hand pose estimation has been an important research area owing to its potential to enable various appli…
View article: Intuitive Surgical SurgToolLoc Challenge Results: 2022-2023
Intuitive Surgical SurgToolLoc Challenge Results: 2022-2023 Open
Robotic assisted (RA) surgery promises to transform surgical intervention. Intuitive Surgical is committed to fostering these changes and the machine learning models and algorithms that will enable them. With these goals in mind we have in…
View article: Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction
Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction Open
The Multiplane Image (MPI), containing a set of fronto-parallel RGBA layers, is an effective and efficient representation for view synthesis from sparse inputs. Yet, its fixed structure limits the performance, especially for surfaces image…
View article: ClipCrop: Conditioned Cropping Driven by Vision-Language Model
ClipCrop: Conditioned Cropping Driven by Vision-Language Model Open
Image cropping has progressed tremendously under the data-driven paradigm. However, current approaches do not account for the intentions of the user, which is an issue especially when the composition of the input image is complex. Moreover…
View article: A case of infective endocarditis caused by Kocuria rosea in a non-compromised patient
A case of infective endocarditis caused by Kocuria rosea in a non-compromised patient Open
View article: Surgical Skill Assessment via Video Semantic Aggregation
Surgical Skill Assessment via Video Semantic Aggregation Open
Automated video-based assessment of surgical skills is a promising task in assisting young surgical trainees, especially in poor-resource areas. Existing works often resort to a CNN-LSTM joint framework that models long-term relationships …
View article: CompNVS: Novel View Synthesis with Scene Completion
CompNVS: Novel View Synthesis with Scene Completion Open
We introduce a scalable framework for novel view synthesis from RGB-D images with largely incomplete scene coverage. While generative neural approaches have demonstrated spectacular results on 2D images, they have not yet achieved similar …