Bugra Tekin
YOU?
Author Swipe
GoTrack: Generic 6DoF Object Pose Refinement and Tracking Open
We introduce GoTrack, an efficient and accurate CAD-based method for 6DoF object pose refinement and tracking, which can handle diverse objects without any object-specific training. Unlike existing tracking methods that rely solely on an a…
HuMoCon: Concept Discovery for Human Motion Understanding Open
We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract se…
CigTime: Corrective Instruction Generation Through Inverse Motion Editing Open
Recent advancements in models linking natural language with human motions have shown significant promise in motion generation and editing based on instructional text. Motivated by applications in sports coaching and motor skill learning, w…
CigTime: Corrective Instruction Generation Through Inverse Motion Editing Open
Recent advancements in models linking natural language with human motions have shown significant promise in motion generation and editing based on instructional text. Motivated by applications in sports coaching and motor skill learning, w…
View article: X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization
X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization Open
Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos ha…
DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions Open
Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the…
FoundPose: Unseen Object Pose Estimation with Foundation Features Open
We propose FoundPose, a model-based method for 6D pose estimation of unseen objects from a single RGB image. The method can quickly onboard new objects using their 3D models without requiring any object- or task-specific training. In contr…
View article: HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World
HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World Open
Building an interactive AI assistant that can perceive, reason, and collaborate with humans in the real world has been a long-standing pursuit in the AI community. This work is part of a broader research effort to develop intelligent agent…
Learning to Align Sequential Actions in the Wild Open
State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspon- dences across videos in time. They either learn frame-to- frame mapping across sequences, which does not leverage temporal …
View article: Context-Aware Sequence Alignment using 4D Skeletal Augmentation
Context-Aware Sequence Alignment using 4D Skeletal Augmentation Open
Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality. State-of-the-art methods directly learn image-based embedding space by leveraging powerful d…
Learning to Align Sequential Actions in the Wild Open
State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-to-frame mapping across sequences, which does not leverage temporal inf…
View article: H2O: Two Hands Manipulating Objects for First Person Interaction Recognition
H2O: Two Hands Manipulating Objects for First Person Interaction Recognition Open
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects. To this end, we propose a method to create a unified dataset for egocentric 3D interaction recog…
Reconstructing and grounding narrated instructional videos in 3D Open
Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Con…
Domain-Specific Priors and Meta Learning for Few-Shot First-Person Action Recognition Open
The lack of large-scale real datasets with annotations makes transfer learning a necessity for video activity understanding. We aim to develop an effective method for few-shot transfer learning for first-person action classification. We le…
View article: HoloLens 2 Research Mode as a Tool for Computer Vision Research
HoloLens 2 Research Mode as a Tool for Computer Vision Research Open
Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research. In this technical report, we present HoloLens 2 Res…
Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction Open
International audience
Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction Open
Modeling hand-object manipulations is essential for understanding how humans interact with their environment. While of practical importance, estimating the pose of hands and objects during interactions is challenging due to the large mutua…
Leveraging Photometric Consistency over Time for Sparsely Supervised\n Hand-Object Reconstruction Open
Modeling hand-object manipulations is essential for understanding how humans\ninteract with their environment. While of practical importance, estimating the\npose of hands and objects during interactions is challenging due to the large\nmu…
Domain-Specific Priors and Meta Learning for Few-Shot First-Person\n Action Recognition Open
The lack of large-scale real datasets with annotations makes transfer\nlearning a necessity for video activity understanding. We aim to develop an\neffective method for few-shot transfer learning for first-person action\nclassification. We…
H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions Open
We present a unified framework for understanding 3D hand and object interactions in raw image sequences from egocentric RGB cameras. Given a single RGB image, our model jointly estimates the 3D hand and object poses, models their interacti…
H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and\n Interactions Open
We present a unified framework for understanding 3D hand and object\ninteractions in raw image sequences from egocentric RGB cameras. Given a single\nRGB image, our model jointly estimates the 3D hand and object poses, models\ntheir intera…
View article: Real-Time Seamless Single Shot 6D Object Pose Prediction
Real-Time Seamless Single Shot 6D Object Pose Prediction Open
We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot techniqu…
Learning Robust Features and Latent Representations for Single View 3D Pose Estimation of Humans and Objects Open
Estimating the 3D poses of rigid and articulated bodies is one of the fundamental problems of Computer Vision. It has a broad range of applications including augmented reality, surveillance, animation and human-computer interaction. Despit…
Real-Time Seamless Single Shot 6D Object Pose Prediction Open
We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot techniqu…
Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation Open
CVLAB
Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation Open
Most recent approaches to monocular 3D human pose estimation rely on Deep Learning. They typically involve regressing from an image to either 3D joint coordinates directly or 2D joint locations from which 3D coordinates are inferred. Both …
Direct Prediction of 3D Body Poses from Motion Compensated Sequences Open
We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people. Previous approaches typically compute candidate poses in individual frames and then link them i…