Yoshimitsu Aoki
YOU?
Author Swipe
View article: BasketLiDAR: The First LiDAR-Camera Multimodal Dataset for Professional Basketball MOT
BasketLiDAR: The First LiDAR-Camera Multimodal Dataset for Professional Basketball MOT Open
Real-time 3D trajectory player tracking in sports plays a crucial role in tactical analysis, performance evaluation, and enhancing spectator experience. Traditional systems rely on multi-camera setups, but are constrained by the inherently…
View article: Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos Open
In this paper, we propose Language-Guided Contrastive Audio-Visual Masked Autoencoders (LG-CAV-MAE) to improve audio-visual representation learning. LG-CAV-MAE integrates a pretrained text encoder into contrastive audio-visual masked autoe…
View article: Iterative Event-based Motion Segmentation by Variational Contrast Maximization
Iterative Event-based Motion Segmentation by Variational Contrast Maximization Open
Event cameras provide rich signals that are suitable for motion estimation since they respond to changes in the scene. As any visual changes in the scene produce event data, it is paramount to classify the data into different motions (i.e.…
View article: Postoperative Knee Extensor Strength After Medial Patellofemoral Ligament Reconstruction Using Superficial Slip of the Quadriceps Tendon and the Factors That Affect Strength Recovery
Postoperative Knee Extensor Strength After Medial Patellofemoral Ligament Reconstruction Using Superficial Slip of the Quadriceps Tendon and the Factors That Affect Strength Recovery Open
Background: Medial patellofemoral ligament reconstruction (MPFLR) using the quadriceps tendon can avoid complications related to the fixation of other graft types to the patella. However, there is concern about postoperative loss of knee e…
View article: Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering Open
Social intelligence, the ability to interpret emotions, intentions, and behaviors, is essential for effective communication and adaptive responses. As robots and AI systems become more prevalent in caregiving, healthcare, and education, th…
View article: Relationship Between Vertical Ground Reaction Force and Acceleration from Wearable Inertial Measurement Units During Single-Leg Drop Landing After Anterior Cruciate Ligament Reconstruction
Relationship Between Vertical Ground Reaction Force and Acceleration from Wearable Inertial Measurement Units During Single-Leg Drop Landing After Anterior Cruciate Ligament Reconstruction Open
The purpose of this study was to clarify the relationship between vertical ground reaction force (VGRF) and acceleration from wearable inertial measurement units (IMUs) during single-leg drop landing after anterior cruciate ligament recons…
View article: Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding
Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding Open
Zero-shot recognition models require extensive training data for generalization. However, in zero-shot 3D classification, collecting 3D data and captions is costly and laborintensive, posing a significant barrier compared to 2D vision. Rec…
View article: A Comprehensive Analysis of a Social Intelligence Dataset and Response Tendencies Between Large Language Models (LLMs) and Humans
A Comprehensive Analysis of a Social Intelligence Dataset and Response Tendencies Between Large Language Models (LLMs) and Humans Open
In recent years, advancements in the interaction and collaboration between humans and have garnered significant attention. Social intelligence plays a crucial role in facilitating natural interactions and seamless communication between hum…
View article: DynamicVLN: Incorporating Dynamics into Vision-and-Language Navigation Scenarios
DynamicVLN: Incorporating Dynamics into Vision-and-Language Navigation Scenarios Open
Traditional Vision-and-Language Navigation (VLN) tasks require an agent to navigate static environments using natural language instructions. However, real-world road conditions such as vehicle movements, traffic signal fluctuations, pedest…
View article: BoundMatch: Boundary Detection Applied to Semi-Supervised Segmentation
BoundMatch: Boundary Detection Applied to Semi-Supervised Segmentation Open
Semi-supervised semantic segmentation (SS-SS) aims to mitigate the heavy annotation burden of dense pixel labeling by leveraging abundant unlabeled images alongside a small labeled set. While current consistency regularization methods achi…
View article: RECA: A Pipeline for Refinement of Compressed Artifacts in Image Super-Resolution Training
RECA: A Pipeline for Refinement of Compressed Artifacts in Image Super-Resolution Training Open
Training datasets for image super-resolution (SR) are often constructed from web images. However, these images are typically stored in JPEG format, introducing compression artifacts that degrade SR performance. To ensure data quality, conv…
View article: Relationship Between Quadriceps Strength at 6 Months Postoperatively and Improvement in Patient-Reported Knee Function After Anterior Cruciate Ligament Reconstruction
Relationship Between Quadriceps Strength at 6 Months Postoperatively and Improvement in Patient-Reported Knee Function After Anterior Cruciate Ligament Reconstruction Open
Background: Understanding the factors associated with poor recovery over time after anterior cruciate ligament reconstruction (ACLR) helps clinicians identify patients who are at risk and targets for an intervention. Purpose: To determine …
View article: Acoustic-based 3D Human Pose Estimation Robust to Human Position
Acoustic-based 3D Human Pose Estimation Robust to Human Position Open
This paper explores the problem of 3D human pose estimation from only low-level acoustic signals. The existing active acoustic sensing-based approach for 3D human pose estimation implicitly assumes that the target user is positioned along …
View article: Pre-training with Synthetic Patterns for Audio
Pre-training with Synthetic Patterns for Audio Open
In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework…
View article: Data Collection-free Masked Video Modeling
Data Collection-free Masked Video Modeling Open
Pre-training video transformers generally requires a large amount of data, presenting significant challenges in terms of data collection costs and concerns related to privacy, licensing, and inherent biases. Synthesizing data is one of the…
View article: Rethinking Image Super-Resolution from Training Data Perspectives
Rethinking Image Super-Resolution from Training Data Perspectives Open
In this work, we investigate the understudied effect of the training data used for image super-resolution (SR). Most commonly, novel SR methods are developed and benchmarked on common training datasets such as DIV2K and DF2K. However, we i…
View article: RetinaViT: Efficient Visual Backbone for Online Video Streams
RetinaViT: Efficient Visual Backbone for Online Video Streams Open
In online video understanding, which has a wide range of real-world applications, inference speed is crucial. Many approaches involve frame-level visual feature extraction, which often represents the biggest bottleneck. We propose RetinaVi…
View article: Poster 241: Comparison of Short-Term Clinical Outcomes of Medial Patellofemoral Ligament Reconstruction Using Superficial Quadriceps Tendon and Using Hamstring Tendon in Patella Instability
Poster 241: Comparison of Short-Term Clinical Outcomes of Medial Patellofemoral Ligament Reconstruction Using Superficial Quadriceps Tendon and Using Hamstring Tendon in Patella Instability Open
Objectives: Medial patellofemoral ligament reconstruction (MPFLR) is widely acknowledged as a therapeutic approach for patella instability. While hamstring autografts are widely used in MPFLR, there are concerns regarding complications ass…
View article: Proto-Adapter: Efficient Training-Free CLIP-Adapter for Few-Shot Image Classification
Proto-Adapter: Efficient Training-Free CLIP-Adapter for Few-Shot Image Classification Open
Large vision-language models, such as Contrastive Vision-Language Pre-training (CLIP), pre-trained on large-scale image–text datasets, have demonstrated robust zero-shot transfer capabilities across various downstream tasks. To further enh…
View article: Secrets of Event-Based Optical Flow, Depth and Ego-Motion Estimation by Contrast Maximization
Secrets of Event-Based Optical Flow, Depth and Ego-Motion Estimation by Contrast Maximization Open
Event cameras respond to scene dynamics and provide signals naturally suitable for motion estimation with advantages, such as high dynamic range. The emerging field of event-based vision motivates a revisit of fundamental computer vision t…
View article: 3D Human Scan With A Moving Event Camera
3D Human Scan With A Moving Event Camera Open
Capturing a 3D human body is one of the important tasks in computer vision with a wide range of applications such as virtual reality and sports analysis. However, conventional frame cameras are limited by their temporal resolution and dyna…
View article: PCT: Perspective Cue Training Framework for Multi-Camera BEV Segmentation
PCT: Perspective Cue Training Framework for Multi-Camera BEV Segmentation Open
Generating annotations for bird's-eye-view (BEV) segmentation presents significant challenges due to the scenes' complexity and the high manual annotation cost. In this work, we address these challenges by leveraging the abundance of unlab…
View article: MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation
MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation Open
Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due…
View article: TAG: Guidance-free Open-Vocabulary Semantic Segmentation
TAG: Guidance-free Open-Vocabulary Semantic Segmentation Open
Semantic segmentation is a crucial task in computer vision, where each pixel in an image is classified into a category. However, traditional methods face significant challenges, including the need for pixel-level annotations and extensive …
View article: Improving Perceptual Loss with CLIP for Super-Resolution
Improving Perceptual Loss with CLIP for Super-Resolution Open
Perceptual loss, calculated by VGG network pre-trained on ImageNet, has been widely employed in the past for super-resolution tasks, enabling the generation of photo-realistic images. However, it has been reported that grid-like artifacts …
View article: Synthetic Document Images with Diverse Shadows for Deep Shadow Removal Networks
Synthetic Document Images with Diverse Shadows for Deep Shadow Removal Networks Open
Shadow removal for document images is an essential task for digitized document applications. Recent shadow removal models have been trained on pairs of shadow images and shadow-free images. However, obtaining a large, diverse dataset for d…
View article: MaskDiffusion: Exploiting Pre-Trained Diffusion Models for Semantic Segmentation
MaskDiffusion: Exploiting Pre-Trained Diffusion Models for Semantic Segmentation Open
Semantic segmentation is essential in computer vision for various applications, yet traditional approaches face significant challenges, including the high cost of annotation and extensive training for supervised learning. Additionally, due…