Xitong Yang
YOU?
Author Swipe
View article: Progress-Aware Video Frame Captioning
Progress-Aware Video Frame Captioning Open
While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the fr…
View article: An adaptive fault diagnosis method for rotating machinery based on GCN deep feature extraction and OptGB
An adaptive fault diagnosis method for rotating machinery based on GCN deep feature extraction and OptGB Open
Detecting faults in bearings and gears is pivotal for smooth machinery and equipment operation, as well as in preventing potentially catastrophic accidents. However, the fault diagnosis method using deep learning is highly dependent on the…
View article: Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos Open
Goal-oriented planning, or anticipating a series of actions that transition an agent from its current state to a predefined objective, is crucial for developing intelligent assistants aiding users in daily procedural tasks. The problem pre…
View article: GenRec: Unifying Video Generation and Recognition with Diffusion Models
GenRec: Unifying Video Generation and Recognition with Diffusion Models Open
Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for…
View article: Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning
Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning Open
We present EMBED (Egocentric Models Built with Exocentric Data), a method designed to transform exocentric video-language data for egocentric video representation learning. Large-scale exocentric data covers diverse activities with signifi…
View article: Video ReCap: Recursive Captioning of Hour-Long Videos
Video ReCap: Recursive Captioning of Hour-Long Videos Open
Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours a…
View article: Minotaur: Multi-Task Video Grounding from Multimodal Queries
Minotaur: Multi-Task Video Grounding from Multimodal Queries Open
View article: Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives Open
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dan…
View article: Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data Open
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a…
View article: Towards Scalable Neural Representation for Diverse Videos
Towards Scalable Neural Representation for Diverse Videos Open
Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (e.g., NeRV, E-NeRV). While achieving promising results, existing INR-based method…
View article: MINOTAUR: Multi-task Video Grounding From Multimodal Queries
MINOTAUR: Multi-task Video Grounding From Multimodal Queries Open
Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image reg…
View article: Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization Open
Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive zero-shot learning abilities for image understanding, yet limited effort has been made to investigate CLIP for zero-shot video recognition. We introduce Open-VCLIP, …
View article: Vision Transformers Are Good Mask Auto-Labelers
Vision Transformers Are Good Mask Auto-Labelers Open
We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-la…
View article: ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization Open
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. Without the boundary information of action segments, existing methods most…
View article: Tumour mutational burden and immune-cell infiltration in cervical squamous cell carcinoma
Tumour mutational burden and immune-cell infiltration in cervical squamous cell carcinoma Open
Background: Cervical carcinoma is one of the most common gynaecological malignancies worldwide and severely affects the health of women; cervical squamous cell carcinoma is the most prevalent form.The aim of this study was to assess the tu…
View article: Efficient Video Transformers with Spatial-Temporal Token Selection
Efficient Video Transformers with Spatial-Temporal Token Selection Open
Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost. In this paper, we present STTS, a token selection framework that dynamically selects a few inform…
View article: Semi-Supervised Vision Transformers
Semi-Supervised Vision Transformers Open
We study the training of Vision Transformers for semi-supervised image classification. Transformers have recently demonstrated impressive performance on a multitude of supervised learning tasks. Surprisingly, we show Vision Transformers pe…
View article: Expression And Clinical Significance of BIRC5 In Low-Grade Gliomas Based On Bioinformatics Analysis
Expression And Clinical Significance of BIRC5 In Low-Grade Gliomas Based On Bioinformatics Analysis Open
Background: To screen target genes analyze the expression and mechanism of target genes in Low-grade Gliomas (LGG). Methods: LGG data were downloaded from TCGA database. Differentially expressed genes (DEGs) were screened by differential e…
View article: Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories Open
The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label. We argue that a single clip may not have enough temporal cove…
View article: LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING
LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING Open
The tremendous growth in video data, both on the internet and in real life, has encouraged the development of intelligent systems that can automatically analyze video contents and understand human actions. Therefore, video understanding ha…
View article: GTA: Global Temporal Attention for Video Action Understanding
GTA: Global Temporal Attention for Video Action Understanding Open
Self-attention learns pairwise interactions to model long-range dependencies, yielding great improvements for video action recognition. In this paper, we seek a deeper understanding of self-attention for temporal modeling in videos. We fir…
View article: Hierarchical Contrastive Motion Learning for Video Action Recognition
Hierarchical Contrastive Motion Learning for Video Action Recognition Open
One central question for video action recognition is how to model motion. In this paper, we present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw vid…
View article: A Generic Visualization Approach for Convolutional Neural Networks
A Generic Visualization Approach for Convolutional Neural Networks Open
View article: Cross-X Learning for Fine-Grained Visual Categorization
Cross-X Learning for Fine-Grained Visual Categorization Open
Recognizing objects from subcategories with very subtle differences remains a challenging task due to the large intra-class and small inter-class variation. Recent work tackles this problem in a weakly-supervised manner: object parts are f…
View article: STEP: Spatio-Temporal Progressive Learning for Video Action Detection
STEP: Spatio-Temporal Progressive Learning for Video Action Detection Open
In this paper, we propose Spatio-TEmporal Progressive (STEP) action detector---a progressive learning framework for spatio-temporal action detection in videos. Starting from a handful of coarse-scale proposal cuboids, our approach progress…
View article: Exploring Uncertainty in Conditional Multi-Modal Retrieval Systems
Exploring Uncertainty in Conditional Multi-Modal Retrieval Systems Open
We cast visual retrieval as a regression problem by posing triplet loss as a regression loss. This enables epistemic uncertainty estimation using dropout as a Bayesian approximation framework in retrieval. Accordingly, Monte Carlo (MC) sam…
View article: Learning Density Models via Structured Latent Variables
Learning Density Models via Structured Latent Variables Open
As one principal approach to machine learning and cognitive science, the probabilistic framework has been continuously developed both theoretically and practically. Learning a probabilistic model can be thought of as inferring plausible mo…
View article: An Interactive Greedy Approach to Group Sparsity in High Dimensions
An Interactive Greedy Approach to Group Sparsity in High Dimensions Open
Sparsity learning with known grouping structure has received considerable attention due to wide modern applications in high-dimensional data analysis. Although advantages of using group information have been well-studied by shrinkage-based…
View article: Two Stream Self-Supervised Learning for Action Recognition
Two Stream Self-Supervised Learning for Action Recognition Open
We present a self-supervised approach using spatio-temporal signals between video frames for action recognition. A two-stream architecture is leveraged to tangle spatial and temporal representation learning. Our task is formulated as both …
View article: The Effectiveness of Instance Normalization: a Strong Baseline for Single Image Dehazing
The Effectiveness of Instance Normalization: a Strong Baseline for Single Image Dehazing Open
We propose a novel deep neural network architecture for the challenging problem of single image dehazing, which aims to recover the clear image from a degraded hazy image. Instead of relying on hand-crafted image priors or explicitly estim…