Pichao Wang
YOU?
Author Swipe
View article: H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers Open
Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this …
View article: CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation
CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation Open
Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcem…
View article: Beyond Speaker Identity: Text Guided Target Speech Extraction
Beyond Speaker Identity: Text Guided Target Speech Extraction Open
Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's identity like enrollment audio, face images, or videos, which may not always be available. In this paper, we propose a text-guided TSE model StyleTSE…
View article: CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation
CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation Open
View article: SparseDiT: Token Sparsification for Efficient Diffusion Transformer
SparseDiT: Token Sparsification for Efficient Diffusion Transformer Open
Diffusion Transformers (DiT) are renowned for their impressive generative performance; however, they are significantly constrained by considerable computational costs due to the quadratic complexity in self-attention and the extensive samp…
View article: Factorized Visual Tokenization and Generation
Factorized Visual Tokenization and Generation Open
Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant li…
View article: Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning
Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning Open
Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This …
View article: Unraveling Movie Genres through Cross-Attention Fusion of Bi-Modal Synergy of Poster
Unraveling Movie Genres through Cross-Attention Fusion of Bi-Modal Synergy of Poster Open
Movie posters are not just decorative; they are meticulously designed to capture the essence of a movie, such as its genre, storyline, and tone/vibe. For decades, movie posters have graced cinema walls, billboards, and now our digital scre…
View article: One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos Open
We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language model…
View article: Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach Open
As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while th…
View article: Hallucination of Multimodal Large Language Models: A Survey
Hallucination of Multimodal Large Language Models: A Survey Open
This survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkab…
View article: Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval Open
The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute sim…
View article: DFN: A deep fusion network for flexible single and multi-modal action recognition
DFN: A deep fusion network for flexible single and multi-modal action recognition Open
Multi-modal action recognition methods can be generally classified into two categories: (1) fusing multi-modal features with simple concatenation or fusing the classification scores of individual modalities without considering the interact…
View article: Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation
Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation Open
Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this …
View article: Human Pose-based Estimation, Tracking and Action Recognition with Deep Learning: A Survey
Human Pose-based Estimation, Tracking and Action Recognition with Deep Learning: A Survey Open
Human pose analysis has garnered significant attention within both the research community and practical applications, owing to its expanding array of uses, including gaming, video surveillance, sports performance analysis, and human-comput…
View article: SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels
SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels Open
Pre-trained vision transformers have strong representation benefits to various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1\% extra…
View article: Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition
Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition Open
RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recogniti…
View article: Revisiting Vision Transformer from the View of Path Ensemble
Revisiting Vision Transformer from the View of Path Ensemble Open
Vision Transformers (ViTs) are normally regarded as a stack of transformer layers. In this work, we propose a novel view of ViTs showing that they can be seen as ensemble networks containing multiple parallel paths with different lengths. …
View article: Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment Open
Text-to-video retrieval systems have recently made significant progress by utilizing pre-trained models trained on large-scale image-text pairs. However, most of the latest methods primarily focus on the video modality while disregarding t…
View article: Frequency Domain Disentanglement for Arbitrary Neural Style Transfer
Frequency Domain Disentanglement for Arbitrary Neural Style Transfer Open
Arbitrary neural style transfer has been a popular research topic due to its rich application scenarios. Effective disentanglement of content and style is the critical factor for synthesizing an image with arbitrary style. The existing met…
View article: Head-Free Lightweight Semantic Segmentation with Linear Transformer
Head-Free Lightweight Semantic Segmentation with Linear Transformer Open
Existing semantic segmentation works have been mainly focused on designing effective decoders; however, the computational load introduced by the overall structure has long been ignored, which hinders their applications on resource-constrai…
View article: DOAD: Decoupled One Stage Action Detection Network
DOAD: Decoupled One Stage Action Detection Network Open
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding. Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage fo…
View article: PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation
PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation Open
Recently, transformer-based methods have gained significant success in sequential 2D-to-3D lifting human pose estimation. As a pioneering work, PoseFormer captures spatial relations of human joints in each video frame and human dynamics ac…
View article: Selective Structured State-Spaces for Long-Form Video Understanding
Selective Structured State-Spaces for Long-Form Video Understanding Open
Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence (S4) model with its linear complexity offers a promising direction in this space. …
View article: EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation
EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation Open
Locating 3D objects from a single RGB image via Perspective-n-Point (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, allowing for pa…
View article: Making Vision Transformers Efficient from A Token Sparsification View
Making Vision Transformers Efficient from A Token Sparsification View Open
The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suf…
View article: Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm
Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm Open
Parameter-Efficient Transfer Learning (PETL) aims at efficiently adapting large models pre-trained on massive data to downstream tasks with limited task-specific data. In view of the practicality of PETL, previous works focus on tuning a s…
View article: Head-Free Lightweight Semantic Segmentation with Linear Transformer
Head-Free Lightweight Semantic Segmentation with Linear Transformer Open
Existing semantic segmentation works have been mainly focused on designing effective decoders; however, the computational load introduced by the overall structure has long been ignored, which hinders their applications on resource-constrai…
View article: Dfn: A Deep Fusion Network for Flexible Single and Multi-Modal Action Recognition
Dfn: A Deep Fusion Network for Flexible Single and Multi-Modal Action Recognition Open
View article: A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition
A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition Open
Motion recognition is a promising direction in computer vision, but the training of video classification models is much harder than images due to insufficient data and considerable parameters. To get around this, some works strive to explo…