Zhaofan Qiu
YOU?
Author Swipe
View article: MotionPro: A Precise Motion Controller for Image-to-Video Generation
MotionPro: A Precise Motion Controller for Image-to-Video Generation Open
Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically rely on large Gaussian kernels to extend motion trajectories as condition without explicitly defining…
View article: Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion Open
The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressi…
View article: Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion Open
The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressi…
View article: FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process
FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process Open
The emergence of text-to-image generation models has led to the recognition that image enhancement, performed as post-processing, would significantly improve the visual quality of the generated images. Exploring diffusion models to enhance…
View article: Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution
Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution Open
Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance fr…
View article: TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models Open
Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The d…
View article: VideoStudio: Generating Consistent-Content and Multi-Scene Videos
VideoStudio: Generating Consistent-Content and Multi-Scene Videos Open
The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video ev…
View article: Selective Volume Mixup for Video Action Recognition
Selective Volume Mixup for Video Action Recognition Open
The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the ove…
View article: Dynamic Temporal Filtering in Video Models
Dynamic Temporal Filtering in Video Models Open
Video temporal dynamics is conventionally modeled with 3D spatial-temporal kernel or its factorized version comprised of 2D spatial kernel and 1D temporal kernel. The modeling power, nevertheless, is limited by the fixed window size and st…
View article: Explaining Cross-Domain Recognition with Interpretable Deep Classifier
Explaining Cross-Domain Recognition with Interpretable Deep Classifier Open
The recent advances in deep learning predominantly construct models in their internal representations, and it is opaque to explain the rationale behind and decisions to human users. Such explainability is especially essential for domain ad…
View article: SPE-Net: Boosting Point Cloud Analysis via Rotation Robustness Enhancement
SPE-Net: Boosting Point Cloud Analysis via Rotation Robustness Enhancement Open
In this paper, we propose a novel deep architecture tailored for 3D point cloud applications, named as SPE-Net. The embedded ``Selective Position Encoding (SPE)'' procedure relies on an attention mechanism that can effectively attend to th…
View article: Lightweight and Progressively-Scalable Networks for Semantic Segmentation
Lightweight and Progressively-Scalable Networks for Semantic Segmentation Open
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation. The problem nevertheless is not trivial especially for the real-world deployments, which often demand high efficiency in infere…
View article: Bi-Calibration Networks for Weakly-Supervised Video Representation Learning
Bi-Calibration Networks for Weakly-Supervised Video Representation Learning Open
The leverage of large volumes of web videos paired with the searched queries or surrounding texts (e.g., title) offers an economic and extensible alternative to supervised video representation learning. Nevertheless, modeling such weakly v…
View article: Stand-Alone Inter-Frame Attention in Video Models
Stand-Alone Inter-Frame Attention in Video Models Open
Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spa…
View article: Silver-Bullet-3D at ManiSkill 2021: Learning-from-Demonstrations and Heuristic Rule-based Methods for Object Manipulation
Silver-Bullet-3D at ManiSkill 2021: Learning-from-Demonstrations and Heuristic Rule-based Methods for Object Manipulation Open
This paper presents an overview and comparative analysis of our systems designed for the following two tracks in SAPIEN ManiSkill Challenge 2021: No Interaction Track: The No Interaction track targets for learning policies from pre-collect…
View article: MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing Open
Convolutional Neural Networks (CNNs) have been regarded as the go-to models for visual recognition. More recently, convolution-free networks, based on multi-head self-attention (MSA) or multi-layer perceptrons (MLPs), become more and more …
View article: Representing Videos as Discriminative Sub-graphs for Action Recognition
Representing Videos as Discriminative Sub-graphs for Action Recognition Open
Human actions are typically of combinatorial structures or patterns, i.e., subjects, objects, plus spatio-temporal interactions in between. Discovering such structures is therefore a rewarding way to reason about the dynamics of interactio…
View article: Condensing a Sequence to One Informative Frame for Video Recognition
Condensing a Sequence to One Informative Frame for Video Recognition Open
Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step…
View article: Boosting Video Representation Learning with Multi-Faceted Integration
Boosting Video Representation Learning with Multi-Faceted Integration Open
Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depend…
View article: Optimization Planning for 3D ConvNets
Optimization Planning for 3D ConvNets Open
It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme. The most common hand-tuning process starts from learning 3D ConvNets using short video…
View article: Motion-Focused Contrastive Learning of Video Representations
Motion-Focused Contrastive Learning of Video Representations Open
Motion, as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning. In this paper, we ask the question: how important is the motion particul…
View article: Condensing a Sequence to One Informative Frame for Video Recognition
Condensing a Sequence to One Informative Frame for Video Recognition Open
Video is complex due to large variations in motion and rich content in fine-grained visual details. Abstracting useful information from such information-intensive media requires exhaustive computing resources. This paper studies a two-step…
View article: SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning
SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning Open
A steady momentum of innovations and breakthroughs has convincingly pushed the limits of unsupervised image representation learning. Compared to static 2D images, video has one more dimension (time). The inherent supervision existing in su…
View article: Learning to Localize Actions from Moments
Learning to Localize Actions from Moments Open
With the knowledge of action moments (i.e., trimmed video clips that each contains an action instance), humans could routinely localize an action temporally in an untrimmed video. Nevertheless, most practical methods still require all trai…
View article: SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning
SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning Open
A steady momentum of innovations and breakthroughs has convincingly pushed the limits of unsupervised image representation learning. Compared to static 2D images, video has one more dimension (time). The inherent supervision existing in su…
View article: Transferring and Regularizing Prediction for Semantic Segmentation
Transferring and Regularizing Prediction for Semantic Segmentation Open
Semantic segmentation often requires a large set of images with pixel-level annotations. In the view of extremely expensive expert labeling, recent research has shown that the models trained on photo-realistic synthetic data (e.g., compute…
View article: Long Short-Term Relation Networks for Video Action Detection
Long Short-Term Relation Networks for Video Action Detection Open
It has been well recognized that modeling human-object or object-object relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scen…
View article: Scheduled Differentiable Architecture Search for Visual Recognition
Scheduled Differentiable Architecture Search for Visual Recognition Open
Convolutional Neural Networks (CNN) have been regarded as a capable class of models for visual recognition problems. Nevertheless, it is not trivial to develop generic and powerful network architectures, which requires significant efforts …
View article: Gaussian Temporal Awareness Networks for Action Localization
Gaussian Temporal Awareness Networks for Action Localization Open
Temporally localizing actions in a video is a fundamental challenge in video understanding. Most existing approaches have often drawn inspiration from image object detection and extended the advances, e.g., SSD and Faster R-CNN, to produce…
View article: Customizable Architecture Search for Semantic Segmentation
Customizable Architecture Search for Semantic Segmentation Open
In this paper, we propose a Customizable Architecture Search (CAS) approach to automatically generate a network architecture for semantic image segmentation. The generated network consists of a sequence of stacked computation cells. A comp…