David Junhao Zhang
YOU?
Author Swipe
View article: DD-Ranking: Rethinking the Evaluation of Dataset Distillation
DD-Ranking: Rethinking the Evaluation of Dataset Distillation Open
In recent years, dataset distillation has provided a reliable solution for data compression, where models trained on the resulting smaller synthetic datasets achieve performance comparable to those trained on the original datasets. To furt…
View article: ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning Open
Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this pap…
View article: Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation Open
We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outpu…
View article: DragAnything: Motion Control for Anything using Entity Representation
DragAnything: Motion Control for Anything using Entity Representation Open
We introduce DragAnything, which utilizes a entity representation to achieve\nmotion control for any object in controllable video generation. Comparison to\nexisting motion control methods, DragAnything offers several advantages.\nFirstly,…
View article: Towards A Better Metric for Text-to-Video Generation
Towards A Better Metric for Text-to-Video Generation Open
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. No…
View article: Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions Open
Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video g…
View article: VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence Open
Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However, these approaches are often ineffective when the…
View article: MotionDirector: Motion Customization of Text-to-Video Diffusion Models
MotionDirector: Motion Customization of Text-to-Video Diffusion Models Open
Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffus…
View article: Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation Open
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on la…
View article: Dataset Condensation via Generative Model
Dataset Condensation via Generative Model Open
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number …
View article: Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks Open
Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recent…
View article: Too Large; Data Reduction for Vision-Language Pre-Training
Too Large; Data Reduction for Vision-Language Pre-Training Open
This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-…
View article: Making Vision Transformers Efficient from A Token Sparsification View
Making Vision Transformers Efficient from A Token Sparsification View Open
The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suf…
View article: Label-Efficient Online Continual Object Detection in Streaming Video
Label-Efficient Online Continual Object Detection in Streaming Video Open
Humans can watch a continuous video stream and effortlessly perform continual acquisition and transfer of new knowledge with minimal supervision yet retaining previously learnt experiences. In contrast, existing continual learning (CL) met…
View article: DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes
DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes Open
Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented fidelity for novel view synthesis in dynamic scenes, existing methods based on Neural Radiance Fields (NeR…
View article: Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition
Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition Open
Learning spatial-temporal relation among multiple actors is crucial for group activity recognition. Different group activities often show the diversified interactions between actors in the video. Hence, it is often difficult to model compl…
View article: MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning Open
Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large comput…
View article: MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video
MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video Open
Self-attention has become an integral component of the recent network architectures, e.g., Transformer, that dominate major image and video benchmarks. This is because self-attention can flexibly model long-range information. For the same …