Anyi Rao
YOU?
Author Swipe
View article: Dense Semantic Matching with VGGT Prior
Dense Semantic Matching with VGGT Prior Open
Semantic matching aims to establish pixel-level correspondences between instances of the same category and represents a fundamental task in computer vision. Existing approaches suffer from two limitations: (i) Geometric Ambiguity: Their re…
View article: AI for Creative Visual Content Generation, Editing and Understanding
AI for Creative Visual Content Generation, Editing and Understanding Open
View article: Generative AI for Film Creation: A Survey of Recent Advances
Generative AI for Film Creation: A Survey of Recent Advances Open
Generative AI (GenAI) is transforming filmmaking, equipping artists with tools like text-to-image and image-to-video diffusion, neural radiance fields, avatar generation, and 3D synthesis. This paper examines the adoption of these technolo…
View article: Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion Open
Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive train…
View article: Mindalogue: LLM-Powered Nonlinear Interaction for Effective Learning and Task Exploration
Mindalogue: LLM-Powered Nonlinear Interaction for Effective Learning and Task Exploration Open
Current generative AI models like ChatGPT, Claude, and Gemini are widely used for knowledge dissemination, task decomposition, and creative thinking. However, their linear interaction methods often force users to repeatedly compare and cop…
View article: ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database
ScriptViz: A Visualization Tool to Aid Scriptwriting based on a Large Movie Database Open
Scriptwriters usually rely on their mental visualization to create a vivid story by using their imagination to see, feel, and experience the scenes they are writing. Besides mental visualization, they often refer to existing images or scen…
View article: CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion
CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion Open
With advancements in video generative AI models (e.g., SORA), creators are increasingly using these techniques to enhance video previsualization. However, they face challenges with incomplete and mismatched AI workflows. Existing methods m…
View article: Cinematic Behavior Transfer via NeRF-based Differentiable Filming
Cinematic Behavior Transfer via NeRF-based Differentiable Filming Open
In the evolving landscape of digital media and video production, the precise manipulation and reproduction of visual elements like camera movements and character actions are highly desired. Existing SLAM methods face limitations in dynamic…
View article: SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models Open
The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial …
View article: Automated Conversion of Music Videos into Lyric Videos
Automated Conversion of Music Videos into Lyric Videos Open
Musicians and fans often produce lyric videos, a form of music videos that\nshowcase the song's lyrics, for their favorite songs. However, making such\nvideos can be challenging and time-consuming as the lyrics need to be added in\nsynchro…
View article: Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization
Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization Open
Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previou…
View article: HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE
HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE Open
Factor model is a fundamental investment tool in quantitative investment, which can be empowered by deep learning to become more flexible and efficient in practical complicated investing situations. However, it is still an open question to…
View article: AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning Open
With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable c…
View article: Self-Supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences
Self-Supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences Open
Self-supervised learning has demonstrated remarkable capability in representation learning for skeleton-based action recognition. Existing methods mainly focus on applying global data augmentation to generate different views of the skeleto…
View article: HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE
HireVAE: An Online and Adaptive Factor Model Based on Hierarchical and Regime-Switch VAE Open
Factor model is a fundamental investment tool in quantitative investment, which can be empowered by deep learning to become more flexible and efficient in practical complicated investing situations. However, it is still an open question to…
View article: CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers Open
Recent vision-language models have achieved tremendous advances. However, their computational costs are also escalating dramatically, making model acceleration exceedingly critical. To pursue more efficient vision-language Transformers, th…
View article: Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences
Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences Open
Self-supervised learning has demonstrated remarkable capability in representation learning for skeleton-based action recognition. Existing methods mainly focus on applying global data augmentation to generate different views of the skeleto…
View article: Dynamic Storyboard Generation in an Engine-based Virtual Environment for Video Production
Dynamic Storyboard Generation in an Engine-based Virtual Environment for Video Production Open
Amateurs working on mini-films and short-form videos usually spend lots of time and effort on the multi-round complicated process of setting and adjusting scenes, plots, and cameras to deliver satisfying video shots. We present Virtual Dyn…
View article: Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows
Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows Open
The ability to choose an appropriate camera view among multiple cameras plays a vital role in TV shows delivery. But it is hard to figure out the statistical pattern and apply intelligent processing due to the lack of high-quality training…
View article: A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language
A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language Open
Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality. Since the hierarch…
View article: AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation
AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation Open
Training a generalizable 3D part segmentation network is quite challenging but of great importance in real-world applications. To tackle this problem, some works design task-specific solutions by translating human understanding of the task…
View article: BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-scale Scene Rendering
BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-scale Scene Rendering Open
Neural radiance fields (NeRF) has achieved outstanding performance in modeling 3D objects and controlled scenes, usually under a single scale. In this work, we focus on multi-scale cases where large changes in imagery are observed at drast…
View article: Online Multi-modal Person Search in Videos
Online Multi-modal Person Search in Videos Open
The task of searching certain people in videos has seen increasing potential in real-world applications, such as video organization and editing. Most existing approaches are devised to work in an offline manner, where identities can only b…
View article: A Unified Framework for Shot Type Classification Based on Subject Centric Lens
A Unified Framework for Shot Type Classification Based on Subject Centric Lens Open
Shots are key narrative elements of various videos, e.g. movies, TV series, and user-generated videos that are thriving over the Internet. The types of shots greatly influence how the underlying ideas, emotions, and messages are expressed.…
View article: MovieNet: A Holistic Dataset for Movie Understanding
MovieNet: A Holistic Dataset for Movie Understanding Open
Recent years have seen remarkable advances in visual understanding. However, how to understand a story-based long video with artistic styles, e.g. movie, remains challenging. In this paper, we introduce MovieNet -- a holistic dataset for m…
View article: A Local-to-Global Approach to Multi-modal Movie Scene Segmentation
A Local-to-Global Approach to Multi-modal Movie Scene Segmentation Open
Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical environment. Identifying the composition of scenes serves as a critical step towards semantic understanding of…
View article: Automatic Music Accompanist
Automatic Music Accompanist Open
Automatic musical accompaniment is where a human musician is accompanied by a computer musician. The computer musician is able to produce musical accompaniment that relates musically to the human performance. The accompaniment should follo…
View article: HotFlip: White-Box Adversarial Examples for Text Classification
HotFlip: White-Box Adversarial Examples for Text Classification Open
We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip …
View article: HotFlip: White-Box Adversarial Examples for Text Classification
HotFlip: White-Box Adversarial Examples for Text Classification Open
We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip …
View article: HotFlip: White-Box Adversarial Examples for NLP
HotFlip: White-Box Adversarial Examples for NLP Open
We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip …