Bryan Russell
YOU?
Author Swipe
View article: ResidualViT for Efficient Temporally Dense Video Encoding
ResidualViT for Efficient Temporally Dense Video Encoding Open
Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require "temporally dense" reasoning over frames sampled at high temporal resolution. H…
View article: EditDuet: A Multi-Agent System for Video Non-Linear Editing
EditDuet: A Multi-Agent System for Video Non-Linear Editing Open
Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving…
View article: Improving Personalized Search with Regularized Low-Rank Parameter Updates
Improving Personalized Search with Regularized Low-Rank Parameter Updates Open
Personalized vision-language retrieval seeks to recognize new concepts (e.g. "my dog Fido") from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the …
View article: Video-Guided Foley Sound Generation with Multimodal Controls
Video-Guided Foley Sound Generation with Multimodal Controls Open
Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model design…
View article: Generative Timelines for Instructed Visual Assembly
Generative Timelines for Instructed Visual Assembly Open
The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instruc…
View article: Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval Open
In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphr…
View article: Koala: Key frame-conditioned long video-LLM
Koala: Key frame-conditioned long video-LLM Open
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solutio…
View article: NewMove: Customizing text-to-video models with novel motions
NewMove: Customizing text-to-video models with novel motions Open
We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specif…
View article: FocalPose++: Focal Length and Object Pose Estimation via Render and Compare
FocalPose++: Focal Length and Object Pose Estimation via Render and Compare Open
We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are threefold. Fi…
View article: Meta-Personalizing Vision-Language Models to Find Named Instances in Video
Meta-Personalizing Vision-Language Models to Find Named Instances in Video Open
Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video wher…
View article: Language-Guided Music Recommendation for Video via Prompt Analogies
Language-Guided Music Recommendation for Video via Prompt Analogies Open
We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language. A key challenge of this problem setting is that existing music video datasets provide the needed (vid…
View article: YouTube8M-MusicTextClips
YouTube8M-MusicTextClips Open
YouTube8M-MusicTextClips Dataset This page includes the YouTube8M-MusicTextClips dataset from our CVPR 2023 paper: Language-Guided Music Recommendation for Video via Prompt Analogies Daniel McKee1, Justin Salamon2, Josef Sivic2,3, Bryan Ru…
View article: YouTube8M-MusicTextClips
YouTube8M-MusicTextClips Open
YouTube8M-MusicTextClips Dataset This page includes the YouTube8M-MusicTextClips dataset from our CVPR 2023 paper: Language-Guided Music Recommendation for Video via Prompt Analogies Daniel McKee1, Justin Salamon2, Josef Sivic2,3, Bryan Ru…
View article: Conditional Generation of Audio from Video via Foley Analogies
Conditional Generation of Audio from Video via Foley Analogies Open
The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs fr…
View article: Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Language-Guided Audio-Visual Source Separation via Trimodal Consistency Open
We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to ass…
View article: Monocular Dynamic View Synthesis: A Reality Check
Monocular Dynamic View Synthesis: A Reality Check Open
We study the recent progress on dynamic view synthesis (DVS) from monocular video. Though existing approaches have demonstrated impressive results, we show a discrepancy between the practical capture process and the existing experimental p…
View article: Contrastive Feature Loss for Image Prediction
Contrastive Feature Loss for Image Prediction Open
Training supervised image synthesis models requires a critic to compare two images: the ground truth to the result. Yet, this basic functionality remains an open problem. A popular line of approaches uses the L1 (mean absolute error) loss,…
View article: It's Time for Artistic Correspondence in Music and Video
It's Time for Artistic Correspondence in Music and Video Open
We present an approach for recommending a music track for a given video, and vice versa, based on both their temporal alignment and their correspondence at an artistic level. We propose a self-supervised approach that learns this correspon…
View article: Focal Length and Object Pose Estimation via Render and Compare
Focal Length and Object Pose Estimation via Render and Compare Open
Code available at http://github.com/ponimatkin/focalpose
View article: Neural Volumetric Object Selection
Neural Volumetric Object Selection Open
We introduce an approach for selecting objects in neural volumetric 3D representations, such as multi-plane images (MPI) and neural radiance fields (NeRF). Our approach takes a set of foreground and background 2D user scribbles in one view…
View article: Focal Length and Object Pose Estimation via Render and Compare
Focal Length and Object Pose Estimation via Render and Compare Open
We introduce FocalPose, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are twofold. First,…
View article: Contrastive Feature Loss for Image Prediction
Contrastive Feature Loss for Image Prediction Open
Training supervised image synthesis models requires a critic to compare two images: the ground truth to the result. Yet, this basic functionality remains an open problem. A popular line of approaches uses the L1 (mean absolute error) loss,…
View article: Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos Open
We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed n…
View article: Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions Open
We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the act…
View article: Editing Conditional Radiance Fields
Editing Conditional Radiance Fields Open
A neural radiance field (NeRF) is a scene model supporting high-quality view synthesis, optimized per scene. In this thesis, we explore enabling user editing of a category-level NeRF – also known as a conditional radiance field – trained o…
View article: 3D Reconstruction By Parameterized Surface Mapping
3D Reconstruction By Parameterized Surface Mapping Open
International audience
View article: Editing Conditional Radiance Fields
Editing Conditional Radiance Fields Open
A neural radiance field (NeRF) is a scene model supporting high-quality view synthesis, optimized per scene. In this paper, we explore enabling user editing of a category-level NeRF - also known as a conditional radiance field - trained on…
View article: Contact and Human Dynamics from Monocular Video
Contact and Human Dynamics from Monocular Video Open
Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles. In t…
View article: Telling Left from Right: Learning Spatial Correspondence of Sight and Sound
Telling Left from Right: Learning Spatial Correspondence of Sight and Sound Open
Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the se…