Cees G. M. Snoek
YOU?
Author Swipe
View article: Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs
Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs Open
While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that…
View article: Elastic ViTs from Pretrained Models without Retraining
Elastic ViTs from Pretrained Models without Retraining Open
Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approxi…
View article: Visual Odometry with Transformers
Visual Odometry with Transformers Open
Despite the rapid development of large 3D models, classical optimization-based approaches dominate the field of visual odometry (VO). Thus, current approaches to VO heavily rely on camera parameters and many handcrafted components, most of…
View article: Purrception: Variational Flow Matching for Vector-Quantized Image Generation
Purrception: Variational Flow Matching for Vector-Quantized Image Generation Open
We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matchin…
View article: NeoBabel: A Multilingual Open Tower for Visual Generation
NeoBabel: A Multilingual Open Tower for Visual Generation Open
Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic d…
View article: Segment Any 3D-Part in a Scene from a Sentence
Segment Any 3D-Part in a Scene from a Sentence Open
This paper aims to achieve the segmentation of any 3D part in a scene based on natural language descriptions, extending beyond traditional object-level 3D scene understanding and addressing both data and methodological challenges. Due to t…
View article: Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection
Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection Open
Anomaly Detection involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal tra…
View article: SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning Open
Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performa…
View article: Association Between Social Distancing Compliance and Public Place Crowding During the COVID-19 Pandemic: Cross-Sectional Observational Study Using Computer Vision to Analyze Surveillance Footage
Association Between Social Distancing Compliance and Public Place Crowding During the COVID-19 Pandemic: Cross-Sectional Observational Study Using Computer Vision to Analyze Surveillance Footage Open
Background Social distancing behavior has been a critical nonpharmaceutical measure for mitigating the COVID-19 pandemic. For this reason, there has been widespread interest in the factors determining social distancing violations, with a p…
View article: GeneralizeFormer: Layer-Adaptive Model Generation across Test-Time Distribution Shifts
GeneralizeFormer: Layer-Adaptive Model Generation across Test-Time Distribution Shifts Open
We consider the problem of test-time domain generalization, where a model is trained on several source domains and adjusted on target domains never seen during training. Different from the common methods that fine-tune the model or adjust …
View article: Geometric Neural Process Fields
Geometric Neural Process Fields Open
This paper addresses the challenge of Neural Field (NeF) generalization, where models must efficiently adapt to new signals given only a few observations. To tackle this, we propose Geometric Neural Process Fields (G-NPF), a probabilistic …
View article: DynaPrompt: Dynamic Test-Time Prompt Tuning
DynaPrompt: Dynamic Test-Time Prompt Tuning Open
Test-time prompt tuning enhances zero-shot generalization of vision-language models but tends to ignore the relatedness among test samples during inference. Online test-time prompt tuning provides a simple way to leverage the information i…
View article: Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning Open
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they lear…
View article: Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection
Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection Open
In the realm of novelty detection, accurately identifying outliers in data without specific class information poses a significant challenge. While current methods excel in single-object scenarios, they struggle with multi-object situations…
View article: One Hundred Neural Networks and Brains Watching Videos: Lessons from Alignment
One Hundred Neural Networks and Brains Watching Videos: Lessons from Alignment Open
What can we learn from comparing video models to human brains, arguably the most efficient and effective video processing systems in existence? Our work takes a step towards answering this question by performing the first large-scale bench…
View article: The Sound of Water: Inferring Physical Properties from Pouring Liquids
The Sound of Water: Inferring Physical Properties from Pouring Liquids Open
We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically…
View article: CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation
CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation Open
In this work, we address the cooperation problem among large language model (LLM) based embodied agents, where agents must cooperate to achieve a common goal. Previous methods often execute actions extemporaneously and incoherently, withou…
View article: Day2Dark: Pseudo-Supervised Activity Recognition Beyond Silent Daylight
Day2Dark: Pseudo-Supervised Activity Recognition Beyond Silent Daylight Open
This paper strives to recognize activities in the dark, as well as in the day. We first establish that state-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the limited ava…
View article: Beyond Model Adaptation at Test Time: A Survey
Beyond Model Adaptation at Test Time: A Survey Open
Machine learning algorithms have achieved remarkable success across various disciplines, use cases and applications, under the prevailing assumption that training and test samples are drawn from the same distribution. Consequently, these a…
View article: Prompt Diffusion Robustifies Any-Modality Prompt Learning
Prompt Diffusion Robustifies Any-Modality Prompt Learning Open
Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen sam…
View article: IPO: Interpretable Prompt Optimization for Vision-Language Models
IPO: Interpretable Prompt Optimization for Vision-Language Models Open
Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineer…
View article: Beyond Coarse-Grained Matching in Video-Text Retrieval
Beyond Coarse-Grained Matching in Video-Text Retrieval Open
Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach c…
View article: TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning Open
Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Multimodal Large Language Models (MLLMs) struggle at this task. In this paper, we introduce TWIST & SCOUT, a frame…
View article: TULIP: Token-length Upgraded CLIP
TULIP: Token-length Upgraded CLIP Open
We address the challenge of representing long captions in vision-language models, such as CLIP. By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens and hindering performa…
View article: Lost in Time: A New Temporal Benchmark for VideoLLMs
Lost in Time: A New Temporal Benchmark for VideoLLMs Open
Large language models have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating video models presents its own unique challenges, for which several benchmarks have bee…
View article: SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery
SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery Open
In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fal…
View article: GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features
GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features Open
In the domain of anomaly detection, methods often excel in either high-level semantic or low-level industrial benchmarks, rarely achieving cross-domain proficiency. Semantic anomalies are novelties that differ in meaning from the training …
View article: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Open
This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias of locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can…
View article: Object-zoomed training of convolutional neural networks inspired by toddler development improves shape bias
Object-zoomed training of convolutional neural networks inspired by toddler development improves shape bias Open
Convolutional Neural Networks (CNNs) surpass human-level performance on visual object recognition and detection, but their behavior still differs from human behavior in important ways. One prominent example is that CNNs trained on ImageNet…
View article: Training-Free Semantic Segmentation via LLM-Supervision
Training-Free Semantic Segmentation via LLM-Supervision Open
Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model ac…