Matthieu Cord
YOU?
Author Swipe
View article: SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization
SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization Open
Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variationa…
View article: IPA: An Information-Reconstructive Input Projection Framework for Efficient Foundation Model Adaptation
IPA: An Information-Reconstructive Input Projection Framework for Efficient Foundation Model Adaptation Open
Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, reduce adaptation cost by injecting low-rank updates into pretrained weights. However, LoRA's down-projection is randomly initialized and data-agnostic, discarding potentially u…
View article: Learning to Steer: Input-dependent Steering for Multimodal LLMs
Learning to Steer: Input-dependent Steering for Multimodal LLMs Open
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such …
View article: FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models Open
Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leadin…
View article: JAFAR: Jack up Any Feature at Any Resolution
JAFAR: Jack up Any Feature at Any Resolution Open
Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream…
View article: SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Open
Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches ad…
View article: Scaling Laws for Native Multimodal Models
Scaling Laws for Native Multimodal Models Open
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders …
View article: GaussRender: Learning 3D Occupancy with Gaussian Rendering
GaussRender: Learning 3D Occupancy with Gaussian Rendering Open
Understanding the 3D geometry and semantics of driving scenes is critical for safe autonomous driving. Recent advances in 3D occupancy prediction have improved scene representation but often suffer from spatial inconsistencies, leading to …
View article: Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting
Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting Open
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions. It is often due to limitations like complex architectures customized for a specific dataset and inef…
View article: Analyzing Finetuning Representation Shift for Multimodal LLMs Steering
Analyzing Finetuning Representation Shift for Multimodal LLMs Steering Open
Multimodal LLMs (MLLMs) have reached remarkable levels of proficiency in understanding multimodal inputs. However, understanding and interpreting the behavior of such complex models is a challenging task, not to mention the dynamic shifts …
View article: PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting
PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting Open
Accurately predicting how agents move in dynamic scenes is essential for safe autonomous driving. State-of-the-art motion forecasting models rely on large curated datasets with manually annotated or heavily post-processed trajectories. How…
View article: UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction
UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction Open
View article: GIFT: A Framework for Global Interpretable Faithful Textual Explanations of Vision Classifiers
GIFT: A Framework for Global Interpretable Faithful Textual Explanations of Vision Classifiers Open
Understanding deep models is crucial for deploying them in safety-critical applications. We introduce GIFT, a framework for deriving post-hoc, global, interpretable, and faithful textual explanations for vision classifiers. GIFT starts fro…
View article: OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models
OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models Open
We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, …
View article: Skipping Computations in Multimodal LLMs
Skipping Computations in Multimodal LLMs Open
Large Language Models (LLMs) have demonstrated remarkable success in both textual and multimodal domains. However, this success often comes with substantial computational costs, particularly when handling lengthy sequences of multimodal in…
View article: LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension Open
Vision Language Models (VLMs) have demonstrated remarkable capabilities in various open-vocabulary tasks, yet their zero-shot performance lags behind task-specific fine-tuned models, particularly in complex tasks like Referring Expression …
View article: Annealed Winner-Takes-All for Motion Forecasting
Annealed Winner-Takes-All for Motion Forecasting Open
In autonomous driving, motion prediction aims at forecasting the future trajectories of nearby agents, helping the ego vehicle to anticipate behaviors and drive safely. A key challenge is generating a diverse set of future predictions, com…
View article: ReGentS: Real-World Safety-Critical Driving Scenario Generation Made Stable
ReGentS: Real-World Safety-Critical Driving Scenario Generation Made Stable Open
Machine learning based autonomous driving systems often face challenges with safety-critical scenarios that are rare in real-world data, hindering their large-scale deployment. While increasing real-world training data coverage could addre…
View article: Valeo4Cast: A Modular Approach to End-to-End Forecasting
Valeo4Cast: A Modular Approach to End-to-End Forecasting Open
Motion forecasting is crucial in autonomous driving systems to anticipate the future trajectories of surrounding agents such as pedestrians, vehicles, and traffic signals. In end-to-end forecasting, the model must jointly detect and track …
View article: A Concept-Based Explainability Framework for Large Multimodal Models
A Concept-Based Explainability Framework for Large Multimodal Models Open
Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs…
View article: DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut
DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut Open
Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In…
View article: Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs Open
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. They are the building block for Large Multimodal Models, yet, we still lack a proper understanding of their succe…
View article: Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?
Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive? Open
International audience
View article: What matters when building vision-language models?
What matters when building vision-language models? Open
The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the d…
View article: What Makes Multimodal In-Context Learning Work?
What Makes Multimodal In-Context Learning Work? Open
Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we pre…
View article: Mind-to-Image: Projecting Visual Mental Imagination of the Brain from fMRI
Mind-to-Image: Projecting Visual Mental Imagination of the Brain from fMRI Open
The reconstruction of images observed by subjects from fMRI data collected during visual stimuli has made strong progress in the past decade, thanks to the availability of extensive fMRI datasets and advancements in generative models for i…
View article: UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction
UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction Open
Vehicle trajectory prediction has increasingly relied on data-driven solutions, but their ability to scale to different data domains and the impact of larger dataset sizes on their generalization remain under-explored. While these question…
View article: Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs Open
The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image…
View article: GradPaint: Gradient-guided inpainting with diffusion models
GradPaint: Gradient-guided inpainting with diffusion models Open
Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved remarkable results in conditional and unconditional image generation. The pre-trained models can be adapted without further training to different downstream tasks, by …
View article: Manipulating Trajectory Prediction with Backdoors
Manipulating Trajectory Prediction with Backdoors Open
Autonomous vehicles ought to predict the surrounding agents' trajectories to allow safe maneuvers in uncertain and complex traffic situations. As companies increasingly apply trajectory prediction in the real world, security becomes a rele…