Florian Bordes
YOU?
Author Swipe
View article: What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes
What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes Open
Multimodal language models possess a remarkable ability to handle an open-vocabulary's worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seem…
View article: Object-centric Binding in Contrastive Language-Image Pretraining
Object-centric Binding in Contrastive Language-Image Pretraining Open
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understa…
View article: LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Open
Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitat…
View article: An Introduction to Vision-Language Modeling
An Introduction to Vision-Language Modeling Open
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models t…
View article: A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions Open
Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the …
View article: Feedback-guided Data Synthesis for Imbalanced Classification
Feedback-guided Data Synthesis for Imbalanced Classification Open
Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static …
View article: PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning
PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning Open
Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render as many data samples as needed, (ii) precisely control each scene and yield granular ground truth la…
View article: Stochastic positional embeddings improve masked image modeling
Stochastic positional embeddings improve masked image modeling Open
Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent success, learning good representations through MIM remains challenging because it requires predict…
View article: Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning
Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning Open
Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural images with one another. However, when taken to the extreme, SSL models can unintendedly memorize specif…
View article: Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations
Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations Open
Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their…
View article: A Cookbook of Self-Supervised Learning
A Cookbook of Self-Supervised Learning Open
Self-supervised learning, dubbed the dark matter of intelligence, is a promising path to advance machine learning. Yet, much like cooking, training SSL methods is a delicate art with a high barrier to entry. While many components are famil…
View article: A surprisingly simple technique to control the pretraining bias for better transfer: Expand or Narrow your representation
A surprisingly simple technique to control the pretraining bias for better transfer: Expand or Narrow your representation Open
Self-Supervised Learning (SSL) models rely on a pretext task to learn representations. Because this pretext task differs from the downstream tasks used to evaluate the performance of these models, there is an inherent misalignment or pretr…
View article: Towards Democratizing Joint-Embedding Self-Supervised Learning
Towards Democratizing Joint-Embedding Self-Supervised Learning Open
Joint Embedding Self-Supervised Learning (JE-SSL) has seen rapid developments in recent years, due to its promise to effectively leverage large unlabeled data. The development of JE-SSL methods was driven primarily by the search for ever i…
View article: The Hidden Uniform Cluster Prior in Self-Supervised Learning
The Hidden Uniform Cluster Prior in Self-Supervised Learning Open
A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked …
View article: Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning
Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning Open
One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method, and using this network on downstream tasks but with its last few projector layers entirely removed…
View article: Masked Siamese Networks for Label-Efficient Learning
Masked Siamese Networks for Label-Efficient Learning Open
We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the ori…
View article: High Fidelity Visualization of What Your Self-Supervised Representation Knows About
High Fidelity Visualization of What Your Self-Supervised Representation Knows About Open
Discovering what is learned by neural networks remains a challenge. In self-supervised learning, classification is the most common task used to evaluate how good a representation is. However, relying only on such downstream task can limit …
View article: Learning to sample from noise with deep generative models
Learning to sample from noise with deep generative models Open
L’apprentissage automatique et spécialement l’apprentissage profond se sont imposés ces dernières années pour résoudre une large variété de tâches. Une des applications les plus remarquables concerne la vision par ordinateur. Les systèmes …
View article: Learning to Generate Samples from Noise through Infusion Training
Learning to Generate Samples from Noise through Infusion Training Open
In this work, we investigate a novel training procedure to learn a generative model as the transition operator of a Markov chain, such that, when applied repeatedly on an unstructured random noise sample, it will denoise it into a sample t…