Maxime Oquab
YOU?
Author Swipe
View article: Cluster and Predict Latent Patches for Improved Masked Image Modeling
Cluster and Predict Latent Patches for Improved Masked Image Modeling Open
Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper, we systematically analyze target representations, loss fu…
View article: DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment Open
Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not read…
View article: Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach Open
Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limi…
View article: Advancing human-centric AI for robust X-ray analysis through holistic self-supervised learning
Advancing human-centric AI for robust X-ray analysis through holistic self-supervised learning Open
AI Foundation models are gaining traction in various applications, including medical fields like radiology. However, medical foundation models are often tested on limited tasks, leaving their generalisability and biases unexplored. We pres…
View article: You Don’t Need Domain-Specific Data Augmentations When Scaling Self-Supervised Learning
You Don’t Need Domain-Specific Data Augmentations When Scaling Self-Supervised Learning Open
Self-Supervised learning (SSL) with Joint-Embedding Architectures (JEA) has led to outstanding performances. All instantiations of this paradigm were trained using strong and well-established hand-crafted data augmentations, leading to the…
View article: Vision Transformers Need Registers
Vision Transformers Need Registers Open
Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond …
View article: Dimensionality and Ramping: Signatures of Sentence Integration in the Dynamics of Brains and Deep Language Models
Dimensionality and Ramping: Signatures of Sentence Integration in the Dynamics of Brains and Deep Language Models Open
A sentence is more than the sum of its words: its meaning depends on how they combine with one another. The brain mechanisms underlying such semantic composition remain poorly understood. To shed light on the neural vector code underlying …
View article: DINOv2: Learning Robust Visual Features without Supervision
DINOv2: Learning Robust Visual Features without Supervision Open
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any sy…
View article: Dimensionality and ramping: Signatures of sentence integration in the dynamics of brains and deep language models
Dimensionality and ramping: Signatures of sentence integration in the dynamics of brains and deep language models Open
A sentence is more than the sum of its words: its meaning depends on how they combine with one another. The brain mechanisms underlying such semantic composition remain poorly understood. To shed light on the neural vector code underlying …
View article: Co-training $2^L$ Submodels for Visual Recognition
Co-training $2^L$ Submodels for Visual Recognition Open
We introduce submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, ``submodels'',…
View article: Efficient conditioned face animation using frontally-viewed embedding
Efficient conditioned face animation using frontally-viewed embedding Open
As the quality of few shot facial animation from landmarks increases, new applications become possible, such as ultra low bandwidth video chat compression with a high degree of realism. However, there are some important challenges to tackl…
View article: Self-appearance-aided Differential Evolution for Motion Transfer.
Self-appearance-aided Differential Evolution for Motion Transfer. Open
Image animation transfers the motion of a driving video to a static object in a source image, while keeping the source identity unchanged. Great progress has been made in unsupervised motion transfer recently, where no labelled data or gro…
View article: Low Bandwidth Video-Chat Compression using Deep Generative Models
Low Bandwidth Video-Chat Compression using Deep Generative Models Open
To unlock video chat for hundreds of millions of people hindered by poor connectivity or unaffordable data costs, we propose to authentically reconstruct faces on the receiver's device using facial landmarks extracted at the sender's side …
View article: Can RNNs learn Recursive Nested Subject-Verb Agreements?
Can RNNs learn Recursive Nested Subject-Verb Agreements? Open
One of the fundamental principles of contemporary linguistics states that language processing requires the ability to extract recursively nested tree structures. However, it remains unclear whether and how this code could be implemented in…
View article: Discriminating the Influence of Correlated Factors from Multivariate Observations: the Back-to-Back Regression
Discriminating the Influence of Correlated Factors from Multivariate Observations: the Back-to-Back Regression Open
Identifying causes solely from observations can be particularly challenging when i) potential factors are difficult to manipulate independently and ii) observations are multi-dimensional. To address this issue, we introduce “Back-to-Back” …
View article: Learning about an exponential amount of conditional distributions
Learning about an exponential amount of conditional distributions Open
We introduce the Neural Conditioner (NC), a self-supervised machine able to learn about all the conditional distributions of a random vector $X$. The NC is a function $NC(x \cdot a, a, r)$ that leverages adversarial training to match each …
View article: Convolutional neural networks : towards less supervision for visual recognition
Convolutional neural networks : towards less supervision for visual recognition Open
Convolutional Neural Networks are flexible learning algorithms for computer vision that scale particularly well with the amount of data that is provided for training them. Although these methods had successful applications already in the ’…
View article: Learning and transferring mid-level image representations using convolutional neural networks
Learning and transferring mid-level image representations using convolutional neural networks Open
Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large-scale visual recognition challenge (ILSVRC2012). The suc-cess of CNNs is attributed to their ability to learn rich mid-level …
View article: Revisiting Classifier Two-Sample Tests
Revisiting Classifier Two-Sample Tests Open
The goal of two-sample tests is to assess whether two samples, $S_P \sim P^n$ and $S_Q \sim Q^m$, are drawn from the same distribution. Perhaps intriguingly, one relatively unexplored method to build two-sample tests is the use of binary c…
View article: Revisiting Classifier Two-Sample Tests for GAN Evaluation and Causal Discovery
Revisiting Classifier Two-Sample Tests for GAN Evaluation and Causal Discovery Open
The goal of two-sample tests is to assess whether two samples, $S_P \sim P^n$ and $S_Q \sim Q^m$, are drawn from the same distribution. Perhaps intriguingly, one relatively unexplored method to build two-sample tests is the use of binary c…
View article: ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization
ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization Open
We aim to localize objects in images using image-level supervision only. Previous approaches to this problem mainly focus on discriminative object regions and often fail to locate precise object boundaries. We address this problem by intro…
View article: Is object localization for free? – Weakly-supervised learning with convolutional neural networks
Is object localization for free? – Weakly-supervised learning with convolutional neural networks Open
International audience