Pierre Ablin
YOU?
Author Swipe
View article: Learning Unmasking Policies for Diffusion Language Models
Learning Unmasking Policies for Diffusion Language Models Open
Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One particularly successful variant is m…
View article: The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining
The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining Open
Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to dis…
View article: Scaling Laws for Optimal Data Mixtures
Scaling Laws for Optimal Data Mixtures Open
Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on…
View article: The Geometries of Truth Are Orthogonal Across Tasks
The Geometries of Truth Are Orthogonal Across Tasks Open
Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the ac…
View article: Multi-View Causal Discovery without Non-Gaussianity: Identifiability and Algorithms
Multi-View Causal Discovery without Non-Gaussianity: Identifiability and Algorithms Open
Causal discovery is a difficult problem that typically relies on strong assumptions on the data-generating model, such as non-Gaussianity. In practice, many modern applications provide multiple related views of the same system, which has r…
View article: Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection
Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection Open
A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i)…
View article: Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging Open
Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model …
View article: A Unified Perspective on the Dynamics of Deep Transformers
A Unified Perspective on the Dynamics of Deep Transformers Open
Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens an…
View article: Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency
Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency Open
The adoption of text-to-image diffusion models raises concerns over reliability, drawing scrutiny under the lens of various metrics like calibration, fairness, or compute efficiency. We focus in this work on two issues that arise when depl…
View article: Dynamic Gradient Alignment for Online Data Mixing
Dynamic Gradient Alignment for Online Data Mixing Open
The composition of training data mixtures is critical for effectively training large language models (LLMs), as it directly impacts their performance on downstream tasks. Our goal is to identify an optimal data mixture to specialize an LLM…
View article: Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling Open
Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most t…
View article: Theory, Analysis, and Best Practices for Sigmoid Self-Attention
Theory, Analysis, and Best Practices for Sigmoid Self-Attention Open
Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between ke…
View article: The AdEMAMix Optimizer: Better, Faster, Older
The AdEMAMix Optimizer: Better, Faster, Older Open
Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This …
View article: Optimization without Retraction on the Random Generalized Stiefel Manifold
Optimization without Retraction on the Random Generalized Stiefel Manifold Open
Optimization over the set of matrices $X$ that satisfy $X^\top B X = I_p$, referred to as the generalized Stiefel manifold, appears in many applications involving sampled covariance matrices such as the canonical correlation analysis (CCA)…
View article: Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization
Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization Open
Bilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. It is routinely used in Machine Learning, notably for hyperparameter tuning. The conventional method to comput…
View article: Careful with that Scalpel: Improving Gradient Surgery with an EMA
Careful with that Scalpel: Improving Gradient Surgery with an EMA Open
Beyond minimizing a single training loss, many deep learning estimation pipelines rely on an auxiliary objective to quantify and encourage desirable properties of the model (e.g. performance on another dataset, robustness, agreement with a…
View article: Need a Small Specialized Language Model? Plan Early!
Need a Small Specialized Language Model? Plan Early! Open
Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a s…
View article: Quelle est la régularité de l'attention ?
Quelle est la régularité de l'attention ? Open
International audience
View article: How Smooth Is Attention?
How Smooth Is Attention? Open
Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robust…
View article: MultiView Independent Component Analysis with Delays
MultiView Independent Component Analysis with Delays Open
Linear Independent Component Analysis (ICA) is a blind source separation technique that has been used in various domains to identify independent latent sources from observed signals. In order to obtain a higher signal-to-noise ratio, the p…
View article: Adaptive Training Distributions with Scalable Online Bilevel Optimization
Adaptive Training Distributions with Scalable Online Bilevel Optimization Open
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work consider…
View article: A Challenge in Reweighting Data with Bilevel Optimization
A Challenge in Reweighting Data with Bilevel Optimization Open
In many scenarios, one uses a large training set to train a model with the goal of performing well on a smaller testing set with a different distribution. Learning a weight for each data point of the training set is an appealing solution, …
View article: How to Scale Your EMA
How to Scale Your EMA Open
Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in …
View article: Learning Elastic Costs to Shape Monge Displacements
Learning Elastic Costs to Shape Monge Displacements Open
Given a source and a target probability measure supported on $\mathbb{R}^d$, the Monge problem asks to find the most efficient way to map one distribution to the other. This efficiency is quantified by defining a \textit{cost} function bet…
View article: Test like you Train in Implicit Deep Learning
Test like you Train in Implicit Deep Learning Open
Implicit deep learning has recently gained popularity with applications ranging from meta-learning to Deep Equilibrium Networks (DEQs). In its general formulation, it relies on expressing some components of deep learning pipelines implicit…
View article: Infeasible Deterministic, Stochastic, and Variance-Reduction Algorithms for Optimization under Orthogonality Constraints
Infeasible Deterministic, Stochastic, and Variance-Reduction Algorithms for Optimization under Orthogonality Constraints Open
Orthogonality constraints naturally appear in many machine learning problems, from principal component analysis to robust neural network training. They are usually solved using Riemannian optimization algorithms, which minimize the objecti…
View article: A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization
A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization Open
Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk min…
View article: Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps
Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps Open
Optimal transport (OT) theory focuses, among all maps $T:\mathbb{R}^d\rightarrow \mathbb{R}^d$ that can morph a probability measure onto another, on those that are the ``thriftiest'', i.e. such that the averaged cost $c(x, T(x))$ between $…
View article: Benchopt : Reproducible, efficient and collaborative optimization benchmarks
Benchopt : Reproducible, efficient and collaborative optimization benchmarks Open
Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several chall…
View article: Benchopt: Reproducible, efficient and collaborative optimization benchmarks
Benchopt: Reproducible, efficient and collaborative optimization benchmarks Open
Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several chall…