Explanipedia

Learning Unmasking Policies for Diffusion Language Models Open

Metod Jazbec, Theo X. Olausson, Louis Béthune, Pierre Ablin, Michael Kirchhof , et al. · 2025

Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One particularly successful variant is m…

The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining Open

Thiziri Nait Saada, Louis Béthune, Michal Klein, David Grangier, Marco Cuturi , et al. · 2025

Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to dis…

Scaling Laws for Optimal Data Mixtures Open

Mustafa Shukor, Louis Béthune, Dan Busbridge, David Grangier, Enrico Fini , et al. · 2025

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on…

The Geometries of Truth Are Orthogonal Across Tasks Open

Waïss Azizian, Michael Kirchhof, Eugène Ndiaye, Louis Béthune, Michal Klein , et al. · 2025

Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the ac…

Multi-View Causal Discovery without Non-Gaussianity: Identifiability and Algorithms Open

Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort, Aapo Hyvärinen · 2025

Causal discovery is a difficult problem that typically relies on strong assumptions on the data-generating model, such as non-Gaussianity. In practice, many modern applications provide multiple related views of the same system, which has r…

Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection Open

Louis Béthune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi , et al. · 2025

A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i)…

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging Open

Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier · 2025

Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model …

A Unified Perspective on the Dynamics of Deep Transformers Open

Valérie Castin, Pierre Ablin, José Carrillo, Gabriel Peyré · 2025

Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens an…

Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency Open

Michael Kirchhof, James Thornton, Pierre Ablin, Louis Béthune, Eugène Ndiaye , et al. · 2024

The adoption of text-to-image diffusion models raises concerns over reliability, drawing scrutiny under the lens of various metrics like calibration, fairness, or compute efficiency. We focus in this work on two issues that arise when depl…

Dynamic Gradient Alignment for Online Data Mixing Open

Simin Fan, David Grangier, Pierre Ablin · 2024

The composition of training data mixtures is critical for effectively training large language models (LLMs), as it directly impacts their performance on downstream tasks. Our goal is to identify an optimal data mixture to specialize an LLM…

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling Open

David Grangier, Simin Fan, Skyler Seto, Pierre Ablin · 2024

Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most t…

Theory, Analysis, and Best Practices for Sigmoid Self-Attention Open

Jason Ramapuram, Federico Danieli, Eeshan Gunesh Dhekane, Floris Weers, Dan Busbridge , et al. · 2024

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between ke…

The AdEMAMix Optimizer: Better, Faster, Older Open

Matteo Pagliardini, Pierre Ablin, David Grangier · 2024

Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This …

Optimization without Retraction on the Random Generalized Stiefel Manifold Open

Simon Vary, Pierre Ablin, Bin Gao, P. -A. Absil · 2024

Optimization over the set of matrices $X$ that satisfy $X^\top B X = I_p$, referred to as the generalized Stiefel manifold, appears in many applications involving sampled covariance matrices such as the canonical correlation analysis (CCA)…

Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization Open

Zhenzhang Ye, Gabriel Peyré, Daniel Cremers, Pierre Ablin · 2024

Bilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. It is routinely used in Machine Learning, notably for hyperparameter tuning. The conventional method to comput…

Careful with that Scalpel: Improving Gradient Surgery with an EMA Open

Yu-Guan Hsieh, James Thornton, Eugène Ndiaye, Michal Klein, Marco Cuturi , et al. · 2024

Beyond minimizing a single training loss, many deep learning estimation pipelines rely on an auxiliary objective to quantify and encourage desirable properties of the model (e.g. performance on another dataset, robustness, agreement with a…

Need a Small Specialized Language Model? Plan Early! Open

David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun · 2024

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a s…

Quelle est la régularité de l'attention ? Open

Valérie Castin, Pierre Ablin, Gabriel Peyré · 2024

International audience

How Smooth Is Attention? Open

Valérie Castin, Pierre Ablin, Gabriel Peyré · 2023

Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robust…

MultiView Independent Component Analysis with Delays Open

Ambroise Heurtebise, Pierre Ablin, Alexandre Gramfort · 2023

Linear Independent Component Analysis (ICA) is a blind source separation technique that has been used in various domains to identify independent latent sources from observed signals. In order to obtain a higher signal-to-noise ratio, the p…

Adaptive Training Distributions with Scalable Online Bilevel Optimization Open

David Grangier, Pierre Ablin, Awni Hannun · 2023

Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work consider…

A Challenge in Reweighting Data with Bilevel Optimization Open

Anastasia Ivanova, Pierre Ablin · 2023

In many scenarios, one uses a large training set to train a model with the goal of performing well on a smaller testing set with a different distribution. Learning a weight for each data point of the training set is an appealing solution, …

How to Scale Your EMA Open

Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane , et al. · 2023

Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in …

Learning Elastic Costs to Shape Monge Displacements Open

Michal Klein, Aram-Alexandre Pooladian, Pierre Ablin, Eugène Ndiaye, Jonathan Niles‐Weed , et al. · 2023

Given a source and a target probability measure supported on $\mathbb{R}^d$, the Monge problem asks to find the most efficient way to map one distribution to the other. This efficiency is quantified by defining a \textit{cost} function bet…

Test like you Train in Implicit Deep Learning Open

Zaccharie Ramzi, Pierre Ablin, Gabriel Peyré, Thomas Moreau · 2023

Implicit deep learning has recently gained popularity with applications ranging from meta-learning to Deep Equilibrium Networks (DEQs). In its general formulation, it relies on expressing some components of deep learning pipelines implicit…

Infeasible Deterministic, Stochastic, and Variance-Reduction Algorithms for Optimization under Orthogonality Constraints Open

Pierre Ablin, Simon Vary, Bin Gao, P. -A. Absil · 2023

Orthogonality constraints naturally appear in many machine learning problems, from principal component analysis to robust neural network training. They are usually solved using Riemannian optimization algorithms, which minimize the objecti…

A Lower Bound and a Near-Optimal Algorithm for Bilevel Empirical Risk Minimization Open

Mathieu Dagréou, Thomas Moreau, Samuel Vaiter, Pierre Ablin · 2023

Bilevel optimization problems, which are problems where two optimization problems are nested, have more and more applications in machine learning. In many practical cases, the upper and the lower objectives correspond to empirical risk min…

Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps Open

Marco Cuturi, Michal Klein, Pierre Ablin · 2023

Optimal transport (OT) theory focuses, among all maps $T:\mathbb{R}^d\rightarrow \mathbb{R}^d$ that can morph a probability measure onto another, on those that are the ``thriftiest'', i.e. such that the averaged cost $c(x, T(x))$ between $…

Benchopt : Reproducible, efficient and collaborative optimization benchmarks Open

Thomas Moreau, Mathurin Massias, Alexandre Gramfort, Pierre Ablin, Pierre‐Antoine Bannier , et al. · 2022

Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several chall…

Benchopt: Reproducible, efficient and collaborative optimization benchmarks Open

Thomas Moreau, Mathurin Massias, Alexandre Gramfort, Pierre Ablin, Pierre‐Antoine Bannier , et al. · 2022

Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several chall…

Pierre Ablin YOU? Author Swipe