Explanipedia

The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining Open

Thiziri Nait Saada, Louis Béthune, Michal Klein, David Grangier, Marco Cuturi , et al. · 2025

Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to dis…

Pretraining with hierarchical memories: separating long-tail and common knowledge Open

Hadi Pouransari, David Grangier, Christopher W. Thomas, Michael Kirchhof, Oncel Tuzel · 2025

The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a frac…

Compute-Optimal Quantization-Aware Training Open

Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun · 2025

Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior ac…

Scaling Laws for Optimal Data Mixtures Open

Mustafa Shukor, Louis Béthune, Dan Busbridge, David Grangier, Enrico Fini , et al. · 2025

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on…

Assessing the Role of Data Quality in Training Bilingual Language Models Open

Skyler Seto, Maartje ter Hoeve, Maureen de Seyssel, David Grangier · 2025

Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more language…

Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection Open

Louis Béthune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi , et al. · 2025

Psychology Mathematics

A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i)…

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging Open

Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier · 2025

Psychology Computer science

Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model …

Training Bilingual LMs with Data Constraints in the Targeted Language Open

Skyler Seto, Maartje ter Hoeve, He Bai, Natalie Schluter, David Grangier · 2024

Computer science Geography

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high qua…

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP Open

Chen Huang, Skyler Seto, Samira Abnar, David Grangier, Navdeep Jaitly , et al. · 2024

Computer science Mathematics Engineering

Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts…

No Need to Talk: Asynchronous Mixture of Language Models Open

Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert · 2024

Computer science

We introduce SMALLTALK LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need for high-bandwid…

Dynamic Gradient Alignment for Online Data Mixing Open

Simin Fan, David Grangier, Pierre Ablin · 2024

Computer science Physics

The composition of training data mixtures is critical for effectively training large language models (LLMs), as it directly impacts their performance on downstream tasks. Our goal is to identify an optimal data mixture to specialize an LLM…

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling Open

David Grangier, Simin Fan, Skyler Seto, Pierre Ablin · 2024

Computer science Engineering

Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most t…

The AdEMAMix Optimizer: Better, Faster, Older Open

Matteo Pagliardini, Pierre Ablin, David Grangier · 2024

Computer science Economics

Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This …

Need a Small Specialized Language Model? Plan Early! Open

David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun · 2024

Computer science Mathematics

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a s…

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling Open

Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang , et al. · 2024

Computer science Geography Philosophy

Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows wi…

Adaptive Training Distributions with Scalable Online Bilevel Optimization Open

David Grangier, Pierre Ablin, Awni Hannun · 2023

Computer science Mathematics Chemistry

Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work consider…

Transfer Learning for Structured Pruning under Limited Task Data Open

Lucio M. Dery, David Grangier, Awni Hannun · 2023

Computer science Mathematics Economics

Large, pre-trained models are problematic to use in resource constrained applications. Fortunately, task-aware structured pruning methods offer a solution. These approaches reduce model size by dropping structural units like layers and att…

High-Resource Methodological Bias in Low-Resource Investigations Open

Maartje ter Hoeve, David Grangier, Natalie Schluter · 2022

Computer science Physics

The central bottleneck for low-resource NLP is typically regarded to be the quantity of accessible data, overlooking the contribution of data quality. This is particularly seen in the development and evaluation of low-resource systems via …

AudioLM: a Language Modeling Approach to Audio Generation Open

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin , et al. · 2022

Computer science

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation spa…

Learning strides in convolutional neural networks Open

Rachid Riad, Olivier Teboul, David Grangier, Neil Zeghidour · 2022

Computer science

Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance w…

Learning strides in convolutional neural networks Open

Rachid Riad, Olivier Teboul, David Grangier, Neil Zeghidour · 2022

Computer science

Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance w…

The Trade-offs of Domain Adaptation for Neural Language Models Open

David Grangier, Dan Iter · 2022

Computer science Mathematics Physics

This work connects language model adaptation with concepts of machine learning theory. We consider a training setup with a large out-of-domain set and a small in-domain set. We derive how the benefit of training a model on either set depen…

On Systematic Style Differences between Unsupervised and Supervised MT and an Application for High-Resource Machine Translation Open

Kelly Marchisio, Markus Freitag, David Grangier · 2022

Computer science Mathematics Economics

Modern unsupervised machine translation (MT) systems reach reasonable translation quality under clean and controlled data conditions. As the performance gap between supervised and unsupervised MT narrows, it is interesting to ask whether t…

A Natural Diet: Towards Improving Naturalness of Machine Translation Output Open

Markus Freitag, David Vilar, David Grangier, Colin Cherry, George Foster · 2022

Computer science Physics Sociology

Machine translation (MT) evaluation often focuses on accuracy and fluency, without paying much attention to translation style. This means that, even when considered accurate and fluent, MT output can still sound less natural than high qual…

High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics Open

Markus Freitag, David Grangier, Qijun Tan, Bowen Liang · 2022

Computer science Chemistry Philosophy

In Neural Machine Translation, it is typically assumed that the sentence with the highest estimated probability should also be the translation with the highest quality as measured by humans. In this work, we question this assumption and sh…

Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding Open

Neil Zeghidour, Olivier Teboul, David Grangier · 2021

Computer science Geography

We introduce DIVE, an end-to-end speaker diarization algorithm. Our neural algorithm presents the diarization task as an iterative process: it repeatedly builds a representation for each speaker before predicting the voice activity of each…

Minimum Bayes Risk Decoding with Neural Metrics of Translation Quality. Open

Markus Freitag, David Grangier, Qijun Tan, Bowen Liang · 2021

Computer science Engineering Philosophy

This work applies Minimum Bayes Risk (MBR) decoding to optimize diverse automated metrics of translation quality. Automatic metrics in machine translation have made tremendous progress recently. In particular, neural metrics, fine-tuned on…

High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics Open

Markus Freitag, David Grangier, Qijun Tan, Bowen Liang · 2021

Computer science Mathematics Chemistry

In Neural Machine Translation, it is typically assumed that the sentence with the highest estimated probability should also be the translation with the highest quality as measured by humans. In this work, we question this assumption and sh…

David Grangier YOU? Author Swipe