Nathan Godey
YOU?
Author Swipe
View article: Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content
Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content Open
We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for …
View article: Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression Open
Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck,…
View article: Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck Open
Recent advances in language modeling consist in pretraining highly parameterized neural networks on extremely large web-mined text corpora. Training and inference with such models can be costly in practice, which incentivizes the use of sm…
View article: On the Scaling Laws of Geographical Representation in Language Models
On the Scaling Laws of Geographical Representation in Language Models Open
Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited by extending this result to Large Language Models (LLMs). In this paper, we propose to fi…
View article: Anisotropy Is Inherent to Self-Attention in Transformers
Anisotropy Is Inherent to Self-Attention in Transformers Open
The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which make…
View article: Headless Language Models: Learning without Predicting with Contrastive Weight Tying
Headless Language Models: Learning without Predicting with Contrastive Weight Tying Open
Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and ins…
View article: Is Anisotropy Inherent to Transformers?
Is Anisotropy Inherent to Transformers? Open
The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which make…
View article: MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling Open
Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this w…
View article: MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling Open
International audience
View article: MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling
MANTa: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling Open
Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this w…