Explanipedia

Loss Functions and Operators Generated by f-Divergences Open

Vincent Roulet, Tianlin Liu, Nino Vieillard, Michael E. Sander, Mathieu Blondel · 2025

Mathematics

The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Ku…

Joint Learning of Energy-based Models and their Partition Function Open

Michael E. Sander, Vincent Roulet, Tianlin Liu, Mathieu Blondel · 2025

Computer science Mathematics Physics

Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks. However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to comp…

Vers un apprentissage plus profond : réseaux résiduels, équations différentielles neuronales et transformers, en théorie et en pratique Open

Michael E. Sander · 2024

Computer science Physics Engineering

This PhD thesis presents contributions to the field of deep learning. From convolutional ResNets to Transformers, residual connections are ubiquitous in state-of-the-art deep learning models. The continuous depth analogues of residual netw…

Towards Understanding the Universality of Transformers for Next-Token Prediction Open

Michael E. Sander, Gabriel Peyré · 2024

Computer science Engineering Physics

Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-conte…

How do Transformers perform In-Context Autoregressive Learning? Open

Michael E. Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, Gabriel Peyré · 2024

Computer science Mathematics Engineering

Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a si…

Implicit regularization of deep residual networks towards neural ODEs Open

Pierre Marion, Yuhan Wu, Michael E. Sander, Gérard Biau · 2023

Computer science Mathematics Economics

Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous mod…

Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective Open

Michael E. Sander, Joan Puigcerver, Josip Djolonga, Gabriel Peyré, Mathieu Blondel · 2023

Computer science Mathematics Chemistry

The top-k operator returns a sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-…

Vision Transformers provably learn spatial structure Open

Samy Jelassi, Michael E. Sander, Yuanzhi Li · 2022

Computer science Mathematics Physics

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any…

Do Residual Neural Networks discretize Neural Ordinary Differential Equations? Open

Michael E. Sander, Pierre Ablin, Gabriel Peyré · 2022

Mathematics Computer science

Neural Ordinary Differential Equations (Neural ODEs) are the continuous analog of Residual Neural Networks (ResNets). We investigate whether the discrete dynamics defined by a ResNet are close to the continuous one of a Neural ODE. We firs…

Sinkformers: Transformers with Doubly Stochastic Attention Open

Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré · 2021

Computer science Mathematics Physics

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise s…

Momentum Residual Neural Networks Open

Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré · 2021

Computer science Mathematics Economics

The training of deep residual neural networks (ResNets) with backpropagation has a memory cost that increases linearly with respect to the depth of the network. A way to circumvent this issue is to use reversible architectures. In this pap…

Michael E. Sander YOU? Author Swipe