Explanipedia

When Diffusion Models Memorize: Inductive Biases in Probability Flow of Minimum-Norm Shallow Neural Nets Open

Chen Zeno, H Manor, Greg Ongie, Nir Weinberger, Tomer Michaeli , et al. · 2025

While diffusion models generate high-quality images via probability flow, the theoretical understanding of this process remains incomplete. A key question is when probability flow converges to training samples or more general points on the…

Optimal Rates in Continual Linear Regression via Increasing Regularization Open

Ran Levinstein, Amit Attia, Matan Schliserman, Uri Sherman, Tomer Koren , et al. · 2025

We study realizable continual linear regression under random task orderings, a common setting for developing continual learning theory. In this setup, the worst-case expected loss after $k$ learning iterations admits a lower bound of $Ω(1/…

Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes Open

Itamar Harel, Yonathan Wolanowsky, Gal Vardi, Nathan Srebro, Daniel Soudry · 2025

We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution $θ_0 \sim p_0$. We focus o…

FP4 All the Way: Fully Quantized Training of LLMs Open

Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry · 2025

We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We ex…

From Continual Learning to SGD and Back: Better Rates for Continual Linear Models Open

Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren , et al. · 2025

We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze forgetting, defined as the loss on previously seen tasks, after $k$ iterations. For conti…

The Implicit Bias of Gradient Descent on Separable Multiclass Data Open

Harish Ravi, Clayton Scott, Daniel Soudry, Yutong Wang · 2024

Implicit bias describes the phenomenon where optimization-based training algorithms, without explicit regularization, show a preference for simple estimators even when more complex estimators have equal objective values. Multiple works hav…

Provable Tempered Overfitting of Minimal Nets and Typical Nets Open

Itamar Harel, William M. Hoza, Gal Vardi, Itay Evron, Nathan Srebro , et al. · 2024

Computer science Mathematics

We study the overfitting behavior of fully connected deep Neural Networks (NNs) with binary weights fitted to perfectly classify a noisy training set. We consider interpolation using both the smallest NN (having the minimal number of weigh…

Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks Open

Edan Kinderman, Itay Hubara, Haggai Maron, Daniel Soudry · 2024

Computer science Engineering

Recent methods aim to merge neural networks (NNs) with identical architectures trained on different tasks into a single multi-task model. While most works focus on the simpler setup of merging NNs initialized from a common pre-trained netw…

Scaling FP8 training to trillion-token LLMs Open

Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry · 2024

Business Computer science Geography

We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training…

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes Open

Dan Qiao, Kaiqi Zhang, Esha Singh, Daniel Soudry, Yuxiang Wang · 2024

Computer science Mathematics

We study the generalization of two-layer ReLU neural networks in a univariate nonparametric regression problem with noisy labels. This is a problem where kernels (\emph{e.g.} NTK) are provably sub-optimal and benign overfitting does not ha…

How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers Open

Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro , et al. · 2024

Mathematics Computer science Physics

Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one …

Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators Open

Yaniv Blumenfeld, Itay Hubara, Daniel Soudry · 2024

Computer science Physics

The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still r…

The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting -- An Analytical Model Open

Daniel Goldfarb, Itay Evron, Nir Weinberger, Daniel Soudry, Paul Hand · 2024

Computer science Psychology Engineering

In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks. Previous works have analyzed separately how forgetting is affected by either task similarity or overparameterization. In contrast, our paper exami…

How do Minimum-Norm Shallow Denoisers Look in Function Space? Open

Chen Zeno, Greg Ongie, Yaniv Blumenfeld, Nir Weinberger, Daniel Soudry · 2023

Computer science Mathematics Political science

Neural network (NN) denoisers are an essential building block in many common tasks, ranging from image reconstruction to image generation. However, the success of these models is not well understood from a theoretical perspective. In this …

The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks Open

Mor Shpigel Nacson, Rotem Mulayoff, Greg Ongie, Tomer Michaeli, Daniel Soudry · 2023

Mathematics Computer science Biology

We study the type of solutions to which stochastic gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. Our results are based on a dynamical stability analysis. In the univa…

DropCompute: simple and more robust distributed synchronous training via compute variance reduction Open

Niv Giladi, Shahar Gottlieb, Moran Shkolnik, Asaf Karnieli, Ron Banner , et al. · 2023

Computer science Mathematics Physics

Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each s…

Continual Learning in Linear Classification on Separable Data Open

Itay Evron, Edward Moroshko, Gon Buzaglo, Maroun Khriesh, Badea Marjieh , et al. · 2023

Computer science Mathematics Philosophy

We analyze continual learning on a sequence of separable linear classification tasks with binary labels. We show theoretically that learning with weak regularization reduces to solving a sequential max-margin problem, corresponding to a sp…

Explore to Generalize in Zero-Shot RL Open

Ev Zisselman, Itai Lavie, Daniel Soudry, Aviv Tamar · 2023

Computer science Mathematics Economics

We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance t…

Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond Open

Itai Kreisler, Mor Shpigel Nacson, Daniel Soudry, Yair Carmon · 2023

Mathematics Physics Computer science

Recent research shows that when Gradient Descent (GD) is applied to neural networks, the loss almost never decreases monotonically. Instead, the loss oscillates as gradient descent converges to its ''Edge of Stability'' (EoS). Here, we fin…

Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations Open

Hagay Michaeli, Tomer Michaeli, Daniel Soudry · 2023

Computer science Mathematics Chemistry

Although CNNs are believed to be invariant to translations, recent works have shown this is not the case, due to aliasing effects that stem from downsampling layers. The existing architectural solutions to prevent aliasing are partial sinc…

The Role of Codeword-to-Class Assignments in Error-Correcting Codes: An Empirical Study Open

Itay Evron, Ophir Onn, Tamar Weiss Orzech, Hai Azeroual, Daniel Soudry · 2023

Computer science Mathematics

Error-correcting codes (ECC) are used to reduce multiclass classification tasks to multiple binary classification subproblems. In ECC, classes are represented by the rows of a binary matrix, corresponding to codewords in a codebook. Codebo…

Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability Open

Aviv Tamar, Daniel Soudry, Ev Zisselman · 2022

Computer science Mathematics Economics

In the Bayesian reinforcement learning (RL) setting, a prior distribution over the unknown problem parameters -- the rewards and transitions -- is assumed, and a policy that optimizes the (posterior) expected return is sought. A common app…

How catastrophic can catastrophic forgetting be in linear regression? Open

Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry · 2022

Computer science Mathematics Psychology

To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after traini…

Minimum Variance Unbiased N:M Sparsity for the Neural Gradients Open

Brian Chmiel, Itay Hubara, Ron Banner, Daniel Soudry · 2022

Computer science Mathematics Biology

In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2, and doubles throughput by skipping computation of zero values. So far, it was mainly only used to prune weig…

Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats Open

Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben Yaacov, Daniel Soudry · 2021

Computer science Mathematics Biology

Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes…

Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability Open

Aviv Tamar, Daniel Soudry, Ev Zisselman · 2021

Computer science Mathematics Economics

In the Bayesian reinforcement learning (RL) setting, a prior distribution over the unknown problem parameters -- the rewards and transitions -- is assumed, and a policy that optimizes the (posterior) expected return is sought. A common app…

Training of quantized deep neural networks using a magnetic tunnel junction-based synapse Open

Tzofnat Greenberg-Toledo, Ben Perach, Itay Hubara, Daniel Soudry, Shahar Kvatinsky · 2021

Computer science

Quantized neural networks (QNNs) are being actively researched as a solution for the computational complexity and memory intensity of deep neural networks. This has sparked efforts to develop algorithms that support both inference and trai…

Task-Agnostic Continual Learning Using Online Variational Bayes with Fixed-Point Updates Open

Chen Zeno, Itay Golan, Elad Hoffer, Daniel Soudry · 2021

Computer science Mathematics Physics

Catastrophic forgetting is the notorious vulnerability of neural networks to the changes in the data distribution during learning. This phenomenon has long been considered a major obstacle for using learning agents in realistic continual l…

Physics-Aware Downsampling with Deep Learning for Scalable Flood Modeling Open

Niv Giladi, Zvika Ben‐Haim, Sella Nevo, Yossi Matias, Daniel Soudry · 2021

Computer science Geography Political science

Background: Floods are the most common natural disaster in the world, affecting the lives of hundreds of millions. Flood forecasting is therefore a vitally important endeavor, typically achieved using physical water flow simulations, which…

Physics-Aware Downsampling with Deep Learning for Scalable Flood\n Modeling Open

Niv Giladi, Zvika Ben‐Haim, Sella Nevo, Yossi Matias, Daniel Soudry · 2021

Computer science Geography Political science

Background: Floods are the most common natural disaster in the world,\naffecting the lives of hundreds of millions. Flood forecasting is therefore a\nvitally important endeavor, typically achieved using physical water flow\nsimulations, wh…

Daniel Soudry YOU? Author Swipe