Daniel Soudry
YOU?
Author Swipe
View article: When Diffusion Models Memorize: Inductive Biases in Probability Flow of Minimum-Norm Shallow Neural Nets
When Diffusion Models Memorize: Inductive Biases in Probability Flow of Minimum-Norm Shallow Neural Nets Open
While diffusion models generate high-quality images via probability flow, the theoretical understanding of this process remains incomplete. A key question is when probability flow converges to training samples or more general points on the…
View article: Optimal Rates in Continual Linear Regression via Increasing Regularization
Optimal Rates in Continual Linear Regression via Increasing Regularization Open
We study realizable continual linear regression under random task orderings, a common setting for developing continual learning theory. In this setup, the worst-case expected loss after $k$ learning iterations admits a lower bound of $Ω(1/…
View article: Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes
Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes Open
We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution $θ_0 \sim p_0$. We focus o…
View article: FP4 All the Way: Fully Quantized Training of LLMs
FP4 All the Way: Fully Quantized Training of LLMs Open
We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We ex…
View article: From Continual Learning to SGD and Back: Better Rates for Continual Linear Models
From Continual Learning to SGD and Back: Better Rates for Continual Linear Models Open
We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze forgetting, defined as the loss on previously seen tasks, after $k$ iterations. For conti…
View article: The Implicit Bias of Gradient Descent on Separable Multiclass Data
The Implicit Bias of Gradient Descent on Separable Multiclass Data Open
Implicit bias describes the phenomenon where optimization-based training algorithms, without explicit regularization, show a preference for simple estimators even when more complex estimators have equal objective values. Multiple works hav…
View article: Provable Tempered Overfitting of Minimal Nets and Typical Nets
Provable Tempered Overfitting of Minimal Nets and Typical Nets Open
We study the overfitting behavior of fully connected deep Neural Networks (NNs) with binary weights fitted to perfectly classify a noisy training set. We consider interpolation using both the smallest NN (having the minimal number of weigh…
View article: Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks
Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks Open
Recent methods aim to merge neural networks (NNs) with identical architectures trained on different tasks into a single multi-task model. While most works focus on the simpler setup of merging NNs initialized from a common pre-trained netw…
View article: Scaling FP8 training to trillion-token LLMs
Scaling FP8 training to trillion-token LLMs Open
We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training…
View article: Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes
Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes Open
We study the generalization of two-layer ReLU neural networks in a univariate nonparametric regression problem with noisy labels. This is a problem where kernels (\emph{e.g.} NTK) are provably sub-optimal and benign overfitting does not ha…
View article: How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers
How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers Open
Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one …
View article: Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators
Towards Cheaper Inference in Deep Networks with Lower Bit-Width Accumulators Open
The majority of the research on the quantization of Deep Neural Networks (DNNs) is focused on reducing the precision of tensors visible by high-level frameworks (e.g., weights, activations, and gradients). However, current hardware still r…
View article: The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting -- An Analytical Model
The Joint Effect of Task Similarity and Overparameterization on Catastrophic Forgetting -- An Analytical Model Open
In continual learning, catastrophic forgetting is affected by multiple aspects of the tasks. Previous works have analyzed separately how forgetting is affected by either task similarity or overparameterization. In contrast, our paper exami…
View article: How do Minimum-Norm Shallow Denoisers Look in Function Space?
How do Minimum-Norm Shallow Denoisers Look in Function Space? Open
Neural network (NN) denoisers are an essential building block in many common tasks, ranging from image reconstruction to image generation. However, the success of these models is not well understood from a theoretical perspective. In this …
View article: The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks
The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks Open
We study the type of solutions to which stochastic gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. Our results are based on a dynamical stability analysis. In the univa…
View article: DropCompute: simple and more robust distributed synchronous training via compute variance reduction
DropCompute: simple and more robust distributed synchronous training via compute variance reduction Open
Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each s…
View article: Continual Learning in Linear Classification on Separable Data
Continual Learning in Linear Classification on Separable Data Open
We analyze continual learning on a sequence of separable linear classification tasks with binary labels. We show theoretically that learning with weak regularization reduces to solving a sequential max-margin problem, corresponding to a sp…
View article: Explore to Generalize in Zero-Shot RL
Explore to Generalize in Zero-Shot RL Open
We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance t…
View article: Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond
Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond Open
Recent research shows that when Gradient Descent (GD) is applied to neural networks, the loss almost never decreases monotonically. Instead, the loss oscillates as gradient descent converges to its ''Edge of Stability'' (EoS). Here, we fin…
View article: Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations
Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations Open
Although CNNs are believed to be invariant to translations, recent works have shown this is not the case, due to aliasing effects that stem from downsampling layers. The existing architectural solutions to prevent aliasing are partial sinc…
View article: The Role of Codeword-to-Class Assignments in Error-Correcting Codes: An Empirical Study
The Role of Codeword-to-Class Assignments in Error-Correcting Codes: An Empirical Study Open
Error-correcting codes (ECC) are used to reduce multiclass classification tasks to multiple binary classification subproblems. In ECC, classes are represented by the rows of a binary matrix, corresponding to codewords in a codebook. Codebo…
View article: Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability
Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability Open
In the Bayesian reinforcement learning (RL) setting, a prior distribution over the unknown problem parameters -- the rewards and transitions -- is assumed, and a policy that optimizes the (posterior) expected return is sought. A common app…
View article: How catastrophic can catastrophic forgetting be in linear regression?
How catastrophic can catastrophic forgetting be in linear regression? Open
To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after traini…
View article: Minimum Variance Unbiased N:M Sparsity for the Neural Gradients
Minimum Variance Unbiased N:M Sparsity for the Neural Gradients Open
In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2, and doubles throughput by skipping computation of zero values. So far, it was mainly only used to prune weig…
View article: Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats
Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats Open
Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes…
View article: Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability
Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability Open
In the Bayesian reinforcement learning (RL) setting, a prior distribution over the unknown problem parameters -- the rewards and transitions -- is assumed, and a policy that optimizes the (posterior) expected return is sought. A common app…
View article: Training of quantized deep neural networks using a magnetic tunnel junction-based synapse
Training of quantized deep neural networks using a magnetic tunnel junction-based synapse Open
Quantized neural networks (QNNs) are being actively researched as a solution for the computational complexity and memory intensity of deep neural networks. This has sparked efforts to develop algorithms that support both inference and trai…
View article: Task-Agnostic Continual Learning Using Online Variational Bayes with Fixed-Point Updates
Task-Agnostic Continual Learning Using Online Variational Bayes with Fixed-Point Updates Open
Catastrophic forgetting is the notorious vulnerability of neural networks to the changes in the data distribution during learning. This phenomenon has long been considered a major obstacle for using learning agents in realistic continual l…
View article: Physics-Aware Downsampling with Deep Learning for Scalable Flood Modeling
Physics-Aware Downsampling with Deep Learning for Scalable Flood Modeling Open
Background: Floods are the most common natural disaster in the world, affecting the lives of hundreds of millions. Flood forecasting is therefore a vitally important endeavor, typically achieved using physical water flow simulations, which…
View article: Physics-Aware Downsampling with Deep Learning for Scalable Flood\n Modeling
Physics-Aware Downsampling with Deep Learning for Scalable Flood\n Modeling Open
Background: Floods are the most common natural disaster in the world,\naffecting the lives of hundreds of millions. Flood forecasting is therefore a\nvitally important endeavor, typically achieved using physical water flow\nsimulations, wh…