Explanipedia

Towards Understanding Inductive Bias in Transformers: A View From Infinity Open

Itay Lavie, Guy Gur-Ari, Zohar Ringel · 2024

Physics Mathematics Engineering

We study inductive bias in Transformers in the infinitely over-parameterized Gaussian process limit and argue transformers tend to be biased towards more permutation symmetric functions in sequence space. We show that the representation th…

PaLM 2 Technical Report Open

Rohan Anil, Andrew M. Dai, Orhan Fırat, Melvin Johnson, Dmitry Lepikhin , et al. · 2023

Computer science Engineering History

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of obje…

Exploring Length Generalization in Large Language Models Open

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra , et al. · 2022

Computer science Mathematics Sociology

The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These…

Solving Quantitative Reasoning Problems with Language Models Open

Aitor Lewkowycz, Anders Andreassen, D. Dohan, Ethan Dyer, Henryk Michalewski , et al. · 2022

Computer science Mathematics Psychology

Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such …

PaLM: Scaling Language Modeling with Pathways Open

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra , et al. · 2022

Computer science Physics Economics

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model t…

Show Your Work: Scratchpads for Intermediate Computation with Language Models Open

Maxwell Nye, Anders Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin , et al. · 2021

Computer science Physics

Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step com…

Are wider nets better given the same number of parameters? Open

Anna Golubeva, Behnam Neyshabur, Guy Gur-Ari · 2020

Mathematics Computer science Physics

Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. In most of these studies, the number of parameters is increased by increasing the network width. This begs the question: I…

On the training dynamics of deep networks with $L_2$ regularization Open

Aitor Lewkowycz, Guy Gur-Ari · 2020

Computer science Mathematics

We study the role of $L_2$ regularization in deep learning, and uncover simple relations between the performance of the model, the $L_2$ coefficient, the learning rate, and the number of training steps. These empirical relations hold when …

On the asymptotics of wide networks with polynomial activations Open

Kyle Aitken, Guy Gur-Ari · 2020

Mathematics Computer science

We consider an existing conjecture addressing the asymptotic behavior of neural networks in the large width limit. The results that follow from this conjecture include tight bounds on the behavior of wide networks during stochastic gradien…

The large learning rate phase of deep learning: the catapult mechanism Open

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl‐Dickstein, Guy Gur-Ari · 2020

Computer science Mathematics Physics

The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning…

Asymptotics of Wide Networks from Feynman Diagrams Open

Ethan Dyer, Guy Gur-Ari · 2019

Physics Mathematics Computer science

Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for com…

Wider Networks Learn Better Features Open

Dar Gilboa, Guy Gur-Ari · 2019

Computer science Chemistry Economics

Transferability of learned features between tasks can massively reduce the cost of training a neural network on a novel task. We investigate the effect of network width on learned features using activation atlases --- a visualization techn…

Gradient Descent Happens in a Tiny Subspace Open

Guy Gur-Ari, Daniel A. Roberts, Ethan Dyer · 2018

Mathematics Computer science Geography

We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the n…

Guy Gur-Ari YOU? Author Swipe