Explanipedia

Mamba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning Open

Junsoo Oh, Wei Huang, Taiji Suzuki · 2025

Mamba, a recently proposed linear-time sequence model, has attracted significant attention for its computational efficiency and strong empirical performance. However, a rigorous theoretical understanding of its underlying mechanisms remain…

Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression Open

Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie · 2025

State-space models (SSMs), particularly Mamba, emerge as an efficient Transformer alternative with linear complexity for long-sequence modeling. Recent empirical works demonstrate Mamba's in-context learning (ICL) capabilities competitive …

Hessian-guided Perturbed Wasserstein Gradient Flows for Escaping Saddle Points Open

Nobuko Yamamoto, J. Kim, Taiji Suzuki · 2025

Wasserstein gradient flow (WGF) is a common method to perform optimization over the space of probability measures. While WGF is guaranteed to converge to a first-order stationary point, for nonconvex functionals the converged solution does…

Generalization Bound of Gradient Flow through Training Trajectory and Data-dependent Kernel Open

Yilan Chen, Zhichao Wang, Wei Huang, Andi Han, Taiji Suzuki , et al. · 2025

Gradient-based optimization methods have shown remarkable empirical success, yet their theoretical generalization properties remain only partially understood. In this paper, we establish a generalization bound for gradient flow that aligns…

Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning Open

Ryotaro Kawata, Kohsei Matsutani, Yuri Kinoshita, Naoki Nishikawa, Taiji Suzuki · 2025

Mixture of Experts (MoE), an ensemble of specialized models equipped with a router that dynamically distributes each input to appropriate experts, has achieved successful results in the field of machine learning. However, theoretical under…

On the Role of Label Noise in the Feature Learning Process Open

Andi Han, Wei Huang, Zhenjun Zhou, Gang Niu, Wuyang Chen , et al. · 2025

Deep learning with noisy labels presents significant challenges. In this work, we theoretically characterize the role of label noise from a feature learning perspective. Specifically, we consider a signal-noise data distribution, where eac…

Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models Open

R. Higuchi, Taiji Suzuki · 2025

Aligning large language models (LLMs) with human preferences is crucial for safe deployment, yet existing methods assume specific preference models like Bradley-Terry model. This assumption leads to statistical inconsistency, where more da…

When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars Open

R. Higuchi, Ryotaro Kawata, Naoki Nishikawa, Kazusato Oko, Shoichiro Yamaguchi , et al. · 2025

The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginnin…

Propagation of Chaos for Mean-Field Langevin Dynamics and its Application to Model Ensemble Open

Atsushi Nitanda, A. R. Lee, Damian Tan Xing Kai, M. Sakaguchi, Taiji Suzuki · 2025

Mean-field Langevin dynamics (MFLD) is an optimization method derived by taking the mean-field limit of noisy gradient descent for two-layer neural networks in the mean-field regime. Recently, the propagation of chaos (PoC) for MFLD has ga…

Direct Distributional Optimization for Provable Alignment of Diffusion Models Open

Ryotaro Kawata, Kazusato Oko, Atsushi Nitanda, Taiji Suzuki · 2025

Computer science Mathematics Physics

We introduce a novel alignment method for diffusion models from distribution optimization perspectives while providing rigorous convergence guarantees. We first formulate the problem as a generic regularized loss minimization over probabil…

Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation Open

Juno Kim, Denny Wu, Jason D. Lee, Taiji Suzuki · 2025

Computer science Chemistry Philosophy

A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model …

On the Comparison between Multi-modal and Single-modal Contrastive Learning Open

Wei Huang, Andi Han, Yongqiang Chen, Yuan Cao, Zhiqiang Xu , et al. · 2024

Computer science Materials science Philosophy

Multi-modal contrastive learning with language supervision has presented a paradigm shift in modern machine learning. By pre-training on a web-scale dataset, multi-modal contrastive learning can learn high-quality representations that exhi…

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning Open

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Taiji Suzuki , et al. · 2024

Computer science Engineering Philosophy

Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their…

Pretrained transformer efficiently learns low-dimensional target functions in-context Open

Kazusato Oko, Yujin Song, Taiji Suzuki, Denny Wu · 2024

Computer science Engineering Biology

Transformers can efficiently learn in-context from example demonstrations. Most existing theoretical analyses studied the in-context learning (ICL) ability of transformers for linear function classes, where it is typically shown that the m…

Transformers Provably Solve Parity Efficiently with Chain of Thought Open

Juno Kim, Taiji Suzuki · 2024

Computer science Engineering Physics

This work provides the first theoretical analysis of training transformers to solve complex problems by recursively generating intermediate states, analogous to fine-tuning for chain-of-thought (CoT) reasoning. We consider training a one-l…

On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent Open

Bingrui Li, Wei Huang, Andi Han, Zhenjun Zhou, Taiji Suzuki , et al. · 2024

Computer science Mathematics Physics

The Adam optimizer is widely used for transformer optimization in practice, which makes understanding the underlying optimization mechanisms an important problem. However, due to the Adam's complexity, theoretical analysis of how it optimi…

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization Open

Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie · 2024

Computer science Mathematics Economics

Transformers have demonstrated great power in the recent development of large foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary changes to the field of vision, achieving significant accomplishments …

Transformers are Minimax Optimal Nonparametric In-Context Learners Open

Juno Kim, Tai Nakamaki, Taiji Suzuki · 2024

Computer science Mathematics Engineering

In-context learning (ICL) of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples. In this paper, we study the efficacy of ICL from the viewpoint of statistica…

Convergence error analysis of reflected gradient Langevin dynamics for non-convex constrained optimization Open

Kanji Sato, Akiko Takeda, Reiichiro Kawai, Taiji Suzuki · 2024

Mathematics Physics Economics

Gradient Langevin dynamics and a variety of its variants have attracted increasing attention owing to their convergence towards the global optimal solution, initially in the unconstrained convex framework while recently even in convex cons…

Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations Open

Kazusato Oko, Yujin Song, Taiji Suzuki, Denny Wu · 2024

Computer science Geology Geography

We study the computational and sample complexity of learning a target function $f_*:\mathbb{R}^d\to\mathbb{R}$ with additive structure, that is, $f_*(x) = \frac{1}{\sqrt{M}}\sum_{m=1}^M f_m(\langle x, v_m\rangle)$, where $f_1,f_2,...,f_M:\…

Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples Open

Dake Bu, Wei Huang, Taiji Suzuki, Cheng Ji, Qingfu Zhang , et al. · 2024

Computer science

Neural Network-based active learning (NAL) is a cost-effective data selection technique that utilizes neural networks to select and train on a small subset of samples. While existing work successfully develops various effective or theory-j…

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit Open

Jason D. Lee, Kazusato Oko, Taiji Suzuki, Denny Wu · 2024

Computer science Mathematics

We study the problem of gradient descent learning of a single-index target function $f_*(\boldsymbol{x}) = \textstyleσ_*\left(\langle\boldsymbol{x},\boldsymbolθ\rangle\right)$ under isotropic Gaussian data in $\mathbb{R}^d$, where the unkn…

Flow matching achieves almost minimax optimal convergence Open

Kenji Fukumizu, Taiji Suzuki, Noboru Isobe, Kazusato Oko, Masanori Koyama · 2024

Mathematics Computer science Economics

Flow matching (FM) has gained significant attention as a simulation-free generative model. Unlike diffusion models, which are based on stochastic differential equations, FM employs a simpler approach by solving an ordinary differential equ…

State Space Models are Comparable to Transformers in Estimating Functions with Dynamic Smoothness Open

Naoki Nishikawa, Taiji Suzuki · 2024

Computer science Mathematics Engineering

Deep neural networks based on state space models (SSMs) are attracting much attention in sequence modeling since their computational cost is significantly smaller than that of Transformers. While the capabilities of SSMs have been primaril…

State-Free Inference of State-Space Models: The Transfer Function Approach Open

Rom Parnichkun, Stefano Massaroli, Alessandro Moro, Jimmy T. H. Smith, Ramin Hasani , et al. · 2024

Computer science Mathematics Biology

We approach designing a state-space model for deep learning applications through its dual representation, the transfer function, and uncover a highly efficient sequence parallel inference algorithm that is state-free: unlike other proposed…

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric Open

Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata , et al. · 2024

Computer science Psychology Mathematics

In typical multimodal contrastive learning, such as CLIP, encoders produce one point in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity struc…

Mechanistic Design and Scaling of Hybrid Architectures Open

Michael Poli, Armin W. Thomas, Éric Nguyen, Pragaash Ponnusamy, Björn Deiseroth , et al. · 2024

Computer science Mathematics

The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation. We set out to simplify this …

Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective Open

Shokichi Takakura, Taiji Suzuki · 2024

Computer science Mathematics Materials science

In this paper, we study the feature learning ability of two-layer neural networks in the mean-field regime through the lens of kernel methods. To focus on the dynamics of the kernel induced by the first layer, we utilize a two-timescale li…

How do Transformers perform In-Context Autoregressive Learning? Open

Michael E. Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, Gabriel Peyré · 2024

Computer science Mathematics Engineering

Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a si…

Transformers Learn Nonlinear Features In Context: Nonconvex Mean-field Dynamics on the Attention Landscape Open

Juno Kim, Taiji Suzuki · 2024

Computer science Mathematics Geography

Large language models based on the Transformer architecture have demonstrated impressive capabilities to learn in context. However, existing theoretical studies on how this phenomenon arises are limited to the dynamics of a single layer of…

Taiji Suzuki YOU? Author Swipe