arXiv (Cornell University)
Hyper-SET: Designing Transformers via Hyperspherical Energy Minimization
February 2025 • Yue Hu, Difan Zou, Dong Xu
Transformer-based models have achieved remarkable success, but their core components, Transformer layers, are largely heuristics-driven and engineered from the bottom up, calling for a prototypical model with high interpretability and practical competence. To this end, we conceptualize a principled, top-down approach grounded in energy-based interpretation. Specifically, we formalize token dynamics as a joint maximum likelihood estimation on the hypersphere, featuring two properties: semantic alignment in the high…