Exploring foci of:
arXiv (Cornell University)
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
February 2022 • Weiyang Wang, Moein Khazraee, Zhizhen Zhong, Manya Ghobadi, Zhihao Jia, Dheevatsa Mudigere, Ying Zhang, Anthony Kewitsch
We propose TopoOpt, a novel direct-connect fabric for deep neural network (DNN) training workloads. TopoOpt co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. We demonstrate the mutability of AllReduce traffic, and leverage this property to construct efficient network topologies for DNN training jobs. TopoOpt then uses an alternating optimization technique and a group theory-inspired algorithm called TotientPerms to find the best network topolog…
Computer Science
Network Topology
Parallel Computing
Artificial Intelligence
Algorithm
Combinatorics
Structural Engineering
Mathematics
Engineering