FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models Article Swipe

PDF

Saeed Rashidi , William Won , Sudarshan K. Srinivasan , Puneet Gupta , Tushar Krishna ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2406.19580

Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating high-end accelerators with high-speed wafer-scale interconnects, making it an attractive platform for distributed training. However, the wafer-scale interconnect should offer high performance and flexibility for various parallelization strategies to enable maximum optimizations for compute and memory usage. In this paper, we propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks and can efficiently execute communication patterns of different parallelization strategies. Furthermore, FRED supports in-switch collective communication execution that reduces the network traffic by approximately 2X. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively when compared to a baseline waferscale 2D-Mesh fabric.

Related Topics To Compare & Contrast

Electrical Engineering

Geography

Cartography

Mathematical Analysis

Meteorology

Geometry

Concepts

Reduction (mathematics) Interconnection Scale (ratio) Computer science Training (meteorology) Wafer Distribution (mathematics) Distributed computing Embedded system Computer network Engineering Electrical engineering Mathematics Geography Cartography Mathematical analysis Meteorology Geometry

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2406.19580
PDF: https://arxiv.org/pdf/2406.19580
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4400222482

All OpenAlex metadata

Raw OpenAlex JSON

No additional metadata available.