Hoplite Article Swipe
Related Concepts
Asynchronous communication
Computer science
Distributed computing
Fault tolerance
Reinforcement learning
Task (project management)
Schedule
Computer network
Artificial intelligence
Operating system
Management
Economics
Siyuan Zhuang
,
Zhuohan Li
,
Danyang Zhuo
,
Stephanie Wang
,
Eric Liang
,
Robert Nishihara
,
Philipp Moritz
,
Ion Stoica
·
YOU?
·
· 2021
· Open Access
·
· DOI: https://doi.org/10.1145/3452296.3472897
· OA: W3189830509
YOU?
·
· 2021
· Open Access
·
· DOI: https://doi.org/10.1145/3452296.3472897
· OA: W3189830509
Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance.
Related Topics
Finding more related topics…