Explanipedia

STrack: A Reliable Multipath Transport for AI/ML Clusters Open

Yanfang Le, Rong Pan, Peter Newman, Jeremias Blendin, Abdul Kabbani , et al. · 2024

Computer science

Emerging artificial intelligence (AI) and machine learning (ML) workloads present new challenges of managing the collective communication used in distributed training across hundreds or even thousands of GPUs. This paper presents STrack, a…

FASTFLOW: Flexible Adaptive Congestion Control for High-Performance Datacenters Open

Tommaso Bonato, Abdul Kabbani, Daniele De Sensi, Rong Pan, Yanfang Le , et al. · 2024

Computer science Business

The increasing demand of machine learning (ML) workloads in datacenters places significant stress on current congestion control (CC) algorithms, many of which struggle to maintain performance at scale. These workloads generate bursty, sync…

Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs Open

Jiaxin Lin, Tao Ji, Xiangpeng Hao, Hokeun Cha, Yanfang Le , et al. · 2023

Computer science Mathematics Biology

The wide adoption of the emerging SmartNIC technology creates new opportunities to offload application-level computation into the networking layer, which frees the burden of host CPUs, leading to performance improvement. Shuffle, the all-t…

SFC: Near-Source Congestion Signaling and Flow Control Open

Yanfang Le, Jeongkeun Lee, Jeremias Blendin, Jiayi Chen, Georgios Nikolaidis , et al. · 2023

Computer science

State-of-the-art congestion control algorithms for data centers alone do not cope well with transient congestion and high traffic bursts. To help with these, we revisit the concept of direct \emph{backward} feedback from switches and propo…

Efficient Data-Plane Memory Scheduling for In-Network Aggregation Open

Hao Wang, Yuxuan Qin, ChonLam Lao, Yanfang Le, Wenfei Wu , et al. · 2022

Computer science Engineering

As the scale of distributed training grows, communication becomes a bottleneck. To accelerate the communication, recent works introduce In-Network Aggregation (INA), which moves the gradients summation into network middle-boxes, e.g., prog…

PL2: Towards Predictable Low Latency in Rack-Scale Networks Open

Yanfang Le, Radhika Niranjan Mysore, Lalith Suresh, Gerd Zellweger, Sujata Banerjee , et al. · 2021

Computer science Engineering

High performance rack-scale offerings package disaggregated pools of compute, memory and storage hardware in a single rack to run diverse workloads with varying requirements, including applications that need low and predictable latency. Th…

RoGUE Open

Yanfang Le, Brent Stephens, Arjun Singhvi, Aditya Akella, Michael M. Swift · 2018

Computer science

RDMA over Converged Ethernet (RoCE) promises low latency and low CPU utilization over commodity networks, and is attractive for cloud infrastructure services. Current implementations require Priority Flow Control (PFC) that uses backpressu…

Yanfang Le YOU? Author Swipe