Manya Ghobadi
YOU?
Author Swipe
View article: LINC: An In-Network Coding Approach to Tame Packet Loss in Hybrid Wireless-Fiber Backbones
LINC: An In-Network Coding Approach to Tame Packet Loss in Hybrid Wireless-Fiber Backbones Open
The emergence of ultra-low latency applications, such as financial transactions, has driven the development of hybrid backbone networks that rely on fiber, satellite, and microwave links. Despite providing low latencies, these hybrid netwo…
View article: Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication
Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication Open
This paper presents Checkmate, a system that enables per-iteration checkpointing in DNN training without any training slowdown. The traditional approach to checkpointing requires a pause in training to copy model states to a separate locat…
View article: Scalable Routing in a City-Scale Wi-Fi Network for Disaster Recovery
Scalable Routing in a City-Scale Wi-Fi Network for Disaster Recovery Open
In this paper, we present a new city-scale decentralized mesh network system suited for disaster recovery and emergencies. When wide-area connectivity is unavailable or significantly degraded, our system, MapMesh, enables static access poi…
View article: MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning
MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning Open
View article: NetBlocks: Staging Layouts for High-Performance Custom Host Network Stacks
NetBlocks: Staging Layouts for High-Performance Custom Host Network Stacks Open
Modern network applications and environments, ranging from data centers and IoT devices to AR/VR headsets and underwater robotics, present diverse requirements that cannot be satisfied by the all-or-nothing approach of TCP and UDP protocol…
View article: MLTCP: Congestion Control for DNN Training
MLTCP: Congestion Control for DNN Training Open
We present MLTCP, a technique to augment today's congestion control algorithms to accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication phases of jobs that compete for network bandwidth to interleave with eac…
View article: On-Fiber Photonic Computing
On-Fiber Photonic Computing Open
In the 1800s, Charles Babbage envisioned computers as analog devices. However, it was not until 150 years later that a Mechanical Analog Computer was constructed for the US Navy to solve differential equations. With the end of Moore's Law,…
View article: Lightning: A Reconfigurable Photonic-Electronic SmartNIC for Fast and Energy-Efficient Inference
Lightning: A Reconfigurable Photonic-Electronic SmartNIC for Fast and Energy-Efficient Inference Open
The massive growth of machine learning-based applications and the end of Moore's law have created a pressing need to redesign computing platforms. We propose Lightning, the first reconfigurable photonic-electronic smartNIC to serve real-ti…
View article: CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters Open
We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, C…
View article: Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters
Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters Open
This paper presents a low-cost network architecture for training large language models (LLMs) at hyperscale. We study the optimal parallelization strategy of LLMs and propose a novel datacenter network design tailored to LLM's unique commu…
View article: PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels
PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels Open
Allowing organizations to share their data for training of machine learning (ML) models without unintended information leakage is an open problem in practice. A promising technique for this still-open problem is to train models on the enco…
View article: Congestion control in machine learning clusters
Congestion control in machine learning clusters Open
This paper argues that fair-sharing, the holy grail of congestion control algorithms for decades, is not necessarily a desirable property in Machine Learning (ML) training clusters. We demonstrate that for a specific combination of jobs, i…
View article: Workshop Organization
Workshop Organization Open
View article: InfoShape: Task-Based Neural Data Shaping via Mutual Information
InfoShape: Task-Based Neural Data Shaping via Mutual Information Open
The use of mutual information as a tool in private data sharing has remained an open challenge due to the difficulty of its estimation in practice. In this paper, we propose InfoShape, a task-based encoder that aims to remove unnecessary s…
View article: Using trio
Using trio Open
This paper describes Trio, a programmable chipset used in Juniper Networks' MX-series routers and switches. Trio's architecture is based on a multi-threaded programmable packet processing engine and a hierarchy of high-capacity memory syst…
View article: ABM
ABM Open
Today's network devices share buffer across queues to avoid drops during transient congestion and absorb bursts. As the buffer-per-bandwidth-unit in datacenter decreases, the need for optimal buffer utilization becomes more pressing. Typic…
View article: Performance trade-offs in reconfigurable networks for HPC
Performance trade-offs in reconfigurable networks for HPC Open
Designing efficient interconnects to support high-bandwidth and low-latency communication is critical toward realizing high performance computing (HPC) and data center (DC) systems in the exascale era. At extreme computing scales, providin…
View article: Delocalized Photonic Deep Learning on the Internet's Edge
Delocalized Photonic Deep Learning on the Internet's Edge Open
Advances in deep neural networks (DNNs) are transforming science and technology. However, the increasing computational demands of the most powerful DNNs limit deployment on low-power devices, such as smartphones and sensors – and this tren…
View article: Delocalized Photonic Deep Learning on the Internet's Edge
Delocalized Photonic Deep Learning on the Internet's Edge Open
Advances in deep neural networks (DNNs) are transforming science and technology. However, the increasing computational demands of the most powerful DNNs limit deployment on low-power devices, such as smartphones and sensors -- and this tre…
View article: TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs Open
We propose TopoOpt, a novel direct-connect fabric for deep neural network (DNN) training workloads. TopoOpt co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. We demonst…
View article: IOI
IOI Open
We present In-network Optical Inference (IOI), a system providing low-latency machine learning inference by leveraging programmable switches and optical matrix multiplication. IOI consists of a novel transceiver module designed specificall…
View article: SiP-ML
SiP-ML Open
This paper proposes optical network interconnects as a key enabler for building high-bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML, accelerates the training time of popular DNN models using silico…
View article: ARROW
ARROW Open
Fiber cut events reduce the capacity of wide-area networks (WANs) by several Tbps. In this paper, we revive the lost capacity by reconfiguring the wavelengths from cut fibers into healthy fibers. We highlight two challenges that made prior…
View article: Edge computing with optical neural networks via WDM weight broadcasting
Edge computing with optical neural networks via WDM weight broadcasting Open
We introduce an optical neural-network architecture for edge computing that takes advantage of wavelength multiplexing, high-bandwidth modulation, and integration detection.Our protocol consists of a server and a client, which divide the t…
View article: NeuraCrypt: Hiding Private Health Data via Random Neural Networks for\n Public Training
NeuraCrypt: Hiding Private Health Data via Random Neural Networks for\n Public Training Open
Balancing the needs of data privacy and predictive utility is a central\nchallenge for machine learning in healthcare. In particular, privacy concerns\nhave led to a dearth of public datasets, complicated the construction of\nmulti-hospita…
View article: NeuraCrypt: Hiding Private Health Data via Random Neural Networks for Public Training
NeuraCrypt: Hiding Private Health Data via Random Neural Networks for Public Training Open
Balancing the needs of data privacy and predictive utility is a central challenge for machine learning in healthcare. In particular, privacy concerns have led to a dearth of public datasets, complicated the construction of multi-hospital c…
View article: FB: A Flexible Buffer Management Scheme for Data Center Switches
FB: A Flexible Buffer Management Scheme for Data Center Switches Open
Today, network devices share buffer across priority queues to avoid drops during transient congestion. While cost-effective most of the time, this sharing can cause undesired interference among seemingly independent traffic. As a result, l…
View article: Cerberus: The Power of Choices in Datacenter Topology Design (A Throughput Perspective)
Cerberus: The Power of Choices in Datacenter Topology Design (A Throughput Perspective) Open
View article: PINE: Photonic Integrated Networked Energy efficient datacenters (ENLITENED Program) [Invited]
PINE: Photonic Integrated Networked Energy efficient datacenters (ENLITENED Program) [Invited] Open
We review the motivation, goals, and achievements of the Photonic Integrated Networked Energy efficient datacenter (PINE) project, which is part of the Advanced Research Projects Agency–Energy (ARPA-E) ENergy-efficient Light-wave Integrate…
View article: Performance Analysis of Demand-Oblivious and Demand-Aware Optical Datacenter Network Designs
Performance Analysis of Demand-Oblivious and Demand-Aware Optical Datacenter Network Designs Open
This paper presents a performance analysis of the design space of optical datacenter networks, including both demand-oblivious (static or dynamic) and demand-aware networks. We formally show that the number of specific optical switch types…