Indranil Gupta
YOU?
Author Swipe
View article: A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via Faro
A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via Faro Open
This paper tackles the challenge of running multiple ML inference jobs\n(models) under time-varying workloads, on a constrained on-premises production\ncluster. Our system Faro takes in latency Service Level Objectives (SLOs) for\neach job…
View article: Counting How the Seconds Count: Understanding Algorithm-User Interplay in TikTok via ML-driven Analysis of Video Content
Counting How the Seconds Count: Understanding Algorithm-User Interplay in TikTok via ML-driven Analysis of Video Content Open
Short video streaming systems such as TikTok, Youtube Shorts, Instagram Reels, etc have reached billions of active users. At the core of such systems is a (proprietary) recommendation algorithm which recommends a sequence of videos to each…
View article: Topology and density control of satellite-defined photonic quantum networks
Topology and density control of satellite-defined photonic quantum networks Open
Creating photonic quantum networks by distributing entangled photon pairs through low Earth orbit satellites is a promising technological advance. A recent work has studied a model of such networks and reported the presence of a heavy-tail…
View article: There is More Control in Egalitarian Edge IoT Meshes
There is More Control in Egalitarian Edge IoT Meshes Open
View article: Transmitting, Fast and Slow: Scheduling Satellite Traffic through Space and Time
Transmitting, Fast and Slow: Scheduling Satellite Traffic through Space and Time Open
Artifact for our Mobicom 2023 paper
View article: Dirigo: Self-scaling Stateful Actors For Serverless Real-time Data Processing
Dirigo: Self-scaling Stateful Actors For Serverless Real-time Data Processing Open
We propose Dirigo, a distributed stream processing service built atop virtual actors. Dirigo achieves both a high level of resource efficiency and performance isolation driven by user intent (SLO). To improve resource efficiency, Dirigo ad…
View article: A User-Centric Evaluation of Smart Home Resolution Approaches for Conflicts Between Routines
A User-Centric Evaluation of Smart Home Resolution Approaches for Conflicts Between Routines Open
With the increasing adoption of smart home devices, users rely on device automation to control their homes. This automation commonly comes in the form of smart home routines, an abstraction available via major vendors. Yet, questions remai…
View article: CoMesh: Fully-Decentralized Control for Sense-Trigger-Actuate Routines in Edge Meshes
CoMesh: Fully-Decentralized Control for Sense-Trigger-Actuate Routines in Edge Meshes Open
While mesh networking for edge settings (e.g., smart buildings, farms, battlefields, etc.) has received much attention, the layer of control over such meshes remains largely centralized and cloud-based. This paper focuses on applications w…
View article: Transactional Panorama: A Conceptual Framework for User Perception in Analytical Visual Interfaces
Transactional Panorama: A Conceptual Framework for User Perception in Analytical Visual Interfaces Open
Many tools empower analysts and data scientists to consume analysis results in a visual interface, such as a dashboard. When the underlying data changes, these results need to be updated, but this update can take a long time -- all while t…
View article: Baechi: Fast Device Placement of Machine Learning Graphs
Baechi: Fast Device Placement of Machine Learning Graphs Open
Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result …
View article: Medley: A Membership Service for IoT Networks
Medley: A Membership Service for IoT Networks Open
Efficient and correct operation of an IoT network requires the presence of a failure detector and membership protocol amongst the IoT nodes. This paper presents a new failure detector for IoT settings wherein nodes are connected via a wire…
View article: ZenoPS: A Distributed Learning System Integrating Communication Efficiency and Security
ZenoPS: A Distributed Learning System Integrating Communication Efficiency and Security Open
Distributed machine learning is primarily motivated by the promise of increased computation power for accelerating training and mitigating privacy concerns. Unlike machine learning on a single device, distributed machine learning requires …
View article: Banyan: A Scoped Dataflow Engine for Graph Query Service
Banyan: A Scoped Dataflow Engine for Graph Query Service Open
Graph query services (GQS) are widely used today to interactively answer graph traversal queries on large-scale graph data. Existing graph query engines focus largely on optimizing the latency of a single query. This ignores significant ch…
View article: Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo
Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo Open
Resource provisioning in multi-tenant stream processing systems faces the dual challenges of keeping resource utilization high (without over-provisioning), and ensuring performance isolation. In our common production use cases, where strea…
View article: Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing\n with Cameo
Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing\n with Cameo Open
Resource provisioning in multi-tenant stream processing systems faces the\ndual challenges of keeping resource utilization high (without\nover-provisioning), and ensuring performance isolation. In our common\nproduction use cases, where st…
View article: CSER: Communication-efficient SGD with Error Reset
CSER: Communication-efficient SGD with Error Reset Open
The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new tech…
View article: Home, SafeHome: Smart Home Reliability with Visibility and Atomicity
Home, SafeHome: Smart Home Reliability with Visibility and Atomicity Open
Smart environments (homes, factories, hospitals, buildings) contain an increasing number of IoT devices, making them complex to manage. Today, in smart homes where users or triggers initiate routines (i.e., a sequence of commands), concurr…
View article: Medley
Medley Open
Efficient and correct operation of an IoT network requires the presence of a failure detector and membership protocol amongst the IoT nodes. This paper presents a new failure detector for IoT settings where nodes are connected via a wirele…
View article: Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates
Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates Open
When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm…
View article: Read atomic transactions with prevention of lost updates: ROLA and its formal analysis
Read atomic transactions with prevention of lost updates: ROLA and its formal analysis Open
Designers of distributed database systems face the choice between stronger consistency guarantees and better performance. A number of applications only require read atomicity (RA) (either all or none of a transaction’s updates are visible …
View article: Zeno++: Robust Fully Asynchronous SGD
Zeno++: Robust Fully Asynchronous SGD Open
We propose Zeno++, a new robust asynchronous Stochastic Gradient Descent~(SGD) procedure which tolerates Byzantine failures of the workers. In contrast to previous work, Zeno++ removes some unrealistic restrictions on worker-server communi…
View article: SLSGD: Secure and Efficient Distributed On-device Machine Learning
SLSGD: Secure and Efficient Distributed On-device Machine Learning Open
We consider distributed on-device learning with limited communication and security requirements. We propose a new robust distributed optimization algorithm with efficient communication and attack tolerance. The proposed algorithm has prova…
View article: Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation
Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation Open
Recently, new defense techniques have been developed to tolerate Byzantine failures for distributed machine learning. The Byzantine model captures workers that behave arbitrarily, including malicious and compromised workers. In this paper,…
View article: Asynchronous Federated Optimization
Asynchronous Federated Optimization Open
Federated learning enables training on a massive number of edge devices. To improve flexibility and scalability, we propose a new asynchronous federated optimization algorithm. We prove that the proposed approach has near-linear convergenc…
View article: Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance Open
We present Zeno, a technique to make distributed machine learning, particularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number of faulty workers. Zeno generalizes previous results that assumed a majority of non-faulty n…
View article: Zeno: Distributed Stochastic Gradient Descent with Suspicion-based\n Fault-tolerance
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based\n Fault-tolerance Open
We present Zeno, a technique to make distributed machine learning,\nparticularly Stochastic Gradient Descent (SGD), tolerant to an arbitrary number\nof faulty workers. Zeno generalizes previous results that assumed a majority of\nnon-fault…
View article: Zeno: Byzantine-suspicious stochastic gradient descent.
Zeno: Byzantine-suspicious stochastic gradient descent. Open
We propose Zeno, a new robust aggregation rule, for distributed synchronous Stochastic Gradient Descent~(SGD) under a general Byzantine failure model. The key idea is to suspect the workers that are potentially malicious, and use a ranking…
View article: Phocas: dimensional Byzantine-resilient stochastic gradient descent
Phocas: dimensional Byzantine-resilient stochastic gradient descent Open
We propose a novel robust aggregation rule for distributed synchronous Stochastic Gradient Descent~(SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the wor…
View article: Popular is cheaper
Popular is cheaper Open
This paper targets the growing area of interactive data analytics engines. We present a system called Getafix that intelligently decides replication levels and replica placement for data segments, in a way that is responsive to changing po…
View article: Service fabric
Service fabric Open
We describe Service Fabric (SF), Microsoft's distributed platform for building, running, and maintaining microservice applications in the cloud. SF has been running in production for 10+ years, powering many critical services at Microsoft.…