Murali Annavaram
YOU?
Author Swipe
View article: Icy-hot: decoupled compute paradigm towards a general-purpose superconducting CPU design
Icy-hot: decoupled compute paradigm towards a general-purpose superconducting CPU design Open
Introduction Single Flux Quantum (SFQ) superconducting technology offers major performance and energy advantages over CMOS as Dennard scaling wanes. Yet, SFQ CPUs face key challenges: the Josephson Junction (JJ) budget limits manufacturabi…
View article: Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding Open
With the skyrocketing costs of GPUs and their virtual instances in the cloud, there is a significant desire to use CPUs for large language model (LLM) inference. KV cache update, often implemented as allocation, copying, and in-place strid…
View article: DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing Open
Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases …
View article: DistilLock: Safeguarding LLMs from Unauthorized Knowledge Distillation on the Edge
DistilLock: Safeguarding LLMs from Unauthorized Knowledge Distillation on the Edge Open
Large Language Models (LLMs) have demonstrated strong performance across diverse tasks, but fine-tuning them typically relies on cloud-based, centralized infrastructures. This requires data owners to upload potentially sensitive data to ex…
View article: The Upside of Bias: Personalizing Long-Tail Item Recommendations with Biased Sampling
The Upside of Bias: Personalizing Long-Tail Item Recommendations with Biased Sampling Open
Recommendation systems drive user engagement across social media, streaming platforms, and e-commerce by learning from past interactions. The relevance of a recommended item depends on the quality of the user and item embeddings learned by…
View article: LEAF: Lightweight, Efficient, Adaptive and Flexible Embedding for Large-Scale Recommendation Models
LEAF: Lightweight, Efficient, Adaptive and Flexible Embedding for Large-Scale Recommendation Models Open
View article: Meta-Learn to Unlearn: Enhanced Exact Machine Unlearning in Recommendation Systems with Meta-Learning
Meta-Learn to Unlearn: Enhanced Exact Machine Unlearning in Recommendation Systems with Meta-Learning Open
Recommendation systems are used widely to recommend items such as movies, products, or news to users. The performance of a recommendation model depends on the quality of the embeddings that are associated with users and items, which are ge…
View article: Memory-Efficient Differentially Private Training with Gradient Random Projection
Memory-Efficient Differentially Private Training with Gradient Random Projection Open
Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient…
View article: MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention
MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention Open
Masked autoregressive (MAR) models unify the strengths of masked and autoregressive generation by predicting tokens in a fixed order using bidirectional attention for image generation. While effective, MAR models suffer from significant co…
View article: PIGEON: A High Throughput Framework for Private Inference of Neural Networks using Secure Multiparty Computation
PIGEON: A High Throughput Framework for Private Inference of Neural Networks using Secure Multiparty Computation Open
Privacy-Preserving Machine Learning (PPML) is one of the most relevant use cases for Secure Multiparty Computation (MPC). While private training of large neural networks such as VGG-16 or ResNet-50 on state-of-the-art datasets such as Imag…
View article: DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding Open
Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed …
View article: Differentially Private In-context Learning via Sampling Few-shot Mixed with Zero-shot Outputs
Differentially Private In-context Learning via Sampling Few-shot Mixed with Zero-shot Outputs Open
In-context learning (ICL) has shown promising improvement in downstream task adaptation of LLMs by augmenting prompts with relevant input-output examples (demonstrations). However, the ICL demonstrations can contain privacy-sensitive infor…
View article: Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training
Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training Open
Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud. The recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained with…
View article: KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation Open
View article: Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models
Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models Open
View article: MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines
MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines Open
View article: On Using Arabic Language Dialects in Recommendation Systems
On Using Arabic Language Dialects in Recommendation Systems Open
View article: Mind the Dialect: NLP Advancements Uncover Fairness Disparities for Arabic Users in Recommendation Systems
Mind the Dialect: NLP Advancements Uncover Fairness Disparities for Arabic Users in Recommendation Systems Open
View article: KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation Open
Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead …
View article: High-Throughput Secure Multiparty Computation with an Honest Majority in Various Network Settings
High-Throughput Secure Multiparty Computation with an Honest Majority in Various Network Settings Open
In this work, we present novel protocols over rings for semi-honest secure three-party computation (3PC) and malicious four-party computation (4PC) with one corruption. While most existing works focus on improving total communication compl…
View article: Fastrack: Fast IO for Secure ML using GPU TEEs
Fastrack: Fast IO for Secure ML using GPU TEEs Open
As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handli…
View article: Biased User History Synthesis for Personalized Long-Tail Item Recommendation
Biased User History Synthesis for Personalized Long-Tail Item Recommendation Open
View article: Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models
Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models Open
Language models (LMs) rely on their parametric knowledge augmented with relevant contextual knowledge for certain tasks, such as question answering. However, the contextual knowledge can contain private information that may be leaked when …
View article: Adaptively Private Next-Token Prediction of Large Language Models
Adaptively Private Next-Token Prediction of Large Language Models Open
As Large Language Models (LLMs) proliferate, developing privacy safeguards for these models is crucial. One popular safeguard involves training LLMs in a differentially private manner. However, such solutions are shown to be computationall…
View article: MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines
MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines Open
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of …
View article: CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data
CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data Open
Deep learning recommendation models (DLRMs) are at the heart of the current e-commerce industry. However, the amount of training data used to train these large models is growing exponentially, leading to substantial training hurdles. The t…
View article: Edge Private Graph Neural Networks with Singular Value Perturbation
Edge Private Graph Neural Networks with Singular Value Perturbation Open
Graph neural networks (GNNs) play a key role in learning representations from graph-structured data and are demonstrated to be useful in many applications. However, the GNN training pipeline has been shown to be vulnerable to node feature …
View article: MPC-Pipe: an Efficient Pipeline Scheme for Semi-honest MPC Machine Learning
MPC-Pipe: an Efficient Pipeline Scheme for Semi-honest MPC Machine Learning Open
View article: Differentially Private Next-Token Prediction of Large Language Models
Differentially Private Next-Token Prediction of Large Language Models Open
Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model to guarantee Differential Privacy (DP). However, DP-SGD overestim…
View article: Cross-core Data Sharing for Energy-efficient GPUs
Cross-core Data Sharing for Energy-efficient GPUs Open
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains, because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as …