Explanipedia

Icy-hot: decoupled compute paradigm towards a general-purpose superconducting CPU design Open

Tara Renduchintala, Jonghyun Lee, Haipeng Zha, Michael Qi, Murali Annavaram · 2025

Introduction Single Flux Quantum (SFQ) superconducting technology offers major performance and energy advantages over CMOS as Dennard scaling wanes. Yet, SFQ CPUs face key challenges: the Josephson Junction (JJ) budget limits manufacturabi…

Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding Open

Murali Annavaram, Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang · 2025

With the skyrocketing costs of GPUs and their virtual instances in the cloud, there is a significant desire to use CPUs for large language model (LLM) inference. KV cache update, often implemented as allocation, copying, and in-place strid…

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing Open

Lei Gao, Chaoqiang Jiang, Hossein Entezari Zarch, Murali Annavaram · 2025

Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases …

DistilLock: Safeguarding LLMs from Unauthorized Knowledge Distillation on the Edge Open

Akankshya Mohanty, Gijae Kang, Lei Gao, Murali Annavaram · 2025

Large Language Models (LLMs) have demonstrated strong performance across diverse tasks, but fine-tuning them typically relies on cloud-based, centralized infrastructures. This requires data owners to upload potentially sensitive data to ex…

The Upside of Bias: Personalizing Long-Tail Item Recommendations with Biased Sampling Open

Abdulla Alshabanah, Keshav Balasubramanian, Elan Markowitz, Greg Ver Steeg, Murali Annavaram · 2025

Recommendation systems drive user engagement across social media, streaming platforms, and e-commerce by learning from past interactions. The relevance of a recommended item depends on the quality of the user and item embeddings learned by…

LEAF: Lightweight, Efficient, Adaptive and Flexible Embedding for Large-Scale Recommendation Models Open

Chaoyi Jiang, Abdulla Alshabanah, Murali Annavaram · 2025

Meta-Learn to Unlearn: Enhanced Exact Machine Unlearning in Recommendation Systems with Meta-Learning Open

Abdulla Alshabanah, Keshav Balasubramanian, Murali Annavaram · 2025

Recommendation systems are used widely to recommend items such as movies, products, or news to users. The performance of a recommendation model depends on the quality of the embeddings that are associated with users and items, which are ge…

Memory-Efficient Differentially Private Training with Gradient Random Projection Open

Alex Mulrooney, Devansh Gupta, James Flemings, Huanyu Zhang, Murali Annavaram , et al. · 2025

Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient…

MARché: Fast Masked Autoregressive Image Generation with Cache-Aware Attention Open

Chaoyi Jiang, Sung Woo Kim, Lei Gao, Hossein Entezari Zarch, Won Woo Ro , et al. · 2025

Masked autoregressive (MAR) models unify the strengths of masked and autoregressive generation by predicting tokens in a fixed order using bidirectional attention for image generation. While effective, MAR models suffer from significant co…

PIGEON: A High Throughput Framework for Private Inference of Neural Networks using Secure Multiparty Computation Open

Christopher Harth-Kitzerow, Yongqin Wang, Rachit Rajat, Georg Carle, Murali Annavaram · 2025

Privacy-Preserving Machine Learning (PPML) is one of the most relevant use cases for Secure Multiparty Computation (MPC). While private training of large neural networks such as VGG-16 or ResNet-50 on state-of-the-art datasets such as Imag…

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding Open

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram · 2025

Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed …

Differentially Private In-context Learning via Sampling Few-shot Mixed with Zero-shot Outputs Open

James Flemings, Hongcheng Gan, Hongyi Li, Meisam Razaviyayn, Murali Annavaram · 2025

In-context learning (ICL) has shown promising improvement in downstream task adaptation of LLMs by augmenting prompts with relevant input-output examples (demonstrations). However, the ICL demonstrations can contain privacy-sensitive infor…

Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training Open

J. J. Lee, Yongqin Wang, Rachit Rajat, Murali Annavaram · 2025

Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud. The recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained with…

KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation Open

Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram · 2025

Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models Open

James Flemings, Bo Jiang, Wanrong Zhang, Zafar Takhirov, Murali Annavaram · 2025

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines Open

Lei Gao, Amir Ziashahabi, Yue Niu, Salman Avestimehr, Murali Annavaram · 2025

On Using Arabic Language Dialects in Recommendation Systems Open

Abdulla Alshabanah, Murali Annavaram · 2025

Mind the Dialect: NLP Advancements Uncover Fairness Disparities for Arabic Users in Recommendation Systems Open

Abdulla Alshabanah, Murali Annavaram · 2025

KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation Open

Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, Murali Annavaram · 2024

Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead …

High-Throughput Secure Multiparty Computation with an Honest Majority in Various Network Settings Open

Christopher Harth-Kitzerow, Ajith Suresh, Yongqin Wang, Hossein Yalame, Georg Carle , et al. · 2024

In this work, we present novel protocols over rings for semi-honest secure three-party computation (3PC) and malicious four-party computation (4PC) with one corruption. While most existing works focus on improving total communication compl…

Fastrack: Fast IO for Secure ML using GPU TEEs Open

Yongqin Wang, Rachit Rajat, Jonghyun Lee, Tingting Tang, Murali Annavaram · 2024

As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handli…

Biased User History Synthesis for Personalized Long-Tail Item Recommendation Open

Keshav Balasubramanian, Abdulla Alshabanah, Elan Markowitz, Greg Ver Steeg, Murali Annavaram · 2024

Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models Open

James Flemings, Wei Zhang, Bo Jiang, Zafar Takhirov, Murali Annavaram · 2024

Language models (LMs) rely on their parametric knowledge augmented with relevant contextual knowledge for certain tasks, such as question answering. However, the contextual knowledge can contain private information that may be leaked when …

Adaptively Private Next-Token Prediction of Large Language Models Open

James Flemings, Meisam Razaviyayn, Murali Annavaram · 2024

As Large Language Models (LLMs) proliferate, developing privacy safeguards for these models is crucial. One popular safeguard involves training LLMs in a differentially private manner. However, such solutions are shown to be computationall…

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines Open

Lei Gao, Amir Ziashahabi, Yue Niu, Salman Avestimehr, Murali Annavaram · 2024

Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of …

CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data Open

Hossein Entezari Zarch, Abdulla Alshabanah, Chaoyi Jiang, Murali Annavaram · 2024

Deep learning recommendation models (DLRMs) are at the heart of the current e-commerce industry. However, the amount of training data used to train these large models is growing exponentially, leading to substantial training hurdles. The t…

Edge Private Graph Neural Networks with Singular Value Perturbation Open

Tingting Tang, Yue Niu, Salman Avestimehr, Murali Annavaram · 2024

Graph neural networks (GNNs) play a key role in learning representations from graph-structured data and are demonstrated to be useful in many applications. However, the GNN training pipeline has been shown to be vulnerable to node feature …

MPC-Pipe: an Efficient Pipeline Scheme for Semi-honest MPC Machine Learning Open

Yongqin Wang, Rachit Rajat, Murali Annavaram · 2024

Differentially Private Next-Token Prediction of Large Language Models Open

James Flemings, Meisam Razaviyayn, Murali Annavaram · 2024

Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model to guarantee Differential Privacy (DP). However, DP-SGD overestim…

Cross-core Data Sharing for Energy-efficient GPUs Open

Hajar Falahati, Mohammad Sadrosadati, Qiumin Xu, Juan Gómez-Luna, Banafsheh Saber Latibari , et al. · 2024

Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains, because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as …

Murali Annavaram YOU? Author Swipe