Explanipedia

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference Open

Yuhan Liu, Yihua Cheng, Jiayi Yao, Xiaokun Chen, Yuyang Huang , et al. · 2025

KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across differ…

ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching Open

Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao , et al. · 2025

Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwid…

AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving Open

Shaoting Feng, Hanchen Li, Kuntai Du, Zhuohan Gu, Yuhan Liu , et al. · 2025

Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by st…

PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications Open

Kuntai Du, Bowen Wang, Chen Zhang, Yi‐Ming Cheng, Qing Lan , et al. · 2025

Besides typical generative applications, like ChatGPT, GitHub Copilot, and Cursor, we observe an emerging trend that LLMs are increasingly used in traditional discriminative tasks, such as recommendation, credit verification, and data labe…

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache Open

Hanchen Li, Yuhan Liu, Yihua Cheng, Kuntai Du, Junchen Jiang · 2025

Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including coloc…

HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse Open

Yuwei An, Yihua Cheng, Seo Jin Park, Junchen Jiang · 2025

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the re…

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache Open

Hanchen Li, Yuhan Liu, Yihua Cheng, Kuntai Du, Junchen Jiang · 2025

Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including coloc…

METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation Open

Siddhant Ray, Rui Pan, Zhaoquan Gu, Kuntai Du, Ganesh Ananthanarayanan , et al. · 2024

Computer science Psychology Philosophy

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work…

LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts Open

Zhaoquan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang · 2024

Computer science Business Geography

As large language models (LLMs) show impressive performance on complex tasks, they still struggle with longer contextual understanding and high computational costs. To balance efficiency and quality, we introduce LLMSteer, a fine-tuning-fr…

Loss-tolerant neural video codec aware congestion control for real time video communication Open

Zhengxu Xia, Hanchen Li, Junchen Jiang · 2024

Computer science

Because of reinforcement learning's (RL) ability to automatically create more adaptive controlling logics beyond the hand-crafted heuristics, numerous effort has been made to apply RL to congestion control (CC) design for real time video c…

SwiftQueue: Optimizing Low-Latency Applications with Swift Packet Queuing Open

Siddhant Ray, Xi Jiang, Jun Luo, Nick Feamster, Junchen Jiang · 2024

Computer science

Low Latency, Low Loss, and Scalable Throughput (L4S), as an emerging router-queue management technique, has seen steady deployment in the industry. An L4S-enabled router assigns each packet to the queue based on the packet header marking. …

Do Large Language Models Need a Content Delivery Network? Open

Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang · 2024

Computer science Mathematics

As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries. Thus, enabling flexible and efficient injection of new knowledge in LLM inference is critical. Three high-…

Lysosomal biogenesis and function in osteoclasts: a comprehensive review Open

Junchen Jiang, Rufeng Ren, Weiyuan Fang, Jiansen Miao, Zijun Wen , et al. · 2024

Biology

Lysosomes serve as catabolic centers and signaling hubs in cells, regulating a multitude of cellular processes such as intracellular environment homeostasis, macromolecule degradation, intracellular vesicle trafficking and autophagy. Alter…

NetLLM: Adapting Large Language Models for Networking Open

Duo Wu, Xianda Wang, Yaqi Qiao, Zhi Wang, Junchen Jiang , et al. · 2024

Computer science Psychology Philosophy

Many networking tasks now employ deep learning (DL) to solve complex\nprediction and optimization problems. However, current design philosophy of\nDL-based algorithms entails intensive engineering overhead due to the manual\ndesign of deep…

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving Open

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang , et al. · 2024

Computer science Materials science

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is proc…

Eloquent: A More Robust Transmission Scheme for LLM Token Streaming Open

Hanchen Li, Yuhan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du , et al. · 2024

Computer science

To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we ref…

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion Open

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng , et al. · 2024

Computer science Philosophy

Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when …

Earth+: on-board satellite imagery compression leveraging historical earth observations Open

Kuntai Du, Yihua Cheng, Peder A. Olsen, Shadi A. Noghabi, Ranveer Chandra , et al. · 2024

Geology Computer science Engineering

With the increasing deployment of earth observation satellite constellations, the downlink (satellite-to-ground) capacity often limits the freshness, quality, and coverage of the imagery data available to applications on the ground. To ove…

Eloquent: A More Robust Transmission Scheme for LLM Token Streaming Open

Hanchen Li, Yuhan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du , et al. · 2024

Computer science

To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we ref…

Towards Optimal Preemptive GPU Time-Sharing for Edge Model Serving Open

Zhengxu Xia, Yitian Hao, Jun Duan, Chen Wang, Junchen Jiang · 2023

Computer science Mathematics

With GPUs increasingly shared by DNN models at the edge, a crucial tradeoff arises between high GPU utilization and the ability of fast preemption when a high-priority request arrives. To reduce inference delay, an inference job can "burst…

VidPlat: A Tool for Fast Crowdsourcing of Quality-of-Experience Measurements Open

Xu Zhang, Hanchen Li, Paul Schmitt, Marshini Chetty, Nick Feamster , et al. · 2023

Computer science Engineering Biology

For video or web services, it is crucial to measure user-perceived quality of experience (QoE) at scale under various video quality or page loading delays. However, fast QoE measurements remain challenging as they must elicit subjective as…

OneAdapt Open

Kuntai Du, Yuhan Liu, Yitian Hao, Qizheng Zhang, Haodong Wang , et al. · 2023

Computer science Engineering

Deep learning inference on streaming media data, such as object detection in\nvideo or LiDAR feeds and text extraction from audio waves, is now ubiquitous.\nTo achieve high inference accuracy, these applications typically require\nsignific…

Estimating WebRTC Video QoE Metrics Without Using Application Headers Open

Taveesh Sharma, Tarun Mangla, Arpit Gupta, Junchen Jiang, Nick Feamster · 2023

Computer science Medicine

The increased use of video conferencing applications (VCAs) has made it\ncritical to understand and support end-user quality of experience (QoE) by all\nstakeholders in the VCA ecosystem, especially network operators, who typically\ndo not…

Run-Time Prevention of Software Integration Failures of Machine Learning APIs Open

Chengcheng Wan, Yuhan Liu, Kuntai Du, Henry Hoffmann, Junchen Jiang , et al. · 2023

Computer science

Due to the under-specified interfaces, developers face challenges in correctly integrating machine learning (ML) APIs in software. Even when the ML API and the software are well designed on their own, the resulting application misbehaves w…

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving Open

Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng , et al. · 2023

Computer science Biology

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is pro…

Junchen Jiang YOU? Author Swipe