Junchen Jiang
YOU?
Author Swipe
View article: LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference Open
KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across differ…
View article: ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching Open
Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwid…
View article: AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving
AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving Open
Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by st…
View article: PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications
PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications Open
Besides typical generative applications, like ChatGPT, GitHub Copilot, and Cursor, we observe an emerging trend that LLMs are increasingly used in traditional discriminative tasks, such as recommendation, credit verification, and data labe…
View article: Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache
Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache Open
Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including coloc…
View article: HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse
HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse Open
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the re…
View article: Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache
Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache Open
Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including coloc…
View article: METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation Open
RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work…
View article: LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts
LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts Open
As large language models (LLMs) show impressive performance on complex tasks, they still struggle with longer contextual understanding and high computational costs. To balance efficiency and quality, we introduce LLMSteer, a fine-tuning-fr…
View article: Loss-tolerant neural video codec aware congestion control for real time video communication
Loss-tolerant neural video codec aware congestion control for real time video communication Open
Because of reinforcement learning's (RL) ability to automatically create more adaptive controlling logics beyond the hand-crafted heuristics, numerous effort has been made to apply RL to congestion control (CC) design for real time video c…
View article: SwiftQueue: Optimizing Low-Latency Applications with Swift Packet Queuing
SwiftQueue: Optimizing Low-Latency Applications with Swift Packet Queuing Open
Low Latency, Low Loss, and Scalable Throughput (L4S), as an emerging router-queue management technique, has seen steady deployment in the industry. An L4S-enabled router assigns each packet to the queue based on the packet header marking. …
View article: Do Large Language Models Need a Content Delivery Network?
Do Large Language Models Need a Content Delivery Network? Open
As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries. Thus, enabling flexible and efficient injection of new knowledge in LLM inference is critical. Three high-…
View article: Lysosomal biogenesis and function in osteoclasts: a comprehensive review
Lysosomal biogenesis and function in osteoclasts: a comprehensive review Open
Lysosomes serve as catabolic centers and signaling hubs in cells, regulating a multitude of cellular processes such as intracellular environment homeostasis, macromolecule degradation, intracellular vesicle trafficking and autophagy. Alter…
View article: NetLLM: Adapting Large Language Models for Networking
NetLLM: Adapting Large Language Models for Networking Open
Many networking tasks now employ deep learning (DL) to solve complex\nprediction and optimization problems. However, current design philosophy of\nDL-based algorithms entails intensive engineering overhead due to the manual\ndesign of deep…
View article: CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving Open
As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is proc…
View article: Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming Open
To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we ref…
View article: CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion Open
Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when …
View article: Earth+: on-board satellite imagery compression leveraging historical earth observations
Earth+: on-board satellite imagery compression leveraging historical earth observations Open
With the increasing deployment of earth observation satellite constellations, the downlink (satellite-to-ground) capacity often limits the freshness, quality, and coverage of the imagery data available to applications on the ground. To ove…
View article: Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming Open
To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we ref…
View article: Towards Optimal Preemptive GPU Time-Sharing for Edge Model Serving
Towards Optimal Preemptive GPU Time-Sharing for Edge Model Serving Open
With GPUs increasingly shared by DNN models at the edge, a crucial tradeoff arises between high GPU utilization and the ability of fast preemption when a high-priority request arrives. To reduce inference delay, an inference job can "burst…
View article: VidPlat: A Tool for Fast Crowdsourcing of Quality-of-Experience Measurements
VidPlat: A Tool for Fast Crowdsourcing of Quality-of-Experience Measurements Open
For video or web services, it is crucial to measure user-perceived quality of experience (QoE) at scale under various video quality or page loading delays. However, fast QoE measurements remain challenging as they must elicit subjective as…
View article: OneAdapt
OneAdapt Open
Deep learning inference on streaming media data, such as object detection in\nvideo or LiDAR feeds and text extraction from audio waves, is now ubiquitous.\nTo achieve high inference accuracy, these applications typically require\nsignific…
View article: Estimating WebRTC Video QoE Metrics Without Using Application Headers
Estimating WebRTC Video QoE Metrics Without Using Application Headers Open
The increased use of video conferencing applications (VCAs) has made it\ncritical to understand and support end-user quality of experience (QoE) by all\nstakeholders in the VCA ecosystem, especially network operators, who typically\ndo not…
View article: Run-Time Prevention of Software Integration Failures of Machine Learning APIs
Run-Time Prevention of Software Integration Failures of Machine Learning APIs Open
Due to the under-specified interfaces, developers face challenges in correctly integrating machine learning (ML) APIs in software. Even when the ML API and the software are well designed on their own, the resulting application misbehaves w…
View article: CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving Open
As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is pro…