KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.5281/zenodo.17817217
Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks. However, their autoregressive inference, particularly with long input sequences, is significantly bottlenecked by the Key-Value (KV) cache. The KV cache, essential for avoiding redundant computations in the self-attention mechanism, grows linearly with sequence length, leading to substantial memory consumption and bandwidth demands. This overhead translates directly into increased inference latency, particularly critical for real-time, latency-constrained applications. Existing KV cache compression techniques often resort to quantization or simple heuristic-based token eviction, which can either sacrifice accuracy through lossy compression or discard vital context. This paper proposes a novel KV-cache compression method rooted in attention pattern pruning. By dynamically analyzing the self-attention weights across transformer layers and heads, our approach identifies and prunes redundant or less important key-value pairs that contribute minimally to the subsequent token generation. This intelligent pruning strategy ensures that only the most semantically relevant context is retained in the cache, thereby reducing its memory footprint and alleviating bandwidth limitations without significant degradation in model performance. We demonstrate that this method leads to a substantial reduction in inference latency and memory usage, enabling more efficient and responsive deployment of LLMs, especially in scenarios demanding low-latency outputs.
Related Topics To Compare & Contrast
- Type
- article
- Landing Page
- https://doi.org/10.5281/zenodo.17817217
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W7108609355