KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs Article Swipe

View

Revista, Zen , IA, 10 ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.5281/zenodo.17817217

Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks. However, their autoregressive inference, particularly with long input sequences, is significantly bottlenecked by the Key-Value (KV) cache. The KV cache, essential for avoiding redundant computations in the self-attention mechanism, grows linearly with sequence length, leading to substantial memory consumption and bandwidth demands. This overhead translates directly into increased inference latency, particularly critical for real-time, latency-constrained applications. Existing KV cache compression techniques often resort to quantization or simple heuristic-based token eviction, which can either sacrifice accuracy through lossy compression or discard vital context. This paper proposes a novel KV-cache compression method rooted in attention pattern pruning. By dynamically analyzing the self-attention weights across transformer layers and heads, our approach identifies and prunes redundant or less important key-value pairs that contribute minimally to the subsequent token generation. This intelligent pruning strategy ensures that only the most semantically relevant context is retained in the cache, thereby reducing its memory footprint and alleviating bandwidth limitations without significant degradation in model performance. We demonstrate that this method leads to a substantial reduction in inference latency and memory usage, enabling more efficient and responsive deployment of LLMs, especially in scenarios demanding low-latency outputs.

Related Topics To Compare & Contrast

Artificial Intelligence

Concepts

Computer science Cache Quantization (signal processing) Lossy compression Inference Memory footprint Data compression Transformer Latency (audio) Artificial intelligence Pruning Security token Language model Bandwidth (computing) Autoregressive model Context (archaeology) Real-time computing Algorithm Overhead (engineering) Compression ratio Lossless compression Recall Computation Memory bandwidth Context management Machine learning Arithmetic coding Reduction (mathematics)

Metadata

Type: article
Landing Page: https://doi.org/10.5281/zenodo.17817217
OA Status: green
OpenAlex ID: https://openalex.org/W7108609355

All OpenAlex metadata

Raw OpenAlex JSON

No additional metadata available.