Zenodo (CERN European Organization for Nuclear Research)
KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs
December 2025 • Revista, Zen, IA, 10
Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks. However, their autoregressive inference, particularly with long input sequences, is significantly bottlenecked by the Key-Value (KV) cache. The KV cache, essential for avoiding redundant computations in the self-attention mechanism, grows linearly with sequence length, leading to substantial memory consumption and bandwidth demands. This overhead translates directly into increased inference latency, part…