Exploring foci of:
Zenodo (CERN European Organization for Nuclear Research)
KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs
December 2025 • Revista, Zen, IA, 10
Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks. However, their autoregressive inference, particularly with long input sequences, is significantly bottlenecked by the Key-Value (KV) cache. The KV cache, essential for avoiding redundant computations in the self-attention mechanism, grows linearly with sequence length, leading to substantial memory consumption and bandwidth demands. This overhead translates directly into increased inference latency, part…
Computer Science
Cache (Computing)
Quantization (Signal Processing)
Lossy Compression
Transformer
Artificial Intelligence
Pruning
Security Token
Algorithm
Compression Ratio
Machine Learning