Eugene Kwek
YOU?
Author Swipe
View article: COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens
COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens Open
Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning me…