Explanipedia

Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency Open

Kurzynski, Marco, Aga, Shaizeen, Wu, Di · 2025

Training large language models (LLMs) efficiently requires a deep understanding of how modern GPU systems behave under real-world distributed training workloads. While prior work has focused primarily on kernel-level performance or single-…

Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency Open

Kurzynski, Marco, Aga, Shaizeen, Wu, Di · 2025

Training large language models (LLMs) efficiently requires a deep understanding of how modern GPU systems behave under real-world distributed training workloads. While prior work has focused primarily on kernel-level performance or single-…

GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory Open

Liu, Jiaxu, Bai, Yuhe, Bouganis, Christos-Savvas · 2025

Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention p…

NysX: An Accurate and Energy-Efficient FPGA Accelerator for Hyperdimensional Graph Classification at the Edge Open

Arockiaraj, Jebacyril, Parikh, Dhruv, Prasanna, Viktor · 2025

Real-time, energy-efficient inference on edge devices is essential for graph classification across a range of applications. Hyperdimensional Computing (HDC) is a brain-inspired computing paradigm that encodes input features into low-precis…

GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory Open

Liu Jiaxu, Bai Yuhe, Bouganis, Christos-Savvas · 2025

Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention p…

NysX: An Accurate and Energy-Efficient FPGA Accelerator for Hyperdimensional Graph Classification at the Edge Open

Arockiaraj, Jebacyril, Parikh, Dhruv, Prasanna, Viktor · 2025

Real-time, energy-efficient inference on edge devices is essential for graph classification across a range of applications. Hyperdimensional Computing (HDC) is a brain-inspired computing paradigm that encodes input features into low-precis…

RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting Open

Tran, Hoang-Nhat, Di Sario, Francesco, Spadaro, Gabriele, Valenzise, Giuseppe, Tartaglione, Enzo · 2025

Recent advances in neural scene representations have transformed immersive multimedia, with 3D Gaussian Splatting (3DGS) enabling real-time photorealistic rendering. Despite its efficiency, 3DGS suffers from large memory requirements and c…

RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting Open

Tran, Hoang-Nhat, Di Sario, Francesco, Spadaro, Gabriele, Valenzise, Giuseppe, Tartaglione, Enzo · 2025

Recent advances in neural scene representations have transformed immersive multimedia, with 3D Gaussian Splatting (3DGS) enabling real-time photorealistic rendering. Despite its efficiency, 3DGS suffers from large memory requirements and c…

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices Open

Li, Xiangyu, Yin, Chengyu, Wang, Weijun, Wei, Jianyu, Cao, Ting , et al. · 2025

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-b…

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices Open

Li Xiangyu, Yin Cheng-yu, Wang Wei-jun, Wei Jianyu, Cao, Ting , et al. · 2025

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-b…

Accelerating 3D Seismic Wave Simulations on ARM Using a Hybrid Half-Precision and Scalable Vector Extension Approach Open

Wen-Qiang Wang, Juepeng Zheng, Bihe Ren, Zekun Yin, Wubing Wan , et al. · 2025

Seismic simulation is fundamental for understanding earthquake physics and mitigating seismic hazards, but accurate seismic modeling requires fine computational grids, imposing severe memory and computational challenges. Traditional modeli…

PERFORMANCE EVALUATION OF NEORV32 RISC V PROCESSOR USING FREERTOS TRACE DRIVEN CACHE SIMULATION Open

Ramesh, Niriksha · 2025

In increasingly complex embedded systems, optimizing memory hierarchy is crucial to attaining high performance and energy efficiency. This thesis uses execution traces from FreeRTOS workloads that mimic realistic multitasking behavior to a…

PERFORMANCE EVALUATION OF NEORV32 RISC V PROCESSOR USING FREERTOS TRACE DRIVEN CACHE SIMULATION Open

Ramesh, Niriksha · 2025

In increasingly complex embedded systems, optimizing memory hierarchy is crucial to attaining high performance and energy efficiency. This thesis uses execution traces from FreeRTOS workloads that mimic realistic multitasking behavior to a…

FlashAttention: Breaking the Memory Wall for Efficient Self-Attention Scaling Open

Revista, Zen, IA, 10 · 2025

The self-attention mechanism is a cornerstone of the Transformer architecture, driving significant advancements across natural language processing, computer vision, and other domains. However, its quadratic computational complexity and, cr…

FlashAttention: Breaking the Memory Wall for Efficient Self-Attention Scaling Open

Revista, Zen, IA, 10 · 2025

The self-attention mechanism is a cornerstone of the Transformer architecture, driving significant advancements across natural language processing, computer vision, and other domains. However, its quadratic computational complexity and, cr…

Flash Attention: Unlocking Bandwidth-Optimal Self-Attention for Trillion-Parameter Models Open

Revista, Zen, IA, 10 · 2025

The Transformer architecture, with its cornerstone self-attention mechanism, has revolutionized deep learning, particularly in natural language processing. However, as models scale towards trillions of parameters and sequence lengths grow,…

High-Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU Open

Haodi Jiang, Hao Guo, Minhui Xie, Jiwu Shu, Youyou Lu · 2025

Approximate nearest neighbor search (ANNS) is broadly adopted in numerous scenarios. Real-world applications seek efficient ways to search billion-scale vectors in high throughput. On-SSD graph-based ANNS systems have the opportunity to ac…

Flash Attention: Unlocking Bandwidth-Optimal Self-Attention for Trillion-Parameter Models Open

Revista, Zen, IA, 10 · 2025

The Transformer architecture, with its cornerstone self-attention mechanism, has revolutionized deep learning, particularly in natural language processing. However, as models scale towards trillions of parameters and sequence lengths grow,…

KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs Open

Revista, Zen, IA, 10 · 2025

Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks. However, their autoregressive inference, particularly with long input sequences, is significantly bottlenecked by the Key-Value…

KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs Open

Revista, Zen, IA, 10 · 2025

Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks. However, their autoregressive inference, particularly with long input sequences, is significantly bottlenecked by the Key-Value…

Cohet: A CXL-Driven Coherent Heterogeneous Computing Framework with Hardware-Calibrated Full-System Simulation Open

Wang, Yanjing, Wu Lizhou, Gao, Sunfeng, Tang Yibo, Luo Jun-hui , et al. · 2025

Conventional heterogeneous computing systems built on PCIe interconnects suffer from inefficient fine-grained host-device interactions and complex programming models. In recent years, many proprietary and open cache-coherent interconnect s…

Cohet: A CXL-Driven Coherent Heterogeneous Computing Framework with Hardware-Calibrated Full-System Simulation Open

Wang, Yanjing, Wu Lizhou, Gao, Sunfeng, Tang Yibo, Luo Jun-hui , et al. · 2025

Conventional heterogeneous computing systems built on PCIe interconnects suffer from inefficient fine-grained host-device interactions and complex programming models. In recent years, many proprietary and open cache-coherent interconnect s…

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression Open

Peng, Liangzu, Chattopadhyay, Aditya, Zancato, Luca, Nunez, Elvis, Xia Wei , et al. · 2025

As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented ta…

Compute-in-Memory Based on Emerging Non-Volatile Memories: RRAM, MRAM, and FeRAM Open

Qianye Han · 2025

In the era of artificial intelligence, Internet of things and big data, processing massive data puts forward unprecedented requirements for the throughput and energy efficiency of computing systems. In traditional von Neumann architectures…

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling Open

Zhou Zhong-chun, Lai, Chengtao, Zhang Wei · 2025

Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work tar…

Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware Open

Paredes-Vallés, Federico, Miyatani Yoshitaka, Scheper, Kirk Y. W. · 2025

Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution …

Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware Open

Paredes-Vallés, Federico, Miyatani Yoshitaka, Scheper, Kirk Y. W. · 2025

Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution …

AME: An Efficient Heterogeneous Agentic Memory Engine for Smartphones Open

Zhao Xinkui, Ma Qingyu, ZHANG Yifan, Lou, Hengxuan, Cheng Guan-jie , et al. · 2025

On-device agents on smartphones increasingly require continuously evolving memory to support personalized, context-aware, and long-term behaviors. To meet both privacy and responsiveness demands, user data is embedded as vectors and stored…

Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration Open

Metere, Alfredo · 2025

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM,…

AME: An Efficient Heterogeneous Agentic Memory Engine for Smartphones Open

Zhao Xinkui, Ma Qingyu, ZHANG Yifan, Lou, Hengxuan, Cheng Guan-jie , et al. · 2025

On-device agents on smartphones increasingly require continuously evolving memory to support personalized, context-aware, and long-term behaviors. To meet both privacy and responsiveness demands, user data is embedded as vectors and stored…

Memory bandwidth