Memory bandwidth
View article: Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency
Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency Open
Training large language models (LLMs) efficiently requires a deep understanding of how modern GPU systems behave under real-world distributed training workloads. While prior work has focused primarily on kernel-level performance or single-…
View article: Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency
Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency Open
Training large language models (LLMs) efficiently requires a deep understanding of how modern GPU systems behave under real-world distributed training workloads. While prior work has focused primarily on kernel-level performance or single-…
View article: GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory
GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory Open
Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention p…
View article: NysX: An Accurate and Energy-Efficient FPGA Accelerator for Hyperdimensional Graph Classification at the Edge
NysX: An Accurate and Energy-Efficient FPGA Accelerator for Hyperdimensional Graph Classification at the Edge Open
Real-time, energy-efficient inference on edge devices is essential for graph classification across a range of applications. Hyperdimensional Computing (HDC) is a brain-inspired computing paradigm that encodes input features into low-precis…
View article: GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory
GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory Open
Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention p…
View article: NysX: An Accurate and Energy-Efficient FPGA Accelerator for Hyperdimensional Graph Classification at the Edge
NysX: An Accurate and Energy-Efficient FPGA Accelerator for Hyperdimensional Graph Classification at the Edge Open
Real-time, energy-efficient inference on edge devices is essential for graph classification across a range of applications. Hyperdimensional Computing (HDC) is a brain-inspired computing paradigm that encodes input features into low-precis…
View article: RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting
RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting Open
Recent advances in neural scene representations have transformed immersive multimedia, with 3D Gaussian Splatting (3DGS) enabling real-time photorealistic rendering. Despite its efficiency, 3DGS suffers from large memory requirements and c…
View article: RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting
RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting Open
Recent advances in neural scene representations have transformed immersive multimedia, with 3D Gaussian Splatting (3DGS) enabling real-time photorealistic rendering. Despite its efficiency, 3DGS suffers from large memory requirements and c…
View article: Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices Open
Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-b…
View article: Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices Open
Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-b…
View article: Accelerating 3D Seismic Wave Simulations on ARM Using a Hybrid Half-Precision and Scalable Vector Extension Approach
Accelerating 3D Seismic Wave Simulations on ARM Using a Hybrid Half-Precision and Scalable Vector Extension Approach Open
Seismic simulation is fundamental for understanding earthquake physics and mitigating seismic hazards, but accurate seismic modeling requires fine computational grids, imposing severe memory and computational challenges. Traditional modeli…
View article: PERFORMANCE EVALUATION OF NEORV32 RISC V PROCESSOR USING FREERTOS TRACE DRIVEN CACHE SIMULATION
PERFORMANCE EVALUATION OF NEORV32 RISC V PROCESSOR USING FREERTOS TRACE DRIVEN CACHE SIMULATION Open
In increasingly complex embedded systems, optimizing memory hierarchy is crucial to attaining high performance and energy efficiency. This thesis uses execution traces from FreeRTOS workloads that mimic realistic multitasking behavior to a…
View article: PERFORMANCE EVALUATION OF NEORV32 RISC V PROCESSOR USING FREERTOS TRACE DRIVEN CACHE SIMULATION
PERFORMANCE EVALUATION OF NEORV32 RISC V PROCESSOR USING FREERTOS TRACE DRIVEN CACHE SIMULATION Open
In increasingly complex embedded systems, optimizing memory hierarchy is crucial to attaining high performance and energy efficiency. This thesis uses execution traces from FreeRTOS workloads that mimic realistic multitasking behavior to a…
View article: FlashAttention: Breaking the Memory Wall for Efficient Self-Attention Scaling
FlashAttention: Breaking the Memory Wall for Efficient Self-Attention Scaling Open
The self-attention mechanism is a cornerstone of the Transformer architecture, driving significant advancements across natural language processing, computer vision, and other domains. However, its quadratic computational complexity and, cr…
View article: FlashAttention: Breaking the Memory Wall for Efficient Self-Attention Scaling
FlashAttention: Breaking the Memory Wall for Efficient Self-Attention Scaling Open
The self-attention mechanism is a cornerstone of the Transformer architecture, driving significant advancements across natural language processing, computer vision, and other domains. However, its quadratic computational complexity and, cr…
View article: Flash Attention: Unlocking Bandwidth-Optimal Self-Attention for Trillion-Parameter Models
Flash Attention: Unlocking Bandwidth-Optimal Self-Attention for Trillion-Parameter Models Open
The Transformer architecture, with its cornerstone self-attention mechanism, has revolutionized deep learning, particularly in natural language processing. However, as models scale towards trillions of parameters and sequence lengths grow,…
View article: High-Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU
High-Throughput, Cost-Effective Billion-Scale Vector Search with a Single GPU Open
Approximate nearest neighbor search (ANNS) is broadly adopted in numerous scenarios. Real-world applications seek efficient ways to search billion-scale vectors in high throughput. On-SSD graph-based ANNS systems have the opportunity to ac…
View article: Flash Attention: Unlocking Bandwidth-Optimal Self-Attention for Trillion-Parameter Models
Flash Attention: Unlocking Bandwidth-Optimal Self-Attention for Trillion-Parameter Models Open
The Transformer architecture, with its cornerstone self-attention mechanism, has revolutionized deep learning, particularly in natural language processing. However, as models scale towards trillions of parameters and sequence lengths grow,…
View article: KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs
KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs Open
Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks. However, their autoregressive inference, particularly with long input sequences, is significantly bottlenecked by the Key-Value…
View article: KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs
KV-Cache Compression via Attention Pattern Pruning for Latency-Constrained LLMs Open
Large Language Models (LLMs) have achieved remarkable success across diverse natural language processing tasks. However, their autoregressive inference, particularly with long input sequences, is significantly bottlenecked by the Key-Value…
View article: Cohet: A CXL-Driven Coherent Heterogeneous Computing Framework with Hardware-Calibrated Full-System Simulation
Cohet: A CXL-Driven Coherent Heterogeneous Computing Framework with Hardware-Calibrated Full-System Simulation Open
Conventional heterogeneous computing systems built on PCIe interconnects suffer from inefficient fine-grained host-device interactions and complex programming models. In recent years, many proprietary and open cache-coherent interconnect s…
View article: Cohet: A CXL-Driven Coherent Heterogeneous Computing Framework with Hardware-Calibrated Full-System Simulation
Cohet: A CXL-Driven Coherent Heterogeneous Computing Framework with Hardware-Calibrated Full-System Simulation Open
Conventional heterogeneous computing systems built on PCIe interconnects suffer from inefficient fine-grained host-device interactions and complex programming models. In recent years, many proprietary and open cache-coherent interconnect s…
View article: Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression Open
As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented ta…
View article: Compute-in-Memory Based on Emerging Non-Volatile Memories: RRAM, MRAM, and FeRAM
Compute-in-Memory Based on Emerging Non-Volatile Memories: RRAM, MRAM, and FeRAM Open
In the era of artificial intelligence, Internet of things and big data, processing massive data puts forward unprecedented requirements for the throughput and energy efficiency of computing systems. In traditional von Neumann architectures…
View article: LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling
LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling Open
Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work tar…
View article: Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware
Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware Open
Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution …
View article: Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware
Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware Open
Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution …
View article: AME: An Efficient Heterogeneous Agentic Memory Engine for Smartphones
AME: An Efficient Heterogeneous Agentic Memory Engine for Smartphones Open
On-device agents on smartphones increasingly require continuously evolving memory to support personalized, context-aware, and long-term behaviors. To meet both privacy and responsiveness demands, user data is embedded as vectors and stored…
View article: Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration
Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration Open
Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM,…
View article: AME: An Efficient Heterogeneous Agentic Memory Engine for Smartphones
AME: An Efficient Heterogeneous Agentic Memory Engine for Smartphones Open
On-device agents on smartphones increasingly require continuously evolving memory to support personalized, context-aware, and long-term behaviors. To meet both privacy and responsiveness demands, user data is embedded as vectors and stored…