Memory bandwidth ≈ Memory bandwidth
View article
Instant neural graphics primitives with a multiresolution hash encoding Open
Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. We reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing qualit…
View article
A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs) Open
Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorp…
View article
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Open
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model q…
View article
DRISA Open
Data movement between the processing units and the memory in traditional von Neumann architecture is creating the "memory wall" problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-cap…
View article
Processing data where it makes sense: Enabling in-memory computation Open
Today's systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: (1) data access from memory …
View article
In‐Memory Vector‐Matrix Multiplication in Monolithic Complementary Metal–Oxide–Semiconductor‐Memristor Integrated Circuits: Design Choices, Challenges, and Perspectives Open
The low communication bandwidth between memory and processing units in conventional von Neumann machines does not support the requirements of emerging applications that rely extensively on large sets of data. More recent computing paradigm…
View article
Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 Open
The best hope for reducing long-standing global climate model biases is by increasing resolution to the kilometer scale. Here we present results from an ultrahigh-resolution non-hydrostatic climate model for a near-global setup running on …
View article
Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition Open
Automatic speech recognition (ASR) is one of the most demanding tasks in natural language processing owing to its complexity. Recently, deep learning approaches have been deployed for this task and have been proven to outperform traditiona…
View article
Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs Open
Fully Homomorphic encryption (FHE) has been gaining in popularity as an emerging means of enabling an unlimited number of operations in an encrypted message without decryption. A major drawback of FHE is its high computational cost. Specif…
View article
Simultaneous Multi-Layer Access Open
3D-stacked DRAM alleviates the limited memory bandwidth bottleneck that exists in modern systems by leveraging through silicon vias (TSVs) to deliver higher external memory channel bandwidth. Today’s systems, however, cannot fully utilize …
View article
An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution Open
The Convolutional Neural Network (CNN) has been used in many fields and has achieved remarkable results, such as image classification, face detection, and speech recognition. Compared to GPU (graphics processing unit) and ASIC, a FPGA (fie…
View article
Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices Open
This code release includes testings scripts for some figures, our benchmark Memo, and our tuning scheme Caption.
View article
The Mondrian Data Engine Open
The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architecture…
View article
A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics Open
There has been significant amount of excitement and recent work on GPU-based database systems. Previous work has claimed that these systems can perform orders of magnitude better than CPU-based database systems on analytical workloads such…
View article
Ultra-Efficient Processing In-Memory for Data Intensive Applications Open
Recent years have witnessed a rapid growth in the domain of Internet of Things (IoT). This network of billions of devices generates and exchanges huge amount of data. The limited cache capacity and memory bandwidth make transferring and pr…
View article
Dissecting the NVidia Turing T4 GPU via Microbenchmarking Open
In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible perform…
View article
RFVP Open
This article aims to tackle two fundamental memory bottlenecks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our approach exploits the inherent error resilience of a wide range of…
View article
Classifying Memory Access Patterns for Prefetching Open
Prefetching is a well-studied technique for addressing the memory access stall time of contemporary microprocessors. However, despite a large body of related work, the memory access behavior of applications is not well understood, and it r…
View article
FPGA-Based Near-Memory Acceleration of Modern Data-Intensive Applications Open
Modern data-intensive applications demand high computation capabilities with strict power constraints. Unfortunately, such applications suffer from a significant waste of both execution cycles and energy in current computing systems due to…
View article
SELD-TCN: Sound Event Localization & Detection via Temporal Convolutional Networks Open
The understanding of the surrounding environment plays a critical role in\nautonomous robotic systems, such as self-driving cars. Extensive research has\nbeen carried out concerning visual perception. Yet, to obtain a more complete\npercep…
View article
Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube Open
Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally asso…
View article
Kleio Open
The increasing demand of big data analytics for more main memory capacity in datacenters and exascale computing environments is driving the integration of heterogeneous memory technologies. The new technologies exhibit vastly greater diffe…
View article
APPROX-NoC Open
The trend of unsustainable power consumption and large memory bandwidth demands in massively parallel multicore systems, with the advent of the big data era, has brought upon the onset of alternate computation paradigms utilizing heterogen…
View article
CAIRO Open
Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processing-in-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units ins…
View article
WRPN: Wide Reduced-Precision Networks Open
For computer vision applications, prior works have shown the efficacy of reducing numeric precision of model parameters (network weights) in deep neural networks. Activation maps, however, occupy a large memory footprint during both the tr…
View article
Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture Open
Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latenc…
View article
Breaking High-Resolution CNN Bandwidth Barriers With Enhanced Depth-First Execution Open
Convolutional neural networks (CNNs) now also start to reach impressive performance on non-classification image processing tasks, such as denoising, demosaicing, super-resolution, and super slow motion. Consequently, CNNs are increasingly …
View article
UPC++: A High-Performance Communication Framework for Asynchronous Computation Open
© 2019 IEEE UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons f…
View article
LAcc Open
PIM (Processing-in-memory)-based CNN (Convolutional neural network) accelerators leverage the characteristics of basic memory cells to enable simple logic and arithmetic operations so that the bandwidth constraint can be effectively allevi…
View article
Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE Open
Achieving a high sustained simulation performance is the most important concern in the HPC community. To this end, many kinds of HPC system architectures have been proposed, and the diversity of the HPC systems grows rapidly. Under this ci…