Marc Casas
YOU?
Author Swipe
View article: A Flexible Instruction Set Architecture for Efficient GEMMs
A Flexible Instruction Set Architecture for Efficient GEMMs Open
GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Archi…
View article: BerryBees: Breadth First Search by Bit-Tensor-Cores
BerryBees: Breadth First Search by Bit-Tensor-Cores Open
Breadth First Search (BFS) plays a key role in computational science, networking, and artificial intelligence applications. Although the BFS approach has been extensively studied, particularly in its direction-optimized form, existing impl…
View article: Extending Sparse Patterns to Improve Inverse Preconditioning on GPU Architectures
Extending Sparse Patterns to Improve Inverse Preconditioning on GPU Architectures Open
Graphic Processing Units (GPUs) have become a key component of high-end computing infrastructures due to their massively parallel architecture, which delivers large floating-point operations per cycle rates. Many scientific workloads benef…
View article: Exploiting Vector Code Semantics for Efficient Data Cache Prefetching
Exploiting Vector Code Semantics for Efficient Data Cache Prefetching Open
Emerging workloads from domains like high performance computing, data analytics or deep learning consume large amounts of memory bandwidth. To mitigate this problem, computing systems include large and deep memory cache hierarchies that ex…
View article: Practically Tackling Memory Bottlenecks of Graph-Processing Workloads
Practically Tackling Memory Bottlenecks of Graph-Processing Workloads Open
Graph-processing workloads have become widespread due to their relevance on a wide range of application domains such as network analysis, path- planning, bioinformatics, and machine learning. Graph-processing workloads have massive data fo…
View article: A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering
A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering Open
To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will …
View article: Compressed Real Numbers for AI: a case-study using a RISC-V CPU
Compressed Real Numbers for AI: a case-study using a RISC-V CPU Open
As recently demonstrated, Deep Neural Networks (DNN), usually trained using single precision IEEE 754 floating point numbers (binary32), can also work using lower precision. Therefore, 16-bit and 8-bit compressed format have attracted cons…
View article: An Open-Source Framework for Efficient Numerically-Tailored Computations
An Open-Source Framework for Efficient Numerically-Tailored Computations Open
We present a versatile open-source framework designed to facilitate\nefficient, numerically-tailored Matrix-Matrix Multiplications (MMMs). The\nframework offers two primary contributions: first, a fine-tuned, automated\npipeline for arithm…
View article: Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations
Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations Open
Many scientific computing problems can be reduced to Matrix-Matrix Multiplications (MMM), making the General Matrix Multiply (GEMM) kernels in the Basic Linear Algebra Subroutine (BLAS) of interest to the high-performance computing communi…
View article: Characterizing the impact of last-level cache replacement policies on big-data workloads
Characterizing the impact of last-level cache replacement policies on big-data workloads Open
In recent years, graph-processing has become an essential class of workloads with applications in a rapidly growing number of fields. Graph-processing typically uses large input sets, often in multi-gigabyte scale, and data-dependent graph…
View article: Optimization of SpGEMM with Risc-V vector instructions
Optimization of SpGEMM with Risc-V vector instructions Open
The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) $C = A \times B$ is a fundamental routine extensively used in domains like machine learning or graph analytics. Despite its relevance, the efficient execution of SpGEMM on vector arc…
View article: Efficient Direct Convolution Using Long SIMD Instructions
Efficient Direct Convolution Using Long SIMD Instructions Open
This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss …
View article: TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming Models
TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming Models Open
In high performance processors, the design of on-chip memory hierarchies is crucial for performance and energy efficiency. Current processors rely on large shared Non-Uniform Cache Architectures (NUCA) to improve performance and reduce dat…
View article: Page Size Aware Cache Prefetching
Page Size Aware Cache Prefetching Open
The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system per- formance due to the disparity between processor and memory speeds. P…
View article: A BF16 FMA is All You Need for DNN Training
A BF16 FMA is All You Need for DNN Training Open
Fused Multiply-Add (FMA) functional units constitute a fundamental hardware component to train Deep Neural Networks (DNNs). Its silicon area grows quadratically with the mantissa bit count of the computer number format, which has motivated…
View article: Communication-aware Sparse Patterns for the Factorized Approximate Inverse Preconditioner
Communication-aware Sparse Patterns for the Factorized Approximate Inverse Preconditioner Open
The Conjugate Gradient (CG) method is an iterative solver targeting linear systems of equations Ax=b where A is a symmetric and positive definite matrix. CG convergence properties improve when preconditioning is applied to reduce the condi…
View article: A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs
A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs Open
We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspe…
View article: FASE: A Fast, Accurate and Seamless Emulator for Custom Numerical Formats
FASE: A Fast, Accurate and Seamless Emulator for Custom Numerical Formats Open
Deep Neural Networks (DNNs) have become ubiquitous in a wide range of application domains. Despite their success, training DNNs is an expensive task that has motivated the use of reduced numerical precision formats to improve performance a…
View article: Task-based Acceleration of Bidirectional Recurrent Neural Networks on Multi-core Architectures
Task-based Acceleration of Bidirectional Recurrent Neural Networks on Multi-core Architectures Open
This paper proposes a novel parallel execution model for Bidirectional Recurrent Neural Networks (BRNNs), B-Par (Bidirectional-Parallelization), which exploits data and control dependencies for forward and reverse input computations. B-Par…
View article: Optimization of the Sparse Multi-Threaded Cholesky Factorization for A64FX
Optimization of the Sparse Multi-Threaded Cholesky Factorization for A64FX Open
Sparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matric…
View article: Autoencoders for Semi-Supervised Water Level Modeling in Sewer Pipes with Sparse Labeled Data
Autoencoders for Semi-Supervised Water Level Modeling in Sewer Pipes with Sparse Labeled Data Open
More frequent and thorough inspection of sewer pipes has the potential to save billions in utilities. However, the amount and quality of inspection are impeded by an imprecise and highly subjective manual process. It involves technicians j…
View article: Dynamically Adapting Floating-Point Precision to Accelerate Deep Neural Network Training
Dynamically Adapting Floating-Point Precision to Accelerate Deep Neural Network Training Open
Mixed-precision (MP) arithmetic combining both single- and half-precision operands has been successfully applied to train deep neural networks. Despite its advantages in terms of reducing the need for key resources like memory bandwidth or…
View article: Multilevel simulation-based co-design of next generation HPC microprocessors
Multilevel simulation-based co-design of next generation HPC microprocessors Open
International audience
View article: Morrigan: A Composite Instruction TLB Prefetcher
Morrigan: A Composite Instruction TLB Prefetcher Open
The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of i…
View article: Compiler-Assisted Compaction/Restoration of SIMD Instructions
Compiler-Assisted Compaction/Restoration of SIMD Instructions Open
All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. Control flow div…
View article: Cache-aware Sparse Patterns for the Factorized Sparse Approximate Inverse Preconditioner
Cache-aware Sparse Patterns for the Factorized Sparse Approximate Inverse Preconditioner Open
Conjugate Gradient is a widely used iterative method to solve linear systems Ax=b with matrix A being symmetric and positive definite. Part of its effectiveness relies on finding a suitable preconditioner that accelerates its convergence. …
View article: Exploiting Page Table Locality for Agile TLB Prefetching
Exploiting Page Table Locality for Agile TLB Prefetching Open
Frequent Translation Lookaside Buffer (TLB) misses incur high performance and energy costs due to page walks required for fetching the corresponding address translations. Prefetching page table entries (PTEs) ahead of demand TLB accesses c…