Explanipedia

A Flexible Instruction Set Architecture for Efficient GEMMs Open

Alexandre Santana, Adrià Armejach, Francesc Martínez, Erich Focht, Marc Casas · 2025

GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Archi…

BerryBees: Breadth First Search by Bit-Tensor-Cores Open

Yuyao Niu, Marc Casas · 2025

Computer science Mathematics

Breadth First Search (BFS) plays a key role in computational science, networking, and artificial intelligence applications. Although the BFS approach has been extensively studied, particularly in its direction-optimized form, existing impl…

Extending Sparse Patterns to Improve Inverse Preconditioning on GPU Architectures Open

Sergi Laut, R. Borrell, Marc Casas · 2024

Computer science Mathematics

Graphic Processing Units (GPUs) have become a key component of high-end computing infrastructures due to their massively parallel architecture, which delivers large floating-point operations per cycle rates. Many scientific workloads benef…

Exploiting Vector Code Semantics for Efficient Data Cache Prefetching Open

Francesc Martínez, Mart́ı Torrents, Adrià Armejach, Marc Casas · 2024

Computer science

Emerging workloads from domains like high performance computing, data analytics or deep learning consume large amounts of memory bandwidth. To mitigate this problem, computing systems include large and deep memory cache hierarchies that ex…

Practically Tackling Memory Bottlenecks of Graph-Processing Workloads Open

Alexandre Valentin Jamet, Georgios Vavouliotis, Daniel A. Jiménez, Lluc Alvarez, Marc Casas · 2024

Computer science

Graph-processing workloads have become widespread due to their relevance on a wide range of application domains such as network analysis, path- planning, bioinformatics, and machine learning. Graph-processing workloads have massive data fo…

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering Open

Alexandre Valentin Jamet, Georgios Vavouliotis, Daniel A. Jiménez, Lluc Alvarez, Marc Casas · 2024

Computer science

To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will …

Compressed Real Numbers for AI: a case-study using a RISC-V CPU Open

Federico Rossi, Marco Cococcioni, Roger Ferrer Ibàñez, Jesús Labarta, Filippo Mantovani , et al. · 2023

Computer science

As recently demonstrated, Deep Neural Networks (DNN), usually trained using single precision IEEE 754 floating point numbers (binary32), can also work using lower precision. Therefore, 16-bit and 8-bit compressed format have attracted cons…

An Open-Source Framework for Efficient Numerically-Tailored Computations Open

Louis V. Ledoux, Marc Casas · 2023

Computer science

We present a versatile open-source framework designed to facilitate\nefficient, numerically-tailored Matrix-Matrix Multiplications (MMMs). The\nframework offers two primary contributions: first, a fine-tuned, automated\npipeline for arithm…

Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations Open

Louis V. Ledoux, Marc Casas · 2023

Computer science Mathematics Materials science

Many scientific computing problems can be reduced to Matrix-Matrix Multiplications (MMM), making the General Matrix Multiply (GEMM) kernels in the Basic Linear Algebra Subroutine (BLAS) of interest to the high-performance computing communi…

Characterizing the impact of last-level cache replacement policies on big-data workloads Open

Alexandre Valentin Jamet, Lluc Alvarez, Marc Casas · 2023

Computer science

In recent years, graph-processing has become an essential class of workloads with applications in a rapidly growing number of fields. Graph-processing typically uses large input sets, often in multi-gigabyte scale, and data-dependent graph…

Optimization of SpGEMM with Risc-V vector instructions Open

Valentin Le Fèvre, Marc Casas · 2023

Computer science Materials science Physics

The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) $C = A \times B$ is a fundamental routine extensively used in domains like machine learning or graph analytics. Despite its relevance, the efficient execution of SpGEMM on vector arc…

Efficient Direct Convolution Using Long SIMD Instructions Open

Alexandre Santana, Adrià Armejach, Marc Casas · 2023

Computer science Mathematics

This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss …

TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming Models Open

Paul Caheny, Lluc Alvarez, Marc Casas, Miquel Moretó · 2022

Computer science Biology Economics

In high performance processors, the design of on-chip memory hierarchies is crucial for performance and energy efficiency. Current processors rely on large shared Non-Uniform Cache Architectures (NUCA) to improve performance and reduce dat…

Page Size Aware Cache Prefetching Open

Georgios Vavouliotis, Gino Chacon, Lluc Alvarez, Paul V. Gratz, Daniel A. Jiménez , et al. · 2022

Computer science

The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system per- formance due to the disparity between processor and memory speeds. P…

A BF16 FMA is All You Need for DNN Training Open

John Osorio, Adrià Armejach, Eric Petit, Greg Henry, Marc Casas · 2022

Computer science Mathematics Materials science

Fused Multiply-Add (FMA) functional units constitute a fundamental hardware component to train Deep Neural Networks (DNNs). Its silicon area grows quadratically with the mantissa bit count of the computer number format, which has motivated…

Communication-aware Sparse Patterns for the Factorized Approximate Inverse Preconditioner Open

Sergi Laut, Marc Casas, R. Borrell · 2022

Computer science Mathematics Physics

The Conjugate Gradient (CG) method is an iterative solver targeting linear systems of equations Ax=b where A is a symmetric and positive definite matrix. CG convergence properties improve when preconditioning is applied to reduce the condi…

A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs Open

Louis V. Ledoux, Marc Casas · 2022

Computer science Physics Biology

We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspe…

FASE: A Fast, Accurate and Seamless Emulator for Custom Numerical Formats Open

John Osorio, Adrià Armejach, Eric Petit, Greg Henry, Marc Casas · 2022

Computer science Economics

Deep Neural Networks (DNNs) have become ubiquitous in a wide range of application domains. Despite their success, training DNNs is an expensive task that has motivated the use of reduced numerical precision formats to improve performance a…

Task-based Acceleration of Bidirectional Recurrent Neural Networks on Multi-core Architectures Open

Robin Kumar Sharma, Marc Casas · 2022

Computer science Philosophy Physics

This paper proposes a novel parallel execution model for Bidirectional Recurrent Neural Networks (BRNNs), B-Par (Bidirectional-Parallelization), which exploits data and control dependencies for forward and reverse input computations. B-Par…

Optimization of the Sparse Multi-Threaded Cholesky Factorization for A64FX Open

Valentin Le Fèvre, Tetsuzo Usui, Marc Casas · 2022

Computer science Mathematics Materials science

Sparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matric…

Autoencoders for Semi-Supervised Water Level Modeling in Sewer Pipes with Sparse Labeled Data Open

Ferran Plana Rius, Mark Philip Philipsen, Josep M. Mirats Tur, Thomas B. Moeslund, Cecilio Ángulo , et al. · 2022

Computer science Mathematics

More frequent and thorough inspection of sewer pipes has the potential to save billions in utilities. However, the amount and quality of inspection are impeded by an imprecise and highly subjective manual process. It involves technicians j…

Dynamically Adapting Floating-Point Precision to Accelerate Deep Neural Network Training Open

John Rios, Adrià Armejach, Eric Petit, Greg Henry, Marc Casas · 2021

Computer science Mathematics

Mixed-precision (MP) arithmetic combining both single- and half-precision operands has been successfully applied to train deep neural networks. Despite its advantages in terms of reducing the need for key resources like memory bandwidth or…

Multilevel simulation-based co-design of next generation HPC microprocessors Open

Lilia Zaourar, Mohamed Benazouz, Ayoub Mouhagir, Fatma Jebali, Tanguy Sassolas , et al. · 2021

Computer science Art Philosophy

International audience

Morrigan: A Composite Instruction TLB Prefetcher Open

Georgios Vavouliotis, Lluc Alvarez, Boris Grot, Daniel A. Jiménez, Marc Casas · 2021

Computer science

The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of i…

Compiler-Assisted Compaction/Restoration of SIMD Instructions Open

Juan M. Cebrián, Thibaud Balem, Adrián Barredo, Marc Casas, Miquel Moretó , et al. · 2021

Computer science

All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. Control flow div…

Cache-aware Sparse Patterns for the Factorized Sparse Approximate Inverse Preconditioner Open

Sergi Laut, R. Borrell, Marc Casas · 2021

Computer science Mathematics Physics

Conjugate Gradient is a widely used iterative method to solve linear systems Ax=b with matrix A being symmetric and positive definite. Part of its effectiveness relies on finding a suitable preconditioner that accelerates its convergence. …

Exploiting Page Table Locality for Agile TLB Prefetching Open

Georgios Vavouliotis, Lluc Alvarez, Vasileios Karakostas, Konstantinos Nikas, Nectarios Koziris , et al. · 2021

Computer science

Frequent Translation Lookaside Buffer (TLB) misses incur high performance and energy costs due to page walks required for fetching the corresponding address translations. Prefetching page table entries (PTEs) ahead of demand TLB accesses c…

Marc Casas YOU? Author Swipe