Doru Thom Popovici
YOU?
Author Swipe
View article: A Closeness Centrality-based Circuit Partitioner for Quantum Simulations
A Closeness Centrality-based Circuit Partitioner for Quantum Simulations Open
Simulating quantum circuits (QC) on high-performance computing (HPC) systems has become an essential method to benchmark algorithms and probe the potential of large-scale quantum computation despite the limitations of current quantum hardw…
View article: Towards An Approach to Identify Divergences in Hardware Designs for HPC Workloads
Towards An Approach to Identify Divergences in Hardware Designs for HPC Workloads Open
Developing efficient hardware accelerators for mathematical kernels used in scientific applications and machine learning has traditionally been a labor-intensive task. These accelerators typically require low-level programming in Verilog o…
View article: Flexible Multi-Dimensional FFTs for Plane Wave Density Functional Theory Codes
Flexible Multi-Dimensional FFTs for Plane Wave Density Functional Theory Codes Open
Multi-dimensional Fourier transforms are key mathematical building blocks that appear in a wide range of applications from materials science, physics, chemistry and even machine learning. Over the past years, a multitude of software packag…
View article: Toward Practical Superconducting Accelerators for Machine Learning Using U-SFQ
Toward Practical Superconducting Accelerators for Machine Learning Using U-SFQ Open
Most popular superconducting circuits operate on information carried by ps-wide, μV-tall, single flux quantum (SFQ) pulses. These circuits can operate at frequencies of hundreds of GHz with orders of magnitude lower switching energy than c…
View article: Distributed memory, GPU accelerated Fock construction for hybrid, Gaussian basis density functional theory
Distributed memory, GPU accelerated Fock construction for hybrid, Gaussian basis density functional theory Open
With the growing reliance of modern supercomputers on accelerator-based architecture such a graphics processing units (GPUs), the development and optimization of electronic structure methods to exploit these massively parallel resources ha…
View article: SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics
SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics Open
Transformer-based models, such as BERT and ViT, have achieved state-of-the-art results across different natural language processing (NLP) and computer vision (CV) tasks. However, these models are extremely memory intensive during their fin…
View article: Distributed Memory, GPU Accelerated Fock Construction for Hybrid, Gaussian Basis Density Functional Theory
Distributed Memory, GPU Accelerated Fock Construction for Hybrid, Gaussian Basis Density Functional Theory Open
With the growing reliance of modern supercomputers on accelerator-based architectures such a GPUs, the development and optimization of electronic structure methods to exploit these massively parallel resources has become a recent priority.…
View article: A systematic approach to improving data locality across Fourier transforms and linear algebra operations
A systematic approach to improving data locality across Fourier transforms and linear algebra operations Open
The performance of most scientific applications depends on efficient mathematical libraries. For example, scientific applications like the plane wave based Density Functional Theory approach for electronic structure calculations uses highl…
View article: A High-Throughput Solver for Marginalized Graph Kernels on GPU
A High-Throughput Solver for Marginalized Graph Kernels on GPU Open
We present the design and optimization of a linear solver on General Purpose\nGPUs for the efficient and high-throughput evaluation of the marginalized graph\nkernel between pairs of labeled graphs. The solver implements a preconditioned\n…
View article: SPIRAL: Extreme Performance Portability
SPIRAL: Extreme Performance Portability Open
In this paper, we address the question of how to automatically map computational kernels to highly efficient code for a wide range of computing platforms and establish the correctness of the synthesized code. More specifically, we focus on…
View article: Compilers, hands-off my hands-on optimizations
Compilers, hands-off my hands-on optimizations Open
Achieving high performance for compute bounded numerical kernels typically requires an expert to hand select an appropriate set of Single-instruction multiple-data (SIMD) instructions, then statically scheduling them in order to hide their…