Rizwan A. Ashraf
YOU?
Author Swipe
View article: Navier: Dataflow Architecture for Computation Chemistry
Navier: Dataflow Architecture for Computation Chemistry Open
Navier’s objectives were two evaluate the use of emerging technologies, especially dataflow accelerators, for high-performance computing (HPC) applications, specifically in the domain of chemistry, and to develop a prototype software stack…
View article: Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0) Open
Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in futur…
View article: Towards Supporting Semiring in MLIR-Based COMET Compiler
Towards Supporting Semiring in MLIR-Based COMET Compiler Open
Semirings are widely used in large-scale scientific applications of high-dimensional data and graph analytics for linear algebra computations. In this work, we propose a semiring compiler for today's high-performance computing (HPC) system…
View article: ReACT
ReACT Open
High-level programming models for tensor computations are becoming increasingly popular in many domains such as machine learning and data science. The index notation is one such model that is widely adopted for expressing a wide range of t…
View article: pnnl/COMET
pnnl/COMET Open
COMET: Domain Specific Compilation in Multi-level IR COMET is a compiler for dense and sparse tensor algebra and a domain-specific-language (DSL) that facilitates development and implementation of high-performance computing, graph analytic…
View article: GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability Open
The Cray XK7 Titan was the top supercomputer system in the world for a long time and remained critically important throughout its nearly seven year life. It was an interesting machine from a reliability viewpoint as most of its power came …
View article: GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability Open
George Ostrouchov, Don Maxwell, Rizwan Ashraf, Mallikarjun Shankar, and James Rogers. 2020. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the International Conference for High Performance Comput…
View article: Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer
Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer Open
Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application p…
View article: A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log
A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log Open
Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, a…
View article: Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery
Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery Open
Efficient utilization of today's high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean time to failure …
View article: Designing and Evaluating Redundancy-Based Soft-Error Masking on a Continuum of Energy versus Robustness
Designing and Evaluating Redundancy-Based Soft-Error Masking on a Continuum of Energy versus Robustness Open
Near-threshold computing is an effective strategy to reduce the power dissipation of deeply-scaled CMOS logic circuits. However, near-threshold strategies exacerbate the impact of delay variations on device performance and increase the sus…
View article: Exploring the Effect of Compiler Optimizations on the Reliability of HPC Applications
Exploring the Effect of Compiler Optimizations on the Reliability of HPC Applications Open
The strict power efficiency constraints required to achieve exascale systems will dramatically increase the number of detected and undetected transient errors in future high performance computing (HPC) systems. Among the various factors th…