Explanipedia

Cyberinfrastructure for machine learning applications in agriculture: experiences, analysis, and vision Open

Lucas Waltz, Sushma Katari, Chae‐Seon Hong, Adithya Anup, James Colbert , et al. · 2025

Introduction Advancements in machine learning (ML) algorithms that make predictions from data without being explicitly programmed and the increased computational speeds of graphics processing units (GPUs) over the last decade have led to r…

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning Open

Lang Xu, Quentin Anthony, Jacob Hatef, Aamir Shafi, Hari Subramoni , et al. · 2025

Computer science Mathematics Physics

Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy commun…

Accelerating Large Language Model Training with Hybrid GPU-based Compression Open

Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane , et al. · 2024

Computer science Geography Materials science

Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive commun…

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Open

Qinghua Zhou, Bharath Ramesh, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni , et al. · 2024

Computer science

With the increasing scale of High-Performance Computing (HPC) and Deep Learning (DL) applications through GPU adaptation, the seamless communication of data stored on GPUs has become a critical factor in enhancing overall application perfo…

The Case for Co-Designing Model Architectures with Hardware Open

Quentin Anthony, Jacob Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman , et al. · 2024

Computer science

While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL …

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference Open

Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, K. Dhabaleswar , et al. · 2024

Computer science Biology

In large language models like the Generative Pre-trained Transformer, the Mixture of Experts paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, deploying GPT MoE models for parallel infer…

Optimizing Amber for Device-to-Device GPU Communication Open

Samuel Khuvis, Karen Tomko, Scott R. Brozell, Chen-Chun Chen, Hari Subramoni , et al. · 2023

Computer science

Although direct GPU-to-GPU communication has been possible in MPI libraries for over a decade, the limited availability of compatible hardware at academic HPC centers has discouraged the development of algorithms in scientific applications…

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference Open

Jinghan Yao, Nawras Alnaasan, Tian Chen, Aamir Shafi, Hari Subramoni , et al. · 2023

Computer science

Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where…

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Open

Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi , et al. · 2023

Computer science History Materials science

In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massi…

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version Open

Hyunho Ahn, Tian Chen, Nawras Alnaasan, Aamir Shafi, Mustafa Abduljabbar , et al. · 2023

Computer science Business

Quantization is a popular technique used in Deep Neural Networks (DNN) inference to reduce the size of models and improve the overall numerical performance by exploiting native hardware. This paper attempts to conduct an elaborate performa…

SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC Open

Pouya Kousha, Arpan Jain, Ayyappa Kolli, Matthew Lieber, Mingzhe Han , et al. · 2023

Computer science

High-Performance Computing (HPC) is increasingly being used in traditional scientific domains as well as emerging areas like Deep Learning (DL). This has led to a diverse set of professionals who interact with state-of-the-art HPC systems.…

Lightning Talks of EduHPC 2022 Open

Apan Qasem, Hartwig Anzt, Eduard Ayguadé, Katharine J. Cahill, Ramón Canal , et al. · 2022

Computer science Engineering Political science

The lightning talks at EduHPC provide an opportunity to share early results and insights on parallel and distributed computing (PDC) education and training efforts. The four lightning talks at EduHPC 2022 cover a range of topics in broaden…

Tutorial 1.A Introduction to Networking Technologies for High-Performance Computing Open

Hari Subramoni, Dhabaleswar Dk · 2022

Computer science

InfiniBand (IB), High-speed Ethernet (HSE), RoCE, Omni-Path, EFA, Tofu, and Slingshot technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, file s…

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems Open

Nawras Alnaasan, Arpan Jain, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda · 2021

Computer science Geography History

Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming interface while allowin…

Cross-layer Visualization and Profiling of Network and I/O Communication for HPC Clusters Open

Pouya Kousha, Quentin Anthony, Hari Subramoni, Dhabaleswar K. Panda · 2021

Computer science

Understanding and visualizing the full-stack performance trade-offs and interplay between HPC applications, MPI libraries, the communication fabric, and the file system is a challenging endeavor. Designing a holistic profiling and visualiz…

Cross-layer Visualization and Profiling of Network and I/O Communication\n for HPC Clusters Open

Pouya Kousha, Quentin Anthony, Hari Subramoni, Dhabaleswar K. Panda · 2021

Computer science

Understanding and visualizing the full-stack performance trade-offs and\ninterplay between HPC applications, MPI libraries, the communication fabric,\nand the file system is a challenging endeavor. Designing a holistic profiling\nand visua…

INAM: Cross-stack Profiling and Analysis of Communication in MPI-based Applications Open

Pouya Kousha, Kamal Raj Sankarapandian Dayala Ganesh Ram, Mansa Kedia, Hari Subramoni, Arpan Jain , et al. · 2021

Computer science Engineering

Understanding the full-stack performance trade-offs and interplay among HPC applications, MPI libraries, the communication fabric, and the job scheduler is a challenging endeavor. Unfortunately, existing profiling tools are disjoint and on…

Efficient MPI-based Communication for GPU-Accelerated Dask Applications Open

Aamir Shafi, Jahanzeb Maqbool Hashmi, Hari Subramoni, Dhabaleswar K. Panda · 2021

Computer science

Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for …

27th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020) Technical program Open

Bora Ucar, Kathy Yelick, Gagan Agrawal, Chengshuo Xu, Abbas Mazloumi , et al. · 2020

Computer science Medicine Physics

HiPC 2020 is the 27th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics.The conference focus is not only HPC but also includes Data Science.Due to the COVID-19 pandemic, this year the conferenc…

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow Open

Ammar Ahmad Awan, Arpan Jain, Quentin Anthony, Hari Subramoni, Dhabaleswar K. Panda · 2019

Computer science

To reduce training time of large-scale DNNs, scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed,…

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation Open

Ammar Ahmad Awan, Jereon Bedorf, Ching-Hsiang Chu, Hari Subramoni, Dhabaleswar K. Panda · 2019

Computer science

TensorFlow has been the most widely adopted Machine/Deep Learning framework.\nHowever, little exists in the literature that provides a thorough understanding\nof the capabilities which TensorFlow offers for the distributed training of\nlar…

ER<span>einit</span>: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications Open

Sourav Chakraborty, Ignacio Laguna, Murali Emani, Kathryn Mohror, Dhabaleswar K. Panda , et al. · 2018

Computer science

Summary Scientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger‐scale future HPC syst…

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? Open

Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Dhabaleswar K. Panda · 2017

Computer science Engineering Biology

Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUD…

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand\n Clusters: MPI or NCCL? Open

Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Dhabaleswar K. Panda · 2017

Computer science Engineering

Dense Multi-GPU systems have recently gained a lot of attention in the HPC\narena. Traditionally, MPI runtimes have been primarily designed for clusters\nwith a large number of nodes. However, with the advent of MPI+CUDA applications\nand …

System-level Scalable Checkpoint-Restart for Petascale Computing Open

Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K. Panda , et al. · 2016

Computer science

Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for …

Hari Subramoni YOU? Author Swipe