Hari Subramoni
YOU?
Author Swipe
View article: Cyberinfrastructure for machine learning applications in agriculture: experiences, analysis, and vision
Cyberinfrastructure for machine learning applications in agriculture: experiences, analysis, and vision Open
Introduction Advancements in machine learning (ML) algorithms that make predictions from data without being explicitly programmed and the increased computational speeds of graphics processing units (GPUs) over the last decade have led to r…
View article: Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning Open
Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy commun…
View article: Accelerating Large Language Model Training with Hybrid GPU-based Compression
Accelerating Large Language Model Training with Hybrid GPU-based Compression Open
Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive commun…
View article: Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters
Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Open
With the increasing scale of High-Performance Computing (HPC) and Deep Learning (DL) applications through GPU adaptation, the seamless communication of data stored on GPUs has become a critical factor in enhancing overall application perfo…
View article: The Case for Co-Designing Model Architectures with Hardware
The Case for Co-Designing Model Architectures with Hardware Open
While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL …
View article: Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference Open
In large language models like the Generative Pre-trained Transformer, the Mixture of Experts paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, deploying GPT MoE models for parallel infer…
View article: Optimizing Amber for Device-to-Device GPU Communication
Optimizing Amber for Device-to-Device GPU Communication Open
Although direct GPU-to-GPU communication has been possible in MPI libraries for over a decade, the limited availability of compatible hardware at academic HPC centers has discouraged the development of algorithms in scientific applications…
View article: Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference Open
Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where…
View article: MCR-DL: Mix-and-Match Communication Runtime for Deep Learning
MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Open
In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massi…
View article: Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version
Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version Open
Quantization is a popular technique used in Deep Neural Networks (DNN) inference to reduce the size of models and improve the overall numerical performance by exploiting native hardware. This paper attempts to conduct an elaborate performa…
View article: SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC
SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC Open
High-Performance Computing (HPC) is increasingly being used in traditional scientific domains as well as emerging areas like Deep Learning (DL). This has led to a diverse set of professionals who interact with state-of-the-art HPC systems.…
View article: Lightning Talks of EduHPC 2022
Lightning Talks of EduHPC 2022 Open
The lightning talks at EduHPC provide an opportunity to share early results and insights on parallel and distributed computing (PDC) education and training efforts. The four lightning talks at EduHPC 2022 cover a range of topics in broaden…
View article: Tutorial 1.A Introduction to Networking Technologies for High-Performance Computing
Tutorial 1.A Introduction to Networking Technologies for High-Performance Computing Open
InfiniBand (IB), High-speed Ethernet (HSE), RoCE, Omni-Path, EFA, Tofu, and Slingshot technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, file s…
View article: OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems
OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems Open
Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming interface while allowin…
View article: Cross-layer Visualization and Profiling of Network and I/O Communication for HPC Clusters
Cross-layer Visualization and Profiling of Network and I/O Communication for HPC Clusters Open
Understanding and visualizing the full-stack performance trade-offs and interplay between HPC applications, MPI libraries, the communication fabric, and the file system is a challenging endeavor. Designing a holistic profiling and visualiz…
View article: Cross-layer Visualization and Profiling of Network and I/O Communication\n for HPC Clusters
Cross-layer Visualization and Profiling of Network and I/O Communication\n for HPC Clusters Open
Understanding and visualizing the full-stack performance trade-offs and\ninterplay between HPC applications, MPI libraries, the communication fabric,\nand the file system is a challenging endeavor. Designing a holistic profiling\nand visua…
View article: INAM: Cross-stack Profiling and Analysis of Communication in MPI-based Applications
INAM: Cross-stack Profiling and Analysis of Communication in MPI-based Applications Open
Understanding the full-stack performance trade-offs and interplay among HPC applications, MPI libraries, the communication fabric, and the job scheduler is a challenging endeavor. Unfortunately, existing profiling tools are disjoint and on…
View article: Efficient MPI-based Communication for GPU-Accelerated Dask Applications
Efficient MPI-based Communication for GPU-Accelerated Dask Applications Open
Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for …
View article: 27th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020) Technical program
27th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020) Technical program Open
HiPC 2020 is the 27th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics.The conference focus is not only HPC but also includes Data Science.Due to the COVID-19 pandemic, this year the conferenc…
View article: HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow
HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow Open
To reduce training time of large-scale DNNs, scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed,…
View article: Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation Open
TensorFlow has been the most widely adopted Machine/Deep Learning framework.\nHowever, little exists in the literature that provides a thorough understanding\nof the capabilities which TensorFlow offers for the distributed training of\nlar…
View article: ER<scp>einit</scp>: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications
ER<span>einit</span>: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications Open
Summary Scientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger‐scale future HPC syst…
View article: Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? Open
Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUD…
View article: Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand\n Clusters: MPI or NCCL?
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand\n Clusters: MPI or NCCL? Open
Dense Multi-GPU systems have recently gained a lot of attention in the HPC\narena. Traditionally, MPI runtimes have been primarily designed for clusters\nwith a large number of nodes. However, with the advent of MPI+CUDA applications\nand …
View article: System-level Scalable Checkpoint-Restart for Petascale Computing
System-level Scalable Checkpoint-Restart for Petascale Computing Open
Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for …