Explanipedia

Coordinated Power Management on Heterogeneous Systems Open

Zhong Zheng, Zhiling Lan, Xingfu Wu, Valerie Taylor, Michael E. Papka · 2025

Performance prediction is essential for energy-efficient computing in heterogeneous computing systems that integrate CPUs and GPUs. However, traditional performance modeling methods often rely on exhaustive offline profiling, which becomes…

Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs Open

Melanie Cornelius, G.E. Cross, Shilpika Shilpika, Matthew T. Dearing, Zhiling Lan · 2025

As supercomputers grow in size and complexity, power efficiency has become a critical challenge, particularly in understanding GPU power consumption within modern HPC workloads. This work addresses this challenge by presenting a data co-an…

Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes Open

Matthew T. Dearing, Yiheng Tao, Xingfu Wu, Zhiling Lan, Valerie Taylor · 2025

While large language models (LLMs) are increasingly used for generating parallel scientific codes, most efforts emphasize functional correctness, often overlooking performance, especially energy efficiency. We propose LASSI-EE, an automate…

Exploring Uncore Frequency Scaling for Heterogeneous Computing Open

Zhong Zheng, S.F. Sultanov, Michael E. Papka, Zhiling Lan · 2025

High-performance computing (HPC) systems are essential for scientific discovery and engineering innovation. However, their growing power demands pose significant challenges, particularly as systems scale to the exascale level. Prior uncore…

More for Less: Integrating Capability-Predominant and Capacity-Predominant Computing Open

Zhong Zheng, Michael E. Papka, Zhiling Lan · 2025

Computer science Business

Capability jobs (e.g., large, long-running tasks) and capacity jobs (e.g., small, short-running tasks) are two common types of workloads in high-performance computing (HPC). Different HPC systems are typically deployed to handle distinct c…

Preventing Workload Interference with Intelligent Routing and Flexible Job Placement Strategy on Dragonfly System Open

Xin Wang, Yao Kang, Zhiling Lan · 2024

Computer science Biology

Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that…

Hybrid PDES Simulation of HPC Networks Using Zombie Packets Open

Elkin Cruz-Camacho, Kevin A. Brown, Xin Wang, Xiongxiao Xu, Kai Shu , et al. · 2024

Computer science Mathematics

Although high-fidelity network simulations have proven to be reliable and cost-effective tools to peer into architectural questions for high-performance computing (HPC) networks, they incur a high resource cost. The time spent in simulatin…

LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes Open

Matthew T. Dearing, Yiheng Tao, Xingfu Wu, Zhiling Lan, Valerie Taylor · 2024

Computer science

This paper addresses the problem of providing a novel approach to sourcing significant training data for LLMs focused on science and engineering. In particular, a crucial challenge is sourcing parallel scientific codes in the ranges of mil…

Modeling and Analysis of Application Interference on Dragonfly+ Open

Yao Kang, Xin Wang, Neil McGlohon, Misbah Mubarak, Sudheer Chunduri , et al. · 2024

Computer science Biology

Dragonfly class of networks are considered as promising interconnects for next-generation supercomputers. While Dragonfly+ networks offer more path diversity than the original Dragonfly design, they are still prone to performance variabili…

Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling Open

Boyang Li, Zhiling Lan, Michael E. Papka · 2023

Computer science Engineering

In the field of high-performance computing (HPC), there has been recent\nexploration into the use of deep reinforcement learning for cluster scheduling\n(DRL scheduling), which has demonstrated promising outcomes. However, a\nsignificant c…

Hybrid PDES Simulation of HPC Networks Using Zombie Packets Open

Elkin Cruz-Camacho, Kevin A. Brown, Xin Wang, Xiongxiao Xu, Kai Shu , et al. · 2023

Computer science

No description supplied

Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly Open

Yao Kang, Xin Wang, Zhiling Lan · 2023

Computer science

Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exc…

Machine Learning for Interconnect Network Traffic Forecasting: Investigation and Exploitation Open

Xiongxiao Xu, Xin Wang, Elkin Cruz-Camacho, Christopher D. Carothers, Kevin A. Brown , et al. · 2023

Computer science

Interconnect networks play a key role in high-performance computing (HPC) systems. Parallel discrete event simulation (PDES) has been a long-standing pillar for studying large-scale networking systems by replicating the real-world behavior…

CODES - Hybrid modeling (high-fidelity simulation and surrogates) for HPC networks Open

Elkin Cruz-Camacho, Kevin Brown, Xin Wan, Xiongxiao Xu, Kai Shu , et al. · 2023

Computer science Engineering

CODES is an HPC (High performance computing) network simulator built to run on HPC systems. We have extended CODES to use historical packet latency data in order to speed up the simulation runtime. The instructions found in this repository…

DNPC: A Dynamic Node-Level Power Capping Library for Scientific Applications Open

Sahil Sharma, Zhiling Lan, Xingfu Wu, Valerie Taylor · 2023

Computer science Engineering Physics

As the race to exa-scale computing accerlerates, power consumption continues to be a critical challenge. While several technologies are available for power management, balancing energy efficiency and application performance during executio…

Study of Workload Interference with Intelligent Routing on Dragonfly Open

Yao Kang, Xin Wang, Zhiling Lan · 2022

Computer science

Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workl…

DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing Open

Yuping Fan, Boyang Li, Dustin Favorite, Naunidh Singh, J. T. Childers , et al. · 2022

Computer science Mathematics

Cluster schedulers are crucial in high-performance computing (HPC). They determine when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on t…

MRSch: Multi-Resource Scheduling for HPC Open

Boyang Li, Yuping Fan, Matthew T. Dearing, Zhiling Lan, Paul M. Rich , et al. · 2022

Computer science Mathematics

Emerging workloads in high-performance computing (HPC) are embracing\nsignificant changes, such as having diverse resource requirements instead of\nbeing CPU-centric. This advancement forces cluster schedulers to consider\nmultiple schedul…

Performance and power modeling and prediction using MuMMI and 10 machine learning methods Open

Xingfu Wu, Valerie Taylor, Zhiling Lan · 2022

Computer science Engineering Materials science

Summary Energy‐efficient scientific applications require insight into how high performance computing system features impact the applications' power and performance. This insight can result from the development of performance and power mode…

Hybrid Workload Scheduling on HPC Systems Open

Yuping Fan, Paul B. Rich, William Allcock, Michael E. Papka, Zhiling Lan · 2021

Computer science Engineering Economics

Traditionally, on-demand, rigid, and malleable applications have been scheduled and executed on separate systems. The ever-growing workload demands and rapidly developing HPC infrastructure trigger the interest of converging these applicat…

Q-adaptive Open

Yao Kang, Xin Wang, Zhiling Lan · 2021

Computer science

High-radix interconnects such as Dragonfly and its variants rely on adaptive routing to balance network traffic for optimum performance. Ideally, adaptive routing attempts to forward packets between minimal and non-minimal paths with the l…

DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling Open

Yuping Fan, Zhiling Lan · 2021

Computer science Mathematics

For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly…

DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling Open

Yuping Fan, Zhiling Lan · 2021

Computer science Mathematics

For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly…

Deep Reinforcement Agent for Scheduling in HPC Open

Yuping Fan, Zhiling Lan, J. T. Childers, Paul M. Rich, William Allcock , et al. · 2021

Computer science Mathematics

Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on thei…

Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods Open

Xingfu Wu, Valerie Taylor, Zhiling Lan · 2020

Computer science Physics

In this paper, we use modeling and prediction tool MuMMI (Multiple Metrics Modeling Infrastructure) and ten machine learning methods to model and predict performance and power and compare their prediction error rates. We use a fault-tolera…

Zhiling Lan YOU? Author Swipe