Zhiling Lan
YOU?
Author Swipe
View article: Coordinated Power Management on Heterogeneous Systems
Coordinated Power Management on Heterogeneous Systems Open
Performance prediction is essential for energy-efficient computing in heterogeneous computing systems that integrate CPUs and GPUs. However, traditional performance modeling methods often rely on exhaustive offline profiling, which becomes…
View article: Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs
Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs Open
As supercomputers grow in size and complexity, power efficiency has become a critical challenge, particularly in understanding GPU power consumption within modern HPC workloads. This work addresses this challenge by presenting a data co-an…
View article: Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes
Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes Open
While large language models (LLMs) are increasingly used for generating parallel scientific codes, most efforts emphasize functional correctness, often overlooking performance, especially energy efficiency. We propose LASSI-EE, an automate…
View article: Exploring Uncore Frequency Scaling for Heterogeneous Computing
Exploring Uncore Frequency Scaling for Heterogeneous Computing Open
High-performance computing (HPC) systems are essential for scientific discovery and engineering innovation. However, their growing power demands pose significant challenges, particularly as systems scale to the exascale level. Prior uncore…
View article: More for Less: Integrating Capability-Predominant and Capacity-Predominant Computing
More for Less: Integrating Capability-Predominant and Capacity-Predominant Computing Open
Capability jobs (e.g., large, long-running tasks) and capacity jobs (e.g., small, short-running tasks) are two common types of workloads in high-performance computing (HPC). Different HPC systems are typically deployed to handle distinct c…
View article: Preventing Workload Interference with Intelligent Routing and Flexible Job Placement Strategy on Dragonfly System
Preventing Workload Interference with Intelligent Routing and Flexible Job Placement Strategy on Dragonfly System Open
Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that…
View article: Hybrid PDES Simulation of HPC Networks Using Zombie Packets
Hybrid PDES Simulation of HPC Networks Using Zombie Packets Open
Although high-fidelity network simulations have proven to be reliable and cost-effective tools to peer into architectural questions for high-performance computing (HPC) networks, they incur a high resource cost. The time spent in simulatin…
View article: LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes
LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes Open
This paper addresses the problem of providing a novel approach to sourcing significant training data for LLMs focused on science and engineering. In particular, a crucial challenge is sourcing parallel scientific codes in the ranges of mil…
View article: Modeling and Analysis of Application Interference on Dragonfly+
Modeling and Analysis of Application Interference on Dragonfly+ Open
Dragonfly class of networks are considered as promising interconnects for next-generation supercomputers. While Dragonfly+ networks offer more path diversity than the original Dragonfly design, they are still prone to performance variabili…
View article: Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling
Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling Open
In the field of high-performance computing (HPC), there has been recent\nexploration into the use of deep reinforcement learning for cluster scheduling\n(DRL scheduling), which has demonstrated promising outcomes. However, a\nsignificant c…
View article: Hybrid PDES Simulation of HPC Networks Using Zombie Packets
Hybrid PDES Simulation of HPC Networks Using Zombie Packets Open
No description supplied
View article: Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly
Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly Open
Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exc…
View article: Machine Learning for Interconnect Network Traffic Forecasting: Investigation and Exploitation
Machine Learning for Interconnect Network Traffic Forecasting: Investigation and Exploitation Open
Interconnect networks play a key role in high-performance computing (HPC) systems. Parallel discrete event simulation (PDES) has been a long-standing pillar for studying large-scale networking systems by replicating the real-world behavior…
View article: CODES - Hybrid modeling (high-fidelity simulation and surrogates) for HPC networks
CODES - Hybrid modeling (high-fidelity simulation and surrogates) for HPC networks Open
CODES is an HPC (High performance computing) network simulator built to run on HPC systems. We have extended CODES to use historical packet latency data in order to speed up the simulation runtime. The instructions found in this repository…
View article: DNPC: A Dynamic Node-Level Power Capping Library for Scientific Applications
DNPC: A Dynamic Node-Level Power Capping Library for Scientific Applications Open
As the race to exa-scale computing accerlerates, power consumption continues to be a critical challenge. While several technologies are available for power management, balancing energy efficiency and application performance during executio…
View article: Study of Workload Interference with Intelligent Routing on Dragonfly
Study of Workload Interference with Intelligent Routing on Dragonfly Open
Dragonfly interconnect is a crucial network technology for supercomputers. To support exascale systems, network resources are shared such that links and routers are not dedicated to any node pair. While link utilization is increased, workl…
View article: DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing
DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing Open
Cluster schedulers are crucial in high-performance computing (HPC). They determine when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on t…
View article: MRSch: Multi-Resource Scheduling for HPC
MRSch: Multi-Resource Scheduling for HPC Open
Emerging workloads in high-performance computing (HPC) are embracing\nsignificant changes, such as having diverse resource requirements instead of\nbeing CPU-centric. This advancement forces cluster schedulers to consider\nmultiple schedul…
View article: Performance and power modeling and prediction using MuMMI and 10 machine learning methods
Performance and power modeling and prediction using MuMMI and 10 machine learning methods Open
Summary Energy‐efficient scientific applications require insight into how high performance computing system features impact the applications' power and performance. This insight can result from the development of performance and power mode…
View article: Hybrid Workload Scheduling on HPC Systems
Hybrid Workload Scheduling on HPC Systems Open
Traditionally, on-demand, rigid, and malleable applications have been scheduled and executed on separate systems. The ever-growing workload demands and rapidly developing HPC infrastructure trigger the interest of converging these applicat…
View article: Q-adaptive
Q-adaptive Open
High-radix interconnects such as Dragonfly and its variants rely on adaptive routing to balance network traffic for optimum performance. Ideally, adaptive routing attempts to forward packets between minimal and non-minimal paths with the l…
View article: DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling
DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling Open
For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly…
View article: DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling
DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling Open
For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly…
View article: Deep Reinforcement Agent for Scheduling in HPC
Deep Reinforcement Agent for Scheduling in HPC Open
Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on thei…
View article: Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods
Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods Open
In this paper, we use modeling and prediction tool MuMMI (Multiple Metrics Modeling Infrastructure) and ten machine learning methods to model and predict performance and power and compare their prediction error rates. We use a fault-tolera…