Charith Mendis
YOU?
Author Swipe
View article: ACT: Automatically Generating Compiler Backends from Tensor Accelerator ISA Descriptions
ACT: Automatically Generating Compiler Backends from Tensor Accelerator ISA Descriptions Open
Tensor compilers play a key role in enabling high-performance implementations of deep learning workloads. These compilers rely on existing CPU and GPU code generation backends to generate device-specific code. Recently, many tensor acceler…
View article: GALA: A High Performance Graph Neural Network Acceleration LAnguage and Compiler
GALA: A High Performance Graph Neural Network Acceleration LAnguage and Compiler Open
Multiple frameworks and optimizations have been proposed for accelerating Graph Neural Network (GNN) workloads over the years, achieving sizable runtime performance improvements. However, we notice that existing systems usually explore opt…
PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees Open
After decades of research in approximate query processing (AQP), its adoption in the industry remains limited. Existing methods struggle to simultaneously provide user-specified error guarantees, eliminate maintenance overheads, and avoid …
MISAAL: Synthesis-Based Automatic Generation of Efficient and Retargetable Semantics-Driven Optimizations Open
Using program synthesis to select instructions for and optimize input programs is receiving increasing attention. However, existing synthesis-based compilers are faced by two major challenges that prohibit the deployment of program synthes…
PandasBench: A Benchmark for the Pandas API Open
The Pandas API has been central to the success of pandas and its alternatives. Despite its importance, there is no benchmark for it, and we argue that we cannot repurpose existing benchmarks (from other domains) for the Pandas API. In this…
COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning Open
Sparse tensor programs are essential in deep learning and graph analytics, driving the need for optimized processing. To meet this demand, specialized hardware accelerators are being developed. Optimizing these programs for accelerators is…
Automated Verification of Soundness of DNN Certifiers Open
The uninterpretability of Deep Neural Networks (DNNs) hinders their use in safety-critical applications. Abstract Interpretation-based DNN certifiers provide promising avenues for building trust in DNNs. Unsoundness in the mathematical log…
SPLAT: A Framework for Optimised GPU Code-Generation for SParse reguLar ATtention Open
Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA) performance across natural language processing and vision tasks. However, their quadratic dependence on sequence lengths has bottlenecked inference speeds. To circ…
View article: TensorRight: Automated Verification of Tensor Graph Rewrites
TensorRight: Automated Verification of Tensor Graph Rewrites Open
Tensor compilers, essential for generating efficient code for deep learning models across various applications, employ tensor graph rewrites as one of the key optimizations. These rewrites optimize tensor computational graphs with the expe…
View article: Transforming the Hybrid Cloud for Emerging AI Workloads
Transforming the Hybrid Cloud for Emerging AI Workloads Open
This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, fu…
Hydride: A Retargetable and Extensible Synthesis-based Compiler for Modern Hardware Architectures Open
As modern hardware architectures evolve to support increasingly diverse, complex instruction sets for meeting the performance demands of modern workloads in image processing, deep learning, etc., it has become ever more crucial for compile…
TGLite: A Lightweight Programming Framework for Continuous-Time Temporal Graph Neural Networks Open
In recent years, Temporal Graph Neural Networks (TGNNs) have achieved great success in learning tasks for graphs that change over time. These dynamic/temporal graphs represent topology changes as either discrete static graph snapshots (cal…
ConstraintFlow: A DSL for Specification and Verification of Neural Network Analyses Open
We develop a declarative DSL - \cf - that can be used to specify Abstract Interpretation-based DNN certifiers. In \cf, programmers can easily define various existing and new abstract domains and transformers, all within just a few 10s of L…
Dias: Dynamic Rewriting of Pandas Code Open
In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom funct…
FLuRKA: Fast and accurate unified Low-Rank & Kernel Attention Open
Many efficient $\textit{approximate}$ self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has i…
SENSEi: Input-Sensitive Compilation for Accelerating GNNs Open
Over the years, many frameworks and optimization techniques have been proposed to accelerate graph neural networks (GNNs). Compared to the optimizations explored in these systems, we observe that different matrix re-associations of GNN com…
Learning Large Graph Property Prediction via Graph Segment Training Open
Learning to predict properties of large graphs is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST),…
Dias: Dynamic Rewriting of Pandas Code Open
In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom funct…
View article: COMET: Neural Cost Model Explanation Framework
COMET: Neural Cost Model Explanation Framework Open
Cost models predict the cost of executing given assembly code basic blocks on a specific microarchitecture. Recently, neural cost models have been shown to be fairly accurate and easy to construct. They can replace heavily engineered analy…
View article: WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program
WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program Open
Leveraging the existence of the large number of zeros in sparse tensors offer a powerful way to solve complex problems efficiently in many applications. However, optimizing the performance of those applications poses a challenge. Sparse te…
GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation Open
Analytical hardware performance models yield swift estimation of desired hardware performance metrics. However, developing these analytical models for modern processors with sophisticated microarchitectures is an extremely laborious task a…
View article: All you need is superword-level parallelism: systematic control-flow vectorization with SLP
All you need is superword-level parallelism: systematic control-flow vectorization with SLP Open
Superword-level parallelism (SLP) vectorization is a proven technique for vectorizing straight-line code. It works by replacing independent, isomorphic instructions with equivalent vector instructions. Larsen and Amarasinghe originally pro…
VeGen: a vectorizer generator for SIMD and beyond Open
Vector instructions are ubiquitous in modern processors. Traditional compiler auto-vectorization techniques have focused on targeting single instruction multiple data (SIMD) instructions. However, these auto-vectorization techniques are no…
View article: A Learned Performance Model for Tensor Processing Units
A Learned Performance Model for Tensor Processing Units Open
Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration fo…
View article: DiffTune: Optimizing CPU Simulator Parameters with Learned Differentiable Surrogates
DiffTune: Optimizing CPU Simulator Parameters with Learned Differentiable Surrogates Open
CPU simulators are useful tools for modeling CPU execution behavior. However, they suffer from inaccuracies due to the cost and complexity of setting their fine-grained parameters, such as the latencies of individual instructions. This com…
View article: A Learned Performance Model for Tensor Processing Units
A Learned Performance Model for Tensor Processing Units Open
Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration fo…
View article: Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks Open
Predicting the number of clock cycles a processor takes to execute a block of assembly instructions in steady state (the throughput) is important for both compiler designers and performance engineers. Building an analytical model to do so …
Making Caches Work for Graph Analytics Open
Modern hardware systems are heavily underutilized when running large-scale graph applications. While many in-memory graph frameworks have made substantial progress in optimizing these applications, we show that it is still possible to achi…