Amit Sabne
YOU?
Author Swipe
View article: TensorRight: Automated Verification of Tensor Graph Rewrites
TensorRight: Automated Verification of Tensor Graph Rewrites Open
Tensor compilers, essential for generating efficient code for deep learning models across various applications, employ tensor graph rewrites as one of the key optimizations. These rewrites optimize tensor computational graphs with the expe…
View article: Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models Open
Large deep learning models have shown great potential with state-of-the-art results in many tasks. However, running these large models is quite challenging on an accelerator (GPU or TPU) because the on-device memory is too limited for the …
View article: A Learned Performance Model for Tensor Processing Units
A Learned Performance Model for Tensor Processing Units Open
Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration fo…
View article: A Learned Performance Model for Tensor Processing Units
A Learned Performance Model for Tensor Processing Units Open
Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration fo…
View article: Fast Distributed Bandits for Online Recommendation Systems
Fast Distributed Bandits for Online Recommendation Systems Open
Contextual bandit algorithms are commonly used in recommender systems, where content popularity can change rapidly. These algorithms continuously learn latent mappings between users and items, based on contexts associated with them both. R…
View article: Fast distributed bandits for online recommendation systems
Fast distributed bandits for online recommendation systems Open
Contextual bandit algorithms are commonly used in recommender systems, where content popularity can change rapidly. These algorithms continuously learn latent mappings between users and items, based on contexts associated with them both. R…
View article: RegDem: Increasing GPU Performance via Shared Memory Register Spilling
RegDem: Increasing GPU Performance via Shared Memory Register Spilling Open
GPU utilization, measured as occupancy, is limited by the parallel threads' combined usage of on-chip resources, such as registers and the programmer-managed shared memory. Higher resource demand means lower effective parallel thread count…
View article: Massively parallel 3D image reconstruction
Massively parallel 3D image reconstruction Open
Computed Tomographic (CT) image reconstruction is an important technique used in a wide range of applications. Among reconstruction methods, Model-Based Iterative Reconstruction (MBIR) is known to produce much higher quality CT images; how…
View article: Evaluating Performance Portability of OpenACC
Evaluating Performance Portability of OpenACC Open
Accelerator-based heterogeneous computing is gaining momentum in High Performance Computing arena. However, the increased complexity of the accelerator architectures demands more generic, high-level programming models. OpenACC is one such …
View article: Programming models, compilers, and runtime systems for accelerator computing
Programming models, compilers, and runtime systems for accelerator computing Open
Accelerators, such as GPUs and Intel Xeon Phis, have become the workhorses of high-performance computing. Typically, the accelerators act as co-processors, with discrete memory spaces. They possess massive parallelism, along with many othe…