L. E. Wright
YOU?
Author Swipe
View article: HadaCore: Tensor Core Accelerated Hadamard Transform Kernel
HadaCore: Tensor Core Accelerated Hadamard Transform Kernel Open
We present HadaCore, a modified Fast Walsh-Hadamard Transform (FWHT) algorithm optimized for the Tensor Cores present in modern GPU hardware. HadaCore follows the recursive structure of the original FWHT algorithm, achieving the same asymp…
View article: Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition
Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition Open
We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improveme…