Efficient warp execution in presence of divergence with collaborative context collection Article Swipe
YOU?
·
· 2015
· Open Access
·
· DOI: https://doi.org/10.1145/2830772.2830796
GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies due to the warp's lockstep traversal of all diverging execution paths. In this work, we present a software (compiler) technique named Collaborative Context Collection (CCC) that increases the warp execution efficiency when faced with thread divergence incurred either by different intra-warp task assignment or by intra-warp load imbalance. CCC collects the relevant registers of divergent threads in a warp-specific stack allocated in the fast shared memory, and restores them only when the perfect utilization of warp lanes becomes feasible. We propose code transformations to enable applicability of CCC to variety of program segments with thread divergence. We also introduce optimizations to reduce the cost of CCC and to avoid device occupancy limitation or memory divergence. We have developed a framework that automates application of CCC to CUDA generated intermediate PTX code. We evaluated CCC on real-world applications and multiple scenarios using synthetic programs. CCC improves the warp execution efficiency of real-world benchmarks by up to 56% and achieves an average speedup of 1.69x (maximum 3.08x).
Related Topics To Compare & Contrast
- Type
- article
- Language
- en
- Landing Page
- https://doi.org/10.1145/2830772.2830796
- https://dl.acm.org/doi/pdf/10.1145/2830772.2830796
- OA Status
- gold
- Cited By
- 34
- References
- 51
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W2236252626