Large Graph Convolutional Network Training with GPU-Oriented Data\n Communication Architecture Article Swipe

PDF

Related Concepts

Computer science PCI Express Parallel computing Uniform memory access Memory management Operating system Field-programmable gate array Semiconductor memory

Seung Won Min , Wu Kun , Sitao Huang , Mert Hidayetoğlu , Jinjun Xiong , Eiman Ebrahimi , Deming Chen , Wen‐mei Hwu ·

YOU? · · 2021 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2103.03330 · OA: W4287283724

Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale\ngraph-based recommender systems. Training GCN requires the minibatch generator\ntraversing graphs and sampling the sparsely located neighboring nodes to obtain\ntheir features. Since real-world graphs often exceed the capacity of GPU\nmemory, current GCN training systems keep the feature table in host memory and\nrely on the CPU to collect sparse features before sending them to the GPUs.\nThis approach, however, puts tremendous pressure on host memory bandwidth and\nthe CPU. This is because the CPU needs to (1) read sparse features from memory,\n(2) write features into memory as a dense format, and (3) transfer the features\nfrom memory to the GPUs. In this work, we propose a novel GPU-oriented data\ncommunication approach for GCN training, where GPU threads directly access\nsparse features in host memory through zero-copy accesses without much CPU\nhelp. By removing the CPU gathering stage, our method significantly reduces the\nconsumption of the host resources and data access latency. We further present\ntwo important techniques to achieve high host memory access efficiency by the\nGPU: (1) automatic data access address alignment to maximize PCIe packet\nefficiency, and (2) asynchronous zero-copy access and kernel execution to fully\noverlap data transfer with training. We incorporate our method into PyTorch and\nevaluate its effectiveness using several graphs with sizes up to 111 million\nnodes and 1.6 billion edges. In a multi-GPU training setup, our method is\n65-92% faster than the conventional data transfer method, and can even match\nthe performance of all-in-GPU-memory training for some graphs that fit in GPU\nmemory.\n