Efficient Distribution For Deep Learning On Large Graphs
Published in Workshop on Graph Neural Networks and Systems (GNNSys), 2021
Recommended citation: Loc Hoang, Xuhao Chen, Hochan Lee, Roshan Dathathri, and Keshav Pingali. (2021). "Efficient Distribution For Deep Learning On Large Graphs." Workshop on Graph Neural Networks and Systems (GNNSys), April 2021.
Graph neural networks (GNN) are compute intensive; thus, they are attractive for acceleration on distributed platforms. We present DeepGalois, an efficient GNN framework targeting distributed CPUs. DeepGalois is designed for efficient communication of high-dimensional feature vectors used in GNN. The graph partitioning engine flexibly supports different partitioning policies and helps the user make tradeoffs among task division, memory usage, and communication overhead, leading to fast feature learning without compromising the accuracy. The communication engine minimizes communication overhead by exploiting partitioning invariants and communication bandwidth in modern clusters. Evaluation on a production cluster for the representative reddit and ogbn-products datasets demonstrates that DeepGalois on 32 machines is 2.5x and 2.3x faster than that on 1 machine in average epoch time and time to accuracy, respectively. On 32 machines, DeepGalois outperforms DistDGL by 4x and 8.9x in average epoch time and time to accuracy, respectively