Efficient Large-Scale Graph Processing. Article Swipe
The abundance of large graphs and the high potential for insight extraction from them have fueled interest in large-scale graph processing systems. Despite significant enhancement in recent years, state-of-the-art large-scale graph processing systems only yield suboptimal performance in common scenarios. In this thesis, we address three categories of deficiency in graph processing systems. First, in a distributed environment, the performance of graph processing systems is hindered by computation-communication coupling. We present a new programming abstraction called Compute-Sync-Merge that fully decouples computation and inter-host communication, making substantial performance gain. The second and third categories of deficiency both stem from the lack of version awareness in existing graph processing systems. When multiple versions of a large graph--among which there exists a substantial common subgraph--are processed sequentially, fetching each version as a standalone graph from persistent storage leads to inefficiency. We mitigate such inefficiency by introducing a graph caching layer to a graph processing system, enabling the construction of in-memory graph representation based on cached contents and significantly improving overall system performance. When multiple versions of a large graph are processed in parallel, maintaining each version as a standalone graph in memory and processing it in isolation from other versions incur memory and computation overheads. We redesign the representation and the processing workflow of a multi-version graph, consolidating duplicated graph states across multiple versions via dynamic version splitting. Our approach improves the overall efficiency of a graph processing system by reducing memory footprint and eliminating computation related to redundant state transitions.