A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization Article Swipe

PDF

Related Concepts

Computer science Algorithm Tensor (intrinsic definition) Sparse matrix Mathematics Physics Geometry Quantum mechanics Gaussian

Shaden Smith , George Karypis ·

YOU? · · 2016 · Open Access · · OA: W3199767607

Modeling multi-way data can be accomplished using tensors, which are data structures \n indexed along three or more dimensions. Tensors are increasingly used to analyze \n extremely large and sparse multi-way datasets in life sciences, engineering, and business. \n The canonical polyadic decomposition (CPD) is a popular tensor factorization for \n discovering latent features and is most commonly found via the method of alternating \n least squares (CPD-ALS). The computational time and memory required to compute CPD \n limits the size and dimensionality of the tensors that can be solved on a typical \n workstation, making distributed solution approaches the only viable option. \n Most methods for distributed-memory systems have focused on distributing the tensor \n in a coarse-grained, one-dimensional fashion that prohibitively requires the dense \n matrix factors to be fully replicated on each node. Recent work overcomes this \n limitation by using a fine-grained decomposition of the tensor nonzeros, at \n the cost of computationally expensive hypergraph partitioning. To that effect, \n we present a medium-grained decomposition that avoids complete factor replication \n and communication, while eliminating the need for expensive pre-processing steps. \n We use a hybrid MPI+OpenMP implementation that exploits multi-core architectures \n with a low memory footprint. We theoretically analyze the scalability of \n the coarse-, medium-, and fine-grained decompositions and experimentally compare \n them across a variety of datasets. Experiments show that the medium-grained \n decomposition reduces communication volume by 36-90% compared to the coarse-grained \n decomposition, is 41-76x faster than a state-of- the-art MPI code, and is 1.5-5.0x faster \n than the fine-grained decomposition with 1024 cores.