Publication: Block-Cyclic Dense Linear Algebra
Open/View Files
Date
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Block-cyclic order elimination algorithms for LU and QR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two-dimensional arrays of arbitrary shape. The cyclic order elimination together with a consecutive data allocation yields good load-balance for both the factorization and solution phases for the solution of dense systems of equations by LU and QR decomposition. Blocking may offer a substantial performance enhancement on architectures for which the level-2 or level-3 BLAS are ideal for operations local to a node. High rank updates local to a node may have a performance that is a factor of four or more higher than a rank-1 update. We show that in many parallel implementations, the O(N2) work in the factorization may be of the same significance as the O(N3) work, even for large matrices. The O(N2) work is poorly load-balanced in two-dimensional nodal arrays, which we show are optimal with respect to communication for consecutive data allocation, block-cyclic order elimination, and a simple, but fairly general, communications model. In our Connection Machine system CM{200 implementation, the peak performance for LU factorization is about 9.4 Gflops/s in 64-bit precision and 16 Gflops/s in 32-bit precision. Blocking offers an overall performance enhancement of about a factor of two. The broadcast and reduce operations fully utilize the bandwidth available in the Boolean cube network interconnecting the nodes along each axis of the two{dimensional nodal array embedded in the cube network.