Publication: Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage
Open/View Files
Date
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
For matrix multiplication on hypercube multiprocessors with the product matrix accumulated in place a processor must receive about P^2/√ N elements of each input operand, with operands of size PxP distributed evenly over N processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to P^2/√N logN for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multiplication with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity 2P^3, and parallel arithmetic complexity 2P^3/N. The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine.