Publication: All-to-all Communication Algorithms for Distributed BLAS
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Dense Distributed Basic Linear Algebra Subroutine (DBLAS) algorithms based on all{to{all broadcast and all-to-all reduce are presented. For DBLAS, at each all-to-all step, it is necessary to know the data values and the indices of the data values as well. This is in contrast to the more traditional applications of all-to-all broadcast (such as a N-body solver) where the identity of the data values is not of much interest. Detailed schedules for all-to-all broadcast and reduction are given for the data motion of arrays mapped to the processing nodes of binary cube networks using binary encoding and binary-reflected Gray encoding. The algorithms compute the indices for the communicated data locally. No communication bandwidth is consumed for data array indices. For the Connection Machine system CM{200, Hamiltonian cycle based all-to-all communication algorithms improve the performance by a factor of two to ten over a combination of tree, butter network, and router based algorithms. The data rate achieved for all-to-all broadcast on a 256 node Connection Machine system CM-200 is 0.3 Gbytes/sec. The data motion rate for all-to-all broadcast, including the time for index computations and local data reordering, is about 2.8 Gbytes/sec for a 2048 node system. Excluding the time for index computation and local memory reordering the measured data motion rate for all-to-all broadcast is 5.6 Gbytes/s. On a Connection Machine system, CM-200, with 2048 processing nodes, the overall performance of the distributed matrix vector multiply (DGEMV) and vector matrix multiply (DGEMV with TRANS) is 10.5 Gflops/s and 13.7 Gflops respectively.