Publication: All-to-all Communication Algorithms for Distributed BLAS
Open/View Files
Date
1993
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Mathur, Kapil K. and S. Lennart Johnsson. 1993. All-to-all Communication Algorithms for Distributed BLAS. Harvard Computer Science Group Technical Report TR-07-93.
Research Data
Abstract
Dense Distributed Basic Linear Algebra Subroutine (DBLAS) algorithms based on all{to{all broadcast and all-to-all reduce are presented. For DBLAS, at each all-to-all step, it is necessary to know the data values and the indices of the data values as well. This is in contrast to the more traditional applications of all-to-all broadcast (such as a N-body solver) where the identity of the data values is not of much interest. Detailed schedules for all-to-all broadcast and reduction are given for the data motion of arrays mapped to the processing nodes of binary cube networks using binary encoding and binary-reflected Gray encoding. The algorithms compute the indices for the communicated data locally. No communication bandwidth is consumed for data array indices. For the Connection Machine system CM{200, Hamiltonian cycle based all-to-all communication algorithms improve the performance by a factor of two to ten over a combination of tree, butter network, and router based algorithms. The data rate achieved for all-to-all broadcast on a 256 node Connection Machine system CM-200 is 0.3 Gbytes/sec. The data motion rate for all-to-all broadcast, including the time for index computations and local data reordering, is about 2.8 Gbytes/sec for a 2048 node system. Excluding the time for index computation and local memory reordering the measured data motion rate for all-to-all broadcast is 5.6 Gbytes/s. On a Connection Machine system, CM-200, with 2048 processing nodes, the overall performance of the distributed matrix vector multiply (DGEMV) and vector matrix multiply (DGEMV with TRANS) is 10.5 Gflops/s and 13.7 Gflops respectively.
Description
Other Available Sources
Keywords
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service