Publication:
All-to-all Communication Algorithms for Distributed BLAS

Thumbnail Image

Date

1993

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Mathur, Kapil K. and S. Lennart Johnsson. 1993. All-to-all Communication Algorithms for Distributed BLAS. Harvard Computer Science Group Technical Report TR-07-93.

Research Data

Abstract

Dense Distributed Basic Linear Algebra Subroutine (DBLAS) algorithms based on all{to{all broadcast and all-to-all reduce are presented. For DBLAS, at each all-to-all step, it is necessary to know the data values and the indices of the data values as well. This is in contrast to the more traditional applications of all-to-all broadcast (such as a N-body solver) where the identity of the data values is not of much interest. Detailed schedules for all-to-all broadcast and reduction are given for the data motion of arrays mapped to the processing nodes of binary cube networks using binary encoding and binary-reflected Gray encoding. The algorithms compute the indices for the communicated data locally. No communication bandwidth is consumed for data array indices. For the Connection Machine system CM{200, Hamiltonian cycle based all-to-all communication algorithms improve the performance by a factor of two to ten over a combination of tree, butter network, and router based algorithms. The data rate achieved for all-to-all broadcast on a 256 node Connection Machine system CM-200 is 0.3 Gbytes/sec. The data motion rate for all-to-all broadcast, including the time for index computations and local data reordering, is about 2.8 Gbytes/sec for a 2048 node system. Excluding the time for index computation and local memory reordering the measured data motion rate for all-to-all broadcast is 5.6 Gbytes/s. On a Connection Machine system, CM-200, with 2048 processing nodes, the overall performance of the distributed matrix vector multiply (DGEMV) and vector matrix multiply (DGEMV with TRANS) is 10.5 Gflops/s and 13.7 Gflops respectively.

Description

Other Available Sources

Keywords

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories