Publication: Efficient Data Parallel Implementations of Highly Irregular Problems
Open/View Files
Date
1997
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Hu, Yu. 1997. Efficient Data Parallel Implementations of Highly Irregular Problems. Harvard Computer Science Group Technical Report TR-14-97.
Research Data
Abstract
This dissertation presents optimization techniques for efficient data parallel formulation/implementation of highly irregular problems, and applies the techniques to O(N) hierarchical N–body methods for large–scale N–body simulations. It demonstrates that highly irregular scientific and engineering problems such as nonadaptive and adaptive O(N) hierarchical N–body methods can be efficiently implemented in high–level data parallel languages such as High Performance Fortran (HPF) on scalable parallel architectures. It also presents an empirical study of the accuracy–cost tradeoffs of O(N) hierarchical N–body methods. This dissertation first develops optimization techniques for efficient data parallel implementation of irregular problems, focusing on minimizing the data movement through careful management of the data distribution and the data references, both between the memories of different nodes, and within the memory hierarchy of each node. For hierarchical N–body methods, our optimizations on improving arithmetic efficiencies include recognizing dominating computations as matrix–vector multiplications and aggregating them into multiple–instance matrix–matrix multiplications. Experimental results with an implementation in Connection Machine Fortran of Anderson’s hierarchical N–body method demonstrate that performance competitive to that of the best message–passing implementations of the same class of methods can be achieved. The dissertation also presents a general data parallel formulation for highly irregular applications, and applies the formulation to an adaptive hierarchical N–body method with highly nonuniform particle distributions. The formulation consists of (1) a method for linearizing irregular data structures, (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, and (3) techniques for expressing irregular communications and nonuniform computations associated with the elements of linearized data structures. Experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible with proper language/compiler/runtime support. Our data parallel N–body code provides a much needed “benchmark” code for evaluating and improving HPF compilers. This thesis also develops the first data parallel (HPF) implementation of the geometric partitioning algorithm due to Miller, Teng, Thurston and Vavasis – one of the only two provably good partitioning schemes. Our data parallel formulation makes extensive use of segmented pre-fix sums and parallel selections, and provides a data parallel procedure for geometric sampling. Experiments on partitioning particles for load–balance and data interactions as required in hierarchical N–body algorithms show that the geometric partitioning algorithm has an efficient data parallel formulation. Finally, this thesis studies the accuracy–cost tradeoffs of O(N) hierarchical N–body methods using our implementation of nonadaptive Anderson’s method. The various parameters which control the degree of approximation of the computational elements and separateness of the interacting computational elements, govern both the arithmetic complexity and the accuracy of the methods. A scheme for choosing optimal parameters that give the best running time for a prescribed error requirement is developed. Using this scheme, we find that for a prescribed error, using a near–field containing only nearest neighbor boxes and the optimal hierarchy depth which minimizes the number of arithmetic operations, minimizes the number of arithmetic operations and therefore the total running time.
Description
Other Available Sources
Keywords
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service