Data Parallel Performance Optimizations Using Array Aliasing

. The array aliasing mechanism provided in the Connection Machine Fortran (CMF) language and run{timesystem provides a unique way of identifyingthe memoryaddressspaceslocalto processorswithinthe globaladdressspaceof distributed memory architectures, while staying in the data parallel programming paradigm. We show how thearray aliasingfeaturecan be used e(cid:11)ectivelyin optimizingcommunication and computationperformance. The constructswe present occur frequentlyin many sci-enti(cid:12)c and engineeringapplications,and include various forms of aggregationand array reshaping through array aliasing. The e(cid:11)ectiveness of the optimization techniques is demonstratedon an implementationof Anderson'shierarchicalO(N)N{bodymethod.


Introduction. Data parallel programming provides an e ective
way to write maintainable, portable, and scalable parallel codes.The proprietary data parallel programming language Connection Machine Fortran (CMF) 9], with many characteristics in common with the emerging data parallel programming language, High Performance Fortran (HPF) 3], has been successfully used in solving many structured and unstructured problems in science and engineering.The array aliasing mechanism provided in CMF and its run{time system provides a unique way of identifying the local memory address space as part of the global address space, while staying in the single{threaded control and a global address space.This feature was frequently used in optimizing the performance of many applications by o ering a technique at the language level for managing memory references.This paper presents a set of optimizations for both communication and computation actions using the array aliasing mechanism in CMF.The particular constructs we present aim at minimizing local memory moves in global communication operations, avoiding the need for general communication primitives when specialized, and on most architectures more e cient, communication primitives such as CSHIFT are speci ed, Aiken Computation Lab, Harvard University, 33 Oxford Street, Cambridge, MA 02138.The rst author was supported by the Air Force O ce of Scienti c Research through grants F49620-93-1-0480 and F49620-96-1-0289.
y Department of Computer Science, University of Houston, Houston, Texas 77204-3475.The second author was supported by the Air Force O ce of Scienti c Research through grants F49620-93-1-0480 and F49620-96-1-0289. 1 2 Y. CHARLIE HU AND S. LENNART JOHNSSON aligning and operating on nonconforming arrays without growth in memory requirements and excessive communication, arithmetic performance enhancement through aggregation.The e ectiveness of our techniques is demonstrated on an implementation of Anderson's hierarchical O(N) N{body method 1] for the Connection Machine system CM{5/5E.Of the total execution time, communication accounts for about 10{20% of the total time, with the average eciency for arithmetic operations being about 40% and the total e ciency (including communication) being about 35%.
Section 2 describes the array aliasing feature of CMF.Section 3 presents brie y the computational structure of hierarchical methods and the computational elements in Anderson's method applied to the evaluation of gravitational or Columbic N{body interactions.The optimization techniques using the array aliasing mechanism in CMF are presented in Sections 4 { 6. Section 7 reports some performance results of our implementation of Anderson's method using the optimization techniques, and Section 8 summarizes the results.

High Performance and Connection Machine Fortran. Below
we brie y summarize the new features in HPF.We then present the array aliasing mechanism in CMF, which provides an elegant way to avoid excess data motion, and compare it to the use of extrinsic procedures for the same purpose.
2.1.High Performance Fortran.HPF consists of Fortran 90 with extensions mainly for data management.The main extensions are: 1. Data distribution directives, which describe data aggregation, such as cyclic and block aggregation, and the partitioning of data among memory regions; 2. Parallel FORALL statements and constructs, which allow speci cations of parallel computations on fairly general array sections; 3. Extrinsic procedures (local procedures), which de ne interfaces to procedures written in other programming paradigms, such as explicit message{passing SPMD style; 4. A set of extended intrinsic functions, including mapping inquiry intrinsic subroutines that allow a program to know the exact mapping of an array at run{time.HPF supports data parallel programming with a global address space.Programs can be written without any knowledge of the architecture of the memory system.HPF has inherited the Random Access Memory model used in most programming languages, with the data distribution directives providing a mechanism for the programmer to manage data layout indirectly.In a hierarchical distributed memory system, such as in distributed memory and distributed shared memory architectures, e cient use of the memory system and communications facilities require complex analysis and optimization, largely beyond state{of{the{art compiler and run{time sys-tem technology for such architectures.The consequence is that most compilers for such architectures often generate excess data movement, and frequently fail to invoke the most e cient communication mechanism.For instance, performing a circular shift on an array causes most compilers at best to issue instructions that move every element as speci ed by the shift instruction.At worst, a call to general communication facilities is generated.A more e cient way of dealing with shift instructions is to move data between processors as required, but to eliminate the local memory moves by modifying subsequent local memory references to account for the speci ed move (that was not carried out).One sensible way of avoiding excess data movement is to restructure the program in such a way that even a not{so{sophisticated compiler is able to generate e cient code.This goal can be achieved by exposing the local memory and processor address spaces and giving a programmer explicit control over data allocation and data references.

Connection Machine Fortran.
In CMF, separation of the local and processor address spaces is elegantly achieved through array aliasing within the global programming paradigm.The array aliasing mechanism allows a user to address memory already allocated for an array, as if it were of a di erent type, shape, or layout.No data motion occurs.
Example 1.Let A be an n{dimensional array with shape L 1 ::: L n .
In the mapping of A onto the physical machine, assume that there are p i processors used for axis i, resulting in a subgrid of length s i for axis i within each processor, i.e., L i = s i p i .Using array aliasing, an array alias A alias with shape s 1 ::: s n p 1 ::: p n can be created, with the rst n axes local to each processor, and the last n axes fully distributed.
Example 2. Let A be a two{dimensional array with shape L 1 L 2 .
Assume A is mapped onto the physical machine such that there are p i processors used for axis i, resulting in a subgrid of length s i for axis i within each processor, i.e., L i = s i p i , just as in Example 1. Now, using array aliasing, an array A alias with shape s 1 s 2 p 1 p 2 can be created such that the rst two axes are local to each processor and the last axis fully distributed.
In the above two examples, we have explicitly identi ed the local address space as part of the global address space.The subgrid equivalencing feature in CMF provides a means of managing memory accesses similar to that of the EQUIVALENCE statement in Fortran 77.
In CMF, the array alias variable is declared to have type array descriptor, and the aliasing is created by calling the CMF utility library procedures 11].The aliased array is then passed to a subroutine in which it is declared to be an array with the desired new type, shape, or layout.The following code segment 1  In the current version of HPF, the separation of local and processor address spaces can only be achieved through the use of extrinsic (local) procedures.Within a local procedure, a program can access directly only the memory local to a processor.Access to other parts of the global memory must either be made through explicit message passing, or by returning to the global program.Hence, within HPF, optimizations based on separation of address spaces cannot be achieved within the language itself, but only by mixing programming models (data parallel and message passing).Moreover, mixing programming models and the use of procedure calls increase the di culty of many forms of compiler optimizations.
3. Hierarchical methods.Hierarchical methods are often considered a serious challenge with respect to performance of data parallel programs.In addition to having many of the same challenges as nonhierarchical methods with respect to programming, they also introduce issues associated with poor parallelism close to the root of the hierarchical decomposition as well as low memory and arithmetic e ciencies in operations involving nonconforming arrays.Hierarchical N{body methods encompass yet another issue, namely, the interaction between two di erent data structures; one for the discrete particles, one for the discretized elds.We used an implementation of Anderson's hierarchical N{body method 1] to evaluate the e ectiveness of the techniques we propose.However, only the techniques described for multigrid{embed and multigrid{extract addresses issues unique to hierarchical methods.All other techniques apply to non- hierarchical methods as well.
All code fragments are extracted from our N{body code.Therefore, we here present brie y some essential characteristics of hierarchical N{body methods.Almost all constructs can be understood with no understanding of such methods, however.The hierarchical N{body methods 1,4,12] applied to potential eld evaluation partitions the eld into two parts: total = near field + far field ; (3.1) where near field is the potential due to nearby particles and far field is the potential due to faraway particles.The near{ eld is evaluated through the classical N{body technique of pairwise interactions, while the far{ eld is evaluated hierarchically.The O(N) hierarchical methods di er in the computational elements they use, but share the same computational structure.The hierarchical domain decomposition is illustrated in Figure 3.1.Mesh level 0 represents the entire domain.Mesh level l + 1 is obtained from level l by subdividing each subdomain at level l (parent domain) into four (in two dimensions) or eight (in three dimensions) equally sized subdomains (child domains).Subdomains that are not further subdivided are leaves.
In Anderson's method, Poisson's formula is used to represent solutions of Laplace equation.Let g(x; y; z) denote potential values on a sphere of radius a, and denote the harmonic function external to the sphere with these boundary values.Given a sphere of radius a and a point x with spherical coordinates (r; ; ) outside the sphere, let xp = (cos( )sin( ); sin( )sin( ); cos( )) be the point on the unit sphere along the vector from the origin to the point x.Then, the potential value at x is approximated by (equation (15 where P n is the nth Legendre function, K the number of integration points on the sphere, si their location and w i their weights.This approximation is called an outer{sphere approximation.
The approximation used to represent potentials inside a given region is (equation (16 and is called an inner{sphere approximation.
The outer{sphere and the inner{sphere approximations de ne the computational elements in Anderson's hierarchical method.Outer{sphere approximations are constructed for clusters of particles in the leaf{level subdomains.During the upward pass, outer{sphere approximations of subdomains are combined into a single outer{sphere approximation of the parent subdomain.In the downward pass, the e ects of the subdomains marked i in Figure 3.2 on the subdomain marked X and its descendants are evaluated.Lastly, the leaf{level computations involve the evaluation of the e ects of the hierarchically evaluated far{ eld on the particles, and the direct evaluation of the e ects of particle interaction in the near{ eld. 4. Data structures and data distribution.We start the discussion of our optimization techniques using array aliasing with the type of data structure used for the representation of persistent variables in the hierarchy.
In most hierarchical methods, it is important to assure that data for a subdomain is allocated to the same processor as data for its parent domain, whenever there is a su cient number of parent domains to cover all processors.To accomplish this form of locality we represent persistent data for the hierarchy by ve{dimensional (5{D) arrays that e ectively consists of two 4{D arrays with the same layout as shown in Figure 4.1.Three of the axes represent the organization of the subdomains in the three spatial dimensions, while a fourth axis is used to represent data local to a subdomain.The declaration of the hierarchy of subdomains (for the far{ eld potential) in CMF is: REAL*8 FAR POT(2,K,L,M,N) CMF$LAYOUT FAR POT(:SERIAL,:SERIAL,:,:,:) The compiler directive CMF$LAYOUT speci es that the rightmost three axes of array FAR POT are parallel axes and that the two leftmost are local to each processor (speci ed through the CMF attribute :SERIAL, or * in HPF).The rightmost three axes represent the subdomains at the leaf{level of the hierarchy along the z{, y{, and x{coordinates, respectively.The local axis of extent K is used to store eld values local to a subdomain.The leaf{level subdomains are embedded in FAR POT(1; :; :; :; :), while levels (h i) are embedded in FAR POT(2, :, 2 i 1 : L : 2 i ; 2 i 1 : M : 2 i ; 2 i 1 : N : 2 i ) (see Figure 4.1).The embedding preserves locality between a subdomain and its descendants in the hierarchy.If at some level, there is at least one subdomain per processor, then all descendants for each such subdomain are allocated to the same processor as the subdomain itself.
The 5{D array representation is quite e ective with respect to memory utilization, yet guarantees locality in traversing the hierarchy.The 5{D array representation is easy to use for any depth of the hierarchy; only the extent of the three spatial axes depend on the depth of the hierarchy.Representing each level of the hierarchy as a separate array can clearly be made more memory e cient, but the number of arrays depends on the depth of the hierarchy.Adjusting the number of arrays allocated according to the depth of the hierarchy at run{time is a di cult problem in Fortran and HPF since arrays have to be named at programming time.Using arrays with one of the axis representing the levels of the hierarchy would require support for ragged arrays for space e ciency.But, ragged arrays are neither supported in HPF nor in CMF.The (default) allocation of the array FAR POT to local memory and processor addresses is illustrated in Figure 4.2.First, we note that the extent of the leaf{level subdomain axes, L,M, and N, are powers{of{two.Second, on the Connection Machine systems, the number of processors assigned to a task is also a power{of{two.Moreover, the Connection Machine run{time system as a default attempts to factor the set of processors assigned to a task such that the set of subdomains assigned to a processor has as small a surface area as possible, within the constraint that the number of processors assigned to an array axis is also a power{of{two.Thus, the allocation of parallel array axes can be described entirely in terms of address bits.Local axes, declared as SERIAL, do not allocate local memory in powers{of{two.

Optimizing communication.
In this section we present four constructs using array aliasing to reduce (minimize) communication.The four constructs are comunication in a global address space, illustrated through the CSHIFT intrinsic, assignments between nonconforming arrays in the form of multigrid embed and multigrid{extract, array reshaping to allow nonconforming arrays to conform without communication or memory growth, illustrated through the re-shaping of 1{D particle arrays to conform with arrays for leaf{level subdomain data, exploiting symmetry in all{to{all communication, illustrated for the direct (all pairs) N{body algrorithm.
5.1.Avoiding excessive data motion in global communications.Many compilers, in particular in their early stages of development, assume a xed mapping between the index space and memory addresses, rather than reevaluating the index map according to speci ed communication actions.For computations that exhibit locality of reference in the physical space, properly identifying the part of the address space local to processors combined with the use of \ghost regions" with respect to the speci ed computations on the local address space, can signi cantly reduce both the number of communication actions and the amount of data moved between processors as well as within local memory.The technique described below attempts to minimize the number of shift operations and the amount of data moved in the followed assignments through array sectioning on aliased arrays.
As an example of how data motion can be reduced through the explicit speci cation of local and global address spaces in data parallel programming, we consider the gathering of data from neighboring subdomains through the use of CSHIFT, a common communication primitive in implementing convolution operations.Such operations tend to occur in inner loops in the implementation of nite di erence methods and in signal processing applications.In the nonadaptive N{body methods, the interactions conveniently speci ed through CSHIFT imply a \stencil"that involves up to 875 subdomains in 3{D (the domains marked i in Figure 3.1).
In the context of the three{dimensional arrays of our N{body example, we assume that the shape of the leaf{level subdomains assigned to each processor is S1 S2 S3, with the parallel extents being P1 = L=S1, P2 = M=S2, and P3 = N=S3, respectively.Explicitly identifying the subdomains local to processors through the aliasing mechanism can be made as follows: REAL*8 POT(K,L,M,N) CMF$LAYOUT POT(:SERIAL,:,:,:) REAL*8 POT ALIAS(K,S1,S2,S3,P1,P2,P3) CMF$LAYOUT POT ALIAS(:SERIAL,:SERIAL,:SERIAL,:SERIAL,:,:,:) With a ghost region d subdomains deep on all sides of the subdomain of shape S1 S2 S3, a total of (S1 + 2d) (S2 + 2d) (S3 + 2d) S1 S2 S3 subdomains must be fetched.Fetching them all individually incurs an unnecessarily high communication overhead, in particular when general communication primitives or CSHIFTs with large o sets must be used for the individual fetches.Instead of fetching individual subdomains through independent shift operations, we gather all subdomains required in a certain direction through a single shift of the maximum distance required.
Combining this fairly obvious idea with assignment on aliased arrays yields high e ciency in gathering the subdomains in the ghost region.
The ghost region consists of 26 subregions: six regions adjacent to the six faces of the subdomains local to each processor, 12 regions adjacent to the edges of the local subdomain, and eight regions adjacent to the corners.The regions adjacent to corners are of shape d d d.For the regions adjacent to edges, four are of shape S1 d d, four of shape S2 d d, and four of shape S3 d d.Note that each subregion may cover more than one processor when d > min(S1; S2; S3 processors, with one of the processors being the target processor.Therefore, the total number of additional CSHIFTs equals the number of processors over which the ghost region is distributed after the alignment less one.For 2d < min(S1; S2; S3), the number of CSHIFTs after alignment is seven for a total of 10.This implementation is illustrated below in which all CSHIFTs are performed on the orignal arrays, but all assignments make use of array sectioning on aliased arrays to extract and assign data from the ghost region.The array NBR POT is used to store the local subdomains and its ghost region.Note that, excluding the rst three CSHIFTs used for the alignment, the interprocessor data motion in the above approach is also minimal.
We implemented both the above approaches.We also applied the two approaches naively to individual subdomains without aliasing and blocking with respect to the depth of the ghost region, as shown in Figure 5.1 (a) and (b).Both naive approaches involve excess data motion, since the approach in (a) requires CSHIFTs with the largest o sets while the approach in (b) moves arrays back and forth as shown in Figure 5.1 (c).In both cases, a processor in fact receives some of the data more than once.
For the hierarchical N{body computations in 3{D, all subdomains marked i in Figure 5.2 interact with the subdomain marked b for the so{ called interactive{ eld to local{ eld conversion.Accounting for all 10 relevant planes with respect to subdomain b yields a total of 875 subdomains marked i.For a subdomain b on the boundary of the S1 S2 S3 subdomains local to a processor, the i{marked subdomain furthest away is at distance four.Thus, for our example d = 4. Table 5.1 summarizes the data motion requirements for the four methods for S1 = S2 = S3 = 8.In this Table, K refers to the number of elements fetched for each subdomain.
From Table 5.1, it is clear that our most e ective alternative is more than two orders of magnitude faster than the most straightforward implemention of fetching subdomains in the ghost region.Using array alising, minimizing the number of shift operations using a space lling curve in the ghost region improved the performance on the Connection Machine CM{5 by about 50% compared to direct fetches.We expect the relative perfor- A plane through the subdomains that must be fetched in interactive{ eld to local{ eld conversion in hierarchical N {body methods in three dimensions.mance of the four alternatives for fetching variables for the ghost region to be similar on other architectures.Increased granularity of the local subdomains, i.e., increased values of S1; S2 and S3, does not a ect the number of subregions in the ghost region, only their sizes, except for the corner regions that remain xed for a xed d.Hence, we expect the relative merits of the four alternatives to be preserved for increased granularity of local subdomains.In fact, we expect the relative e ectiveness of the two alternatives using aliasing to increase with increased size of local subdomains.The scalability with respect to machine size is somewhat harder to assess in that even though ghost regions are adjacent in the physical domain they are not necessarily adjacent or even \local" with respect to processors.Hence, global aspects of the communication system may be important.However, for a xed size of the subdomains mapped to processors, i.e., increasing the total number of subdomains along with the working array number of processors, the amount of communication per processor remains xed.Hence, if the average communication distance for communication between adjacent subdomains from the mapping of the physical address space to the processor address space employed by the compiler/run{time system does not increase, then bandwidth requirement per processor does not increase either.Thus, for such communication networks and address space mappings, scalability with respect to machine size should be good, and the relative order between the alternatives preserved.

Multigrid{embed and Multigrid{extract.
In hierarchical methods, interactions between congruent arrays of di erent size occur frequently in proceeding to coarser or ner domain decompositions.In our implementation of the hierarchical N{body methods, we use working arrays of sizes that correspond to the current levels of the hierarchy, while a permanent 5{D array of a shape determined by the leaf{level domain decomposition is used for storage of persistent data, as illustrated in Figure 5.3 and described in Section 4. The computations require frequent interactions between the smaller working arrays, with sizes dependent upon hierarchical level, and the permanent \full resolution" array.
If the elements of the working array and the corresponding elements of the permanent array are assigned to the same processor, then in fact all that is required for exchange of data is a local memory copy.Unfortunately, it is not possible to know at compile time whether or not two corresponding elements in the two arrays are assigned to the same processor, since the number of processors used is a run{time variable, and so are in most cases also the number of leaf{level subdomains, and thus the extents of the parallel axes of the permanent 5{D array.At compile time, it is necessary to assume that assignments between the working and permanent arrays be performed using a general communications mechanism.At run{time, some of these general communications may be converted into local copy operations.The automatic detection and resolution of what type of communication action is needed, requires fairly sophisticated run{time system optimizations, and is beyond the capabilities of any current run{time system for distributed memory architectures.However, if the programmer explicitly identi es that the assignment is between array elements assigned Y. CHARLIE HU AND S. LENNART JOHNSSON to the same processor, then run{time systems with a much lower degree of optimization may still be able to carry out the assignment using local memory copy instead of a general communications mechanism.This is the case for the CM{5 run{time system.The array aliasing mechanism can be used by the programmer to identify that an assignment is in fact between array elements allocated to the same processor.
We now illustrate how to use array aliasing to identify that the assignment of variables from a working array into a larger permament array in fact is a local memory copy in all processors.A straightforward CMF expression for a Multigrid-embed operation at hierarchical level (h i) is This code fragment would cause the compiler to invoke general communication as explained above.However, if the array TMP, which stores the variables for the subdomains at level i of the hierarchy, has at least one subdomain per processor, then the general communication is avoided by ideintifying the assignment as local copying as illustrated in the code fragment below: In the above code, an alias is created for the two arrays to separate their local addresses from their processor addresses.Array sectioning is then performed on the local axes and general communication is avoided.
If the array TMP corresponds to a level of the hierarchy that has fewer subdomains than the number of processors, and therefore the elements of the array TMP cannot be guaranteed to be assigned to the same processors as the corresponding elements of the permanent array, then Multigrid-embed requires interprocessor communication.This lack of alignment is typically the case, since arrays of a size smaller than the number of processors typically are allocated to processor addresses contiguosly.This form of allocation is used on the CM{5.Whenever TMP has fewer subdomains than the number of processors, introducing a second temporary array, TMP2, congruent to TMP and the permanent array, and with one element per processor, may improve the performance.Assignments between TMP2 and the permanent array can be carried out as described above, while the assignments between TMP and TMP2 require interprocessor communication for at least some elements.However, this communication is much more e cient than  the assignment between TMP and the permanent array, since in general, TMP2 is much smaller than the permanent array.Array sectioning can be used to specify the assignment between TMP and TMP2, but any syntactically correct form of assignment can be used as well.
On the CM{5E, the performance improvement achieved through the techniques above for Multigrid-embed was as much as a factor of a hundred or better, as shown in Figure 5.4.

Alignment of nonconforming arrays through reshaping.
Many operations in the data parallel programming model are only valid on conforming arrays.Thus, reshaping interacting arrays such that they are conforming is sometimes necessary, and often very desirable for good performance.If the reshaping can be accomplished with no communication, then clearly there may be a substantial bene t.The example we use from our N{body code is the alignment of 1{D particle arrays with the 5{D arrays for hierarchical data, without any growth in memory requirements for uniform particle dsitributions.Communication is required in organizing the particle data for locality of reference with respect to the leaf{level subdomains to which they belong, but no communication is required for the actual reshaping.
The particle data for N{body codes are frequently given as a collection of 1{D arrays; one array for each particle attribute, such as charge, mass, velocity and coordinates.In the hierarchical N{body methods, the particle data are used in establishing leaf{level subdomain representations prior to the upward pass through the hierarchy (see Figure 3.2).Interaction between the hierarchical domain decomposition and the particle data occurs again after the downward traversal of the hierarchy.In order to maximize the locality in particle and leaf{level subdomain interactions, it is desir-

Morton
Column major Peano-Hilbert able to allocate particles to the same processor as the leaf{level subdomain to which they belong.To accomplish this task, the particles are rst reordered such that they are ordered in the same way across processors as the leaf{level subdomains are ordered across processors.Then, reshaping the 1{D particle arrays to 4{D arrays with three parallel axis of the same extent as the three parallel axis of the permament 5{D array yields the desired result.Using array aliasing the reshaping requires no communication.Below we described a particle reordering and array reshaping that result in collocation of leaf{level subdomains and their particles for uniform distributions.
The leaf{level subdomain to which a particle belongs is determined by the coordinates of the leaf{level subdomain and the particle.Hierarchical subdomain orderings, such as Morton 8] or Peano{Hilbert 5], shown in Figure 5.5, allow one sorted order to be used for any depth of the hierarchical domain decomposition with the ordering properties preserved from level to level.The Morton ordering is achieved by constructing keys for sorting the particles by interleaving the bits of the particle coordinates.The Peano{Hilbert ordering is achieved by constructing keys recursively.Assume the addresses of the particle coordinates are x n :::x 1 x 0 , y n :::y 1 y 0 , and z n :::z 1 z 0 , respectively.First, the Gray code 7] of x n y n z n is generated as the three leading bits of the key k 3n+2 k 3n+1 k 3n .Second, the ith three bits of key k 3i+2 k 3i+1 k 3i determine how the (i 1)th bits of the three axes x i 1 , y i 1 and z i 1 should be encoded to form the (i 1)th three bits of the key k 3i 1 k 3i 2 k 3i 3 .If the ordering of the subdomains is row or column major, as in the typical linearization of array addresses used in common programming languages, then resorting is required for each depth of the hierarchy.
For the row or column major ordering of subdomains, the keys for sorting particles are constructed by partitioning each axis coordinate for a particle into two parts, one corresponding to the axis coordinate of the subdomain to which it belongs, and the other corresponding to its relative location within the subdomain.A row or column major ordering is then created for the particle axes coordinates corresponding to the subdomain coordinates.The part representing the relative location within a subdomain may be in arbitrary order, but could also use a row or column major ordering, as is illustrated in Figure 5.5.
Figure 5.6 illustrates three di erent orderings of 16 2{D subdomains each with four particles.
For the Connection Machine implementation for which performance data are reported in Section 7, the default run{time system ordering of the subdomains is column major order, with an optional row major ordering.We used the default ordering in our benchmarks.We illustrate the steps of our column major sort below: Algorithm: (Column major sort) 1. Find the layout of the parallel axes of the 5{D arrays storing hierarchical subdomain data using the intrinsic mapping functions, 2. For each particle, generate the coordinates of the subdomain to which it belongs, denoted by xx::x, yy::y, and zz:::z; 3. Split the subdomain coordinates into a processor address and a local memory address, written as x::xjx::x, y::yjy::y, z::zjz::z, according to the layout of the 5{D arrays; 4. Form keys for sorting by concatenating the processor addresses with the local memory addresses, written as z::zy::yx::xjz::zy::yx::x; 5. Sort.
If there is at least one subdomain per processor, then for a uniform particle distribution, each particle in the sorted 1{D particle array will be allocated to the same processor as the data of the leaf{level subdomain to which it belongs.The leaf{level subdomain data is contained in the 5{D array representation of hierarchical subdomain data.
The reshaping of the 1{D array to a 4{D array with three parallel axes such that the shape of its parallel axes conform with the shape of the three parallel axes of the 5{D array can be straightforwardly achieved using array aliasing, and no communication will be required.For a near{ uniform particle distribution, it is expected that the coordinate sort will leave most particles in the same processor memory as the leaf{level boxes to which they belong.But the reshaping has to be expressed using gather operations, as shown in the following code fragment: FORALL(I=1:L, J=1:M, K=1:N, MASK(I,J,K)) $ X_4D(II+1,I,J,K) = X(START_PTR(I,J,K)+II) ENDDO 5.4.Exploiting symmetry in all{to{all communication.All{ to{all communication 6] is a critical operation in several applications, such as the naive (direct) N{body algorithm, but also in molecular dynamics computations where a cut{o radius is used.In such computations, all{to{ all interactions take place between all molecules within the cut{o radius of each molecule.All{to{all communication is also required within local regions in the hierarchical N{body methods, as de ned by the so{called near{ eld.This eld contains all particles too close to apply the approximations on which the hierarchical eld evaluation is based.
Using the reshaped 4{D array particle representation described above, the all{to{all communication can be carried out based on the use of space{ lling curves, analogously to the use of such curves in fetching ghost regions.Exploiting symmetry in the all{to{all interaction is fairly straightforward based on the linear ordering o ered by the space{ lling curve.The observation has been made by many others, see for instance 2].The idea of reducing communication by exploiting symmetry is shown in a 2{D example in Figure 5.7.As subdomain 0 traverses subdomains 1{4, the interactions between subdomain 0 and each of the four subdomains will be computed.The contributions from the four subdomains to subdomain 0 are accumulated and communicated along with subdomain 0. Using data parallel programming, subdomains 5{8 will traverse subdomain 0 while subdomain 0 traverses subdomains 1{4.The contributions from subdomains 5 { 8 to subdomain 0 are accumulated and stored in subdomain 0. Finally, the two contributions to subdomain 0 are combined.
6. Optimizing computation.The optimizations presented in this section focus on two issues: carrying out operations on nonconforming arrays without growth in memory requirements and without invoking general communication, improving the performance of large numbers of BLAS type of operations applied to small operands.The rst situation occurs frequently when a given function is applied many times, and at least one of the operands is shared between many applications of the function.An example is a multivalued \constant coe cient" convolution kernel.The second situation is also very common in many scienti c and engineering codes.For instance, the simulation of uid ow using a nite di erence or nite volume approach involves inner products or matrix{vector multiplications in each grid point, or for each nite volume.
For our reference application, the hierarchical N{body method due to Anderson, the rst type of situation occurs frequently in the hierarchical computations.In fact only eight distinct matrices are required for parent{ child domain interactions for all such interactions in 3{D, while a total of 1206 matrices are required for interactive{ eld computations regardless of the number of subdomains, and independent of the level to which the comnputations are applied 2 .With respect to the second case, level{1 BLAS can be aggregated into level{3 BLAS, that can be further aggregated into multiple{instance level{3 BLAS.Multiple{instance BLAS is supported in the Connection Machine Scienti c Software Library, CMSSL 10].The ag-gregation from level{1 to level{2 BLAS is made in the problem formulation, while the aggregation into multiple{instance level{3 BLAS is carried out either through aliasing alone, or aliasing in combination with copying of some local arrays.The copy operation allow for operands of increased size for the level{3 BLAS.
The bene ts of higher level BLAS compared to lower level ones is that performance may be improved through a higher degree of locality of reference.For instance, an inner product allows two oating{point operations to be carried out for each pair of elements fetched from memory, with an additional memory reference required to store the result.A matrix{vector multiplication allows for two oating{point operations for matrix elements fetched.Additional memory references are required for the input vector, and for storing the result vector.A matrix{matrix multiplication performed on b b blocks allow 2b 3 oating{point operations to be performed for each 2b 2 elements fetched with additional memory reference required to store a b b result block for every 2N=b blocks read and 2b 2 N oating{point operations performed.Thus, for these operations, an inner product (level{1 BLAS) approaches one oating{point operation per memory reference as operand size grows, a matrix{vector multiplication (level{2 BLAS) approaches two oating{point operations per memory reference, while a matrix{matrix multiplication (level{3 BLAS) approaches b oating{point operations per memory reference.Since memory bandwidth on most processor architectures is a limiting factor it is highly desirable to use higher level BLAS whenever possible.
Another factor often at least as important for performance is the memory access pattern.Access patterns that lead to frequent or consistent cache misses, TLB misses, or DRAM page faults may have severe performance impact, even when all data is in primary memory.If the array layout in memory was known at compile time, and if all operations were expressed in loop nests or straight{line code or some combination thereof in the source code, then an optimizing compiler may successfully handle the relevant loop ordering and blocking required for peak performance.However, if the array layout is not known at compile time, and if subroutine calls are used for some of the functions, then in addition to the unknown memory layout, the optimization of memory accesses is further complicated by the need for interprocedural analysis.In practice, the situation may be even more di cult through the use of commercial libraries often available only as binaries.In addition to fewer memory references per oating{point operation, the higher{level BLAS have the bene ts to o er more choices for memory accesses to the implementer of the function.For instance, an inner product is de ned through a single loop on two one{dimensional arrays, while matrix{matrix multiplication requires three loops and involves three two{dimensional arrays.
The multiple{instance feature provides at least one loop variable in addition to the loops required for a single{instance BLAS, and thus, o ers ad-ditional freedom in organizing memory accesses.The multiple{instance feature does not a ect the operations count per memory reference.However, looping overhead may be reduced, for instance, if the multiple{instance feature allows for loop fusion.On the CM{5E, aggregating inner products of length 12 into matrix{vector multiplication resulted in a processor eciency of about 36%, while the aggregation of inner products of length 72 resulted in an e ciency of 59%.
Next we present two examples of aggregation, one without copying, the other with copying.Both examples also demonstrates how nonconforming arrays can be used e ectively without memory growth or ine cient memory management.Array aliasing is used to accomplish both tasks.
6.1.Aggregation of Matrix{Vector multiplication into multiple{ instance Matrix{Matrix multiplication.In the example below, we assume that each processor has a collection of subdomains forming a local array of shape S1 S2 S3, with S1; S2; S3 2, as shown in Figure 6.1.Each subdomain stores a vector of length K and must perform a matrix{ vector multiplication with a K K matrix.Moreover, each subdomain uses the same matrix.In order to conserve storage, the K K matrix is not replicated among the subdomains, but stored in a K K P array with P being the number of physical processors.In addition to the challenge of aggregation, the problem as presented also hav the challenge that the two arrays of vectors (one input and one output for each subdomain), and the array of matrices are not conforming.This challenge as well as the aggregation can be handled through the aliasing feature of CMF.The loop structure is shown by the pseudo{code fragment The three loops iterate through every other subdomain along the three axes, which in fact is the requirement in the N{body application from which the example is taken.The above loop nest is embedded in another triple loop nest that implements the use of eight di erent matrices, one for each child of a parent subdomain.The loop body contains a call to the matrix{ vector multiplication subroutine with the matrix having shape K K. Since the same translation matrix is used in the inner three loops, these loops could in principle be combined into a single matrix{matrix multiplication with one matrix of shape K K and the other of shape K S1 2 S2 2 S3 2 .However, such combining is possible through loop fusion only if the stride for the axis of length S1 where we assume that one of the two axes of extent S2 and S3 is used for aggregation of vectors into matrices, and the other for the multiple{instance axis.Explicit looping is still required for the axis of extent S1.
In aggregating matrix{vector operations into matrix{matrix operations, not only is the number of vectors being aggregated of interest, but also the stride between successive vectors, since it a ects the number of cache misses, TLB misses and DRAM page faults in the multiple{instance matrix{matrix multiplication.With cubic, or close to cubic subgrids for minimum communication, either the extents of the subgrid axes are the same or they di er by a factor of two.For relatively small subgrids, the di erence in size of the multiplicand due to the di erence in subgrid axes extents has a larger impact on performance than DRAM page faults and TLB thrashing on CM{5.Hence, for the CM{5 we aggregate vectors into a matrix along the axis with the largest local extent.If two axes, or all three axes are of the same length, the vectors are aggregated along the axis with the largest local extent and with the smallest stride.For relatively large subgrids, vectors are aggregated along the axis with the smallest stride.On the CM{5E, the e ciency in the above operation improved from 36% to 55% for K = 12 and subgrid of extents 32 32 16.The matrices are of shape 12 12 and 12 8 with sixteen such instances handled in a single call.For K = 72 and a subgrid of extents 16 16 8, the e ciency improved only a fraction of a percent and remained 60%.

Aggregating Matrix{Vector Multiplication into Matrix{
Matrix Multiplication Through Copying.We also investigated the performance bene ts of aggregation of matrix{vector multiplication into matrix{matrix multiplication through copying.In the context of the N{ body code we used this technique for the interactive{ eld to local{ eld conversions, i.e., in handling the subdomains marked i in Figure 5.2.Recall that for these computations the local subdomains with their ghost regions are stored as one local subgrid of shape (S1 +8) (S2 +8) (S3 +8).The copying is illustrated in Figure 6.3.
The loop nests before and after copying are as follows, respectively.For S1 = 32; S2 = 32; S3 = 16 and K = 12, the e ciency of the 12 12 by 12 2048 matrix multiplication is 75%.If there are no DRAM page faults, the copying requires 2K cycles for each vector for which the matrix multiplication ideally takes K 2 cycles.Thus, the relative time for copying is about 2=K.This amounts to about 17% for K = 12, and less than 4% for K = 72.With the cost of copying included, the measured e ciency decreased to 53%.For S1 = 16; S2 = 16; S3 = 8 and K = 72, the e ciency of the 72 72 by 72 256 matrix multiplication is 85%.Including the cost of copying, the measured e ciency is 78%.When the copy cost is included, the cost of copying back the result vector is insigni cant since it is amortized over all 875 matrix multiplications that generate the same result vector for each subdomain in the N{body code.If the copy back cost is amortized over fewer matrix{vector multiplications, then copy may actually decrease performance.For instance, if it is amortized over only eight multiplications, as in other parts of the N{body code, then for K = 12 the e ciency was reduced from the 55% mentioned in the previous subsection to 52%, while for K = 72 copying amortized over eight applications in fact increased the e ciency from 60% to 77%.
The copying cost can be reduced by copying a whole column block of (S1 + 8) S2=2 S3=2 subdomains into two linear memory blocks; one for even slices, and one for odd slices, as shown in Figure 6.4.In our application, the cost of copying is reduced to 4 100 875 (S1+8) (S1 K) of that of matrix multiplication, assuming no page faults.Including the cost of copying, the e ciency reaches 60% and 80% for K = 12 and K = 72, respectively.7. The E ectiveness of the Performance optimizations in an hierarchical N{body method.The use of the data parallel performance techniques described here, in an implementation of the hierarchical N{body method due to Anderson resulted in an execution time of 180 seconds on a 256 processor CM{5E for the potential evaluation of 100 million interacting particles.Twelve integration points per sphere (K = 12) were used in this case.The overall e ciency is about 27%, and is fairly independent of machine size.With K = 72 integration points on the sphere, the e ciency improves to 35%.
The timings breakdown for the potential eld calculation of 100 million particles on a 256 processor CM{5E is shown in Table 7.1 for K = 12 and K = 72.The hierarchy depths optimal with respect to oating{point operations, are 8 and 7, respectively.The communication time for K = 12 is 22.3% of the total running time and 10% for K = 72, demonstrating that our techniques for reducing and managing data motion are very e ective.reshaping 1{D particle arrays to 4{D particle arrays, the multigrid functions, the fetching of ghost regions at all levels, replicating matrices at every level, and the CSHIFTs used in the direct evaluation.In the computation time we include the time for forming the far{ eld potential for leaf{level subdomains, all BLAS operations, the copying in the aggregation of BLAS operations for better arithmetic e ciency, the masking in distinguishing boundary subdomains from interior subdomains, the evaluation of the potential due to particles in the far{ eld, and nally the direct evaluation in the near{ eld. Figure 7.1 shows that the speed of our code scales linearly with the number of processors and the number of particles.These timings were collected on CM{5s due to the unavailability of a variety of con gurations of CM{5E systems.All cases use uniform particle distributions in a 3{ D cubic domain, 12 integration points per sphere, and optimal hierarchy depths.It is clear from Figure 7.1 that for a xed number of particles per processor, the e ciency remains independent of the number of processors.The slight uctuation is mainly due to the uctuation in the number of oating{point operations per particle for the optimal hierarchy depth, as shown in Figure 7.2.8. Conclusions.In this paper, we presented a set of techniques for optimizing both communication and computation performance using the array aliasing mechanism in CMF.The array aliasing feature provides a way of separating the local memory address space from the processor address space while staying in the data parallel programming paradigm, and therefore allows the programmer to explicitly manage local and remote memory references at the data parallel language level.
The e ectiveness of our techniques is demonstrated on an implementation in Connection Machine Fortran of Anderson's hierarchical O(N) N{body method.CMF is the only data parallel language that has ever supported array aliasing.The overall e ciency achieved for the evaluation of the potential eld of 100 million uniformly distributed particles and K = 12 integration points on the sphere was 27% on a 256 processor CM{5E.For K = 72 integration points the e ciency is about 35%.

REAL* 8 2 Fig. 5 . 1 .
Fig. 5.1.Optimizing communication in fetching ghost regions through the use of space lling curves.Examples are in 2{D for clarity of presentation.

Fig. 5 . 4 .
Fig. 5.4.Performance comparison of Multigrid-embed implementations with and without array aliasing for a depth{eight 3{D hierarchy on a 256 processor CM-5E.The two{step scheme was used for the rst two cases and the remaining cases used only local copying.
illustrates the programming concept: ).In fact, the entire ghost region s3 e 1) CSHIFTs, each parallel to an axis.For d < min(S1; S2; S3), this amounts to 54 CSHIFTs, since fetching each of the 12 edge regions requires two axis{parallel CSHIFTs, while fetching each of the eight corner regions requires three axis{parallel CSHIFTs.Since the mechanism used for the implementation of CSHIFTs on many architectures is more e cient than general communication, converting CSHIFTs not parallel to an axis into a sequence of axis{parallel shifts may in fact yield better performance than invoking general communication facilities.The conversion is usually made by the run{time system and not under user control.The interprocessor data motion in the above approach is minimal, since only data in ghost region are actually fetched.

Table 5 . 1
Comparison of data motion needs for the fetching of a ghost region four subdomains deep on all sides of an 8 8 8 subgrid assigned to each processor on a 32 processor CM{5E.The local region with its ghost region is stored in a local 16 16 16 aliased array.
Again, the loop nests are inside another triple loop nest with loop indices II, JJ, and KK which e ectively loops through the eight siblings of a parent.

Table 7 . 1
The breakdown of the communication and computation time for potential evaluation for 100 million particles on a 256 processor CM{5E.Fig.7.2.Floating{point operations per particle for optimal hierarchy depth, K=12.