Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System

. The objective of this paper is to analyze the impact of data mapping strategies on the performance of (cid:12)nite element applications. First, we describe a parallel mesh decomposition algorithm based on recursive spectral bisection used to partition the mesh into element blocks. A simple heuristic algorithm then renumbers the mesh nodes. Large three-dimensional meshes demonstrate the e(cid:14)ciency of those mapping strategies and assess the performance of a (cid:12)nite element program for (cid:13)uid dynamics.


Introduction
Data distribution is a crucial issue when implementing nite element techniques on distributed-memory parallel computers.Communication between processing nodes can become a bottleneck if the nite element data structures are not carefully mapped to the processing nodes.In order to minimize this bottleneck, we have developed a set of data mapping strategies and implemented them on the Connection Machine CM-5 system.Special library communication routines taking advantage of data locality to reduce data transfer between processing nodes are used to perform the gather and scatter operations found in nite element applications.Decomposition timings for large tetrahedral meshes are presented, as well as the e ect of data mapping on the performance of a nite element program for computational uid dynamics.

Data Mapping Strategies
Both elements and nodes of an unstructured mesh are mapped onto the vector units of the CM-5 system.We have designed a two-step procedure which performs these mappings: 1. First, the mesh is decomposed into element blocks made of adjacent elements.2. The mesh nodes are then mapped onto the vector units using the mesh partitioning as a criterion for chosing the placement of each node.The objective of these mappings is to achieve as much locality between the nodes and the elements as possible to minimize data transfer through the CM-5 data network.In order to achieve the best computational load balance possible in the nite element program itself, we constrain the elements and the nodes to be uniformly distributed across the vector units, i.e., all vector units hold the same number of elements (resp.nodes) except for the last one which gets whatever elements (resp.nodes) remain.The implementation of both mapping strategies is done on the CM-5 system itself.

Mesh Partitioning
The recursive spectral bisection (RSB) algorithm was chosen as the basis of the data mapping strategies described in this paper.The RSB algorithm was proposed by Pothen, Simon and Liou for reordering sparse matrices 1].Simon then applied it to unstructured mesh partitioning 2].The RSB algorithm has since found wide acceptance in the scienti c community because of the high-quality partitionings it generates.
The RSB algorithm is based on a graph representation of the mesh topology.It is therefore insensitive to regions of highly concentrated elements or to element distorsion.In our implementation, the graph is generated through the dual mesh connectivity, which identi es the elements sharing a face with a given element.* Also a liated with the Division of Applied Sciences, Harvard University In this representation, the mesh elements become the graph vertices and the internal faces correspond to the graph edges.The mesh partitioning is performed using an iterative process which decomposes the whole mesh into two partitions, each of which in turn is decomposed into two partitions, and so on.This process ends when there are as many partitions as vector units in the CM-5 con guration considered.Each iteration of the process just described involves several computational steps: 1. Possible disconnections in a partition are identi ed using a frontal algorithm.
2. The smallest non-zero eigenvalue and its associated eigenvector (also called the Fiedler vector) of the Laplacian matrix L, de ned as L ij = 1; if elements i and j share a face; 0; otherwise. (1) are computed using the Lanczos algorithm.Each Lanczos step includes three dot-product operations, one matrix-vector product and an eigenanalysis of the tridiagonal matrix generated by the Lanczos process.3.After convergence of the Lanczos algorithm, the components of the Fiedler vector are ranked, and this ranking is used to reorder the dual mesh connectivity.4. The graph is then split in two, and this process is repeated on each subgraph.
The RSB algorithm can be computationally intensive since a series of eigenvalue problems have to be solved.In order to keep the partitioning time as small as possible, we have implemented the RSB algorithm on the CM-5 system in a data-parallel fashion.In this implementation, all elements of the mesh are treated in parallel.It implies a two-level parallelization: one level on the partitions generated at a given stage of the decomposition process and the other on the elements in each partition.Most of the resulting code is written in the CM Fortran language 3], except the eigenanalysis of the tridiagonal matrix which is implemented in CDPEAC (a macro-assembler) 4].Details of the implementation can be found in 5].

Node Renumbering
Once the elements have been reordered to obtain element blocks, the mesh nodes are renumbered using the following procedure: 1.Each element is assigned the element block number to which it belongs.2. Each element sends the block number to the nodes it is associated with.Nodes receiving the same block number from their neighboring elements are marked as \interior nodes" and their location code is the block number received.The other nodes are marked as \boundary nodes" and they choose their location code at random from the block numbers they received.3. Nodes are ranked based on their location code with the constraint of having interior nodes ranked before boundary nodes for the same location code.4. Nodes are assigned to the vector units based on their location code in the order obtained at Step 3. Since all nodes may not be assigned during this phase because of the load-balance constraint described at the beginning of Section 2, this strategy forces interior nodes to have a greater probability than boundary nodes of being assigned to the same vector unit as the elements they are associated with. 5. Nodes which have not been assigned during Step 4 are distributed among the vector units which still have room left.This procedure can be easily implemented in a data-parallel fashion, parallelization occuring over the elements for Steps 1 and 2 and over the nodes for Steps 3 through 5.

Partitioning Example
Both examples presented in this paper have been run in 64-bit arithmetic on a 32-processing node CM-5E system.This system is a prototype used internally at Thinking Machines and has the following hardware characteristics: 1.A processing node running at 39 MHz and composed of a SuperSPARC microprocessor controlling four vector units and 32 Mbytes of memory (which can be upgraded to 128 Mbytes), yielding a peak 64-bit oating-point performance of 156 M ops/s/pn.Systems shipped to customers will have processing nodes running at 40 MHz. 2. A new network interface chip which can inject larger packets into the data network at a faster rate.The operating system running on the CM-5E at the time we performed the numerical tests (CMost 7.3 beta 2.9) had the large packet mode disabled.3. A SPARCstation 2 control processor.Systems shipped to customer sites will have SPARCstation 10 control processors.The software and hardware restrictions of this CM-5E prototype should have little impact on the performance of the problems presented in this paper.
A tetrahedral mesh of an assembly part composed of 19;793 nodes and 81;649 elements, courtesy of Mark Shephard (Rensselaer Polytechnic Institute), was used as a partitioning illustration (see Fig. 1).The graph representation of this mesh has 152;878 edges.The decomposition into 32 subdomains depicted in Fig. 2 shows the quality of partitioning.Note that 128 subdomains are actually needed on a 32-node CM-5E, but the resulting picture is too confusing to be shown here.The total cost of partitioning the mesh into 128 subdomains is 24:6 seconds, making the RSB algorithm a competitive strategy for mesh decomposition.At this level of partitioning, there are 10;648 cuts in the graph, which represents 7:0% of the total number of graph edges.Detailed timings based on CM elapsed times (which correspond to the elapsed execution times while the program is not swapped out by the operating system on the processing nodes) are presented in Tables 1 and 2. Figure 3 presents the cost of the RSB algorithm as the bisection procedure progresses.The O(log 2 (no. of partitions)) cost seen in this gure is due to the two-level parallelization of the RSB algorithm.The time spent renumbering the nodes using the algorithm presented in Section 2.2 is 0.2 seconds.To illustrate the performance improvements achieved by proper data mappings, we have computed the inviscid ow around a Falcon Jet ying at Mach 0.85 and at an angle of attack of 1 degree on a 32-node CM-5E system.The mesh, courtesy of Dassault Aviation, has 19;417 nodes and 109;914 tetrahedra (see Fig. 4 for a view of the surface mesh on the airplane).A one-point integration rule was used on the elements.A freestream uniform ow was chosen as initial condition, and the solver was marched 50 time steps at a CFL number of 10, which was su cient to reach steady-state.Two computations were performed successively using the following mapping strategies (referred to as Strategy 1 and Strategy 2, respectively): 1. Random mapping of elements and nodes to the processing nodes.2. Mapping of elements and nodes according to the procedures described in Section 2. Timings for the data mapping and the nite element solver (which is a series of gather/compute/scatter cycles) are given in Table 3.This example shows that proper mapping can improve overall speed of the program by more than a factor of two, even if the partitioning time is included in the total time.In the case of Strategy 2, the computation part of the solver achieves 39:5 M ops/s/pn.The gather and scatter operations yield bandwidths of 24:7 Mbytes/s/pn and 20:0 MBytes/s/pn, respectively.The overall performance of the nite element solver is 1:0 G ops/s, which is 20% of the peak hardware performance.Substantial computational e ciency can therefore be achieved on distributed-memory computers for nite element applications as long as a careful mapping of the data structures is performed.

Table 1 .
Assembly part.CM elapsed times for di erent parts of the RSB algorithm for a partitioning into 128 subdomains on a 32-node CM-5E system.

Table 2 .
Assembly part.Cost analysis for the computation of the Fiedler vector.haveimplemented in CM Fortran a nite element program for solving the compressible Euler and Navier-Stokes equations 6,7].It is based on the Galerkin/least-squares formulation proposed by Hughes et al. 8] and Johnson et al. 9].In the case of steady ow computations, an implicit time-marching scheme is used to reach steady-state.A preconditioned matrix-free algorithm is employed to solve the nonsymmetric systems of equations arising from the nite element discretization.The gather and scatter operations are performed using special communication procedures available from the Connection Machine Scienti c Software Library 10]. We

Table 3 .
Falcon Jet.CM elapsed times for di erent parts of the nite element program run on a 32-node CM-5E system.