Communication primitives for unstructured finite element simulations on data parallel architectures

E(cid:14)cient data motion is critical for high performance computing on distributed memory architectures. The value of some techniques for e(cid:14)cient data motion is illustrated by identifying generic communication primitives. Further, the e(cid:14)ciency of these primitives is demonstrated on three di(cid:11)er-ent applications using the (cid:12)nite element method for unstructured grids and sparse solvers with di(cid:11)erent communication requirements. For the applications presented, the techniques advocated reduced the communication times by a factor of between 1.5 { 3.


Introduction
The nite element method is a popular technique for solving boundary and initial value problems.Moderate sized engineering problems have been successfully simulated using this technique.The primary bottleneck for the simulation of large problems has been available computational resources.With the advent of massively parallel architectures, simulating signi cantly larger problems is now possible.The inherent parallelism in the nite element method makes it very attractive for massively parallel architectures.However, the data structures that must be used to exploit the new computing architectures e ciently may be quite di erent from the data structures that have been used on traditional sequential architectures.
Two important issues that must be addressed on massively parallel architectures is load balance and the e cient use of the communication network that connects the processing nodes.A high overall e ciency can only be achieved with good use of the communication network.The data allocation and the algorithms must be chosen to preserve locality of reference, and to minimize contention in the network.These issues are particularly critical for unstructured discretizations.
The reader is referred to articles by Johnsson andMathur 1989, 1990], Belytschko et al. 1990], Farhat et al. 1990 and 1992], Shapiro 1991], and Johan et al. 1992] for several di erent data parallel nite element algorithms.This article identi es a common set of communication primitives in the data parallel programming model which express most of the communication needs for unstructured nite element simulations.The need for such primitives is twofold.First, although most of these primitives can be expressed in high level data parallel programming languages, such as Fortran{90 Metcalf, 90], Fortran{D Fox,91], and Fortran Y Chen, 92], compilers are currently not capable of exploiting the network interconnecting the processing nodes optimally.High{level communication primitives allow communication operations to be implemented optimally as functions callable from the language of choice or by the compiler.The communication functions, if implemented well, provide optimal selection of paths for moving data, optimal schedules for moving the data along these paths, and highly{tuned implementations.Finally, it is shown that some of the primitives cannot be expressed by the available high level data parallel programming languages.Availability of the communication primitives suggested here, on di erent computing platforms, should signi cantly enhance portability of application codes both with respect to the programming e ort and the achieved performance.
Three di erent nite element applications are used as examples illustrating the intended use of these communication functions.The Connection Machine systems, CM{2/200, have been used as model platforms for these applications.It is emphasized that although this article presents results on the Connection Machine systems, CM-2 and CM{200, these applications are easily portable to any architecture with compilers that are based on a global address space (Fortran{90, Fortran{D, and Fortran-Y).
All three example applications are based on three{dimensional unstructured discretizations.The rst example application is a three{dimensional, fully dynamic nite element formulation that has been used to study the rate dependent three{dimensional deformation in the Charpy V{notch test Mathur et al., 1992a, Tvergaard and Needleman, 1986and 1988].This is an explicit formulation in which the desired solution phase is spread out over several thousand time increments.No solution of sparse linear systems is required.Since the mesh connectivity remains xed for the entire simulation period, it is bene cial to preprocess the communication patterns and amortize the preprocessing cost over the several thousand time steps.
The second application is a three{dimensional deformation processing simulation where the material response is based on polycrystalline plasticity Beaudoin et al., 1992, Mathur et al., 1992band Dawson et al., 1992].This application involves a solution of a nonlinear sparse system of equations at each time increment.Further the material behavior at each sampling point is based on a second level of discretization which models a material point on a continuum scale as a collection of crystals.This is an implicit{explicit formulation.The deformation eld at a xed state is computed implicitly.The mechanical state of the material and the geometry is then updated explicitly.There are several interesting levels of data interaction between the two discretizations representing two signi cantly di erent length scales.
Finally, the third application is from computational uid dynamics where the steady state Euler ow on a complete airplane is computed for a mesh involving almost a million degrees of freedom Johan et al., 1992].This is a fully implicit formulation based on a preconditioned matrix free restarted GMRES algorithm Saad and Schultz, 1986].

Data parallel nite element mappings
The nite element simulations that are described here make no assumptions about the topology of the mesh.Although the results presented here are based on meshes which are composed of only one type of elements, the set of communication functions described here can be used for meshes comprising of di erent types of elements.Other equally important issues such as e ective utilization of the processing nodes in the data parallel architecture must also be taken into account for meshes composed of several di erent types of elements.
The data parallel implementation for completely arbitrary nite element formulations is best based on two di erent \views" of the data set.In the rst view, one unassembled nite element is mapped onto the processing node of the data parallel architecture.When there are more nite elements than processing nodes, each processing node is assigned a number of nite elements.Further, the mapping of a nite element onto the processing node is done randomly, that is, a nite element labeled i is mapped onto a virtual processing node labeled r(i) where r is a random permutation such that 1 r N e and N e is the number of nite elements in the mesh Mathur, 1990].The random mapping is quite e ective in minimizing the contention for the communication channels that occur in the network connecting the processing nodes of any distributed memory architecture Valiant and Brebner, 1981, Valiant, 1982, Ranade, 1987, Ranade et al., 1988].Throughout this article, this view of the data set will be referred to as the \set of nite elements".Finally, the second data representation maps a nodal point (and sometimes a nodal degree of freedom) onto the processing nodes of the architecture.As with the rst mapping, a random mapping scheme is used to make the contention for the communication channels uniform.This view of the data set is referred to as a \set of nodal points".The interaction between the two data sets is accomplished by the use of the mesh connectivity array.

Applications
This section describes the three applications that were used to identify the set of communication primitives described in this article.The reader is referred to the articles referenced in the appropriate sections for a detailed description of the formulation and their data parallel implementations on the model architecture.

An explicit nite element formulation
The rst application involves a data parallel implementation of an unstructured three{dimensional nite element formulation for the rate dependent deformation in the Charpy V{notch test Mathur et al., 1992, Tvergaard and Needleman, 1986and 1988].This is a Lagrangian, fully dynamic and explicit implementation.The governing equations that result from substituting the nite element approximation of the momentum balance are of the form where M is the mass matrix, a is the nodal acceleration vector, and F is the nodal force vector that is based on the externally applied traction boundary conditions and the internal forces generated as a consequence of the deformation.The formulation rst computes the acceleration of each nodal point.
Next, the equations of motion are integrated by an explicit numerical integration procedure to compute the nodal velocities and the nodal displacements.Finally, the deformation is driven forward in time and the whole procedure repeated until the solution is known over the entire time period of interest.
A lumped mass matrix based on the scheme proposed by Hinton, Rock, and Zienkiewicz 1976] is used instead of a consistent mass matrix.For explicit formulations, a lumped mass matrix scheme yields better accuracy Hughes, 1987].Further, a lumped mass matrix is also preferable from a computational e ciency point of view.For a lumped mass matrix, M in Equation ( 1) is diagonal, and consequently a sparse linear system solution is no longer necessary to evaluate the nodal acceleration.
The mesh is composed of twenty node brick elements with selective integration to prevent locking of these elements due to the nearly incompressible deformations in the plastic range.The material is characterized by a viscoplastic version of an isotropically hardening Mises solid.The constitutive relation is integrated using a rate tangent time stepping procedure.For the class of problems to be analyzed here the element sizes are small enough that stability of the explicit integration of the momentum balance, rather than integration of the constitutive relation, controls the maximum permissible time step Hughes, 1987].The reader is referred to Mathur et al. 1992a] for a detailed description of the data parallel implementation and a detailed analysis of the three{dimensional simulation results.
The explicit nature of the implementation divides the solution procedure at each time increment into three distinct phases.First, based on the mesh connectivity, all nite elements in the mesh \gather" the nodal data values into local vectors.The nodal values of interest that need to be gathered, are the nodal displacements.All other variables of interest already reside on the same processing node as the nite element.Next, the unassembled nodal forces are computed at an element level.The gather operation ensures that the computation of the unassembled nodal forces does not involve any interaction between the two di erent views of the data set.Consequently, the computation of the unassembled nodal forces can be performed independently for all elements (of the same type) in the mesh.Finally, the nodal forces are assembled into a global force vector by a \scatter" operation where each nite element sends the local unassembled force component to the corresponding nodal point.Since, many nite elements share a nodal point, data collision occurs at the destination.In the context of nite element simulations, the colliding values must be added together.
Only two communication primitives are necessary for most explicit nite element formulations { one which gathers the nodal quantities into local elemental vectors and the other which assembles the unassembled elemental vectors into a global vector.Both these primitives involve interaction between the set of nite elements and the set of nodal points.The gathering of the nodal displacements is a \one{to{many" mapping between the assembled nodal degrees of freedom and the unassembled nodal degrees of freedom.The assembly phase involves data motion in the reverse direction (\many{ to{one" mapping).
Figure (1) shows the mesh used to simulate the three{dimensional Charpy V{notch test at the start of deformation and after the rst 100 microseconds.The mesh consists of 3200 twenty node bricks.Tables (1) and ( 2) show the computation time per time increment for the three sections of the formulation described above, as a function of the number of quadrature points.The timings reported in these tables were measured on an 8K CM{200 con guration.The computation time per time increment per nite element per integration point is 11 s for 2 2 2 Gauss quadrature and 9 s for 3 3 3 Gauss quadrature.If the nite element and the nodal points in the mesh are not mapped randomly onto the processing elements of the architecture, then for 2 2 2 quadrature almost 45% of the total computation time is spent in data motion (gather and scatter).When the nite elements and the nodal points are mapped randomly, the reduction in the communication cost is signi cant.Only 28% of the total computation time is now spent in data motion.1: Computation time (milli{seconds) for one time increment for a mesh comprising of 3200 20{node brick elements (Figure 1).The integration of the constitutive equations was performed by a 2 2 2 Gauss quadrature.The timing measurement was performed on a 8K CM{200 con guration.2: Computation time (milli{seconds) for one time increment for a mesh comprising of 3200 20{node brick elements (Figure 1).The integration of the constitutive equations was performed by a 3 3 3 Gauss quadrature.The timing measurement was performed on a 8K CM{200 con guration.

Finite element modeling based on polycrystalline plasticity
The second set of applications described here involves two levels of discretization.The rst level of discretization is on the macroscopic scale.On the continuum scale, the discretization is similar to any other nite element simulation.The domain is discretized into a collection of nite elements which model the geometry.The second level of discretization models the microstructure of the material.Underlying each nite element is an aggregate of crystals.The properties of the material at a sampling point within the element are a function of the deformation of the individual crystals within the element.In addition to the two views (a set of nodes and a set of nite elements) described above, there is a third data representation in which a crystal in the aggregate is mapped on to the processing nodes of the model architecture.The reader is referred to Beaudoin et al., 1992, Mathur et al., 1992band Dawson et al., 1992] for a detailed description of these simulations.Brie y, such simulations can be viewed as having three distinct sections.First, the macroscopic deformation eld is computed at xed state.In the context of polycrystalline models, the state of the material is characterized by the orientation of the crystals comprising the material and the strength of the slip systems of each crystal.The evaluation of the macroscopic deformation eld involves data interaction between all three data representations.The macroscopic deformation eld, at xed state, is computed from the solution of the discretized form of the macroscopic balance of linear momentum r = 0 (2) and the conservation of mass r u = 0 (3) where is the macroscopic Cauchy stress tensor and u is the velocity.Note that the expression for the conservation of mass assumes that the elastic e ects are small compared to the plastic deformation.The nonlinear viscoplastic material behavior results in a nonlinear sparse system of equations, which is solved by a method of successive approximations.Iteration i of the nonlinear scheme results in a sparse linear system of the form Zienkiewicz where K D ] is the material sti ness matrix based on the xed state and the velocity eld at the i{th iterate, G] is the matrix that enforces the conservation of mass, and M ] is the pressure mass matrix of the consistent penalty method Engleman et al., 1982].The sparse linear system described by Equation ( 4) is solved by a conjugate gradient method with diagonal scaling.The nonlinear iterations are assumed to have converged when kfU i+1 g fU i gk 2 < kfU i gk 2 (5) where is a prescribed tolerance.
The evaluation of the material sti ness matrix, K D ], involves interactions between the three data representations.First, the nodal coordinates and the nodal velocities must be \gathered" from the set of nodal points to the set of nite elements.The macroscopic rate of deformation and the Jacobian at a quadrature point within a nite element, are computed in the set of nite elements.Next, the macroscopic rate of deformation at the quadrature point is broadcast to all the crystals underlying the quadrature point.This is a segmented \all{to{all" broadcast Johnsson and Ho, 1989, Brunet and Johnsson, 1991, and Mathur and Johnsson, 1992].It is assumed that all the crystals within the aggregate experience the macroscopic deformation Taylor, 1938].After the broadcast stage, data is readily available in the data set representing the crystals.A Newton{type method with line search is necessary to solve for the microscopic stress deviator that is required to achieve the prescribed rate of deformation.The quasi{Newton scheme is required for all crystals in all aggregates.This nonlinear solution for each crystal is independent of all other crystals.Once the microscopic stress deviator is known, the microscopic crystal sti ness matrices are computed and the macroscopic sti ness matrix at the quadrature point is computed by averaging the crystal sti ness matrices.This averaging procedure requires a segmented \all{to{all" reduction Mathur and Johnsson, 1992].
The performance of the conjugate gradient solver is dominated by the sparse matrix vector multiply.The implementation details of the conjugate gradient method are described in Golub and vanLoan 1989].After the evaluation of the elemental sti ness matrices, the sparse matrix vector multiply can be expressed as an \element{by{element" operation.This avoids the explicit assembly of the global sti ness matrix.Instead, the assembly is performed implicitly during the sparse matrix vector multiply operation.First, the nodal quantities are \gathered" from the set of nodal points.After the gather operation, all relevant data is in place for a local matrix vector multiplication of the unassembled sti ness matrices and the elemental vectors.This operation takes place in the set of nite elements.The matrix vector multiplication produces the unassembled product vector which is then \scattered" back to the set of nodal points.Since many elements share the same nodal points, data collision occurs at the destination.These colliding values must be added to ensure that the contributions from all the elements sharing a nodal point are accounted for.The boundary conditions are applied in the set of nodal points.
Once a converged deformation eld corresponding to a xed mechanical state has been computed, the mechanical state of the material and the geometry of the workpiece must be updated.An explicit Euler integration scheme is used here.The update of the microstructural quantities (the crystal orientations and the strength of the slip systems) must be done in the data representation of the crystals.The geometry of the workpiece is updated in the nodal data representation.Again, data interaction between all the three data sets is necessary.The reader is referred to Mathur and Dawson Mathur and Dawson, 1989] for a more detailed description of the update procedure.The communication primitives that are necessary for an e cient implementation of the update are identical to the communication primitives that are necessary for computing the macroscopic deformation eld at a xed state.
Figure (2) shows the deformed mesh after the rst time increment and the deformed mesh after signi cant deformation during a hydroforming simulation of an initially textured aluminum sheet Dawson et al., 1992].The initial crystal orientation underlying each material point corresponds to a cube texture.The sheet properties were assumed to be initially uniform.As a result, all elements are initialized with identical initial textures.The color contours in Figure (2) represent the magnitude of the radial velocity in the plane of the at portion of the sheet.The deviation from an axisymmetric solution due to the initial anisotropy is readily apparent.Material is moving inwards faster at 45 relative to 90 .Tables (3) and ( 4 4: Time for the computation intensive portions of the micromechanical nite element method.All times correspond to an 8K CM{200.
gregate of 256 crystals.The timing data reported here was measured on an 8K CM-200 system.As with the rst application, randomizing the mapping of the nite elements and the nodal points to the processing nodes of the model architecture reduces the time spent in data motion signi cantly.

Data parallel computational uid dynamics
The third application used in identifying an appropriate set of high level communication primitives for unstructured nite element simulations involves a computational uid dynamics application.Three{dimensional compressible ows on unstructured meshes having almost one million degrees of freedom have been simulated using an implicit iterative solution method Johan et al., 1990 and1992].The solution strategy involves an implicit time marching algorithm to converge to a steady state solution.A preconditioned matrix{free generalized minimal residual algorithm is used to solve the linear system.The reader is referred to Johan et al. 1991] for a detailed description of the solution strategy.The data parallel implementation is described in detail in Johan et al. 1992].
This data parallel implementation assigns two views to the data set.In the rst view, the unassembled nite elements are mapped on to the processing nodes of the model architecture.The second view maps the assembled nodal points to the processing nodes.In this application, the bulk of the simulation time is spent in the evaluation of the sparse matrix vector multiply used in the GMRES algorithm.First, the nodal values are \gathered" from the set of nodal points to the set of nite elements.All processing nodes perform local matrix vector operations concurrently to evaluate the unassembled elemental product vectors.These vectors are then assembled to a global vector by a \scatter" operation.Thus, the communication primitives that are needed in the computation{intensive part of this application are identical to the primitives needed for the other two applications described here.
Figures (3) and ( 4) show the surface mesh and the surface pressure distribution for a generic Falcon Jet at an e ective angle of attack of six degrees.The complete nite element mesh consists of 150,752 nodal points and 878,544 tetrahedra elements, yielding 753,620 equations.Table (5) shows the overall performance of the application for a one{point integration rule and 250 time steps at a CFL number of ve.Similar computations with a four{ point integration rule indicate that the fraction of time spent in data motion \Gather" \Computation" \Scatter" Total Time 1331 2055 1490 4860 Table 5: Overall time (s) for simulating Euler ow over a complete Falcon Jet during approach.The timings reported here were measured on a fully con gured Connection Machine system CM{200 with 2048 processing nodes.The timings correspond to 250 time steps at a CFL of ve with a one{point integration rule.
is signi cantly less.The time spent in communication is proportional to the number of degrees of freedom per unassembled nite element.The computation time increases linearly with the number of integration points used for numerical quadrature.

Unstructured communication primitives
The three applications described above illustrate the use of the following communication primitives: Global gather Global scatter All{to{all broadcast All{to{all reduce The rst set of communication primitives that are needed by almost all unstructured (arbitrary sparse) applications are the global gather and scatter functions.During the global gather operation, every destination array element accumulates data values from a source array element based on a set of pointer (index) arrays.Since many destination array elements may want to fetch the same source data value, this is a \one{to{many" mapping.For the nite element simulations described above, the gather operation can be expressed as a vector{valued Fortran{90 statement Metcalf and Reid, 1990] of the form D = S(P ) (6) where D is the destination array, P the pointer array identifying the source element that needs to be accumulated and S is the source array.The extent of array P must be identical to the extent of the destination array D. The Fortran{90 expression in ( 6) is equivalent to an element{wise gather operation.A similar expression can be written for a \vector{gather" operation which is very useful for applications involving multiple degrees of freedom per nodal point.The scatter operation is the reverse of the gather operation.Many source array elements send their data to the destination array elements.This is a \many{to{one" mapping, which can be formally expressed as D(P j+) = S: (7) where the pointer array P has the same extent as the source array S. Expression ( 7) is not a legal Fortran-90 expression.Data collision occurs at the destination.The data collision can be handled in a variety of di erent ways.For nite element simulations, data collision at the destination represents the contributions from di erent nite elements sharing a nodal point (or a nodal degree of freedom).Therefore, the colliding data values must be added.There is no legal way of expressing a scatter operation with addition in the subset of Fortran{90 implemented on the Connection Machine systems.This functionality is achieved by calling a utility library function.The reader is referred to the CM Fortran reference manual 1991] for details.
In the context of the nite element method, the pointer array P represents the mesh connectivity.The three applications described above make use of the same mesh connectivity for the entire duration of the simulation.Note that in the second application, the geometry of the workpiece changes signi cantly.However, the mesh connectivity remains xed.For such situations, it is highly desirable to preprocess the mesh connectivity and reuse the preprocessed information repeatedly during the course of the simulation.This preprocessing can be performed in several ways.Additional memory is required to store the information that is generated during the preprocessing step.The amount of memory needed is a function of the number of processing nodes on the architecture and the mesh connectivity.The implementations used for the applications reported here transform the one{to{many mapping of the gather operation and the many{to{one mapping of the scatter operation to one{to{one mappings.The data motion related to the one to one mapping is achieved by the use of the general purpose communications hardware on the model architecture.The gather operation requires a local copy operation before the permutation operation, while the scatter operation requires an addition step after the permutation.The one{to{one mapping is constructed in such a way that the addition step may be performed locally on each processing node.The algorithms used for the gather and scatter operations are described in detail in Mathur 1991].The preprocessed gather and scatter functions perform signi cantly better than the simple Fortran{90 expression of Equation ( 6).These gather and scatter functions are a part of the Scienti c Software Library on the model architecture.The reader is referred to the CMSSL reference manual 1992] for more detail.
The all{to{all broadcast and reduce functions that are necessary for the second application are part of a general class of communication primitives that are useful for several dense linear algebra algorithms Mathur and Johnsson, 1992] and applications based on regular grids.A segmented spread is expressed straightforwardly in Fortran{90 as where D is a two{dimensional matrix of shape (N g N e ), d is a vector N e long.In the context of the micromechanical model, N g represents the number of crystals underlying each nite element and N e is the number of nite elements in the mesh representing the continuum length{scale discretization.The parameters N e and N g represent the resolution of the macroscopic and the microscopic scale discretization respectively.The vector d represents a component of the macroscopic rate of deformation and the matrix D is the equivalent component of the microscopic rate of deformation of a crystal in the aggregate.The Fortran{90 expression of Equation ( 8) may, in fact, de ne an all{ to{all broadcast.To illustrate this fact, consider the allocation of arrays on the model architecture in some detail.The Connection Machine Run Time System allocates array elements to the processing nodes based on the shape of the array.The run time system attempts to con gure the processing nodes as an array with the same number of axes as the data array.The extent of each processing node axis is chosen such that the number of data array elements along each array axis assigned to a processing node is approximately equal.Thus, for a one{dimensional data array the processing nodes are also con gured as a one{dimensional array, with the data elements divided as evenly as possible among the processing nodes.For a two{dimensional data array, the processing nodes are con gured as a two{dimensional array.The subarray assigned to a processing node is approximately square.Figure (5) shows a simple example which illustrates that the segmented spread of Equation (8) indeed implies an all{to{all broadcast in a column major ordering of the processing nodes.
A spread of a one{dimensional vector, d, into a two{dimensional matrix, D, as shown in Figure (5), requires that each instance of the vector d instead of being distributed among all processing nodes (labeled 0 7) is now distributed among processing nodes labeled 0, 2, 4 and 6, and also processing nodes labeled 1, 3, 5, and 7.After the spread operation, the two segments, d 0 and d 1 of vector d, initially allocated to processing nodes labeled 0 and 1 respectively, must now reside in both processing nodes 0 and 1.Similarly, the two segments, d 2 and d 3 , of vector d, initially allocated to processing nodes labeled 2 and 3 respectively, must now both reside in processing nodes 2 and 3. Thus, for the column major ordering of the processing nodes, the spread operation implies an all{to{all broadcast within the columns of the two{dimensional processing node con guration.For a row major ordering of the processing nodes, a transposition corresponding to a conversion between the column and the row major ordering must precede the all{to{all broadcast.
A segmented reduce function such as s = sum(S; dim = 1) (9) where, as before, S is a two{dimensional matrix of shape (N g N e ) and s is a vector N e long implies an all{to{all reduce operation.In the micromechanical modeling e ort, S represents a component of the microscopic Cauchy stress deviator and s is the corresponding component from the macroscopic Cauchy stress deviator (which is formed by averaging the microscopic stress tensor).Reduction along the columns in Figure (5) performed such that the result from the reduction on the rst half of the columns assigned to a column pair of processors is assigned to the top processor, and the result of the reduction on the second half of the columns is assigned to the bottom processing node, de nes an all{to{all reduce within the columns of the processing node array.The all{to{all reduction is easily generalized to more than two processing nodes per column.Elements per node corresponds to 64{bit precision.N r is the number of nodes along the axis of reduction.The timings reported here were measured on a 8K CM{200.
As with the gather function, although the all{to{all broadcast and reduce functions can be expressed in Fortran{90, it is very bene cial to identify them as communication primitives.These primitives can then be optimized for the network interconnecting the processing nodes of the architecture.Optimal algorithms for the all{to{all broadcast and the all{to{all reduce for the network of the model architecture are described in Johnsson and Ho, 1989, Brunet et al., 1990, Brunet and Johnsson, 1991, and Mathur and Johnsson, 1992].Table (6) compares an optimized all{to{all reduce operation with the standard reduce and send instruction on the Connection Machine system, CM{200.Note that the standard reduce and send instruction itself is also optimized.

Summary
This article introduces a set of communication primitives that are very useful for unstructured nite element simulations.The intended use of these primitives is illustrated by three applications that have been implemented on the Connection Machine systems, CM{2 and CM{200.The choice of the applications is such that both fully explicit and fully implicit formulations have been investigated.
The communication primitives are a part of the scienti c software library on the model architecture.They have been optimized for the model architecture.With the use of the optimized communication primitives, the three applications presented here run very e ciently on the model architecture.These communication functions support the notion of scalable computing.The physical con guration of the computing platform is completely transparent to the application.The intended goal of portability between di erent architecture supporting compilers with a global address space is achieved.This has been demonstrated by simply compiling these codes for the next generation Connection machine system, CM{5.The initial performance numbers on the Connection Machine system, CM{5, are very promising.Simulations with signi cantly ner resolutions are now possible in a reasonable time.

Figure 1 :
Figure 1: Three dimensional Charpy V{notch simulation Mathur, et al., 1992].The middle image shows the x component of the displacement eld after the rst 100 s.The image on the right shows the magnitude of the displacement eld after the rst 100 s.The nite element mesh comprises of 3,200 twenty node bricks with 15,083 nodal points and 45,249 nodal degrees of freedom.

Figure 2 :
Figure 2: Hydroform simulation of an initially textured aluminum sheet Dawson, et al., 1992].The magnitude of the radial velocity in the plane of the at portion of the sheet is shown.The deformed mesh corresponds to a time increment after signi cant deformation has occurred.The nite element mesh is comprised of 864 eight node bricks with 1407 nodal points and 5221 nodal degrees of freedom.Underlying each nite element is an aggregate of 256 crystals.

Figure 3 :
Figure 3: Surface mesh for Euler ow simulation on a generic Falcon jet Johan, et al., 1992].The complete mesh consists of 878,544 tetrahedra with 150,752 nodal points and 753,620 nodal degrees of freedom.

Figure 4 :
Figure 4: Surface pressure distribution for a generic Falcon jet at an e ective angle of attack of six degrees.

Figure 5 :
Figure 5: One and two dimensional column major ordered con gurations of eight processing nodes.

Table 3 :
Computation time (milli{seconds) and computation rate for the sparse matrix vector multiply in the conjugate gradient solver.All timings and oating point rates correspond to an 8K CM{200.The nite element mesh had 864 eight node brick elements and 1407 nodal points.The total time per conjugate gradient iteration was 14.4 milli{seconds for no random mapping and 8.5 milli{seconds for random mapping.

Table 6 :
Execution times in milli-seconds and speedups for the two methods of performing an all{to{all reduction on the Connection Machine system CM{ 200.