Scientific Software Libraries for Scalable Architectures

Massively parallel processors introduce new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used, poses se-rious challenges to software developers. The Connection Machine Scienti(cid:12)c Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of arithmetic operations and data motion, and through the automatic selection of algorithms at run{time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications.


Introduction
The main reason for large scale parallelism is performance. In order for massively parallel architectures to deliver on the promise of extreme performance compared to conventional supercomputer architectures, an e ciency in resource use close to that of conventional supercomputers is necessary. Achieving high e ciency in using a computing system is mostly a question of ecient use of its memory system. This is the premise on which the Connection Machine Scienti c Software Library 28], the CMSSL, is based. Another premise for the CMSSL is the notion of scalability. Software systems must be designed to operate on systems and data sets that may vary in size by as much as four orders of magnitude. This level of scalability with respect to computing system size must be accomplished transparently to the user, i.e., the same program must execute not only correctly but also eciently without change over this range in processing capacity and corresponding problem sizes. Moreover, programs should not have to be recompiled for various system sizes. This requirement will be even more important in the future, since over time the assignment of processing nodes to tasks is expected to become much more dynamic than today. Robustness of software both with respect to performance and numerical properties are becoming increasingly important. The memory system in each node is becoming increasingly complex in order to match the speed of the memory system with that of an individual processor. The distributed nature of the total memory compounds the complexity of the memory system. It is imperative that software systems deliver a large fraction of the available performance over a wide range of problem sizes, transparently to the user. For instance, small changes in array sizes should not impact performance in a signi cant way. Robustness with respect to performance in this sense will increase the demands on the software systems, in particular on the run{time parts of the systems. Robustness with respect to numerical properties is also becoming increasingly important. The same software may be used for problem sizes over a very wide range. Condition numbers for the largest problems are expected to be signi cantly worse than for small problems. As a minimum, condition estimators must be provided to allow users to assess the numerical quality of the results. It will also be increasingly necessary to furnish software for ill{conditioned problems, and whenever possible, automatically choose an appropriate numerical method. Some parallel methods do not have as good a numerical behavior as sequential methods, and this disadvantage is often increasing with the degree of parallelism. The trade{o between performance and numerical stability and accuracy is very complex. Much research is needed before the choice of algorithm with respect to numerical properties and performance can be automated. Portability of codes is clearly highly desirable in order to amortize the software investment over as large a usage as possible. Portability is also critical in a rapid adoption of new technology, thus allowing for early bene ts from the increased memory sizes, increased performance, or decreased cost/performance o ered by new technology. But, not all software is portable when performance is taken into account. New architectures, like MPPs, require new software technology that often lags the hardware technology by several years. Thus, it is important to exploit the architecture of software systems such that architecture dependent, nonportable software is limited to as few functions as possible, while maintaining portability of the vast amount of application software. One of the purposes of software libraries is to enable portability of application codes without loss of performance. The Connection Machine Scienti c Software Library today has about 250 user callable functions covering a wide range of frequent operations in scienti c and engineering computation. In this paper we illustrate how the goals of high performance and scalability have been achieved. The outline of the paper is as follows. In the next few sections we discuss memory systems for scalable architectures and their impact on the sequence to storage association used in mapping arrays to the memory system. We then discuss data representations for dense and sparse arrays. The memory system and the data representation de nes the foundation for the CMSSL. We then present the design goals for the CMSSL and how these goals have been approached and achieved. The multiple{instance capability of the CMSSL is an extension of the functionality of conventional libraries in the spirit of array operations, and critical to the performance in computations on both distributed and local data sets. The multiple{instance feature is discussed in Section 6. Scalability and robustness with respect to performance both depend heavily on the ability to automatically select appropriate schedules for arithmetic/logic operations and data motion, and proper algorithms. These issues are discussed by speci c examples. A summary is given in Section 8.

Architectural model
High performance computing has depended on elaborate memory systems since the early days of computing. The Atlas 23] introduced virtual memory as a means of making the main relatively slow memory appear as fast as a small memory capable of delivering data to the processor at its clock speed. Since the emergence of electronic computers processors have as a rule been faster than memories, regardless of the technology being used. Today, most computers, conventional supercomputers excepted, use MOS technology for both memories and processors. But, the properties of the MOS technology is such that the speed of processors is doubling about every 18 months, while the speed of memories is increasing at a steady rate of about 7%/yr. Since the speed of individual memory units, today primarily built out of MOS memory chips, is very limited, high performance systems require a large number of memory banks (units), even when locality of reference can be exploited. High end systems have thousands to tens of thousands of memory banks. The aggregate memory bandwidth of such systems far exceeds the bandwidth of a bus. A network of some form is used to interconnect memory modules. The nodes in the network are typically of a low degree, and for most networks independent of the size of the network. A large variety of network topologies can be constructed out of nodes with a limited xed degree. Massively parallel architectures have employed two{ and three{dimensional mesh topologies, butter y networks, binary cubes, complete binary trees, and fat{tree topologies. The speed of the memory chips present the most severe restriction with respect to performance. The second weakest technological component is the communication system. Constructing a communication system with the capacity of a full crossbar with a bandwidth equal to the aggregate bandwidth of the memory system is not feasible for systems of extreme performance, and would represent a considerable expense even for systems where it may be technically feasible. Hence, with a constraint of a network with a lower bandwidth than that of the full memory system, in MPPs, processors are placed close to the memory modules such that whenever locality of reference can be exploited the potential negative impact upon performance of the limited network capacity is alleviated. This placement of processors and the limited network capacity has a fundamental impact upon the preferred sequence{to{storage association to be used for programming languages. This di erence in preferred sequence{to{storage association is a major source of ine ciency in porting codes in conventional languages to MPPs. The generic architectural model for MPPs used throughout this paper is shown in Figure 1. The local memory system is shown only schematically. As a minimum, the local memory hierarchy consists of a processor register le and DRAM, but quite often there is at least one level of cache, and sometimes two levels. In systems without a cache, such as the Connection Machine systems, the mode in which the DRAM is operated is important. In addition to the local memory hierarchy, the access time to the memories of other nodes (a processor with associated memory modules and network interface hardware) often is nonuniform. The source of the nonuniformity in access time may be directly related to the distance in the network, which is the case for packet switched communication systems. In circuit switched and wormhole routing systems, the distance in itself is often insigni cant with respect to access time. However, the longer the routing distance the more likely it is that contention in the network will arise, and hence add to the remote access time.

Dimensionality of the address space
The one{dimensional address space used for the conventional languages is not suitable for most applications on MPPs. A linearized address space may also result in poor performance for multidimensional arrays or nonunit stride accesses in banked and interleaved memory systems. So{called bank con icts are well known performance limitations caused by the combination of data allocation strategies and access strides. For MPPs, for computations with a uniform use of the address space, a multi{dimensional address with as many dimensions as are being accessed uniformly is ideal. We discuss this claim through a number of simple, but important, examples. We rst discuss computations dominated by operations on a single array, then consider operations involving two or three arrays.

The Fast Fourier Transform
For the fast Fourier Transform, the FFT, and many other hierarchical or divide{and{conquer methods, an address space with log 2 N dimensions may be ideal even for a one{dimensional array of extent N. All data references are to data within unit distance in such an address space.
This view is particularly useful in the mapping of the arrays to networks of memory units, since it properly models the communication needs. The FFT computations are uniform across the index space and the load{balance is independent of whether cyclic or consecutive allocation is used. However, the cyclic data allocation yields lower communication needs than the consecutive allocation by up to a factor of two for unordered transforms 13,14]. The reason is that the computations of the FFT always proceed from the high to the low order bit in the index space. With the consecutive allocation the high order bits are associated with processor addresses and must be mapped to local memory addresses before local butter y computations can be performed. Conserving memory in this remapping means that another remapping is required when the computations are to be performed on the dimensions that were moved from local memory to processor addresses in order to accommodate the move of the leading dimensions into local memory. In the cyclic allocation the leading dimensions are mapped to local memory from the start.

Direct methods for solution of systems of equations
LU and QR factorization only involves a single array, while the solution of triangular systems involves two or three arrays. Two important distinguishing features of dense matrix factorization are that all data references are \global", and that the computations are performed on a diminishing set of indices. The \global" references consists of pivot selection and the broadcast of the pivot row and column. If a block algorithm is used, sets of rows and columns are treated together, but it does not fundamentally change the reference pattern for the algorithm. We will rst discuss the preferred dimensionality and shape of the address space, then load{ balancing. Since the broadcast operations are performed both along rows and columns a one{ dimensional partitioning makes one of these operations local, while for the other a complete pivot row or column must be broadcast. With a consecutive data allocation and a p N p N nodal array for the factorization of a P P matrix, the broadcast operations require the communication of 2 P p N elements, instead of P elements for a one{dimensional partitioning. This argument is too simplistic in that the communication among the two axes is not the same. But, the conclusion is correct: a two{dimensional address space is desirable, and the shape of the local subgrid shall be close to square. Partial pivoting requires additional communication along one axis. Second, since not all indices are involved in all steps, the number of elements per node participating in the broadcast operation is not necessarily P p N and P, respectively. It depends upon what techniques are used for load{balancing as discussed in 19]. Note that for out{of{core factorization algorithms using panel techniques with entire columns in primary storage, the shape of the panel to be factored may be extremely rectangular. Hence, the shape of the processing array shall also be extremely rectangular to yield almost square subgrids.
A cyclic allocation guarantees good load{balance for computations such as LU and QR factorization, and triangular system solution. But, a good load{balance can be achieved also for consecutive mapping by adjusting the elimination order accordingly 19]. To allow for the use of level{2 LBLAS (Local BLAS 16]), blocking of rows and columns on each node is used. In LU factorization a blocking of the operations on b rows and columns means that b rows are eliminated at a time from all the other rows. The resulting block{cyclic elimination order yields the desired load{balance as well as an opportunity to conserve local memory bandwidth. A block{cyclic elimination order was rst recommended in 7] for load{balanced solution of banded systems. The result of the factorization is not two block triangular matrices, but block{cyclic triangles. A block{cyclic triangle can be permuted to a block triangular matrix. However, it is not necessary to carry out this permutation for the solution of the block{cyclic triangular system of equations. Indeed, it is desirable to use the block{cyclic triangle for the forward and back substitutions, since the substitution process is load{balanced for the block{cyclic triangles. Using block triangular matrices stored in a square data array (A) allocated to nodes with a consecutive data allocation scheme would result in poor load{balance. For details as well as modi cations necessary for rectangular nodal arrays, see 19]. Note further that for triangular solvers the communication is again of the global nature, and the conclusions about the shape of the address space still applies 8, 19].

The Alternating Direction Implicit Method
In the Alternating Direction Implicit (ADI) Methods a multi{dimensional operator is factored into one{dimensional operators that are applied in alternating order. In its most common use, tridiagonal systems of equations are solved along each coordinate direction of a grid. Whether substructured elimination or straight elimination is used, the communication requirements along each coordinate axis is proportional to the area of the surface having the normal aligned with the axis of solution. Hence, regardless of the extent of the axes in the di erent dimensions, it is again desirable with respect to minimizing nonlocal references to minimize the surface area of the subgrids assigned to each node. For a more detailed discussion of ADI on parallel computers and cyclic reduction based methods as well as Gaussian elimination based methods for the solution of tridiagonal systems of equations see 12, 17].

Stencil computations
For stencil computations on three{dimensional arrays with a stencil symmetric with respect to the axis, the well known minimum \surface{to{volume" rule dictates that a three{dimensional address space shall be used for optimum locality of reference. For example, for a 512 512 512 grid distributed evenly across 512 nodes, each node holds 256k grid points. With a 7{point, centered, symmetric stencil in three dimensions, the number of nonlocal grid points that must be referenced is 6 64 64 for cubic subgrids of shape 64 64 64. For the standard linearized array mapping used by Fortran 77 or C, the subgrids will be of shape 1 512 512. References along two of the axis are entirely local, but the references along the third axis require access to 2 512 512 nonlocal grid points. Thus, the linearized address space requires a factor of 1 3 8 8 21 more nonlocal references for the stencil computations. Note that if the data array is of shape 256 1024 512, it is still the case that the ideal local subgrid is of shape 64 64 64. But, the ideal shape of processing nodes have changed from an 8 8 8 array to a 4 16 8 array. This example with simple stencil computations on a three{dimensional array has shown that a multi{dimensional address space is required in order to maximize the locality of reference. Moreover, the example also shows that the shape of the address space, i.e., how the indices for each axis is split between a physical or processor address and a local memory address is very important. We have implicitly assumed a consecutive or block partitioning in the discussion above. A cyclic partitioning would in fact maximize the number of nonlocal references.

Matrix multiplication
We restrict the discussion to the computation C C + A B. In order to minimize the communication, a good strategy is to keep the matrix with the largest number of elements stationary 9, 21]. The other two operands are moved as required for the required indices of the three operands to be present in a given node at the same time. With this underlying strategy, the ideal shape of the address space is such that the stationary matrix has square submatrices in each node 9,21]. This result can be derived from that fact that the required communication is all{to{all broadcast and/or reduction within rows or columns 20]. The ideal shape of the address space has been veri ed on the Connection Machine system CM{ 200 and is illustrated in Figure 2. It con rms that the optimal nodal array shape is square for square matrices. For the matrix shapes used in this experiment, a one{dimensional nodal array aligned with either the row or column axis, requires about a factor of six higher execution time than the ideal two{dimensional nodal array shape. With the proper nodal array shape a superlinear speedup is achieved for matrix multiplication, since the communication requirements increase in proportion to the matrix size, while the computational requirements grow in proportion to the size to the 3 2 power. The superlinear speedup achieved on the CM{5 is shown in Table 1.

Data representation
In the previous section we showed that linearized address spaces as used by the conventional languages is not compatible with the notion of locality of reference for MPP memory systems. We showed that multi{dimensional address spaces are required, and that the optimal shape of the address space can be derived from the data reference pattern. In this section we focus on the data Number Matrix M op/s MFlop/s Size for half of Nodes Size (P) Overall per node peak perf. (P 1=2 )  1  1088  62  62  128  32  5952  2266  71  1088  64  8704  4608  72  1408  128  12416  9933  78  1664  256  17664 19661  77  2816  512  24576 42395  83  3328   Table 1: Performance of matrix multiplication in M op/s on the CM{5.
representation and how, based on the selected representation, the desired data allocation can be realized. With the exception of the FFT, in all our examples a consecutive data allocation is either preferred, or the choice between cyclic and consecutive allocation immaterial with respect to performance. Thus, for simplicity, we will assume a consecutive data allocation in this section.

Dense arrays
For the allocation of dense arrays we have seen that subgrids with equal axes extents are either optimal or close to optimal for several common and important computations. Hence, as a default without sophisticated data reference analysis, an allocation creating subgrids with axes extents of as equal a length as possible is sensible and feasible.

Grid Sparse Matrices
For sparse arrays the data representation is less obvious, even for sparse matrices originating from regular grids. Such matrices typically consists of a number of nonzero diagonals. For instance, consider the case with a 7{point centered di erence stencil in three dimensions. The stencil computation can be represented as a matrix{vector multiplication, y Ax, where x and y are grid point values and the matrix A represents the stencils at all the grid points. With an N 1 N 2 N 3 grid, with stride one along the axis of extent N 3 and stride N 3 along the axis of length N 2 , the matrix is of shape N 1 N 2 N 3 N 1 N 2 N 3 with a nonzero main diagonal, a nonzero diagonal immediately above and below the main diagonal, two nonzero diagonals at distance N 3 above and below the main diagonal, and two nonzero diagonals at distance N 2 N 3 above and below the main diagonal. A common representation in Fortran 77 is either to use a set of one{dimensional arrays, one for each nonzero diagonal or a single one{dimensional array with the nonzero diagonals appended to each other. However, either of these representations are not suitable for MPP memory systems, since preservation of locality of reference for matrix{vector multiplication is likely to be lost. A natural representation for grid{sparse matrices and grid{point vectors is to tie the representation directly to the grid rather than the matrix representation of the grid. Grid{point vectors are represented as multi{dimensional arrays with one axis for each axis of the grid, plus an axis for the grid{point vector. The grid{axes extents are the same as the lengths of the corresponding grid axes. A grid{sparse matrix is represented in an analogous way. The matrix represents interaction between variables in di erent grid points. As an example of the grid based representation of a grid{sparse matrix we consider the common 7{point stencil in three dimensions.
Each of the stencil coe cients are represented as three{dimensional arrays A; ;G, of nodal values of shape (LX,LY,LZ).  Table 2: Partitioning of a tetrahedral mesh between concentric spheres.

Representation and allocation of arbitrary sparse matrices
The representation and allocation of arbitrary sparse matrices is a very di cult topic subject to research.  24]. The recursive spectral bisection technique has been used successfully by Simon 26] for partitioning of nite volume and nite element meshes. A parallel implementation of this technique has been made by Johan 4]. The spectral partitioning technique is based on the eigenvector corresponding to the smallest nonzero eigenvalue of the Laplacian matrix associated with the graph to be partitioned. The Laplacian matrix is constructed such that the smallest eigenvalue is zero and its corresponding eigenvector consists of all ones. The eigenvector associated with the smallest nonzero eigenvalue is called the Fiedler vector 3]. Grid partitioning for nite volume and nite element methods is often based on a dual mesh representing nite volumes or elements and their adjacencies (or some approximation thereof) rather than the graph of nodal points 4]. One advantage of the spectral bisection technique is that it is based on the topology of the graph underlying the sparse matrix. It requires no geometric information. However, it is computationally quite demanding. The results of applying the spectral bisection technique to a model problem is reported in 4] and shown in Table 2. A planar grid of tetrahedra between concentric cylinders. with 20,374 nodes, 107,416 tetrahedra, and 218,807 faces is partitioned using the spectral bisection algorithm. The numbers of shared nodes and edges as a function of the number of partitions are given in the table.
The results of applying the spectral bisection technique on a more realistic nite element application 6] are summarized in Table 3. The spectral bisection technique in this example o ered a reduction in the number of remote references by a factor of 13.2. The speedup for the gather operation was a factor of 13 and of the scatter operation the speedup was a factor of 9.6 (the scatter operation includes the time required for addition which is una ected by the partitioning). Another important aspect of computations with arbitrary sparse matrices is that unlike for dense and grid sparse matrices, address computations cannot be performed by incrementing addresses using xed strides. For arbitrary sparse matrices, indirect addressing is required. It frequently is the most time consuming part on uniprocessors. On a distributed memory machine, the address Operation Standard Spectral  allocation bisection  Partitioning  |  66  Gather  298  23  Scatter  445  46  Computation  180  181  Total time  923  316   Table 3: Gather and scatter times in seconds on a 32{node CM{5 for 50 time steps with a 1{point integration rule for nite element computations on 19,417 nodes and 109,914 elements.
computations do not only involve the computation of local addresses, but routing information as well. In an iterative (explicit) method, the underlying grid may be xed for several or all iterations. For such computations it is important with respect to performance to amortize the cost of computing the addresses over as many iterations as possible. Cacheing this information and reusing it later is important for performance 28].
In an arbitrary sparse matrix, there is no simple way of encoding the global structure. Yet, arbitrary sparse matrices may still have some local structure resulting in a block sparse matrix.
Taking advantage of such a block structure for both economy in data representation, data storage and e ciency of operations, is signi cantly simpli ed by explicitly representing the blocks 28].

Multiple{instance computation
The multiple{instance capability of the CMSSL is consistent with the idea of collective computation inherent in languages with an array syntax. We have already seen how it arises naturally in the ADI method. CMSSL routines are designed to carry out a collection of high level computations on independent sets of operands in a single call, in the same way addition of arrays are carried out through a single statement, or intrinsic functions are applied to each entry in an array in Fortran 90. To accomplish the same task in an F{77 or C library, the call to a library routine would be embedded in a set of nested loops. The multiple{instance capability not only eliminates loop nests, but also allows for parallelization and optimization without a sophisticated interprocedural data dependence analysis. The multiple{instance feature for parallel computation is necessary for the desired degree of optimization, which goes beyond the capabilities of state{of{the{art compiler and run{time systems. We discuss the signi cance of the multiple{instance capability with respect to performance and simplicity of user code by considering the computation of the FFT along one of the axes of a two{dimensional array of shape P Q. We assume a canonical data layout in which the set of processing nodes are con gured as an array of the same rank as the data array and of a shape making the local subarrays approximately square. The nodal array shape is N r N c (= N). With the FFT performed along the P{axis, the computations on the two{dimensional array consist of Q independent FFT computations, each on P data elements. We consider three di erent alternatives for the computation: 1. Maximize the concurrency for each FFT through the use of a canonical data layout for one{dimensional arrays of size P.
2. Compute each FFT without data relocation.
3. Compute all Q FFTs concurrently through multiple{instance routines.
Alternative 1 corresponds to the following code fragments: The concurrency in the computation of the FFT is maximized. The data motion prior to the computation of the FFT on a column is a one{to{all personalized communication 11]. The data redistribution corresponds to a change in data allocation from A(:,:) to TEMP(:) and back to the original allocation, one column at a time. The arithmetic speedup is limited to min(N; P) for transforms on the P{axis.
In alternative 2, the data redistribution is avoided by computing each instance in{place. An obvious disadvantage with this approach is the poor load{balance. The speedup of the arithmetic is proportional to min(N r ; P) for a transform along the P{axis. The third choice is clearly preferable both with respect to communication and arithmetic load{ balance. Note that with a single{instance library routine and canonical layouts, Alternative 1 would be realized. Further, for particular situations, a noncanonical layout will alleviate the communication problem, but in many cases the communication appears somewhere else in the application code. Thus, we claim that our discussion based on canonical layouts re ects the situation in typical computations.

CMSSL
The primary design goal for the Connection Machine Scienti c Software Library, CMSSL, is to provide high level support for most numerical methods, both traditional and recently developed methods, such as hierarchical and multi{scale methods and multipole and other fast so{called N-body algorithms used for large scale scienti c and engineering computations. High level support in this context means functionality that is at a su ciently high level that architectural characteristics are essentially transparent to the user, yet that a high performance can be achieved. Speci c design goals for the CMSSL include consistency with languages with an array syntax, such as Fortran 90, Connection Machine Fortran and C*, functionality that is independent of data distribution, multiple instance capability, support for all four conventional oating{point data types, high performance, scalability across system and problem sizes, robustness, portability, and functionality supporting traditional numerical methods. These goals have had an impact on the architecture of the CMSSL. The rst few goals have also impacted the user interfaces. Version 3.2 of CMSSL has about 250 user callable functions. The library exists on the Connection Machine systems CM{2, CM{200, and CM{5. The CM{5 version consists of about 0.5 million lines of code, and so does the CM{2 and CM{200 version. Table 4 gives a few examples of how the goal of high performance is met by the CMSSL. The table entry for unstructured grid computations actually represent complete applications 27, 5, 22, 1], while the other entries represent library functions by themselves. Table 5 provides excellent data of how the goal of scalability has been met by the CMSSL, as well as the CM{5 architecture over a range of a factor of a thousand in system size. ENSA 1 is an Euler and Navier Stokes nite element code 5], while TeraFrac 2 and MicMac 3 are solid mechanics nite element codes ( 22,1]). To rst order, the performance per node is independent of the system size, thus demonstrating excellent scalability. For some computations, like matrix multiplication, the e ciency actually increases as a function of system size. For the unstructured grid computations the performance decreases by about 5%, an insigni cant amount. With respect to scienti c and engineering computations, the architectural dependence on traditional architectures has mostly been captured in a set of matrix utilities known as the BLAS (Basic Linear Algebra Subprograms) 2, 18]. E cient implementations of this set of routines  1  126  83 62  68  32  126  80 71  61  25  26  30  64  125  74 72  60  25  26  31  128  125  76 78  60  26  24  29  256  125  68 77  59  24  25  32  512  125  68 83  59  24  25  32  1024 58 26 The external architecture of the CMSSL is similar to conventional library systems in that there exists a set of matrix utilities similar to the BLAS, a set of sparse matrix utilities supporting operations on regular and irregular grids, dense and banded direct solvers, and iterative solvers. Fast Fourier transforms are supported for multidimensional transforms. In addition, CMSSL also includes a few statistical routines, and a routine for integration of systems of ordinary di erential equations, and a simplex routine for dense systems. The CMSSL also contains a communications library. Such libraries are unique to distributed memory machines. The CMSSL also contains tools in the form of two special compilers, a stencil compiler and a communications compiler. Novel ideas in the CMSSL can be found at all levels: in the internal architecture, in the algorithms used, in the automatic selection of algorithms at run{time, and in the local operations in each node. The CMSSL is a \global" library. It accepts global, distributed data structures. Internally, the CMSSL consists of a set of library routines executing in each node and a set of communication functions. The communication functions are either part of the Connection Machine Run{Time System, or part of the CMSSL. All communication functions that are part of the CMSSL are directly user accessible, and so are the functions in each node. For the global library, these functions are called directly and are transparent to the user and the distributed nature of the data structures is transparent to the user. The internal structure of the CMSSL supports data distribution independent functionality, automatic algorithm selection for best performance for the BLAS, FFT and a few other functions as well as user speci ed choices for many other functions. The execution is made through calls to local routines and communication functions. It follows from the internal architecture of the CMSSL, that it also has the ability to serve as a nodal library.

Summary
The CMSSL has been designed for performance, scalability, robustness and portability. The architecture with respect to functionality follows the approach in scienti c libraries for sequential architectures. Internally, the CMSSL consists of a nodal library and a set of communication and data distribution functions. CMSSL provides data distribution independent functionality and has logic for automatic algorithm selection based on the data distribution for input and output arrays and a collection of algorithms together with performance models. The performance goals have largely been achieved both for the local and global functions. Particular emphasis has been placed on reducing the problem sizes o ering half of peak performance. Some peak global performance data were given in Section 7. Scalability is excellent. The performance per node has been demonstrated to be largely independent of the number of nodes in the systems over a range of a factor of one thousand ( Table 5, Section 7). Robustness with respect to performance is achieved through the automatic selection of algorithm as a function of data distribution for both low level and high level functions. CMSSL o ers portability of user codes without loss of performance. CMSSL itself has an architecture amenable to portability. It is the same on all Connection Machine platforms. Code for maximum exploitation of the memory hierarchy is in assembly language, and thus has limited portability. Some algorithmic changes were also necessary in porting the library to the CM{5. These changes are largely due to the di erences in the communication systems, but also due to the MIMD nature of the CM-5.