Local Basic Linear Algebra Subroutines (LBLAS) for Distributed Memory Architectures and Languages with Array Syntax

We describe a subset of the level{1, level{2, and level{3 BLAS implemented for each node of the Connection Machine system CM{200. The routines, collectivelycalled LBLAS, have interfaces consistent with languages with an array syntax such as Fortran 90. One novel feature, important for distributed memory architectures, is the capability of performing computations on multiple instances of objects in a single call. The number of instances and their allocation across memory units, and the strides for the di(cid:11)erent axes within the local memories, are derived from an array descriptor that contains type, shape, and data distribution information. Another novel feature of the LBLAS is a selection of loop order for rank{1 updates and matrix{matrix multiplication based upon array shapes, strides, and DRAM page faults. The peak e(cid:14)ciencies for the routines are in excess of 75%. Matrix{vector multiplication achieves a peak e(cid:14)ciency of 92%. The optimization of loop ordering has a success rate exceeding 99.8% for matrices for which the sum of the lengths of the axes is at most 60. The success rate is even higher for all possible matrix shapes. The performance loss when a nonoptimal choice is made is less than (cid:24)15% of peak and typically less than 1% of peak. We also show that the performance gain for high rank updates may be as much as a factor of 6 over rank{1 updates.


Introduction
The Basic Linear Algebra Subroutines 1, 2, 8] (BLAS) are used in many scienti c codes, often being critical for the performance of those codes.For many computer architectures, implementations in low{level languages have been made to ensure a desirable e ciency, though improved compiler technologies recently have allowed BLAS for some architectures to be coded in high{level languages, suitably structured, without signi cant loss of e ciency 3, 7, 10].We discuss the issues involved in designing local BLAS for distributed memory architectures, programmed in languages with an array syntax.We report on the techniques used and the performance achieved in a subset of the BLAS for each node of a Connection Machine system CM{200, a distributed memory architecture with up to 2048 processing nodes.The techniques are applicable to many distributed memory architectures and are now being applied in developing optimized BLAS for the Connection Machine system CM{5 15].The subset of the BLAS described here, referred to as LBLAS for Local BLAS, forms part of the Connection Machine Scienti c Software Library, CMSSL 18].The LBLAS can be accessed directly, or indirectly, through the CMSSL Distributed BLAS (DBLAS) 11].Indeed, there is only one interface.The distribution of the arrays determines whether calls to the LBLAS su ce to accomplish the desired operation, or if a distributed algorithm calling both communication routines and the LBLAS shall be invoked.The LBLAS are used for both dense and block sparse matrix operations, and are used as well in the CMSSL linear system solvers 9] and eigenanalysis routines.Frequently, the LBLAS are also used directly in solving partial di erential equations.The BLAS have emerged over time, starting with level{1 BLAS 8] for vector operations such as inner products ( DOT), scaling of a vector ( SCAL), and the addition of a scaled vector to another vector ( AXPY).The underscore represents a symbol denoting the relevant data type (S,D,C,Z).Operations such as matrix{vector multiplication can be expressed in terms of level{1 BLAS.But, the number of memory references may be excessive.Scaling a vector requires two memory accesses for a single oating{point operation, assuming that the constant is available in a register.An inner product requires two memory accesses for each pair of oating{point operations (plus a store of the result), while an AXPY operation requires three memory operations per pair of oating{point operations, assuming the constant is available in a register.The level{2 BLAS 2] were de ned to allow for increased utilization of arithmetic units in architectures with a higher data motion capacity between registers, or a cache, and the arithmetic units, than between the registers or the cache and main memory.For instance, in matrix{vector multiplication y Ax + y, the vector x can be loaded into a register le, followed by the sequence of operations z(:) x(i)A(:; i) + z(:) for i = 1; 2; : : : ; Q for a matrix of shape P Q.Each iteration is an AXPY operation on vectors of length P. The AXPY operation requires three memory references per element when z is read from and stored in memory for each iteration.However, if z is kept in the register le until completion, then each iteration only requires one read from memory for each pair of arithmetic operations.The level{2 BLAS include matrix{vector multiplication as a primitive, allowing extraneous memory references to be avoided without relying on interprocedural analysis by the compiler.Commercial compilers still rarely achieve close to peak performance even on single routines written in a completely architecturally independent way.However, by a suitable partitioning, unrolling, and possibly skewing of loops by the programmer, state{of{the{art compilers for some architectures produce very e cient code 3,7,10].But, on many architectures, assembly level programming is still required to achieve the desired level of e ciency in using registers, caches, memory and pipelines.The CM{200 belong to this category of computer systems.In architectures with a single data path to memory, such as the CM{200, the level{2 BLAS o er a potential speedup for matrix{vector multiplication by a factor of two compared to a DOT based  1: Floating{point operations per memory reference for a few BLAS functions.algorithm, and a factor of three compared to an AXPY based algorithm.The level{3 BLAS 1] allow computations, such as matrix{matrix multiplication, to be performed with less demand on the memory bandwidth than when level{2 BLAS routines are used in the absence of interprocedural analysis and subsequent optimization of memory references by a compiler.The matrix multiplication C A B +D may be performed as a sequence of multiplications of b by b subblocks.If the blocks required for a block matrix multiplication t in the registers, then 2b 3 oating{point operations can be carried out using 3b 2 memory references for data input and b 2 memory references for data output.Furthermore, if all contributions to a block of C are accumulated in the registers, i.e., only the nal result is stored in memory, then only 2b 2 elements need to be loaded from memory for each set of 2b 3 oating{point operations.Delaying all stores until the computations for a b b block of C are completed results in b oating{point operations per memory reference, on average.Note that for matrix{matrix multiplication, an algorithm based on the level{1 BLAS is memory bandwidth limited when there is a single oating{point unit for each data path to memory and the data paths internal to the oating{point processor and the paths to memory are of the same width.An algorithm based on level{2 BLAS is balanced with respect to the demand for memory bandwidth and computational capability, and an algorithm based on the level{3 BLAS is limited by the arithmetic processing capability.Table 1 summarizes the operations count per memory reference for some functions in the level{1 BLAS.
Throughout this paper, in the operation C A B, the matrix A is of shape P Q, B of shape Q R, and C of shape P R. For matrix{vector multiplication R = 1, the vector x replaces B, and the vector y replaces C. For outer products Q = 1, the x replaces A and y T replaces B.
We only discuss BLAS for real data types.No particular optimization is currently performed in the CMSSL LBLAS for complex data types.For complex data types real routines are called as required.The functionality of the LBLAS we have implemented deviates somewhat from the corresponding subset of the conventional BLAS 1,2,8] in that all scaling factors in LBLAS are assumed to be one.For instance, for matrix{vector multiplication, GEMV supports the operation y Ax + y, where and are scalars, whereas the CMSSL local matrix{vector multiplication performs the operation y Ax + y.The reason that, at the moment, the LBLAS in CMSSL assume that = = 1 is for compatibility with the CMSSL DBLAS.The functionality of any routine in the CMSSL is independent of the data distribution, and BLAS are no exception.For the DBLAS, mixing array arguments of di erent rank may have severe performance implications, unless the required data motion for aligning the arrays is e ciently implemented.When the data motion issues for mixing arrays of di erent rank are satisfactorily solved for the DBLAS, then the CMSSL LBLAS will be extended to the exact same functionality as the corresponding routine in the conventional BLAS.For a discussion of the data motion issues in the DBLAS see 4,5,6,11].In the following, for the convenience of the reader, we discuss the LBLAS using the traditional BLAS names whenever the distinction is either irrelevant or clear from the context.Whenever there is a need to stress that the discussion refers to the LBLAS, we pre x the BLAS names with CMSSL, such as, for instance, CMSSL DDOT for the CMSSL routine computing inner products in double precision.For the actual names of the CMSSL routines, see 18].The names of the CMSSL routines, and hence the LBLAS, are not limited to six characters, and do not have the data type encoded in the routine name.All CMSSL routines have interfaces consistent with a language with an array syntax such as Fortran 90 12], Connection Machine Fortran 17], and the emerging High Performance Fortran standard.BLAS interfaces are consistent with Fortran 77 syntax.The parts of the LBLAS that execute in each oating{point unit of the CM{200 are mostly implemented in assembly code, with some parts, however, in microcode.The parts of the code that execute on the Front{End computer is in higher{level code, such as C, Fortran, or Lisp.In Section 2, we discuss some of the design issues for high performance library routines for languages with an array syntax and distributed memory multiprocessors.In Section 3 we discuss the relevant architectural features of the CM{200.Section 4 discusses our level{1 LBLAS, while level{2 LBLAS are presented in Section 5.In Section 6 we discuss how the loop ordering for matrix{matrix multiplication and outer product computation is determined at run{time.A summary follows in Section 7.

Library issues for languages with an array syntax
The LBLAS are designed to support concurrent operations on distributed data structures and languages with an array syntax.These facts have a fundamental impact on the interfaces of the routines with respect to: 1) what information is provided explicitly through arguments to maintain consistency with the languages, 2) what information is provided through arguments in order to realize a desired level of e ciency in execution, and 3) the amount of concurrency handled by the routines.One of the unique features of the CMSSL is the ability of routines to perform compu-tations on multiple instances of objects in the same call, with instances embedded in a multidimensional array.For example, in the solution of the Navier{Stokes equation by a nite di erence technique, matrix{vector multiplication is required in each grid point 13].
The matrix in a grid point de nes one instance of the grid point matrices.For a regular three{dimensional discretization of the domain, three additional array axes may be used to identify the matrices in di erent grid points.Similarly, in solving partial di erential equations by the nite element technique, there is a matrix associated with each element.For a regular discretization of a three-dimensional domain, three additional array axes may be used to identify each elemental matrix, just as in the nite di erence case.For unstructured discretizations, a single axis in addition to the axes de ning each elemental matrix can be used to enumerate the elemental matrices.In a traditional BLAS, the looping over instances is made by placing the call inside a loop or a set of nested loops.
In the CMSSL, the entire array is used as an argument in the subroutine call, and the looping over instances handled is inside the subroutine.The multiple{instance capability of the routines allows extraneous data motion as well as unnecessary temporary storage allocation to be avoided.This capability also provides additional opportunities to optimize scheduling of operations compared to single{instance routines (without powerful interprocedural analysis).
With respect to performance, it is highly desirable to minimize any unnecessary data motion, in particular that between processing nodes.Depending upon the data distribution rules used by the compiler and the run{time system, and the techniques used for dealing with computations on multiple instances of (sub)arrays, limiting each call to the BLAS to a single instance may incur a substantial performance penalty 4].Thus, in order to assure a minimum amount of data motion, avoiding the creation of temporary arrays representing sections of arrays, (in particular the creation of temporary arrays with a potentially signi cantly di erent data distribution compared to the original array from which they may be extracted), and maximizing the optimization opportunities in scheduling operations, the LBLAS, like all CMSSL routines, are based on passing entire arrays in place.No data motion is associated with the subroutine call itself.The decision of which instances are treated concurrently, and which ones are treated sequentially, is made within each routine of the CMSSL.
In the LBLAS for the CM-200, instances assigned to di erent processing nodes are treated concurrently, while instances assigned to the same processing node are treated sequentially, one instance at a time.However, in the LBLAS currently under development for the CM{5, whether the looping over multiple instances in a processing node is made in an outer or inner loop depends upon which loop order is predicted to yield the best performance.A mechanism for optimal loop ordering at run{time, including the instance axes, is particularly important when DRAM page faults may a ect the performance signi cantly.How the multiple{instance capability designed into the LBLAS is exploited for the CM{5 will be described elsewhere.The scheduling order is transparent with respect to the call.The interface is una ected.Thus, programs calling the LBLAS or the DBLAS are portable between the CM{200 and the CM{5.Though instances in a processing node are treated sequentially on the CM{200, the multiple{instance capability still implies signi cant performance advantages due to reduced data motion, in particular, for computations on many small objects, due to reduced overhead by amortizing over all instances the cost of determining and setting up shared information, such as strides within an instance and the calling sequence to be used for each instance.
From an implementation point of view, one important consequence of the LBLAS multiple{ instance capability, is that no implicit assumption can be made about either the absolute values or the relative values of the strides for the di erent axes de ning an instance, the problem axes (axis).The problem axes in the CMSSL can be chosen arbitrarily from a multidimensional array passed in{place.Hence, for a given data distribution across the memory units, the stride within a row may be larger ( or smaller) than the stride within a column depending upon the relative ordering of the row and column axes in the array in which the instances are embedded.Moreover, on the Connection Machine systems, the lengths of the axes' segments assigned to a memory unit, and hence the strides, are not known until run{time.Data distribution is a run{time system function, allowing the same user program to be executed on a di erent number of processing nodes without recompilation.
The form of a call to the LBLAS is illustrated by the following example: Array y(N,M,K), x(N,K,L), A(M,L,N,K) gen matrix vect mult(y, A, x, 2, 1, 2, 3, ier).
In the example, y and x represent either single vectors or (multidimensional) arrays of vectors; A represents a matrix, or a (multidimensional) array of matrices.The rank of the array A must be one higher than the ranks of the arrays y and x, which are of the same rank.The number 2 succeeding x states that the problem axis for y is the second axes of the array y, i.e., the axis of extent M. Similarly, the number 1 states that the problem row axis for A is axis 1 of the array A, and the problem column axis is axis 2. The shape of each instance of A is M L. The problem axis for x is axis 3 of the array x.Thus, the above call de nes multiple matrix{vector multiplications.Each instance consists of the multiplication of an M L matrix by a vector of length L. There are N K such instances.The call is independent of the distribution of the arrays.However, in order for the operation to be well{de ned, we require in the CMSSL that the arrays y, A, and x have conforming shapes, i.e., that the shapes of all arrays with the problem axes excluded are identical.In the example, it is easily seen that exclusion of the problem axes results in arrays of shape N K.
Operations requiring the transpose of A can be performed using the same interface.The speci cations of the row and column axes of A are simply interchanged.
The example interface above is consistent with a language with array syntax, such as Fortran 90 12] or Connection Machine Fortran 17], in that the call contains no explicit information about array data types or shapes.This information is clearly needed for most BLAS.In the case of distributed data structures, it is also necessary to know how arrays are distributed over the di erent memory units, as well as within the memory units.On the Connection Machine systems, this information is kept in a descriptor.Passing an array to a subroutine, in fact, implies passing a pointer to the descriptor.In the traditional BLAS, the data type is encoded in the name, each call only handles a single instance, and the length of the leading dimension of two{dimensional arrays is passed explicitly as an argument.
The local BLAS we have implemented support all operations of the form where op A and op C are of the type N, or T for normal or transpose, respectively.op B is of type N, T, C or H where C stands for complex conjugate and H for Hermitian (complex conjugate transpose).(The selection of the desired combination of operand options for the LBLAS is made through a mode parameter not shown in the subroutine call above.) 3 The Connection Machine system CM{200 The CM{200 14] has up to 2048 oating{point processors interconnected via a Boolean cube network.The elements of the architecture are shown in Figure 1.A node consists of a oating{point processor, its local memory, and associated communication circuitry.
Instructions and scalar data are allocated to the memory of the Front{End computer.Arrays are allocated across the memories of the oating{point processors.Instructions that operate on data allocated across a subset of nodes are broadcast to all nodes.Likewise, scalar values that are required in the nodes are broadcast from the Front{End computer.The result of a global reduction operation is a scalar that resides in the memory of the Front{End computer.So called segmented reduction operations, i.e., concurrent reductions on disjoint segments of the index space, result in arrays that are allocated across some subset of nodes.For a detailed description of the memory management system on the CM{200, see 16,17].For a description of BLAS operating on distributed data structures, see 5,6,11].Here we focus on the BLAS in each node and do not further address the issues for distributed data structures and communication.
Each oating{point processor has 64{bit wide data paths, but the path to local memory is only 32{bits wide.Figure 2 illustrates the local node architecture.Each processor has one oating{point multiplier and one adder for 32{bit or 64{bit data types, 32 registers and 4 Mbytes of local memory.The clock frequency is 10 MHz.The peak oating{ point rate for each processor is 20 M op/s, both in 32{bit and 64{bit precision.But, for operations on 64{bit data types, the peak rate is limited to 10 M op/s for any operation that requires one operand to be loaded from memory.The arithmetic units are pipelined, with a pipeline length of 2 cycles for the adder.The oating{point processors do not have an inner{product instruction.Our vectorized inner{product results in two partial sums that are added in scalar mode.
There is a one cycle delay for memory operations.In the Connection Machine Instruction Set (CMIS), a vectorized instruction set, most instructions require a minimum of 6 cycles.For 32{bit data types, one item can be loaded from memory every cycle.The memory Storing data in memory requires slightly less than two cycles per item.Loads and stores of 64{bit data items require 2 and 4 cycles, respectively.DRAM page faults do not extend the time for store operations.For 32{bit data types, matrix{vector multiplication achieves a good balance between the available memory bandwidth and the processing capability of each oating{point unit.However, for 64{bit operands, the di erence in the width of the data path internal to a processor and the data path to memory, suggests a level{3 BLAS for matrix{matrix multiplication.But, because of the limited number of registers and the pipeline delays, matrix{vector multiplication is used for matrix{matrix multiplication also on 64{bit data types.Which lower level BLAS are used and the loop order, may have a signi cant impact on performance.Our matrix{matrix multiplication routines estimate the performance expected with di erent loop orders, accounting for the impact of di erent vector lengths, various overheads, the number of DRAM page faults, and vector chaining.

Level{1 LBLAS
The estimated peak performance rates for the level{1 LBLAS we have implemented are given in Table 2, together with the measured performance for arrays with sizes covering a large fraction of the range possible in each CM{200 node.The estimated peak rates are computed based on the required number of memory references for the operation.A single load from memory is required for each pair of oating{point operations in computing L 2 norms ( NRM2), while two loads are required for a pair of oating{point operations in computing inner products ( DOT).Two loads and one store are required for AXPY operations.Load/store operations in 64{bit precision require twice as many cycles as in 32{bit precision.The estimated peak rate quanti es the feasibility of the architecture for the stated operations.We measure the quality of the implementation, i.e., the algorithm, the program structure, and the instruction set, by the achieved e ciency, de ned as the realized performance as a percentage of the estimated peak performance.The e ciencies for various array sizes can be determined from Table 2.The peak e ciencies are: 80% for CMSSL DNRM2, 85% for CMSSL DDOT, and 100% for CMSSL DAXPY.The last gure indicates that stores do not require a full two cycles.The reason that the e ciency of the CMSSL DAXPY routine is higher than that of the CMSSL DDOT routine is largely due to the ine ciency in the nal accumulation in our vectorized inner{product computation.CMSSL DNRM2, in addition, requires a square root evaluation which is equivalent to 14 oating{point operations on the CM{200.The peak e ciencies for CMSSL SNRM2, CMSSL SDOT, and CMSSL SAXPY are 70%, 75%, and 90%, respectively.The higher e ciency for operations in 64{bit precision is due to the fact that the data paths to memory are 32{bits wide, and thus, memory operations require twice the time, thereby reducing the relative importance of looping overhead, pipeline lengths, etc. Figures 3, 4 and 5 show the aggregate oating{point rates for a 2048 processor CM{200.
Both the DOT and AXPY routines rst load one operand into the register le, then perform the required operations while reading the second operand from memory.The maximum length of a single vector operation is limited by the size of the register le.With 32 registers, our implementation uses a maximum vector length of 29 for the DOT routines and a maximum of 30 for the AXPY routines (the remaining registers being used for other variables).The pipeline length of the adder is two.For the DOT routines two partial results are computed by using the output of the adder as one of the inputs, and the output from the multiplier as the other input.The two partial results (corresponding to the depth of the adder) are added for the nal result after the vectorized accumulation is completed.
The performance for both the DOT and AXPY computation increases with the vector length, but a penalty is incurred each time a new vector must be loaded into the register le.The in uence of the vector length and the penalty for loading vectors into registers, is easily seen in Figures 4 and 5.For the NRM2 computation there is no need to preload a vector into the register le.It is possible to apply the value read from memory to both inputs of the multiplier, hence directly squaring the value read from memory.By accumulating the squared values as they are computed, there is, in e ect, no upper limit on the vector length.However, because of limitations in the Connection Machine Instruction Set (CMIS), the vector length is limited to a maximum of 29, just as in the DOT routines.The e ect of the limited vector length is clearly visible in Figure 3.The di erence in performance between the DNRM2, DDOT, and DAXPY routines is apparent from Figure 6.The di erence is mostly due to the di erence in the need for memory accesses.The peak performance, R 1 , and the array size for half of peak predicted performance based on memory references, n1 2 , are summarized in Table 3.The Table also states the maximum vector length in a vector instruction.Note that the stronger the dependence of the peak performance on the memory bandwidth, the shorter the vector length for half of peak performance.

Level{2 LBLAS
Of the level{2 BLAS, we have implemented LBLAS versions of matrix{vector multiplication ( GEMV) and outer{products ( GER).As in the BLAS, one routine is used for both matrix{vector and vector{matrix multiplication.One or the other function is obtained by an appropriate speci cation of strides for the matrix axes.The basic operation in our CM{200 CMSSL GEMV routines is the vector operation z x + z.The accumulation vector z and the coe cient are kept in registers, as discussed in Section 1.The CMSSL AXPY routines are not used (since z is explicitly kept in registers).In the LBLAS for the CM{5, additional loop orderings are used, as will be described elsewhere.While AXPY routines are not used for matrix{vector multiplication in the LBLAS, they are used for the local outer{product routines.The number of oating{point operations per memory reference for matrix{vector (and vector{matrix) multiplication is twice that  of an outer{product.With the CM{200 processor architecture, the expected peak performance of matrix{vector multiplication is not twice, but, in fact, three or four times that of the outer{product computation, as explained below.

Estimated peak performance
For the matrix{vector multiplication y Ax + y, two oating{point operations are required for each matrix element.The minimum number of loads from memory is P +Q+ PQ for a matrix of shape P Q, and the minimum number of stores is P.For PQ P; Q, the peak processing rate approaches two oating{point operations per memory load/store operation, or, for the CM{200, 20 M op/s per processor in 32{bit precision and 10 M op/s per processor in 64{bit precision.Hence, matrix{vector multiplication can achieve close to the peak arithmetic performance in 32{bit precision.
For the outer{product computation C xy T + C, the number of elements that must be loaded from memory is P +Q+PQ, i.e., the same as for matrix{vector multiplication.But, PQ elements must be stored.With each store in 32{bit precision requiring two cycles, the peak rate approaches two oating{point operations for every three cycles, instead of two operations for every cycle in matrix{vector multiplication.Thus, for PQ P; Q, the peak estimated performance for the outer{product C xy T + C is only 1 3 rd of the performance for the operation y Ax + y.The asymptotic peak performance for outer{products can be achieved by preloading x and a (few) component(s) of y into the register le, then reading a column of C from memory while performing the operation z(:) y(i)x(:) + C(:; i) with the destination of z(:) being a vector register.The desired result is obtained through the store operation C(:; i) z.Two vector registers are required: one for x, one for z.
For the outer{product operation C xy T , only two vectors must be loaded from memory.But, there is only one oating{point operation to be performed for each element of C. With stores requiring two cycles, the performance for PQ P; Q approaches one quarter of the peak performance for matrix{vector multiplication.Because of the limited number of registers in each CM{200 processor, and the pipeline start{up and shut{down times, we choose to use the CMSSL AXPY routine for our outer{ product routine.Thus, in the CM{200 LBLAS, the peak oating{point performance per processor in 32{bit precision is 5 M op/s, and 2.5 M op/s in 64{bit precision for both C xy T + C and C xy T .Hence, the peak outer{product performance on the CM{200 is only a quarter of the peak performance of the CMSSL GEMV routines.The estimated peak arithmetic performances for matrix{vector multiplication and outer{ product computations are summarized in Table 4.

Matrix{vector multiplication
In the CMSSL GEMV routine the accumulation vector z in z(:) x(i)A(:; i) + z(:) is allocated in the register le.Elements of the vector x are preloaded into the register le, Function SGER DGER SGEMV DGEMV Est.peak.perf.5 2.5 20 10 Table 4: Estimated peak oating{point rates in M op/s for the CMSSL GER and CMSSL GEMV routines on each CM{200 processor.while the columns A(:; i) are read from memory as needed.With 32 registers, the looping over the rows and the columns of the matrix must be divided, in general, into two loops each.The loop orderings used for matrix{vector multiplication in the CM{200 LBLAS are shown in Figure 7.The innermost, vectorized loop is labeled 0. The outermost loop is labeled 3. Several vector instructions corresponding to successive iterations of loop 1 can be chained together as a macro vector operation.The chaining of individual vector instructions results in one pipeline start{up and shut{down for the entire chain of vector instructions, instead of a pipeline start{up and shut{down for each vector instruction.To allow for vector operations to be chained, the necessary updating of pointers and base addresses must be performed concurrently with the execution of a vector instruction.On the CM{ 200, this form of concurrency can only be achieved in microcode.Moreover, the number of vector instructions chained together cannot be a run{time argument.The performance gain from chaining is typically in the 5 { 10 % range.
The accumulation vector z and the coe cient vector x must share the register set in a processor.With v 1 registers allocated to z and v 2 registers to x, the number of vector loads of x is Q v 2 P v 1 , for the loop order shown in Figure 7. x is loaded P v 1 times.The total number of stores of v 1 elements each is P v 1 .Clearly, with respect to load and store operations, it is desirable to maximize v 1 .In determining the optimal values of v 1 and v 2 , the overheads for vector loads, for the chained operations, and for the vector stores must also be included.In the LBLAS for the CM{200, v 1 = 22 and v 2 = 8.Two of the 32 registers in a processor are used for temporary variables.The timings reported in Tables 5 and 6 are based on chaining sets of 8 vector instructions, when possible.A matrix of arbitrary shape is divided into the maximum number of sets of 22 rows and 8 columns.The remaining columns are treated by independent vector instructions.Remaining rows are treated in a manner analogous to a set of 22 rows, except for the di erence in vector length.All submatrices with 8 columns are handled by chained vector instructions.The e ects of these restrictions are apparent in Figures 8 and 9, which show the performance of our matrix{vector multiplication routines.Each iteration in loop 2 requires a load of a new segment of the vector x as well as a new chained vector operation.Each iteration in loop 3, in addition, requires a segment of the vector y to be stored, and a new segment of the vector y to be loaded from memory for the operation y Ax + y on a new set of rows.The peak e ciency in 32{bit precision is 90% and in 64{bit precision, 92%.The vector length (number of iterations in loop 0) has a very strong in uence on performance up to the maximum vector length of 22.For four columns, the performance increases by a factor of about 5.1 when the vector length increases from 2 to 22, while for 500 columns, the increase in performance is approximately a factor of 1.8 for the same range of vector lengths.The overhead associated with loop 2, i.e., successive chained vector operations, is apparent in Figure 10.
For the timings in Figures 8 and 9, the stride for the elements in the inner loop is 1, while the stride for loop 1 is P.With a DRAM page size of 1 kbyte, and arrays aligned with DRAM page boundaries, no page fault occurs in the inner loop for P 256 in 32{bit precision, or P 128 in 64{bit precision.With arrays not aligned with DRAM pages, a page fault may occur at most for one instance of loop 0 for these values of P.
One additional DRAM page fault will be encountered in loop 0 for every 256 or 128 rows, respectively, added to the matrix.For su ciently small arrays aligned with DRAM pages, no page faults occur in loop 1.However, for P > 256 in 32{bit precision, or P > 128 in 64{bit precision, a page fault occurs for every iteration in every instance of loop 1.The relatively larger impact of the DRAM page faults on performance is clear by comparing Figures 8 and 9.With one page fault per iteration in loop 1 and no page faults in loop 0, the number of cycles per iteration in loop 1 increases from 22 to 23 in 32{bit precision, while it increases from 44 to 45 in 64{bit precision.
Interchanging the order of loops 0 and 1, as required for vector{matrix multiplication using the routine CMSSL GEMV on the CM{200, may signi cantly increase the number of page faults for a given matrix layout in memory.With the matrix B being of shape Q R for the operation y T x T B + y T , the stride for the inner loop increases from 1 to Q by interchanging the ordering of loops 0 and 1.For Q > 256 in 32{bit precision, one page fault occurs for every memory access.For Q > 128 in 64{bit precision, one page fault occurs for every two memory cycles, since a 64{bit word is stored as two contiguous 32{bit words.Thus, for Q = 128 the peak performance of vector{matrix multiplication is  CMSSL GEMV routine with the stride in the inner loop equal to the length of the row axis of the matrix.The performance losses due to page faults are summarized in Table 7.
With unit stride in loop 0, the impact on performance of DRAM page faults only amounts to a few percent.With unit stride in loop 1, and a stride in loop 0 equal to the length of the axis for loop 1, the maximum estimated performance loss in 32{bit precision is 50%, and in 64{bit precision 33.3%.The measured peak performance with the maximum number of page faults is 62% of peak measured performance in 32{bit precision, and 77% of peak measured performance in 64{bit precision, for a xed loop order.However, a higher peak performance is attained if loop 0 is given unit stride, and loop 1 a stride equal to the length of the axis for loop 0. Measuring the performance loss under page faults relative to this loop ordering yields a loss of 42% (instead of 38%) in 32{bit precision, and a loss of 28% (instead of 23%) in 64{bit precision.Because of the page faults with loop 1 having unit stride, the performance for this loop order never reaches the level achieved with loop 0 having unit stride.Though there is a di erence in performance for large matrices depending upon which of the loops 0 and 1 have unit stride, the performance is essentially independent of the stride for loops 0 and 1 for small matrices.
In the CMSSL, any axis of a multidimensional array can be chosen as the row axis and as the column axis.Thus, whether a row{ or column{oriented array layout is used, either the row or the column axis for a matrix may have the smaller stride of the two, and none of the selected axes may have unit stride.Since it is desirable to avoid as many page faults as possible in the inner loop, choosing loop 0 to enumerate elements along the axis with the smaller stride of the two matrix axes is a plausible strategy.However, this strategy may change the character of the algorithm from being AXPY{like to being DOT{like.Due to the ine ciency on the CM{200 of a single inner product of length at most equal to the maximum vector length, it may be desirable to maintain the ordering  of loops 2 and 3.This allows the nal, nonvectorized reduction to be deferred until the end, and to be performed once for each row of the matrix.The expense on the CM{200 is that two registers are required for each matrix row in loop 1.Because of the limited number of registers in each CM{200 oating{point processor, and the relative ine ciency of the inner product instruction, a performance gain would occur only in relatively few cases.However, on the CM{5, where each vector unit has 64 64{bit registers (128 32{ bit registers), and a page fault amounts to 2.5 cycles, we have also implemented the DOT{like matrix{vector multiplication loop ordering.The CM{5 implementation will be described elsewhere.

Outer{products
The outer{product computation C xy T + C is performed using the AXPY routines.
The AXPY operations may be row{ or column{oriented as shown in Figure 13.In the column{oriented scheme, a column C(:; i) is loaded into the register le, together with  Figure 12: The aggregate performance for vector{matrix multiplication in 64{bit precision on a 2048 processor CM{200.The matrix shape is Q R. y(i).Then, the operation z(:) y(i)x(:)+C(:; i) is performed with x read from memory, and z overwriting C(:; i) in the register le, followed by a store, C(:; i) z(:).With 32 registers, the columns are partitioned into segments of length 30 in our implementation.The index space of the matrix is traversed as shown in Figure 13.The choice between a row{ or column{oriented algorithm is based on the strides within rows and columns of C, x, and y, and the shape of C. For a row{oriented algorithm it su ces to consider y and one row of C, while for a column{oriented algorithm x and one column of C is considered.The loop order is determined as described in the next section for matrix{matrix multiplication.
Performance data for rank{1 updates on square matrices are given in Table 10 and Figure 14.The performance data are almost identical to those of the CMSSL AXPY routines in Table 2, as expected.The peak e ciency for 32{bit precision is about 85%, while the peak for 64{bit precision approaches 100%.The latter gure shows that our estimate of a store requiring two cycles is too conservative.The in uence of the vector length is apparent in Figure 14.

Matrix{matrix multiplication
Local matrix{matrix multiplication in the CM{200 LBLAS is based on the level{1 or level{2 LBLAS.Which routine is called depends entirely upon matrix shape, as does the  Figure 14: The aggregate performance for rank{1 updates of square matrices on a 2048 processor CM{200.
calling sequence when level{2 LBLAS are called.How the calling sequence is determined is addressed in this section.A CMSSL DOT routine is only called when P = R = 1, and Q > 1. (Recall that in the matrix multiplication C A B, A is of shape P Q and B is of shape Q R).A CMSSL AXPY routine is chosen only when P > 1, and Q = R = 1 or P = Q = 1 and R > 1.A CMSSL GER routine is chosen if Q = 1 and P; R > 1.For all other cases the matrix{matrix multiplication is performed through calls to a CMSSL GEMV routine.Making the loop on the P{axis or the R{axis the innermost loop and the other loop the outermost loop is a choice made as a trade{o between overhead, DRAM page faults, and pipeline lengths.Thus, the loop order is a function of the shapes of the operands and the strides along the row and column axes, i.e., the data layout in each node, and must be determined at run{time.If at least one of the operands is a true matrix, i.e., both of its axes have a length greater than 1, then it is clear from Tables 2, 5, 6, 8, and 9 that, for the LBLAS, a rank{ 1 routine (e ectively an AXPY routine) is never competitive with a CMSSL GEMV routine.From the same set of tables it is also clear that the CMSSL GEMV routines o er superior performance compared to the CMSSL DOT routines, for most matrix shapes.Indeed, there is only one case (for which we have performance data) in which the CMSSL SDOT routine performs better than the CMSSL SGEMV routine, namely for vector{matrix multiplication with a matrix of shape 2048 2 layed out in column major order.In this case, the inner loop is of length two with a stride of 2048.The performance of the CMSSL SDOT routine is 7.38 M op/s per processor, compared to 6.94 M op/s per processor for the CMSSL SGEMV routine.The reason for the lower performance of the matrix{vector multiplication routine is the excessively short inner loop (vector length 2) and the page faults in the inner loop with a column major layout.Because of the column major layout, there would be no page faults in the inner loop if the inner product routine had been used.The performance data reported in Table 2 would apply.The performance numbers in Table 2 are all based on a memory stride of 1 for each vector.A DOT{like loop ordering for vector{matrix multiplication would yield a somewhat better performance than given in Table 2, as explained in Section 5.The CMSSL DGEMV routine yields higher performance than CMSSL DDOT routine for all cases and loop orderings we considered.Table 11 and Figure 15 give performance data for the multiplication of square matrices in each processing node.The dips in the performance are due to the limitation on the vector length (maximum 22 rows) and the limitation on chaining of vector instructions (either 8 or none).The matrices in Table 11 and Figure 15 are su ciently small that page faults do not have any (signi cant) impact.Table 12 and Figures 16 and 17 show the performance for the multiplication of a pair of matrices in which one is square, the other rectangular.The e ect of choosing either the loop over the P{axis or the R{axis as the innermost loop is apparent.In the absence of page faults, the loop shall be chosen such that the length of the inner loop is maximized.When there are fewer columns in B than rows in A, i.e., R < P, then the loop for the R{axis should be the outermost loop, and the loop for the P{axis should be the innermost loop.Thus, for a xed matrix A, as the number of columns of B increases, R < P, the performance is at until R = P, then it increases with the number of columns of B. A similar behavior is apparent when B is square, and the number of rows of A is increased, as seen from Figure 17.
Choosing loop order for best performance must take the vector length, various overheads, and DRAM page faults into account.For the matrix multiplication C A B, the stride along the P-axis of A is A 0 and the stride along the Q-axis of A is A 1 .The stride along the Q-axis of B is B 0 and the stride along its R-axis is B 1 .Finally, the stride along the P-axis of C is C 0 and the stride along its R-axis is C 1 .Since all our routines are designed for multiple instance computation, the strides of the P-axis of A and C need not be the same, neither need the strides along the Q-axis of A and B, or the R-axis of A and C, be the same.The stride expressed in memory addresses is also dependent upon the data type.This fact is accounted for in the strides A 0 , A 1 , B 0 , B 1 , C 0 and C 1 .But, the number of 32{bit memory words loaded from and stored to memory must be accounted for explicitly which is done through a factor S, where S = 1 for 32{bit precision, S = 2 for 64{bit precision.The sum of the pipeline start{up and shut{down cost is denoted .Table 13 summarizes the cycle estimates for matrix{vector multiplication, with the loop  Figure 17: The aggregate performance for the multiplication of a P Q matrix and a Q Q matrix in 32{bit and 64{bit precision on a 2048 processor CM{200.
on the P{axis being the innermost loop.In addition to the entries in Table 13, there is also a one time cost for the function call.The value of and amounts to about 6 and 10 cycles, respectively.
For vector-matrix multiplication the roles of A and B are interchanged.More precisely, A has taken the role of B with the stride A 1 replacing that of B 0 ; B has taken the role of A with B 1 replacing A 0 and B 0 replacing A 1 .C is accessed by rows instead of columns, and C 1 replaces C 0 .
The choice of inner and outermost loops for matrix{matrix multiplication is based on the above expressions, acknowledging that with the P{axis being the innermost, R calls are  required, while for the R{axis being innermost, P calls are required.The losses due to a nonoptimal choice of loop order are illustrated in Figure 18.The optimal choice is indeed made in more than 99.8% of the cases for which the sum of the length of the axes is at most 60.The performance loss due to an incorrect choice is at most 15% and typically less than 1%.achieved by a rank{n update over a rank{1 update is shown in Figure 20.The relative enhancement is greater for small matrices than for large matrices.For instance, the performance enhancement of a rank{32 update over a rank{1 update, is a factor of about 5.9 for 32 32 matrices.For 256 256 matrices the improvement is a factor of 3.9 in 32{bit precision.The performance improvement in 64{bit precision for the same matrix sizes is 4.7 and 3.5, respectively.

Summary
We have presented a local BLAS (LBLAS) for distributed memory architectures and languages with an array syntax.The LBLAS are designed to perform the operations C op C C op C A op A B op B where A, B, and C are real or complex matrices in 32{bit or 64{bit precision and of any shape and memory layout that can be speci ed for arrays with an arbitrary number of axis.The superscript for A and C designates either the matrix or its transpose, while for B, the superscript designates one of four options: normal, transpose, complex conjugate, or Hermitian.Given the architectural features of the CM{200 processor, the peak performance for BLAS routines is limited by memory accesses.Thus, the expected performance of the NRM2 routines is twice that of the corresponding DOT routine, which, in turn, is twice that of the AXPY routine.The peak e ciency achieved for the DNRM2, DDOT  Figure 20: The performance ratio between rank{n and rank{1 updates of P P matrices local to a CM{200 processor, 32{bit precision.
routines is 80%, 85%, and 100%, respectively.These e ciency estimates are based on stores requiring two cycles, which is conservative.The peak measured performance is 8.04 M op/s, 4.25 M op/s, and 2.49 M op/s per processor.Clearly, for algorithms that can make use of either the DDOT routine or the DAXPY routine, the former should be chosen.The e ciency in 32{bit precision is slightly less.The vector length for half of peak performance is 35, 20 and 10 for DNRM2, DDOT and DAXPY, respectively.Matrix{vector and vector{matrix multiplication can fully exploit the arithmetic capability of the processor architecture in 32{bit precision.In 64{bit precision the fact that the path to memory is 32{bits wide limits the peak performance to half of the peak arithmetic capability of the CM{200.The peak measured oating{point rate is 18.0 M op/s per processor in 32{bit precision (90% e ciency), and 9.2 M op/s per processor in 64{bit precision (92% e ciency).Our rank{1 update routines are based on AXPY routines and, at best, achieve a quarter of the peak processor performance.The measured peak oating{point rate is 4.2 M op/s per processor in 32{bit precision and 2.3 M op/s per processor in 64{bit precision.The choice between row-wise or column-wise AXPY operations is made in the same way as for matrix{matrix multiplication.
Our LBLAS matrix{matrix multiplication routines are based on the level{1 and level{2 LBLAS.A level{1 LBLAS routine is only used when the matrix shapes are such that a single call to such a routine su ces to carry out the requested operation.When the matrix{matrix multiplication consists of three nested loops with more than one iteration in each loop, the level{2 LBLAS are used and the loop ordering determined at run{time based on an estimate of page faults, pipeline losses, and looping overhead.The optimum choice is indeed made in more than 99.8% of the cases for which the sum of the length of the axes is, at most, 60.For all possible axes' lengths, the optimum choice is made in an even higher fraction of the cases.The performance loss due to an incorrect choice is at most 15% and typically less than 1%.We also show that the performance gain in using a rank{n update instead of a rank{1 update may be close to a factor of 6, with a factor of 3 -4 being more typical.

Figure 6 :
Figure 6: The aggregate oating-point rates in G op/s for the local CMSSL DNRM2, CMSSL DDOT and CMSSL DAXPY routines on a 2048 processor CM{200.

Figure 7 :
Figure 7: The CM{200 LBLAS loop ordering for matrix{vector and vector{matrix multiplication.

Figure 18 :
Figure 18: The percentage performance loss due to nonoptimal choice of loop ordering in matrix{matrix multiplication local to each CM{200 processor.

Table 3 :
Peak performance and vector length for half of peak performance in 64{bit precision for CMSSL level{1 LBLAS on each node of a CM{200 node.
Figure 3: The aggregate oating{point rate in G op/s for the local CMSSL SNRM2 and CMSSL DNRM2 routines on a 2048 processor CM{200.

10.00 11.00 12.00 13.00 14.00 15.00 16.00 32 1K 32K
Figure 4: The aggregate oating{point rate in G op/s for the local CMSSL SDOT and CMSSL DDOT routines on a 2048 processor CM{200.Figure 5: The aggregate oating{point rate in G op/s for the local CMSSL SAXPY and CMSSL DAXPY routines on a 2048 processor CM{200.

Table 5 :
The oating{point rates in M op/s achieved in each CM{200 processor for matrix{vector multiplication.

Table 6 :
The oating{point rates in M op/s achieved in each CM{200 processor for matrix{vector multiplication.

10.00 12.00 14.00 16.00 18.00 20.00 22.00 24.00 26.00 28.00 30.00 32.00 34.00 36.00 38.00
Figure 8: The aggregate performance for matrix{vector multiplication in 32{bit precision on a 2048 processor CM{200.The matrix shape is P Q.Figure 9: The aggregate performance for matrix{vector multiplication in 64{bit precision on a 2048 processor CM{200.The matrix shape is P Q.

10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00
Figure 10: The overhead associated with successive chained vector operations is mainly the cause of the irregularities in the curves in this Figure.64{bit precision.

Table 8 :
The oating{point rates in M op/s achieved in each CM{200 processor for vector{matrix multiplication.

Table 9 :
The oating{point rates in M op/s achieved in each Connection Machine system CM-200 processor for vector{matrix multiplication.

Table 10 :
The oating{point rate in M op/s achieved in each CM{200 processor for rank{1 updates of square matrices.

Table 12 :
The oating{point rate in M op/s achieved in each CM{200 processor for the multiplication of one square and one rectangular matrix.

P=Q= 64 Real*8 P=Q= 4 Real*8 P=Q= 16 Real*8 P=Q= 64
Figure16: The aggregate performance for the multiplication of a P P matrix by a P R matrix in 32{bit and 64{bit precision on a 2048 processor CM{200.

Table 13 :
Estimation of the number of cycles required for matrix{vector multiplication.
In the LBLAS there is no special interface provided for rank{n updates.The CMSSL GEMM interface is used.Tables14 and 15and Figure19give the performance for rank{n updates on square matrices for n in the range 1 { 256.The relative performance enhancement