Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes

We present optimal schedules for permutations in which each node sends one or several unique messages to every other node. With concurrent communication on all channels of every node in binary cube networks, the number of element transfers in sequence for K elements per node is K 2 , irrespective of the number of nodes over which the data set is distributed. For a succession of s permutations within disjoint subcubes of d dimensions each, our schedules yield min( K 2 +(s(cid:0)1)d;(s+3)d; K 2 +2d) exchanges in sequence. The algorithms can be organized to avoid indirect addressing in the internode data exchanges, a property that increases the performance on some architectures. For message passing communication libraries, we present a blocking procedure that minimizes the number of block transfers while preserving the utilization of the communication channels. For schedules with optimal channel utilization, the number of block transfers for a binary d-cube is d. The maximum block size for K elements per node is d K 2d e.


Introduction
We give simple, yet optimal, schedules for a class of permutations on multiprocessors con gured as binary cubes with concurrent communication on all channels of every processor, all-port communication.The class of permutations we consider is all{to{all personalized communication (AAPC), in which each node sends a unique message to every other node.The Connection Machine systems CM{2 and CM{200 20] are examples of binary cube con gured multiprocessor architectures allowing concurrent communication on all channels of every node.Our schedules avoid indirect addressing in the data exchanges by performing a local alignment of the data in each processor prior to, and after, the data interchanges between processors.Examples of all{to{all personalized communication are bit-reversal, vector-reversal, matrix transposition and shu e permutations.Conversion between cyclic and consecutive mapping 9] for array allocations, such that a number of memory address bits are exchanged with the same number of processor bits, also constitutes all{to{all personalized communication.Cyclic and consecutive data layout can be speci ed in Vienna Fortran 21] and in Fortran D 4].Both layouts are also included in the emerging High Performance Fortran standard.Any one of the above permutations, combined with code conversion such as conversion between binary code and binary-re ected Gray code 10,14], also constitutes all{to{all personalized communication.A succession of AAPCs is necessary in some computations.For instance, a Fast Fourier Transform on 4096 data elements distributed evenly over 512 processors can be performed as a sequence of three local transforms, each on eight data elements.Between successive local transforms, an all{to{all personalized communication within 3-cubes must be performed.Successive AAPCs are performed on successive sets of three processor dimensions.With data in cyclic order 9], the rst communication exchanges the local memory address bits (maddr) with the bits AAPC-2 within the processor address eld (paddr).The second AAPC exchanges the local memory address bits with the bits AAPC-1, etc.Each segment of the processor address eld can be viewed as an axis encoded in d bits.The successive AAPCs in the example perform the axis exchanges (l; k; jji) !(l; k; ijj) !(l; j; ijk) !(k; j; ijl), which constitute a generalized shu e 5,6,11].
For the pipelining of a succession of AAPCs, the time elapsed from the rst motion of an element until it reaches its destination a ects the pipeline delay.We refer to the maximum elapsed time for any element as the span.We assume that each node in a d dimensional binary cube, d-cube, concurrently can send and receive one data element on all its ports in one time step, i.e., all-port communication.The all-port communication implies that all communication links are full duplex.The number of elements per processor is K.
The lower bound for the number of time steps for all-port, all{to{all personalized communication can be shown to be K 2 1,12].The optimum span is d.We present ve algorithms of which the rst three either has an optimum number of element transfers or optimum span.The fourth algorithm has both an optimum number of element transfers and optimum span.The rst three algorithms are very simple and provide some of the essential ideas for the fourth algorithm.The fth algorithm is for multiple, pipelined AAPCs.In many message-passing libraries for multiprocessors, there is a signi cant overhead for each message.It is often desirable to send few messages with a large amount of data in each message instead of many messages with a small amount of data in each message.We present a blocking procedure that minimizes the number of block transfers and the block size for a given schedule for a single AAPC.The blocking preserves the number of time steps, i.e., the utilization of the communication links is the same with blocking as without blocking.For Algorithm 4 that has the optimum number of time steps and optimum span, the minimal number of block transfers is d, which is optimal.The block size for this number of block transfers is d K 2d e, which is minimal for d block transfers.
The procedure for minimizing the number of messages in a message-passing system can also be used to achieve optimal utilization of communication links when the width of the communication links is a multiple of the element size, or when there are multiple links between pairs of nodes.The Connection Machine systems CM{2 and CM{200 have two communication links between pairs of nodes forming a binary cube with up to 11 dimensions.For a channel width of b elements, our procedure yields dd K 2db e time steps for a single AAPC.Thus, the time is reduced in proportion to the width of each link.Saad and Schultz 18] have suggested a recursive AAPC algorithm based on 2 d translated binomial trees.The algorithm requires d K 2 time steps; it is primarily of interest for one-port communication, i.e., communication restricted to one send and one receive per node per time step.One-port communication algorithms have also been presented by Nassimi and Sahni 16,17] and Flanders 3].Nassimi and Sahni discuss bit-permute complement (BPC) permutations on mesh or hypercube con gured multiprocessor systems with one-port communication.Flanders 3] considers similar permutations on two{dimensional meshes with one-port communication.
In 12] we presented one-port and all-port algorithms based on balanced trees and rotated binomial trees.For all-port communication, the algorithms attain the optimum number of time steps, when each node sends a multiple of d elements to every other node, i.e., K = d2 d for some integer .For other values of K, the algorithms in 12] are optimal within a small constant factor ( 24% for d > 4) 7].Algorithms 2, 3, and 4 presented here all have the optimum number of time steps for any K. Algorithms 1 and 4 have the optimum span.Independent of the work reported in 12], Stout and Wagar 19] also gave an algorithm with the optimum number of time steps for all-port communication.Later, Bertsekas et al. 1] presented yet another algorithm with an optimal number of time steps for any K.A detailed optimal scheduling algorithm for all-port communication has also been presented by Edelman 2], who implemented the algorithm on the Connection Machine system CM{2.The issue of minimum span is not addressed in any of the previous algorithms.Indeed, the schedule used by Edelman has a span that is greater than 2 d 2 .The schedule computation time is proportional to O(2 d ).Our Algorithm 4 improves upon previous algorithms by o ering both the optimum number of time steps and optimum span, and by easily being amenable to blocking for message-passing communication libraries, while preserving the e ciency in using the communication links.Our Algorithms 2 and 3 have very simple control and are easier to implement than the algorithms in 1, 2].The outline of this paper is as follows.In Section 2 we de ne the concepts used and ideas common to the algorithms presented in this paper.We also discuss pipelining of several AAPCs on di erent sets of cube dimensions, but with a shared set of local memory dimensions.The need for e cient pipelining of a sequence of AAPCs was our motivation for devising Algorithm 4. In Section 3 we present our algorithms for a single AAPC.Section 4 discusses the use of the algorithms for a single AAPC in performing multiple AAPCs.An algorithm performing multiple concurrent AAPCs is presented.Section 5 presents an idea for minimizing the communications overhead in message-passing libraries, and Section 6 presents an idea for the e cient utilization of wide channels.Section 7 gives a summary of our results.

All{to{all personalized communication on binary cubes 2.1 Preliminaries
A binary n-cube has N = 2 n nodes.Each node has n neighbors which, with the conventional binary addressing scheme, correspond to the n di erent single bit complementations of the bits in a node address.For two nodes a and b with addresses (a n 1 a n 2 : : : a 0 ) and (b n 1 b n 2 : : : b 0 ) the Hamming distance is P n 1 i=0 (a i b i ), where denotes modulo two addition.There are n edge-disjoint paths of length n between any pair of nodes at distance n.The local address space in each processing node, or processor for short, is K = f0; 1; : : :; K 1g, and the global address is (jji), where j is the node address and i the local memory address.The bits required for address encoding are sometimes referred to as dimensions.
De nition 1 Let S s = f0; 1; : : :; d 1g, S p fd; d + 1; : : :; d + n 1g, where n is the number of cube dimensions.d is the number of cube dimensions involved in the permutation and also the number of local memory dimensions involved in the permutation.Furthermore, let f be a bijection S s !S p and g be a bijection S p ! S s .An AAPC is a permutation de ned by a i ! a f(i) for all i 2 S s and a j ! a g(j) for all j 2 S p .
Note that in our de nition of the AAPC jS s j = jS p j. S s and S p represent the set of local memory dimensions and processor dimensions involved in the AAPC, respectively.For simplicity in notation, we assume in the de nition that all memory dimensions are involved in the AAPC.An example of AAPCs satisfying de nition 1 is the bit-reversal (a 7 a 6 a 5 a 4 a 3 ja 2 a 1 a 0 ) ! (a 7 a 6 a 0 a 1 a 2 ja 3 a 4 a 5 ), where n = 5, d = 3, S s = f0; 1; 2g and S p = f3; 4; 5g.This bit-reversal operation, in fact, consists of four independent bit-reversals in three-dimensional subcubes.The subcubes are identi ed by the two leading processor address bits.Another example is the axis exchange (a 6 a 5 a 4 a 3 ja 2 a 1 a 0 ) ! (a 2 a 1 a 0 a 3 ja 6 a 5 a 4 ), for which n = 4, d = 3, S s = f0; 1; 2g and S p = f4; 5; 6g.This axis exchange can formally be represented as (k; jji) !(i; jjk), where i initially is encoded in the three local memory dimensions f0; 1; 2g and k initially is encoded in the processor dimensions f4; 5; 6g.If 2 d < K, then the AAPC is repeated d K  2 d e times.For instance, if K = 256 and the AAPC includes three processor dimensions (d = 3), then the permutation is repeated 2 8 3 = 32 times.For clarity, we assume that K = 2 d in the remainder of this paper.Conversely, if there are processor dimensions not included in the AAPC, then, in fact, a number of AAPCs in disjoint subcubes is speci ed, as shown in the examples above.The subcube for each AAPC is identi ed by the processor address bits not included in the speci cation of the AAPC.
The relative address of a local memory address i in processor j with respect to its destination is j i.
A homogeneous communication schedule has schedules for each node that only depend on the relative addresses.Thus, in a homogeneous schedule, node i sends data to node i j in the same dimension and step as node 0 sends data to node j.The message path from node i to node i j is a translation (modulo two addition) with respect to node i of the path from node 0 to node j.We only consider homogeneous schedules.For such schedules, it is su cient to consider schedules for node 0. All our schedules are based on minimum length path routing of all elements in a node and use all ports of every node in every step, except possibly the last step.
De nition 2 Let a data element i leave the source during step t 1 and arrive at the destination during step t s > t 1 .Then, span(i) = t s t 1 + 1.The span of an AAPC is the maximum span for any data element, i.e., span = max i span(i).
Thus, the span for element i is the number of exchanges required for it to move from source to destination, including exchanges during which the element may be waiting en route.
In A q-necklace is a necklace in which each address has q bits equal to 1.

Algorithm organization
We rst consider the organization of algorithms for axes exchanges of the form (jji) !(ijj), then discuss permutations of the form (jji) !(f(i)jg(j)).
Table 1: Data distribution for AAPC in a 3-cube, after the alignment in phase 1.

Axes exchanges: (jji) ! (ijj)
The permutation (jji) !(ijj), where j and i have the same number of bits, is equivalent to the transposition of a matrix stored with one row per processor to the storage of the matrix with one column per processor.The permutation can also be viewed as changing the allocation of a onedimensional array from consecutive storage to cyclic storage 9], two storage forms included in Vienna Fortran 21] and Fortran D 4], and adopted in the emerging High Performance Fortran standard.
The routing algorithms we present use only direct addressing in the data exchanges between processors.All processors access the same local memory address during the same step.To accomplish this property, a local alignment precedes the interprocessor exchanges, and a realignment follows the exchanges.On the Connection Machine systems CM{2 and CM{200, avoiding indirect addressing in the exchange phase results in a signi cant speedup.
Phase 1: Local alignment.Sort the local data by relative address, i.e., perform the operation (jji) !(jjj i).
Table 1 shows the data distribution in a 3-cube after the alignment.The processor addresses are given in the decimal number system and the local memory addresses in the binary number system.
Phase 3: Local realignment.The realignment to restore the local memory order is identical to phase 1.
The alignment is performed on the memory dimensions included in the AAPC.
A memory address is subject to an exchange operation in processor dimensions that corresponds to nonzero bits in its relative address.Thus, local memory address zero retains its content throughout phase 2, whereas local memory address (11 : : :1) needs to send its content across all cube dimensions (in any order).Similarly, local memory address (00 : : :01) only exchanges its content with the neighboring node in the least signi cant cube dimension, while local memory address (11 : : :10) needs to send its content across all but the least signi cant cube dimension.
In an AAPC, including all local memory dimensions, either local memory address i or i needs to send its content across a given cube dimension.Thus, each complement pair (i; i) of local memory addresses must send one element across each cube dimension.The pairing of local memory addresses is shown in Table 2: The initial data distribution viewed as complement pairs of local addresses after alignment in a 3-cube.

Axes exchanges with permutations: (jji) ! (f(i)jg(j))
The algorithm organization for axes exchanges (jji) !(ijj) is easily generalized to permutations of the form (jji) !(f(i)jg(j)), where f(i) is a bijection from S s to S s and g a bijection S p ! S s .The operation i !f(i) can be combined with the prealignment operation (phase 1) without a need for additional data motion.Similarly, the permutation j !g(j) can be performed as a local memory operation by combining the permutation with the postalignment operation (phase 3).For instance, if the processor addresses are encoded in a binary-re ected Gray code and the local memory addresses use the standard binary code, then an axes exchange preserving the encoding strategy requires no extra data motion.But, the pre{ and postalignment phases must be modi ed to include the code conversion from binary code to binary-re ected Gray code (f), and from binary-re ected Gray code to binary code (g = f 1 ) 10, 14].The functions f and g adds to the complexity of the address calculation for the local permutations in the pre{ and postalignment.

Multiple AAPC
A sequence of AAPCs can be represented as a sequence of axis exchanges.For instance, in the FFT example given in the introduction, the sequence of AAPCs was represented by the axis exchanges (l; k; jji) !(l; k; ijj) !(l; j; ijk) !(k; j; ijl).

Lemma 1
The alignments for AAPCs on di erent processor bits are independent, and can be performed at once.
The sequence of AAPCs can therefore be organized into three phases as follows: Phase 1: Local alignment at once for all axes.Phase 2: Interprocessor exchanges for the di erent AAPCs.Phase 3: Local realignment at once for all axes.
For the three-axes example the operations are (l; k; jji)  The rst step is phase 1, the local prealignment.Phase 2 consists of steps two through four.Each step represents an AAPC within distinct subcubes de ned by the di erent axes (sets of processor dimensions), starting with coordinate axis j.All the communications in the second phase are between elements with the same local memory address.The element initially in location (l; k; jji) is routed to the nal location (k; j; ijl) according to the path: Table 3 shows the data motion in detail for the permutation (k; jji) !(j; ijk), where each axes is encoded in two bits.The row mod 2 add shows the index used for the alignment operations, i.e., k j.
Figure 1 shows the motion of one set of data elements.The pairing of data elements is based on the local storage dimensions involved in the AAPCs.The pairing is the same as for a single AAPC.For a sequence of s AAPCs on d dimensions each, and a total of sd dimensions in the permutation, there are s coordinate axes (each encoded in d bits), and s steps in phase two.The steps are as follows: (a s ; a s 1 ; : : :; a 1 ja 0 ) 1 !(a s ; a s 1 ; : : :; a 1 ja s a s 1 a 1 a 0 ) (a s ; a s 1 ; : : :; a 1 ja 0 ) 2 !(a s ; a s 1 ; : : :; a 2 ; a s a s 1 a 1 a 0 ja 0 )  (a s ; a s 1 ; : : :; a 1 ja 0 ) s+1 !(a s a s 1 a 1 a 0 ; a s 1 ; a s 2 ; : : :; a 1 ja 0 ) (a s ; a s 1 ; : : :; a 1 ja 0 ) s+2 !(a s ; a s 1 ; : : :; a 1 ja s a s 1 a 1 a 0 ) The communication for successive AAPCs can be pipelined.Once a pair of local memory addresses has completed the communication for one AAPC, they are ready to proceed to the next AAPC.Minimizing the pipeline delay is equivalent to minimizing the span.In Section 4 we also show how the pipeline delay can be reduced by performing multiple AAPCs concurrently.
3 Algorithms for a single AAPC The rst three algorithms in this section either have optimum data transfer time, or optimum span, but not both.Algorithm 4, the major contribution of this section, has both optimum data transfer time and optimum span, but is more complex.

Algorithm 1
A straightforward algorithm for AAPC is to schedule d complement pairs of addresses concurrently, by scheduling complement pair u in dimensions u; (u + 1) mod d; (u + 2) mod d; : : :; (u + d 1) mod d.After d exchanges, the d pairs of addresses have performed all necessary communications.In each step exactly one of the two addresses in a pair must exchange its content with another processor.The address sending its data to another processor has the relative address bit equal to one for the dimension being exchanged.Table 4 shows the exchanges for the rst three complement pairs of addresses for d = 3.For the rst row of complement pairs of addresses in Table 4, the exchange sequence with respect to processor dimensions is 0, 1, and 2. The sequence for the second row of complement pairs of addresses is 1, 2, and 0. The sequence for the third row of complement pairs of addresses is 2, 0, and 1.Every channel is used in every step; each item follows a minimum-length path.
Table 4: The locations of the contents of the rst three complement pairs of local addresses after each exchange step in a 3-cube.The position of the entry in the column \Dim" (top or bottom) indicates the address of the data element subject to exchange.
The pseudocode below serves to illustrate the simplicity in implementing the algorithm for phase two of an AAPC performed on the d least signi cant processor dimensions.The innermost loop performs the concurrent exchanges in all processor dimensions.Index u enumerates di erent complement pairs of local memory addresses in a concurrent exchange.Index r enumerates the exchanges required for each complement pair of memory addresses.Loop index q enumerates di erent sets of d complement pairs of addresses, while the forall statement speci es all binary cube processors.The memory accesses over time and the loop ordering are illustrated in Figure 2 for d = 5.By unrolling the loop on q in Algorithm 1, the exchanges corresponding to a few diagonal blocks in Figure 2 can be made in the same exchange step, thereby reducing the number of exchange steps and potentially the overhead in a message-passing communication library.The number of element transfers in sequence is una ected by such a schedule change.Blocking of element exchanges is discussed further in Section 5.The correctness of the above algorithm follows from the following consideration:

Algorithm 2
Table 5 de nes Algorithm 2. Complement pair u of local memory addresses is exchanged in processor dimension r during time step (u + r) mod 2 d 2 , 0 r < d and 0 u < K 2 .The span is K 2 .The AAPC schedule in 2] has a span of order O(2 d ) and optimal channel utilization.The large span is a disadvantage when several AAPCs on di erent processor address bits must be performed.

Algorithm 3
Optimal channel utilization with a span of d + K 2 mod d is easily obtained by applying Algorithm 2 to d + K  2 mod d pairs of memory addresses and Algorithm 1 to all other pairs of memory addresses.We refer to this combined algorithm as Algorithm 3.

The communication complexity of Algorithms 1 { 3
The characteristics of the above algorithms are summarized in Table 6.The algorithms either has optimum span or an optimum number of time steps, but not both.

Algorithm 4
Algorithms 1, 2 and 3 schedule complement pairs of local addresses together.In Algorithm 4 we supplement this scheduling strategy with one based on whether or not an address is noncyclic.We rst consider a possible scheduling of noncyclic addresses, then a possible scheduling of cyclic addresses.Considering noncyclic and cyclic addresses separately would yield an algorithm with optimum span, but in general nonoptimum data transfer time.By scheduling some cyclic addresses together with some noncyclic addresses, an algorithm with both optimum span and optimum data transfer time can be devised.

Scheduling of noncyclic addresses
The exchanges for the three local memory locations (001), (010), and (100) can be done concurrently.Similarly, locations (011), (110), and (101) can be scheduled concurrently with respect to the least signi cant bit of (011), the second least signi cant bit of (110), and the most signi cant bit of (101).The communication for the remaining bit that is one in each of these addresses can also be scheduled concurrently.In general, for a full q-necklace with bits i 0 ; i 1 ; : : :; i q 1 equal to one, the contents of the distinguished address is exchanged in dimensions i 0 ; i 1 ; : : :; i q 1 .Within the q-necklace, address r obtained through an r steps left cyclic rotation of the distinguished address, is subject to exchanges in dimensions (i 0 + r) mod d; (i 1 + r) mod d; : : :; (i q 1 + r) mod d.For the 1-necklace with distinguished address (001), the described schedule yields a single exchange in dimension i 0 = 0 for the distinguished address.For address r in the 1-necklace, the exchange is in dimension r.For the 3-necklace f0111; 1110; 1101; 1011g, the schedule yields the exchanges Address Exchange dimensions 0111: 0 1 2 1110: 1 2 3 1101: 2 3 0 1011: 3 0 1 Lemma 2 The set of addresses forming a full q-necklace can be scheduled to complete the required permutation in q exchange steps, which is optimal.The span is q d, and all communication channels are used in each exchange step.
The presented schedule for noncyclic addresses yields optimum span and full utilization of all communication channels.Cyclic addresses cannot be scheduled optimally using the same idea.In our Algorithm 4, we schedule one suitably chosen full necklace of addresses together with a set of complement pairs of cyclic addresses.All other noncyclic addresses are scheduled as described above.Before considering the scheduling of cyclic addresses, we compare the scheduling of noncyclic addresses based on the idea of necklaces with scheduling based on complement pairs of addresses.Lemma 3 Addresses of a q-necklace obtained through bit-complementation of a full (d q)-necklace is also a full necklace, and distinct, if q 6 = d=2.
Corollary 1 The addresses in any full q-necklace, q 6 = d=2, and its complement (d q)-necklace, can be scheduled together to complete the required data motion in d exchange steps.
Thus, for d odd, schedules of noncyclic addresses either based on necklaces, or complement pairs of addresses, both yield optimum utilization of all d communication channels.The maximum span for scheduling based on necklaces is d 1, while the span for schedules based on complement pairs is d.For q = d=2 not every q-necklace has a matching complement necklace.For instance, the address (0011) and its complement (1100) belong to the same necklace.Thus, for d even, scheduling based on complement pairs of addresses may not yield full utilization of all communication channels for all noncyclic addresses.

Scheduling of cyclic addresses
Cyclic addresses are scheduled as complement pairs of addresses.The bit-complement of a cyclic address is also cyclic.Scheduling cyclic addresses as complement pairs of addresses yields full utilization of all d communication channels for all such pairs only when the number of cyclic addresses is a multiple of 2d.Let the number of pairs of cyclic addresses be p and let c = p mod d.Then, all but c pairs of cyclic addresses are scheduled as blocks of d complement pairs in exchange sequences with span d, utilizing all d communication channels of every processor in every exchange step.The remaining c complement pairs of cyclic addresses are scheduled together with the addresses of one full (d c)-necklace.For example, for d = 3 c = 1 and the complement pair of cyclic addresses (000) and ( 111) are scheduled together with the 2-necklace (011), ( 110) and (101).The remaining addresses all belong to the 1necklace (001), (010), and (100), which is scheduled as a full necklace with all exchanges completed during a single cycle.The schedule for the cyclic addresses and the 2-necklace are shown below.

The complete algorithm
We are now ready to de ne Algorithm 4 as follows: schedule blocks of d complement pairs of cyclic addresses together, schedule the remaining c complement pairs of cyclic addresses with addresses of a full (d c)necklace, schedule all remaining noncyclic addresses by scheduling the d members of each remaining full necklace together.
It follows from the previous discussion that this algorithm fully utilizes all communication channels of every processor during every exchange step and has a span of d.We construct the schedule for the c pairs of cyclic addresses that are combined with the addresses in a full (d c)-necklace from a d d generating table.The columns of the table correspond to exchange steps, and the table entries correspond to the dimensions scheduled for that time step.The task is to associate d of the d + 2c local memory addresses that must be scheduled in every exchange step to the d table entries in a column.The rst row of the table consists of the numbers 0; 1; : : :; d 1.Each other row is a one step left cyclic rotation of the previous row.Thus, every dimension is used for every exchange step.We arbitrarily associate the last c rows of the generating table with the c complement pairs of cyclic addresses.Each complement pair is clearly subject to an exchange in every dimension during d exchange steps, as required.The remaining d c rows of the table are used to create a schedule for the d addresses of the full (d c)-necklace.We choose the full (d c)-necklace to be the necklace with distinguished address (0 c 1 d c ), where 1 d c denotes d c consecutive 1-bits.There exist several schedules with span d and an optimal number of element transfers.Any valid schedule must associate an address of the necklace with at most one entry in a column and precisely d c columns.Furthermore, all table entries associated with an address must be unique, and the set of table entries for di erent addresses must be rotations of each other.Each table entry can only be assigned to one address.The schedules can be illustrated by drawing lines in the generating table.Each line intersects d column entries and represents all addresses in the full (d c)-necklace.Each entry on the line represents an address, and the table entry gives the dimension in which that address is exchanged during the time step given by the table column.In total, d c lines must be drawn since each address of the (d c)-necklace must be scheduled in d c dimensions.Table 7: Exchange dimension for type-2 scheduling for d = 5 and q = 4.
Figure 3 gives the generating table for d = 5 with two di erent schedules with arrows representing the lines.In the left half of the gure, the lines start on the diagonal and progress to the right within a row, cyclicly.The dimensions associated with entry j, 0 j < d of line i, 0 i < d c, is (2i + j) mod d, by construction of the generating table and the lines.Thus, for any j, all d c entries for 0 i < d c are unique for d odd.Member j of the necklace has the address bits set which correspond to the table entries for j on the di erent lines.Thus, for d = 5 d c = 4 and for j = 0, j = 1, etc., the addresses are f(10111), (01111), (11110), (11101), (11011)g.For d even, the same scheme can be used for c d=2.For c < d=2 the rst d=2 lines can be drawn as for d odd, while a skew of two columns is used for line i = d=2 in order to preserve uniqueness of all entries for j.The schedule corresponding to the right half of Figure 3 is obtained by drawing the lines vertically in a \greedy" manner.Thus, the rst line is drawn from the top of the rst column down to the (d c)th row, then c columns to the right.The second line starts in the second column, goes vertically through d c 1 rows, then turns right through c columns, followed by a vertical downturn to include one more row.Subsequent lines are drawn in a similar way by proceeding vertically one row less than the previous line, then proceeding horizontally through c columns, then proceeding vertically to the last row reserved for the full necklace.The schedules corresponding to Figure 3 are shown in Table 7.The rst row corresponds to j = 0, the second row to j = 1, etc.In summary, in Algorithm 4, local memory addresses are partitioned into cyclic and noncyclic addresses.The number of cyclic addresses is two if d is prime; otherwise it is of order O( Theorem 1 An all{to{all personalized communication on d processor dimensions can be performed in K  2 time steps for all{port communication.The maximum span is d.

Algorithms for multiple AAPCs
For multiple AAPCs with the same set of local memory dimensions, pipelining can be applied to increase the utilization of the communication channels in all-port communication.If each AAPC fully utilizes the communication channels, then the only source of ine ciency is the pipeline delay.If there are additional storage dimensions, then the pipeline delay may be avoided by performing multiple, concurrent AAPCs.Below we rst comment on the pipelined approach, then outline an algorithm for multiple concurrent AAPCs (Algorithm 5).

Pipelining of AAPCs
Pipelining the communications for a succession of AAPCs is conceptually straightforward when all the AAPCs involve the same number of local memory and processor dimensions.The communication complexity for pipelined AAPCs follows from Theorem 1.

Corollary 2
The number of time steps for s all{to{all personalized communications in succession on di erent sets of d processor dimensions, and the same set of local memory dimensions, is K 2 +(s 1)d for a local data set of size K, and all{port communication.
If the succession of AAPCs shall be performed on nonoverlapping subcubes of di erent order, then the situation is more complex.If the subcubes are of nonincreasing order, and the local memory bits for AAPCs on smaller subcubes form a subset of the d 1 bits for the rst AAPC, then an extra delay is introduced whenever the number of dimensions in an AAPC does not divide the number of dimensions in the previous AAPC.Assume that the number of dimensions for the ith AAPC is d i .Then, the extra delay introduced by the ith AAPC in the case of Algorithm 1 is d i gcd(d i 1 ; d i ).No realignment or change of pairing is required for Algorithm 1.For Algorithm 4, a change in the number of local memory dimensions involved in the AAPC a ects the classi cation of addresses as cyclic and noncyclic, and hence their schedules.The regrouping for the ith AAPC may involve addresses from two blocks each of which requires d i 1 exchanges steps.Slightly more e cient, and simpler algorithms for this case, are contained in 13].

Multiple concurrent AAPCs
Consider the exchange sequence (l; k; jji; v) !(l; v; jji; k) !(k; v; jji; l) !(k; v; jjl; i) !(k; v; ijl; j) !(k; j; ijl; v).The resulting permutation is the same as for the exchange sequence (l; k; jji) !(l; k; ijj) !(l; j; ijk) !(k; j; ijl).Thus, if there are local memory dimensions in addition to the ones included in the AAPC's, then the rst AAPC can be performed on a processor axis other than the rst, at the expense of one local memory reordering and of one repeated AAPC.The idea is illustrated in Figure 4 for six processor axes and one local memory axis.In Figure 4, each column represents d exchanges (as required for any pair of memory addresses).The number of memory addresses associated with a row in the gure depends upon the algorithm used for each AAPC.For Algorithm 1, 2d addresses are associated with each table row (compare Figure 2).For Algorithm 4, a table row corresponds to 2d memory addresses, except for the noncyclic addresses scheduled together with some of the cyclic addresses.The number of local memory addresses associated with this row is d + 2c.The data sets d f0;1g 0 and d f0;1g 1 have their rst axes exchanges in dimensions other than the rst.With the axes labeled from 0, the data sets d f0;1g 0 perform their rst exchange with processor axis s 2, while the data sets d f0;1g 1 perform their rst exchanges on processor  axis s 4. The represents the exchange between the two memory axes initially represented by i and v, and corresponds to the exchange (k; v; jji; `) !(k; v; jj`; i) in the example above.
Algorithm 5 for multiple AAPCs divide the local data sets into groups scheduled together according to a suitable algorithm for a single AAPC.Then, s 2 such data sets for s even, and s 1 data sets for s odd, have their rst exchange on an axis other than the rst.Pairs of data sets have their rst exchange on the same axis, and perform a local exchange prior to the exchange on the rst processor axis.Data sets having their rst exchange on the rst processor axis have their exchanges pipelined.For multiple AAPCs, either performed as multiple concurrent AAPCs as outlined above or by pipelining multiple AAPCs, the following communication complexity can be derived.The rst case above corresponds to pipelining of AAPCs; the other two cases apply to multiple, concurrent AAPCs.For a succession of AAPCs of di erent orders, algorithms are presented in 13].

Minimizing communications overhead
The scheduling algorithms above attempt to maximize the utilization of the communications bandwidth, i.e., the algorithms strive to communicate on all channels in every exchange step.No attention was paid to a possible overhead in communicating elements.In many message-passing communication libraries, the overhead associated with each message is often substantial.It is of interest to organize the element transfers into block transfers, with each such block being subject to one communications overhead.Below we present a blocking procedure applicable to any algorithm.The procedure yields the minimal number of block transfers, and the minimal block size for this number of block transfers.Applying the blocking procedure to Algorithm 4 yields both optimum span, optimum communication channel utilization, and a minimum number of block transfers in sequence.We construct the blocks in the form of a table of local memory addresses where all entries in a row can be communicated as one block.Thus, an address can only appear once in a row.A block cannot be extended beyond a single row.The number of di erent rows into which an address is entered is equal to the number of exchange steps required for the address.The minimum number of block transfers is equal to the number of rows.Whether or not the minimum number of block transfers can be realized in an implementation depends on the sizes of communications bu ers.Clearly, it is of interest to minimize the maximum block size.Before de ning the procedure, we give an example.For Algorithm 4, a table with d rows (the minimum) can be constructed as shown in Table 8.R(i) denotes the necklace of addresses derived from rotations of i, and C(i) denotes the complement pair of addresses fi; ig.Thus, C(11111) denotes the addresses (00000) and (11111).A group is a set of addresses that are scheduled together during d exchange steps in a nonblocked algorithm.Group 1 consists of the c complement pairs of cyclic addresses and the associated (d c)-necklace of addresses scheduled together during d exchange steps.The columns Group 2 and Group 3 each contains the addresses of two full necklaces.The necklaces in a column form d complement pairs of addresses scheduled during d exchange steps.The last column contains a single necklace for which a single exchange su ces.The maximum block size is four in this example.In Algorithm 1, d complement addresses are scheduled together during each of d exchange steps.Each such group of addresses may form one column of our table for blocking of data exchanges.The number of groups and the block size is d K 2d e.
Applying our blocking strategy to Algorithm 2 yields a table of addresses with K 2 rows, since the span is K  2 .Fewer blocks yield bandwidth ine ciency for this algorithm.
In general, we partition the local memory addresses K = f0; 1; : : :; K 1g into disjoint subsets K 1 ; K 2 ; : : :; K , s=1 K s = K, K s \ K r = , s 6 = r.The addresses belonging to di erent subsets are scheduled independently, while addresses within a subset are scheduled together.In the example above, the subsets can be de ned as follows: K 1 = fC(11111); R(01111)g, K 2 = fR(00011)g, K 3 = fR(00111)g, K 4 = fR(00101)g, K 5 = fR(01011)g, and K 6 = fR(00001)g.Let T s be the time required for subset K s to complete the data motion.In our example, T 1 = 5, T 2 = T 4 = 2, T 3 = T 5 = 3 and T 6 = 1.By creating a table of addresses with T max = max 1 i T i rows, T max block transfers are required for a maximum block size of d(T sum =T max )e, where T sum = P i=1 T i .
The strategy in assigning partitions to table entries is that a partition K i can appear in a row at most once, but must appear in precisely T s rows.Note, that if each subset fully uses the communications bandwidth, then the number of element transfers in sequence is the same for the blocked and nonblocked algorithms, i.e., T sum .
Let jxj be the number of 1-bits in x.Clearly, if the schedule for each subset K i fully uses all channels then T i = P 8x2K i jxj d and T sum = P i=1 T i = K 2 .Thus, we have the following lemma.
Group 1 Time Group 2 Time Group 3 Time Group 4 Time fC(11111); R(01111)g 1 fR(00011)g 0 fR(00101)g 0 fR(00001)g 0 fC(11111); R(01111)g 2 fR(00011)g 1 fR(00101)g 1 fC(11111); R(01111)g 3 fR(00111)g 2 fR(01011)g 2 fC(11111); R(01111)g 4 fR(00111)g 3 fR(01011)g 3 fC(11111); R(01111)g 5 fR(00111)g 4 fR(01011)g 4 6 Wide channels If the channel width is a multiple b of the width of a data item, then b d addresses can be scheduled concurrently.Determining the optimum channel utilization is related to minimizing the maximum block size in an algorithm that blocks the data transfers.Table 9 shows a scheduling for d = 5 and b = 3.The columns headed by \Time" denote the time step during which an address is scheduled for exchange.The same time step appears precisely b = 3 times except for the last step.The time step is determined by labeling the table entries row by row and within each row from right to left.Proof: Multiple occurrences of cyclic addresses appears within the same column ( group), since T i = d for all such addresses.The blocking procedure guarantees that multiple occurrences of cyclic addresses appear within blocks of d rows and in the same column (group).Clearly, multiple occurrences in the same group cannot be scheduled during the same time step, since b b K 2d c.For noncyclic addresses, T i < d.If the same address appears in two adjacent groups, say in row i of group j and row i 0 of group j + 1, then i 0 < i.
When b d K 2d e, the upper bound is the same as the lower bound: d.

Summary
We have presented four schedules for a single all{to{all personalized communication, three of which are simple to implement.One algorithm (Algorithm 4) requires the optimal number of element exchanges in sequence, K 2 , with a span of d.Combining Algorithms 3 and 4 for s all{to{all personalized communications in sequence, each on d dimensions, yields The algorithms can be organized such that indirect addressing is not required in interprocessor data exchanges by carrying out pre-and post-alignment steps.By combining these alignments with other local permutations, the presented algorithms can perform permutations of the form (ijj) $ (g(j)jf(i)) with no increase in the required data motion.The presented blocking procedure preserves optimality with respect to element transfers for an optimal element-wise schedule.For instance, for Algorithm 4 the number of block transfers is d for a single AAPC and the block size is d K 2d e. Applying the same blocking procedure to the scheduling of exchanges when the channel width is a multiple b of the width of a message yields d K 2b e element transfers in sequence, if b K 2d c b, otherwise d.
Algorithm 1 has been implemented on the Connection Machine model CM-2.The exchanges require 40 sec per element compared to 62 sec for the schedule in 2].The total expense for alignment and realignment is about 0.9 sec per element (four bytes).Hence, the simpli ed algorithms presented here may yield a speedup of up to 50% for a single AAPC as well as a considerably reduced pipeline delay for multiple AAPC.

Figure 1 :
Figure 1: Tracing a set of memory locations for an AAPC with explicit alignment.

2 ,
ijj): Phase 2 of Algorithm 1 proceeds in d K 2d e subphases with d exchanges per subphase.If d does not divide K then the last subphase (d communication steps) does not fully use all communication channels.

p 2 d
) 8]. Cyclic addresses are paired through bit-complementation and divided into blocks of d pairs.The remaining c pairs are scheduled together with d c noncyclic addresses forming a (d c)-necklace.All remaining noncyclic local memory addresses can be scheduled as blocks of d complement pairs of addresses.

Figure 4 :
Figure 4: The exchange sequences for six AAPCs in sequence using arbitrary starting dimensions and pipelining.

Lemma 5
If b K 2d c b, then an AAPC of order d with channel bandwidth b can be completed in dd K 2db e time steps, which is optimal.

Table 3 :
The global memory state for an AAPC with explicit alignment.

Table 5 :
A schedule for optimal channel utilization (Algorithm 2) for an AAPC on ve processor dimensions.

Table 6 :
The number of element transfers, the span, and the number of local memory addresses scheduled together in Algorithms 1 { 3 on a d-cube.
Theorem 2 A sequence of s all{to{all personalized communications each on d processor dimensions can be accomplished in a time of, at most,

Table 8 :
The scheduling for block exchanges for d = 5.

Table 9 :
The scheduling for d = 5 and b = 3. Lemma 4 Algorithm 3 can be organized into d + K 2 mod d 2d 1 block exchanges with maximum block size d K 2(d+ K 2 modd) e, which is optimal.Algorithm 4 can be organized into d block exchanges with maximum block size d K 2d e, which is also optimal.