Optimal All-to-All Personalized Communication with Minimum Span on Boolean Cubes

All-to-allpersonalized communicationis a class of per-mutationsin which each processor sends a unique message to every other processor. We present optimal algorithms for concurrent communication on all channels in Boolean cube networks, both for the case with a single permutation, and the case where multipleper-mutations shall be performed on the same local data set, but on di(cid:11)erent sets of processors. For K elements per processor our algorithms give the optimal number of elements transfer, K=2. For a succession of all-to-all personalized communications on disjoint subcubes of (cid:12) dimensions each, our best algorithmyields K 2 +(cid:27)(cid:0)(cid:12) element exchanges in sequence, where (cid:27) is the total number of processor dimensions in the permutation. An


Introduction
We give simple, yet optimal, schedules for all-to-all personalized communication on Boolean cubes with concurrent communication on all channels of every processor. An example of an architecture that allows for such communication is the Connection Machine. The schedules avoid indirect addressing in the data exchanges by performing a local alignment of the data in each processor prior to, and after, the data interchanges between processors. In addition to optimal utilization of communication channels we are also concerned with the duration of the orbits of the elements, i.e., the time from the rst motion of an element until it reaches its destination. The orbit length is important for pipelining several all-to-all personalized communications (AAPC). With K elements per processor and AAPC in -cubes our rst algorithm requires K 2 + ( K 2 mod ) element transfers in sequence for a single AAPC, except if K 2 mod = 0. Then, the number of element transfers is optimal, K 2 . The orbit length for all pairs of elements is . One or the other element in a pair is always exchanged. Our second algorithm has the minimal number of element exchanges for any K and , but the maximum orbit length is + ( K 2 mod ). Our third algorithm illustrates how pipelining can be combined with exchange sequences starting in arbitrary dimensions to yield an optimal number of element exchanges in sequence. But, the orbit length is K 2 . Our last algorithm requires K 2 element transfers in sequence for any , and has a maximum orbit length of . All-to-all personalized communication is a frequently used class of permutations in multi-processor systems. Examples of all-to-all personalized communication are bit-reversal, vector reversal, matrix transposition, shu e permutations, and conversion between cyclic and consecutive mapping 5] for allocations of the arrays such that a number of storage dimensions are exchanged with the same number of processor dimensions. All-to-all personalized communication can also be combined with code conversion, such as conversion between binary code, and binary-re ected Gray code 6, 10], which is often used for encoding of arrays on Boolean cubes. As an example of the use of a succession of AAPC's consider the computation of high radix FFT (Fast Fourier Transform) on a large multiprocessor with relatively few data points per processor. It can be performed through a sequence of local FFT. This requires a sequence of all-to-all personalized communications on the same local memory address bits, but on di erent processor address bits. Saad and Schultz 12] have suggested a recursive AAPC algorithm based on 2 translated binomial trees. The algorithm requires K 2 element transfers in sequence. In 7] we presented algorithms based on balanced trees, and rotated binomial trees, both attaining the minimal number of element transfers in sequence, K 2 , if K is a multiple of . For other values of K the algorithms are optimal within a small constant factor ( 24% for > 4) 3]. Independently, Stout and Wagar 13] gave an algorithm with the same complexity. Later, Bertsekas et al. 1] presented an algorithm optimal for any . A detailed optimal scheduling algorithm has been presented by Edelman 2], who also implemented the algorithm on the Connection Machine. The algorithm uses indirect addressing, and has orbit lengths of order O(2 ). Varvarigos and Bertsekas 14] related AAPC on hypercubes and tori to a matrix decomposition problem.
2 All-to-all personalized communication on Boolean cubes 2.1 Preliminaries A Boolean n-cube has N = 2 n nodes. Each node has n neighbors, which with the conventional binary addressing scheme correspond to the n di erent single bit complementations of the bits in a node address. We use to denote the bitwise exclusive-or operation. The local address space in each node is K = f0; 1; : : :; 2 k 1g, and the global address is (ijj) where i denotes the processor address and j the memory location.
De nition 1 Let S s f0; 1; : : :; k 1g, S p fk; k + 1; : : : ; k+n 1g, jS s j = jS p j, f be a one-to-one mapping S s ! S p and g be a one-to-one mapping S p ! S s . An AAPC is a permutation de ned by a i ! a f(i) for all i 2 S s and a j ! a g(j) for all j 2 S p .
An example of the bit-reversal permutation is (a 5 a 4 a 3 ja 2 a 1 a 0 ) ! (a 0 a 1 a 2 ja 3 a 4 a 5 ) and that of the matrix transposition is (a 5 a 4 a 3 ja 2 a 1 a 0 ) ! (a 2 a 1 a 0 ja 5 a 4 a 3 ). Both are examples of AAPC. Let jS s j = min(k; n). If < k then the permutation is repeated 2 k times. For instance, if there are eight bits for the local memory, k = 8, and three are part of the permutation, = 3, then the permutation is repeated 2 8 3 = 32 times. Similarly, if there are processor dimensions not included in the permutation, then the permutation consists in a number of permutations in disjoint subcubes. Each such subcube is identi ed by the address bits not included in the permutation. The relative address of a local memory address i in processor j with respect to its destination is j i, where is performed bit-wise. A homogeneous scheduling has communication schedules for each node that only depend on the relative addresses. This implies that node i sends data to node i j in the same dimension and step as node 0 sends data to node j. The message path from node i to node i j is a translation with respect to node i (i.e., exclusive-or) of the path from node 0 to node j. We only consider homogeneous schedules. For such schedules it is su cient to consider schedules for node 0.
De nition 2 The span for message j, span(j), is the number of exchanges from the step during which the message starts its path through the cube until it arrives at its destination in a single AAPC. Formally, if the message leaves the source during step t 1 and arrives at the destination during step t s > t 1 , then span(j) = t s t 1 +1. The span of the communication for a single AAPC is the maximum span for any message in the communication, i.e., span = max j span(j).
In the following, we consider AAPC of the form (jji) ! (ijj). Based on this, algorithms for the general AAPC (jji) ! (g(j)jf(i)) can be derived.

Routing for the permutation
The permutation (jji) ! (ijj) is equivalent to the transposition of a matrix stored with one row per processor to the storage of one column per processor. The permutation can also be viewed as changing the allocation of a one-dimensional array from consecutive storage to cyclic storage 5]. The routing algorithms we present use only direct addressing in the data exchanges between processors. All processors access the same (local) address during the same step. To accomplish this property a local alignment precedes the exchanges, and a realignment follows them. Depending on the relative times for the alignment and realignment phases and the exchanges with direct and indirect addressing, avoiding indirect addressing in the exchanges may be desirable. On the Connection Machine avoiding indirect addressing in the exchange phase results in a signi cant speed-up.
Phase 1: Alignment. Sort the local data by relative address, i.e., perform the operation (jji) ! (jji j), where i is the local memory address and j is the processor address.
Phase 2: Interprocessor exchange. The interprocessor exchange phase implements the operation: (jji) ! (j iji). Phase 3: Realignment. The realignment to restore the local memory order is identical to phase 1.
The alignment is performed on the storage dimensions involved in the exchange. The treatment of other storage dimensions is arbitrary. Pairing A memory location is subject to an exchange operation in processor dimensions that correspond to nonzero bits in its (relative) address. Hence, local memory location zero retains its content throughout phase 2, whereas memory location (11 : : : 1) exchanges content with neighboring nodes in every dimension. Similarly, memory location (00 : : : 01) only exchanges its content with the neighboring node in the least significant processor dimension, while local memory location (11 : : : 10) exchanges content with all neighboring nodes, except the neighbor in the least signi cant processor dimension. For any local memory address i, either i or i is subject to exchange in any step of the algorithm. In the following we will always consider pairs of memory locations de ned by (i; i). If there are local storage dimensions in addition to the dimensions involved in the AAPC (K > 2 ), then the pairing is simply repeated K 2 times. The pairing is always made on the dimensions included in the AAPC. Other dimensions are of no consequence with respect to contention for communication, or correctness. Permutation for a set of pairs will be considered together. The partitioning of all pairs to sets can be arbitrary. For instance, in a set of contiguous memory locations and their complements, pair s, 0 s < , is exchanged in dimensions s; (s + 1) mod ; (s + 2) mod ; : : : ; (s + 1) mod . After exchanges the pairs have performed the necessary communications. Every channel is used in every step, and each item follows a minimum-length path. The procedure is repeated for the next pairs, etc. Algorithm 1 yields the following lemma.
Lemma 1 All-to-all personalized communication in a -cube for K local memory locations requires at most d K 2 e element exchanges in sequence with concurrent exchanges on all channels.
The correctness of the above algorithm follows from the following consideration: 8 > < > : does not fully use all dimensions. This property causes the algorithm to be slightly non-optimal with respect to channel utilization. However, the schedule is easily modi ed to an optimal schedule by having one subphase include +( K 2 mod ) pairs. Consider the case for = 3. There are a total of eight memory locations, or four pairs. By scheduling the four pairs as shown in Table 1, the optimal number of element transfers is achieved. The scheme is easily generalized to arbitrary . We refer to this algorithm as Algorithm 2.
Lemma 2 All-to-all personalized communication in a -cube requires K 2 element transfers in sequence by Algorithm 2 with concurrent exchanges on all channels, and K elements per processor.
There are many schedules that yield an optimal utilization of communication channels. For instance, the schedule in Table 2 is also optimal with respect to channel utilization, Algorithm 3. But, the last 1 pairs are in orbit for 2 1 exchanges for an AAPC on bits. The schedule in 2] has a similar property. The long orbits are a disadvantage in the case several AAPC's on di erent processor address bits shall be performed.    The characteristics of the above algorithms are summarized in Table 3. For a single AAPC Algorithms 2 or 3 have the fewest (optimal) number of element transfers in sequence. If several AAPC's shall be performed on distinct sets of processor dimensions, then it is desirable that the maximum span is in order to minimize the pipeline delay. In section 3 we present an algorithm with a maximum span of , and K 2 element transfers in sequence.
3 Pipelining many all-to-all personalized communications 3 Table 3: The number of element transfers and the span in Algorithms 1 { 3 on a -cube. K 2 +( 1)( + K 2 mod ) exchanges. Using the schedules below (Algorithm 4) with a span of and pipelining yields K 2 + exchanges in sequence. Compared to the pipelined usage of Algorithm 1 the gain may be 1 exchanges, and compared to Algorithm 2 ( 1)( K 2 mod ) exchanges. This gain in e ciency is obtained at the expense of a more complex scheduling for some of the local elements. The pipeline delay can be further reduced as described in Section 3.3.4. The schedules in Algorithms 1 through 3 grouped local storage addresses by complement pairs. In Algorithm 4 we will use this grouping strategy as well as one based on necklaces to accomplish a span of and an optimal number of element transfers in sequence for a single AAPC. A necklace 11] is a set of addresses derived from each other through rotations of some xed bit string. A necklace is full if it has distinct addresses for a string of length . Otherwise, it is degenerated. The period of a bit string is the minimum number of rotations (> 0) required to generate a bit string identical to the unrotated string. An address in a degenerate necklace is cyclic. Addresses in full necklaces are non-cyclic. For instance, the address (0000) is cyclic with period one and the address (0101) is cyclic with period two; the addresses (0001) and (0011) are non-cyclic. The address with smallest value in the necklace is a distinguished address. A qnecklace is a necklace in which each address has q bits equal to 1.

Non-cyclic addresses
The exchanges for the three local memory locations (001), (010), and (100) obviously can be done concurrently. Similarly, locations (011), (110), and (101) can be scheduled concurrently with respect to the least signi cant bit of (011), the second least signi cant bit of (110), and the most signi cant bit of (101). The communication for the remaining bit that is one in each of these addresses can also be scheduled concurrently. The basic idea should now be clear: sets of memory locations with addresses being cyclic rotations of each other are scheduled concurrently. Lemma 3 Addresses forming a full q-necklace can be scheduled to complete the permutation in q exchange steps, which is optimal. The span is q .
In any optimal scheduling according to the lemma all communication channels are used in every exchange. For instance, an optimal schedule for f0111; 1110; 1101;1011g is Any full necklace and its complement, if distinct, can be scheduled together by pairing of addresses as in Algorithms 1 through 3. Hence, addresses in full qnecklaces can either be scheduled as blocks of addresses during q exchange steps, or as 2 addresses during exchange steps by combining a full q-necklace with a full ( q)-necklace. For q = =2 not every q-necklace has a matching complement necklace. But, pairs of such =2-necklaces can be scheduled by pairing of addresses through complementation. Such complement addresses belong to the same necklace. Hence, with the exception of addresses in at most one =2-necklace not having a distinct complement necklace, all addresses in a full necklace can be scheduled optimally by pairing of addresses through complementation.

Cyclic addresses
The scheduling for non-cyclic addresses cannot be used for cyclic addresses. For instance, location (11 : : : 1) has period one, and must be matched with other locations in order to achieve maximum utilization of the communication channels. For every cyclic address the address formed through bit-complementation is also cyclic. Hence, pairs of cyclic addresses can be formed through bit-complementation (as in Algorithms 1 through 3). If the number of such pairs is , or a multiple thereof, then all cyclic addresses can also be scheduled to fully use the communication channels by scheduling pairs at a time. When the number of pairs of cyclic addresses is not a multiple of , then we combine q pairs of cyclic nodes with a full qnecklace. The schedule for the cyclic addresses combined with addresses in a full necklace is constructed based on a generating table. The rst row of this table consists of the numbers 0; 1; : : :; 1. Each other row is a one step left cyclic rotation of the previous row. Figure 1 shows the generating table for = 5. Columns of the generating table represent time steps. Complement pairs of cyclic addresses share a row. The table entries specify the exchange dimension for the addresses. One or the other address in a complement pair must be exchanged in every dimension. Every row contains all dimensions. Which element in a complement pair is exchanged depends upon which address in the pair has the address bit set to one for the considered dimension. The full q-necklace associated with the cyclic complement pairs consists of all addresses formed by cyclic rotations of a bit string of length with q consecutive bits set to one. The schedule for the addresses in the necklace are derived by drawing q lines each of length through the rst q rows of the scheduling table. The rules for drawing the q lines are described as follows.
The rst line is drawn from the top of the rst column down to the qth row, then q columns to the right. The second line, if any, starts in the second column, goes through q 1 rows, then turns right through q columns, followed by a vertical turn to include one more row. Thus, it is also of length . Subsequent lines are drawn in a similar way by proceeding vertically one row less than the previous line, then proceeding horizontally through q columns, then proceeding vertically to the last row of the generating table reserved for the full necklace. The last line starts out horizontally.
The construction should be apparent from Figure 2. The rst entry on each line corresponds to the distinguished address (0 q 1 q ) where 1 q denotes the q consecutive 1-bits. The second entry on each line corresponds to a one-step left cyclic rotation of the distinguished address. The pth entry refers to an address which is the (p 1)-step left cyclic rotations of the distinguished address. For instance, for = 5 and q = 4, the rst through the th entries refer to address (01111), (11110), (11101), (11011) and (10111), respectively. The value of the pth line entry de nes the dimension in which the contents of the pth address is exchanged during the time step given by the column in which the pth entry occurs. The number of times an address is scheduled is equal to the number of lines, i.e., q. It is easy to see that the rst time an address is scheduled, it is scheduled in the dimension that corresponds to dimension zero in the distinguished address. For each address the next higher dimension, modulo the number of dimensions, is scheduled the next time the address is scheduled for an exchange.  Table 4, which is derived from  Lemma 5 There exists a set of addresses forming a full q-necklace such that any q complement pairs of cyclic addresses can be scheduled together with the q-necklace to complete all communication in steps, which is optimal.
Proof: The q-necklace de ned by addresses with q consecutive one bits in cyclic order is non-cyclic for q < . Deriving the schedule from the table guarantees that all dimensions are scheduled every time step, precisely one address is scheduled for a dimension every time step, each complement pair is scheduled in every dimension, and every address in the full necklace is scheduled in each of the q dimensions with a bit set, and in no other dimension.
In summary, memory addresses are rst partitioned into cyclic and non-cyclic addresses. The number of cyclic addresses is two if is prime, otherwise it is of order O( p 2 ) 4]. Cyclic addresses are paired through bit-complementation and divided into blocks of pairs. The remaining q pairs is scheduled together with non-cyclic addresses forming a qnecklace. All non-cyclic addresses can be scheduled individually, except the addresses scheduled with the cyclic addresses. Non-cyclic addresses can also be scheduled as complement pairs, except if there are no complement pairs forming distinct full necklaces. There are at most two such sets, one set forming a =2-necklace, if is even, and one set of addresses forming the complement of the addresses scheduled with cyclic addresses.
Theorem 1 An all-to-all personalized communication on processor dimensions can be performed in K 2 element exchanges in sequence for concurrent communication on all ports of every processor. The maximum span is . The number of element exchanges in sequence for all-to-all personalized communications in succession on di erent sets of processor dimensions and the same set of local storage dimensions is For the case where the succession of AAPC's shall be perfomred on non-overlapping subcubes of di erent order, refer to 8].

Summary and Discussion
We have presented four schedules for a single all-to-all personalized communication, three of which are simple to implement. One algorithm (Algorithm 4) requires the optimal number of element exchanges in sequence, K 2 , with a span of . The other three algorithms have either minimial maximal orbit lengths, or maximum channel utilization, but not both. Algorithm 1 has been implemented on the Connection Machine model CM-2. The exchanges require 40 sec per element compared to 62 sec for the schedule in 2]. The total expense for alignment and realignment is about 0.9 sec per element (four bytes). Hence, the simpli ed algorithms presented here may yield a speed-up of up to 50% for a single AAPC, and a considerably reduced pipeline delay for multiple AAPC. It should be noted that for single AAPC on a -cube, Algorithm 4 can be easily blocked to communication steps with the optimal number of element transfers preserved 9]. The blocking procedure can also be used to generate optimal schedules for channel widths greater than the width of the data item 9].