An Efficient Algorithm for Gray-to-Binary Permutation on Hypercubes

Both Gray code and binary code are frequently used in mapping arrays into hypercube architectures. While the former is preferred when communication between adjacent array elements is needed, the latter is preferred for FFT-type communication. When di(cid:11)erent phases of computations have di(cid:11)erent types of communication patterns, the need arises to remap the data. We give a nearly optimal algorithm for permuting data from a Gray code mapping to a binary code mapping on a hypercube with communication restricted to one input and one output channel per node at a time. Our algorithm improves over the best previously known algorithm [6] by nearly a factor of two and is optimal to within a factor of n=(n (cid:0) 1) with respect to data transfer time on an n-cube. The expected speedup is con(cid:12)rmed by measurements on an Intel iPSC/2 hypercube.


Introduction
The availability of run-time systems and libraries supporting a shared address space, such as Express (by ParaSoft) 5], or a shared memory programming model, such as Linda 2], for programming distributed memory architectures, makes parallel programming transparent with respect to the physically distributed memory. The utility of these programming systems depends critically on the e cient implementation of the underlying communication primitives 1, 4, 7, 13] on each individual architecture.
Consider a one{dimensional array partitioned into N blocks allocated evenly to the nodes of an N = 2 n node hypercube. For the binary{code mapping of the N blocks, block i, where 0 i < N, is allocated to node i. For the Gray code mapping, block i is allocated to node G(i), the i-th code in the Gray code. For instance, for a binary{re ected Gray code 11] mapping on a 3{cube, blocks 0 through 7 are allocated to nodes 0; 1; 3; 2; 6; 7; 5 and 4, respectively. Thus, block i is adjacent to blocks i 2 j for all 0 j < n with the binary{code mapping, and is adjacent to blocks (i 1) mod N and (i + 1) mod N with the Gray code mapping. While the adjacency o ered by the binary{code mapping is preferred for FFT{type communications, the adjacency o ered by the Gray code mapping is preferred for communications between neighboring blocks. The need to change between these two types of mappings arises when di erent phases of a computation exhibit di erent data reference patterns.
In this paper, we give two practical algorithms for the permutation of a one{dimensional array from a binary{re ected Gray code mapping to binary code mapping (termed Gray{ to{binary permutation). The new algorithms apply to communication systems restricted to communication on one channel per node at a time, known as one{port communication. The fastest of the two new algorithms improves upon the data transfer time of the one{port algorithm in 6] by a factor of 2(n 1)=n at the expense of one additional communication step. Implementation results on a 64 node iPSC/2 con rm our complexity analysis.
2 Preliminaries 2.1 Notation and de nitions N = 2 n is the size of the hypercube. The least signi cant bit is the 0th bit. Subcube 0 j is the set of nodes whose jth bit is zero. Subcube 1 j is similarly de ned. Let i = (i n 1 i n 2 i 0 ) and j = (j n 1 j n 2 j 0 ), then the Hamming distance between i and j is Hamming(i; j) = P n 1 m=0 (i m j m ), where \ " is the bitwise exclusive{or operator. The concatenation symbol is\jj".

Communication model
In a distributed memory multiprocessor, nodes communicate with each other by sending and receiving messages. We denote the overhead associated with each internode communication (send or receive) by , and the data transfer time per byte by t c . We assume that each communication channel between a pair of nodes can transmit data in both directions at the same time. In an n{dimensional hypercube, each node has n output and n input ports. We assume the one{port communication model, in which only one output port and one input port per node can be active at a given time. In the all{port communication model each node can send and receive messages concurrently on all its ports. All{port Gray{to{binary permutation algorithms are given in 6, 10]. The communication time for sending an m byte message to a nearest neighbor is T = + mt c . We refer to the term associated with t c as the \data transfer time", and to the term associated with as the \startup time". The number of bytes per node subject to the Gray{to{binary permutation is K.
We do not make any assumption as to whether the router of the hypercube supports store{and{forward, circuit{switched, wormhole, or virtual cut{through routings. Our algorithms use only nearest{neighbor (in fact, one{dimension{at{a{time) communications. However, the complexity comparisons and the derivation of the lower bound on the time complexity are based on store{and{forward routing.

Previous algorithms
In 6], Johnsson gives an algorithm for Gray{to{binary permutation consisting of n 1 exchanges along the sequence of cube dimensions n 2; n 3; ; 0. Johnsson also shows that for all{port communication, pipelining of the communication steps can be used to reduce the communication complexity by a factor of n. In 8] Johnsson gives algorithms for Gray{to{binary and binary{to{Gray permutation with exchanges proceeding in both ascending and descending order of dimensions. In 9], Johnsson and Ho show that the permutation can be realized by exchanges in cube dimensions f0; 1; ; n 2g in arbitrary order. Later, this was also proved by Edelman, Heller and Johnsson using an algebraic framework 3]. Algorithm GB1 6] performs n 1 exchanges in ascending order of cube dimensions 8], i.e., along the sequence of cube dimensions 0; 1; ; n 2. We refer to the n 1 exchange steps as step 0 through step n 2, where during step i exchange operations are performed for a subset of edges in dimension i. The subset is determined as follows. Let g 1 i be the i-th bit of G 1 (pid), the inverse Gray code of the node address, pid. Then, during step i, an exchange is performed for nodes whose g 1 i+1 value is 1. Figure 1 shows the three exchange steps of Gray{to{binary permutation on a 4{cube. In the gure, a 4{cube annotated with \step i, dimension i" is the scenario right before exchange step i. The arrows show the subset of edges in dimension i that are subject to an exchange operation across dimension i during the next step. The number at a node is the rank of the block allocated to the node initially (by Gray code mapping). The communication complexity for Algorithm GB1 6] is T GB1 = (n 1)( + Kt c ): 3 Gray{to{binary permutation

Lower bound
We now derive the lower bound for the Gray{to{binary permutation with respect to a store{and{forward routing. We rst compute the (Hamming) distance distribution between all n{bit binary strings and their corresponding Gray codes. From the distribution, the required communication bandwidth can be computed. Let S(n; j) be the number of n{bit strings for which the binary and Gray code mapping is a Hamming distance j apart, i.e., S(n; j) is the cardinality of the set fijHamming(i; G(i)) = j; 0 i < 2 n g.
De ne n j = 0 if j < 0 or j > n.
Lemma 2 S(n; j) = 2 n 1 j for all n 2 and 0 j n.
Proof: There are 2 n di erent n{bit strings. Thus, for n 2, if we show that S(n; j) = 2 n 1 j for 0 j < n, then it follows that S(n; n) = 0 (because P n 1 j=0 2 n 1 j = 2 n ). We prove that S(n; j) = 2 n 1 j by induction on n. For the basis n = 2, we have S(2; 0) = S(2; 1) = 2. For the induction hypothesis, we assume S(k; j) = 2 k 1 j for all 0 j k 1. From De nition 1, we have S(k + 1; j) = S(k; j) + S(k; j 1), which, by the induction hypothesis, yields 2 k 1 j + 2 k 1 j 1 = 2 k j . This observation completes the proof.
From Lemmas 1 and 2, i = G(i) if and only if i = 0 or i = 1. Also, if Hamming(i; G(i)) = n 1, then i = (11 1) and G(i) = (100 0), or i = (11 10) and G(i) = (100 01). Thus, each of these two nodes must send their data to a node at a distance of n 1 in the Gray{to{binary permutation. It is easy to derive a lower bound from the preceding lemma as follows. Lemma 3 9] The lower bound for Gray{to{binary permutation with respect to a store{ and{forward routing and a one{port communication model is max((n 1) ; (n 1) K 2 t c ).
Proof: Clearly, the start{up time is at least (n 1) , because the maximum Hamming distance between i and G(i) is n 1. The total number of element transfers required by the permutation is n X j=0 j S(n; j) K = n X j=0 2j n 1 j ! K = 2(n 1)K n 2 X j=0 n 2 j ! = (n 1)2 n 1 K: In each step up to 2 n directed links can be used in the one{port communication model. Thus, at least a time of (n 1) K 2 t c is needed for transferring the data.

Algorithm GB2
We now introduce two new algorithms for Gray{to{binary permutation for one{port communication. The algorithms exchange half of the local data set per step at the expense of two additional communication steps for Algorithm GB2 compared to Algorithm GB1, and only one additional step for Algorithm GB3. Although Algorithm GB2 is always inferior to Algorithm GB3, the description of GB2 facilitates the understanding of GB3. Both Algorithm GB2 and GB3 are based on the observation that Algorithm GB1 uses only half of the communication channels in dimensions 0; 1; : : : ; n 2. In Figure 1, consider any two adjacent nodes i and j in subcube 0 3 , and the corresponding two nodes i 0 and j 0 in subcube 1 3 . Let i and j di er in the dth bit, where 0 d 2. If an exchange is performed between nodes i and j in step d, then there is no exchange between nodes i 0 and j 0 . Similarly, if an exchange is needed between nodes i 0 and j 0 during step d, then there is no exchange between nodes i and j during the same step. Thus, exactly one of the two pairs (i; j) and (i 0 ; j 0 ) performs an exchange along dimension d during step d 10]. The property can be stated as follows: Lemma 4 10] In Algorithm GB1 for the Gray{to{binary permutation, if two nodes exchange their data in subcube 0 n 1 (respectively, 1 n 1 ), then the corresponding two nodes in subcube 1 n 1 (respectively, 0 n 1 ) do not exchange data during the same step.
Proof: In Algorithm GB1, the value of bit g 1 j+1 (i) determines if node i must exchange its data with node i 2 j during step j, for all 0 j n 2. Thus, we only need to show that g 1 j+1 (i) g 1 j+1 (i 2 n 1 ) = 1 for all 0 j n 2. Let g 1 j+1 (i) = x and g 1 j+1 (i 2 n 1 ) = y. Then, by Lemma 1, x = i n 1 i n 2 i j+1 and y = i n 1 i n 2 i j+1 . Thus, x y = 1.
Lemma 4 implies that only half of the edges in dimension 0; 1; : : : ; n 2 are used for the permutation. This lemma was used in 10] to devise an all{port algorithm requiring 2 3 K element transfers, an improvement over the algorithm in 6] by a factor of 1 3 K. In the following, we refer to the n 1 exchanges in Algorithm GB1 as normal{ exchanges. In Algorithm GB2 each node rst exchanges half its data along dimension n 1. Then, normal{exchanges are performed along dimensions 0; 1; ; n 2 as in Algorithm GB1, except that when a node in subcube 0 n 1 would not have exchanged data in Algorithm GB1, it exchanges the data it received from its neighbor in cube 1 n 1 and vice versa. Hence, each subcube not only performs the Gray{to{binary permutation for half of its data, but also performs the permutation for half of the data of the other subcube. Finally, an exchange along dimension n 1 undoes the rst exchange. We refer to the exchanges before and after the normal{exchanges as pre{exchange and post{exchange, respectively. These exchanges allow all edges of the hypercube in dimension 0; 1; : : : ; n 2 to be used in each step. Further, during each step only half of the local data set is exchanged. Thus, the communication complexity is T GB2 = (n + 1)( + K 2 t c ): (2) Figure 2 shows an example of Algorithm GB2 on a 4{cube. In the two 4{cubes annotated with pre{exchange and post{exchange, an arrow between subcube 0 3 and subcube 1 3 represents exchanges across all edges in dimension 3. Each data block, say of rank i, of the global array is partitioned into two subblocks of size K=2 each, denoted i and i 0 , respectively, in the gure.

Algorithm GB3
Algorithm GB3 is similar to Algorithm GB2, except that the pre{exchange and post{ exchange steps are performed in dimension n 2, i.e., the exchanges are made in dimensions n 2; 0; 1; ; n 2. We show later that, by doing the pre{ and post{exchanges in dimension n 2, the post{exchange step can be combined with the last step of the normal{exchanges. The size of the combined data set remains K=2.
Formally, consider the four subcubes 00; 01; 10 and 11 de ned by dimensions n 1 and n 2. For each of the rst n 2 normal{exchange steps of GB1 (i.e., on dimension 0; 1; ; n 3), if there is an exchange between nodes i and j in subcube 00 (01, 10 and 11, respectively), then there is no exchange between their corresponding nodes in subcube 01 (00, 11 and 10, respectively). By exchanging half of the data across dimension n 2 before and after these n 2 exchanges, only K=2 elements need to be exchanged for each of these n 2 normal{exchanges. To complete the Gray{to{binary permutation, the normal{exchange step of Algorithm GB1 in dimension n 2 must also be performed. We now show that the post{exchange and the normal{exchange in dimension n 2 can be combined into one exchange of K=2 elements. For subcube 0 n 1 , there is no normal{ exchange needed in dimension n 2 for Algorithm GB1 (i.e., if no pre{exchange had been made). For subcube 1 n 1 , the post{exchange requires that K=2 elements be exchanged in dimension n 2, while for Algorithm GB1 (no pre{exchange), K elements must be exchanged in the normal{exchange in dimension n 2. Thus, in Algorithm GB3, due to the pre{exchange in dimension n 2, only K=2 elements are subject to exchange in dimension n 2 in both the 0 n 1 and 1 n 1 subcubes, in the last exchange step. Figure 3 shows an example of the algorithm (GB3) on a 4{cube. The complexity of the algorithm is T GB3 = n( + K 2 t c ):

Complexity comparison
From Figure 4-(a) it can be seen that the estimated performance of Algorithm GB3 is better than that of Algorithm GB1 for large data sets. The estimated performance of Algorithm GB3 is always better than that of Algorithm GB2. To demonstrate the feasibility of our algorithms, we implemented all three on a 64 node Intel iPSC/2. Although the Intel iPSC/2 uses circuit{switched routing, we only use nearest{neighbor communication for our implementation. Figure 4-(b) compares the measured times on the Intel iPSC/2 for Algorithms GB1, GB2 and GB3. All measured times on the Intel iPSC/2 are averaged over at least 100 runs. The improvement of Algorithm GB3 over Algorithm GB1 increases as the message size increases and approaches 2(n 1)=n, asymptotically. For small data sets, however, Algorithm GB1 performs better than Algorithm GB3 because it needs n 1 communication steps instead of n steps. The measured break{even point for the 64 node iPSC/2 is at about K = 485 bytes, while the complexity estimates predict a break{even around K = 1:4 kbytes.
The speedup of Algorithm GB3 over Algorithm GB1 is expected to increase as the cube dimension increases. Figure 5 compares the measured times of the three algorithms on the iPSC/2 as a function of cube dimensions. The measured speedup of Algorithm GB3 over Algorithm GB1 increases from 1.63, 1.67 to 1.79 as the dimension increases from 4 to 6.

Circuit{switched routing
All the Gray{to{binary permutation algorithms discussed so far only use nearest{neighbor communication and thus do not assume a particular routing policy. From Algorithm GB1, one can easily observe that the paths from all nodes i to their corresponding destinations G 1 (i) are edge{disjoint (in the directed edge sense). Thus, on a circuit{switched hypercube, such as the Intel iPSC/2, all data can be sent directly to their corresponding destinations with one communication startup without congestion. (In fact, by 9] and 12], the congestion{free property remains true for any routing with any xed order of hypercube dimensions.) For circuit{switched routing, where each node can send and receive at least two messages at a time, it is possible to apply our technique to achieve a better complexity than for a direct{route. The complexity of sending a K byte message without congestion is + Kt c in a circuit{switched routing, regardless of the number of hops that must be traversed. We partition each local data set into 3 blocks of equal size, called blocks 0, 1 and 2. In the rst step, each node i sends block 0 to node G 1 (i) and concurrently sends block 1 to node i 0 , where i 0 is the neighbor of i across dimension n 1. In the second step, node i sends block 2 to node G 1 (i), and concurrently node i 0 forwards block 1 of node i to node G 1 (i). (The description is made from the viewpoint of the data originating in node i.) The same idea was used in 10] for an all{port algorithm. It can be easily shown that for both steps there is no congestion between the blocks of size K=3 each. Thus, the total time is 2 + 2K 3 t c (compared to + Kt c for the direct{route alternative).

Concluding remarks
We have presented two algorithms for the permutation between binary{re ected Gray code mapping and binary code mapping on hypercubes. Algorithm GB3 improves upon the data transfer time of the previous algorithm for one{port communication 6] (Algorithm GB1) by a factor of 2(n 1)=n on an n{cube, at the expense of one additional startup. The data transfer time of the new algorithm is optimal within a factor of n=(n 1) on an n{cube with store{and{forward routing.