Communication efficient multi-processor FFT

Computing the Fast Fourier Transform on a distributed memory architecture by a direct pipelined radix-2 algorithm, a bi-section or multi-section algorithm, all yield the same communications requirement,if communicationfor all FFT stages can be performed concurrently, the input data is in normal order, and the data allocation consecutive. With a cyclic data allocation, or bit-reversed input data and a consecutive allocation, multi-sectioning o(cid:11)ers a reduced communications requirement by approximately a factor of two. For a consecutive data allocation, normal input order, a decimation-in-time FFT requires that P N + d (cid:0) 2 twiddle factors be stored for P elements distributed evenly over N processors, and the axis subject to transformation distributed over 2 d processors. No communication of twiddle factors is required. The same storage requirements hold for a decimation-in-frequency FFT, bit-reversed input order, and consecutive data allocation. The opposite combination of FFT type and data ordering requires a factor of log 2 N more storage for N processors. The peak performance for a Connection Machine system CM-200 implementation is 12.9 G(cid:13)ops/s in 32-bit precision, and 10.7 G(cid:13)ops/s in 64-bit precision for unordered transforms local to each processor. The corresponding execution rates for ordered transforms are 11.1 G(cid:13)ops/s and 8.5 G(cid:13)ops/s, respectively. For distributed one-and two-dimensional transforms the peak performance for unordered transforms exceeds 5 G(cid:13)ops/s in 32-bit precision, and 3 G(cid:13)ops/s in 64-bit precision. Three-dimensional transforms executes at a slightly lower rate. Distributed ordered transforms executes at a rate of about 1 2 to 2 3 of the unordered transforms.


Introduction
The main contributions of this paper are communication e cient multi-processor algorithms for the Cooley-Tukey Fast Fourier Transform 2] (FFT).The impact on performance of different data layouts is evaluated and an implementation on the Connection Machine system CM-200 is described.The algorithms are e cient in their use of the communication system, in particular systems with processors interconnected as Boolean cube networks allowing concurrent communication on all channels of every processor.The algorithms are also e cient in the use of storage for twiddle factors with no communication of twiddles required, when the factors are precomputed.In a distributed memory architecture a poor choice of FFT algorithm may require twiddle factors to be communicated, or the storage requirements may exceed the data storage requirement by a factor of log N for N processors.Finally, the algorithms are also e cient with respect to their use of the bandwidth between each processor and its memory.The distribution of data among the memory modules in a distributed memory architecture has a signi cant impact on performance.We brie y discuss this issue for both onedimensional and multi-dimensional transforms.It is well known that the Cooley-Tukey, in-place FFT reorders the data, such that after the transform the component in location i = (i p 1 i p 2 : : :i 0 ) has index (i 0 i 1 : : :i p 2 i p 1 ).The output index is the bit-reversed value of the input index.An FFT that leaves the output data in this order is unordered.An ordered FFT has the same data order for input and output.The implementations being discussed fully utilize the communication system for the computations of the unordered FFT.All channels of every processor are used concurrently.The reordering required for an ordered transform is made explicitly.Reordering algorithms, and implementations thereof are discussed elsewhere 10,11,3].No gain in communication eciency is possible by interleaving the reordering with the FFT computations, when all channels are used for the unordered transform, unlike the case with communication restricted to one channel at-a-time 20,22].For reference, we include performance measurements both for unordered and ordered transforms.The feasibility of di erent implementations of the Cooley-Tukey algorithm depends critically upon architectural characteristics.In the Connection Machine systems CM-2 and CM-200 the memory is distributed among up to 2048 oating-point processors.The maximum memory per processor is 4 Mbytes.In model CM-200, the oating-point processors support both 32-bit and 64-bit arithmetic.Data paths internal to the oating-point processors are 64-bits wide.Each processor has a single 32-bit wide data path to its local memory.The processors are interconnected as an 11-dimensional Boolean cube, with two communications channels between each pair of processors.Communication can be performed on all channels of every processor concurrently.The primitive communications operation is an exchange.The Discrete Fourier Transform is de ned by X(l) = P 1 X j=0 !lj P x(j); 8l 2 0; P 1]; !P = e 2 i P and the Inverse Discrete Fourier Transform is de ned by X(l) = 1 N P 1 X j=0 !lj P x(j); 8l 2 0; P 1]; !P = e 2 i P : The Cooley-Tukey Fast Fourier Transform 2] evaluates these matrix vector products in log 2 P stages by recursively using a splitting formula of the type X(l) = for a forward decimation-in-frequency (DIF) FFT.The coe cients !lj P are known as twiddle factors.Both these types of FFT are known as radix-2 FFTs.The inverse transform can be computed in the same manner as the forward transform by using conjugate twiddle factors.Figure 1 shows a radix-2, DIT FFT, and Figure 2 shows a radix-2 DIF FFT.The di erence of signi cance with respect to computing an FFT on a distributed data structure is that the DIF and DIT FFT use the twiddle factors in opposite orders.The DIT FFT use all twiddle factors in the last stage, while the DIF FFT use all twiddle factors in the rst stage.Also, the twiddle factors for the DIF FFT are ordered in the same way as the input data, i.e., in normal order for normal order input data, while the twiddle factors for a DIT FFT are in bit-reversed order for normal order input.The consequences of these di erences for FFT computations on data sets distributed throughout the memories of a multi-processor are discussed in Section 4. For P = R m the splitting formulas can be generalized to a radix-R FFT. Figure 3 shows computational kernels corresponding to radix-4 DIT and DIF splitting formulas.Figure 4 shows the computational kernels corresponding to radix-8 DIT and DIF FFT.For details of the derivations see for instance 16,17,18].As the radix of the FFT increases the number of arithmetic operations decreases somewhat.However, the main advantage from an increased radix in architectures with a limited memory bandwidth is a reduced need for memory accesses 5,6].The number of real operations (leading terms only) and memory accesses for radix-2, 4, and 8 kernels are given in Table 1.The number of arithmetic operations for the radix-8 algorithm is approximately 20% less than that of the radix-2 algorithm.The exact number of multiplications and additions can be found for instance in 16].Whereas the reduction in arithmetic operations is modest a radix-8 FFT o ers a reduction in the number of memory operations by a factor of almost three compared to a radix-2 algorithm.These kernel sizes are relevant for exploiting the Index binary x (15)  x(1) 0001 x(0) 0000 Index binary X(15) X( 7) X( 11) X(3) X( 13) X( 5) X( 9) X(1) X( 14) X( 6) X( 10) X(2) X( 12) X(4) X( 8) X(0) Index binary x (15)  x(1) 0001 x(0) 0000 Index binary X(15) X( 7) X(11) X(3) X(13) X(5) X( 9) X(1) X( 14) X( 6) X(10) X(2) X( 12) X(4) X( 8) X(0)  ! 5  register set in the oating-point processors of the Connection Machine systems, as discussed in Section 5.3.At the next level in the memory hierarchy, the local memory, the radix is equal to the size of the local data set.The outline of the paper is as follows.We rst brie y discuss the issues of the data allocation, or layout, among the memory modules.We then discuss the communication requirements of Cooley-Tukey FFT on multi-processors, speci cally on Boolean cube con gured processors.We compare the requirements of a direct pipelined algorithm, and algorithms based on bi-section, or multi-section, assuming concurrent communication on all channels of every processor, which is relevant for the Connection Machine systems.In Section 4 we discuss the computation and storage of twiddle factors for distributed FFT computations, and show how the storage requirements are related to the data layout, and the type of FFT being used.We then present results from our implementation on the Connection Machine system CM-200.All performance data are obtained for complex-to-complex FFT.

Data Allocation
In a distributed memory multi-processor architecture data is typically distributed uniformly across the memory modules at compile time, in order to maximize the potential concurrency in computation.If there are more data items than processors, then several data elements must be allocated to the same memory module.In a consecutive data allocation 7] successive elements are allocated to the same memory module.With n bits assigned to the encoding of processor addresses, the mapping of the array indices to machine addresses can be viewed as follows, where x i denotes a bit in the encoding of the data indices: Consecutive assignment: (x p 1 : : : The eld denoted rp encodes real processor addresses as opposed to memory addresses, vp.
In cyclic assignment the lowest order bits in the encoding of array indices are mapped to the processor address eld.Cyclic assignment: (x p 1 x p 2 : : : All data elements with the same n low order bits of their indices reside in the same processor.
In the consecutive assignment the indices of all elements in a processor have the same n high order bits.The consecutive and cyclic allocations of a 32 element one-dimensional array among 8 processors are illustrated in Figure 5.We consider the impact of these forms of data allocation on the data motion requirements for the FFT.  to partition the processors among the axis of the data array.In the Connection Machine systems the con guration can be controlled through compiler directives.We discuss how to con gure the processors for optimum performance in Section 5.

Communication requirements for the FFT
The data interaction in stage q of a radix-2 FFT is between data elements i and i 2 p q 1 , i 2 f0; 1; : : :; 2 p 1g, where q 2 f0; 1; : : : ; p 1g.The input stage is stage 0. The symbol denotes the bit-wise exclusive-or function.Hence, in a radix-2 FFT the data interaction is between data elements that di er only in one bit in the encoding of their respective indices, starting with the most signi cant bit and progressing towards the least signi cant bit.For a radix-2 r FFT the interaction in stage s is between data that di er in bits fp (s + 1)r; p (s + 1)r + 1; : : : ; p sr 1g, where s 2 f0; 1; : : :; p r 1g and we for simplicity assume that r divides p.For arbitrary p a collection of radices is needed.Since the data interaction proceeds from the most signi cant digit in the encoding of the data indices towards the least signi cant digit, the rst n r radix-2 r stages involve data motion between processors, when the data allocation is of the consecutive type.For the cyclic data allocation, the last n r stages involves communication.The data motion ts multiprocessors with processors interconnected as a Boolean cube network very well.Processors in such a network can be given addresses such that adjacent processors di er in precisely one bit, and conversely, there is an adjacent processor for every bit in the processor address.
Hence, processors j and j 2 m are adjacent for every m 2 f0; 1; : : : ; n 1g, and any j 2 f0; 1; : : : ; 2 n 1g.Clearly, for a radix-2 FFT, stages corresponding to index bits mapped to the processor address eld imply communication between directly connected processors.No other communication is required.A radix-2 r FFT requires communication between processors forming r-dimensional subcubes.For an example of the communication needs consider Figure 5.It is easy to see that the rst three stages of a radix-2 FFT with consecutive data allocation correspond to communication between directly connected processors.A radix-8 stage requires communication between all 8 processors.For the cyclic allocation, the last three stages require communication between directly connected processors.In a direct implementation of the splitting formulas for a radix-2 FFT a pair of processors exchanges a pair of data elements, then one computes the \top", and one the \bottom".Each stage requires that P N elements be exchanged for a transform on P elements distributed evenly over N processors.There are n such stages for the axis subject to transformation distributed across N = 2 n processors.In a multi-processor network with at least n channels per processor, such as in a Boolean n-cube, only one out of n communication channels are used.A radix-2 r algorithm implemented in an analogous manner would require that each processor in r-dimensional subcubes sends one data element to every other processor in the subcube to which it belongs.After this all-to-all broadcast within r-cubes 1, 10] each processor computes one output value for the radix-2 r kernel.For each all-to-all broadcast r channels can be used concurrently.The number of element transfers in sequence for each all-to-all broadcast in r-cubes is 2 r 1 r 10].The required temporary storage is 2 r 1.For large r the increased utilization of the communication system is accomplished at a signi cant expense in temporary storage.The radix-2 implementation presented above su ers from a slight load imbalance in addition to the ine ciency in using the communication system.One of the processors in a pair computes a complex addition, while the other computes a complex multiplication and a complex subtraction.The radix-2 r algorithm yields a better load balance, but this gain is accomplished at the expense of redundant computations.We now consider a few alternative implementation strategies that yield both increased communication and computational eciency.These alternatives were all considered for the Connection Machine implementation.The implementation is described in Section 5.

Direct pipelining
Pipelining the communications and computations for successive stages in the FFT is a straightforward way of increasing the utilization of the communication system.Pipelining allows d communication channels on every processor to be used concurrently in computing an FFT on an array axis distributed over 2 d processors of a Boolean cube network.The idea is illustrated for a radix-2 FFT in Figure 6.In the rst communication data is exchanged in the most signi cant cube dimension.After the splitting formulas have been evaluated for these data items, they are ready for the second stage of the FFT.In the second communication the rst memory locations in all processors are exchanged in the second most signi cant cube dimension, while the second memory locations are exchanged in the most signi cant cube dimension.From the third communication stage all communication channels are used in every exchange, until all local memory locations have been touched, at which point the shut-down of the communications pipeline starts.The idea of pipelining the communications for the FFT computations can also be understood by observing that for a consecutive data allocation over 2 d processors, the rst d stages can be viewed as P N distinct FFTs, each with one data point per processor.Each such FFT requires communication in the dimensions d 1; d 2; : : : ; 1; 0, one for each stage of the FFT.Hence, when the rst stage of the rst FFT is computed, dimension d 1 is free to be Time Memory

Processor
Step location 0 1 2 3 4 5 6 7 used for the computations of the next FFT.The radix-2, pipelined FFT requires P N + d 1 element transfers in sequence for an axis distributed over d cube dimensions, and with P N elements per processor.Note that for multi-dimensional FFTs, d is typically not equal to n, since more than one axis may be distributed over several processors.If P N >> d, then the communication system is fully utilized e ectively all the time.Pipelining o ers an improvement in the communication e ciency by a factor of d over the naive implementation, for P N >> d.It is easily veri ed that the claims are true for both the consecutive and cyclic data allocation.In the following we refer to the above algorithm as the \direct pipelined algorithm".The idea of pipelining the communication and computations for successive FFT stages can be applied to radix-2 r FFT for r > 1, but the pipelined radix-2 FFT o ers better overall e ciency for the reasons given in the previous section.

Bi-section
Even though the direct pipelined algorithm above uses the communication system to about 100%, the algorithm actually requires about twice the communication of implementations based on bi-section 14], or multi-section, or so called i-cycles 4,20,22].The notion of icycles for the computation of FFTs as used in 20,22] is equivalent to our notion of bi-section.We focus on the use of the idea for communication systems with concurrent communication on multiple channels, whereas the development in 20,22] assumes communication on one channel at a time.This di erence in assumption of the capabilities of the communication system a ects the utility of the idea in a fundamental way.The idea of using bi-section to achieve load balance and communication e ciency on Boolean cube networks is not new.It has been used previously for the solution of systems of tridiagonal equations 8].The idea of computing an FFT through bi-section is illustrated in Figure 7 for a cyclic data allocation with two data elements per processor.The table shows the location of the data indices through the course of the algorithm.The rst stage with the cyclic allocation requires no communication.Each processor evaluates one complete splitting formula.In the rst exchange on the most signi cant processor dimension, the rst half of the processors exchange the content of their second memory location with the content of the rst memory location of the second half of the processors.After this exchange each processor can again perform the computations for one splitting formula, this time for the second stage of a radix-2 FFT.The exchange proceeds on successively lower processor dimensions, but use the same memory locations.Processors with the address bit 0 for the dimension subject to exchange, exchange their second memory location, while processors with the address bit 1 exchange their rst memory location.In each exchange a processor sends one out of two data elements identi ed by a local memory address bit.All processors evaluate P 2N complete splitting formulas after each communication of P 2N elements per processor.The load is perfectly balanced, and only half as much data is exchanged for each FFT stage.The idea of pipelining can be used in combination with the bi-section idea to fully utilize the communication system.Figure 8 shows the rst few exchanges for a pipelined bi-section algorithm.The factor of two gain in communication e ciency by using bi-section may not be fully realizable, or realizable at all for a consecutive data allocation.To see this fact we apply the bi-section idea to the consecutive allocation, as shown in Figure 9.Note that communication in the most signi cant dimension must be performed twice.With concurrent communication on all channels a pipelined bi-section algorithm requires 2 P 2N + d 1 element transfers in Time Memory sequence, ignoring a possible overlap between the second exchange in the most signi cant dimension and the pipeline lling time.Hence, for the consecutive data allocation and concurrent communication on all channels, the communication requirements are the same as for the direct pipelined algorithm.The FFT implementation based on bi-section reorders the data, in addition to the bit-reversal due to the FFT itself, unlike the direct pipelined FFT.That a reordering takes place is apparent from Figures 7 and 9, which both show the location of the original data indices.The data motion caused by the sequence of bi-sections implements an unshu e.An unshu e is the inverse of a shu e, which perfectly interleaves the rst and second half of a set of numbers.For instance, a shu e on the set f0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15g produces the set f0; 8; 1; 9; 2; 10; 3; 11; 4; 12; 5; 13; 6; 14; 7; 15g.Changing the last data ordering to the rst corresponds to the reordering in Figure 7.The data reordering for the cyclic data allocation can be represented formally in terms of the encoding of the address space as shown below.The overline marks address bits that have been exchanged in the step indicated on the left.For instance, after the rst exchange step bits x n and x n 1 have been exchanged.In an exchange, only data that di er in the values of the index bits are moved.For instance, in the rst exchange only data for which x n x n 1 = 1 are moved.In this example the same memory dimension is used for all exchanges.Upon completion of the bi-section process the bits in the processor address together with the local memory bit used for all exchanges have been subject to a right cyclic shift, which de nes an unshu e.The local memory bit can be chosen arbitrarily.Using the least signi cant memory bit implies that every other location is subject to exchange, and the stride is two.If the most signi cant memory dimension is used, then a block equal to half of the local memory is exchanged, and the stride is one within the block.The number of memory references are the same, but architectural characteristics such as page faults, communications overhead, etc., may make the strategy for selecting the local memory bit important with respect to performance.
For the consecutive data allocation we use the exchange sequence for the cyclic allocation augmented with one additional exchange, as illustrated below In this case an unshu e permutation has been performed on the processor address eld.The local memory address eld is not reordered, as can be seen in Figure 9. Pairs of local memory locations contain even-odd pairs of successive data indices.

Multi-section
The idea of bi-section can be generalized to multi-section to support high radix FFT.A R = 2 r way splitting implies matrix transposition, or all-to-all personalized communication 10, 9, 11] in r-dimensional subcubes.After each 2 r -sectioning step a radix-2 r FFT can be performed locally in each processor.Figure 10 illustrates multi-sectioning for the interprocessor communication steps for p = 6 and n = 4 and cyclic data allocation.The numbers in the table are the initial indices.The rst partitioning step is a matrix transposition within each subcube of dimension two with respect to the two highest order real processor dimensions.For instance, processors 0, 4, 8 and 12 are in the same subcube.The bit interchanges corresponding to multi-sectioning is illustrated below for cyclic data allocation.
As in the bi-section case there exist many ways in which partitioning of the local data can be performed throughout the algorithm, resulting in di erent nal orderings.For a 4-section algorithm, two bits are involved in every step.For a 2 r -section algorithm, r bits are involved in each permutation.Sectioning steps for successive radix-2 r stages involve consecutive blocks of r dimensions.The communication for the 2 r -sectioning on blocks of r di erent processor dimensions can be pipelined.The number of element transfers in sequence is P 2N + (d n r e 1)2 r 1 for cyclic data allocation.An in-place sectioning requires that 2 r P N , or r p n, since the size of the local data set involved in a 2 r -section is 2 r .For p 2n (P N 2 ) a 2 n -sectioning minimizes the number of element transfers in sequence, since there is no pipeline start-up or shut down in this case.Next to 2 n -sectioning, 4-sectioning is the best choice with respect to the number of element transfers in sequence.Bi-sectioning is insigni cantly inferior with respect to the inter-processor communication requirements.For small values of r the variance in communication e ciency is small, and the choice of r is largely determined by the e ciency in evaluating the splitting formulas.With a consecutive data allocation the r most signi cant processor dimensions must be used twice, and a pipelined multi-section algorithm requires P N + (d n r e 1)2 r 1 element transfers in sequence, essentially the same as for the direct pipelined algorithm.

Discussion and summary of algorithms
In all derivations above the input order was normal.With the input in bit-reversed order the traversal of the address bits proceeds from the lowest to the highest order bit.With respect to the communication issues the roles of the consecutive and cyclic mapping are interchanged.
Forward and inverse transforms only di er in that one is computed using the conjugate values of the twiddle factors of the other.There are no particular issues with respect to multi-processors for one that is not present in the other.A multi-dimensional FFT can be performed as a sequence of one-dimensional FFTs for the di erent axes.Performed in this way the only issue unique to multi-dimensional FFT is how to partition the set of processors among the axes.We discuss this issue in the context of the Connection Machine implementation, see Section 5.

Consecutive allocation
The communication requirements for the consecutive and cyclic data allocations are summarized in Table 2.The communication requirements assume concurrent communication on all channels.With a consecutive data allocation all algorithms yield the same communication requirements for an unordered transform, with data in normal input order.The output ordering is bit-reversed for the direct pipelined algorithm, and shu ed bit-reversed for the pipelined bi-section or multi-section algorithms.With a cyclic data allocation, and normal input order, the bi-section and multi-section type algorithms require approximately half as many element transfers in sequence as the direct pipelined algorithm.With a bit-reversed input order all algorithms essentially have the same communication requirements for the cyclic data allocation, while the bi-section or multi-section algorithms o er a reduction in the communications requirements for the consecutive data allocation.
For an ordered transform an explicit reordering phase is required.Interleaving the reordering with the FFT computation o ers no gain in communications e ciency, when all communications channels are utilized for the unordered transform, unlike the case where only one channel at a time is used 20,22].The reordering requires the same number of element transfers in sequence for a pure bit-reversal operation, or a combined shu e with bit-reversal, assuming concurrent communication on all channels 12,11].The number of element transfers in sequence for the reordering is P 2N .For details see 12,11].

Twiddle Factors
The total number of twiddle factors needed for a radix-R FFT of size P is (R 1) P R .For the computation of an FFT on a distributed memory architecture using precomputed twiddle factors, it is important to minimize the need for either redundant storage of twiddle factors or communication of twiddle factors should they be required in a processor di erent from the one in which they are stored.In 15] we show that a radix-2 DIT FFT with precomputed twiddle factors, and data in normal order allocated consecutively, requires a maximum of P 2N + d 2 twiddle factors in a processor.A DIF FFT with bit-reversed input order and consecutive allocation requires the same twiddle factors.A radix-2 DIT FFT on normal order input allocated cyclically, or a DIF FFT on bit-reversed data allocated cyclically requires a maximum of (n 1) P N twiddle factors in a processor 15].Hence, the data allocation has a signi cant impact on the need for twiddle factor storage.Below we give algorithms for computation of twiddle factor indices based on memory addresses, for high radix FFT.

Decimation-in-frequency FFT
We rst give a formula for the twiddle factor indices for a radix-2 in-place DIF FFT with normal order input, then generalize the formula to radix-R DIF FFT.For the rst radix-2 stage the twiddle factor index is (a p 1 ) (a p 2 a p 2 : : : a 0 ) for the data element in location (a p 1 a p 2 : : : a 0 ).The radix-2 twiddle factors can be derived from the following iterative formulation of the DIF FFT.

Decimation-in-time FFT
As in the DIF case we rst consider the radix-2 case, then generalize to the radix-2 r case.
x 1 (a p 1 ;: : :; a 0 ) = x(a p 1 ; :: : ;a 0 ) x0 (a p 1 ;: : :; a 0 ) = x 1 (0; a p 2 ; : : :; a 0 ) + ! a p 1 2 ! 02 x 1 (1; a p 2 ; : :: ; a 0 ) x2 (a p 1 ;: : :; a 0 ) = x1 (a p 1 ; 0; a p 3 ; : :: ; a 0 ) + ! a p 2 2 !ha p 1 i 4 x1 (a p 1 ;1; a p 3 ;: : : ;a 0 ) . . .xs(ap 1 ;: : :; a 0 ) = xs 1 (a p 1 ;: : :; 0 p s 1 ;: : :; a 0 ) + ! a p s 1 2 !ha p s ;:::;a p 1 i 2 s+1 xs 1 (a p 1 ; :: : ;1 p s 1 ; :: : ;a 0 ) . . .xp 1 (a p 1 ;: : :; a 0 ) = xp 2 (a p 1 ; : :: ; a 1 ;0) + ! a 0 2 !ha 1 ;:::;a p 1 i P xp 2 (a p 1 ; : : :; a 1 ; 1) X(a p 1 ;: : :; a 0 ) = xp 1 (a 0 ; : :: ; a p 1 ) Note that, in the expression for x0 , the value of ! 02 is 1: no twiddle factors are needed in the rst stage.A radix-R, in-place, DIT FFT with input data in normal order can be written as The indices of the twiddle factors are all zero for the rst stage, j d d u 1 2 p 2r for the second radix-R stage, and j ( d d u s : : : d d u 1 )2 p (s+1)r for stage number s. Note, that the address is bit-reversed and shifted for the proper exponent.If the P complex data points are allocated consecutively and are in normal order, then the data in address location (d u 1 d u 2 : : : d 0 ) requires twiddle factors with indices fjg (f d d u s : : : d d u n r 1 g d d u n r : : : d d u 1 )2 p (s+1)r for stage s of an in-place DIT algorithm.With a consecutive data allocation the processor address bits form the high order bits of the element index.The rst n r radix-R butter y stages correspond to P N independent FFTs of size N.All these FFTs require the same set of twiddle factors.The local addresses do not enter into the index computation.Moreover, the rst stage does not require any twiddle factor.The last u n r radix-R stages are local to a processor.The maximum total number of twiddle factors required in a processor is P N + (d n r e 1)(2 r 1) 1, the same as for cyclic data allocation, normal input order, and in-place DIF FFT.The set of twiddle factors required in a processor is identical to those required for consecutive data allocation, bit-reversed input order and a DIF, in-place FFT.The number of twiddle factors required for a DIT FFT with input data in bit-reversed order and a consecutive data allocation is excessive, see 15].every processor, the bi-section and multi-section techniques do not o er any reduction in the communication time compared to the direct pipelined algorithm.Indeed, on the Connection Machine systems the bi-section and multi-section techniques require more time for a data exchange between a pair of processors than the direct pipelined algorithm.The reason for this di erence is that the exchanges in the direct pipelined algorithm take place between memory locations with the same local addresses in di erent processors, while the other algorithms require that the elements in an exchange has di erent local memory addresses.Depending on algorithm 3,11] the increase in the time for an exchange is in the range 30 -100% for the Connection Machine systems CM-2 and CM-200.The increased communication time due to this reduction in communication e ciency is in many cases greater than the reduction in time for evaluating the splitting formulas by radix-4 or 8 kernels instead of by radix-2 kernels.In the direct pipelined algorithm the FFT stages requiring communication are computed using a radix-2 algorithm.Local stages are computed using a mix of radix-2, 4 and 8 kernels.For e ciency, as many stages as possible are performed using the radix-8 kernels.To increase the e ciency of the radix-2 kernels for the inter-processor communication stages, data caching is performed as explained in Section 5.2.Reordering for ordered transforms is performed explicitly.Interleaving the reordering with the computation of the unordered transform would not gain any e ciency with respect to communication, since all communication channels are already used by the unordered FFT.For details of algorithms see 3,10,11].Timings are presented for both the unordered and ordered FFT.All performance data presented below refer to complex-to-complex transforms performed on the Connection Machine system CM-200.The data is assumed to be presented in normal order for the DIT FFT, and bit-reversed order for the DIF FFT.A standard binary encoding of the indices for each axis is assumed.For Boolean cube multi-processors a common encoding of array indices is the binary-re ected Gray code encoding 19].This encoding is also supported on the Connection Machine systems.FFT algorithms for this type of data encoding can be found in 13].Those algorithms have not yet been implemented on the Connection Machine systems.

Organization of the inter-processor communication stages
For the inter-processor communication stages, direct pipelined radix-2 DIT or DIF FFT algorithms are used for normal and bit-reversed input order, respectively.For each interprocessor communication FFT stage, a single twiddle factor is needed for all local data elements.The total number of twiddle factors needed in each processor is equal to the number of processor dimensions d over which the data set is distributed.d is the number of FFT stages requiring communication.For the direct pipelined algorithm d data elements are exchanged in each communication, except during the start-up and shut down of the communications pipeline.d butter y computations can be performed after each communication.The butter y computations belong to di erent stages, and require di erent twiddles.
As is apparent from Figure 6 in Section 3.1 each local data element is updated d times in succession.No further updates are required for the inter-processor communication phase.In order to reduce the number of loads and stores to local memory, the local data items are cached in the register set of the oating-point unit.Twiddle factors are (re)read from memory.The data caching scheme is used for up to 10 dimensions.For 11 dimensions there are insu cient registers in the oating-point unit, resulting in a performance loss, as can be seen from the timings in Section 5.5.Another detail that deserves to be mentioned addresses the SIMD (Single Instruction Multiple Data) nature of the Connection Machine systems CM-2 and CM-200.In the butter y computations one processor in a pair performs a complex addition and the other a complex subtraction.By integrating the negation of one of the operands into the communication both arithmetic operations can be performed concurrently with no measurable loss in e ciency.

Organization of the local FFT stages.
The current oating-point unit of the Connection Machine has a register le of 32 registers, which is su cient to keep all the twiddle factors and the temporary variables for the radix-2 and radix-4 kernels.For the radix-8 kernel, the twiddle factors are brought in from memory as they are needed, and only the temporary variables are kept in the registers.For kernels of higher radices, temporary results would have to be stored in memory.For that reason, we only implemented the radix-2, radix-4 and radix-8 kernels.To handle data sets of any power-of-two (on-processor) size, it is necessary to mix kernels of di erent radices.Our implementation does as much of the computation as possible with radix-8 kernels, using one stage of radix-2 or radix-4 kernels to handle the remainder of the computation when the size is not a power of 8.An FFT algorithm is typically expressed in terms of three nested loops.The outermost loop ranges over the stages of the FFT (\stage loop").The two inner loops range over the groups (a group is a set of kernels which use the same set of twiddle factors) in each stage (\group loop") and over all the kernels in each group (\kernel loop").With this organization, the twiddle factors for all the kernels in each group can be kept in the register le during the extent of the kernel loop; they are loaded at the beginning of the kernel loop and need not be loaded for each kernel.For radix-8 kernels, only part of the twiddle factors can be kept in registers, due to the register le size.The number of groups and the number of kernels in a group change from stage to stage.The product of the number of groups and the number of kernels in a group is the total number of kernels in a stage, which is equal to the local FFT size divided by the size of the current kernel.Table 4 gives the number of groups and the number of kernels of size R = 2 r per group, for a given radix-2 r butter y stage s, when there are P N elements per processor.The loop structure is given by the following pseudocode (it assumes for simplicity that the kernel size is always the same, but it can be easily generalized):  In the last stage, only the rst radix-2 kernel needs to load the twiddle factor from memory to the register le; the other 63 kernels will use the twiddle factor already in the register le.

The Twiddle Factors
In stage s (as de ned above), a radix-R FFT (R = 2 r ) needs R 1 twiddle factors per kernel.Since all the kernels in one group use the same set of twiddle factors, the number of twiddle factors used in one stage is the number of groups multiplied by R 1. Hence, in stage s, with P N elements per processor, the DIT FFT needs (R 1)2 sr di erent twiddle factors, and the DIF FFT needs (R 1) 2 (s+1)r P N twiddle factors.For all the stages, both of these add up to P N 1.There are P N twiddle factors used in total for the local part of the FFT.The DIT FFT is performed on data stored in normal order.For stage s, processor N i needs the twiddle factors !j d N i kg 2 n+(s+1)r ; j 2 1; R 1]; for the kernels in group g, where xky is the concatenation of x and y.For the DIF FFT with cyclic data allocation and the input in bit-reversed order, the twiddle factors used by the kernels in group g in processor N i at stage s are !j d N i kg 2 p sr ; j 2 1; R 1]: If the substitution s p n r s 1 is made in the expression above, we get exactly the expression for DIT twiddle factors.The DIF and the DIT FFT's are thus using the same set of twiddle factors, but are using them in reverse orders.Our implementation only uses one table of twiddle factors for both the DIT and the DIF FFT.
The following pseudo-code re ects exactly how the table of twiddle factors is stored in the processor memory.It generates them in the order used by the DIT FFT.twiddle pointer:=0 for s:=0 to p n r 1 for g:=0 to nb groups(s,r) for j:=1 to 2 r 1 twiddle twiddle pointer] := !j d N i kg 2 n+(s+1)r twiddle pointer := twiddle pointer + 1

Performance measurements
The performance measurements below have been made on a Connection Machine system CM-200 with 2048 64-bit oating-point units.All performance data refer to a complex-to-complex FFT, CCFFT, implemented as described above, and included as part of the Connection Machine Scienti c Software Library version 3.0.Data is provided for both ordered and unordered FFT.Performance of local FFTs for di erent array sizes is given in Table 5 and Figure 11.The peaks in Figure 11 correspond to array sizes for which only radix-8 kernels are used.The performance for 64-bit precision is about 75-80% of the performance for 32-bit precision.
The di erence is due to the fact that the data path between each oating-point unit and its memory is 32-bits wide.Data paths internal to the oating-point unit are 64-bits wide.The performance of the DIT kernels is 90 -95% of the DIF kernel performance for most sizes.The di erence is due to minor di erences in the construction of arithmetic pipelines for the oating-point processor.Table 6 gives performance data for ordered local transforms.Large ordered transforms are about 10% slower than unordered transforms.For transforms of size 1024 the ordered transform is about 20% slower than the unordered transform.The ordering phase requires one traversal of memory regardless of the size of the array, whereas the computation of the FFT requires several traversals.Timings for two-and three-dimensional CCFFT are given in Table 7, and shown in Figure 12.
The signi cant increase in performance for the two-dimensional CCFFT between the 1024 1024 array and the 2048 2048 array is due to one of the axis being local to a processor for the larger array (there are 2048 processors).The subsequent minor decrease in performance for the next larger array is due to the fact that the axis distributed over all processors also has a local component of length two.This part of the axis requires a radix-2 kernel, which is less e cient than the radix-4, and the radix-8 kernels normally used.For reference, performance data for ordered two and three-dimensional transforms are given in       is spread.The number of element transfers in sequence is approximately independent of the number of axis d, except if d = 0 in which case no communication is required.The performance variation once an axis is distributed across processors is minor, as can be seen in Table 9.For a two-dimensional FFT of shape 4096 4096 the worst performance once an axis is distributed across processors is at most 5% below the peak in 32-bit precision, and at most 3.5% below peak in 64-bit precision.The di erence between a distributed axis, and a local axis is about 20% in 32-bit precision and close to 30% in 64-bit precision.

Summary and Discussion
We have shown that for consecutive data allocation, normal order input, and a Boolean cube interconnection network allowing concurrent communication on all channels of every processor, a direct pipelined radix-2 FFT and an FFT based on multi-section or i-cycles 20,22] all yield essentially the same communication requirements.The number of element transfers in sequence is P N +d 1 for a transform on an array of size P distributed evenly over N processors, with the axis subject to transformation distributed over 2 d processors.For a cyclic data allocation and normal input order, or bit-reversed input order and consecutive data allocation, an FFT based on multi-section requires about half as many element transfers in sequence as a direct pipelined FFT.We have also shown that with precomputed twiddle factors a decimation-in-time FFT for consecutive data allocation and normal order input, requires approximately the same total storage on a distributed memory architecture as on a shared memory architecture.No computation, or communication of twiddle factors is necessary with this amount of storage.A decimation-in-frequency FFT requires the same twiddle factors in the same processors if the input is in bit-reversed order and the data allocation consecutive.Hence, a pair of unordered forward and inverse Fourier Transforms computed using a decimation-in-time and a decimation-in-frequency FFT can use the same twiddle factors, stored in exactly the same way in the distributed memory.An implementation of the Cooley-Tukey FFT based on multi-sectioning yields perfect arithmetic load balance, while the direct pipelined FFT does not.Hence, even for data allocations where there is no gain in the communication requirements, an FFT based on multi-section has advantages.However, for our implementation on the Connection Machine systems we concluded that the multi-section approach would be inferior.The reason is that the multisection approach requires data in the processor interchanges to come from di erent memory locations, which incurs a performance penalty of 30-100% on the Connection Machine systems CM-2 and CM-200, compared to the direct pipelined FFT algorithm.The decrease in communication performance is in most cases greater than, or approximately equal to, the gain from an increased computational e ciency in the kernels evaluating splitting formulas.Though a radix-2 FFT was chosen for the FFT stages requiring communication, a mix of radix-2, 4 and 8 kernels are used for stages local to each processor.The peak performance of our implementation of the complex-to-complex FFT on the Connection Machine system CM-200 is 12.9 G ops/s in 32-bit precision, and 10.7 G ops/s in 64-bit precision for unordered transforms.The corresponding data for ordered transforms is 11.1 G ops/s and 8.5 G ops/s, respectively.The peak performance for unordered two-dimensional transforms distributed over all processors is 5.0 G ops/s in 32-bit precision and 3.2 G ops/s in 64-bit precision.The corresponding execution rates for the ordered transforms are 2.7 and 1.5 G ops/s, respectively.The execution rates for large one-dimensional transforms is slightly higher, and for three-dimensional transforms slightly lower.

X
decimation-in-time (DIT) FFT, or of the type

Figure 5 :
Figure 5: Consecutive and cyclic data allocation of 32 elements to 8 processors.

Figure 6 :
Figure 6: The rst four steps of a direct pipelined radix-2 FFT.

Figure 7 :
Figure 7: The data distribution for a radix-2 FFT based on bi-section with cyclic data allocation.

Figure 8 :Figure 9 :
Figure 8: The rst four steps of a pipelined bi-section based radix-2 FFT.

Figure 10 :
Figure 10: The data distribution for a 4-sectioning, radix-4 FFT for 16 processors and 4 elements per processor.

Figure 12 :
Figure 12: The execution rate for two and three-dimensional, unordered, DIT CCFFT on a 2048 processor CM{200.

Figure 13 :
Figure 13: Total execution time for a two-dimensional unordered CCFFT on a 4096 4096 array as a function of the con guration of 2048 CM-200 processors.
For multi-dimensional arrays each axis is often encoded separately, as for instance is the case in the Connection Machine programming systems 21].Still, there is an issue of how

Table 2 :
Communication requirements for unordered transforms with concurrent communication on all channels, and consecutive and cyclic ordering.

Table 4 :
Number of groups and kernels per group.

Table 8 .
The execution time increases by 50 -100% for our examples, considerably more than for entirely local transforms.Optimal e ciency is attained by maximizing the number of axes that have no non-local component.Recall that with the pipelining of communications, the number of element transfers in sequence is P N + d 1, where P N is the number of elements per processor, and d the number of inter-processor dimensions over which an axis subject to transformation

Table 6 :
Performance data for local, ordered, CCFFT on a 2048 processor CM-200.

Table 7 :
Performance data for two and three-dimensional, unordered, CCFFT on a 2048 processor CM-200.

Table 9 :
Performance of a two-dimensional unordered CCFFT on a 4096 4096 array computed on a 2048 processor CM-200.