A combining mechanism for parallel computers

In a multiprocessor computer communication among the components may be based either on a simple router, which delivers messages point-to-point like a mail service, or on a more elaborate combining network that, in return for a greater investment in hardware, can combine messages to the same address prior to delivery. This paper describes a mechanism for recirculating messages in a simple router so that the added functionality of a combining network, for arbitrary access patterns, can be achieved by it with provable e(cid:14)ciency. The method brings together the messages with the same destination address in more than one stage, and at a set of components that is determined by a hash function and decreases in number at each stage.


Introduction
A general purpose parallel computer needs to have a mechanism for realizing concurrent memory accesses e ciently. Several or all of possibly thousands of processors may wish to read the same memory address at the same time. Alternatively, several or all may wish to write a value into the same address, in which case some convention needs to be adopted about the desired outcome. In either case, the requests will have to converge from the various parts of the physical system to the one location.
If each request is sent directly to the component containing the relevant address, then this component will require time to handle them proportional to the number arriving there. In general, this becomes unacceptable if the number of processors p is large. This overhead, potentially linear in p, can be overcome by implementing the requests in more than one stage. In the rst stage, for example, the requests to any one ultimate destination will converge in groups at various intermediate components, where each group is combined into a single request to be forwarded in the next stage. In the last stage all extant requests to an address nally converge to the chosen location. Thus the requests can be viewed as owing from the leaves of a tree to the root. In some instances, as when the concurrent requests implement a read statement, a ow of information in the reverse direction, from the root to the leaves, needs to follow. In these instances, whenever requests are combined at a component, the sources of the requests in that stage are stored at that component, so that a complete record of the structure of the tree is maintained. In a general pattern of requests, accesses to several memory addresses may be present. In that case a combining tree has to be maintained for each address.
What is the most e cacious way of providing a multiprocessor computer with a combining facility that is acceptably e cient for the widest range of concurrent access patterns? In this paper we shall describe one proposed solution, and provide some analytic, experimental and, also, qualitative arguments in its favor.
In 14] we proposed the bulk-synchronous parallel (BSP) model of parallel computation in which the basic medium of inter-processor communication is a router, that delivers messages among the components point-to-point, but performs no combining or other computational operations itself. In this context a router means any device that can deliver a set of messages. It may be, for example, an optical device that transmits messages physically point-to-point. It was shown that shared memory with arbitrary concurrent accesses could be simulated on a p-processor BSP machine with only constant loss in e ciency asymptotically, if the simulated program had p 1+" fold parallelism for some positive constant ". One important advantage of having as the communication medium this simplest option of being merely a message transmitter, is that it makes possible a competition for the highest throughput router among the widest possible range of technologies. In contrast, a medium that is required to perform more complex operations, such as combining (e.g. 4], 12], 13]) imposes more constraints on the technology. The crucial question is whether the extra capabilities of more complex hardware can be simulated in practice on the simple router, with acceptable loss of e ciency.
In this paper we lend support to the position that simple routers can indeed implement concurrent accesses e ciently, by describing an algorithm for this that is apparently more e cient and practical than previous solutions 8] , 15]. We give analytic results that show that p 1+1=m requests to p units, with an arbitrary pattern of combining, can be realized in time mp 1=m asymptotically as p ! 1. A certain natural charging policy is used here for measuring time, and only the minimal assumptions are made on the uniformity of the requests among the components. Perhaps more signi cantly, the algorithm is of the form of a simple natural heuristic. Experiments suggest, for example, that for p = 4096 and with each processor making 32 requests, the cost of realizing arbitrary concurrency patterns as compared with patterns with no concurrency, is no more than a factor of about 3.5, even if nothing is known about the pattern. If the degree of concurrency is known then this factor can be made smaller.
We conclude that, when building a parallel computer, it may be e cacious to invest the bulk of the resources to be used for communication, in a simple router having maximal throughput. Although every general purpose parallel machine needs to have mechanisms for implementing arbitrary patterns of concurrent accesses, if, as it appears, di cult access patterns occur rarely enough, then our proposed mechanism for dealing with them is e cient enough that substantial investments in combining networks are not warranted.

Multi-Phase Combining
We consider a system consisting of p components, each of which has some memory and processing capabilities. A (q,r)-pattern among the p components is a set of communication requests in which each component sends at most q requests, each request has a destination that is a memory address in a particular component, and at most r requests are addressed to any one component. In this paper we shall charge maxfq; rg units for executing directly a (q; r)-pattern on a router (as in the variant of the BSP model considered in 2].) This charging policy is intended to capture the basic idea that the requests made by any one component are processed sequentially as they are injected into the router, as are the requests arriving via the router at any one component. Thus q and r de ne the maximum cost of these two processes over all the components. Taking maxfq; rg to be proportional to the overall time taken by the router is justi ed if the router has low latency, or if its latency is hidden by pipelining and its message load is high enough.
Distinct components contain disjoint sets of memory addresses. Several requests may share the same address (and therefore by implication also the same component.) We call the set of requests sharing the same address a group. The degree of a request is the number of elements in its group, and the degree of the pattern is the maximum degree of its elements. Thus if the pattern consists of n groups, of respective sizes d 1 ; ; d n and destinations t 1 ; ; t n ; then the degree of the pattern is d = maxfd 1 ; ; d n g: The proposed multi-phase combining algorithm implements patterns of high degree by decomposing them into a sequence of patterns of low degree. At the end of each phase, the requests having the same destination address that arrive at each component, are combined so that they can be transmitted as a single request in the phase to follow. For example, if every processor wishes to read the same word in memory then the requests form a (1; p)-pattern consisting of a single group, and has degree p. If implemented directly our charging policy would charge it p units of time.
It can be decomposed, however, into a sequence of two (1; p p) patterns, each costing p p units. The rst allows each of p p sets of p p requests to converge to a separate component. Each of these components combines the p p messages arriving, into one message, that is charged as one unit when it is sent on in the second phase. This second phase sends the p p requests, that are so formed from the original p request, to their common destination. This decomposition into two phases, therefore, reduces the total charge from p to 2 p p units. This example illustrates the software combining tree method proposed by Yew et al. 16] for dealing with a single hotspot.
For simplicity we shall assume that any two requests to the same destination address are combinable. This is true if, for example, we are executing at any one time either read or write statements, but not both. Our algorithm and analysis can be easily extended to cases in which more than one species of request cohabit.
We shall now describe our multi-phase algorithm for implementing arbitrary patterns on a simple router e ciently. For concreteness we shall describe the two instances that we analyzed, one of which we implemented. Many variants with comparable performance are possible and we shall discuss some of these in Section 5.
The algorithm has a basis sequence (b 1 ; :::; b m ) of integers such that m i=1 b i = p. We also give analytic results that show for two variants of the algorithm. Suppose the space of possible addresses is denoted by M and that the components are numbered f0; :::; p 1g. Suppose also that fh 1 ; :::; h m g are hash functions where h i : M ! f0; :::; b 1 b i 1g and fk 1 ; :::; k m 1 g are random functions where k i : ! f0; :::; b i+1 b m 1g. The important distinction is that for each pattern the hash functions h i are chosen once, randomly from certain sets of hash functions, while the random functions k i have values that are independently chosen randomly at each invocation, and do not depend on any argument. The i th phase of the algorithm will at each component combine into one all the requests it has to any one destination t j and send it to component h i (t j )b i+1 b m + k i . Note again that the requests to the same t j will have the same value of h i (t j ), but will have k i chosen randomly and independently for each of them at each component. It can be easily veri ed that, after i phases, among the requests destined for any t j , sets of up to b 1 b i may have been combined into one. Also, if i < m, the resulting requests have been scattered randomly over b i+1 b m components, whose identities are determined by the hash function h i . In particular after the last phase the request destined for t j has been delivered to the hashed address h m (t j ). It turns out that the most promising mechanisms known for simulating a shared memory (PRAM) or BSP model on multiprocessor machines use a hashed address space (e.g. 14], 15]). Hence the algorithm as described implements hashing exactly as required in that context. As noted in Section 5, however, the algorithm can be adapted easily to deliver to physical rather than to hashed addresses.
An alternative algorithm with similar properties is obtained by replacing the ran-dom function k i in the above by the deterministic function k i : f0; ; p 1g ! f0; ; b i+1 b m 1g de ned as k i (s) = s mod b i+1 b m , and in the i th phase sending a request originating at component s and destined for address t j to component h i (t j )b i+1 b m + k j (s). This requires no randomization beyond the hash functions h i , but evens out the load among the processors more slowly if the spread of the original requests is uneven.
As an illustration of these two versions of the algorithm, consider the following example consisting of 3 phases and having basis sequence (8,8,8), and, therefore, applying to the case p = 512. We represent the components as nine-bit binary numbers. The hash functions h 1 ; h 2 and h 3 applied to a destination address t take on values that are sequences of 3, 6 and 9 bits respectively. The packets destined for address t, that could initially start from all of the 8 3 = 512 processors, will be directed to addresses having a common value of h 1 (t) in their rst three bits. Hence these packets will have converged to at most 8 2 = 64 processors after the rst phase. The importance of having these 64 processors determined by a hash function, that treats distinct destination addresses as independently as possible, is that it prevents packets going to distinct addresses from creating hotspots. There remains the separate problem that the packets going to the one destination t must be spread su ciently uniformly among these 64 processors that they do not create hotspots themselves. Our rst solution is to spread them out by determining the last six bits in their addresses randomly and independently of each other, using a random function k 1 . The second solution is to have as these last six bits the corresponding six bits of the source of the request. The latter is an attractive option for implementation in combination, for example, with h 1 (h) and h 2 (t) being pre xes of h 3 (t). The performance of the former solution is, however, independent of the source addresses. It is, therefore, the more appropriate for deriving experimental results that generalize. In either case the behavior of the second phase of the algorithm is similar to the rst. The nal phase brings all the packets destined for address t to processor h 3 (t). The e ect of the overall algorithm is, therefore, to bring together the requests destined for one address in a tree of depth three, where the nodes have approximately equal degree and the processors that simulate them are chosen by a hash function in order to avoid hotspots.

Asymptotic Analysis
We shall establish the following property of the multi-phase algorithm, that holds asymptotically as the number of processors p ! 1, for all patterns of all degrees. Theorem For any constant " > 0, any integer m 1 + b" 1 c, and any constant > 0, there is an m-phase algorithm that can realize any (q; r)-pattern with q p " in a number of steps that exceeds (1 + )mp " with probability less than e (p " 1=m ) : The result assumes that the address space is hashed, as previously described, and for that reason does not depend on r. Also, the proof assumes that when choosing the hash functions, we are choosing from a set of functions that allow the chosen one to behave randomly and independently for the various arguments at which it is evaluated.
The result as stated improves on the constant multiplier in the runtime of the best previously known method based on integer sorting 8], 15]. In particular the experimental results show that small values of can be attained with probability close to unity.
We shall use the following bound on the tail of the sum of independent random variables given in 9] and 10], and also derivable from 11]. Lemma If 1 ; ; n are independent random variables each taking values in the range 0; 1] such that the expectation of their sum is E, then for any > 0,

Proof of Theorem
In the analysis we shall assume, for simplicity, that b = p 1=m is an integer so that we can choose as basis sequence (b 1 ; ; b m ) where b 1 = b 2 = = b m = b. Otherwise the b i would be chosen so as to di er by at most one from each other. We consider how the algorithm behaves on an arbitrary (q,r)-pattern with q p " , and with n destination addresses t 1 ; ; t n and degrees d 1 ; ; d n , respectively. If we let v = P n i=1 d i then clearly v qp p 1+" .
At the start of phase i the j th group of requests, namely those destined for t j , will have been combined into at most minfd j ; p=b i 1 g requests, since these number at most d j at the start, and have converged to at most p=b i 1 processors by this time. For i > 1 they will be distributed randomly among the p=b i 1 components numbered h i 1 (t j )b m i+1 + x for 0 x < b m i+1 . Now consider some xed component numbered yb m i + z where 0 y < b i and 0 z < b m i . Let j be the number of requests with destination t j arriving at this component at the end of phase i. Then Prob( j u) Prob(h i (t j ) = y) B(minfd j ; p=b i 1 g; b i =p; u) (1) where B(w; P; u) denotes the probability that in w independent Bernoulli trials, each with probability P of success, there are at least u successes. The rst term gives the probability that the randomly chosen hash function h i maps t j to y, and equals b i clearly. The second bounds the probability that at least u of the requests are mapped by the invocations of the random function k i , to the chosen value of z, and, by the Lemma above, can be upper bounded by if u = (1 + )b and > 0, since the mean is at most (p=b i 1 )(b i =p) = b. We shall now de ne new random variables 1 ; ; n where j = ( j =((1 + )b) if j (1 + )b; 1 otherwise: These satisfy the condition of the Lemma that 0 j 1, and model the behavior of 1 ; ; n exactly, except for the range j (1 + )b, which is a very rare event by virtue of (2). Since, by (1), the expected value of j is at most b i minfd j ; p=b i 1 g b i =p, it follows from the de nition of j that the sum of the expectations of 1 ; ; n is (1 + ) : Applying the Lemma to 1 ; ; n , assumed here to be independent, then gives that Prob 0 @ n X j=1 j (1 + ) p " 1=m (1 + ) if > 0. But, by the de nition of j , the lefthand side is Prob 0 @ n X j=1 j (1 + )p " where j is less than the probability that j exceeds (1 + )b; which by relations (1) and (2) is at most e (p 1=m ) : This is derived by partitioning the event referred to in the rst term of (4) into two events according to whether or not j (1 + )b for every j. Hence we deduce from (3) and (4) that the probability that the number of requests P n j=1 j arriving at the chosen node at the end of phase i exceeds (1 + )p " is still e (p " 1=m ) , since the n p 1+" choices of j contribute only a lower order term in the exponent.
As there are p components and m phases, the probability that this charge is exceeded anywhere in the run is therefore pm times this same quantity, which is also e (p " 1=m ) . Hence the result claimed in the Theorem follows.
For completeness we now prove the same result for the alternative algorithm in which in phase i any request originating at component s and destined for address t j is sent to h i (t j )b m i + k i (s) where k i (s) = s mod b m i . Here at the start of phase i the group of requests destined for t j will have been combined into again at most minfd j ; b m i+1 g requests, which are distributed among the b m i+1 components numbered h i 1 (t j )b m i+1 + x for 0 x b m i+1 . For any xed component numbered yb m i + z where 0 y < b i and 0 z < b m i de ne j to be the number of requests with destination t j arriving at this component at the end of phase i. Let X j be the number of requests at the start of phase i destined for t j from addresses with k i (s) = z. Then clearly X j b, and P X j p " b i since only the b i source components that coincide with z in the last bits contribute. Also j = X j with probability b i (i.e. if h i (t j ) = y) and j = 0 otherwise. If we de ne j = j =b, so that 0 j 1, then the sum of the expectations of j is The result then follows from the Lemma as before.

Experimental Results
The multi-phase combining algorithm with the k i chosen to be random functions was implemented as follows. In the basis sequence we used only powers of 2 (i.e. b i = 2 a i for integers a i ; 1 i m). We used a pseudo-random number generator to generate new values of the functions k i at each invocation. We also used a pseudorandom number generator to generate for each pattern the set of values fh i (t j ) j 1 i m; 1 j ng. In particular we had the binary representation of h i (t j ) to be the pre x of the binary representation of h i+1 (t j ), so that only a i+1 random new bits were chosen when determining the latter. We ran experiments for the case p = 2 12 and v = P n j=1 d j = 2 12 . We implemented (q,r)-patterns with q = 32, but with degrees varying from 2 12 down to 1. Thus typically we had n = 2 17 =d groups each of degree d, for d = 2 12 ; 2 11 ; ; 2 0 . The reported results are all averages over 500 runs.
At one extreme we had 2 17 groups of degree one and therefore no combining was required. The patterns were (32,r)-patterns where r depended on the maximum number of requests that the hashed address space placed into one component. The average value of r was determined experimentally to be 54:4. This is just the expected number of objects in the bucket having the most objects, if 2 17 objects are placed randomly into 2 12 buckets. Since this is the baseline performance of a pure router with a hashed address space, we computed the runtime of all our experiments as multiples of this basic unit, and call this multiple the performance factor.
At the other extreme we had 32 groups of degree 2 12 , which corresponds to each of the components sending requests to the same set of 32 addresses. This requires the highest amount of combining.
We note that the m = 1 version of our algorithm (i.e. basis sequence (4096)), is the solution proposed in 14] for patterns of low degree. From Table 1 we see that if the degree is no more than the slack (i:e: v=p = 32) then the performance is indeed quite good, the performance factor being no worse than 3.7. This factor improves rapidly as d decreases. On the other hand, the degree is clearly a lower bound on the runtime of the one-phase algorithm, and for d = 2 12 gives a performance factor greater than 4096=54:4 > 75, which is unacceptable.
If the case d = 1 is implemented in several phases, then each phase performs hashing with no combining and contributes a factor of about 1 to the overall runtime. (The contribution is actually slightly more since we are charging max(q; r) rather than r for a (q,r)-pattern and irregularities in the distribution at the start of a phase contribute also.) This is also the case in early phases of the algorithm if d is small enough that little combining is done in that phase. In phases where much combining is done the performance factor can exceed one considerably. If the necessary combining is achieved in early phases, however, later phases may execute very fast since only few requests remain in the system. These phenomena can be discerned easily in Table 2, where for various values of d we give basis sequences that achieved factors below 3.
We note that the motivation for the charging policy in the BSP model is that some routers may achieve a satisfactory rate of throughput only when they have enough work to do (i.e. q is high enough when implementing a (q,r)-pattern) in one superstep. Hence the BSP model has a lower bound on the time for a superstep, determined by some parameters. Where this lower bound is relevant, we may give preference to basis-sequences that distribute the time cost evenly among the phases. On the other hand there are circumstances, for example, when the phases are implemented asynchronously as discussed below, when this issue does not arise.   Tables 1 and 2 show that if we have information about the degrees of the patterns then we can nd a good basis sequence that brings the performance factor of our algorithm below 3 in the whole range. If d 8, then use of a single phase brings this factor below 2. The only assumption here is that the requests all have the same degree, which is the case we tested. If we have no knowledge about the degree of concurrency, then, as Table 3 shows, the basis sequences (32; 8; 4; 4) and (32; 16; 8) are good compromises and achieve performance factors of at most about 4 and 3.5 respectively throughout the whole range.

Variants of the Algorithm
The algorithm as described is \bulk-synchronized" in the sense that each phase has to nish before the next one starts. The correctness of the algorithm, however, does not require this. As each request in a phase arrives at a component, a check can be made to determine whether any other to the same address has been previously received, and if none has then the request can be sent on immediately to the next phase, without waiting for the previous phase to complete. Where it is permissible, such an asynchronous implementation can only improve performance. The actual performance in that case depends, however, on the order in which the router delivers the requests. Asynchrony may be introduced also if the requests are transmitted bit-serially.
The algorithm can be adapted to models of parallel computation other than the simple router. One candidate is what is called the S PRAM in 15] that has been suggested as a model of various proposals for optical interconnects 1], 6]. Here at each cycle any component can transmit a message to any other, but only those receive messages that have just one targeted at them in that cycle. The senders nd out immediately whether their transmission succeeded. Known general simulations of the BSP on the S PRAM with slack log p or slightly more are known 3], 15] and these imply constant factor optimal implementations of our combining algorithm on the S PRAM. There are clearly several possibilities for more e cient direct implementations also.
So far we have discussed versions of the algorithm that implement (q,r)-patterns in a hashed address space. The performance has been independent of the value of r because of the hashing. Suppose now that, we wish to send requests to physical addresses as, for example, when implementing a \direct BSP algorithm " 2]. We can clearly do this by sending the requests to hashed addresses rst by the algorithm described, and then in one extra phase sending them to the correct destinations.
This last phase will be a pattern of degree 1, from randomly distributed sources. Also we can expect that the targets are distributed approximately uniformly among the components, since that is the purpose of using a direct algorithm. Hence this extra phase of the algorithm will run fast on the simple router. In particular, if the added last phase is a (q 0 ; r 0 )-pattern, and the previous one is a (q 00 ; r 00 )-pattern then clearly q 0 r 00 . Also r 0 r where r is the maximum number of distinct destination addresses targeted in any component . Hence the cost maxfq 0 ; r 0 g will be dominated by r 00 , which is controlled by randomization, and by r which is controlled by the programmer.
As an alternative to adding an extra phase to our multi-phase combining algorithm, we can also consider replacing its last phase by one that sends the requests directly to the actual rather than the hashed addresses. This will be e cient if the number of requests destined for each physical component, is small enough. When counting this number here, we have to allow for the multiplicity of each request group as de ned by its degree in the last phase of the basic algorithm.
When implementing this multi-phase algorithm, provision has to be made by software or hardware or some combination, for storing at each phase the sources of the converging requests, so that this trace can be used for any necessary reverse ow of information. These provisions are also useful for implementing concurrent accesses when the decomposition of the pattern into phases is handcrafted by a programmer. This may be worthwhile for the sake of greater e ciency, for patterns that have a structure well-known to the programmer. Our algorithm, therefore, is also consistent with such direct implementations of concurrent accesses.
Finally, we note that our combining mechanism can be used for applications other than accesses. When requests are combined it is meaningful to perform almost any operation on their contents that is commutative and associative. The method can be used, for example, to nd the sum, product, minimum, or Boolean disjunction over arbitrary sets of elements simultaneously.