Towards compressing Web graphs

We consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by random graph models for describing the Web. The algorithms are based on reducing the compression problem to the problem of finding a minimum spanning free in a directed graph related to the original link graph. The performance of the algorithms on graphs generated by the random graph models suggests that by taking advantage of the link structure of the Web, one may achieve significantly better compression than natural Huffman-based schemes. We also provide hardness results demonstrating limitations on natural extensions of our approach.


Introduction
A snapshot of the World Wide Web can be thought of as a graph, with Web pages represented by nodes and hyperlinks represented by directed edges.This representation has been used for a wide variety of Web algorithms, including algorithms for ranking pages based on their connectivity 11,3] and nding natural communities of pages on a shared topic 13].Indeed, at least one major search engine has designed a tool called the connectivity server for storing the Web graph 2,4].
Given this previous work, a natural question to ask is how well the Web graph and Web-like graphs can be compressed, in order to save on the memory required to store or transfer such graphs.Good compression requires using the structural properties of the Web graph, and hence an important rst step is understanding this structure.Previous work gives us important insights.It is clear that the Web graph appears to be signi cantly di erent from the likely graphs resulting from traditional random graph models.In particular, there appear to be natural clusters of related pages with similar connections.Hence, in 12,14], a new random graph model was introduced with these clustering properties.The basis of this model is that pages and links enter and leave the system dynamically, and new pages may link to other pages by nding one or more reference pages and copying links from these references.
Recent studies of the Web graph suggest that the structure of the Web is actually more complex than this random graph model; see, for example, a study based on a recent snapshot of the Web from Altavista 4].However, as a rst approximation, this model captures important high level behavior, and it may be especially suitable for the large components of the Web graph, or for speci c subdomains, such as all the pages within a given university.Hence, in this paper we focus on variations of this graph model; experiments on complete Web data will be covered in future work.Since achieving good compression is strongly related to nding structure, we expect our compression work eventually to yield further insights into the structure of the Web graphs.
Our primary results are the following: We provide a compression algorithm based on nding pages with many shared links.This dependence on shared links makes our algorithm particularly well suited to graph models that employ copying links from reference pages, but we also expect that our algorithm will work well with any graph structure that involves signi cant similarities in shared links.The algorithm requires solving a directed minimum spanning tree on a graph associated with the original graph.Under appropriate assumptions concerning the degree distribution of the Web graph to be compressed, the running time of our algorithm is O(n log n), where n is the number of nodes in the Web graph.We provide hardness results demonstrating that several natural extensions of our algorithm are NP-Hard.We demonstrate the e ectiveness of our approach on a testbed of random graphs derived from the random graph models that motivate our approach.Our results appear signi cantly better than natural Hu man-based alternatives.

Framework
When we discuss compressing Web-like graphs, there are actually a variety of distinct situations we may wish to consider: 1. Compressing the underlying graph for storage or transmission, up to isomorphism.This setting would be useful if we want to store just the graph structure itself.2. Compressing the underlying graph for storage or transmission, maintaining a given ordering of the nodes.As an example of this setting, we might order the nodes according to the sorted order of the URLs (so that the URLs can be compressed by delta encoding, as in 2]). 3. Compressing the underlying graph for use in its compressed form.That is, we desire a compressed form of the graph that still allows for e cient computation on the compressed form.
Our primary focus in the paper is the second setting, where we are given a node ordering and are concerned solely with overall compression.However, the three problems are clearly related, and we will suggest connections between the variations as they arise.

A Web graph model
We reiterate that in this paper, rather than compress actual subgraphs of the Web, our focus is a recently proposed Web graph model that captures certain aspects of Web graphs.We have thus far only tested the algorithm on a single subgraph of the real Web graph (on which, as we show, performance is quite good); we expect to test our compression scheme on more extensive real Web data in future work.We also believe our experiments on the Web graph model we examine are interesting in their own right.
The model, taken primarily from 14], uses the following basic outline.The graph evolves over time by associated node and edge creation and deletion processes.The intuition suggested from 14] is the following: \A new page adds links by picking an existing page, and copying some links from that page to itself."For example, a new page v might examine the outedges from a page w and link to a subset of the pages that w links to; we call this copying outedges.This intuition is based on the idea that a user decides what pages to link a new page to based on a page or pages that the user already likes.
Given this framework, there are a variety of possible variations, depending on the speci cs of the edge creation and deletion process as well as the copy process.We specify the model we use here.We begin with an initial graph of n 0 nodes, with each having d 0 outedges connected to nodes chosen uniformly at random.There is no deletion process, only a node creation process.One new node is created each time step to a total of n nodes.The creation process is determined by probability distributions A; B; C; D; E; F. The distribution A provides a number a, such that v is given a outedges with each edge pointing to a node chosen uniformly at random from all nodes existing at that time.Similarly, B provides a number such that v copies outedges from b nodes, again chosen uniformly at random.The distribution C yields a probability; for each of the b nodes w 1 ; : : : ; w b chosen to copy from, a probability is independently chosen from C, and each outedge from w i is copied independently with the probability determined from C. 1 Distributions D; E, and F are analogous to A; B, and C respectively, except that they determine the inedges for a new page.
Graphs based on this copying procedure do not appear to have been given a general name.The resulting graphs may have more speci c structure than power-law graphs, such as those suggested in 1].We suggest the name copy graphs for families of evolving random graphs using copying operations of this type.

A baseline Hu man-based scheme
Experiments have demonstrated that the indegrees and outdegrees of Web pages follow a Zip an distribution 1, 14, 4].That is, the fraction of pages with indegree j is roughly proportional to 1=j for some xed constant , and similarly the fraction of pages with outdegree j is roughly proportional to 1=j for some xed constant .One of the features of the copy graph model is that it can yield graphs with such Zip an distributions 14].
Given the large variance in degrees, it is natural to consider Hu man-based compression schemes.A simple such scheme would go through the nodes in order and list the destination of each outedge directed from that node.Each page would be assigned a Hu man codeword based on its indegree.To separate the outedges of each node we could utilize a special stop symbol, which would appear n 1 times if there are n pages.An end of le symbol would denote the end of the edge list.In Appendix 1, we give an expression for the number of bits required to represent a graph using this kind of compression, provided that the indegree follows a Zip an distribution with > 2. The expression demonstrates that this kind of compression requires asymptotically the same number of bits as not using Hu man compression.
Many simple variations are possible.The compression scheme could also be based on the edges directed into each node instead of the edges directed out from each node.Whether inedges or outedges is a better choice depends on which has higher variance.In the case where we only need to store an isomorphism of the graph, we might avoid the stop symbol.Instead we can send an implicit or explicit representation of the outdegree distribution, sort the nodes by outdegree, and list the outedges for each node as before without the stop symbol.Also, we note that an advantage of this compressed representation is that it is extremely simple and convenient; it could naturally be used in the framework of the connectivity server 2], or in any system that wanted e cient computation on the compressed form of the graph.
This approach achieves signi cant compression with little complexity, and thus we shall use it as a baseline for algorithms that compress the Web graph.Note that this Hu manbased scheme ignores the natural clustering structure induced in copy graphs.We next examine how to take advantage of this structure and show via experiments that this does in fact yield substantially greater compression.
3 The Find-Reference algorithm Our basic algorithm is based on the following insight: given the random graph model, a natural approach to compress a Web graph is to try to reconstruct how the graph developed according to that model.That is, if we knew the history of how the graph was created (according to the random graph model), we might achieve a great deal of compression by working with this history instead of the actual graph.In practice, we attempt to nd nodes that share several common outedges, corresponding to cases where one node might have copied the links of another.Once an appropriate neighbor is identi ed, the di erence, or delta, between the outedges of the two nodes can be identi ed.When node i is compressed in this way using node j, we say that node j is a reference for node i.
For example, if node i is labeled as a reference of node j, we can include a 0/1 bit vector denoting which outedges of node j are also outedges of node i.Other outedges of i can then be separately identi ed, say using dlogne bits in an n node graph.Of course we must also identify i, which is another dlog ne bits.Let N(i) and N(j) represent the set of outedges for node i and node j respectively.The cost of compressing node i using node j as a reference with this scheme is then cost(i; j) = out-deg(j) + dlog ne (jN(i) N(j)j + 1) : Given a description of a graph in this kind of compressed format, consider how we would determine where a link from node i encoded using node j as a reference actually points.If the corresponding link from node j is encoded using another node k as a reference, then we would need to determine where the corresponding link from node k points.Eventually, we must reach a link that is encoded without using a reference node.In order to satisfy this requirement, we shall not allow any cycles among references.For example, we shall not allow i to be compressed using j as a reference, j to be compressed using k as a reference, and k to be compressed using i as a reference.
An intermediate structure that Find-Reference uses is the a nity graph G S for the given Web graph G W . Speci cally, the nodes of G S are the same as the nodes of G W .We set w(i; j), the weight of the directed edge from node i to node j, to be the cost of compressing node i using node j as a reference.We add to the a nity graph a root node r to which every other node has a directed edge and from which there are no directed edges.The weight of the edge from i to r is the cost of compressing i without using any other node as a reference.We assume that node i has a directed edge to node j if and only if w(i; j) < w(i; r).
Given a Web graph, the algorithm Find-Reference rst computes the corresponding a nity graph for the given cost function, and then nds an optimal set of references under the restrictions that (a) each node has at most one reference, and (b) there are no cycles among references.The problem of nding the globally best mapping from nodes to references (or to the dummy node) is equivalent to nding the minimum weight directed spanning tree with root r on the a nity graph.Thus, a high level description of the compression algorithm is as follows: Algorithm Find-Reference Given a Web graph G W , compute the corresponding a nity graph G S .
Compute a minimum directed spanning tree D rooted at r for the graph G S .Compress the graph G W , where node i uses node j as a reference if and only if node i points to node j in D.
Theorem 1 For a Web graph G W , let n be the number of nodes in G W , and let t G W (i) be the indegree of node i. Algorithm Find-Reference can be realized to run in time The a nity graph G S can be computed from the original graph G W by using a matrix multiplication.When G W is a Web graph, we expect it to be sparse, and so we describe the algorithm in terms of a sparse matrix multiplication.Let M represent the adjacency matrix of G W .It is easy to verify that (MM T ) ij is the number of nodes that both i and j have outedges to.The matrix (MM T ) can be computed in time O ( P n i=1 t G W (i) 2 ), assuming that we compute a list of the non-zero entries of (MM T ).We also compute an array R, where R j] = out-deg(j).This requires time O(n + m), where m = P n i=1 t G W (i) is the number of edges in G W .Given (MM T ) ij and R j], cost(i; j) can be computed in constant time.
Note that there will never be an edge from i to j in G S unless nodes i and j in G W have an outgoing edge to at least one shared neighbor.Thus, to compute the set of edges in G S , we only need to compute cost(i; r) for every i, and then for every edge from i to j such that (MM T ) ij > 0, compute cost(i; j), and compare it to cost(i; r).The set of edges in G S is f(i; j) s.t.cost(i; j) < cost(i; r)g and the set of edges from every other vertex to r.This also gives us the weight of each edge in G S .Since there can be at most P n i=1 t G W (i) 2 nonzero entries in (MM T ), the total time required to compute the graph G S is O (n + P n i=1 t G W (i) 2 ).
Computing a minimum directed spanning tree with root r in a directed graph is generally referred to in the literature as a branching with root r. 2 For information on branchings, see for example 6,8,10,16].Minimum spanning trees in directed graphs with x nodes and y edges can be found deterministically in time O(x log x + y) 8].A simpler algorithm that runs in time O(y log x) is suitable for the case of sparse graphs 16, 6], which will generally be the case in our context.Since the total number of edges in G S is at most P n i=1 t G W (i) 2 + n, the total time required to compute the minimum directed spanning tree in G S is O (n log n + P n i=1 t G W (i) 2 ).
All that remains is to perform the compression using the computed directed tree to specify a reference for each node.To do this, we compute for each node i with reference node j a linked list of outedges that i and j have in common.This set of lists can be computed in time O ( P n i=1 t G W (i) 2 ).With the list of edges that i and j have in common, the compressed version of node i can be computed in time O(out-deg(i)).Thus, the entire algorithm runs in time O (n log n + P n i=1 t G W (i) 2 ). 2 Note that the performance of this algorithm is particularly good when G W is sparse, as we expect of Web graphs.For example, if the distribution of indegrees in G W is Zip an with > 3, then P n i=1 t G W (i) 2 = O(n).
We point out that we have found a similar idea to the algorithm Find-Reference alluded to in 5], in the context of compressing tables of data, where one column can be used to compress another.The authors mention that the problem can be reduced to a minimum spanning tree problem (in their case, edges are undirected).

Additional improvements and related problems
In practice, after we have found the references via the directed minimum spanning tree, there are various improvements that can be implemented.First, we may wish to nd additional references for greater compression.This can be done by stripping edges from the original graph handled by the rst references, re-calculating the cost function accordingly, and rerunning the algorithm.This algorithm is not optimal, however, since we may obtain better compression if we choose the references of the rst stage keeping in mind that we have further stages coming.Although it appears that a better approach would be to nd multiple references simultaneously instead of in stages, in general nding multiple references in an optimal manner is a hard problem, as we show in Section 4. Hence the stage attack is likely to be the most e cient and e ective in practice.
Once we have found the best references, we may again use a Hu man encoding to handle the edges not covered by references.Note that by doing this, we invalidate the cost function we used to determine the references, so that the set of references may not be optimal.However, until we choose the references, we cannot determine the cost of edges not covered by references, so it seems di cult to take this into account properly in the cost function.One possibility is to attempt to approximate this e ect in the cost function.Another is to apply a heuristic approach such as hill-climbing to nd the best references.Since this process is likely to be time-consuming, starting with a good solution from our algorithm may prove e ective.The gain from this consideration is likely to be small, and so in practice it can probably be ignored.
Other possibilities include using di erent compressed representations.We have suggested using a bit vector to denote which links a node has copied from its reference.These bit vectors can be Hu man encoded; alternatively, a run-length encoding might be applicable here.Also, if a node only copies a small fraction of the links from its reference, a list of the copied links may be more e cient than a bit vector.As usual with compression schemes, there are a variety of possible enhancements that may slightly improve compression.However, we believe the main concept of using similar pages for compression provides the bulk of the bene t.
While our algorithm is described for the problem of storing Web graphs, we believe these techniques can also be useful when we wish to compute or use the compressed form.The main potential problem is that in order to nd the inedges or outedges of a node, one may have to go through multiple references in the directed minimum spanning tree, which may take more time than is desired for fast computation.To bound the number of references to pass through in our single-reference setting it is su cient to bound the depth of the directed minimum spanning tree we nd on the a nity graph.Unfortunately, nding the optimal directed minimum spanning tree of bounded depth is NP-hard; for example, if we allow depth at most two, then the problem of nding the optimal directed minimum spanning tree is equivalent to the facility location problem.(Indeed, it is this connection to the facility location problem that was used in the work on compressing tables of data mentioned earlier 5].)In the terminology of 15], each page is a possible facility; a page that is not compressed by a reference corresponds to an opened facility; and a page that is compressed using a reference corresponds to a location receiving shipment from a facility corresponding to the reference page.We believe the depth-bounded directed minimum spanning tree problem is an interesting extension of previous facility location problems.In practice, we expect that using the Find-Reference algorithm to initially nd a directed tree and then \chopping the tree" to maintain a depth bound (by changing some nodes to be compressed without a reference and thus linking them to the root r) is a suitable solution.

Hardness results
Since we can nd the optimal compression given an appropriate cost metric when we allow a single reference node using branching algorithms, a natural question to ask is whether we can similarly achieve optimality when we allow more than one reference node.We show hardness results related to this question.We focus on the case where up to two nodes can be used as references, but everything described is easily generalized to any number of reference nodes.
For up to two reference nodes, the a nity graph becomes the following kind of structure: De nition 1 A 2-supergraph is a directed hypergraph where each hyperedge is directed from a single node to two other nodes.These two other nodes can be the same, but must be di erent from the source node.
Given a Web graph, we shall consider the corresponding weighted 2-supergraph, where w ijk , the weight of the hyperedge from i to j and k, represents the cost of encoding i using both j and k as references.For a hyperedge w ijj where the two other nodes pointed to are the same node, the weight of the hyperedge w ijj represents the cost of encoding node i using only node j as a reference node.Note that w ijk will vary depending on the overlap between the set of edges of the Web graph that nodes i and j have in common and the set of edges of the Web graph that nodes i and k have in common.We call the resulting 2-supergraph an a nity 2-supergraph.
Given a Web graph, computing the a nity 2-supergraph for a given link compression scheme can easily be done in polynomial time.Using the a nity 2-supergraph to compute the best compression using up to two reference nodes is equivalent to the following generalization of nding optimal branchings: De nition 2 Given a 2-supergraph G and a designated root node r, a 2-branching is a subset S of the hyperedges of G such that each node except the designated root has exactly one outgoing hyperedge in S, and r has no outgoing hyperedges in S. In addition, the hypergraph formed by the set of hyperedges in S has no directed cycles.The optimum 2-branching is the 2-branching that minimizes the total weight of the edges in S.
Unfortunately, in general nding the optimal 2-branching is not only NP-Hard, it is hard to approximate.In particular, we demonstrate an approximation preserving polynomial time reduction from the problem of nding the optimal directed Steiner tree to the problem of nding the optimal 2-branching.It is known that if P 6 = NP, then no polynomial time algorithm can nd a log n-approximation to the directed Steiner tree problem 7].Furthermore, this reduction provides evidence that it may be di cult to nd a polynomial time algorithm that provides better than an n -approximation, since the directed Steiner tree problem has thus far resisted e orts to outperform this bound.
Theorem 2 Any polynomial algorithm that provides a k-approximation for the 2-branching problem also provides a k-approximation for the directed Steiner tree problem.
Proof: We use the following reduction: given an input to the directed Steiner tree problem consisting of a directed graph G = (V; E), a set S V of required points, a weight function W : E !R, and a designated root r, we construct an input to the 2-branching problem as follows.The input hypergraph is G 0 = (V 0 ; E 0 ), where V 0 = V U, and U is an additional set of nodes we describe.For each node v 2 V S, let U(v) = fu 1 ; : : : ; u indeg(v) g be the set of nodes u 2 V such that (u; v) 2 E. For each u i 2 U(v), U has a node denoted v i .Let u(v i ) be the node u i that corresponds to the node v i .
For each edge e = (u; v) in E, E 0 contains one hyperedge e 0 = (u; v; v), where w uvv is equal to the weight of e.The edge set E 0 also contains a hyperedge (v; v 1 ; u(v 1 )) of weight 0, and for all i, 1 i < indeg(v), E 0 contains a hyperedge (v i ; v i+1 ; u(v i+1 )) of weight 0. The edge set E 0 also has a hyperedge (v indeg(v) ; r; r) of weight 0. Finally, for all i, 1 i indeg(v), E 0 has a hyperedge (v i ; v; r) of weight 0. This completes the construction of the 2-branching problem.The theorem now follows from the following two claims: Claim 1 For any directed Steiner tree of weight z, there is a 2-branching with weight z.

Proof:
The 2-branching consists of all the edges used by the directed Steiner tree.These will be the only non-zero weight edges used in the 2-branching, and thus the 2-branching has the same weight as the directed Steiner tree.This guarantees that every required node of the Steiner tree problem has exactly one outgoing hyperedge.We divide the optional nodes of the Steiner tree problem into two sets: S 1 , consisting of the nodes used in the Steiner tree solution, and S 2 , consisting of the nodes not used in the Steiner tree solution.For each node v 2 S 1 , there already is exactly one outgoing hyperedge for v, and thus we only need to specify the outgoing hyperedges for the nodes v i .For each such v i , we include the hyperedge (v i ; v; r).Since there are no hyperedges pointing at the nodes v i , this will not create any cycles.
For each node v 2 S 2 , we include the hyperedge (v; v 1 ; u(v 1 )), and for all i, 1 i < indeg(v), we include the hyperedge (v i ; v i+1 ; u(v i+1 )).We also include the hyperedge (v indeg(v) ; r; r).Since there are no edges in the Steiner tree solution coming into or going out of v, this will not create any cycles.We have speci ed an outgoing hyperedge for every node without introducing any cycles, and thus we have a valid 2-branching.2 Claim 2 For any 2-branching of weight z, there is a directed Steiner tree with weight z.

Proof:
The 2-branching must consist of exactly one hyperedge from each of the required Steiner nodes.These hyperedges correspond exactly to edges in the Steiner tree problem, and thus these edges are included in the Steiner tree solution.For each of the optional Steiner nodes v, the hyperedge from v either points to another node in the original Steiner tree problem, or it uses the hyperedge (v; v 1 ; u(v 1 )).In the rst case, we include that edge and the node v in the Steiner tree solution, and since the 2-branching has no cycles, this cannot create any cycles in the Steiner tree solution.In the second case, it is easy to show by induction that the 2-branching must also contain all of the hyperedges (v i ; v i+1 ; u(v i+1 )).Since there can be no cycles in the 2-branching, the only nodes with hyperedges pointing to v are the nodes v i .Thus, we can leave the node v out of the Steiner tree solution.This gives us a valid solution to the Steiner tree problem where the set of nonzero weight edges of the Steiner tree solution is exactly to the set of nonzero weight hyperedges of the 2-branching.

2
The inapproximability result above demonstrates the hardness of the general problem of nding an optimal 2-branching.This result does not however directly imply that it is even NP-Hard to nd the best compression using at most two references, since the graphs that we reduce the directed Steiner tree problem to may not correspond to actual a nity graphs that arise as a result of a Web graph.
We also provide a more direct reduction showing that it is in fact NP-Hard to nd the best compression of a Web graph based on using up to two reference nodes.In fact, even if we ignore the additional di culty imposed by taking into account the asymmetry of the a nity graph, the problem remains NP-Hard.In particular, we demonstrate that the problem of nding the assignment of reference nodes that maximizes the total number of edges in the Web graph that are represented by a corresponding edge in a reference node is NP-Hard.This proof can easily be extended to the case where the objective is to minimize the total cost of the reference nodes used.
Theorem 3 The problem of nding an encoding for a graph G W , with each node encoded using up to two reference nodes, that maximizes the total number of edges that are encoded using a reference node is NP-Hard.
The proof, which is somewhat lengthy, is a reduction from 3SAT, and is given in the Appendix.Note that this does not imply that the problem of maximizing the number of edges encoded using a reference node is hard to approximate.In fact, it is clear that we can nd a 2-approximation to this problem by using at most a single reference node to encode each node.Achieving better than a 2-approximation is an interesting open problem.

Experiments
We present the results from a preliminary prototype running on arti cial random copy graphs and one subset of a snapshot of the Web graph.We emphasize that these experiments are meant as a preliminary proof of concept.In particular, the prototype does not output a compressed le, but rather the compressed size of the le.Moreover, when using Hu man coding, the compressed size does not include the size of any associated Hu man tables; we chose not to include this as the size of the Hu man table depends on whether one compresses it further.
We rst describe the graphs we tested.For the random copy graphs, our tests all had 131072 nodes.(Smaller test graphs had similar performance, so we present results for the largest graphs we tested.)Each graph began with 1024 seed nodes with three outedges, where the end of the outedge was chosen uniformly at random from all nodes.When new nodes were added, they were given only outedges.The outedges were determined by copying edges from some number of nodes and by generating edges with endpoints chosen uniformly at random from all present nodes.We show the parameters for the copy graphs tested in Table 1.The eld # random copies denotes the number of nodes whose outedges were copied.A range such as 1; 2] denotes that an integer value was chosen uniformly over that range.Each edge was copied with a xed probability, listed as the copy probability.The eld # random edges gives the number of edges that were generated with random destinations; again, a range denotes that an integer value was chosen uniformly over that range.We note that for the large graphs G 3 and G 4 , we were forced by memory considerations to limit the a nity graph to allow edges between nodes i and j only if their outedges share at least three destinations.This can only hurt our compression e orts.(Further testing suggests that the the di erence is minimal if two shared destinations can be handled in constructing the a nity graph.) We also tested our compression scheme on real Web data from from the TREC-8 (Text REtrieval Conference 8) Web track 9].Our data set was the WT2g data set 3 , which was chosen as a small subset of the Web for testing information by the TREC retrieval conference.This data set is larger than our random sets; hence again to construct the a nity graph on the TREC database we only created edges between pages with at least three shared links.
Table 2 presents the compression results.Here we have taken the average of ten di erent trials for the random graphs, where a di erent random graph is produced for each trial.We note that there is little deviation between the runs.The average number of edges is given; the uncompressed size given is simply the number of edges multiplied by log 2 (#nodes), which is an underestimate of the uncompressed size.Compression for other methods is given as a percentage of the uncompressed size.
As seen in graph G 1 , when the amount of copying is low, and thus the average degree is very small, the reference algorithm alone does slightly worse than the Hu man algorithm, although using a Hu man code in conjunction with the reference algorithm leads to better performance.When the amount of copying is larger, as for G 2 , G 3 , and G 4 , our Find-Reference algorithm greatly outperforms Hu man coding.We expect repeated passes might allow even greater compression.The Hu man algorithm compresses the outedges for each edge, so the code words are based on the indegree.For the Find-Reference algorithm, we test both the straightforward algorithm as well as one which rst determines the references and then uses Hu man coding on the remaining outedges.
Our results are actually best for the TREC database, demonstrating that our approach should be e ective on real Web data as well.Our belief is that our good results on the TREC data set arise because links appear to have signi cant locality, due to the heuristic principles by which the data set was chosen 9].We believe much greater testing is required to determine our performance on larger scale Web data sets, although this preliminary result is promising.

Name
G

Future Work
We have initiated study into how to compress Web graphs.We have considered the copy graph model, introduced elsewhere as a random graph family with properties similar to Web graphs.Using this structure, we have designed a compression algorithm based on nding similarity among the links of the pages and tested it on simple copy graphs.We have also shown that various generalizations of this idea lead to NP-Hard problems.
There are several directions remaining to pursue, including further tests of our algorithm on real Web data.It would also be interesting to learn how our approach works in conjunction with others.For example, another idea that would clearly be useful in compressing Web graphs is locality.If we assume that we store our graph with pages listed alphabetically by URL, we would expect a good percentage of the links to be between pages that are near each other in the list, since pages in the same domain are likely to reference each other.Thus, even before using our reference-based algorithm, it seems likely that a rst pass algorithm to handle local links would be useful.One natural approach would be to split the Web graph into two subgraphs, one with local links (say, within distance 256 in the sorted URL list) and non-local links and compress them separately.Although designing such a system can be done via experimentation, developing an appropriate model that allows us to understand the tradeo s would be an interesting problem.
Another interesting issue is understanding what our compression algorithm tells us about the structure of graphs.For example, a natural technique to test how accurately a proposed random graph model captures the structure of real Web graphs would be to run our compression algorithm (or any other compression algorithm) on both kinds of graphs, and compare the compression obtained.A feature more speci c to our algorithm is that our algorithm attempts to reconstruct the evolution of a copy graph when choosing reference nodes for each edge.Preliminary results seem to indicate that the algorithm is able to correctly identify a large fraction of the history, and thus an interesting line of research would be to quatify this, and to understand its implications for real Web graphs.
The node X T h is connected to the clause structures that the literal X h appears in, and X F h is connected to the clause structures that X h appears in.In particular, let L i be one of the nodes X T h or X F h .The graph G S ( ) has a node L j i for the jth appearance of the literal L i in a clause.We use L j i to denote both the appearance of the literal in the formula, as well as the corresponding node in the graph.The nodes are connected as follows, where k i is the number of appearances of L i : There is also a structure for each clause.In particular, a clause containing the three literals L j 1 i 1 , L j 2 i 2 , and L j 3 i 3 results in the following structure: The nal edge we add to the graph is (r 0 ; r 1 ).These are all of the full edges of the graph G S ( ).To complete the description, we need to specify which pairs of edges are overlapping.All of the pairs of edges incident to L i are overlapping, except for the pair of edges (L i ; Li ); (L i ; L 1 i ), which is non-overlapping and the pair (L i ; r 0 ); (L i ; r 1 ), which is lightly-overlapping.Of the three pairs of edges incident to Li , two are overlapping: the non-overlapping pair is ( Li ; L i ); ( Li ; r 0 ), where L i is the complement of L i .All other pairs of edges are non-overlapping.
This construction is easily seen to be polynomial time.We see that it is a reduction from the following two claims.Claim 3 If is satis able, then G S ( ) has a 2-branching of weight (2N 3) v, where N is the total number of nodes in the graph, and v is the number of variables in . Proof: We describe how to construct such a 2-branching B, given a satisfying assignment.For each true literal L i , B contains the (directed) edges (L i ; r 0 ), (L i ; r 1 ), ( Li ; L i ), and ( Li ; r 0 ).For every false literal L i , B contains the edges ( L i ; L 1 i ), ( L i ; ^ L i ), ( ^ L i ; L i ), and ( ^ L i ; r 0 ).In addition, B contains the edges (L 1 i ; L i ); (L 1 i ; L 2 i ); (L 2 i ; L i ); (L 2 i ; L 3 i ); : : : ; (L k i i ; L i ); (L k i i ; r 0 ), as well as the edges ( L 1 i ; L 2 i ); ( L 2 i ; L 3 i ); : : : ; ( L k i i ; r 0 ), plus the edge from each node L j i to the corresponding clause structure.
It is easy to verify that in the description of B thus far there are no cycles and every node other than the three in each clause structure has exactly two outgoing edges.The 2-branching B also contains an edge from each of the three nodes in each clause structure to the node r 0 .Finally, we need to specify the second outgoing edge from each of these three nodes.However, each clause has at least one true literal, and thus there is at least one node L jc ic such that the edge from L jc ic to the clause structure was not included in B. We include the edge from the clause structure to such a node L jc ic .The outgoing edges in B from the other two nodes in the clause structure are the clockwise edges along the triangle of the clause structure.These additional edges mean that every node has exactly two outgoing edges, and we have guaranteed that we do not have a cycle going around the clause structure.
Before adding the edge from the clause structure to the node L j 1 i 1 , it was easy to verify that there were no cycles, since there were no directed paths between the nodes corresponding to di erent variables.However, this additional edge does introduce such paths.To see that these edges do not introduce any cycles, note that they can only introduce paths from an L i representing a false literal to a L j representing a true literal.It is also possible to go from the node L i to the node L i , but only in the case that L i is the false literal.Thus, any path that goes from a node L i to a node L j (which may be L i ) must go from a false literal to a true literal.Therefore, there can be no cycles in the resulting set of edges B. The nal edge we add is (r 1 ; r 0 ).Since r 0 has no outgoing edges, this does not create a cycle.
Thus, B is a valid 2-branching.Every node except r 0 and r 1 has two outgoing full edges in B, and for every node except the nodes L i corresponding to true literals, these two edges are non-overlapping.The two outgoing edges from the L i are lightly overlapping.Since we have exactly v true literals, the total weight of this 2-branching is (2N 3) v. 2 Claim 4 If G S ( ) has a 2-branching of weight (2N 3) v, then is satis able.

Proof:
For any variable X i , if all four of the nodes in the variable structure for X i had two full, non-overlapping edges, this would lead to a cycle.Thus, for each of the v variable structures, there is at least one node whose outgoing edges contribute at most 2 1 to the total weight of the 2-branching.In order to have a 2-branching of weight (2N 3) v, there must be exactly one such node per variable structure, and it must contribute 2 1 to the branching.The only way for this to be possible is if there is one literal L i that has the edges (L i ; r 0 ), (L i ; r 1 ), and the other literal L i has the edges ( L i ; L 1 i ) and ( L i ; ^ L i ).In order to satisfy the formula , it is su cient to set the literal L i to true, and the literal L i to false.or (n 2 ; n 3 ).Furthermore, since the targets of these two edges are distinct, the merger does not e ect the weight of (n 1 ; n 2 ) or (n 1 ; n 3 ), but it does ensure that those two edges become overlapping.
The only edge weight that is changed in W S ( ) by the merger is (n 2 ; n 3 ), which is increased by either 1 or 2, depending on whether the merger resulted from an overlapping pair or a lightly-overlapping pair.If (n 2 ; n 3 ) is a full edge, then this increase is o set by removing one or two (depending on the increase) of the nodes from the target of (n 2 ; n 3 ) that were not involved in any mergers.Otherwise, no further changes are required.Since the only pairs of edges that can increase the weight of (n 2 ; n 3 ) via a merger are pairs of full edges of the form (n i ; n 2 ) and (n i ; n 3 ), for some i, and since the maximum degree of G S ( ) is d, we see that the weight of any edge (n 2 ; n 3 ) can be increased by at most d mergers.Thus, no edge that was not full increases in weight past 2d = =2, and each edge that is full has enough nodes left in its target after its mergers to account for the e ects of all other mergers, and thus remains at weight .Thus, we have constructed a graph G W ( ) such that the corresponding graph G S ( ) satis es all three of our su cient conditions.Determining the maximum number of nodes that can be encoded using reference nodes in an encoding of G W ( ) using two reference nodes per node is equivalent to deciding if the formula is satis able.This completes the reduction, and the proof. 2

Table 1 :
Parameters of the test graphs.

Table 2 :
Results from the test graphs.