A Fast Incremental Update Scheme for PageRank
Steve Chien
University of
California at Berkeley
We began with the broad goal of exploiting changes in the link structure of the web. For our raw data we used the results of two large crawls of the web made at SRC from May and November 2000, and their compact representation in the connectivity server, also built at SRC. While we made a number of observations on the evolution of link structure over time, our most substantial result was an increased understanding of Google’s very popular PageRank measure of web page importance and an application of this to a very fast incremental update algorithm for PageRank.
PageRank is Google’s highly successful method for evaluating the importance of web pages and is based on interpreting the web as a very large graph in which pages are vertices and links are directed edges. It is then natural to think of a Markov chain, or a random walk on this graph; from our current page, we choose a random edge to follow. We then modify this random walk so that at any step there is a certain probability that we jump to a random page. PageRank is then the stationary probability distribution of this Markov chain. While PageRank is therefore only an eigenvector computation, the size of the web graph limits us to using power iteration, a relatively slow approach.
PageRank works very well in practice, and some effort has been spent analyzing its effectiveness. In particular, Andrew Ng, Alice Zheng, and Michael Jordan reported that one nice feature of PageRank is its stability under significant changes to the web graph and concluded that it is the random jump that is responsible for this.
We analyzed PageRank from a similar perspective, asking how changes to the links out of a single page affects the PageRank of other pages. Our analysis showed that the random jump probability has the effect of localizing the impact of such updates to a small subgraph surrounding the modified page. This suggested the following update algorithm: We assume we are given the current web graph and the correct PageRank for it. We then discover that a new edge has been added. We respond by constructing a small subgraph around the affected page, and consolidate the remainder of the web graph into a single superpage that simulates the interaction of the rest of the graph with the subgraph. We then compute the new PageRank on the small subgraph, and assume that PageRank has not changed in the superpage. Our analysis shows that this should be an excellent approximation to the updated PageRank.
As an extra result, we showed that PageRank (and a large class of other Markov chains) is also monotonic in the sense that if a page receives a new incoming edge, its PageRank increases.
We implemented our algorithm on the raw data from the May and November 2000 web crawls. In a series of small experiments, we started with the 61 million pages from the May 2000 web crawl and added single edges. We found that our approximations on these changes were extremely accurate. We then tried a much larger scale experiment in which we incrementally incorporated all of the edge changes between May and November, for a total of over 17 million edge additions and deletions. Once again, we showed that our algorithm worked extremely well. [We are preparing to publish a more rigorous and detailed description of our experiments.] In each case, our algorithm performed far fewer computations than full-blown power iteration.
5 Acknowledgements
In all of the above, the first person plural includes primarily my summer advisor Cynthia Dwork, who made the summer an especially rewarding and enjoyable experience. I also greatly appreciate Janet Wiener’s indispensable help with the connectivity server, as well as conversations with Li Zhang, Yunhong Zhao, Marc Najork, Mike Burrows, Lyle Ramshaw, and many others.