Encyclopedia > PageRank

Article Content

PageRank

PageRank is a concept of assigning numerical values to all World Wide Web pages listed in a search engine. It is a topic much discussed by search engine optimization (SEO) experts. The PageRank system is one of the methods the most successful search engine, Google uses to determine a page's relevance or importance. It was developed by Google's founders Larry Page and Sergey Brin while at Stanford University in 1998.

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

In other words, page rank is a "vote" by all the other pages on the World Wide Web about how important a page is. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked by many pages with high rank receives a high rank itself. If there are no links to a web page there is no support of this specific page. The Google Toolbar PageRank goes from 0 to 10. It seems to be a logarithmic scale. The exact details of this scale are unknown.

The name PageRank™ is a trademark of Google. Whether or not the pun on the name Larry Page and the word "page" was intentional or accidental remains an open question. The PageRank process has been patented. [more info on patent]

The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the frequency of hits on that page by the random surfer. It can be understood as a Markov process in which the states are pages, and the transitions are all equally probable and are the links between pages. If a page has no links to another pages, it becames a sink and therefore makes this whole thing unusable, because the sink pages will trap the random visitors forever. However, the solution is quite simple. If the random surfer arrives to a sink page, it picks another URL at random and continues surfing again.

To be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability of usually q=0.15, estimated from the frequency that an average surfer uses his or her browser's bookmark feature.

So, the equation is as follows:

<math>{\rm PageRank}(p_i) = \frac{q}{N} + (1 -q) \sum_{p_j \in L(p_i)} {\rm PageRank} (p_j)</math>

where <math>p_1, p_2, ... p_N</math> are the pages under consideration, <math>L(p_i)</math> is the set of pages that link to <math>p_i</math>, and N is the total number of pages.

It's worth noting that the PageRank values are the entries of the dominant eigenvector of the modified adjacency matrix. This makes PageRank a particularly elegant metric:

<math> \begin{bmatrix} {\rm PageRank}(p_1) \\ {\rm PageRank}(p_2) \\ \vdots \\ {\rm PageRank}(p_N) \end{bmatrix} =

\begin{bmatrix} {q / N} \\ {q / N} \\ \vdots \\ {q / N} \end{bmatrix}

+ (1-q)

\begin{bmatrix} l(p_1,p_1) & l(p_1,p_2) & \cdots & l(p_1,p_N) \\ l(p_2,p_1) & \ddots & & \\ \vdots & & l(p_i,p_j) & \\ l(p_N,p_1) & & & l(p_N,p_N) \end{bmatrix}

\begin{bmatrix} {\rm PageRank}(p_1) \\ {\rm PageRank}(p_2) \\ \vdots \\ {\rm PageRank}(p_N) \end{bmatrix}

 </math>

Where the adjacency function <math>l(p_i,p_j)</math> is 0 if page <math>p_i</math> does not link to <math>p_j</math>, and normalised such that, for each i

The values of the PageRank eigenvector are fast to approximate (only a few iterations are needed) and in practice it gives good results.

As a result of Markov theory, it can be shown that the PageRank of a page is the probability of being at that page after lots of clicks. This happens to equal <math>t^{-1}</math> where <math>t</math> is the expectation of the number of clicks (or random jumps) required to get from the page back to itself.

The main disadvantage is that it favors older pages, because a new page, even a very good one, will not have many links unless it is part of an existing site (a site being a densely connected set of pages).

That's why PageRank should be combined with textual analysis or other ranking methods. PageRank seems to like Wikipedia pages, often putting them high or at the top of searches for several encyclopedic topics. A common theory for why this is is because the Wikipedia is very interconnected, with each article having many internal links from other articles, which in turn have links from many other sites on the Web pointing to them. Compared to Wikipedia, and similar high quality content-rich sites, the rest of the World Wide Web is relatively loosely connected.

External references

Google talks about PageRank "the heart of our software" (http://www.google.com/technology/)
"Google's PageRank and how to make the most of it" (http://www.webworkshop.net/pagerank) from www.webworkshop.net (http://www.webworkshop.net)
PowerPoint HTML presentation by Larry Page (http://hci.stanford.edu/~page/papers/pagerank/ppframe.htm)
"PageRank, Google's Original Sin" (http://www.google-watch.org/pagerank) from www.google-watch.org (http://www.google-watch.org)
Online PostScript version (http://www-db.stanford.edu/~backrub/pageranksub.ps) of "The PageRank Citation Ranking: Bringing Order to the Web", by L. Page, S. Brin, R. Motwani, T. Winograd, 1998.
The Handy Dandy Google Page Rank Figurin' Guide (http://www.searchnerd.com/pagerank/)
The Google Pagerank Algorithm and How It Works (http://www.iprcom.com/papers/pagerank/)

All Wikipedia text is available under the terms of the GNU Free Documentation License

Search Encyclopedia

Search over one million articles, find something about almost anything!