-- 作者:admin
-- 发布时间:2008/5/6 5:51:51
-- [推荐]第八部分课堂讲义之三——PageRank
7 Web search and search engine PPT课件下载链接:
7.3 Link analysis Link analysis for web search has intellectual antecedents in the field of academic citation analysis
The idea of Web citations Quantify the influence of scholarly articles by analyzing the pattern of citations amongst them Much as citations represent the conferral of authority from a scholarly article to others, link analysis on the Web treats hyperlinks from a web page to another as a conferral of authority
Diversity of Web Pages Unlike academic papers which are scrupulously reviewed, web pages proliferate free of quality control or publishing costs Web pages vary on a much wider scale than academic papers in quality, usage, citations, and length The average web page quality experienced by a user is higher than the quality of the average web page This is because the simplicity of creating and publishing web pages results in a large fraction of low quality web pages that users are unlikely to read
7.3.1 The basics Some terms about link Forward links Out links Out-edges Backlinks In links In-edges
The assumptions of PageRank The anchor text pointing to page B is a good description of page B More objective Many Web pages do not provide an accurate description of itself Image search IBM, Yahoo!, Google, … Consequently, web searchers need not use the terms in a page to query for it
Image link <html> <head> <title>Image</title> </head> <body> <a href="js.gif">江苏地图"</a> </body> </html>
The hyperlink from A to B represents an endorsement of page B, by the creator of page A This is not always the case Link spam All pages links to copyright notice within a single website So backlinks from high PR-pages count more than links from low PR-pages
The weight of anchor text Anchor text terms are generally weighted based on frequency with a penalty for terms that occur very often (the most common terms in anchor text across the Web are Click and here, using methods very similar to idf)
The extended topics The use of anchor text has some interesting side-effects Searching for big blue often returns the home page of the IBM corporation Creating misleading anchor text pointing to other sites can boost its ranking on selected query terms The window of text surrounding anchor text (sometimes referred to as extended anchor text) is often usable in the same manner as anchor text itself ... company about computer can be referred <a href="herewww.ibm.com">here</a>
7.3.2 PageRank A numerical score between 0 and 1 assigned to every Web page by Google depending on the link structure of the web graph
The composite score in Google Given a query, a web search engine computes a composite score for each web page that combines hundreds of features such as Cosine similarity Term proximity PageRank score This composite score is used to provide a ranked list of results for the query
7.3.2.1 The basic of PageRank Ref by: The PageRank Citation Ranking: Bringing Order to the Web
The random surferring model A random surfer who begins at a web page Executes a random walk on the Web as follows Following the out links With probability 0 < α < 1 Teleport operation With probability 0 < 1 -α < 1
Following the out links From his current page A to a randomly chosen web page that A hyperlinks to Nodes with many links coming in from other pages may be frequently visited The idea behind this is that pages visited more often in this walk are more important
Teleport operation What if the current location of the surfer, the node A, has no out-links? Jumps from a node to any other node in the web graph through typing an address into the URL bar of his browser The destination of a teleport operation is modeled as being chosen uniformly at random from all web pages with probability 1/N ( N is the total number of nodes in the Web )
7.3.2.2 The computation of PageRank A page has high rank if the sum of the ranks of its backlinks is high When a page has many backlinks When a page has a few highly ranked backlinks The importance of a page decides and depends on the importance of other pages
Some Definition Let u be a web page Let Fu be the set of pages u points to Let Bu be the set of pages that point to u Let Nu=|Fu| be the number of links from u Let c be a factor used for normalization so that the total rank of all web pages is constant
A simple PageRank Calculation R which is a slightly simplified version of PageRank
The rank of a page is divided among its forward links evenly to contribute to the ranks of the pages they point to Argument c < 1 because there are a number of pages with no forward links and their weight is lost from the system The equation is recursive but it may be computed by starting with any set of ranks and iterating the computation until it converges
This figure demonstrates the propagation of rank from one pair of pages to another
This figure shows a consistent steady state solution for a set of pages
Stated by matrix notation let A be a square matrix with the rows and column corresponding to web pages Let Au,v = 1/Nu if there is an edge from u to v and Au,v = 0 if not If we treat R as a vector over web pages, then we have R = cATR So R is an eigenvector of A with eigenvalue c In fact, we want the dominant eigenvector of A. It may be computed by repeatedly applying A to any nondegenerate start vector
The problem of simple PageRank Suppose two web pages point to each other but to no other page And there is some web page which points to one of them Then, during iteration, this loop will accumulate rank but never distribute any rank (since there are no out-edges) The loop forms a sort of trap which we call a rank sink
Improved PageRank with rank source Let E(u) be some vector over the Web pages that corresponds to a source of rank c is a decay factor (0<c<=1)
Stated by matrix notation R’ = c( AR’ + E ) R’ = c( A +E×1 ) R’ E×1×R’=E since ||R’||=1 1 is the vector consisting of all ones R’ is an eigenvector of (A + E×1)
How to represent random surferring model PageRank corresponds to the probability distribution of a random walk on the web graphs
The additional factor E can be viewed as a way of modeling this behavior The surfer periodically “gets bored” and jumps to a random page chosen based on the distribution in E In most tests we let E be uniform over all web pages with value 1/n However, values of E can generate “customized" page ranks if we left E as a user defined parameter
How to represent collaborative notion of authority or trust Many pages will receive quite a high PageRank simply because it is mentioned by a very important page This seems to capture a kind of collaborative trust, since if a page was mentioned by a trustworthy or authoritative source, it is more likely to be trustworthy or authoritative Similarly, quality or importance seems to fit within this kind of circular definition.
Dangling links Dangling links are simply links that point to any page with no outgoing links They affect the model because it is not clear where their weight should be distributed There are a large number of them Often these dangling links are simply pages that we have not downloaded yet, since it is hard to sample the entire web (in Larry Page & Sergey Brin’s experiment, 24 million pages currently downloaded, 51 million URLs not downloaded yet
Because dangling links do not affect the ranking of any other page directly, we simply remove them from the system until all the PageRanks are calculated After all the PageRanks are calculated, they can be added back in, without affecting things significantly
Convergence properties PageRank on a large 322 million link database converges to a reasonable tolerance in roughly 52 iterations The convergence on half the data takes roughly 45 iterations This graph suggests that PageRank will scale very well even for extremely large collections as the scaling factor is roughly linear in log n Rates of Convergence for Full Size and Half Size Link Databases
Computing PageRank initialize vector over web pages new ranks sum of backlink ranks compute normalizing factor add escape term control parameter stop when converged
7.3.2.3 Searching with PageRank A major application of PageRank is searching Google utilizes a number of factors to rank search results including standard IR measures, proximity, anchor text (text of links pointing to web pages), and PageRank
An example of the benefit of PageRank A query for “Stanford University” may return any number of web pages which mention Stanford (such as publication lists) on a conventional search engine, But using PageRank, the university home page is listed first
The comparison of Query The left is PageRank based title search engine The search engine on the right is Altavista Altavista returns random looking web pages that match the query “University” and are the root page of the server Altavista seems to be using URL length as a quality heuristic
7.3.2.4 The disadvantages of PageRank Rank Merging Common Case Rich-get-richer
Rank Merging The reason that the title based PageRank system works so well is that the title match ensures high precision, and the PageRank ensures high quality For more specific searches where recall is more important, the traditional information retrieval scores over full-text and the PageRank should be combined
Common Case PageRank can handle the common case for queries well Even if query terms are not contained in the HTML of the page The approach which might simply return a commonly used commercial site is also important A general purpose web search engine should return results which fulfill the needs of both of these tasks automatically
7.3.2.5 Other applications of PageRank Estimating Web Traffic PageRank as Backlink Predictor User Navigation: The PageRank Proxy Other Uses of PageRank
Estimating Web Traffic It is interesting to see how PageRank corresponds to actual usage Experiments used the counts of web page accesses from NLANR proxy cache and compared these to PageRank The NLANR data was from several national proxy caches over the period of several months and consisted of unique URLs
Some interesting trends in the data There seems to be a high usage of pornographic sites in the cache data, but these sites generally had low PageRanks This is because people do not want to link to pornographic sites from their own web pages There are some sites that have a very high usage, but low PageRank such as netscape.yahoo.com There is probably an important backlink which simply is omitted from our database (we only have a partial link structure of the web) It may be possible to use usage data as a start vector for PageRank, and then iterate PageRank a few times
PageRank as Backlink Predictor PageRank is a better predictor of future citation counts than citation counts themselves Because citation counting often gets stuck in local maxima In Web crawling, PageRank can help to judge which page should be crawled preferred Using the incomplete data, PageRank is a more effective way to order the crawling than the number of known citations in crawling process
User Navigation: The PageRank Proxy This proxy application can annotate each link that a user sees with its PageRank The proxy application can help users decide which links in a long listing are more likely to be interesting
Conclusion of PageRank PageRank is a global ranking based on the web\'s graph structure PageRank use backlinks information to bring order to the web PageRank can separate out representative pages as cluster center A great variety of applications
7.3.2.6 Personalized PageRank Such personalized page ranks may have a number of applications, including personal search engines
Issues Users are no random walkers Content based methods Starting point distribution Actual usage data as starting vector Reinforcing effects/bias towards main pages
Uniform E vector Intuitively the E vector corresponds to the distribution of web pages that a random surfer periodically jumps to Common E vector is uniform over all web pages with each element is 0.15 But Web users do not access all Web pages in a uniform probability
Personalized E vector Another extreme is to have E consist entirely of a single web page This is a very democratic choice for E based on own desire
PageRanks for Two Different Views: Netscape vs. John McCarthy Pages related to computer science have a higher McCarthy-rank than Netscape-rank and pages related to computer science at Stanford have a considerably higher McCarthy-rank
[此贴子已经被作者于2010-12-14 09:26:25编辑过]
|