信息检索原理课件-[推荐]第八部分课堂讲义之三——PageRank课外天地李树青

以文本方式查看主题

-  课外天地李树青  (http://njcie.com/bbs/index.asp)
--  信息检索原理课件  (http://njcie.com/bbs/list.asp?boardid=16)
----  [推荐]第八部分课堂讲义之三——PageRank  (http://njcie.com/bbs/dispbbs.asp?boardid=16&id=480)

--  作者：admin
--  发布时间：2008/5/6 5:51:51
--  [推荐]第八部分课堂讲义之三——PageRank

7 Web search and search engine
PPT课件下载链接：
下载信息  [文件大小：   下载次数： ]
点击浏览该文件:

7.3 Link analysis
Link analysis for web search has intellectual antecedents in the field of academic citation analysis

The idea of Web citations
Quantify the influence of scholarly articles by analyzing the pattern of citations amongst them
Much as citations represent the conferral of authority from a scholarly article to others, link analysis on the Web treats hyperlinks from a web page to another as a conferral of authority

Diversity of Web Pages
Unlike academic papers which are scrupulously reviewed, web pages proliferate free of quality control or publishing costs
Web pages vary on a much wider scale than academic papers in quality, usage, citations, and length
The average web page quality experienced by a user is higher than the quality of the average web page
This is because the simplicity of creating and publishing web pages results in a large fraction of low quality web pages that users are unlikely to read

7.3.1 The basics
Some terms about link
Forward links
Out links
Out-edges
Backlinks
In links
In-edges

The assumptions of PageRank
The anchor text pointing to page B is a good description of page B
More objective
Many Web pages do not provide an accurate description of itself
Image search
IBM, Yahoo!, Google, …
Consequently, web searchers need not use the terms in a page to query for it

Image link
<html>
<head>
<title>Image</title>
</head>
<body>
<a href="js.gif">江苏地图"</a>
</body>
</html>

The hyperlink from A to B represents an endorsement of page B, by the creator of page A
This is not always the case
Link spam
All pages links to copyright notice within a single website
So backlinks from high PR-pages count more than links from low PR-pages

The weight of anchor text
Anchor text terms are generally weighted based on frequency with a penalty for terms that occur very often (the most common terms in anchor text across the Web are Click and here, using methods very similar to idf)

The extended topics
The use of anchor text has some interesting side-effects
Searching for big blue often returns the home page of the IBM corporation
Creating misleading anchor text pointing to other sites can boost its ranking on selected query terms
The window of text surrounding anchor text (sometimes referred to as extended anchor text) is often usable in the same manner as anchor text itself
... company about computer can be referred <a href="herewww.ibm.com">here</a>

7.3.2 PageRank
A numerical score between 0 and 1 assigned to every Web page by Google
depending on the link structure of the web graph

The composite score in Google
Given a query, a web search engine computes a composite score for each web page that combines hundreds of features such as
Cosine similarity
Term proximity
PageRank score
This composite score is used to provide a ranked list of results for the query

7.3.2.1 The basic of PageRank
Ref by:
The PageRank Citation Ranking:
Bringing Order to the Web

The random surferring model
A random surfer who begins at a web page
Executes a random walk on the Web as follows
Following the out links
With probability 0 < α < 1
Teleport operation
With probability 0 < 1 -α < 1

Following the out links
From his current page A to a randomly chosen web page that A hyperlinks to
Nodes with many links coming in from other pages may be frequently visited
The idea behind this is that pages visited more often in this walk are more important

Teleport operation
What if the current location of the surfer, the node A, has no out-links?
Jumps from a node to any other node in the web graph through typing an address into the URL bar of his browser
The destination of a teleport operation is modeled as being chosen uniformly at random from all web pages with probability 1/N ( N is the total number of nodes in the Web )

7.3.2.2 The computation of PageRank
A page has high rank if the sum of the ranks of its backlinks is high
When a page has many backlinks
When a page has a few highly ranked backlinks
The importance of a page decides and depends on the importance of other pages

Some Definition
Let u be a web page
Let Fu be the set of pages u points to
Let Bu be the set of pages that point to u
Let Nu=|Fu| be the number of links from u
Let c be a factor used for normalization
so that the total rank of all web pages is constant

A simple PageRank Calculation
R which is a slightly simplified version of PageRank

The rank of a page is divided among its forward links evenly to contribute to the ranks of the pages they point to
Argument c < 1 because there are a number of pages with no forward links and their weight is lost from the system
The equation is recursive but it may be computed by starting with any set of ranks and iterating the computation until it converges

This figure demonstrates the propagation of rank from one pair of pages to another

This figure shows a consistent steady state solution for a set of pages

Stated by matrix notation
let A be a square matrix with the rows and column corresponding to web pages
Let Au,v = 1/Nu if there is an edge from u to v and Au,v = 0 if not
If we treat R as a vector over web pages, then we have R = cATR
So R is an eigenvector of A with eigenvalue c
In fact, we want the dominant eigenvector of A. It may be computed by repeatedly applying A to any nondegenerate start vector

The problem of simple PageRank
Suppose two web pages point to each other but to no other page
And there is some web page which points to one of them
Then, during iteration, this loop will accumulate rank but never distribute any rank (since there are no out-edges)
The loop forms a sort of trap which we call a rank sink

Improved PageRank with rank source
Let E(u) be some vector over the Web pages that corresponds to a source of rank
c is a decay factor (0<c<=1)

Stated by matrix notation
R’ = c( AR’ + E )
R’ = c( A +E×1 ) R’
E×1×R’=E since ||R’||=1
1 is the vector consisting of all ones
R’ is an eigenvector of (A + E×1)

How to represent random surferring model
PageRank corresponds to the probability distribution of a random walk on the web graphs

The additional factor E can be viewed as a way of modeling this behavior
The surfer periodically “gets bored” and jumps to a random page chosen based on the distribution in E
In most tests we let E be uniform over all web pages with value 1/n
However, values of E can generate “customized" page ranks if we left E as a user defined parameter

How to represent collaborative notion of authority or trust
Many pages will receive quite a high PageRank simply because it is mentioned by a very important page
This seems to capture a kind of collaborative trust, since if a page was mentioned by a trustworthy or authoritative source, it is more likely to be trustworthy or authoritative
Similarly, quality or importance seems to fit within this kind of circular definition.

Dangling links
Dangling links are simply links that point to any page with no outgoing links
They affect the model because it is not clear where their weight should be distributed
There are a large number of them
Often these dangling links are simply pages that we have not downloaded yet, since it is hard to sample the entire web (in Larry Page & Sergey Brin’s experiment, 24 million pages currently downloaded, 51 million URLs not downloaded yet

Because dangling links do not affect the ranking of any other page directly, we simply remove them from the system until all the PageRanks are calculated
After all the PageRanks are calculated, they can be added back in, without affecting things significantly

Convergence properties
PageRank on a large 322 million link database converges to a reasonable tolerance in roughly 52 iterations
The convergence on half the data takes roughly 45 iterations
This graph suggests that PageRank will scale very well even for extremely large collections as the scaling factor is roughly linear in log n
Rates of Convergence for Full Size and Half Size Link Databases

Computing PageRank
initialize vector over web pages
new ranks sum of backlink ranks
compute normalizing factor
add escape term
control parameter
stop when converged

7.3.2.3 Searching with PageRank
A major application of PageRank is searching
Google utilizes a number of factors to rank search results including standard IR measures, proximity, anchor text (text of links pointing to web pages), and PageRank

An example of the benefit of PageRank
A query for “Stanford University” may return any number of web pages which mention Stanford (such as publication lists) on a conventional search engine,
But using PageRank, the university home page is listed first

The comparison of Query
The left is PageRank based title search engine
The search engine on the right is Altavista
Altavista returns random looking web pages that match the query “University” and are the root page of the server
Altavista seems to be using URL length as a quality heuristic

7.3.2.4 The disadvantages of PageRank
Rank Merging
Common Case
Rich-get-richer

Rank Merging
The reason that the title based PageRank system works so well is that the title match ensures high precision, and the PageRank ensures high quality
For more specific searches where recall is more important, the traditional information retrieval scores over full-text and the PageRank should be combined

Common Case
PageRank can handle the common case for queries well
Even if query terms are not contained in the HTML of the page
The approach which might simply return a commonly used commercial site is also important
A general purpose web search engine should return results which fulfill the needs of both of these tasks automatically

7.3.2.5 Other applications of PageRank
Estimating Web Traffic
PageRank as Backlink Predictor
User Navigation: The PageRank Proxy
Other Uses of PageRank

Estimating Web Traffic
It is interesting to see how PageRank corresponds to actual usage
Experiments used the counts of web page accesses from NLANR proxy cache and compared these to PageRank
The NLANR data was from several national proxy caches over the period of several months and consisted of unique URLs

Some interesting trends in the data
There seems to be a high usage of pornographic sites in the cache data, but these sites generally had low PageRanks
This is because people do not want to link to pornographic sites from their own web pages
There are some sites that have a very high usage, but low PageRank such as netscape.yahoo.com
There is probably an important backlink which simply is omitted from our database (we only have a partial link structure of the web)
It may be possible to use usage data as a start vector for PageRank, and then iterate PageRank a few times

PageRank as Backlink Predictor
PageRank is a better predictor of future citation counts than citation counts themselves
Because citation counting often gets stuck in local maxima
In Web crawling, PageRank can help to judge which page should be crawled preferred
Using the incomplete data, PageRank is a more effective way to order the crawling than the number of known citations in crawling process

User Navigation: The PageRank Proxy
This proxy application can annotate each link that a user sees with its PageRank
The proxy application can help users decide which links in a long listing are more likely to be interesting

Conclusion of PageRank
PageRank is a global ranking based on the web\'s graph structure
PageRank use backlinks information to bring order to the web
PageRank can separate out representative pages as cluster center
A great variety of applications

7.3.2.6 Personalized PageRank
Such personalized page ranks may have a number of applications, including personal search engines

Issues
Users are no random walkers
Content based methods
Starting point distribution
Actual usage data as starting vector
Reinforcing effects/bias towards main pages

Uniform E vector
Intuitively the E vector corresponds to the distribution of web pages that a random surfer periodically jumps to
Common E vector is uniform over all web pages with each element is 0.15
But Web users do not access all Web pages in a uniform probability

Personalized E vector
Another extreme is to have E consist entirely of a single web page
This is a very democratic choice for E based on own desire

PageRanks for Two Different Views: Netscape vs. John McCarthy
Pages related to computer science have a higher McCarthy-rank than Netscape-rank and pages related to computer science at Stanford have a considerably higher McCarthy-rank

[此贴子已经被作者于2010-12-14 09:26:25编辑过]