-- 作者:admin
-- 发布时间:2008/5/5 16:03:58
-- [推荐]第八部分课堂讲义之二——搜索引擎
7 Web search and search engine
PPT课件下载链接:
7.2 Search engine The components of web search engine
7.2.1 Web crawling 7.2.1.1 Web crawler Sometimes referred to as a spider or robot The objective of crawler is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them
Features of crawler Features a crawler must provide Robustness Resilient to spider traps Politeness Features a crawler should provide Distributed Scalable Performance and efficiency Quality Biased towards fetching “useful” pages first Freshness Crawl Web pages with a frequency that approximates the rate of change of pages Extensible Cope with new data formats, new fetch protocols
7.2.1.2 Web crawling The process of Web crawling Begins with one or more URLs that constitute a seed set Picks a URL from this seed set, then fetches the web page at that URL The fetched page is then parsed, to extract both the text and the links from the page The extracted text is fed to a text indexer The extracted links (URLs) are then added to a URL frontier As pages are fetched, the corresponding URLs are deleted from URL frontier
Some spiders of famous search engines Mercator=Altavista.com ArchitextSpider=excite.com Googlebot=Google.com googlebot@googlebot.com=Google.com
google=Google.com Slurp.so/1.0=Yahoo slurp@inktomi.com=Yahoo
MSNBOT/0.1=MSN baiduspider=Baidu
7.2.1.3 Crawler architecture The basic crawler architecture Multi-thread Iterative process
URL frontier module Containing URLs yet to be fetched in the current crawl A URL is back in the frontier for re-fetching
DNS resolution module Determines the web server by URL
Fetch module Uses the http protocol to retrieve the web page
Parsing module Extracts the text and set of links from a fetched web page
Duplicate elimination module
What is URL frontier? Give a URL by its crawl process Maintain the URLs in the frontier Regurgitate them in some order The priority of a page should be a function of both its change rate and its quality Avoid repeated fetch requests to a host within a short time span A common heuristic is to insert a gap between successive fetch requests
The sub-modules of URL frontier F front queues FIFO queues Implement the prioritization With assigned priority i, the URL is appended to the ith of the front queues B back queues FIFO queues Implement the politeness Each queue contains URLs from a single host A heap One entry for each back queue which has the earliest time at which the host corresponding to that queue
What is DNS resolution? DNS: Domain Name Service Translating URL to an IP address through contacting DNS servers DNS resolution is a well-known bottleneck in web crawling May recursively call upon other DNS servers The lookup implementations are generally synchronous
How to remedy? Caching URLs for which we have recently performed DNS lookups are likely to be found in the DNS cache Using own DNS resolver Most web crawlers implement their own DNS resolver as a component of the crawler Mercator recommend the order of five attempts The time quantum of the wait increases exponentially with each of these attempts Mercator started with one second and ended with roughly 90 seconds
What is Doc FP? Tests whether a web page with the same content has already been seen at another URL Methods Use a simple fingerprint such as a checksum A more sophisticated test would use shingles
What is URL filter? Determine whether the extracted URL should be excluded from the frontier Exclude certain domains Include specific domains (vertical search) Comply with robots.txt
Robots Exclusion Protocol Placing a file with the name robots.txt at the root of the URL hierarchy at the site Declaring their websites off-limits to crawling We should perform the robots filtering before adding such a URL to the frontier but this file may be changed In face, maintaining a cache of robots.txt files is still highly effective
robots.txt For example User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: No robot should visit any URL whose position in the file hierarchy starts with /yoursite/temp/, except for the robot called “searchengine”
URL normalization Relative link should be converted into absolute link Subdir/onefile.html ../../othersubdir/otherfile.html /rootfile.html Eliminate URL of other protocol mailtleeshuqing@yahoo.cn
What is duplicate elimination? If the URL is already in the frontier or (in the case of a non-continuous crawl) already crawled, we do not add it to the frontier
Other tasks in crawling Performed by a dedicated thread it waking up periodically Log crawl progress statistics Decide whether to terminate the crawl Checkpoint the crawl for failure recovery
7.2.1.4 Distributing crawler The basic distributing crawl architecture
How to crawl? Each node crawls hosts “near” it By domain Do not always reflect geographic proximity By hash value of URL
How to communicate and share URLs Use a host splitter to dispatch each surviving URL to the crawler node responsible for the URL URL set need receive URLs from other nodes
The difficulty in identify same content Nothing can prevent the same (or highly similar) content from appearing on different web servers The set of fingerprints/shingles must be partitioned across the nodes based on some property of the fingerprint/shingle
7.2.2 Indexes Based on inverted index Distributing indexes
7.2.2.1 Distributing indexes Partitioning by terms Known as global index organization Partitioning by documents Known as local index organization More common
Partitioning by terms Known as global index organization A query is routed to the nodes corresponding to its query terms Allows greater concurrency But multi-word queries require the sending of long postings lists between sets of nodes for merging The implementation of dynamic indexing is more difficult
Partitioning by documents Known as local index organization Each node contains the index for a subset of all documents Each query is distributed to all nodes, with the results from various nodes being merged before presentation to the user One difficulty is that global statistics must be computed across the entire document collection in all nodes
How to decide the partition of documents to nodes? One simple approach would be to assign all pages from a host to a single node Unbalanced load Assign pages based on hash value More uniform distribution
How to get results? At query time, the query is broadcast to each of the nodes, with the top k results from each node being merged to find the top k documents for the query A common implementation heuristic is to partition the document collection into indexes of documents that are more likely to score highly on most queries and low-scoring indexes with the remaining document
7.2.2.2 Connectivity indexes Connectivity queries Which URLs link to a given URL? Which URLs does a given URL link to?
Applications of connectivity indexes Crawl control Web graph analysis Sophisticated crawl optimization Link analysis
The size of connectivity indexes Suppose that the Web had four billion pages, each with ten links to other pages We would require 32 bits or 4 bytes to specify each end (source and destination) of each link Total cost in space 4 ×109 × 10×8 = 3.2× 1011 ( about 300G )
Methods of decreasing memory requirement Data compression Must efficiently supports connectivity queries Adjacency table It has a row for each web page, with the rows ordered by the corresponding integers The row for any page contains a sorted list of integers, each corresponding to a web page that links to or the pages linked to by this web page Nearly cuts the space taken by the naive representation by 50% Use as few as 3 bits per link, on average – a dramatic reduction from the 64 required in the naive representation
More ideas in adjacency table Similarity between lists If we explicitly represent a prototype row for several similar rows, the remainder can be succinctly expressed in terms of the prototypical row We use gap encodings in sorted lists: rather than store the destination of each link, we store the offset from the previous entry in the row Locality Many links from a page go to “nearby” pages ( pages on the same host, for instance) We can often use small integers in encoding the destination of a link and thereby save space
An example of adjacency table 1: www.stanford.edu/alchemy 2: www.stanford.edu/biology 3: www.stanford.edu/biology/plant 4: www.stanford.edu/biology/plant/copyright 5: www.stanford.edu/biology/plant/people 6: www.stanford.edu/chemistry ... 1: 1, 2, 4, 8, 16, 32, 64 2: 1, 4, 9, 16, 25, 36, 49, 64 3: 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 4: 1, 4, 8, 16, 25, 36, 49, 64 …
[此贴子已经被作者于2010-12-14 09:23:19编辑过]
|