-- 作者:admin
-- 发布时间:2008/5/5 15:52:46
-- [推荐]第八部分课堂讲义之一——Web搜索
7 Web search and search engine
PPT课件下载链接:
7.1 Web search 7.1.1 Web background and history Web is an access mode in Internet, but not the only one Telnet FTP Usenet …
About Web The invention of hypertext, envisioned by Vannevar Bush in the 1940’s and first realized in working systems in the 1970’s, significantly precedes the formation of World Wide Web in the 1990’s World Wide Web simply refers to as Web Flexible and simple features has contributed enormously to the growth of the Web
Characteristics of Web Large scale No central control of authorship Rapid develop Extensible architecture Robust HTML Lack of coordination in its creation Heterogeneity at a daunting scale Diversity of backgrounds and motives of its participants Simple open client-server architecture
Client-server architecture Server communicates with the client via a protocol (the http or hypertext transfer protocol) Simple, asynchronously carrying a variety of payloads encoded in HTML (HyperText Markup Language) Client (generally a browser) can receive the server’s response Client sends an http request to a web server through URL (Universal Resource Locator) URL includes protocol, domain, path, Web page, …(query string) Ignore what it does not understand
How big is the Web? By the end of 1995, Altavista reported that it had crawled and indexed approximately 30 million static web pages By the begin of 2008, Google declared it can index nearly 8 billion web pages, and Yahoo! declared 19 billion web pages Web search engines have historically engaged in a battle of bragging rights over which one indexes more web pages In fact, the actual totality is difficult to get
How to discover Web resources Full-text index search engines Keyword search interface Inverted indexes Ranking mechanisms Taxonomies populated with web pages in categories Browse through a hierarchical tree of category labels The popularity declined over time
Yahoo! Created by Jerry Yang and David Filo of Stanford University in 1994 April Jerry & David's Guide to the World Wide Web Begin with Web directory surpassing 1000 distinct nodes
The drawbacks of Web directory Require manual editorial process User’s idea should match that of the editors performing the classification
The development of search engine Stage I: Focusing on the challenge of scale Documents, Web servers, … Altavista, … Stage II: Improving the quality and relevance of web search results Need to gauge the authoritativeness of a document Google, … Stage III: Multimedia search, personalized search, …
7.1.2 Web graph The definition of Web graph Web consists of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directed edge We refer to the set of all such nodes and directed edges as the web graph
Hyperlink The characteristics of Web graph Hyperlink is similar to document’ reference but not equal to it Web graph is not strongly connected In-links do not means out-links The number of in-links to a page (also known as in-degree) has averaged from roughly 8 to 15
The distribution of links Not follow the Poisson distribution Every web page were not to pick the destinations of its links uniformly at random Follow the power law The total number of web pages with in-degree i is proportional to 1/iα Web graph has a bowtie shape
Bowtie shape of Web graph Referred to as IN, OUT, SCC, Tubes and Tendrils SCC: strongly connected component Tubes are small sets of pages outside SCC that lead directly from IN to OUT Tendrils mean either leading nowhere from IN, or from nowhere to OUT Any page in IN can reach any page in SCC by following hyperlinks Likewise, any page in SCC can reach any page in OUT Any page in SCC can reach any other page in SCC Any page in SCC can not reach any page in IN Any page in OUT can not reach any page in SCC IN and OUT are roughly equal in size SCC is somewhat largest
7.1.3 Economic issues Spam and anti-spam Web search advertising
7.1.3.1 Spam and anti-spam Manipulation of web page content for the purpose of appearing high up in search results for selected keywords Increase term frequencies Hide these repeated terms Different from the fact that a company uses large fonts to list its phone numbers in the yellow pages because the latter needs more cost Even if company names beginning with a long string of A’s to be listed early in a yellow pages category
Paid inclusion Auto Bidding (竞价排名) Pay Per Click (点击付费) Different search engines have different policies on whether to allow paid inclusion
More sophisticated spam Cloaking Doorway page Link spam
Cloaking The relationship between page publishers and web search engines is not completely collaborative
Doorway page It contains text and metadata carefully chosen to rank highly on selected search keywords When a browser requests the doorway page, it is redirected to a page containing content of amore commercial nature
Link spam Manipulation of the links into a web page SEO (Search Engine Optimizers) Link farm Search engines use link analysis to recognize link spam Adversarial information retrieval has sprung up around this battle
7.1.3.2 Web search advertising Sponsored search Search advertising
Two types Priced on a cost per mil (CPM) basis Also known as impressions Based on having its banner advertisement displayed 1000 times Priced on the number of times it was clicked on by the user, known as the cost per click (CPC) model The goal of the advertisement is not so much brand promotion as to induce a transaction
An example: Goto search engine The pioneer in Web advertising based on CPC Changed its name to Overture prior to eventual acquisition by Yahoo!
The market strategy of Goto Every query term q it accepted bids from companies who wanted their web page shown on the query q In response to the query q, Goto would return the pages of all advertisers who bid for q, ordered by their bids When the user clicked on one of the returned results, the corresponding advertiser would make a payment to Goto In the initial implementation, this payment equaled the advertiser’s bid for q
Combined strategy of current search engine The primary response provides pure search results generally known as algorithmic search results The sponsored search results displayed separately and distinctively to the right of the algorithmic results
Search engine marketing (SEM) Understanding how search engines do this ranking and how to allocate marketing campaign budgets to different keywords and to different sponsored search engines has become a profession known as search engine marketing (SEM)
Click spam Clicks on sponsored search results are not from bona fide search users Two main types From other devious advertising competitor From devious search engine
7.1.4 The search user experience The behavior of Web users Web search users are not information scientists Not know (or care) about the heterogeneity of web content, the syntax of query languages and the art of phrasing queries
The attracting strategy of search engine Focus on relevance, specifically precision rather than recall in the first few results Save users time in locating the information they sought Believe that user experience is lightweight, meaning that both the search query page and the search results page are uncluttered and almost entirely textual, with very few graphical elements Not bottlenecked by the time to load the search query or results page
The three types of user query needs Informational queries Navigational queries Transactional queries
Informational queries Seek general information on a broad topic Not a single web page that contains all the information sought User wants more web pages
Navigational queries Seek the website or home page of a single entity that the user has in mind In such cases, the user’s expectation is that the very first search result should be the home page about query term, not interested in documents containing the query term The best measure of user satisfaction is precision at 1
Transactional queries Prelude to the user performing a transaction on the Web – such as purchasing a product, downloading a file or making a reservation The search engine should return results listing services that provide form interfaces for such transactions
How to distinguish? Discerning which of these categories a query falls into can be challenging It should be clear that some queries will fall in more than one of these categories, while others will fall outside them
[此贴子已经被作者于2010-12-14 09:22:17编辑过]
|