《The Principle of Information Retrieval》
Theoretical Teaching Outline
(Formulated in 2001,Revised in 2010)
Course Number:
Course Category:Core Course of Specialty
Preposed Curriculum: Principle of Database
Postpositive Curriculum:
Credit:3 credits
Class hour:51 teaching hours
Teachers:Li Shuqing.
Teaching Material:Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval. UK, Cambridge University Press, 2008.
Course Description:
This course provides a theoretical and practical explanation of the latest advancements in information retrieval and their application to existing systems. It takes a system approach, discussing all aspects of an Information Retrieval System. The growth of the Internet and the availability of enormous volumes of data in digital form have necessitated intense interest in techniques to assist the user in locating data. The importance of the Internet and its associated hypertext linked structure are put into perspective as a new type of information retrieval data structure. The total system approach also includes discussion of the human interface and the importance of information visualization for identification of relevant information. With the availability of large quantities of page information on the Internet, Information Retrieval Systems need to efficiently retrieval. This is introduced in the course. In addition to the theoretical aspects, the course maintains a theme of practicality that puts into perspective the importance and utilization of the theory in systems that are being used by anyone on the Internet. The student will gain an understanding of what is achievable using existing technologies and deficient areas that warrant additional research. The course provides coverage of all of the major aspects of information retrieval and has sufficient detail to allow students to implement a simple Information Retrieval System.
Course Objectives:
1. Students should understand the deep influence of the information technique on human activities, and be acquainted with the concept, principle, technique, method and the related knowledge of information storage and retrieval.
2. Students should be familiar with the current technique of information storage and the development trend.
3. Students should be acquainted with the relevant technology of information retrieval, such as Automatic Indexing, user search techniques, the technique of the full-text and hypertext retrieval etc, especially search engine tools..
4. It is necessary for the students to know the applications of the information storage and retrieval technique, such as the establishment of the database and the realization of the search query etc.
Course Content
Chapter 1 Introduction to Information Retrieval
Teaching hours: 3
Teaching Requirements:
Through the teaching and learning of this chapter, students should master the principle and development of information retrieval. And, students should understand modern applications such as search engines and so on.
Teaching Content:
1. Definition of Information Retrieval
2. Definition of Information Retrieval Systems
3. Overview of Current research
4. Overview of Current application
5. Search engine and Web search
6. The future of information retrieval
Exercise:
1. What is Web information retrieval?
2. What are the main characteristics of search engine?
3. Please give some examples which can demonstrate the importance of Web information retrieval.
Chapter 2 Basic information retrieval
Teaching hours: 9
Teaching Requirements:
This chapter introduces inverted indexes, and shows how simple Boolean queries can be processed using such indexes. And also it builds on this introduction by detailing the manner in which documents are preprocessed before indexing and by discussing how inverted indexes are augmented in various ways for functionality and speed. At last, it discusses search structures for dictionaries and how to process queries that have spelling errors and other imprecise matches to the vocabulary in the document collection being searched.
Teaching Content:
Section One: Boolean retrieval
1. Examples information retrieval problem
2. Building an inverted index
3. Processing Boolean queries
4. The extended Boolean model
Section Two: The term vocabulary and postings lists
1. Document delineation and character sequence decoding
2. Determining the vocabulary of terms
3. Skip pointers
4. Positional postings and phrase queries
Section Three: Tolerant retrieval
1. Search structures for dictionaries
2. Wildcard queries
3. Spelling correction
4. Phonetic correction
Exercise:
1. What is inverted index?
2. How to use skip pointers to speed up the query?
3. What are main types of wildcard query?
4. How to implement phonetic correction?
Chapter 3 Index
Teaching hours: 3
Teaching Requirements:
This chapter describes a number of algorithms for constructing the inverted index from a text collection with particular attention to highly scalable and distributed algorithms that can be applied to very large collections. It also covers techniques for compressing dictionaries and inverted indexes.
Teaching Content:
Session One: Index construction
1. Blocked sort-based indexing
2. Single-pass in-memory indexing
3. Distributed indexing
4. Dynamic indexing
5. Other types of indexes
Session Two: Index compression
1. Statistical properties of terms in information retrieval
2. Dictionary compression
3. Postings file compression
Exercise:
1. Please explain the characteristics of Google’s MapReduce.
2. Please give some examples which can demonstrate speeding up query with index.
Chapter 4 Vector space model
Teaching hours: 6
Teaching Requirements:
Through the teaching and learning of this chapter, students should understand the basic principle of VSM which is the most important and effective information retrieval technology. And students also can compute similarity of the document vector in document similarity detection and document search.
Teaching Content:
1. Parametric and zone indexes
2. Term frequency and weighting
3. The vector space model for scoring
4. Efficient scoring and ranking
5. Reference rank
6. Vector space scoring and query operator interaction
Exercise:
1. Why we can not use absolute term frequency to make document vector?
2. Please give the common methods to score the document vector and judge whose effect is best.
3. Compute document similarity using VSM.
Chapter 5 Evaluation in information retrieval
Teaching hours: 3
Teaching Requirements:
This chapter focuses on the evaluation of an information retrieval system based on the relevance of the documents it retrieves, allowing us to compare the relative performances of different systems on benchmark document collections and queries.
Teaching Content:
1. Standard test collections
2. Evaluation of unranked retrieval sets
3. Evaluation of ranked retrieval results
4. Assessing relevance
5. System quality and user utility
6. Results snippets
Exercise:
1. What is TREC?
2. Why we use different technology to evaluate the unranked and ranked retrieval results?
3. Do you think evaluation is objective or subjective? Why?
Chapter 6 Relevance feedback and query expansion
Teaching hours: 3
Teaching Requirements:
This chapter discusses methods by which retrieval can be enhanced through the use of techniques like relevance feed-back and query expansion, which aim at increasing the likelihood of retrieving relevant documents.
Teaching Content:
1. Relevance feedback
2. Pseudo-relevance feedback
3. Global methods for query reformulation
Exercise:
1. How to implement relevance feedback?
2. Please explain the types of global methods for query reformulation.
Chapter 7 Web search and search engine
Teaching hours: 18
Teaching Requirements:
This chapter treats the problem of web search and gives a summary of the basic challenges in web search, together with a set of techniques that are pervasive in web information retrieval. Next, it also describes the architecture and requirements of a basic web crawler. Finally, it considers the power of link analysis in web search, using in the process several methods from linear algebra and advanced probability theory.
Teaching Content:
Session One: Web search basics
1. Background and history
2. Web characteristics
3. Advertising as the economic model
4. The search user experience
5. Index size and estimation
6. Near-duplicates and shingling
Session Two: Web crawling and indexes
1. Crawling
2. Distributing indexes
3. Connectivity servers
Session Three: Link analysis
1. The Web graph
2. PageRank and other extends
3. HITS and other extends
Session Four: Search engine
1. The history of search engine
2. The challenge of search engine
Exercise:
1. What are main characteristics of modern Web?
2. How does search engine resolve the difficulties in large Web?
3. Please give the main difference between PageRank and HITS.
4. Compute the PageRank score of these pages with 3 iterates ( round off to 5 decimal places and not consider the random jumping factor).
Chapter 8 Advanced information retrieval
Teaching hours: 6
Teaching Requirements:
This chapter mainly introduces some types of the advanced information retrieval including XML retrieval, multimedia retrieval.
Teaching Content:
Session One: XML retrieval
1. Challenges in XML retrieval
2. Vector space model for XML retrieval
3. Evaluation of XML Retrieval
4. Text-centric XML retrieval and data-centric XML retrieval
Session Two: Multimedia retrieval
1. Models and Languages
2. Indexing and Searching
Exercise:
1. What are main challenges in XML retrieval?
2. How do we computer the similarity of XML documents?
3. Please give the explanation of content-based method in multimedia retrieval.
[此贴子已经被作者于2011-02-16 09:07:54编辑过]