-- 作者:admin
-- 发布时间:2008/4/27 7:18:29
-- [推荐]第六部分课堂讲义——信息检索评价
5 Evaluation in information retrieval
How to measure user happiness The key utility measure is user happiness Speed of response Size of the index Relevance of results User interface design Independent of the quality of the results …
5.1 Introduction How to measure information retrieval effectiveness We need a test collection consisting of three things A document collection A test suite of information needs, expressible as queries A set of relevance judgments, standardly a binary assessment of either relevant or not relevant for each query-document pair
And in this test collection A document in the test collection is given a binary classification as either relevant or not relevant Collection and suite of information needs have to be of a reasonable size Results are highly variable over different documents and information needs 50 information needs at least
The difficulty The difference of stated information need and query Relevance is assessed relative to an information need, not a query System issue User issue Python and python One word query The subjectivity of relevance decision Many systems contain various parameters that can be adjusted to tune system performance The correct procedure is to have one or more development test collections
5.2 Standard test collections The Cranfield collection Text Retrieval Conference (TREC) GOV2 NII Test Collections for IR Systems (NTCIR) Reuters-21578 and Reuters-RCV1
The Cranfield collection Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries Allowing precise quantative measures But too small
Text Retrieval Conference (TREC) The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992 Over a range of different test collections But the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations from 1992 to 1999 Comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles)
Relevance judgments for 450 information needs, which are called topics and specified in detailed text passages TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles Relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation
GOV2 Contains 25 million GOV2 web pages GOV2 is one of the largest Web test collection but still more than 3 orders of magnitude smaller than the current size of the document collections indexed by the large web search companies
NII Test Collections for IR Systems (NTCIR) Similar sizes to the TREC collections Focusing on East Asian language and cross-language information retrieval
Reuters-21578 and Reuters-RCV1 Most used for text classification Reuters-21578 collection of 21578 newswire articles Reuters Corpus Volume 1 (RCV1) is much larger, consisting of 806,791 documents
8.3 Evaluation of unranked retrieval sets Precision Recall Accuracy F measure
Precision Precision (P) is the fraction of retrieved documents that are relevant
Recall Recall (R) is the fraction of relevant documents that are retrieved
Contingency table
|
Relevant |
Not relevant |
Retrieved |
true positives (tp) |
false positives (fp) |
Not retrieved |
false negatives (fn) |
true negatives (tn) | The other way to define P and R P = tp/(tp + fp) R = tp/(tp + fn)
Accuracy Accuracy is the fraction of its classifications that are correct An information retrieval system can be thought of as a two-class classifier accuracy=(tp+tn)/(tp+fp+fn+tn)
Often used for evaluating machine learning classification problems Not an appropriate measure for information retrieval problems Normally over 99.9% of the documents are in the not relevant category Can maximize accuracy by simply deeming all documents nonrelevant to all queries, that is to say tn maybe too large
The comparison of P and R Typical web surfers prefer P to R Various professional searchers such as paralegals and intelligence analysts prefer R to P Individuals searching their hard disks prefer R to P
The two quantities clearly trade off against one another Recall is a non-decreasing function of the number of documents retrieved Can always get a recall of 1 (but very low precision) by retrieving all documents for all queries Precision usually decreases as the number of documents retrieved is increased
F measure A single measure that trades off precision versus recall
此主题相关图片如下:
α∈[0, 1] and thus β2∈[0,∞] The default balanced F measure use β=1
Values of β < 1 emphasize precision, while values of β > 1 emphasize recall Harmonic mean rather than the simpler average Recall that we can always get 100% recall by just returning all documents, and therefore we can always get a 50% arithmetic mean by the same process
The harmonic mean is always less than either the arithmetic or geometric mean, and often quite close to the minimum of the two numbers This strongly suggests that the arithmetic mean is an unsuitable measure to use because it closer to their maximum than harmonic mean
8.4 Evaluation of ranked retrieval results Precision, recall, and the F measure are set-based measures and are computed using unordered sets of documents In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents
The types Precision-recall curve Mean Average Precision (MAP) Precision at k R-precision ROC curve
Precision-recall curve If the (k + 1)th document retrieved is nonrelevant then recall is the same as for the top k documents, but precision has dropped If it is relevant, then both precision and recall increase, and the curve jags up and to the right
Precision-recall curve with interpolated precision Can remove the jiggles of curve The interpolated precision at a certain recall level r is defined as the highest precision found for any recall level q ≥ r The justification is that almost anyone would be prepared to look at more documents if it would increase the percentage of the viewed set that were relevant Precision-recall curve with 11-point interpolated average precision1-2 Can boil curve’s information down to a few numbers
11 points means 0.0, 0.1, 0.2, … , 1.0 For each recall level, we then calculate the arithmetic mean of the interpolated precision at that recall level for each information need in the test collection
Precision-recall curve with 11-point interpolated average precision2-2
Mean Average Precision (MAP) Most standard among the TREC community A single-figure measure of quality across recall levels Have especially good discrimination and stability
For a single information need, Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved
The set of relevant documents for an information need qj ∈ Q is {d1, . . . dmj} Rjk is the set of ranked retrieval results from the top result until you get to document dk
The disadvantage of MAP Fixed recall levels are not chosen, corresponding precision is at all recall levels Calculated scores normally vary widely across information needs( 0.1 to 0.7 ) So a set of test information needs must be large and diverse enough to be representative of system effectiveness across different queries
Precision at k What matters is rather how many good results there are on the first page or the first k pages, especially for search engine This method measures precision at fixed low levels of retrieved results, such as 10 or 30 document It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stable
R-precision It requires having a set of known relevant documents of size Rel, from which we calculate the precision of the top Rel documents returned Like Precision at k, R-precision describes only one point on the precision-recall curve And the R-precision and R-recall is equal ( r/Rel ) and is identical to the break-even point of the PR curve
But even a perfect system could only achieve a precision at 20 of 0.4 if there were only 8 relevant documents Averaging this measure across queries thus makes more sense
ROC curve Receiver Operating Characteristics, ROC ROC curve plots the true positive rate ( sensitivity or recall ) against the false positive rate ( 1-specificity The false positive rate is given by fp/( fp+tn ) Specificity is given by tn/( fp + tn)
ROC curve always goes from the bottom left to the top right of the graph For a good system, the graph climbs steeply on the left side
8.5 Assessing relevance About information needs Information needs must be germane to the documents Information needs are best designed by domain experts Using random combinations of query terms as an information need is generally not a good idea because typically they will not resemble the actual distribution of information needs
About assessment Time-consuming and expensive process involving human beings For tiny collections like Cranfield, exhaustive judgments of relevance for each query and document pair were obtained For large modern collections, it is usual for relevance to be assessed only for a subset of the documents for each query The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned
Humans and their relevance judgments are quite variable But this is not a problem to be solved In the final analysis, the success of an IR system depends on how good it is at satisfying the needs of these variable humans But needs to measure the agreement of these assessments
Kappa statistic It can measure how much agreement between relevance judgments
此主题相关图片如下:
P(A) is the proportion of the times the judges agreed P(E) is the proportion of the times they would be expected to agree by chance, which needs to be estimated
|
Yes |
No |
Total |
Yes |
300 |
20 |
320 |
No |
10 |
70 |
80 |
Total |
310 |
90 |
400 |
P(A) = (300+ 70)/400 = 370/400 = 0.925 P(nonrelevant) = (80+90)/(400+400) = 170/800 = 0.2125 P(relevant) = (320+ 310)/(400+ 400) = 630/800 = 0.7878 P(E) = P(nonrelevant)2 + P(relevant)2 = 0.21252 + 0.78782 = 0.665 k= (P(A)- P(E))/(1- P(E)) = (0.925-0.665)/(1- 0.665) = 0.776
Result Kappa value above 0.8 is taken as good agreement Kappa value between 0.67 and 0.8 is taken as fair agreement Kappa value below 0.67 is seen as dubious basis for evaluation If there are more than two judges, it is normal to calculate an average pairwise kappa value
The level of agreement in TREC normally falls in the range of “fair” (0.67–0.8) This means not requiring more fine-grained relevance labeling But these deviation has in general been found to have little impact on the relative effectiveness ranking despite the variation of individual assessors’ judgments
About relevance Some dubious hypothesis The relevance of one document is treated as independent of the relevance of other documents in the collection Assessments are binary and not nuanced Information need is treated as an absolute, objective decision So any results are heavily skewed by the choice of collection, queries, and relevance judgment set
Marginal relevance It means whether a document still has distinctive usefulness after the user has looked at certain other documents The most extreme case of this is documents that are duplicates – a phenomenon that is actually very common on the Web Maximizing marginal relevance requires returning documents that exhibit diversity and novelty
[此贴子已经被作者于2010-12-14 09:20:49编辑过]
|