-- 作者:admin
-- 发布时间:2008/5/27 14:17:59
-- [推荐]第九部分课堂讲义——XML检索
8 XML Retrieval PPT课件下载链接:
Two Retrieval Methods Unstructured text retrieval Handled by IR systems Structured text retrieval Relational data retrieval Handled by DBMS Structured documents retrieval XML Retrieval
Relational data retrieval For example: select * from stu where name like ’李%’ Meaning that an inverted index is created and Boolean or vector space search enabled, effectively combining core database with information retrieval technologies
Semi-structured retrieval The term structured retrieval is used for XML retrieval The term semi-structured retrieval is mainly used for database querying
Structured documents retrieval Applications of structured documents retrieval include: Digital libraries Patent databases Blogs Text in which entities like persons and locations have been tagged (in a process called named entity tagging) Output from office suites like OpenOffice that save documents as marked up text
We want to query that combine textual criteria with structural criteria For example: Give me patents whose claims mention RSA public key encryption and that cite US patent 4,405,829 These three queries are structured queries that cannot be answered well by an unranked retrieval system And these retrieved data is not in DBMS but in raw text documents
The comparison of these methods There is no consensus yet as to which methods work best for structured retrieval although many researchers believe that XQuery will become the standard for structured queries
The comparison of XML retrieval and relational databases retrieval The XML retrieval is characterized by: Long text fields (e.g., sections of a document) Inexact matching Relevance-ranked results Relational databases do not deal well with this use case
XML retrieval XML retrieval is one standard retrieval methods for encoding structured documents These retrieval method is also applicable to markup languages in general such as HTML etc.
The characteristics of XML retrieval Aimed at text-centric XML Not aimed at data-centric XML Most data-centric XML is stored in relational databases in contrast to the inverted index-based methods for text-centric XML
8.1 Basic XML concepts An example of XML document <play> <author>Shakespeare</author> <title>Macbeth</title> <act number="I"> <scene number="vii"> <title>Macbeth’s castle</title> <verse>Will I with wine and wassail ...</verse> </scene> </act> </play>
Document Object Model ( DOM ) This figure shows a simplified DOM object DOM is the standard for accessing and processing XML documents With a DOM API, we can process an XML document by starting at the root element and then descending down the tree from parents to children
XPath Also called as XML contexts Standard for enumerating paths in an XML document collection
The characteristics of XPath The XPath expression node selects all nodes of that name “act/scene” selects all scene elements whose parent is an act element Double slashes indicate that an arbitrary number of elements can intervene on a path “play//scene” selects all scene elements occurring in a play element
An initial slash starts the path at the root element “/play/title” selects the play’s title “/play//title” selects a set with two members The final element of a path to be a vocabulary term and separate it from the element path by the symbol # Only for notational convenience but not conforming to the XPath standard “ title#"Macbeth" ” selects all titles containing the term Macbeth
NEXI Narrowed Extended XPath I A common format for XML queries For example: //article [.//yr = 2001 or .//yr = 2002] //section [about(.,summer holidays)] Specifieing a search for a section about the summer holidays that are part of articles from 2001 or 2002 The dot in a clause in square brackets refers to the element the clause modifies The about clause is a ranking constraint
8.2 Challenges in XML retrieval 8.2.1 How to choose the result unit Users want to return parts of documents (i.e., XML elements), not entire documents as IR systems usually do in unstructured retrieval If users query Shakespeare’s plays for Macbeth’s castle, they are probably looking for the scene, not the act or the entire play
Structured document retrieval principle A system should always retrieve the most specific part of a document answering the query One criterion for selecting the most appropriate part of a document This principle motivates a retrieval strategy that returns the smallest unit that contains the information sought, but does not go below this level
One difficulty However, it is not invariable and can be hard to implement this principle algorithmically Consider the query title#"Macbeth" The title of the tragedy, Macbeth, and the title of Act I, Scene vii, Macbeth’s castle, are both good hits because they contain the matching term Macbeth But in this case, the title of the tragedy, the higher node, is preferred
8.2.2 How to choose the index unit The difficulties also include deciding which parts of a document to index In unstructured retrieval, it is usually clear what the right document unit is: files on your desktop, email messages, web pages on the web etc. In structured retrieval, there are a number of different approaches to defining the indexing unit
In the example, books, chapters and sections have been designated to be indexing units, but without overlap The disadvantage of this approach is that pseudo-documents may not make sense to the user because they are not coherent units
Top-down method We can also use one of the largest elements as the indexing unit, for example, the book element in a collection of books We can then postprocess search results to find for each book the subelement that is the best hit For example, the query Macbeth’s castle may return the play Macbeth, which we can then postprocess to identify act I, scene vii as the best-matching subelement Unfortunately, the relevance of the whole is often not a good predictor of the relevance of small subelements within it
Bottom-up method We can also search all leaves, select the most relevant ones and then extend them to larger units in postprocessing For the query Macbeth’s castle, we would retrieve the title Macbeth’s castle in the first pass and then decide in a postprocessing step whether to return the title, the scene, the act or the play Unfortunately, the relevance of a leaf element is also often not a good predictor of the relevance of elements it is contained in
Indexing-all-elements method The least restrictive approach is to index all elements This is also problematic Many XML elements are not meaningful search results, e.g., typographical elements like <b>definitely</b> or an ISBN number which cannot be interpreted without context Also, indexing all elements means that search results will be highly redundant For the query Macbeth’s castle, we would return all of the play, act, scene and title elements on the path between the root node and Macbeth’s castle
Restriction strategies in Indexing-all-elements method Discard all small elements Discard all element types that users do not look at (this requires a working XML retrieval system that logs this information) Discard all element types that assessors generally do not judge to be relevant (if relevance assessments are available) Only keep element types that a system designer or librarian has deemed to be useful search results
8.2.3 How to reduce redundancy of result In most of these approaches, result sets will still contain nested elements
The solution Removing some elements in a postprocessing step Collapsing several nested elements in the results list and use highlighting of query terms An additional advantage of this approach is that the paragraph is presented together with its context section This context may be helpful in interpreting the paragraph If the user knows the schema of the collection and is able to specify the desired type of element, then the problem of redundancy is alleviated as few nested elements have the same type
8.2.4 How to distinguish different contexts of a term in compute term statistics for ranking For example, the term Gates under the node author is unrelated to an occurrence under a content node like section if used to refer to the plural of gate It makes little sense to compute a single document frequency for Gates in this example
The solution Computing idf for XML-context/term pairs For example, computing different idf weights for author#"Gates" and section#"Gates“ Unfortunately, this scheme will run into sparse data problems Many XML-context pairs occur too rarely to reliably estimate df A compromise is only to consider the parent node x of the term and not the rest of the path from the root to x to distinguish contexts Unfortunately, we do not distinguish names of authors and names of corporations if both have the parent node name
8.2.5 How to handle schema heterogeneity The XML documents in an IR application often come from more than one source This phenomenon is called schema heterogeneity or schema diversity An example of schema heterogeneity Many elements may have different names creator vs. author
The solution Some form of approximate matching of element names in combination with semi-automatic matching of different document structures can help here Human editing of correspondences of elements in different schemas will usually do better than automatic methods
The challenge for interface design Another reason is that users often are not familiar with the element names and the structure of the schemas of collections they search as mentioned before This poses a challenge for interface design in XML retrieval Ideally, the user interface should expose the tree structure of the collection and allow users to specify the elements they are querying Designing the query interface in structured retrieval is more complex than a search box for keyword queries in unstructured retrieval
Extended queries supporting the user by interpreting all parent-child relationships in queries as descendant relationships with any number of intervening nodes allowed Some examples with pseudo-XPath notation book//#"Gates" book//section//#"Gates"
This is a case where we may want to interpret the structural constraints specified in the query Elements that do not meet structural constraints perfectly should be ranked lower, but they should not be omitted from search results
8.3 A vector space model for XML retrieval An example We want a book entitled Julius Caesar to be a match for q1 and no match (or a lower weighted match) for q2
The necessary of VSM in XML retrieval In unstructured retrieval, there would be a single dimension of the vector space for Caesar In XML retrieval, we must separate the title word Caesar from the author name Caesar One way of doing this is to have each dimension of the vector space encode a word together with its position within the XML tree
Lexicalized subtrees The dimensions of the vector space Not vocabulary terms The subtrees that contain at least one vocabulary term We can now represent queries and documents as vectors in this space of lexicalized subtrees and compute matches between them
About space of VSM in XML retrieval There is a tradeoff between the dimensionality of the space and accuracy of query results A compromise is to index all paths that end in a single vocabulary term, in other words, all XML-context/term pairs Such an XML-context/term pair is called by a structural term and denote it by <c, t>: a pair of XML-context c and vocabulary term t This is why only seven structural term are shown in figure
The computation of VSM in XML retrieval Because of users’ habit, we will therefore interpret all queries as extended queries But we still prefer documents that match the query structure closely by inserting fewer additional nodes
A simple measure of the similarity of a path cq in a query and a path cd in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path, respectively cq matches cd iff we can transform cq into cd by inserting additional nodes
SimNoMerge The final score for a document is computed as a variant of the cosine measure which we call We compute the weights using one of the weightings, e.g., idft·wft,d
8.4 Evaluation of XML Retrieval The premier venue for research on XML retrieval is the INEX (INitiative for the Evaluation of XML retrieval) program INEX is a collaborative effort that has produced reference collections, sets of queries, and relevance judgments
INEX The INEX 2002 collection consisted of about 12,000 articles from IEEE journals The IEEE journal collection was expanded in 2005 Since 2006 INEX uses the much larger English Wikipedia as a test collection
The schema of INEX collection Two types of information needs in INEX Content-Only: CO topics Content-And-Structure: CAS topics Since CAS queries have both structural and content criteria, relevance assessments are more complicated than in unstructured retrieval
How to evaluate INEX 2002 defined two orthogonal dimensions of relevance Component coverage Topical relevance
Component coverage It evaluates whether the element retrieved is “structurally” correct, i.e., neither too low nor too high in the tree We distinguish four cases: Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information Too large (L): The information sought is present in the component, but is not the main topic No coverage (N): The information sought is not a topic of the component
Topical relevance The topical relevance dimension also has four levels: highly relevant (3) fairly relevant (2) marginally relevant (1) nonrelevant (0)
The denotation of evaluation result Components are judged on both dimensions and the judgments are then combined into a digit-letter code 2S is a fairly relevant component that is too small 3E is a highly relevant component that has exact coverage In theory, there are 16 combinations of coverage and relevance, but many cannot occur A nonrelevant component cannot have exact coverage, so the combination 3N is not possible
Evaluation scheme The relevance-coverage combinations are then quantized as follows: Binary relevance judgments, which are standard in unstructured information retrieval, are not appropriate for XML retrieval A 2S component provides incomplete information and may be difficult to interpret without more context, but it does answer the query partially The quantization function Q does not impose a binary choice relevant/nonrelevant and instead allows us to grade the component as partially relevant
The number of relevant components in a retrieved set A of components can then be computed as:
As an approximation, the standard definitions of precision, recall and F can be applied to this modified definition of relevant items retrieved
The disadvantage of this evaluation scheme One flaw of measuring relevance this way is that overlap is not accounted for Need to consider marginal relevance Much of the recent focus at INEX has been on developing algorithms and evaluation measures that return non-redundant results lists and evaluate them properly
Why the effectiveness in XML retrieval is often lower? XML retrieval is harder Instead of just finding a document, we have to find the subpart of a document that is most relevant to the query Graded judgments lower measured performance Consider a system that returns a document with graded relevance 0.6 and binary relevance 1 at the top of the retrieved list. Then, interpolated precision at 0.00 recall is 1.0 on a binary evaluation, but can be as low as 0.6 on a graded evaluation
A comparison of content-only and full-structure search In INEX 2003/2004 The evaluation metric is precision at k The discretization function used for the evaluation maps highly relevant elements (roughly corresponding to the 3E elements defined for Q) to 1 and all other elements to 0
These results demonstrate the benefits of structured retrieval The table shows that structure helps increase precision at the top of the results list There is a large increase of precision at k = 5 and at k = 10 There is almost no improvement at k = 30 Structured retrieval imposes additional constraints on what to return and documents that pass the structural filter are more likely to be relevant Recall may suffer at high levels because some relevant documents will be filtered out, but for precision-oriented tasks structured retrieval is superior
8.5 Some complicated XML retrieval 8.5.1 Data-centric XML retrieval Data-centric XML mainly encodes numerical and non-text attribute- value data When querying data-centric XML, we want to impose exact match conditions in most cases An example of data-centric XML retrieval Find employees whose salary is the same this month as it was 12months ago This query requires no ranking exact matching
8.5.2 XML retrieval with joins and ordering constraints. The query for employees with unchanged salary requires a join The following query imposes an ordering constraint: Retrieve the chapter of the book Introduction to algorithms that follows the chapter IR
About relational databases retrieval Relational databases are better equipped to handle many structural constraints, particularly joins although ordering is also difficult in a database framework The tuples of a relation in the relational calculus are not ordered For this reason, most data-centric XML retrieval systems are extensions of relational databases If text fields are short, exact matching meets user needs and retrieval results in form of unordered sets are acceptable, then using a relational database for XML retrieval is appropriate
XQuery A query language proposed for standardization by the W3C A powerful query language for XML that can handle numerical attributes, joins and ordering constraints Due to its complexity, it is challenging to implement an XQuery-based ranked retrieval system This is currently one of the most active areas of research in XML retrieval
[此贴子已经被作者于2010-12-14 09:28:23编辑过]
|