Chapter 5
Query Modeling Using Concepts
In the previous chapter we have looked at how to use explicit and pseudo relevance
information to obtain an improved estimate of the query model and, hence, improved
retrieval performance. The documents used there were newswire documents and web
pages. What if the documents are annotated, e.g., using concepts? Can we utilize the
knowledge captured by those annotations to further improve retrieval effectiveness? In
this chapter we introduce and evaluate a model that leverages document-level
annotations for query modeling.
Explicit (and often manually curated) knowledge is routinely added to documents
for a variety of reasons, e.g., to increase their findability or to aid navigation of the
collection to which they belong. It is typically expressed in a meta-language and can be
either formal (e.g., in the form of a thesaurus or ontology [157]) or more informal
(e.g., in the form of user-generated tags [238, 269]). Annotations of the formal kind
may be found in a broad range of domains and a variety of document types. News
articles, for example, can be annotated with concepts from the NewsCodes taxonomies
provided by the International Press Telecommunication Council (IPTC) [319]. Another
example is the annotation of bibliographic records with indexing terms from a
controlled vocabulary. In the biomedical domain, citations in the MEDLINE database
are manually indexed with concepts from the Medical Subject Headings (MeSH)
thesaurus.
As indicated earlier, we refer to the broad range of formal meta-languages as concept
languages and to their vocabulary terms as concepts. Figure 1.2 shows an excerpt
from MeSH. Tables 5.1 and 5.2 show two examples of document-concept
annotations from the two test collections that we use and that were introduced in
Section 3.3.
In order to use concept languages for query modeling, we develop a two-step
translation-based method. In the first step, an information need (as expressed in a
textual query) is translated into a conceptual representation. In a process we call
conceptual query modeling, feedback documents from an initial retrieval run are used to
obtain a conceptual query model; this model represents the user’s information need at
the level of concepts rather than that of the terms entered by the user. The intuition
behind this step is that this conceptual representation provides a less ambiguous
representation of the information need. In contrast to traditional textual relevance
feedback, where query refinement is biased towards terms occurring in the initial query,
this intermediate conceptual representation is less dependent on the original query
words. On its own, this explicit conceptual representation can be used to aid retrieval,
for example by suggesting relevant concepts to the user [165, 209, 285, 323] or by
matching it to a conceptual representation of, or the annotations associated with the
documents [254, 318].
In the second step, we translate the conceptual query model back into a
contribution to the textual query model. We hypothesize that, since the
textual representation of documents is more detailed than its conceptual
representation,A document is typically represented by far more terms than concepts.
retrieving information with a textual query representation translated from a conceptual
form, will result in better retrieval performance than strictly matching with concepts
only. Essential to these two translation steps is the estimation of a query model, both for
terms and for concepts. The textual query should be captured by a small set of specific
concepts and the conceptual query model should be translated to specific textual terms.
To achieve this, we employ an expectation maximization algorithm inspired by
parsimonious language models [136].
In this chapter we introduce and investigate our method for using document
annotations for query modeling as formulated in our RQ 2:
-
RQ 2.
- What are effective ways of using conceptual information for query
modeling to improve retrieval performance?
To estimate a conceptual query model we propose a method that looks at the top-ranked
documents in an initially retrieved set. In order to assess the effectiveness of this step,
we compare the results of using these concepts with a standard language modeling
approach. Moreover, since this method relies on pseudo relevant documents from an
initial retrieval run, we also compare the results of our conceptual query models to
another, established pseudo relevance feedback algorithm based on relevance models.
We ask:
-
RQ 2a.
- What is the relative retrieval effectiveness of this method with respect
to the standard language modeling and conventional pseudo relevance
feedback approach?
-
RQ 2b.
- How portable is our conceptual language model? That is, what are the
results of the model across multiple concept languages and test collections?
-
RQ 2c.
- Can we say anything about which evaluation measures are helped most
using our model? Is it mainly a recall or a precision-enhancing device?
Document text [CSASA-1-EN-9600048] | Concept annotations |
Immigration and Economic Dependence in
the U.S.: Approaches to Presenting Logistic
Regression Results. Logistic regression models
are found increasingly in the social science
literature, but the coefficients can be difficult
to interpret for novice users. Strategies are
discussed that can enhance the substantive
interpretation of logistic regression results. … |
| UNITED STATES OF AMERICA | IMMIGRANTS | CITIZENS | BENEfiTS | SOCIAL SECURITY | REGRESSION ANALYSIS | |
|
Table 5.1: Example of a document (title and part of abstract) from the
Cross-Language Evaluation Forum (CLEF)-DS test collection, annotated with
Sociological Abstracts (SA) concepts.
Document text [PMID: 10077651] | Concept annotations |
Mechanism of increased iron absorption in
murine model of hereditary hemochromatosis:
increased
duodenal expression of the iron transporter
DMT1. Hereditary hemochromatosis (HH) is a
common autosomal recessive
disorder characterized by tissue iron deposition
secondary to excessive dietary iron absorption.
We recently reported that HFE, the protein
defective in HH, was physically associated
with the transferrin receptor (TfR) in duodenal
crypt cells and proposed that mutations in HFE
attenuate the uptake of transferrin-bound iron
from plasma by duodenal crypt cells, leading to
up-regulation of transporters for dietary iron. … |
| | ANIMALS | CARRIER PROTEINS | CATION TRANSPORT PROTEINS | DUODENUM | HEMOCHROMATOSIS | IRON | IRON-BINDING PROTEINS | MICE | MUTATION | |
|
Table 5.2: Example of a document (title and part of abstract) from the
TREC-GEN-04 annotated with MeSH concepts.
The remainder of this chapter is organized as follows. We introduce conceptual
language models in Section 5.1. We then describe our experimental setup in
Section 5.2 and report on the outcomes of our experimental evaluation and discuss our
findings in Section 5.3. We end with a concluding section.