Chapter 5
Query Modeling Using Concepts

In the previous chapter we have looked at how to use explicit and pseudo relevance information to obtain an improved estimate of the query model and, hence, improved retrieval performance. The documents used there were newswire documents and web pages. What if the documents are annotated, e.g., using concepts? Can we utilize the knowledge captured by those annotations to further improve retrieval effectiveness? In this chapter we introduce and evaluate a model that leverages document-level annotations for query modeling.

Explicit (and often manually curated) knowledge is routinely added to documents for a variety of reasons, e.g., to increase their findability or to aid navigation of the collection to which they belong. It is typically expressed in a meta-language and can be either formal (e.g., in the form of a thesaurus or ontology [157]) or more informal (e.g., in the form of user-generated tags [238269]). Annotations of the formal kind may be found in a broad range of domains and a variety of document types. News articles, for example, can be annotated with concepts from the NewsCodes taxonomies provided by the International Press Telecommunication Council (IPTC) [319]. Another example is the annotation of bibliographic records with indexing terms from a controlled vocabulary. In the biomedical domain, citations in the MEDLINE database are manually indexed with concepts from the Medical Subject Headings (MeSH) thesaurus.1 As indicated earlier, we refer to the broad range of formal meta-languages as concept languages and to their vocabulary terms as concepts. Figure 1.2 shows an excerpt from MeSH. Tables 5.1 and 5.2 show two examples of document-concept annotations from the two test collections that we use and that were introduced in Section 3.3.

In order to use concept languages for query modeling, we develop a two-step translation-based method. In the first step, an information need (as expressed in a textual query) is translated into a conceptual representation. In a process we call conceptual query modeling, feedback documents from an initial retrieval run are used to obtain a conceptual query model; this model represents the user’s information need at the level of concepts rather than that of the terms entered by the user. The intuition behind this step is that this conceptual representation provides a less ambiguous representation of the information need. In contrast to traditional textual relevance feedback, where query refinement is biased towards terms occurring in the initial query, this intermediate conceptual representation is less dependent on the original query words. On its own, this explicit conceptual representation can be used to aid retrieval, for example by suggesting relevant concepts to the user [165209285323] or by matching it to a conceptual representation of, or the annotations associated with the documents [254318].

In the second step, we translate the conceptual query model back into a contribution to the textual query model. We hypothesize that, since the textual representation of documents is more detailed than its conceptual representation,1 1A document is typically represented by far more terms than concepts. retrieving information with a textual query representation translated from a conceptual form, will result in better retrieval performance than strictly matching with concepts only. Essential to these two translation steps is the estimation of a query model, both for terms and for concepts. The textual query should be captured by a small set of specific concepts and the conceptual query model should be translated to specific textual terms. To achieve this, we employ an expectation maximization algorithm inspired by parsimonious language models [136].

In this chapter we introduce and investigate our method for using document annotations for query modeling as formulated in our RQ 2:

RQ 2.
What are effective ways of using conceptual information for query modeling to improve retrieval performance?

To estimate a conceptual query model we propose a method that looks at the top-ranked documents in an initially retrieved set. In order to assess the effectiveness of this step, we compare the results of using these concepts with a standard language modeling approach. Moreover, since this method relies on pseudo relevant documents from an initial retrieval run, we also compare the results of our conceptual query models to another, established pseudo relevance feedback algorithm based on relevance models. We ask:

RQ 2a.
What is the relative retrieval effectiveness of this method with respect to the standard language modeling and conventional pseudo relevance feedback approach?
RQ 2b.
How portable is our conceptual language model? That is, what are the results of the model across multiple concept languages and test collections?
RQ 2c.
Can we say anything about which evaluation measures are helped most using our model? Is it mainly a recall or a precision-enhancing device?

Document text [CSASA-1-EN-9600048] Concept annotations

Immigration and Economic Dependence in the U.S.: Approaches to Presenting Logistic Regression Results. Logistic regression models are found increasingly in the social science literature, but the coefficients can be difficult to interpret for novice users. Strategies are discussed that can enhance the substantive interpretation of logistic regression results.







Table 5.1: Example of a document (title and part of abstract) from the Cross-Language Evaluation Forum (CLEF)-DS test collection, annotated with Sociological Abstracts (SA) concepts.

Document text [PMID: 10077651] Concept annotations

Mechanism of increased iron absorption in murine model of hereditary hemochromatosis: increased duodenal expression of the iron transporter DMT1. Hereditary hemochromatosis (HH) is a common autosomal recessive disorder characterized by tissue iron deposition secondary to excessive dietary iron absorption. We recently reported that HFE, the protein defective in HH, was physically associated with the transferrin receptor (TfR) in duodenal crypt cells and proposed that mutations in HFE attenuate the uptake of transferrin-bound iron from plasma by duodenal crypt cells, leading to up-regulation of transporters for dietary iron.










Table 5.2: Example of a document (title and part of abstract) from the TREC-GEN-04 annotated with MeSH concepts.

The remainder of this chapter is organized as follows. We introduce conceptual language models in Section 5.1. We then describe our experimental setup in Section 5.2 and report on the outcomes of our experimental evaluation and discuss our findings in Section 5.3. We end with a concluding section.

 5.1 Conceptual Language Models
  5.1.1 Conceptual Query Modeling
  5.1.2 Generative Concept Models
 5.2 Experimental Setup
  5.2.1 Parameter Estimation
  5.2.2 Complexity and Implementation
  5.2.3 Baselines
 5.3 Results and Discussion
  5.3.1 Baselines
  5.3.2 Conceptual Language Models
 5.4 Parameter Sensitivity Analysis
 5.5 Summary and Conclusions