Conceptual Language Models

5.1 Conceptual Language Models

Our goal is to utilize the knowledge captured using concepts from a concept language to enhance the estimation of the query model $θ_{Q}$ . To this end, we use the concepts as a pivot language in a double translation [169], similar to the method proposed by Berger and Lafferty [31] that was discussed in Section 2.3.1. The approach presented by French et al. [104] is also related to ours. They propose a heuristic method of associating terms with concepts. Our approach, however, utilizes the concepts that are associated with a query to find terms related to these concepts in order to estimate the expanded part of the query model, $P (t | {\hat{θ}}_{Q})$ (cf. Eq. 2.10). Figure 5.1 shows a graphical representation of the dependencies of this process.

Figure 5.1: Dependence network for our conceptual language models.

In words, first we translate the query $Q$ into a set of relevant concepts, $C = {c_{1}, \dots, c_{k}}$ . Next, the vocabulary terms associated with the concepts are considered as possible terms to include in the query model and we marginalize out the concepts. More formally, we determine $\begin{array}{rcl} P (t | {\hat{θ}}_{Q}) = \sum_{c \in C} P (t | c) P (c | Q), & (5.1) \end{array}$

where we assume that the probability of selecting a term is only dependent on the concept once we have selected that concept for the query.

Two components need to be estimated: $P (t | c)$ , to which we refer as a generative concept model, and $P (c | Q)$ , to which we refer as a conceptual query model. As to the former, we will need to associate terms with concepts in the concept language. While the concepts may be directly usable for retrieving documents [128, 302, 318], we associate each concept with a weighted set of most characteristic terms using a multinomial unigram model. To this end we consider the documents that are annotated with concept $c$ as bridges between the concept and terms, by representing concepts as multinomial distributions over terms, $P (t | c)$ . Generative concept models will be detailed further in Section 5.1.2 below.

The second component—the conceptual query model $P (c | Q)$ —is a distribution over concepts specific to the query. In some settings, concepts are provided with a query or as part of a query, see, e.g., the PubMed search interface [132], some early Text Retrieval Conference (TREC) adhoc tracks (6, 7, and 8 in particular), and the Initiative for the Evaluation of XML Retrieval (INEX) Entity Ranking track, where Wikipedia categories are used [87]. If this is not the case, however, we may leverage the document annotations to approximate this step: this is what we do in the next section.

5.1.1 Conceptual Query Modeling

We now turn to defining $P (c | Q)$ , the conceptual query model. Contrary to the alternatives mentioned at the end of the previous section, in a typical IR setting concepts are not provided with a query and need to be inferred, estimated, or recognized [339, 358]. In this chapter, we formulate the estimation of concepts relevant to a query in a standard language modeling manner, by determining which concepts are most likely given documents relevant to the query. Alternatively, we could involve the end user and ask which documents, associated concepts, or terms are relevant. Since we do not have access to such assessments we use pseudo relevance methods. In recent work and using the same framework, different approaches of estimating a conceptual query model have been studied and it was concluded that using feedback documents is far more effective than using, e.g., string matching methods that try to recognize concepts in the query [318]. In the next chapter this finding is confirmed, albeit using a different setting and test collection.

Like Lavrenko and Croft [183], we view the process of obtaining a conceptual query model as a sampling process from a number of representative sources. The user has a notion of documents satisfying her information need, randomly selects one of these, and samples a concept from its representation. Hence, the conceptual query model is defined as follows: $\begin{array}{rcl} P (c | Q) & = & \sum_{D \in R} P (c | D) P (D | Q) . & (5.2) \end{array}$

Here, $R$ is a set of pseudo relevant documents returned by an initial retrieval run using the textual query; $P (c | D)$ is the concept language model of the document, the estimation of which is discussed in the next section. We assume that the probability of observing a concept is independent of the query once we have selected a document given the query, i.e., $P (c | D, Q) = P (c | D)$ . The term $P (D | Q)$ denotes the probability that document $D$ is chosen given $Q$ , which is obtained using the retrieval scores, viz. Eq. 2.8.

We assume that pseudo relevant documents are a good source from which we can sample the conceptual query model. Indeed, manual inspection shows that they are annotated with many relevant concepts, but also that they, despite being related to the query, contain a lot of noise: some concepts occur in many documents and are not very informative. Sampling from the maximum likelihood estimate for these documents would thus result in very general conceptual query models. Therefore, to re-estimate the probability mass of the concepts in the sampling process, we use a parsimonious language model. In the next section we detail how re-estimation is performed.

Table 5.3 illustrates the difference between a maximum likelihood estimation and a parsimonious estimation. It shows the concepts (in this case MeSH terms) with the highest probability for topic 186 from the TREC Genomics 2006 test collection. The conceptual query model based on the parsimonious document models contains more specific—and thus more useful—concepts, such as PRESENILIN-1 and PRESENILIN-2. The model based on maximum likelihood estimates includes more general concepts such as HUMANS, which are relevant but too general to be useful for searching.

P (c | D)

estimated using MLE

ALZHEIMER DISEASE

HUMANS

MEMBRANE PROTEINS

AMYLOID BETA-PROTEIN

AMYLOID BETA-PROTEIN, PRECURSOR

RESEARCH SUPPORT, U.S. GOV’T, P.H.S.

P (c | D)

estimated using Eq. 5.10

PRESENILIN-1

PRESENILIN-2

ALZHEIMER DISEASE

AMYLOID PRECURSOR, PROTEIN SECRETASES

MEMBRANE PROTEINS

AMYLOID BETA-PROTEIN, PRECURSOR

Table 5.3: A comparison of the concepts with the highest probability

P (c | Q)

(cf. Eq. 5.2) for the TREC Genomics topic: “How do mutations in the Presenilin-1 gene affect Alzheimer’s disease.” The two columns show the difference between using MLE on the concepts associated with the documents to determine

P (c | D)

, or the expectation maximization (EM) algorithm given in Eq. 5.10. Unique concepts are marked in boldface.

5.1.2 Generative Concept Models

Given Eq. 5.1, our goal is to arrive at a probability distribution $P (t | c)$ over vocabulary terms for each concept in the concept language used for annotating the documents. We determine the strength of the association between a term and a concept by looking at the annotations made by the trained annotators who have labeled the documents. In the end, this method defines the parameters of a generative language model for each concept: a generative concept model. We determine $P (t | c)$ , i.e., the strength of association between a concept $c$ and a term $t$ , by determining the probability of observing $t$ given $c$ . Concepts that are used to annotate documents may have different characteristics from other parts of a document, such as title and content. Annotations are selected by human indexers from a concept language while the remaining content consists of free text. Since the terms that make up the document are “generated” using a different process than the concepts, we may assume that $t$ and $c$ are independent and identical samples given a document $D$ in (or with) which they occur. So, the probability of observing both $t$ and $c$ is: $\begin{array}{rcl} P (t, c) = \sum_{D} P (D) P (c, t | D) = \sum_{D \in D_{C}} P (D) P (t | D) P (c | D), & (5.3) \end{array}$

where $D_{C}$ denotes the set of documents annotated with concept $c$ . We assume each document to have a uniform prior probability of being selected and obtain: $\begin{array}{rcl} P (t | c) & = & \frac{P (t, c)}{P (c)} & (5.4) \\ = & \frac{\sum_{D \in D_{C}} P (D) P (t | D) P (c | D)}{P (c)} \\ \propto & \frac{1}{P (c)} \sum_{D \in D_{C}} P (t | D) P (c | D) . \end{array}$

Hence, it remains to define three terms: $P (c)$ , $P (t | D)$ , and $P (c | D)$ . First, the term $P {(c)}^{- 1}$ functions as a penalty for frequently occurring and thus relatively non-informative concepts. We estimate this term using MLE on the document collection: $\begin{array}{rcl} P (c) = \frac{\sum_{D} n (c, D)}{\sum_{c^{'}} \sum_{D^{'}} n (c^{'}, D^{'})}, & (5.5) \end{array}$

where $n (c, D)$ is the number of times document $D$ is labeled with concept $c$ (which is typically 1).

$P (t \| D)$ estimated using MLE
0.061	the
0.054	of
0.045	indian
0.038	ethnic
0.028	in
0.028	american
0.021	a
0.021	renew
0.019	cultur
0.017	ident

$P (t \| D)$ estimated using Eq. 5.9
0.54	indian
0.46	ethnic

Table 5.4: Top 10 stemmed terms for the document model belonging to document CSASA-1-EN-9706464 (entitled “American indian ethnic renewal: red power and the resurgence of identity and culture.”) from the CLEF-DS test collection.

Next we turn to $P (x | D)$ , for $x \in {t, c}$ . The size of these models (in terms of the number of words or the number of concepts that receive a non-zero probability) may be quite large, e.g., in the case of a large document collection or in the case of frequently occurring concepts. Moreover, as exemplified above, not all of the observed events (where events are either terms or concepts) are equally informative. Some may be common, whilst others may describe the general domain of the document. Earlier in the thesis, we have noted that it is common to consider each document as a mixture of document-specific and more general terms (cf. Eq. 2.5); we now generalize this statement to also include concepts. Further, given this assumption, we may update each document model by reducing the amount and probability mass of non-specific events. We do so by iteratively adjusting the individual probabilities in each document, based on a comparison with a large reference corpus such as the collection. More formally, we maximize the posterior probability of $D$ after observing $x$ : $\begin{array}{rcl} P (D | x) = \frac{λ_{x} P (x | D)}{(1 - λ_{x}) P (x) + λ_{x} P (x | D)} . & (5.6) \end{array}$

Note that $λ_{x}$ may be set differently for $D$ (Eq. 2.5) and $C$ . For these estimations, we fix $λ_{C} = λ_{D} = 0.15$ based on [136, 211, 215]. We then apply the following EM algorithm until the estimates do not change significantly anymore: $\begin{array}{rcl} E-step: & e_{x} = P (D | x) = \frac{λ_{C} P (x | D)}{(1 - λ_{C}) P (x) + λ_{C} P (x | D)} & (5.7) \\ M-step: & P_{C} (x | D) = \frac{n (x, D) e_{x}}{\sum_{x^{'}} n (x^{'}, D) e_{x^{'}}} . & (5.8) \end{array}$

This updating mechanism enables more specific events, i.e., events that are not well-explained by the background model, to receive more probability mass, making the resulting document model more specific. After the EM algorithm has converged, we remove those events with a probability lower than a certain threshold $δ$ . Thus, the resulting document model for terms, $P (t | {\hat{θ}}_{D})$ , to be used as $P (t | D)$ in Eq. 5.4 is given by: $\begin{array}{rcl} P (t | {\hat{θ}}_{D}) = \{\begin{matrix} Z_{D_{t}} \cdot P_{C} (t | D) & if t \in D and P_{C} (t | D) > δ_{t} \\ 0 & otherwise, \end{matrix} & (5.9) \end{array}$

where $Z_{D_{t}}$ is a document-specific normalization factor: $Z_{D_{t}} = 1 ∕ \sum_{t} P_{C} (t | D)$ . Table 5.4 provides an example of the effects of applying Eq. 5.9 on a document from the CLEF-DS document collection (that will be introduced in Section 5.2). Similarly, the resulting document model for concepts, $P (c | {\hat{θ}}_{D})$ , to be used for $P (c | D)$ in Eq. 5.4, is given by: $\begin{array}{rcl} P (c | {\hat{θ}}_{D}) = \{\begin{matrix} Z_{D_{c}} \cdot P_{C} (c | D) & if c \in D and P_{C} (c | D) > δ_{c} \\ 0 & otherwise, \end{matrix} & (5.10) \end{array}$

where $Z_{D_{c}}$ is a document-specific normalization factor: $Z_{D_{c}} = 1 ∕ \sum_{c} P_{C} (c | D)$ . Table 5.3 provides an example of the effects of applying Eq. 5.10 on a topic from the TREC document collection (that will be introduced in Section 5.2). For the experiments in this chapter we fix $δ_{t} = δ_{c} = 0.01$ .

Front