5.2 Experimental Setup

To answer the research questions specified in the introduction to this chapter, we set up a number of experiments in which we compare our conceptual language models with other retrieval approaches. Below, we describe the baseline approaches that we use for comparison, our experimental environment, and estimation methods. In Section 5.3, we turn to the results of our experiments. The test collections we employ in this chapter have been introduced in Section 3.3.

5.2.1 Parameter Estimation

Given the models introduced in the previous sections, we have a number of parameters that need to be set (cf. Section 3.4). Table 5.5 summarizes the parameters that we need to set.


Parameter

Description

λQ Eq. 2.10

Interpolation between initial query and expanded query part

|R| Eq. 2.23 and Eq. 5.2

The size of the set of pseudo relevant documents

|VQ| Eq. 2.23 and Eq. 5.4

The number of terms to use, either for the expanded query part or for each concept

|C| Eq. 5.1

The number of concepts to use for the conceptual query representation

Table 5.5: Free parameters in the models described in the previous sections.

There are various approaches that may be used to estimate these parameters. We choose to optimize the parameter values by determining the mean average precision for each set of parameters, i.e., a grid search [223262], and show the results of the best performing settings. For λQ we sweep in the interval [0,1] with increments of 0.1. The other parameters are investigated in the range [1,10] with increments of 1. We determine the MAP scores on the same topics that we present results for, similar to [173189224235356]. While computationally expensive (exponential in the number of parameters), it provides us with an upper bound on the attainable retrieval performance using the described models.

5.2.2 Complexity and Implementation

As to the complexity of our methods, we need to calculate two terms additional to the standard language modeling estimations [173]: the generative concept models (offline) and the conceptual query model (online). The former is most time-consuming, with a maximum complexity per concept proportional to the number of terms in the vocabulary, the number of documents annotated with the concept, and the number of EM iterations. The advantage of this step, however, is that it can be performed offline. Determining a conceptual query model is, in terms of efficiency, comparable to standard pseudo relevance feedback approaches except for the addition of the number of EM iterations.


QL
RM
EC
GC
0.500 citi 0.272 citi 0.250

urban

0.216 citi
0.500 shrink 0.250 shrink

sociology

0.200 shrink
0.024 of 0.250

urban

0.164 urban
0.024 develop

planning

0.090 town
0.015 popul 0.250

town

0.089 develop
0.014 town

planning

0.083 plan
0.010 economi 0.250

town

0.047 hous
0.009 sociolog

development

0.040 sociolog

Table 5.6: Concepts or stemmed terms with the highest probability in the query models for the CLEF Domain-specific topic “Shrinking cities” generated by the query likelihood baseline (QL; Eq. 2.9), relevance model (RM; Eq. 2.23), conceptual query model (EC; Eq. 5.2), and the conceptual language models (GC; Eq. 5.1).

5.2.3 Baselines

We use two baseline retrieval approaches for comparison purposes. Table 5.6 shows an example of the generated query models for these baseline approaches and the CLEF-DS-08 query “Shrinking cities.” As our first baseline, we employ a run based on the KL divergence retrieval method and set λQ = 1. This uses only the information from the initial, textual query and amounts to performing retrieval using query likelihood, as was detailed in Chapter 2. All the results on which we report in this chapter use this baseline as their initially retrieved document set.

Since our concept language models also rely on pseudo relevance feedback (PRF), we use the text-based PRF method introduced in Chapter 2 (RM-2, cf. Eq. 2.23) as another baseline. The functional form of our conceptual query model is reminiscent of RM-1 (cf. Eq. 2.24) and we also evaluated RM-1 as a text-based pseudo relevance feedback baseline. We found that its performance was inferior to RM-2 on all test collections—a finding in line with results obtained by Lavrenko and Croft [183], other researchers [23197], as well as our own (on all the test collections we evaluated in Chapter 4). Consequently, we use RM-2 in our experiments (labeled as “RM” in the remainder of this chapter) and refrain from mentioning the results of RM-1.