Experimental Setup

5.2 Experimental Setup

To answer the research questions specified in the introduction to this chapter, we set up a number of experiments in which we compare our conceptual language models with other retrieval approaches. Below, we describe the baseline approaches that we use for comparison, our experimental environment, and estimation methods. In Section 5.3, we turn to the results of our experiments. The test collections we employ in this chapter have been introduced in Section 3.3.

5.2.1 Parameter Estimation

Given the models introduced in the previous sections, we have a number of parameters that need to be set (cf. Section 3.4). Table 5.5 summarizes the parameters that we need to set.

Parameter		Description
$λ_{Q}$	Eq. 2.10	Interpolation between initial query and expanded query part
$\| R \|$	Eq. 2.23 and Eq. 5.2	The size of the set of pseudo relevant documents
$\| V_{Q} \|$	Eq. 2.23 and Eq. 5.4	The number of terms to use, either for the expanded query part or for each concept
$\| C \|$	Eq. 5.1	The number of concepts to use for the conceptual query representation

Table 5.5: Free parameters in the models described in the previous sections.

There are various approaches that may be used to estimate these parameters. We choose to optimize the parameter values by determining the mean average precision for each set of parameters, i.e., a grid search [223, 262], and show the results of the best performing settings. For $λ_{Q}$ we sweep in the interval [0,1] with increments of 0.1. The other parameters are investigated in the range [1,10] with increments of 1. We determine the MAP scores on the same topics that we present results for, similar to [173, 189, 224, 235, 356]. While computationally expensive (exponential in the number of parameters), it provides us with an upper bound on the attainable retrieval performance using the described models.

5.2.2 Complexity and Implementation

As to the complexity of our methods, we need to calculate two terms additional to the standard language modeling estimations [173]: the generative concept models (offline) and the conceptual query model (online). The former is most time-consuming, with a maximum complexity per concept proportional to the number of terms in the vocabulary, the number of documents annotated with the concept, and the number of EM iterations. The advantage of this step, however, is that it can be performed offline. Determining a conceptual query model is, in terms of efficiency, comparable to standard pseudo relevance feedback approaches except for the addition of the number of EM iterations.

QL		RM		EC		GC




0.500	citi	0.272	citi	0.250	urban	0.216	citi
0.500	shrink	0.250	shrink		sociology	0.200	shrink
		0.024	of	0.250	urban	0.164	urban
		0.024	develop		planning	0.090	town
		0.015	popul	0.250	town	0.089	develop
		0.014	town		planning	0.083	plan
		0.010	economi	0.250	town	0.047	hous
		0.009	sociolog		development	0.040	sociolog

Table 5.6: Concepts or stemmed terms with the highest probability in the query models for the CLEF Domain-specific topic “Shrinking cities” generated by the query likelihood baseline (QL; Eq. 2.9), relevance model (RM; Eq. 2.23), conceptual query model (EC; Eq. 5.2), and the conceptual language models (GC; Eq. 5.1).

5.2.3 Baselines

We use two baseline retrieval approaches for comparison purposes. Table 5.6 shows an example of the generated query models for these baseline approaches and the CLEF-DS-08 query “Shrinking cities.” As our first baseline, we employ a run based on the KL divergence retrieval method and set $λ_{Q} = 1$ . This uses only the information from the initial, textual query and amounts to performing retrieval using query likelihood, as was detailed in Chapter 2. All the results on which we report in this chapter use this baseline as their initially retrieved document set.

Since our concept language models also rely on pseudo relevance feedback (PRF), we use the text-based PRF method introduced in Chapter 2 (RM-2, cf. Eq. 2.23) as another baseline. The functional form of our conceptual query model is reminiscent of RM-1 (cf. Eq. 2.24) and we also evaluated RM-1 as a text-based pseudo relevance feedback baseline. We found that its performance was inferior to RM-2 on all test collections—a finding in line with results obtained by Lavrenko and Croft [183], other researchers [23, 197], as well as our own (on all the test collections we evaluated in Chapter 4). Consequently, we use RM-2 in our experiments (labeled as “RM” in the remainder of this chapter) and refrain from mentioning the results of RM-1.

PrevTail

Front