5.3 Results and Discussion

Now that we have detailed our conceptual language modeling approach (Section 5.1) and laid out the experimental environment (Section 5.2), we present the results of the experiments aimed at answering this chapter’s main research questions. First, we look at the performance of the query likelihood model that we use as our baseline. We emphasize that the other models that we evaluate use the initial ranking from the query likelihood model as a set of pseudo relevant documents. We then look at the results of applying RM. Next, we evaluate the results of using the conceptual language models as described in Section 5.1, using the conceptual query models and the generative concept models in conjunction.

Further, we perform an ablation study by zooming in on the results after removing each component in the conceptual language models. First, we consider the generative concept models that we use to translate the conceptual query model to free-text terms. We look at the results of using MLE, i.e., without applying the EM algorithm described in Section 5.1.2. Second, since each document in our collections has associated concepts, we use the conceptual query model in conjunction with the initial query for retrieval, as detailed in Section 5.3.2. Finally, we look at the sensitivity of our model with respect to the individual parameter settings and zoom out in order to see whether we can relate collection-specific properties with the reported results.



RelRet/TotalRel 2289/4530 2430/4530 +6.2%

P5 0.5120 0.5440 +6.2%

P10 0.5080 0.5040 -0.8%

MAP 0.1952 0.2061 +5.6%


RelRet/TotalRel 1468/2133 1473/2133 +0.3%

P5 0.5280 0.5680 +7.6%

P10 0.4680 0.4800 +2.6%

MAP 0.2819 0.2856 +1.3%


RelRet/TotalRel 3847/8268 4205/8268 +9.3%*

P5 0.5160 0.5680 +10.1%

P10 0.4800 0.5340 +11.2%*

MAP 0.2856 0.3306 +15.8%*


RelRet/TotalRel 2825/4584 3031/4584 +7.3%*

P5 0.4122 0.4163 +1.0%

P10 0.3776 0.3857 +2.1%

MAP 0.2153 0.2368 +10.0%


RelRet/TotalRel 1078/1449 1160/1449 +7.6%

P5 0.4154 0.4308 +3.7%

P10 0.4154 0.4346 +4.6%

MAP 0.2731 0.2993 +9.6%*

Table 5.7: Results of the baselines: QL and the best performing run using relevance model (RM), model 2. The right-most column indicates the relative difference between the query likelihood and relevance model scores.

(a) CLEF-DS-07 – MAP
(b) CLEF-DS-07 – P5
(c) CLEF-DS-07 – P10

(d) CLEF-DS-08 – MAP
(e) CLEF-DS-08 – P5
(f) CLEF-DS-08 – P10

(g) TREC-GEN-04 – MAP
(h) TREC-GEN-04 – P5
(i) TREC-GEN-04 – P10

(j) TREC-GEN-05 – MAP
(k) TREC-GEN-05 – P5
(l) TREC-GEN-05 – P10

(m) TREC-GEN-06 – MAP
(n) TREC-GEN-06 – P5
(o) TREC-GEN-06 – P10
Figure 5.2: Per-topic breakdown of the improvement of conceptual language models over the QL baseline for all test collections, on various evaluation measures and sorted in decreasing order. A positive value indicates an improvement over the baseline. The vertical labels indicate the topic identifiers.



RelRet/TotalRel 2289/4530 2596/4530 +13.4%*

P5 0.5120 0.5520 +7.8%

P10 0.5080 0.4920 -3.1%

MAP 0.1952 0.2315 +18.6%*


RelRet/TotalRel 1468/2133 1602/2133 +9.1%*

P5 0.5280 0.4880 -7.6%

P10 0.4680 0.4840 +3.4%

MAP 0.2819 0.2991 +6.1%


RelRet/TotalRel 3847/8268 4022/8268 +4.5%

P5 0.5160 0.5560 +7.8%

P10 0.4800 0.5000 +4.2%

MAP 0.2856 0.3045 +6.6%*


RelRet/TotalRel 2825/4584 3330/4584 +17.9%

P5 0.4122 0.4245 +3.0%

P10 0.3776 0.3776 0.0%

MAP 0.2153 0.2338 +8.6%


RelRet/TotalRel 1078/1449 1244/1449 +15.4%

P5 0.4154 0.4538 +9.2%

P10 0.4154 0.4077 -1.9%

MAP 0.2731 0.3182 +16.5%*

Table 5.8: Results of the baseline (QL) and the conceptual language model (GC).

5.3.1 Baselines

Table 5.7 shows the results of the query likelihood model as well as the relevance model—both of which were introduced in Section 2.3—on the five test collections that we consider in this chapter.

Query likelihood

This model (abbreviated by QL) uses MLE on the initial query to build a query model, by distributing the probability mass evenly among the terms in the topic, cf. Eq. 2.9. First, we note that the results obtained for the query likelihood model are comparable to or better than the mean results of all the participating groups in the respective TREC Genomics [129131] and CLEF Domain-specific tracks [244245]. As to the TREC Genomics test collections, we do not perform any of the elaborate and knowledge-intensive preprocessing of the queries and/or documents that is common in this domain [316]. Even without applying such explicit domain-specific knowledge, our baseline outperforms many systems that do.

Relevance Models

The runs based on relevance models (abbreviated by RM) use the retrieved documents from the query likelihood run to construct an improved query model which is subsequently used for retrieval. The optimal parameter settings for the relevance model, with which we obtain these results are determined in the same fashion as for our conceptual language models, i.e., we sweep over all possible values for λQ (cf. Eq. 2.10) and try varying numbers of documents and terms to find the optimal performance in terms of MAP.

Table 5.7 shows the results of the baseline QL model and the RM model. We observe that, on the CLEF collections, the RM runs show improvements over the baseline in terms of mean average precision (+6% and +1% for the 2007 and 2008 collection, respectively), average recall (+6% and +0.3%) and early precision (precision@5 (P5): +6%, +8%). None of these differences is significant, however. Results on the individual CLEF-DS-07 topics show that 3 of the topics substantially increase average precision (a difference of more than 0.05), whereas only 1 topic decreases. The number of CLEF-DS-08 topics which improve in terms of average precision is about the same as the number which are hurt, causing the modest improvement.

The RM runs on the TREC Genomics collections do show significant differences compared to the QL baseline. For the 2004 query set, average precision (+17%), recall (+9%) and early precision (P10: +12%) increase significantly. TREC-GEN-06 shows a larger significant improvement on mean average precision (10%). Recall and precision show improvements although they are not significant. Similar to the CLEF collections, TREC-GEN-05 shows a positive difference on average but, besides recall, none of the changes are significant. The increase in mean average precision on the TREC 2005 topics can be mainly attributed to a single topic which strongly benefits from using relevance models.

These findings regarding pseudo relevance feedback using relevance models, i.e., where some topics are helped and some topics are hurt, are often found when applying pseudo relevance feedback.

5.3.2 Conceptual Language Models

We now turn to the results of the conceptual language model presented in Section 5.1. Recall that this model consists of three steps. First, each query is mapped onto a conceptual query model, i.e., a distribution over concepts relevant to the query using Eq. 5.2. The concepts found are then translated back to terms using Eq. 5.4 in conjunction with the EM algorithm from Eq. 5.7.

In the first subsection we discuss the results of applying all the steps in our conceptual language model (GC; Section 5.1). Then, in the following subsections, we will perform an ablation study and discuss the results of not applying the EM algorithm (MLGC; Section 5.3.2) and not translating the found concepts using generative concept models (EC; Section 5.3.2). Example query models for GC and EC can be found in Table 5.6 for the CLEF topic “Shrinking cities.”



RelRet/TotalRel 2430/4530 2596/4530 +6.8%*

P5 0.5440 0.5520 +1.5%

P10 0.5040 0.4920 -2.4%

MAP 0.2061 0.2315 +12.3%


RelRet/TotalRel 1473/2133 1602/2133 +8.8%*

P5 0.5680 0.4880 -14.1%

P10 0.4800 0.4840 +0.8%

MAP 0.2856 0.2991 +4.7%


RelRet/TotalRel 4205/8268 4022/8268 -4.4%

P5 0.5680 0.5560 -2.1%

P10 0.5340 0.5000 -6.4%*

MAP 0.3306 0.3045 -7.9%*


RelRet/TotalRel 3031/4584 3330/4584 +9.9%

P5 0.4163 0.4245 +2.0%

P10 0.3857 0.3776 -2.1%

MAP 0.2368 0.2338 -1.3%


RelRet/TotalRel 1160/1449 1244/1449 +7.2%

P5 0.4308 0.4538 +5.3%

P10 0.4346 0.4077 -6.2%

MAP 0.2993 0.3182 +6.3%*

Table 5.9: Results of the relevance model (RM) versus conceptual language models (GC).


In this section we present the results of using every step of the conceptual language model (abbreviated GC) we detailed in Section 5.1. Table 5.8 lists the results of the concept language models. The results for the two CLEF collections show that the GC model can result in a significant improvement in recall over the query likelihood approach: 13% and 9% more relevant documents are returned for CLEF-DS-07 and CLEF-DS-08, respectively. Figure 5.3 shows the precision-recall graphs for our conceptual language model, versus the query likelihood baseline and relevance models. The precision-recall curve of the CLEF-DS-07 query set shows improved precision over almost the whole recall range. The CLEF-DS-08 runs shows improved precision between recall levels 0.7 and 0.8, making up for the loss of initial precision. Overall, both CLEF test collections show improvements in mean average precision (19% and 6% respectively), but only the results on CLEF-DS-07 are significantly different. We note that the RM approach was unable to achieve a significant difference against the query likelihood baseline on these test collections and measures.

(a) CLEF-DS-07
(b) CLEF-DS-08

(c) TREC-GEN-04
(d) TREC-GEN-05

(e) TREC-GEN-06
Figure 5.3: Precision-recall plots for all evaluated test collections.

The three TREC Genomics test collections show a less consistent behavior. In terms of mean average precision, the TREC-GEN-04 and TREC-GEN-06 collections show significant improvements in favor of the GC model (+6.6% and +15.4% respectively). The TREC-GEN-05 topics also show substantial improvements between the query likelihood and GC model, although these changes are not significant. Figure 5.2 shows a per-topic analysis of the difference of the GC model with respect to the QL baseline; a positive value in these graphs indicates that the GC model outperformed the QL baseline. For TREC-GEN-05, it shows that half of the topics benefit from applying the GC model and the other half is actually hurt. This is what causes the difference to be non-significant. The overall increase in average precision measured over all the topics, however, is larger than its loss.

From a further look on the per-topic plots, we can observe that, in terms of MAP, more topics are helped than hurt for all the other test collections. The early precision plots show a less clear picture. The ratio between the number of topics that improve P5 versus topics that worsen is about 1.5, averaged over all test collections. The average number of topics which P10 scores increase is about the same as the number of topics for which they decrease.

A more in-depth analysis of the terms that are introduced provides more insight into when and where the GC model improves or hurts retrieval. We observe that when the initial textual query is not specific, the resulting set of feedback documents is unfocused. Hence, fairly general and uninformative words are added to the query model and it fails to achieve higher retrieval performance. Another reason for poor performance is that particular aspects in the original query are overemphasized in the updated query model, resulting in query drift. For example, the CLEF-DS-08 topic #210 entitled “Establishment of new businesses after the reunification” results in expansion terms related to the aspect “Establishment of new businesses,” such as “entrepreneur” and “entrepreneurship,” but fails to include words related to the “reunification” aspect. When the updated query model is a balanced expansion of the original query, i.e., when it does include expansion terms for all aspects of the query, the GC model show improved results.



RelRet/TotalRel 2596/4530 2596/4530 0.0%

P5 0.5520 0.5520 0.0%

P10 0.4760 0.4920 +3.4%

MAP 0.2311 0.2315 +0.2%


RelRet/TotalRel 1566/2133 1602/2133 +2.3% *

P5 0.5120 0.4880 -4.7%

P10 0.4960 0.4840 -2.4%

MAP 0.2853 0.2991 +4.8%


RelRet/TotalRel 3973/8268 4022/8268 +1.2%

P5 0.5360 0.5560 +3.7%

P10 0.4960 0.5000 +0.8%

MAP 0.2989 0.3045 +1.9%


RelRet/TotalRel 2887/4584 3330/4584 +15.3%

P5 0.4163 0.4245 +2.0%

P10 0.3571 0.3776 +5.7%

MAP 0.2174 0.2338 +7.5%


RelRet/TotalRel 1118/1449 1244/1449 +11.3%

P5 0.4231 0.4538 +7.3%

P10 0.4192 0.4077 -2.7%

MAP 0.2863 0.3182 +11.1%

Table 5.10: Results of the conceptual language models in conjunction with the EM algorithm (GC) described in Section 5.1 versus without (MLGC).

Overall, we see that our conceptual language model mainly has a recall enhancing effect, indicated by the significant increases in MAP for the CLEF-DS-07 and TREC-GEN-06 test collections and the significant increases in recall on both CLEF topic sets.

Table 5.9 shows a comparison between the GC and the RM model. When comparing these results, we find significant improvements in terms of recall on the CLEF test collections. On the TREC-GEN-04 and TREC-GEN-06 topic set we find a significant improvement in terms of MAP. The results on the TREC Genomics 2004 and 2005 topic sets indicate that the GC model performs comparably (TREC-GEN-05) or slightly worse (TREC-GEN-04). We believe the latter result is caused by the fixed setting of δt in Eq. 5.9 in conjunction with the rather small average document length and the large number of documents in this particular document collection.

Unlike the relevance model, the GC model provides a weighted set of concepts in the form of a conceptual query model. Besides the possibility of suggesting these to the user, we hypothesize that the results of applying the remaining steps in our conceptual language models after a user has selected the concepts most relevant to his query would improve retrieval effectiveness. Since we do not have relevant concepts for our current topics, we consider the verification of this hypothesis a topic for future work.

In the following subsections, we look at the results of not using the EM algorithm in the generative concept models and directly using the conceptual query models for retrieval.



RelRet/TotalRel 2448/4530 2596/4530 +6.0%

P5 0.5040 0.5520 +9.5%

P10 0.5080 0.4920 -3.1%

MAP 0.2104 0.2315 +10.0%


RelRet/TotalRel 1485/2133 1602/2133 +7.9%*

P5 0.5120 0.4880 -4.7%

P10 0.4880 0.4840 -0.8%

MAP 0.2894 0.2991 +3.4%


RelRet/TotalRel 4221/8268 4022/8268 -4.7%

P5 0.5480 0.5560 +1.5%

P10 0.5240 0.5000 -4.6%

MAP 0.3146 0.3045 -3.2%


RelRet/TotalRel 2916/4584 3330/4584 +14.2%

P5 0.4082 0.4245 +4.0%

P10 0.3776 0.3776 0.0%

MAP 0.2295 0.2338 +1.9%


RelRet/TotalRel 1171/1449 1244/1449 +6.2%

P5 0.4231 0.4538 +7.3%

P10 0.4000 0.4077 +1.9%

MAP 0.2927 0.3182 +8.7%

Table 5.11: Results of the conceptual language models (GC) versus using the found concepts directly (EC).

Maximum Likelihood-based Generative Concept Models

In this subsection, we investigate the added value of using the EM algorithm described in Section 5.1.2, by comparing a maximum likelihood based GC model (named MLGC) to the GC model shown in the previous section. Table 5.10 shows the results of this method. We observe that applying the EM algorithm improves overall retrieval effectiveness compared to the MLGC model, although not significantly, and only in terms of recall and MAP. Only the number of relevant retrieved documents for the CLEF-DS-08 improves significantly when using the EM algorithm.

The topics that are helped most by the application of the EM algorithm—in terms of an absolute gain in MAP—include TREC-GEN-05 topic #146: “Provide information about Mutations of presenilin-1 gene and its/their biological impact in Alzheimer’s disease” (increased MAP by 0.51) and TREC-GEN-06 topic #160 “What is the role of PrnP in mad cow disease?” (increased MAP by 0.52). A closer look at the intermediate results for these topics reveals two things. In the first topic, the GC model introduces the term “PRP”, which is a synonym for “PrnP.” The second topic shows that the GC model introduces three new terms which do not seem directly relevant to the query, but are able to boost MAP substantially.

Explicit Conceptual Query Models

In Section 5.1.1 we introduced a method for acquiring a weighted set of concepts for a query, by translating a textual query to a conceptual representation. In this section, we evaluate the results of using the conceptual query model (abbreviated EC) directly, i.e., using it in combination with the original textual representation to estimate the relevance of a document. Since all the documents in the test collections used in this chapter have two representations (terms and concepts), we can use both disjunctively for retrieval [254]. So, instead of interpolating the query model and using the result for retrieval, we interpolate the scores of each individual component as follows. Score(Q,D) = (1λQ)KL(θ̃Q||θD)+λQKL(θC||θD). (5.11)

Here, the first term is the regular query-likelihood score. The second term is the score obtained from matching the conceptual query model with the conceptual representation of each document: KL(θC||θD) = cP(c|θC)log P(c|θC) P(c|θD) cP(c|θC)logP(c|θD), (5.12)

where P(c|θC) = P(c|Q) (Eq. 5.2 q.v.). In effect, this drops the dependence between t and c (see Figure 5.1) and considers the concepts as regular indexing terms.

Thus, the EC model uses an explicit conceptual representation in combination with the textual representation for searching documents and, similar to the approaches described in the previous subsections, the EC approach uses the same feedback documents for improving the query. However, instead of sampling terms from these documents, we now use their associated concepts.

When we look at the results as compared to the GC model as depicted in Table 5.11, we find marginal differences. Only recall on the CLEF-DS-08 topic set is significantly different from the run based on conceptual language models. In comparison to the query likelihood baseline (cf. Table 5.7 and Table 5.11), the EC model shows similar improvements as the relevance models. The runs on the CLEF collections show small, statistically insignificant improvements in mean average precision, recall and initial precision. The EC model, when applied to the TREC Genomics collections, shows significant improvements for the 2004 and 2006 collection with respect to the QL baseline.

Before turning to the answers to our research questions based on the results in this section, we present a brief analysis of the parameter sensitivity of our conceptual language model.

(a) MAP

(b) P5
Figure 5.4: Results of varying λQ on retrieval effectiveness on all test collections evaluated in this chapter.