Pseudo Relevance Feedback

4.3 Pseudo Relevance Feedback

In this section we look at the performance of the relevance feedback models using pseudo relevance feedback, that uses the top ranked documents (which we denote $\hat{R}$ ) as feedback document set. In order to obtain these documents, we perform a query likelihood (QL) run (cf. Eq. 2.8) that also serves as our baseline.

As to the parameter settings, we initially consider only a limited number of terms for practical reasons; we use the 10 terms with the highest probability, a number that has been shown to be suitable on a number of test collections [196, 242]. We then perform a grid search over $| \hat{R} |$ and the value of the query interpolation parameter, $λ_{Q}$ . Note that we exclude $λ_{Q} = 1.0$ and $| \hat{R} | = 0$ from our grid search which makes it possible to obtain “optimal” performance worse than the baseline. After we have obtained the optimal values for these parameters we fix them and vary the number of terms with the highest probability included in the query model, $| V_{Q} |$ . This approach to optimizing parameter values is a combination of a line and a grid search over the parameter space [108, 223, 262], as introduced in Section 3.4. While computationally expensive, it provides us with an upper bound on the attainable retrieval effectiveness for each model. Note that, because we initially fix the number of terms, we may not find the absolute maximum in terms of performance (there might be cases where a different combination of $λ_{Q}$ , $| \hat{R} |$ , and the number of terms obtains better results).

We continue this section by discussing the experimental results with a fixed number of terms (Section 4.3.1), followed by a per-topic analysis in Section 4.3.2 and a discussion of the influence of varying $| V_{Q} |$ in Section 4.3.3.

4.3.1 Results and Discussion

Before we report on the experimental results on the three test collections, we note that, for all test collections, the performance of the baseline run is on par with results reported in the literature. In particular, for the TREC Robust 2004 track, our baseline run would have been placed at around the tenth position of all participating runs. For TREC Web 2009, the mean performance in terms of statMAP of all participating runs lies around 0.15. For the TREC Relevance Feedback 2008 test collection (using pseudo relevance feedback), this number is not available since we use an aggregation of multiple topic sets, with topics from the TREC Million Query 2007 and the TREC Terabyte 2004–2006 tracks. Furthermore, for this test collection, we use the relevant documents provided to us by the TREC Relevance Feedback 2008 track (which are a combination of relevant documents from (i) the TREC Million Query 2007 track, (ii) the TREC Terabyte 2004–2006 tracks, and (ii) the newly assessed, relevant documents created during the TREC Relevance Feedback 2008 track). We do note, however, that the mean average precision (MAP) score of all systems participating in the TREC Terabyte 2004–2006 tracks is roughly 0.30.

	P5		P10		MAP		RelRet		$λ_{Q}$	$\| \hat{R} \|$
QL	0.442		0.406		0.221		9099		1.0	–
MLE	0.462	+4.5%	0.412	+1.5%	0.257*	+16.3%	10287*	+13.1%	0.4	10
MBF	0.466	+5.4%	0.422	+3.9%	0.263*	+19.0%	10508*	+15.5%	0.4	9
RM-0	0.459	+3.8%	0.407	+0.2%	0.261*	+18.1%	10390*	+14.2%	0.3	10
RM-1	0.457	+3.4%	0.417	+2.7%	0.253*	+14.5%	9901*	+8.8%	0.5	19
RM-2	0.471*	+6.6%	0.422	+3.9%	0.249*	+12.7%	9844*	+8.2%	0.4	7
PRM	0.446	+0.9%	0.415	+2.2%	0.264*	+19.5%	10543*	+15.9%	0.4	12
MLgen	0.468	+5.9%	0.417	+2.7%	0.264*	+19.5%	10564*	+16.1%	0.3	13
NLLR	0.448	+1.4%	0.410	+1.0%	0.224*	+1.4%	9087	-0.1%	0.8	9

Table 4.2: Best results (optimized for MAP) of the models contrasted in this chapter on the TREC-ROB-04 test collection using

| V_{Q} | = 10

	P5		P10		MAP		RelRet		$λ_{Q}$	$\| \hat{R} \|$
QL	0.442		0.406		0.221		9099		1.0	–
MLE	0.464*	+5.0%	0.428*	+5.4%	0.245*	+10.9%	9824*	+8.0%	0.7	3
MBF	0.459	+3.8%	0.429*	+5.7%	0.248*	+12.2%	9897*	+8.8%	0.7	2
RM-0	0.468*	+5.9%	0.427*	+5.2%	0.246*	+11.3%	9823*	+8.0%	0.7	6
RM-1	0.465*	+5.2%	0.426	+4.9%	0.248*	+12.2%	9820*	+7.9%	0.6	152
RM-2	0.471*	+6.6%	0.428*	+5.4%	0.242*	+9.5%	9567*	+5.1%	0.7	7
PRM	0.465*	+5.2%	0.423	+4.2%	0.247*	+11.8%	9873*	+8.5%	0.7	2
MLgen	0.471*	+6.6%	0.430*	+5.9%	0.255*	+15.4%	10109*	+11.1%	0.6	6
NLLR	0.443	+0.2%	0.412	+1.5%	0.223	+0.9%	9083	-0.2%	0.9	3

Table 4.3: Best results (optimized for P10) of the models contrasted in this chapter on the TREC-ROB-04 test collection using

| V_{Q} | = 10

Figure 4.2: Influence of the size of

\hat{R}

on MAP, using pseudo relevant documents on the TREC-ROB-04 collection with

λ_{Q} = 0.4

and

| V_{Q} | = 10

TREC Robust 2004

The results for this test collection are listed in Table 4.2. We observe that, when compared to the baseline, all models except NLLR significantly improve recall. Moreover, these models also significantly improve MAP. This finding is common for relevance feedback algorithms which typically improve recall at the cost of precision [202, 272]. MLgen obtains highest recall of all models. In Table 4.2, the parameter settings were chosen such that maximum MAP was obtained for each model. Because of this, we do not observe any significant improvements in early precision, except for RM-2. When we look at the best performing parameter settings when optimizing for P10 (cf. Table 4.3), we obtain different optimal values. In this case we obtain significant improvements on P10 for all models, except NLLR, PRM, and RM-1.

When optimizing for MAP, the optimal setting of $λ_{Q}$ lies within the range $0.3 - 0.5$ for all models except NLLR (which has similar results for $λ_{Q} = 0.4$ ). When optimizing for P10, $λ_{Q}$ lies within the range $0.6 - 0.7$ . The optimal number of feedback documents also differs when optimizing either for MAP or for P10.

Let’s zoom in on the relative performance of each model. Figure 4.2 shows the performance of all models on TREC-ROB-2004 when an increasing number of pseudo relevant documents is used to estimate the query model. From this figure, we observe that all models reach their peak when $5 \leq | \hat{R} | \leq 20$ . Furthermore, all models except NLLR and RM-1 respond similarly to each newly added document (although there are differences in absolute scores). As seen before, NLLR is the worst performing model and is unable to improve upon the QL baseline for any number of feedback documents. Interestingly, RM-1 behaves quite differently from the other models. It shows the most stable behavior by reaching its peak after about 20 documents and declines only slightly after that. Although it does not obtain the highest scores, it is robust with respect to the number of feedback documents used. We also note from this figure that, in order to identify the best performing relevance feedback model, the number of documents is of significance. When one would use a fixed number of documents to compare the various models (as is typically done in earlier work [183, 354]), the choice of this particular parameter setting determines the ranking of the models in terms of their performance.

The overall results for the TREC Robust 2004 test collection are partly consistent with most related work on pseudo relevance feedback: in general, pseudo relevance feedback helps in terms of recall-oriented measures at the cost of precision. In our case, however, we also improve early precision (and, in most cases significantly so). When carefully tuned, it is also possible to obtain significant improvements on early precision, as seen in Table 4.3. In that case, however, the improvements on recall-oriented measures is less substantial (although in most cases still significant). Furthermore, most models react similarly to an increasing number of feedback documents on this test collection.

	P5		P10		MAP		RelRet		$λ_{Q}$	$\| \hat{R} \|$
QL	0.405		0.357		0.289		8921		1.0	–
MLE	0.399	-1.5%	0.358	+0.3%	0.295	+2.1%	9044*	+1.4%	0.9	1
MBF	0.400	-1.2%	0.362	+1.4%	0.297	+2.8%	8951*	+0.3%	0.9	1
RM-0	0.399	-1.5%	0.356	-0.3%	0.295	+2.1%	9122*	+2.3%	0.8	3
RM-1	0.410	+1.2%	0.350	-2.0%	0.300	+3.8%	9182*	+2.9%	0.8	13
RM-2	0.398	-1.7%	0.358	+0.3%	0.296	+2.4%	9053*	+1.5%	0.9	1
PRM	0.410	+1.2%	0.366	+2.5%	0.301*	+4.2%	8596*	-3.6%	0.9	29
MLgen	0.404	-0.2%	0.358	+0.3%	0.299	+3.5%	9133*	+2.4%	0.8	3
NLLR	0.406	+0.2%	0.355	-0.6%	0.292*	+1.0%	10156*	+13.8%	0.9	2

Table 4.4: Best results (optimized for MAP) of the models contrasted in this chapter on the TREC-PRF-08 test collection using

| V_{Q} | = 10

	P5		P10		MAP		RelRet		$λ_{Q}$	$\| \hat{R} \|$
QL	0.405		0.357		0.289		8921		1.0	–
MLE	0.399	-1.5%	0.358	+0.3%	0.295	+2.1%	9044*	+1.4%	0.9	1
MBF	0.403	-0.5%	0.362	+1.4%	0.290	+0.3%	9093*	+1.9%	0.9	5
RM-0	0.488*	+20.5%	0.486*	+36.1%	0.276	-4.5%	6491*	-27.2%	0.8	10
RM-1	0.413	+2.0%	0.362	+1.4%	0.294*	+1.7%	9059*	+1.5%	0.9	166
RM-2	0.398	-1.7%	0.358	+0.3%	0.296	+2.4%	9053*	+1.5%	0.9	1
PRM	0.414	+2.2%	0.375	+5.0%	0.295*	+2.1%	8684*	-2.7%	0.9	96
MLgen	0.399	-1.5%	0.358	+0.3%	0.295	+2.1%	9044*	+1.4%	0.9	1
NLLR	0.402	-0.7%	0.359	+0.6%	0.285	-1.4%	8866	-0.6%	0.9	9

Table 4.5: Best results (optimized for P10) of the models contrasted in this chapter on the TREC-PRF-08 test collection using

| V_{Q} | = 10

Figure 4.3: Influence of the size of

\hat{R}

on MAP, using pseudo relevant documents on the TREC-PRF-08 collection with

λ_{Q} = 0.9

and

| V_{Q} | = 10

TREC Relevance Feedback 2008

Table 4.4 shows the results on the TREC-PRF-08 test collection (optimized for MAP). This test collection contains the largest topic set (with 264 queries, cf. Section 3.3). Here, the story is different from that for TREC Robust. All models obtain a significant improvement in recall over the baseline. NLLR and PRM are the only models, however, that also achieve a significant improvement in terms of MAP, albeit a small one. None of the models achieve a significant improvement on the early precision measures. The optimal value for $λ_{Q}$ is again very similar for all models, with a range of $0.8 - 0.9$ . This value indicates that a relatively large part of the probability mass is put towards the initial query. This in turn is an explanation for the relatively small differences in absolute retrieval scores as compared to the baseline.

When optimized for P10 (cf. Table 4.5), RM-0 is the only model to obtain substantial and significant improvements over the baseline in terms of early precision. It does so at the cost of recall and MAP, however, yielding the only significantly worse performance. This is an interesting finding since RM-0 does not take the query or the set of feedback documents into account; it is therefore quickly computed. The optimal value for $λ_{Q}$ when optimizing for P10 is roughly the same as when optimizing for MAP; only the optimal number of employed feedback documents is different. Furthermore, RM-2, MLE, and MLgen perform very similar. This is not surprising, since the they all base their query model on the same, single feedback document and, in that particular case, RM-0 is equivalent to MLgen. RM-2 also blends in the probability of the query given the document, causing 9 more relevant documents to be retrieved. On the other hand, RM-1 and PRM obtain their highest P10 scores with substantially more feedback documents.

Figure 4.3 again shows the effect of varying the amount of pseudo relevant documents, although this time on the TREC-PRF-08 test collection. From this figure, we first note that the models react differently to an increasing number of feedback documents on this test collection. RM-1 is again most robust. It outperforms all other models (except PRM) on almost any number of feedback documents; it is only slightly outperformed by MBF for low numbers of feedback documents. On this collection, re-estimating the document models by applying PRM offers the best performance in terms of MAP when more than 20 feedback documents are used. MBF is the worst performing model on this test collection, whereas RM-0, RM-2, and MLgen perform similar to, or slightly worse than the baseline (with MLgen outperforming the other two models).

In sum, despite having a large number of topics and documents, we obtain only minor improvements on the TREC-PRF-08 test collection. In part, this is caused by the type of collection. Unlike TREC Robust, this collection consists of unedited web pages which may contain significant amounts of noise. For example, layout related terms may erroneously end up in content fields (due to the web crawler or the author of a page). Other examples include typos or other grammatical errors. Such noise does not appear in the edited and moderated content that makes up the TREC Robust document collection. RM-1 again shows to be stable, whereas PRM again obtains the highest MAP scores (although recall is significantly worse than the baseline).

	statP10		statMAP		$λ_{Q}$	$\| \hat{R} \|$
QL	0.328		0.148		1.0	–
MLE	0.312	-4.9%	0.177	+19.6%	0.4	1
MBF	0.335	+2.1%	0.167	+12.8%	0.8	1
RM-0	0.312	-4.9%	0.177	+19.6%	0.4	1
RM-1	0.312	-4.9%	0.177	+19.6%	0.4	1
RM-2	0.341	+4.0%	0.175	+18.2%	0.4	1
PRM	0.386	+17.7%	0.175	+18.2%	0.6	54
MLgen	0.312	-4.9%	0.177	+19.6%	0.4	1
NLLR	0.328	0.0%	0.148	0.0%	0.9	10

Table 4.6: Best results (optimized for statMAP) of the models contrasted in this chapter on the TREC-WEB-09 test collection using

| V_{Q} | = 10

	statP10		statMAP		$λ_{Q}$	$\| \hat{R} \|$
QL	0.328		0.148		1.0	–
MLE	0.346	+5.5%	0.146*	-1.4%	0.1	3
MBF	0.338	+3.0%	0.157	+6.1%	0.7	150
RM-0	0.350	+6.7%	0.159	+7.4%	0.3	53
RM-1	0.364	+11.0%	0.159	+7.4%	0.3	76
RM-2	0.373	+13.7%	0.150	+1.4%	0.1	2
PRM	0.510*	+55.5%	0.157	+6.1%	0.6	80
MLgen	0.393	+19.8%	0.159	+7.4%	0.4	89
NLLR	0.389	+18.6%	0.140	-5.4%	0.6	168

Table 4.7: Best results (optimized for statP10) of the models contrasted in this chapter on the TREC-WEB-09 test collection using

| V_{Q} | = 10

Figure 4.4: Influence of the size of

\hat{R}

on statP10, using pseudo relevant documents on the TREC-WEB-09 collection with

λ_{Q} = 0.6

and

| V_{Q} | = 10

TREC Web 2009

In Table 4.6, we show the best results obtained in terms of statMAP on the TREC-WEB-09 test collection. We observe that pseudo relevance feedback on this collection does not perform well for all models; none of them obtains a significant improvement over the baseline on any evaluation metric. PRM is able to obtain a substantial (although not significant improvement), but requires a large number of feedback documents. Applying NLLR does not make any difference in terms of statMAP with the baseline, for any setting of $λ_{Q}$ or any number of feedback documents. All versions of the relevance model again base their estimation on a single document which, in turn, leads to equal scores (and a performance in terms of statP10 that is worse than the baseline). As to the optimal value of $λ_{Q}$ , PRM is the odd one out. MBF behaves similarly to NLLR, as do the relevance modeling variations, MLE, and MLgen.

Table 4.7 shows the results when optimized for statP10. From this table we observe that only PRM is able to obtain a significant improvement over the baseline, again using a large number of documents. In terms of statP10, all other models improve over the baseline as well, although not significantly so. We also note the large variation in the optimal number of feedback documents and in the optimal setting of $λ_{Q}$ . As to statMAP in this case, most models improve slightly over the baseline; NLLR and MLE obtain statMAP values worse than the baseline (and, in the case of MLE even significantly so).

In Figure 4.4 we display the influence of the number of feedback documents on statP10 for TREC-WEB-09 and $λ_{Q} = 0.6$ . First we note the variance as single feedback documents are added. This is in part due to the small number of topics as compared to the TREC-ROB-04 and TREC-PRF-08 test collections. For this setting of $λ_{Q}$ , most models obtain statP10 values that are close to the baseline. As was clear from the results tables, PRM outperforms the other models, followed by MLgen when $| \hat{R} | > 30$ . From this figure it is clear why PRM obtained the substantial improvements indicated above; when using more than 50 feedback documents, this model outperforms all the other models.

The main reason for the retrieval performance obtained on this test collection is that it is a large web collection. Unlike the TREC-PRF-08 collection (which was restricted to web pages from the .gov domain), this document collection is a representative sample of the full Web. Therefore, it contains quite some noise in the form of spam pages, strange terms, etc. In the case of pseudo relevance feedback, spam pages are treated just like any other. However, the content of most of these is either extremely focused (e.g., to promote or encourage you to buy some product) or extremely varied (e.g., in order to appear in search engine rankings for many queries). These factors influence the query models that are estimated from such documents.

One can assume that on governmental web pages (such as found in the TREC-PRF-08 test collection) there exists at least some kind of moderation on the contents. Having a document collection containing any web page, however, means that most of the documents are unmoderated. Hence, such uninformative terms might acquire a probability mass under some models. Judging by the results, PRM is the only model that is able to correct for this phenomenon. Interestingly, MBF (which uses a similar EM-based update algorithm on the set of feedback documents) only performs similar to the baseline on this test collection.

Upshot

We obtain improvements over the baseline on all test collections using most models with a fixed number of terms and with the right number of feedback documents. This finding confirms those from related work (see e.g., [70, 196]) on a much larger set of test collections. On TREC Robust we observe that all but two models behave similarly when more pseudo relevant documents are used. RM-1 is most robust on this test collection in that respect; its performance does not change much with a varying number of feedback documents. The picture on TREC-PRF-08 is slightly different. Here, PRM obtains the highest absolute scores. RM-1 is still the most robust with respect to the number of terms. All other models only improve slightly over the baseline when using a small number of feedback documents. On the TREC Web 2009 test collection, we obtain only modest, non-significant improvements in terms of statMAP. Early precision (as measured by statP10), on the other hand, does significantly improve in the case of PRM.

(a) TREC-ROB-04 – all documents

(b) TREC-ROB-04 – relevant documents

(d) TREC-PRF-08 – relevant documents

(e) TREC-WEB-09 – all documents

(f) TREC-WEB-09 – relevant documents

Figure 4.5: Histograms of the document lengths on the test collections employed in this chapter. The “all-documents” plots have been cropped to match the dimensions of the “relevant documents” plots.

So, we can conclude from the results presented so far that the test collection has a definitive influence on the level of improvement provided by pseudo relevance feedback.

Furthermore, from the relative results between test collections, we have hypothesized that the level of noise in the documents influence the query models generated from them. Indeed, related work has shown that selecting terms from different document representations (be it, e.g., from structural elements [150], from referring documents [86], or from both [333]) or from contextual factors such as the number of inlinks [45] helps retrieval performance. We conclude that reducing the amount of noise by leveraging such information would help to further improve the performance resulting from relevance feedback.

But these are not the only factors. For query modeling using relevance feedback to be successful, the terms that receive most probability mass should be “coherent,” that is, they should reinforce each other (as opposed to finding a single, excellent term) [242]. In order to find such terms, it helps when the documents have a dedicated interest in a topic [125]. Ideally, one would like to select those feedback documents that are both most coherent and most relevant to the query [124, 235]. Especially on the larger test collections (TREC-PRF-08 and TREC-WEB-09), we see that the models that solely make use of the set of feedback documents (MLE and MBF) perform worse than their counterparts. We conclude that, on these collections, it helps to mix the evidence brought in by each individual feedback document as well as the set thereof to determine which terms are coherent. The notion that the largest benefit from query modeling using relevance feedback is to be obtained when the feedback documents show a dedicated interest in a topic or, consequently, the terms in the query models are cohesive, is something we exploit in the next chapter. There, we use concepts assigned to documents to focus the query model estimations on a subset of coherent, relevant to the query.

Fang et al. [100] observe that “if all the query terms are discriminative words, the KL-divergence method will assign a higher score to a longer document. If there are common terms, however, longer documents are penalized.” This implies two things. First, that if a relevance feedback model (such as MBF or PRM) emphasizes discriminative terms, i.e., those that occur infrequently in the collection, then they are more likely to rank longer documents higher. It also implies that the length of the (relevant) documents is of influence on the retrieval performance. Figure 4.5 shows the distribution of document lengths for all documents as well as only the relevant documents on the different test collections. The histograms first provide a clear indication that the TREC-ROB-04 documents are the shortest of all test collections. They further show that most of the relevant documents for TREC Robust 2004 are relatively short. TREC-PRF-08 and TREC Web 2009, on the other hand, have a much larger spread. Hence, this is a partial explanation why PRM outperforms the other models on TREC-PRF-08 and TREC-WEB-09. It does not explain why the same effect isn’t visible for MBF, however.

(a) MLE

(b) MBF

(d) RM-1

(e) RM-2

(f) PRM

(g) MLgen

(h) NLLR

Figure 4.6: Per-topic breakdown of the improvement of the models over the QL baseline on the TREC-ROB-04 test collection on MAP using

| V_{Q} | = 10

and the parameter settings optimized for MAP (cf. Table 4.2).

(a) MLE

(b) MBF

(d) RM-1

(e) RM-2

(f) PRM

(g) MLgen

(h) NLLR

Figure 4.7: Per-topic breakdown of the improvement of the models over the QL baseline on the TREC-PRF-08 test collection on MAP using

| V_{Q} | = 10

and the parameter settings optimized for MAP (cf. Table 4.4).

(a) MLE

(b) MBF

(d) RM-1

(e) RM-2

(f) PRM

(g) MLgen

(h) NLLR

Figure 4.8: Per-topic breakdown of the improvement of the models over the QL baseline on the TREC-WEB-09 test collection on statMAP using

| V_{Q} | = 10

and the parameter settings optimized for MAP (cf. Table 4.6).

4.3.2 Per-topic Results

Relevance feedback models are typically associated with a large variance in performance per topic. For some topics it improves results substantially, whereas for others it hurts [70, 272]. In this section, we look at the per-topic performance of the models. We take the values for $λ_{Q}$ and $| \hat{R} |$ that optimize MAP for each model (listed in the tables above) and plot the difference with the baseline in terms of AP (“ $Δ$ AP”). We sort the topics by decreasing $Δ$ AP; a positive value in these plots indicates an improvement over the baseline for that particular topic.

Figure 4.6 shows a per-topic plot of the difference of each model compared to the baseline for the TREC Robust 2004 test collection. From these figures we observe that, for all models except NLLR, the number of topics that are improved over the baseline is larger than the number of topics with a worse performance. Since the documents in this collection are short and focused, all models generally pick up related, relevant terms. There are some topics that are difficult, however. RM-0, RM-2, PRM, and MLgen all have difficulties with topic #308 (“implant dentistry”). The terms that are introduced for all of these models are mostly related to the query term “implant” instead of dental implants. Another difficult topic is #630 (“gulf war syndrome”). Although most terms are related to the Gulf war, there are also terms that are related to war (or wars) in general.

Figure 4.7 shows the results for TREC-PRF-08. On this test collection, we first note that—judging by the area under the curve—the performance of all models is closer to the baseline as for TREC Robust 2004. This is in line with the observation made in the previous section, where we noted that the optimal value lies around $λ_{Q} = 0.9$ . This in turn means that the generated query models are close to the original query, i.e., the baseline. Furthermore, most models have difficulty with the same topic. In particular, topic #8218 (“marfan syndrome infants”) yields the worst relative performance for MBF, MLgen, PRM, RM-1, and RM-0. Marfan’s syndrome is a genetic disorder of the connective tissue. Most of the models, however, erroneously focus on the terms “infants” and “syndrome,” causing a decline in retrieval performance. Conversely, most models are helped on topic #3554 (“what specific blood tests test for celiac disease or sprue”) and #2106 (“arizona parkways”). For both topics, almost all models identify related, relevant terms and improve upon the baseline. PRM performs particularly well on topic #6010 (“wind farms in new mexico”). Here, most terms included in the query model are relevant, including such examples as “turbin,” “kilowatt,” and “megawatt.” These terms are infrequent in the collection, causing them to obtain substantial probability mass.

In Figure 4.8 we show the per-topic differences for TREC-WEB-09. The first obvious observation is that this test collection has the smallest amount of topics. For all models, the number of topics that are helped roughly equal the number of topics that are not. For most models, however, the absolute improvements are larger. Especially topic #16 (“arizona game and fish”) is helped. This can be attributed to the fact that all but two models use a single feedback document to obtain optimal retrieval performance (cf. Table 4.6). In the case of this particular topic, the first feedback document is a relevant one.

To summarize, we have observed that for TREC Robust most (but not all) topics are helped using pseudo relevance feedback. On the TREC-PRF-08 test collection, the fraction of topics helped roughly equals the number of topics that are hurt, mainly due to the nature of the documents in the collection. This phenomenon is also visible on ClueWeb09, although in this case there are more topics that are helped substantially than those that are hurt.

All of the experiments so far have used a fixed number of terms in the query models. Ogilvie et al. [242] show that varying this number can have significant effects on retrieval performance. Therefore we zoom on this parameter setting in the next section.

(a) TREC-ROB-04 (

λ_{Q} = 0.4

and

| \hat{R} | = 12

(b) TREC-PRF-08 (

λ_{Q} = 0.9

and

| \hat{R} | = 3

λ_{Q} = 0.7

and

| \hat{R} | = 60

Figure 4.9: Influence of the size of

| V_{Q} |

on (stat)MAP, using pseudo relevant documents.

4.3.3 Number of Terms in the Query Models

In the previous sections we have fixed the number of terms in the query models, $| V_{Q} |$ , to a maximum, considering only the ten terms with the highest probability for inclusion in the query model. The fewer terms you use, the fewer lookups need to be performed in the index. Reducing or optimizing this number is therefore interesting from an efficiency point of view. Furthermore, the number of terms may also influence the end-to-end retrieval performance. In this section we discuss the influence of varying this parameter setting. In particular, we fix $λ_{Q}$ and $| \hat{R} |$ and report retrieval performance for incremental values of $| V_{Q} |$ , similar to the graphs presented in Section 4.3.1.

Figure 4.9(a) shows the results of varying the number of terms on the TREC Robust 2004 test collection. We note that all models except NLLR show similar behavior when more terms are included. The optimal number of terms lies in the interval $10 - 30$ and performance degrades slightly after that. In absolute terms, PRM and MLgen obtain the best MAP scores. NLLR again does not perform well; its performance is below the baseline on all settings. Although it does not obtain the highest MAP scores overall, MBF is most robust to varying the number of terms. Including more than 10 terms does not influence its retrieval performance. For this test collection, the model typically converges at around 15 terms, causing this behavior. In contrast, PRM also re-estimates the language models from feedback documents. In this case, however, the terms from each individual feedback document model are aggregated to obtain the query model.

In Figure 4.9(b) we show the results on TREC-PRF-08. Here, we again observe that the models respond similarly to an increasing amount of terms. We also note that the absolute differences between the performance of the baseline and the models is small. NLLR improves slightly over the baseline. PRM again obtains the highest scores overall. On this test collection, the ranking of the relevance feedback models in terms of their performance is roughly independent of the number of terms. In other words, selecting the right model is more important than setting the right number of terms to obtain the best retrieval performance.

The results for TREC Web 2009 are shown in Figure 4.9(c). We observe that RM-2 and NLLR both perform worse than the baseline on this test collection. Only when including more than 80 terms does RM-2 perform comparably to the baseline. MLgen and PRM obtain highest retrieval performance using any number of terms. MLgen reaches it peak at around 30 terms; PRM already after 10 terms. RM-0, MLE, and RM-1 perform very similarly and only improve slightly upon the baseline.

In sum, we observe that for all test collections and models, the optimal number of terms to include in the query model lies between 10 and 30. This finding is in line with earlier work (see e.g. [119, 196]) and thus confirms those findings a much larger and diverse set of test collections. Varying the number of terms has an effect on the retrieval performance, albeit limited. The effects are certainly not as pronounced as when varying the number of feedback documents. Furthermore, the ranking of the various models in terms of their retrieval performance is relatively stable across all values for $| V_{Q} |$ for all test collections.

PrevTail

Front