4.1 Estimating the Importance of Feedback Documents

In Section 2.3.2 we have introduced core relevance feedback models in the language modeling approach to information retrieval (IR). In Eq. 2.14 we have indicated a means by which to emphasize the importance of each individual feedback document, P(D|R). In this section, we turn to different ways of estimating this relative importance. When we know (or assume) that a given set of documents, R = {D1,,D|R|}, is relevant to a query, we posit that documents therein that are more similar to R are more topically relevant and should thus receive a higher probability of being picked. We thus propose two models that base the estimate of P(D|R) on the divergence between D and R. They are introduced in this section.

4.1.1 MLgen: A Generative Model

The first model rewards documents that contain terms that are frequent in the set of feedback documents. Using this model, we determine P(D|R) by determining the generative probability of D given R, i.e., the probability that the set of relevant documents generated the terms in the current document, similar to the query likelihood approach (cf. Eq. 2.3). More formally: P(D|R) tDP(t|θ̃R)n(t,D). (4.1)

Here, P(t|θ̃R) is determined using Eq. 2.13; below, we refer to this model as MLgen.

4.1.2 Normalized Log-likelihood Ratio

The second method measures the divergence between R and each D by determining the log-likelihood ratio, normalized by the collection C. Interpreted loosely, this measure indicates the average surprise of observing document D when we have R in mind, normalized using a background collection, C. That is, terms that are “well-explained” by the collection (i.e., that have a high frequency in the collection) do not contribute as much to the comparison as terms that are not. It quantifies how much better one language model is than another in modeling an observed text in comparison with modeling by a collection model. More formally: P(D|R) H(θD,θC)H(θD,θR) = Z tVP(t|θD)log P(t|θR) P(t|θC). (4.2)

The measure has the attractive property that it is high for documents for which H(θD,θC) is high and H(θD,θR) is low. So, in order to receive a high score, documents should contain specific terminology, i.e., they should be dissimilar from the collection model but similar to the topical model of relevance. Since we do not know the actual parameters of θR by which we could calculate this, we use R as a surrogate and linearly interpolate it with the collection model (cf. Eq. 2.13). This is similar to the intuitions behind MBF (cf. Eq. 2.16): P(t|θ̂R) = (1λR)P(t|θ̃R)+λRP(t|θC). (4.3)

This interpolation also ensures that zero-frequency issues are avoided and that the sum in Eq. 4.2 is over the same event space for all language models involved. Then, in order to use this discriminative measure as a probability, we define a normalization factor Z = 1 DRP(D|R).

Finally, by putting Eq. 2.15 and Eq. 4.2 together, we obtain an estimate of our expanded query model: P(t1,,t|V||θQ) = i=1|V| DR Z tVP(t|θ D)log P(t|θ̂R) P(t|θC) P(ti|θD). (4.4)

This model, to which we refer as NLLR, effectively determines the query model based on information from each individual relevant document and the most representative sample we have of θQ, namely R.

4.1.3 Models Related to MLgen and NLLR

As an aside, other ways of estimating P(D|R) have been proposed. Examples include simply assuming a uniform distribution, the retrieval score of a document (or the inverse thereof), or information from clustered documents [24170]. One could also apply machine learning to select documents to use for relevance feedback, and use the machine learner’s confidence level as a substitute for P(D|R) [124].

The surface form of NLLR seems reminiscent of a model introduced in [60]. Carpineto et al. [60] propose to use the KL-divergence between R and the collection to select and weight expansion terms for Rocchio feedback [267]. Their model is also highly similar to the query clarity score that uses this measure to predict the difficulty of a query [84]. Besides the fact that we do not use a VSMCarpineto et al. also ignore the individual document models by assuming independence between relevant documents, similar to MLE.

Ponte’s [247] log ratio method is also related to NLLR. He uses the log of the ratio between a term’s probability given each relevant document and its probability given the collection, summed over all the relevant documents. However, Ponte [247] views the query as a set—as opposed to a generative model—and, moreover, he uses the log ratio only for thresholding the terms to be added to the initial query.

MBF is related to NLLR in that it also uses information from both the set of relevant documents and the collection in its estimations, although the estimation method is different. Moreover, NLLR leverages information from each individual relevant document. When we apply this intuition underlying NLLR to MBF, we should let go of the full document independence assumption in MBF and change the M-step (cf. Eq. 2.18) to: P(t|θ̂R) = 1 |R| DR et tet. (4.5)

Under the assumption that we exclude the collection estimate, we set λR = 0 (cf. Eq. 2.16) and obtain: P(t|θ̂R) = 1 |R| DR n(t,D) tn(t,D) (4.6) = 1 |R| DRP(t|θ̃D),

which is a simplified version of NLLR, using a uniform probability of selecting a document. Moreover, this is in fact the same as the relevance model in situation 1 (when the full set of relevant documents is known, cf. Section 2.3.2): RM-0.

The relevance modeling approach to relevance feedback can be viewed as a simplification of MLgen and NLLR, since it assumes that each document has an equal probability of being selected (RM-0) or that this probability is dependent on the query (RM-1 and RM-2). The latter models explicitly consider the initial query by first gathering evidence from each document for a query term and, next, combining the evidence for all query terms (RM-2) or vice versa (RM-1), as detailed in Section 2.3.2. Using the probability that a document generated the query (as is the case with RM-1 and RM-2) is a much simpler implementation of leveraging the notion that documents should be weighted according to their “relative” level of relevance, essentially replacing R in the MLgen and NLLR models with only the query θ̃Q. And, since the query is quite sparse compared to R, our models avoid overfitting to obtain an improved estimate.