2.2 Generative Language Modeling for IR

The success of using statistical language models (LMs) to improve automatic speech recognition (ASR), as well as the practical challenges associated with using the PRP model inspired several IR researchers to re-cast IR in a generative probabilistic framework, by representing documents as generative probabilistic models.

The main task of automatic speech recognition is the transcription of spoken utterances. An effective and theoretically well-founded way of approaching this task is by estimating a probabilistic model based on the occurrences of word sequences in a particular language [147271]. Such models are distributions over term sequences (or: n-grams, where n indicates the length of each sequence) and can be used to compute the probability of observing a sequence of terms, by computing the product of the probabilities of observing the individual terms. Then, when a new piece of audio material A needs to be transcribed, each possible interpretation of each observation is compared to this probabilistic model (the LM) and the most likely candidate S is returned: S = argmax SP(S|A) = argmaxSP(A|S)P(S). (2.1)

Here, P(S) is the language model. S is viewed as having been generated according to some probability and transmitted through a noisy channel that transforms S to A with probability P(A|S). Instead of selecting a single S, this source-channel model can also be used to rank a set of candidates; this is exactly what happens in IR, as we will see later.

It is common in ASR to use higher order n-grams, although deriving trigram or even bigram probabilities is a sparse estimation problem, even with large training corpora. Higher order n-grams have also been tried for IR but these experiments were met with limited success; the mainstream approach is to use n-grams of length 1 (or: unigrams). Ironically, n-gram based language models use very little knowledge of what language really is. They take no advantage of the fact that what is being modeled is language—it may as well be a sequence of arbitrary symbols [271]. Efforts to include syntactic information in n-gram based models have yielded modest improvements at best [63137290].

The first published application of language modeling for IR was based on the multivariate Bernoulli distribution [248], but the simpler multinomial model became the mainstream model [134228]. In the multivariate Bernoulli model, each term position in a document is a vector over the entire vocabulary with all zeroes, except for a single element (the term) which is set to 1. The multinomial model, on the other hand, explicitly captures the frequency of occurrence of a term.

Mccallum and Nigam [204] find that, for text classification using Naive Bayes, the multivariate Bernoulli model performs well with small vocabulary sizes, but that the multinomial usually performs better at larger vocabulary sizes. Losada and Azzopardi [192] observe that for most retrieval tasks (except sentence retrieval) the multivariate Bernoulli model is significantly outperformed by the multinomial model; their analysis reveals that the multivariate Bernoulli model tends to promote long documents.

However, recent work has addressed some of the shortcomings of using the multinomial distribution for modeling text [198256]. A common argument against using a multinomial is that it insufficiently captures the “burstiness” of language. This property of language is derived from the observation that there is a higher chance of observing a term when it has already been observed before. Such burstiness also implies a power law distribution, similar to a Zipfian curve often observed in natural language [201360361]. Zipf’s law states that, if Fi is the frequency of the i-th most frequent event, then Fi 1 iα, (2.2)

where α is a constant (as well as the only parameter of the distribution). In practice this means that there are very few words which occur frequently and many unusual words. Due to this distribution, the number of distinct words in a vocabulary does not grow linearly (but sublinearly) with the size of the collection. Alternative models that try to incorporate this information include the Dirichlet compound multinomial distribution [348] or the related Hierarchical Pitman-Yor model [236]. These distributions provide a better model of language use and the authors show significant improvements over the standard multinomial model. Sunehag [308] provides an analysis of such approaches and shows that TF.IDF follows naturally from them.

2.2.1 Query Likelihood

The earliest work in the query likelihood family of approaches can be attributed to Kalt [158]. He suggests that term probabilities for documents related to a single topic can be modeled by a single stochastic process; documents related to different topics would be generated by different stochastic processes. Kalt’s model treats each document as a sample from a topic language model. Since the problem he considered was text classification, “queries” were derived from a training set instead of solicited from actual queries. Kalt’s approach was based on the maximum likelihood (ML) estimate (which will be introduced below in Eq. 2.4) and incorporated collection statistics, term frequency, and document length as integral parts of the model. Although later query likelihood approaches are more robust in that they consider each document (vs. a group of documents) as being described by an underlying language model, Kalt’s early work is clearly a precursor to language modeling for information retrieval.

In the multinomial unigram language modeling approach to IR, each document D is represented as a multinomial probability distribution P(t|θD) over all terms t in the vocabulary. At retrieval time, each document is ranked according to the likelihood of having generated the query, which is why this model is commonly referred to as the query likelihood (QL) model. It determines the probability that the query terms (t Q) are sampled from the document language model [134229240]: Score(Q,D) = P(D|Q) = P(D)P(Q|D) P(Q) P(D)P(Q|D) = P(D) tQP(t|θD)n(t,Q), (2.3)

where n(t,Q) denotes the count of term t in query Q. The term P(Q) is the same for all documents and, since it does not influence the ranking of documents for a given query, it can safely be ignored for ad hoc search. As is clear from Eq. 2.3, independence between terms in the query is assumed. Note that this formulation is exactly the source-channel model described above, only for document ranking. The term P(D) is the prior probability of selecting a document and may be used to model a document’s higher a priori chance of being relevant [229], for example based on its authoritativeness or the number of incoming links or citations [210336]. In all the experiments in this thesis we assume this probability to be uniform, however.

A common way of estimating a document’s generative language model is through the use of an ML estimate on the contents of the document, P(t|θ̃D) = n(t,D) |D| . (2.4)

Here, |D| indicates the length of D. It is an essential condition for retrieval models that are based on measuring the probability of observed data given a reference generative model, that the reference model is adequately smoothed. Smoothing is applied both to avoid data sparsity (and, hence, zero-frequency) problems occurring with a maximum likelihood approach (which happens, for example, when one of the query terms does not appear in the document) and to account for general and document-specific language use. So, the goal of smoothing is to account for unseen events (terms) in the documents [65356]. Various types of smoothing have been proposed including discounting techniques such as Laplace, Good-Turing, or leave-one-out smoothing. These methods add (or subtract) small amounts of probability mass with varying levels of sophistication. Another type is interpolation-based smoothing, which adjusts the probabilities of both seen and unseen events. One interpolation method commonly used in IR is Jelinek-Mercer smoothing which considers each document to be a mixture of a document-specific model and a more general background model. Each document model is estimated using the maximum likelihood estimate of the terms in the document, linearly interpolated with a background language model P(t) [148229356]: P(t|θD) = λDP(t|θ̃D)+(1λD)P(t). (2.5)

Here, P(t) is calculated as the likelihood of observing t in a sufficiently large corpus, such as the document collection, C: P(t) = n(t,C) tn(t,C). (2.6)

In this thesis, we use Bayesian smoothing using a Dirichlet prior which has been shown to achieve superior performance on a variety of tasks and collections [3065191352356] and set: P(t|θD) = |D| |D|+μP(t|θ̃D)+ μ |D|+μP(t), (2.7)

where μ is a hyperparameter that controls the level of smoothing which is typically set to the average document length of all documents in the collection.

Various improvements upon this model have been proposed with varying complexity. For example, Shakery and Zhai [281] use a graph-based method to smooth document models, similar to Mei et al. [206]. Tao et al. [311] use document expansion to improve end-to-end retrieval.

2.2.2 KL divergence

Soon after its conception, the query likelihood model was generalized by realizing that an information need can also be represented as a language model. This way, a comparison of two language models forms the basis for ranking and, hence, a more general and flexible retrieval model than query likelihood was obtained. Several authors have proposed the use of the Kullback-Leibler (KL) divergence for ranking, since it is a well established measure for the comparison of probability distributions with some intuitive properties—it always has a non-negative value and equal distributions receive a zero divergence value [173240346]. Using KL divergence, documents are scored by measuring the divergence between a query model θQ and document model θD. Since we want to assign a high score for high similarity and a low score for low similarity, the KL divergence is negated for ranking purposes. More formally, the score for each query-document pair using the KL divergence retrieval model is: Score(Q,D) = KL(θQ||θD) = tVP(t|θQ)log P(t|θQ) P(t|θD) = tVP(t|θQ)logP(t|θD) tVP(t|θQ)logP(t|θQ), (2.8)

where V denotes the set of all terms used in all documents in the collection. KL divergence is also known as the relative entropy, which is defined as the cross-entropy of the observed distribution (in this case the query) as if it was generated by a reference distribution (in this case the document) minus the entropy of the observed distribution. KL divergence can also be measured in the reverse direction (also known as document likelihood), but this leads to poorer results for ad hoc search tasks [180]. The entropy of the query, tVP(t|θQ)logP(t|θQ), is a query specific constant and can thus be ignored for ranking purposes in the case of ad hoc retrieval (cf. Section 3.2.1).

When the query model is estimated using the empirical ML estimate on the original query, i.e., P(t|θ̃Q) = n(t,Q) |Q| , (2.9)

it can be shown that documents are ranked in the same order as using the query likelihood model from Eq. 2.3 [353]. Later in this thesis, we use Eq. 2.8 in conjunction with Eq. 2.9 as a baseline retrieval model.

Note that a query is a verbal expression of an underlying information need and the query model (derived from the query) is therefore also only an estimate of this information need. Given that queries are typically short [300], this initial, crude estimate can often be improved upon by adding and reweighting terms. Since the query is modeled in its own fashion using the KL divergence framework, elaborate ways of estimating or updating the query model may be employed—a procedure known as query modeling.

In order to obtain a query model that is a better estimate of the information need, the initial query P(t|θ̃Q) may be interpolated with the expanded part P(t|θ̂Q) [24172267354]. Effectively, this reweights the initial query terms and provides smoothing for the relatively sparse initial sample: P(t|θQ) = λQP(t|θ̃Q)+(1λQ)P(t|θ̂Q). (2.10)

Figure 2.1 shows an example of an interpolated query model; query modeling will be further introduced in Section 2.3. In the remainder of this thesis, we will use this mechanism to incorporate relevance feedback information (Chapter 4) or leverage conceptual knowledge in the form of document annotations (Chapter 5) or in the form of Wikipedia articles (Chapter 7). In Section 2.3 we zoom in on ways of estimating P(t|θ̂Q). We discuss the issue of setting the smoothing parameter λQ in Section 3.4.

2.2.3 Relation to Probabilistic Approaches

As indicated above, several researchers have attempted to relate the LM approach to traditional probabilistic approaches, including the PRP model [83]. Sparck-Jones and Robertson [295] examine the notion of relevance in both the PRP and the query likelihood language modeling approach. They identify the following two distinctions.

1.
Although in both approaches a match between terms in the query and a document implies relevance, the notion of relevance features explicitly in PRP but is never mentioned in LM.
2.
The underlying principle of LM is to identify the ideal document, i.e., the one that generated the query (as exemplified by the argmax in Eq. 2.1).

Sparck-Jones and Robertson emphasize that the last point implies that retrieval stops after the document that generated the query is found. Furthermore, this fact, coupled with simply assuming that query generation and relevance are correlated, implies that it is difficult to describe methods such as relevance feedback (or any method pertaining to relevance) in existing LM approaches.

Hiemstra and de Vries [135] relate LM to traditional approaches by comparing the QL model presented in [134] with the TF.IDF weighting scheme and the combination with relevance weighting as done in Okapi BM25. Lafferty and Zhai [173] and Lavrenko and Croft [183] address the two issues mentioned above by suggesting new forms of LM for retrieval that are more closely related to the PRP model and move away from the estimating the query generation probability. Lafferty and Zhai [173] include a binary, latent variable that indicates relevance of a document with respect to a query. They point out that document length normalization is an issue in PRP but not in LM; another difference is that in LM we typically have more data for estimation purposes that PRP. Greiff [115] observes that the main contribution of LM is the recognition of the importance of parameter estimation in modeling and in the treatment of term frequency as the manifestation of an underlying probability distribution rather than as the probability of word occurrence itself. Lavrenko and Croft [183] take a similar view and explicitly define a latent model of relevance. According to this model, both the query and the relevant documents are samples from this model. Hiemstra et al. [136] build upon work presented in [297] and also attempt to bridge the gap between PRP and LM. They posit that LM should not blindly model language use. Instead, LM should model what language use distinguishes a relevant document from the other documents. In Section 2.3.2 we introduce these approaches further. In Chapter 4 we evaluate their performance on three distinct test collections.

Another, more recent spin-off of the discussion centers around the notion of event spaces for probabilistic models [259]. Since LM (and, in particular, the QL approach) is based on the probability of a query given a document, the event space would consist of queries in relation to a single, particular document. These event spaces would therefore be unique to each document. Under this interpretation, the query-likelihood scores of different documents for the same query would not be comparable because they come from different probability distributions in different event spaces. In line with the observations above, this implies that relevance feedback (in the form of documents) for a given query is impossible (although relevant query feedback for a given document would indeed be feasible [239]). Luk [195] responds to Robertson in a fashion similar to [16] and proves that, under certain assumptions, the latent variable indicating relevance introduced by [174] is implicit in the ranking formula. Boscarino and de Vries [43] reply to Luk in turn and argue that this claim is also problematic. Boscarino and de Vries state that Luk attempts to solve the issue at the statistical level, while it should be addressed through a proper selection of priors. All in all, a definitive bridge between PRP and LM is still missing. Even if the LM approach to IR is “misusing” some of its fundamental premises, the theoretical and experimental evidence suggest that the approach does indeed have merit.