The success of using statistical language models (LMs) to improve automatic speech recognition (ASR), as well as the practical challenges associated with using the PRP model inspired several IR researchers to re-cast IR in a generative probabilistic framework, by representing documents as generative probabilistic models.
The main task of automatic speech recognition is the transcription of spoken utterances. An effective and theoretically well-founded way of approaching this task is by estimating a probabilistic model based on the occurrences of word sequences in a particular language [147, 271]. Such models are distributions over term sequences (or: n-grams, where indicates the length of each sequence) and can be used to compute the probability of observing a sequence of terms, by computing the product of the probabilities of observing the individual terms. Then, when a new piece of audio material needs to be transcribed, each possible interpretation of each observation is compared to this probabilistic model (the LM) and the most likely candidate is returned:
Here, is the language model. is viewed as having been generated according to some probability and transmitted through a noisy channel that transforms to with probability . Instead of selecting a single , this source-channel model can also be used to rank a set of candidates; this is exactly what happens in IR, as we will see later.
It is common in ASR to use higher order n-grams, although deriving trigram or even bigram probabilities is a sparse estimation problem, even with large training corpora. Higher order n-grams have also been tried for IR but these experiments were met with limited success; the mainstream approach is to use n-grams of length 1 (or: unigrams). Ironically, n-gram based language models use very little knowledge of what language really is. They take no advantage of the fact that what is being modeled is language—it may as well be a sequence of arbitrary symbols . Efforts to include syntactic information in n-gram based models have yielded modest improvements at best [63, 137, 290].
The first published application of language modeling for IR was based on the multivariate Bernoulli distribution , but the simpler multinomial model became the mainstream model [134, 228]. In the multivariate Bernoulli model, each term position in a document is a vector over the entire vocabulary with all zeroes, except for a single element (the term) which is set to 1. The multinomial model, on the other hand, explicitly captures the frequency of occurrence of a term.
Mccallum and Nigam  find that, for text classification using Naive Bayes, the multivariate Bernoulli model performs well with small vocabulary sizes, but that the multinomial usually performs better at larger vocabulary sizes. Losada and Azzopardi  observe that for most retrieval tasks (except sentence retrieval) the multivariate Bernoulli model is significantly outperformed by the multinomial model; their analysis reveals that the multivariate Bernoulli model tends to promote long documents.
However, recent work has addressed some of the shortcomings of using the multinomial distribution for modeling text [198, 256]. A common argument against using a multinomial is that it insufficiently captures the “burstiness” of language. This property of language is derived from the observation that there is a higher chance of observing a term when it has already been observed before. Such burstiness also implies a power law distribution, similar to a Zipfian curve often observed in natural language [201, 360, 361]. Zipf’s law states that, if is the frequency of the -th most frequent event, then
where is a constant (as well as the only parameter of the distribution). In practice this means that there are very few words which occur frequently and many unusual words. Due to this distribution, the number of distinct words in a vocabulary does not grow linearly (but sublinearly) with the size of the collection. Alternative models that try to incorporate this information include the Dirichlet compound multinomial distribution  or the related Hierarchical Pitman-Yor model . These distributions provide a better model of language use and the authors show significant improvements over the standard multinomial model. Sunehag  provides an analysis of such approaches and shows that TF.IDF follows naturally from them.
The earliest work in the query likelihood family of approaches can be attributed to Kalt . He suggests that term probabilities for documents related to a single topic can be modeled by a single stochastic process; documents related to different topics would be generated by different stochastic processes. Kalt’s model treats each document as a sample from a topic language model. Since the problem he considered was text classification, “queries” were derived from a training set instead of solicited from actual queries. Kalt’s approach was based on the maximum likelihood (ML) estimate (which will be introduced below in Eq. 2.4) and incorporated collection statistics, term frequency, and document length as integral parts of the model. Although later query likelihood approaches are more robust in that they consider each document (vs. a group of documents) as being described by an underlying language model, Kalt’s early work is clearly a precursor to language modeling for information retrieval.
In the multinomial unigram language modeling approach to IR, each document is represented as a multinomial probability distribution over all terms in the vocabulary. At retrieval time, each document is ranked according to the likelihood of having generated the query, which is why this model is commonly referred to as the query likelihood (QL) model. It determines the probability that the query terms () are sampled from the document language model [134, 229, 240]:
where denotes the count of term in query . The term is the same for all documents and, since it does not influence the ranking of documents for a given query, it can safely be ignored for ad hoc search. As is clear from Eq. 2.3, independence between terms in the query is assumed. Note that this formulation is exactly the source-channel model described above, only for document ranking. The term is the prior probability of selecting a document and may be used to model a document’s higher a priori chance of being relevant , for example based on its authoritativeness or the number of incoming links or citations [210, 336]. In all the experiments in this thesis we assume this probability to be uniform, however.
A common way of estimating a document’s generative language model is through the use of an ML estimate on the contents of the document,
Here, indicates the length of . It is an essential condition for retrieval models that are based on measuring the probability of observed data given a reference generative model, that the reference model is adequately smoothed. Smoothing is applied both to avoid data sparsity (and, hence, zero-frequency) problems occurring with a maximum likelihood approach (which happens, for example, when one of the query terms does not appear in the document) and to account for general and document-specific language use. So, the goal of smoothing is to account for unseen events (terms) in the documents [65, 356]. Various types of smoothing have been proposed including discounting techniques such as Laplace, Good-Turing, or leave-one-out smoothing. These methods add (or subtract) small amounts of probability mass with varying levels of sophistication. Another type is interpolation-based smoothing, which adjusts the probabilities of both seen and unseen events. One interpolation method commonly used in IR is Jelinek-Mercer smoothing which considers each document to be a mixture of a document-specific model and a more general background model. Each document model is estimated using the maximum likelihood estimate of the terms in the document, linearly interpolated with a background language model [148, 229, 356]:
Here, is calculated as the likelihood of observing in a sufficiently large corpus, such as the document collection, :
In this thesis, we use Bayesian smoothing using a Dirichlet prior which has been shown to achieve superior performance on a variety of tasks and collections [30, 65, 191, 352, 356] and set:
where is a hyperparameter that controls the level of smoothing which is typically set to the average document length of all documents in the collection.
Various improvements upon this model have been proposed with varying complexity. For example, Shakery and Zhai  use a graph-based method to smooth document models, similar to Mei et al. . Tao et al.  use document expansion to improve end-to-end retrieval.
Soon after its conception, the query likelihood model was generalized by realizing that an information need can also be represented as a language model. This way, a comparison of two language models forms the basis for ranking and, hence, a more general and flexible retrieval model than query likelihood was obtained. Several authors have proposed the use of the Kullback-Leibler (KL) divergence for ranking, since it is a well established measure for the comparison of probability distributions with some intuitive properties—it always has a non-negative value and equal distributions receive a zero divergence value [173, 240, 346]. Using KL divergence, documents are scored by measuring the divergence between a query model and document model . Since we want to assign a high score for high similarity and a low score for low similarity, the KL divergence is negated for ranking purposes. More formally, the score for each query-document pair using the KL divergence retrieval model is:
where denotes the set of all terms used in all documents in the collection. KL divergence is also known as the relative entropy, which is defined as the cross-entropy of the observed distribution (in this case the query) as if it was generated by a reference distribution (in this case the document) minus the entropy of the observed distribution. KL divergence can also be measured in the reverse direction (also known as document likelihood), but this leads to poorer results for ad hoc search tasks . The entropy of the query, , is a query specific constant and can thus be ignored for ranking purposes in the case of ad hoc retrieval (cf. Section 3.2.1).
When the query model is estimated using the empirical ML estimate on the original query, i.e.,
it can be shown that documents are ranked in the same order as using the query likelihood model from Eq. 2.3 . Later in this thesis, we use Eq. 2.8 in conjunction with Eq. 2.9 as a baseline retrieval model.
Note that a query is a verbal expression of an underlying information need and the query model (derived from the query) is therefore also only an estimate of this information need. Given that queries are typically short , this initial, crude estimate can often be improved upon by adding and reweighting terms. Since the query is modeled in its own fashion using the KL divergence framework, elaborate ways of estimating or updating the query model may be employed—a procedure known as query modeling.
In order to obtain a query model that is a better estimate of the information need, the initial query may be interpolated with the expanded part [24, 172, 267, 354]. Effectively, this reweights the initial query terms and provides smoothing for the relatively sparse initial sample:
Figure 2.1 shows an example of an interpolated query model; query modeling will be further introduced in Section 2.3. In the remainder of this thesis, we will use this mechanism to incorporate relevance feedback information (Chapter 4) or leverage conceptual knowledge in the form of document annotations (Chapter 5) or in the form of Wikipedia articles (Chapter 7). In Section 2.3 we zoom in on ways of estimating . We discuss the issue of setting the smoothing parameter in Section 3.4.
As indicated above, several researchers have attempted to relate the LM approach to traditional probabilistic approaches, including the PRP model . Sparck-Jones and Robertson  examine the notion of relevance in both the PRP and the query likelihood language modeling approach. They identify the following two distinctions.
Sparck-Jones and Robertson emphasize that the last point implies that retrieval stops after the document that generated the query is found. Furthermore, this fact, coupled with simply assuming that query generation and relevance are correlated, implies that it is difficult to describe methods such as relevance feedback (or any method pertaining to relevance) in existing LM approaches.
Hiemstra and de Vries  relate LM to traditional approaches by comparing the QL model presented in  with the TF.IDF weighting scheme and the combination with relevance weighting as done in Okapi BM25. Lafferty and Zhai  and Lavrenko and Croft  address the two issues mentioned above by suggesting new forms of LM for retrieval that are more closely related to the PRP model and move away from the estimating the query generation probability. Lafferty and Zhai  include a binary, latent variable that indicates relevance of a document with respect to a query. They point out that document length normalization is an issue in PRP but not in LM; another difference is that in LM we typically have more data for estimation purposes that PRP. Greiff  observes that the main contribution of LM is the recognition of the importance of parameter estimation in modeling and in the treatment of term frequency as the manifestation of an underlying probability distribution rather than as the probability of word occurrence itself. Lavrenko and Croft  take a similar view and explicitly define a latent model of relevance. According to this model, both the query and the relevant documents are samples from this model. Hiemstra et al.  build upon work presented in  and also attempt to bridge the gap between PRP and LM. They posit that LM should not blindly model language use. Instead, LM should model what language use distinguishes a relevant document from the other documents. In Section 2.3.2 we introduce these approaches further. In Chapter 4 we evaluate their performance on three distinct test collections.
Another, more recent spin-off of the discussion centers around the notion of event spaces for probabilistic models . Since LM (and, in particular, the QL approach) is based on the probability of a query given a document, the event space would consist of queries in relation to a single, particular document. These event spaces would therefore be unique to each document. Under this interpretation, the query-likelihood scores of different documents for the same query would not be comparable because they come from different probability distributions in different event spaces. In line with the observations above, this implies that relevance feedback (in the form of documents) for a given query is impossible (although relevant query feedback for a given document would indeed be feasible ). Luk  responds to Robertson in a fashion similar to  and proves that, under certain assumptions, the latent variable indicating relevance introduced by  is implicit in the ranking formula. Boscarino and de Vries  reply to Luk in turn and argue that this claim is also problematic. Boscarino and de Vries state that Luk attempts to solve the issue at the statistical level, while it should be addressed through a proper selection of priors. All in all, a definitive bridge between PRP and LM is still missing. Even if the LM approach to IR is “misusing” some of its fundamental premises, the theoretical and experimental evidence suggest that the approach does indeed have merit.