Generative Language Modeling for IR

2.2 Generative Language Modeling for IR

The success of using statistical language models (LMs) to improve automatic speech recognition (ASR), as well as the practical challenges associated with using the PRP model inspired several IR researchers to re-cast IR in a generative probabilistic framework, by representing documents as generative probabilistic models.

The main task of automatic speech recognition is the transcription of spoken utterances. An effective and theoretically well-founded way of approaching this task is by estimating a probabilistic model based on the occurrences of word sequences in a particular language [147, 271]. Such models are distributions over term sequences (or: n-grams, where $n$ indicates the length of each sequence) and can be used to compute the probability of observing a sequence of terms, by computing the product of the probabilities of observing the individual terms. Then, when a new piece of audio material $A$ needs to be transcribed, each possible interpretation of each observation is compared to this probabilistic model (the LM) and the most likely candidate $S$ is returned: $\begin{array}{rcl} S^{*} = arg \end{array}$

2.2 Generative Language Modeling for IR

2.2.1 Query Likelihood

2.2.2 KL divergence

2.2.3 Relation to Probabilistic Approaches