2.1 Information Retrieval

An information retrieval system implements a retrieval model that is used to generate a ranking of documents for a given query. A retrieval model is itself a formal representation of the process of matching a query and a document and is often (but not necessarily) based on a statistical view of language.

As described in the previous chapter, Boolean systems were the first popular retrieval models. They did not generate document rankings, but returned sets of documents fulfilling the (Boolean) query. They were superseded by the vector space model (VSM) [274], which would become the mainstream model for many years. It is based on a vector space where the dimensions are defined by the terms in the vocabulary. Queries and documents are represented by vectors and similarity is defined using a distance measure in this space. The most commonly used distance measure is based on the cosine of the angle between vectors in the high-dimensional space (although other measures such as the Euclidean distance are also sometimes used). Each component of a vector can take either binary values or more complex, real values. Examples of the latter include statistical information such as term frequency (TF) and inverse document frequency (IDF) [155258264]. TF and IDF are two notions that are not specific to VSM, but in common use in most retrieval models. The TF of a term in a document is defined as the relative frequency of occurrence of that term in the document. IDF is defined as the (log of the) inverse of the relative frequency of occurrence of a term in the entire collection. The underlying intuition is that documents with a high TF for a term are more likely to be relevant to queries containing this term. Moreover, terms that are infrequent in the collection are more discriminative and convey more information than frequent ones. Therefore, a common weighting scheme, called TF.IDF, is a simple multiplication of the two.

Other retrieval models exist, some of which are still in popular use today. Maron and Kuhns [203] were the first to explicitly incorporate the notion of relevance in a retrieval model (a broader discussion on “relevance” is given in Section 3.1) by developing a probabilistic indexing model. They moved beyond binary indexing of documents (as was common in Boolean systems, where each indexing term could be either present or absent) and proposed the use of indexing weights, that were to be interpreted as probabilities. They considered the retrieval problem as a problem involving inference where an IR system should predict which documents in the collection would most probably be relevant to a query and then rank those documents in descending order by those computed values of probability of relevance. This idea is highly similar to the Naive Bayes method, a popular machine learning approach [187]. Given that the output of their system was a ranked result list, Maron and Kuhns [203] have often been credited with being the first to move beyond set-based retrieval and introducing the ranked lists that are still in common use today [313] (although Joyce and Needham [157] employed a notion of TF to sort the list of matching documents two years prior).

Robertson and Jones [264] proposed the RSJ model that solely uses IDF with relevance feedback. The RSJ model (or: probability ranking principle (PRP)) builds upon the ideas presented in Maron and Kuhns [203] and hinges on two probabilistic models; one for all non-relevant documents and one for all relevant ones [263264]. The PRP model is based on measuring the probability that a document will be relevant to a user, given a query (note that it does not measure the degree of relevance [261]). The higher this probability, the more likely the document is to be relevant to the user. Robertson [263] proves that ranking documents using the PRP (in which documents are ranked by their decreasing probability of relevance) optimizes retrieval performance, under the condition that these probabilities are properly estimated. As may be clear, effectively estimating these relevant and non-relevant models is unsurmountable in practice and the PRP resorts to various approximation methods.

The PRP model uses a binary representation of terms in documents, which was generalized to TF information soon after in the 2-Poisson model [122266]. Amati and Van Rijsbergen [8] present a generalization of the 2-poisson model, called the divergence from randomness (DFR) model. It is built around the notion that the amount of information carried by a term in a document is proportional to the divergence of its term frequency within that document with respect to its frequency in the collection. DFR is inspired by the idea that “good” descriptors of documents (terms or concepts from a controlled vocabulary, for example) are those that describe the information content and that have discriminatory power [38325].

The Okapi team developed another, much extended version of the PRP model, now commonly known as (Okapi) BM25 [156]. It is a handcrafted approximation of the PRP model and makes effective use of TF and document length. It also remains a common baseline in IR literature [265]. A relatively new form of model, known as Language Modeling, appeared in the late 1990s and will be further introduced in the next section. Lafferty and Zhai [174] note that the PRP model can be considered rank equivalent to the language modeling approach, although this has caused some debate in recent literature [4383195259]. After we have discussed language modeling below we return to this issue in Section 2.2.3.