Language Modeling Variations

2.4 Language Modeling Variations

A number of extensions and variants have been developed for language modeling for IR, most of which aim to address the vocabulary gap between queries and documents. In the previous sections we have seen techniques such as query modeling and relevance feedback. Other extensions include, but are not limited to, leveraging document structure, collection structure, and semantics. Other IR research avenues aim to develop models that use semantic information to improve performance with respect to standard bag-of-word based models. Many of these approaches aim at concept-based retrieval, but differ in the nature of the concepts. They range from

latent topics derived from the document contents (as in latent semantic indexing (LSI) or latent dirichlet allocation (LDA)),
document clusters in the collection, to
concepts (a priori defined, for example, in linguistic resources such as WordNet [21, 57] or structured knowledge sources such as DBpedia [69, 106, 205], as we will see in Chapter 5).

In the following sections we provide an overview of these models.

2.4.1 Topic Models

Building thesauri or other knowledge structures by hand is a very labor-intensive process. It is also difficult to get people to agree on a certain ordering and structuring of things. Because of this, it seems very attractive to automate this process, by inferring such structures from text in an unsupervised manner, i.e., without any human intervention [151, 273, 291, 293]. For instance, a co-occurrence analysis of the entire collection might be applied to estimate dependencies between vocabulary terms [21, 67, 234]. Turney and Pantel [322] uses a similar method which is commonly referred to as statistical semantics. Alternatively, term dependencies may be determined on a query-dependent subset of the collection, such as a set of initially retrieved documents [224, 235, 345]. These dependencies may then be employed to locate terms related to the initial query. Spiegel and Bennet already suggested that dependency information between terms may be used to choose terms for query expansion [272, 298]. Peat and Willett [243], however, do not find significant improvements in retrieval performance using such methods for query expansion.

More recently, various data driven models based on principal component analysis/singular value decomposition and posterior inferencing methods have caused a renewed interest in methods for automatically identifying implicit concepts in text. They capture hidden (latent) themes underlying the collection, much in the same way as explicit concepts. Unlike explicit topics (such as document or term annotations—addressed in Section 2.4.2), implicit topics are estimated from the data and group together terms that frequently occur together in the documents. The assumption is that in every document collection there exist a number of such topics and that every document describes some combination of them. The goal, then, is to apply some form of dimensionality reduction in order to represent documents as topic mixtures. In sum, topic models are statistical models of text that assume a hidden space of topics in which the collection is embedded [40]. Topic models are typically used as a way of expressing the “semantic” properties of a piece of text [303] and, at the same time, can address the vocabulary mismatch problem [105].

LSI was an early approach towards extracting term clusters from text [88]. It is based on applying singular value decomposition to a matrix containing document-term counts and effectively “collapses” similar terms into groups. probabilistic latent semantic indexing (PLSI) evolved from LSI and adds a probabilistic interpretation that is based on a mixture decomposition derived from a latent class model [139]. Its formulation is very similar to the translation model given in Eq. 2.11: $\begin{array}{rcl} P (t | θ_{D}) = \sum_{z} P (t | z) P (z | D), & (2.30) \end{array}$

where $z$ is a latent topic (or: aspect). However, they differ in that in the case of PLSI $P (t | θ_{D})$ is given and the objective is to learn the probabilities $P (t | z)$ and $P (z | D)$ , i.e., the probability of a term given a topic and the probability of each topic given a document respectively. Learning is typically accomplished using an optimization algorithm such as EM [90]. In fact, in Chapter 5, we use a variant of this model to incorporate explicit topics in the form of document annotations to improve retrieval performance. PLSI has some issues, the most important of which being the fact that it is a generative model of the documents it is estimated on and does not generalize to new documents. This fact is addressed in the LDA model [40] which is a fully generative approach to language modeling (in fact, Girolami and Kaban [112] show that PLSI is a maximum a posteriori estimated LDA model under a uniform Dirichlet prior).

Topic models have been applied in the context of IR [340] and text classification [40], among others [193]. Wei and Croft [340] use LDA to apply an additional level of language model smoothing. Pu and He [250] use “Independent Component Analysis” (a topic modeling variant) to determine so-called semantic clusters, defined by the learned topics. They sample terms for query modeling using relevance models on these clusters. This intuition is highly similar to our methods presented in Chapters 5 and 7, although we use explicit topics in the form of concepts instead of implicit topics.

2.4.2 Concept Models

In this thesis we define concepts to be cognitive units of meaning that have been formalized in a knowledge structure such as a controlled vocabulary, thesaurus, or ontology. Furthermore, we impose the restriction that such concepts should be agreed upon by a number of people (who typically are domain experts). So, this definition includes concepts taken from thesauri such as MeSH, but also Wikipedia articles (as captured, for example, in the DBpedia knowledge base). Moreover, this definition thus excludes machine-generated concepts (such as topics, clusters, or topic hierarchies) as well as personal, user generated tags. Initially, such concepts (taken from a particular knowledge structure, described in some particular concept language) were used in IR for indexing purposes. The Cranfield experiments established, however, that retrieval performance using “controlled” indexing terms does not outperform using terms as they appear in the documents [74]. However, later studies did not unanimously confirm this conclusion [35]. Various researchers continue to look for ways of (automatically) improving retrieval performance, using either manually or automatically identified concepts. In order for IR models and methods to leverage concepts from concept languages, the more general task of (automatically) linking free text to such concepts needs to be addressed. In this section we zoom in on approaches related to language modeling and/or IR. In Section 2.5 we discuss the issue from a more general viewpoint.

One of the first methods for automatically relating concepts with text was introduced in the 1980s. Giger [111] incorporated a mapping between concepts from a thesaurus and words as they appear in the collection. The main motivation was to move beyond text-based retrieval and bridge the semantic gap between the user and the information retrieval system. His algorithm first defines atomic concepts, which are string-based concept to term mappings. Then, documents are placed in disjoint groups based on so-called elementary logical conjuncts, which are defined through the atomic concepts. At retrieval time, the query is parsed and the sets of documents with the lowest distance to the requested concepts are returned. His ideas relate to recent work done by Zhou et al. [357, 358], who use so-called topic signatures to index and retrieve documents. These signatures are comprised of named entities recognized within each document and query; when named entities are not available, term pairs are used. The named entity recognition step in [357, 358] is automated and might not be completely accurate; we suspect that errors in this concept detection process do not strongly affect retrieval performance because pairs of concepts (topic signatures) are used for retrieval. Below, in Chapter 5, we rely on manually curated concept annotations, making the topic signatures superfluous.

Trieschnigg et al. [315] also use named entity recognition to obtain a conceptual representation of queries and documents. They conclude that searching only with an automatically obtained conceptual representation seriously degrades retrieval when searching for short documents. Interestingly, the same approach performs on par with text-only search when larger documents (full-text articles) are retrieved. Guo et al. [117] perform named entity recognition in queries; they recognize a single entity in each query and subsequently classify it into one of a very small set of predefined classes such as “movie” or “video game.” In our concept models (presented in Chapter 5), we do not impose the restriction of having a single concept per query and, furthermore, our list of candidate concepts is much larger. Several other approaches have been proposed that link queries to a limited set of categories. French et al. [104] present an approach that uses mappings between noun phrases and concepts for query expansion; to this end they employ so-called Entry Vocabulary Indexes [109]. These are calculated as a logit-like function, operating on contingency tables with counts of the number of times a noun phrase is and is not associated with a concept. The counts are obtained by looking at the documents that are annotated with a certain concept, much in the same way as the approach we present in Chapter 5. Bendersky and Croft [29] use part-of-speech tagging and a supervised machine learning technique to identify the “key noun phrases” in verbose natural language queries. Key noun phrases are phrases that convey the most information in a query and contribute most to the resulting retrieval performance.

Instead of using part-of-speech tagging, noun phrases, or named entity recognition, Gabrilovich and Markovitch [106] employ document-level annotations, in the form of Wikipedia articles and categories [205]. They perform semantic interpretation of unrestricted natural language texts by representing meaning in a high-dimensional space of concepts derived from Wikipedia. In this way, the strength of association between vocabulary terms and concepts can be quantified, which can subsequently be used to generate vectors of concepts for a piece of text—either a document or query. In Chapter 7, we use a similar method using machine learning and language modeling techniques, to obtain a query model estimated from Wikipedia articles relevant to the query. This approach is also similar to the intuitions behind the topic modeling approach described by Wei [339], that uses Open Directory Project (ODP) concepts in conjunction with generative language models. Instead of using concept-document associations, however, she uses an ad hoc approach based on the descriptions of the concepts in the concept language (in this case, ODP categories). Interestingly, all of these approaches open up the door to providing conceptual relevance feedback to users. Instead of suggesting vocabulary terms that are related to the query, we can now suggest related concepts that can, for example, be used for navigational purposes [165, 209, 285, 323] or directly for retrieval [254]. Trajkova and Gauch [314] describe another possible application; their system keeps track of a user’s history by classifying visited web pages into concepts from the ODP.

Concepts can be recognized at different levels of granularity, either at the term level, by recognizing concepts in the text, or at the document level, by using document-level annotations or categories. While the former can be described as a form of concept-based indexing [178], the latter is more related to text classification. Indeed, the mapping of vocabulary terms to concepts as described above is in fact a text (or concept) classification algorithm [294].

Further examples of mapping queries to conceptual representations can be found in the area of web query classification. Broder et al. [47] use a pseudo relevance feedback technique to classify rare queries into a commercial taxonomy of web queries, with the goal to improve web advertisements. A classifier is used to classify the highest ranked results, and these classifications are subsequently used to classify the query by means of voting. We use a similar method to obtain the conceptual representation of our query described in Section 5.1.1, with the important difference that all our documents have been manually classified.

Mishne and de Rijke [233] classify queries into taxonomies using category-based web services. Shen et al. [282] improve web query classification by mapping the query to concepts in an intermediate taxonomy which in turn are linked to concepts in the target taxonomy. Chen et al. [66] use a taxonomy to suggest keywords. After mapping the seed keywords to a concept hierarchy, content phrases related to the found concepts are suggested. In Chapter 5, the concepts are used to update the query model, i.e., to update the probabilities of terms based on the found concepts rather than the addition of related discrete terms or phrases.

The use of a conceptual representation obtained from pseudo relevance feedback has also been investigated by researchers in the biomedical domain. Srinivasan [302] proposes directly adding concepts to an initial query and reports the largest improvement in retrieval effectiveness when another round of blind relevance feedback on vocabulary terms is applied afterwards. She creates a separate “concept index” in which tokenized concept labels are used as terms. In this way, searching using a concept labeled “Stomach cancer” also matches the related, but clearly different concept “Breast cancer” because they share the word “cancer”. In our opinion, this obfuscates the added value of using clearly defined concepts; searching with a textual representation containing the word “cancer” will already result in matching related concepts. In Section 6.4 we show that this kind of lexical matching does not perform well. Srinivasan concludes that concepts are beneficial for retrieval, but remarks that the OHSUMED collection used for evaluation was quite small. Our evaluation in Chapter 5 uses the larger Text Retrieval Conference (TREC) Genomics test collections and, additionally, investigates the use of document level annotations in another domain using the Cross-Language Evaluation Forum (CLEF) Domain Specific test collections (cf. Section 3.3). Camous et al. [56] also use the annotations of the top-5 retrieved documents to obtain a conceptual query representation, but incorporate them in a different fashion. The authors use them to create a new ranked list of documents, which is subsequently combined with the initially retrieved documents.

In addition to query expansion, various ways of directly improving text-based retrieval by incorporating concepts or a concept language have been proposed. For example, the entries from a concept language may be used to define the indexing terms employed by the retrieval system [280].

2.4.3 Cluster-based Language Models

Work done on cluster-based retrieval can be viewed as a variation on the concept or topic modeling theme; in those cases, however, the clusters are defined by the concepts (hard clustering) or the latent topics (soft clustering) that are associated with the documents in the collection.

Cluster-based language models use document-document similarity to define coherent subsets of the collection. Document clusters can be construed as semantically coherent segments, each covering one “concept.” Indeed, Trieschnigg et al. [318] have shown that a nearest-neighbor clustering approach yields the best performance when classifying documents into MeSH terms. Kurland and Lee [171] determine overlapping clusters of documents in a collection, which are considered facets of the collection. They use a language modeling framework in which their aspect-x algorithm smoothes documents based on the information from the clusters and the strength of the connection between each document and cluster. Liu and Croft [189] evaluate both the direct retrieval of clusters and cluster-based smoothing. Their CBDM model is a mixture between a document model, a collection model, and the cluster each document belongs to, which is able to significantly outperform a standard query likelihood baseline. Instead of smoothing documents, Minker et al. [231] use cluster-based information for query expansion. The authors evaluate their algorithm on several small test collections, without achieving any improvements over the unexpanded queries. More recently, Lee et al. [185] have shown that detecting clusters in a set of (pseudo-)relevant documents is helpful for identifying dominant documents for a query and, thus, for subsequent query expansion, a finding which was corroborated on different test collections by Kurland [170]. In [126] we show that soft clustering using LDA can help to significantly improve result diversification performance, i.e., identifying and promoting relevant aspects of a query. These approaches all exploit the notion that “associations between documents convey information about the relevance of documents to requests” [145]. Indeed, if we have evidence that a given concept is relevant for a particular query, it is natural to assume that all documents labeled with this concept have a higher prior probability of being relevant to the query [325]. This is the main motivating idea for our model presented in Chapter 5.

PrevTail

Front