7.1 Linking queries to Wikipedia

To be able to derive query models based on the concepts meant by the query, we first need to link queries to concepts (in the form of Wikipedia articles or, equivalently, DBpedia concepts). To this end, we follow the approach from Chapter 6, which maps queries to DBpedia concepts. In this case, however, we subsequently apply query modeling. We take the best performing settings from that chapter, i.e., SVM with a polynomial kernel using full queries. Instead of using the Sound and Vision dataset, however, we employ two ad hoc TREC test collections in tandem with a dump of the English version of Wikipedia (dump date 20090920).

In order to classify concepts as being relevant to a query, the approach uses manual query-to-concept annotations to train the SVM model. During testing, a retrieval run is performed on Wikipedia for new, unseen queries. The results of which are then classified using the learned model. The output of this step is a label for each concept, indicating whether it is relevant or not. This dichotomy represents our binary classification problem.

Wikipedia and supervised machine learning have previously been used to select optimal terms to include in the query model [347]. We, however, are interested in selecting those concepts that best describe the query and use those to sample terms from. This is similar to the unsupervised manner used, e.g., in the context of retrieving blogs [337]. Such approaches are completely unsupervised in that they only consider a fixed number of pseudo relevant Wikipedia articles. As we will see below, focusing this set using machine learning improves overall retrieval performance.

The features that we use include those pertaining to the query, the Wikipedia article, and their combination. See Section 6.2.3 for an extensive description of each. Since we are using ad hoc test collections in this case, we do not have session information and omit the history-based features. In order to obtain training data, we have asked 4 annotators to manually identify all relevant Wikipedia articles for queries in the same fashion as presented in the previous chapter. The average number of Wikipedia articles the annotators identified per query is around 2 for both collections. The average number of articles identified as relevant per query by SVM is slightly different between the test collections, with 1.6 for TREC Terabyte and 2.7 for TREC Web 2009. This seems to be due to the differences in queries; the TREC Web queries are shorter and, thus, more prone to ambiguity.


Topic #

Query Concepts

2

french lick resort and casino FRENCH LICK RESORT CASINO

FRENCH LICK, INDIANA

13

map MAP

TOPOGRAPHIC MAP

WORLD MAP

THE NATIONAL MAP

14

dinosaurs DINOSAURS

HARRY AND HIS BUCKET FULL OF DINOSAURS

WALKING WITH DINOSAURS

15

espn sports ESPN STAR SPORTS

ESPN

ESPN ON ABC

16

arizona game and fish ARIZONA GAME AND FISH DEPARTMENT

LIST OF LAKES IN ARIZONA

17

poker tournaments POKER TOURNAMENT

ULTIMATE POKER CHALLENGE

23

yahoo YAHOO!

YAHOO! MUSIC

YAHOO! NEWS

24

diversity SPECIES DIVERSITY

GENETIC DIVERSITY

CULTURAL DIVERSITY

26

lower heart rate HEART RATE

HEART RATE VARIABILITY

DOPPLER FETAL MONITOR

28

inuyasha INUYASHA

LIST OF INUYASHA EPISODES

LIST OF INUYASHA CHARACTERS

39

disneyland hotel DISNEYLAND HOTEL (CALIFORNIA)

DISNEYLAND HOTEL (PARIS)

TOKYO DISNEYLAND HOTEL

41

orange county convention center ORANGE COUNTY CONVENTION CENTER

ORANGE COUNTY, CALIFORNIA

 LIST OF CONVENTION & EXHIBITION CENTERS

42

the music man THE MUSIC MAN

THE MUSIC MAN (1962 fiLM)

MUSIC MAN

THE MUSIC MAN (SONG)

45

solar panels PHOTOVOLTAIC MODULE

48

wilson antenna ROBERT WOODROW WILSON

49

flame designs FLAME OF RECCA

GEORDIE LAMP

Table 7.1: Examples of topics automatically linked to concepts on the TREC Web 2009 test collection.


PIC
(a) Topic #42 (“music man”).
PIC
(b) Topic #39 (“disneyland hotel”).


Figure 7.1: Example query models. The size of a term is proportional to its probability in the query model.


Let’s look at some examples. Table 7.1 shows examples of concepts that are identified by the SVM model on the TREC Web 2009 test collection. We first observe that, as pointed out above, the queries themselves are short and ambiguous. For query (#48) “wilson antenna,” it predicts ROBERT WOODROW WILSON as the only relevant concept, classifying concepts such as MOUNT WILSON (CALIFORNIA) as not relevant. For the query “the music man” (#42) it identifies the company, song, film, and musical which indicates the inherent ambiguity that is typical for many web queries. The same effect can be observed for the query “disneyland hotel” (#39) with concepts TOKYO DISNEYLAND HOTEL, DISNEYLAND HOTEL (CALIFORNIA), and DISNEYLAND HOTEL (PARIS). There are also mistakes, however, such as predicting the concepts FLAME OF RECCA and GEORDIE LAMP for the query (#49) “flame designs.” The first concept is a Japanese manga series, whereas ‘Geordie’ was the nickname of the designer of the mine lamp that served as a solution to explosions due to firedamp in coal mines.

In the next stage, we take the predicted concepts for each query and estimate query models from the Wikipedia article associated with each concept. For this, we adopt the language modeling approach detailed in Section 2.2.2 and as query model we use the linear interpolation from Eq. 2.10. Recall that there, P(t|θ̃Q) indicates the empirical estimate on the initial query and P(t|θ̂Q) the expanded part. In Chapter 4, relevance model 1 (RM-1, cf. Eq. 2.24) had the most robust performance. We therefore use this model to obtain P(t|θ̂Q) and estimate it on the contents of the Wikipedia articles associated with the concepts. In essence, this method is similar to the one we presented in Chapter 5. There, we used conceptual document annotations to (i) obtain a conceptual representation of each query and to (ii) “translate” the found concepts to vocabulary terms. In this chapter, we use the learned SVM model to obtain the first step. Since each concept is now associated with a single document (the Wikipedia article), we use those to update the estimate of the query model.

Figure 7.1 shows two example query models for topics #42 and #39 from the TREC Web 2009 test collection. We note that the initial query terms receive the largest probability mass and that the terms that are introduced seem mostly related to the topic.