Linking queries to Wikipedia

To be able to derive query models based on the concepts meant by the query, we first need to link queries to concepts (in the form of Wikipedia articles or, equivalently, DBpedia concepts). To this end, we follow the approach from Chapter 6, which maps queries to DBpedia concepts. In this case, however, we subsequently apply query modeling. We take the best performing settings from that chapter, i.e., SVM with a polynomial kernel using full queries. Instead of using the Sound and Vision dataset, however, we employ two ad hoc TREC test collections in tandem with a dump of the English version of Wikipedia (dump date 20090920).

In order to classify concepts as being relevant to a query, the approach uses manual query-to-concept annotations to train the SVM model. During testing, a retrieval run is performed on Wikipedia for new, unseen queries. The results of which are then classified using the learned model. The output of this step is a label for each concept, indicating whether it is relevant or not. This dichotomy represents our binary classification problem.

Wikipedia and supervised machine learning have previously been used to select optimal terms to include in the query model [347]. We, however, are interested in selecting those concepts that best describe the query and use those to sample terms from. This is similar to the unsupervised manner used, e.g., in the context of retrieving blogs [337]. Such approaches are completely unsupervised in that they only consider a fixed number of pseudo relevant Wikipedia articles. As we will see below, focusing this set using machine learning improves overall retrieval performance.

The features that we use include those pertaining to the query, the Wikipedia article, and their combination. See Section 6.2.3 for an extensive description of each. Since we are using ad hoc test collections in this case, we do not have session information and omit the history-based features. In order to obtain training data, we have asked 4 annotators to manually identify all relevant Wikipedia articles for queries in the same fashion as presented in the previous chapter. The average number of Wikipedia articles the annotators identified per query is around 2 for both collections. The average number of articles identified as relevant per query by SVM is slightly different between the test collections, with 1.6 for TREC Terabyte and 2.7 for TREC Web 2009. This seems to be due to the differences in queries; the TREC Web queries are shorter and, thus, more prone to ambiguity.

Topic #	Query	Concepts
2	french lick resort and casino	FRENCH LICK RESORT CASINO
		FRENCH LICK, INDIANA
13	map	MAP
		TOPOGRAPHIC MAP
		WORLD MAP
		THE NATIONAL MAP
14	dinosaurs	DINOSAURS
		HARRY AND HIS BUCKET FULL OF DINOSAURS
		WALKING WITH DINOSAURS
15	espn sports	ESPN STAR SPORTS
		ESPN
		ESPN ON ABC
16	arizona game and fish	ARIZONA GAME AND FISH DEPARTMENT
		LIST OF LAKES IN ARIZONA
17	poker tournaments	POKER TOURNAMENT
		ULTIMATE POKER CHALLENGE
23	yahoo	YAHOO!
		YAHOO! MUSIC
		YAHOO! NEWS
24	diversity	SPECIES DIVERSITY
		GENETIC DIVERSITY
		CULTURAL DIVERSITY
26	lower heart rate	HEART RATE
		HEART RATE VARIABILITY
		DOPPLER FETAL MONITOR
28	inuyasha	INUYASHA
		LIST OF INUYASHA EPISODES
		LIST OF INUYASHA CHARACTERS
39	disneyland hotel	DISNEYLAND HOTEL (CALIFORNIA)
		DISNEYLAND HOTEL (PARIS)
		TOKYO DISNEYLAND HOTEL
41	orange county convention center	ORANGE COUNTY CONVENTION CENTER
		ORANGE COUNTY, CALIFORNIA
		LIST OF CONVENTION & EXHIBITION CENTERS
42	the music man	THE MUSIC MAN
		THE MUSIC MAN (1962 fiLM)
		MUSIC MAN
		THE MUSIC MAN (SONG)
45	solar panels	PHOTOVOLTAIC MODULE
48	wilson antenna	ROBERT WOODROW WILSON
49	flame designs	FLAME OF RECCA
		GEORDIE LAMP

Table 7.1: Examples of topics automatically linked to concepts on the TREC Web 2009 test collection.

Let’s look at some examples. Table 7.1 shows examples of concepts that are identified by the SVM model on the TREC Web 2009 test collection. We first observe that, as pointed out above, the queries themselves are short and ambiguous. For query (#48) “wilson antenna,” it predicts ROBERT WOODROW WILSON as the only relevant concept, classifying concepts such as MOUNT WILSON (CALIFORNIA) as not relevant. For the query “the music man” (#42) it identifies the company, song, film, and musical which indicates the inherent ambiguity that is typical for many web queries. The same effect can be observed for the query “disneyland hotel” (#39) with concepts TOKYO DISNEYLAND HOTEL, DISNEYLAND HOTEL (CALIFORNIA), and DISNEYLAND HOTEL (PARIS). There are also mistakes, however, such as predicting the concepts FLAME OF RECCA and GEORDIE LAMP for the query (#49) “flame designs.” The first concept is a Japanese manga series, whereas ‘Geordie’ was the nickname of the designer of the mine lamp that served as a solution to explosions due to firedamp in coal mines.

In the next stage, we take the predicted concepts for each query and estimate query models from the Wikipedia article associated with each concept. For this, we adopt the language modeling approach detailed in Section 2.2.2 and as query model we use the linear interpolation from Eq. 2.10. Recall that there,

P (t | {\tilde{θ}}_{Q})

indicates the empirical estimate on the initial query and

P (t | {\hat{θ}}_{Q})

the expanded part. In Chapter 4, relevance model 1 (RM-1, cf. Eq. 2.24) had the most robust performance. We therefore use this model to obtain

P (t | {\hat{θ}}_{Q})

and estimate it on the contents of the Wikipedia articles associated with the concepts. In essence, this method is similar to the one we presented in Chapter 5. There, we used conceptual document annotations to (i) obtain a conceptual representation of each query and to (ii) “translate” the found concepts to vocabulary terms. In this chapter, we use the learned SVM model to obtain the first step. Since each concept is now associated with a single document (the Wikipedia article), we use those to update the estimate of the query model.

Figure 7.1 shows two example query models for topics #42 and #39 from the TREC Web 2009 test collection. We note that the initial query terms receive the largest probability mass and that the terms that are introduced seem mostly related to the topic.

7.1 Linking queries to Wikipedia