Up Next Tail

### 7.1 Linking queries to Wikipedia

To be able to derive query models based on the concepts meant by the query, we first need to link queries to concepts (in the form of Wikipedia articles or, equivalently, DBpedia concepts). To this end, we follow the approach from Chapter 6, which maps queries to DBpedia concepts. In this case, however, we subsequently apply query modeling. We take the best performing settings from that chapter, i.e., SVM with a polynomial kernel using full queries. Instead of using the Sound and Vision dataset, however, we employ two ad hoc TREC test collections in tandem with a dump of the English version of Wikipedia (dump date 20090920).

In order to classify concepts as being relevant to a query, the approach uses manual query-to-concept annotations to train the SVM model. During testing, a retrieval run is performed on Wikipedia for new, unseen queries. The results of which are then classified using the learned model. The output of this step is a label for each concept, indicating whether it is relevant or not. This dichotomy represents our binary classification problem.

Wikipedia and supervised machine learning have previously been used to select optimal terms to include in the query model [347]. We, however, are interested in selecting those concepts that best describe the query and use those to sample terms from. This is similar to the unsupervised manner used, e.g., in the context of retrieving blogs [337]. Such approaches are completely unsupervised in that they only consider a fixed number of pseudo relevant Wikipedia articles. As we will see below, focusing this set using machine learning improves overall retrieval performance.

The features that we use include those pertaining to the query, the Wikipedia article, and their combination. See Section 6.2.3 for an extensive description of each. Since we are using ad hoc test collections in this case, we do not have session information and omit the history-based features. In order to obtain training data, we have asked 4 annotators to manually identify all relevant Wikipedia articles for queries in the same fashion as presented in the previous chapter. The average number of Wikipedia articles the annotators identified per query is around 2 for both collections. The average number of articles identified as relevant per query by SVM is slightly different between the test collections, with 1.6 for TREC Terabyte and 2.7 for TREC Web 2009. This seems to be due to the differences in queries; the TREC Web queries are shorter and, thus, more prone to ambiguity.

 Topic # Query Concepts 2 french lick resort and casino FRENCH LICK RESORT CASINO FRENCH LICK, INDIANA 13 map MAP TOPOGRAPHIC MAP WORLD MAP THE NATIONAL MAP 14 dinosaurs DINOSAURS HARRY AND HIS BUCKET FULL OF DINOSAURS WALKING WITH DINOSAURS 15 espn sports ESPN STAR SPORTS ESPN ESPN ON ABC 16 arizona game and fish ARIZONA GAME AND FISH DEPARTMENT LIST OF LAKES IN ARIZONA 17 poker tournaments POKER TOURNAMENT ULTIMATE POKER CHALLENGE 23 yahoo YAHOO! YAHOO! MUSIC YAHOO! NEWS 24 diversity SPECIES DIVERSITY GENETIC DIVERSITY CULTURAL DIVERSITY 26 lower heart rate HEART RATE HEART RATE VARIABILITY DOPPLER FETAL MONITOR 28 inuyasha INUYASHA LIST OF INUYASHA EPISODES LIST OF INUYASHA CHARACTERS 39 disneyland hotel DISNEYLAND HOTEL (CALIFORNIA) DISNEYLAND HOTEL (PARIS) TOKYO DISNEYLAND HOTEL 41 orange county convention center ORANGE COUNTY CONVENTION CENTER ORANGE COUNTY, CALIFORNIA LIST OF CONVENTION & EXHIBITION CENTERS 42 the music man THE MUSIC MAN THE MUSIC MAN (1962 fiLM) MUSIC MAN THE MUSIC MAN (SONG) 45 solar panels PHOTOVOLTAIC MODULE 48 wilson antenna ROBERT WOODROW WILSON 49 flame designs FLAME OF RECCA GEORDIE LAMP
Table 7.1: Examples of topics automatically linked to concepts on the TREC Web 2009 test collection.

Let’s look at some examples. Table 7.1 shows examples of concepts that are identified by the SVM model on the TREC Web 2009 test collection. We first observe that, as pointed out above, the queries themselves are short and ambiguous. For query (#48) “wilson antenna,” it predicts ROBERT WOODROW WILSON as the only relevant concept, classifying concepts such as MOUNT WILSON (CALIFORNIA) as not relevant. For the query “the music man” (#42) it identifies the company, song, film, and musical which indicates the inherent ambiguity that is typical for many web queries. The same effect can be observed for the query “disneyland hotel” (#39) with concepts TOKYO DISNEYLAND HOTEL, DISNEYLAND HOTEL (CALIFORNIA), and DISNEYLAND HOTEL (PARIS). There are also mistakes, however, such as predicting the concepts FLAME OF RECCA and GEORDIE LAMP for the query (#49) “flame designs.” The first concept is a Japanese manga series, whereas ‘Geordie’ was the nickname of the designer of the mine lamp that served as a solution to explosions due to firedamp in coal mines.

In the next stage, we take the predicted concepts for each query and estimate query models from the Wikipedia article associated with each concept. For this, we adopt the language modeling approach detailed in Section 2.2.2 and as query model we use the linear interpolation from Eq. 2.10. Recall that there, $P\left(t|{\stackrel{̃}{\theta }}_{Q}\right)$ indicates the empirical estimate on the initial query and $P\left(t|{\stackrel{̂}{\theta }}_{Q}\right)$ the expanded part. In Chapter 4, relevance model 1 (RM-1, cf. Eq. 2.24) had the most robust performance. We therefore use this model to obtain $P\left(t|{\stackrel{̂}{\theta }}_{Q}\right)$ and estimate it on the contents of the Wikipedia articles associated with the concepts. In essence, this method is similar to the one we presented in Chapter 5. There, we used conceptual document annotations to (i) obtain a conceptual representation of each query and to (ii) “translate” the found concepts to vocabulary terms. In this chapter, we use the learned SVM model to obtain the first step. Since each concept is now associated with a single document (the Wikipedia article), we use those to update the estimate of the query model.

Figure 7.1 shows two example query models for topics #42 and #39 from the TREC Web 2009 test collection. We note that the initial query terms receive the largest probability mass and that the terms that are introduced seem mostly related to the topic.

 Up Next Front