Approach

6.2 Approach

Our approach for mapping search engine queries to concepts consists of two stages. In the first stage, we select a set of candidate concepts. In the second stage, we use supervised machine learning to classify each candidate concept as being intended by the query or not.

In order to find candidate concepts in the first stage, we leverage the textual descriptions (rdfs:comment and/or dbpprop:abstract in the case of DBpedia) of the concepts as each description of a concept may contain related words, synonyms, or alternative terms that refer to the concept. An example is given in Table 6.1, while the Wikipedia article it is extracted from is shown in Figure 6.2. From this example it is clear that the use of such properties for retrieval improves recall (we find BARACK OBAMA using the terms “President of the United States”) at the cost of precision (we also find BARACK OBAMA when searching for “John McCain”). In order to use the concept descriptions, we adopt a language modeling for information retrieval framework to create a ranked list of candidate concepts. This framework will be further introduced in Section 6.2.1.

Since we are dealing with an ontology extracted from Wikipedia, we have several options with respect to which textual representation(s) we use. Natural possibilities include: (i) the title of the article (similar to a lexical matching approach where only the rdfs:label is used), (ii) the first sentence or paragraph of an article (where a definition should be provided according to the Wikipedia guidelines [342]), (iii) the full text of the article, (iv) the anchor texts of the incoming hyperlinks from other articles, and (v) a combination of any of these. For our experiments we aim to maximize recall and use the combination of all available fields with or without the incoming anchor texts. In Section 6.5.2 we discuss the relative performance of each field and of their combinations.

For the first stage, we also vary the way we handle the query. In the simplest case, we take the query as is and retrieve concepts for the query in its entirety. As an alternative, we consider extracting all possible n-grams from the query, generating a ranked list for each, and merging the results. An example of what happens when we vary the query representation is given in Table 6.2 for the query “obama white house.” From this example it is clear why we differentiate between the two ways of representing the query. If we simply use the full query on its own (first row), we miss the relevant concept BARACK OBAMA. However, as can be seen from the last two rows, considering all n-grams also introduces noise.

N-gram ( $Q$ )	Candidate concepts
obama white house	WHITE HOUSE; WHITE HOUSE STATION; PRESIDENT COOLIDGE;
	SENSATION WHITE
obama white	MICHELLE OBAMA; BARACK OBAMA; DEMOCRATIC PRE-ELECTIONS 2008;
	JANUARY 17
white house	WHITE HOUSE; WHITE HOUSE STATION; SENSATION WHITE;
	PRESIDENT COOLIDGE
obama	BARACK OBAMA; MICHELLE OBAMA; PRESIDENTIAL ELECTIONS 2008;
	HILLARY CLINTON
white	COLONEL WHITE; EDWARD WHITE; WHITE COUNTY;
	WHITE PLAINS ROAD LINE
house	HOUSE; ROYAL OPERA HOUSE; SYDNEY OPERA HOUSE; FULL HOUSE

Table 6.2: An example of generating n-grams for the query “obama white house” and retrieved candidate concepts, ranked by retrieval score. Correct concepts in boldface.

In the second stage, a supervised machine learning approach is used to classify each candidate concept as either relevant or non-relevant or, in other words, to decide which of the candidate concepts from the first stage should be kept as viable concepts for the query in question. In order to create training material for the machine learning algorithms, we asked human annotators to assess search engine queries and manually map them to relevant DBpedia concepts. More details about the test collection and manual annotations are provided in Section 6.3. The machine learning algorithms we consider are Naive Bayes, Decision Trees, and Support Vector Machines [326, 344] which are further detailed in Section 6.2.2. As input for the machine learning algorithms we need to extract a number of features. We consider features pertaining to the query, concept, their combination, and the session in which the query appears; these are specified in Section 6.2.3.

6.2.1 Ranking Concepts

We base our concept ranking framework within the language modeling paradigm as introduced in Chapter 2. For the n-gram based scoring method, we extract all n-grams from each query $Q$ (where $1 \leq n \leq | Q |$ ) and create a ranked list of concepts for each individual n-gram, $Q$ . For the full query based reranking approach, we use the same method but add the additional constraint that $n = | Q |$ . The problem of ranking DBpedia concepts given $Q$ can then be formulated as follows. Each concept $c$ should be ranked according to the probability $P (c | Q)$ that it was generated by the n-gram, which can be rewritten using Bayes’ rule as: $\begin{array}{rcl} P (c | Q) & = & \frac{P (Q | c) P (c)}{P (Q)} . & (6.1) \end{array}$

Here, for a fixed n-gram $Q$ , the term $P (Q)$ is the same for all concepts and can be ignored for ranking purposes. The term $P (c)$ indicates the prior probability of selecting a concept, which we assume to be uniform. Assuming independence between the individual terms $q \in Q$ (cf. Eq. 2.3) we obtain $\begin{array}{rcl} P (c | Q) & \propto & P (c) \prod_{q \in Q} P {(q | c)}^{n (q, Q)}, & (6.2) \end{array}$

where the probability $P (q | c)$ is determined by looking at the textual relations as illustrated in Table 6.1. It is smoothed using Bayes smoothing with a Dirichlet prior (cf. Eq. 2.7).

N-gram features
$L E N (Q) = \| Q \|$	Number of terms in the phrase $Q$
$I D F (Q)$	Inverse document frequency of $Q$
$W I G (Q)$	Weighted information gain using top-5 retrieved concepts
$Q E (Q)$	Number of times $Q$ appeared as whole query in the query log
$Q P (Q)$	Number of times $Q$ appeared as partial query in the query log
$Q E Q P (Q)$	Ratio between $Q E$ and $Q P$
$S N I L (Q)$	Does a sub-n-gram of $Q$ fully match with any concept label?
$S N C L (Q)$	Is a sub-n-gram of $Q$ contained in any concept label?
Concept features
$I N L I N K S (c)$	The number of concepts linking to $c$
$O U T L I N K S (c)$	The number of concepts linking from $c$
$G E N (c)$	Function of depth of $c$ in the SKOS category hierarchy [230]
$C A T (c)$	Number of associated categories
$R E D I R E C T (c)$	Number of redirect pages linking to $c$
N-gram + concept features
$T F (c, Q) = \frac{n (Q, c)}{\| c \|}$		Relative phrase frequency of $Q$ in $c$ , normalized by length of $c$
${T F}_{f} (c, Q) = \frac{n (Q, c, f)}{\| f \|}$		$\begin{matrix} Relative phrase frequency of Q in representation f of c, \\ normalized by length of f \end{matrix}$
$P O S_{n} (c, Q) = p o s_{n} (Q) ∕ \| c \|$		Position of $n$ th occurrence of $Q$ in $c$ , normalized by length of $c$
$S P R (c, Q)$		Spread (distance between the last and first occurrences of $Q$ in $c$ )
$T F \cdot I D F (c, Q)$		The importance of $Q$ for $c$
$R I D F (c, Q)$		Residual IDF (difference between expected and observed IDF)
$χ^{2} (c, Q)$		$χ^{2}$ test of independence between $Q$ in $c$ and in collection $C o l l$
$Q C T (c, Q)$		Does $q$ contain the label of $c$ ?
$T C Q (c, Q)$		Does the label of $c$ contain $q$ ?
$T E Q (c, Q)$		Does the label of $c$ equal $q$ ?
$S C O R E (c, Q)$		Retrieval score of $c$ w.r.t. $Q$
$R A N K (c, Q)$		Retrieval rank of $c$ w.r.t. $Q$
History features
$C C I H (c)$	Number of occurrences of label of $c$ appears as query in history
$C C C H (c)$	Number of occurrences of label of $c$ appears in any query in history
$C I H H (c)$	Number of times $c$ is retrieved as result for any query in history
$C C I H H (c)$	Number of times label of $c$ equals title of any result for any query in history
$C C C H H (c)$	Number of times title of any result for any query in history contains label of $c$
$Q C I H H (Q)$	Number of times title of any result for any query in history equals $Q$
$Q C C H H (Q)$	Number of times title of any result for any query in history contains $Q$
$Q C I H (Q)$	Number of times $Q$ appears as query in history
$Q C C H (Q)$	Number of times $Q$ appears in any query in history

Table 6.3: Features used, grouped by type. Detailed descriptions in Section 6.2.3.

6.2.2 Learning to Select Concepts

Once we have obtained a ranked list of possible concepts for each n-gram, we turn to concept selection. In this stage we need to decide which of the candidate concepts are most viable. We use a supervised machine learning approach that takes as input a set of labeled examples (query to concept mappings) and several features of these examples (detailed below). More formally, each query $Q$ is associated with a ranked list of concepts $c$ and a set of associated relevance assessments for the concepts. The latter is created by considering all concepts that any annotator used to map $Q$ to $c$ . If a concept was not selected by any of the annotators, we consider it to be non-relevant for $Q$ . Then, for each query in the set of annotated queries, we consider each combination of n-gram $Q$ and concept $c$ an instance for which we create a feature vector.

The goal of the machine learning algorithm is to learn a function that outputs a relevance status for any new n-gram and concept pair given a feature vector of this new instance. We choose to compare a naive bayes (NB) classifier, with a support vector machine (SVM) classifier and a decision tree classifier (J48)—a set representative of the state-of-the-art in classification. These algorithms will be further introduced in Section 6.3.3.

6.2.3 Features Used

We employ several types of features, each associated with either an n-gram, concept, their combination, or the search history. Unless indicated otherwise, when determining the features, we consider any consecutive terms in $Q$ as a phrase, that is, we do not assume term independence.

N-gram Features

These features are based on information from an n-gram and are listed in Table 6.3 (first group). $I D F (Q)$ indicates the relative number of concepts in which $Q$ occurs, which is defined as $I D F (Q) = log (| C o l l | ∕ d f (Q))$ , where $| C o l l |$ indicates the total number of concepts and $d f (Q)$ the number of concepts in which $Q$ occurs [18]. $W I G (Q)$ indicates the weighted information gain, which was proposed by Zhou and Croft [359] as a predictor of the retrieval performance of a query. It uses the set of all candidate concepts retrieved for this n-gram, $C_{Q}$ , and determines the relative probability of $Q$ occurring in these documents as compared to the collection. Formally:

W I G (Q) = \frac{\frac{1}{| C_{Q} |} \sum_{c \in C_{Q}} log (P (Q | c)) - log (P (Q))}{log P (Q)} .

$Q E (Q)$ and $Q P (Q)$ indicate the number of times the n-gram $Q$ appears in the entire query logs as a complete or partial query respectively.

Concept Features

Table 6.3 (second group) lists the features related to a DBpedia concept. This set of features is related to the knowledge we have of the candidate concept, such as the number of other concepts linking to or from it, the number of associated categories (the count of the DBpedia property skos:subject), and the number of redirect pages pointing to it (the DBpedia property dbpprop:redirect).

N-gram + Concept Features

This set of features considers the combination of an n-gram and a concept (Table 6.3, third group). We consider the relative frequency of occurrence of the n-gram as a phrase in the Wikipedia article corresponding to the concept, in the separate document representations (title, content, anchor texts, first sentence, and first paragraph of the Wikipedia article), the position of the first occurrence of the n-gram, the distance between the first and last occurrence, and various IR-based measures [18]. Of these, $R I D F$ [68] is the difference between expected and observed IDF for a concept, which is defined as

R I D F (c, Q) = log (\frac{| C o l l |}{d f (Q)}) + log (1 - exp (\frac{- n (Q, C o l l)}{| C o l l |})) .

We also consider whether the label of the concept (rdfs:label) matches $Q$ in any way and we include the retrieval score and rank as determined by using Eq. 6.2.

History Features

Finally, we consider features based on the previous queries that were issued in the same session (Table 6.3, fourth group). These features indicate whether the current candidate concept or n-gram occurs (partially) in the previously issued queries or retrieved candidate concepts respectively.

In Section 6.4 we compare the effectiveness of the feature types listed above for our task, whilst in Section 6.5.5 we discuss the relative importance of each individual feature.

PrevTail

Front