Up | Next | Prev | PrevTail | Tail |
Our approach for mapping search engine queries to concepts consists of two stages. In the first stage, we select a set of candidate concepts. In the second stage, we use supervised machine learning to classify each candidate concept as being intended by the query or not.
In order to find candidate concepts in the first stage, we leverage the textual descriptions (rdfs:comment and/or dbpprop:abstract in the case of DBpedia) of the concepts as each description of a concept may contain related words, synonyms, or alternative terms that refer to the concept. An example is given in Table 6.1, while the Wikipedia article it is extracted from is shown in Figure 6.2. From this example it is clear that the use of such properties for retrieval improves recall (we find BARACK OBAMA using the terms “President of the United States”) at the cost of precision (we also find BARACK OBAMA when searching for “John McCain”). In order to use the concept descriptions, we adopt a language modeling for information retrieval framework to create a ranked list of candidate concepts. This framework will be further introduced in Section 6.2.1.
Since we are dealing with an ontology extracted from Wikipedia, we have several options with respect to which textual representation(s) we use. Natural possibilities include: (i) the title of the article (similar to a lexical matching approach where only the rdfs:label is used), (ii) the first sentence or paragraph of an article (where a definition should be provided according to the Wikipedia guidelines [342]), (iii) the full text of the article, (iv) the anchor texts of the incoming hyperlinks from other articles, and (v) a combination of any of these. For our experiments we aim to maximize recall and use the combination of all available fields with or without the incoming anchor texts. In Section 6.5.2 we discuss the relative performance of each field and of their combinations.
For the first stage, we also vary the way we handle the query. In the simplest case, we take the query as is and retrieve concepts for the query in its entirety. As an alternative, we consider extracting all possible n-grams from the query, generating a ranked list for each, and merging the results. An example of what happens when we vary the query representation is given in Table 6.2 for the query “obama white house.” From this example it is clear why we differentiate between the two ways of representing the query. If we simply use the full query on its own (first row), we miss the relevant concept BARACK OBAMA. However, as can be seen from the last two rows, considering all n-grams also introduces noise.
N-gram () | Candidate concepts |
obama white house | WHITE HOUSE; WHITE HOUSE STATION; PRESIDENT COOLIDGE; |
SENSATION WHITE | |
obama white | MICHELLE OBAMA; BARACK OBAMA; DEMOCRATIC PRE-ELECTIONS 2008; |
JANUARY 17 | |
white house | WHITE HOUSE; WHITE HOUSE STATION; SENSATION WHITE; |
PRESIDENT COOLIDGE | |
obama | BARACK OBAMA; MICHELLE OBAMA; PRESIDENTIAL ELECTIONS 2008; |
HILLARY CLINTON | |
white | COLONEL WHITE; EDWARD WHITE; WHITE COUNTY; |
WHITE PLAINS ROAD LINE | |
house | HOUSE; ROYAL OPERA HOUSE; SYDNEY OPERA HOUSE; FULL HOUSE |
In the second stage, a supervised machine learning approach is used to classify each candidate concept as either relevant or non-relevant or, in other words, to decide which of the candidate concepts from the first stage should be kept as viable concepts for the query in question. In order to create training material for the machine learning algorithms, we asked human annotators to assess search engine queries and manually map them to relevant DBpedia concepts. More details about the test collection and manual annotations are provided in Section 6.3. The machine learning algorithms we consider are Naive Bayes, Decision Trees, and Support Vector Machines [326, 344] which are further detailed in Section 6.2.2. As input for the machine learning algorithms we need to extract a number of features. We consider features pertaining to the query, concept, their combination, and the session in which the query appears; these are specified in Section 6.2.3.
We base our concept ranking framework within the language modeling paradigm as introduced in Chapter 2. For the n-gram based scoring method, we extract all n-grams from each query (where ) and create a ranked list of concepts for each individual n-gram, . For the full query based reranking approach, we use the same method but add the additional constraint that . The problem of ranking DBpedia concepts given can then be formulated as follows. Each concept should be ranked according to the probability that it was generated by the n-gram, which can be rewritten using Bayes’ rule as:
Here, for a fixed n-gram , the term is the same for all concepts and can be ignored for ranking purposes. The term indicates the prior probability of selecting a concept, which we assume to be uniform. Assuming independence between the individual terms (cf. Eq. 2.3) we obtain
where the probability is determined by looking at the textual relations as illustrated in Table 6.1. It is smoothed using Bayes smoothing with a Dirichlet prior (cf. Eq. 2.7).
N-gram features
| ||
Number of terms in the phrase
| ||
Inverse document frequency of
| ||
Weighted information gain using top-5 retrieved concepts
| ||
Number of times appeared as whole query in the query log
| ||
Number of times appeared as partial query in the query log
| ||
Ratio between and
| ||
Does a sub-n-gram of fully match with any concept label?
| ||
Is a sub-n-gram of contained in any concept label?
| ||
Concept features
| ||
The number of concepts linking to
| ||
The number of concepts linking from
| ||
Number of associated categories
| ||
Number of redirect pages linking to
| ||
N-gram + concept features
| ||
Relative phrase frequency of in , normalized by length of |
||
|
||
Position of th occurrence of in , normalized by length of |
||
Spread (distance between the last and first occurrences of in ) |
||
The importance of for |
||
Residual IDF (difference between expected and observed IDF) |
||
test of independence between in and in collection |
||
Does contain the label of ? |
||
Does the label of contain ? |
||
Does the label of equal ? |
||
Retrieval score of w.r.t. |
||
Retrieval rank of w.r.t. |
||
History features
| ||
Number of occurrences of label of appears as query in history
| ||
Number of occurrences of label of appears in any query in history
| ||
Number of times is retrieved as result for any query in history
| ||
Number of times label of equals title of any result for any query in history
| ||
Number of times title of any result for any query in history contains label of
| ||
Number of times title of any result for any query in history equals
| ||
Number of times title of any result for any query in history contains
| ||
Number of times appears as query in history
| ||
Number of times appears in any query in history
| ||
Once we have obtained a ranked list of possible concepts for each n-gram, we turn to concept selection. In this stage we need to decide which of the candidate concepts are most viable. We use a supervised machine learning approach that takes as input a set of labeled examples (query to concept mappings) and several features of these examples (detailed below). More formally, each query is associated with a ranked list of concepts and a set of associated relevance assessments for the concepts. The latter is created by considering all concepts that any annotator used to map to . If a concept was not selected by any of the annotators, we consider it to be non-relevant for . Then, for each query in the set of annotated queries, we consider each combination of n-gram and concept an instance for which we create a feature vector.
The goal of the machine learning algorithm is to learn a function that outputs a relevance status for any new n-gram and concept pair given a feature vector of this new instance. We choose to compare a naive bayes (NB) classifier, with a support vector machine (SVM) classifier and a decision tree classifier (J48)—a set representative of the state-of-the-art in classification. These algorithms will be further introduced in Section 6.3.3.
We employ several types of features, each associated with either an n-gram, concept, their combination, or the search history. Unless indicated otherwise, when determining the features, we consider any consecutive terms in as a phrase, that is, we do not assume term independence.
These features are based on information from an n-gram and are listed in Table 6.3 (first group). indicates the relative number of concepts in which occurs, which is defined as , where indicates the total number of concepts and the number of concepts in which occurs [18]. indicates the weighted information gain, which was proposed by Zhou and Croft [359] as a predictor of the retrieval performance of a query. It uses the set of all candidate concepts retrieved for this n-gram, , and determines the relative probability of occurring in these documents as compared to the collection. Formally:
and indicate the number of times the n-gram appears in the entire query logs as a complete or partial query respectively.
Table 6.3 (second group) lists the features related to a DBpedia concept. This set of features is related to the knowledge we have of the candidate concept, such as the number of other concepts linking to or from it, the number of associated categories (the count of the DBpedia property skos:subject), and the number of redirect pages pointing to it (the DBpedia property dbpprop:redirect).
This set of features considers the combination of an n-gram and a concept (Table 6.3, third group). We consider the relative frequency of occurrence of the n-gram as a phrase in the Wikipedia article corresponding to the concept, in the separate document representations (title, content, anchor texts, first sentence, and first paragraph of the Wikipedia article), the position of the first occurrence of the n-gram, the distance between the first and last occurrence, and various IR-based measures [18]. Of these, [68] is the difference between expected and observed IDF for a concept, which is defined as
We also consider whether the label of the concept (rdfs:label) matches in any way and we include the retrieval score and rank as determined by using Eq. 6.2.
Finally, we consider features based on the previous queries that were issued in the same session (Table 6.3, fourth group). These features indicate whether the current candidate concept or n-gram occurs (partially) in the previously issued queries or retrieved candidate concepts respectively.
In Section 6.4 we compare the effectiveness of the feature types listed above for our task, whilst in Section 6.5.5 we discuss the relative importance of each individual feature.
Up | Next | Prev | PrevTail | Front |