6.4 Results

In the remainder of this section we report on the experimental results and use them to answer the research questions for this chapter. Here, we compare the following approaches for mapping queries to DBpedia:

(i)
a baseline that retrieves only those concepts whose label lexically matches the query,
(ii)
a retrieval baseline that retrieves concepts based solely on their textual representation in the form of the associated Wikipedia article with varying textual fields,
(iii)
n-gram based reranking that extracts all n-grams from the query and uses machine learning to identify the best concepts, and
(iv)
full query based reranking that does not extract n-grams, but calculates feature vectors based on the full query and uses machine learning to identify the best concepts.

In the next section we further analyze the results along multiple dimensions, including the effects of varying the number of retrieved concepts in the first stage, varying parameters in the machine learning models, the most informative individual features and types, and the kind of errors that are made by the machine learning algorithms.

6.4.1 Lexical Match

As our first baseline we consider a simple heuristic which is commonly used [122894114142200]. For this baseline we select concepts that lexically match the query, subject to various constraints. This returns concepts where consecutive terms in the rdfs:label are contained in the query or vice versa.


QCL

QCL-LCQ

QCL-LSO

JOSEPH HAYDN

JOSEPH HAYDN

JOSEPH HAYDN

JOSEPH

JOSEPH HAYDN OPERAS

JOSEPH HAYDN SYMPHONIES

Table 6.5: An example of the concepts obtained using various lexical matching constraints for the query “joseph haydn” (translated to English). In this case, the annotators only linked the concept JOSEPH HAYDN.

An example for the query “joseph haydn” is given in Table 6.5. We then rank the concepts based on the language modeling score of their associated Wikipedia article given the query (cf. Eq. 6.2).


P1 R-prec Recall MRR SR

QCL

0.3956 0.3140 0.4282 0.4117 0.4882

QCL-LCQ

0.4286 0.3485 0.4881 0.4564 0.5479

QCL-LSO

0.4160 0.2747 0.3435 0.3775 0.4160

oracle

0.5808 0.4560 0.5902 0.5380 0.6672

Table 6.6: Lexical match baseline results using lexical matching between labels and query to select concepts.

Table 6.6 shows the scores when using lexical matching for mapping search engine queries. The results in the first row are obtained by only considering the concepts whose label is contained in the query (QCL). This is a frequently taken but naive approach and does not perform well, achieving a P1 score of under 40%. The second row relaxes this constraint and also selects concepts where the query is contained in the concept label (QCL-LCQ). This improves the performance somewhat.

One issue these approaches might have, however, is that they might match parts of compound terms. For example, the query “brooklyn bridge” might not only match the concept BROOKLYN BRIDGE but also the concepts BROOKLYN and BRIDGE. The approach taken for the third row (QCL-LSO) therefore extracts all n-grams from the query, sorts them by the number of terms, and checks whether the label is contained in each of them. If a match is found, the remaining, smaller n-grams are skipped.

The last row (“oracle”) shows the results when we initially select all concepts whose terms in the label matches with any part of the query. Then, we keep only those concepts that were annotated by the assessors. As such, the performance of this run indicates the upper bound on the performance that lexical matching might obtain. From these scores we conclude that, although lexical matching is a common approach for matching unstructured text with structured data, it does not perform well for our task and we need to consider additional kinds of information pertaining to each concept.

6.4.2 Retrieval Only

As our second baseline, we take the entire query as issued by the user and employ Eq. 6.2 to rank DBpedia concepts based on their textual representation; this technique is similar to using a search engine and performing a search within Wikipedia. We use either the textual contents of the Wikipedia article (“content-only”—which includes only the article’s text) or a combination of the article’s text, the title, and the anchor texts of incoming links (“full text”).


P1 R-prec Recall MRR SR

full text

0.5636 0.5216 0.6768 0.6400 0.7535

content-only

0.5510 0.5134 0.6632 0.6252 0.7363

Table 6.7: Results for the retrieval only baseline which ranks concepts using the entire query Q and either the content of the Wikipedia article or the full text associated with each DBpedia concept (including title and anchor texts of incoming hyperlinks).

Table 6.7 shows the results of this method. We note that including the title and anchor texts of the incoming links results in improved retrieval performance overall. This is a strong baseline; on average, over 65% of the relevant concepts are correctly identified in the top-5 and, furthermore, over 55% of the relevant concepts are retrieved at rank 1. The success rate indicates that for 75% of the queries at least one relevant concept is retrieved in the top-5. In Section 6.5.2 we further discuss the relative performance of each textual representation as well as various combinations.


N-gram Candidate concepts
challenger SPACE SHUTTLE CHALLENGER; CHALLENGER; BOMBARDIER CHALLENGER;
STS-61-A; STS-9
wubbo WUBBO OCKELS; SPACELAB; CANON OF GRONINGEN; SUPERBUS;
ANDRé KUIPERS
ockels WUBBO OCKELS; SPACELAB; SUPERBUS; CANON OF GRONINGEN;
ANDRé KUIPERS
challenger wubbo WUBBO OCKELS; STS-61-A; SPACE SHUTTLE CHALLENGER; SPACELAB;
STS-9
wubbo ockels WUBBO OCKELS; SPACELAB; SUPERBUS; CANON OF GRONINGEN;
ANDRé KUIPERS
challenger wubbo ockels WUBBO OCKELS; STS-61-A; SPACELAB; SPACE SHUTTLE CHALLENGER;
STS-9

Table 6.8: An example of the concepts obtained when using retrieval only for the n-grams in the query “challenger wubbo ockels” (translated to English), ranked by retrieval score. Concepts annotated by the human annotators for this query in boldface.

6.4.3 N-gram based Concept Selection

Table 6.8 (last row) shows the concepts obtained for the second baseline and the query “challenger wubbo ockels.” Here, two relevant concepts are retrieved at ranks 1 and 4. When we look at the same results for all possible n-grams in the query, however, one of the relevant concepts is retrieved at the first position for each n-gram. This example and the one given earlier in Table 6.2 suggest that it will be beneficial to consider all possible n-grams in the query. In this section we report on the results of extracting n-grams from the query, generating features for each, and subsequently applying machine learning algorithms to decide which of the suggested concepts to keep. The features used here are described in Section 6.2.2.


P1
R-prec
Recall
MRR
SR

baseline

0.5636 0.5216 0.6768 0.6400 0.7535

J48

0.6586 0.5648 0.7253 0.7348 0.7989

NB

0.4494 0.4088 0.6948 0.7278 0.7710

SVM

0.7998 0.6718 0.7556 0.8131 0.8240

Table 6.9: Results for n-gram based concept selection. and  indicate that a score is significantly better, worse, or statistically indistinguishable respectively. The leftmost symbol represents the difference with the baseline, the next with the J48 run, and the rightmost with the NB run.

Table 6.9 shows the results of applying the machine learning algorithms on the extracted n-gram features. We note that J48 and SVM are able to improve upon the baseline results from the previous section, according to all metrics. The Naive Bayes classifier performs worse than the baseline in terms of P1 and R-precision. SVM clearly outperforms the other algorithms and is able to obtain scores that are very high, significantly better than the baseline on all metrics. Interestingly, we see that the use of n-gram based reranking has both a precision enhancing effect for J48 and SVM (the P1 and MRR scores go up) and a recall enhancing effect.

6.4.4 Full Query-based Concept Selection

Next, we turn to a comparison of n-gram based and full-query based concept selection. Using the full-query based concept selection method, we take each query as is (an example is given in the last row of Table 6.8) and generate a single ranking to which we apply the machine learning models.


P1
R-prec
Recall
MRR
SR

baseline

0.5636 0.5216 0.6768 0.6400 0.7535

J48

0.7152 0.5857 0.6597 0.6877 0.7317

NB

0.6925 0.5897 0.6865 0.6989 0.7626

SVM

0.8833 0.8666 0.8975 0.8406 0.9053

Table 6.10: Results for full query-based concept selection.

Table 6.10 shows the results when only the full query is used to generate a ranked list of concepts. We again observe that SVM significantly outperforms J48, NB, and the baseline. For both the J48 and NB classifiers we see a significant increase in precision (P1). Naive Bayes, for which precision was significantly worse than all other methods on n-gram based concept selection, performs significantly better than the other machine learning algorithms using full query reranking. The increase in precision comes at a loss in recall for NB. The MRR scores for J48 are no longer significantly higher than the baseline. Both J48 and NB produce fewer false positives when classifying full query data instead of n-gram based query data. This means that fewer incorrect concepts end up in the ranking which in turn results in a higher precision.

Interestingly, this increase in precision is not accompanied by a loss in recall. In particular, the SVM classifier is able to distinguish between correct and incorrect concepts when used on the full query data. These scores are the highest obtained so far and this approach is able to return almost 90% of all relevant concepts. This result is very encouraging and shows that the approach taken handles the mapping of search engine queries to the LOD cloud extremely well.