Up | Next | Prev | PrevTail | Tail |
In the remainder of this section we report on the experimental results and use them to answer the research questions for this chapter. Here, we compare the following approaches for mapping queries to DBpedia:
In the next section we further analyze the results along multiple dimensions, including the effects of varying the number of retrieved concepts in the first stage, varying parameters in the machine learning models, the most informative individual features and types, and the kind of errors that are made by the machine learning algorithms.
As our first baseline we consider a simple heuristic which is commonly used [12, 28, 94, 114, 142, 200]. For this baseline we select concepts that lexically match the query, subject to various constraints. This returns concepts where consecutive terms in the rdfs:label are contained in the query or vice versa.
QCL | QCL-LCQ | QCL-LSO |
JOSEPH HAYDN | JOSEPH HAYDN | JOSEPH HAYDN |
JOSEPH | JOSEPH HAYDN OPERAS |
|
| JOSEPH HAYDN SYMPHONIES |
|
|
An example for the query “joseph haydn” is given in Table 6.5. We then rank the concepts based on the language modeling score of their associated Wikipedia article given the query (cf. Eq. 6.2).
| P1 | R-prec | Recall | MRR | SR |
QCL | 0.3956 | 0.3140 | 0.4282 | 0.4117 | 0.4882 |
QCL-LCQ | 0.4286 | 0.3485 | 0.4881 | 0.4564 | 0.5479 |
QCL-LSO | 0.4160 | 0.2747 | 0.3435 | 0.3775 | 0.4160 |
oracle | 0.5808 | 0.4560 | 0.5902 | 0.5380 | 0.6672 |
|
Table 6.6 shows the scores when using lexical matching for mapping search engine queries. The results in the first row are obtained by only considering the concepts whose label is contained in the query (QCL). This is a frequently taken but naive approach and does not perform well, achieving a P1 score of under 40%. The second row relaxes this constraint and also selects concepts where the query is contained in the concept label (QCL-LCQ). This improves the performance somewhat.
One issue these approaches might have, however, is that they might match parts of compound terms. For example, the query “brooklyn bridge” might not only match the concept BROOKLYN BRIDGE but also the concepts BROOKLYN and BRIDGE. The approach taken for the third row (QCL-LSO) therefore extracts all n-grams from the query, sorts them by the number of terms, and checks whether the label is contained in each of them. If a match is found, the remaining, smaller n-grams are skipped.
The last row (“oracle”) shows the results when we initially select all concepts whose terms in the label matches with any part of the query. Then, we keep only those concepts that were annotated by the assessors. As such, the performance of this run indicates the upper bound on the performance that lexical matching might obtain. From these scores we conclude that, although lexical matching is a common approach for matching unstructured text with structured data, it does not perform well for our task and we need to consider additional kinds of information pertaining to each concept.
As our second baseline, we take the entire query as issued by the user and employ Eq. 6.2 to rank DBpedia concepts based on their textual representation; this technique is similar to using a search engine and performing a search within Wikipedia. We use either the textual contents of the Wikipedia article (“content-only”—which includes only the article’s text) or a combination of the article’s text, the title, and the anchor texts of incoming links (“full text”).
| P1 | R-prec | Recall | MRR | SR |
full text | 0.5636 | 0.5216 | 0.6768 | 0.6400 | 0.7535 |
content-only | 0.5510 | 0.5134 | 0.6632 | 0.6252 | 0.7363 |
|
Table 6.7 shows the results of this method. We note that including the title and anchor texts of the incoming links results in improved retrieval performance overall. This is a strong baseline; on average, over 65% of the relevant concepts are correctly identified in the top-5 and, furthermore, over 55% of the relevant concepts are retrieved at rank 1. The success rate indicates that for 75% of the queries at least one relevant concept is retrieved in the top-5. In Section 6.5.2 we further discuss the relative performance of each textual representation as well as various combinations.
N-gram | Candidate concepts |
challenger | SPACE SHUTTLE CHALLENGER; CHALLENGER; BOMBARDIER CHALLENGER; |
STS-61-A; STS-9 | |
wubbo | WUBBO OCKELS; SPACELAB; CANON OF GRONINGEN; SUPERBUS; |
ANDRé KUIPERS | |
ockels | WUBBO OCKELS; SPACELAB; SUPERBUS; CANON OF GRONINGEN; |
ANDRé KUIPERS | |
challenger wubbo | WUBBO OCKELS; STS-61-A; SPACE SHUTTLE CHALLENGER; SPACELAB; |
STS-9 | |
wubbo ockels | WUBBO OCKELS; SPACELAB; SUPERBUS; CANON OF GRONINGEN; |
ANDRé KUIPERS | |
challenger wubbo ockels | WUBBO OCKELS; STS-61-A; SPACELAB; SPACE SHUTTLE CHALLENGER; |
STS-9 | |
Table 6.8 (last row) shows the concepts obtained for the second baseline and the query “challenger wubbo ockels.” Here, two relevant concepts are retrieved at ranks 1 and 4. When we look at the same results for all possible n-grams in the query, however, one of the relevant concepts is retrieved at the first position for each n-gram. This example and the one given earlier in Table 6.2 suggest that it will be beneficial to consider all possible n-grams in the query. In this section we report on the results of extracting n-grams from the query, generating features for each, and subsequently applying machine learning algorithms to decide which of the suggested concepts to keep. The features used here are described in Section 6.2.2.
| P1 | R-prec | Recall | MRR | SR
|
baseline | 0.5636 | 0.5216 | 0.6768 | 0.6400 | 0.7535 |
J48 | 0.6586 | 0.5648 | 0.7253 | 0.7348 | 0.7989 |
NB | 0.4494 | 0.4088 | 0.6948 | 0.7278 | 0.7710 |
SVM | 0.7998 | 0.6718 | 0.7556 | 0.8131 | 0.8240 |
|
Table 6.9 shows the results of applying the machine learning algorithms on the extracted n-gram features. We note that J48 and SVM are able to improve upon the baseline results from the previous section, according to all metrics. The Naive Bayes classifier performs worse than the baseline in terms of P1 and R-precision. SVM clearly outperforms the other algorithms and is able to obtain scores that are very high, significantly better than the baseline on all metrics. Interestingly, we see that the use of n-gram based reranking has both a precision enhancing effect for J48 and SVM (the P1 and MRR scores go up) and a recall enhancing effect.
Next, we turn to a comparison of n-gram based and full-query based concept selection. Using the full-query based concept selection method, we take each query as is (an example is given in the last row of Table 6.8) and generate a single ranking to which we apply the machine learning models.
| P1 | R-prec | Recall | MRR | SR
|
baseline | 0.5636 | 0.5216 | 0.6768 | 0.6400 | 0.7535 |
J48 | 0.7152 | 0.5857 | 0.6597 | 0.6877 | 0.7317 |
NB | 0.6925 | 0.5897 | 0.6865 | 0.6989 | 0.7626 |
SVM | 0.8833 | 0.8666 | 0.8975 | 0.8406 | 0.9053 |
|
Table 6.10 shows the results when only the full query is used to generate a ranked list of concepts. We again observe that SVM significantly outperforms J48, NB, and the baseline. For both the J48 and NB classifiers we see a significant increase in precision (P1). Naive Bayes, for which precision was significantly worse than all other methods on n-gram based concept selection, performs significantly better than the other machine learning algorithms using full query reranking. The increase in precision comes at a loss in recall for NB. The MRR scores for J48 are no longer significantly higher than the baseline. Both J48 and NB produce fewer false positives when classifying full query data instead of n-gram based query data. This means that fewer incorrect concepts end up in the ranking which in turn results in a higher precision.
Interestingly, this increase in precision is not accompanied by a loss in recall. In particular, the SVM classifier is able to distinguish between correct and incorrect concepts when used on the full query data. These scores are the highest obtained so far and this approach is able to return almost 90% of all relevant concepts. This result is very encouraging and shows that the approach taken handles the mapping of search engine queries to the LOD cloud extremely well.
Up | Next | Prev | PrevTail | Front |