2.5 Linking Free Text to Concepts

In the previous section we have introduced IR-related ways of mapping free text in the form of queries and/or documents to concepts. In this section we focus on more general solutions to this problem. The approaches we discuss here are related to several areas of research. These include Semantic Web areas such as ontology learning, population, and matching and semantic annotation, but also natural language interfaces to databases.

2.5.1 Natural Language Interfaces to Databases

The first body of related work that we discuss is from the field of natural language interfaces to databases [351]. For example, BANKS [34], DISCOVER [140], and DBXplorer [2] allow novice users to query large, complex databases using natural language queries. Tata and Lohman [312] propose a similar keyword-based querying mechanism but with additional aggregation facilities. All of these systems perform some kind of matching between the input query and either the database schema itself, the contents of the database, or the graph of tuples created by the joins defined on the schema. The actual matching function varies per system and ranges from determining lexical matches (optionally using regular expressions or some form of edit distance) to using an inverted index and related IR techniques [18]. These approaches are very similar to the ones we use to rank candidate concepts in Chapter 6. Later, we take these two types of matching as baselines to which we compare our own approach. In contrast to our approach, none of them apply machine learning.

NAGA is a similar system that is more tied to the semantic web [99160]. It uses language modeling intuitions to determine a ranking of possible answer graphs, based on the frequency of occurrence of terms in the knowledge base. This scoring mechanism has been shown to perform better than that of BANKS on various test collections [160]. NAGA does not support approximate matching and keyword-augmented queries. Our method presented in Chapter 6, on the other hand, takes as input any unstructured search engine query.

Demidova et al. [89] present the evaluation of a system that maps keyword queries to structured query templates. The query terms are mapped to specific places in each template and the templates are subsequently ranked, explicitly taking diversity into account. They find that applying diversification to query template ranking achieves a significant reduction of result redundancy. Kaufmann and Bernstein [161] perform a user study in which they evaluate various natural language interfaces to structured knowledge bases. Each interface has a different level of complexity and the task they ask their users to accomplish is to rewrite a set of factoid and list queries for each interface, with the goal of answering each question using the contents of the knowledge base. They find that for this task, the optimal strategy is a combination of structure (in the form of a fixed set of question beginnings, such as “How many ...” and “Which ...”) and free text. The task we present in Chapter 6 is more general than the task evaluated in [161], in that we do not investigate if, how well, or how easily users’ queries are answered, but whether they are mapped to the right concepts. We postulate various benefits of these mappings other than to answering questions, such as to provide contextual suggestions, to start exploring the knowledge base, etc.

2.5.2 Ontology Matching

In ontology matching, relations between concepts from different ontologies are identified. The Ontology Alignment Evaluation Initiative has addressed this task since 2008 [59]. Here, participants link a largely unstructured thesaurus to DBpedia. The relations to be obtained are based on a comparison of instances, concept labels, semantic structure, or ontological features such as constraints or properties, sometimes exploiting auxiliary external resources such as WordNet or an upper ontology [284]. E.g., Wang et al. [334] develop a machine learning technique to learn the relationship between the similarity of instances and the validity of mappings between concepts. Other approaches are designed for lexical comparison of concept labels in the source and target ontology and use neither semantic structure nor instances (e.g., [304]). Aleksovski et al. [3] use a lexical comparison of labels to map both the source and the target ontology to a semantically rich external source of background knowledge. This type of matching is referred to as “lexical matching” and is used in cases where the ontologies do not have any instances or structure. Lexical matching is very similar to the task presented in Chapter 6, as we do not have explicit semantic structure in any of our queries. Indeed, the queries that we use are free text utterances instead of standardized concept labels, which makes our task intrinsically harder.

2.5.3 Ontology Learning, Ontology Population, and Semantic Annotation

In the field of ontology learning and population, concepts and/or their instances are learned from unstructured or semi-structured documents, together with links between concepts [53]. Well-known examples of ontology learning tools are OntoGen [103] and TextToOnto [199]. More related to our task is the work done on semantic annotation, the process of mapping text from unstructured data resources to concepts from ontologies or other sources of structured knowledge. In the simplest case, this is performed using a lexical match between the labels of each candidate concept and the contents of the text [94142200217]. A well-known example of a more elaborate approach is Ontotext’s KIM platform [166]. The KIM platform builds on GATE to detect named entities and to link them to concepts in an ontology [249]. Entities unknown to the ontology are fed back into the ontology, thus populating it further. OpenCalais1 provides semantic annotations of textual documents by automatically identifying entities, events, and facts. Each annotation is given a URI that is linked to concepts from the Linked Open Data (LOD) cloud when possible.

Chemudugunta et al. [64] do not restrict themselves to named entities, but instead use topic models to link all words in a document to ontological concepts. Other sub-problems of semantic annotation include sense tagging and word sense disambiguation [101]. Some of the techniques developed there have fed into automatic link generation between full-text documents and Wikipedia. For example, Milne and Witten [230], building on the work of Mihalcea and Csomai [226], depend heavily on contextual information from terms and phrases surrounding the source text to determine the best Wikipedia articles to link to. The authors apply part-of-speech tagging and develop several ranking procedures for candidate Wikipedia articles. A key difference with the approach of linking queries to concepts that we present in Chapter 6, is that we utilize much sparser data in the form of short keyword queries, as opposed to either verbose queries or full-text documents. Hence, as we will see in Chapter 6, we cannot easily use techniques such as part-of-speech tagging or lean too heavily on context words for disambiguation.