Keyword Query Expansion on Linked Data Using Linguistic and Semantic Features

Keyword Query Expansion on Linked Data Using Linguistic and Semantic Features Saeedeh Shekarpour∗, Konrad Höffner∗, Jens Lehmann∗ and Sören Auer∗ University of Leipzig, Department of Computer Science Augustusplatz 10, 04109 Leipzig {lastname}@informatik.uni-leipzig.de Abstract—Effective search in structured information based generate additional expansion features such as the labels of on textual user input is of high importance in thousands of sub- and superclasses of resources. The underlying research applications. Query expansion methods augment the original question is whether interlinked data and vocabularies provide query of a user with alternative query elements with similar meaning to increase the chance of retrieving appropriate re- features which can be taken into account for query expan- sources. In this work, we introduce a number of new query sion and how effective those new semantic query expansion expansion features based on semantic and linguistic inferencing features are in comparison to traditional linguistic ones. over Linked Open Data. We evaluate the effectiveness of each We do this by using machine learning methods to gen- feature individually as well as their combinations employing erate a linear combination of linguistic and semantic query several machine learning approaches. The evaluation is carried out on a training dataset extracted from the QALD question expansion features with the aim of maximizing the F1-score answering benchmark. Furthermore, we propose an optimized and efficiency on different benchmark datasets. Our results linear combination of linguistic and lightweight semantic features allow developers of new search engines to integrate AQE with in order to predict the usefulness of each expansion candidate. good results without spending much time on its design. This Our experimental study shows a considerable improvement in is important, since query expansion is usually not considered precision and recall over baseline approaches. in isolation, but rather as one step in a pipeline for question I. INTRODUCTION answering or keyword search systems. With the growth of the Linked Open Data cloud, a vast Our core contributions are as follows: amount of structured information was made publicly available. • Definition of several semantic features for query expan- Querying that huge amount of data in an intuitive way is sion. challenging. SPARQL, the standard query language of the • Creation of benchmark test sets for query expansion. semantic web, requires exact knowledge of the vocabulary and • Combining semantic and linguistic features in a linear is not accessible by laypersons. Several tools and algorithms classifier. have been developed that make use of semantic technologies • An analysis of the effect of each feature on the classifier and background knowledge [10], such as TBSL [17], SINA [16] as well as other benefits and drawbacks of employing and Wolfram|Alpha1. Those tools suffer from a mismatch semantic features for query expansion. between query formulation and the underlying knowledge The paper is structured as follows: In section II, the overall base structure that is known as the vocabulary problem [8]. approach is described, in particular the definition of features For instance, using DBpedia as knowledge base, the query and the construction of the linear classifier. Section III provides “Who is married to Barack Obama?” could fail, because the the experiment on the QALD-1, QALD-2 and QALD-3 test desired property in DBpedia is labeled “spouse” and there is sets and presents the results we obtained. In the related work no property labeled “married to”. in section IV, we discuss in which settings AQE is used. Automatic query expansion (AQE) is a tried and tested Finally, we conclude and give pointers to future work. method in web search for tackling the vocabulary problem by II. APPROACH adding related words to the search query and thus increase the likelihood that appropriate documents are contained in In document retrieval, many query expansion techniques are the result. The query is expanded with features such as based on information contained in the top-ranked retrieved synonyms, e.g. “spouse” and “married to” in the example documents in response to the original user query, e.g. [6]. above, or hyponym-hypernym relations, e.g. “red” and “color”. Similarly, our approach is based on performing an initial We investigate, which methods semantic search engines can retrieval of resources according to the original keyword query. use to overcome the vocabulary problem and how effective Thereafter, further resources are derived by leveraging the AQE with the traditional linguistic features is in this regard. initially retrieved ones. Overall, the proposed process depicted Semantic search engines can use the graph structure of RDF in the Figure 1 is divided into three main steps. In the first step, and follow interlinks between datasets. We employ this to all words closely related to the original keyword are extracted based on two types of features – linguistic and semantic. In 1http://www.wolframalpha.com the second step, various introduced linguistic and semantic • subclass/-property: deriving all sub resources of the input LOD cloud resource ri by following the rdfs:subClassOf or reformulated keyword data extraction feature query rdfs:subPropertyOf property paths ending with the and preparation weighting reformulation query input resource. Wordnet • broader concepts: deriving broader concepts related to the input resource ri using the SKOS vocabulary properties skos:broader and skos:broadMatch. Fig. 1. AQE Pipeline. • narrower concepts: deriving narrower concepts related to the input resource ri using skos:narrower and skos:narrowMatch. features are weighted using learning approaches. In the third • related concepts: deriving related concepts to the step, we assign a relevance score to the set of the related words. input resource ri using skos:closeMatch, Using this score we prune the related word set to achieve a skos:mappingRelation and skos:exactMatch. balance between precision and recall. Note that on a given ri only those semantic features are applicable which are consistent with its associated type. For A. Extracting and Preprocessing of Data using Semantic and example, sameAs only. For instance, super/sub class/property Linguistic Features relations are solely applicable to resources of type class or For the given input keyword k, we define the set of all property. words related to the keyword k as Xk = fx1; x2; :::; xng. For each ri 2 APk, we derive all the related resources The set Xk is defined as the union of the two sets LEk and employing the above semantic features. Then, for each derived 0 SEk. LEk (resp. SEk) is constructed as the collection of all resource r , we add all the English labels of that resource words obtained through linguistic features (resp. semantic). to the the set SEk. Therefore, SEk contains the labels of Linguistic features extracted from WordNet are: all semantically derived resources. As mentioned before, the • synonyms: words having a similar meanings to the input set of all related words of the input keyword k is defined as keyword k. Xk = LEk [ SEk. After extracting the set Xk of related • hyponyms: words representing a specialization of the words, we run the following preprocessing methods for each input keyword k. xi 2 Xk: • hypernyms: words representing a generalization of the 1) Tokenization: extraction of individual words, ignoring input keyword k. punctuation and case. The set SE comprises all words semantically derived from the 2) Stop word removal: removal of common words such as input keyword k using Linked Data. To form this set, we match articles and prepositions. the input keyword k against the rdfs:label property of all 3) Word lemmatisation: determining the lemma of the resources available as Linked Open Data2. It returns the set word. APk = fr1; r2; :::; rng as APk ⊂ (C [ I [ P ) where C, I and Vector space of a word: A single word xi 2 Xk may P are the sets of classes, instances and properties contained be derived via different features. For example, as can be in the knowledge base respectively, whose labels contain k as observed in Figure 2, the word “motion picture” and “film” a sub-string or are equivalent to k. For each ri 2 APk, we is derived by synonym, sameAs and equivalent relations. derive the resources semantically related to ri by employing Thus, for each derived word xi, we define a vector space a number of semantic features. These semantic features are representing the derived features resulting in including that defined as the following semantic relations: word. Suppose that totally we have n linguistic and semantic features. Each x 2 X is associated with a vector of size n • sameAs: deriving resources having the same identity as i k as V = [α ; α ; : : : ; α ]. Each α represents the presence or the input resource using owl:sameAs. xi 1 2 n i absence of the word x in the list of the words derived via the • seeAlso: deriving resources that provide more information i about the input resource using rdfs:seeAlso. feature fi. For instance, if we assume that f1 is dedicated to the synonym feature, the value 1 for α in the V means that • class/property equivalence: deriving classes or 1 xi properties providing related descriptions for the xi is included in the list of the words derived by the synonym input resource using owl:equivalentClass and feature. There are features and those features generate a set of owl:equivalentProperty. expansion words. • superclass/-property: deriving all super classes/properties B. Feature Selection and Feature Weighting of the input resource by following the rdfs:subClassOf or In order to distinguish how effective each feature is and to rdfs:subPropertyOf property paths originating remove ineffective features, we employ a weighting schema from the input resource. ws for computing the weights of the features as ws : fi 2 F ! wi. Note that F is the set of all features taken into 2via http://lod.openlinksw.com/sparql account.

Keyword Query Expansion on Linked Data Using Linguistic and Semantic Features

A Query Suggestion Method Combining TF-IDF and Jaccard Coeﬃcient for Interactive Web Search

Concept-Based Interactive Query Expansion ∗

Xu: an Automated Query Expansion and Optimization Tool

Query Expansion with Locally-Trained Word Embeddings

A New Query Expansion Method Based on Query Logs Mining1

Query Expansion Techniques

Learning for Efficient Supervised Query Expansion Via Two-Stage Feature Selection

CEQE: Contextualized Embeddings for Query Expansion

A Survey of Automatic Query Expansion in Information Retrieval

A Query Suggestion Method Combining TF-IDF and Jaccard Coeﬃcient for Interactive Web Search

Joining Automatic Query Expansion Based on Thesaurus and Word Sense Disambiguation Using Wordnet

Query Expansion Techniques for Information Retrieval: a Survey