Semantic Search
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Search Philippe Cudre-Mauroux Definition ter grasp the semantics (i.e., meaning) and the context of the user query and/or Semantic Search regroups a set of of the indexed content in order to re- techniques designed to improve tra- trieve more meaningful results. ditional document or knowledge base Semantic Search techniques can search. Semantic Search aims at better be broadly categorized into two main grasping the context and the semantics groups depending on the target content: of the user query and/or of the indexed • techniques improving the relevance content by leveraging Natural Language of classical search engines where the Processing, Semantic Web and Machine query consists of natural language Learning techniques to retrieve more text (e.g., a list of keywords) and relevant results from a search engine. results are a ranked list of documents (e.g., webpages); • techniques retrieving semi-structured Overview data (e.g., entities or RDF triples) from a knowledge base (e.g., a knowledge graph or an ontology) Semantic Search is an umbrella term re- given a user query formulated either grouping various techniques for retriev- as natural language text or using ing more relevant content from a search a declarative query language like engine. Traditional search techniques fo- SPARQL. cus on ranking documents based on a set of keywords appearing both in the user’s Those two groups are described in query and in the indexed content. Se- more detail in the following section. For mantic Search, instead, attempts to bet- each group, a wide variety of techniques 1 2 Philippe Cudre-Mauroux have been proposed, ranging from Natu- matical tags (such as noun, conjunction ral Language Processing (to better grasp or verb) to individual words. Such the contents of the query and data) to assignment is highly accurate for well- Semantic Web (to guide the search pro- formed sentences (Manning 2011), but cess leveraging declarative artifacts like much more challenging for short texts ontologies) and Machine Learning (typ- such as queries (Hua et al 2015). POS ically to learn models from large quanti- tags can then by used to better discrim- ties of data). inate textual keywords, for example for Named-Entity Recognition (NER), where the task is to identify which key- words correspond to real-world entities, Main Approaches or for co-reference resolution, where the task is to identify all keywords referring We give below an overview of the vari- to the same entity in text. Sentence ous techniques that have been proposed parsing takes NLP analyses to the next in the context of Semantic Search for level by aiming at capturing the overall improving document search as well structure of sentences, typically through as knolwedge base search. A number a dependency parse tree. of surveys delve into more detail in NLP methods are often combined this context: Mangold (2007) focuses with lexical resources or third-party on natural language queries on RDF sources to retrieve more relevant results. knowledge bases or ontologies. Madhu The main idea in this context is to et al (2011) and Makel¨ a¨ (2005) are identify entities in the textual query two brief surveys covering both topics. or content and to match them to their Bast et al (2016) is an extensive survey counterpart in a third-party resource to covering Semantic Search in its broadest improve the search results. Voorhees sense. (1993) proposed an early approach in that sense that leverages WordNet to disambiguate word senses and hence improve search results. Pehcevski Document Search et al (2008) analyze the structure of Wikipedia to better rank relevant entities Classical search engines take as input in response to a search request. Kaptein a user query formulated as a list of et al (2010) use Wikipedia to better keywords and return as output a ranked characterize and identify entities when list of documents relevant to those key- searching for entities in document words. A number of Semantic Search collections, while Schuhmacher et al methods have been suggested in that (2015) combine different features from context. the documents, the entity mentions Natural Language Processing (NLP) and Wikipedia using a learning-to-rank techniques have long been applied to approach to improve the search results. better grasp the semantics of the query Conceptually similar approaches or documents. Often, Part-Of-Speech have been proposed in the context of the (POS) tagging is first applied on the Semantic Web, by leveraging the struc- textual content in order to assign gram- ture or contents of a knowledge base to Semantic Search 3 better grasp the context of queries or pearing within a certain context window entities appearing in textual documents. using a relatively simple neural network. Tran et al (2007), for instance, propose This opened the door to numerous an ontology-based interpretation of applications, by efficiently generating natural language queries for Semantic vector representations of words from Search. The authors translate a keyword large text corpora and feeding them into query into a description logic, conjunc- subsequent machine learning models. tive query that can then be evaluated Approaches to improve Entity Recogni- with respect to an underlying knowledge tion (Siencnik 2015), web search query base. Schuhmacher and Ponzetto (2013) expansion (Grbovic et al 2015), or web exploit entities and semantic relations search ranking (Nalisnick et al 2016) from the DBpedia knowledge base to have for example been explored in the cluster the results of a search engine into context of Semantic Search. more meaningful groups. Prokofyev et al (2015) leverage an ontology to better resolve co-references in textual documents for Semantic Search tasks. Knowledge Base Search Machine Learning techniques are often used for Semantic Search, to A number of Semantic Search ap- power some of the approaches described proaches target large and declarative above, but also to capture the context knowledge bases (a.k.a ontologies or and semantics of the words or entities knowledge graphs) instead of document appearing in documents. One of the collections. Such knowledge bases can main intuitions in this context is that be expressed in many different ways words that occur in similar contexts that are typically derived from Semantic are likely to be semantically similar. Web standards such as RDF or OWL. Early approaches leveraging this obser- Google’s Knowledge Graph, DBpedia vation built high-dimensional matrices (Bizer et al 2009), Yago (Rebele et al capturing the co-occurrence of words 2016) or Wikidata (Vrandecic and in a certain context (e.g., within a Krotzsch¨ 2014) are well-known exam- window of a few words), and hence the ples of that trend. Users can express similarity between words (Lund and their queries through two main modal- Burgess 1996). Each word is in that case ities in this case: either as structured represented by a sparse vector in a high- (e.g., SPARQL) queries, or as natural dimensional space. Lower-dimensional language (e.g., keyword) queries. embeddings can then be created by Horrocks and Tessaris (2002) in- applying standard matrix factorization troduced an early formal approach to techniques like Principal Component answer structured queries posed against Analysis. ontologies. Their algorithm return sound More recently, Mikolov et al (2013) a complete results to conjunctive queries suggested a new word embedding leveraging reasoning techniques and technique to generate dense vector Description Logics. Stojanovic et al representations of words efficiently. (2003) present a method to rank results The method works by maximizing the in ontology-based search. The authors co-occurrence probability of words ap- consider conjunctive queries, and com- 4 Philippe Cudre-Mauroux bine logical inference with an analysis resentations of entities in RDF graphs. of the contents of the knowledge base to (Wang et al 2017) provide a survey of retrieve more relevant results. Maedche embedding approaches for knowledge et al (2003) introduce a meta-ontology bases and classify the models into and a registry to improve search queries two main families: translation-based targeting ontologies. Their solution techniques, which interpret relations leverages WordNet to match entities in the knowledge base as a transla- appearing in the ontology to lexical tion vector between the two entities entries and to guide the search process. connected by the relation, and seman- Pound et al (2010) introduce the ad- tic matching models, which exploit hoc object retrieval task for searching similarity-based scoring functions to for resources (e.g., entities, types or create the embeddings. relations) over knowledge bases using natural language queries. The authors also propose a baseline technique for answering such queries based on term Systems frequencies as well as an evaluation methodology. Tonon et al (2012) pro- Leading industrial search engines, pose an improved search technique such as Bing, Yandex or Google, all for ad-hoc object retrieval exploiting implement Semantic Search in one way sequentially an inverted index to answer or another, but typically do not describe keyword queries and a graph database in detail the techniques they leverage. to improve the search effectiveness by A number of Semantic Search systems automatically generating declarative have been described or open-sourced, queries over an RDF graph. however, and are summarized below. Hybrid approaches leveraging both TAP (Guha et al 2003) is an early Se- textual and structured contents have also mantic Search framework. TAP focuses been suggested. Rocha et al (2004), for on entity search queries expressed as instance, combine a traditional search keywords, and augments traditional re- engine with graph exploration tech- sults (documents) with semi-structured niques to answer keyword queries on an data returned from a knowledge base. ontology. Zhang et al (2005) suggest a The knowledge base is also used to new model to search semantic portals, better filter and sort the list or returned where both documents and structured documents in that context.