Semantic Search

Philippe Cudre-Mauroux

Definition ter grasp the semantics (i.e., meaning) and the context of the user query and/or Semantic Search regroups a set of of the indexed content in order to re- techniques designed to improve tra- trieve more meaningful results. ditional document or Semantic Search techniques can search. Semantic Search aims at better be broadly categorized into two main grasping the context and the semantics groups depending on the target content: of the user query and/or of the indexed • techniques improving the relevance content by leveraging Natural Language of classical search engines where the Processing, and Machine query consists of natural language Learning techniques to retrieve more text (e.g., a list of keywords) and relevant results from a search engine. results are a ranked list of documents (e.g., webpages); • techniques retrieving semi-structured Overview data (e.g., entities or RDF triples) from a knowledge base (e.g., a or an ontology) Semantic Search is an umbrella term re- given a user query formulated either grouping various techniques for retriev- as natural language text or using ing more relevant content from a search a declarative query language like engine. Traditional search techniques fo- SPARQL. cus on ranking documents based on a set of keywords appearing both in the user’s Those two groups are described in query and in the indexed content. Se- more detail in the following section. For mantic Search, instead, attempts to bet- each group, a wide variety of techniques

1 2 Philippe Cudre-Mauroux have been proposed, ranging from Natu- matical tags (such as noun, conjunction ral Language Processing (to better grasp or verb) to individual words. Such the contents of the query and data) to assignment is highly accurate for well- Semantic Web (to guide the search pro- formed sentences (Manning 2011), but cess leveraging declarative artifacts like much more challenging for short texts ontologies) and Machine Learning (typ- such as queries (Hua et al 2015). POS ically to learn models from large quanti- tags can then by used to better discrim- ties of data). inate textual keywords, for example for Named-Entity Recognition (NER), where the task is to identify which key- words correspond to real-world entities, Main Approaches or for co-reference resolution, where the task is to identify all keywords referring We give below an overview of the vari- to the same entity in text. Sentence ous techniques that have been proposed parsing takes NLP analyses to the next in the context of Semantic Search for level by aiming at capturing the overall improving document search as well structure of sentences, typically through as knolwedge base search. A number a dependency parse tree. of surveys delve into more detail in NLP methods are often combined this context: Mangold (2007) focuses with lexical resources or third-party on natural language queries on RDF sources to retrieve more relevant results. knowledge bases or ontologies. Madhu The main idea in this context is to et al (2011) and Makel¨ a¨ (2005) are identify entities in the textual query two brief surveys covering both topics. or content and to match them to their Bast et al (2016) is an extensive survey counterpart in a third-party resource to covering Semantic Search in its broadest improve the search results. Voorhees sense. (1993) proposed an early approach in that sense that leverages WordNet to disambiguate word senses and hence improve search results. Pehcevski Document Search et al (2008) analyze the structure of to better rank relevant entities Classical search engines take as input in response to a search request. Kaptein a user query formulated as a list of et al (2010) use Wikipedia to better keywords and return as output a ranked characterize and identify entities when list of documents relevant to those key- searching for entities in document words. A number of Semantic Search collections, while Schuhmacher et al methods have been suggested in that (2015) combine different features from context. the documents, the entity mentions Natural Language Processing (NLP) and Wikipedia using a learning-to-rank techniques have long been applied to approach to improve the search results. better grasp the semantics of the query Conceptually similar approaches or documents. Often, Part-Of-Speech have been proposed in the context of the (POS) tagging is first applied on the Semantic Web, by leveraging the struc- textual content in order to assign gram- ture or contents of a knowledge base to Semantic Search 3 better grasp the context of queries or pearing within a certain context window entities appearing in textual documents. using a relatively simple neural network. Tran et al (2007), for instance, propose This opened the door to numerous an ontology-based interpretation of applications, by efficiently generating natural language queries for Semantic vector representations of words from Search. The authors translate a keyword large text corpora and feeding them into query into a description logic, conjunc- subsequent machine learning models. tive query that can then be evaluated Approaches to improve Entity Recogni- with respect to an underlying knowledge tion (Siencnik 2015), web search query base. Schuhmacher and Ponzetto (2013) expansion (Grbovic et al 2015), or web exploit entities and semantic relations search ranking (Nalisnick et al 2016) from the DBpedia knowledge base to have for example been explored in the cluster the results of a search engine into context of Semantic Search. more meaningful groups. Prokofyev et al (2015) leverage an ontology to better resolve co-references in textual documents for Semantic Search tasks. Knowledge Base Search Machine Learning techniques are often used for Semantic Search, to A number of Semantic Search ap- power some of the approaches described proaches target large and declarative above, but also to capture the context knowledge bases (a.k.a ontologies or and semantics of the words or entities knowledge graphs) instead of document appearing in documents. One of the collections. Such knowledge bases can main intuitions in this context is that be expressed in many different ways words that occur in similar contexts that are typically derived from Semantic are likely to be semantically similar. Web standards such as RDF or OWL. Early approaches leveraging this obser- ’s Knowledge Graph, DBpedia vation built high-dimensional matrices (Bizer et al 2009), Yago (Rebele et al capturing the co-occurrence of words 2016) or (Vrandecic and in a certain context (e.g., within a Krotzsch¨ 2014) are well-known exam- window of a few words), and hence the ples of that trend. Users can express similarity between words (Lund and their queries through two main modal- Burgess 1996). Each word is in that case ities in this case: either as structured represented by a sparse vector in a high- (e.g., SPARQL) queries, or as natural dimensional space. Lower-dimensional language (e.g., keyword) queries. embeddings can then be created by Horrocks and Tessaris (2002) in- applying standard matrix factorization troduced an early formal approach to techniques like Principal Component answer structured queries posed against Analysis. ontologies. Their algorithm return sound More recently, Mikolov et al (2013) a complete results to conjunctive queries suggested a new word embedding leveraging reasoning techniques and technique to generate dense vector Description Logics. Stojanovic et al representations of words efficiently. (2003) present a method to rank results The method works by maximizing the in ontology-based search. The authors co-occurrence probability of words ap- consider conjunctive queries, and com- 4 Philippe Cudre-Mauroux bine logical inference with an analysis resentations of entities in RDF graphs. of the contents of the knowledge base to (Wang et al 2017) provide a survey of retrieve more relevant results. Maedche embedding approaches for knowledge et al (2003) introduce a meta-ontology bases and classify the models into and a registry to improve search queries two main families: translation-based targeting ontologies. Their solution techniques, which interpret relations leverages WordNet to match entities in the knowledge base as a transla- appearing in the ontology to lexical tion vector between the two entities entries and to guide the search process. connected by the relation, and seman- Pound et al (2010) introduce the ad- tic matching models, which exploit hoc object retrieval task for searching similarity-based scoring functions to for resources (e.g., entities, types or create the embeddings. relations) over knowledge bases using natural language queries. The authors also propose a baseline technique for answering such queries based on term Systems frequencies as well as an evaluation methodology. Tonon et al (2012) pro- Leading industrial search engines, pose an improved search technique such as Bing, Yandex or Google, all for ad-hoc object retrieval exploiting implement Semantic Search in one way sequentially an inverted index to answer or another, but typically do not describe keyword queries and a graph database in detail the techniques they leverage. to improve the search effectiveness by A number of Semantic Search systems automatically generating declarative have been described or open-sourced, queries over an RDF graph. however, and are summarized below. Hybrid approaches leveraging both TAP (Guha et al 2003) is an early Se- textual and structured contents have also mantic Search framework. TAP focuses been suggested. Rocha et al (2004), for on entity search queries expressed as instance, combine a traditional search keywords, and augments traditional re- engine with graph exploration tech- sults (documents) with semi-structured niques to answer keyword queries on an data returned from a knowledge base. ontology. Zhang et al (2005) suggest a The knowledge base is also used to new model to search semantic portals, better filter and sort the list or returned where both documents and structured documents in that context. data are available. Their method is based A number of Semantic Search sys- on creating textual representations for tems focusing on RDF and ontologies all entities in the structured repository have been proposed. Swoogle (Ding such that they can also be indexed and et al 2004) offers semantic search searched through classical Information over a knowledge base represented in Retrieval techniques. RDF thanks to an inverted index and Word Embeddings techniques (see a database storing metadata about all above) have also been adapted to power entities. SWSE (Hogan et al 2011) fol- Semantic Search on knowledge bases. lows the typical architecture of a search RDF2Vec (Ristoski and Paulheim engine but operates on RDF data also. 2016), for instance, learns vectorial rep- SWSE results are ranked by running a Semantic Search 5 classical PageRank algorithm on a graph default DBpedia), interlinking entities, connecting the URIs appearing in the and retrieving entities. RDF triple to their source on the Web. Schema.org (Mika 2015), finally, is SemSearch (Lei et al 2006) is a search not a system per se, but rather a stan- engine for the Semantic Web that hides dardization effort founded by leading the complexity of the underlying RDF search engines (including Google, Mi- data to the user. SemSearch accepts key- crosoft, Yahoo and Yandex). Its mission word queries from the user, translates the is to create and promote schemas to user queries into formal queries by ex- embed semi-structured data in web ploiting the labels of the entities in the documents (and beyond) in order to knowledge base, runs the resulting query facilitate Semantic Search capabilities in the knowledge base and finally ranks online. the results by taking into account the number of keywords the search results satisfy. Sindice (Oren et al 2008) is a Se- mantic Search engine and look-up ser- Cross-References vice that focuses on scaling to very large quantities of semi-structured data. It sup- Knowledge Graph Embeddings, Rea- ports keyword and URI-based search as soning at Scale, Semantic Interlinking. well as structured queries. SHOE (Heflin and Hendler 2000) is an early Semantic Search system collecting semi-structured annotations References from the web and storing them in a knowledge base. It offers a GUI to Bast H, Buchhold B, Haussmann E (2016) formulate ontology-based structured Semantic search on text and knowledge bases. Foundations and Trends in Infor- queries to find webpages. The mation Retrieval 10(2-3):119–271, DOI system (d’Aquin and Motta 2011) works 10.1561/1500000032, URL http://dx. similarly by collecting, analyzing and doi.org/10.1561/1500000032 giving access to ontologies and semantic Bizer C, Lehmann J, Kobilarov G, Auer S, data available online. It supports both Becker C, Cyganiak R, Hellmann S (2009) Dbpedia - A crystallization point for the keyword and SPARQL search queries. web of data. J Web Sem 7(3):154–165, SCORE (Sheth et al 2002) is a plat- DOI 10.1016/j.websem.2009.07.002, URL form supporting the creation and mainte- https://doi.org/10.1016/j. nance of large knowledge bases. It sup- websem.2009.07.002 ports ontology-driven Semantic Search d’Aquin M, Motta E (2011) Watson, more than a semantic web search en- capabilities by extracting facts and meta- gine. Semant web 2(1):55–63, URL data from web sources via text mining http://dl.acm.org/citation. techniques. cfm?id=2019470.2019476 Nordlys (Hasibi et al 2017) is Ding L, Finin T, Joshi A, Pan R, Cost RS, an open-source toolkit for Semantic Peng Y,Reddivari P, Doshi V,Sachs J (2004) Swoogle: A search and metadata engine for Search. The toolkit supports a number the semantic web. In: Proceedings of the of features including detecting entities Thirteenth ACM International Conference in natural language queries, cataloging on Information and Knowledge Manage- entities from a knowledge base (by ment, ACM, New York, NY, USA, CIKM 6 Philippe Cudre-Mauroux

’04, pp 652–659, DOI 10.1145/1031171. gineering, pp 495–506, DOI 10.1109/ICDE. 1031289, URL http://doi.acm.org/ 2015.7113309 10.1145/1031171.1031289 Kaptein R, Serdyukov P, de Vries AP, Grbovic M, Djuric N, Radosavljevic V, Sil- Kamps J (2010) Entity ranking using vestri F, Bhamidipati N (2015) Context- wikipedia as a pivot. In: Proceedings of and content-aware embeddings for query the 19th ACM Conference on Information rewriting in sponsored search. In: Proceed- and Knowledge Management, CIKM 2010, ings of the 38th International ACM SIGIR Toronto, Ontario, Canada, October 26-30, Conference on Research and Development 2010, pp 69–78, DOI 10.1145/1871437. in Information Retrieval, ACM, New 1871451, URL http://doi.acm.org/ York, NY, USA, SIGIR ’15, pp 383–392, 10.1145/1871437.1871451 DOI 10.1145/2766462.2767709, URL Lei Y, Uren VS, Motta E (2006) Sem- http://doi.acm.org/10.1145/ search: A search engine for the se- 2766462.2767709 mantic web. In: Managing Knowledge Guha R, McCool R, Miller E (2003) Se- in a World of Networks, 15th Interna- mantic search. In: Proceedings of the 12th tional Conference, EKAW 2006, Pode- International Conference on World Wide brady, Czech Republic, October 2-6, Web, ACM, New York, NY, USA, WWW 2006, Proceedings, pp 238–245, DOI ’03, pp 700–709, DOI 10.1145/775152. 10.1007/11891451 22, URL https: 775250, URL http://doi.acm.org/ //doi.org/10.1007/11891451_22 10.1145/775152.775250 Lund K, Burgess C (1996) Producing high- Hasibi F, Balog K, Garigliotti D, Zhang S dimensional semantic spaces from lexical (2017) Nordlys: A toolkit for entity-oriented co-occurrence. Behavior Research Methods, and semantic search. In: Proceedings of Instruments, & Computers 28(2):203–208, the 40th International ACM SIGIR Con- DOI 10.3758/BF03204766, URL https: ference on Research and Development in //doi.org/10.3758/BF03204766 Information Retrieval, ACM, New York, Madhu G, Govardhan A, Rajinikanth NY, USA, SIGIR ’17, pp 1289–1292, TV (2011) Intelligent semantic web DOI 10.1145/3077136.3084149, URL search engines: A brief survey. http://doi.acm.org/10.1145/ CoRR abs/1102.0831, URL http: 3077136.3084149 //arxiv.org/abs/1102.0831, Heflin J, Hendler J (2000) Searching the web 1102.0831 with shoe. In: AAAI Workshop on Artificial Maedche A, Motik B, Stojanovic L, Studer Intelligence for Web Search, pp 35–40 R, Volz R (2003) An infrastructure for Hogan A, Harth A, Umbrich J, Kinsella S, searching, reusing and evolving distributed Polleres A, Decker S (2011) Searching and ontologies. In: Proceedings of the 12th browsing with swse: The se- International Conference on World Wide mantic web search engine. Web Semantics: Web, ACM, New York, NY, USA, WWW Science, Services and Agents on the World ’03, pp 439–448, DOI 10.1145/775152. Wide Web 9(4):365 – 401, DOI https: 775215, URL http://doi.acm.org/ //doi.org/10.1016/j.websem.2011.06.004, 10.1145/775152.775215 URL http://www.sciencedirect. Makel¨ a¨ E (2005) Survey of semantic search com/science/article/pii/ research. URL https://seco.cs. S1570826811000473, jWS special aalto.fi/publications/2005/ issue on Semantic Search makela-semantic-search-2005. Horrocks I, Tessaris S (2002) Querying the se- pdf mantic web: A formal approach. In: Hor- Mangold C (2007) A survey and classifica- rocks I, Hendler J (eds) The Semantic Web tion of semantic search approaches. Int J — ISWC 2002, Springer Berlin Heidelberg, Metadata Semant Ontologies 2(1):23–34, Berlin, Heidelberg, pp 177–191 DOI 10.1504/IJMSO.2007.015073, URL Hua W, Wang Z, Wang H, Zheng K, Zhou http://dx.doi.org/10.1504/ X (2015) Short text understanding through IJMSO.2007.015073 lexical-semantic analysis. In: 2015 IEEE Manning CD (2011) Part-of-speech tagging 31st International Conference on Data En- from 97% to 100%: Is it time for some Semantic Search 7

linguistics? In: Gelbukh AF (ed) Computa- URL https://doi.org/10.1007/ tional Linguistics and Intelligent Text Pro- 978-3-319-25007-6_27 cessing, Springer Berlin Heidelberg, Berlin, Rebele T, Suchanek FM, Hoffart J, Biega Heidelberg, pp 171–189 J, Kuzey E, Weikum G (2016) YAGO: Mika P (2015) On schema.org and why it mat- A multilingual knowledge base from ters for the web. IEEE Internet Computing wikipedia, wordnet, and geonames. In: 19(4):52–55, DOI 10.1109/MIC.2015.81 The Semantic Web - ISWC 2016 - Mikolov T, Chen K, Corrado G, Dean 15th International Semantic Web Con- J (2013) Efficient estimation of word ference, Kobe, Japan, October 17-21, representations in vector space. CoRR 2016, Proceedings, Part II, pp 177–185, abs/1301.3781, URL http://arxiv. DOI 10.1007/978-3-319-46547-0 19, org/abs/1301.3781, 1301.3781 URL https://doi.org/10.1007/ Nalisnick E, Mitra B, Craswell N, Caru- 978-3-319-46547-0_19 ana R (2016) Improving document rank- Ristoski P, Paulheim H (2016) Rdf2vec: Rdf ing with dual word embeddings. In: Pro- graph embeddings for data mining. In: ceedings of the 25th International Confer- Groth P, Simperl E, Gray A, Sabou M, ence Companion on World Wide Web, In- Krotzsch¨ M, Lecue F, Flock¨ F, Gil Y (eds) ternational World Wide Web Conferences The Semantic Web – ISWC 2016, Springer Steering Committee, Republic and Canton International Publishing, Cham, pp 498–514 of Geneva, Switzerland, WWW ’16 Com- Rocha C, Schwabe D, Aragao MP (2004) A panion, pp 83–84, DOI 10.1145/2872518. hybrid approach for searching in the se- 2889361, URL https://doi.org/10. mantic web. In: Proceedings of the 13th 1145/2872518.2889361 International Conference on World Wide Oren E, Delbru R, Catasta M, Cyganiak Web, ACM, New York, NY, USA, WWW R, Stenzhorn H, Tummarello G (2008) ’04, pp 374–383, DOI 10.1145/988672. Sindice.com: a document-oriented lookup 988723, URL http://doi.acm.org/ index for open linked data. IJMSO 3(1):37– 10.1145/988672.988723 52, DOI 10.1504/IJMSO.2008.021204, Schuhmacher M, Ponzetto SP (2013) Exploit- URL https://doi.org/10.1504/ ing for web search results cluster- IJMSO.2008.021204 ing. In: Proceedings of the 2013 Workshop Pehcevski J, Vercoustre AM, Thom JA (2008) on Automated Knowledge Base Construc- Exploiting locality of wikipedia links in en- tion, ACM, New York, NY, USA, AKBC tity ranking. In: Macdonald C, Ounis I, Pla- ’13, pp 91–96, DOI 10.1145/2509558. chouras V, Ruthven I, White RW (eds) Ad- 2509574, URL http://doi.acm.org/ vances in Information Retrieval, Springer 10.1145/2509558.2509574 Berlin Heidelberg, Berlin, Heidelberg, pp Schuhmacher M, Dietz L, Paolo Ponzetto S 258–269 (2015) Ranking entities for web queries Pound J, Mika P, Zaragoza H (2010) Ad-hoc through text and knowledge. In: Proceed- object retrieval in the web of data. In: ings of the 24th ACM International on Proceedings of the 19th International Con- Conference on Information and Knowl- ference on World Wide Web, ACM, New edge Management, ACM, New York, York, NY, USA, WWW ’10, pp 771–780, NY, USA, CIKM ’15, pp 1461–1470, DOI 10.1145/1772690.1772769, URL DOI 10.1145/2806416.2806480, URL http://doi.acm.org/10.1145/ http://doi.acm.org/10.1145/ 1772690.1772769 2806416.2806480 Prokofyev R, Tonon A, Luggen M, Vouilloz Sheth A, Bertram C, Avant D, Hammond B, L, Difallah DE, Cudre-Mauroux´ P (2015) Kochut K, Warke Y (2002) Managing se- Sanaphor: Ontology-based coreference mantic content for the web. IEEE Internet resolution. In: Proceedings of the 14th Computing 6(4):80–87, DOI 10.1109/MIC. International Conference on The Semantic 2002.1020330 Web - ISWC 2015 - Volume 9366, Springer- Siencnik SK (2015) Adapting word2vec to Verlag, Berlin, Heidelberg, pp 458–473, named entity recognition. In: Proceedings DOI 10.1007/978-3-319-25007-6 27, of the 20th Nordic Conference of Com- putational Linguistics, NODALIDA 2015, 8 Philippe Cudre-Mauroux

May 11-13, 2015, Institute of the Lithua- Web, ACM, New York, NY, USA, WWW nian Language, Vilnius, Lithuania, pp ’05, pp 453–462, DOI 10.1145/1060745. 239–243, URL http://aclweb.org/ 1060812, URL http://doi.acm.org/ anthology/W/W15/W15-1830.pdf 10.1145/1060745.1060812 Stojanovic N, Studer R, Stojanovic L (2003) An approach for the ranking of query results in the semantic web. In: Fensel D, Sycara K, Mylopoulos J (eds) The Semantic Web - ISWC 2003, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 500–516 Tonon A, Demartini G, Cudre-Mauroux´ P (2012) Combining inverted indices and structured search for ad-hoc object re- trieval. In: Proceedings of the 35th Inter- national ACM SIGIR Conference on Re- search and Development in Information Re- trieval, ACM, New York, NY, USA, SIGIR ’12, pp 125–134, DOI 10.1145/2348283. 2348304, URL http://doi.acm.org/ 10.1145/2348283.2348304 Tran T, Cimiano P, Rudolph S, Studer R (2007) Ontology-based interpretation of keywords for semantic search. In: Pro- ceedings of the 6th International The Semantic Web and 2Nd Asian Confer- ence on Asian Semantic Web Conference, Springer-Verlag, Berlin, Heidelberg, ISWC’07/ASWC’07, pp 523–536, URL http://dl.acm.org/citation. cfm?id=1785162.1785201 Voorhees EM (1993) Using wordnet to dis- ambiguate word senses for text retrieval. In: Proceedings of the 16th Annual Inter- national ACM SIGIR Conference on Re- search and Development in Information Re- trieval, ACM, New York, NY, USA, SI- GIR ’93, pp 171–180, DOI 10.1145/160688. 160715, URL http://doi.acm.org/ 10.1145/160688.160715 Vrandecic D, Krotzsch¨ M (2014) Wikidata: a free collaborative knowledgebase. Com- mun ACM 57(10):78–85, DOI 10.1145/ 2629489, URL http://doi.acm.org/ 10.1145/2629489 Wang Q, Mao Z, Wang B, Guo L (2017) Knowledge graph embedding: A survey of approaches and applications. IEEE Trans- actions on Knowledge and Data Engi- neering 29(12):2724–2743, DOI 10.1109/ TKDE.2017.2754499 Zhang L, Yu Y, Zhou J, Lin C, Yang Y (2005) An enhanced model for searching in se- mantic portals. In: Proceedings of the 14th International Conference on World Wide