Making Sense of schema.org with WordNet
Csaba Veres Department of Information Science and Media Studies The University of Bergen, Norway [email protected]
Abstract lowed extension mechanism1. From the outset the immediate sub classes of Thing were stiulated as The schema.org initiative was designed Action, CreativeWork, Event, Intangible, Organi- to introduce machine readable metadata zation, Person, Place and Product. These high into the World Wide Web. This pa- level conceptual divisions with their implicit onto- per investigates conceptual biases in the logical commitments are not, and never were open schema through a mapping exercise be- to discussion. tween schema.org types and WordNet (Guha et al., 2016) explain that the primary synsets. We create a mapping ontol- driving force behind the design of the schema, and ogy which establishes the relationship be- ultimately the reason for its success, was its sim- tween schema metadata types and the cor- plicity. Previous efforts to introduce large scale responding everyday concepts. This in metadata failed, in part because each standard was turn can be used to enhance metadata an- too narrow in terms of domain coverage. The re- notation to include a more complete de- sult was too many standards for too few appli- scription of knowledge on the Web of data. cations. On the other hand the schema offered a single, unified and broad vocabulary that could 1 Introduction be used across several verticals and promised a benefit for perhaps the most important driving Schema.org is an initiative to introduce machine force, search rankings. As a part of this sim- readable metadata into HTML Web pages. It was plicity, the schema taxonomy and classes were launched on June 2, 2011, under the auspices of intended more as an "organisational tool to help a consortium consisting of Google, Bing, and Ya- browse the vocabulary" than a definitive ontology hoo!. The schema.org web site initially described of world (Guha et al., 2016). In other words, the the project as one that "provides a collection of schema was designed as an intuitive set of meta- schemas, i.e., html tags, that webmasters can use data classes that could be used to describe the ma- to markup their pages in ways recognized by major jority of items people would search for on the Web. search providers . . . . making it easier for people Together these factors ensured that the schema to find the right web pages." (schema.org web site, has enjoyed a significant amount of success. 2011). The incentive for using the schema was (Guha et al., 2016) report that in a sample of that web sites that contained markup would appear 10 billion web pages, 31.3% of the pages had with informative details in search results which in schema.org markup, a growth of 22% from a year turn enables people to judge the relevance of the earlier. The markup is used by many different data site more accurately. This could lead to higher consumers for various tasks involving enhanced user engagement and higher search ranking, which search results (rich snippets), populating the is the ultimate incentive for web masters. Google Knowledge Graph, exchange of transac- The initial release contained 297 classes and tion details in email, support for automatic format- 187 relations, but by 2016 had grown to 638 ting of recipes, reviews, etc., and advanced search classes and 965 relations (Guha et al., 2016). It is features in Apple’s Siri. The fifteen most popu- important to note, however, that the expansion of lar implemented classes were WebSite, SearchAc- the schema consists entirely in adding subclasses and properties to the core classes through the al- 1https://is.gd/HdnHkp tion, WebPage, Product,ImageObject, Person, Of- tain information about likely causes and implica- fer, BlogPosting, Organization, Article, Postal- tions, and these can be overlapping. Neverthe- Address, Blog, LocalBusiness, AggregateRating, less, our interest is that people do consider these WPFooter. Many of these refer to elements of the as kinds of mistake in everyday discourse. For the web page itself rather than the content. The top same reason we think it is beside the point to try fifteen content bearing classes were Product, Im- and restrucutre WordNet by some formal method- ageObject, Person, Offer, Organization, PostalAd- ology such as DOLCE (Gangemi et al., 2003a). dress, LocalBusiness, AggregateRating, Creative- We are interested here in intuitive relations, not Work, Review, Place, Rating, Event, GeoCoordi- formal ones. nates, and Thing. These are sun types of Prod- uct, CreativeWork, Person, Intangible, Organiza- 2 The WordNet Mappings tion, Place, and Event. Although the coverage was The mapping involved two stages. First the intended to be broad, it is clear that the use of the schema.org types were aligned with WordNet schema covers its range of types well, but that the synsets, while retaining the structure of the types favour a particular view of web content, in schema. This stage can be seen as adding infor- the interests of the search providers. mation to the schema, namely, the corresponding The motivation for this paper was to try and WordNet synsets. Then, a new hierarchy of con- characterize the conceptual biases of the schema cepts was constructed from the synsets involved in top level categories, by mapping the types to their the mapping. That is, by promoting the mapped corresponding meanings in WordNet. To the ex- synsets to be the central classes, we could get tent that we believe WordNet captures the ontolog- a better idea what sorts of concepts are in the ical commitments inherent in human language, it schema, in relation to the WordNet taxonomy. should provide insights about where the two con- In order to distinguish between the concepts in ceptualisations diverge. The further aim, however, the two taxonomies, WordNet names will be pre- is to use the mappings to enrich the valuable hu- fixed with wn: and the schema with the prefix man provided metadata towards the aim of provid- schema:. In addition when necessary the Word- ing general but rich meaning annotations to a large Net name will be qualified with part of speech and portion of Web content. sense tag, as in wn:dog#n#1. To summarize, we constructed two artefacts at It is important to note that we are not advo- the end of the process: cating WordNet as a gold standard for ontolo- gies and knowledge representation. On the con- • The WordNet to schema.org mapping ontol- trary, we agree with (Hirst, 2004) who argues that ogy. This retains the schema class struc- WordNet contains modeling decisions which dif- ture. The mappings were manually con- ferentiate it from formal ontologies. As an exam- structed and available on GitHub2. ple, there are cases where synsets have overlap- ping hyponyms whereas ontologies have disjoint • The WordNet taxonomy for the synsets that subclasses. Consider the first noun sense of mis- have been mapped to the schema. This shows take: {mistake, error, fault} which includes the an alternative taxonomy of the words in the following hyponyms (among others): {slip, slip- schema. up, miscue, parapraxis}, {oversight, lapse}, {faux 2.1 The Mapping Ontology pas, gaffe, solecism, slip, gaucherie}, and {fail- schema.org ure}. A single act can be both a slip and a faux In this ontology the original taxonomy pas. The first implies the act was inadvertent, and was retained, and the WordNet synsets were sim- the second that it possibly had a social component ply inserted into this taxonomy. In fig. 1 we see such as a mistake in etiquette. A lapse is also a some example mappings, showing schema:Beach slip, but it involves some sort of forgetfulness or mapped to wn:beach. Since schema:Beach is a inattention on top of the mere slip. A lapse can subtype of schema:CivicStructure, by implication also be a faux pas, of course. If the faux pas is so too is wn:beach. Similarly, the other Word- sufficiently severe, it can become a complete fail- Net synsets in the example become subclasses ure. These hyponyms contain more information of schema:CivicStructure through their respective that that they are a kind-of mistake, they also con- 2https://is.gd/XF0bJe alignments. The mapping provides the immedi- in turn is an equivalent class of schema:LandForm, ate benefit that web sites which contained any of it could be inferred that schema:Beach could also the WordNet synsets in the alignment, could au- be a schema:LandForm. The benefit of the align- tomatically be connected to their corresponding ment is that a new and sensible schema type could schema types. This suggests a method for auto- be added to any markup involving beach. Fig- matic metadata creation, which will be dicussed ure 3. shows how the WordNet hierarchy con- subsequently. nects wn:beach to schema:Landform and poten- Notice that the mapping is not straightfor- tially other subclasses. A web site about a geo- ward and in this example synsets of quite dis- graphical area with mountains and beaches could tinct types are grouped under the one schema then be appropriately annotated. type. For example wn:bus_terminal
Figure 2: Part of the WordNet taxonomy
Figure 3: wn:beach inherits schema:Landform Figure 4: Part of the WordNet taxonomy from SynsetTagger role that they are "made available for sale". One (in (Fellbaum, 1998)) explains that this was per- can sell a sewing needle or a Saturn V rocket. Ac- haps an unfortunate problem that might have been tually the situation is even more complicated be- avoided had the importance of Wierzbicka’s work cause Products don’t even have to be individuated been realized earlier. However, the structure of "things". The documentation of schema:Product WordNet ensures that, whenever such a confu- reads: "a pair of shoes; a concert ticket; the rental sion exists, the formal properties of the word are of a car; a haircut; or an episode of a TV show still recorded. One mechanism is that words can streamed online". appear in more than one hierarchy. For exam- The fact that there are in fact a number of func- ple anthrax bacillus is both a wn:microorganism, tional categories at the highest level helps explain and a wn:weapon. Another possibility is that the strange tangle of types at the lower levels of the words with both roles are listed twice. For ex- hierarchy, where many different kinds of things (in ample wn:chicken#n#1
Figure 6: Mapping "cave" to schema.org types the schema, microdata is not able to express mul- scribe a more neutral view of the world, for exam- tiple types. The recommendation therefore is to ple artefacts, to describe things independently of use rdf-a5 or json-ld6 which are inherently built to the roles they can play. A metadata specification express multiple types from any vocabulary. should be able to annotate a chicken as a kind of bird as well as a kind of food. 5 Conclusion Our suggestion to include WordNet mappings into the markup effort is one way to sneak more We proposed a method for evaluating the concep- general markup into the annotation process. The tual bias of schema.org by comparing the type requirement is that multiple types must be a stan- terms against their usage in everyday language as dard feature of the annotation, with different types stipulated in WordNet. The observation is that describing different aspects of the item. A car is schema.org favours the markup of web sites pro- an artefact designed for locomotion, but can also moting goods, services, and locations fulfilling acquire its role as a product if it is put up for sale. some human centred need. This then results in the This addition would not compromise people who observed data that the majority of web sites which want to advertise their products. In fact, it would contain schema.org, are about products and goods give them more freedom to express physical prop- and services. If search rankings favour sites with erties of their products like size, construction ma- markup, and if most markup is about goods and terial, origin, and so on. services, then search results will come to favour In summary, we used WordNet as a standard goods and services. Anecdotally, this could be one representation of everyday word use, to provide factor for why it is sometimes easier to find where clarity to the types proposed in schema.org. We to buy something rather than information about the proposed a method to help people mark up Web thing itself. The bias diminishes the potential for sites that do not fit neatly into the service ori- providing a rich source of general semantic meta- ented world view, by enabling them to annotate data on the web, for use in diverse use cases. their contribution to world knowledge as broadly We argued that the schema needs types that de- as possible. This is clearly of benefit to all users 5https://rdfa.info/ who see the web as a vehicle for disseminating in- 6https://json-ld.org/ formative structured data as freely as possible. References Anna Wierzbicka. 1984. Apples are not a "kind of fruit": the semantics of human categorization. Christiane Fellbaum, editor. 1998. WordNet: An Elec- American Ethnologist, 11(2):313–328. tronic Lexical Database. Language, Speech, and Communication. MIT Press, Cambridge, MA.
Aldo Gangemi, Nicola Guarino, Claudio Masolo, and Alessandro Oltramari. 2003a. Sweetening wordnet with dolce. AI Magazine, 24:13–24.
Aldo Gangemi, Nicola Guarino, Claudio Masolo, and Alessandro Oltramari. 2003b. Sweetening wordnet with dolce. AI Mag., 24(3):13–24, September.
RV Guha, Dan Brickley, and Steve Macbeth. 2016. Schema.org: evolution of structured data on the web. Communications of the ACM, 59:44–51.
Harry Halpin, Patrick J. Hayes, James P. McCusker, Deborah L. McGuinness, and Henry S. Thompson. 2010. When owl:sameas isn’t the same: An anal- ysis of identity in linked data. In Peter F. Patel- Schneider, Yue Pan, Pascal Hitzler, Peter Mika, Lei Zhang, Jeff Z. Pan, Ian Horrocks, and Birte Glimm, editors, The Semantic Web – ISWC 2010, pages 305– 320, Berlin, Heidelberg. Springer Berlin Heidelberg.
Graeme Hirst, 2004. Ontology and the Lexicon, pages 209–229. Springer Berlin Heidelberg, Berlin, Hei- delberg.
Rubén Izquierdo, Armando Suárez, German Rigau, and Ixa Nlp. 2006. Exploring the automatic selec- tion of basic level concepts. In International Con- ference Recent advance in Natural Language Pro- cessing.
Ian Niles and Adam Pease. 2001. Towards a stan- dard upper ontology. In Proceedings of the Interna- tional Conference on Formal Ontology in Informa- tion Systems - Volume 2001, FOIS ’01, pages 2–9, New York, NY, USA. ACM.
James Pustejovsky. 1991. The generative lexicon. Comput. Linguist., 17(4):409–441, December.
Emilia Stoica and Marti A. Hearst. 2004. Nearly- automated metadata hierarchy creation. In Pro- ceedings of HLT-NAACL 2004: Short Papers, HLT- NAACL-Short ’04, pages 117–120, Stroudsburg, PA, USA. Association for Computational Linguis- tics.
Csaba Veres and Eivind Elseth. 2013. Schema.org for the semantic web with madame. In Proceedings of the I-SEMANTICS 2013 Posters & Demonstrations Track, Graz, Austria, September 4-6, 2013, pages 11–15.
Csaba Veres, Kristian Johansen, and Andreas Opdahl. 2013. Synsettagger: A tool for generating ontolo- gies from semantic tags. In Proceedings of the 3rd International Conference on Web Intelligence, Min- ing and Semantics, WIMS ’13, pages 16:1–16:10, New York, NY, USA. ACM.