<<

Making Sense of schema.org with WordNet

Csaba Veres Department of Information Science and Media Studies The University of Bergen, Norway [email protected]

Abstract lowed extension mechanism1. From the outset the immediate sub classes of Thing were stiulated as The schema.org initiative was designed Action, CreativeWork, Event, Intangible, Organi- to introduce machine readable metadata zation, Person, Place and Product. These high into the World Wide Web. This pa- level conceptual divisions with their implicit onto- per investigates conceptual biases in the logical commitments are not, and never were open schema through a mapping exercise be- to discussion. tween schema.org types and WordNet (Guha et al., 2016) explain that the primary synsets. We create a mapping ontol- driving force behind the design of the schema, and ogy which establishes the relationship be- ultimately the reason for its success, was its sim- tween schema metadata types and the cor- plicity. Previous efforts to introduce large scale responding everyday concepts. This in metadata failed, in part because each standard was turn can be used to enhance metadata an- too narrow in terms of domain coverage. The re- notation to include a more complete de- sult was too many standards for too few appli- scription of knowledge on the Web of data. cations. On the other hand the schema offered a single, unified and broad vocabulary that could 1 Introduction be used across several verticals and promised a benefit for perhaps the most important driving Schema.org is an initiative to introduce machine force, search rankings. As a part of this sim- readable metadata into HTML Web pages. It was plicity, the schema taxonomy and classes were launched on June 2, 2011, under the auspices of intended more as an "organisational tool to help a consortium consisting of Google, Bing, and Ya- browse the vocabulary" than a definitive hoo!. The schema.org web site initially described of world (Guha et al., 2016). In other , the the project as one that "provides a collection of schema was designed as an intuitive set of meta- schemas, i.e., html tags, that webmasters can use data classes that could be used to describe the ma- to markup their pages in ways recognized by major jority of items people would search for on the Web. search providers . . . . making it easier for people Together these factors ensured that the schema to find the right web pages." (schema.org web site, has enjoyed a significant amount of success. 2011). The incentive for using the schema was (Guha et al., 2016) report that in a sample of that web sites that contained markup would appear 10 billion web pages, 31.3% of the pages had with informative details in search results which in schema.org markup, a growth of 22% from a year turn enables people to judge the relevance of the earlier. The markup is used by many different data site more accurately. This could lead to higher consumers for various tasks involving enhanced user engagement and higher search ranking, which search results (rich snippets), populating the is the ultimate incentive for web masters. Google Knowledge Graph, exchange of transac- The initial release contained 297 classes and tion details in email, support for automatic format- 187 relations, but by 2016 had grown to 638 ting of recipes, reviews, etc., and advanced search classes and 965 relations (Guha et al., 2016). It is features in Apple’s Siri. The fifteen most popu- important to note, however, that the expansion of lar implemented classes were WebSite, SearchAc- the schema consists entirely in adding subclasses and properties to the core classes through the al- 1https://is.gd/HdnHkp tion, WebPage, Product,ImageObject, Person, Of- tain information about likely causes and implica- fer, BlogPosting, Organization, Article, Postal- tions, and these can be overlapping. Neverthe- Address, Blog, LocalBusiness, AggregateRating, less, our interest is that people do consider these WPFooter. Many of these refer to elements of the as kinds of mistake in everyday discourse. For the web page itself rather than the content. The top same reason we think it is beside the point to try fifteen content bearing classes were Product, Im- and restrucutre WordNet by some formal method- ageObject, Person, Offer, Organization, PostalAd- ology such as DOLCE (Gangemi et al., 2003a). dress, LocalBusiness, AggregateRating, Creative- We are interested here in intuitive relations, not Work, Review, Place, Rating, Event, GeoCoordi- formal ones. nates, and Thing. These are sun types of Prod- uct, CreativeWork, Person, Intangible, Organiza- 2 The WordNet Mappings tion, Place, and Event. Although the coverage was The mapping involved two stages. First the intended to be broad, it is clear that the use of the schema.org types were aligned with WordNet schema covers its range of types well, but that the synsets, while retaining the structure of the types favour a particular view of web content, in schema. This stage can be seen as adding infor- the interests of the search providers. mation to the schema, namely, the corresponding The motivation for this paper was to try and WordNet synsets. Then, a new hierarchy of con- characterize the conceptual biases of the schema cepts was constructed from the synsets involved in top level categories, by mapping the types to their the mapping. That is, by promoting the mapped corresponding meanings in WordNet. To the ex- synsets to be the central classes, we could get tent that we believe WordNet captures the ontolog- a better idea what sorts of concepts are in the ical commitments inherent in human language, it schema, in relation to the WordNet taxonomy. should provide insights about where the two con- In order to distinguish between the concepts in ceptualisations diverge. The further aim, however, the two taxonomies, WordNet names will be pre- is to use the mappings to enrich the valuable hu- fixed with wn: and the schema with the prefix man provided metadata towards the aim of provid- schema:. In addition when necessary the - ing general but rich meaning annotations to a large Net name will be qualified with and portion of Web content. sense tag, as in wn:dog#n#1. To summarize, we constructed two artefacts at It is important to note that we are not advo- the end of the process: cating WordNet as a gold standard for ontolo- gies and knowledge representation. On the con- • The WordNet to schema.org mapping ontol- trary, we agree with (Hirst, 2004) who argues that ogy. This retains the schema class struc- WordNet contains modeling decisions which dif- ture. The mappings were manually con- ferentiate it from formal . As an exam- structed and available on GitHub2. ple, there are cases where synsets have overlap- ping hyponyms whereas ontologies have disjoint • The WordNet taxonomy for the synsets that subclasses. Consider the first noun sense of mis- have been mapped to the schema. This shows take: {mistake, error, fault} which includes the an alternative taxonomy of the words in the following hyponyms (among others): {slip, slip- schema. up, miscue, parapraxis}, {oversight, lapse}, {faux 2.1 The Mapping Ontology pas, gaffe, solecism, slip, gaucherie}, and {fail- schema.org ure}. A single act can be both a slip and a faux In this ontology the original taxonomy pas. The first implies the act was inadvertent, and was retained, and the WordNet synsets were sim- the second that it possibly had a social component ply inserted into this taxonomy. In fig. 1 we see such as a mistake in etiquette. A lapse is also a some example mappings, showing schema:Beach slip, but it involves some sort of forgetfulness or mapped to wn:beach. Since schema:Beach is a inattention on top of the mere slip. A lapse can subtype of schema:CivicStructure, by implication also be a faux pas, of course. If the faux pas is so too is wn:beach. Similarly, the other Word- sufficiently severe, it can become a complete fail- Net synsets in the example become subclasses ure. These hyponyms contain more information of schema:CivicStructure through their respective that that they are a kind-of mistake, they also con- 2https://is.gd/XF0bJe alignments. The mapping provides the immedi- in turn is an equivalent class of schema:LandForm, ate benefit that web sites which contained any of it could be inferred that schema:Beach could also the WordNet synsets in the alignment, could au- be a schema:LandForm. The benefit of the align- tomatically be connected to their corresponding ment is that a new and sensible schema type could schema types. This suggests a method for auto- be added to any markup involving beach. Fig- matic metadata creation, which will be dicussed ure 3. shows how the WordNet hierarchy con- subsequently. nects wn:beach to schema:Landform and poten- Notice that the mapping is not straightfor- tially other subclasses. A web site about a geo- ward and in this example synsets of quite dis- graphical area with mountains and beaches could tinct types are grouped under the one schema then be appropriately annotated. type. For example wn:bus_terminal Looking at the taxonomy itself, we can see wn:facility, wn:cinema wn:theater wn:building, and a wn:parking_lot schema. The major division in fig. 4 is wn:tract,piece_of_land. Yet they all map to sub- between wn:physical_entity and wn:abstraction, classes of schema:CivicStructure. which is an ontological distinction that is typ- The second taxonomy was created precisely to ically considered fundamental (e.g. (Niles and reveal the schema conceptualisation in terms of the Pease, 2001), (Gangemi et al., 2003b)). On this WordNet hierarchy. In other words, "what IS a view the schema describes the world as popu- schema:CivicStucture in everyday language?" lated by physical entities and abstractions, where the physical entities are predominantly objects, 2.2 The WordNet Ontology and abstractions are diverse sorts of events or roles which the entities engage in. For example The full WordNet hypernym tree is quite deep, and wn:measure is how much there is of something quickly leads to a very complex taxonomy. For you can quantify, and wn:state is the way some- this reason we made use of a simple tool which thing is with respect to its attributes. Other sub uses an algorithm to eliminate low information types of wn:abstraction, like wn:organization and nodes from a taxonomy (Veres et al., 2013). The wn:tourist_attraction apply to concepts that are algorithm prunes the tree by counting the number typically human centered, functional collections of outward links at each node, and eliminating any of objects (Wierzbicka, 1984). Wierzbicka argues node that has fewer than a certain number of (user that putatively taxonomic concept hierarchies are specified) hyponyms. When this is performed on in fact the majority of the time made up of a mix- every node in the graph, what remains is a num- ture of supercategory types, with the most promi- ber of intermediate synsets which are the maxi- nent two being taxonomic and functional. (Puste- mally informative hypernyms of any leaf node. In jovsky, 1991) draws a similar distinction with the the graphs reported here, the lower threshold was mechanism of formal and telic roles in his lexical set at 3. The tool essentially implements the algo- structures. rithm used by (Stoica and Hearst, 2004), but our interface has the advantage that the parameters can The ontological commitment adopted by be dynamically adjusted and visually inspected to schema.org becomes clear if we compare the two give the most intuitively pleasing result. A similar taxonomies. The schema divides schema:Thing procedure was followed in (Izquierdo et al., 2006) into: schema:CreativeWork, schema:Event, to identify basic level concepts. Our work differs schema:Intangible, schema:Organization, in that we do not distinguish between nodes above schema:Person, schema:Place, and the basic threshold. schema:Product. The focus is immediately A part of the inferred hierarchy involving on the functional categories: telic roles dom- wn:beach is shown in figure 2. Note that wn:beach inate the top level categories of the schema, is a sibling of wn:mountain, whereas the schema and physical entities are sub types of these choice to model the civic structure aspect of beach abstractions. puts them in different subclasses; schema:Beach is The most obvious example of a top-level purely a schema:CivicStructure while schema:Mountain functional type is schema:Product. Almost any- is a schema:Landform. However, since wn:beach thing can be a product, and there is no property is a hyponym of wn:geological_formation, which which products have in common except the telic Figure 1: Example mappings between WordNet and schema.org, for the corresponsing concepts beach. The ovals in darker shading represent concepts which have equivalent classes in the two namespaces.

Figure 2: Part of the WordNet taxonomy

Figure 3: wn:beach inherits schema:Landform Figure 4: Part of the WordNet taxonomy from SynsetTagger role that they are "made available for sale". One (in (Fellbaum, 1998)) explains that this was per- can sell a sewing needle or a Saturn V rocket. Ac- haps an unfortunate problem that might have been tually the situation is even more complicated be- avoided had the importance of Wierzbicka’s work cause Products don’t even have to be individuated been realized earlier. However, the structure of "things". The documentation of schema:Product WordNet ensures that, whenever such a confu- reads: "a pair of shoes; a concert ticket; the rental sion exists, the formal properties of the word are of a car; a haircut; or an episode of a TV show still recorded. One mechanism is that words can streamed online". appear in more than one hierarchy. For exam- The fact that there are in fact a number of func- ple anthrax bacillus is both a wn:microorganism, tional categories at the highest level helps explain and a wn:weapon. Another possibility is that the strange tangle of types at the lower levels of the words with both roles are listed twice. For ex- hierarchy, where many different kinds of things (in ample wn:chicken#n#1 wn:meat#n#1, and the formal, taxonomic sense) can appear if they wn:chicken#n#2 wn:bird#n#1. The schema serve particular functions. To see how this be- only offers one choice for the poor chicken, comes problematical, consider the common func- schema:MenuSection. tional category weapon which can include items such as crossbow, flamethrower, gun, knife, poi- 3 Finding correct mappings son gas, anthrax bacillus, novichok, boomerang, and hydrogen bomb. Clearly as individual objects There are a number of potential pitfalls in defin- these would have quite different sets of proper- ing appropriate mappings between the two tax- ties. The problem for the schema is that differ- onomies. One of the most important is to avoid ent formal objects are forced to coexist as sib- introducing unwanted inferences from the seman- lings in a taxonomy dominated by telic roles. tics of the mapping axioms. A prevalent example This results in examples such as schema:Beach of this is the use of owl:sameAs to represent equiv- having opening hours, schema:Continent with a alence between individuals, or classes in OWL- telephone number and review, and other strange Full. owl:sameAs asserts full equivalence between and wonderful things. One is forced to as- the individuals such that all of their properties sume that schema:Beach was designated as a are automatically shared, even though most com- schema:CivicStructure, for example, because the monly this is not the desired consequence (Halpin emphasis is on the facilities available at the beach, et al., 2010). To avoid this problem we used the not the beach itself. weaker owl:equivalentClass axiom, which does not imply complete equality. What is required in- The inclusion of telic roles such as stead is the weaker condition that every instance schema:Product at such a high level of gen- of one class must also be an instance of the other. erality has the additional consequence that the schema does not contain a type which corresponds Even with a weaker semantics we found that to the simple notion of a physical object. There equivalent classes could not always be found. is no option in schema.org for the structured One reason is that schema.org includes concepts markup of cars, boats, computer chips, barbells, which involve various sorts of compounding of antiques, or any of the other hundred million simple concepts, and WordNet contains only com- human artefacts ancient and modern, except as a mon, lexicalized compounds. For example Land- "Product", because the schema lumps these into marksOrHistoricalBuildings is a compound con- the class of "sellable things". Neither does there cept that includes any kind of general landmark as seem to be any proper place for natural objects well as the specific concept of buildings with his- like cats or dogs3 or tree amd forest, which simply torical significance. There is no such lexical entry have no place. in English. Most likely there is no such compound Finally it should be noted that the hierarchy in any language, because the concept is un-natural, in WordNet does also include purely functional mixing different levels of generalization. It is anal- toys or teddy bears types among its hypernyms. For example in ogous to a concept for . the weapon example above we see that wn:gun There are also more acceptable compounds like is-a wn:weapon is-a wn:object. George Miller schema:CivicStructure which is "a public struc- ture such as a town hall or concert hall". This is 3the search facility suggests schema:AnimalShelter of course a perfectly acceptable compound, which happens not to be in WordNet. In every case that 4 Using the WordNet Mappings an acceptable WordNet compound could not be The practical motivation for mapping the schema found, we decided to make the schema.org con- to WordNet was to enrich the metadata that can be cept a subclass of one or more WordNet synsets assigned to concepts in a web page. We have al- that captured part of the compound. For the above ready seen this in examples such as beach. A sec- example of schema:CivicStructure, the obvious ondary motivation was to make it easier for web superclass is wn:structure#n#1. masters to find the schema types without know- Sometimes the compound nature of ing anything about its structure. We have already the schema terms is hidden. For ex- developed a prototype of a tool in which the user ample the terms that are subclasses of can highlight any word in text, nominate its corre- schema:LocalBusiness are a mixed group of ex- sponding synset, and the application will attempt plicit compounds (e.g., schema:MovingCompany, to guess the correct schema type. Consider the fol- schema:IceCreamShop) and implicit lowing example scenario. compounds (e.g., schema:Electrician, There is a geological landmark called the schema:Locksmith, schema:HousePainter). Jenolan Caves in the Blue Mountains, Australia. That is, schema:Electrician is really meant Suppose a web master wanted to mark up the web to be something like "ElectricianBusi- site for Jenolan Caves. A quick search will reveal ness" and not just "Electrician". The that there is no matching type in the schema for compound schema:HousePainter is even caves. Using the WordNet mappings it is possible more complicated because it has an exact for the designer to find the most appropriate types, match in wn:house_painter#n#1, but in fact without any knowledge of the schema. The synset schema:HousePainter is really meant to be a wn:cave is a wn:geological_formation, which in HousePainterBusiness, so the exact match is turn maps to schema:Landform. However, the illusory. The important modelling decision mapping ontology can also suggest additional is whether or not to reintroduce the hidden useful classifications. The coordinate terms of compound in mapping to WordNet. That is, wn:cave contain some terms which are defined in should schema:Electrician be regarded in its the schema, including our old friend beach. Recall ordinary word sense as "a person who is an that wn:beach is mapped to schema:CivicStructure electrician", or should it be modelled as an through schema:Beach (see Figure 6). Thus "electrician business"? In other words, these Jenolan Caves could be marked with both schema concepts could simply be declared as subclasses types, and the properties of the facilities at the of wn:place_of_business to maintain the intended premises could be specified. Of course the an- interpretation in the schema. The most flexible notation effort does not have to stop there. Since solution was to declare an equivalent class rela- the WordNet synset is available, it can also be in- tion between schema:Electrician and the person cluded in the markup, which in turn enables the interpretation in WordNet, wn:electrician#n#1. markup to be used with a huge number of map- 4 This choice captures the notion that electricians pings to other resources . are people. However it is also possible to infer While this process is currently being performed that wn:electrician is a wn:place_of_business, as through our prototype tool where users specify the shown in Figure 5. disambiguated sense (Veres and Elseth, 2013), this does not necessarily have to be performed man- There is a small set of schema.org types for ually. With sufficiently accurate disambiguation which we did not establish mappings. One methods, any web page could be automatically an- group involved technical compounds describ- notated with schema and WordNet metadata. This ing the structure of web pages with terms like would be useful for any downstream task includ- schema:AboutPage and schema:CheckoutPage. ing the construction of knowledge graphs, as pre- These are all subtypes of schema:WebPage, for viously mentioned. which we did define a mapping. The second group The Jenolan Caves example requires the ability was the primitive data types, schema:DataType to declare multiple types. The original for which are not part of the main taxonomy sub- 4https://wordnet.princeton.edu/ sumed by schema:Thing. related-projects Figure 5: Electrician as both a person (wn:electrician#n#1) and place of business (wn:place_of_business#n#1).

Figure 6: Mapping "cave" to schema.org types the schema, microdata is not able to express mul- scribe a more neutral view of the world, for exam- tiple types. The recommendation therefore is to ple artefacts, to describe things independently of use rdf-a5 or json-ld6 which are inherently built to the roles they can play. A metadata specification express multiple types from any vocabulary. should be able to annotate a chicken as a kind of bird as well as a kind of food. 5 Conclusion Our suggestion to include WordNet mappings into the markup effort is one way to sneak more We proposed a method for evaluating the concep- general markup into the annotation process. The tual bias of schema.org by comparing the type requirement is that multiple types must be a stan- terms against their usage in everyday language as dard feature of the annotation, with different types stipulated in WordNet. The observation is that describing different aspects of the item. A car is schema.org favours the markup of web sites pro- an artefact designed for locomotion, but can also moting goods, services, and locations fulfilling acquire its role as a product if it is put up for sale. some human centred need. This then results in the This addition would not compromise people who observed data that the majority of web sites which want to advertise their products. In fact, it would contain schema.org, are about products and goods give them more freedom to express physical prop- and services. If search rankings favour sites with erties of their products like size, construction ma- markup, and if most markup is about goods and terial, origin, and so on. services, then search results will come to favour In summary, we used WordNet as a standard goods and services. Anecdotally, this could be one representation of everyday word use, to provide factor for why it is sometimes easier to find where clarity to the types proposed in schema.org. We to buy something rather than information about the proposed a method to help people mark up Web thing itself. The bias diminishes the potential for sites that do not fit neatly into the service ori- providing a rich source of general semantic meta- ented world view, by enabling them to annotate data on the web, for use in diverse use cases. their contribution to world knowledge as broadly We argued that the schema needs types that de- as possible. This is clearly of benefit to all users 5https://rdfa.info/ who see the web as a vehicle for disseminating in- 6https://json-ld.org/ formative structured data as freely as possible. References Anna Wierzbicka. 1984. Apples are not a "kind of fruit": the semantics of human categorization. Christiane Fellbaum, editor. 1998. WordNet: An Elec- American Ethnologist, 11(2):313–328. tronic Lexical Database. Language, Speech, and Communication. MIT Press, Cambridge, MA.

Aldo Gangemi, Nicola Guarino, Claudio Masolo, and Alessandro Oltramari. 2003a. Sweetening with dolce. AI Magazine, 24:13–24.

Aldo Gangemi, Nicola Guarino, Claudio Masolo, and Alessandro Oltramari. 2003b. Sweetening wordnet with dolce. AI Mag., 24(3):13–24, September.

RV Guha, Dan Brickley, and Steve Macbeth. 2016. Schema.org: evolution of structured data on the web. Communications of the ACM, 59:44–51.

Harry Halpin, Patrick J. Hayes, James P. McCusker, Deborah L. McGuinness, and Henry S. Thompson. 2010. When owl:sameas isn’t the same: An anal- ysis of identity in linked data. In Peter F. Patel- Schneider, Yue Pan, Pascal Hitzler, Peter Mika, Lei Zhang, Jeff Z. Pan, Ian Horrocks, and Birte Glimm, editors, The Semantic Web – ISWC 2010, pages 305– 320, Berlin, Heidelberg. Springer Berlin Heidelberg.

Graeme Hirst, 2004. Ontology and the , pages 209–229. Springer Berlin Heidelberg, Berlin, Hei- delberg.

Rubén Izquierdo, Armando Suárez, German Rigau, and Ixa Nlp. 2006. Exploring the automatic selec- tion of basic level concepts. In International Con- ference Recent advance in Natural Language Pro- cessing.

Ian Niles and Adam Pease. 2001. Towards a stan- dard upper ontology. In Proceedings of the Interna- tional Conference on Formal Ontology in Informa- tion Systems - Volume 2001, FOIS ’01, pages 2–9, New York, NY, USA. ACM.

James Pustejovsky. 1991. The generative lexicon. Comput. Linguist., 17(4):409–441, December.

Emilia Stoica and Marti A. Hearst. 2004. Nearly- automated metadata hierarchy creation. In Pro- ceedings of HLT-NAACL 2004: Short Papers, HLT- NAACL-Short ’04, pages 117–120, Stroudsburg, PA, USA. Association for Computational Linguis- tics.

Csaba Veres and Eivind Elseth. 2013. Schema.org for the semantic web with madame. In Proceedings of the I-SEMANTICS 2013 Posters & Demonstrations Track, Graz, Austria, September 4-6, 2013, pages 11–15.

Csaba Veres, Kristian Johansen, and Andreas Opdahl. 2013. Synsettagger: A tool for generating ontolo- gies from semantic tags. In Proceedings of the 3rd International Conference on Web Intelligence, Min- ing and Semantics, WIMS ’13, pages 16:1–16:10, New York, NY, USA. ACM.