<<

NECKAr: A Named Entity Classifier for

Johanna Geiß and Andreas Spitz and Michael Gertz Institute of Computer Science, Heidelberg University Im Neuenheimer Feld 205, 69120 Heidelberg {geiss,spitz,gertz}@informatik.uni-heidelberg.de

Abstract of entity candidates is advantageous for the dis- ambiguation process and for pruning candidates. Many Information Extraction tasks such In more specialized cases, only a subset of entity as Named Entity Recognition or Event mentions may be of interest, such as toponyms or Detection require background repositories person mentions, which requires the classification that provide a classification of entities of entity mentions. As a result, the classification into the basic, predominantly used classes of entities in the underlying serves LOCATION, PERSON, and ORGANIZATION. to support the linking procedure and directly trans- Several available knowledge bases offer lates into a classification of the entities that are a very detailed and specific ontology of mentioned in the text, which is a necessary pre- entities that can be used as a repository. condition for many subsequent tasks such as event However, due to the mechanisms behind detection (Kumaran and Allan, 2004) or document their construction, they are relatively static geolocation (Ding et al., 2000). and of limited use to IE approaches that require up-to-date information. In contrast, There is a number of knowledge bases that pro- Wikidata is a community-edited knowledge vide such a background repository for entity clas- base that is kept current by its userbase, sification, predominantly DBpedia, , and but has a constantly evolving and less rigid Wikidata (Farber¨ et al., 2017). While these knowl- ontology structure that does not correspond edge bases provide semantically rich and fine- to these basic classes. In this paper we granular classes and relationship types, the task present the tool NECKAr, which assigns of entity classification often requires associating Wikidata entities to the three main classes coarse-grained classes with discovered surface of named entities, as well as the resulting forms of entities. This problem is best illustrated Wikidata NE dataset that consists of over 8 by an IE tasks that has recently gained significant million classified entities. Both are avail- interest in particular in the context of processing able at http://event.ifi.uni- streams of news articles and postings in social me- heidelberg.de/?page_id=532. dia, namely event detection and tracking, e.g., (Ag- garwal and Subbian, 2012; Sakaki et al., 2010; 1 Introduction Brants and Chen, 2003). Considering an event as The classification of entities is an important task something that happens at a given place and time in information extraction (IE) from textual sources between a group of actors (Allan, 2012), the en- that requires the support of a comprehensive knowl- tity classes PERSON, ORGANIZATION, LOCATION, edge base. In a standard workflow, a Named Entity and TIME, are of particular interest. While surface Recognition (NER) tool is used to discover the sur- forms of temporal expressions are typically nor- face forms of entity mentions in some input text. malized by using a temporal tagger (Strotgen¨ and These surface forms then have to be disambiguated Gertz, 2016), dealing with the classification of the and linked to a specific entity in a knowledge base other types of entities often is much more subtle. () to be useful in subsequent IE tasks. This is especially true if one recalls that almost For the latter step of entity linking, suitable entity all available NER tools tag named entities only at candidates have to be selected from the underly- a very coarse-grained level, e.g., Stanford NER ing knowledge base. In the general case of link- (Finkel et al., 2005), which predominately uses the ing arbitrary entities, information about the classes classes LOCATION, PERSON, and ORGANIZATION.

This is the authors version of the work. It is posted here for your personal use, not for redistribution. The definitive version is published in: Proceedings of the Int. Conference of the German Society for Computational and Language Technology (GSCL ’17), September 13-14, 2017, Berlin, Germany The objective of this paper is to provide the tal conditions. Third, we find that the hierarchy community with a dataset and API for entity clas- and structure in Wikidata is (necessarily) compli- sification in Wikidata, which is tailored towards cated and does not lend itself easily to creating entities of the classes LOCATION, PERSON, and coarse class hierarchies on the fly without substan- ORGANIZATION. Like knowledge bases with in- tial prior investigation into the existing hierarchies. herntly more coarse or hybrid class hierarchies such Here, NECKAr provides a stable, easy to use view as YAGO and DBpedia, this version of Wikidata of classified Wikidata entities that is based on a then supports entity linking tasks at state-of-the-art selected Wikidata dump and allows reproducible level (Geiß and Gertz, 2016; Spitz et al., 2016b), results of subsequent IE tasks. but links entities to the continuously evolving - In summary, we make the following contribu- data instead of traditional KBs. As we outline in tions: We provide an easy to use tool for assigning the following, extracting respective sets of entities Wikidata items to commonly used NE classes by from a KB for each such class is by no means trivial exclusively utilizing Wikidata. Furthermore, we (Spitz et al., 2016a), especially given the complex- make one such resulting Wikidata NE dataset avail- ity of simultaneously dealing with multi-level class able as a resource, including basic statistics and a and instance structures inherent to existing KBs, an thorough comparison to YAGO3. aspect that is also pointed out by Brasileilro et al. The remainder of the paper is structured as fol- (2016). However, there are several reasons to chose lows. After a brief discussion of related work in Wikidata over other KBs. First, especially when the following section, we describe our named en- dealing with news articles and social media data tity classifier in detail in Section 3 and present the streams, it is crucial to have an up-to-date reposi- resulting Wikidata NE dataset in Section 4. Section tory of persons and organizations. To the best of 5 gives a comparison of our results to YAGO3. our knowledge, at the time of writing this paper, the most recent version of DBPedia was published 2 Related Work in April 2016, and the latest evaluated version of The DBpedia project1 extracts structured informa- YAGO in September 2015, whereas Wikidata pro- tion from (Auer et al., 2007). The 2016- vides a weekly data dump. Even though all three 04 version includes 28.6M entities, of which 28M KBs (Wikidata, DBpedia, and YAGO3) are based are classified in the DBpedia Ontology. This DB- on Wikipedia, Wikidata also contains information pedia 2016-04 ontology is a directed-acyclic graph about entities and relationships that have not been that consists of 754 classes. It was manually cre- simply extracted from Wikipedia (YAGO and DB- ated and is based on the most frequently used in- pedia extract data predominantly from ) foboxes in Wikipedia. For each Wikipedia lan- but collaboratively added by users (Muller-Birn¨ et guage version, there are mappings available be- al., 2015). Although the latter feature might raise tween the infoboxes and the DBpedia ontology. In concerns regarding the quality of the data in Wiki- the current version, there are 3.2M persons, 3.1M data, for example due to vandalism (Heindorf et places and 515,480 organizations. To be available al., 2015), we find that the currentness of infor- in DBpedia, an entity needs to have a Wikipedia mation far outweights these concerns when using page (in at least one language version that is in- Wikidata as basis for a named entity classifying cluded in the extraction) that contains an framework and as a knowledge base in particu- for which a valid mapping is available. lar. While Wikidata provides a SPARQL interface YAGO3 (Mahdisoltani et al., 2015), the mul- for direct query access in addition to the weekly tilingual extension of YAGO, combines informa- dumps, this method of accessing the data has sev- tion from 10 different Wikipedia language versions eral downsides. First, the interface is not designed and fuses it with the English WordNet. YAGO2 for speed and is thus ill suited for entity extrac- concentrates on extracting facts about entities, us- tion or linking tasks in large corpora, where many ing Wikipedia categories, infoboxes, and Wikidata. lookups are necessary. Second, and more impor- The YAGO taxonomy is constructed by making use tantly, the continually evolving content of Wikidata of the Wikipedia categories. However, instead of prevents reproducability of scientific results if the using the category hierarchy that is “barely useful online SPARQL access is used, as the versioning is unclear and it is impossible to recreate experimen- 1http://wiki.dbpedia.org/ 2http://www.yago-knowledge.org/ for ontological purposes” (Suchanek et al., 2007), the Wikipedia categories are extracted, filtered and parsed for noun phrases in order to map them to WordNet classes. To include the multilingual cat- egories, Wikidata is used to find corresponding English category names. As a result, the entities are assigned to more than 350K classes. YAGO3, which is extracted from Wikipedia dumps of 2013- 2014, includes about 4.6 M entities. Both KBs solely depend on Wikipedia. Since it takes some time to update or create the KBs, Figure 1: Wikidata data model they do not include up-to-date information. In con- trast, the current version of Wikidata can be directly queried and a fresh Wikidata dump is available ev- ery week. Another advantage is that Wikidata does not rely on the existence of an infobox or Wikipedia page. Entities and information about the entities can be extracted from Wikipedia or manually en- tered by any user, meaning that less significant entities that do not warrant their own Wikipedia Figure 2: Class hierachy for river, generated with page are also represented. Since Wikipedia in- the Wikidata Graph Builder foboxes are partially populated through templates from Wikidata entries, extracting data from in- foboxes instead of Wikidata itself adds an addi- 3.1 Wikidata data model tional source of errors. Furthermore, unless all lan- Wikidata4 is a free and open knowledge base that guage versions of Wikipedia are used as a source, is intended to serve as central storage for all struc- such an approach would even limit the amount of tured data of Wikimedia projects. The data model retrieved information due to Wikidata’s inherent of Wikidata consists primarily of two major com- multi-lingual design as the knowledge base behind ponents: items and properties. Items represent all (Vrandecic and Krotzsch,¨ 2014). all things in knowledge. Each item corre- For completeness, Freebase3 should be men- sponds to a clearly identifiable concept or object, tioned as a fourth available knowledge base that or to an instance of a concept or object. For exam- has historically been used as a popular alternative ple, there is one item for the concept river and one to YAGO and DBpedia. However, efforts have item Neckar, which is an instance of a river. In this recently been taken to merge it enterily into Wiki- context, a concept is the same as a class. All items data (Tanon et al., 2016). Given the need for cur- are denoted by numerical identifiers prefixed with rent, up-to-date entity information in many event- a Q, while properties have numerical identifiers related applications, the fact that is no prefixed with P. Properties (P) connect items (Q) longer actively maintained and updated means that to values (V). A pair (P,V) is called a statement. it is increasingly ill-suited for such tasks. A property classifies the value of a statement. Fig- ure 1 shows a simplified entry for the item Neckar. 3 The NECKAr Tool Here P2043 describes that the value 367 km has The Named Entity Classifier for Wikidata to be interpreted as the length of the river. Both (NECKAr) assigns Wikidata entites to the NE items and properties have a label, a description, and (multilingual) aliases. Property entries in Wikidata classes PERSON, LOCATION, and ORGANIZATION. The tool, which is available as open source code do not have statements, but data types that restrict (see the URL in the Abstract), is easy to use and what can be given as a properties value. These data only requires a minimum setup of Python3 pack- types include items, external identifiers (e.g. ISBN ages as well as an instance of a MongoDB. codes), URLs, geographic coordinates, strings, or quantities, to name a few. 3https://developers.google.com/ freebase/ 4http://www.wikidata.org/ location type root classes # sub to extract all subclasses of a root class, e.g., ge- ographic location. Once the subclasses of a root continent Q5107 1 class are identified, we can extract all items that are country Q6256, Q1763527, instances of these subclasses. Q3624078 50 The task is then to find root classes that, together state Q7275 173 with their subclasses, best represent the predom- settlement Q486972 1224 inately used NE classes LOCATION, ORGANIZA- city Q515 126 TION, and PERSON. In the following we describe sea Q165 12 how items of these classes are extracted and what river Q4022 34 kind of information we store for each item. For all mountain Q8502 81 items, we store the Wikidata ID, the label (the most mountain range Q1437459 18 common name for an item), the links to the English territorial entity Q15642541 3,657 and German Wikipedia, and the description. Table 1: Location types with corresponding Wiki- 3.2 Location extraction data root classes for location types and number of subclasses To extract all locations from Wikidata, we use the root class geographic location (Q2221906). This When we are interested in the classification of class is very large and includes 23,383 subclasses8. items, we require the knowledge which item is an For each location item, we extract the following instance of which class. Class membership of an statements: coordinate location (P625), population item is predominately modelled by the property (P1082), country (P17), and continent (P30). Ad- instance of (P31). For example, consider the state- ditionally, we assign a location type if an item is ment Q1673:P31:Q4022 (Neckar is an instance an instance of a subclass of the root class for that of river), in which Q4022 can be seen as a class. location type (see Table 1). Classes can be subclasses of other classes, e.g., In this large set of subclasses of geographic loca- river is a subclass of watercourse, which is a sub- tion, we encounter several problems. For example, class of land water. Figure 2 shows the subclass Food is a subclass of geographic location. Food graph for river. is connected to geographic location by a path of The property subclass of (P279) is transitive, length 3 (food → energy storage → storage → meaning that since Neckar is an instance of river, geographic location). We cannot simply limit the which is a subclass of watercourse, Neckar is im- allowed path length since there are other subclasses plicitly also an instance of watercourse. Due to with a greater path length that we consider a valid this transitivity rule, in Wikidata there is no need location. For example the shortest path for village 5 to specify more than the most specific statement . of Japan has a length of 4 (village of Japan → In other words, there is no statement that directly municipality of Japan → municipality → human specifies Neckar to be a geographic location. Thus, settlement → geographic location). In this case we cannot simply extract items that are instances we decided to exclude the subtree for Food, which of the general classes. There are, for example, only reduces the number of subclasses considerably to 1,733 items that are direct instances of geographic 13,445. However, there might be other subclasses location. Instead, we need to extract the transitive that are not considered a proper location (e.g., Ar- hull, that is, all items that are an instance of any cade Video Game with the path: arcade video game subclass of the general class (henceforward root → arcade game machine → computing platform classes). There are several tools available to show → computing infrastructure → infrastructure → and query the class structure of Wikidata. For Fig- construction → geographical object → geographic 6 ure 2 we used the Wikidata Graph Builder (WGB) location). For the time being we only exclude the to visualize the class tree. For NECKAr, we make Food subclasses. The identified location items can 7 use of the SPARQL based Wikidata Query Service be filtered for a certain application by using the 5http://www.wikidata.org/wiki/Help: location type or by only using items for which a Basic_membership_properties coordinate location is given. 6http://angryloki.github.io/wikidata- graph-builder/ 8In the following, all class and subclass sizes are as of 7https://query.wikidata.org/ February 22, 2017 neClass LOCATION ORGANIZATION PERSON id Q1796771 Q81230 Q76658 norm name Kothen¨ Siemens Frank-Walter Steinmeier description capital of the district Engineering and politician of Anhalt-Bitterfeld electronics Saxony-Anhalt conglomerate en Wikipedia Kothen¨ (Anhalt) Siemens Frank-Walter Steinmeier location type city, settlement instance of concern, occupation politician, population 26,384 bus. enterprise jurist, lawyer continent Europe CEO Joe Kaeser gender male country Germany Klaus Kleinfeld dob 1956-01-05 coordinate 51.75 founder Ernst Werner dod none 11.916666666667 von Siemens alias Steinmeier GeoNames 2885237 inception 1847-10-01 HQ Munich country Germany website www.siemens.com

Table 2: Example of classified entities

3.3 Organization extraction ity File of the German National Library, Internet The root class organization (Q43229) includes Movie Database, MusicBrainz, GeoNames, Open- 4811 subclasses, such as nonprofit organization, Street Map). This information is represented in political organization, team, musical ensemble, Wikidata as statements and can be extracted analo- newspaper, or state. For each item in this cate- gously to the examples above. gory, we extract additional information such as 3.6 Extraction algorithm country (sovereign state of this item, P17), founder (P112), CEO (P169), inception (P571), headquarter The named entity classes to be extract can be speci- location (P159), instance of (P31), official website fied in a configuration file. For each chosen class of (P856), and official language (P37). named entities, the process then works as described in the following. First, the subclasses of the root 3.4 Person extraction class are extracted using the Wikidata SPARQL To extract all real world persons from Wikidata, we API. The output of this step is a list of subclasses, only use the class human (Q5) instead of a list of from which the invalid subclasses are excluded. For subclasses. In Wikidata, a more specific classifica- locations, we also generate lists of subclasses for tion of a person is usually given by the occupation the specific location types. The tool then searches property or by having several instance of state- the Wikidata dump (stored in a local MongoDB) for ments. All items with the statement is instance of all items that are instances of one of the subclasses human are classified as person. Fictional charac- in the list and extracts the common features (id, ters, such as Homer Simpson or Harry Potter and label, description, Wikipedia links). Depending on deities that are not also classified as human, are the named entity class, additional information (see not extracted. For each person, we gather some above) is extracted, and for locations, the list of basic information: date of birth (dob) (P569), date location type subclasses is used to assign a location of death (dod) (P570), gender (P21), occupation type. This data is then stored in a new, intermedi- (P106), and alternative names. ary MongoDB collection. In a subsequent step, we extract for each item the identifiers that link them 3.5 Extracting links to other knowledge bases to the other databases as described in Section 3.5 In addition to the above information, we also record and store them in a separate collection. In the last identifiers for the items in other publicly available step, the data is exported to CSV and JSON files databases (Wikipedia, DBPedia, Integrated Author- for ease of use. type count # occupation count continent 10 1 politician 312,571 country 2,496 2 assoc. football player 206,142 state 4,330 3 actor 170,291 city 25,470 4 writer 127,837 territorial entity 1,805,213 5 painter 99,060 settlement 1,983,860 sea 183 Table 4: The five most frequent occupations river 199,991 # type count mountain Range 7,814 1 business enterprise 102,129 mountain 229,853 2 band 58,996 Table 3: Number of entities for location types 3 commune of France 38,387 4 primary school 36,078 5 association football club 31,257 4 Wikidata NE Dataset Table 5: The five most frequent organization types The Wikidata NE dataset9 was extracted using the NECKAr tool. For the version that we discuss in this paper, we extracted entities from the Wiki- 4.3 Organization entities data dump from December 5, 2016, which includes 936,939 items were classified as organizations, of 24,580,112 items and 2,910 distinct properties. which 11% are business enterprises. Table 5 shows In total, we extracted and classified 8,842,103 the top 5 organization types. Where possible, we items, of which 51.8% are locations, 37.6% per- also extracted the country in which the organization sons, and 10.6% organizations. Table 2 shows ex- is based. Figure 3 shows a heatmap of the number amples for each named entity class, including the of organizations per country. Most organizations class specific additional information. are based in the U.S.A., followed by France and Germany. This is partially due to the fact that com- 4.1 Location entities mune of France and municipality of Germany are Of the 4,582,947 identified locations, 51% have subclasses of organization. geographic coordinates. Location types are ex- tracted for 93% of the location items (see Table 3). 4.4 Assignment to more than one class Most of the classified locations are settlements and 400,856 Wikidata items are assigned to more than territorial entities. We find over 2,400 countries: one NE class by NECKAr. The vast majority of this although there currently are only 206 countries10, subset (over 99%) are members of the two classes Wikidata also includes former countries like the location and organization. This is mainly caused Roman Empire, Ancient Greece, or Prussia. by a subclass overlap between the root classes ge- ographic location and organization. In total, they 4.2 Person entities share 1,310 subclasses, e.g., hospital, state or li- We extracted 3,322,217 persons, of which 78% brary and their respective subclasses. We do not are male, 15% female, while for 7% of the persons favour one class over the other, because both in- another gender or no gender is specified. Occupa- terpretations are possible, depending on the con- tions are given for 66% of the person items, where text. There are also items that have several instance the largest group are politicians, followed by foot- of statements, which in six cases leads to an as- ball players (see Table 4). Wikidata covers mostly signment to all three classes, e.g., Jean Leon is persons from recent history, so 70% of the persons described as instance of human and instance of for whom a birth date is given (over 2,5M persons) winery, which is a subclass of both organization were born in the 20th century, while around 20% and geographic location. There are 116 items that were born in the 19th century. are classified as person and location or person and organization, which is again caused by multiple 9 http://event.ifi.uni-heidelberg.de/ instance of statements. In contrast to the subclass ?page_id=532 10https://en.wikipedia.org/wiki/List_ overlap between root classes, these cases are caused of_sovereign_states by incorrect user input into Wikidata. Figure 3: Heatmap of organization frequency by country

5 Comparison to YAGO3 NE NECKAr Yago3 Yago3 ∩ WD LOC 4,582,947 1,267,402 1,250,409 In order to get an estimate of the quality of PER 3,322,217 1,745,219 1,715,305 the NECKAr tool, we compare the resulting ORG 936,939 481,001 464,351 Wikidata NE dataset to the currently available version of YAGO (Version 3.0.2). When using Table 6: Number of entities per class in the Wiki- the YAGO3 hierarchy to classify YAGO3 entities, data NE dataset created by NECKAr, YAGO3 and we find 1,745,219 distinct persons (member of the intersection of YAGO3 and Wikidata YAGO3 class wordnet person 100007846), 1,267,402 distinct locations (member of YAGO class yagoGeoEntity) and 481,001 dis- Here, TP (true positives) is the number of YAGO3 tinct organisations (member of YAGO class entites, that NECKAr assigns to the same class, wordnet social group 107950920) for a while FP (false positives) is the number of YAGO3 total of 3,493,622 entities in comparison to the entities that are falsely assigned to that class. 8,8M entities in the Wikidata NE dataset (see TP+FP represents the number of entities assigned Table 6). to that class by NECKAr. FN (false negatives) is YAGO3 entities can be linked to Wikidata en- the number of YAGO3 entities in a given class that tries via their subject id, which corresponds to NECKAr does not assign to that class (these enti- Wikipedia page names. If a YAGO3 entity is de- ties might be assigned to a different class or to no rived from a non-, the subject id class). Thus, TP + FN is the number of YAGO3 is prefixed with the language code. For 3,430,065 entities in a given class. Using these standard met- YAGO3 entities we find a corresponding entry in rics, we receive a overall F1-score of 0.88 with Wikidata (1,715,305 persons, 1,250,409 locations P = 0.90 and = 0.86 (see Table 7). The lower and 464,351 organization). This subset is the basis recall is due to the fact that NECKAr does not for our comparison in the following. classify all entries that are a person, location, or To assess the quality of NECKAr, the well- organization entity in YAGO3. Only about 88% known IR measures F1-score, precision (P) and of the YAGO3 entites that correspond to Wikidata recall (R) are used. Precision is a measure for ex- entries are classified. For example, NECKAr does actness, that is, how many of the classified entities not find Pearson, a town in Victoria, Australia, be- are classified correctly. Recall measures complete- cause the Wikidata entry does not include any is ness and gives the fraction of correctly classified instance of relation. This is true for 290,905 of the entities of all given entities. F1 is the harmonic 387,259 entities (75.12%) that are not classified by mean of P and R. The measures are defined as: NECKAr. Some entities are missed by NECKAr P ∗ R TP entirely for a couple of reasons. In some cases, F = 2 ∗ P = 1 P + R TP + FP the correct is instance of relation is not given in (1) TP Wikidata. In others, a relevant subclass or prop- R = TP + FN erty may not have been included. Finally, since NE F1 P R that 156,926 YAGO3 organizations were not iden- LOC 0.88 0.93 0.84 tified by NECKAr. Again, the majority of these PER 0.97 0.99 0.95 items (82%) has no is instance of relation in Wiki- ORG 0.57 0.54 0.60 data, so NECKAr was not able to classify them. all 0.88 0.90 0.86 The reason for the missing 18% warrants future investigation in more detail, as it is possible that Table 7: Evaluation results (F1 score, Precision (P) an important subclass or property was excluded. and Recall (R)) for the Wikidata NE dataset created 29,625 YAGO3 organizations were assigned to an- by NECKAr in comparison to YAGO3 other class, 96% to LOCATION and 4% to PERSON. Many of these items are constituencies or adminis- YAGO3 was automatically extracted and not every trative units, which could be seen as organizations fact was checked for correctness it contains some and/or locations. erroneous claims or classifications. For example, In summary, we find that the application of some overview articles in Wikipedia are classified NECKAr to Wikidata produces a set of classi- as entities in YAGO, such as Listed buildings in or fied entities that is comparable in quality to a well Index of. The original evaluation of YAGO3 lists known and widely used knowledge base. However, the fraction of incorrect facts that it contains as in contrast to existing knowledge bases, which are 2% (Mahdisoltani et al., 2015). not updated regularly, NECKAr can be used to extract substantially more entities and up-to-date 5.1 Location comparison lists of persons, locations and organizations. Since For LOCATION, NECKAr achieves a F1-score of NECKAr can be applied to weekly dumps of Wiki- 0.88 (P = 0.93 and R = 0.84). 170,869 YAGO3 data, it can be used to extract a reproducible re- locations were not classified, of which 81% have source for subsequent IE tasks. no entry in Wikidata for the instance of property. Of the entities that are assigned to a different class, 6 Conclusion and Future Work NECKAr classified 97,6% as ORGANIZATION in- stead of LOCATION. Most of these entities (85%) In this paper, we introduced the NECKAr tool for are radio or television stations for which a clas- assigning NE classes to Wikidata items. We dis- sification into either class is a matter of debate. cussed the data model of Wikidata and its class These items are described in Wikidata as instance hierarchy. The resulting NE dataset offers the sim- of radio station or television station which are ple classification of entities into locations, organi- subclasses of ORGANIZATION. The majority of zations and persons that is often used in IE tasks. the FPs for locations are assigned by NECKAr The datasets includes basic, class specific informa- to two classes (ORGANIZATION and LOCATIONS), tion on each item and links them to other linked whereas in YAGO3 these are only organizations. sets. The clear and lightweight struc- ture makes the dataset a valuable and easy to use 5.2 Person comparison resource. Much of the original more fine grained For entities of class PERSON, we receive the high- classification is preserved and can be used to cre- est F1-score of all classes with 0.97 (P = 0.99 and R ate application-specific subsets. A comparison to = 0.95). Most of the entities that NECKAr assigned YAGO3 showed that NECKAr is able to create to a different class (90% to ORGANIZATION,10% to state-of-the-art lists of entities with the added ad- LOCATION) are bands or musical ensembles which vantage of providing larger and more recent data. are classified as ORGANIZATION. Based on these results, we are further investigat- ing the Wikidata class hierarchy in order to reduce 5.3 Organization comparison the number of incorrect or multiple assignments. The class ORGANIZATION shows the lowest F1- We are also working on an automated process to score = 0.57 (P = 0.54 and R = 0.60). The low provide the Wikidata NE dataset on a monthly basis. precision is caused by the high number of false In future releases of NECKAr, we plan to include positives. As discussed in the previous section, the option of choosing between a Wikidata dump many entities that are classified as Persons or Lo- and the SPARQL API as source for obtaining the cations by YAGO3 are classified as organizations entity data. by NECKAr. The low recall is due to the fact References Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. 2015. YAGO3: A Knowledge Base Charu C. Aggarwal and Karthik Subbian. 2012. Event from Multilingual Wikipedias. In Seventh Biennial Detection in Social Streams. In Proceedings of Conference on Innovative Data Systems Research, the Twelfth SIAM International Conference on Data CIDR. Mining, ICDM, pages 624–635. Claudia Muller-Birn,¨ Benjamin Karran, Janette James Allan. 2012. Topic Detection and Tracking: Lehmann, and Markus Luczak-Rosch.¨ 2015. Event-based Information Organization, volume 12. Peer-production System or Collaborative Ontology Springer Science & Business Media. Engineering Effort: What is Wikidata? In Proceed- ings of the 11th International Symposium on Open Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens Collaboration, pages 20:1–20:10. Lehmann, Richard Cyganiak, and Zachary G. Ives. 2007. DBpedia: A Nucleus for a Web of Open Data. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. In The , 6th International Semantic 2010. Earthquake Shakes Twitter Users: Real-time Web Conference, ISWC, pages 722–735. Event Detection by Social Sensors. In Proceedings of the 19th International Conference on World Wide Thorsten Brants and Francine Chen. 2003. A Sys- Web, WWW, pages 851–860. tem for New Event Detection. In Proceedings of the 26th Annual International ACM SIGIR Confer- Andreas Spitz, Vaibhav Dixit, Ludwig Richter, ence on Research and Development in Information Michael Gertz, and Johanna Geiss. 2016a. State of Retrieval, pages 330–337. the Union: A Data Consumer’s Perspective on Wiki- data and Its Properties for the Classification and Res- Freddy Brasileiro, Joao˜ Paulo A. Almeida, Victorio Al- olution of Entities. In Wiki, Papers from the 2016 bani de Carvalho, and Giancarlo Guizzardi. 2016. ICWSM Workshop. Applying a Multi-Level Modeling Theory to Assess Taxonomic Hierarchies in Wikidata. In Proceedings Andreas Spitz, Johanna Geiß, and Michael Gertz. of the 25th International Conference on World Wide 2016b. So Far Away and Yet so Close: Aug- Web, WWW Companion Volume, pages 975–980. menting Toponym Disambiguation and Similarity with Text-based Networks. In Proceedings of the Junyan Ding, Luis Gravano, and Narayanan Shivaku- Third International ACM SIGMOD Workshop on mar. 2000. Computing Geographical Scopes of Managing and Mining Enriched Geo-Spatial Data, Web Resources. In Proceedings of 26th Interna- GeoRich@SIGMOD, pages 2:1–2:6. tional Conference on Very Large Data Bases, VLDB, pages 545–556. Jannik Strotgen¨ and Michael Gertz. 2016. Domain- Sensitive Temporal Tagging. Synthesis Lectures on Michael Farber,¨ Frederic Bartscherer, Carsten Menne, Human Language Technologies. Morgan & Clay- and Achim Rettinger. 2017. Quality of pool Publishers. DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web Journal. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. YAGO: A Core of Semantic Knowl- Jenny Rose Finkel, Trond Grenager, and Christopher D. edge. In Proceedings of the 16th International Con- Manning. 2005. Incorporating Non-local Informa- ference on World Wide Web, WWW, pages 697–706. tion into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meet- Thomas Pellissier Tanon, Vrandecic, Sebas- ing of the Association for Computational Linguistics, tian Schaffert, Thomas Steiner, and Lydia Pintscher. ACL, pages 363–370. 2016. From Freebase to Wikidata: The Great Mi- gration. In Proceedings of the 25th International Johanna Geiß and Michael Gertz. 2016. With a Little Conference on World Wide Web, WWW, pages 1419– Help from my Neighbors: Person Name Linking Us- 1428. ing the Wikipedia Social Network. In Proceedings of the 25th International Conference on World Wide Denny Vrandecic and Markus Krotzsch.¨ 2014. Wiki- Web, WWW Companion Volume, pages 985–990. data: A Free Collaborative Knowledgebase. Com- mun. ACM, 57(10):78–85. Stefan Heindorf, Martin Potthast, Benno Stein, and Gregor Engels. 2015. Towards Vandalism Detec- tion in Knowledge Bases: Corpus Construction and Analysis. In Proceedings of the 38th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 831–834.

Giridhar Kumaran and James Allan. 2004. Text Classi- fication and Named Entities for New Event Detec- tion. In Proceedings of the 27th Annual Interna- tional ACM SIGIR Conference on Research and De- velopment in Information Retrieval, pages 297–304.