Semantic Web 0 (0) 1–35 1 IOS Press

Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge Series Emerging Trends in Mining Semantics from Tweets

Editor(s): Andreas Hotho, Julius-Maximilians-Universität Würzburg, Germany; Robert Jäschke, The University of Sheffield, UK; Kristina Lerman, University of Southern California, USA Solicited review(s): Tim Baldwin, The University of Melbourne, Australia; Hao Li, SmartFinance, USA; One anonymous reviewer

Giuseppe Rizzo a,∗, Bianca Pereira b, Andrea Varga c, Marieke van Erp d and Amparo Elizabeth Cano Basave e a ISMB, Turin, Italy. E-mail: [email protected] b Insight Centre for Data Analytics, NUI Galway, Ireland. E-mail: [email protected] c The Content Group, Godalming, UK. E-mail: [email protected] d Vrije Universiteit Amsterdam, Netherlands. E-mail: [email protected] e Cube Global, United Kingdom. E-mail: [email protected]

Abstract. The large number of tweets generated daily is providing decision makers with means to obtain insights into recent events around the globe in near real-time. The main barrier for extracting such insights is the impossibility of manual inspection of a diverse and dynamic amount of information. This problem has attracted the attention of industry and research communities, resulting in algorithms for the automatic extraction of semantics in tweets and linking them to machine readable resources. While a tweet is shallowly comparable to any other textual content, it hides a complex and challenging structure that requires domain- specific computational approaches for mining semantics from it. The NEEL challenge series, established in 2013, has contributed to the collection of emerging trends in the field and definition of standardised benchmark corpora for entity recognition and linking in tweets, ensuring high quality labelled data that facilitates comparisons between different approaches. This article reports the findings and lessons learnt through an analysis of specific characteristics of the created corpora, limitations, lessons learnt from the different participants and pointers for furthering the field of entity recognition and linking in tweets.

Keywords: Microposts, Named Entity Recognition, Named Entity Linking, Disambiguation, Knowledge Base, Evaluation, Challenge

1. Introduction cient contextual information necessary for understand- ing its content. A commonly used approach is to ex- Tweets have been proven to be useful in different tract named entities, which are information units such applications and contexts such as music recommen- as the names of a Person or an Organisation, a Loca- dation, spam detection, emergency response, market tion, a Brand, a Product, a numerical expression in- analysis, and decision making. The limited number cluding Time, Date, Money and Percent found in a sen- of tokens in a tweet however implies a lack of suffi- tence [41], and enrich the content of the tweet with such information. In the context of the NEEL chal- *Corresponding author. E-mail: [email protected] lenge series, we extended this definition of named en-

1570-0844/0-1900/$27.50 c 0 – IOS Press and the authors. All rights reserved 2 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series tity as being a phrase representing the name, excluding the computing time was not taken into account in the the preceding definite article (i.e. “the”) and any other final evaluation. pre-posed (e.g. “Dr”,“Mr”) or post-posed modifiers, The four challenges have published four incremen- that belongs to a class of the NEEL Taxonomy (Ap- tal manually labeled benchmark corpora. The creation pendix A) and is linked to a DBpedia resource. Seman- of the corpora followed rigid designations and proto- tically enriched tweets have been shown to help ad- cols, resulting in high quality labeled data that can be dressing complex information seeking behaviour in so- used as seeds for reasoning and supervised approaches. cial media, such as semantic search [1], deriving user Despite these protocols, the corpora have strengths and interests [2], and disaster detection [77]. weaknesses that we have discovered over the years and The automated identification, classification and they are discussed in this article. linking of named entities has proven to be challeng- The purpose of each challenge was to set up an open ing due to, among other things, the inherent character- and competitive environment that would encourage istics tweets: i) the restricted length and ii) the noisy participants to deliver novel approaches or improve on lexical nature, i.e. terminology differs between users existing ones for recognising and linking entities from when referring to the same thing, and non-standard tweets to either a referent knowledge base entry or abbreviations are common. Numerous initiatives have NIL where such an entry does not exist. From the first contributed to the progress in the field broadly cov- (in 2013) to the 2016 NEEL challenge, thirty research ering different types of textual content (and thus go- teams have submitted at least one entry to the com- ing beyond the boundaries of tweets). For example petitions proposing state-of-the-art approaches. More TAC-KBP [55] has established a yearly challenge in than three hundred teams have explicitly acquired the the field covering newswire, websites, discussion fo- corpora in the four years, underlining the importance 1 rum posts, ERD [17] with search queries content, and of the challenges in the field. The NEEL challenges SemEval [58] with technical manuals and reports. have also attracted from strong interest from the indus- The NEEL challenge series, established first in 2013 try as both participants and funding agencies. For ex- and since then running yearly, has captured a com- ample, in 2013 and 2015 the best performing systems munity need for making sense from tweets through a were proposed by industrial participants. The prizes 2 wealth of high quality annotated corpora and to mon- were sponsored by industry (ebay in 2013 and Spazio- 3 4 itor the emerging trends in the field. The first edition Dati in 2015) and research projects (LinkedTV in 5 of the challenge named Concept Extraction (CE) Chal- 2014, and FREME in 2016). The NEEL challenge lenge [15] focused on entity identification and classi- also triggered the interest of localised challenges such fication. A step further into this task is to ground enti- as NEEL-IT, the NEEL challenge for tweets written in ties in tweets by linking them to knowledge base refer- Italian [8] that brings the multilinguality aspect in the ents. This prompted the Named Entity Extraction and NEEL contest. Linking (NEEL) Challenge the following year [16]. This paper reports on the findings and lessons learnt These two research avenues, which add to the intrin- from the last four years of NEEL challenges, analysing sic complexity of the tasks proposed in 2013 and 2014, the corpora in detail, highlighting their limitations, and prompted the Named Entity rEcognition and Linking providing guidance to implement top performing ap- (NEEL) Challenge in 2015 [70]. In 2015, the role of proaches in the field from the different participants. the named entity type in the grounding process was in- The resulting body of work has implications for re- vestigated, as well as the identification of named enti- searchers, application designers and social media en- ties that cannot be grounded because they do not have gineers who wish to harvest information from tweets a knowledge base referent (defined as NIL). The En- for their own objectives. The remainder of this paper glish DBpedia 2014 dataset was the designated refer- is structured as follows: in Section 2 we introduce a ent knowledge base for the 2015 NEEL challenge, and comparison with recent shared tasks in entity recog- the evaluation was performed through live querying the Web APIs participants prepared, in an automatic 1This number does not account for the teams who experimented fashion to measure the computing time. The 2016 edi- with the corpora out of the challenges’ timeline. 2http://www.ebay.com tion [71] consolidated the 2015 edition, using the En- 3http://www.spaziodati.eu glish DBpedia 2015-04 version as referent knowledge 4http://www.linkedtv.eu base. This edition proposed an offline evaluation where 5http://freme-project.eu/ Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 3 nition and linking and underline the reason that has This task became known in the research community as prompted the need to establish the NEEL challenge se- Entity Disambiguation. ries. Next, in Section 3, the decisions regarding dif- The Entity Disambiguation task popularised after ferent versions of the NEEL challenge are introduced Bunescu and Pasca [13] in 2006 explored the use of an and the initiative is compared against the other shared encyclopaedia as a source for entities. In particular, af- tasks. We then detail the steps followed in generating ter [23] demonstrated the benefit of using ,6 the four different corpora in Section 4, followed by a a free crowd-sourced encyclopaedia, for such purpose. quantitative and qualitative analysis of the corpora in The reason why encyclopedic knowledge is important Section 5. We then list the different approaches pre- is that an encyclopaedia contains representation of en- sented and narrow down the emerging trends in Sec- tities in a variety of domains, and, moreover, contains a tion 6, grounding the trends according to the evalua- single representation for each entity along with a sym- tion strategies presented in Section 7. Section 8 reports bolic or textual description. Therefore from 2006 un- the participants’ results and provides an error analy- til 2009, there were two main areas of research: En- sis. We conclude and list our future activities in Sec- tity Recognition, as a legacy of the work started during tion 9. We then provide in appendix the NEEL Taxon- the MUC-7 challenge, and Entity Disambiguation, ex- omy (Appendix A) and the NEEL Challenge Annota- ploring encyclopedic knowledge bases as catalogs of tion Guidelines (Appendix B). entities. In 2009, the TAC-KBP challenge [55] introduced a new problem to both the Entity Recognition and Entity 2. Entity Linking Background Disambiguation communities. In Entity Recognition, the mention is recognised in text without information The first research challenge to identify the impor- about the exact entity that is referred to by the men- tance of the recognition of entities in textual docu- tion. On the other hand, Entity Disambiguation focuses ments was held in 1997 during the 7th Message Un- only on the resolution of entities that have a referent in derstanding Conference (MUC-7) [20]. In this chal- a provided knowledge base. The TAC-KBP challenge lenge, the term named entity was used for the first illustrated the problem that a mention identified in text time to represent terms in text that refer to instances may not have a referent entity in the knowledge base. of classes such as Person, Location, and Organisa- In this case, the suggestion was to link such a mention tion. Since then, named entities have become a key to a NIL entity in order to indicate that it is not present aspect in different research domains, such as Infor- in the knowledge base. This problem was referred as mation Extraction, Natural Language Processing, Ma- Named Entity Linking and it is still a hard and current chine Learning, and the Semantic Web. research problem. Nowadays, however, the terms En- Recognising an entity in a textual document was tity Disambiguation and Entity Linking have become the first big challenge, but after overcoming this obsta- interchangeable. cle, the research community moved into a second and Since the TAC-KBP challenge, there has been an challenging task: disambiguating entities. This prob- explosion in the number of algorithms proposed for lem appears when a mention in text may refer to more Entity Linking using a variety of textual documents, than one entity. For instance, the mention Paul appear- Knowledge Bases, and even using different definitions ing in text may refer to the singer Paul McCartney, of entities. This variety, whilst beneficial, also extends to the actor Paul Walker, or to any of the millions of to how approaches are evaluated, regarding metrics people called Paul around the world. In the same man- and gold standard datasets used. Such diversity makes ner, Copacabana can be a mention of the beach in Rio it difficult to perform comparisons between various de Janeiro, Brazil, or the beach in Dubrovnik, Croa- Entity Linking algorithms and creates the need for tia. The problem of ambiguity is translated into the benchmark initiatives. question “which is the exact entity that the mention In this section, we first introduce the main com- in text refers to?”. To solve this problem, recognising ponents of the Entity Linking task and their possible the mention to an entity in text is only the first step variations, followed by a typical workflow used to ap- for the semantic processing of textual documents, the proach to the task, the expected output of each step and next one is to ground the mention to an unambiguous representation of the same entity in a knowledge base. 6http://en.wikipedia.org 4 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series three strategies for evaluation of Entity Linking sys- mention. For instance, let us assume the knowledge tems. We conclude with an overview of benchmark ini- base has two candidate entries to be linked with the tiatives and their decisions regarding the use of Entity mention Michael Jordan. One of these entries refer to Linking components and evaluation strategies. the professor at University of California, Berkeley and the other to the basketball player. In order to decide 2.1. Entity Linking components which is the correct entry to be linked with the text, some context needs to be provided such as: “played a Entity Linking is defined as the task of grounding game yesterday”, or “won the championship”, and so entity mentions (i.e. phrases) in textual documents with on. More context makes the task easier, and little or no knowledge base entries, in which both mention and context makes the task more challenging. knowledge base entry are recognised as references to Short text documents are considered more challeng- the same entity. If there is no knowledge base entry ing than long ones because they have the exact oppo- to ground a specific entity mention then the mention site features such as: the presence of few entity men- should be linked to a NIL reference instead. tions in a single document (due to the limited size In this definition, Entity Linking contains three main of the text); the presence of misspellings or phonetic components: text, knowledge base, and entity. The fea- spelling (e.g. “I call u 2morrow” rather than “I call tures of each component may vary, and consequently, you tomorrow”); and the low availability of contex- have an impact on the results of algorithms used to per- tual information within the text. It is important to note form the task. For instance, a state-of-the-art solution though that even within the short text category there based on long textual documents may have a poor per- are still important distinctions between microposts and formance when evaluated over short documents with search queries that may impact the performance of En- little contextual information within the text. In a sim- tity Linking algorithms. ilar manner, a solution developed to link entities of In performing a search, it is expected that the search types Person, Location, and Organisation may not be query will be composed by a mention to the entity of able to link entities of type Movie. Therefore, the interest being searched and, sometimes, by additional choice of each component defines which type of solu- contextual information. Therefore, despite the chal- tions are being evaluated by each specific benchmark lenge of being a short text document, search queries initiative. are assumed to contain at least one mention to an en- tity and likely to contain additional contextual infor- 2.1.1. Textual Document mation. However, for microposts this assumption does In Entity Linking, textual documents are usually di- not hold. vided in two main categories: long text, and short text. Microposts do not necessarily have an entity as tar- Long textual documents usually contain more than 400 get. For instance, a document with the content “So words, such as news articles and web sites. Short doc- happy today!!!” does not explicitly cite any entity men- 7 uments (such as microposts or tweets) may have as tion. Also, microposts may be used to talk about enti- few as 200 characters, or even contain a single word, ties without providing any context within the text, as in as in search queries. Different types of texts have their “Adele, you rock!!”. In this aspect, Entity Linking for own characteristics that may influence Entity Linking microposts is more challenging than for search queries algorithms. because it is unclear if a micropost will contain an en- Long textual documents provide a series of document- tity and context for the linking. Furthermore, microp- level features that can be explored for Entity Link- osts are also more likely to contain misspellings and ing such as: the presence of multiple entity mentions phonetic writing than search queries. If a search en- in a single document; well-written text (expressed by gine user misspells a term then it is very likely that the lack or relative absence of misspellings); and the she will not find the desired information. In this case, availability of contextual information that supports the it is safe to assume that users will try grounding of each mention. Contextual information to avoid misspellings and phonetic writing as much entails the supporting facts that help in deciding the as possible. On the other hand, in micropost commu- best knowledge base entry to be linked with a given nities, misspellings and phonetic writing are used as strategies to shorten words, thus enabling the commu- 7Microposts is the term used in the social media field to refer to nication of more information within a single micro- tweets and social media posts in general. post. Therefore, misspelling and phonetic writing are Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 5 common features of microposts and need to be taken more likely to have a faster change on their entities of into consideration when performing Entity Linking. interest than manufacturing reports, for instance. Another characteristic of a knowledge base which 2.1.2. Knowledge Base may impact Entity Linking is the amount of entities it The second component of Entity Linking we con- covers. Two knowledge bases with the same charac- sider is the knowledge base used. Knowledge bases teristics may still vary on the amount of entities they differ from each other regarding the domains of knowl- cover. When applied to Entity Linking, the more enti- edge they cover (e.g. domain-specific or encyclopedic ties a knowledge base covers the more likely there will knowledge), the features used to describe entries (e.g. be a matching between text and knowledge base. Of long textual descriptions, attribute-value pairs, or rela- course in this case we should assume that both knowl- tionship links between entities), their ratio of updates, edge bases are focused on representing key entities in and the amount of entities they cover. their domain rather than long tail ones. As with textual documents, different characteristics will impact in the Entity Linking task. The domain 2.1.3. Entity covered by the knowledge base will influence which The third component of interest for Entity Linking entity mentions will possibly have a link. If there is a is the definition of entity. Despite its importance for the mismatch between the domain of the text (e.g. biomed- Entity Linking task, entities are not formally defined. ical text) and the domain of the knowledge base (e.g. Instead, entities are defined either through example or politics) then all, or most, entity mentions found in through the data available. Named entities are the most common case of definition by example. Named entities text will not have a reference in the knowledge base. were introduced in 1997 as part of the Message Under- In the extreme case of complete mismatch, the Entity standing Conference as instances of Person, Organisa- Linking process will be reduced to Entity Recognition. tion, and Geo-political types. An extension of named Therefore, in order to perform linking, the knowledge entities is usually performed through the inclusion of base should at least be partially related to the domain additional types such as Locations, Facilities, Movies. of the text being linked. In these cases there is no formal definition of entities, Furthermore, the features used to describe entities rather they are exemplars of a set of categories. in the knowledge base influence which algorithms can An alternative definition of entities assumes that en- make use of it. For instance, if entities are represented tities are anything represented by the knowledge base. only through textual descriptions, a text-based algo- In other words, the definition of entity is given by the rithm needs to be used to find the best mention-entry data available (in this case, data from the knowledge link. If, however, knowledge base entries are only de- base). Whereas this definition makes the Entity Link- scribed through relationship links with other entities ing task easier by not requiring any refined “human- then a graph-based algorithm may be more suitable. like” reasoning about types, it makes it impossible to The third characteristic of a knowledge base which identify NIL links. If entity is anything in the knowl- impacts Entity Linking is its ratio of updates. Static edge base, how could we ever possibly have, by defi- knowledge bases (i.e. knowledge bases that are not nition, an entity which is not in the knowledge base? or infrequently updated) represent only the status of a The choice of entity will depend on the Entity Link- given domain at the moment it was generated. Any en- ing desired. If the goal is to consider links to NIL then tity which becomes relevant to that domain, after that the definition based on types is the most suitable, oth- point in time will not be represented in the knowledge erwise the definition based on the knowledge base may base. Therefore, in a textual document, only mentions be used. to entities prior to the creation of the knowledge base will have a link, all others would be linked to NIL. The 2.2. Typical Entity Linking Workflow and Evaluation faster entities change in a given domain the more likely Strategies it is for the knowledge base to become outdated. In the likelihood that there is a complete disjoint between text Regardless of the different Entity Linking compo- and knowledge base, all links from text would invari- nents, most proposed systems for Entity Linking fol- ably be linked to NIL. Depending on the textual doc- low a workflow similar to the one presented in Fig- ument to be linked, the ratio of updates may or may ure 1. This workflow is composed of the following not be an important feature. Social and news media are steps: Mention Detection, Entity Typing, Candidate 6 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

The evaluation of Entity Linking systems is based on this typical workflow and can be of three types: end- to-end, step-by-step, or partial end-to-end. An end-to-end strategy evaluates a system based only on the aggregated result of all its steps. It means that if one step in the workflow does not perform well and its error propagates through all subsequent steps, this type of evaluation will judge the system based only on the aggregated error. In this case, a system that per- forms excellent Candidate Selection but poor Mention Detection can be considered as good as a system that performs a poor Candidate Selection but an excellent Mention Detection. The end-to-end strategy is very useful for application benchmark in which the goal is to maximise the results that will be consumed by an- other application based on Entity Linking (e.g. entity- based search). However, for research benchmark, it is important to know which algorithms are the best fit for each of the steps in the Entity Linking workflow. The opposite to an end-to-end evaluation is a step- by-step strategy. The goal of this evaluation is to pro- vide a robust benchmark of algorithms for each step of the Entity Linking workflow. Each step is provided Fig. 1. Typical Entity Linking workflow with expected output of with the gold standard input (i.e. the correct input data each step. for that specific step) in order to eliminate propagation of errors from previous steps. The output of each step Detection, Candidate Selection, and NIL Clustering. is then evaluated separately. Despite the robustness of Note that, although it is usually a sequential workflow, this approach, this type of evaluation does not account for systems that do not follow the typical Entity Link- there are approaches that create a feedback loop be- ing workflow, e.g. systems that merge two steps into a tween different steps, or merge two or more steps into single one or that create feedback loops; and it is also a single one. a highly time and labour consuming task to set up. The Mention Recognition step receives textual doc- Finally, the partial end-to-end evaluation aims at uments as input and recognises all terms in text that evaluating the output of each Entity Linking step but refer to entities. The goal of this step is to perform typ- by analysing the final result of the whole system. Par- ical Named Entity Recognition. Next, the Entity Typ- tial end-to-end evaluation uses different metrics that ing step detects the type of each mention previously are influenced only by specific parts of the Entity Link- recognised. This task is usually framed as a categori- ing workflow. For instance, one metric evaluates only sation problem. Candidate Detection next receives the the link between mentions and entities in the knowl- detected mentions and produces a list with all entries edge base, another metric evaluates only links with in the knowledge base that are candidates to be linked NIL, yet another one evaluates only the correct men- with each mention. In the Candidate Selection step, tions recognized, whereas another metric measures the these candidate lists are processed and, by making use performance of the NIL Clustering. of available contextual information, the correct link for each mention, either an entry from the knowledge base 2.3. Entity Linking Benchmark Initiatives or a NIL reference, is provided. Last, the NIL Clus- tering step receives a series of mentions linked to NIL The number of variations in Entity Linking makes as input and generates clusters of mentions referring to it hard to benchmark Entity Linking systems. Dif- the same entity, i.e. each cluster contains all NIL men- ferent research communities focus on different types tions representing one, and only one, entity, and there of text and knowledge base, and different algorithms are no two clusters representing the same entity. will perform better or worse on any specific step. In Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 7 this section, we present the Entity Linking benchmark 2.3.2. ERD initiatives to date, the Entity Linking specifications The Entity Recognition and Disambiguation (ERD) used, and the communities involved. The challenges challenge [17] was a benchmark initiative organised are summarised in Table 2.3. in 2015 as part of the SIGIR conference9 with the fo- cus of enabling entity-based . For 2.3.1. TAC-KBP such a task, a system needs to be able to index docu- Entity Linking was first introduced in 2009 as a 8 ments based on entities, rather than words, and to iden- challenge for the Text Analysis Conference. This con- tify which entities satisfy a given query. Therefore, the ference was aimed at a community focused on the ERD challenge proposed two Entity Linking tracks, analysis of textual documents and the challenge itself a long text track based on web sites (e.g. the docu- was part of the Knowledge Base Population track (also ments to be indexed), and a short text track based on called TAC-KBP) [55]. The goal of this track was to search queries. For both tracks, entities identified in explore algorithms for automatic knowledge base pop- text should be linked with a subset of Freebase,10 a ulation from textual sources. In this track, Entity Link- large collaborative encyclopedic knowledge base con- ing was perceived as a fundamental step, in which en- taining structured data. tities are extracted from text and evaluated if they al- The Information Retrieval community, and conse- ready exist in a knowledge base to be populated (i.e. quently the ERD challenge, focuses on the process- link to a knowledge base entry) or if they should be ing of large amounts of information. Therefore, the included in the knowledge base (i.e. link to NIL). The systems evaluated should provide not only the correct results of Entity Linking could be used either for direct results but also fulfill basic standards for large scale population of knowledge bases or used in conjunction web systems, i.e. they should be available through Web with other TAC-KBP tasks such as Slot Filling. APIs for public use, they should accept a minimum As of 2009, the TAC-KBP benchmark was not con- number of requests without timeout, and they should cerned about recognition of entities in text, in partic- ensure a minimum uptime availability. All these stan- ular considering that their entities of interest were in- dards were translated into the evaluation method of the stances of types Organisation, Geo-political, and Per- ERD challenge that required systems to have a given son, and the recognition of these types of entities in Web API available for querying during the time of the text was already a well-established task in the commu- evaluation. Also, large scale web systems are evalu- nity. The challenge was then mainly concerned with ated regarding how useful their output is for the task at correct Entity Typing and Candidate Selection. In later hand regardless of the internal algorithms used, so the years, Mention Detection and NIL Clustering were evaluation used by ERD was an end-to-end evaluation also included in the TAC-KBP pipeline [48]. Also, using standard information retrieval evaluation metrics more entity types are now considered such as Location (i.e. precision, recall, and f-measure). and Facility, as well as, multiple languages [49]. Characteristics that have been constant in TAC-KBP 2.3.3. W-NUT The community of natural language processing and are the use of long textual documents, entities given by 11 Type, and the use of encyclopedic knowledge bases. A computational linguistics within the ACL-IJCNLP reason for long textual documents would be that this conferences have always been interested in the study type of text is more likely to contain contextual in- of long textual documents. One of the main character- istics of these documents is that they are usually writ- formation to populate a knowledge base, in particular ten using standard English writing. However, with the news articles and web sites. The use of entities given advent of Twitter and other forms of microblogging, by Type is a direct consequence of the availability of short documents started to receive increased attention named entity recognition algorithms based on types from the academic community of computational lin- and the need for NIL detection. The use of an encyclo- guists in special because of their non-standard writing. pedic knowledge base was because Person, Organisa- In 2015, the Workshop on Noisy User-generated tion, and Geo-political entities are not domain-specific Text (W-NUT) [4] promoted the study of documents and due to the availability of Wikipedia as a free avail- able knowledge base on the Web. 9http://sigir.org/sigir2014 10https://developers.google.com/freebase 8http://www.nist.gov/tac 11https://www.aclweb.org 8 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

TAC-KBP ERD SemEval W-NUT NEEL Characteristic 2014 2015 2016 2014 2015 2015 2013 2014 2015 2016 newswire web sites technical manual Text web sites search queries reports tweets tweets discussion forum posts formal discussions Knowledge Base Wikipedia Freebase Freebase Babelnet none none DBpedia Entity given by Type given by KB given by KB given by Type given by Type file API file file file API file Evaluation partial partial end-to-end end-to-end end-to-end end-to-end end-to-end end-to-end Target Conference TAC SIGIR NAACL-HLT ACL-IJCNLP WWW

Table 1 Named Entity Recognition and Linking challenges since 2013 that are not written in standard English, with tweets Therefore, in order to make the integration of the two as the focus of its two shared tasks. One of these tasks easier, it followed that an entity is anything that tasks was targeted at the normalisation of text. In other is available in the knowledge base of entities. Also, words, expressions such as “r u coming 2” should be given the complexity involved in joining the two tasks, normalised into standard English on the form of “are the SemEval shared task focused on technical manuals, you coming to”. The second task proposed named en- reports, and formal discussions that tend to follow a tity recognition within tweets in which systems were more rigid written structure than tweets or other forms required to detect mentions to entities corresponding of informal natural language text. The use of such to a list of ten entity types. This proposed task cor- well-written texts makes the task easier at the mention responds to the first two steps of the Entity Linking recognition level (i.e. Mention Detection), and leaves workflow: Mention Detection and Entity Typing. the challenge at the disambiguation level (i.e. Candi- date Selection). 2.3.4. SemEval Word Sense Disambiguation and Entity Linking are two tasks that perform disambiguation of textual doc- 3. The NEEL Challenge Series uments through links with a knowledge base. Their main difference is that the former disambiguates the Named Entity Recognition and Entity Linking have meaning of words with respect to a dictionary of word been active research topics since their introduction senses, whereas the latter disambiguates words with by MUC-7 in 1997 and TAC-KBP in 2009, respec- respect to a list of entity referents. These two tasks tively. The main focus of these initiatives had been on have been historically treated as different tasks since long textual documents, such as news articles, or web they require knowledge bases of a dissimilar nature. sites. Meanwhile, microposts emerged as a new type However, with the development of Babelnet, a knowl- of communication on the Social Web and have been edge base containing both entities and word senses, a widespread format to express opinions, sentiments, Word Sense Disambiguation and Entity Linking could and facts about entities. The popularisation of micro- be finally performed using a single knowledge base. posts through the use of Twitter,13 an established plat- In 2015, a shared task for Multilingual All-Words form for publication of microposts, reinforced a gap in Sense Disambiguation and Entity Linking [58] using the research of the Named Entity Recognition and En- Babelnet was proposed as part of the International tity Linking communities. The NEEL series was pro- 12 Workshop on Semantic Evaluation (SemEval). In posed as a benchmark initiative to fill this gap. this task, the goal was to create a system that could The evolution of the NEEL challenge followed the perform both Word Sense Disambiguation and En- evolution of Entity Linking. The challenge was first tity Linking. In word sense disambiguation, senses are held in 2013 under the name of Concept Extraction anything that is available in the dictionary of senses. (CE) and was concerned with the detection of men-

12http://alt.qcri.org/semeval2015 13http://www.twitter.com Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 9 tions to entities in microposts and the specification of Definition of entities. Due to the dynamic nature their types. In the next year, already under the acronym of microposts, the recognition of NILs was recognised of NEEL, the challenge also included linking mentions as an important feature since the introduction of En- to an encyclopedic Knowledge Base or to NIL. In 2015 tity Linking in the NEEL challenge in 2014. Due to and 2016, NEEL was expanded to also include cluster- that, but also to accommodate the participants from the ing of NIL mentions. Concept Extraction challenge, the definition of entities To propose a fair benchmark for approaches to En- is given by type. tity Linking with microposts, the organisation of the In 2013, the list of entity types was based on the NEEL challenge had to make certain decisions con- taxonomy used in CoNLL 2003 [73]. From 2014 on- cerning different Entity Linking components and the wards, the NEEL Taxonomy (Appendix A) was cre- available strategies for evaluation, always taking into ated with the goal of providing a more fine-grained consideration the trends and needs of the research classification of entities. This would represent a vast community focused on the Web and microposts. In this amount of entities of interest in the context of the Web. section, we provide the motivation for these decisions. The types of entities used and how the NEEL Taxon- A discussion on their impact will be provided in later omy was built is described in Section 4. sections. Evaluation. The evaluation is the main component Text. The first decision that had to be taken regards of a benchmark initiative because, after all, the goal of the text used for the challenge. Twitter was chosen benchmarking is to compare different systems applied as the source of textual documents since it is a well- to the same data by using the same evaluation met- known platform for microposts on the Web, and it pro- rics. There are two main decisions regarding the evalu- vides a public API that makes it easy to extract microp- ation process. The first decision is about the format in which the results of each system are gathered (i.e. via osts both for generation of the benchmark corpora and file transfer or call to a Web API). The second decision for future use of the evaluated Entity Linking systems. regards how the results will be evaluated and which More information on how Twitter was used to build the evaluation metrics will be applied. NEEL corpora is presented in Section 4. The NEEL challenge has used different evaluation Knowledge Base. Despite the type of text used, it settings in different versions of the challenge. Each is important for an Entity Linking challenge to pro- change has its own motivation, but the main focus for vide a balance among mentions linked to the Knowl- each of them was to provide a fair and comprehensive edge Base and mentions linked to NIL. A better bal- evaluation of the submitted systems. ance enables a fairer evaluation, otherwise the chal- The first decision regards the submission of a file lenge would be biased towards algorithms that perform or the evaluation through Web APIs. Both approaches one task better than the other. In the case of tweets, the have their advantages and disadvantages. The use of a frequency of knowledge base updates is an important file lowers the bar to new participants in the challenge factor for the balance among knowledge base links and because they do not need to develop a Web API in ad- NIL. Microposts are a dynamic form of communica- dition to the usual Entity Linking steps, nor to have a tion usually dealing with recent events. If the collec- Web server available during the whole evaluation pro- tion of tweets is more recent than the entities in the cess. This was the proposed model for 2013, 2014, and knowledge base, the amount of NIL links is likely to 2016. However, during NEEL 2014, some participants be much higher than the links to entries in the Knowl- suggested that the challenge should apply a blind eval- edge Base. Therefore, the rate in which the knowledge uation, i.e. the participants should know the input data base is updated is an important factor for the NEEL just at the time of the query in order to avoid com- challenge. mon mistakes of tuning the system based on evalua- Taking this into account, we chose to use DB- tion data. Therefore, in 2015 the submission of evalua- pedia [50], a structured knowledge base based on tion results was changed to Web API calls. The impact Wikipedia, mainly because it is frequently updated of this change was that a few teams could not partici- with entities appearing in events covered in social me- pate in the challenge, mainly because their Web server dia. Another motivation for using DBpedia is that its was not available during evaluation or their API did format lends itself better to the task than Wikipedia it- not have the correct signature. This format of evalua- self. Each NEEL version used the latest available ver- tion also required extra effort on the part of the organ- sion of DBpedia. isers that had to advise participant teams that their web 10 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series servers were not available. Given the amount of prob- junction with consecutive World Wide Web confer- lems generated and no real benefit experienced, the or- ences. ganisation opted for going back to the transfer of files In the next sections we will explain in detail how the with the results of the systems as in previous years. NEEL challenges were organised, how the benchmark The second decision concerns the evaluation strat- corpora were generated semi-manually, details of par- egy, which impacts the metrics used and on the overall ticipant systems in each year, and the impact of each benchmark ranking. In this step, we either have the op- change in the participation in subsequent years. tion for an end-to-end, a partial end-to-end, or a step- by-step evaluation. Borrowing from the named entity recognition community, the first two versions of the 4. Corpus Creation challenge (i.e. 2013 and 2014) were based on an end- to-end evaluation. In this evaluation, standard evalua- The organisation of the NEEL challenges led to the tion metrics (i.e. precision, recall, and f-measure) were yearly release of datasets of high value for the research applied on top of the aggregated results of the system. community. Over the years, the datasets increased in size and coverage. A drawback of end-to-end evaluation is that in Entity Linking, if one step in the typical workflow does not 4.1. Collection procedure and statistics perform well, its error will propagate until the last step. Therefore, an end-to-end evaluation will only evaluate The initial 2013 challenge dataset contains 4,265 based on the aggregated errors from all steps. This was tweets collected from the end of 2010 to the begin- not a problem when the systems were required to per- ning of 2011 using the Twitter firehose with no explicit form one or two simple steps, but when the challenge hashtag search. These tweets cover a variety of topics, starts requiring a larger number of steps then a more including comments on news and politics. The dataset fine-grained evaluation is required. was split into 66% training and 33% test. A partial end-to-end strategy evaluates the output of The second 2014 challenge dataset contains 3,505 each Entity Linking step by analysing only the final re- event-annotated tweets, where each entity was linked sult of the system. This evaluation uses different met- to its corresponding DBpedia URI. This dataset was rics for each part of the workflow and had been suc- collected as part of the Redites project14 from 15th cessfully performed by multiple TAC-KBP versions. July 2011 to 15th August 2011 (31 days) comprising a Therefore, due to its benefits for the research commu- set of over 18 million tweets obtained from the Twit- nity, the partial end-to-end evaluation has also been ap- ter firehose. The 2014 dataset includes both event and plied in the NEEL challenge in 2015 and 2016. Fur- non-event related tweets. The collection of event re- thermore, the NEEL challenge applied this strategy us- lated tweets did not rely on the use of hashtags but ing the same evaluation tool as TAC-KBP [45], which on applying the First Story Detection (FSD) algorithm aimed to enable an easier interchange of participants of [62] and [63]. This algorithm relies on locality- between both communities. sensitive hashing, which processes each tweet as it ar- The step-by-step evaluation has never being applied rives in time. The hashing dynamically builds up tweet within the NEEL series. Despite its robustness by elim- clusters representing events. Notice that hashing in this inating error propagation, it is very time consuming, in context refers to a compression methodology not to a particular if participant systems do not implement the Twitter hashtag. Within this collection, the FSD algo- typical workflow. The evaluation process for each year rithm identified a series of events (stories) including as well as the specific metrics used will be discussed the death of Amy Winehouse, the London Riots and in Section 7. the Oslo bombing. Since the challenge task was to au- Target Conference. The NEEL challenge keeps in tomatically recognise and link named entities (to DB- mind that microposts are of interest of a broader com- pedia referents), we built the challenge dataset con- munity, composed of researchers in Natural Language sidering both event and non-event tweets. While event Processing, Information Retrieval, Computational Lin- tweets are more likely to contain entities, non-event guistics, and also from a community interested on the tweets enabled us to evaluate the performance of the World Wide Web. Given this, the NEEL Challenges system in avoiding false positives in the entity extrac- were proposed as part of the International Workshop on Making Sense of Microposts that was held in con- 14http://demeter.inf.ed.ac.uk/redites Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 11

dataset tweets words tokens tokens/tweet entities NILs total entities entities/tweet NILs/tweet 2013 training 2,815 10,439 51,969 18.46 2,107 - 3,195 1.88 - 2013 test 1,450 6,669 29,154 20.10 1,140 - 1,557 1.79 - 2014 training 2,340 12,758 41,037 17.54 1,862 - 3,819 3.26 - 2014 test 1,165 6,858 20,224 17.36 834 - 1,458 2.50 - 2015 training 3,498 13,752 67,393 19.27 2,058 451 4,016 1.99 0.22 2015 dev 500 3,281 7,845 15.69 564 362 790 2.04 0.94 2015 test 2,027 10,274 35,558 - 17.54 2,122 1,478 3,860 2.32 0.89 2016 training 6,025 26,247 100,071 16.61 3,833 2,291 8,665 1.43 0.38 2016 dev 100 841 1,406 14.06 174 85 338 3.38 0.85 2016 test 3,164 13,728 45,164 14.27 430∗ 284∗ 1,022∗ 3.412+ 0.95+ Table 2 General statistics of the training, dev, and test data sets. tweets refers to the number of tweets in the set; words to the unique number of words, thus without repetition; tokens refers to the total number of words; tokens/tweet represents the average number of tokens per tweet, entities refers to the unique number of named entities including NILs; NILs refers to the number of entities not yet available in the knowledge base; total entities corresponds to the number of entities with repetition in the set; entities/tweet refers to the average of entities per tweet; NILs/tweet corresponds to the average of NILs per tweet. ∗ only 300 tweets have been randomly selected to be manually annotated and included in the gold standard. + these figures refer to the 300 tweets of the gold standard. tion phase. This dataset was split into a training (70%) velopment set, and 82.05% in the test set for the 2015 and testing (30%) sets. Given the task of identifying dataset; and 67.60% in the training set, 100% in the mentions and linking to the referent knowledge base development set, and 9.35% in the test set for the 2016 entities in 2014, the class information was removed dataset. The overlap of entities between the training from the final release. and test data is 8.09% for the 2013 dataset, 13.27% for The 2015 challenge dataset extends the 2014 dataset. the 2014 dataset, 4.6% for the 2015 dataset, and 6.59% This dataset consists of tweets published over a longer for the 2016 dataset. Following the Terms of Use of period, between 2011 and 2013. In addition to this, we Twitter, for all the four challenge datasets, participants also collected tweets from the Twitter firehose from were only provided the tweet IDs and the annotations, November 2014 covering both event (such as the UCI the tweet text had to be downloaded from Twitter. Cyclo-cross World Cup) and non-event tweets. The dataset was split into training (58%), consisting of the 4.2. Annotation taxonomy and class distribution entire 2014 dataset, development (8%), which enabled The taxonomy for annotating the entities changed participants to tune their systems, and test (34%) from from a four-class taxonomy, based on the taxonomy the newly added 2015 tweets. used in CoNLL 2003 [73], in 2013 to an extended ver- The 2016 challenge dataset builds on the 2014 and sion seven-type taxonomy, namely the NEEL Taxon- 2015 datasets, and consists of tweets extracted from omy (Appendix A), which is derived from the NERD the Twitter firehose from 2011 to 2013 and from Ontology [68]. This new taxonomy was introduced to 2014 to 2015 via a selection of popular hashtags. This provide a more fine-grained classification of the enti- dataset was split into training (65%) consisting of the ties, covering also names of characters, products and entire 2015 dataset, development (1%), and test (34%) events. Furthermore, it is deemed to better answer the sets from the newly collected tweets for the 2016 chal- need to cope with the semantic diversity of named lenge. entities in textual documents as shown in [67]. Ta- Statistics describing the training, development and ble 3 shows the mapping between the two classifica- test sets are provided in Table 2. In all, but not in tion schemes. Summary statistics of the entity types the 2015 challenge, the training datasets presented a are provided in Tables 4, 5, and 6 for the 2013, 2015, higher rate of named entities linked to DBpedia than and 2016 corpora respectively.15 The most frequent en- the development and test datasets. The percentage of tity type across all datasets is Person. This is followed tweets that mention at least one entity is 74.42% in the training, 72.96% in the test set for the 2013 dataset; 15The statistics cover the observable data in the corpora. Thus, the 32% in the training, and 40% in the test set for the 2014 distributions of implicit classes in the 2014 corpus are not reported. dataset; 57.83% in the training set, 77.4% in the de- The choice of removing the class information from the release was 12 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

CE 2013 NEEL Taxonomy Type Training Dev Test MISC Thing Character 63 (0.73%) 19 (5.62%) 57 (5.58%) PER Person Event 482 (5.56%) 7 (2.07%) 24 (2.35%) LOC Location Location 1,868 (21.56%) 17 (5.03%) 43 (4.21%) ORG Organization Organization 1,641 (18.94%) 33 (9.76%) 158 (15.46%) - Character Person 2,846 (32.84%) 120 (35.50%) 337 (32.97%) - Product Product 1,199 (13.84%) 128 (37.87%) 355 (34.74%) - Event Thing 570 (6.58%) 14 (4.14%) 49 (4.79%) Table 3 Table 6 Mapping between the taxonomy used in the first challenge of the NEEL challenge series (left column), and the taxonomy used since Entity type statistics for the three data sets from 2016. The statistics the 2014 on (right column). of the Test set refer to the manually annotated set of tweets selected to generate the gold standard. by Organisation and Location in the 2013 and 2015 computer scientists, social scientists, social web ex- datasets. In the 2016 dataset the second and third most perts, semantic web experts and natural language pro- frequent types are Product and Organisation. The dis- cessing experts; in the 2015 challenge, 3 annotators tributional differences between the entity types in the generated the annotations; in the 2016 challenge, 2 ex- three sets are quite apparent, making the NEEL task perts took on the manual annotation campaign. challenging, particularly when addressed with super- The annotation process for the 2013 dataset started vised learning approaches. with the unannotated corpus and consists of the fol- lowing steps: Type Training Test Person 1,722 (53.89%) 1,128 (72.44%) Phase 1. Manual annotation: the corpus was split into Location 621 (19.44%) 100 (6.42%) four quarters, each was annotated by a different Organisation 618 (19.34%) 236(15.16%) human annotator. Misceleneous 233 (7.29%) 95(6.10%) Phase 2. Consistency: for consistency checking, each Table 4 annotator further checked the annotations that the Entity type statistics for the two data sets from 2013. other three performed to verify correctness. Phase 3. Consensus: for the annotations without con- sensus, discussions among the four annotators Type Training Dev Test was used to come to a final conclusion. This pro- Character 43 (1.07%) 5 (0.63%) 15 (0.39%) cess resulted in resolving annotation inconsisten- Event 182 (4.53%) 81 (10.25%) 219 (5.67%) cies. Location 786 (19.57%) 132 (16.71%) 957 (24.79%) Phase 4. Adjudication: a very small number of errors Organization 968 (24.10%) 125(15.82%) 541 (14.02%) was also reported by the participants, which was Person 1102 (27.44%) 342 (43.29%) 1402 (36.32%) taken into account in the final version of the Product 541 (13.47%) 80 (10.13%) 575 (14.9%) dataset. Thing 394 (9.81%) 25 (3.16%) 151 (3.92%) With the inclusion of entity links, the annotation Table 5 process for the 2014, 2015 datasets was amended to Entity type statistics for the three data sets from 2015. consist of the following phases: Phase 1. Unsupervised automated annotation: the cor- pus was initially annotated using the NERD 4.3. Annotation procedure framework [69], which extracted potential entity mentions, candidate links to DBpedia, and entity In the 2013 challenge, 4 annotators created the gold types. The NERD framework was used as an off- standard; in the 2014 challenge a total of 14 annotators the-shelf annotation tool, i.e. it was used without were used who had different backgrounds, including training it properly. Phase 2. Manual annotation: the labeled data set was made on purpose because of the final objective of the task of having divided into batches, with different annotators - end-to-end solutions. three annotators in the 2014 challenge, and two Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 13

annotators in the 2015 challenge - to each batch. lenge Annotation Guidelines provided in 2015. Due to In this phase manual annotations were performed the intensity of the annotation task, 10% of the test using an annotation tool (e.g. CrowdFlower for set was annotated manually.18 A random selection was the 2014 challenge dataset,16 and GATE [25] performed while preserving the original distributions for the 2015 challenge dataset17). The annota- of types in the corpus by the law of large numbers [79]. tors were asked to analyse the annotations gen- The annotation process for the 2016 test set consisted erated in Phase 1 by adding or removing entity of the following steps: annotations as required. The annotators were also Phase 1. Manual annotation: the data set was divided asked to mark any ambiguous cases encountered. Along with the batches, the annotators also re- into 2 batches, one for each annotator. In this ceived the NEEL Challenge Annotation Guide- phase, annotations were performed using GATE. lines (Appendix B). The annotators were asked to analyse the annota- Phase 3. Consistency: the annotators - three experts in tions generated in Phase 1 by adding or remov- the 2014 challenge, and a forth annotator in the ing entity annotations as required. The annotators 2015 challenge - double-checked the annotations were also asked to mark any ambiguous cases en- and generated the gold standard (for the training, countered. Along with the batches, the annotators development and test sets). Three main tasks were received the NEEL Challenge Annotation Guide- carried out here: i) consistency checking of entity lines. types; ii) consistency checking of URIs; iii) res- Phase 2. Consistency: the annotators checked each olution of ambiguous cases raised by the annota- other annotations and generated the gold stan- tors. The annotators looped through Phase 2 and dard (for the training, development and test sets). 3 of the process until the problematic cases were Three main tasks were carried out here: i) con- resolved. sistency checking of entity types; ii) consistency Phase 4. NIL Clustering: particular to the 2015 chal- checking of URIs; iii) resolution of ambiguous lenge, a seed cluster generation algorithm through cases raised by the annotators. The annotators it- merging of string- and type- identical named en- erated between Phases 1 and 2 until the problem- tity mentions was used to generate an initial NIL atic cases were resolved. Clustering. Phase 3. NIL Clustering: an unsupervised NIL Clus- Phase 5. Consensus: also in the 2015 challenge, based tering generation was performed, using a seed on the results of the seed algorithm, the third an- cluster generation algorithm based on exact string notator manually verified all NIL clusters in order matching of mention strings and their types. to remove links asserted to the wrong cluster, and Phase 4. Consensus: one of the two expert annotators merge clusters referring to the same entity. Spe- went through all NIL clusters in order to, where cial attention was paid to name variations such as appropriate, include or exclude them from a given acronyms, misspellings, and similar names. cluster. Phase 6. Adjudication: where the challenge partici- Phase 5. Adjudication: where the challenge partici- pants reported incorrect or missing annotations. pants reported incorrect or missing annotations. Each reported mention was evaluated by one of Each reported mention was evaluated by one of the challenge chairs to check compliance with the the challenge chairs to check compliance with the NEEL Challenge Annotation Guidelines, and ad- Challenge Annotation Guidelines, and additions ditions and corrections made as required. and corrections were made as required. In the 2016 challenge, the training set was built on The inter-annotator agreement (IAA) for the chal- top of the 2014 and 2015 datasets in order to provide lenge datasets (2014, 2015 and 2016) is presented in continuity with previous years and to build upon exist- Table 7.19 We computed these values using the annota- ing findings. The 2016 challenge used the NEEL Chal- tion diff tool in GATE. As the annotators are not only

16For annotating the 2014 challenge dataset, we used Crowd- 18The participants were asked to annotate the entire corpus of flower with selected expert annotators rather than the crowd. tweets. 17For the 2015 challenge we chose GATE instead of Crowdflower, 19The inter-annotator agreement for the 2013 dataset could not be because GATE allows for the annotation of entities according to an computed, as the challenge settings and intermediate data were lost ontology, and to compute inter-annotator agreement on the dataset. due to a lack of organisation of the challenge. 14 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series classifying predefined mentions but can also define dif- 5.1. Entity Overlap ferent mentions, traditional IAA measures such as Co- hen’s Kappa are less suited to this task. Therefore, we Table 8 presents the entity overlap between the dif- measured the IAA in terms of precision (P), recall (R) ferent datasets. Each row in the table represents the and F-measure (F1) using exact matches, where the an- percentage of unique entities present in that dataset notations must be correct for both the entity mentions that are also represented in the other datasets. and the entity types over the full spans (entity men- tion and label) [26]. Annotations which were partially 5.2. Confusability correct were considered as incorrect.

Dataset P R F1 We define the true confusability of a surface form s NEEL 2014 49.49% 73.10% 59.02% as the number of meanings that this surface form can 20 NEEL 2015 97.00% 98.5% 98.00% have. Because new organisations, people and places NEEL 2016 90.31% 92.27% 91.28% are named every day, there is no exhaustive collection Table 7 of all named entities in the world. Therefore, the true Inter-Annotator Agreement on the challenge datasets confusability of a surface form is unknown, but we can estimate the confusability of a surface form through The lessons learnt from building high quality gold the function A(s): S ⇒ N that maps a surface form to standards are that the annotation process must be an estimate of the size of its candidate mapping, such guided with annotation guidelines, at least two an- that A(s) = |C(s)|. notators must be involved in the annotation process The confusability of a location name offers only a to ensure consistency, and the feedback from the rough a priori estimate of the difficulty in linking that participants is valuable in improving the quality of surface form. Observing the annotated occurrences of the datasets, providing complementary annotations to this surface form in a text collection allows us to make the cases found by experts. The annotation guide- more informed estimates. We show the average num- lines, written by experts, must describe the annotation ber of meanings denoted by a surface form, indicat- task (for instance, entity types and NEEL taxonomy) ing the confusability, as well as complementary statis- through examples, and must be regularly updated dur- tical measures on the datasets in Table 9. In this table, ing the manual annotation stage, describing special we observe that most datasets have a low number of cases, issues encountered. In order to speed up the an- average meanings per surface form, but there is a fair notation process, it is a good practice to employ an an- amount of variation, i.e. number of surface forms that notation tool. We used GATE because the annotation can refer to a meaning. process was guided by a taxonomy-centric view. The annotation task took less time if the annotators shared 5.3. Dominance the same background (e.g. all annotators were seman- tic web and natural language processing experts with We define the true dominance of an entity resource experience in ). 21 ri for a given surface form si to be a measure of how commonly ri is meant with regard to other pos- 5. Corpus Analysis sible meanings when si is used in a sentence. Let the dominance estimate D(ri, si) be the relative frequency While the main goals of the 2013-2016 challenges with which the resource ri appears in Wikipedia links were the same, and the 2014-2016 corpora are largely where si appears as the anchor text. Formally: built on top of each other, there are some differences among the datasets. In this section, we will analyse the |W ikiLinks(si, ri)| D(ri, si) = different datasets according to the characteristics of the ∀r∈R|W ikiLinks(si, r)| entities and events annotated in them. We hereby reuse measures and scripts from [31] and add a readability 20 measure analysis of the corpora. Note that for the En- As surface form we refer to the lexical value of the mention. 21An entity resource is an entry in a knowledge base that describes tity Linking analyses, we can only compare the 2014- that entity, for example http://dbpedia.org/resource/ 2016 NEEL corpora since the 2013 corpus (CE2013) Hillary_Clinton is the DBpedia entry that describes the Amer- does not contain entity links. ican politician Hillary Rodham Clinton. Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 15

NEEL 2014 NEEL 2015 NEEL 2016 NEEL 2014 (2,380) - 1,630 (68.49%) 1,633 (68.61%) NEEL 2015 (2,800) 1,630 (58.21%) - 2,800 (100%) NEEL 2016 (2,992) 1,633 (54.58%) 2,800 (93.58%) - Table 8 Entity overlap in the analysed datasets. Behind the dataset name in each row the number of unique entities present in that dataset is given. For each dataset pair the overlap is given as the number of entities and percentage (in parentheses).

Corpus Average Min. Max. σ We experimented with various readability measures NEEL 2014 1.02 1 3 0.16 to assess the reading difficulty of the various tweet cor- NEEL 2015 1.05 1 4 0.25 pora. These measures would indicate that tweets are NEEL 2016 1.04 1 3 0.22 generally not very difficult in terms of word and sen- Table 9 tence length, but the abbreviations and slang present in Confusability stats for analysed datasets. Average stands for average tweets proves them to be more difficult to interpret for number of meanings per surface form, Min. and Max. stand for the readers outside the target community. To the best of minimum and maximum number of meanings per surface form found in the corpus respectively, and σ denotes the standard deviation. our knowledge, there is no readability metric that takes this into account. Therefore we chose not to include those experimental results in this article. The dominance statistics for the analysed datasets are presented in Table 10. The dominance scores for all corpora are quite high and the standard deviation is 6. Emerging Trends and Systems Overview low, meaning that in the vast majority of cases, a sin- gle resource is associated with a certain surface form In the remainder of this analysis, we focus on two in the annotations, creating a low of variance for an main tasks, namely Mention Detection and Candidate automatic disambiguation system. Selection. Thirty different approaches were applied in four editions of the challenge since 2013. Table 11 lists Corpus Dominance Max Min σ all ranked teams. NEEL 2014 0.99 47 1 0.06 NEEL 2015 0.98 88 1 0.09 6.1. Emerging Trends NEEL 2016 0.98 88 1 0.08 Table 10 Whilst there are substantial differences between the Dominance stats for analysed datasets. proposed approaches, a number of trends can be ob- served in the top-performing named entity recognition and linking approaches to tweets. Firstly, we observe 5.4. Summary the large adoption of data-driven approaches: while in the first and second year of the challenge there was In this section, we have analysed the corpora in an extensive use of off-the-shelf approaches, the top terms of their variance in named entities and readabil- ranking systems from 2013 - 2016 show a high de- ity. pendence on the training data. This is not surprising, As the datasets are built on top of each other, they since these approaches are supervised, but this clearly show a fair amount of overlap in entities between each suggests that, to reach top performance, labeled data other. This need not to be a problem, if there is enough is necessary. Additionally, the extensive use of knowl- variation among the entities, but the confusability and edge bases as dictionaries of typed entities and en- dominance statistics show that there are very few en- tity relation holder have dramatically affected the per- tities in our datasets with many different referents formance over the years. This strategy overcomes the (“John Smiths”) and if such an entity is present, often lexical limitations of a tweet and performs well on only one of its referents is meant. To remedy this, fu- the identification of entities available in the knowledge ture entity linking corpora should take care to balance base used as referent. A common phase in all sub- the entity distribution and include more variety. mitted approaches is normalisation, meaning smooth- 16 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

APPROACH AUTHORS NO.OF RUNS entities with mentions in text, they cannot support the identification of novel and emerging entities. Ad-hoc 2013 Entries solutions for tweets for the generation of NILs have 1 Habib, M. et al. [43] 1 been proposed, ranging from edit distance-based solu- 2 Dlugolinsky, S . et al. [29] 3 tions to the use of Brown clustering. 3 van Erp, M. et al. [30] 3 Between the first NEEL challenge on Concept Ex- 4 Cortis, K. [22] 1 traction (CE) and the 2016 edition we observe the fol- 5 Godin, F. et al. [39] 1 lowing: 6 van Den Bosch, M. et al. [10] 3 – tweet normalisation as first step of any approach. 7 Munoz-Garcia, O. et al. [60] 1 8 Genc, Y. et al. [36] 1 This is generally defined as preprocessing and it 9 Hossein, A. [47] 1 increases the expressiveness of the tweets, e.g. via 10 Mendes, P. et al. [57] 3 the expansion of Twitter accounts and hashtags 11 Das, A. et al. [28] 3 with the actual names of entities they represent, 12 Sachidanandan, S. et al. [72] 1 or with conversion of non-ASCII characters, and, 13 de Oliveira, D. et al. [61] 1 generally, noise filtering; – the contribution of knowledge bases in the men- 2014 Entries tion detection and typing task. This leads to 14 Chang, M. et al. [19] 1 higher coverage, which, along with the linguistic 15 Habib, M. et al. [44] 2 analysis and type prediction, better fits this par- 16 Scaiella, U. et al. [74] 2 ticular domain; 17 Amir, M. et al. [81] 3 – the use of high performing end-to-end approaches 18 Bansal, R. et al. [5] 1 for the candidate selection. Such a methodology 19 Dahlmeier, D. et al. [27] 1 was further developed with the addition of fuzzy

2015 Entries distance functions operating over n-grams and acronyms; 20 Yamada, I. et al. [80] 10 – the inclusion of a pruning stage to filter out can- 21 Gârbacea, C. et al. [35] 10 didate entities. This was presented in various ap- 22 Basile, P. et al. [7] 2 proaches ranging from Learning-to-Rank to re- 23 Guo, Z. et al. [42] 1 casting the problem as a classification task. We 24 Barathi Ganesh, H. B. et al. [3] 1 observed that the approach based on a classi- 25 Sinha, P. et al. [75] 3 fier reached better performance (in particular, the 2016 Entries classifier that performed best for this task was im- plemented using a SVM based on a radial basis 26 Waitelonis, J. et. al. [78] 1 function kernel), however it required an exten- 27 Torres-Tramon, P. et al. [76] 1 sive feature engineering of the feature set used as 28 Greenfield, K. et al. [40] 2 training; 29 Ghosh, S. et al. [37] 3 – utilising hierarchical clustering of mentions to ag- 30 Caliano, D. et al. [14] 2 gregate exact mentions of the same entity in the Table 11 text and thus complementing the knowledge base Per year submissions and number of runs for each team. entity directory in case of absence of an entity; – a considerable decrease in off-the-shelf systems. These were popular in the first editions of NEEL, ing the lexical variations of the tweets and translating but in later editions their performance grew in- them to language structures that can be better parsed creasingly limited as the task became more con- by state-of-the-art approaches that expect more formal strained. and well-formed text. Whilst the linguistic workflow favours the use of sequential solutions, Entity Recog- Table 12 provides an overview of the methods and nition and Linking for tweets is proposed as joint step features used in these four years, grouped according to using large knowledge bases as referent entity direc- the step involved in the workflow. In addition to the list tories. While knowledge bases support the linking of of the steps listed in Figure 1. Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 17

Step Method Features Knowledge Base Off-the-Shelf Systems

stop words, spelling dictionary, acronyms, hashtags, Twitter Preprocessing Cleaning, Expansion, Extraction accounts, tweet -- timestamps, punctuation, capitalisation, token positions POS tags, tokens and Approximate String Matching, adjacent tokens, Exact String Matching, Fuzzy contextual features, tweet String Matching, Acronym Search, timestamps, string Perfect String Matching, Mention similarity, n-grams, proper Wikipedia, Levenshtein Matching, Context Semanticizer22 Detection nouns, mention similarity DBpedia Similarity Matching, Conditional score, Wikipedia titles, Random Fields, Random Forest, Wikipedia redirects, Jaccard String Matching, Prior Wikipedia anchors, word Probability Matching embeddings tokens, linguistic features, DBpedia Type, Logistic word embeddings, entity AlchemyAPI,23 DBpedia, Entity Typing Regression, Random Forest, mentions, NIL mentions, OpenCalais,24 Freebase Conditional Random Fields DBpedia and Freebase Zemanta25 types

DBpedia Distributional Semantic Model, Spotlight [56], Candidate Random Forest, RankSVM, gloss, contextual features, Wikipedia, AlchemyAPI, Selection Random Walk with Restart, graph distance DBpedia Zemanta, Learning to Rank Babelfy [59] POS tags, contextual Conditional Random Fields, words, n-grams length, Random Forest, Brown Clustering, predicted entity types, NIL Clustering Lack of candidate, Score capitalization ratio, entity Threshold, Surface Form mention label, entity Aggregation, Type Aggregation mention type Table 12 Map of the approaches per sub-task applied in the NEEL series of challenges from 2013 until 2016.

6.2. Systems overview Calais, Stanford NER [33], WikiMiner,28 NERD [67], TwitterNLP [66], AlchemyAPI, DBpedia Spotlight, Table 13 presents a description of the approaches Zemanta) for both the identification of the boundaries used for Mention Detection combined with Typing. of the entity (mention detection) and the assignment Participants approached the task using lexical similar- of a semantic type (entity typing). The top perform- ity matchers, machine learning algorithms, and hybrid ing system proved to be System 1, which proposed methods that combine the two. For 2013, the strategies an ensemble learning approach composed of a Condi- yielding the best results where hybrid, where mod- tional Random Fields (CRF) and a Support Vector Ma- els relied on the application of off-the-shelf systems chine (SVM) with a radial basis function kernel specif- 26 (e.g., AIDA [46], ANNIE [24], OpenNLP, Illinois ically trained with the challenge dataset. The ensem- 27 NET [64], Illinois Wikifier [65], LingPipe, Open- ble is performed via a union of the extraction results,

26https://opennlp.apache.org 27http://alias-i.com/lingpipe 28http://wikipedia-miner.cms.waikato.ac.nz 18 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

MENTION DETECTION EXTERNAL SYSTEM MAIN FEATURES LANGUAGE RESOURCE TEAM STRATEGY

2013 Entries

YAGO, Microsoft n-grams, 1 AIDA IsCap, AllCap, TwPOS2011 CRF and SVM (RBF) WordNet ANNIE, OpenNLP, Illinois NET, IsCap, AllCap, LowerCase, 2 Illinois Wikifier, LingPipe, OpenCalais, C4.5 decision tree Google Gazetteer isNP, isVP, Token length StanfordNER, WikiMiner IsCap, AllCap, Prefix, suffix, 3 StanfordNER, NERD, TwitterNLP TwPOS2011, First word, last SVM SMO - word DBpedia and ANNIE 4 ANNIE IsCap, ANNIE Pos ANNIE Gazetteer Alchemy, DBpedia Spotlight, 5 - Random Forest - OpenCalais, Zemanta Geonames.org Gazetteer, JRC 6 - PosTreebank, lowercasing IGTree memory-based taggers names corpus n-gram, PosFreeling 2012, 7 Freeling Lexical Similarity Wiki and DBpedia Gazetteers isNP, Token Length 8 NLTK [53] n-grams, NLTKPos Lexical Similarity Wikipedia 9 Babelfy API [59] - Lexical Similarity DBpedia and BabelNet n-grams, IsCap, AllCap, 10 DBpedia Spotlight CRF DBpedia, BALIE Gazetteers lower case Country names, City names Stem, IsCap, TwPos2011, 11 - CRF Gazetteers, Samsad and Follows NICTA dictionaries, IsOOV 12 - IsCap, prefix, suffix CRF Wiki and Freebase Gazetteers 13 - n-gram PageRank, CRF YAGO, Wikipedia, WordNet

2014 Entries

n-grams, stop words removal, Wikipedia and Freebase 14 - Lexical Similarity punctuation as tokens lexicons Regular Expression, Entity 15 TwiNER [51] TwiNER and CRF DBpedia Gazetteer, Wikipedia phrases, N-gram Wikipedia anchor texts, Collective agreement and 16 TAGME [32] Wikipedia N-grams Wikipedia statistics 17 StanfordNER - - NER Dictionary proper nouns sequence, 18 TwitterNLP - Wikipedia n-grams Unigram, POS, lower, title and upper case, stripped DBpedia Gazetteer, Brown 19 DBpedia Spotlight, TwitterNLP CRF words, isNumber, word Clustering [52] cluster, DBpedia 2015 Entries

Lexical Similarity joint with 20 - n-grams Wikipedia CRF, Random Forest 21 Semanticizer - CRF DBpedia 22 POS Tagger n-grams Maximun Entropy DBpedia 23 TwitIE [9] - - DBpedia 24 TwitIE tokens - DBpedia 25 - tokens CRF joint with POS Tagger - 2016 Entries

26 - unigrams Lexical Similarity DBpedia 27 GATE NLP tokens CRF - 28 - n-grams Lexical Similarity DBpedia Stanford NER and ARK Twitter POS 29 tokens and POS CRF - tagger [38] Lexical Similarity and Lexical 30 - tokens - Similarity

Table 13 Submissions and number of runs for each team for the Mention Detection phase. Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 19 while the typing is assigned via the class computed by base. This resulted in a set of candidate links, which the CRF. have been ranked according to the similarity of the The 2014 systems approached the Mention Detec- link with respect to the mention, and the surrounded tion task adding lexicons and features computed from text. However, the best performing system (System DBpedia resources. System 14, the best performing 14), approached the Candidate Selection as a joint system, used a matcher from n-grams computed from stage with the Mention Detection and link assignment, the text and the lexicon entries taken from DBpedia. proposing a so-called end-to-end system. As opposed From the 2014 challenge on, we observe more ap- to most of the participants which used off-the-shelf proaches favouring recall in the Mention Detection, tools, System 14 proposed a SMART gradient boosting while focusing less on using linguistic features for algorithm [34], specifically trained with the challenge mention detection. System 15, proposed by the same dataset where the features are textual features (such as authors of the best performing system in 2014, ad- textual similarity, contextual similarity), graph-based dressed the Mention Detection task with a large set features (such as semantic cohesiveness between the of linguistic and lexicon-related features (such as the entity-entity and entity-mention pairs), and statistical probability of the candidate obtained from the Mi- features (such as mention popularity using the Web as crosoft Web N-Gram services, or its appearance in archive). The majority of the systems, including Sys- WordNet) and using a SVM classifier with a radial ba- tem 14, applied name normalisation for feature extrac- sis function kernel specifically trained with the chal- tion, which was useful for identifying entities orig- lenge data. Such an approach resulted in high preci- inally appearing as hashtags, or username mentions. sion, but it slightly penalised recall. Among the most commonly used external knowledge The 2015 best performing approach for Mention sources are: NER dictionaries (e.g., Google Cross- Detection, System 20, was largely inspired by the 2014 Wiki), Knowledge Base Gazetteers (e.g., Yago, DB- winning approach: the use of n-grams used to look pedia), weighted lexicons (e.g., Freebase, Wikipedia), up resources in DBpedia and a set of lexical features other sources (e.g., Microsoft Web N-gram).29 A wide such as POS tags and position in tweets. The type range of features were investigated for Candidate Se- was assigned by a Random Forest classifier specifi- lection strategies: n-grams, by capturing jointly the cally trained with the challenge dataset and using DB- local (within a tweet) and global (within the knowl- pedia related features (such as PageRank [11]), word edge base) contextual information of an entity via embeddings (contextual features), temporal popular- graph-based features (e.g., entity semantic cohesive- ity knowledge of an entity extracted from Wikipedia ness). Other novel features included the use of Twitter page view data, string similarity measures to measure account metadata and popularity-based statistical fea- the similarity between the title of the entity and the tures for mentions and entity characterisation respec- mention (such as edit distance), and linguistic features tively. (such as POS tags, position in tweets, capitalization). In the 2015 challenge, System 20 (ranked first) pro- The 2016 best performing system, System 26, im- posed an enhanced version of the 2014 challenge win- plements a lexicon matcher to match the entity in the ner, combined with a pruning stage meant to increase knowledge base to the unigrams computed from the the precision of the Candidate Selection while con- text. The approach proposed a preliminary stage of sidering the role of the entity type being assigned tweet normalisation resolving acronyms, hashtags to mentions written in natural language. by a Conditional Random Field (CRF) classifier. In From 2014 on, the challenge task required partici- particular, System 20 is a five-sequential stage ap- pants to produce systems that were also able to link proach: preprocessing, generation of potential entity the detected mentions to their corresponding DBpedia mentions, candidate selection, NIL detection, and en- link (if existing). Table 14 describes the approaches tity mention typing. In the preprocessing stage, a to- taken by the 2014, 2015, 2016 participants for the Can- kenisation and Part-of-Speech (POS) tagging approach didate Detection and Selection, and NIL Clustering based on [38] was used, along with the extraction of stages. In 2014, most of the systems proposed a Candi- tweet timestamps. They address the generation of po- date Selection step as subsequent of the Mention De- tential entity mentions by computing n-grams (with tection stage, implementing the conventional linguistic pipeline detecting first the mention, and then looking 29http://research.microsoft.com/apps/pubs/ for referents of the mention in the external knowledge default.aspx?id=130762 20 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series n = 1..10 words) and matching them to Wikipedia face form that matches best. The context-related fea- titles, Wikipedia titles of the redirect pages, and an- tures assess the relation of a candidate entity to the chor text using exact, fuzzy, and approximate match other candidates within the given context. They in- functions. An in-house dictionary of acronyms is built clude graph distance measurements, connected com- by splitting the mention surface into different n-grams ponent analysis, or centrality and density observations (where one n-gram corresponds to one character). At using as pivot the DBpedia graph. The candidate selec- this stage all entity mentions are linked to their candi- tion is sorted according to the confidence score, which dates, i.e., the Wikipedia counterparts. The candidate is used as a means to understand whether the entity selection is approached as a learning-to-rank problem: actually describes the mention. In case the confidence each mention is assigned a confidence score computed score is lower than an empirically-determined thresh- as the output of a approach using a old, the mention is annotated with a NIL. random forest as the classifier. An empirically defined The other approaches implement linguistic pipelines threshold is used to select the relevant mentions; in where the Candidate Selection is performed by look- case of mention overlap the span with the highest score ing up entities according to the exact lexical value of is selected. System 20 implemented a Random Forest the mentions with DBpedia titles, redirect pages, and classifier working with a feature set composed of the disambiguation pages. While we observed a reduc- predicted entity types, contextual features such as sur- tion in complexity for the NIL clustering, resulting in rounding words, POS, length of the n-gram and capi- only considering the lexical distance of the mentions as talization features to predict the NIL reference linked for System 27 with the Monge-Elkan similarity mea- to the mention, thus turning the NIL Clustering to a sure [21], or System 28, that experimented the nor- supervised learning task that uses as training data the malised Damerau-Levenshtein, performing better than training challenge dataset. The mention entity typing Brown clustering [12]. stage is treated as a supervised learning task where two independent classifiers are built: a Logistic Regression classifier for typing entity mentions and a Random 7. Evaluation Strategies Forest for typing NIL entries. The other approaches can be classified as sequential, where the complexity In this section, the evaluation metrics used in the dif- is moved to only performing the right matching of the ferent challenges are described. n-gram from the text and the (candidate) entity in the knowledge base. Most of these approaches exploit the 7.1. 2013 Evaluation Measures popularity of the entities and apply distance similarity functions to better rank entities. From the analysis, the In 2013, the submitted systems were evaluated move to fully supervised in-house developed pipelines based on performance in extracting a mention and as- emerges while the use of external systems is signifi- signing its correct class as assigned in the Gold Stan- cantly reduced. The 2015 challenge introduced the task dard (GS). Thus a system was requested to provide a of linking mentions to novel entities, i.e. not present in set of tuples of the form: (m, t), where m is the men- the knowledge base. All approaches in this challenge tion and t is the type, which are then compared against exploit lexical similarity distance functions and class the tuples of the gold standard (GS). A type is any information of the mentions. valid materialisation of the class defined in Table 3 and In 2016, the top performing system, System 26, pro- defined as Person-type, Organisation-type, Location- posed a lexicon-based joint Mention Extraction and type, Misc-type. The precision (P ), recall (R) and F- Candidate Selection approach, where unigrams from measure (F1) metrics were computed for each entity tweets are mapped to DBpedia entities. A preprocess- type. The final result for each system was reported as ing stage cleans and classifies the part-of-speech tags, the average performance across the four entity types and normalises the initial tweets converting alphabetic, considered in the task. The evaluation was based on numeric, and symbolic Unicode characters to ASCII macro-averages across annotation types and tweets. equivalents. For each entity candidate, it considers lo- We performed a strict match between the tuples sub- cal and context-related features. Local features include mitted and those in the GS. A strict match refers to an the edit distance between the candidate labels and the exact match, with conversion to lowercase, between a n-gram, the candidates link graph popularity, its DB- system value and the GS value for a given entity type pedia type, the provenance of the label and the sur- t. Let (m, t) ∈ St denote the set of tuples extracted for Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 21

CANDIDATE SELECTION EXTERNAL SYSTEM MAIN FEATURES LINGUISTIC KNOWLEDGE TEAM STRATEGY

2014 Approaches

n-grams, lower case, entity graph features (entity semantic cohesiveness), DCD-SSVM[18] and SMART 14 - Wikipedia, Freebase popularity-based statistical gradient boosting features (clicks and visiting information from the Web) Wikipedia, DBpedia, n-grams, DBpedia and 15 Google Search SVM WordNet, Web N-Gram, Wikipedia links, capitalisation YAGO link probability, mention-link 16 TAGME C4.5 (for taxonomy-filter) Wikipedia, DBpedia commonness distance prefix, POS, suffix, Twitter Entity Aggregate Prior, 17 - account metadata, normalised Prefix-tree Data Structure Wikipedia, DBpedia, YAGO mentions, trigrams Classifier, Lexical Similarity wikipedia context-based Wikipedia Gazetteer, Google 18 - measure, anchor text measure, LambdaMART Cross Wiki Dictionary Twitter entity popularity Wikipedia Search API, DBpedia Lexical Similarity and 19 mentions Wikipedia, DBpedia Spotlight, Google Search Rule-based 2015 Approaches

word embeddings, entity popularity, commonness Random Forest, Logistic 20 - DBpedia distance, string similarity Regression distance 21 Semanticizer - Learning to Rank DBpedia 22 - mentions Lesk [6] DBpedia 23 - mentions, PageRank Random Walks DBpedia 24 - mentions Lexical Similarity DBpedia 25 DBpedia Spotlight mentions Lexical Similarity - 2016 Approaches

graph distances, connected component analysis, or 26 - Learning to Rank DBpedia centrality and density observations 27 - mentions, graph distances Lexical Similarity DBpedia commonness, inverse document frequency anchor, term entity frequency, TCN, 28 - SVM DBpedia term entity frequency, term frequency paragraph, and redirect 29 Bebelfy mentions - - Lexical Similarity, context 30 - mentions Wikipedia similarity

Table 14 Submissions and number of runs for each team for the Candidate Selection phase. an entity type t by system S; (m, t) ∈ GS denotes the set of tuples for entity type t in the gold standard. Then the set of true positives (TP ), false positives (FP ) and false negatives (FN) for a system is defined as: FNt = {(m, t)GS ∈ GS|¬ ∈ (m, t)S ∈ S} (3)

TPt = {(m, t)S ∈ S|∃(m, t)GS ∈ GS} (1) Since we require strict matches, a system must both detect the correct mention (m) and extract the correct entity type (t) from a tweet. Then for a given entity

FPt = {(m, t)S ∈ S|¬ ∈ (m, t)GS ∈ GS} (2) type we define: 22 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

pedia links lead to redirect pages that point to final resources, we implemented a resolve mechanism for |TPt| Pt = (4) links that was uniformly applied to all participants. TPt ∪ FPt The correctness of tuples provided by a system S as the exact-match of the mention and the link was assessed. Here the tuple order was also taken into account. We |TPt| define (m, l)S ∈ S as the set of pairs extracted by the Rt = (5) TPt ∪ FNt system S, (m, l)GS ∈ GS denotes the set of pairs in the gold standard. We define the set of true positives Then it is computed the precision and recall on a (TP ), false positives (FP ), and false negatives (FN) per-entity-type basis as: for a given system as:

P + P + P + P P = PER ORG LOC MISC (6) 4 TPl = {(m, l)S ∈ S|∃(m, l)GS ∈ GS} (9)

R + R + R + R R = PER ORG LOC MISC (7) 4 FPl = {(m, l)S ∈ S|¬ ∈ (m, l)GS ∈ GS} (10)

P × R F1 = 2 × (8) P + R FNl = {(m, l)GS ∈ GS|¬ ∈ (m, l)S ∈ S} (11)

Submissions were evaluated offline as participants TPl defines the set of relevant pairs in S, in other were asked to annotate in a short time window a test words, the set of pairs in S that match the ones in GS. set of the GS and to send the results in a TSV30 file. FPl is the set of irrelevant pairs in S, in other words the pairs in S that do not match the pairs in GS. FN 7.2. 2014 Evaluation Measures l is the set of false negatives denoting the pairs that are not recognised by S, yet appear in GS. As our evalua- In 2014, a system S was evaluated in terms of tion is based on a micro-average analysis, we sum the its performance in extracting both mentions and links individual true positives, false positives, and false neg- from a set of tweets. For each tweet of this set, a atives. As we require an exact-match for pairs (m, l) system S provided a tuple of the form: (m, l) where we are looking for strict entity recognition and linking m is the mention and l is the link. A link is any valid DBpedia URI31 that points to an existing re- matches; each system has to link each recognised en- source (e.g. http://dbpedia.org/resource/ tity to the correct resource l. Precision, Recall, F1 are Barack_Obama). The evaluation consisted of com- defined in Equation 12, Equation 13, and Equation 8 paring the submission entry pairs against those in GS. respectively. The measures used to evaluate each pair are precision (P ), recall (R), and f-measure (F1). The evaluation P 32 l |TPl| was based on micro-averages. P = P (12) The evaluation procedure involved an a priori nor- l TPl ∪ FPl malisation stage for each submission. Since some DB-

P 30 |TP | TSV stands for tab separated value. R = l l (13) 31We considered all DBpedia v3.9 resources valid. P l TPl ∪ FNl 32Since the 2014 NEEL Challenge, we opted to weigh all in- stances of TP , FP , FN for each tweet in the scoring, instead of Submissions were evaluated offline, where partici- weighing arithmetically by entity classes. This gives a better and de- tailed effectiveness of the system performances across different tar- pants were asked to annotate in a short time window gets (typed mention, links) and tweets. the TS and to send the results in a TSV file. Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 23

7.3. 2015 and 2016 Evaluation Measures P In the 2015 and 2016 editions of the NEEL chal- t |TPt| R = P (16) lenge, systems were evaluated according to the num- t TPt ∪ FNt ber of mentions correctly detected, their type correctly asserted (i.e. output of Mention Detection and Entity The strong_link_match metric measures the correct Typing), the links correctly assigned between a men- link between a correctly recognised mention and a tion in a tweet and a knowledge base entry, and a NIL knowledge base entry. For a link to be considered cor- assigned when no knowledge base entry disambiguates rect, a system must detect a mention (m) and its type the mention. correctly (t) as well as the correct Knowledge Base The required outputs were measured using a set of entry (l). Note also that this metric does not evaluate three evaluation metrics: strong_typed_mention_match, links to NIL. The detection of mentions is still based strong_link_match, and mention_ceaf. These metrics on strict matching as in previous versions of the chal- were combined into a final score (Equation 14). lenge. Therefore true positive (Equation 9), false pos- itive (Equation 10), and false negative (Equation 11) counts are still calculated in the same manner. This score = 0.4 ∗ mention_ceaf (14) metric is also based on micro-averaged precision and recall as defined in Equation 12 and Equation 13 and + 0.3 ∗ strong_typed_mention_match the F1 as in Equation 8. + 0.3 ∗ strong_link_match, The last metric in our evaluation score is the Con- strained Entity-Alignment F-measure (CEAF) [54]. where the weights are empirically assigned to favour This is a metric that measures coreference chains and more the role of the mention_ceaf, i.e. the ability of a is used to jointly evaluate Candidate Selection and system S to link the mention either to an existing entry NIL Clustering steps. Let E = {m1, . . . mn} de- in DBpedia or to a NIL entry generated by S and iden- note the set of all mentions linked to e, where e is ei- tified uniquely and consistently across different NILs. ther a knowledge base entry or a NIL identifier. men- The strong_typed_mention_match measures the per- tion_ceaf finds the optimal alignment between the sets formance of the system regarding the correct identi- provided by the system and the gold standard and then fication of mentions and their correct type assertion. performs the micro-averaged precision and recall over The detection of mentions is still based on strict match- each mention. ing as in previous versions of the challenge. There- In 2015, submissions were evaluated through an on- fore true positive (Equation 9), false positive (Equa- line process as participants were required to imple- tion 2), and false negative (Equation 3) counts are ment their systems as a publicly accessible web service still calculated in the same manner. However, the mea- following a REST-based protocol, where they could surement of precision and recall changed slightly. In submit up to 10 contending entries to a registry of 2013, we used macro-averaged precision and recall. In the NEEL challenge services. Each endpoint had a this case, first the precision, recall and F-measure are Web address (URI) and a name, which was referred computed over the classes which are then averaged. as runID. Upon receiving the registration of the REST Here, a more popular class can dominate the metrics. endpoint, calls to the contending entry were scheduled In 2015 we used micro-averaged metrics. In a micro- for two different time windows, namely, D − T ime - averaged precision and recall setup, each mention has to test the APIs, and T − T ime - for the final evalu- an equal impact on the final result regardless of its ation and metric computations. To ensure the correct- class size. Therefore, precision (P ) is calculated ac- ness of the results and avoid any loss we triggered N cording to the Equation 15 and recall (R) according to (with N=100) calls to each entry. We then applied a Equation 16. Finally, strong_typed_mention_match is majority voting approach over the set of annotations per tweet and statistically evaluated the latency by ap- the micro-averaged (F1) as given by Equation 8. plying the law of large numbers [79]. The algorithm is detailed in Algorithm 1. This offered the opportunity P |TP | to measure the computing time systems spent in pro- P = t t P (15) viding the answer. The computing time was proposed t TPt ∪ FPt 24 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series to solve potential draws from Equation 14. Where E is 2013 Entries the set of entities, and T weet is the set of tweets. TEAM P R F1 1 0.764 0.604 0.67 Algorithm 1 EVALUATE(E, T weet, N = 100,M = 2 0.724 0.613 0.662 30) 3 0.735 0.611 0.658

1: for all ei ∈ E do 4 0.734 0.595 0.61 2: AS = ∅, LS = ∅ 5 0.688 0.546 0.589 3: for all tj ∈ T weet do 4: for all nk ∈ N do 6 0.774 0.548 0.589 5: (A, L) = annotate(tj , ei) 6: end for 7 0.683 0.483 0.561 7: 8 0.685 0.5 0.54 8: // Majority Voting Selection of a from A 9 0.662 0.482 0.518 9: for all ak ∈ A do 10: hash(ak) 10 0.627 0.383 0.494 11: end for S 11 0.564 0.43 0.491 12: Aj = Majority Voting on the exact same hash(ak) 13: 12 0.501 0.468 0.489 14: // Random Selection of l from L 13 0.53 0.402 0.399 15: generate LT from the uniformly random selection of M l from L 16: (µ, σ) = computeMuAndSigma(LT ) S Table 15 17: Lj = (µ, σ) 18: end for Scores achieved for the NEEL 2013 submissions. 19: end for

8. Results As setting up a REST API increased the system implementation load on the participants, we reverted This section presents a compilation of the NEEL back to an offline evaluation setup in 2016. As in pre- challenge results across the years. As the NEEL task vious challenges, participants were asked to annotate evolved, the results among these years are not entirely the TS during a short time window and to send the re- comparable. Table 15 shows results for the 2013 chal- sults in a TSV file which was then evaluated by the lenge task, where we report scores averaged for the challenge chairs. four entity types analysed on this task. The 2013 task consisted of building systems that 7.4. Summary could identify four entity types (i.e., Person, Loca- tion, Organisation and Miscellaneous) in a tweet. This Three editions out of four followed an offline eval- task proved to be challenging, with some approaches uation procedure. A discontinuity was introduced in favouring precision over recall. The best rank in preci- 2015 with the introduction of the online evaluation sion was obtained by Team 1, which used a combina- procedure. Two issues were noted by the participants tion of rule types and data-driven approaches achiev- of the 2015 edition: i) increasing complexity of the ing a 76.4% precision. For recall, results varied across task, going beyond the pure NEEL objectives; ii) un- the four entity types with results for the miscellaneous fair comparison of the computing time with respect to and organisation types ranking the lowest. Averaging big players that can afford better computing resources over entity types, the best approach was obtained by than small research teams. These motivations caused Team 2, whose solution relied on gazetteers. All top the use of a conventional offline procedure for the 2016 3 teams ranked by F-measure followed a hybrid ap- edition. The emerging trend sees a consolidation of a proach combining rules and gazetteers. standard de-facto scorer that was proposed in TAC- The 2014 challenge task extended the concept ex- KBP and also now successfully adopted and widely traction challenge by not only considering the entity used in our community. This scorer supports the mea- type recognition but also the linking of entities to the surement of the performance of the approaches in the DBpedia v3.9 knowledge base. Table 16 presents the entire annotation pipeline, ranging from the Mention results for this task, which follow the evaluation de- Extraction, Candidate Selection, Typing, and detection scribed in Section 7. There was a clear winner that out- of novel and emerging entities from highly dynamic performed all other systems on all three metrics and contexts such as tweets. it was proposed by the Microsoft Research Lab Red- Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 25

2014 Entries 9. Conclusion

TEAM P R F1 The NEEL challenge series was established in 2013 14 77.1 64.2 70.06 to foster the development of novel automated ap- 15 57.3 52.74 54.93 proaches for mining semantics from tweets and pro- 16 60.93 42.25 49.9 viding standardised benchmark corpora enabling the 17 53.28 39.51 45.37 community to compare systems. 18 50.95 40.67 45.23 This paper describes the decisions and procedures 19 49.58 32.17 39.02 followed in setting up and running the task. We first Table 16 described the annotation procedures used to create the Scores achieved for the NEEL 2014 submissions. NEEL corpora over the years. The procedures were in- crementally adjusted over time to provide continuity

33 and ensure reusability of the approaches over the dif- mond. Most of the 2014 submissions followed a se- ferent editions. While the consolidation has provided quential approach doing first the recognition and then consistent labeled data, it has also showed the robust- the linking. The winning system (Team 14) introduced ness of the community. a novel approach, namely joint learning of recognition We also described the different approaches proposed and linking from the training data. This approach out- by the NEEL challenge participants. Over the years, performed the second best team in F-measure by over we witnessed to a convergence of the approaches to- 15%. wards data-driven solutions supported by knowledge The 2015 task extended the 2014 recognition and bases. Knowledge bases are prominently used as a linking tasks with a clustering task. For this task par- source to discover known entities, relations among ticipants had to provide clusters where each cluster data, and labelled data for selecting candidates and contained only mentions to the real world entity. For suggesting novel entities. Data-driven approaches have 2015 we also computed the latency of each system. become, with variations, the leading solution. Despite Table 17 presents a ranked list of results for the 2015 the consolidated number of options for addressing the submissions. The last column shows the final score for challenge task, the participants’ results show that the each participant following Equation 14. Here the win- NEEL task remains challenging in the microposts do- ner (Team 20) outperformed the second best with a main. boost in tagging F1 of 41.9%, in clustering F1 of 28%, Furthermore, we explained the different evaluation and linking F1 of 23.9%. Team 20 improved upon the strategies used in different challenges. These changes second best team on the general score with 33.1%. For were driven by a desire to ensure fairness of the eval- 2015, the winner team proposed an end-to-end sys- uation, transparency, and correctness. These adapta- tem for both candidate selection and mention typing, tions involve the use of in-house scoring tools in 2013 along with a linguistic pipeline to perform entity typ- and 2014, which were made publicly available and dis- ing and filtering. As in 2014, the best ranked system cussed in the community. Since 2015 the TAC-KBP was proposed by a company, namely Studio Ousia,34 challenge scorer was adopted to both leverage the wide that focuses on and artificial in- experience developed in the TAC-KBP community and telligence. break down the analysis to account for the clustering. Finally, the 2016 challenge followed the same task Thanks to the yearly releases of the annotations and as 2015. Team 26 outperformed all other participants, tweet IDs with a public license, the NEEL corpus has with an overall F1 score of 0.5486 and an absolute started to become widely adopted. Beyond the thirty improvement of 16.58% compared to the second-best teams who completed the evaluations in four years, approach. Team 26 used a learning-to-rank approach more than three hundred participants have contacted for the candidate selection task along with a series of the NEEL organisers with a request to acquire the cor- graph-based metrics making use of DBpedia as their pora. The teams come from more than twenty differ- main linguistic knowledge source. ent countries and are both from academia and industry. The 2014 and 2015 winners were companies operat- 33https://www.microsoft.com/en-us/research/lab/microsoft- ing in the field, respectively Microsoft and Studio Ou- research-redmond/ sia. The 2013 and 2016 winners were academic teams. 34http://www.ousia.jp/en The success of the NEEL challenges is also illustrated 26 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

2015 Entries

TEAM TAGGING F1 CLUSTERING F1 LINKING F1 LATENCY[S] SCORE 20 0.807 0.84 0.762 8.5±3.62 0.8067 25 0.388 0.506 0.523 0.13±0.02 0.4757 21 0.412 0.643 0.316 0.19±0.09 0.4756 22 0.367 0.459 0.464 2.03±2.35 0.4329 23 0.329 0.394 0.415 3.41±7.62 0.3808 24 0 0.001 0 12.89±27.6 0.004

Table 17 Scores achieved for the NEEL 2015 submissions. Tagging refers to strong_typed_mention_match, clustering refers to mention_ceaf, and linking to strong_link_match.

2016 Entries

TEAM TAGGING F1 CLUSTERING F1 LINKING F1 SCORE 26 0.473 0.641 0.501 0.5486 27 0.246 0.621 0.202 0.3828 28 0.319 0.366 0.396 0.3609 29 0.312 0.467 0.248 0.3548 30 0.246 0.203 0.162 0.3353

Table 18 Scores achieved for the NEEL 2016 submissions. Tagging refers to strong_typed_mention_match, clustering refers to mention_ceaf, and linking to strong_link_match. by the sponsorships of the challenges offered by com- ter sphere. The 2015 enhancements in the evaluation 35 36 panies (ebay in 2013 and SpazioDati in 2015) and strategy, which accounts for computational time, high- research projects (LinkedTV37 in 2014, and FREME38 lighted new challenges when focusing on an algo- in 2016). The NEEL challenges also triggered the interest of rithm’s efficiency vs efficacy. Since more efforts on local communities such as NEEL-IT. This community handling large scale data mining involve distributed is pushing the NEEL Challenge Annotation Guidelines computing and optimisation, we aim to develop new (with minor variations due to the intra-language de- evaluation strategies. These strategies will ensure the pendencies) and know-how to create a benchmark for fairness of the results when asking participants to pro- sharing the algorithms and results of mining semantics from Italian tweets. In 2015, we also built bridges with duce large scale annotations in a small window of time. the TAC community. We plan to strengthen these and Among the future efforts, we aim to identify the dif- to involve a larger audience of potential participants ferences in performance among the disparate systems ranging from Linguistics, Machine Learning, Knowl- and their approaches, first characterising what can be edge Extraction, Data and Web Science. considered an error by the systems in the context of the Future work involves the generation of corpora that account for the low variance of entity-type semantics. challenge, and then deriving insightful conclusions on We aim to create larger datasets covering a broader the building blocks needed to build an optimal system range of entity types and domains within the Twit- automatically. Finally, given the increasing interest in adopting the 35http://www.ebay.com NEEL guidelines in creating corpora for other lan- 36http://www.spaziodati.eu 37http://www.linkedtv.eu guages, we aim to develop a multilingual NEEL chal- 38http://freme-project.eu/ lenge as a future activity. Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 27

Acknowledgments through a distributional semantic model. In Jan Hajic and Ju- nichi Tsujii, editors, COLING 2014, 25th International Con- This work was supported by the H2020 FREME ference on Computational Linguistics, Proceedings of the Con- ference: Technical Papers, August 23-29, 2014, Dublin, Ire- project (GA no. 644771), by the research grant from land, pages 1591–1600. ACL, 2014. URL http://aclweb.org/ Science Foundation Ireland (SFI) under Grant Number anthology/C/C14/C14-1151.pdf. SFI/12/RC/2289, and by the CLARIAH-CORE project [7] Pierpaolo Basile, Annalina Caputo, Giovanni Semeraro, and financed by the Netherlands Organisation for Scientific Fedelucio Narducci. UNIBA: Exploiting a distributional se- Research (NWO). mantic model for disambiguating and linking entities in tweets. In Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie, editors, Proceedings of the the 5th Workshop on Making Sense of Microposts co-located with the 24th International References World Wide Web Conference (WWW 2015), Florence, Italy, May 18th, 2015., volume 1395 of CEUR Workshop Proceed- [1] Fabian Abel, Ilknur Celik, Geert-Jan Houben, and Patrick ings, page 62. CEUR-WS.org, 2015. URL http://ceur-ws.org/ Siehndel. Leveraging the semantics of tweets for adaptive Vol-1395/paper_15.pdf. faceted search on Twitter. In Lora Aroyo, Chris Welty, [8] Pierpaolo Basile, Annalina Caputo, Anna Lisa Gentile, and Harith Alani, Jamie Taylor, Abraham Bernstein, Lalana Ka- Giuseppe Rizzo. Overview of the EVALITA 2016 Named gal, Natasha Fridman Noy, and Eva Blomqvist, editors, The Se- Entity rEcognition and Linking in Italian Tweets (NEEL-IT) mantic Web - ISWC 2011 - 10th International Semantic Web Task. In Pierpaolo Basile, Anna Corazza, Francesco Cu- Conference, Bonn, Germany, October 23-27, 2011, Proceed- tugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, ings, Part I, volume 7031 of Lecture Notes in Computer Sci- Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceed- ence, pages 1–17. Springer, 2011. DOI https://doi.org/10.1007/ ings of Third Italian Conference on Computational Linguistics 978-3-642-25073-6_1. (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Lan- [2] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Se- guage Processing and Speech Tools for Italian. Final Work- mantic enrichment of Twitter posts for user profile construc- shop (EVALITA 2016), Napoli, Italy, December 5-7, 2016., vol- tion on the social web. In Grigoris Antoniou, Marko Grobel- ume 1749 of CEUR Workshop Proceedings. CEUR-WS.org, nik, Elena Paslaru Bontas Simperl, Bijan Parsia, Dimitris Plex- 2016. URL http://ceur-ws.org/Vol-1749/paper_007.pdf. ousakis, Pieter De Leenheer, and Jeff Z. Pan, editors, The Se- [9] Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark A. manic Web: Research and Applications - 8th Extended Seman- Greenwood, Diana Maynard, and Niraj Aswani. TwitIE: tic Web Conference, ESWC 2011, Heraklion, Crete, Greece, An open-source information extraction pipeline for microblog May 29 - June 2, 2011, Proceedings, Part II, volume 6644 of text. In Galia Angelova, Kalina Bontcheva, and Ruslan Mitkov, Lecture Notes in Computer Science, pages 375–389. Springer, editors, Recent Advances in Natural Language Processing, 2011. DOI https://doi.org/10.1007/978-3-642-21064-8_26. RANLP 2013, 9-11 September, 2013, Hissar, Bulgaria, pages [3] Barathi Ganesh H. B., Abinaya N., M. Anand Kumar, Vinayku- 83–90. RANLP 2013 Organising Committee / ACL, 2013. mar R., and K. P. Soman. AMRITA - CEN@NEEL: Identifi- URL http://aclweb.org/anthology/R/R13/R13-1011.pdf. cation and linking of Twitter entities. In Matthew Rowe, Milan [10] Antal van den Bosch and Toine Bogers. Memory-based Stankovic, and Aba-Sah Dadzie, editors, Proceedings of the the named entity recognition in tweets. In Amparo Elizabeth 5th Workshop on Making Sense of Microposts co-located with Cano Basave, Matthew Rowe, Milan Stankovic, and Aba-Sah the 24th International World Wide Web Conference (WWW Dadzie, editors, Proceedings of the Concept Extraction Chal- 2015), Florence, Italy, May 18th, 2015., volume 1395 of CEUR lenge at the Workshop on ’Making Sense of Microposts’, Rio de Workshop Proceedings, pages 64–65. CEUR-WS.org, 2015. Janeiro, Brazil, May 13, 2013, volume 1019 of CEUR Work- URL http://ceur-ws.org/Vol-1395/paper_20.pdf. shop Proceedings, pages 40–43. CEUR-WS.org, 2013. URL [4] Timothy Baldwin, Marie-Catherine de Marneffe, Bo Han, http://ceur-ws.org/Vol-1019/paper_03.pdf. Young-Bum Kim, Alan Ritter, and Wei Xu. Shared tasks of [11] Sergey Brin and Lawrence Page. The anatomy of a large-scale the 2015 workshop on noisy user-generated text: Twitter lexi- hypertextual web search engine. Computer Networks, 30(1-7): cal normalization and named entity recognition. In ACL 2015 107–117, 1998. DOI https://doi.org/10.1016/S0169-7552(98) Workshop on Noisy User-generated Text, Proceedings of the 00110-X. Workshop, July 31, 2015 Beijing, China, pages 126–135. ACL, [12] Peter F. Brown, Vincent J. Della Pietra, Peter V. de Souza, Jen- 2015. URL http://www.aclweb.org/anthology/W15-4319. nifer C. Lai, and Robert L. Mercer. Class-based n-gram mod- [5] Romil Bansal, Sandeep Panem, Priya Radhakrishnan, Man- els of natural language. Computational Linguistics, 18(4):467– ish Gupta, and Vasudeva Varma. Linking entities in #mi- 479, 1992. croposts. In Matthew Rowe, Milan Stankovic, and Aba-Sah [13] Razvan C. Bunescu and Marius Pasca. Using encyclopedic Dadzie, editors, Proceedings of the the 4th Workshop on Mak- knowledge for named entity disambiguation. In Diana Mc- ing Sense of Microposts co-located with the 23rd International Carthy and Shuly Wintner, editors, EACL 2006, 11st Confer- World Wide Web Conference (WWW 2014), Seoul, Korea, April ence of the European Chapter of the Association for Compu- 7th, 2014., volume 1141 of CEUR Workshop Proceedings, tational Linguistics, Proceedings of the Conference, April 3- pages 71–72. CEUR-WS.org, 2014. URL http://ceur-ws.org/ 7, 2006, Trento, Italy. ACL, 2006. URL http://aclweb.org/ Vol-1141/paper_16.pdf. anthology/E/E06/E06-1002.pdf. [6] Pierpaolo Basile, Annalina Caputo, and Giovanni Semer- [14] Davide Caliano, Elisabetta Fersini, Pikakshi Manchanda, Mat- aro. An enhanced lesk word sense disambiguation algorithm teo Palmonari, and Enza Messina. UniMiB: Entity link- 28 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

ing in tweets using Jaro-Winkler distance, popularity and co- Rowe, Milan Stankovic, and Aba-Sah Dadzie, editors, Pro- herence. In Aba-Sah Dadzie, Daniel Preotiuc-Pietro, Dan- ceedings of the Concept Extraction Challenge at the Work- ica Radovanovic, Amparo Elizabeth Cano Basave, and Katrin shop on ’Making Sense of Microposts’, Rio de Janeiro, Brazil, Weller, editors, Proceedings of the 6th Workshop on ’Mak- May 13, 2013, volume 1019 of CEUR Workshop Proceedings, ing Sense of Microposts’ co-located with the 25th Interna- pages 31–35. CEUR-WS.org, 2013. URL http://ceur-ws.org/ tional World Wide Web Conference (WWW 2016), Montréal, Vol-1019/paper_20.pdf. Canada, April 11, 2016., volume 1691 of CEUR Workshop [23] Silviu Cucerzan. Large-scale named entity disambiguation Proceedings, pages 70–72. CEUR-WS.org, 2016. URL http: based on Wikipedia data. In Jason Eisner, editor, EMNLP- //ceur-ws.org/Vol-1691/paper_12.pdf. CoNLL 2007, Proceedings of the 2007 Joint Conference on [15] Amparo Elizabeth Cano Basave, Andrea Varga, Matthew Empirical Methods in Natural Language Processing and Com- Rowe, Milan Stankovic, and Aba-Sah Dadzie. Making Sense putational Natural Language Learning, June 28-30, 2007, of Microposts (#MSM2013) Concept Extraction Challenge. Prague, Czech Republic, pages 708–716. ACL, 2007. URL In Amparo Elizabeth Cano Basave, Matthew Rowe, Milan http://www.aclweb.org/anthology/D07-1074. Stankovic, and Aba-Sah Dadzie, editors, Proceedings of the [24] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Concept Extraction Challenge at the Workshop on ’Mak- Valentin Tablan. GATE: An architecture for development of ing Sense of Microposts’, Rio de Janeiro, Brazil, May 13, robust HLT applications. In Pierre Isabelle, Eugene Charniak, 2013, volume 1019 of CEUR Workshop Proceedings, pages 1– and Dekang Lin, editors, ACL-02, 40th Annual Meeting of the 15. CEUR-WS.org, 2013. URL http://ceur-ws.org/Vol-1019/ Association for Computational Linguistics, Proceedings of the msm2013-challenge-report.pdf. Conference, 7 âA¸S12July˘ 2002, Philadelphia, Pennsylvania, [16] Amparo Elizabeth Cano Basave, Giuseppe Rizzo, Andrea USA, pages 168–175. ACL, 2002. DOI https://doi.org/10.3115/ Varga, Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie. 1073083.1073112. Making Sense of Microposts (#Microposts2014) Named En- [25] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, tity Extraction & Linking Challenge. In Matthew Rowe, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Milan Stankovic, and Aba-Sah Dadzie, editors, Proceed- Adam Funk, Angus Roberts, Danica Damljanovic, Thomas ings of the the 4th Workshop on Making Sense of Microp- Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, osts co-located with the 23rd International World Wide Web Yaoyong Li, Wim Peters, et al. Text Processing with GATE Conference (WWW 2014), Seoul, Korea, April 7th, 2014., (Version 6). Gateway Press, 2011. volume 1141 of CEUR Workshop Proceedings, pages 54– [26] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, 60. CEUR-WS.org, 2014. URL http://ceur-ws.org/Vol-1141/ Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, microposts2014_neel-challenge-report.pdf. Adam Funk, Angus Roberts, Danica Damljanovic, Thomas [17] David Carmel, Ming-Wei Chang, Evgeniy Gabrilovich, Bo- Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, June Paul Hsu, and Kuansan Wang. ERD’14: Entity Recog- Yaoyong Li, Wim Peters, Leon Derczynski, et al. Developing nition and Disambiguation Challenge. SIGIR Forum, 48(2): language processing components with GATE version 8 (a user 63–77, 2014. DOI https://doi.org/10.1145/2701583.2701591. guide). Technical report, The University of Sheffield, 2017. [18] Ming-Wei Chang and Wen-tau Yih. Dual coordinate de- URL https://gate.ac.uk/sale/tao/split.html. scent algorithms for efficient large margin structured predic- [27] Daniel Dahlmeier, Naveen Nandan, and Wang Ting. Part-of- tion. Transactions of the Association for Computational Lin- speech is (almost) enough: SAP research & innovation at the guistics, 1:207–218, 2013. URL https://tacl2013.cs.columbia. #Microposts2014 NEEL Challenge. In Matthew Rowe, Milan edu/ojs/index.php/tacl/article/view/120. Stankovic, and Aba-Sah Dadzie, editors, Proceedings of the the [19] Ming-Wei Chang, Bo-June Paul Hsu, Hao Ma, Ricky Loynd, 4th Workshop on Making Sense of Microposts co-located with and Kuansan Wang. E2E: An end-to-end entity linking system the 23rd International World Wide Web Conference (WWW for short and noisy text. In Matthew Rowe, Milan Stankovic, 2014), Seoul, Korea, April 7th, 2014., volume 1141 of CEUR and Aba-Sah Dadzie, editors, Proceedings of the the 4th Work- Workshop Proceedings, pages 73–74. CEUR-WS.org, 2014. shop on Making Sense of Microposts co-located with the URL http://ceur-ws.org/Vol-1141/paper_20.pdf. 23rd International World Wide Web Conference (WWW 2014), [28] Amitava Das, Utsab Burman, Balamurali A. R., and Sivaji Seoul, Korea, April 7th, 2014., volume 1141 of CEUR Work- Bandyopadhyay. NER from tweets: SRI-JU System @MSM shop Proceedings, pages 62–63. CEUR-WS.org, 2014. URL 2013. In Amparo Elizabeth Cano Basave, Matthew Rowe, http://ceur-ws.org/Vol-1141/paper_18.pdf. Milan Stankovic, and Aba-Sah Dadzie, editors, Proceedings [20] Nancy Chinchor and P. Robinson. Muc-7 named entity task of the Concept Extraction Challenge at the Workshop on definition. In Message Understanding Conference Proceed- ’Making Sense of Microposts’, Rio de Janeiro, Brazil, May ings, MUC-7, 1997. URL http://www-nlpir.nist.gov/related_ 13, 2013, volume 1019 of CEUR Workshop Proceedings, projects/muc/proceedings/ne_task.html. page 62. CEUR-WS.org, 2013. URL http://ceur-ws.org/ [21] William W. Cohen, Pradeep Ravikumar, and Stephen E. Fien- Vol-1019/paper_33.pdf. berg. A comparison of string distance metrics for name- [29] Stefan Dlugolinsky, Peter Krammer, Marek Ciglan, and Michal matching tasks. In Subbarao Kambhampati and Craig A. Laclavik. MSM2013 IE Challenge: Annotowatch. In Am- Knoblock, editors, Proceedings of IJCAI-03 Workshop on In- paro Elizabeth Cano Basave, Matthew Rowe, Milan Stankovic, formation Integration on the Web (IIWeb-03), August 9-10, and Aba-Sah Dadzie, editors, Proceedings of the Concept 2003, Acapulco, Mexico, pages 73–78, 2003. URL http://www. Extraction Challenge at the Workshop on ’Making Sense of isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf. Microposts’, Rio de Janeiro, Brazil, May 13, 2013, volume [22] Keith Cortis. ACE: A concept extraction approach using 1019 of CEUR Workshop Proceedings, pages 21–26. CEUR- linked open data. In Amparo Elizabeth Cano Basave, Matthew WS.org, 2013. URL http://ceur-ws.org/Vol-1019/paper_21. Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 29

pdf. [37] Souvick Ghosh, Promita Maitra, and Dipankar Das. Fea- [30] Marieke van Erp, Giuseppe Rizzo, and Raphaël Troncy. Learn- ture based approach to named entity recognition and linking ing with the Web: Spotting named entities on the intersec- for tweets. In Aba-Sah Dadzie, Daniel Preotiuc-Pietro, Dan- tion of NERD and machine learning. In Amparo Elizabeth ica Radovanovic, Amparo Elizabeth Cano Basave, and Katrin Cano Basave, Matthew Rowe, Milan Stankovic, and Aba-Sah Weller, editors, Proceedings of the 6th Workshop on ’Mak- Dadzie, editors, Proceedings of the Concept Extraction Chal- ing Sense of Microposts’ co-located with the 25th Interna- lenge at the Workshop on ’Making Sense of Microposts’, Rio de tional World Wide Web Conference (WWW 2016), Montréal, Janeiro, Brazil, May 13, 2013, volume 1019 of CEUR Work- Canada, April 11, 2016., volume 1691 of CEUR Workshop shop Proceedings, pages 27–30. CEUR-WS.org, 2013. URL Proceedings, pages 74–76. CEUR-WS.org, 2016. URL http: http://ceur-ws.org/Vol-1019/paper_15.pdf. //ceur-ws.org/Vol-1691/paper_10.pdf. [31] Marieke van Erp, Pablo N. Mendes, Heiko Paulheim, Filip [38] Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipan- Ilievski, Julien Plu, Giuseppe Rizzo, and Jörg Waitelonis. jan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Evaluating entity linking: An analysis of current benchmark Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of- datasets and a roadmap for doing a better job. In Nico- speech tagging for Twitter: Annotation, features, and experi- letta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, ments. In The 49th Annual Meeting of the Association for Com- Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène putational Linguistics: Human Language Technologies, Pro- Mazo, Asunción Moreno, Jan Odijk, and Stelios Piperidis, ed- ceedings of the Conference, 19-24 June, 2011, Portland, Ore- itors, Proceedings of the Tenth International Conference on gon, USA - Short Papers, pages 42–47. ACL, 2011. URL Language Resources and Evaluation LREC 2016, Portorož, http://www.aclweb.org/anthology/P11-2008. Slovenia, May 23-28, 2016. European Language Resources [39] Fréderic Godin, Pedro Debevere, Erik Mannens, Wesley De Association (ELRA), 2016. URL http://www.lrec-conf.org/ Neve, and Rik Van de Walle. Leveraging existing tools for proceedings/lrec2016/summaries/926.html. named entity recognition in microposts. In Amparo Elizabeth [32] Paolo Ferragina and Ugo Scaiella. TAGME: On-the-fly an- Cano Basave, Matthew Rowe, Milan Stankovic, and Aba-Sah notation of short text fragments (by Wikipedia entities). In Dadzie, editors, Proceedings of the Concept Extraction Chal- Jimmy Huang, Nick Koudas, Gareth J. F. Jones, Xindong Wu, lenge at the Workshop on ’Making Sense of Microposts’, Rio de Kevyn Collins-Thompson, and Aijun An, editors, Proceed- Janeiro, Brazil, May 13, 2013, volume 1019 of CEUR Work- ings of the 19th ACM Conference on Information and Knowl- shop Proceedings, pages 36–39. CEUR-WS.org, 2013. URL edge Management, CIKM 2010, Toronto, Ontario, Canada, http://ceur-ws.org/Vol-1019/paper_25.pdf. October 26-30, 2010, pages 1625–1628. ACM, 2010. DOI [40] Kara Greenfield, Rajmonda S. Caceres, Michael Coury, Kelly https://doi.org/10.1145/1871437.1871689. Geyer, Youngjune Gwon, Jason Matterer, Alyssa Mensch, [33] Jenny Rose Finkel, Trond Grenager, and Christopher D. Man- Cem Safak Sahin, and Olga Simek. A reverse approach to ning. Incorporating non-local information into information named entity extraction and linking in microposts. In Aba-Sah extraction systems by Gibbs sampling. In Kevin Knight, Dadzie, Daniel Preotiuc-Pietro, Danica Radovanovic, Amparo Hwee Tou Ng, and Kemal Oflazer, editors, ACL 2005, 43rd Elizabeth Cano Basave, and Katrin Weller, editors, Proceed- Annual Meeting of the Association for Computational Linguis- ings of the 6th Workshop on ’Making Sense of Microposts’ co- tics, Proceedings of the Conference, 25-30 June 2005, Uni- located with the 25th International World Wide Web Confer- versity of Michigan, USA, pages 363–370. ACL, 2005. DOI ence (WWW 2016), Montréal, Canada, April 11, 2016., volume https://doi.org/10.3115/1219840.1219885. 1691 of CEUR Workshop Proceedings, pages 67–69. CEUR- [34] Jerome H. Friedman. Greedy function approximation: A gra- WS.org, 2016. URL http://ceur-ws.org/Vol-1691/paper_11. dient boosting machine. The Annals of Statistics, 29(5):1189– pdf. 1232, 10 2001. DOI https://doi.org/10.1214/aos/1013203451. [41] Ralph Grishman and Beth Sundheim. Message understanding [35] Cristina Gârbacea, Daan Odijk, David Graus, Isaac Sijarana- conference–6: A brief history. In 16th International Confer- mual, and Maarten de Rijke. Combining multiple signals ence on Computational Linguistics, Proceedings of the Confer- for semanticizing tweets: University of Amsterdam at #Mi- ence, COLING 1996, Center for Sprogteknologi, Copenhagen, croposts2015. In Matthew Rowe, Milan Stankovic, and Aba- Denmark, August 5-9, 1996, pages 466–471. ACL, 1996. URL Sah Dadzie, editors, Proceedings of the the 5th Workshop http://aclweb.org/anthology/C96-1079. on Making Sense of Microposts co-located with the 24th In- [42] Zhaochen Guo and Denilson Barbosa. Entity recognition and ternational World Wide Web Conference (WWW 2015), Flo- linking on tweets with random walks. In Matthew Rowe, Milan rence, Italy, May 18th, 2015., volume 1395 of CEUR Work- Stankovic, and Aba-Sah Dadzie, editors, Proceedings of the the shop Proceedings, pages 59–60. CEUR-WS.org, 2015. URL 5th Workshop on Making Sense of Microposts co-located with http://ceur-ws.org/Vol-1395/paper_17.pdf. the 24th International World Wide Web Conference (WWW [36] Yegin Genc, Winter A. Mason, and Jeffrey V. Nickerson. Clas- 2015), Florence, Italy, May 18th, 2015., volume 1395 of CEUR sifying short messages using collaborative knowledge bases: Workshop Proceedings, pages 57–58. CEUR-WS.org, 2015. Reading Wikipedia to understand Twitter. In Amparo Elizabeth URL http://ceur-ws.org/Vol-1395/paper_21.pdf. Cano Basave, Matthew Rowe, Milan Stankovic, and Aba-Sah [43] Mena B. Habib, Maurice van Keulen, and Zhemin Zhu. Dadzie, editors, Proceedings of the Concept Extraction Chal- Concept Extraction Challenge: University of Twente at lenge at the Workshop on ’Making Sense of Microposts’, Rio de #MSM2013. In Amparo Elizabeth Cano Basave, Matthew Janeiro, Brazil, May 13, 2013, volume 1019 of CEUR Work- Rowe, Milan Stankovic, and Aba-Sah Dadzie, editors, Pro- shop Proceedings, pages 50–53. CEUR-WS.org, 2013. URL ceedings of the Concept Extraction Challenge at the Work- http://ceur-ws.org/Vol-1019/paper_28.pdf. shop on ’Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013, volume 1019 of CEUR Workshop Proceedings, 30 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

pages 17–20. CEUR-WS.org, 2013. URL http://ceur-ws.org/ OR, USA, August 12-16, 2012, pages 721–730. ACM, 2012. Vol-1019/paper_14.pdf. DOI https://doi.org/10.1145/2348283.2348380. [44] Mena B. Habib, Maurice van Keulen, and Zhemin Zhu. Named [52] Percy Liang. Semi-supervised learning for natural language. Entity Extraction and Linking Challenge: University of Twente Master’s thesis, Massachusetts Institute of Technology, 2005. at #Microposts2014. In Matthew Rowe, Milan Stankovic, and URL http://hdl.handle.net/1721.1/33296. Aba-Sah Dadzie, editors, Proceedings of the the 4th Workshop [53] Edward Loper and Steven Bird. NLTK: the natural language on Making Sense of Microposts co-located with the 23rd In- toolkit. In Chris Brew, Michael Rosner, and Dragomir Radev, ternational World Wide Web Conference (WWW 2014), Seoul, editors, ETMTNLP ’02, Proceedings of the ACL-02 Workshop Korea, April 7th, 2014., volume 1141 of CEUR Workshop on Effective Tools and Methodologies for Teaching Natural Proceedings, pages 64–65. CEUR-WS.org, 2014. URL http: Language Processing and Computational Linguistics - Volume //ceur-ws.org/Vol-1141/paper_13.pdf. 1, Philadelphia, Pennsylvania âA˘Tˇ July 07 - 07, 2002, pages [45] Ben Hachey, Joel Nothman, and Will Radford. Cheap and 63–70. ACL, 2002. DOI https://doi.org/10.3115/1118108. easy entity evaluation. In Proceedings of the 52nd Annual 1118117. Meeting of the Association for Computational Linguistics, ACL [54] Xiaoqiang Luo. On coreference resolution performance met- 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short rics. In HLT/EMNLP 2005, Human Language Technology Papers, pages 464–469. ACL, 2014. URL http://aclweb.org/ Conference and Conference on Empirical Methods in Natu- anthology/P/P14/P14-2076.pdf. ral Language Processing, Proceedings of the Conference, 6- [46] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Ha- 8 October 2005, Vancouver, British Columbia, Canada, pages gen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, 25–32. ACL, 2005. DOI https://doi.org/10.3115/1220575. Stefan Thater, and Gerhard Weikum. Robust disambiguation 1220579. of named entities in text. In Proceedings of the 2011 Con- [55] Paul McNamee, Heather Simpson, and Hoa Trang Dang. ference on Empirical Methods in Natural Language Process- Overview of the TAC 2009 Knowledge Base Population ing, EMNLP 2011, 27-31 July 2011, John McIntyre Confer- Track. Presentation at the Second Text Analysis Confer- ence Centre, Edinburgh, UK, A meeting of SIGDAT, a Special ence (TAC 2009), November 16-17, 2009, National Institute Interest Group of the ACL, pages 782–792. ACL, 2011. URL of Standards and Technology, Gaithersburg, Maryland, USA, http://www.aclweb.org/anthology/D11-1072. 2009. URL http://tac.nist.gov/publications/2009/presentations/ [47] Amir Hossein Jadidinejad. Unsupervised information extrac- TAC2009_KBP_overview.pdf. tion using BabelNet and DBpedia. In Amparo Elizabeth [56] Pablo N. Mendes, Max Jakob, Andrés García-Silva, and Chris- Cano Basave, Matthew Rowe, Milan Stankovic, and Aba-Sah tian Bizer. Dbpedia spotlight: shedding light on the web of doc- Dadzie, editors, Proceedings of the Concept Extraction Chal- uments. In Chiara Ghidini, Axel-Cyrille Ngonga Ngomo, Ste- lenge at the Workshop on ’Making Sense of Microposts’, Rio de fanie N. Lindstaedt, and Tassilo Pellegrini, editors, Proceed- Janeiro, Brazil, May 13, 2013, volume 1019 of CEUR Work- ings the 7th International Conference on Semantic Systems, I- shop Proceedings, pages 54–56. CEUR-WS.org, 2013. URL SEMANTICS 2011, Graz, Austria, September 7-9, 2011, ACM http://ceur-ws.org/Vol-1019/paper_32.pdf. International Conference Proceeding Series, pages 1–8. ACM, [48] Heng Ji, Ralph Grishman, and Hoa Trang Dang. Overview 2011. DOI https://doi.org/10.1145/2063518.2063519. of the TAC2011 Knowledge Base Population (KBP) Track. [57] Pablo N. Mendes, Dirk Weissenborn, and Chris Hokamp. DB- Presentation at the Fourth Text Analysis Conference (TAC pedia Spotlight at the MSM2013 Challenge. In Amparo Eliz- 2011), November 14-15, 2011, National Institute of Standards abeth Cano, Matthew Rowe, Milan Stankovic, and Aba-Sah and Technology, Gaithersburg, Maryland, USA, 2009. URL Dadzie, editors, Proceedings of the Concept Extraction Chal- http://tac.nist.gov/publications/2011/presentations/KBP2011_ lenge at the Workshop on ’Making Sense of Microposts’, Rio de overview.presentation.pdf. Janeiro, Brazil, May 13, 2013, volume 1019 of CEUR Work- [49] Heng Ji, Joel Nothman, Ben Hachey, and Radu Florian. shop Proceedings, pages 57–61. CEUR-WS.org, 2013. URL Overview of TAC-KBP2015 tri-lingual entity discovery and http://ceur-ws.org/Vol-1019/paper_30.pdf. linking. In Proceedings of the Eighth Text Analysis Conference [58] Andrea Moro and Roberto Navigli. SemEval-2015 Task 13: (TAC 2015), November 16-17, 2015, National Institute of Multilingual all-words sense disambiguation and entity link- Standards and Technology, Gaithersburg, Maryland, USA, ing. In Daniel M. Cer, David Jurgens, Preslav Nakov, and 2015. URL http://tac.nist.gov/publications/2015/additional. Torsten Zesch, editors, Proceedings of the 9th International papers/TAC2015.KBP_Trilingual_Entity_Discovery_and_ Workshop on Semantic Evaluation, SemEval@NAACL-HLT Linking_overview.proceedings.pdf. 2015, Denver, Colorado, USA, June 4-5, 2015, pages 288– [50] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dim- 297. ACL, 2015. URL http://aclweb.org/anthology/S/S15/ itris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mo- S15-2049.pdf. hamed Morsey, Patrick van Kleef, Sören Auer, and Christian [59] Andrea Moro, Alessandro Raganato, and Roberto Navigli. Bizer. DBpedia - A large-scale, multilingual knowledge base Entity linking meets word sense disambiguation: a unified extracted from wikipedia. Semantic Web, 6(2):167–195, 2015. approach. Transactions of the Association for Computa- DOI https://doi.org/10.3233/SW-140134. tional Linguistics, 2:231–244, 2014. URL https://tacl2013.cs. [51] Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman columbia.edu/ojs/index.php/tacl/article/view/291. Datta, Aixin Sun, and Bu-Sung Lee. TwiNER: Named entity [60] Óscar Muñoz-García, Andrés García-Silva, and Óscar Corcho. recognition in targeted Twitter stream. In William R. Hersh, Towards concept identification using a knowledge-intensive Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, approach. In Amparo Elizabeth Cano Basave, Matthew Rowe, The 35th International ACM SIGIR conference on research and Milan Stankovic, and Aba-Sah Dadzie, editors, Proceedings of development in Information Retrieval, SIGIR ’12, Portland, the Concept Extraction Challenge at the Workshop on ’Making Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 31

Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013, tion results to the Cloud. In Christian Bizer, volume 1019 of CEUR Workshop Proceedings, pages 45– Tom Heath, Tim Berners-Lee, and Michael Hausenblas, edi- 49. CEUR-WS.org, 2013. URL http://ceur-ws.org/Vol-1019/ tors, WWW2012 Workshop on Linked Data on the Web, Lyon, paper_29.pdf. France, 16 April, 2012, volume 937 of CEUR Workshop [61] Diego Marinho de Oliveira, Alberto H. F. Laender, Adriano Proceedings. CEUR-WS.org, 2012. URL http://ceur-ws.org/ Veloso, and Altigran Soares da Silva. Filter-stream named Vol-937/ldow2012-paper-02.pdf. entity recognition: A case study at the #MSM2013 Concept [69] Giuseppe Rizzo, Marieke van Erp, and Raphaël Troncy. Extraction Challenge. In Amparo Elizabeth Cano Basave, Benchmarking the extraction and disambiguation of named en- Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie, edi- tities on the Semantic Web. In Nicoletta Calzolari, Khalid tors, Proceedings of the Concept Extraction Challenge at the Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Workshop on ’Making Sense of Microposts’, Rio de Janeiro, Joseph Mariani, Asunción Moreno, Jan Odijk, and Stelios Brazil, May 13, 2013, volume 1019 of CEUR Workshop Pro- Piperidis, editors, Proceedings of the Ninth International ceedings, pages 71–75. CEUR-WS.org, 2013. URL http:// Conference on Language Resources and Evaluation, LREC ceur-ws.org/Vol-1019/paper_34.pdf. 2014, Reykjavik, Iceland, May 26-31, 2014., pages 4593– [62] Miles Osborne, Sasa Petrovic, Richard McCreadie, Craig Mac- 4600. European Language Resources Association (ELRA), donald, and Iadh Ounis. Bieber no more: First story detection 2014. URL http://www.lrec-conf.org/proceedings/lrec2014/ using Twitter and Wikipedia. In Fernando Diaz, Susan Du- summaries/176.html. mais, Kira Radinsky, Maarten de Rijke, and Milad Shokouhi, [70] Giuseppe Rizzo, Amparo Elizabeth Cano Basave, Bianca editors, SIGIR 2012 Workshop on Time-aware Information Ac- Pereira, and Andrea Varga. Making Sense of Microposts (#Mi- cess (#TAIA2012), Portland, Oregon, USA, August 16, 2012, croposts2015) Named Entity rEcognition and Linking (NEEL) 2012. Challenge. In Matthew Rowe, Milan Stankovic, and Aba-Sah [63] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Stream- Dadzie, editors, Proceedings of the the 5th Workshop on Mak- ing first story detection with application to Twitter. In Human ing Sense of Microposts co-located with the 24th International Language Technologies: Conference of the North American World Wide Web Conference (WWW 2015), Florence, Italy, Chapter of the Association of Computational Linguistics, Pro- May 18th, 2015., volume 1395 of CEUR Workshop Proceed- ceedings, June 2-4, 2010, Los Angeles, California, USA, pages ings, pages 44–53. CEUR-WS.org, 2015. URL http://ceur-ws. 181–189. ACL, 2010. URL http://www.aclweb.org/anthology/ org/Vol-1395/microposts2015_neel-challenge-report/. N10-1021. [71] Giuseppe Rizzo, Marieke van Erp, Julien Plu, and Raphaël [64] Lev-Arie Ratinov and Dan Roth. Design challenges and mis- Troncy. Making Sense of Microposts (#Microposts2016) conceptions in named entity recognition. In Suzanne Steven- Named Entity rEcognition and Linking (NEEL) Challenge. In son and Xavier Carreras, editors, Proceedings of the Thirteenth Aba-Sah Dadzie, Daniel Preotiuc-Pietro, Danica Radovanovic, Conference on Computational Natural Language Learning, Amparo Elizabeth Cano Basave, and Katrin Weller, editors, CoNLL 2009, Boulder, Colorado, USA, June 4-5, 2009, pages Proceedings of the 6th Workshop on ’Making Sense of Micro- 147–155. ACL, 2009. DOI https://doi.org/10.3115/1596374. posts’ co-located with the 25th International World Wide Web 159639. Conference (WWW 2016), Montréal, Canada, April 11, 2016., [65] Lev-Arie Ratinov, Dan Roth, Doug Downey, and Mike An- volume 1691 of CEUR Workshop Proceedings, pages 50– derson. Local and global algorithms for disambiguation to 59. CEUR-WS.org, 2016. URL http://ceur-ws.org/Vol-1691/ wikipedia. In Dekang Lin, Yuji Matsumoto, and Rada Mi- microposts2016_neel-challenge-report/. halcea, editors, The 49th Annual Meeting of the Association [72] Sandhya Sachidanandan, Prathyush Sambaturu, and Ka- for Computational Linguistics: Human Language Technolo- malakar Karlapalem. NERTUW: named entity recognition on gies, Proceedings of the Conference, 19-24 June, 2011, Port- tweets using Wikipedia. In Amparo Elizabeth Cano Basave, land, Oregon, USA, pages 1375–1384. ACL, 2011. URL Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie, edi- http://www.aclweb.org/anthology/P11-1138. tors, Proceedings of the Concept Extraction Challenge at the [66] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. Named Workshop on ’Making Sense of Microposts’, Rio de Janeiro, entity recognition in tweets: An experimental study. In Pro- Brazil, May 13, 2013, volume 1019 of CEUR Workshop Pro- ceedings of the 2011 Conference on Empirical Methods in Nat- ceedings, pages 67–70. CEUR-WS.org, 2013. URL http:// ural Language Processing, EMNLP 2011, 27-31 July 2011, ceur-ws.org/Vol-1019/paper_35.pdf. John McIntyre Conference Centre, Edinburgh, UK, A meeting [73] Erik F. Tjong Kim Sang and Fien De Meulder. Introduc- of SIGDAT, a Special Interest Group of the ACL, pages 1524– tion to the CoNLL-2003 shared task: Language-independent 1534. ACL, 2011. URL http://www.aclweb.org/anthology/ named entity recognition. In Walter Daelemans and Miles Os- D11-1141. borne, editors, Proceedings of the Seventh Conference on Nat- [67] Giuseppe Rizzo and Raphaël Troncy. NERD: A frame- ural Language Learning, CoNLL 2003, Held in cooperation work for unifying named entity recognition and disambigua- with HLT-NAACL 2003, Edmonton, Canada, May 31 - June tion extraction tools. In Walter Daelemans, Mirella Lapata, 1, 2003, pages 142–147. ACL, 2003. URL http://aclweb.org/ and Lluís Màrquez, editors, EACL 2012, 13th Conference of anthology/W/W03/W03-0419.pdf. the European Chapter of the Association for Computational [74] Ugo Scaiella, Michele Barbera, Stefano Parmesan, Gaetano Linguistics, Avignon, France, April 23-27, 2012, pages 73– Prestia, Emilio Del Tessandoro, and Mario Verí. DataTXT 76. ACL, 2012. URL http://aclweb.org/anthology-new/E/E12/ at #Microposts2014 Challenge. In Matthew Rowe, Milan E12-2015.pdf. Stankovic, and Aba-Sah Dadzie, editors, Proceedings of the the [68] Giuseppe Rizzo, Raphaël Troncy, Sebastian Hellmann, and 4th Workshop on Making Sense of Microposts co-located with Martin Brümmer. NERD meets NIF: Lifting NLP extrac- the 23rd International World Wide Web Conference (WWW 32 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

2014), Seoul, Korea, April 7th, 2014., volume 1141 of CEUR Appendix Workshop Proceedings, pages 66–67. CEUR-WS.org, 2014. URL http://ceur-ws.org/Vol-1141/paper_19.pdf. A. NEEL Taxonomy [75] Priyanka Sinha and Biswanath Barik. Named entity extrac- tion and linking in #microposts. In Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie, editors, Proceedings of the the Thing 5th Workshop on Making Sense of Microposts co-located with languages the 24th International World Wide Web Conference (WWW ethnic groups 2015), Florence, Italy, May 18th, 2015., volume 1395 of CEUR Workshop Proceedings, pages 66–67. CEUR-WS.org, 2015. nationalities URL http://ceur-ws.org/Vol-1395/paper_19.pdf. religions [76] Pablo Torres-Tramón, Hugo Hromic, Brian Walsh, diseases Bahareh Rahmanzadeh Heravi, and Conor Hayes. sports Kanopy4Tweets: Entity extraction and linking for Twit- astronomical objects ter. In Aba-Sah Dadzie, Daniel Preotiuc-Pietro, Danica Radovanovic, Amparo Elizabeth Cano Basave, and Katrin Weller, editors, Proceedings of the 6th Workshop on ’Mak- Examples: ing Sense of Microposts’ co-located with the 25th Interna- If all the #[Sagittarius] in the tional World Wide Web Conference (WWW 2016), Montréal, world Canada, April 11, 2016., volume 1691 of CEUR Workshop Jon Hamm is an [American] actor Proceedings, pages 64–66. CEUR-WS.org, 2016. URL http://ceur-ws.org/Vol-1691/paper_13.pdf. Event [77] Andrea Varga, Amparo Elizabeth Cano Basave, Matthew Rowe, Fabio Ciravegna, and Yulan He. Linked knowledge holidays sources for topic classification of microposts: A semantic sport events graph-based approach. Journal of Web Semantics, 26:36–57, political events 2014. DOI https://doi.org/10.1016/j.websem.2014.04.001. social events [78] Jörg Waitelonis and Harald Sack. Named entity linking in #tweets with KEA. In Aba-Sah Dadzie, Daniel Preotiuc- Pietro, Danica Radovanovic, Amparo Elizabeth Cano Basave, Examples: and Katrin Weller, editors, Proceedings of the 6th Workshop [London Riots] on ’Making Sense of Microposts’ co-located with the 25th In- [2nd World War] ternational World Wide Web Conference (WWW 2016), Mon- [Tour de France] tréal, Canada, April 11, 2016. CEUR Work- , volume 1691 of [Christmas] shop Proceedings, pages 61–63. CEUR-WS.org, 2016. URL http://ceur-ws.org/Vol-1691/paper_14.pdf. [Thanksgiving] occurs the ... [79] Ronald E. Walpole and Raymond H. Myers. Probability and Statistics for Engineers & Scientists. Pearson Education Inter- Character national, eight edition, 2007. fictional characters [80] Ikuya Yamada, Hideaki Takeda, and Yoshiyasu Takefuji. An end-to-end entity linking approach for tweets. In Matthew comic characters Rowe, Milan Stankovic, and Aba-Sah Dadzie, editors, Pro- title characters ceedings of the the 5th Workshop on Making Sense of Micro- posts co-located with the 24th International World Wide Web Examples: Conference (WWW 2015), Florence, Italy, May 18th, 2015., [Batman] volume 1395 of CEUR Workshop Proceedings, pages 55– [Wolverine] 56. CEUR-WS.org, 2015. URL http://ceur-ws.org/Vol-1395/ paper_16.pdf. [Donald Draper] [81] Mohamed Amir Yosef, Johannes Hoffart, Yusra Ibrahim, [Harry Potter] is the strongest Artem Boldyrev, and Gerhard Weikum. Adapting AIDA for wizard in the school tweets. In Matthew Rowe, Milan Stankovic, and Aba-Sah Dadzie, editors, Proceedings of the the 4th Workshop on Mak- Location ing Sense of Microposts co-located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April public places (squares, opera houses, museums, 7th, 2014., volume 1141 of CEUR Workshop Proceedings, schools, markets, airports, stations, swimming pools, pages 68–69. CEUR-WS.org, 2014. URL http://ceur-ws.org/ hospitals, sports facilities, youth centers, parks, town Vol-1141/paper_15.pdf. halls, theatres, cinemas, galleries, universities, churches, medical centers, parking lots, cemeteries) regions (villages, towns, cities, provinces, coun- tries, continents, dioceses, parishes) commercial places Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 33

(pubs, restaurants, depots, hostels, hotels, industrial books, blogs) parks, nightclubs, music venues, bike shops) devices (cars, vehicles, electronic devices) buildings (houses, monasteries, creches, mills, operating systems army barracks, castles, retirement homes, towers, programming languages halls, rooms, vicarages, courtyards) Examples: Examples: Apple has updated [Mac Os X] [Miami] Big crowd at the [Today Show] Paul McCartney at [Yankee Stadium] [Harry Potter] has beaten any president of [united states] records Five New [Apple Retail Store] Washington’s program [Prism] Opening Around

Organization B. NEEL Challenge Annotation Guidelines companies (press agencies, studios, banks, stock markets, manufacturers, cooperatives) The challenge task consists of three consecutive subdivisions of companies stages: 1) extraction and typing of entity mentions brands within a tweet; 2) link of each mention to an entry political parties in the English DBpedia39 representing the same real government bodies (ministries, councils, courts, world entity, or NIL in case such entry does not exist; political unions) press names (magazines, newspapers, journals) and 3) clustering of all mentions linked to NIL. Thus, public organizations (schools, universities, chari- the same entity, which does not have a corresponding ties) entry in DBpedia, will be referenced with the same collections of people (sport teams, associations, the- NIL identifier. This section introduces various defini- ater companies, religious orders, youth organizations, tions relevant to this task. musical bands) B.1. Named Entity Examples: [Apple] has updated Mac Os X A named entity, in the context of the NEEL Chal- [Celtics] won against lenge series, is a phrase representing the name, exclud- [Police] intervene after ing the preceding definite article (i.e. “the”) and any disturbances other pre-posed (e.g. “Dr”, “Mr”) or post-posed mod- [Prism] performed in Washington ifiers, that belongs to a class of the NEEL Taxonomy [US] has beaten the Japanese team (Appendix A) and is linked to a DBpedia resource. Compound entities should be annotated in isolation. Person people’s names (titles and roles are not included, B.2. Mentions and Typification such as Dr. or President) In this task we consider that an entity may be ref- Examples: erenced in a tweet as a proper noun or acronym if: [Barack Obama] is the current [Jon Hamm] is an American actor 1. it belongs to one of the categories specified in the 40 [Paul McCartney] at Yankee Stadium NEEL Taxonomy (Appendix A); call it [Lady Gaga] 39In the 2016 NEEL Challenge we used as referent Product knowledge base DBpedia 2015-04 http://wiki.dbpedia. movies org/dbpedia-data-set-2015-04, in 2015 we used DBpedia 2014 http://wiki.dbpedia.org/data-set-2014, while for the 2014 tv series edition we used DBpedia v3.9 http://wiki.dbpedia.org/data-set-39. music albums 40The typification was introduced since the 2015 NEEL Chal- press products (journals, newspapers, magazines, lenge. 34 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

2. it can be linked to a DBpedia entry or to a NIL ple entities therefore we do not consider it reference given the context of the tweet. as an entity mention and it should not be an- Notice: notated. – Pronouns (e.g., he, him): they are not considered ∗ Noun phrases referring to an entity: While mentions to entities in the context of this chal- noun phrases can be dereferenced to an ex- lenge. isting entity, we do not consider them as en- – Misspellings: lower cased words and compressed tity mentions. In such cases we only keep words (e.g. “c u 2night” rather than “see you “embedded” entity mentions. tonight”) are common in tweets. Thus, they are Tweet: head of sharm el sheikh still considered mentions if they can be directly hospital is DENYING ... mapped to proper nouns. – Complete Mentions: entity extents that are sub- In this case head of sharm el sheikh hos- strings of a complete named entity mention are pital refers to a Person-type however since not identified. In: it is not a proper noun we do not consider Tweet: Barack Obama gives a it as an entity mention. For that reason in speech at NATO this case the annotation should only contain you could not select either of the words Barack the embedded entity [sharm el sheikh hos- or Obama by themselves. This is because they pital]: [sharm el sheikh hospital, Organiza- constitute a substring of the full mention [Barack tion, dbr:Sharm_International_Hospital] Obama]. However, in the text: “Barack was born in the city, at which time his parents named him ∗ Noun phrases completing the definition of Obama” both of the terms [Barack] and [Obama] an entity. should be selected as entity mentions. Tweet: The best Panasonic – Overlapping mentions: nested entities with qual- LUMIX digital camera from a ifiers should be considered as independent enti- wide range of models ties. Tweet: Alabama CF Taylor Dugas While digital camera describes the entity has decided to end negotiations Panasonic LUMIX it is not considered with the Cubs and will return within the entity annotation. In this case the to Alabama for his senior annotation should be [Panasonic, ORG, dbr: season. #bamabaseball Panasonic] [LUMIX, Product, dbr:Lumix] In this case the [Alabama CF] entity quali- fies [Taylor Dugas], the annotation for such a – Entity Typing by Context: entity mentions in a case should be: [Alabama CF, Organization, dbr: tweet can be typified based on the context in Alabama_Crimson_Tide41] [Taylor Dugas, Per- which they are used. son, NIL] Tweet: Five New Apple Retail – Noun phrases: these are not considered as entity Stores Opening Around the mentions. World: As we reported, Apple is opening 5 new retail stores on ∗ Ambiguous noun phrases: ... Tweet: I am happy that an #asian team have won the In this case [Apple Retail Stores] refers to a womens world cup! After just Location-type while the second [Apple] mention returning from #asia i have refers to an Organisation-type. seen how special you all are! Congrats B.3. Special Cases in Social Media (#s and @s) While [asian team] could be considered as an Organisation-type it can refer to multi- Entities may be referenced in a tweet preceded by hashtags and @s or composed by hashtagged and @- 41dbr is the prefix of http://dbpedia.org/resource/ nouns: Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 35

Tweets: in Social Media. For example the use of “SFGiants” #[Obama] is proud to support the for referring to “the San Francisco Giants”. For these Respect for Marriage Act cases we coreferenced the nickname to the entity it #[Barack Obama] is proud to support refers to in the context of the tweet. the Respect for Marriage Act Tweet: #[Panda] with 3 straight hits @[BarackObama] is proud to support to give #[SFGiants] 6-1 lead in the Respect for Marriage Act 12th Hashtags can refer to entities, but it does not mean [Panda] is linked to http://dbpedia.org/page/Pablo_ that all hashtags will be considered entities. We con- Sandoval and [SFGiants] is linked to http://dbpedia. sider the following cases: org/page/San_Francisco_Giants. – Hashtagged nouns and noun-phrases. B.5. Links Tweet: I burned the cake again. #fail the hashtag #fail does not represent an entity as The NEEL Challenge series use the English DBpe- defined in the Guidelines. Thus, it should not be dia as Knowledge Base for linking. given as an entity mention. DBpedia is a widely available Linked Data dataset – Partially tagged entities. and is composed by a series of RDF (Resource De- Tweet: Congrats to Wayne Gretzky, scription Framework) resources. Each resource is his son Trevor has officially uniquely identified by a URI (Uniform Resource Iden- signed with the Chicago @Cubs tifier). A single RDF resource can be represented by a today series of triples of the type < s, p, o > where s con- Here Chicago @Cubs refers to the proper noun tains the identifier of the resource (to be linked with a characterising the Chicago Cubs entity (Notice mention), p contains the identifier for a property and that in this case that Chicago is not a qualifier o may contain a literal value or a reference to another but rather part of the entity mention). In this case resource. the annotation should be [Chicago,Organization, In this challenge, a mention in a tweet should be dbr:Chicago_Cubs] [Cubs, Organization, dbr: linked to the identifier of a resource (i.e. the s in a Chicago_Cub]. triple) if available. Note that in DBpedia there are cases – Tagged entities: If a proper noun is splitted and where one resource does not represent an entity, in- tagged with two hashtags, the entity mention stead it represents an ambiguity case (disambiguation should be splitted into two separate mentions. resource), a category, or it just redirects to another re- Tweet: #Amy #Winehouse source. In this challenge, only the final IRI properly In this case the annotation should be [Amy, Per- describing a real world entity (i.e. containing their de- son, dbr:Amy_Winehouse] [Winehouse, Person, scriptive attributes as well as relations to other enti- dbr:Amy_Winehouse] ties) is considered for linking. Thus, if there is a redi- rection chain given by the property wikiPageRedirects, Note that the characters # and @ should not be in- the correct IRI is the one at the end of this redirection cluded in the annotation string. chain. B.4. Use of Nicknames If the resource is not available, it is asked to pro- vide a “NIL” plus an alphanumeric code (such as NIL1 The use of nicknames (i.e. descriptive names re- or NILA) that identifies uniquely the entity in the an- placing the actual name of an entity) are commonplace notated set.