Using Embeddings for Both Entity Recognition and Linking in Tweets
Total Page:16
File Type:pdf, Size:1020Kb
Using Embeddings for Both Entity Recognition and Linking in Tweets Giuseppe Attardi, Daniele Sartiano, Maria Simi, Irene Sucameli Dipartimento di Informatica Università di Pisa Largo B. Pontecorvo, 3 I-56127 Pisa, Italy {attardi, sartiano, simi}@di.unipi.it [email protected] correspond to entities in the domain of interest; Abstract entity disambiguation is the task of linking an extracted mention to a specific instance of an English. The paper describes our sub- entity in a knowledge base. missions to the task on Named Entity Most approaches to mention detection rely on rEcognition and Linking in Italian some sort of fuzzy matching between n-grams in Tweets (NEEL-IT) at Evalita 2016. Our the source text and a list of known entities (Rizzo approach relies on a technique of Named et al., 2015). These solutions suffer severe limita- Entity tagging that exploits both charac- tions when dealing with Twitter posts, since the ter-level and word-level embeddings. posts’ vocabulary is quite varied, the writing is Character-based embeddings allow learn- irregular, with variants and misspellings and en- ing the idiosyncrasies of the language tities are often not present in official resources used in tweets. Using a full-blown like DBpedia or Wikipedia. Named Entity tagger allows recognizing Detecting the correct entity mention is howev- a wider range of entities than those well er crucial: Ritter et al. (2011) for example report known by their presence in a Knowledge a 0.67 F1 score on named entity segmentation, Base or gazetteer. Our submissions but an 85% accuracy, once the correct entity achieved first, second and fourth top offi- mention is detected, just by a trivial disambigua- cial scores. tion that maps to the most popular entity. We explored an innovative approach to men- Italiano. L’articolo descrive la nostra tion detection, which relies on a technique of partecipazione al task di Named Entity Named Entity tagging that exploits both charac- rEcognition and Linking in Italian ter-level and word-level embeddings. Character- Tweets (NEEL-IT) a Evalita 2016. Il no- level embeddings allow learning the idiosyncra- stro approccio si basa sull’utilizzo di un sies of the language used in tweets. Using a full- Named Entity tagger che sfrutta embed- blown Named Entity tagger allows recognizing a dings sia character-level che word-level. wider range of entities than those well known by I primi consentono di apprendere le idio- their presence in a Knowledge Base or gazetteer. sincrasie della scrittura nei tweet. L’uso Another advantage of the approach is that no di un tagger completo consente di rico- pre-built resource is required in order to perform noscere uno spettro più ampio di entità the task, minimal preprocessing is required on rispetto a quelle conosciute per la loro the input text and no manual feature extraction presenza in Knowledge Base o gazetteer. nor feature engineering is required. Le prove sottomesse hanno ottenuto il We exploit embeddings also for disambigua- primo, secondo e quarto dei punteggi uf- tion and entity linking, proposing the first ap- ficiali. proach that, to the best of our knowledge, uses only embeddings for both entity recognition and 1 Introduction linking. We report the results of our experiments with Most approaches to entity linking in the current this approach on the task Evalita 2016 NEEL-IT. literature split the task into two equally important Our submissions achieved first, second and but distinct problems: mention detection is the fourth top official scores. task of extracting surface form candidates that 2 Task Description Train a bidirectional LSTM character- level Named Entity tagger, using the pre- The NEEL-IT task consists of annotating named trained word embeddings entity mentions in tweets and disambiguating them by linking them to their corresponding en- Build a dictionary mapping titles of the try in a knowledge base (DBpedia). Italian DBpedia to pairs consisting of the According to the task Annotation Guidelines corresponding title in the English DBpedia (NEEL-IT Guidelines, 2016), a mention is a 2011 release and its NEEL-IT category. string in the tweet representing a proper noun or This helps translating the Italian titles into an acronym that represents an entity belonging to the requested English titles. An example one of seven given categories (Thing, Event, of the entries in this dictionary are: Character, Location, Organization, Person and Cristoforo_Colombo Product). Concepts that belong to one of the cat- (http://dbpedia.org/resource/Christopher_Colu egories but miss from DBpedia are to be tagged mbus, Person) Milano (http://dbpedia.org/resource/Milan, Lo- as NIL. Moreover “The extent of an entity is the cation) entire string representing the name, excluding the preceding definite article”. From all the anchor texts from articles of The Knowledge Base onto which to link enti- the Italian Wikipedia, select those that link ties is the Italian DBpedia 2015-10, however the to a page that is present in the above dic- concepts must be annotated with the canonical- tionary. For example, this dictionary con- ized dataset of DBpedia 2015, which is an Eng- tains: lish one. Therefore, despite the tweets are in Ital- Person Cristoforo_Colombo Colombo ian, for unexplained reasons the links must refer Create word embeddings from the Italian to English entities. Wikipedia 3 Building a larger resource For each page whose title is present in the above dictionary, we extract its abstract The training set provided by the organizers con- and compute the average of the word em- sists of just 1629 tweets, which are insufficient beddings of its tokens and store it into a for properly training a NER on the 7 given cate- table that associates it to the URL of the gories. same dictionary We thus decided to exploit also the training set of the Evalita 2016 PoSTWITA task, which con- Perform Named Entity tagging on the test sists of 6439 Italian tweets tokenized and gold set annotated with PoS tags. This allowed us to con- For each extracted entity, compute the av- centrate on proper nouns and well defined entity erage of the word embeddings for a con- boundaries in the manual annotation process of text of words of size c before and after the named entities. entity. We used the combination of these two sets to train a first version of the NER. Annotate the mention with the DBpedia We then performed a sort of active learning entity whose lc2 distance is smallest step, applying the trained NER tagger to a set of among those of the abstracts computed be- over 10 thousands tweets and manually correct- fore. ing 7100 of these by a team of two annotators. For the Twitter mentions, invoke the Twit- These tweets were then added to the training ter API to obtain the real name from the set of the task and to the PoSTWITA annotated screen name, and set the category to Per- training set, obtaining our final training corpus of son if the real name is present in a gazet- 13,945 tweets. teer of names. 4 Description of the system The last step is somewhat in contrast with the task guidelines (NEEL-IT Guidelines, 2016), Our approach to Named Entity Extraction and which only contemplate annotating a Twitter Linking consists of the following steps: mention as Person if it is recognizable on the Train word embeddings on a large corpus spot as a known person name. More precisely “If of Italian tweets the mention contains the name and surname of a person, the name of a place, or an event, etc., it should be considered as a named entity”, but without resorting to any language-specific “Should not be considered as named entity those knowledge or resources such as gazetteers. aliases not universally recognizable or traceable In order to take into account the fact that named back to a named entity, but should be tagged as entities often consist of multiple tokens, the algo- entity those mentions that contains well known rithms exploits a bidirectional LSTM with a se- aliases. Then, @ValeYellow46 should not be quential conditional random layer above it. tagged as is not an alias for Valentino Rossi”. Character-level features are learned while train- We decided instead that it would have been ing, instead of hand-engineering prefix and suffix more useful, for practical uses of the tool, to pro- information about words. Learning character-level duce a more general tagger, capable of detecting embeddings has the advantage of learning repre- mentions recognizable not only from syntactic sentations specific to the task and domain at hand. features. Since this has affected our final score, They have been found useful for morphologically we will present a comparison with results ob- rich languages and to handle the out-of- tained by skipping this last step. vocabulary problem for tasks like POS tagging and language modeling (Ling et al., 2015) or de- 4.1 Word Embeddings pendency parsing (Ballesteros et al., 2015). The word embeddings for tweets have been created The character-level embedding are given to bi- using the fastText utility1 by Bojanowski et al. directional LSTMs and then concatenated with the (2016) on a collection of 141 million Italian tweets embedding of the whole word to obtain the final retrieved over the period from May to September word representation as described in Figure 1: 2016 using the Twitter API. Selection of Italian r e l tweets was achieved by using a query containing a Mario Mario Mario list of the 200 most common Italian words. r r r r r The text of tweets was split into sentences and Mar M a- Mar Ma M tokenized using the sentence splitter and the tweet tokenizer from the linguistic pipeline Tanl l l l l l (Attardi et al., 2010), replacing emoticons and M Ma Mar Mari Mari emojis with a symbolic name starting with EMO_ and normalizing URLs.