Using Embeddings for Both Entity Recognition and Linking in Tweets

Giuseppe Attardi, Daniele Sartiano, Maria Simi, Irene Sucameli Dipartimento di Informatica Università di Pisa Largo B. Pontecorvo, 3 I-56127 Pisa, Italy {attardi, sartiano, simi}@di.unipi.it [email protected]

correspond to entities in the domain of interest; Abstract entity disambiguation is the task of linking an extracted mention to a specific instance of an English. The paper describes our sub- entity in a knowledge base. missions to the task on Named Entity Most approaches to mention detection rely on rEcognition and Linking in Italian some sort of fuzzy matching between n-grams in Tweets (NEEL-IT) at Evalita 2016. Our the source text and a list of known entities (Rizzo approach relies on a technique of Named et al., 2015). These solutions suffer severe limita- Entity tagging that exploits both charac- tions when dealing with Twitter posts, since the ter-level and word-level embeddings. posts’ vocabulary is quite varied, the writing is Character-based embeddings allow learn- irregular, with variants and misspellings and en- ing the idiosyncrasies of the language tities are often not present in official resources used in tweets. Using a full-blown like DBpedia or . Named Entity tagger allows recognizing Detecting the correct entity mention is howev- a wider range of entities than those well er crucial: Ritter et al. (2011) for example report known by their presence in a Knowledge a 0.67 F1 score on named entity segmentation, Base or gazetteer. Our submissions but an 85% accuracy, once the correct entity achieved first, second and fourth top offi- mention is detected, just by a trivial disambigua- cial scores. tion that maps to the most popular entity. We explored an innovative approach to men- Italiano. L’articolo descrive la nostra tion detection, which relies on a technique of partecipazione al task di Named Entity Named Entity tagging that exploits both charac- rEcognition and Linking in Italian ter-level and word-level embeddings. Character- Tweets (NEEL-IT) a Evalita 2016. Il no- level embeddings allow learning the idiosyncra- stro approccio si basa sull’utilizzo di un sies of the language used in tweets. Using a full- Named Entity tagger che sfrutta embed- blown Named Entity tagger allows recognizing a dings sia character-level che word-level. wider range of entities than those well known by I primi consentono di apprendere le idio- their presence in a Knowledge Base or gazetteer. sincrasie della scrittura nei tweet. L’uso Another advantage of the approach is that no di un tagger completo consente di rico- pre-built resource is required in order to perform noscere uno spettro più ampio di entità the task, minimal preprocessing is required on rispetto a quelle conosciute per la loro the input text and no manual feature extraction presenza in Knowledge Base o gazetteer. nor feature engineering is required. Le prove sottomesse hanno ottenuto il We exploit embeddings also for disambigua- primo, secondo e quarto dei punteggi uf- tion and entity linking, proposing the first ap- ficiali. proach that, to the best of our knowledge, uses only embeddings for both entity recognition and 1 Introduction linking. We report the results of our experiments with Most approaches to entity linking in the current this approach on the task Evalita 2016 NEEL-IT. literature split the task into two equally important Our submissions achieved first, second and but distinct problems: mention detection is the fourth top official scores. task of extracting surface form candidates that 2 Task Description  Train a bidirectional LSTM character- level Named Entity tagger, using the pre- The NEEL-IT task consists of annotating named trained word embeddings entity mentions in tweets and disambiguating them by linking them to their corresponding en-  Build a dictionary mapping titles of the try in a knowledge base (DBpedia). Italian DBpedia to pairs consisting of the According to the task Annotation Guidelines corresponding title in the English DBpedia (NEEL-IT Guidelines, 2016), a mention is a 2011 release and its NEEL-IT category. string in the tweet representing a proper noun or This helps translating the Italian titles into an acronym that represents an entity belonging to the requested English titles. An example one of seven given categories (Thing, Event, of the entries in this dictionary are: Character, Location, Organization, Person and Cristoforo_Colombo Product). Concepts that belong to one of the cat- (http://dbpedia.org/resource/Christopher_Colu egories but miss from DBpedia are to be tagged mbus, Person) Milano (http://dbpedia.org/resource/Milan, Lo- as NIL. Moreover “The extent of an entity is the cation) entire string representing the name, excluding the preceding definite article”.  From all the anchor texts from articles of The Knowledge Base onto which to link enti- the Italian Wikipedia, select those that link ties is the Italian DBpedia 2015-10, however the to a page that is present in the above dic- concepts must be annotated with the canonical- tionary. For example, this dictionary con- ized dataset of DBpedia 2015, which is an Eng- tains: lish one. Therefore, despite the tweets are in Ital- Person Cristoforo_Colombo Colombo ian, for unexplained reasons the links must refer  Create word embeddings from the Italian to English entities. Wikipedia 3 Building a larger resource  For each page whose title is present in the above dictionary, we extract its abstract The training set provided by the organizers con- and compute the average of the word em- sists of just 1629 tweets, which are insufficient beddings of its tokens and store it into a for properly training a NER on the 7 given cate- table that associates it to the URL of the gories. same dictionary We thus decided to exploit also the training set of the Evalita 2016 PoSTWITA task, which con-  Perform Named Entity tagging on the test sists of 6439 Italian tweets tokenized and gold set annotated with PoS tags. This allowed us to con-  For each extracted entity, compute the av- centrate on proper nouns and well defined entity erage of the word embeddings for a con- boundaries in the manual annotation process of text of words of size c before and after the named entities. entity. We used the combination of these two sets to train a first version of the NER.  Annotate the mention with the DBpedia We then performed a sort of active learning entity whose lc2 distance is smallest step, applying the trained NER tagger to a set of among those of the abstracts computed be- over 10 thousands tweets and manually correct- fore. ing 7100 of these by a team of two annotators.  For the Twitter mentions, invoke the Twit- These tweets were then added to the training ter API to obtain the real name from the set of the task and to the PoSTWITA annotated screen name, and set the category to Per- training set, obtaining our final training corpus of son if the real name is present in a gazet- 13,945 tweets. teer of names. 4 Description of the system The last step is somewhat in contrast with the task guidelines (NEEL-IT Guidelines, 2016), Our approach to Named Entity Extraction and which only contemplate annotating a Twitter Linking consists of the following steps: mention as Person if it is recognizable on the  Train word embeddings on a large corpus spot as a known person name. More precisely “If of Italian tweets the mention contains the name and surname of a person, the name of a place, or an event, etc., it should be considered as a named entity”, but without resorting to any language-specific “Should not be considered as named entity those knowledge or resources such as gazetteers. aliases not universally recognizable or traceable In order to take into account the fact that named back to a named entity, but should be tagged as entities often consist of multiple tokens, the algo- entity those mentions that contains well known rithms exploits a bidirectional LSTM with a se- aliases. Then, @ValeYellow46 should not be quential conditional random layer above it. tagged as is not an alias for Valentino Rossi”. Character-level features are learned while train- We decided instead that it would have been ing, instead of hand-engineering prefix and suffix more useful, for practical uses of the tool, to pro- information about words. Learning character-level duce a more general tagger, capable of detecting embeddings has the advantage of learning repre- mentions recognizable not only from syntactic sentations specific to the task and domain at hand. features. Since this has affected our final score, They have been found useful for morphologically we will present a comparison with results ob- rich languages and to handle the out-of- tained by skipping this last step. vocabulary problem for tasks like POS tagging and language modeling (Ling et al., 2015) or de- 4.1 Word Embeddings pendency parsing (Ballesteros et al., 2015). The word embeddings for tweets have been created The character-level embedding are given to bi- using the fastText utility1 by Bojanowski et al. directional LSTMs and then concatenated with the (2016) on a collection of 141 million Italian tweets embedding of the whole word to obtain the final retrieved over the period from May to September word representation as described in Figure 1: 2016 using the Twitter API. Selection of Italian r e l tweets was achieved by using a query containing a Mario Mario Mario list of the 200 most common Italian words. r r r r r The text of tweets was split into sentences and Mar Ma- Mar Ma M tokenized using the sentence splitter and the tweet tokenizer from the linguistic pipeline Tanl l l l l l (Attardi et al., 2010), replacing emoticons and M Ma Mar Mari Mari emojis with a symbolic name starting with EMO_ and normalizing URLs. This prepro- M a r i o cessing was performed by MapReduce on the large source corpora. Figure 1. The embeddings for the word "Mario" are ob- We produced two versions of the embeddings, tained by concatenating the two bidirectional LSTM char- one with dimension 100 and a second of dimen- acter-level embeddings with the whole word embeddings. sion 200. Both used a window of 5 and retained words with a minimum count of 100, for a total The architecture of the NER tagger is described of 245 thousands words. in Figure 2. Word embeddings for the Italian Wikipedia were created from text extracted from the Wik- CRF B-PER I-PER O B-LOC ipedia dump of August 2016, using the WikiEx- Layer tractor utility by Attardi (2009). The vectors c c c c were produced by means of the utili- 1 2 3 4 ty2, using the skipgram model, a dimension of

100, a window of 5, and a minimum occurrence Bi LSTM r r r r 1 2 3 4 of 50, retaining a total of 214,000 words. Encoder

4.2 Bi-LSTM Character-level NER l l l l 1 2 3 4 Lample et al. (2016) propose a Named Entity Recognizer that obtains state-of-the-art perfor- Word Mario Monti a Roma mance in NER on the 4 CoNLL 2003 datasets Embeddings

Figure 2. Architecture of the NER

1 https://github.com/facebookresearch/fastText.git 2 https://storage.googleapis.com/google-code-archive- source/v2/code.google.com/word2vec/source-archive.zip 5 Experiments Specific parameters of the individual runs are: Since Named Entity tagging is the first step of  UniPI.1: twitter embeddings with dimen- our technique and hence its accuracy affects the sion 100, disambiguation by frequency of overall results, we present separately the evalua- mention in Wikipedia anchors tion of the NER tagger.  UniPI.2: twitter embeddings with dimen- Here are the results of the NER tagger on a sion 100, disambiguation with Wikipedia development set of 1523 tweets, randomly ex- embeddings tracted from the full training set.  UniPI.3: twitter embeddings with dimen- Category Precision Recall F1 sion 200, disambiguation with Wikipedia Character 50.00 16.67 25.00 embeddings, training set with geograph- Event 92.48 87.45 89.89 ical entities more properly annotated as Location 77.51 75.00 76.24 Location (e.g. Italy). Organization 88.30 78.13 82.91 The runs achieved the scores listed in the follow- Person 73.71 88.26 88.33 ing table:

Product 65.48 60.77 63.04 Strong Strong Mention typed Final Thing 50.00 36.84 42.42 Run link ceaf mention score match Table 1. NER accuracy on devel set. match On the subset of the test set used in the evalua- UniPI.3 0.561 0.474 0.456 0.5034 tion, which consists of 301 tweets, the NER per- UniPI.1 0.561 0.466 0.443 0.4971 forms as follows: Team2.base 0.530 0.472 0.477 0.4967 Category Precision Recall F1 UniPI.2 0.561 0.463 0.443 0.4962 Character 0.00 0.00 0.00 Team3.3 0.585 0.516 0.348 0.4932 Event 0.00 0.00 0.00 Table 4. Top official task results. Location 72.73 61.54 66.67 The final score is computed as follows: Organization 63.46 44.59 52.38 0.4 mention_ceaf + Person 76.07 67.42 71.49 0.3 strong_typed_mention_match + Product 32.26 27.78 29.85 0.3 strong_link_match Thing 0.00 0.00 0.00 As mentioned, our tagger performs an extra ef- Table 2. NER accuracy on gold test set. fort in trying to determine whether Twitter men- In the disambiguation and linking process, we tions represent indeed Person or Organization experimented with several values of the context entities. In order to check how this influences our size c of words around the mentions (4, 8 and 10) result we evaluate also a version of the UniPI.3 and eventually settled for a value of 8 in the run without the extra step of mention type identi- submitted runs. fication. The results are reported in the following table: 6 Results Strong Strong Mention typed Final Run link We submitted three runs. The three runs have in ceaf mention score match common the following parameters for training match the NER: UniPI.3 without 0.616 0.531 0.451 0.541 mention check

Character embeddings dimension 25 Table 5. Results of run without Twitter mention analysis. dropout 0.5 On the other hand, if we manually correct the test Learning rate 0.001 set annotating the Twitter mentions that indeed Training set size 12,188 refer to Twitter users or organizations, the score for strong typed mentions match increases to Table 3. Training parameters for NER. 0.645.

7 Discussion rithm used for assigning a score to mention can- didates, which are however identified by either The effectiveness of the use of embeddings in exact or fuzzy matching on a mention-entity dic- disambiguation can be seen in the improvement tionary built from Wikipedia titles and anchor in the strong link match score between run Uni- texts. PI.2 and UniPI.3. Examples where embeddings Guo et al. (2016) propose a structural SVM lead to better disambiguation are: algorithm for entity linking that jointly optimizes Liverpool_F.C. vs Liverpool mention detection and entity disambiguation as a Italy_national_football_team vs Italy single end-to-end task. S.S._Lazio vs Lazio Diego_Della_Valle vs Pietro_Della_Valle 9 Conclusions Nobel_Prize vs Alfred_Nobel We presented an innovative approach to mention There are many cases where the NER recognizes extraction and entity linking of tweets, that relies a Person, but the linker associates the name to a on Deep Learning techniques. In particular we famous character, for example: use a Named Entity tagger for mention detection Maria_II_of_Portugal for Maria that exploits character-level embeddings in order Luke_the_Evangelist for Luca to deal with the noise in the writing of tweet posts. We also exploit word embeddings as a The approach of using embeddings for disam- measure of semantic relatedness for the task of biguation looks promising: the abstract of articles entity linking. sometimes does not provide appropriate evi- As a side product we produced a new gold re- dence, since the style of Wikipedia involves source of 13,609 tweets (242,453 tokens) anno- providing typically meta-level information, such tated with NE categories, leveraging on the re- as the category of the concept. For example dis- source distributed for the Evalita PoSTWITA ambiguation for “Coppa Italia” leads to “Ital- task. ian_Basketball_Cup” rather than to “Ital- The approach achieved top score in the Evalita ian_Football_Cup”, since both are described as 2016 NEEL-IT Challenge and looks promising sport competitions. Selecting or collecting for further future enhancements. phrases that mention the concept, rather than de- fine it, might lead to improved accuracy. Using character-based embeddings and a large Acknowledgments training corpus requires significant computation- We gratefully acknowledge the support by the al resources. We exploited a server equipped University of Pisa through project PRA 2016 and with nVidia Tesla K 80 GPU accelerators. by NVIDIA Corporation through the donation of Nevertheless training the LSTM NER tagger a Tesla K40 GPU accelerator used in the experi- still required about 19 hours: without the GPU ments. accelerator the training would have been impos- We thank Guillaume Lample for making sible. available his implementation of Named Entity Recognizer. 8 Related Work Several papers discuss approaches to end-to-end References entity linking (Cucerzan, 2007; Milne and Wit- Giuseppe Attardi. 2009. WikiExtractor: A tool for ten, 2008; Kulkarni et al., 2009; Ferragina and extracting plain text from Wikipedia dumps. Scaiella, 2010; Han and Sun, 2011; Meij et al., https://github.com/attardi/wikiextractor 2012), but many heavily depend on Wikipedia Giuseppe Attardi, Stefano Dei Rossi, and Maria Simi. text and might not work well in short and noisy 2010. The Tanl Pipeline. In Proc. of LREC Work- tweets. shop on WSPP, Malta. Most approaches to mention detection rely on Miguel Ballesteros, Chris Dyer, and Noah A. Smith. some sort of fuzzy matching between n-grams in 2015. Improved transition-based dependency pars- the source and the list of known entities (Rizzo et ing by modeling characters instead of words with al., 2015). LSTMs. In Proceedings of EMNLP 2015. Yamada et al. (2015) propose an end-to-end approach to entity linking that exploits word em- Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vec- beddings as features in a random-forest algo- tors with Subword Information. ference on Web search and data mining, pages https://arxiv.org/abs/1607.04606 563–572, New York, NY, USA. ACM. Silviu Cucerzan. 2007. Large-scale named entity dis- David Milne and Ian H. Witten. 2008. Learning to am- biguation based on Wikipedia data. In Pro- link with wikipedia. In Proceedings of ACM Con- ceedings of the 2007 Joint Conference of EMNLP- ference on Information and Knowledge Manage- CoNLL, pages 708–716. ment (CIKM), pages 509–518, New York, NY, USA. ACM. Evalita. 2016. NEEL-IT. http://neel-it.github.io NEEL-IT Annotation Guidelines. 2016. Paolo Ferragina and Ugo Scaiella. 2010. Tagme: on- https://drive.google.com/open?id=1saUb2NSxml67 the-fly annotation of short text fragments (by wik- pcrz3m_bMcibe1nST2CTedeKOBklKaI ipedia entities). In Proceedings of the 19th ACM international conference on Information and Wang Ling, Tiago Luís, Luís Marujo, Ramón Fernan- knowledge manage- ment, CIKM ’10, pages 1625– dez Astudillo, Silvio Amir, Chris Dyer, Alan W 1628, New York, NY, USA. ACM. Black, and Isabel Trancoso. 2015. Finding function in form: Compositional character models for open Stephen Guo, Ming-Wei Chang and Emre Kıcıman. vocabulary word representation. In Proceedings of 2016. To Link or Not to Link? A Study on End-to- the Conference on Empirical Methods in Natural End Tweet Entity Linking. Language Processing (EMNLP). Xianpei Han and Le Sun. 2011. A generative entity- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. mention model for linking entities with knowledge Corrado, and Jeff Dean. 2013. Distributed repre- base. In Proceedings of the 49th Annual Meeting of sentations of words and phrases and their composi- the Asso- ciation for Computational Linguistics: tionality. Advances in Neural Information Pro- Human Lan- guage Technologies - Volume 1, HLT cessing Systems. arXiv:1310.4546 ’11, pages 945– 954, Stroudsburg, PA, USA. As- sociation for Compu- tational Linguistics. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: an ex- Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, perimental study. In Proceedings of the Conference and Soumen Chakrabarti. 2009. Collective annota- on Empirical Methods for Natural Language Pro- tion of wikipedia entities in web text. In Proceed- cessing (EMNLP), pages 1524–1534, Stroudsburg, ings of the 15th ACM SIGKDD international con- PA, USA. Association for Computational Lingui- ference on Knowledge discovery and data mining, stics. Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), Giuseppe Rizzo, Amparo Elizabeth Cano Basave, pages 457–466, New York, NY, USA. ACM. Bianca Pereira, and Andrea Varga. 2015. Making sense of microposts (#microposts2015) named en- Guillaume Lample, Miguel Ballesteros, Kazuya Ka- tity recognition and linking (NEEL) challenge. In wakami, Sandeep Subramanian, and Chris Dyer. Proceedings of the 5th Workshop on Making Sense 2016. Neural Architectures for Named Entity of Microposts, pages 44–53. Recognition, In Proceedings of NAACL-HLT (NAACL 2016). Ikuya Yamada, Hideaki Takeda, and Yoshiyasu Take- fujiIkuya. 2015. An End-to-End Entity Linking Edgar Meij, Wouter Weerkamp, and Maarten de Approach for Tweets. In Proceedings of the 5th Rijke. 2012. Adding semantics to microblog posts. Workshop on Making Sense of Microposts, Firenze, In Proceedings of the fifth ACM international con- Italy.