Using Embeddings for Both Entity Recognition and Linking in Tweets

Total Page:16

File Type:pdf, Size:1020Kb

Using Embeddings for Both Entity Recognition and Linking in Tweets Using Embeddings for Both Entity Recognition and Linking in Tweets Giuseppe Attardi, Daniele Sartiano, Maria Simi, Irene Sucameli Dipartimento di Informatica Università di Pisa Largo B. Pontecorvo, 3 I-56127 Pisa, Italy {attardi, sartiano, simi}@di.unipi.it [email protected] correspond to entities in the domain of interest; Abstract entity disambiguation is the task of linking an extracted mention to a specific instance of an English. The paper describes our sub- entity in a knowledge base. missions to the task on Named Entity Most approaches to mention detection rely on rEcognition and Linking in Italian some sort of fuzzy matching between n-grams in Tweets (NEEL-IT) at Evalita 2016. Our the source text and a list of known entities (Rizzo approach relies on a technique of Named et al., 2015). These solutions suffer severe limita- Entity tagging that exploits both charac- tions when dealing with Twitter posts, since the ter-level and word-level embeddings. posts’ vocabulary is quite varied, the writing is Character-based embeddings allow learn- irregular, with variants and misspellings and en- ing the idiosyncrasies of the language tities are often not present in official resources used in tweets. Using a full-blown like DBpedia or Wikipedia. Named Entity tagger allows recognizing Detecting the correct entity mention is howev- a wider range of entities than those well er crucial: Ritter et al. (2011) for example report known by their presence in a Knowledge a 0.67 F1 score on named entity segmentation, Base or gazetteer. Our submissions but an 85% accuracy, once the correct entity achieved first, second and fourth top offi- mention is detected, just by a trivial disambigua- cial scores. tion that maps to the most popular entity. We explored an innovative approach to men- Italiano. L’articolo descrive la nostra tion detection, which relies on a technique of partecipazione al task di Named Entity Named Entity tagging that exploits both charac- rEcognition and Linking in Italian ter-level and word-level embeddings. Character- Tweets (NEEL-IT) a Evalita 2016. Il no- level embeddings allow learning the idiosyncra- stro approccio si basa sull’utilizzo di un sies of the language used in tweets. Using a full- Named Entity tagger che sfrutta embed- blown Named Entity tagger allows recognizing a dings sia character-level che word-level. wider range of entities than those well known by I primi consentono di apprendere le idio- their presence in a Knowledge Base or gazetteer. sincrasie della scrittura nei tweet. L’uso Another advantage of the approach is that no di un tagger completo consente di rico- pre-built resource is required in order to perform noscere uno spettro più ampio di entità the task, minimal preprocessing is required on rispetto a quelle conosciute per la loro the input text and no manual feature extraction presenza in Knowledge Base o gazetteer. nor feature engineering is required. Le prove sottomesse hanno ottenuto il We exploit embeddings also for disambigua- primo, secondo e quarto dei punteggi uf- tion and entity linking, proposing the first ap- ficiali. proach that, to the best of our knowledge, uses only embeddings for both entity recognition and 1 Introduction linking. We report the results of our experiments with Most approaches to entity linking in the current this approach on the task Evalita 2016 NEEL-IT. literature split the task into two equally important Our submissions achieved first, second and but distinct problems: mention detection is the fourth top official scores. task of extracting surface form candidates that 2 Task Description Train a bidirectional LSTM character- level Named Entity tagger, using the pre- The NEEL-IT task consists of annotating named trained word embeddings entity mentions in tweets and disambiguating them by linking them to their corresponding en- Build a dictionary mapping titles of the try in a knowledge base (DBpedia). Italian DBpedia to pairs consisting of the According to the task Annotation Guidelines corresponding title in the English DBpedia (NEEL-IT Guidelines, 2016), a mention is a 2011 release and its NEEL-IT category. string in the tweet representing a proper noun or This helps translating the Italian titles into an acronym that represents an entity belonging to the requested English titles. An example one of seven given categories (Thing, Event, of the entries in this dictionary are: Character, Location, Organization, Person and Cristoforo_Colombo Product). Concepts that belong to one of the cat- (http://dbpedia.org/resource/Christopher_Colu egories but miss from DBpedia are to be tagged mbus, Person) Milano (http://dbpedia.org/resource/Milan, Lo- as NIL. Moreover “The extent of an entity is the cation) entire string representing the name, excluding the preceding definite article”. From all the anchor texts from articles of The Knowledge Base onto which to link enti- the Italian Wikipedia, select those that link ties is the Italian DBpedia 2015-10, however the to a page that is present in the above dic- concepts must be annotated with the canonical- tionary. For example, this dictionary con- ized dataset of DBpedia 2015, which is an Eng- tains: lish one. Therefore, despite the tweets are in Ital- Person Cristoforo_Colombo Colombo ian, for unexplained reasons the links must refer Create word embeddings from the Italian to English entities. Wikipedia 3 Building a larger resource For each page whose title is present in the above dictionary, we extract its abstract The training set provided by the organizers con- and compute the average of the word em- sists of just 1629 tweets, which are insufficient beddings of its tokens and store it into a for properly training a NER on the 7 given cate- table that associates it to the URL of the gories. same dictionary We thus decided to exploit also the training set of the Evalita 2016 PoSTWITA task, which con- Perform Named Entity tagging on the test sists of 6439 Italian tweets tokenized and gold set annotated with PoS tags. This allowed us to con- For each extracted entity, compute the av- centrate on proper nouns and well defined entity erage of the word embeddings for a con- boundaries in the manual annotation process of text of words of size c before and after the named entities. entity. We used the combination of these two sets to train a first version of the NER. Annotate the mention with the DBpedia We then performed a sort of active learning entity whose lc2 distance is smallest step, applying the trained NER tagger to a set of among those of the abstracts computed be- over 10 thousands tweets and manually correct- fore. ing 7100 of these by a team of two annotators. For the Twitter mentions, invoke the Twit- These tweets were then added to the training ter API to obtain the real name from the set of the task and to the PoSTWITA annotated screen name, and set the category to Per- training set, obtaining our final training corpus of son if the real name is present in a gazet- 13,945 tweets. teer of names. 4 Description of the system The last step is somewhat in contrast with the task guidelines (NEEL-IT Guidelines, 2016), Our approach to Named Entity Extraction and which only contemplate annotating a Twitter Linking consists of the following steps: mention as Person if it is recognizable on the Train word embeddings on a large corpus spot as a known person name. More precisely “If of Italian tweets the mention contains the name and surname of a person, the name of a place, or an event, etc., it should be considered as a named entity”, but without resorting to any language-specific “Should not be considered as named entity those knowledge or resources such as gazetteers. aliases not universally recognizable or traceable In order to take into account the fact that named back to a named entity, but should be tagged as entities often consist of multiple tokens, the algo- entity those mentions that contains well known rithms exploits a bidirectional LSTM with a se- aliases. Then, @ValeYellow46 should not be quential conditional random layer above it. tagged as is not an alias for Valentino Rossi”. Character-level features are learned while train- We decided instead that it would have been ing, instead of hand-engineering prefix and suffix more useful, for practical uses of the tool, to pro- information about words. Learning character-level duce a more general tagger, capable of detecting embeddings has the advantage of learning repre- mentions recognizable not only from syntactic sentations specific to the task and domain at hand. features. Since this has affected our final score, They have been found useful for morphologically we will present a comparison with results ob- rich languages and to handle the out-of- tained by skipping this last step. vocabulary problem for tasks like POS tagging and language modeling (Ling et al., 2015) or de- 4.1 Word Embeddings pendency parsing (Ballesteros et al., 2015). The word embeddings for tweets have been created The character-level embedding are given to bi- using the fastText utility1 by Bojanowski et al. directional LSTMs and then concatenated with the (2016) on a collection of 141 million Italian tweets embedding of the whole word to obtain the final retrieved over the period from May to September word representation as described in Figure 1: 2016 using the Twitter API. Selection of Italian r e l tweets was achieved by using a query containing a Mario Mario Mario list of the 200 most common Italian words. r r r r r The text of tweets was split into sentences and Mar M a- Mar Ma M tokenized using the sentence splitter and the tweet tokenizer from the linguistic pipeline Tanl l l l l l (Attardi et al., 2010), replacing emoticons and M Ma Mar Mari Mari emojis with a symbolic name starting with EMO_ and normalizing URLs.
Recommended publications
  • Predicting the Correct Entities in Named-Entity Linking
    Separating the Signal from the Noise: Predicting the Correct Entities in Named-Entity Linking Drew Perkins Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ects credits June 9, 2020 Supervisors: Gongbo Tang, Uppsala University Thorsten Jacobs, Seavus Abstract In this study, I constructed a named-entity linking system that maps between contextual word embeddings and knowledge graph embeddings to predict correct entities. To establish a named-entity linking system, I rst applied named-entity recognition to identify the entities of interest. I then performed candidate gener- ation via locality sensitivity hashing (LSH), where a candidate group of potential entities were created for each identied entity. Afterwards, my named-entity dis- ambiguation component was performed to select the most probable candidate. By concatenating contextual word embeddings and knowledge graph embeddings in my disambiguation component, I present a novel approach to named-entity link- ing. I conducted the experiments with the Kensho-Derived Wikimedia Dataset and the AIDA CoNLL-YAGO Dataset; the former dataset was used for deployment and the later is a benchmark dataset for entity linking tasks. Three deep learning models were evaluated on the named-entity disambiguation component with dierent context embeddings. The evaluation was treated as a classication task, where I trained my models to select the correct entity from a list of candidates. By optimizing the named-entity linking through this methodology, this entire system can be used in recommendation engines with high F1 of 86% using the former dataset. With the benchmark dataset, the proposed method is able to achieve F1 of 79%.
    [Show full text]
  • Three Birds (In the LLOD Cloud) with One Stone: Babelnet, Babelfy and the Wikipedia Bitaxonomy
    Three Birds (in the LLOD Cloud) with One Stone: BabelNet, Babelfy and the Wikipedia Bitaxonomy Tiziano Flati and Roberto Navigli Dipartimento di Informatica Sapienza Universit`adi Roma Abstract. In this paper we present the current status of linguistic re- sources published as linked data and linguistic services in the LLOD cloud in our research group, namely BabelNet, Babelfy and the Wikipedia Bitaxonomy. We describe them in terms of their salient aspects and ob- jectives and discuss the benefits that each of these potentially brings to the world of LLOD NLP-aware services. We also present public Web- based services which enable querying, exploring and exporting data into RDF format. 1 Introduction Recent years have witnessed an upsurge in the amount of semantic information published on the Web. Indeed, the Web of Data has been increasing steadily in both volume and variety, transforming the Web into a global database in which resources are linked across sites. It is becoming increasingly critical that existing lexical resources be published as Linked Open Data (LOD), so as to foster integration, interoperability and reuse on the Semantic Web [5]. Thus, lexical resources provided in RDF format can contribute to the creation of the so-called Linguistic Linked Open Data (LLOD), a vision fostered by the Open Linguistic Working Group (OWLG), in which part of the Linked Open Data cloud is made up of interlinked linguistic resources [2]. The multilinguality aspect is key to this vision, in that it enables Natural Language Processing tasks which are not only cross-lingual, but also independent both of the language of the user input and of the linked data exploited to perform the task.
    [Show full text]
  • Cross Lingual Entity Linking with Bilingual Topic Model
    Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Cross Lingual Entity Linking with Bilingual Topic Model Tao Zhang, Kang Liu and Jun Zhao Institute of Automation, Chinese Academy of Sciences HaiDian District, Beijing, China {tzhang, kliu, jzhao}@nlpr.ia.ac.cn Abstract billion web pages on Internet and 31% of them are written in other languages. This report tells us that the percentage of Cross lingual entity linking means linking an entity web pages written in English are decreasing, which indicates mention in a background source document in one that English pages are becoming less dominant compared language with the corresponding real world entity in with before. Therefore, it becomes more and more important a knowledge base written in the other language. The to maintain the growing knowledge base by discovering and key problem is to measure the similarity score extracting information from all web contents written in between the context of the entity mention and the different languages. Also, inserting new extracted knowledge document of the candidate entity. This paper derived from the information extraction system into a presents a general framework for doing cross knowledge base inevitably needs a system to map the entity lingual entity linking by leveraging a large scale and mentions in one language to their entries in a knowledge base bilingual knowledge base, Wikipedia. We introduce which may be another language. This task is known as a bilingual topic model that mining bilingual topic cross-lingual entity linking. from this knowledge base with the assumption that Intuitively, to resolve the cross-lingual entity linking the same Wikipedia concept documents of two problem, the key is how to define the similarity score different languages share the same semantic topic between the context of entity mention and the document of distribution.
    [Show full text]
  • Babelfying the Multilingual Web: State-Of-The-Art Disambiguation and Entity Linking of Web Pages in 50 Languages
    Babelfying the Multilingual Web: state-of-the-art Disambiguation and Entity Linking of Web pages in 50 languages Roberto Navigli [email protected] http://lcl.uniroma1.it! ERC StG: MultilingualERC Joint Word Starting Sense Disambiguation Grant (MultiJEDI) MultiJEDI No. 259234 1! Roberto Navigli! LIDER EU Project No. 610782! Joint work with Andrea Moro Alessandro Raganato A. Moro, A. Raganato, R. Navigli. Entity Linking meets Word Sense Disambiguation: a Unified Approach. Transactions of the Association for Computational Linguistics (TACL), 2, 2014. Babelfying the Multilingual Web: state-of-the-art Disambiguation and Entity Linking Andrea Moro and Roberto Navigli 2014.05.08 BabelNet & friends3 Roberto Navigli 2014.05.08 BabelNet & friends4 Roberto Navigli Motivation • Domain-specific Web content is available in many languages • Information should be extracted and processed independently of the source/target language • This could be done automatically by means of high- performance multilingual text understanding Babelfying the Multilingual Web: state-of-the-art Disambiguation and Entity Linking Andrea Moro, Alessandro Raganato and Roberto Navigli BabelNet (http://babelnet.org) A wide-coverage multilingual semantic network and encyclopedic dictionary in 50 languages! Named Entities and specialized concepts Concepts from WordNet from Wikipedia Concepts integrated from both resources Babelfying the Multilingual Web: state-of-the-art Disambiguation and Entity Linking Andrea Moro, Alessandro Raganato and Roberto Navigli BabelNet (http://babelnet.org)
    [Show full text]
  • Named Entity Extraction for Knowledge Graphs: a Literature Overview
    Received December 12, 2019, accepted January 7, 2020, date of publication February 14, 2020, date of current version February 25, 2020. Digital Object Identifier 10.1109/ACCESS.2020.2973928 Named Entity Extraction for Knowledge Graphs: A Literature Overview TAREQ AL-MOSLMI , MARC GALLOFRÉ OCAÑA , ANDREAS L. OPDAHL, AND CSABA VERES Department of Information Science and Media Studies, University of Bergen, 5007 Bergen, Norway Corresponding author: Tareq Al-Moslmi ([email protected]) This work was supported by the News Angler Project through the Norwegian Research Council under Project 275872. ABSTRACT An enormous amount of digital information is expressed as natural-language (NL) text that is not easily processable by computers. Knowledge Graphs (KG) offer a widely used format for representing information in computer-processable form. Natural Language Processing (NLP) is therefore needed for mining (or lifting) knowledge graphs from NL texts. A central part of the problem is to extract the named entities in the text. The paper presents an overview of recent advances in this area, covering: Named Entity Recognition (NER), Named Entity Disambiguation (NED), and Named Entity Linking (NEL). We comment that many approaches to NED and NEL are based on older approaches to NER and need to leverage the outputs of state-of-the-art NER systems. There is also a need for standard methods to evaluate and compare named-entity extraction approaches. We observe that NEL has recently moved from being stepwise and isolated into an integrated process along two dimensions: the first is that previously sequential steps are now being integrated into end-to-end processes, and the second is that entities that were previously analysed in isolation are now being lifted in each other's context.
    [Show full text]
  • Entity Linking
    Entity Linking Kjetil Møkkelgjerd Master of Science in Computer Science Submission date: June 2017 Supervisor: Herindrasana Ramampiaro, IDI Norwegian University of Science and Technology Department of Computer Science Abstract Entity linking may be of help to quickly supplement a reader with further information about entities within a text by providing links to a knowledge base containing more information about each entity, and may thus potentially enhance the reading experience. In this thesis, we look at existing solutions, and implement our own deterministic entity linking system based on our research, using our own approach to the problem. Our system extracts all entities within the input text, and then disam- biguates each entity considering the context. The extraction step is han- dled by an external framework, while the disambiguation step focuses on entity-similarities, where similarity is defined by how similar the entities’ categories are, which we measure by using data from a structured knowledge base called DBpedia. We base this approach on the assumption that simi- lar entities usually occur close to each other in written text, thus we select entities that appears to be similar to other nearby entities. Experiments show that our implementation is not as effective as some of the existing systems we use for reference, and that our approach has some weaknesses which should be addressed. For example, DBpedia is not an as consistent knowledge base as we would like, and the entity extraction framework often fail to reproduce the same entity set as the dataset we use for evaluation. However, our solution show promising results in many cases.
    [Show full text]
  • Named-Entity Recognition and Linking with a Wikipedia-Based Knowledge Base Bachelor Thesis Presentation Niklas Baumert [[email protected]]
    Named-Entity Recognition and Linking with a Wikipedia-based Knowledge Base Bachelor Thesis Presentation Niklas Baumert [[email protected]] 2018-11-08 Contents 1 Why and What?—Motivation. 1.1 What is named-entity recognition? 1.2 What is entity linking? 2 How?—Implementation. 2.1 Wikipedia-backed Knowledge Base. 2.2 Entity Recognizer & Linker. 3 Performance Evaluation. 4 Conclusion. 2 1 Why and What—Motivation. 3 Built to provide input files for QLever. ● Specialized use-case. ○ General application as an extra feature. ● Providing a wordsfile and a docsfile as output. ● Wordsfile is input for QLever. ● Support for Wikipedia page titles, Freebase IDs and Wikidata IDs. ○ Convert between those 3 freely afterwards. 4 The docsfile is a list of all sentences with IDs. record id text 1 Anarchism is a political philosophy that advocates [...] 2 These are often described as [...] 5 The wordsfile contains each word of each sentence and determines entities and their likeliness. word Entity? recordID score Anarchism 0 1 1 <Anarchism> 1 1 0.9727 is 0 1 1 a 0 1 1 political 0 1 1 philosophy 0 1 1 <Philosophy> 1 1 0.955 6 Named-entity recognition (NER) identifies nouns in a given text. ● Finding and categorizing entities in a given text. [1][2] ● (Proper) nouns refer to entities. ● Problem: Words are ambiguous. [1] ○ Same word, different entities within a category—”John F. Kennedy”, the president or his son? ○ Same word, different entities in different categories—”JFK”, person or airport? 7 Entity linking (EL) links entities to a database. ● Resolve ambiguity by associating mentions to entities in a knowledge base.
    [Show full text]
  • KORE 50^DYWC: an Evaluation Data Set for Entity Linking Based On
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2389–2395 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC KORE 50DYWC: An Evaluation Data Set for Entity Linking Based on DBpedia, YAGO, Wikidata, and Crunchbase Kristian Noullet, Rico Mix, Michael Farber¨ Karlsruhe Institute of Technology (KIT) [email protected], [email protected], [email protected] Abstract A major domain of research in natural language processing is named entity recognition and disambiguation (NERD). One of the main ways of attempting to achieve this goal is through use of Semantic Web technologies and its structured data formats. Due to the nature of structured data, information can be extracted more easily, therewith allowing for the creation of knowledge graphs. In order to properly evaluate a NERD system, gold standard data sets are required. A plethora of different evaluation data sets exists, mostly relying on either Wikipedia or DBpedia. Therefore, we have extended a widely-used gold standard data set, KORE 50, to not only accommodate NERD tasks for DBpedia, but also for YAGO, Wikidata and Crunchbase. As such, our data set, KORE 50DYWC, allows for a broader spectrum of evaluation. Among others, the knowledge graph agnosticity of NERD systems may be evaluated which, to the best of our knowledge, was not possible until now for this number of knowledge graphs. Keywords: Data Sets, Entity Linking, Knowledge Graph, Text Annotation, NLP Interchange Format 1. Introduction Therefore, annotation methods can only be evaluated Automatic extraction of semantic information from texts is on a small part of the KG concerning a specific do- an open problem in natural language processing (NLP).
    [Show full text]
  • Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions Wei Shen, Jianyong Wang, Senior Member, IEEE, and Jiawei Han, Fellow, IEEE
    1 Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions Wei Shen, Jianyong Wang, Senior Member, IEEE, and Jiawei Han, Fellow, IEEE Abstract—The large number of potential applications from bridging Web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions. Index Terms—Entity linking, entity disambiguation, knowledge base F 1 INTRODUCTION Entity linking can facilitate many different tasks such as knowledge base population, question an- 1.1 Motivation swering, and information integration. As the world HE amount of Web data has increased exponen- evolves, new facts are generated and digitally ex- T tially and the Web has become one of the largest pressed on the Web. Therefore, enriching existing data repositories in the world in recent years. Plenty knowledge bases using new facts becomes increas- of data on the Web is in the form of natural language. ingly important. However, inserting newly extracted However, natural language is highly ambiguous, es- knowledge derived from the information extraction pecially with respect to the frequent occurrences of system into an existing knowledge base inevitably named entities. A named entity may have multiple needs a system to map an entity mention associated names and a name could denote several different with the extracted knowledge to the corresponding named entities.
    [Show full text]
  • Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks
    Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks Matthew Francis-Landau, Greg Durrett and Dan Klein Computer Science Division University of California, Berkeley mfl,gdurrett,klein @cs.berkeley.edu { } Abstract al., 2011), but such heuristics are hard to calibrate and they capture structure in a coarser way than A key challenge in entity linking is making ef- learning-based methods. fective use of contextual information to dis- In this work, we model semantic similarity be- ambiguate mentions that might refer to differ- tween a mention’s source document context and its ent entities in different contexts. We present a model that uses convolutional neural net- potential entity targets using convolutional neural works to capture semantic correspondence be- networks (CNNs). CNNs have been shown to be ef- tween a mention’s context and a proposed tar- fective for sentence classification tasks (Kalchbren- get entity. These convolutional networks oper- ner et al., 2014; Kim, 2014; Iyyer et al., 2015) and ate at multiple granularities to exploit various for capturing similarity in models for entity linking kinds of topic information, and their rich pa- (Sun et al., 2015) and other related tasks (Dong et rameterization gives them the capacity to learn al., 2015; Shen et al., 2014), so we expect them to be which n-grams characterize different topics. effective at isolating the relevant topic semantics for We combine these networks with a sparse lin- ear model to achieve state-of-the-art perfor- entity linking. We show that convolutions over mul- mance on multiple entity linking datasets, out- tiple granularities of the input document are useful performing the prior systems of Durrett and for providing different notions of semantic context.
    [Show full text]
  • Satellitener: an Effective Named Entity Recognition Model for the Satellite Domain
    SatelliteNER: An Effective Named Entity Recognition Model for the Satellite Domain Omid Jafari1 a, Parth Nagarkar1 b, Bhagwan Thatte2 and Carl Ingram3 1Computer Science Department, New Mexico State University, Las Cruces, NM, U.S.A. 2Protos Software, Tempe, AZ, U.S.A. 3Vigilant Technologies, Tempe, AZ, U.S.A. Keywords: Natural Language Processing, Named Entity Recognition, Spacy. Abstract: Named Entity Recognition (NER) is an important task that detects special type of entities in a given text. Existing NER techniques are optimized to find commonly used entities such as person or organization names. They are not specifically designed to find custom entities. In this paper, we present an end-to-end framework, called SatelliteNER, that its objective is to specifically find entities in the Satellite domain. The workflow of our proposed framework can be further generalized to different domains. The design of SatelliteNER in- cludes effective modules for preprocessing, auto-labeling and collection of training data. We present a detailed analysis and show that the performance of SatelliteNER is superior to the state-of-the-art NER techniques for detecting entities in the Satellite domain. 1 INTRODUCTION with having NER tools for different human languages (e.g., English, French, Chinese, etc.), domain-specific Nowadays, large amounts of data is generated daily. NER software and tools have been created with the Textual data is generated by news articles, social me- consideration that all sciences and applications have a dia such as Twitter, Wikipedia, etc. Managing these growing need for NLP. For instance, NER is used in large data and extracting useful information from the medical field (Abacha and Zweigenbaum, 2011), them is an important task that can be achieved using for the legal documents (Dozier et al., 2010), for Natural Language Processing (NLP).
    [Show full text]
  • BENNERD: a Neural Named Entity Linking System for COVID-19
    BENNERD: A Neural Named Entity Linking System for COVID-19 Mohammad Golam Sohraby,∗, Khoa N. A. Duongy,∗, Makoto Miway, z , Goran Topic´y, Masami Ikeday, and Hiroya Takamuray yArtificial Intelligence Research Center (AIRC) National Institute of Advanced Industrial Science and Technology (AIST), Japan zToyota Technological Institute, Japan fsohrab.mohammad, khoa.duong, [email protected], fikeda-masami, [email protected], [email protected] Abstract in text mining system, Xuan et al.(2020b) created CORD-NER dataset with comprehensive NE an- We present a biomedical entity linking (EL) system BENNERD that detects named enti- notations. The annotations are based on distant ties in text and links them to the unified or weak supervision. The dataset includes 29,500 medical language system (UMLS) knowledge documents from the CORD-19 corpus. The CORD- base (KB) entries to facilitate the corona virus NER dataset gives a shed on NER, but they do not disease 2019 (COVID-19) research. BEN- address linking task which is important to address NERD mainly covers biomedical domain, es- COVID-19 research. For example, in the example pecially new entity types (e.g., coronavirus, vi- sentence in Figure1, the mention SARS-CoV-2 ral proteins, immune responses) by address- ing CORD-NER dataset. It includes several needs to be disambiguated. Since the term SARS- NLP tools to process biomedical texts includ- CoV-2 in this sentence refers to a virus, it should ing tokenization, flat and nested entity recog- be linked to an entry of a virus in the knowledge nition, and candidate generation and rank- base, not to an entry of ‘SARS-CoV-2 vaccination’, ing for EL that have been pre-trained using which corresponds to therapeutic or preventive pro- the CORD-NER corpus.
    [Show full text]