Assessing the Effectiveness of Multilingual Transformer-Based Text Embeddings for Named Entity Recognition in Portuguese

Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese Diego Bernardes de Lima Santos1, Frederico Giffoni de Carvalho Dutra2 a, Fernando Silva Parreiras3 b and Wladmir Cardoso Brandao˜ 1 c 1Department of Computer Science, Pontifical Catholic University of Minas Gerais (PUC Minas), Belo Horizonte, Brazil 2Companhia Energetica´ de Minas Gerais (CEMIG), Belo Horizonte, Brazil 3Laboratory for Advanced Information Systems, FUMEC University, Belo Horizonte, Brazil Keywords: Named Entity Recognition, Text Embedding, Neural Network, Transformer, Multilingual, Portuguese. Abstract: Recent state of the art named entity recognition approaches are based on deep neural networks that use an attention mechanism to learn how to perform the extraction of named entities from relevant fragments of text. Usually, training models in a specific language leads to effective recognition, but it requires a lot of time and computational resources. However, fine-tuning a pre-trained multilingual model can be simpler and faster, but there is a question on how effective that recognition model can be. This article exploits multilingual models for named entity recognition by adapting and training tranformer-based architectures for Portuguese, a challenging complex language. Experimental results show that multilingual trasformer-based text embeddings approaches fine tuned with a large dataset outperforms state of the art trasformer-based models trained specifically for Portuguese. In particular, we build a comprehensive dataset from different versions of HAREM to train our multilingual transformer-based text embedding approach, which achieves 88.0% of precision and 87.8% in F1 in named entity recognition for Portuguese, with gains of up to 9.89% of precision and 11.60% in F1 compared to the state of the art single-lingual approach trained specifically for Portuguese. 1 INTRODUCTION ing with similar entities or with entities with the same semantic value. Natural Language Processing (NLP) is a computer NER is strongly dependent on the context, i.e., science research field with several practical applica- words or expressions can be recognized as different tions, such as automatic text reading and question an- types of entity in different contexts. For instance, in swering, audio content interpretation, document clas- the sentence “Mary prays to Saint Paul for health”, the sification, and predictive text analysis. Usually, NLP expression “Saint Paul” refers to a person (religious systems perform a set of basic preprocessing tasks on entity), but in the sentence “We will move to Saint input text, such as parsing, tokenization, stop-words Paul next year”, the expression “Saint Paul” refers removal, stemming and tagging. Particularly, Named to a place (location entity). Even if the spelling of Entity Recognition (NER) is a NLP tagging task that a word or expression cited in different sentences is extracts important information by marking up it on identical, the meaning can be distinct given differ- text, such as names of people, places and currency ent contexts. Additionally, sentences are formulated values (Borthwick, 1999). The extracted elements are in distinct ways in different languages, and the lan- relevant entities in the textual content that make sense guages differ from each other in structure, form and within a context. For instance, the recognition of the complexity, which impose even more challenging is- entity “New York” as a location in a sentence can be sues for NER. important to detect where a particular event occurred Traditional NER approaches use hand-crafted lin- or even to relate that location to other locations, deal- guistic grammar-based strategies or statistic models that requires a large amount of manually annotated a https://orcid.org/0000-0002-8666-0354 training data to recognize entities in text (Marsh and b https://orcid.org/0000-0002-9832-1501 Perzanowski, 1998). For years, Conditional Random c https://orcid.org/0000-0002-1523-1616 Fields (CRF) has been the state of the art strategy for 473 Santos, D., Dutra, F., Parreiras, F. and Brandão, W. Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese. DOI: 10.5220/0010443204730483 In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 473-483 ISBN: 978-989-758-509-8 Copyright c 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved ICEIS 2021 - 23rd International Conference on Enterprise Information Systems NER, taking context into account in a learning model recognition, word embeddings and transformer-based that support sequential dependencies between predic- architectures of neural networks. Section 3 presents tions (Lafferty et al., 2001). Recently, deep neural related work reported in literature for NER, including networks based approaches have achieved even more the state-of-the-art approach for NER in Portuguese. effective results than CRF for NER (Goldberg, 2016). Section 4 presents our multilingual NER approach for They learn distributed text representations (text em- Portuguese, as well as the comprehensive dataset we beddings) from a huge amount of text to build a lan- create to improve the training of our approach. Sec- guage model that can be effectively used in several tion 5 presents the experimental setup and the results NLP tasks, including NER. of the experiments we carry out to evaluate our pro- Deep neural single-lingual models (training NLP posed approach. Finally, Section 6 concludes this ar- models in a specific language) usually leads to effec- ticle, suggesting directions for future work. tive entity recognition, requiring a lot of time and computational resources for training. In addition, such single-lingual approaches require a large amount 2 BACKGROUND of data in each specific language for training, some- times not available or easily obtained for certain lan- Named Entity Recognition (NER) is a NLP task that guages. However, fine-tuning a pre-trained multilin- identifies people, location, currency, and other rel- gual model can be cheaper, simpler and faster, re- evant information within a text (Borthwick, 1999). quiring no specific single-language training dataset While traditional NER approaches use hand-crafted and less time and computational resources for train- linguistic grammar-based strategies or statistic mod- ing. But how effective multilingual NER models can els that require a large amount of manually anno- be compared to single-lingual models, particularly for tated training data to recognize entities in text (Marsh complex languages, such as Portuguese? and Perzanowski, 1998), recent NER approaches use In this article, we exploit multilingual models deep neural networks to learn an effective recognition for NER by adapting and training transformer-based model (Goldberg, 2016). In particular, they learn text text embeddings for named entity recognition in Por- embeddings from a huge amount of text to build a lan- tuguese. Particularly, we propose a NER approach guage model that can be effectively used for NER. by training and fine tuning a multilingual transformer- based NLP model using a comprehensive dataset we created by combining different versions of HAREM. 2.1 Word Embeddings Additionally, we evaluate our proposed approach by contrasting it with the state-of-the-art (SOTA) single- Recently, different ways to represent text have lingual approach for NER in Portuguese. emerged, allowing more accurate analyzes of textual Experimental results show that our multilingual information, e.g., the analysis of similarity between approach for NER in Portuguese outperforms the two words. A distributed text representation, or text SOTA single-lingual approach with gains of up 9.89% embeddings, can be generated by deep neural net- of precision and 11.60% in F1, achieving 88.00% of work (NN) approaches that learn language models precision and 87.80% in F1 in named entity recogni- from a huge amount of natural language corpus. In tion. The main contributions of this article are: particular, word embeddings take the form of a continuous vector representation describing the meaning • We propose a comprehensive dataset to improve of terms (Levy and Goldberg, 2014). Usually, this the training of NER models for Portuguese by distributed representation is a not mutually exclusive combining different versions of the HAREM continuous real-valued vector of fixed length learned dataset. by a NN, typically much smaller than the size of the • We propose a multilingual NER approach for vocabulary (Bengio et al., 2003). Portuguese by adapting and training different The continuous vectors representation are capable transformer-based neural networks for multilin- of syntactically representing words, but also allow the gual NER in English. learning of semantic values of terms, that is, word embeddings can capture similarity between words with • We provide a throughout evaluation of our pro- similar meaning, even if their spelling is quite differ- posed approach by contrasting them with the ent among them (Mikolov et al., 2013b). Figure 1 SOTA single-lingual approach for NER in Por- presents groups of words with similar context mea- tuguese reported in literature. sured by cosine similarity between word embeddings. The present article is organized as follows: Section 2 In recent years, different frameworks and algo- presents the theoretical background in named

Assessing the Effectiveness of Multilingual Transformer-Based Text Embeddings for Named Entity Recognition in Portuguese

Powering Inclusion: Artificial Intelligence and Assistive Technology

An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation

Extraction of Predictive Document Spans with Neural Attention

Alternate Keypad Designs and Novel Predictive Disambiguation Methods

Graph Neural Networks for Natural Language Processing: a Survey

AI in HR: the New Frontier INTRODUCTION the Robots Have Arrived

Linguistics 1

Corpus-Based Language Technology for Indigenous and Minority Languages

Predictive Text System for Bahasa with Frequency, N-Gram, Probability Table and Syntactic Using Grammar

Voice Typing: a New Speech Interaction Model for Dictation on Touchscreen Devices

Proceedings of the 55Th Annual Meeting of the Association For

Experts Doubt Ethical AI Design Will Be Broadly Adopted As the Norm Within the Next Decade