Entity Linking Aleksander Smywinski-Pohl´
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the PolEval 2019 Workshop Maciej Ogrodniczuk, Łukasz Kobyliński (eds.) j Institute of Computer Science, Polish Academy of Sciences Warszawa, 2019 Contents PolEval 2019: Introduction Maciej Ogrodniczuk, Łukasz Kobylinski´ . .5 Results of the PolEval 2019 Shared Task 1: Recognition and Normalization of Temporal Expressions Jan Kocon,´ Marcin Oleksy, Tomasz Berna´s, Michał Marcinczuk´ . .9 Results of the PolEval 2019 Shared Task 2: Lemmatization of Proper Names and Multi-word Phrases Michał Marcinczuk,´ Tomasz Berna´s................................. 15 Results of the PolEval 2019 Shared Task 3: Entity Linking Aleksander Smywinski-Pohl´ . 23 PolEval 2019: Entity Linking Szymon Roziewski, Marek Kozłowski, Łukasz Podlodowski . 37 Results of the PolEval 2019 Shared Task 4: Machine Translation Krzysztof Wołk . 47 The Samsung’s Submission to PolEval 2019 Machine Translation Task Marcin Chochowski, Paweł Przybysz . 55 Results of the PolEval 2019 Shared Task 5: Automatic Speech Recognition Task Danijel Koržinek . 73 Automatic Speech Recognition Engine Jerzy Jamro˙zy, Marek Lange, Mariusz Owsianny, Marcin Szymanski´ . 79 Results of the PolEval 2019 Shared Task 6: First Dataset and Open Shared Task for Automatic Cyberbullying Detection in Polish Twitter Michał Ptaszynski,´ Agata Pieciukiewicz, Paweł Dybała . 89 Simple Bidirectional LSTM Solution for Text Classification Rafał Pronko´ .............................................. 111 4 Comparison of Traditional Machine Learning Approach and Deep Learning Models in Automatic Cyberbullying Detection for Polish Language Maciej Biesek . 121 Przetak: Fewer Weeds on the Web Marcin Ciura . 127 Approaching Automatic Cyberbullying Detection for Polish Tweets Krzysztof Wróbel . 135 Exploiting Unsupervised Pre-Training and Automated Feature Engineering for Low-Resource Hate Speech Detection in Polish Renard Korzeniowski, Rafał Rolczynski,´ Przemysław Sadownik, Tomasz Korbak, Marcin Mo˙zejko . 141 Universal Language Model Fine-Tuning for Polish Hate Speech Detection Piotr Czapla, Sylvain Gugger, Jeremy Howard, Marcin Kardas . 149 A Simple Neural Network for Cyberbullying Detection (abstract) Katarzyna Krasnowska-Kiera´s, Alina Wróblewska . 161 PolEval 2019: Introduction Maciej Ogrodniczuk, Łukasz Kobylinski´ (Institute of Computer Science, Polish Academy of Sciences) This volume consists of proceedings of PolEval session, organized during the AI & NLP Day 2019 (https://nlpday.pl), which took place on May 31st, 2019 at the Institute of Computer Science, Polish Academy of Sciences in Warsaw. PolEval (http://poleval.pl) is an evaluation campaign organized since 2017, which focuses on Natural Language Processing tasks for Polish, promoting research on language and speech technologies. As a consequence of the global growth of interest in Natural Language Processing and rising number of published papers, frameworks for an objective evaluation and comparison of NLP-related methods become crucial to support effective knowledge dissemination in this field. Following the initiatives of ACE Evaluation1, SemEval2, or Evalita3, we have observed the need to provide a similar platform for comparing NLP methods, which would focus on Polish language tools and resources. Since 2017 we observe a steady growth of interest in PolEval participation: the first edition has attracted 20 submissions, while in 2018 we have received 24 systems for evaluation. During the 2019 edition of PolEval six different tasks have been announced and teams from both academia and business submitted 34 systems in total (see Figure 1). In 2019 the systems competed in the following tasks: — Task 1: Recognition and normalization of temporal expressions — Task 2: Lemmatization of proper names and multi-word phrases — Task 3: Entity linking — Task 4: Machine translation (EN-PL, PL-RU, RU-PL) — Task 5: Automatic speech recognition — Task 6: Automatic cyberbullying detection (harmful vs non-harmful and detecting type of harmfulness). 1https://en.wikipedia.org/wiki/Automatic_content_extraction 2https://en.wikipedia.org/wiki/SemEval 3http://www.evalita.it/ 6 Maciej Ogrodniczuk, Łukasz Kobylinski´ 30 20 10 2017 2018 2019 Submissions per task (avg) Submissions Figure 1: Number of PolEval submissions and average submissions per task in 2017–2019 The number of submissions per each task has varied greatly (see Figure 2): the subject of hate speech and cyberbullying has attracted the most submissions during this edition of the campaign. 16 15 10 6 Submissions 5 4 4 3 1 0 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Figure 2: Number of PolEval submissions per task in 2019 We hope you will join us next year for PolEval 2020! Please feel free to share your ideas for improving this competition or willingness to help in organizing your own NLP tasks. Organizers Concept and administrative issues Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences) Łukasz Kobylinski´ (Institute of Computer Science, Polish Academy of Sciences and Sages) Task 1: Recognition and normalization of temporal expressions Jan Kocon´ (Wrocław University of Science and Technology) Task 2: Lemmatization of proper names and multi-word phrases Michał Marcinczuk´ (Wrocław University of Science and Technology) Task 3: Entity linking Aleksander Smywinski-Pohl´ (AGH University of Science and Technology) Task 4: Machine translation Krzysztof Wołk (Polish-Japanese Academy of Information Technology) Task 5: Automatic speech recognition Danijel Koržinek (Polish-Japanese Academy of Information Technology) Task 6: Automatic cyberbullying detection Michał Ptaszynski´ (Kitami Institute of Technology, Japan) Agata Pieciukiewicz (Polish-Japanese Academy of Information Technology) Paweł Dybała (Jagiellonian University in Kraków) Results of the PolEval 2019 Shared Task 1: Recognition and Normalization of Temporal Expressions Jan Kocon,´ Marcin Oleksy, Tomasz Berna´s, Michał Marcinczuk´ (Department of Computational Intelligence, Wrocław University of Science and Technology) Abstract This article presents the research in the recognition and normalization of Polish temporal expressions as the result of the first PolEval 2019 shared task. Temporal information extracted from the text plays a significant role in many information extraction systems, like question answering, event recognition or text summarization. A specification for annotating Polish temporal expressions (PLIMEX) was used to prepare a completely new test dataset for the competition. PLIMEX is based on state-of-the-art solutions for English, mostly TimeML. The training data provided for the task is Polish Corpus of Wrocław University of Science and Technology (KPWr) fully annotated using PLIMEX guidelines. Keywords natural language processing, information extraction, temporal expressions, recognition, normalization, Polish 1. Introduction Temporal expressions (henceforth timexes) tell us when something happens, how long some- thing lasts, or how often something occurs. The correct interpretation of a timex often involves knowing the context. Usually, people are aware of their location in time, i.e., they know what day, month and year it is, and whether it is the beginning or the end of week or month. Therefore, they refer to specific dates, using incomplete expressions such as 12 November, Thursday, the following week, after three days. The temporal context is often necessary to determine to which specific date and time timexes refer. These examples do not exhaust the complexity of the problem of recognizing timexes. 10 Jan Kocon,´ Marcin Oleksy, Tomasz Berna´s, Michał Marcinczuk´ The possibility of sharing information about recognized timexes between different information systems is very important. For example, a Polish programmer may create a method that will recognize the expression the ninth of December and normalize it (knowing the context of the whole document) to the form 09.12.2015. A programmer from the United States could create a similar method that normalizes the same expression to a form 12/9/2015. A serious problem is the need to use the information from two such methods, e.g. for the analysis of multilingual text sources, where the expected effect is a certain metadata unification and the application of international standards. Normalization allows to determine the machine- readable form of timexes and requires the analysis of each expression in a broad context (even the whole document), due to the need to make some calculations on relative timexes, such as five minutes earlier, three hours later. TimeML (Saurí et al. 2006) is a markup language for describing timexes that has been adapted to many languages. The specification was created as part of the TERQAS1 workshop, as part of the AQUAINT project2, aimed at improving the quality of methods for question answering(Pustejovsky et al. 2005). The aim of this study was to improve access to information in the text with the focus on in-depth analysis of the content, not only through keywords. A key problem was the recognition of events and their location in time. 2. Task Description The aim of this task is to advance research on the processing of timexes, which are used in other NLP applications like question answering, textual entailment, document classification, summarization, etc. This task follows on from previous TempEval events organized for evaluating time expressions for English and Spanish like SemEval-2013 (UzZaman et al. 2013). This time a corpus of Polish documents fully annotated with temporal expressions was provided. The annotation consists of