Moldavian Vs. Romanian Cross-Dialect Topic Identification
Total Page:16
File Type:pdf, Size:1020Kb
SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification Cristian Onose, Dumitru-Clementin Cercel and Stefan Trausan-Matu Faculty of Automatic Control and Computers University Politehnica of Bucharest, Romania fonose.cristian, [email protected], [email protected] Abstract et al., 2017, 2018). This year (Zampieri et al., 2019), the problem of discriminating between Ro- This paper describes our models for the Mol- manian and Moldavian dialects was introduced as davian vs. Romanian Cross-Topic Identifica- a series of three subtasks. It involves the pro- tion (MRC) evaluation campaign, part of the cessing of the MOROCO dataset (Butnaru and VarDial 2019 workshop. We focus on the three subtasks for MRC: binary classification Ionescu, 2019) to construct several language clas- between the Moldavian (MD) and the Roma- sification models. The dataset contains text sam- nian (RO) dialects and two cross-dialect multi- ples from online news outlets in the Romanian class classification between six news topics, (RO) language or the Moldavian (MD) dialect. All MD to RO and RO to MD. We propose several the subtasks are closed, meaning that the use of ex- deep learning models based on long short-term ternal datasets is not allowed. Additionally, inter- memory cells, Bidirectional Gated Recurrent nal data, available for the MRC subtasks, must not Unit (BiGRU) and Hierarchical Attention Net- works (HAN). We also employ three word em- be used between tasks. Thus, the first subtask is a bedding models to represent the text as a low binary classification between the two dialects. The dimensional vector. Our official submission second subtask involves a cross-dialect multi-class includes two runs of the BiGRU and HAN classification between six topics. More precisely, models for each of the three subtasks. The the classifier is trained using Moldavian dialect in best submitted model obtained the following order to classify samples from the Romanian di- macro-averaged F1 scores: 0.708 for subtask alect. The third subtask is similar to the second 1, 0.481 for subtask 2 and 0.480 for the last one, here the use of the dialects is reversed. one. Due to a read error caused by the quoting behaviour over the test file, our final submis- Generally, such tasks are approached using tra- sions contained a smaller number of items than ditional machine learning algorithms, which un- expected. More than 50% of the submission fortunately require handcrafted features. Recently, files were corrupted. Thus, we also present the deep learning methods, where features are learned results obtained with the corrected labels for from the data, have been proposed (Ali, 2018). To which the HAN model achieves the following address the MRC shared task, we propose the use results: 0.930 for subtask 1, 0.590 for subtask of three state of the art deep learning architectures 2 and 0.687 for the third one. for text classification: Long Short-Term Mem- 1 Introduction ory cells (LSTM) (Hochreiter and Schmidhuber, 1997), Bidirectional Gated Recurrent Unit (Bi- The task of discriminating between two dialects GRU) (Graves and Schmidhuber, 2005) and Hier- or different languages is a popular research topic archical Attention Networks (HAN) (Yang et al., which has attracted a lot of interest from the 2016). The submission results are based only on research community. Specifically, the VarDial BiGRU and HAN models for each of the three competition proposed in recent years a num- subtasks. After the competition deadline an er- ber of shared tasks on different languages such ror, caused by the quoting behaviour over the test as dialect identification for Arabic or German, file, was discovered. As a result our final submis- Indo-Aryan language identification, distinguish sions contained a smaller number of labels than between Mainland and Taiwan Mandarin or dis- expected, with approximately 50% of the files be- criminating between Dutch and Flemish (Zampieri ing corrupted. Thus, we present both the official 172 Proceedings of VarDial, pages 172–177 Minneapolis, MN, June 7, 2019 c 2019 Association for Computational Linguistics Subtask Training Validation Test ures from both countries, Romania or the Republic 1 21719 11845 5923 of Moldova, are anonymized. 2 9968 5435 5923 Besides the default processing, we also took ex- 3 11751 6410 5923 tra steps to clean up the dataset. Text usually con- tains expressions which carry little to no meaning, Table 1: Dataset sample distribution between training, thus, we choose to remove the following: stop validation and test for each of the three subtasks. words, special characters and punctuation marks all except end of sentence. Additionally, we re- submissions as well as later work, that includes move the named entity identifier as they interfere the correction of this problem. with the text representations. Another important The study of Romanian dialects was first ap- aspect is given by how we deal with the diacritics. proached by Ciobanu and Dinu(2016). They During our experiments we analyze their impact construct binary classifiers to distinguish between on the performance. Romanian and three dialects (Macedo-Romanian, Megleno-Romanian and Istro-Romanian) by ex- 3 Deep learning models ploring information provided by a set of 108 word In recent years, with the increasing availability pairs. Consequently, Butnaru and Ionescu(2019) of computational resources, deep neural networks proposed a first Moldavian and Romanian Dialec- became successful for classification and regres- tal Corpus (MOROCO) assembled from multi- sion problems (LeCun et al., 2015). At first, sim- ple news websites. On top of this dataset they ple feedforward networks were used. These net- construct several deep learning models for di- works lack loops or cycles and the information alect identification: Character level Convolutional moves only forward, from the input to the out- Neural Network (CharCNN) and an improvement put nodes. The switch to other types of rep- CNN model using squeeze and excitation blocks resentations, namely Recurrent Neural Networks (Hu et al., 2018). Additionally, they also investi- (RNNs), was made because of the need to map in- gate shallow string kernel methods (Ionescu et al., put and output nodes of varying types and sizes. 2016). They conclude that string kernels achieve best performance among the studied methods. Recurrent neural networks. RNNs are neural The remainder of this paper is organized as networks that form connections between nodes follows. In Section2 we briefly discuss the along a sequence. This allows the network to dataset for the three tasks. Section3 describes the exhibit internal memory with respect to the in- methodology behind our solution, while the exper- puts which in turn enables the prediction of fu- imental setup and the results are presented in Sec- ture steps. Due to this memory, RNNs are the pre- tion4. Finally, Section5 contains details regard- ferred method for processing sequential data such ing our conclusions. as time series, text or video. Unfortunately, RNNs can suffer from training instability, exploding and 2 Dataset vanishing gradients. The MOROCO dataset contains Moldavian and Long short-term memory. Long Short-Term Romanian samples collected from one of the fol- Memory (LSTM) units, introduced by Hochreiter lowing news categories: culture, finance, politics, and Schmidhuber(1997), are used in recurrent science, sports and technology. It is divided be- neural networks as a way to prevent vanishing or tween training, validation and test for each of the exploding gradients. The units allow the errors three tasks as described in Table1. The test set is to flow backwards through endless virtual layers combined for all the subtasks such that the labels which are unfolded in space. Besides the usual in- for the first task can not be inferred. This is neces- put and output gates, LSTM units are augmented sary because the second and the third subtasks are by recurrent gates called forget gates which regu- based entirely on just one of the dialects. late the movement of information through the cell The data samples are provided preprocessed by (Gers et al., 2000). replacing named entities, which could act as biases Gated recurrent unit. Similar to LSTM, the for the classifiers, with a special identifier: $NE$. Gated Recurrent Unit (GRU) was introduced by For instance, city names or important public fig- Cho et al.(2014) as a method to solve the van- 173 Name Vector size Min. word count Unique tokens Diacritics Training algorithm CoRoLa 300 20 250942 Yes FastText NLPL 100 10 2153518 No Word2Vec Skipgram CC 300 - 2000000 Yes FastText Table 2: Word embeddings: statistics regarding training methods and dataset/parameters details. ishing gradient problem that occurs when using fiers. In order to achieve this, we rely on a number standard RNNs. These types of units are closely of pretrained word vector models: Romanian Lan- related to LSTM having similar performance and guage Corpus (CoRoLa) introduced by Mititelu design. The GRU layers are popular due their et al.(2018), Nordic Language Processing Labo- simpler structure, which results in faster training ratory (NLPL) word embedding repository (Kutu- time. A bidirectional extension for such recurrent zov et al., 2017) and Common Crawl (CC) word layers was proposed by Graves and Schmidhuber vectors (Grave et al., 2018). The relevant details (2005). It connects two hidden layers of opposite for each word vector representation model can be directions in a backward and forward manner to viewed in Table2. the same output. This is useful for text process- LSTM and BiGRU Models. The input for the ing since it can encode the context present in such RNN flavour models is computed by taking the structures: characters and words. mean of all word embeddings present in the text. Hierarchical Attention Networks. Hierarchi- Missing words are considered zero valued vectors. cal Attention Networks (HAN) were introduced The result is a representation of the whole news for document classification by Yang et al.(2016).