Spoken identification in Twitter using a multi-filter architecture

Mohammadreza Banaei Remi´ Lebret Karl Aberer EPFL, Switzerland

Abstract and even a single author might use different spelling for a word between two sentences. There also exists This paper presents our approach for SwissText a dialect continuum across the German-speaking part & KONVENS 2020 shared task 2, which is a multi-stage neural model for of Switzerland, which makes NLP for Swiss German (GSW) identification on Twitter. Our model even more challenging. Swiss German has its own outputs either GSW or non-GSW and is not pronunciation, grammar and also lots of its words are meant to be used as a generic language identifier. different from German. Our architecture consists of two independent There exists some previous efforts for discriminating filters where the first one favors recall, and the similar languages with the help of tweets metadata second one filter favors precision (both towards such as geo-location (Williams and Dagli, 2017), but in GSW). Moreover, we do not use binary models (GSW vs. not-GSW) in our filters but rather a this paper, we do not use tweets metadata and restrict multi-class classifier with GSW being one of the our model to only use tweet content. Therefore, this possible labels. Our model reaches F1-score of model can also be used for language identification in 0.982 on the test set of the shared task. sources other than Twitter. LIDs that support GSW like fastText (Joulin et al., 1 Introduction 2016) LID model are often trained by using Alemannic Out of over 8000 languages in the world (Hammarstrm Wikipedia, which also contains other German et al., 2020), Twitter language identifier (LID) only such as Swabian, German, and Alsatian Ger- supports around 30 of the most used languages1, which man; hence, these models are not able to discriminate is not enough for NLP community needs. Furthermore, dialects that are close to GSW. Moreover, fastText LID it has been shown that even for these frequently used also has a pretty low recall (0.362) for Swiss German languages, Twitter LID is not highly accurate, especially tweets, as it identified many of them as German. when the tweet is relatively short (Zubiaga et al., 2016). In this paper, we use two independently trained However, Twitter data is linguistically diverse filters to remove non-GSW tweets. The first filter and especially includes tweets in many low-resource is a classifier that favors recall (towards GSW), and languages/dialects. Having a better performing Twitter the second one favors precision. The exact same LID can help us to gather large amounts of (unlabeled) idea can be extended to N consecutive filters (with text in these low-resource languages that can be used to N ≥ 2), with the first N −1 favoring recall and the enrich models in many down-stream NLP tasks, such last filter favoring precision. In this way, we make arXiv:2006.03564v1 [cs.CL] 5 Jun 2020 as sentiment analysis (Volkova et al., 2013) and named sure that GSW samples are not filtered out (with high entity recognition (Ritter et al., 2011). probability) in the first N −1 iterations, and the whole However, the generalization of state-of-the-art NLP pipeline GSW precision can be improved by having a models to low-resource languages is generally hard filter that favors precision at the end (N-th filter). The due to the lack of corpora with good coverage in these reason that we use only two filters is that adding more languages. The extreme case is the spoken dialects, filters improved the performance (measured by GSW where there might be no standard spelling at all. In F1-score) negligibly on our validation set. this paper, we especially focus on Swiss German as We demonstrate that by using this architecture, we our low-resource dialect. As Swiss German is a spoken can achieve F1-score of 0.982 on the test set, even with dialect, people might spell a certain word differently, a small amount of available data in the target domain (Twitter data). Section2 presents the architecture of 1https://dev.twitter.com/docs/ developer-utilities/supported-languages/ each of our filters and the rationale behind the chosen api-reference training data for each of them. In section3, we discuss our LID implementation details and also discuss the subwords embedding should be updated in order to detailed description of used datasets. Section4 presents improve the downstream task performance. In addition, the performance of our filters on the held-out test there are even syntactic differences between German dataset. Moreover, we demonstrate the contribution and GSW (and even among different variations of GSW of each of the filters on removing non-GSW filters to in different regions (Honnet et al., 2017)). For these see their individual importance in the whole pipeline three reasons, we can conclude that freezing the BERT (for this specific test dataset). body (and just training the classifier layer) might not be optimal for this transfer learning between German 2 Multi-filter language identification and our target language. Hence, we also let the whole In this paper, we follow the combination of N −1 fil- BERT body be trained during the downstream task, ters favoring recall, followed by a final filter that favors which of course needs a large amount of supervised more precision. We choose N = 2 in this paper to data to avoid quick overfitting in the fine-tuning phase. demonstrate the effectiveness of the approach. As dis- For this filter, we choose the same eight classes for cussed before, adding more filters improved the perfor- training LID as Linder et al.(2019) (the dataset classes mance of the pipeline negligibly for this specific dataset. and their respective sizes can be found in section 3.1). However, for more challenging datasets, it might be These languages are similar in structure to GSW (such needed to have N >2 to improve the LID precision. as German, Dutch, etc.), and we try to train a model Both of our filters are multi-class classifiers with that can distinguish GSW from similar languages GSW being one of the possible labels. We found it to decrease GSW false positives. For all classes empirically better to use roughly balanced classes for except GSW, we use sentences (mostly Wikipedia training the multi-class classifier, rather than making and Newscrawl) from Leipzig text corpora (Goldhahn the same training data a highly imbalanced GSW vs. et al., 2012). We also use the SwissCrawl (Linder et al., non-GSW training data for a binary classifier, especially 2019) dataset for GSW sentences. for the first filter (section 2.1) which has much more Most GSW training samples (SwissCrawl data) parameters compared to the second filter (section 2.2). come from forums and social media, which are less formal (in structure and also used phrases) than other 2.1 First filter: fine-tuned BERT model (non-GSW) classes samples (mostly from Wikipedia The first filter should be designed in a way to favor and NewsCrawl). Moreover, as our target dataset GSW recall, either by tuning inference thresholds or consist of tweets (mostly informal sentences), this by using training data that implicitly enforces this bias could make this filter having high GSW recall during towards GSW. Here we follow the second approach the inference phase. Additionally, our main reason for for this filter by using different domains for training using a cased tokenizer for this filter is to let the model different labels, which is further discussed below. also use irregularities in writing, such as improper Moreover, we use a more complex (in terms of the capitalization. As these irregularities mostly occur in number of parameters) model for the first filter, so informal writing, it will again bias the model towards that it does the main job of removing non-GSW inputs GSW (improving GSW recall) when tweets are passed while having reasonable GSW precision (further detail to it, as most of the GSW training samples are informal. in section4). The second filter will be later used to improve the pipeline precision by removing a relatively 2.2 Second filter: fastText classifier smaller number of non-GSW tweets. For this filter, we also train a multiclass classifier with Our first filter is a fine-tuned BERT (Devlin et al., GSW being one of the labels. The other classes are 2018) model for the LID downstream task. As we again close languages (in structure) to GSW such do not have a large amount of unsupervised GSW as German, Dutch and Spanish (further detail in data, it will be hard to train the BERT language model section 3.1). Additionally, as mentioned before, our (LM) from scratch on GSW itself. Hence, we use the second filter should have a reasonably high precision German pre-trained LM (BERT-base-cased model2), to enhance the full pipeline precision. Hence, unlike which is the closest high-resource language to GSW. the first filter, we choose the whole training data However, this LM has been trained using sentences to be sampled from a similar domain to the target (e.g., German Wikipedia) that are quite different test set. non-GSW samples are tweets from SEPLN from the Twitter domain. Moreover, lack of standard 2014 (Zubiaga et al., 2014) and Carter et al.(2013) spelling in GSW introduces many new words (unseen dataset. GSW samples consist of this shared task in German LM training data) that their respective provided GSW tweets and also part of GSW samples 2Training details available at https://huggingface. of Swiss SMS corpus (Stark et al., 2015) dataset. co/bert-base-german-cased (Wolf et al., 2019) As the described training data is rather small com- pared to the first filter training, we should also train a 3.1.2 Second filter simpler architecture with significantly fewer parameters. The sentences are mostly from Twitter (except for We take advantage of fastText (Joulin et al., 2016) for some GSW samples from Swiss SMS corpus (Stark training this model, which is based on a bag of character et al., 2015)). In Table2, we can see the distribution n-grams in our case. Moreover, unlike the first filter, of different classes. The GSW samples consist of 1971 this model is not a cased model, and we make input sen- tweets (provided by shared task organizers) and 3000 tences lower-case to reduce vocab size. Our used hyper- GSW samples from Swiss SMS corpus. parameters for this model can be found in section3. Language Number of samples 3 Experimental Setup Catalan 2000 Dutch 560 In this section, we describe the datasets and the hyper- English 2533 parameters for both filters in the pipeline. We also French 639 describe our preprocessing method that is specifically German 3608 designed to handle inputs from social media. Spanish 2707 Swiss German 4971 3.1 Datasets Table 2: Distribution of samples in the second filter dataset For both filters, we use 80% of data for training, 5% for validation set and 15% for the test set. 3.2 Preprocessing 3.1.1 First filter As the dataset sentences are mostly from social media, we used a custom tokenizer that removes common The sentences are from Leipzig corpora (Goldhahn social media tokens (emoticons, emojis, URL, hashtag, et al., 2012) and SwissCrawl (Linder et al., 2019) Twitter mention) that are not useful for LID. We also dataset. The classes and the number of samples in normalize word elongation as it might be misleading each class are shown in Table1. We pick the proposed for LID. In the second filter, we also make the input classes by Linder et al.(2019) for training GSW LID. sentences lower-case before passing it to the model. The main differences of our first filter with their LID are the GSW sentences and the fact that our fine-tuning 3.3 Implementation details dataset is about three times larger than theirs. Each 3.3.1 BERT filter of “other”3 and “GSW-like”4 classes are a group of lan- guages where their respective members cannot be repre- We train this filter by fine-tuning a German pre-trained sented as a separate class due to having a small number BERT-cased model on our LID task. As mentioned of samples. The GSW-like is included to make sure before, we do not freeze the BERT body in the that the model can distinguish other German dialects fine-tuning phase. We train it for two epochs, with from GSW (hence, reducing GSW false positives). a batch size of 64 and max-seq-length of 64. We use Adam optimizer (Kingma and Ba, 2014) with a learning rate of 2e-5. Language Number of samples 3.3.2 fastText filter 250000 German 100000 We train this filter using fastText (Joulin et al., 2016) English 250000 classifier for 30 epochs using character n-grams Swiss-German 250000 as features (where 2 ≤ n ≤ 5) and the embedding GSW-like 250000 dimension set to 50. To favor precision during Luxembourgian 250000 inference, we label a tweet as GSW if the model Dutch 250000 probability for GSW is greater than 64% (this threshold Other 250000 is seen as a hyper-parameter and was optimized according to validation set). Table 1: Distribution of samples in the first filter dataset 4 Results

3Catalan, Croatian, Danish, Esperanto, Estonian, Finnish, In this section, we evaluate our two filters performance French, Irish, Galician, Icelandic, Italian, Javanese, Konkani, (either in isolation or when present in the full pipeline) Papiamento, Portuguese, Romanian, Slovenian, Spanish, Swahili, on the held-out test dataset of the shared task. We also Swedish 4Bavarian, Kolsch, Limburgan, , Northern Frisian, evaluate the BERT filter on its test data (Leipzig and Palatine German SwissCrawl samples). 4.1 BERT filter performance 4.3 Discussion on Leipzig + SwissCrawl corpora Our designed LID outperforms the baseline signifi- We first evaluate our BERT filter on the test set of the cantly (Table4) which underlines the importance of first filter (Leipzig corpora + SwissCrawl). In Table having a domain-specific LID. Additionally, although 3 we demonstrate the filter performance on different the positive effect of the second filter is quite small on labels. The filter has an F1-score of 99.8% on the the test set, when we applied the same architecture on GSW test set. However, when this model is applied randomly sampled tweets (German tweets according to to Twitter data, we expect a decrease in performance Twitter API), we observed that having the second filter due to having short and also informal messages. could reduce the number of GSW false positives sig- nificantly. Hence, the number of used filters is indeed Language Precision Recall F1-score totally dependent on the complexity of the target dataset. Afrikaans 0.9982 0.9981 0.9982 German 0.9976 0.9949 0.9962 5 Conclusion English 0.9994 0.9992 0.9993 Swiss-German 0.9974 0.9994 0.9984 In this work, we propose an architecture for spoken GSW-like 0.9968 0.9950 0.9959 dialect (Swiss German) identification by introducing Luxembourgian 0.9994 0.9989 0.9992 a multi-filter architecture that is able to filter out Dutch 0.9956 0.9965 0.9960 non-GSW tweets during the inference phase effectively. Other 0.9983 0.9989 0.9986 We evaluated our model on the GSW LID shared task test-set, and we reached an F1-score of 0.982. Table 3: First filter performance on Leipzig + SwissCrawl However, there are other useful features that can be corpora used during training, such as orthographic conventions in GSW writing, as observed by Honnet et al.(2017), 4.2 Performance on the shared-task test set which their presence might not be easily captured even by a complex model like BERT. Moreover, in this paper, In Table4, we can see both filters performance either in we did not use tweets metadata as a feature and only isolation or when they are used together. As shown in focused on tweet content, although they can improve this table, the model improvement by adding the second LID classification for dialects considerably (Williams filter is rather small. The main reason can be seen in and Dagli, 2017). These two, among others, are future Table5 as the majority of non-GSW filtering is done works that need to be further studied to see their by the first filter for the shared-task test set (Table6). usefulness for low-resource language identification. Model Precision Recall F1-score BERT filter 0.9742 0.9896 0.9817 References fastText Filter 0.9076 0.9892 0.9466 BERT + fastText 0.9811 0.9834 0.9823 Simon Carter, Wouter Weerkamp, and Manos Tsagkias. fastText Baseline 0.9915 0.3619 0.5303 2013. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Table 4: Filters performance on the shared task test-set Language Resources and Evaluation, 47(1):195–215. compared to fastText (Joulin et al., 2016) LID baseline Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv Model Number of filtered samples preprint arXiv:1810.04805. BERT filter 2741 fastText Filter 35 Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. 2012. Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. Table 5: Number of Non-GSW removals by each filter In LREC, volume 29, pages 31–43.

Harald Hammarstrm, Sebastian Bank, Robert Forkel, Label Number of samples and Martin Haspelmath. 2020. Glottolog 4.2.1. Max not-GSW 2782 Planck Institute for the Science of Human History, Jena. Available online at http://glottolog.org, Accessed GSW 2592 on 2020-04-18.

Table 6: Distribution of labels in test set Pierre-Edouard Honnet, Andrei Popescu-Belis, Claudiu Musat, and Michael Baeriswyl. 2017. Machine translation of low-resource spoken dialects: Strate- gies for normalizing swiss german. arXiv preprint arXiv:1710.11035. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu Musat, and Andreas Fischer. 2019. Automatic creation of text corpora for low-resource languages from the internet: The case of swiss german. arXiv preprint arXiv:1912.00159. Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the conference on empirical methods in natural language processing, pages 1524–1534. Association for Computational Linguistics. Elisabeth Stark, Simon Ueberwasser, and Beni Ruef. 2015. Swiss sms corpus. www.sms4science.ch. Svitlana Volkova, Theresa Wilson, and David Yarowsky. 2013. Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Process- ing, pages 1815–1827, Seattle, Washington, USA. Association for Computational Linguistics. Jennifer Williams and Charlie Dagli. 2017. Twitter lan- guage identification of similar languages and dialects without ground truth. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 73–83. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi´ Louf, Morgan Funtowicz, et al. 2019. Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Arkaitz Zubiaga, Inaki San Vicente, Pablo Gamallo, Jose´ Ramom Pichel Campos, Inaki˜ Alegr´ıa Loinaz, Nora Aranberri, Aitzol Ezeiza, and V´ıctor Fresno-Fernandez.´ 2014. Overview of tweetlid: Tweet language identifica- tion at sepln 2014. In TweetLID@ SEPLN, pages 1–11. Arkaitz Zubiaga, Inaki San Vicente, Pablo Gamallo, Jose´ Ramom Pichel, Inaki Alegria, Nora Aranberri, Aitzol Ezeiza, and V´ıctor Fresno. 2016. Tweet- lid: a benchmark for tweet language identification. Language Resources and Evaluation, 50(4):729–766.