UPTEC F 20043 Examensarbete 30 hp September 2020

Automatic language identification of short texts

Anna Avenberg Abstract Automatic language identification of short texts

Anna Avenberg

Teknisk- naturvetenskaplig fakultet UTH-enheten The world is growing more connected through the use of online communication, exposing software and humans to all the world's Besöksadress: languages. While devices are able to understand and share the raw data Ångströmlaboratoriet Lägerhyddsvägen 1 between themselves and with humans, the information itself is not Hus 4, Plan 0 expressed in a monolithic format. This causes issues both in the human to computer interaction and human to human communication. Automatic Postadress: language identification (LID) is a field within artificial intelligence Box 536 751 21 Uppsala and natural language processing that strives to solve a part of these issues by identifying languages from text, sign language and speech. Telefon: 018 – 471 30 03 One of the challenges is to identify the short pieces of text that can

Telefax: be found online, such as messages, comments and posts on social media. 018 – 471 30 00 This is due to the small amount of information they carry. The goal of this thesis has been to build a model that can identify Hemsida: the language for these short pieces of text. A long short-term memory http://www.teknat.uu.se/student (LSTM) machine learning model was built and benchmarked towards Facebook’s fastText model. The results show how the LSTM model reached an accuracy of around 95% and the fastText model used as comparison reached an accuracy of 97%. The LSTM model struggled more when identifying texts shorter than 50 characters than with longer text. The classification performance of the LSTM model was also relatively poor in cases where languages were similar, like Croatian and Serbian. Both the LSTM model and the fastText model reached accuracy’s above 94% which can be considered high, depending on how it is evaluated. There are however many improvements and possible future work to be considered; looking further into texts shorter than 50 characters, evaluating the model’s softmax output vector values and how to handle similar languages.

Handledare: Prashant Singh Ämnesgranskare: Prashant Singh Examinator: Tomas Nyberg UPTEC F 20043 Acknowledgement

I would like to express my gratitude to the people that has given me input, helped and supported me throughout the thesis. This thesis would not have been possible without your knowledge, guidance and advice. Many thanks to Prashant Singh for his knowledge and always answering my questions. I also wish to thank the group of people that enriched my lunches with good company and discussions. Lastly I want to thank my family and friends for staying positive and cheering me on. Populärvetenskaplig sammanfattning

Idag kommunicerar vi i stor utsträckning genom våra mobiltelefoner och datorer. Kom- munikationen går fort, vi reagerar, kommenterar och svarar såväl vänner och familj som främlingar online. Mycket av det vi skriver om och till varandra är i form av korta texter, såsom ett sms, i en chatt, en kommentar eller ett inlägg på sociala medier. Då internet når människor i hela världen, skrivs mycket av det som förekommer online på flera hundra olika språk. För att webbsidor och olika program ska kunna hantera alla språk är det viktigt för webbsidor och program att automatiskt kunna identifiera språken. Artificiell intelligens, eller AI, är ett ämne som många har hört talas om de senaste åren. AI handlar i stor uträckning om att låta datorer göra uppgifter som tidigare har krävt mänsklig intelligens för att genomföra. Ett delområde inom AI är språkteknologi (engel- ska: natural language processing) varav en del handlar om automatisk språkidentifiering. I det här arbetat har metoder för hur man kan bygga språkigenkänningsmodeller för korta texter studerats och testats. Maskininlärning är en del av AI där det är möjligt att lära en dator olika uppgifter utan att behöva programmera för exakt varje datapunkt, i det här fallet text. Det resulterar i att det går att designa modeller för datorn där det som behövs är data där olika språk är representerade. I arbetet hämtades inlägg från Twitter på olika språk och utifrån den insamlade informa- tionen konstruerades en maskininlärningsmodell. Twitter är en källa till korta texter och inlägg på 9 olika språk hämtades. Språken var engelska, svenska, spanska, portugisiska, ryska, tyska, polska, serbiska och kroatiska. Modellen som konstruerades lyckades korrekt identifiera 95% av den testdata som användes. Modellen som konstruerades jämfördes med modellen fastText som är utvecklad av Facebook. FastText lyckades korrekt identi- fiera 97% av testdata. Det var förvånande att modellerna fick en noggrannhet över 90%, trots att de inte tränats på en stor datamängd. Det här tyder på att de maskininlärn- ingsmodeller som finns tillgängliga att använda gratis online idag har stor potential till att kunna användas i applikationer med goda resultat. Det finns mycket att fortsätta utforska inom området för språkidentifiering av korta texter och hur modeller av liknande slag ska tolkas. Problem som kvarstår är modellens osäkerhet, hur de kortaste texterna ska kunna identifieras samt hur flera språk som liknar varandra bör hanteras av modellen.

4 Contents

Populärvetenskaplig sammanfattning 4

1 Introduction 7 1.1 Background ...... 7 1.2 Objectives ...... 7 1.2.1 Sub-objectives ...... 8

2 Theory 9 2.1 Natural language processing ...... 9 2.1.1 Automatic language identification ...... 9 2.2 Artificial intelligence ...... 9 2.3 Machine learning ...... 9 2.4 Artificial neural networks ...... 11 2.4.1 Activation functions ...... 12 2.4.2 Output activation function ...... 14 2.4.3 Training of neural networks ...... 16 2.5 Recurrent neural networks ...... 17 2.5.1 Long short-term memory ...... 19 2.6 Pre-training of models ...... 20 2.7 Feature extraction of text ...... 21 2.7.1 Bag of words ...... 21 2.7.2 N-grams ...... 22 2.7.3 Word and character embeddings ...... 22 2.8 Data extraction ...... 23 2.8.1 Twitter’s API ...... 23 2.8.2 Web crawling ...... 23

3 Method 24 3.1 Dataset ...... 24 3.2 Preprocessing ...... 25 3.3 Machine learning model ...... 26 3.4 Software setup ...... 28 3.4.1 TensorFlow and Keras ...... 28 3.5 Experiments ...... 29 3.5.1 Hyperparameter tuning ...... 29 3.5.2 Word versus character embeddings ...... 30 3.5.3 Preprocessing variations ...... 30 3.5.4 fastText ...... 30 3.5.5 Two similar languages ...... 30

4 Results 30 4.1 Hyperparameter tuning ...... 30 4.2 Word versus character embeddings ...... 32 4.3 Preprocessing variations ...... 32 4.4 Final model ...... 33

5 4.5 fastText ...... 35 4.6 Two similar languages ...... 36

5 Discussion 38 5.1 Results ...... 38 5.2 Errors ...... 41 5.3 Ethical reflection ...... 41 5.4 Future work ...... 42

6 Conclusion 42

References 43

6 1 Introduction

More than 4 billion people use the internet today, which corresponds to over 50% of the global population [1]. This means that hundreds of different languages are being used daily online. A sub field within artificial intelligence called natural language processing has made it possible for all of these languages to be identified, translated and communi- cated between each other. The past decade has provided new technology like big data and parallel computing that has made the artificial intelligence ideas possible to implement. Natural language processing includes how to handle language data and get more knowl- edge about a language. Automatic language identification is important for many types of natural language processing tasks today. As the communication online often is done through short pieces of text, a challenge of how to identify the language of these texts has arisen. The thesis covers parts of this field and how machine learning can recognise languages from short texts. Machine learning can be used to create computer models that train on known data and later be able to identify new unknown data.

1.1 Background In recent years it has become more common to come across websites where automatic language identification is used, like Facebook and Google [2]. It is impressive how transla- tion and identification of languages often is accurate. Although sometimes the translation of an expression or saying does not quite match the real meaning. Another example of where languages identification (LID) models are being used is in the search tool bar of websites. When the user types in a couple of words or a sentence, the search tool bar first identify what language the text is written in, before it gives the most up to date results in the same language as typed by the user. Languages continuously develop and change which can become an issue when building models that automatically identify languages. In a time where people communicate online, the changes happen quickly, from one day to another an online community can start to use a completely new word or slang. It is common with badly written text or misspelled words online because of the quick communication on online platforms. The possibility to identify a language of "dirty" and tricky short pieces of text still needs to be explored further. Is it possible to keep up with the constantly changing of languages for machine learning models that identify languages? How easy is it to create a machine learning model to automatically identify different languages from shorter expressions and words containing slang, misspellings or is badly written? Can a model be created to distinguish between possible languages for a short online expression? These questions will be explored in this master thesis.

1.2 Objectives The master thesis objective is to explore how a neural network can be created to identify languages from short texts from internet platforms. With the goal to quickly go from a newly trained network to implementation, automatically.

7 This master’s thesis will include a study of how language identification is set up and used today. A specific and well represented dataset of short texts, for example comments on posts on internet platforms like Twitter, is to be found and structured. The dataset could be from any type of specific environment where short pieces of texts occur. A machine learning model is to be designed, trained and tested for automatic language identification.

1.2.1 Sub-objectives • Perform a literature study of the state-of-the-art concepts of Automatic Language Identification (LID) and Natural Language Processing (NLP). • Find and extract data from the internet. • Design, construct and test a machine learning model to identify languages from short texts. A flow chart of the work covered in this thesis report is shown in figure 1.

Assemble knowledge

Model decision Data collection

Preprocessing

Machine learning models

fastText LSTM

Evaluation

Conclusion

Figure 1: Flow chart of thesis work.

8 2 Theory

2.1 Natural language processing Natural language processing (NLP) is a field within computer science and linguistics, which covers the area of how languages can be described, represented, used and con- structed in a computational manner. NLP, also known as computational linguistics, has been around since the 1980s [3] and includes several different fields. Some examples of NLP topics are natural language modeling, , text identification, text translation, question and sentence answering and summarization [3]. As computational power and parallelization has increased at a rapid speed in the last few years, machine learning and deep learning can now be applied to the field of NLP [3]. Some examples of machine learning problems within NLP are , machine translating, and automatic language identification.

2.1.1 Automatic language identification Automatic language identification (LID) aims to identify languages, without human in- tervention [4]. LID is a process present in many web services today. When searching the web, many websites have LID for the written text in the search bar, the most rele- vant search results are then exposed first. Another example is the translation tools that automatically recognise what language is written and then translate it to the desired language. In , which is the automatic translation of documents, the first step before automatic translation, is having a well working LID model. Language identification is a crucial part of natural language processing. In general, which will be explained further in the sections below, artificial intelligence and machine learning tries to mimic the way the human brain works [5]. Several different approaches have been studied since the 60s in the field of automatic language identification [4], but just like AI, had a rapid increase in performance the last 10 years. LID can be used on data from speech, sign language or from text.

2.2 Artificial intelligence Artificial intelligence (AI) is a part of computer science that has sprung up to be one of the most current and progressive topics in technology today. AI has been around much longer than expected and was introduced already back in the 1950s. The early idea of AI still holds today; the intelligence of machines, and tries to mimic the intelligence found in humans and animals. AI is widely used today for many different applications and can include many different things. There are however different definitions of AI, and it is used as a broad term for several fields. The main idea of AI is for machines to do the tasks that usually require human intelligence.

2.3 Machine learning Similar to AI, Machine learning (ML) lacks an explicit definition. One common definition was phrased by Arthur Samuel in 1959 as; "Machine learning is giving computers the

9 ability to learn without being explicitly programmed." [6]. Machine learning is AI, or could be defined as a sub-field of AI. The general idea of ML is to learn or recognise a pattern within an existing dataset. This will create a model that can be used in future applications where new data is introduced and run through the model and depending on task, gives a result based on the previously seen data. ML can be described as generalizing a model from one set of data for a specific task, to be used in the future for another similar set of data for the same task. Depending on what type of task is desired to be performed with a model, different types of models and algorithms can be used. There are two distinct different types of problems in ML, regression problems and classification problems. Regression problems handle numerical desired outputs like the pricing of houses or temperatures where the data is quantitative. Classification problems on the other hand handle categorical outputs being specific labels, like true or false, or one label of several classes like a specific language, where the data is qualitative [9]. The mathematical algorithms behind many of the most frequently used machine learning models have been around for over 50 years and the probability theory are based on old knowledge, like Bayes’ theorem [10]. There are three major approaches within machine learning; supervised machine learning, unsupervised machine learning and reinforcement learning [11]. is the most widely studied and used type of ML. In supervised ma- chine learning the dataset used for training the machine learning model has a label for each data sample. For example; if the dataset consists of pictures of animals, each picture also includes the correct label of what animal is in the picture like "dog" or "cat". Both regression and classification problems can be performed through supervised learning. Su- pervised learning requires a labelled dataset for training and testing. In , and reinforcement learning, the dataset does not have a label for each sample. This means that the learning tasks differ from what is done in supervised learning. Here neither regression nor classification problems can be solved. Unsupervised learning only has data samples as inputs and knows nothing else, the model can from there learn patterns within the dataset and cluster similar data together. Unsupervised learning can be used as a pre-model to supervised models or to give an idea of how samples within a dataset relate to each other. Reinforcement learning does not contain a label to every data sample in the dataset, but instead the data contains other information about how different outputs are scaled differently depending on what tasks are being investigated. Typically the input to the model contains an action described, a few different outputs of the action and a grade for each output [11]. Reinforcement learning can be in good use to train a model to play a game, like chess, or simulate animals’ behaviour. Reinforcement learning can be described as partly unsupervised and partly supervised learning. The challenge for this thesis is a classification problem that will be approached by using supervised machine learning models.

10 2.4 Artificial neural networks Artificial neural networks or just neural networks (NN) are mimicking the structure of the human brain. The main idea for a NN is to consist of many neurons that will receive knowledge about a task through training, just like the human brain is trained to learn new things during a lifetime. The idea of AI and specifically NN has been around for a long time, and has developed over the years and is today sometimes mentioned as deep learning. Deep learning is when the NN becomes deeper and when more so called hidden layers are added to the network. The past decade has provided a large upswing and improvement of NN due to the improved computational power but also by the possibility of gathering data. Big data, data mining, databases, open sources and cloud computing have made it possible for data scientists to build more powerful AI algorithms and machine learning models. Deep learning is possible because as of today, the computers can handle the large amount of data and the large architecture of the networks. Deep learning can be implemented in systems for several different fields; for example disease diagnosis, image recognition and language identification. The most basic NN are feed forward neural networks. A feed forward NN can be used for all three types of machine learning; supervised, unsupervised and reinforcement learning. They are all based on the same basic structure. Neural networks are built up by several layers of neurons, also called nodes, that are “hidden” units that are being trained when data is run through them. The nodes are connected together in something called a hidden layer, which acts like a black box of the learning part of a neural network model. The connections between the nodes can be mathematically expressed as weights, the weights are updated when new data is used as input. A neural network usually consists of several hidden layers, with many weights that are updated during the neural network model training [12]. The feed forward NN is connected, meaning that each node in each layer is connected to every node in the next hidden layer, see figure 2. A basic feed forward neural network consists of one input layer, hidden layers, and finally one output layer.

Figure 2: A simple feed forward neural network.

NN are nonlinear models, designed to describe and handle nonlinear relationships. The network gives an output z given an input x, which is given by several hidden layers of linear functions and nonlinear activation functions. The basic linear regression problem

11 is given by equation 1.

z(x) = β01 + β1x1 + β2x2 + ··· + βM xp (1)

T Here the output z is given by the summation of all the input values of x = [1 x1 x2 ... xp] , each multiplied with a parameter βM . To describe nonlinear relationships between z and x, these factors are put through a nonlinear function, called an activation function, often denoted as σ see equation 2.

z(x) = σ(β01 + β1x1 + β2x2 + ··· + βM xp) (2)

2.4.1 Activation functions Two common activation functions used in NN are the logistic (sigmoid) function and the rectified linear unit (ReLU). They are described by equation 3 and 4 and are depicted in figure 3. The idea of an activation function is to mimic, like mentioned earlier, the functioning of the human brain and act like a "switch", where an input contribution either gives a value of 1 or is kept to 0. The sigmoid function has been the first choice as activation function for many years, but in recent years the more simple ReLU function has become popular to use. The main reason is for the ReLU’s simplicity and faster learning of the network compared to the sigmoid function [13]. A couple of downsides with the ReLU is that negative neurons will be kept at zero and have a hard time recovering, if the learning rate is too high the model can stop updating the weights. The ReLU function is only used for the hidden layers, not the output layer. 1 σ(x) = max(0, x) (3) σ(x) = (4) 1 + e−x

(a) Rectified, ReLU, function. (b) Logistic, sigmoid, function.

Figure 3: Two common activation function used in neural networks, in (a) ReLU and (b) sigmoid.

For neural networks, and further deep neural networks, several activation units are put together and these are then described as the hidden units hi.

hi = σ(β0i1 + β1ix1 + β2ix2 + ··· + βpixp), i = 1, 2, 3,...M (5)

12 Input Hidden Ouput layer layer layer

x1

h1 z1 x2

x3 . . . . zk . hM .

xp

Figure 4: The two layer neural network built up by one hidden layer, consisting of M number of hidden units hi.

In figure 4 a basic neural network can be seen, also known as the "two-layer-neural- network". The full model can be expressed mathematically in matrix form by describing a weight matrix W and an offset vector b, see equations 6.

β(1) . . . β(1)  β(2) h iT 11 1M h i 1 b(1) = (1) (1) (1) , W(1) =  . .  , b(2) = (2) , W(2) =  .  1 β01 β02 . . . β0M  . ... .  β0  .  (1) (1) (2) βp1 . . . βpM βM (6) The model can therefore be described by equations 7 and 8.

h = σ(W(1)T x + b(1)T ) (7)

z = W(2)T h + b(2)T (8)

Further on, deep neural networks are described by adding more hidden layers as seen in figure 5, where L hidden layers gives (M · (L − 1)) hidden units [14].

13 Input Hidden Hidden Output layer layer layer layer

x1 1 L h1 h1 ... x2 z1

x3 ...... h1 hL . M M zk . ...

xp

Figure 5: A deep neural network with L hidden layers.

2.4.2 Output activation function So far, the neural network that has been presented covers the regression problem. When handling a classification problem, a layer of another activation function is added for the output layer. This will turn the hidden units into probabilities for each class, since a classification problem handles qualitative data instead of quantitative data (as in regres- sion). From the outputs layer in figure 5 an extra layer with a softmax output layer is added. Figure 6 shows the idea of a softmax output layer, which will yield a probability of every class, in this case what object the input x is representing.

14 Figure 6: The softmax output layer in a classification problem, yielding a probability for each class for one input sample.

Equation 9 gives the softmax function and can be seen in figure 7. The softmax function gives a probability for each class, that sums up to 1. The sigmoid activation function on the other hand produces values between 0 and 1 that are independent to each other and does not sum to 1.

eyi S(yi) = (9) P eyj j

Figure 7: The softmax function.

The softmax outputs a vector of the probability for each class. This could give a hint towards how certain the model is on predicting the correct class, or how confused the model is. Either the highest value of the softmax output vector is chosen as the predicted class, or a threshold value can be chosen for this purpose.

15 2.4.3 Training of neural networks The general architecture of a neural network is now established, moving on to how the neural network is being trained on new input samples x. When training a neural network, the main goal is to minimize a loss function. That is, a loss function that describes the distance between the correct label of the input data and the predicted label given by the neural network model. Figure 8 shows the idea of what the model will iterate to find, which is the minimal loss. Where the minimal loss is found, those are the values of the weights the model needs to keep to perform well. The loss function is giving the distance between the true label, z, of an input x compared to the predicted output z0. Through backpropagation and optimization, the weights of the network are updated. Backpropagation computes the gradient of the loss function by going backwards in the neural network to update the weights. A common optimization technique is stochastic gradient descent, which will find the minimum loss as seen in figure 8. One alternation of this is an optimization method called ADAM (adaptive moment estimation) [15], which is being used in this thesis.

Figure 8: Finding the minimal loss function value is the goal when training a neural network.

The dataset is split into different parts, one dataset for training, one dataset for validation and one dataset for testing (hold-out dataset). The training dataset is used for training the model weights during the backpropagation. The validation and test datasets on the other hand are only used to evaluate the model, to see how the model acts on new data. The validation dataset is used during training as an unbiased evaluation. It is important to keep the training dataset completely separated from the validation and test set, as the model is supposed to be general. If the validation dataset also is used for training, the model would be biased and not give trustworthy result when the model is used on new, unseen data. The validation dataset is used to validate the model during the training. An additional dataset, the test dataset, is used to have an extra evaluation

16 of the model and how well it predicts new data. All of the data samples are labeled, and the predictions can therefore be compared with the actual labels. The split of the complete dataset is; 80% training data, 10% validation data and 10% test data. The split can be divided differently between the datasets, also something like cross- validation can be used to make use of all data without risking the model of overfitting or underfitting. Overfitting means that the model is well fitted to the training data, but does not generalize well on new data samples. Underfitting a model means that the model does not capture the patterns of the data well and usually means that the model is too simple. When evaluating the three datasets, the loss and the accuracy is of interest. The loss is, as described above, the distance between the predicted label and the actual label of a dataset. The choice of loss function often depends on the model’s task. For the thesis a sparse categorical cross-entropy loss function [16] is chosen because the problem has one correct class for each data sample out of several classes. The cross-entropy loss function is the distribution between two probability distributions, the two distributions being the true and the predicted classes. The function is given by equation 10 where yi is the true label and yˆi is the predicted label. A categorical cross-entropy loss function is used when a model has more than two classes.

outputsize X Loss = − yi · log(ˆyi) (10) i=1

The accuracy is the number of correctly classified data samples, out of all the data samples in each dataset. These three values can give a more objective view of the results of a model. They describe the false positive rates, the sensitivity of the false negatives and a ratio between false positives and false negatives

2.5 Recurrent neural networks Recurrent neural networks (RNN) are a type of neural networks that are designed to handle sequential data samples. Examples of problems that can benefit from a recurrent neural network instead of regular feed forward neural networks are speech recognition, human activity recognition and language detection/identification. These problems in- clude data samples that depend on the previous time step of the sample. To identify a language, the previous word or character makes it easier to see what language the text is in. The same applies for human motion, to know if the movement was sitting or running, the previous time instances from a sensor can reveal what activity the sensor currently is performing.

17 Figure 9: A RNN has a feed-back connection inside the hidden layer.

Compared to a regular feed forward artificial neural network, the recurrent neural network has a feedback connection within the hidden layer, see figure 9 and figure 10. This is the reason for recurrent neural networks to in theory, handle sequential input data better than feed forward neural networks.

Figure 10: Unfolded RNN architecture.

The hidden units in RNN are described, as before, by a weight matrix W, but also by a hidden-state-to-hidden-state matrix multiplied with the previous time step, (t-1). See equation 11.

ht = σ(W xt + Uht−1) (11)

As equation 11 shows, each hidden unit contains information not only of the previous sample, but all previously passed samples. The hidden units are trained with new data through an extension of backpropagation (used in feed forward networks) that is called backpropagation through time, BPTT [17]. Recurrent neural networks have certain problems when handling time serial data samples where a vanishing gradient problem can occur. The gradients can blow up and give an unreliable model [18].

18 2.5.1 Long short-term memory Long short-term memory (LSTM) is a type of recurrent neural network that handles the problem with vanishing gradient that RNN has. LSTM units keep a more constant error for the model and can therefore keep training over many steps. The LSTM cell keeps more information, and can both keep memory of information, reject information and decide when and to where to open gates for the information to flow through. The LSTM cell blocks or passes information depending on the value of the weights of the unit, just like in feed forward networks. These cells are by backpropagation and stochastic gradient descent iterated to optimize its weight value’s that will learn when to pass, delete or store information, see figure 11. In figure 11 a LSTM cell can be seen. The LSTM cell has several gates that interpret and take action to the input and the recurrent inputs compared to the RNN cell that only pushes the input and the recurrent input through the cell.

Figure 11: A detailed sketch of the inside of the LSTM cell. Source from [19].

An additional feature that can be added to a recurrent network cell is to use a bidirectional layer of the recurrent neural network. A unidirectional RNN is what has been described, and is similar to a feed forward NN that pushes the input data, and the recurrent input data, through each cell. In a bidirectional RNN the input data is pushed through the NN both forward and backward. This means that if an input data is representing a string of a text, the text trains the weight value in both directions of the text. This can be beneficial for text inputs as this may be useful for the NN to learn more about the word order for a language. Figure 12 shows a sketch of the unidirectional and bidirectional

19 RNN layer.

Figure 12: The difference between a unidirectional RNN and bidirectional RNN. The green circles represents the input layer, the pink circles represents the hidden layer and the yellow circles represents the next layer or output layer.

2.6 Pre-training of models A ML model can be pre-trained before being trained for its final purpose. This means that the first pre-training task could for example be to cluster data together, or let the model see a lot of textual data samples. After the pre-training, a new dataset for the final task is used to train the model. There are several pre-trained language models out on the web. Many of these models are developed, and used, by large companies like Amazon, Google and Facebook. For this thesis, Facebook’s language identification model fastText has been studied and used as benchmark to the neural network. FastText is both an open Python library for training language models on specific data, and has pre-trained language models. FastText provides trained word vectors for 157 lan- guages, a pre-trained language identification model that can identify up to 176 languages and a library to train models on specific data (both supervised and unsupervised) [22]. FastText is a model that can train a language identification model on thousands of data samples, in just a couple of minutes. Compared to the neural network architecture, fastText is designed in a more simple way of a linear classifier with low rank matrix constraint [23]. The model architecture includes a hierarchical softmax, to reduce running time. The model also includes a combination of a bag-of-words model and N-gram (both explained further below) approach to increase performance and reduce running time even more. The N-gram model gets information about characters around the current time instance but takes more memory capacity, and the bag-of-words model does not capture the texts features as well. A combination of these two gives a bag-of-n-grams model that is used in fastText [24].

20 2.7 Feature extraction of text It is not possible for a LID model to use the raw text data as inputs. The text needs to be handled in some way for the computer to understand. When handling text as an input to a machine learning model it needs to be established what way to represent the text. In today’s digital world, texts can include many different scripts, letters, characters, signs and also such things as emojis and similar. Luckily there is an encoding for most characters in computers today called Unicode. Unicode is a universal character encoding that is constantly updated with new characters and as of the 13.0 version it includes up to almost 150,000 characters [25]. This is one way to represent all the characters found in texts. A way to represent the size and form of the input samples also need to be handled. There are many different approaches to this, which includes representing sentences as vectors of words or characters. How to map these words or characters needs to be established. A few different approaches to computationally create these text representations will be discussed below.

2.7.1 Bag of words The bag of words (BoW) method can be described as gathering a vocabulary of known words from data and measure the presence of known words for every data sample. Every word found in the training dataset can be added to a vocabulary of words to keep track of. One text sample, a sentence of a couple of words, is divided into the words. Each word is added to a vector, which keeps track of all the words found in all of the data samples used for training the model. Every sample of text is then mapped from this large word vector, "the bag of words" with a hashing encoding. An example is given in figure 13.

Figure 13: An example of how the bag-of-words method works.

A couple of drawbacks with the BoW methods is that it assumes all words in the text are correctly written, for it to map the same words to the same place in the bag-of-words vector. If the word "hello" would also be found as "helloo" in two different data samples. Both words would be represented as different words in the vector even though it can be

21 assumed that these two words are the same but one word happens to be misspelled with an extra "o". Another drawback with the BoW method is that it requires a lot of training data to well represent many different languages, and all the possible words in each language for a model to be general. This also yields the bag-of-words vectors for all the languages in the model, to become large. This will lead to long computational time and the need for a lot of memory.

2.7.2 N-grams N-grams is another approach of how to represent text. The idea of N-grams is that instead of representing one piece of text by every word, an N-gram method takes into account the N number of grams (depending on what a gram is, if it is one word, part of a word or a set of characters). It is easier described by an example. The sentence "My new cat", can be vectorized as ["My" "new" "cat"], this is what it would look like when applying a BoW method on the sentence. An N-gram model can either vectorize per N words or per N characters. A 2-gram model can either be vectorized as ["My new" "new cat"] or ["My" "y " " n" "ne" "ew" "w " " c" "ca" at"]. The sentence is divided into objects of two words or two characters, sequentially. N-gram models can be powerful, since the new vectorization of the text gives an idea what word or character comes before another. It gives each represented token a context. A N-gram model takes words or characters that come before and after each other into account and can therefore help a machine learning model to identify which words belongs together. These models can be designed by themselves and more often used when the task is to automatically finish sentences and uses probability theories to predict the next word in an unfinished sentence [26]. The N-gram models are probabilistic models, but the N-gram idea can also be used to extract the features of a text and be used as the input for a neural network.

2.7.3 Word and character embeddings There are many different approaches to word and character embeddings. Many companies have developed toolkits for this purpose. Some examples are Google’s , the NLTK library, skip-grams, N-grams, BoW to mention a few. One approach specifically to use on neural networks is to decide on how many words or character the model should know about, compared to the BoW methods that keep all known words. The word or character embedding approach only saves the most common words appearing in the training dataset. Embeddings will keep an extra word or character category that will represent the rest of the words and call it "out of vocabulary" tokens and not be specified further. If for example the size of the "vocabulary" is chosen to be 50, the 50 most common words or characters (depending on what is chosen) will be saved to a vector and all of the leftover words or characters will be put in the same category. The vector of the 50 most common, plus 1 for "out of vocabulary" words or characters will get a number, from 0 to 50. All data samples will be mapped from that vocabulary. The

22 dataset will now be transformed from vectors of strings to vectors of integers, that can be decoded with the vocabulary of 51 tokens. The dataset can be chosen to be embedded as words or characters.

2.8 Data extraction When dealing with AI and machine learning, one of the key components for it to work, is data. The dataset that trains and evaluate the model is everything for a model to be as generalized as possible. Often, large datasets are needed, and when talking about supervised machine learning the dataset also needs to be correctly annotated. Machine learning has arisen in the last decade, not only because of the computational power development but also because of the developments of data structuring and databases. There are millions of data points gathered online, by companies and authorities. The cloud has made it possible to store, retrieve and use data in a whole new way, which makes machine learning possible. Datasets can be private or include sensitive information and in that sense, be difficult to get access to. To find a suitable datasets is an important part of building machine learning projects. There are however many open datasets available online. For this thesis it has been important that all data has been found through open sources, which has led to the investigation on how to extract certain type of data online. The thesis objective is to look into short pieces of texts in a specific environment. One source of short pieces of texts in many different languages, are text messages. These type of texts may be hard to find, since they tend to be private and only the person communicating them can see them. To find public short pieces of text some examples are comments on social media postings, and social media posting in general. These texts are usually not that long and can be publicly available. This brought the thesis to use posts on Twitter, it could have been any other specific environment.

2.8.1 Twitter’s API When looking into how to retrieve Twitter data, rather quickly it is found that Twitter already provides a service for people to access their data. Twitter’s API (application programming interface) offers developers and users the possibility to get tweets, replies, direct messages (where the user has granted the access), ads, publisher tools and SDK (software development kit) [27]. Many of these features are of interest when developers embed Twitter in their own applications. For the thesis only the retrieving of tweets function is of interest.

2.8.2 Web crawling Other possible ways to find suitable dataset, not only from Twitter but from the entire web, is from a web crawling services. One webcrawler service commonly mentioned and used is Common Crawl [28]. Common Crawl frequently (almost every month) crawls the web and makes all of the crawled data open to anyone. Common crawl provides web crawls as raw web page data, extracted metadata and text extractions.

23 3 Method

3.1 Dataset For this thesis, a representative dataset of short online texts was requested. The dataset needs to contain natural mistakes made from humans like misspellings, typos, gram- matically incorrect written text, informal words and slang. Considering it is both time consuming and not accurate to create a dataset of this type by oneself, an already scraped bunch of data was ideal to find. Twitter is a good place to find many different people’s writing and also the length of writings are restricted to 280 characters. The dataset from [29] is a dataset containing the tweet identification number of 1.6 million tweets from 15 different languages. This dataset was scraped before 2017, which was before the maxi- mum twitter length was expanded to 280 [30]. Therefore the Twitter corpora from [29] contains only tweets up to about 140 characters, which was the limit before 2017. The corpora has been annotated manually from different native speakers for every language and has been considered to be correctly annotated in this thesis. From the annotated dataset, 9 languages were chosen to use for the thesis. The 9 languages were chosen so the model had a wide spread of different languages.

Figure 14: Distribution of tweets in the dataset of 8 languages of different character lengths.

To be able to take part of this dataset the use of Twitter’s API is needed. The user needs to create a developer’s account through a Twitter account to get the credentials to reach and unlock the API. Once this is set up there are several ways to reach the desired data. This thesis took use of the open library Tweepy where several ready to use functions to extract tweets from only having a tweet identification number exists. The dataset was therefore created by using the dataset [29] of tweet identification numbers for wanted languages, running them through a code that extracted the exact tweet for each number

24 and saving it. The thesis started off getting 2000 samples (tweets) for five different languages, later on this was expanded to 4000 samples per language. The final model used in total 4000 samples for 8 different languages resulting in around 32.000 samples, the distribution can be seen in figure 14. The languages that were considered was English, Swedish, Spanish, Portuguese and Russian, later German, Polish and Serbian was added. As a final language, 4000 Croatian tweets were added to the dataset.

Table 1: Abbreviations for all 9 languages used in the thesis.

Language Abbreviation English Eng Swedish Swe Spanish Spa Portuguese Por Russian Rus German Ger Polish Pol Serbian Ser Croatian Cro

3.2 Preprocessing Preprocessing of the dataset includes structuring the data in suitable files for training data, validation data and test data. For this project the split between these has been kept as a 80% training dataset, a 10% validation dataset and a 10% test dataset. The data was kept as it was, completely raw and later also cleaned from addresses and emojis. It is here optional to keep the blank spaces between words. Considering the blank spaces gives information about the structure of the text, the blank spaces has been kept for the thesis. Other characters to consider to remove or keep are hashtags, user names, symbols like question marks, exclamation marks, commas, diacritical marks etc. These symbols has also been kept, as it is possible it can give more information on what type of language the strings are of. No metadata was used for training the data, meaning no information on where the tweet was sent from or from who. Names or retweeted accounts mentioned in the tweets has not been published or used in any other way than as input for the model, as names can differ depending on language. Word and character embeddings has been used on the collected dataset, and the text has been mapped to a vocabulary as described in section 2.7.3. Considering that the data consists of tweets, with a maximum character length of around 140, it was chosen to make the input vector length for the ML model to be maximum 150 long. To make all data samples the same length, zero-padding was used. Zero-padding means that if a data sample, in this case a string, is shorter than 150 characters long, the vector is filled with zeroes for the "empty" places. See figure 15 for an example of the word and character embedding and zero-padding of a data sample. The zero-padding was not needed for all of the tests during the thesis but was kept to make the testing easier throughout the thesis work.

25 Figure 15: Three examples of the test data samples and how they are represented, where the 10 most frequently occurring characters of the vocabulary are shown.

Zero padding is not necessary for all types of layers in NN. Considering this thesis ex- perimented with different approaches the zero-padding was kept throughout the whole project.

3.3 Machine learning model The machine learning model designed for this thesis is a network consisting of an embed- ding layer, a bidirectional long short-term memory layer, a dropout layer, a dense layer and finally a dense layer for the outputs. A simple sketch for the final model can be seen in figure 16.

26 Figure 16: Simple sketch of the final model.

As seen in the section above, 3.2, the input data sample to the model is a vector of length 150. This vector is then pushed through an embedding layer. From the embedding layer, an output of a size of 128 parameters is pushed through the bidirectional LSTM layer, where also 128 parameters are located, but as the layer is bidirectional it results in 256 parameters. After the bidirectional LSTM layer, 256 parameters are pushed through a dense layer which gives 128 output parameters to the final dense layer that gives eight output parameters, one for each of the languages of model. The final dense layer, that outputs 8 parameters, represents a probability for each of the eight languages. The vector of length eight, will together sum to 1 and the highest of these probabilities will decide which language the model will predict for the specific input. Ideally, one language has a probability almost to 1 and the other seven languages have a probability close to 0. This could mean that the model is certain of one language. If the values in the vector instead would be around 0.5 for two of the eight languages, it could be assumed that the model has a hard time knowing which one of the two languages the input sample was of. It is not completely clear what the values of this last output vector means. It could be a measure of how certain the model is of one language, or the probability of how much resemblance the input sample has with each languages. In figure 17 an example of a softmax output vector for a Russian tweet can be seen.

27 Figure 17: Example of what the output looks from the model, where the blue bold marked vector value will be the model’s predicted label. In this case the predicted label is correct as the index for the vector place is 4, just like the correct label.

3.4 Software setup All coding for the thesis has been done in Python 3. Python is a high-level programming language that has become one of the most common languages to use for data scientists worldwide [31]. Python is a programming language that suits well for machine learning tasks, and has many well developed libraries for that specific purpose. There are many helpful guides and tutorials available online which makes it easy to use Python. A few to mention are Susan Li’s Notebook fo buidling a LSTM model on GitHub [32] and Alex-Just GiHub code and comments for stripping text from emojis and characetrs [33].

3.4.1 TensorFlow and Keras It is essential to have a library that can handle specific types of computations when building machine learning models, especially when doing deep neural networks. There are several free, open-source libraries today, the most common libraries used in Python in- cludes PyTorch and TensorFlow. These are mathematical libraries that has been created for users to easily design and construct machine learning models for different purposes. TensorFlow was developed by Google and PyTorch was developed by Facebook. These libraries were introduced in 2015 and 2016 respectively, and have made a huge impact on the accessibility for data scientists to create machine learning models in a broader sense [34]. Today there are many more libraries that have been created to make models perform well, be visualised and user friendly. Keras [35] is an API running on top of TensorFlow that makes it easy to implement the machine learning model in Python. For this project this is the main reason for using TensorFlow and Keras together. Apart from these Python machine learning libraries, several other free open-source li- braries has been used in the thesis to create the model, visualise the data and evaluate

28 the performance of the different models that has been studied. A few of theses libraries to mention are; numpy, pandas, sklearn, matplotlib and seaborn [36][37][38][39][40].

3.5 Experiments The thesis has studied and tested several different things for neural networks, to see how it handles the data and how well it performs. The important aspects that has been studied are how well the model handles the shortest texts in the dataset, how fast the model predicts test data and the accuracy.

3.5.1 Hyperparameter tuning The model consists of several hyperparameters that affect the model’s behaviour and performance in different way. There are several ways to tune these hyperparameters, different automatic optimizations techniques like Bayesian optimization are possible. A common approach is also to try out different parameters and see what happens. This has been done for the thesis’ model, the parameters that were alternated are the vocabulary size, the embedding dimension, the number of epochs and the batch size. These are the ones that were chosen to be tested. The different values for all tests can be seen in tables 2-5. All tests’ original settings were kept as; a vocabulary size of 300, an embedding dim of 128, a batch size of 128 and an epoch of 15. Depending on which test was performed, one of the parameters were altered.

Table 2: Alternations of the vocab- Table 3: Alternations of the embed- ulary size. ding dimension.

Test Vocabulary size Test Embedding dimension 1 50 1 32 2 100 2 64 3 300 3 128 4 500 4 256 5 1000 Table 4: Alternations of the number Table 5: Alternations of the batch of epochs. size.

Test Number of epochs Test Batch size 1 5 1 32 2 15 2 64 3 30 3 128 4 50 5 100 The four tests for the hyperparameter tuning were performed when using 5 languages with 2000 samples each as the dataset. The five languages were English, Swedish, Spanish, Portuguese and Russian. For all the above tests, the character embedding was used.

29 3.5.2 Word versus character embeddings The model was trained on both word embeddings and character embeddings when pre- processing the data. The character embedding can also be described as a 1-gram, each character in the text sample was treated as a token. When doing a , every word is considered as one token. The model accuracy was evaluated. This test was also performed when using 5 languages with 2000 samples of each. The settings for the word versus character embeddings were kept with a vocabulary size of 300, embedding dimension of 128, a batch size of 128 during 30 epochs.

3.5.3 Preprocessing variations It was desired to have text that is "raw" for this thesis, meaning not much cleaning of the texts was done. A few tests were made to see what difference it makes if the text contains url:s and emojis. The preprocessing test was performed when using eight languages with 4000 samples each. The eight languages that were used was English, Swedish, Spanish, Portuguese, Russian, German, Polish and Serbian. The settings for the preprocessing variations were kept as earlier with a vocabulary size of 300, embedding dimension of 128, a batch size of 128 during 20 epochs.

3.5.4 fastText The fastText supervised model training, see section 2.3, was used on the dataset and compared to the LSTM model. A speed test for both models was also conducted.

3.5.5 Two similar languages The last experiment was to see what happens when the model is to recognise two similar languages. For this part, Croatian data was added to the model giving the model 9 lan- guages to recognise. Croatian and Serbian are similar, arguably even the same language with small differences.

4 Results

4.1 Hyperparameter tuning In tables 6 and 7 the results from the tests of alternating the vocabulary size and embed- ding dimension can be seen.

30 Table 6: Results of different vocabulary sizes.

Vocabulary size Validation accuracy Validation loss Test accuracy Test loss 50 93.80% 0.18 91.81% 0.21 100 95.00% 0.15 93.61% 0.18 300 96.70% 0.11 95.70% 0.11 500 93.80% 0.20 93.71% 0.20 1000 91.70% 0.23 91.91% 0.22

Table 7: Results of different embedding dimensions.

Embedding dimension Validation accuracy Validation loss Test accuracy Test loss 32 94.20% 0.17 92.51% 0.21 64 94.20% 0.18 93.01% 0.22 128 93.90% 0.18 93.81% 0.20 256 95.50% 0.15 94.61% 0.18

In figure 18 the accuracy and the loss of training over 100 epochs can be seen.

(a) Accuracy (b) Loss

Figure 18: Accuracy and loss over 100 epochs for a model of 5 languages.

In table 8 the results of tests for alternating the batch size can be seen.

Table 8: Results of different batch sizes.

Batch size Validation accuracy Validation loss Test accuracy Test loss 32 94.50% 0.16 92.81% 0.24 64 91.60% 0.28 89.91% 0.31 128 95.20% 0.15 93.11% 0.21

31 4.2 Word versus character embeddings In figures 19a and 19b the accuracy for different tweet sample lengths are displayed for word embeddings and characters embeddings respectively. The embedding setting is decided when mapping the vocabulary on the training dataset, before starting the model training.

(a) Word embeddings. (b) Character embeddings.

Figure 19: Accuracy for character lengths of word and character embeddings.

4.3 Preprocessing variations The accuracy for different tweet sample lengths when adresses and emojis has been re- moved can be seen in figure 20a. When no cleaning during the preprocessing has been done can be seen in figure 20b.

(a) Accuracy with cleaning. (b) Accuracy without cleaning.

Figure 20: Accuracy for different data sample lengths with and without preprocessing cleaning.

32 4.4 Final model The results for the final model can be seen in the following figures, 21 and 22. Although one more language was added, see next section, most time was spent on evaluating the 8 language model. Figure 21 shows the confusion matrix for the 8 languages, where the true labels of the test samples is on the y-axis and the model’s predicted label for the test samples on the x-axis.

Figure 21: Confusion matrix for test data of the 8 language LSTM model.

(a) Accuracy (b) Loss

Figure 22: Accuracy and loss for training and validation data.

Figure 23 shows the correctly classified test samples, where the probabilities on the y-axis

33 represent the highest probability from the softmax output vector. The tweet length is represented on the x-axis. The points in the figure are the mean values of the probabilities for the present tweet sample lengths.

Figure 23: The mean value of the probability for all correctly classified test data’s lengths.

Similar to figure 23, the probabilities from the softmax output vector for each language has been ordered in ascending size and plotted in figures 24a-24c. Here all the test samples’ corresponding language index value has been plotted, also including the test samples that was every other language of the test dataset.

34 (a) English (b) Portuguese

(c) Russian (d) Swedish

Figure 24: The probability value from the softmax output vector for four languages, values come from all of the test data samples.

4.5 fastText In table 9 the results of the comparison with fastText can be seen. FastText only returns the training loss of the trained model, while the LSTM model give test loss. A speed test was also performed for both the models, timing how long it took for the model to predict 4000 samples. Both models took use of data with 8 languages and 4000 samples per language.

35 Figure 25: Confusion matrix for test data of the 8 language fastText model.

Table 9: Results of test accuracy, loss and the time to predict 4000 samples for each model.

Model Test accuracy Loss Prediction speed LSTM 94.9% 0.16 (test) 4.67 sec fastText 97.0% 0.060 (training) 0.960 sec

4.6 Two similar languages The confusion matrix when adding a language to the LSTM model is shown in figure 26. Below, in figure 27, the confusion matrix for the fastText model when adding one more language is shown. The language that has been added is Croatian and again a plot of all the corresponding probabilities produced to the softmax output vector for Croatian and Serbian is shown in figures 28a and 28b.

36 Figure 26: Confusion matrix for test data of the 9 language LSTM model.

Figure 27: Confusion matrix for test data of the 9 language fastText model.

37 (a) Serbian (b) Croatian

Figure 28: The probability value from the softmax output vector, for each language out of all test data samples for the LSTM model.

5 Discussion

5.1 Results From the results of the hyperparameter tuning it was decided to keep the parameters according to table 10. Tables 6, 7 and 8 shows the values of the parameters where the accuracy and loss were maximized and minimised respectively. The embedding dimension was chosen to 128, even though the accuracy was higher at 256. This because of two reasons; firstly it takes longer to train a model with large embedding dimensions, secondly because large embedding dimension yields a large value of total number of parameters. Considering the dataset does not have more than around 30.000 data samples, having millions of parameters can risk the model of overfitting. Figure 18 shows how the model is overfitted towards the training data around epochs 35-40. The training accuracy reaches 100 % while the validation accuracy does not increase. The difference between the training and validation loss is also seen to increase in figure 18b which suggests the model to be overfitting. Therefore it is better to stop the training of the model before this happens, and the number of epochs are therefore chosen to 30 epochs.

Table 10: The chosen hyperparameters for the model chosen from the test results.

Parameter Value Vocabulary size 300 Embedding dimension 128 Number of epochs 30 Batch size 128

Figure 19 suggests that the model more often correctly classifies the shorter, less than 60 characters, pieces of text when character embedding was used. A similar result is

38 seen in figure 20, where the preprocessing was done with and without cleaning of specific characters. Keeping the text raw and "dirty" seems to be beneficial for shorter text, as the accuracy for the shortest character length bin is around 80 % without cleaning and closer to 70 % with cleaning. It is also clear that the accuracy is lower for tweets with less than 60 characters according to both figure 19 and 20. Using the hyperparameters in table 10, the final model gives the result as of figure 21 and figure 22. The confusion matrix in figure 21 shows how the accuracy differs between the different languages. Russian is 100 % accurate on the test dataset and could probably be explained by the fact that it is the only language using another script, the Cyrillic alphabet instead of the Latin alphabet. It makes it easy for the model to train to recognise Russian text. Looking at German instead; where 8 % of the German tweets where predicted to be English, the same for several other languages like Spanish, Swedish, Polish and Portuguese. There are probably several reasons for this, which can include the fact that many non-English languages have incorporated English words and hashtags. Also due to the fact that many of these languages are Latin influenced languages that share many similarities, or come from the same language family group like Germanic, Romance or Slavic languages. It can therefore be expected that Spanish text would be misclassified as Portuguese, interesting enough it seems like Portuguese text is not predicted as Spanish as often. This could imply that Spanish has much in common with Portuguese but Portuguese has some unique attributes compared to Spanish, making Portuguese easier to distinguish. Figure 22 shows the training history of the model and it looks like the training and validation both accuracy and loss lie close to each other, meaning the model is not overfitted. This yields a more trustworthy result. The Accuracy for the model lies around 95 % which means it does predict a large part of new data samples correctly. The loss for both the training and validation datasets is around 0.20 which needs to be compared to another loss value for a conclusion to be made. The softmax output produces a vector of probability values for each input sample, each vector include the probability for all eight languages of the model for that specific sample. In figure 23 the mean value for every existing tweet length of the probability for the predicted language, when correctly predicted, is plotted. The figure suggests a slight trend towards an increasing probability value for longer tweets. What this could mean is that the model output probability is more certain of the language for the longer tweets. Considering the model will predict the class with the highest value from the softmax 1 output vector, the value of the predicted languages can be > 8 = 0.125, 12.5 %. But if the probability would be this low, the model would have similar probabilities to all indexes of the softmax output vector. Looking at figure 23 the probability values of tweets shorter than 20 characters are more spread out from 0.60 to 1.0. When looking into the probability values close to or equal to 1.0 in that area, they happened to be Russian tweets. Comparing this with figures 24c, it looks like all Russian softmax output vector probability values are either almost 1.00 or zero. Considering the model gave 100 % accurate test data results the probability value shows that the model is certain of tweets being in Russian when they are. Figure 24a and 24b shows how the softmax output vector probabilities for English and

39 Portuguese language is spread out. This can mean many different things. The model can be predicting similarities to other languages, falsely predict one of the other languages or is uncertain about the correct language. This can mean that the model is more confused and not as certain compared to when the language is Russian. Both English and Portuguese do not seem to have as long and as horizontal of a line at the top, where probably most of the correctly predicted data samples are located, as Russian does. The Swedish curve, figure 24d has a more sparse spreading of the softmax output vector probabilities than what both English and Portuguese have. That can suggest that the model is more "certain" when predicting a data sample as Swedish, possibly because Swedish does have three unique letters in its alphabet (å, ä and ö). This argument can probably also hold for other languages, like the letter ñ in Spanish for example. In table 9 the results of the comparison between the LSTM model and the fastText model shows that fastText has a higher accuracy, lower loss and predicts new samples faster than the LSTM model. The prediction speed is of interest if a model is to be used in a application or website, especially in real time. Because fastText is more simple in its architecture it clearly shows how that affects the prediction speed. It is important to note that the LSTM model has not been written for an application purpose, the performance would probably be better when the code would be rewritten. FastText also take use of both N-grams of words and characters, which also seem to bump up the accuracy. When testing to have the same type of character embedding as the LSTM model, fastText did have a lower accuracy than the one presented. It is therefore possible that if the LSTM model also would take use of both N-grams of words and characters the model evaluation would yield a higher accuracy and lower loss, just like fastText. When adding a ninth language, Croatian, that is similar to one of the already existing languages, Serbian, the results in section 4.6 shows some interesting behaviour. The accuracy is still fairly high (above 92 %) for the entire model. Looking at each individual language in the confusion matrix in figure 26, the accuracy for the Croatian and Serbian test data samples has dropped. The accuracy for the Croatian tweets is at 68.9 % with a large portion of the test data being misclassified as Serbian, 26.6 %. The Serbian tweets has gone from having a classification accuracy of 97.2 % (figure 21) to 82.6 %. Also for the Seriban tweets, 14 % of the test data was misclassified as Croatian. The model was retrained, without any knowledge from the former model of 8 languages. It seems like the LSTM model has a hard time establishing a difference between the two similar languages. When Croatian was added to a fasText model, the confusion matrix in figure 27 shows that the model has a lower accuracy for the Croatian and Serbian test data. Compared to the LSTM model seen in figure 26 the fastText model seems to handle the two similar languages a little bit better than the LSTM model and establishing the difference between them. Figure 28 shows how the curves of the softmax output vector probabilities for Croatian and Serbian does not have a proper peak around 1.0, like the curves in figures 24. It is possible that this means that the model is confused when it comes to both Croatian and Serbian tweets. The softmax output vector gives probabilities for each languages of the data sample, as mentioned earlier. This probability can reflect on the model’s "belief" of the input data sample, it does not reflect a true probability of the data sample and

40 should not be interpreted as the real probability of presence or similarities of a language or languages.

5.2 Errors A couple of errors need to be mentioned for the thesis. The collected dataset used in the thesis work is based on a collection of tweets that has been annotated by hand. This means that there is a human error as in if a tweet is correctly labeled in a certain language. As this is difficult to keep track of and check, it has been assumed for the thesis work that the annotation is correct. Another concern of the dataset used is the time of the tweets were collected. The tweets were collected between 2013 and 2015 [29]. As the dataset is a couple of years old, the topics of the tweets can be directed towards certain time dependant events that is not of interests today. This can make the model unbalanced if new twitter data from today would be used as input to the model. Only 9 languages were used for the thesis’ model, which is not representing all languages. It was decided to only have a look at a few languages to get a good overview on how the model work. It is important to keep in mind that if several more languages would be added, the accuracy and performance of the model would probably change.

5.3 Ethical reflection Building ML models can be exciting and give interesting insights of how data samples relate and act towards each other. It is important and an obligation for engineers to think about how ML and AI is used, for what purpose and how the data is being handled. For this thesis short pieces of text from Twitter is used, which are written personally by people and can be viewed as private thoughts and ideas. Each individual tweet may not be representative for an entire group, in this case languages, to draw any conclusions on what way or what words specific native language speakers use. How the model is used and generalized can become an issue. If the purpose of using the model is to filter out or censor certain languages in a specific website or program, it needs to be established what this can cause. When discussing ethical AI and ML, the idea of a model being bias often comes up. Only by choosing a specific dataset to train a ML model, a bias has arisen for the model. As it is impossible to collect and train models on all data there is within an area, the model will always be slightly biased. This is why it is important to put time to evaluate the dataset that is being used and make it as general and well represented as possible for the task. A result of not building models thoughtfully can be that it discriminate one group of people, or generalize a behaviour for a larger group. One example is Google’s hate speech detection AI tool that was meant to be built to identify "toxic" language written online, but instead ended up being racially biased towards African-Americans [41]. It needs to be clear from the engineer, or model creator, what the model actually does say and what it does not say. It is sometimes easier and more appealing to draw quick conclusions from data samples, data behaviour and ML model behaviour. The reflections of the results need to be transparent and clearly state what the results are actually saying.

41 5.4 Future work It would be of interest to keep exploring ML models that handles similar languages in the future. In this thesis the chosen languages were all different, possibly making it easier for the model to distinguish them from each other. Future work could include romanization of Russian, meaning Russian written in the Latin alphabet, or other languages written in the Cyrillic alphabet like Ukrainian. Another possible future work could be to look into how emojis are used in different languages, and if it would be possible to identify languages through what emojis are used when writing short texts. Future work can be to dive deeper into the shortest pieces, less than 50 characters, of text and how to automatically identify the language, or languages, of the text. It is also possible to further explore how a ML model for short texts is to be implemented in an application, and how the softmax output vector values should be interpreted. Lastly a further study of newer ML model approaches like the ones seen in fastText can be looked into.

6 Conclusion

It is crucial how the text feature extraction is chosen for short texts. Whether or not to use character or word embeddings, N-grams or bag-of-words can affect the results of the model a lot. As shown in the results the fastText model that uses both N-grams and word embeddings has a better accuracy than the LSTM model that only uses character embeddings. A combination of these seem to improve the accuracy and a more simple model architecture improves the performance of the prediction speed. The results show that the LSTM model does classify the correct language often, it is however not clear how certain the model is of its prediction. The softmax output vector probability values can give a hint to how confused the model is with the input data sample. One improvement for this can be to add a threshold for the highest value from the softmax output vector. If the highest probability out of all the languages is below the threshold value, the model would classify the data sample as "unknown" or "uncertain" and would probably yield a much lower accuracy than what has been seen. Depending on the purpose of having an automatic language identification model, the model could be designed to give a hint towards a few possible languages instead. The uncertainty comes from the fact that the tweets do not consist of many characters making them harder to identify [42]. When dealing with similar languages, like Croatian and Serbian, the results show that it could be a good idea to either create a separate language identifier for only these similar languages or add the two languages to the same class. This would make the model more certain on the difference between languages that are not similar. A future model could perhaps include several steps, where the first step is to identify the language family group of a text, and then move on to a specific model for that language family group. In this way a model can narrow down to only a couple of languages as possible correct classifications.

42 It is clear that there are more to discover within the field of LID systems for short texts. As languages change all the time and the requirement for large, up to date datasets, the ML models need to keep being updated and improved.

References

[1] ITU Telecommunication Development Bureau. Measuring digital development Facts and figures 2019. ITU Publications. https://www.itu.int/en/ITU-D/Statistics/ Documents/facts/FactsFigures2019.pdf, retrieved 2020-07-07. [2] M.A. Nejla Qafmolla. Automatic Language Identification. European Journal of Lan- guage and Literature Studies Volume 3 Issue 1. 2017. 2411-4103 [3] D. W. Otter, J. R. Medina, J. K. Kalita. Survey of the Usages of Deep Learning for Natural Language Processing. IEEE Transactions on Neural Networks and Learning Systems. April, 21, 2020. [4] T. Jauhiainen, M. Lui, M. Zampieri, T. Baldwin, K. Lindén Automatic language identification: A survey. Journal of Artificial Intelligence research, Vol 65. August 25, 2019. [5] S. Russell, P. Norvig, Artificial intelligence - A modern approach. Pearson education, New Jersey. Third edition, 2010. [6] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 210-229. 1959. [7] J. Wehrmann, W. E. Becker, and R. C. Barros. A Multi-Task Neural Network for Mul- tilingual Sentiment Classification and Language Detection on Twitter. In Proceedings of ACM SAC Conference, Pau, France, April 9-13, 2018 (SAC’18), 8 pages. [8] Input & Intelligence - Natural Language Processing Team. Language Identification from Very Short Texts. Apple Machine learning Journal. Vol. 1, Issue 14. July 2019. [9] G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning with Applications in R. 7 ed. New York: Springer Science+Business Media. 2013. [10] Wikipedia. Bayes’ theorem. https://en.wikipedia.org/wiki/Bayes%27_theorem. Fetched 2020-06-05. [11] A-M. Yaser S., M. Magdon-Ismail, H-T Lin. Learning From Data. A short course. AMLbook.com, 2012. [12] S. Haykin. Neural networks and learning machines. Pearson education, third edition. 2009. [13] A . Krizhevsky, I. Sutskever, G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems. 25. 2012. 10.1145/3065386. [14] A. Lindholm, N. Wahlström, F. Lindsten, T. B. Schön. Supervised Ma- chine Learning. Department of Information Technology, Uppsala University. Avail-

43 able at: http://www.it.uu.se/edu/course/homepage/sml/literature/lecture_ notes.pdf. March, 2019. [15] D. P. Kingma, J. Ba. Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR). 2014/12/22. arXiv:1412.6980. [16] Tensorflow documentation. https://www.tensorflow.org/api_docs/python/tf/ keras/losses/SparseCategoricalCrossentropy. fetched 20200810. [17] P. Werbos. Backpropagation through time: what it does and how to do it. Proceed- ings of the IEEE. 78. 1550 - 1560. 10.1109/5.58337. 1990. [18] N. M. Rezk, M. Purnaprajna, T. Nordström, Z. Ul-Abdin. Recurrent Neural Net- works: An Embedded Computing Perspective. IEEE Access. March 23, 2020. [19] K. Greff, R K. Srivastava, J. Koutník, B. R. Steunebrink, J. Schmidhuber. LSTM: A Search Space Odyssey. EEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222-2232, Oct. 2017, doi: 10.1109/TNNLS.2016.2582924. [20] A. Scherstinsky. Fundamentals of Recurrent Neural Network (RNN)and Long Short- Term Memory (LSTM) Network. Elsevier journal “Physica D: Nonlinear Phenomena”, Volume 404. March 2020 [21] M. Sundermeyer, R. Schlüter, H. Ney. LSTM Neural Networks for Language Mod- eling. Human Language Technology and Pattern Recognition, Computer Science De- partment, RWTH Aachen University, Aachen, Germany. [22] fastText’s website. Resources. https://fasttext.cc/docs/en/ language-identification.html. Fetched 2020-06-12. [23] A. Joulin, E. Grave, P. Bojanowski, T Milkolov. Bag of Tricks for Efficient Text Classification. Facebook AI Research. August 9 2016. [24] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov. Enriching Word Vectors with Sub- word Information. Facebook AI Research. July 16, 2016. [25] Unicode’s home website. https://home.unicode.org, fetched 2020-06-10. [26] D. Jurafsky, J. H. Martin. N-gram language models. Speech and Language Process- ing, chapter 3. Draft 2019. [27] About Twitter’s API. https://help.twitter.com/en/rules-and-policies/ twitter-api. Fetched 2020-06-11. [28] CommonCrawl’s website. https://commoncrawl.org/. Fetched 2020-06-12. [29] I. Mozetic, M. Grcar, J. Smailovic. Multilingual Twitter Sentiment Classification: The Role of Human Annotators. PLOS ONE. May 5, 2016. [30] Sarah Perez. Techcrunch blog.https://techcrunch.com/2017/11/07/ twitter-officially-expands-its-character-count-to-280-starting-today/. Published November 7, 2017. Fetched 2020-06-12.

44 [31] B. Hayes. Programming languages most used and recommended by data scientists. https://businessoverbroadway.com/2019/01/13/ programming-languages-most-used-and-recommended-by-data-scientists/ 2019-01-13. Fetched 2020-06-11. [32] S. Li. GitHub. https://github.com/susanli2016/ PyCon-Canada-2019-NLP-Tutorial/blob/master/BBC%20News_LSTM.ipynb. Last commit 2019-10-15 [33] Alex-Just, commented by mgaitan. GitHub. https://gist.github.com/ Alex-Just/e86110836f3f93fe7932290526529cd1#gistcomment-3208085. Com- ment committed 2020-03-11. [34] H. He. The State of Machine Learning Frameworks in 2019. The gradient. https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-research-tensorflow-dominates-industry/. Published October 10, 2019, fetched 2020-06-12. [35] https://keras.io/ [36] TE. Oliphant. A guide to NumPy. Vol. 1. Trelgol Publishing USA. 2006. [37] W. McKinney, others. Data structures for statistical computing in python. In: Pro- ceedings of the 9th Python in Science Conference. p. 51–6. 2010. [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research. 2011;12(Oct):2825–30. [39] JD. Hunter. Matplotlib: A 2D graphics environment. Computing in science & engineering. 2007;9(3):90–5. [40] M. Waskom, et al. mwaskom/seaborn: v0.8.1 (September 2017), Zenodo. Available at: https://doi.org/10.5281/zenodo.883859. 2017. [41] N. Martin. Google’s Artificial Intelligence Hate Speech De- tector Is ’Racially Biased,’ Study Finds. Forbes media. https://www.forbes.com/sites/nicolemartin1/2019/08/13/ googles-artificial-intelligence-hate-speech-detector-is-racially-biased/ #d46ab7b326c4. Published 2019-08-12, fetched 2020-07-02. [42] H. Boström, H .Linusson, T. Löfström, U. Johansson. Accelerating difficulty estima- tion for conformal regression forests. Annals of Mathematics and Artificial Intelligence. 2017. 10.1007/s10472-017-9539-9.

45