Automatic Language Identification of Short Texts

UPTEC F 20043 Examensarbete 30 hp September 2020 Automatic language identification of short texts Anna Avenberg Abstract Automatic language identification of short texts Anna Avenberg Teknisk- naturvetenskaplig fakultet UTH-enheten The world is growing more connected through the use of online communication, exposing software and humans to all the world's Besöksadress: languages. While devices are able to understand and share the raw data Ångströmlaboratoriet Lägerhyddsvägen 1 between themselves and with humans, the information itself is not Hus 4, Plan 0 expressed in a monolithic format. This causes issues both in the human to computer interaction and human to human communication. Automatic Postadress: language identification (LID) is a field within artificial intelligence Box 536 751 21 Uppsala and natural language processing that strives to solve a part of these issues by identifying languages from text, sign language and speech. Telefon: 018 – 471 30 03 One of the challenges is to identify the short pieces of text that can Telefax: be found online, such as messages, comments and posts on social media. 018 – 471 30 00 This is due to the small amount of information they carry. The goal of this thesis has been to build a machine learning model that can identify Hemsida: the language for these short pieces of text. A long short-term memory http://www.teknat.uu.se/student (LSTM) machine learning model was built and benchmarked towards Facebook’s fastText model. The results show how the LSTM model reached an accuracy of around 95% and the fastText model used as comparison reached an accuracy of 97%. The LSTM model struggled more when identifying texts shorter than 50 characters than with longer text. The classification performance of the LSTM model was also relatively poor in cases where languages were similar, like Croatian and Serbian. Both the LSTM model and the fastText model reached accuracy’s above 94% which can be considered high, depending on how it is evaluated. There are however many improvements and possible future work to be considered; looking further into texts shorter than 50 characters, evaluating the model’s softmax output vector values and how to handle similar languages. Handledare: Prashant Singh Ämnesgranskare: Prashant Singh Examinator: Tomas Nyberg UPTEC F 20043 Acknowledgement I would like to express my gratitude to the people that has given me input, helped and supported me throughout the thesis. This thesis would not have been possible without your knowledge, guidance and advice. Many thanks to Prashant Singh for his knowledge and always answering my questions. I also wish to thank the group of people that enriched my lunches with good company and discussions. Lastly I want to thank my family and friends for staying positive and cheering me on. Populärvetenskaplig sammanfattning Idag kommunicerar vi i stor utsträckning genom våra mobiltelefoner och datorer. Kom- munikationen går fort, vi reagerar, kommenterar och svarar såväl vänner och familj som främlingar online. Mycket av det vi skriver om och till varandra är i form av korta texter, såsom ett sms, i en chatt, en kommentar eller ett inlägg på sociala medier. Då internet når människor i hela världen, skrivs mycket av det som förekommer online på flera hundra olika språk. För att webbsidor och olika program ska kunna hantera alla språk är det viktigt för webbsidor och program att automatiskt kunna identifiera språken. Artificiell intelligens, eller AI, är ett ämne som många har hört talas om de senaste åren. AI handlar i stor uträckning om att låta datorer göra uppgifter som tidigare har krävt mänsklig intelligens för att genomföra. Ett delområde inom AI är språkteknologi (engelska: natural language processing) varav en del handlar om automatisk språkidentifiering. I det här arbetat har metoder för hur man kan bygga språkigenkänningsmodeller för korta texter studerats och testats. Maskininlärning är en del av AI där det är möjligt att lära en dator olika uppgifter utan att behöva programmera för exakt varje datapunkt, i det här fallet text. Det resulterar i att det går att designa modeller för datorn där det som behövs är data där olika språk är representerade. I arbetet hämtades inlägg från Twitter på olika språk och utifrån den insamlade informa- tionen konstruerades en maskininlärningsmodell. Twitter är en källa till korta texter och inlägg på 9 olika språk hämtades. Språken var engelska, svenska, spanska, portugisiska, ryska, tyska, polska, serbiska och kroatiska. Modellen som konstruerades lyckades korrekt identifiera 95% av den testdata som användes. Modellen som konstruerades jämfördes med modellen fastText som är utvecklad av Facebook. FastText lyckades korrekt identifiera 97% av testdata. Det var förvånande att modellerna fick en noggrannhet över 90%, trots att de inte tränats på en stor datamängd. Det här tyder på att de maskininlärn- ingsmodeller som finns tillgängliga att använda gratis online idag har stor potential till att kunna användas i applikationer med goda resultat. Det finns mycket att fortsätta utforska inom området för språkidentifiering av korta texter och hur modeller av liknande slag ska tolkas. Problem som kvarstår är modellens osäkerhet, hur de kortaste texterna ska kunna identifieras samt hur flera språk som liknar varandra bör hanteras av modellen. 4 Contents Populärvetenskaplig sammanfattning 4 1 Introduction 7 1.1 Background . .7 1.2 Objectives . .7 1.2.1 Sub-objectives . .8 2 Theory 9 2.1 Natural language processing . .9 2.1.1 Automatic language identification . .9 2.2 Artificial intelligence . .9 2.3 Machine learning . .9 2.4 Artificial neural networks . 11 2.4.1 Activation functions . 12 2.4.2 Output activation function . 14 2.4.3 Training of neural networks . 16 2.5 Recurrent neural networks . 17 2.5.1 Long short-term memory . 19 2.6 Pre-training of models . 20 2.7 Feature extraction of text . 21 2.7.1 Bag of words . 21 2.7.2 N-grams . 22 2.7.3 Word and character embeddings . 22 2.8 Data extraction . 23 2.8.1 Twitter’s API . 23 2.8.2 Web crawling . 23 3 Method 24 3.1 Dataset . 24 3.2 Preprocessing . 25 3.3 Machine learning model . 26 3.4 Software setup . 28 3.4.1 TensorFlow and Keras . 28 3.5 Experiments . 29 3.5.1 Hyperparameter tuning . 29 3.5.2 Word versus character embeddings . 30 3.5.3 Preprocessing variations . 30 3.5.4 fastText . 30 3.5.5 Two similar languages . 30 4 Results 30 4.1 Hyperparameter tuning . 30 4.2 Word versus character embeddings . 32 4.3 Preprocessing variations . 32 4.4 Final model . 33 5 4.5 fastText . 35 4.6 Two similar languages . 36 5 Discussion 38 5.1 Results . 38 5.2 Errors . 41 5.3 Ethical reflection . 41 5.4 Future work . 42 6 Conclusion 42 References 43 6 1 Introduction More than 4 billion people use the internet today, which corresponds to over 50% of the global population [1]. This means that hundreds of different languages are being used daily online. A sub field within artificial intelligence called natural language processing has made it possible for all of these languages to be identified, translated and communi- cated between each other. The past decade has provided new technology like big data and parallel computing that has made the artificial intelligence ideas possible to implement. Natural language processing includes how to handle language data and get more knowledge about a language. Automatic language identification is important for many types of natural language processing tasks today. As the communication online often is done through short pieces of text, a challenge of how to identify the language of these texts has arisen. The thesis covers parts of this field and how machine learning can recognise languages from short texts. Machine learning can be used to create computer models that train on known data and later be able to identify new unknown data. 1.1 Background In recent years it has become more common to come across websites where automatic language identification is used, like Facebook and Google [2]. It is impressive how translation and identification of languages often is accurate. Although sometimes the translation of an expression or saying does not quite match the real meaning. Another example of where languages identification (LID) models are being used is in the search tool bar of websites. When the user types in a couple of words or a sentence, the search tool bar first identify what language the text is written in, before it gives the most up to date results in the same language as typed by the user. Languages continuously develop and change which can become an issue when building models that automatically identify languages. In a time where people communicate online, the changes happen quickly, from one day to another an online community can start to use a completely new word or slang. It is common with badly written text or misspelled words online because of the quick communication on online platforms. The possibility to identify a language of "dirty" and tricky short pieces of text still needs to be explored further. Is it possible to keep up with the constantly changing of languages for machine learning models that identify languages? How easy is it to create a machine learning model to automatically identify different languages from shorter expressions and words containing slang, misspellings or is badly written? Can a model be created to distinguish between possible languages for a short online expression? These questions will be explored in this master thesis. 1.2 Objectives The master thesis objective is to explore how a neural network can be created to identify languages from short texts from internet platforms. With the goal to quickly go from a newly trained network to implementation, automatically.

Load more