Student Thesis Master’S Level (Second Cycle) Preprocessing Method Comparison and Model Tuning for Natural Language Data

Student Thesis Master’s level (second cycle) Preprocessing method comparison and model tuning for natural language data Author: Peter Tempfli Supervisor: William Wei Song and Serena Barakat Examiner: Moudud Alam Subject/main field of study: Microdata Analysis Course code: MI4002 Higher education credits: 15 ECTS-credits Date of examination: 02/06/2020 At Dalarna University it is possible to publish the student thesis in full text in DiVA. The publishing is open access, which means the work will be freely accessible to read and download on the internet. This will significantly increase the dissemination and visibility of the student thesis. Open access is becoming the standard route for spreading scientific and academic information on the internet. Dalarna University recommends that both researchers as well as students publish their work open access. I give my/we give our consent for full text publishing (freely accessible on the internet, open access): Yes X No ☐ Abstract Twitter and other microblogging services are a valuable source for almost real-time marketing, public opinion and brand-related consumer information mining. As such, collection and analysis of user-generated natural language content is in the focus of research regarding automated sentiment analysis. The most successful approach in the field is supervised machine learning, where the three key problems are data cleaning and transformation, feature generation and model choice and training parameter selection. Papers in recent years thoroughly examined the field and there is a agreement that relatively simple techniques as bag-of-words transformation of text and a naive bayes models can generate acceptable results (between 75% and 85% percent F1-scores for an average dataset) and fine tuning can be really difficult and yields relatively small results. However, a few percent in performance even on a middle-size dataset can mean thousands of better classified documents, which can mean thousands of missed sales or angry customers in any business domain. Thus this work presents and demonstrates a framework for better tailored, fine-tuned models for analysing twitter data. The experiments show that Naive Bayes classifiers with domain specific stopword selection work the best (up to 88% F1-score), however the performance dramatically decreases if the data is unbalanced or the classes are not binary. Filtering stopwords is crucial to increase prediction performance; and the experiment shows that a stopword set should be domain-specific. The conclusion is that there is no one best way for model training and stopword selection in sentiment analysis. Thus the work suggests that there is space for using a comparison framework to fine-tune prediction models to a given problem: such a comparison framework should compare different training settings on the same dataset, so the best trained models can be found for a given real-life problem. Keywords Natural language processing, sentiment analysis, machine learning 1 Table of contents 1. Introduction 3 1.1 The aim of this work 4 2. Previous work in the field 6 3. Machine learning approach 8 3.1. Supervised Machine learning 8 3.2. Bag of words methods 8 3.3. TF-IDF weighting 9 4. The dataset 10 4.1. Word frequency in the dataset 13 4.2. POS distribution in classes 17 4.3. Comparison of the predefined classes 17 5. Building the right training dataset 19 5.1. Comparing different dataset sizes with model performance metrics 20 6. Pre-processing 23 6.1. String Normalization 23 6.2. Tokenization 23 6.3. Stopwords 24 6.4. Stemming / Lemmatisation 24 6.5. N-gram converting 25 Infrequent word filtering 25 6.6. Synonyms 26 6.7. Part of Speech tagging 26 7. The experiment 27 7.1. Classifiers 27 7.2. Pre-processing datasets 27 7.3. Comparison matrices 27 8. Discussion 34 9. Conclusions and future work 35 References 36 2 1. Introduction Sentiment analysis is a document classification problem, in the domain of natural language processing. In simple terms, sentiment analysis aims to detect the sentiment of a 'subject' of the communication towards an 'object'. As an example, sentiment of product reviews can be analysed, so an automated system can classify if a product review is positive, negative or neutral. In more advanced classification systems the sentiment itself is not a list of classes (as positive, negative or neutral) or a scale, but rather a multy-dimensional system (as angryness, joy, interest...) on which every dimension can have a value (Snyder and Barzilay, 2007). Also, sentiment analysis is not strictly a classification problem: advanced sentiment analysis problems are often about detecting subjectivity, polarity and subject/object relations in a natural language document. The last problem (subject/object relationship) is also in the domain of the entity detection. The most challenging problems in automated sentiment analysis are mostly connected with linguistic features of the text, which are above the vocabulary level. For example: negotiation, specific word orders which change meaning, modal verbs, sarcasms. This work focuses on the classification problem in the domain of sentiment analysis. Classification of natural language documents as a problem appears not only in the domain of sentiment analysis -- this is a more broad area. In simple terms it can be described as automatically adding labels to a document (one or more), analysing its content. For example, a classification engine can add 'economy', 'politics' or 'culture' tags to newspaper articles; or an email filtering engine can classify emails as 'spam' or 'important'. This problem is very similar to classifying a product review as 'positive' or 'negative'. It is important to mention that there are many ways to classify a document sentiment: on a simple binary (positive/negative) system, on a 3-class system (positive-neutral-negative), or a scale (one-to-ten) or on a many-dimensional system. As many classifier algorithms have many limitations, not all of them can be used for every classification system. Thus before selecting a system for classes, it is important to take into account that this can introduce limitations about choosing the best classifier. The application of sentiment analysis techniques is very wide, and in future new areas might evolve. Some of current domains: ● Marketing and monitoring brand reputation (the dataset of the current work is a typical example of this). The typical process of collecting data is to set up automated keyword-monitoring processes on the critical channels, and then applying pre-trained sentiment analysis classifiers on the collected data. The process can help to point out the critical areas in order to increase brand reputation. ● Public Relations management and prediction. Automated techniques can help find early critical media messages and help to manage brand criticism. Similar to the previous use-case, automated systems can help to find individual critical messages, so the organization can address them early. 3 ● Automated political surveys. Using this sentiment analysis tools high volume tweets can be analysed automatically and show the effects of individual public messages. ● Sentiment-based stock price predictions. There are some attempts to build stock (or other goods) trading systems based on media and social media message analysis. In this area speed is critical and frequent model re-training can be crucial. ● Customer support -- integrated sentiment analysis engine can help to prioritize messages from 'angry' customers, so help-desk agents can solve their cases first. Advanced customer-support and CRM software already have integrated automated text analysis tools. This work focuses on specific kinds of natural-language documents: short 'tweets'. Twitter is a microblogging service started in 2006; currently the most popular of this type. Users share 140-character long messages, so from the sentiment-analysis perspective these observations are rather short and very subjective. This makes it a perfect platform to sentiment analysis; a large amount of research works with data gathered from Twitter. According to Wikipedia, 37% of content is conversational and 40% of the content is 'pointless babble' which also falls under the subjective communication category (Wikipedia, 'Twitter', 2016.05.25.). 1.1 The aim of this work Twitter-gathered data analysis is a relatively well-known area in natural language processing and sentiment analysis, and it has many commercial implementations as well, as it was demonstrated in the previous section. For data gathered from Twitter, it seems that there is a consensus that machine learning algorithms using even relatively simple feature generating methods can create results which are usable not only for research problems, but also in production environments with real-life use-cases. Such applications create very valuable information for organizations, so correct and well-performing implementation is critical. The amount of gathered data from micro-blogging services as Twitter is growing exponentially. Because data gathered using sentiment analysis can have very high business value, even a small improvement in the performance of implementations can create very tangible value. That’s why in this work model tuning is compared, as tailoring models for different business use-cases and specific domains is a real need. In this work, after reviewing the previous work on the topic and pointing out the most effective approaches, the proposed approaches are implemented using the ‘Airlines’ 4 dataset (which is described in The Dataset section). Another dataset is used in order to make sure the findings are generic enough and not specific only to one dataset. At the work’s contribution, a comparison framework is introduced which demonstrates how to implement and fine-tune models. Said that, the research question of this thesis work can be formulated as: How to select features and compare several model performance in an automated, flexible and quick way for analyzing twitter data? 5 2. Previous work in the field Twitter data is well-researched, as Twitter is a very approachable data source for valuable information.

Load more