A comparative study of algorithms for Document Classification

Hugo Moritz

Subject: Information Systems Corresponds to: 30 hp

Presented: VT2020 Supervisor: David Johnson Examiner: Andreas Hamfeldt

Department of Informatics and Media

Abstract

In a more digitalized world, companies with e-archive solutions want to be part of the usage of modern methods to develop their business. One method is to automatically classify the content of the documents. A common approach is to apply machine learning, also known as document classification. There is a lack of updated research on comparing different machine learning algorithms. Also, in the context of whether more modern methods as neural networks are better than more statistical traditional/classic machine learning methods. The document classification process goes through pre-processing, feature selection, document representation and training and testing of the classifiers. Implementation of five different machine learning methods, with different and feature selection settings, presents result based on various classification metrics and time consumption. The result shows that the neural network classifier have as high accuracy as one of the traditional statistical classifiers SVM, but the neural network provides a higher computational time cost. More studies for the document classification area with other programming language and libraries may give interesting aspects to whether the differences can be determined even more.

ii

Acknowledgements

After studying two years at the Master’s program in Information Systems at Uppsala University, it is with experience and joy I can present this thesis. I would like to thank my supervisor from Uppsala University, David Johnson. He has supported me and providing me with vital knowledge in the machine learning and text classification area during this period and guaranteeing the quality of the thesis. Also, the people at Ida Infront that have given me the chance to work at their office and take part in the development. Especially, I would like to thank Richard Johansson and Johnny Hensegård, who been supporting me on-site at the company and always been there to help me with the development of the thesis work.

Hugo Moritz Uppsala, 2020-06-01

iii

Content

Abstract ...... ii

Acknowledgements ...... iii

List of Figures ...... 6

Terms & Abbreviations ...... 8

Introduction ...... 9

Ida Infront ...... 10

Motivation ...... 10

Research problem ...... 11

Research question ...... 11

Scope ...... 11

Theory ...... 13

Automatic Document Classification ...... 13

Feature Extraction ...... 13

Feature Selection ...... 14

Document representation (Vector representation) ...... 16

Classification Algorithms ...... 16

Overfitting ...... 17

Decision Classifiers ...... 17

Linear Classifiers (Discriminative Classifiers)...... 19

Proximity-based Classifier ...... 20

Probalistic Classifiers ...... 21

Artificial Neural Network Classifiers ...... 22 iv

Classification performance metrics ...... 25

Methodology ...... 27

Chosen text data ...... 27

Document classification process ...... 27

Pre-processing ...... 28

Feature Selection ...... 31

Document representation ...... 32

Classifiers ...... 34

Classification metrics ...... 38

Result ...... 40

Stemming ...... 40

Feature Selection ...... 42

Discussion ...... 46

Methodology ...... 46

Result ...... 47

Conclusions ...... 51

Future work ...... 51

References ...... 53

Appendix ...... 61

Stemming – confusion matrixes ...... 61

Feature Selection – confusion matrixes ...... 63

Code ...... 65

v

List of Figures

Figure 1. Venn diagram of the area...... 9 Figure 2. Example of the fundamentals of stemming for Swedish grammar...... 14 Figure 3. Example of a decision tree based of continous or discrete ordinal data...... 18 Figure 4. The margin of separation for the SVM...... 19 Figure 5. The kNN measurement differences between classes and data points...... 21 Figure 7. Model of a neuron unit for a neural network...... 23 Figure 6. Example of a feed-forward neural network architecture...... 23 Figure 8. Fundamentals of a confusion matrix for multi-class labels...... 26 Figure 9: Predictive Document Classification process ...... 28 Figure 10. Predictive Document Classification with Feature Selection task included in the model training phase...... 28 Figure 11. Pre-processing tasks ...... 29 Figure 12. Feature Selection ...... 32 Figure 13. Bar chart with test accuracy comparison between without stemming and with stemming .. 41 Figure 14. Bar chart with training time without stemming and with stemming ...... 41 Figure 15. Confusion matrix for NN Classifier with stemming ...... 42 Figure 16. Confusion matrix for NB Classifier with stemming ...... 42 Figure 17. Bar chart with test accuracy for filter and embedded method...... 43 Figure 18. Bar chart with training time for filter and embedded method...... 43 Figure 19. Confusion matrix for KNN Classifier with DF filter method feature selection...... 44 Figure 20. Confusion matrix for NB Classifier with DF filter method feature selection...... 44 Figure 21. Confusion matrix for KNN Classifier with SVM embedded method feature selection...... 45 Figure 22. Confusion matrix for NB Classifier with SVM embedded method feature selection...... 45 Figure 23. Confusion matrix for DT Classifier without stemming ...... 61 Figure 24. Confusion matrix for NN Classifier without stemming ...... 61 Figure 25. Confusion matrix for KNN Classifier without stemming ...... 61 Figure 26. Confusion matrix for SVM Classifier without stemming ...... 61 Figure 27. Confusion matrix for NB Classifier without stemming ...... 62 Figure 28. Confusion matrix for DT Classifier with stemming ...... 62 Figure 29. Confusion matrix for KNN Classifier with stemming ...... 62 Figure 30. Confusion matrix for SVM Classifier with stemming ...... 63 Figure 31. Confusion matrix for DT Classifier with DF filter method feature selection...... 63 Figure 32. Confusion matrix for NN Classifier with DF filter method feature selection...... 63

Figure 33. Confusion matrix for SVM Classifier with DF filter method feature selection...... 64 Figure 34. Confusion matrix for DT Classifier with SVM embedded method feature selection...... 64 Figure 35. Confusion matrix for NN Classifier with SVM embedded method feature selection...... 64 Figure 36. Confusion matrix for SVM Classifier with SVM embedded method feature selection...... 65

7

Terms & Abbreviations

Document A media information element that can be extracted from text information

Over-stemming Reducing too much of the in the stemming process.

Tokenisation Dividing the document unit into component referred to as tokens, that give actual meaning for the process.

Data space Data that is mapped to a space of its context

Dot-product An algebraic operation that computes the sum of the products

Cosine similarity A metrics that computes the inner product space of two vectors by using the dot-product.

Corpus Large structured text in a text mining context.

Naturvårdsverket The Swedish Environmental Protection Agency

NLTK , libraries and programs for natural language processing.

Hyper-parameters Parameters that are used to configure a machine learning model, which is not updated during the learning phase.

Lemmatizer The process of extracting the lemma from a word. Identifying its part of speech and meaning of the word in a sentence or context.

Receiver Operating Plotting the true-positive rate against false positive rate for different Characteristic Curve thresholds.

8

Introduction

Companies and organisations are striving for digitisation today, to keep up with customers and users' need for fast and digital information. One area where digitisation development has been of great importance is archives, from manually handling documents to storing documents in a digital solution. The benefits of digital storing are easy access and information utilisation on a larger scale. Companies and organisations often have access to some specific e-archive, to access different kinds of documents relevant to the business and organisation function. To make it easier and more efficient to access and read documents in big e-archives, these are usually sorted by metadata. This metadata usually contains document type and technical data about the specific document. As the documents can differ depending on the content, there is a demand for an easy way to categorise the specific content. To able to categorise the content, it needs to be classified. The classification can be information retrieval from the metadata, manually classifying, or via an automatic classifier retrieving information from the content. As manually classifying documents can be a time-consuming and in-consistent task, it is usually not beneficial on a larger scale. Instead, Automatic Document Classification is suggested to solve this kind of task, as it is an automatic process that can be used on for larger systems (Goller, Löning, Will, & Wolff, 2000).

Figure 1. Venn diagram of the Text Mining area.

In Automatic Document Classification, the primary approach is to use text classification on the specific data which refers to the content of the documents. The text classification approach was introduced by (Maron, 1961) as a way of categorising text into different classes. The approach is well-known when

9

applied to electronic documents, but also, homepages and natural language documents etc. Automatic Document Classification, also referred to as document classification, is a software designed approach that within the text mining category1. How it is related is illustrated in Figure 1. The actual classification process is a supervised machine learning and statistical approach which can include a variety of methods.2 As machine learning and other methods within artificial intelligence have become more popular, the combination of including machine learning methods for solving document classification tasks have grown in popularity. With evolvement from statistical methods, to more modern methods- based techniques within machine learning, the research and application within text mining have also developed and been focusing on more complex and in-depth solutions like neural network models. (Sebastiani, 2002).

Ida Infront

Ida Infront is an IT-company which focuses on simplifying and streamline business for authorities and administrations within the Nordics. It was founded in 1984 as a part of the Linköping University and have since been grown within the digitization area. One of the products is the e-archive, iipax archive. It is the most popular product Ida Infront offers, mostly focused towards the public sector. It is an in- house product that offers a modern solution for the companies to store and handle documents and data for its customised purpose (Ida Infront, n.d.).

The iipax archive does not offer any automatic document classification for categorisation of content but aims to investigate whether if it is useful for their business. By developing an automatic document classifier, the company's customer would be able to easier access the documents and structure the e- archive more efficiently. The documents are from government agencies, municipalities, city councils and larger companies. As the intention is to give value to Ida Infront’s information systems, the main approach is to develop an automatic classifier.

Motivation

To be able to measure the performance of the classifiers, multiple classifiers focused on the same problem domain is determined for comparison. The performance of the classifiers is mainly focused on the machine learning approach conducted, which will give a result difference between the methods used. This is meant to give an interesting aspect of how the methods could differ regarding to performance in the result. Previous studies within the area include comparisons between the different machine learning

1 A Brief Survey of Text Mining - https://www.researchgate.net/publication/215514577_A_Brief_Survey_of_Text_Mining 2 Statistical Classification Wikipedia - https://en.wikipedia.org/wiki/Statistical_classification 10

methods usually aimed to investigate whether a new implementation of an approach is suitable. It is therefore interesting and beneficial for the information systems area to update on how the different document classification approach compare in the algorithmic learning approach. The thesis is also aiming to give an updated form for the different phases, of how document classification is conducted and how it can affect the different approaches.

Research problem

The thesis is based on investigating and comparing the different machine learning methods that can be used for doing document classification. The specific focus will be on comparing the differences between statistical and artificial neural network machine learning methods in a document classification process. Specifically, artificial neural network algorithms and, conventional machine learning algorithms that are not using artificial neural networks. The study also aims to investigate the document classification process and how it can affect how the machine learning algorithms perform in its context. The performance measurement used will be determined on previous research and on applying it to the context of the classification process.

Research question

• Do artificial neural network algorithms perform better than classical/statistical methods in a document classification context?

Scope

The research will include the comparison of the machine learning classifiers, with the same pre-phase approaches, but with different algorithms that can be comparative for the intended purpose of investigating performance.

1.5.1. Limitation

Specific influences that are out of control for the author concerning the methodology and how the thesis will be conducted.

• Only Swedish text documents will be used as input data for the model that will be part of the valuation and comparison between the algorithms. • All stages that are focused on software development is conducted on an HP laptop with 16GB RAM, i7-6600U CPU processors (2 cores, 4 logical processors), on Microsoft Windows 10 OS, which can result in restricted computational power. This affects the comparatives of different theories and methods.

11

• Single-label text categorisation is chosen since multi-label text categorisation is not of specific interest for the company.

1.5.2. Delimitation

Boundaries for the thesis, which is applied by the author chosen for convenience or specific aim of the research.

• Python libraries will mainly be used for software development frameworks. Since Python is well-known for its machine learning applications and focusing on one programming language will make it easier to integrate into Ida Infront.

• Python library scikit-learn will be used for the machine learning computations since it well- known and -documented for classification.

• The machine learning methods used for the classifier will not be of a combination of different areas of methods (ensemble). This was chosen since the aim is to compare the differences of the algorithms depending on its single method characteristics.

1.5.3. Assumption

Assumed instances that is explanatory for the intended purpose of the thesis work and its scope.

• The original class labelling of the text data is reliable and correct.

12

Theory

This section describes the literature for the thesis, intended to describe the theory that is selected or considered for the method of the thesis. Explanation about the automatic document classification process, different methods to give a stable and understandable process to develop a classifier, collection of different possible and recognised machine learning methods for the classification purpose to develop and compare a classifier is included in the section.

Automatic Document Classification

Automatic Document Classification (can also be defined as Document categorisation, text categorisation, document classification or text classification) is an inductive learning approach, which is divided into two different phases, the learning phase, and the classification phase. The learning phase is explaining the process of how the documents chosen for the process gets defining topic categories for the respective document. In the classification phase documents can be given to a topic classifier that returns a classification topic to the respective document (Goller, Löning, Will, & Wolff, 2000). The classification approach is a method in machine learning terminology, as the learning process is considered to be supervised by the knowledge of the categories (Sebastiani, 2002). Classification techniques areas suggested for document classification purpose is machine learning in statistical and neural networks (Goller, Löning, Will, & Wolff, 2000).

Feature Extraction

The process of document classification consists of different steps that should be approached to achieve a good text classification result. Document extraction is a pre-processing task overlapping the area of and natural language processing. The document extraction process is a feature extraction process which consists of segment features (words) from the document, which is referred to as tokenisation. This process helps to extract the terms and words from the actual original document. There may also be necessary to remove noise from the text data. Special characters within the documents are vital to remove since the format of the text data can differ a lot (Montañés, Fernández, Díaz, Combarro, & Ranilla, 2003).

Another critical task for the feature extraction process is to apply tasks for keeping the data relevant and removing unnecessary words or characters (Song, Liu, & Yang, 2005; Aggarwal & Zhai, 2012). This type of extraction in a text classification aspect is mostly focused on the removal and stemming which is described in several cross-sectional studies (Forman, 2003; Joachims, 1998; Lai, Xu, Liu, & Zhao, 2015; Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002; Nigam, Lafferty, & McCallum, 1999; Tong & Koller, 2002; Aggarwal & Zhai, 2012). Stopword removal and removing

13

noisy data can be interpreted as a straightforward process as it is not algorithmic, compared to stemming, which is implemented with an algorithmic approach. Stopwords refers to common or short function words that are not relevant in the text data regarding the classification of the differences of the documents. In fact, including stopwords may give faulty predictions as the stopwords usually have a higher occurrence and may affect the document representation.

The use and differences of the algorithms in stemming are usually more focused within the text classification research area in the representational processes. Stemming refers to the processing of truncate the words to its root stem, which enables it to map the words from the same root stem. A fundamental example of it is illustrated in Figure 2. Stemming was firstly introduced by Lovins (1968) and significant advancements have been achieved since then, as stemming algorithms have been further developed. Porter (1997) developed a stemmer which contributed to a more aggressive stemming algorithm, which creates more classes. Another approach of stemming has been developed by Krovetz (2000), a stemming algorithm which is considered as a softer stemming approach. The stemmer was not given as many correlated classes compared to Porter (1997). In terms of the actual performance of the stemming algorithms Hull (1996) explains that the accuracy is not giving more improvement than 1-3 percentage for larger documents. The approach is however concluded to be focusing on English documents. Carlberger, Dalianis, Duneld, & Knutsson (2001) developed a Swedish stemming algorithm which shows a stemming accuracy that exceeds up to 15-18 percentage improvement compared to not using any stemming algorithm and approves the use of stemming algorithms on a different language than English. By recognising that over-stemming can be an issue with more aggressive approaches, they use a stemmer which provides only 150 stemming rules. The process is in four steps and is provided with the maximum of one rule of the rule set for each step. That will result in an interval of 0-4 rules for each word in the document.

Figure 2. Example of the fundamentals of stemming for Swedish grammar.

Feature Selection

Since vectors of features (words) from documents usually gives features and data that is not always desirable or necessary for the text classification task, it is beneficial to reduce the feature space

14

(Joachims, 1998; Dasgupta, Drineas, Harb, Josifovski, & Mahoney, 2007; Sebastiani, 2002). The process of reducing the dimensionality is called feature selection. It was first touched on by Salton, Wong, & Yang (1975) which was able to reduce the document space into vectors and represented each document as one point in the total document space and enabled comparing of the document similarities. It included a combination of feature selection and vector representation. The challenges that make the feature selection necessary are suggested from Sebastiani (2002); since the documents feature space is a high-dimensional vector in the set of documents, it can reduce the classification instance that is correlated to heavy computational consumption. Another aspect is that the features that are non-relevant for the classification model can misguide the prediction. Noisy or redundant features are also desirable to remove as they can leave out relevant variables or just add computational consumption.

Feature selection is usually categorised into three different types of methods, filter method, wrapper method, and embedded method (Saeys, Inza, & Larrañaga, 2007; Ikonomakis, Kotsiantis, & Tampakas, 2005). The Filter methods are using statistical methods to achieve a relevant feature selection for the aimed task. Common methods used are Document frequency thresholding (DF), Information Gain (IG), Mutual Information (MI), Chi-Square Statistic (CHI), Term Strength (TS) which are methods that have been proven to give a good term elimination. In terms of performance, IG, DF, and CHI algorithms, outperform TS and MI. The three algorithms are equal in terms of accuracy performance. However, DF is recommended for larger text computational contexts since it is not as computational heavy as the other two algorithms (Yang & Pedersen, 1997).

The wrapper method uses classifiers to achieve a reduced feature selection and is part of the inductive learning process for the classifier. It has proven to be more effective in handling low dimensionality documents with less features. Compared to the filter method, which has shown results that it can handle more features for each attribute with a higher evaluation performance result (Forman, 2003; Kohavi & John, 1997). The embedded method in similarity to the wrapper method also includes the feature selection process in the learning algorithm. The difference is that the embedded method is using an intrinsic model building metric in the learning process as well. It uses machine learning methods as selection and can learn the context of the features to its content. Embedding methods usually have the potential to result in better feature selection since the method is to apply the whole document corpora in its context. Lal, Chapelle, Weston, & Elisseeff (2006) states that there are advantages and disadvantages for all of the three methods and choosing the right feature selection method must be applied to its context of the problem domain. Further research has suggested that both the wrapper and the method feature selection methods are in general slower and more complex than the filter methods, which makes the filter method more prevalent in academic and commercial domains. The complexity of the other two

15

methods does however allow them to achieve a higher performance than the filter methods (Sarkar, Goswami, Agarwal, & Aktar, 2014).

Document representation (Vector representation)

For the machine learning algorithms to be able to interpret the documents, the extracted features need to be vectorised with some interpretation of how they should be prioritised. It consists of a distribution of vectors with the features. By adding representational weighting to the features, the words in the documents give an actual meaning of the context. There are different ways of approaching the document representation of different weights; weighting schemes are making use of how the ordering of the words in the documents are constructed (Agarwal & Mittal, 2014). Ringuette & Lewis (1994) and Manning, Raghavan, & Schütze (2008) explain that a standard method is to just ignore the ordering of the words which is referred to as Bag of Words (BoW), and only focusing on the absence and presence of words. The representation extracts the unigrams (one term) and does not include semantics, pos tags, syntax (two terms) or (three terms). Another method which is based on the use of n-grams is the TF-IDF method. It is the most common document representation method and has mostly been used in its original model (Kim, Seo, Cho, & Kang, 2019). The algorithm adds the combined weighting on the term frequency of the words in the documents, and also the inverse document frequency which determines how non-frequent a word occur in a document. It is to distinguish common words that are non-relevant (Salton & Buckley, 1988; Manning, Raghavan, & Schütze, 2008)

Classification Algorithms

As many different techniques can be used for text classification, there are multiple machine learning algorithms available. A document classifier is explained as a function that maps an input attribute vector of words from the document to predict a class (category) that correlates to the input. To able to determine the class, the classifier needs training to learn from the target variable (also referred to as the labelled input in text classification). By using machine learning to train the model and testing the model in the process, it creates a classifier that can predict new data. The most important techniques and commonly used classifiers to study for text classification are Decision Trees Classifiers, Pattern Rule-based Classifiers, Support Vector Machine Classifiers, Regression-based Classifiers, Neural Networks Classifiers, Naïve Bayes Classifier, Nearest Neighbor Classifier, The Rocchio Method. Other classifiers which are not as commonly used but, well-known in the area as well which include Genetic Algorithm- Based Classifier, Ensemble Classifier, and Maximum Entropy Model (Aggarwal & Zhai, 2012; Dumais, Platt, Heckerman, & Sahami, 1998; Sebastiani, 2002).

16

Overfitting

One common problem when applying machine learning methods is overfitting. It describes the development of a model that corresponds too closely or exact to the input data. The common reason is that the data is dimensionality is significant or that the model is fitted to the training data too much. Also, not enough data can make the model overfit, as the trained classifier is not generalised enough. The training set has few alternatives of data differences and therefore applies the model too specific on the data while the test data includes data with more information (Elite Data Science, 2020). There are numerous ways to reduce overfitting. Commonly used methods are, tuning the machine learnings models hyperparameters, adding data and reducing dimensionality. There are also other model-specific tasks depending on the machine learning method.

Decision Classifiers

Decision classifiers are based on rule-sets that can determine the classification of a set of words. The most common approaches are Decision Tree Classifiers and Pattern Rule-based Classifiers.

Decision trees use a hierarchal approach for the data space of the text data, as it splits it depending on the condition and predicate. That is usually depending on the presence or absence of one or more words. It creates a tree structure of the data space nodes and leaves. The splitting is done recursively until a minimum number of records is set, or a certain class purity has been achieved. To reduce overfitting, some nodes may be pruned from the tree as they would correspond the data too close to the classification (Quinlan, 1986). A simple version of a decision tree based is illustrated in Figure 3.

17

Figure 3. Example of a decision tree based of continuous or discrete ordinal data.

Single Attribute Splits is a splitting method that uses the presence or absence of words or phrases at the nodes to determine the split. C5 algorithm (successor of 4.5C) and DT-min10 algorithm are two commonly used decision tree algorithms for single attribute splits (Quinlan, 1986).

Similarity-based multi-attribute split and Discriminant-based multi-attribute split uses linear splitting with combinations of multi-attribute and by splitting the data by word clusters of the documents. The documents are ranked on similar values or discriminants and splitting further into a grouping of clusters (Triantaphyllou, 2009; Quinlan, 1986). Multi-attribute splitting is typically more accurate and provides smaller decision trees than single splitting. However, Triantaphyllou (2009) describes that it can be difficult to interpret and understand the linear combination at the nodes since multi-attribute splitting usually is made by linear splits. Algorithms implemented with the decision tree model are usually C5 algorithm and ID3, as they are easily adapted to text classification (Jensen, Neville, & Gallagher, 2004).

A pattern rule-based classifier is using a rule set to determine the class of the words. The most commonly used pattern rule-based classifier is the decision rule classifier. It is similar to the decision tree model as they are both based on rule decisions. However, decision trees use a hierarchical approach, while decision rules classifiers can overlap the decision space, based on the data space. Through the training phase, a set of rules are created in Disjunctive Normal Form and then compares each rule with the document and the presence of the words in the rule. The goal is that these rules will cover at least one rule in the extraction of the data space. The disadvantage is that the overlapping could make it difficult to which rule should decide the priority outcome and lead to inconsistency. Rule-based decision 18

classifiers are despite that used in practical scenarios as it is easy to manually add rules and keep maintenance and interpretability (Johnson, Oles, Zhang, & Goetz, 2002; Apté, Damerau, & Weiss, 1994).

Linear Classifiers (Discriminative Classifiers)

Linear classifiers are included as discriminative classifiers which means that they are trying to classify the data on the differences of the classes and discriminate between them, independent on creating examples of the classes (Jurafsky & Martin, 2019). The dependency of the values is classified by the characteristics of the linear combination. They are known for solving document classification problems as they can easily determine problems that include many features (Yuan, Ho, & Lin, 2012). The mathematical definition for the linear classifiers is defined in (1).

푝 = 퐴̅ ⋅ 푋̅ + 푏 (1)

The prediction is defined as 푝, 푋̅ = (푥1 … 푥푛) defines the normalised document word frequency vector,

퐴̅ = (푎1 … 푎푛) defines the vector of linear coefficients, and 푏 defines the scalar (Jurafsky & Martin, 2019). The basics of the Support Vector Machine (SVM) is that it tries to separate the search space and use the separation that can provide the largest distance to the data points. The largest distance will determine the largest clear separation between the classes, which is called margin of separation and is fundamentally illustrated in Figure 4 (Cortes & Vapnik, 1995).

Figure 4. The margin of separation for the SVM.

Since the SVM method takes advantage of determining the combination of features for the data space, it is well-known to be suitable and effective for working with text data. Moreover, the high dimensionality that is required in the text classification process makes the SVM classifier suitable for the simplicity of organising in categories with a linear method (Joachims, 1997; Raghavan & Allan, 2007). The disadvantage of the SVM method is that it performs worse with many parameters. One

19

approach of getting the SVM optimal for this problem is to try to reduce the parameters, only use the lower and higher bound of the word occurrence set (Joachims, 2001).

By implementing an e-mail spam detector with different statistical machine learning methods, it has shown that the SVM method can outperform a decision tree, rule-based classifier and the Rocchio method in terms of classification performance (Drucker, Wu, & Vapnik, 1999). Moreover, the implementation of a non-linear SVM classifier has shown to outperform the linear SVM classifier for in a text classification purpose. However, the non-linear SVM classifier training process is challenging to distribute (Pranckevičius & Marcinkevičius, 2017).

Another linear machine learning approach is regression. The use of regression methods is mainly used for continuous quantitative values rather than qualitative. But for a classification purpose, it is possible to make use of binary values. It can be approached as a direct and traditional use for text classification. One robust technique of regression is the Linear Least Squares Fit (LLSF), which can be used for text classification. Although, linear regression is not the most natural way of using regression since it is typically used for numerical attributes (Yang & Chute, 1994). Logistic regression is considered as a generalized linear model and produces a classifier model which is binary. It uses a sigmoid-function in its calculation to predict the outcome. It is used to map the sum of the weighted features from the data to an [0,1] interval for each of the categories for classification. This provides the total sum to be the probability of 1 (Jurafsky & Martin, 2019). Pranckevičius & Marcinkevičius (2017) notes that SVM classifiers and Logistic Regression classifiers are the most popular and accurate classifiers for multi- class classification tasks.

Proximity-based Classifier

The proximity-based classifiers are founding its classification techniques on distance measuring approaches for the data space. In a text classification context, the technique is based on measuring the dot-product and the cosine similarity. The measurements can determine the document similarities and therefore, be used to measure the distance to other documents. One of the most common proximity- based classifiers is the k-Nearest Neighbour (kNN) classifier. It is most commonly used for clustering but can also be used for a text classification domain (Sebastiani, 2002).

The algorithm was first introduced by Cover & Hart (1967), and describes it as a lazy learning algorithm which applies a non-parametric approach. The kNN-algorithm values the categories by comparing the input similarities of its neighbours (Yang & Pedersen, 1997). The idea is that it determines the distance from a centred point, it is illustrated in Figure 5, it describes the green and blue symbols as different data points and the red border as the determined different classes.

20

Figure 5. The kNN measurement differences between classes and data points.

The three most common distance functions to use with the kNN for continuous values is the Euclidean distance (2), Manhattan distance (3), and Minkowski distance (4). With a set k value, the distances between x and y is measured. (Zhang & Zhou, 2007)

푘 2 √∑(푥푖 − 푦푖) (2) 푖=1

∑|푥푖 − 푦푖| (3) 푖=1

1⁄ 푘 푞 푞 (4) (∑(|푥푖 − 푦푖|) ) 푖=1

The kNN classifier has shown high performance compared to other proximity-based classifiers such as the Rocchio Method. However, the lack of research on domain problems is considered as a disadvantage for the classifier. One of the areas where it has not been researched enough in is the evaluation of scalability (Yang, 1999). These further disadvantages are described by (Jiang, Pang, Wu, & Kuang, 2012) which explains its huge text similarity computation, which may give effect to its implementation on business applications.

Probalistic Classifiers

The most commonly referred probabilistic classifier is the Naïve Bayes Classifier which is a well-known classifier for classification purpose (Nigam, McCallum, Thrun, & Mitchell, 2000). It focuses on the interpretation of Bayes 'theorem with strict (naïve) independence assumptions between

21

features. It tries to predict the class that is most likely to generate a specific example of the Bayes rule, which is the calculation of conditional probabilities (4). Whereas A refers to a class and B to a document.

푃(퐵|퐴)푃(퐴) 푃(퐴|퐵) = (5) 푃(퐵)

There are three different types of models for the Naïve Bayes Classifier, and the most common ones are Multi-variate Bernoulli Naive Bayes, Multinomial Naive Bayes, Gaussian Naive Bayes (Manning, Raghavan, & Schütze, 2008). The most commonly used for a document classification purpose is the Multi-variate Bernoulli Naive Bayes and Multinomial Naive Bayes model, according to (McCallum & Nigam, 1998). Furthermore, the Multi-variate Bernoulli model is approaching the domain with the bag of words distribution as it focuses on the presences and absences of features. The Multinomial model includes frequency in its feature distribution approach. McCallum & Nigam (1998) suggests that the Multinomial model performs better than the Bernoulli model in both academic types of research tests, and in real-world application problems.

The Naïve Bayes Classifiers have previously shown good results in terms of classification accuracy combined with its fast feature training phase and possibilities to scale over larger datasets. However, they are often outperformed by the SVM classifier, even though new implementations of the Naïve Bayes Classifier can achieve significant accuracy levels (Narayanan, Arora, & Bhatia, 2013).

Artificial Neural Network Classifiers

The first research that proposed artificial neural network (ANN) in an information retrieval aspect was when (Belew, 1989) suggested a learning algorithm which referred to as a connectionist network with weighted links. The concept of an ANN is a network consisting of neurons that are computed through weighted variables and the input variables, which can be illustrated in Figure 6. The summarization of the weighted inputs is applied through some activation function, and the output is input to the next layer or if in the last layer output value for the model.

22

Figure 6. Model of a neuron unit for a neural network.

The simplest form of an ANN is the perceptron model, which is considered a linear model of a neural network. However, the more general concept in the text classification area is a non-linear model architecture, that is not depended on single linear separation within the classes. The architectural model for non-linear classification task refers to a multi-layer neural network. The complexity of these networks enables back-propagation, which is the process of back-propagate through the layers in the network model to set the optimal weighted values by updating an error function (Aggarwal & Zhai, 2012). Document classification with use of a backpropagation neural network was investigated by (Ruiz & Srinivasan, 1997) which compared it to a counter-propagation neural network. They use a feed- forward neural network, which is a straight-forward architecture consisting of an input layer, hidden layers and an output layer illustrated in Figure 7. The authors also suggest that using ANN in automatic text classification could be an essential tool.

Figure 7. Example of a feed-forward neural network architecture.

23

Document classification within sentiment classification, have shown from (Moraes, Valiati, & Neto, 2013) that ANN classifiers give slightly better result than the SVM, and outperforms an NB classifier. However, the comparison between ANN and SVM also show that the ANN was more sensitive to noisy terms compared to the SVM. The research also suggests that with a focus on running time ANN is to be considered since a larger number of support vectors is used for the SVM. But in context to the computational training cost, the ANN has a higher time consumption than the SVM. The research also states that the feature selection method can have affected the results between the ANN and SVM slightly since the number of terms as input can be given a difference to the time and computational cost. Classifiers used for a text classification context, like Logistic regression, Naïve Bayes and SVM have also previously shown to have data sparsity problems, which is something that is not the case with Neural Network classifiers (Lai, Xu, Liu, & Zhao, 2015). However, the data sparsity problem has been investigated by (Allison, Guthrie, & Guthrie, 2006), which have shown that the problematics can often be reduced by adding training data. Bengio, Courville, & Vincent (2013) points out that the data sparsity problem also is focused on choosing a relevant feature selection and vector representation method, as it could reduce the vector sizing problem prominently.

One of the challenges with feed-forward neural networks is that it can have a problem with vanishing gradient when the gradient is too small and the values of the weights are not updated, and the learning phase might converge. One approach to solve this is to use an LSTM model architecture which defines memory cells that can cope with the input from the previous cell, of the prior layer. The memory can then work as a safeguard for the network (Hochreiter & Bengio, 2001; Hochreiter & Schmidhuber, 1997). Further research from Adhikari, Ram, Tang, & Lin (2019) shows evidence that a variant of the LSTM, a BiLSTM classifier can outperform different CNN and HAN classifiers that include more complex neural network architecture. However, they point out that there is a lack of research furthermore than their own conducted study, on whether LSTM classifiers for document classification have a higher performance compared to other classifiers in the same domain. Also, Lipton & Steinhardt (2019) explain that one problem area within machine learning research is the suggestions of complex architecture models. This way of proposing these models often require a high level of training necessary to function. It can also mean these can be difficult to apply to real-world problems which are more than academic. Since there are multiple models within the neural network context, the area quickly becomes complex (Veen, 2016). Mohammed, Shi, & Lin (2018) and Lipton & Steinhardt (2019) show evidence that simple neural network models can provide just as high result with small marginals compared to complex and more computational heavy neural network models.

24

Classification performance metrics

Within prediction analysis, multiple classification metrics are useful to show how well the machine learning models have interpreted the data and created a classifier. (Hossin & Sulaiman, 2015) explains different commonly used metrics that have been used often with excellent results. For the metrics,

푇 = 푡푟푢푒, 퐹 = 푓푎푙푠푒, 푃 = 푝표푠푖푡푖푣푒, 푁 = 푛푒푔푎푡푖푣푒 is used to determine the binary value for the different class labels. The combination of the class labels for metrics consists of 푇푁 = 푡푟푢푒 푛푒푔푎푡푖푣푒, 푇푃 = 푡푟푢푒 푝표푠푖푡푖푣푒, 퐹푁 = 푓푎푙푠푒 푛푒푔푎푡푖푣푒, 퐹푃 = 푓푎푙푠푒 푝표푠푖푡푖푣푒.

• Accuracy. It is the most used metric, non-dependent if it is a binary- or multiclass classification task. Which is a determination in the percentage of how well the test data have resulted, which is the correct predictions of total instances. The opposite of accuracy is the error-rate. (Sunasra, 2017) suggests that accuracy should not be used to a great extent in the cases where most of the target variable data is from one class (5).

푇 퐴푐푐푢푟푎푐푦 = (6) 푇표푡푎푙

• Precision. The proportion of predictive positive values that have predicted a true positive result (6).

푇푃 푃푟푒푐푖푠푖표푛 = (7) 푇푃 + 퐹푃

• Recall. The proportion of actual positives that are correctly classified (7).

푇푃 푅푒푐푎푙푙 = (8) 푇푃 + 퐹푁

• F1-score. Represent a mean value between precision and recall (8). (Sunasra, 2017) argues that F1-score is a good way to get an understanding of the precision and recall value in one value. However, it could also give an unjustified value if one of the precision or recall values is outlining.

푝푟푒푐푖푠푖표푛 × 푟푒푐푎푙푙 퐹1 − 푠푐표푟푒 = 2 (9) 푝푟푒푐푖푠푖표푛 + 푟푒푐푎푙푙

25

The accuracy, precision, recall and f1-score are computed through the weighted value. It returns the average depending on the proportion of the class labels in the data set.

• Confusion matrix is another useful way of displaying the classification data. It is used for binary- or multiclass classification tasks where the labels give binary predictions. It produces a table which visualises the positive and the negative result of the classes in one axis and the actual positive or negative result of the classes in another axis. This gives a good overview of the specific result and is good for analysing particular cases where the classifier has predicted in a certain way. The fundamentals of a confusion matrix for a multi-class labelling task is described in Figure 8.

Figure 8. Fundamentals of a confusion matrix for multi-class labels.

Another approach to look at how a classifier model within machine learning performance is computational and time cost. He & Sun (2015) and Frank, Drikakis, & Charissis (2020) explains the difference a machine learning model with a low computational time when scaling up the amounts of data. With lower time and computational cost, a machine learning model with a lower learning rate can still be beneficial in a real-world problem.

26

Methodology

This section will explain the method of choice with the theory literature as knowledge, but also through the practical situation of the development of the classifiers to give enhance a quality comparison. The classifiers methodology differences will be explained in the process of development, as well as evaluation of the classifiers.

Chosen text data

The text data chosen as the sample for the task, was from Naturvårdsverket, which is an organisation that is one of the users of IDA Infront’s iipax archive. The content included was ensured to provide enough text data to enable to give a good classification result. The text that was provided was content from the documents and class labelling for the documents. In total there were 1938 documents. In Table 1 there is information about the documents text data. The table shows the data as the length of each token from the documents by using the python string datatype. This can be referred to a feature that has not been tokenized and pre-processed yet. Max describes the document with the most number of features, min describes the document with the fewest number of features, mean describe the average number of features of the total documents, and sum describe total sum of features in all documents.

Table 1. Data about the documents included.

Number of tokens Max 1 441 668 Min 91 Mean 148 732 Sum 288 243 143

Document classification process

The process of developing a classifier that can be evaluated for its context is displayed in Figure 9. The process consists of text data that are labelled by some sort of clustering phase or manual process. The labels are used to train the predictive algorithmic model in the later stage. The first step is to pre-process the text data content to achieve structure to the corpus. To structure the relevant features and to reduce the dimensionality for the structured corpus by feature selection, the feature selection process can also be included in the training phase of the machine learning model as mentioned by Lal, Chapelle, Weston, & Elisseeff (2006) and illustrated in Figure 10. Document representation is applied to the selected features and documents to give meaning to its differences, presence, absence and occurrence by setting a valued weighting. The predictive algorithm is the machine learning method that trains the features of 27

the document to its applied labels and develops a classifier based on the machine learning method of choice. The classifier is tested on a training data set. From the training and testing process, the data is evaluated.

Figure 9: Predictive Document Classification process

Figure 10. Predictive Document Classification with Feature Selection task included in the model training phase.

Pre-processing

To be able to understand the process of automatic classification in the context of using Swedish documents and comparing the machine learning methods, the first step was to develop a general pre- processing for the classifiers. In the automatic document classification as Goller, Löning, Will, & Wolff, (2000) mentions, the first process of topic learning is assigning the documents. For the different classification process tasks, feature selection, document representation and machine learning application training and testing scikit-learn was used (Buitinck, 2013). Handling the text data, containers from Python libraries pandas 3 and numpy 4 was used. They were chosen based on that they were easy to use and could provide efficient data handling.

3 Pandas version 1.0.3 - https://pandas.pydata.org/ 4 Numpy version 1.18.0 - https://numpy.org/ 28

3.3.1. Feature Extraction

Figure 11. Pre-processing tasks

A general function was developed that used tokenization for the text data to extract the features of words and terms. The difference between the actual tokenizing of the features and cleaning the data is illustrated in Figure 11. The tokenization functionality was set to the document representation input tokenizer parameter, in that way the text data for each document was extracted in the same process. Each document was extracted in the function by using the NLTK-corpus library functionality, sent_tokenize which split the document into sentences and then with wordpunct_tokenize it was split into features (NLTK-Project, 2020). The tokenization functionality was looped through and applied different steps for each token:

• Set all characters to lowercase to be able to match the features for representation purposes. With python-built-in lower(). • Strip tokens of special characters at the beginning and end of features. With python-built-in regular expression operations re. The regular expression removed non-alphabetic characters. • If token still contains special characters, do not add the token. With python-built-in regular expression operations re. The regular expression checked if it contained non-alphabetic characters. • If stopword, do not add the token to the extracted text data container. Checked if the stopword was in the set of stopwords that were combined. 29

The basic stopwords were collected from the NLTK-corpus library (NLTK-Project, 2020), which include Swedish stopwords. Also, stopwords from appropriate GitHub repositories56 were added to cover the corpus. Since the documents include a lot of words that do not give meaning for the categorisation itself but could affect the result, more stopword lists were implemented. Stopwords for cities in Sweden, common Swedish first names and last names were added to the stopwords check. The reason to add cities was that the documents often contain information about regions and could instead mislead the classification with city and municipality names. Since the documents contain authors and, in some cases, include this on every page, it was determined that names should also be included to the stopwords list.

• Check that token-length is larger than 1. • Stemming the token.

The stemming functionality that was added, was chosen based on the stemming model developed by (Carlberger, Dalianis, Duneld, & Knutsson, 2001). The stemmer chosen is part of the NLTK7, named SnowballStemmer. It follows the suggestion from (Carlberger, Dalianis, Duneld, & Knutsson, 2001) of a softer stemmer and a small set of suffix rules to avoid over-stemming. To investigate whether the stemmer boosts the performance of the classifier, testing with and without the stemmer was done. From the result, the method that gave the best performance was chosen for the further parameter and functionality testing.

After the feature extraction, 520 176 features were extracted from the content of the total 288 243 143 elements from the original text data from the documents. The extracted features are approximately 0,18% of the total text data content. In Table 2 below, there are examples of the different feature extraction functionality to clean the features. An empty cell means that the token was ignored.

Table 2. Example of before and after tokenization

Before data cleaning After data cleaning

. ). 2 avfallsdeponier avfallsdeponi avskiljs avskilj

5 https://github.com/peterdalle/svensktext 6 https://github.com/stopwords-iso/stopwords-sv 7 https://www.nltk.org/_modules/nltk/stem/snowball.html 30

dagvatteninfiltration dagvatteninfiltration Förordningen förordning föroreningsrisk föroreningsrisk grundvattensynpunkt grundvattensynpunk infiltrationsanläggningar infiltrationsanläggning marken mark

nedan om parasiterna parasit

se snabbt snabbt stallgödsel stallgödsel strandbeteszoner strandbeteszon

Feature Selection

The feature selection process is a features dimensionality process, that compresses the features to extract valued data for the classifier. Features and data that is not always desirable or necessary for the classifier should be discarded, as described by (Joachims, 1998; Dasgupta, Drineas, Harb, Josifovski, & Mahoney, 2007; Sebastiani, 2002). The importance of deciding on a relevant feature selection and vector representation method is also motivated by Bengio, Courville, & Vincent (2013), adding that it could reduce the vector sizing problem prominently.

Figure 12 illustrates the three different feature selection categories described by (Saeys, Inza, & Larrañaga, 2007; Ikonomakis, Kotsiantis, & Tampakas, 2005). According to Forman (2003) and Kohavi & John (1997), the wrapper method is more effective in text documents with fewer features and less effective in documents with many features. The wrapper method would in that sense, not be a suitable choice for the feature selection for this task since a vast majority of the documents contain many features. Even though Yang & Pedersen (1997) argues that the filter method can achieve high performance with the Document Frequency (DF) feature selection technique for text data of high dimensionality, it does not have the same performance potential as the wrapper and embedded methods according to (Sarkar, Goswami, Agarwal, & Aktar, 2014). However, the filter method does not have the same high computational cost. Furthermore, as Lal, Chapelle, Weston, & Elisseeff (2006) adds, that the embedded method potential can achieve high performance for the task. It has shown to have an undesirable computational cost and time, which is not the case for the filter method. Therefore, development with

31

both embedded method and filter method was done. From the filter method, DF was chosen. Since the embedded method is part of the training of the classifier (see Figure 10), a commonly used machine learning method that can handle text data well was considered. Since both Neural network and SVM have been commonly used in previous research, they could both be suitable for the task. However, SVM was chosen since it has shown lower computational cost and time, and is well-known for its good approaches in handling text data as mentioned by (Joachims, 1997).

The DF filter method was applied by setting the parameter for the document representation functionality. By setting the minimum number of document frequencies, the dimensionality is reduced based on document frequency. This was applied by using the min_df parameter for the document representation functionality in sklearn.feature_extraction.text.TfidfVectorizer (Buitinck, 2013).

The SVM embedded method was applied by using sklearn.feature_selection.SelectFromModel with default parameters threshold=None, prefit=False, norm_order=1, max_features=None. In the model the sklearn.svm.LinearSVC model (SVM) was initialised as the estimator parameter with penalty parameter set to l2 and dual set to False. This process was used in a pipelining functionality pipeline.Pipeline (Buitinck, 2013).

Figure 12. Feature Selection

Document representation

The process of representing the features in a way the machine learning process can differentiate the importance and the representation is critical for the document classification process, described by (Agarwal & Mittal, 2014). The most straight-forward approach, BOW was considered for the task. However, as Ringuette & Lewis (1994) and Manning, Raghavan, & Schütze (2008) describes it, it does not represent the differences between the documents. The possibility is that the differences between the documents is not considered. Compared to using other document representation alternatives. Since TF- IDF is regarded as the most common approach in text classification purposes and can be considered as reliable as stated by (Kim, Seo, Cho, & Kang, 2019), the method was used for the feature representation of the documents. 32

The TF-IDF method was implemented with sklearn.feature_extraction.text.TfidfVectorizer, the set parameters was the tokenizer (as it was the input for the tokenization functionality), ngram_range which was set to a unigram (1, 1), and depending on the feature selection the min_df parameter was either set to 1 or 3. Other default parameters was input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\b\w\w+\b', max_features=None, vocabulary=None, binary=False, dtype=, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False. Some of the default parameters became irrelevant because of the separate tokenization functionality. The weightings was fitted to the corpus by using the fit method for the TfidfVectorizer (Buitinck, 2013).

Since the classifier needs to be tested in the later stage, a test set needs to be produced. Therefore, one key step is to split data into training and test data. The sklearn.model_selection.train_test_split function splits the data into random train and test subsets. The default value is set to a 25/75 split, which means that 75% of the data is used for training and 25% of the data is used for testing (Buitinck, 2013).

To return a feature matrix from the generated process, the transform method was used. The training data is applied and return a feature matrix that contain document id, token id and the weighting for the specific token in the document. An example sample set is illustrated in Table 3.

Table 3. Example set of the TF-IDF Document Representation

(Document ID, Token ID) Weighting (1, 68) 0.006687185127 (1, 65) 0.001671796282 (1, 64) 0.001671796282 (1, 62) 0.001348794714 (1, 61) 0.003343592564 (1, 43) 0.001671796282 (1, 34) 0.006743973568 (1, 26) 0.003343592564 (1, 23) 0.001348794714 (1, 7) 0.001348794714 (1, 1) 0.005576339201

33

Classifiers

When selecting and implementing the different machine learning classifiers models, five different classifiers were chosen for comparison. (Aggarwal & Zhai, 2012; Dumais, Platt, Heckerman, & Sahami, 1998; Sebastiani, 2002) mentions the variety of classifiers possible to implement. The five classifiers chosen for the classification task was intended to give meaning for the research investigating purpose. To give a broad comparison result for the chosen classifiers, they were chosen based on diversity and previous suggestions of the literature in regards to performance. The machine learning models chosen and developed for the study were Decision Tree, SVM, KNN, Naïve Bayes, and ANN.

3.6.1. Applying Machine Learning methods

The training and test process itself were done by developing a training functionality that took the machine learning model as parameter, that was going be used for training the classifier. Since the execution of the feature selection process with the embedded method was going to be done in the same phase, the sklearn.pipeline.Pipeline functionality was used. The use of the pipelining functionality helps the machine learning workflow and could improve the efficiency for the computation (Brownlee, 2016). By using the fit method from the respective machine learning model, the classifier was trained with the training data, by adding the content and labels.

For tuning the model for the most optimised version of its settings in the training phase, a hyper- parameter optimiser was used for the task. That ensured that the performance of the different classifiers would be at its most optimised version for the document classification task, and preventing the overfitting that an untuned model would result in. Consideration of optimizing to convergence and the computational cost and time, this was reviewed as well. The chosen parameters were chosen to correlate to the different methods in a way that the tuning is not inequal and might give skewed results. The hyperparameter optimisation was done with the sklearn.model_selection.GridSearchCV functionality. It was set with its default parameters scoring=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False). The parameters set was, n_jobs=-1, for the Python kernels to work with all processors available. Also, param_grid to set the different parameters for the specific model to test with. The estimator parameter is assigned for the specific model to test with. (Buitinck, 2013).

3.6.2. Decision Tree (DT)

The decision tree classifier was chosen from the decision-based classifier. The other suggested decision- based classifier that could be used was the decision rule-based classifier. Since Decision Tree classifiers have been previously used in more relevant literature in the text classification research area (Drucker,

34

Wu, & Vapnik, 1999), there are more reason to choose the Decision Tree. The Decision rule-based classifier's overlapping of the rule set can lead to inconsistency in a text classification context as Johnson, Oles, Zhang, & Goetz (2002) and Apté, Damerau, & Weiss (1994) describes it. Hence, it is more reliable to use the Decision Tree, which also has proven to be suitable for text classification tasks according to (Jensen, Neville, & Gallagher, 2004).

To determine the optimised hyperparameters to use for the decision tree, Jordan (2017) suggests that the most essential variables to look at when approaching decision tree models are the number of estimators that should be used, which is correlated to the sample split. Also, the allowed depth was an important parameter. The parameters tested to the sklearn.model_selection.GridSearchCV functionality was min_sample_split=range(10,500,20) and max_depth=(1,20,2). The estimator algorithm used as a decision tree was the sklearn.tree.DecisionTreeClassifier. Both the parameters indicate the start value, then the end value, and lastly the frequency interval. The default parameters were criterion='gini', splitter='best', min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0. After running the optimisation for the decision tree, its parameters were accordingly:

• min_sample_split=60 • max_depth=11

3.6.3. Support Vector Machine (SVM)

The SVM classifiers have previously shown consistent high performance within classification tasks performance. Drucker, Wu, & Vapnik (1999) mentions how it has outperformed other classifiers within classification tasks and that the SVM is a approved method for text classification tasks. Pranckevičius & Marcinkevičius (2017) notes that the linear classifiers, logistic regression and SVM have proven to have the highest performance within text classification task and is accordingly valid classifiers to use. Since the logistic regression classifier is mostly focused on numerical classification task and not text classification tasks (Yang & Chute, 1994), the SVM classifier seemed more reliable for the comparison. Furthermore, as Jurafsky & Martin (2019) describes it, the logistic regression is most suitable for binary classification tasks, and it may give uncertainty to use it for a multi-class classification task.

To determine the optimised hyperparameters to use for the decision tree, (Dawson, 2019) suggests that the most important variables to look at when approaching the SVM models are the kernel, regularization (C) and gamma. Since the use of SVM is already approached with a linear kernel, the parameter was already set. The parameters tested to the sklearn.model_selection.GridSearchCV functionality was C=range(1,20) and gamma=[1,0.1,0.01,0.001]. The estimator algorithm used as a SVM was the 35

sklearn.svm.SVC with the kernel parameter set to ‘linear’. The default parameters were degree=3, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None. After running the optimisation for the SVM, its parameters were accordingly:

• C=4 • gamma=0.1

3.6.4. k-Nearest Neighbour (kNN)

The extensive use of proximity-based classifiers in the machine learning context makes it useable to choose the kNN classifier to give diversity to the comparison of the classification models. Since the kNN often outperforms other proximity-based classifiers according to (Yang, 1999) the kNN was chosen as a classifier model for the text classification purpose.

To determine the optimised hyperparameters to use for the kNN, Martulandi (2019) suggests that the most essential variables to look at when approaching the kNN model are the minkowski-value (p), which can either be manhattan distance or euclidean distance. Since the use of kNN is already approached with weights that is of distance approach, that parameter was added as a determined parameter. Another vital variable mentioned is the n-neighbours parameter. N-neighbours explain the number of neighbours to use in the equation to set distance. The parameters tested to the sklearn.model_selection.GridSearchCV functionality was n_neighbours=range(1,31) and weights=[‘uniform’,’distance’]. The estimator algorithm used as a kNN was the sklearn.neighbors.KNeighborsClassifier. The default parameters were algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None. After running the optimisation for the kNN, its parameters were accordingly:

After running the optimisation for the decision tree its parameters were accordingly:

• n_neighbours=4 • p=2

3.6.5. Naïve Bayes (NB)

The Naïve Bayes machine learning method for classification have shown previously to be suitable for this task and widely used, as suggested by (Nigam, McCallum, Thrun, & Mitchell, 2000). Research has proven that the SVM classification method can outperform the Naïve Bayes in text classification purposes as (Narayanan, Arora, & Bhatia, 2013) describes. However, in the diversity context for a better comparison of the classifiers and as Narayanan, Arora, & Bhatia (2013) explains that the Naïve Bayes has a fast training phase in the machine learning process, the NB classifier is chosen for the classification

36

task. Since (McCallum & Nigam, 1998) outlines that the Multinomial model is more suitable for text classification tasks than the other well-known Naïve Bayes methods, the Multinomial model is specifically chosen as the NB classifier model.

Since the Naïve Bayes models have few hyperparameters to tune, which is not considered as vital parameters to tune according to (Yiu, 2019), there is no need to try to optimise the Naïve Bayes model for this classification task. The multinomial NB model used was sklearn.naive_bayes.MultinomialNB with the default parameters alpha=1.0, fit_prior=True, class_prior=None.

3.6.6. Artificial Neural Network (ANN)

The artificial neural network machine learning method for classification have in previous research shown promising results. In modern text classification application, neural networks are often used explained by (Ruiz & Srinivasan, 1997). There are many different neural network architectures to choose from in developing a machine learning model as Veen (2016) visualises and describes. Therefore, the selection of what neural network structure to choose can be complicated. Furthermore, (Lipton & Steinhardt, 2019) argues that complex models need a lot of training to achieve a good result and that it should be taken in to a computational cost context. Hence, the choice of a model that is simple, also support Mohammed, Shi, & Lin (2018) and Lipton & Steinhardt (2019) arguments. It suggests that simple neural network architectures can achieve as good performance as complex architecture since many models are in such a complex context. Since the classification task is not considered as a such a complex task, the choice of architecture was a feed-forward network. Ruiz & Srinivasan (1997) also describes the feed-forward network as a straight-forward network. The type of chosen feed-forward is a multilayer perceptron network.

To determine the optimised hyperparameters to use for the Neural network, Agrawal (2019) suggests that there are multiple parameters proven by previous research and articles relating to the scikit-learn library that are useful. However, to choose a balance so that the network is optimised but not applying to computational optimisation executions, the neural network hyperparameters can be divided into two different categories. These are the optimiser hyperparameters and the model hyperparameters. For the optimiser hyperparameters, there are multiple suggestions, such as the learning rate, batch size, dropout. These are commonly used parameters for this case. However, Buitinck (2013) states that scikit-learn does not support dropout and as Dawson (2019) mentions the regularization value as an important optimiser parameter for overfitting in machine learning, regularisation is choosen. For the model hyperparameters, the hidden layer is the parameter with most effect to the model according to (Agrawal, 2019). The suggestions for two parameters that are the most important ones to implement are, the regularization value (alpha) and the hidden layer size. The parameters tested to the

37

sklearn.model_selection.GridSearchCV functionality was therefore hidden_layer_sizes=[(1, ),(50, ),(100, )] in the first implementation with the result of (10, ). Therefore a new test was implemented with hidden_layer_sizes=[(2, ),(3, ),(4, ),(5, ),(6, ),(7, ),(8, ),(9, ),(10, ),(11, ),(12, ),(13, )] and alpha=[0.01,0.001,0.0001,0.00001]. The estimator algorithm used as a NN was the sklearn.neural_network.MLPClassifier. The default parameters were activation='relu', solver='adam', batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000. After running the optimisation for the neural network, its parameters were accordingly:

• alpha=0.01 • hidden_layer_sizes=8

Classification metrics

To be able to evaluate and measure the performance of the classifiers, different classification metrics was used for the task. The most critical metrics considered by Hossin & Sulaiman (2015) were:

• Confusion matrix • Accuracy • Precision • Recall • F1-score

Some of the metrics had opposed metrics that were not included in the result. These were not considered to be necessary for interpretation of the outcome. The computational time for the training phase was also added as a metric. Since the testing time did not differ that much between the machine learning methods, it was not included in the result metrics.

The classification metrics and time consumption were executed after the training of the classifier. When the classifier was trained, a prediction for the test data was executed by using the predict method for the specific machine learning algorithms model. The prediction data is used as an input parameter with the true labelling to the sklearn.metrics.classification_report that produced a text report of multiple classification metrics. To include the computation time, the built-in timer timeit was used. By setting the start time before the learning fit method was used and the end timer after the method, the computation time was calculated. To be able to detect skewed data with the true versus predicted labels, a

38

functionality for plotting a confusion metrics was implemented with sklearn.metrics.plot_confusion_matrix.

39

Result

This section describes the result of the classification process and the result of the different machine learning methods on performance and computational time. The section includes the result with and without stemming, and with different feature selection methods.

Stemming

This section presents the document classification process with only stemming differences in the feature extraction process, without feature selection. The purpose of the section is to present if the classifiers have a higher performance with or without stemming.

The table below describes the classifiers detailed classification metrics for the execution of the classification process. It represents the test accuracy, train accuracy, precision, recall, f1-score, training time and test time.

Table 4. Overview table of the performance metrics with and without stemming

Result from classification with and without stemming functionality Test Train F1- Train time Classifier Precision Recall accuracy accuracy score [MM:SS.SSS] DT 0.85 0.93 0.85 0.85 0.85 00:06.222 NN 0.95 1.00 0.95 0.95 0.95 11:44.899 Without KNN 0.88 1.00 0.90 0.88 0.88 00:00.030 stemming SVM 0.94 1.00 0.95 0.94 0.94 00:40.062 NB 0.51 0.60 0.68 0.51 0.40 00:00.185 DT 0.82 0.90 0.83 0.82 0.81 00:06.367 NN 0.97 1.00 0.97 0.97 0.97 11:44.152 Stemming KNN 0.88 1.00 0.90 0.88 0.88 00:00.036 SVM 0.97 1.00 0.97 0.97 0.97 00:41.008 NB 0.52 0.59 0.46 0.52 0.42 00:00.188

40

Test accuracy

NB

SVM

KNN

NN

DT

0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90 1,00

Stemming Without stemming

Figure 13. Bar chart with test accuracy comparison between without stemming and with stemming

Training time

NB

SVM

KNN

NN

DT

00:29 01:55 03:21 04:48 06:14 07:41 09:07 10:33 12:00

Stemming Without stemming

Figure 14. Bar chart with training time without stemming and with stemming

4.1.1. Stemming

The classifiers true labels versus predicted labels is presented in a confusion matrix, testing with stemming. It describes what class have been predicted for what document and whether it is the true class labelling.

41

NN Classifier (MLPClassifier) NB Classifier (MultinomialNB)

Figure 15. Confusion matrix for NN Classifier with stemming Figure 16. Confusion matrix for NB Classifier with stemming

Feature Selection

This section presents the document classification process with different feature selection processes with stemming to the feature extraction process. The feature selection methods compared is DF (filter method) and SVM (embedded method). The purpose of the section is to present which feature selection process is to present classification performance for the different methods, but also to compare the performance of the classifiers. The computational time for the feature selection methods is not be compared since they are implemented in different parts of the classification process (Figure 9 and Figure 10).

The table below describes the classifiers detailed classification metrics for the execution of the classification process. It represents the test accuracy, train accuracy, precision, recall, f1-score, training time and test time.

Table 5. Overview table of the performance metrics for different feature selections

Result from classification with different feature selection methods Feature Test Train F1- Train time Classifier Precision Recall Selection accuracy accuracy score [MM:SS.SSS] DT 0.87 0.93 0.87 0.87 0.87 00:04.148 NN 0.95 1.00 0.96 0.95 0.95 04:38.093 DF KNN 0.90 1.00 0.91 0.90 0.90 00:00.028 SVM 0.97 1.00 0.97 0.97 0.97 00:34.763 NB 0.61 0.63 0.74 0.61 0.51 00:00.118 DT 0.86 0.91 0.87 0.86 0.87 00:11.357 SVM NN 0.96 1.00 0.96 0.96 0.96 02:55.393

KNN 0.60 1.00 0.89 0.60 0.65 00:07.513 SVM 0.97 1.00 0.97 0.97 0.97 00:40.617 NB 0.70 0.74 0.72 0.70 0.65 00:07.567

Test accuracy

NB

SVM

KNN

NN

DT

0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90 1,00

Embedded method Filter method

Figure 17. Bar chart with test accuracy for filter and embedded method.

Training time

NB

SVM

KNN

NN

DT

00:00:41 00:01:24 00:02:07 00:02:50 00:03:33 00:04:17 00:05:00

Embedded method Filter method

Figure 18. Bar chart with training time for filter and embedded method.

43

4.2.1. Filter method (DF)

The classifiers true labels versus predicted labels is presented in a confusion matrix, testing with filter method DF feature selection. It describes what class have been predicted for what document and whether it is the true class labelling. After feature selection (with stemming), 171 123 features were selected from the feature set with DF filter method feature selection. The minimum document frequency value was set to 3.

KNN Classifier (KNeighbourClassifier) NB Classifier (MultinomialNB)

Figure 19. Confusion matrix for KNN Classifier with DF Figure 20. Confusion matrix for NB Classifier with DF filter method feature selection. filter method feature selection.

4.2.2. Embedded method (SVM)

The classifiers true labels versus predicted labels is presented in a confusion matrix, testing with embedded method SVM feature selection. It describes what class have been predicted for what document and whether it is the true class labelling. Since the method is implemented in the learning process, the number of features is not determined.

KNN Classifier (KNeighbourClassifier) NB Classifier (MultinomialNB)

Figure 21. Confusion matrix for KNN Classifier with SVM Figure 22. Confusion matrix for NB Classifier with SVM embedded method feature selection. embedded method feature selection.

.

Discussion

This section will analyse the result of the classification process, the performance of the classifiers and how it can be reflected on the method of the document classification process applied. It will also discuss the strengths and weaknesses of the methodology chosen and how to interpret it.

Methodology

The methodology used for the thesis followed the practice of a document classification process. The methods used were aiming to provide approaches that related to previous research. The stemming method described by Carlberger, Dalianis, Duneld, & Knutsson (2001) which correlated to the use of the stemmer provided by NLTK, is not the only stemmer provided for the specific purpose. Whether there are better stemmers was not something that was part of the initial research, but an interesting aspect of what findings that could achieve other results. Using a lemmatizer beyond using a stemmer, is one of the changes in the methodology that could have been done to getting more results. From GitHub there is open-source code which provides a lemmatizer for Swedish usage.8 Since the risk of that the stemming is not providing more accuracy as mentioned by Hull (1996), the question would be whether the lemmatizer could produce something more beneficial for the process. It is possible it could achieve better or worse accuracy for some of the classifiers, not just interpreted for the overall accuracy result.

With the different aspects of the feature selection methods, they provide a decent comparison between two different techniques that are fundamentally different. However, the differences also make it difficult to interpret the individual differences in the methods. Moreover, these methods cannot be analysed by comparing the classification metrics in a fair measurement, since they are applied in different phases of the process. One of these aspects is the computational cost and time for the feature selection methods, and how they would differ is not presented.

Furthermore, the document representation, which is an essential part of the process, was only used with one method the TF-IDF approach. As no comparison was made with other techniques, it is difficult to say whether it is the most high-performance approach. It can be considered as the most common one and previously been reliable for document classification tasks as mentioned by (Kim, Seo, Cho, & Kang, 2019), it can ensure it does provide a good representation. However, an interesting aspect would be to compare it to a more modern approach. One that has increased in popularity the last years is the doc2vec (based on ) which Kim, Seo, Cho, & Kang (2019) also mentions as a high-performance method.

8 Lemmatizer for Swedish and Danish - https://github.com/sorenlind/lemmy 46

The chosen classifiers for the classification tasks provided a broad result. The possibility that some high- performance classifier was left out is unlikely since approved research have constantly used the methods presented in the theory section. The logistic regression classifier had also proven to be good for classification tasks which Pranckevičius & Marcinkevičiu (2017) suggests. However, since the evidence also favours the SVM it would not give diversity to the classifiers to use multiple linear classifiers in that sense. The aspect in terms of the methodology that could be discussed is whether the frameworks and libraries used for the implementation provided the most optimised result. For example usage of CV Grid Search is not the only parameter optimising tool. As it just goes through different settings of the parameters provided, there is a possiblity there is better optimising libraries to be used for the task. Another aspect is whether the scikit-learn library was providing the most optimal settings. For example in the usage of neural network there are multiple libraries used where more settings and parameters could be tweaked and interpreted in a more advanced way. Such a platform or library could for example be TensorFlow9 or Keras10.

The classification metrics have been commonly used in previous research and documentation. However, a metric that could be useful that was not included in previous research, , since it was not mentioned as much is the computation of the area under the Receiver Operating Characteristic Curve (ROC AUC). Considering ROC is usually computed for a binary class problem and not for multiclass or multilabel tasks, it may be one of the reason it is not mentioned as widely as the other classification metrics in previous research. Scikit-learn provides a metric for the attribute for the area under the ROC Curve11. An interesting aspect would be to see whether it would have given any new finding information about the classifiers.

Result

5.2.1. Document classification process

From the result of the thesis work, many different aspects can be analysed. The document classification process was tested with different configurations and methodologies used. From section 4.2 the result of using stemming can be viewed. In Table 4, Figure 13, and Figure 14 the result from the stemming improves the highest performing machine learning methods SVM and NN in regards to the test accuracy

9 TensorFlow, an open-source Python platform for machine learning - https://www.tensorflow.org/ 10 Keras, an Python library based of TensorFlow - https://keras.io/ 11 ROC AUC Score metric - https://scikitlearn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

47

metrics. Slightly, to no changes for NB and KNN is also interpretable. DT is the only method performing worse with stemming. The training accuracy shows that both methods show signs of overfitting, even though with stemming the overfitting is slightly less than without stemming. Although the result does not show improvements of 15-18% as mentioned by Carlberger, Dalianis, Duneld, & Knutsson (2001), the stemmer still shows improvements for the majority of the classifiers. However, Hull (1996) mentions that the improvements with a stemmer are about 1-3%. The same improvement can be viewed in Table 4 for thesis as well in terms of improved accuracy.

Although the low score of the test accuracy result, it also shows sign of reducing overfitting comparing the train and test accuracy with stemming. The indication is an improvement, and the reason is that the stemmer is trying to generalise the features. Therefore, the classifier is learning by generalisation more instead of learning from specific cases which is part of it’s a stemmer’s intended purpose. One of the reasons that the stemmer is not achieving up to 15-18% accuracy improvements could be because of the noisy data that the documents contain. Even though removing stopwords functionality and reducing noise in the data is used for the feature extraction process, there is a risk that irrelevant words may go through the process. However, when comparing using stemming with feature selection, it does not provide any significantly proved result (comparing Table 4 and Table 5) and in some cases, give a slightly worse result. Hence, the chance that the feature words are irrelevant is less likely in that sense, since one of the purposes of the feature selection process is to reduce irrelevant features. Also, comparing the result with previous research can be difficult because they usually use datasets with more distinct and clean data since the purpose is to prove new development within the text mining research area. Commonly datasets from newsgroups are used, for example, in (Labani, Moradi, Ahmadizar, & Jalili, 2018).

The interpretation of the feature selection process intends to understand whether its functionality is improving the result of the document classification process in performance and time consumption. The feature selection process is as Joachims (1998), Dasgupta, Drineas, Harb, Josifovski, & Mahoney (2007) and Sebastiani (2002) describes it an essential process for making sure the data is relevant and not of higher dimensionality than necessary. However, the feature selection process is not providing that much improvements in the result. By comparing the results for stemming in Table 4 and SVM and DF methods Table 5, there are just slight improvements for the classifiers except for the NB Classifier which shows a clear improvement when interpreting the classification metrics. By comparing Figure 16, Figure 20 and Figure 22, there is possible to distinguish a pattern, that the NB classifier is usually predicting the same class for the majority of its prediction. The higher accuracy results from the feature selection methods with the NB classifier could possibly be because it is skewed to another class with a larger number of label instances in the dataset. Since it is still possible to see from Figure 20 and Figure 22

48

that the classifier is predicting the same class, the classifier is not providing a correct classification behaviour even with feature selection. Another interesting aspect is that the KNN in the embedded feature selection method is performing a lot worse than the filter method. Even though it has a high precision score, the recall and f1-score are much lower, which means that it is predicting some specific classes all wrong. It can be observed in the confusion matrix in Figure 21 that it is predicting multiple categories as class 1 wrongly. Since the KNN classifiers is a proximity-based method and most popular within the clustering, there are disadvantages as mentioned by Yang, An Evaluation of Statistical Approaches to Text (1999). The lacking research of domain problems which is stated about the KNN, could, in this case, show that these types of faulty occurrences could be the cause of the combination of using the embedded method with SVM, and the unique set of data.

Possible affections in the feature selection process could be the lack of enough text data since the improvements are not widely distinct in test accuracy, but the training accuracy is still on the same high levels. There are also other possibilities that the feature selection is not providing any new exclusion for the irrelevant features, they may have already been removed as stopwords or that the classifiers can easily distinguish the information. The documents contain a large amount of text, and it may also be an aspect that it is easier for the classifier to identify the terms. Since the categorical important terms are frequently used in a document, it may be easier for the classifier to distinguish these terms compared to using smaller news articles as text data which have less relevant words.

5.2.2. Machine learning methods performance

The differences in the classifiers and its machine learning methods findings are quite distinct and significant. The considerable differences in the classifier’s performance metrics and in computational time to training for some of the classifiers helped understand even more which classifiers are the best performing classifiers for a document classification context problem. By comparing the classification metrics in Table 4, Table 5, Figure 13, and Figure 17, it shows that the SVM and NN are outperforming the rest of the classifiers with a significant marginal. The classifiers outcome of the test accuracy performance ranges from 95-97% for NN and 94-97 for SVM. This result is confirming Pranckevičius & Marcinkevičius (2017) and Moraes, Valiati, & Neto (2013) suggestions that the SVM and NN can outperform other machine learning methods in a document classification aspect. Whether more data would able the neural network to achieve even better classification results is difficult to estimate for the context problem. The DT and KNN classifiers are performing slightly worse but have not in previous research had any confirmation that it could perform up to the same performance as the SVM and NN. The NB classifiers performance, which is the worst of the classifiers tested, show in the confusion matrixes that it does not provide results that are convincing for a reliable classifier, as it is predicting the same class for each label. 49

The findings from Moraes, Valiati, & Neto (2013) comparison study with a NN and SVM classifier, is similar to the thesis result. The neural network model has a higher computational time cost which is far less for the SVM classifier and the rest of the classical/statistical methods. This could be because of the NN classifier is using backpropagation as described by Ruiz & Srinivasan (1997), that the training of the classifier usually takes longer. But it could also be because of the use of the scikit-learn library. In comparison with the other classifiers, the other classifiers provide a low computational time cost compared to the neural network, and even lower than the SVM in all cases. The interesting aspect in the findings could be how the classifiers would be able to cope with more documents and data, how it would affect a larger scale for the training time.

50

Conclusions

The section concludes the thesis work of the document classification process and analysing the machine learning methods performance. Also adding suggestions for interesting future work and perspectives of text mining and document classification areas for upcoming research, based on the thesis and reflection in general.

The intention and purpose of the thesis was to investigate how the neural network method would perform against the traditional/classic statistical machine learning methods in a document classification context applied with Swedish documents. Studying different methods in traditional/classic statistical area, the overall performance of the neural network is outperforming most of the classifiers. This is clearly shown from the result of the test accuracy. However, the SVM classifier is performing at the same high level of test accuracy and the other performance metrics measurements as well. It also shows evidence of outperforming the NN slightly when feature selection is applied to the document classification process. With references to the performance, the NN and the SVM prove equal high result with marginal errors included. Another context to this domain is whether the NN could achieve better results with more input text data documents, as it is famously known for achieving better accuracy performance when provided more data, because of its backpropagation. Concerning time consumption, the NN classifier is showing significantly higher time consumption than the SVM. In the context of the thesis, the SVM is performing better regarding the classification metrics and the time consumption comparison combined. It is difficult to state if the use of plain text data could have affected its interpretation and result. However, since the feature extraction discarded a lot of the original text data, the corpus seems reliable.

The work gives a better frame of reference on how traditional/classic methods for document classification are still relevant for classifying new problem domains. It also can provide a high result for a new language context. Additionally, this proves that the neural network machine learning methods achieve high accuracy and have the potential for future document classification tasks to provide high results.

Future work

There are different methods to variate within the research context of document classification. The different methods of the document classification process could be interesting to investigate and contribute with more essential research comparing the machine learning algorithms. Another aspect is the document classification process. A suggestion regarding research for stemming, focusing on Swedish text data would be desirable. The lack of updated research in this domain for Swedish documents is quite significant, as referred to (Carlberger, Dalianis, Duneld, & Knutsson, 2001). The research could be useful to do in a context for lemmatization as well. Since the data used in the thesis is 51

not certainly providing enough documents, it would be of good use to see how the methods in the thesis would perform on a larger scale. Text classification tasks to compare the machine learning methods using other libraries and programming language than the ones used in the thesis is also a research aspect that could provide new discoveries and interesting results for the document classification area.

52

References

Adhikari, A., Ram, A., Tang, R., & Lin, J. (2019). Rethinking complex neural network architectures for document classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4046-4051).

Agarwal, B., & Mittal, N. (2014). Text classification using machine learning methods-a survey. In Proceedings of the Second International Conference on Soft Computing for Problem Solving (SocProS 2012), December 28-30, 2012 (pp. 701-709). Springer, New Delhi.

Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer Science & Business Media.

Agrawal, S. (2019). Hyperparameters in deep learning. Retrieved from Towards Data Science: https://towardsdatascience.com/hyperparameters-in-deep-learning-927f7b2084dd

Allison, B., Guthrie, D., & Guthrie, L. (2006, September). Another look at the data sparsity problem. In International Conference on Text, Speech and Dialogue (pp. 327-334). Springer, Berlin, Heidelberg.

Apté, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS), 12(3), 233-251.

Khan, A., Baharudin, B., Lee, L. H., & Khan, K. (2010). A review of machine learning algorithms for text-documents classification. Journal of advances in information technology, 1(1), 4-20.

Belew, R. K. (1989). Adaptive information retrieval: Using a connectionist representation to retrieve and learn about documents. In Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 11-20).

Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798-1828.

Blum, A. L., & Langrey, P (1997). Selection of relevant features and examples. Machine Learning, 245- 271.

Brownlee, J. (2016). Automate Machine Learning Workflows with Pipelines in Python and scikit-learn. Retrieved from Machine Learning Mastery: https://machinelearningmastery.com/automate-machine- learning-workflows-pipelines-python-scikit-learn/

53

Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., & Layton, R. (2013). API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238.

Carlberger, J., Dalianis, H., Duneld, M., & Knutsson, O. (2001). Improving precision in information retrieval for Swedish using stemming. In Proceedings of the 13th Nordic Conference of Computational Linguistics (NODALIDA 2001).

Chatterjee, S. (2019). A Comprehensive Study of Linear vs Logistic Regression to refresh the Basics. Retrieved from Medium - Towards Data Science: https://towardsdatascience.com/a-comprehensive- study-of-linear-vs-logistic-regression-to-refresh-the-basics-7e526c1d3ebe

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21-27.

Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., & Mahoney, M. W. (2007). Feature selection methods for text classification. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 230-239).

Dawson, C. (2019). SVM Parameter Tuning. Retrieved from Towards Data Science: https://towardsdatascience.com/svm-hyper-parameter-tuning-using-gridsearchcv-49c0bc55ce29

Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural networks, 10(5), 1048-1054.

Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management (pp. 148-155).

Fallgren, P., Segeblad, J., & Kuhlmann, M. (2016). Towards a standard dataset of swedish word vectors. In Sixth Swedish Language Technology Conference (SLTC), Umeå 17-18 nov 2016.

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research, 3(Mar), 1289-1305.

Frank, M., Drikakis, D., & Charissis, V. (2020). Machine-learning methods for computational science and engineering. Computation, 8(1), 15.

54

Gao, W., Hu, L., Zhang, P., & Wang, F. (2018). Feature selection by integrating two groups of feature evaluation criteria. Expert Systems with Applications, 110, 11-19.

Goller, C., Löning, J., Will, T., & Wolff, W. (2000). Automatic Document Classification-A thorough Evaluation of various Methods. ISI, 2000(2), 145-162.

Guo, Y., Chung, F., & Li, G. (2016). An ensemble embedded feature selection method for multi-label clinical text classification. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 823-826). IEEE.

He, K., & Sun, J. (2015). Convolutional neural networks at constrained time cost. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5353-5360).

Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735- 1780.

Hossin, M., & Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 1.

Hull, D. A. (1996). Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for , 47(1), 70-84.

Ida Infront. (n.d.). Retrieved February 9, 2020, from E-arkiv, e-arkivering för myndigheter & företag: https://www.idainfront.se/losningar/e-arkiv/

Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques. WSEAS transactions on computers, 4(8), 966-974.

Jensen, D., Neville, J., & Gallagher, B. (2004). Why collective inference improves relational classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 593-598).

Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved K-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503-1509.

Joachims, T. (1997). A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Dans les actes de ICML’97: Proceedings of the Fourteenth International Conference on Machine Learning, San Francisco, CA, USA, 143–151.

55

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning (pp. 137-142). Springer, Berlin, Heidelberg.

Joachims, T. (2001). A statistical learning learning model of text classification for support vector machines. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 128-136).

Johnson, D. E., Oles, F. J., Zhang, T., & Goetz, T. (2002). A decision-tree-based symbolic rule induction system for text categorization. IBM Systems Journal, 41(3), 428-437.

Jordan, J. (2017). Hyperparameter tuning for machine learning models. Retrieved from Jeremy Jordan: https://www.jeremyjordan.me/hyperparameter-tuning/

Jurasky, D., & Martin, J. H. (2000). Speech and Language Processing: An introduction to natural language Processing. Computational Linguistics and . Prentice Hall, New Jersey.

Kim, D., Seo, D., Cho, S., & Kang, P. (2019). Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Information Sciences, 477, 15-29.

Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324.

Korde, V., & Mahender, C. N. (2012). Text classification and classifiers: A survey. International Journal of Artificial Intelligence & Applications, 3(2), 85.

Krovetz, R. (2000). Viewing morphology as an inference process. Artificial intelligence, 118(1-2), 277- 294.

Labani, M., Moradi, P., Ahmadizar, F., & Jalili, M. (2018). A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence, 70, 25-37.

Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In Twenty-ninth AAAI conference on artificial intelligence.

Guyon, I. (2006). Feature extraction: foundations and applications.(I. Guyon, S. Gunn, M. Nikravesh, & LA Zadeh, Eds.).

Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368.

Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196). 56

Lipton, Z. C., & Steinhardt, J. (2019). Troubling trends in machine learning scholarship. Queue, 17(1), 45-77.

Lodhi, H., Shawe-Taylor, J., Cristianini, N., & Watkins, C. J. (2001). Text classification using string kernels. In Advances in neural information processing systems (pp. 563-569).

Lovins, J. B. (1968). Development of a stemming algorithm. Mech. Transl. Comput. Linguistics, 11(1- 2), 22-31.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Scoring, term weighting and the vector space model. Introduction to information retrieval, 100, 2-4.

Maron, M. E. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM (JACM), 8(3), 404-417.

Martulandi, A. (2019). K-Nearest Neighbors in Python + Hyperparameters Tuning. Retrieved from Medium - Data Driven Investor: https://medium.com/datadriveninvestor/k-nearest-neighbors-in- python-hyperparameters-tuning-716734bc557f

McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, No. 1, pp. 41-48).

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 746-751).

Mohammed, S., Shi, P., & Lin, J. (2017). Strong baselines for simple over knowledge graphs with and without neural networks. arXiv preprint arXiv:1712.01969.

Montañés Roces, E., Fernández, J., Díaz Rodríguez, S. I., Fernández-Combarro Álvarez, E., & Ranilla Pastor, J. (2003). Measures of rule quality for feature selection in text categorization. Lecture Notes in , 2810.

Moraes, R., Valiati, J. F., & Neto, W. P. G. (2013). Document-level sentiment classification: An empirical comparison between SVM and ANN. Expert Systems with Applications, 40(2), 621-633.

57

Narayanan, V., Arora, I., & Bhatia, A. (2013). Fast and accurate sentiment classification using an enhanced Naive Bayes model. In International Conference on Intelligent Data Engineering and Automated Learning (pp. 194-201). Springer, Berlin, Heidelberg.

Nigam, K., Lafferty, J., & McCallum, A. (1999). Using maximum entropy for text classification. In IJCAI-99 workshop on machine learning for information filtering (Vol. 1, No. 1, pp. 61-67).

Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3), 103-134.

NLTK-Project. (2020). nltk.corpus package. Retrieved from NLTK 3.5 Documentation: https://www.nltk.org/api/nltk.corpus.html?highlight=corpus#module-nltk.corpus

Elite Data Science. (2020). Overfitting in Machine Learning: What It Is and How to Prevent It. Retrieved from Elite Data Science: https://elitedatascience.com/overfitting-in-machine-learning

Porter, M. F. (1997). An algorithm for suffix stripping program. Editors JS Karen, and P. Willet, Readings in Information Retrieval, San Francisco, Morgan Kaufmann.

Pranckevičius, T., & Marcinkevičius, V. (2017). Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Baltic Journal of Modern Computing, 5(2), 221.

Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106.

Raghavan, H., & Allan, J. (2007). An interactive algorithm for asking and incorporating feature feedback into support vector machines. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 79-86).

Rehman, A., Javed, K., Babri, H. A., & Saeed, M. (2015). Relative discrimination criterion–A novel feature ranking method for text data. Expert Systems with Applications, 42(7), 3670-3681.

Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Third annual symposium on document analysis and information retrieval (Vol. 33, pp. 81-93).

Rogati, M., & Yang, Y. (2002). High-performing feature selection for text classification. In Proceedings of the eleventh international conference on Information and knowledge management (pp. 659-661).

Ruiz, M. E., & Srinivasan, P. (1998). Automatic text categorization using neural networks. In Proceedings of the 8th ASIS SIG/CR Workshop on Classification Research (pp. 59-72).

58

Saeys, Y., Inza, I., & Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. bioinformatics, 23(19), 2507-2517.

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523.

Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.

Dey Sarkar, S., Goswami, S., Agarwal, A., & Aktar, J. (2014). A novel feature selection technique for text classification using Naive Bayes. International scholarly research notices, 2014.

Sculley, D., Snoek, J., Wiltschko, A., & Rahimi, A. (2018). Winner's curse? On pace, progress, and empirical rigor.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.

Song, F., Liu, S., & Yang, J. (2005). A comparative study on text representation schemes in text categorization. Pattern analysis and applications, 8(1-2), 199-209.

Sunasra, M. (2017). Performance Metrics for Classification problems in Machine Learning. Retrieved from Medium: https://medium.com/thalus-ai/performance-metrics-for-classification-problems-in- machine-learning-part-i-b085d432082b

Talib, R., Hanif, M. K., Ayesha, S., & Fatima, F. (2016). Text mining: techniques, applications and issues. International Journal of Advanced Computer Science and Applications, 7(11), 414-418.

Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov), 45-66.

Triantaphyllou, E., & Felici, G. (Eds.). (2006). Data mining and knowledge discovery approaches based on rule induction techniques (Vol. 6). Springer Science & Business Media.

Uysal, A. K., & Gunal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 36, 226-235.

Veen, F. (2016). The Neural Network Zoo. Retrieved from Asimov Institute: https://www.asimovinstitute.org/neural-network-zoo/

59

de Vries, A. P., Mamoulis, N., Nes, N., & Kersten, M. (2002). Efficient k-NN search on vertically decomposed data. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data (pp. 322-333).

Xing, C., Wang, D., Zhang, X., & Liu, C. (2014). Document classification with distributions of word vectors. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (pp. 1-5). IEEE.

Xu, J., & Croft, W. B. (1998). Corpus-based stemming using cooccurrence of word variants. ACM Transactions on Information Systems (TOIS), 16(1), 61-81.

Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information retrieval, 1(1-2), 69-90.

Yang, Y., & Chute, C. G. (1994). An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems (TOIS), 12(3), 252-277.

Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Icml (Vol. 97, No. 412-420, p. 35).

Yiu, T. (2019). Understanding The . Retrieved from Towards Data Science: https://towardsdatascience.com/understanding-the-naive-bayes-classifier-16b6ee03ff7b

You, M., Liu, J., Li, G. Z., & Chen, Y. (2012). Embedded feature selection for multi-label classification of music emotions. International Journal of Computational Intelligence Systems, 5(4), 668-678.

Yuan, G. X., Ho, C. H., & Lin, C. J. (2012). Recent advances of large-scale linear classification. Proceedings of the IEEE, 100(9), 2584-2603.

Zhang, M. L., & Zhou, Z. H. (2007). ML-KNN: A lazy learning approach to multi-label learning. Pattern recognition, 40(7), 2038-2048.

60

Appendix

Stemming – confusion matrixes

7.1.1. Without stemming

DT Classifier (DecisionTreeClassifier) KNN Classifier (KNeighbourClassifier)

Figure 23. Confusion matrix for DT Classifier without Figure 25. Confusion matrix for KNN Classifier without stemming stemming

NN Classifier (MLPClassifier) SVM Classifier (SVC)

Figure 24. Confusion matrix for NN Classifier without Figure 26. Confusion matrix for SVM Classifier without stemming stemming

61

NB Classifier (MultinomialNB)

Figure 27. Confusion matrix for NB Classifier without stemming

7.1.2. Stemming

DT Classifier (DecisionTreeClassifier) KNN Classifier (KNeighbourClassifier)

Figure 28. Confusion matrix for DT Classifier with Figure 29. Confusion matrix for KNN Classifier with stemming stemming

62

SVM Classifier (SVC)

Figure 30. Confusion matrix for SVM Classifier with stemming

Feature Selection – confusion matrixes

7.2.1. Filter method (DF)

DT Classifier (DecisionTreeClassifier) NN Classifier (MLPClassifier)

Figure 32. Confusion matrix for NN Classifier with DF Figure 31. Confusion matrix for DT Classifier with DF filter method feature selection. filter method feature selection.

63

SVM Classifier (SVC)

Figure 33. Confusion matrix for SVM Classifier with DF filter method feature selection.

7.2.2. Embedded method (SVM)

DT Classifier (DecisionTreeClassifier) NN Classifier (MLPClassifier)

Figure 35. Confusion matrix for NN Classifier with SVM Figure 34. Confusion matrix for DT Classifier with SVM embedded method feature selection. embedded method feature selection.

64

SVM Classifier (SVC) .

Figure 36. Confusion matrix for SVM Classifier with SVM embedded method feature selection.

Code

The code that was implemented for the thesis can be found at https://github.com/hugomoritz/document_classification_study .

65