A Comparative Study of Machine Learning Algorithms for Document Classification

A comparative study of machine learning algorithms for Document Classification Hugo Moritz Subject: Information Systems Corresponds to: 30 hp Presented: VT2020 Supervisor: David Johnson Examiner: Andreas Hamfeldt Department of Informatics and Media Abstract In a more digitalized world, companies with e-archive solutions want to be part of the usage of modern methods to develop their business. One method is to automatically classify the content of the documents. A common approach is to apply machine learning, also known as document classification. There is a lack of updated research on comparing different machine learning algorithms. Also, in the context of whether more modern methods as neural networks are better than more statistical traditional/classic machine learning methods. The document classification process goes through pre-processing, feature selection, document representation and training and testing of the classifiers. Implementation of five different machine learning methods, with different stemming and feature selection settings, presents result based on various classification metrics and time consumption. The result shows that the neural network classifier have as high accuracy as one of the traditional statistical classifiers SVM, but the neural network provides a higher computational time cost. More studies for the document classification area with other programming language and libraries may give interesting aspects to whether the differences can be determined even more. ii Acknowledgements After studying two years at the Master’s program in Information Systems at Uppsala University, it is with experience and joy I can present this thesis. I would like to thank my supervisor from Uppsala University, David Johnson. He has supported me and providing me with vital knowledge in the machine learning and text classification area during this period and guaranteeing the quality of the thesis. Also, the people at Ida Infront that have given me the chance to work at their office and take part in the development. Especially, I would like to thank Richard Johansson and Johnny Hensegård, who been supporting me on-site at the company and always been there to help me with the development of the thesis work. Hugo Moritz Uppsala, 2020-06-01 iii Content Abstract ................................................................................................................................................... ii Acknowledgements ................................................................................................................................ iii List of Figures ......................................................................................................................................... 6 Terms & Abbreviations ........................................................................................................................... 8 Introduction ..................................................................................................................................... 9 Ida Infront .............................................................................................................................. 10 Motivation ............................................................................................................................. 10 Research problem .................................................................................................................. 11 Research question .................................................................................................................. 11 Scope ..................................................................................................................................... 11 Theory ........................................................................................................................................... 13 Automatic Document Classification ...................................................................................... 13 Feature Extraction ................................................................................................................. 13 Feature Selection ................................................................................................................... 14 Document representation (Vector representation) ................................................................. 16 Classification Algorithms ...................................................................................................... 16 Overfitting ............................................................................................................................. 17 Decision Classifiers ............................................................................................................... 17 Linear Classifiers (Discriminative Classifiers)...................................................................... 19 Proximity-based Classifier .................................................................................................... 20 Probalistic Classifiers ........................................................................................................ 21 Artificial Neural Network Classifiers ................................................................................ 22 iv Classification performance metrics ................................................................................... 25 Methodology ................................................................................................................................. 27 Chosen text data .................................................................................................................... 27 Document classification process ........................................................................................... 27 Pre-processing ....................................................................................................................... 28 Feature Selection ................................................................................................................... 31 Document representation ....................................................................................................... 32 Classifiers .............................................................................................................................. 34 Classification metrics ............................................................................................................ 38 Result ............................................................................................................................................. 40 Stemming .............................................................................................................................. 40 Feature Selection ................................................................................................................... 42 Discussion ..................................................................................................................................... 46 Methodology ......................................................................................................................... 46 Result ..................................................................................................................................... 47 Conclusions ........................................................................................................................................... 51 Future work ........................................................................................................................... 51 References ..................................................................................................................................... 53 Appendix ....................................................................................................................................... 61 Stemming – confusion matrixes ............................................................................................ 61 Feature Selection – confusion matrixes ................................................................................. 63 Code ...................................................................................................................................... 65 v List of Figures Figure 1. Venn diagram of the Text Mining area. ................................................................................... 9 Figure 2. Example of the fundamentals of stemming for Swedish grammar. ....................................... 14 Figure 3. Example of a decision tree based of continous or discrete ordinal data. ............................... 18 Figure 4. The margin of separation for the SVM. ................................................................................. 19 Figure 5. The kNN measurement differences between classes and data points. ................................... 21 Figure 7. Model of a neuron unit for a neural network. ........................................................................ 23 Figure 6. Example of a feed-forward neural network architecture. ....................................................... 23 Figure 8. Fundamentals of a confusion matrix for multi-class labels. .................................................. 26 Figure 9: Predictive Document Classification process .......................................................................... 28 Figure 10. Predictive Document Classification with Feature Selection task

A Comparative Study of Machine Learning Algorithms for Document Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support