A Review of Machine Learning Algorithms for Text-Documents Classification
Total Page:16
File Type:pdf, Size:1020Kb
4 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, FEBRUARY 2010 A Review of Machine Learning Algorithms for Text-Documents Classification Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee*, Khairullah khan Department of Computer and Information Science, Universiti Teknologi PETRONAS, Tronoh, Malaysia. *Faculty of Science, Engineering and Technology, Universiti Tunku Abdul Rahman, Perak Campus, Kampar, Malaysia. (E-mail: [email protected], [email protected],[email protected],[email protected]) (supervised, unsupervised and semi supervised) and Abstract— With the increasing availability of electronic documents and the rapid growth of the World Wide Web, summarization. However how these documented can be the task of automatic categorization of documents became properly annotated, presented and classified. So it con- the key method for organizing the information and know- sists of several challenges, like proper annotation to the ledge discovery. Proper classification of e-documents, online documents, appropriate document representation, dimen- news, blogs, e-mails and digital libraries need text mining, sionality reduction to handle algorithmic issues [1], and machine learning and natural language processing tech- an appropriate classifier function to obtain good generali- niques to get meaningful knowledge. The aim of this paper zation and avoid over-fitting. Extraction, Integration and is to highlight the important techniques and methodologies classification of electronic documents from different that are employed in text documents classification, while at sources and knowledge discovery from these documents the same time making awareness of some of the interesting challenges that remain to be solved, focused mainly on text are important for the research communities. representation and machine learning techniques. This paper Today the web is the main source for the text documents, provides a review of the theory and methods of document the amount of textual data available to us is consistently classification and text mining, focusing on the existing litera- increasing, and approximately 80% of the information of ture . an organization is stored in unstructured textual format Index Terms— Text mining, Web mining, Documents [2], in the form of reports, email, views and news etc. The classification, Information retrieval. [3] shows that approximately 90% of the world’s data is held in unstructured formats, so Information intensive I. INTRODUCTION business processes demand that we transcend from simple document retrieval to knowledge discovery. The need of The text mining studies are gaining more importance re- automatically retrieval of useful knowledge from the cently because of the availability of the increasing num- huge amount of textual data in order to assist the human ber of the electronic documents from a variety of sources. analysis is fully apparent [4]. The resources of unstructured and semi structured infor- mation include the word wide web, governmental elec- Market trend based on the content of the online news ar- tronic repositories, news articles, biological databases, ticles, sentiments, and events is an emerging topic for chat rooms, digital libraries, online forums, electronic research in data mining and text mining community [5]. mail and blog repositories. Therefore, proper classifica- For these purpose state-of-the-art approaches to text clas- tion and knowledge discovery from these resources is an sifications are presented in [6], in which three problems important area for research. were discussed: documents representation, classifier con- struction and classifier evaluation. So constructing a data Natural Language Processing (NLP), Data Mining, and structure that can represent the documents, and construct- Machine Learning techniques work together to automati- ing a classifier that can be used to predicate the class la- cally classify and discover patterns from the electronic bel of a document with high accuracy, are the key points documents. The main goal of text mining is to enable in text classification. users to extract information from textual resources and deals with the operations like, retrieval, classification One of the purposes of research is to review the available and known work, so an attempt is made to collect what’s ———————————————— known about the documents classification and representa- Aurangzeb khan and Khairullah Khan are PhD Students, Department tion. This paper covers the overview of syntactic and se- of Computer and Information Science at Universiti Teknologi PETRO- mantic matters, domain ontology, tokenization concern NAS, Tronoh, Malaysia. Baharum Baharudin is an Assistant Professor at the Department of and focused on the different machine learning techniques Computer and Information Science at Universiti Teknologi PETRO- for text classification using the existing literature. The NAS, Tronoh, Malaysia. motivated perspective of the related research areas of text Lam Hong Lee is an Assistant Professor at the Faculty of Science, mining are: Engineering and Technology of Universiti Tunku Abdul Rahman, Perak Campus, located in Kampar, Malaysia. Information Extraction (IE) methods is aim to extract (E-mail:[email protected],[email protected] [email protected])., [email protected]) specific information from text documents. This is the first Manuscript received May 28, 2009; revised September 7, 2009. © 2010 ACADEMY PUBLISHER doi:10.4304/jait.1.1.4-20 JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, FEBRUARY 2010 5 approach assumes that text mining essentially corres- more strong and effective. We have tried to get some re- ponds to information extraction. ports drawn using tables and graphs on the basis of exist- ing studies. Information Retrieval (IR) is the finding of documents which contain answers to questions. In order to achieve The rest of the paper is organized as follows. In Section 2 this goal statistical measures and methods are used for an overview of documents representation approaches, automatic processing of text data and comparison to the Section 3 presents document classification models, in given question. Information retrieval in the broader sense Section 4 new and hybrid techniques were presented. deals with the entire range of information processing, Section 5 consists of comparative study of different me- from data retrieval to knowledge retrieval [7]. thods and finally in Section 6, some discussions and con- clusion were made. Natural Language Processing (NLP) is to achieve a better understanding of natural language by use of computers II DOCUMENTS REPRESENTATION and represent the documents semantically to improve the classification and informational retrieval process. Seman- The documents representation is one of the pre- tic analysis is the process of linguistically parsing sen- processing technique that is used to reduce the complexi- tences and paragraphs into key concepts, verbs and prop- ty of the documents and make them easier to handle, the er nouns. Using statistics-backed technology, these words document have to be transformed from the full text ver- are then compared to the taxonomy. sion to a document vector. Text representation is the im- portant aspect in documents classification, denotes the Ontology is the explicit and abstract model representation mapping of a documents into a compact form of its con- of already defined finite sets of terms and concepts, in- tents. A text document is typically represented as a vector volved in knowledge management, knowledge engineer- of term weights (word features) from a set of terms (dic- ing, and intelligent information integration [23]. tionary), where each term occurs at least once in a certain In this paper we have used system literature review minimum number of document. A major characteristic of process and followed standard steps for searching, screen- the text classification problem is the extremely high di- ing, data-extraction, and reporting. mensionality of text data. The number of potential fea- tures often exceeds the number of training documents. A First of all we tried to search for relevant papers, presen- definition of a document is that it is made of a joint mem- tations, research reports and policy documents that were bership of terms which have various patterns of occur- broadly concerned with documents classification or text rence. Text classification is an important component in mining. We identified appropriate electronic databases many informational management tasks, however with the and websites. Potentially relevant papers were identified explosive growth of the web data, algorithms that can using the electronic databases and websites, Such as improve the classification efficiency while maintaining IEEE Explore, Springer Linker, Science Direct, ACM accuracy, are highly desired [8]. Portal and Googol Search Engine. For best and consistent search a systematic search strategy was adopted. Proper Documents pre-processing or dimensionality reduction keywords, queries, and phrases were derived from the (DR) allows an efficient data manipulation and represen- desired research question. These keywords were arranged tation. Lot of discussions on the pre-processing and DR into categories and related keywords were arranged. are there in the current literature and many models and Some facilities of digital libraries like sort by year etc techniques have been proposed. DR is a very important were also used. The search keywords were