Text Segmentation Techniques: a Critical Review

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Sunway Institutional Repository Text Segmentation Techniques: A Critical Review Irina Pak and Phoey Lee Teh Department of Computing and Information Systems, Sunway University, Bandar Sunway, Malaysia [email protected], [email protected] Abstract Text segmentation is widely used for processing text. It is a method of splitting a document into smaller parts, which is usually called segments. Each segment has its relevant meaning. Those segments categorized as word, sentence, topic, phrase or any information unit depending on the task of the text analysis. This study presents various reasons of usage of text segmentation for different analyzing approaches. We categorized the types of documents and languages used. The main contribution of this study includes a summarization of 50 research papers and an illustration of past decade (January 2007- January 2017)’s of research that applied text segmentation as their main approach for analysing text. Results revealed the popularity of using text segmentation in different languages. Besides that, the “word” seems to be the most practical and usable segment, as it is the smaller unit than the phrase, sentence or line. 1 Introduction Text segmentation is process of extracting coherent blocks of text [1]. The segment referred as “segment boundary” [2] or passage [3]. Another two studies referred segment as subtopic [4] and region of interest [5]. There are many reasons why the splitting document can be useful for text analysis. One of the main reasons is because they are smaller and more coherent than whole documents [3]. Another reason is each segment is used as units of analysis and access [3]. Text segmentation was used to process text in emotion extraction [6], sentiment mining [7][8], opinion mining [9][10], topic identification [11][12], language detection [13] and information retrieval [14]. Sentiment analyzing within the text covers wide range of techniques, but most of them include segmentation stage in text process. For instance, Zhu et al. [15] used segmentation in his model to identify multiple polarities and aspects within one sentence. There are studies that applied tokenization in the semantic analysis to increase the probability of obtaining the useful information by processing tokens. For example, the study of Gan at al. [16] applied tokenization in their proposed method where semantics were used to improve search results for obtaining more relevant and clear content from the search. Later, Gan and Teh [17] used technique similar to segmentation where information is organized into segments called facet and values in order to improve search algorithm. Another study of Duan at al. [18] applied text segmentation and then tokenization to determine aspects and associated features. In other words, tokenization is also text segmentation because they are apparently a similar process. That is splitting text into words, symbols, phrases, or any meaningful units named as a token. This paper reviews different methods and reasons of applying text segmentation in opinion and sentiment mining, language detection and information retrieval. The target is to overview of text segmentation techniques with brief details. The contribution of this paper includes the categorizations of recent articles and visualization of the recent trend of research in the opinion mining and related areas, such as sentiment analysis and emotion detection. Also in pattern recognition, language processing, and information retrieval. Next section of this paper explains the scope and method used to review the past studies. Section 3 discusses the results of summarized articles. Section 4 contains a discussion on results. Lastly, section 5 concludes this paper. 2 Review Method The review process of the articles includes publications from the past ten years. Fifty journals and conferences in total were evaluated. These articles implemented text segmentation in their main approaches. Fifty articles are summarized in Table 1 in next section. In order to make it clear content of the Table 1, here is breakdown the column by column. The first column of the table refers to the year of the article. Next column includes the references of study. In the following column, there is a brief description of the study. There are different types of segments used in their study, it includes: 1) topic, 2) word, 3) sentence etc. Type of segment commonly selected based on their analyzing targets and specifications. The fourth column listed the type of segment used in the assessed study. The evaluation of each study has applied different sets of data or documentations. They can be categorised as: 1) corpus, 2) news, 3) articles, 4) reviews, 5) datasets etc. We listed it in the fifth column. The sixth column describes the reason for applying the text segmentation study. The last column specifies those language(s) used in those sets of documentations(s). 3 Results Following, we present the summarized review in Table 1. Table 1. Summarisation of the fifty articles applied text segmentation in their studies. N Year R Description Segme Data/ Reason Langu o ef nt Document age 1 200 [1 Opinion search in web Topic Web blog To identify Englis 7 0] blogs (logs) opinion in web h blogs(logs) 2 200 [1 Unsupervised methods Topic Dataset of To detect Polish 7 2] of topical text e-mail boundaries segmentation for Polish newsletter, between news artificial items or documents, consecutive streams, and stories individual document s 3 200 [1 Word segmentation of Word Dataset To identify Englis 7 9] handwritten text using words within h supervised classification handwritten techniques text 4 200 [2 Clustering based text Sente Corpus of To Englis 7 0] segmentation nce articles understand the h sentences relations with consideration the similarities in a group rather than individually 5 200 [6 Comprehensive Word Comment To identify Chines 7 ] information based s polarity e semantic orientation orientation in identification text 6 200 [2 Unconventional word Word Handwritt To define Englis 7 1] segmentation in Brazilian en texts word h boundaries children’s early text production 7 200 [2 Comparative Analysis Sente Dataset of To identify Arabic 7 2] of Different Text nce news boundaries Segmentation Algorithms within the text on Arabic News Stories and measure lexical cohesion between textural units 8 200 [2 Automatic Story Topic News To identify Chines 8 3] Segmentation in Chinese story boundary e Broadcast News 9 200 [1 Aspect-based sentence Sente Reviews To extract Chines 9 5] segmentation for nce aspect-based e sentiment summarization sentiment summary 1 200 [2 Chinese text sentiment Word Corpus To classify Chines 0 9 4] classification based on text based on e Granule Network Zhang sentiment value 1 200 [2 Automatic extraction Word Corpus of To identify Chines 1 9 5] of new words based on Google News word and e Google News corpora sentence from Chinese language texts for supporting lexicon- in real-word based Chinese word applications segmentation systems 1 201 [2 An information- Word Blogs, To identify Urdu 2 0 6] extraction system for comments, social and Urdu-a resource-poor news articles human language behavior within text 1 201 [2 Chinese text Word Corpus To build Chines 3 0 7] segmentation: A hybrid system for e approach using cross-language transductive learning information moreover, statistical association measures 1 201 [7 Sentiment Word Chinese To classify Chines 4 0 ] classification for stock stock news news by e news sentiment orientation 1 201 [8 Sentiment text Word Comment To improve Chines 5 0 ] classification of customers s accuracy of e reviews on the web based sentiment on SVM classification 1 201 [2 The application of text Word Web To analyze Chines 6 0 8] mining technology in documents sentiment in e monitoring the network text to monitor education public public network sentiment 1 201 [2 Using text mining and Word Forums To design a Chines 7 0 9] sentiment analysis for text sentiment e online forums hotspot approach detection and forecast 1 201 [3 A topic modeling Topic Corpus To design an Englis 8 1 0] perspective for text enhance topic h segmentation. extraction approach 1 201 [4 Iterative approach to Topic Articles To identify Englis 9 1 ] text segmentation topic and h Subto subtopic pic boundaries within content 2 201 [3 Text segmentation of Text Articles in To process Englis 0 1 1] consumer magazines in blocks PDF PDF documents h PDF format documents 2 201 [3 Rule-based Malay text Sente Articles To design Malay 1 1 2] segmentation tool nce Malay sentence splitter and Englis tokenizer h 2 201 [3 A novel evaluation Word Corpus To improve Englis 2 1 3] method for morphological word h segmentation segmentation with consideration true morphological ambiguity 2 201 [3 Usage of text Passag Articles To improve Englis 3 1 4] segmentation and inter-e text document h passage similarities clustering 2 201 [3 Using a boosted tree Word Dataset of To convert Englis 4 2 5] classifier for text handwritten handwritten h segmentation in hand- documents document into annotated documents digital text 2 201 [1 Two-part Word Corpus To process Englis 5 2 ] segmentation of text problem and h documents Sente solution nce documents 2 201 [3 Enhancing lexical Topic Corpus of To split into Englis 6 2 6] cohesion measure with news segments that h confidence measures, deal with a semantic relations and single topic language model interpolation for multimedia

Text Segmentation Techniques: a Critical Review

Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann

Multiple Segmentations of Thai Sentences for Neural Machine Translation

A Clustering-Based Algorithm for Automatic Document Separation

An Incremental Text Segmentation by Clustering Cohesion

A Text Denormalization Algorithm Producing Training Data for Text Segmentation

Topic Segmentation: Algorithms and Applications

A Generic Neural Text Segmentation Model with Pointer Network

Text Segmentation Based on Semantic Word Embeddings

Steps Involved in Text Recognition and Recent Research in OCR; a Study

Word Sense Disambiguation and Text Segmentation Based on Lexical

Multiple Segmentations of Thai Sentences for Neural Machine

A Journey Through Existing Procedures on Ocr Recognition and Text