Single-Document Myanmar Text Summarization Using Latent Semantic Analysis (LSA)

Single-Document Myanmar Text Summarization using Latent Semantic Analysis (LSA)

Soe Soe Lwin*, Dr. Khin Thandar Nwet** *University of Computer Studies, Yangon(UCSY),**University of Information Technology,Yangon,Myanmar [email protected],[email protected]

Abstract level [1]. Summary should include important sentences that explains the topics covered in the Due to an exponential growth in the generation original text. of textual data, tools and mechanisms for automatic Summarization system can be divided into summarization of documents is needed. Text two categories: extractive and abstractive summarization is currently a major research topic in summarization. Extractive summarization extracts Natural Language Processing. There are various salient words/sentences from documents and group approaches to generate text summary. Among them, we them to produce summary without changing [10]. proposed Myanmar text summarization using latent Abstractive summarization examines the source semantic analysis (LSA). Latent semantic analysis text and generate the concise summary that contains (LSA) is a technique in natural language processing, of some novel sentences not present in original analyzing relationships between a set of documents and document. the terms they contain by producing a set of concepts Summarization methods can be categorized related to the documents and terms. LSA is a retrieval according to what they generate and how they method that uses a mathematical technique called generate it. A summary can be extracted from a singular value decomposition (SVD) to identify single document or from multiple documents. If a patterns in the relationships between the terms and summary is generated from a single document, it is concepts contained in an unstructured collection of known as single document summarization. On the text. There is no LSA based sentence extraction in other hand, if a single summary is generated from Myanmar language. This is the first LSA based Text multiple documents on the same subject, this is Summarizer in Myanmar. We summarize Myanmar known as multi -document summarization. news from Myanmar official websites such as 7day Summaries are also divided into generic daily, new-eleven, ThithtooLwin, etc. summaries and query-based summaries. Generic summarization systems generate summaries Keywords: text summarization, LSA containing main topics of documents. In query- based summarization, the generated summaries 1. Introduction contain the sentences that are related to the given queries. With the explosion of data available on the Web There are several research projects in the form of unstructured text, efficient methods of concerned with automatic text summarization for summarizing text are becoming more important today. English and European languages and in Asian Text summarization is the creation of a shortened languages such as Chinese and Japanese. However, version of a document by a computer program. It there is little ongoing research in Myanmar text extracts the most important points of the original text. summarization. In this paper, generic extractive A text document generally consists of a set of Myanmar text summarization system based on LSA topics. Topics are explained by some sentences and is proposed. other sentences in the document are used to make the topics more readable and complete. The summary of a 2. Related Works document should be able to cover all the topics Researchers have been working actively on presented in the original text and simultaneously text summarization within the Natural Language should be able to keep the redundancy to a minimum

190

Processing (NLP) to create a better and more efficient two evaluation methods are: similarity of the main summary. The researchers have been developing for topic and similarity of the term significance. summarization using latent semantic analysis method O.foong, S.Yong and F.Jaid [7] investigated to achieve better summary. lsa based extractive text summarization in mobile W.T.Kyaw[17] used CRF (Conditional Random android platform. The summary was compressed to Field) to summarize Myanmar disaster news that is 20% from the original document. The lsa produce based on Information Extraction. This paper proposed the summary output with an average f-score of multi-documents, query-focused, extractive and 0.36. informative Myanmar text Summarization framework. It is not sentence extraction. It is word level summary 3. Latent Semantic Analysis (LSA) concerned with natural disasters such as date, time, Place. Our system extracts important sentence based on LSA is an algebraic method, which can LSA. analyze relations between terms and sentences of a M. T. Naing proposed Summary generation given set of documents. LSA uses context of the system is based on semantics roles. Automatic input document and extracts information such as prominal anaphora resolution in Myanmar text is used which words are used together and which common for summary generation. In sematic role labeling, words are seen in different sentences. High number argument identification and argument classification are of common words among sentences indicates that developed using Myanmar Verb Frame Resource. The the sentences are semantically related [2]. It uses system cannot identify verbs in sentences with two SVD (Singular Value decomposition) for main verbs [18]. decomposing matrices. SVD is a numerical process, Gong and Liu [5] proposed summarization of which is often used for data reduction, but also for news with the use of LSA as a way to identify the classification, searching in documents and for text significant topics in the documents. SVD is applied to summarization [3]. matrix A to decompose into three matrices as follows: There are three main steps in Latent A=U∑VT. They proposed that the row of the matrix Semantic Analysis. These steps are as follows: VT can be considered as various topics covered in the  Input Matrix Creation. original text. And finally, they reproduce each row of  Singular Value Decomposition. matrix VT successively and extract a sentence from it  Sentence Selection. which has maximum values. They select one sentence for each topic according to topic importance. 3.1. Input Matrix Creation M.G. Ozsoy, I. Cicekli and F.N. Ferda Nur Alpaslan [2] explained LSA-based summarization The input document is represented in a algorithms and evaluated on Turkish and English matrix form to perform the calculations. A matrix is document. While creating summaries using LSA, there created which represents the input text. The row of are various approaches for selection of sentences. the matrix represents the words in the sentences and Among them, they compared five algorithms: Gong column represents the sentences of the input and liu, Steinberger and Jezeck, Murry, cross method document. The cells of matrix represent the and topic method. Cross method and topic method are importance of words in sentences. There are proposed by authors of that paper. They showed that various methods to represent importance of words. among LSA-based approaches, the cross method These approaches are: performs better than other approaches. They observed o Number of Occurrence: The cell is filled with that cross method does not perform well in shorter frequency of the word in the sentence. documents. o Binary Representation: If a word occurs in the J.Steinberger and K.Jezek [6] described a sentence the cell is filled 1, otherwise the cell generic text summarization method which used the value is 0. latent semantic analysis technique to identify o Root Type: If the root type of the word is Noun, semantically important sentences. Summarization ratio cell value is the frequency of the word, otherwise is 20%. They suggested two new evaluation methods the cell value is 0. based on LSA, which measure content similarity o Term Frequency-Inverse Document between an original document and its summary. These Frequency:

191

Frequency of word i in sentence j T TF= Ozsoy's Cross 1.Matrix V Sum of frequencies of all words in sentence j approach Method 2.The average value of each Number of sentences in input text sentence IDF=log( ) Number of sentences containing word i 3.The total length of o Modified tf-idf: Cell value are first calculated with each sentence vector tf-idf value. If cell value is less than or equal to Topic 1.Matrix VT average TF-IDF values in the associated row, then Method 2. The creation of set them zero. concept x concept matrix 3.The strength 3.2. Singular Value Decomposition values of each concept LSA process the term-sentence matrix through 4. Discovering the the algorithm called Singular Value Decomposition main-concepts and (SVD) [8]. SVD is a method of word co-occurrence sub-concepts analysis. It is based on theorem from linear algebra. The input matrix A is decomposed into three matrices 4. Proposed System U, V, ∑. It is shown in equation (1). A=U ∑ V (1) This section presents our proposed Myanmar U= word×concept matrix text summarization system. Figure 1 describes ∑=Scaling values, diagonal descending matrix overall architecture of proposed system. V= sentence×concept matrix The algorithm of SVD: 1. Compute AAT Input Document 2. Calculate the Eigen values and Eigen vectors of AAT 3. Compute AT A 4. Calculate the Eigen values and Eigen vectors of AT A 5.Calculate the square root of the common positive Preprocessing Eigen values of AAT and ATA 6. Finally, assign the computed values to U, ∑, VT. Word Segmentation

3.3. Sentence selection

Stopwords Removal Using the results of SVD, different algorithms use different approaches to select important sentences. Different algorithms are proposed to select important sentences from the document for summarization using Latent Semantic Analysis the results of SVD [2]. These algorithms are summarized in table 1. Create Word-Sentence Matrix Table1. Sentence Selection Approaches Algorithms Features for Singular Value with LSA Algorithm sentence Decomposition approach selection Gong and Gong and It is based on 1.Matrix T lius Lius V Sentence Selection approach Method Steinberger Lengthy 1. Matrix VT and Iezek's Method 2. The length of the Approach, sentence vector Summary T Murray, Murray, 1. Matrix V Renals and Renals and 2. ∑ matrices Figure 1. proposed system for text Carletta's Carletta's approach Method summarization

192

4.1. Word Segmentation Table3. Stopwords in Myanmar

Word segmentation is the process of Examples of Stopwords in Myanmar determining word boundaries in a text. In English ေၾကာင့္, က, ႏွင့္, ,၏ ,ျဖင့္,သို႔ ,ရန္,ထို ,ယင္း ,သည္, မည္သည္ language, words boundaries can be easily determined ,မွာ, မွ ,မည္ , အား ,တြင္ ,ကို ,ၿပီး ,ဖို႔ လည္း, ထိ ,တိုင္ ,စြာ ,ကား by space. In Myanmar language does not have white ,လုံး ,ေသာ ,လၽွင္ ,မ်ား,ခဲ့ , ေၾကာင္း, ဟု space between words. The news data are converted to Unicode using Burmese font converter [11]. And then, we needed to 4.3. Term-Sentence Matrix Creation perform word segmentation. In this paper, word segmentation is performed according to Burmese word Input matrix is created based on the source segmentation program using Foma-generated Finite document where columns represent sentence and State Automata [ 9]. They used a foma regular expression and a readily available Burmese word list to rows represent terms. Frequency of the word in the create finite state automata to segment text into words sentence is filled in cell values of matrix. Input using longest match with dictionary. Currently word matrix is shown in figure 2. lists have 41343 words. Local news contains the words such as ေဒေအာင်ဆန်းစုကည်, ဘတ်ဂျတ်, ေဖ့ဘွတ်ခ် ဘက်တီးရီးယား. But S1 S2 S3 these words are not included in word list. The words such as အီရတ်, သီရိလ ကာ, ဗုံးကဲ are included in international အိႏၵိယ 1 1 0 news but these words are missing in word list. စစ္တပ္ 1 0 0 Therefore, such words are added to word list. Now, ရဟတ္ယာဥ္ 1 1 1 1 0 0 word lists have 41543 words. တစ္စီး ပ်က္က် 1 1 1 The performance and accuracy of word လ္ိုက္ပါသြားသူ 1 0 0 segmentation is very important because text ၇ဦး 1 0 0 summarization accuracy totally depends on ေသဆုံး 1 0 0 preprocessing. The accuracy of Burmese word သိရွိ 1 0 0 segmentation program using Foma-generated Finite ေလတပ္ 0 1 0 0 1 0 State Automata is 86.92 %. The sample document after ႏိုင္ငံ အေရွ႕ 0 1 0 word segmentation is shown in table 2. ေျမာက္ 0 1 0 တ႐ုတ္ 0 1 0 Table 2. Input Document after Word Segmentation နယ္နိမိတ္ 0 1 0 လူသူ 0 1 0 S1 အိႏၵိယ|စစ္တပ္| ၏ |ရဟတ္ယာဥ္ |တစ္စီး |ပ်က္က် |ခဲ |့ၿပီး အေရာက္အေပါက္ 0 1 0 |လိုက္ပါသြားသူ |၇ |ဦ |းစလုံး |ေသဆုံး| ခဲ့ |ေၾကာင္း| အတည္ျပဳ 0 1 0 သိရွိရသည္ ေဒသ 0 1 1 S2 အိႏၵိယ |ေလတပ္ |က |ႏုိင္ငံ |အေ႐ွ႕| ေျမာက္ပိုင္း၊| တ႐ုတ္ အေျခစိုက္စခန္း 0 0 1 |ႏွင့္ |နယ္နိမိတ္ |အနီး |လူသူ |အေရာက္အေပါက္| နည္းပါး ပ်ံတတ္ 0 0 1 ရာ |ေဒသ |တြင္ |ရဟတ္ယာဉ္ |ပ်က္က်| ခဲ့| ျခင္းဟ ု မၾကာမီ 0 0 1 |အတည္ျပဳခဲ့သည္။ S3 ရဟတ္ယာဥ္ |သည္ |အဆိုပါ |ေဒသ ရိွ| အေျခစိုက္ |စခန္း Figure 2. Term-sentence Matrix of document |တစ္ခု |မွ| ပ်ံတက္ |လာ| ၿပီး| မၾကာမီ|တြင္| ပ်က္က်| ခဲ့ျခင္း| ျဖစ္သည္| 4.4. Singular value decomposition 4.2. Stopwords removal SVD is used to reduce dimension of term- Stopwords do not represent the main idea of the by-document matrix. By applying SVD algorithm on input matrix, three matrices U, ∑, VT are shown input text. Stopwords in the input document is in figure 3,4, and 5. removed. Examples of Stopwords are described in table 3.

193

-0.320 0.123 0.234 remove sentences that are not related to concepts. -0.114 0.312 0.122 In table 4, rows represent concepts and column -0.406 0.162 -0.199 represents sentences. Average value of all concept -0.114 0.312 0.122 are calculated. And then, if the cell value is less -0.406 0.162 -0.199 than or equal to average value, this cell value is set -0.114 0.312 0.122 to be 0. VT matrix after preprocessing is shown in -0.114 0.312 0.122 table 4. -0.114 0.312 0.122 -0.114 0.312 0.122 Table 4. VT matrix after preprocessing U= -0.206 -0.189 0.112 sen1 sen2 sen3 Average -0.206 -0.189 0.112 con1 -0.456 0 -0.341 -0.539 -0.206 -0.189 0.112 con2 0.85 0 0 0.147 -0.206 -0.189 0.112 con3 0.264 0.241 0 -0.143 -0.206 -0.189 0.112 -0.206 -0.189 0.112 And then, cell values are multiplied by with -0.206 -0.189 0.112 the values in ∑ matrix. The length of sentence -0.206 -0.189 0.112 vector is calculated by summing up all concepts in -0.206 -0.189 0.112 columns of VT matrix, which represent the -0.292 -0.150 -0.321 sentences. Length score calculation for each -0.085 0.039 -0.433 sentence is shown in table 5. -0.085 0.039 -0.433

-0.085 0.039 -0.433 Table 5. length score calculation for each

sentence Figure 3. U word × concept matrix Sen1 Sen2 Sen3 Con1 -1.81853 0 -0.736219 Con2 3.3898 0 0 3.988 0 0 Con3 1.05283 0.656966 0 ∑= 0 2.726 0 length 2.6241 0.656966 -0.736219 0 0 2.159 In table 5, sen1 is the highest length score.

Figure 4. diagonal descending matrix ∑ Therefore, sentence 1 “အိႏၵိယစစ္တပ္ ၏

ရဟတ္ယာဉ္တစ္စီးပ်က္က် ခဲ ့ၿပီး လုိက္ပါသြားသူ ၇ ဦ းစလံုး ေသဆံုး ခဲ့ေၾကာင္း သိရွိရသည္။”is chosen as summary sentence. In 0.456 -0.822 -0.341 this paper, average number of sentence in summarized document is 40% of the original T V = 0.850 -0.516 0.107 document. 0.264 0.241 -0.934 5. Experimental Work Figure 5. VT concept × sentence matrix Local news and international news are used

for experiment. Summarization is done by Latent To get U, ∑, VT matrices, [4] is used. We used semantic analysis and performance measure is cross method to select sentence included in summary. shown in this section.

4.5. Sentence Selection 5.1. Data used for Experiment

After creating input matrix and singular value We collected Myanmar local news and decomposition of the matrix, sentence selection is international news from Myanmar news websites performed to generate output summary. In this paper, [12], [13], [14], [15], [16]. Now, we collected 55 cross method is used for sentence selection. local News and 76 international news from these T In cross method, V matrix and ∑ are used to websites. We will collect more data news in the T select sentences. Firstly, V matrix is preprocessed to future. Table 6 shows data set used for experiment.

194

Table 6. Data Set summarizer. We proposed the Myanmar text Local International Summarizer using latent semantic analysis model. This paper mainly concentrates on single document. News News Number of In future work, we will combine graph based 55 76 documents method to get better performance. Sentences per 275 456 References documents Words per 2860 4100 [1] P.V.Reddy, “Text Summarization Using Latent document Semantic Analysis”, International Journal of Technology and Engineering Science [IJTES], 5.2. Performance Measure and Result Volume 1[8], pp: 1309-1313, November 2013. [2] M.G. Ozsoy, I. Cicekli, F.N. Ferda Nur Human-generated summary is used as a Alpaslan, “Text Summarization of Turkish reference to compare the summaries generated by Texts using Latent Semantic Analysis”, machine for same document in this evaluation. Proceedings of the 23rd International Accuracy, precision and recall are used for evaluation Conference on Computational Linguistics parameter. Summarization accuracy of international (Coling 2010), pages 869–876, Beijing, August news are less than local news because most of the 2010. words in international news are not contained in word [3] M. Campr, K. Ježek, “Comparative list. Therefore, these words cannot be segmented summarization via Latent Semantic Analysis”, correctly. Therefore, the accuracy of international news Department of Computer Science and is less than local news. Table 7 describes evaluation Engineering University of West Bohemia. measure formulas. And then performance measure by [4] http://calculator.vhex.net/calculator/linear LSA is shown in table 8. algebra/singular-value-decomposition.

[5] G.Yong, X.Liu, “Generic text summarization Table 7. Evaluation Measure using relevance measure and latent semantic Accuracy + analysis”. In: Proceedings of SIGIR‘01 2001. + + + [6] J.Steinberger, K.Jezek, “Using Latent Semantic Precision Analysis in Text Summarization and Summary + Evaluation”, Department of Computer Science Recall and Engineering. + Where, [7] O.foong, S.Yong, F.Jaid, “Text Summarization tp = True positive i.e. total number of correct using Latent Semantic Analysis Model in sentences selected. Mobile Android Platform”, 2015 9th Asia fp =False positive i.e. total number of sentences Modelling Syposium”. selected which are not correct. [8] S.A.Babar, S.A.Thorat, “Improving Text fn = False negative i.e. total number of sentences Summarization using Fuzzy Logic & Latent which are correct but not selected. tn= True negativei.e. total number of sentences which Semantic Analysis”, Computer Science & are not correct and also not selected. Engineering, Rajarambapu Institute of Technology, Sakharale, India, International Table 8. Performance Measure by LSA Journal of Innovative Research in Advanced Engineering (IJIRAE), Volume 1 Issue 4 (May Category Accuracy Precision Recall Local News 87% 60% 66% 2014). International 66% 50% 50% [9] https://github.com/lwinmoe/segment. news [10] J. Kamala Geetha, N. Deepamala, “Kannada text summarization using Latent Semantic 6. Conclusion Analysis”, International Conference on Advances in Computing, Communications and Understanding a huge document without any Informatics (ICACCI), 2015. abstract or summary is difficult and takes lot of time. [11] http://www.myanmarengineer.org/converter/fo This problem is solved with the automatic text ntconv.htm [12] http://www.moi.gov.mm/

195

[13] http://www.7daydaily.com/ on Computer Applications (ICCA [14] http://news-eleven.com/ 2014),Yangon, Myanmar. [15] http://www.thithtoolwin.com/ [18] M. T. Naing, A. Thida, “Automatic Myanmar [16] http://thevoicemyanmar.com/ Text Summarization System with Semantic [17] W.T.Kyaw, N.L.Thein and H.H.Htay, “Automatic roles”, in the Proceedings of the 12th Myanmar Text Summarization System”, International Conference on Computer Proceeding of the 12th International Conference Applications (ICCA2014), Yangon, Myanmar, February2014, p. 217-223.

196