A Novel Application of Fuzzy Set Theory and Topic Model in Sentence Extraction for Vietnamese Text
Total Page:16
File Type:pdf, Size:1020Kb
IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.8, August 2010 41 A Novel Application of Fuzzy Set Theory and Topic Model in Sentence Extraction for Vietnamese Text Ha Nguyen Thi Thu 1† and Nguyen Thien Luan2††, †Department of Computer Science, Electric Power University ,235 Hoang Quoc Viet, Ha noi, Viet Nam ††Faculty of Information Technology, Le Qui Don Technical University,100 Hoang Quoc Viet, Ha Noi, Viet Nam Summary on the method of calculating the frequency of words in Vietnamese language has common characteristics with some sentences is not appropriate . On the other hand, study on Asian languages such as Chinese, Japanese, Korean ... They do Vietnamese text summary is now something new, corpus not define words based on spaces. In this article, we present a for Vietnamese text summarization or word segmentation method that application of Fuzzy set theory and topic model to is still limited. extract sentences in Vietnamese texts which have been categorized by topic. This method based on identification of important features as : length of sentence, weight of terms in We studied Vietnamese text that categorized by topic and sentences, position of sentences ..., then extracting important found that with each different theme, weight of words sentences according to the ratio, this ratio indicate which (hereafter called terms) would depend on the different sentences in original text will be extracted. We also built a topics. Therefore, we have built sets of terms for each system based on this method and experiments have obtained topic and applied Fuzzy theory in determining degree of good results, satisfying the given requirements. membership of each term in text which is easier than Key words: identifying important sentences for extracting. Vietnamese text, Sentence extraction, topic model, Fuzzy set theory Structure of this article is as follows: in section 2, we introduce structure of automatic Vietnamese text sentence 1. Introduction extraction system. In section 3, we present methodology using topic model to build sets of terms and Fuzzy set " A text that is produced from one or more texts, that theory to determine weight of sentence in text. Section 4 conveys important information in the original text(s), and shown the result of our system and evaluation with some that is no longer than half of the original text(s) and others method and section 5 is conclusion. usually significantly less than that" (Radev et all 2002) [1]. Lin - 2000 proposed construction of an automatic text 2. Modeling of sentence extraction summary system including sentence- extracting process [3]. For a long time, most of automatic text summariztion The method applying Fuzzy theory is quite effective in systems based on sentence extraction in one or more texts approximately solving the problems of classification, and then combined them to make the summary. The identification of texts, scripts, images, sounds, ... An earliest research on text summary used sentence extraction Fuzzy set A on a classic set Χ is defined as follows: method based on specific features of words and phrases frequency (Luhn, 1958), position of the sentence in the text (Baxendale, 1958) and important phrases (Edmundson, 1969) [1]. Appearance frequency of a word mainly based Membership function μA (x) quantifies the degree which on the tf * idf method. elements x belong to basic set Χ. If the function gives result 0 for an element, then that element is not in the Most of these methods are effectively applied for English. given set, the result 1 describes a full member of the set. However, for Asian languages like Vietnamese, Chinese, The values in range of 0 to 1 show fuzzy elements. Japanese or Korean with single syllable, it is very difficult to separate words. Unlike English, words of these We propse a model for automatic Vietnamese text languages can not be determined based only on a space [4]. sentence calculation as follows Thus, determination of the weight of sentences based Manuscript received August 5, 2010 Manuscript revised August 20, 2010 42 IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.8, August 2010 Fig 1: Calculated weight of sentence n d (t ) 3. Method of sentence extraction ∑ i j μ (t ) = i=1 (3) D j n 3.1 Topic Model Where: Vietnamese has not a normal word segmentation, thus in : is degree of membership of t on D this research, we have applied the Topic Model [4] to μD (t j ) j build a set of terms corresponding to the different topics. First we build sets of documents for training that has been n: is number of documents in D classified by difference topic. With each topic we have a ⎪⎧1 if tdji∈ training set: dtij()= ⎨ ⎩⎪0 if tdji∉ D = {d1, d2 ,..dn} (1) Where : D is the training set, collecting set of documents 3.3 Degree of membership of terms in test document. di are the same of topic. If the degree of membership of terms in the training set D For each topic, we build a corresponding set of terms. indicates importance of the term to the training set, the degree of membership of terms in test document shows the (2) T = {t1,t2 ,..tm} importance of the term to the test document and it is calculated by the number of appearance of term tj over the 3.2 Degree of membership of terms in training set D total terms appearing in test document. Nt() j (4) Degree of membership of terms in the training set μCj()t = m indicates importance of the term to the topic [6]. It is ∑ Nt()i calculated by the ratio of total number of texts containing i=1 the term divide total texts in the training set. IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.8, August 2010 43 After the sentences have been calculated, they will be Where: sorted in descending order and extracted according to the μ ()t indicates grade of membership of tj in test Cj ratio r which is length of set of extract sentences divided document C . by length of set of sentences in the original text. The algorithm of sentence extraction is as follows Nt()j is number of t j appear in C. 3.4 Calculating the importance of sentences SENTENCE EXTRACTION ALGORITHM We propose a method for calculate weight of sentences μ (t ) μ (t ) ∑ c j ∑ D j t j ∈si length(si ) Fi = a1 + a2 + a3 + a4 position Input : max{}∑ μ D (t j ) max{}∑ μC (t j ) length(S) C: Test Document (5) Ratio r Where : is the score are the linear aaaa1234,,, Output: : coefficients indicate grade of membership of term t in j R : Set of Sentence training set D, grade of membership of term t j in test . Begin document, length of s and position of s in document. R=””; p=1; i i // select sentence to extract ⎧1 if si in the head of document or foot of document Position = ⎨ Do While ⎩0 if otherwise length (R ) + length(s[ ip ]<= length(C) *r Begin The algorithm is described as a follows: R =R + s[ i ] p t[ip ]= true; ALGORITHM CALCULATING THE p=p+1; IMPORTANCE OF SENTENCES End // arrange sentences in order Input C: Test Document R=” ” Output R: Set of important sentences For i=1 to n do if t[i]=true then R=R+s[i] 1. Initialization End R←Ø; S←Ø; 2.Sentence segmentation Fig. 3 Procedure for selected sentence S←s1,s2,..,sk 3. Calculating weight of each sentence in Test 4. Result and evaluation Document We evaluate this approach by apply it to summarize For each sk Vietnamese news and comparing with other current For i=1 to length(sk) approaches, our research shows interesting and satisfactory results. Based on the proposed method above, we have For each term in s calculate k built a system to calculate importance of sentences and automatic extraction of sentences for Vietnamese texts. L(si) ←L(si)+ μD (t j ) + μC (t j ) We tested with four different topics: sport, technology, length(si) political, and economy. For each topic, we collected about F(s ) ←⎯⎯ L(s ) + + position i i length(S) 300 different texts for training from the Vietnamese website http://www.vnexpress.net. And here is the architecture of our automatic Vietnamese text sentence Fig. 2 Procedure for calculated weight of sentence extraction. 3.5 Method for extracting important sentences 44 IJCSNS International Journal of Computer Science and Network Security, VOL.10 No.8, August 2010 Fig. 4 Architecture for extraction http://smmry.com, and result from extract by human. In At the present, Vietnamese does not have any standard evaluate our method with an online summary system. We assessment method, therefore, we compare the result of downloaded english documents from extract according to the method proposed by Ha Thanh Le http://news.bbc.co.uk and used for exprimentation. We [Le Thanh Ha, Quyet Thang Huynh, Mai Luong , 2005] translated to Vietnamese text and apply our system to and our result. Comparison shows that our extract result is extract sentences, then translated to English and compared better and more stable than theirs. In addition, we also it with result of smmry.com. compared our result with an online summary system Real Madrid are set to complete the signing of Portugal centre-back Ricardo Carvalho from Chelsea. The Spanish giants have agreed a £6.7m fee to sign the 32-year-old, who will link up once again with former Chelsea boss Jose Mourinho. "Chelsea can confirm it has agreed terms with Real Madrid for the transfer of Carvalho," said a club statement.