Contrastive Domain Term Extraction from Chinese Texts Without Word Segmentation Chu-Qiao YU and I.A. BESSMERTNY ITMO University, Kronverksky Pr., 49, St

2017 International Conference on Advanced Education and Management Science (AEMS 2017) ISBN: 978-1-60595-438-7

Contrastive Domain Term Extraction from Chinese Texts without Word Segmentation Chu-qiao YU and I.A. BESSMERTNY ITMO University, Kronverksky Pr., 49, St. Petersburg, 197101, Russian Federation

Keywords: Natural language, Shallow syntactic analysis, Fact extraction, Thesaurus, Search tree.

Abstract. Subject of Research: The paper considers the problem of automatic term extraction from natural language texts (text mining). One of the first-priority problems in this topic is creation of domain thesaurus. Some well approved methods of terms extraction exist for alphabetic languages, for instance, the latent semantic analysis. Applying of these methods for hieroglyphic texts is challenged because of missing blanks between words. The sentences segmentation task in hieroglyphic languages is usually solved by dictionaries or by statistical methods, particularly, by means of a mutual information approach. Methods of sentences segmentation, as methods of terms extraction, separately, do not reach 100 percent accuracy and fullness, and their consistent applying just increases a number of errors. The aim of this work is improving the fullness and accuracy of domain terms extraction from hieroglyphic texts. Method: The proposed method lies in detection of repeating two, three or four symbol sequences in each sentence and correlation of occurrence frequencies for these sequences in domain and contrast documents collection. According to research carried out it was stated that a trivial ranging of all possible symbol sequences enables to extract satisfactory only frequently using terms. Filtering of symbol sequences by their ratio of frequencies in the domain and contrast collection gave the possibility to extract reliably frequently used terms and find satisfactory rare domain terms. Some results of terms extraction for the "Network technologies" domain from a Chinese text are presented in this paper. A set of articles from the newspaper "Renmin Ribao" was used as a contrast collection and some satisfactory results were obtained.

Introduction Text mining is a pressing issue the solution of which will considerably enhance the efficiency of using information resources stored as text documents [1]. The urgency of this issue grows multiply when this is about documents in foreign non-English languages. Whereas for alphabetic languages there are methods solving the task of terms extraction in this or that degree, for hieroglyphic languages there are currently no satisfactory solutions. One of the priority tasks in this sphere is automatic compilation of a domain thesaurus. The most popular method is based on a Bag-of-words model [1,2] which ignores the links between words and sentences and which has sufficiently tested methods of domain terms extraction among which latent semantic analysis should be mentioned [3].

Status of the Problem and Current Research A specific feature of hieroglyphic writing, particularly of Chinese, is missing blanks between words, which engenders the problem of sentence segmentation. Thus, the task of terms extraction is broken down into segmentation of text into words and subsequent terms extraction. Despite there being the rules of segmentation based on dictionaries [4], the task of segmentation of hieroglyphic texts even using dictionaries has no unequivocal solution [5]. The statistic methods of sentences segmentation among which a mutual information approach should be noted [6] allow dictionaries to be dispensed with, but do not ensure definitive segmentation. In the said paper [6] the ratios of accuracy and fullness of text segmentation into words do not exceed 0.9. The methods of extracting Chinese terms from segmented texts are most frequently based on well- proven TF-IDF (Term Frequency-Inverted Document Frequency) algorithms [7, 8]. In the paper [7] the

50 TF-IDF algorithm is modified by adding to it an information measure DI (Distribution Information), the accuracy not exceeding 0.68, and fullness 0.77. Taking into account sentences segmentation errors these figures should be adjusted to the values of 0.6 and 0.7 respectively. The paper [8] proposes using a body not less than 3010 words and ignoring words occurring less than three times for extracting rare terms. The authors maintain that in this case no key words are lost at all when using the TF-IDF algorithm, which looks doubtful. Thus, the known research is based on separation of phases of text segmentation and terms extraction, and the declared results demonstrate a great spread in the values of accuracy and fullness, which evidences an underdevelopment of this topic. This paper proposes rejecting the sentence segmentation phase and using all possible symbol sequences between punctuation marks as prospective terms. The basic idea of the proposed approach consists in detecting frequently used symbol sequences and their filtration depending on occurrence frequencies in a domain and contrast collection.

Specifics of Hieroglyphic Texts A hieroglyph is the minimal lexical unit of hieroglyphic languages including Chinese. The difference of hieroglyphs from words of alphabetic languages consists in that the hieroglyph denotes a quite wide concept incorporating dozens of meanings. The union of two and more hieroglyphs narrows and gives substance to the meaning conveyed by them. For instance, if the hieroglyph 出 denotes movement from inside to outside, and 口: mouth or any orifice, 出口: means leaving or driving away. Despite the presence of more accurate hieroglyphs 門 and 门denoting door and gate respectively, the combination 出口 is used to denote both leaving a building and driving away from a parking lot. Thus, unlike alphabetic letters, each hieroglyph bears a meaning, and almost any combination of hieroglyphs can be interpreted in some manner. This gravely complicates the set objective. On the other hand, as shown above, unification of hieroglyphic combinations is in place, which enables detection of statistically significant sequences. The Chinese language is distinguished by a rigid word order and extremely simple grammar: there are no declensions, conjugations, tenses, or numbers. Consequently, no lemmatization stage in processing Chinese texts is required. Interrogative sentences have the same work order as declarative ones but use interrogatory words at the end of a sentence. All this makes Chinese rather appealing as an object of automatic term extraction. Finally, hieroglyphic languages are distinguished by missing blanks between words on account of its being no problem for the native speaker. Just as we do not read by letters, the Chinese do not interpret any separate hieroglyphs but recognize their stable combinations. Consequently, a similar anthropomorphic approach can be applied for automatic processing of hieroglyphic texts as well.

Frequency Analysis of Symbol Sequences Let us represent a text as a sequence of symbols such as abedefghijk located between terminal symbols which are not only punctuation marks but also any non-hieroglyphic symbols. Taking into account that domain terms can consist of two, three, or four symbols, the following interpretations of the said sequence are possible: four-symbol - abed, bede, cdef, defg, efgh, fghi, ghij, hijk, three-symbol - abc, bed, cde, def, efg,fgh, ghi, hij, ijk, two-symbol - ab, be, cd, de, ef, fg, gh, hi, ij, jk. Part of them are domain terms, part general technical terms, part common words, the rest being meaningless combinations. Let computer network technologies be a domain, and a reference text from the book “Basics of Computer Networks”, chapter 3 Local Area Networks (LAN) (http://ebook.qq.com/hvread.html?bid=637747&cid=3). The volume of this text is 19 000 symbols of which 10978, 12563, and 14383 strings were obtained per 4, 3, and 2 symbols respectively. After extracting all symbol strings and ranging them by their occurrence in N text, we will obtain the results the fragment of which is given in Table 1.

51 Table 1 shows the most frequently encountered four-symbol combinations of hieroglyphs among which there are domain terms (“wireless internet access”, “radio signal”, “Fresnel region” and such), and their derivatives (possessive expressions: “a switch’s”, “a local network’s”), combinations of symbols that have the same meaning as the terms but which are not such (“network switchboard”, “Ethernet technology”). No meaningless combinations of hieroglyphs have occurred among frequently used strings of symbols. Naturally, a pure frequency analysis allows only frequently used domain terms and common words to be detected without differentiating them. Table 1. Results of frequency analysis of symbol sequences N Sequence Translation Type of sequence 54 线局域网 LAN line 52 无线局域 wireless internet access term 22 交换机的 a switch’s (Genitive case) 20 局域网的 a local network’s (Genitive case) 20 兆以太网 Gigabit network 19 无线信号 Radio signal term 19 千兆以太 Gigabit Ethernet 18 数据传输 data transmission term 16 菲涅耳区 Fresnel region term 16 传输速率 data transmission rate term 16 以太网的 internet’s (Genitive case) 15 网交换机 network switchboard 13 线路由器 router line 13 无线路由 wireless route 13 无线网络 wireless network term 13 传输距离 transmission range term 12 质访问控 medium access control 12 访问控制 access control term 12 覆盖范围 coverage term 12 的计算机 a computer’s (Genitive case) 12 的局域网 a local network’s (Genitive case) 12 的以太网 Ethernet (Genitive case) 12 波束成形 beam forming 12 无线接入 wireless access term 11 以太网技 Ethernet technology 10 载波信号 carrier signal term 10 无线设备 wireless equipment term 10 无线电波 radio waves term 10 换式局域 local formula 10 式局域网 LAN (imperative) 10 局域网络 LAN term 10 台计算机 one computer

Term Extraction by Using a Contrast Collection A traditional approach to improving the quality of domain term extraction is using a contrast collection referring to another domain, or a general collection referring to no domain [9]. All existing methods encourage in one way or another the presence of a prospective term in a domain collection and penalize for its presence in the contrast one. One of the first papers in this direction

52 was the paper [10] which calculates Weirdness as the ratio of word frequencies in domain and contrast collections. It should be noted that in Chinese texts that are not broken into words, with this approach singularity should often manifest itself when the denominator turns to zero. This is related to the circumstance that sequences including fragments of terms and common words may well be unique. Other modifications of TF-IDF approach [11-15] are also empirical, rely on various hypotheses on the nature of terms distribution in the domain and contrast collections. The diversity as well as the absence of a clearly expressed preference of any of the existing methods of detecting the termhood of words give evidence that no solution to this problem has yet been found. During the attentive consideration of the results of pure frequency analysis of a Chinese text a regularity was detected that untranslatable four-hieroglyph sequences are most often composed of common words with added prepositions or other fragments. Consequently, if common words are detected using a contrast collection, meaningless combinations of symbols can be filtered. Thus, the following approach is proposed to filter a list obtained on the basis of frequency analysis. Let the sequence abcd occur in the domain collection, which includes the fragments abc, bed, ab, and cd. This sequence is included into a terms list only when the probability of presence of not just the sequence itself but of any of its fragments in the domain collection is higher than in the contrast с: p(abcdg) >p(abcdc),p(abcg) > p(abcc) p(bcdg) >p(bcdc),p(abg)> p(abc), p{abg) >p(cdc).

Results of Experimental Research Table 2 shows the results of word filtration for four-symbol sequences from the said text through local computer networks. This text has |Drel|=25 terms occurring four and more times. As a contrast body a set of articles from the Chinese newspaper "Renmin Ribao" on the topics “politics”, “culture”, “sports”, “events” of the total volume about 480 thousand hieroglyphs was used. From the chosen text |Dretr|=46 words using the proposed filtration algorithm, of which |Drel|∩|Dretr|=20 were domain terms. Table 2. Results of term extraction using a contrast collection.

N Sequence Translation Type of sequence 52 无线局域 wireless Internet access term 19 无线信号 radio signal term 19 千兆以太 gigabit internet 16 菲涅耳区 Fresnel region term 16 传输速率 data transmission rate term 13 线路由器 router line 13 无线路由 wireless route 13 传输距离 transmission range term 12 波束成形 beam forming 12 无线接入 wireless access term 12 太网交换 Internet switching 12 多模光纤 multimode fiber term 11 内置电源 inbuilt unit term 10 无线电波 radio waves term 10 换式局域 local formula 10 交换式局 switching center term 9 线接入点 access point line 9 全向天线 omni-directional antenna term 9 个发射机 one transmitter 8 要供电的 to the power source 8 换机端口 8-port switch

53 8 双工端口 duplex port term 7 无线介质 wireless medium term 7 据链路层 to capture a data channel 7 功率类别 power class term 6 线的增益 amplification line 6 端口宽带 port throughput term 6 的覆盖范 fan cover

N Sequence Translation Type of sequence 6 工端口带 no translation 6 屏蔽双绞 screened twisted pair term 6 太网链路 Ethernet reference 6 口带宽为 no translation 5 电缆衰减 cable attenuation term 5 同轴电缆 coaxial cable term 5 发射功率 transmission power term 5 享式以太 no translation 5 不重叠的 not to overlap 4 送方和接 sender, and then 4 送或接受 send or receive 4 这意味着 this means that 4 过无线介 through wireless medium 4 笔记本电 notebook 4 相移键控 phase manipulation 4 的灵敏度 sensitivity 4 天线增益 antenna gain term 4 发送或接 send or receive

It can be seen from table 2 that the proposed filtration algorithm has successfully eliminated words with a possessive sign (an analog of the genitive case) without a clear indication of it as a stop word. At that, it failed with combinations of hieroglyphs that might be interpreted as «gigabit internet», «router line», «wireless route», «beam forming», «access point line» and others which occur in the domain text, are missing in the contrast collection, but are not terms of this domain. Thus, the accuracy of extraction P= (|Drel|∩|Dretr|) / ∩|Dretr| = 0.48, and the fullness R= (|Drel|∩|Dretr|) / ∩| Drel| = 0.8. The hybrid F-measures determined by the formula F=2PR/(P+R) = 0.6. In comparison with the existing methods of term extraction for English texts [16] where the hybrid measure F varies in different texts from 0.35 to 0.85, the proposed approach shows completely comparable results. The above results show that the exclusion of combinations of hieroglyphs parts of which are present in the contrast collection from prospective domain terms enables extraction of not only common but also rather rare domain terms.

Conclusion As a result of the conducted research the implementability of the method of domain terms extraction from hieroglyphic texts without preliminary segmentation of phrases has been confirmed experimentally. This method has shown results comparable in accuracy and fullness to two-phase procedure consisting of breaking down sentences into words and extracting terms later on. Further research will be aimed at refining methods of detecting meaningless combinations of hieroglyphs and at detecting the termness of words more precisely using methods laid down in [17] where, in order to improve the quality of selection of terms, their co-occurrence in sentences is taken into account. 54 References [1] Joachims T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, 2002, 205 p. [2] Wallach H.M. Topic modeling: beyond bag-of-words. Proc. 23rd Int. Conf. on Machine Learning. Pittsburgh, USA, 2006, pp. 977–984. [3] Nugumanova A., Bessmertnyi I. Applying the latent semantic analysis to the issue of automatic extraction of collocations from the domain texts. Communications in Computer and Information Science, 2013, vol. 394, pp. 92–101. doi: 10.1007/978-3-642-41360-5_8 [4] Taiwanese Principles of Text Segmentation. Available at: http://ip194097.ntcu.edu.tw/ TG/CompLing/hunsu/hunsu.htm (accessed 28.10.2016). [5] Xue N. Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 2003, vol. 8, no. 1, pp. 29–48. [6] Zeng D., Wei D., Chau M., Wang F. Domain-specific Chinese word segmentation using suffix tree and mutual information. Information Systems Frontiers, 2011, vol. 13, no. 1, pp. 115–125. doi: 0.1007/s10796-010-9278-5 [7] Huang Lei, Wu Yan-Peng, Zhu Qun-Feng. Research and improvement of TFIDF feature weighting method. Computer Science, 2014, vol. 41, no. 6, pp. 204–208. [8] Li Xiaochao, Zhao Shang, Lao Yan, Chen Min, Liu Mengmeng. Statistics law of same frequency words in Chinese texts and its application to keywords extraction. Application Research of Computers, vol. 33, no. 4, pp. 1007–1012. [9] Conrado M.S., Pardo T.A.S., Rezende S.O. A machine learning approach to automatic term extraction using a rich feature set. Proc. NAACL HLT Student Research Workshop. Atlanta, USA, 2013, pp. 16–23. [10] Ahmad K., Gillam L., Tostevin L. University of surrey participation in TREC8: weirdness indexing for logical document extrapolation and retrieval (WILDER). Proc. 8th Text Retrieval Conference TREC. Gaithersburg, USA, 1999, pp. 717. [11] Penas A., Verdejo F., Gonzalo J. Corpus-based terminology extraction applied to information access. Proceedings of Corpus Linguistics, 2001, vol. 2001, pp. 458–465. [12] Kim S.N., Baldwin T., Kan M.-Y. An unsupervised approach to domain-specific term extraction. Proc. Australasian Language Technology Association Workshop, 2009, pp. 94–98. [13] Basili R. A contrastive approach to term extraction. Proc. 4th Terminological and Artificial Intelligence Conference, TIA2001. Nancy, France, 2001. [14] Wong W., Liu W., Bennamoun M. Determining termhood for learning domain ontologies using domain prevalence and tendency. Proc. 6th Australasian Conference on Data Mining and Analytics. Gold Coast, Australia, 2007, vol. 70, pp. 47–54. [15] Yang Y., Pedersen J.O. A comparative study on feature selection in text categorization. Proc. 14th Int. Conf. on Machine Learning ICML, 1997, vol. 97, pp. 412–420. [16] Astrakhantsev N.A. Automatic term acquisition from domain-specific text collection by using Wikipedia. Trudy ISP RAN, 2014, vol. 26, no. 4, pp. 7–20. (In Russian) [17] Nugumanova A.B., Bessmertnyi I.A., Petsina P., Baiburin E.M. Semantic relations in text classification based on Bag-of-words model. Programmnye Produkty i Sistemy, 2016, no. 2, pp. 89– 99.