Chinese Open Information Extraction Based on DBMCSS in the Eld Of

Open Phys. 2018; 16:568–573 Research Article Open Access Jianhou Gan, Peng Huang, Juxiang Zhou, and Bin Wen* Chinese open information extraction based on DBMCSS in the eld of national information resources https://doi.org/10.1515/phys-2018-0074 Received April 19, 2018; accepted June 12, 2018 1 Introduction Abstract: Binary entity relationship tuples can be applied The research on information extraction (IE) is being de- in many elds such as knowledge base construction, data veloped into open information extraction, extracting open mining, pattern extraction, and so on. The purpose of en- categories of entity relations and events from open do- tity relationship mining is discovering and identifying the main text resources [1, 2]. Wu [3] uses Wikipedia infobox semantic relationship. As the relationship between enti- attribute values to construct training data, using the CRF ties are dierent from the general domain, using super- method to choose relational words, then using the match- vise learning methods to extract entity relationships in the ing template to handle an unbounded set of semantic. This eld of ethnicity is dicult. After research,we nd that help the IE system get better results while also speeding some words can be used in the context of a sentence to de- more time. Shan [4] uses the shallow parsing theory to im- scribe the semantic relationship. In order to salve the ex- prove the IE system’s quantity of information, chooses POS isting diculties of building tagged corpus and the prede- (part of speech) tagging and nominal phrase (NP) chunks ned entities-relationships model, this paper proposes a instead of full parse trees to be the features, and chooses a method of density-based multi-clustering clustering of se- logistic regression classier. This will bring some new er- mantic similarity (DBMCSS) to mine the binary entity re- rors, such as “He lent me a book”, the result is (He, lent, lationship tuples from the Chinese national information me), but as a matter of fact, it’s a ternary relation task. corpus, which can extract entity relationships without a Cafarella [5] uses an open information extraction system training corpus. to extract entities from the web [6]. Yates [7] proposes a system named TextRunner for extraction tasks. Del and Keywords: Open information extraction, DBSCAN, Entity Gemulla [8] think of extract open information as a cluster- relationship ing problem, Xavier [9] extracts open information based PACS: 07.05.Kf, 07.05.Mh on semantics. Relation extraction is an important task in text mining [10], it can reect the relationship between the named entities and help to nd implicit knowledge in the sub- stantial data and text [11]. We nd that it can use some words in the context of a sentence to describe the semantic relationship. To solve the known diculties and problems in the setup of a tagged corpus and predene the entities-relationships model, this paper proposes a Jianhou Gan: Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, method of density-based multi-clustering of semantic sim- Kunming, China, Kunming, E-mail: [email protected] ilarity (DBMCSS) to extract the binary entity relationship Peng Huang: Key Laboratory of Educational Informatization for tuples. Nationalities Ministry of Education, Yunnan Normal University, In recent years, supervised learning of relation- Kunming, China, Kunming, E-mail: [email protected] specic examples has huge adoption in information ex- Juxiang Zhou: Key Laboratory of Educational Informatization for traction systems, but the training data is very hard to Nationalities Ministry of Education, Yunnan Normal University, Kunming, China, Kunming, E-mail: [email protected] obtain, especially in the eld of ethnicity [12], and it *Corresponding Author: Bin Wen: College of Computer Science hasn’t built a training corpus and predened the entities- and Technology, Yunnan Normal University, Kunming, Yunnan relationship model. To solve this problem, we give an ex- Province, China, E-mail: [email protected] Open Access. © 2018 Jianhou Gan et al., published by De Gruyter. This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivs 4.0 License. Chinese open information extraction based on DBMCSS in the eld of national information resources Ë 569 tract entity relation triples without a tagged corpus and ancient of Hmong” and “the middle and lower reaches of pre-dened relation types. In the experiments, we have Yellow River”. to perform better in precision of some relation type tasks (2) The relationship demonstrative is on the right of than others. the entity pair, such as: “the swinging dance is the most inuential large dance of Tujia ethnic minority”, the relationship demonstrative “the large dance” is on the right of 2 The method the entity pair of “the swinging dance” and “Tujia ethnic minority”. (3) The relationship demonstrative is on the left of 2.1 Named entity pair extraction the entity pair, such as: “As a leader, Mrs. Washi led the Zhuang ethnic minority people to resist the Japanese”, the To extract Binary entity relationship tuples from texts. relationship demonstrative “leader” is on the left of the en- Firstly, we need to extract named entity pairs. Since the tity pair of “Mrs. Washi” and “the Zhuang ethnic minority named entity pairs are between in a sentence and a group people”. of sentences, we need to use the method of HMM to recog- We chose the entity pair, the verb or noun between the nize the named entity [13]. Our task is to extract the named entity pair, the verb or noun on the right of entity pair, the entity pair from a sentence. How to nd the named entity verb or noun on the left of entity pair, and the POS of all pair from a sentence which includes many entities inu- of the words as the feature for binary entity relationship ences the precision. So we dened the entity set as follows: tuples extraction. Entity = (entity1, type1), (entity2, type2), (entity3, type3),... We nd that most of the entity pairs are neighboring 2.3 Semantic similarity computation of by analyzing the distribution of entity pairs in a sentence. feature vector According to this principles, the entity pairs are selected. We dened the candidate entity pair set as follows: During the cluster analysis course, calculating the simi- Candidate Entity=<entity1,entity2>,<entity2, en- larity of feature vector is important. We suppose that if tity3><entity3,entity4>... entity pairs have the same relationship the feature vec- There is too much noise in the candidate entity pair tors are more similar. For the N-dimensional feature vec- set, so we used some rules to lter out the noise. The rules tor to describe the binary entity relationship tuples we are used by us as follows: use the distance to measure the feature vector’s similar- Rule1: We observed that most of entity pair are not too ity. We choose the Manhattan distance to describe the two far away, so we give a threshold to lter some candidate feature’s similarity. For the vector V (v , v , ..., v n), entity pair of noise; 1 11 12 1 V (v , v , ..., v n), the Manhattan distance formula as Rule2: The entity pair are not the same if the entity 2 21 22 2 follows: pair exist relationship, so we can lter some noise; Rule3: m To statistic frequency of the relationship X demonstrative between the entity pair type. We can lter D(Vt , Vj) = jvtk − vjkj (1) some noise if the frequency is below a threshold value. k=1 We choose the TongYiCi CiLin Extended Edition dic- tionary written by the HIT to calculate the similarity of two 2.2 The contextual feature extraction of words, the TongYiCi CiLin Extended Edition record 70000 named entity pairs entry [14]. The TongYiCi CiLin Extended Edition uses a ve- layer classication system to describe the hierarchical re- We nd some conditions by counting the entity pair distri- lation of entries. Fig.1 is its hierarchical structure gure. bution of the Chinese corpus. With the Chinese syntactic Such as: the word “Hmong” code is Di04B10#, “D” is changeable, we need to simplify the condition, so we can the Level 1, “i” is level 2, “04” is level 3, “B” is level 4, “10” mainly divide the condition into three types: is level 5, and the “#” has other uses. Table 1 gives the word (1) The relationship demonstrative is between the en- encoding rules. tity pair, such as : “the ancient of Hmong rstly lived in the The eighth encode has 3 labels, including the label middle and lower reaches of Yellow River”, the relation- “#”, “=” and “@”. The “#”represent the unequal, the “=” ship demonstrative “lived in” is between the entity of “the represents equation, and the “@” represents independent. 570 Ë Jianhou Gan, Peng Huang, Juxiang Zhou, and Bin Wen – Eps is the radius that represents spatial attribute (lat- itude and longitude) that delimitates the neighborhood area of a point. – Minpts is the minimum number of points that must exist in the Eps-neighborhood. Some concepts and denitions of DBSCAN (Density- Figure 1: Monitoring Properties of Sensor NetworkThe hierarchical Based Spatial Clustering of Application with Noise) which structure of TongYiCi CiLin Extended Edition. are directly and indirectly related to DBSCAN (Density- Based Spatial Clustering of Application with Noise) are ex- Table 1: Description of sub-process monitoring indicators plained here [15]: D Encode 1 2 3 4 5 6 7 8 (1) Cluster: in a database with given data objects as = address fO1, O2, ··· , Ong, the procedure of partitioning database Example D i 0 4 B 1 0 m/# D into smaller parts which are similar in certain standards ′ /@ as C = fC1, C2, ··· , Cig is called clustering.

Chinese Open Information Extraction Based on DBMCSS in the Eld Of

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support