Transfer Learning Based Cross-lingual Knowledge Extraction for

Zhigang Wang†, Zhixing Li†, Juanzi Li†, Jie Tang†, and Jeff Z. Pan‡ † Tsinghua National Laboratory for Information Science and Technology DCST, Tsinghua University, , wzhigang,zhxli,ljz,tangjie @keg.cs.tsinghua.edu.cn { } ‡ Department of Computing Science, University of Aberdeen, Aberdeen, UK [email protected]

Abstract Heino and Pan, 2012) and OWL (Pan and Hor- rocks, 2006; Pan and Thomas, 2007; Fokoue et Wikipedia infoboxes are a valuable source al., 2012), and their reasoning services. of structured knowledge for global knowl- However, most infoboxes in different Wikipedi- edge sharing. However, infobox infor- a language versions are missing. Figure 1 shows mation is very incomplete and imbal- the statistics of article numbers and infobox infor- anced among the in differen- mation for six major Wikipedias. Only 32.82% t languages. It is a promising but chal- of the articles have infoboxes on average, and the lenging problem to utilize the rich struc- numbers of infoboxes for these Wikipedias vary tured knowledge from a source language significantly. For instance, the English Wikipedi- Wikipedia to help complete the missing in- a has 13 times more infoboxes than the Chinese foboxes for a target language. Wikipedia and 3.5 times more infoboxes than the second largest Wikipedia of German language. In this paper, we formulate the prob-

6 lem of cross-lingual knowledge extraction x 10

4 from multilingual Wikipedia sources, and Article present a novel framework, called - 3.5 Infobox CiKE, to solve this problem. An instance- 3 based transfer learning method is utilized 2.5 to overcome the problems of topic drift 2 1.5

and translation errors. Our experimen- Number of Instances 1 tal results demonstrate that WikiCiKE out- 0.5

performs the monolingual knowledge ex- 0 English German French Dutch Spanish Chinese traction method and the translation-based Languages method. Figure 1: Statistics for Six Major Wikipedias. 1 Introduction To solve this problem, KYLIN has been pro- In recent years, the automatic knowledge extrac- posed to extract the missing infoboxes from un- tion using Wikipedia has attracted significant re- structured article texts for the English Wikipedi- search interest in research fields, such as the se- a (Wu and Weld, 2007). KYLIN performs mantic web. As a valuable source of structured well when sufficient training data are available, knowledge, Wikipedia infoboxes have been uti- and such techniques as shrinkage and retraining lized to build linked open data (Suchanek et al., have been used to increase recall from English 2007; Bollacker et al., 2008; Bizer et al., 2008; Wikipedia’s long tail of sparse infobox classes Bizer et al., 2009), support next-generation in- (Weld et al., 2008; Wu et al., 2008). The extraction formation retrieval (Hotho et al., 2006), improve performance of KYLIN is limited by the number question answering (Bouma et al., 2008; Fer- of available training samples. r´andez et al., 2009), and other aspects of data ex- Due to the great imbalance between different ploitation (McIlraith et al., 2001; Volkel et al., Wikipedia language versions, it is difficult to gath- 2006; Hogan et al., 2011) using semantic web s- er sufficient training data from a single Wikipedia. tandards, such as RDF (Pan and Horrocks, 2007; Some translation-based cross-lingual knowledge

641 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 641–650, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics extraction methods have been proposed (Adar et called WikiCiKE. The extraction perfor- al., 2009; Bouma et al., 2009; Adafre and de Rijke, mance for the target Wikipedia is improved 2006). These methods concentrate on translating by using rich infoboxes and textual informa- existing infoboxes from a richer source language tion in the source language. version of Wikipedia into the target language. The recall of new target infoboxes is highly limited 2. We propose the TrAdaBoost-based extractor by the number of equivalent cross-lingual arti- training method to avoid the problems of top- cles and the number of existing source infoboxes. ic drift and translation errors of the source Take Chinese-English1 Wikipedias as an example: Wikipedia. Meanwhile, some language- current translation-based methods only work for independent features are introduced to make 87,603 Chinese Wikipedia articles, 20.43% of the WikiCiKE as general as possible. total 428,777 articles. Hence, the challenge re- 3. Chinese-English experiments for four typ- mains: how could we supplement the missing in- ical attributes demonstrate that WikiCiKE foboxes for the rest 79.57% articles? outperforms both the monolingual extrac- On the other hand, the numbers of existing in- tion method and current translation-based fobox attributes in different languages are high- method. The increases of 12.65% for pre- ly imbalanced. Table 1 shows the comparison cision and 12.47% for recall in the template of the numbers of the articles for the attributes named person are achieved when only 30 tar- in template PERSON between English and Chi- get training articles are available. nese Wikipedia. Extracting the missing value for these attributes, such as awards, weight, influences The rest of this paper is organized as follows. and style, inside the single Chinese Wikipedia is Section 2 presents some basic concepts, the prob- intractable due to the rarity of existing Chinese lem formalization and the overview of WikiCiKE. attribute-value pairs. In Section 3, we propose our detailed approaches. We present our experiments in Section 4. Some re- Attribute en zh Attribute en zh lated work is described in Section 5. We conclude name 82,099 1,486 awards 2,310 38 birth date 77,850 1,481 weight 480 12 our work and the future work in Section 6. occupation 66,768 1,279 influences 450 6 nationality 20,048 730 style 127 1 2 Preliminaries Table 1: The Numbers of Articles in TEMPLATE In this section, we introduce some basic con- PERSON between English(en) and Chinese(zh). cepts regarding Wikipedia, formally defining the key problem of cross-lingual knowledge extrac- In this paper, we have the following hypothesis: tion and providing an overview of the WikiCiKE one can use the rich English (auxiliary) informa- framework. tion to assist the Chinese (target) infobox extrac- tion. In general, we address the problem of cross- 2.1 Wiki Knowledge Base and Wiki Article lingual knowledge extraction by using the imbal- We consider each language version of Wikipedia ance between Wikipedias of different languages. as a wiki knowledge base, which can be represent- For each attribute, we aim to learn an extractor to ed as K = a p , where a is a disambiguated { i}i=1 i find the missing value from the unstructured arti- article in K and p is the size of K. cle texts in the target Wikipedia by using the rich Formally we define a wiki article a K as a ∈ information in the source language. Specifically, 5-tuple a = (title,text,ib,tp,C), where we treat this cross-lingual information extraction task as a transfer learning-based binary classifica- title denotes the title of the article a, • tion problem. text denotes the unstructured text description The contributions of this paper are as follows: • of the article a, 1. We propose a transfer learning-based cross- ib is the infobox associated with a; specif- lingual knowledge extraction framework • q ically, ib = (attr , value ) represents { i i }i=1 1Chinese-English denotes the task of Chinese Wikipedia the list of attribute-value pairs for the article infobox completion using a,

642 Given an attribute attrT and an instance n (word/token) xi, XS = xi i=1 and XT = n+m { } xi i=n+1 are the sets of instances (words/tokens) {in the} source and the target language respectively. xi can be represented as a feature vector according to its context. Usually, we have n m in our set- ting, with much more attributes in≫ the source that those in the target. The function g : X Y maps → the instance from X = X X to the true la- S ∪ T bel of Y = 0, 1 , where 1 represents the extrac- { } tion target (positive) and 0 denotes the background information (negative). Because the number of target instances m is inadequate to train a good classifier, we combine the source and target in- stances to construct the training data set as T D = T D T D , where T D = x , g(x ) n and S ∪ T S { i i }i=1 T D = x , g(x ) n+m represent the source Figure 2: Simplified Article of “Bill Gates”. T { i i }i=n+1 and target training data, respectively. Given the combined training data set T D, our r is the infobox template as- tp = attri i=1 objective is to estimate a hypothesis f : X Y • sociated{ with} , where is the number of ib r that minimizes the prediction error on testing→ data attributes for one specific template, and in the target language. Our idea is to determine the C denotes the set of categories to which the useful part of T DS to improve the classification • article a belongs. performance in T DT . We view this as a transfer learning problem. Figure 2 gives an example of these five impor- tant elements concerning the article named “Bill Gates”. 2.3 WikiCiKE Framework In what follows, we will use named subscripts, such as aBill Gates, or index subscripts, such as ai, WikiCiKE learns an extractor for a given attribute to refer to one particular instance interchangeably. attrT in the target Wikipedia. As shown in Fig- We will use “name in TEMPLATE PERSON” ure 3, WikiCiKE contains four key components: to refer to the attribute attrname in the template (1) Automatic Training Data Generation: given tpPERSON . In this cross-lingual task, we use the the target attribute attrT and two wiki knowledge source (S) and target (T) languages to denote the bases KS and KT , WikiCiKE first generates the languages of auxiliary and target Wikipedias, re- training data set T D = T DS T DT automati- ∪ spectively. For example, KS indicates the source cally. (2) WikiCiKE Training: WikiCiKE uses wiki knowledge base, and KT denotes the target a transfer learning-based classification method to wiki knowledge base. train the classifier (extractor) f : X Y by using → T DS T DT . (3) Template Classification: Wi- 2.2 Problem Formulation kiCiKE∪ then determines proper candidate articles Mining new infobox information from unstruc- which are suitable to generate the missing value of tured article texts is actually a multi-template, attrT . (4) WikiCiKE Extraction: given a candi- multi-slot information extraction problem. In our date article a, WikiCiKE uses the learned extractor task, each template represents an infobox template f to label each word in the text of a, and generate and each slot denotes an attribute. In the Wiki- the extraction result in the end. CiKE framework, for each attribute attrT in an infobox template tpT , we treat the task of missing value extraction as a binary classification prob- 3 Our Approach lem. It predicts whether a particular word (token) from the article text is the extraction target (Finn In this section, we will present the detailed ap- and Kushmerick, 2004; Lafferty et al., 2001). proaches used in WikiCiKE.

643 contained by the value. The text and value are processed as bags of words x and x . { }text { }value Then for each x x we have: i ∈ { }text 1 xi x value, x value = 1 ∈ { } |{ } | 1 xi 1,xi x value or xi,xi+1 x value, g(xi)=  − ∈ { } ∈ { }  x value > 1  |{ } | 0 otherwise  (1)  After the word labeling, each instance (word/token) is represented as a feature vec- tor. In this paper, we propose a general feature space that is suitable for most target languages. As shown in Table 2, we classify the features used in WikiCiKE into three categories: format features, POS tag features and token features.

Category Feature Example

}¬Ä¡ Format First token of sentence `

feature Hello World!

}¬Ä¡ In first half of sentence `

Hello World! å Figure 3: WikiCiKE Framework. Starts with two digits 12¨31

31th Dec. ¯µ Starts with four digits 1999Ø 3.1 Automatic Training Data Generation 1999’s summer

Contains a cash sign 10åor 10$ To generate the training data for the target at- Contains a percentage 10% symbol

tribute attrT , we first determine the equivalen-

¼ Ù² Stop words „, , t cross-lingual attribute attrS. Fortunately, some of, the, a, an

templates in non-English Wikipedia (e.g. Chinese Pure number 365

Õü Wikipedia) explicitly match their attributes with Part of an anchor text 5 Movie Director

their counterparts in English Wikipedia. There-

¯ ú¡¨ Begin of an anchor text 8 fore, it is convenient to align the cross-lingual at- Game Designer tributes using English Wikipedia as bridge. For POS tag POS tag of current token features POS tags of attributes that can not be aligned in this way, cur- previous 5 tokens rently we manually align them. The manual align- POS tags of ment is worthwhile because thousands of articles next 5 tokens Token Current token belong to the same template may benefit from it features Previous 5 tokens and at the same time it is not very costly. In Chi- Next 5 tokens nese Wikipedia, the top 100 templates have cov- Is current token contained by title ered nearly 80% of the articles which have been Is one of previous 5 assigned a template. tokens contained by title Once the aligned attribute mapping attr T ↔ Table 2: Feature Definition. attrS is obtained, we collect the articles from both KS and KT containing the corresponding attr. The target training data T DT is directly gener- The collected articles from KS are translated into ated from articles in the target language Wikipedi- the target language. Then, we use a uniform au- a. Articles from the source language Wikipedia tomatic method, which primarily consists of word are translated into the target language in advance labeling and feature vector generation, to generate and then transformed into training data T DS. In the training data set T D = (x, g(x)) from these next section, we will discuss how to train an ex- { } collected articles. tractor from T D = T DS T DT . For each collected article a = ∪ title,text,ib,tp,C and its value of attr, 3.2 WikiCiKE Training { } we can automatically label each word x in text Given the attribute attrT , we want to train a clas- according to whether x and its neighbors are sifier f : X Y that can minimize the prediction →

644 error for the testing data in the target language. TrAdaBoost AdaBoost Target + wt wt Traditional machine learning approaches attempt 1 1 samples wt βt− wt βt− − × 1 × to determine f by minimizing some loss function Source + wt β− No source training × L on the prediction f(x) for the training instance samples wt β sample available − × x and its real label g(x), which is +: correctly labelled : miss-labelled − th wt: weight of an instance at the t iteration βt = ǫt (1 ǫt) fˆ = argmin L(f(x),g(x)) where (x,g(x)) T DT × − f Θ ∈ β = 1/(1 + √2 ln nT ) ∈ (2) In this paper, we use TrAdaBoost (Dai et al., Table 3: Weight-updating Strategy of TrAd- 2007), which is an instance-based transfer learn- aBoost. ing algorithm that was first proposed by Dai to find fˆ. TrAdaBoost requires that the source training 3.3 Template Classification instances XS and target training instances XT be drawn from the same feature space. In WikiCiKE, the source articles are translated into the target Before using the learned classifier f to extrac- language in advance to satisfy this requirement. t missing infobox value for the target attribute Due to the topic drift problem and translation er- attrT , we must select the correct articles to be pro- cessed. For example, the article aNew York is not rors, the joint probability distribution PS (x, g(x)) a proper article for extracting the missing value of is not identical to PT (x, g(x)). We must adjust the the attribute attrbirth day. source training data T DS so that they fit the dis- tribution on T DT . TrAdaBoost iteratively updates If a already has an incomplete infobox, it is the weights of all training instances to optimize the clear that the correct tp is the template of its own prediction error. Specifically, the weight-updating infobox ib. For those articles that have no infobox- strategy for the source instances is decided by the es, we use the classical 5-nearest neighbor algo- loss on the target instances. rithm to determine their templates (Roussopoulos For each t = 1 T iteration, given a weight et al., 1995) using their category labels, outlinks, ∼ vector pt normalized from wt(wt is the weight inlinks as features (Wang et al., 2012). Our classi- vector before normalization), we call a basic clas- fier achieves an average precision of 76.96% with sifier F that can address weighted instances and an average recall of 63.29%, and can be improved then find a hypothesis f that satisfies further. In this paper, we concentrate on the Wiki- CiKE training and extraction components. fˆt = argmin L(pt, f(x),g(x)) f ΘF (3) ∈ (x,g(x)) T DS T DT ∈ ∪ 3.4 WikiCiKE Extraction Let ǫ be the prediction error of fˆ at the tth iter- t t Given an article a determined by template classi- ation on the target training instances T D , which T fication, we generate the missing value of attr is from the corresponding text. First, we turn the n+m 1 t text into a word sequence and compute the fea- ǫt = (w fˆt(xk) yk ) (4) n+m t × k × | − | ture vector for each word based on the feature k=n+1 wk k=n+1 definition in Section 3.1. Next we use to label f With ǫt, the weight vector wt is updated by the each word, and we get a labeled sequence textl as function: l f(x1) f(xi 1) f(xi) f(xi+1) f(xn) text = x1 ...xi 1 − xi xi+1 ...xn wt+1 = h(wt, ǫt) (5) { } where the superscript− f(x ) 0, 1 represents i ∈ { } The weight-updating strategy h is illustrated in the positive or negative label by f. After that, we Table 3. extract the adjacent positive tokens in text as the Finally, a final classifier fˆ can be obtained by predict value. In particular, the longest positive to- combining fˆ fˆ . ken sequence and the one that contains other pos- T/2 ∼ T TrAdaBoost has a convergence rate of itive token sequences are preferred in extraction. O( ln(n/N)), where n and N are the number E.g., a positive sequence “comedy movie director” of source samples and number of maximum is preferred to a shorter sequence “movie direc- p iterations respectively. tor”.

645 4 Experiments 2. KE-Tr is the translation-based extractor. It obtains the values by two steps: finding their In this section, we present our experiments to e- counterparts (if available) in English using valuate the effectiveness of WikiCiKE, where we Wikipedia cross-lingual links and attribute focus on the Chinese-English case; in other words, alignments, and translating them into Chi- the target language is Chinese and the source lan- nese. guage is English. It is part of our future work to try other language pairs which two Wikipedias of We conduct two series of evaluation to compare these languages are imbalanced in infobox infor- WikiCiKE with KE-Mon and KE-Tr, respectively. mation such as English-Dutch. 1. We compare WikiCiKE with KE-Mon on the 4.1 Experimental Setup first testing data set Atest, where most val- 4.1.1 Data Sets ues can be found in the articles’ texts in those labeled articles, in order to demonstrate the Our data sets are from Wikipedia dumps2 generat- performance improvement by using cross- ed on April 3, 2012. For each attribute, we collect lingual knowledge transfer. both labeled articles (articles that contain the cor- responding attribute attr) and unlabeled articles 2. We compare WikiCiKE with KE-Tr on the ′ in Chinese. We split the labeled articles into two second testing data set Atest, where the subsets A and A (A A = ), in which existences of values are not guaranteed in T test T ∩ test ∅ AT is used as target training articles and Atest is those randomly selected articles, in order to used as the first testing set. For the unlabeled arti- demonstrate the better recall of WikiCiKE. cles, represented as Atest′ , we manually label their For implementation details, the weighted-SVM infoboxes with their texts and use them as the sec- is used as the basic learner f both in WikiCiKE ond testing set. For each attribute, we also collect a and KE-Mon (Zhang et al., 2009), and set of labeled articles AS in English as the source Translation API3 is used as the translator both in training data. Our experiments are performed on WikiCiKE and KE-Tr. The Chinese texts are pre- four attributes, which are occupation, nationality, processed using ICTCLAS4 for word segmenta- alma mater in TEMPLATE PERSON, and coun- tion. try in TEMPLATE FILM. In particular, we extract values from the first two paragraphs of the texts 4.1.3 Evaluation Metrics because they usually contain most of the valuable Following Lavelli’s research on evaluation of in- information. The details of data sets on these at- formation extraction (Lavelli et al., 2008), we per- tributes are given in Table 4. form evaluation as follows.

Attribute AS AT Atest Atest′ 1. We evaluate each attr separately. occupation 1,000| | | 500| | 779| | 208| alma mater 1,000 200 215 208 2. For each attr, there is exactly one value ex- nationality 1,000 300 430 208 tracted. country 1,000 500 1,000 − A : the number of articles in A 3. No alternative occurrence of real value is | | available. Table 4: Data Sets. 4. The overlap ratio is used in this paper rather 4.1.2 Comparison Methods than “exactly matching” and “containing”. We compare our WikiCiKE method with two dif- Given an extracted value v = w and its ′ { ′} ferent kinds of methods, the monolingual knowl- corresponding real value v = w , two measure- { } edge extraction method and the translation-based ments for evaluating the overlap ratio are defined: method. They are implemented as follows: recall: the rate of matched tokens w.r.t. the real . It can be calculated using 1. KE-Mon is the monolingual knowledge ex- value tractor. The difference between WikiCiKE v v′ R(v′, v)= | ∩ | and KE-Mon is that KE-Mon only uses the v Chinese training data. | | 3http://openapi.baidu.com/service 2http://dumps.wikimedia.org/ 4http://www.ictclas.org/

646 precision: the rate of matched tokens w.r.t. the Figure 4(d) shows the average improvements extracted value. It can be calculated using yielded by WikiCiKE w.r.t KE-Mon on TEM- PLATE PERSON. We can see that WikiCiKE v v′ yields significant improvements when only a few P (v′, v)= | ∩ | v articles are available in target language and the im- | ′| provements tend to decrease as the number of tar- We use the average of these two measures to get articles is increased. In this case, the articles evaluate the performance of our extractor as fol- in the target language are sufficient to train the ex- lows: tractors alone.

R = avg(R (v′, v)) a A KE-Mon WikiCiKE i i test # ∈ P R P R 10 81.1% 63.8% 90.7% 66.3% P = avg(P (v′, v)) a A and v ′ = i i ∈ test i ∅ 30 78.8% 64.5% 87.5% 69.4% 50 80.7% 66.6% 87.7% 72.3% The recall and precision range from 0 to 1 and 100 82.8% 68.2% 87.8% 72.1% are first calculated on a single instance and then 200 83.6% 70.5% 87.1% 73.2% averaged over the testing instances. 300 85.2% 72.0% 89.1% 76.2% 500 86.2% 73.4% 88.7% 75.6% 4.2 Comparison with KE-Mon # Number of the target training articles. In these experiments, WikiCiKE trains extractors Table 5: Results for country in TEMPLATE on A A , and KE-Mon trains extractors just FILM. S ∪ T on AT . We incrementally increase the number of target training articles from 10 to 500 (if available) 4.3 Comparison with KE-Tr to compare WikiCiKE with KE-Mon in different We compare WikiCiKE with KE-Tr on the second situations. We use the first testing data set Atest to ′ evaluate the results. testing data set Atest. Figure 4 and Table 5 show the experimental re- From Table 6 it can be clearly observed that Wi- sults on TEMPLATE PERSON and FILM. We can kiCiKE significantly outperforms KE-Tr both in see that WikiCiKE outperforms KE-Mon on all precision and recall. The reasons why the recal- three attributions especially when the number of l of KE-Tr is extremely low are two-fold. First, target training samples is small. Although the re- because of the limit of cross-lingual links and in- call for alma mater and the precision for nation- foboxes in English Wikipedia, only a very smal- ality of WikiCiKE are lower than KE-Mon when l set of values is found by KE-Tr. Furthermore, only 10 target training articles are available, Wi- many values obtained using the translator are in- kiCiKE performs better than KE-Mon if we take correct because of translation errors. WikiCiKE into consideration both precision and recall. uses translators too, but it has better tolerance to translation errors because the extracted value is

0.8 1 from the target article texts instead of the output

0.9 0.6 of translators. 0.8 0.4 0.7 KE-Tr WikiCiKE P(KE−Mon) 0.6 P(KE−Mon) Attribute 0.2 P(WikiCiKE) P(WikiCiKE) R(KE−Mon) 0.5 R(KE−Mon) P R P R R(WikiCiKE) R(WikiCiKE) 0 0.4 occupation 27.4% 3.40% 64.8% 26.4% 10 30 50 100 200 300 500 10 30 50 100 200 number of target training articles number of target training articles nationality 66.3% 4.60% 70.0% 55.0% alma mater 66.7% 0.70% 76.3% 8.20% (a) occupation (b) alma mater

performance gain 1 20 Table 6: Results of WikiCiKE vs. KE-Tr. P 0.9 R 15 0.8 10 0.7 P(KE−Mon) percent(%) 4.4 Significance Test 0.6 P(WikiCiKE) 5 R(KE−Mon) 0.5 R(WikiCiKE) 0 We conducted a significance test to demonstrate 10 30 50 100 200 300 10 30 50 100 200 300 500 number of target training articles number of target training articles that the difference between WikiCiKE and KE- (c) nationality (d) average improvements Mon is significant rather than caused by statistical errors. As for the comparison between WikiCiKE Figure 4: Results for TEMPLATE PERSON. and KE-Tr, significant improvements brought by

647 WikiCiKE can be clearly observed from Table 6 5 Related Work so there is no need for further significance test. In this paper, we use McNemar’s significance test Some approaches of knowledge extraction from (Dietterich and Thomas, 1998). the open Web have been proposed (Wu et al., 2012; Yates et al., 2007). Here we focus on the Table 7 shows the results of significance test extraction inside Wikipedia. calculated for the average on all tested attributes. When the number of target training articles is less 5.1 Monolingual Infobox Extraction than 100, the χ is much less than 10.83 that cor- KYLIN is the first system to autonomously ex- responds to a significance level 0.001. It suggests tract the missing infoboxes from the correspond- that the chance that WikiCiKE is not better than ing article texts by using a self-supervised learn- KE-Mon is less than 0.001. ing method (Wu and Weld, 2007). KYLIN per- # 10 30 50 100 200 300 500 forms well when enough training data are avail- χ 179.5 107.3 51.8 32.8 4.1 4.3 0.3 able. Such techniques as shrinkage and retraining # Number of the target training articles. are proposed to increase the recall from English Wikipedia’s long tail of sparse classes (Wu et al., Table 7: Results of Significance Test. 2008; Wu and Weld, 2010). Different from Wu’s research, WikiCiKE is a cross-lingual knowledge extraction framework, which leverags rich knowl- 4.5 Overall Analysis edge in the other language to improve extraction performance in the target Wikipedia. As shown in above experiments, we can see that WikiCiKE outperforms both KE-Mon and KE-Tr. 5.2 Cross-lingual Infobox Completion When only 30 target training samples are avail- Current translation based methods usually con- able, WikiCiKE reaches comparable performance tain two steps: cross-lingual attribute alignmen- of KE-Mon using 300-500 target training samples. t and value translation. The attribute alignmen- Among all of the 72 attributes in TEMPLATE t strategies can be grouped into two categories: PERSON of Chinese Wikipedia, 39 (54.17%) and cross-lingual link based methods (Bouma et al., 55 (76.39%) attributes have less than 30 and 200 2009) and classification based methods (Adar et labeled articles respectively. We can see that Wi- al., 2009; Nguyen et al., 2011; Aumueller et al., kiCiKE can save considerable human labor when 2005; Adafre and de Rijke, 2006; Li et al., 2009). no sufficient target training samples are available. After the first step, the value in the source lan- We also examined the errors by WikiCiKE and guage is translated into the target language. E. they can be categorized into three classes. For at- Adar’s approach gives the overall precision of tribute occupation when 30 target training sam- 54% and recall of 40% (Adar et al., 2009). How- ples are used, there are 71 errors. The first cat- ever, recall of these methods is limited by the egory is caused by incorrect word segmentation number of equivalent cross-lingual articles and the (40.85%). In Chinese, there is no space between number of infoboxes in the source language. It is words so we need to segment them before extrac- also limited by the quality of the translators. Wi- tion. The result of word segmentation directly kiCiKE attempts to mine the missing infoboxes decide the performance of extraction so it caus- directly from the article texts and thus achieves es most of the errors. The second category is be- a higher recall compared with these methods as cause of the incomplete infoboxes (36.62%). In shown in Section 4.3. evaluation of KE-Mon, we directly use the val- ues in infoboxex as golden values, some of them 5.3 Transfer Learning are incomplete so the correct predicted values will Transfer learning can be grouped into four cate- be automatically judged as the incorrect in these gories: instance-transfer, feature-representation- cases. The last category is mismatched words transfer, parameter-transfer and relational- (22.54%). The predicted value does not match the knowledge-transfer (Pan and Yang, 2010). golden value or a part of it. In the future, we can TrAdaBoost, the instance-transfer approach, is improve the performance of WikiCiKE by polish- an extension of the AdaBoost algorithm, and ing the word segmentation result. demonstrates better transfer ability than tradition-

648 al learning techniques (Dai et al., 2007). Transfer References learning have been widely studied for classifica- S. Fissaha Adafre and M. de Rijke. 2006. Find- tion, regression, and cluster problems. However, ing Similar Sentences across Multiple Languages few efforts have been spent in the information in Wikipedia. EACL 2006 Workshop on New Text: extraction tasks with knowledge transfer. and Blogs and Other Dynamic Text Sources. Sisay Fissaha Adafre and Maarten de Rijke. 2005. 6 Conclusion and Future Work Discovering Missing Links in Wikipedia. Proceed- In this paper we proposed a general cross-lingual ings of the 3rd International Workshop on Link Dis- covery. knowledge extraction framework called Wiki- CiKE, in which extraction performance in the tar- Eytan Adar, Michael Skinner and Daniel S. Weld. get Wikipedia is improved by using rich infobox- 2009. Information Arbitrage across Multi-lingual Wikipedia. WSDM’09. es in the source language. The problems of topic drift and translation error were handled by using David Aumueller, Hong Hai Do, Sabine Massmann and the TrAdaBoost model. Chinese-English exper- Erhard Rahm”. 2005. Schema and ontology match- imental results on four typical attributes showed ing with COMA++. SIGMOD Conference’05. that WikiCiKE significantly outperforms both the Christian Bizer, Jens Lehmann, Georgi Kobilarov, current translation based methods and the mono- S¨oren Auer, Christian Becker, Richard Cyganiak lingual extraction methods. In theory, WikiCiKE and Sebastian Hellmann. 2009. DBpedia - A crys- tallization Point for the Web of Data. J. Web Sem.. can be applied to any two wiki knowledge based of different languages. Christian Bizer, Tom Heath, Kingsley Idehen and Tim We have been considering some future work. Berners-Lee. 2008. on the web (L- Firstly, more attributes in more infobox templates DOW2008). WWW’08. should be explored to make our results much Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim S- stronger. Secondly, knowledge in a minor lan- turge and Jamie Taylor. 2008. Freebase: a Collabo- guage may also help improve extraction perfor- ratively Created Graph Database for Structuring Hu- man Knowledge. SIGMOD’08. mance for a major language due to the cultural and religion differences. A bidirectional cross-lingual Gosse Bouma, Geert Kloosterman, Jori Mur, Gertjan extraction approach will also be studied. Last but Van Noord, Lonneke Van Der Plas and Jorg Tiede- not least, we will try to extract multiple attr-value mann. 2008. Question Answering with Joost at CLEF 2007. Working Notes for the CLEF 2008 pairs at the same time for each article. Workshop. Furthermore, our work is part of a more ambi- tious agenda on exploitation of linked data. On the Gosse Bouma, Sergio Duarte and Zahurul Islam. 2009. Cross-lingual Alignment and Completion of one hand, being able to extract data and knowl- Wikipedia Templates. CLIAWS3 ’09. edge from multilingual sources such as Wikipedi- a could help improve the coverage of linked data Wenyuan Dai, Qiang Yang, Gui-Rong Xue and Yong for applications. On the other hand, we are also Yu. 2007. Boosting for Transfer Learning. ICM- L’07. investigating how to possibly integrate informa- tion, including subjective information (Sensoy et Dietterich and Thomas G. 1998. Approximate Statis- al., 2013), from multiple sources, so as to better tical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput.. support data exploitation in context dependent ap- plications. Sergio Ferr´andez, Antonio Toral, ´ıscar Ferr´andez, An- tonio Ferr´andez and Rafael Mu˜noz. 2009. Exploit- Acknowledgement ing Wikipedia and EuroWordNet to Solve Cross- Lingual Question Answering. Inf. Sci.. The work is supported by NSFC (No. 61035004), NSFC-ANR (No. 61261130588), 863 High Tech- Aidan Finn and Nicholas Kushmerick. 2004. Multi- level Boundary Classification for Information Ex- nology Program (2011AA01A207), FP7-288342, traction. ECML. FP7 K-Drive project (286348), the EPSRC WhatIf project (EP/J014354/1) and THU-NUS NExT Co- Achille Fokoue, Felipe Meneguzzi, Murat Sensoy and Lab. Besides, we gratefully acknowledge the as- Jeff Z. Pan. 2012. Querying Linked Ontological Data through Distributed Summarization. Proc. of sistance of Haixun Wang (MSRA) for improving the 26th AAAI Conference on Artificial Intelligence the paper work. (AAAI2012).

649 Yoav Freund and Robert E. Schapire. 1997. Jeff Z. Pan and Ian Horrocks. 2006. OWL-Eu: Adding A Decision-Theoretic Generalization of On-Line Customised Datatypes into OWL. Journal of Web Learning and an Application to Boosting. J. Com- Semantics. put. Syst. Sci.. Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Norman Heino and Jeff Z. Pan. 2012. RDFS Rea- Transfer Learning. IEEE Trans. Knowl. Data Eng.. soning on Massively Parallel Hardware. Proc. of the 11th International Semantic Web Conference Nick Roussopoulos, Stephen Kelley and Fr´ed´eric Vin- (ISWC2012). cent. 1995. Nearest Neighbor Queries. SIGMOD Conference’95. Aidan Hogan, Jeff Z. Pan, Axel Polleres and Yuan Ren. 2011. Scalable OWL 2 Reasoning for Linked Data. Murat Sensoy, Achille Fokoue, Jeff Z. Pan, Timothy Reasoning Web. Semantic Technologies for the Web Norman, Yuqing Tang, Nir Oren and Katia Sycara. of Data. 2013. Reasoning about Uncertain Information and Conflict Resolution through Trust Revision. Proc. Andreas Hotho, Robert J¨aschke, Christoph Schmitz of the 12th International Conference on Autonomous and Gerd Stumme. 2006. Information Retrieval in Agents and Multiagent Systems (AAMAS2013). Folksonomies: Search and Ranking. ESWC’06. Fabian M. Suchanek, Gjergji Kasneci and Gerhard John D. Lafferty, Andrew McCallum and Fernando Weikum. 2007. Yago: a Core of Semantic Knowl- C. N. Pereira. 2001. Conditional Random Fields: edge. WWW’07. Probabilistic Models for Segmenting and Labeling Sequence Data. ICML’01. Max Volkel, Markus Krotzsch, Denny Vrandecic, Heiko Haller and Rudi Studer. 2006. Semantic Alberto Lavelli, MaryElaine Califf, Fabio Ciravegna, Wikipedia. WWW’06. Dayne Freitag, Claudio Giuliano, Nicholas Kush- merick, Lorenza Romano and Neil Ireson. 2008. Zhichun Wang, Juanzi Li, Zhigang Wang and Jie Tang. Evaluation of Machine Learning-based Information 2012. Cross-lingual Knowledge Linking across Wi- Extraction Algorithms: Criticisms and Recommen- ki Knowledge Bases. 21st International World Wide dations. Language Resources and Evaluation. Web Conference. Juanzi Li, Jie Tang, Yi Li and Qiong Luo. 2009. Ri- Daniel S. Weld, Fei Wu, Eytan Adar, Saleema Amer- MOM: A Dynamic Multistrategy Ontology Align- shi, James Fogarty, Raphael Hoffmann, Kayur Pa- ment Framework. IEEE Trans. Knowl. Data Eng.. tel and Michael Skinner. 2008. Intelligence in Wikipedia. AAAI’08. Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang and Yong Yu. 2008. Can Chinese We- Fei Wu and Daniel S. Weld. 2007. Autonomously Se- b Pages be Classified with English Data Source?. mantifying Wikipedia. CIKM’07. WWW’08. Fei Wu and Daniel S. Weld. 2010. Open Information Sheila A. McIlraith, Tran Cao Son and Honglei Zeng. Extraction Using Wikipedia. ACL’10. 2001. Semantic Web Services. IEEE Intelligent Fei Wu, Raphael Hoffmann and Daniel S. Weld. 2008. Systems. Information Extraction from Wikipedia: Moving Thanh Hoang Nguyen, Viviane Moreira, Huong N- down the Long Tail. KDD’08. guyen, Hoa Nguyen and Juliana Freire. 2011. Mul- Wentao Wu, Hongsong Li, Haixun Wang and Kenny tilingual Schema Matching for Wikipedia Infoboxes. Qili Zhu. 2012. Probase: a Probabilistic Taxonomy CoRR. for Text Understanding. SIGMOD Conference’12. Jeff Z. Pan and Edward Thomas. 2007. Approximat- Alexander Yates, Michael Cafarella, Michele Banko, ing OWL-DL Ontologies. 22nd AAAI Conference Oren Etzioni, Matthew Broadhead and Stephen on Artificial Intelligence (AAAI-07). Soderland. 2007. TextRunner: Open Information Jeff Z. Pan and Ian Horrocks. 2007. RDFS(FA): Con- Extraction on the Web. NAACL-Demonstrations’07. necting RDF(S) and OWL DL. IEEE Transaction Xinfeng Zhang, Xiaozhao Xu, Yiheng Cai and Yaowei on Knowledge and Data Engineering. 19(2): 192 - Liu. 2009. A Weighted Hyper-Sphere SVM. IC- 206. NC(3)’09.

650