Cross-Lingual Alignment and Completion of Wikipedia Templates

Cross-lingual Alignment and Completion of Wikipedia Templates

Gosse Bouma Sergio Duarte Zahurul Islam Information Science Information Science Information Science University of Groningen University of Groningen University of Groningen [email protected] [email protected] [email protected]

Abstract whereas the English Wikipedia gives only 292 hits. Also, 21,697 pages in Dutch Wikipedia fall in a cate- For many languages, the size of Wikipedia is gory matching ’Nederlands(e)’, whereas only 9,494 an order of magnitude smaller than the En- pages in English Wikipedia fall in a category match- glish Wikipedia. We present a method for ing ’Dutch’. This indicates that, apart from the ob- cross-lingual alignment of template and in- vious fact that smaller Wikipedia’s can be expanded fobox attributes in Wikipedia. The alignment is used to add and complete templates and with information found in the larger Wikipedia’s, infoboxes in one language with information it is also true that even the larger Wikipedia’s can derived from Wikipedia in another language. be supplemented with information harvested from We show that alignment between English and smaller Wikipedia’s. Dutch Wikipedia is accurate and that the re- Wikipedia infoboxes are tabular summaries of the sult can be used to expand the number of tem- most relevant facts contained in an article. They plate attribute-value pairs in Dutch Wikipedia by 50%. Furthermore, the alignment pro- represent an important source of information for vides valuable information for normalization general users of the encyclopedia. Infoboxes (see of template and attribute names and can be figure 1) encode facts using attributes and values, used to detect potential inconsistencies. and therefore are easy to collect and process automatically. For this reason, they are extremely valuable for systems that harvest information from 1 Introduction Wikipedia automatically, such as DbPedia (Auer et One of the more interesting aspects of Wikipedia al., 2008). However, as Wu and Weld (2007) note, is that it has grown into a multilingual resource, infoboxes are missing for many pages, and not all with Wikipedia’s for many languages, and system- infoboxes are complete. This is particularly true for atic (cross-language) links between the information Wikipedia’s in languages other than English. in different language versions. Eventhough English Infoboxes are a subclass of Wikipedia templates, has the largest Wikipedia for any given language, the which are used by authors of Wikipedia pages to ex- amount of information present in Wikipedia exceeds press information in a systematic way, and to en- that of any single Wikipedia. One of the reasons for sure that formatting of this information is consistent this is that each language version of Wikipedia has across Wikipedia. Templates exist for referring to its own cultural and regional bias. It is likely, for multimedia content, external websites, news stories, instance, that information about the Netherlands is scientific sources, other on-line repositories (such as better represented in Dutch Wikipedia than in other the Internet Movie Database (IMDB), medical clas- Wikipedia’s. Some indication that this is indeed sification systems (ICD9 and ICD10), coordinates on the case comes from the fact a Google search for Google Maps, etc. Although we are primarily inter- ’Pim Fortuyn’ in the Dutch Wikipedia gives 498 hits, ested in infoboxes, in the experiments below we take

21 Proceedings of CLIAWS3, Third International Cross Lingual Information Access Workshop, pages 21–29, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics {{Infobox Philosopher | region = Western Philosophy | era = [[19th-century philosophy]] | image_name = George Boole.jpg| image_caption = George Boole | name = George Lawlor Boole | birth = November 2, 1815 | death = December 8, 1864 | school = Mathematical foundations of [[computer science]] | main_interests = [[Mathematics]], [[Logic]] | ideas = [[Boolean algebra]] }}

Figure 1: Infobox and (simplified) Wikimedia source all templates into account. In the recent GIKICLEF task2 systems have to find We plan to use information mined from Wikipedia Wikipedia pages in a number of languages which for Question Answering and related tasks. In 2007 match descriptions such as Which Australian moun- and 2008, the CLEF question answering track1 used tains are higher than 2000 m?, French bridges which Wikipedia as text collection. While the usual ap- were in construction between 1980 and 1990, and proach to open domain question answering relies on African capitals with a population of two million information retrieval for selecting relevant text snip- inhabitants or more. The emphasis in this task is pets, and natural language processing techniques for less on answer extraction from text (as in QA) and answer extraction, an alternative stream of research more on accurate interpretation of (geographical) has focussed on the potential of on-line data-sets for facts known about an entity. GIKICLEF is closely question answering (Lita et al., 2004; Katz et al., related to the entity ranking task for Wikipedia, as 3 2005). In Bouma et al. (2008) it is suggested that organized by INEX. We believe systems partici- information harvested from infoboxes can be used pating in tasks like this could profit from large collections of entity,attribute,value triples harvested for question answering in CLEF. For instance, the h i answer to questions such as How high is the Matter- from Wikipedia templates. horn?, Where was Piet Mondriaan born?, and What In this paper, we propose a method for automati- is the area of the country Suriname? can in princi- cally expanding the amount of information present ple be found in infoboxes. However, in practice the in the form of templates. In our experiments, number of questions that is answered by their Dutch we used English and Dutch Wikipedia as sources. QA-system by means information from infoboxes is Given a page in English, and a matching page in small. One reason for this is the lack of coverage of Dutch, we first find all English-Dutch attribute-value infoboxes in Dutch Wikipedia. 2http://www.linguateca.pt/GikiCLEF 3http://inex.is.informatik.uni-duisburg. 1http://clef-qa.itc.it de

22 tuples which have a matching value. Based on the researchers. Adafre and de Rijke (2006) explore ma- frequency with which attributes match, we create chine translation and (cross-lingual) link structure a bidirectional, intersective, alignment of English- to find sentences in English and Dutch Wikipedia Dutch attribute pairs. Finally, we use the set of which express the same content. Bouma et al. (2006) aligned attributes to expand the number of attribute- discuss a system for the English-Dutch QA task value pairs in Dutch Wikipedia with information ob- of CLEF. They basically use a Dutch QA-system, tained from matching English pages. We also show which takes questions automatically translated from that aligned attributes can be used to normalize at- English (by the on-line Babelfish translation ser- tribute names and to detect formatting issues and po- vice). To improve the quality of the translation tential inconsistencies in attribute values. of named entities, they use, among others, cross- language links obtained from Wikipedia. Erdmann 2 Previous Work et al. (2008) explore the potential of Wikipedia for the extraction of bilingual terminology. They note DbPedia (Auer et al., 2008) is a large, on-going, that apart from the cross-language links, page redi- project which concentrates on harvesting informa- rects and anchor texts (i.e. the text that is used to tion from Wikipedia automatically, on normaliza- label a hypertext reference to another (wikipedia) tion of the extracted information, on linking the in- page) can be used to obtain large and accurate bilin- formation with other on-line data repositories, and gual term lists. on interactive access. It contains 274M facts about 2.6M entities (November, 2008). An important com- 3 Data collection and Preparation ponent of DbPedia is harvesting of the information We used a dump of Dutch Wikipedia (June 2008) present in infoboxes. However, as Wu and Weld and English Wikipedia (August 2007) made avail- (2007) note, not all relevant pages have (complete) able by the University of Amsterdam4 and converted infoboxes. The information present in infoboxes to an XML-format particularly suitable for informa- is typically also present in the running text of a tion extraction tasks. page. One line of research has concentrated on using the information obtained from infoboxes as From these two collections, for each page, we seeds for systems that learn relation extraction pat- extracted all attribute-value pairs found in all tem- terns (Nguyen et al., 2007). Wu and Weld (2007) plates. Results were stored as quadruples of the form Page, TemplateName, Attribute, Value . Each go one step further, and concentrate on learning to h i TemplateName Attribute pair expresses a specific complete the infoboxes themselves. They present ∼ a system which first learns to predict the appropri- semantic relation between the entity or concept de- ate infobox for a page (using text classification). scribed by the Page and a Value. Values can be Next, they learn relation extraction patterns using anything, but often refer to another Wikipedia page (i.e. George Boole, Philosopher, notable ideas, the information obtained from existing infoboxes as h Boolean algebra , where Boolean algebra is a link seeds. Finally, the learned patterns are applied to i text of pages for which a new infobox has been pre- to another page) or to numeric values, amounts, and dicted, to assign values to infobox attributes. A re- dates. Note that attributes by themselves tend to be cent paper by Adar et al. (2009) (which only came highly ambiguous, and often can only be interpreted to our attention at the time of writing) starts from the in the context of a given template. The attribute pe- same observation as we do. It presents a system for riod, for instance, is used to describe chemical el- completing infoboxes for English, German, French, ements, royal dynasties, countries, and (historical) and Spanish Wikipedia, which is based on learning a means of transportation. Another source of attribute mapping between infoboxes and attributes in multi- ambiguity is the fact that many templates simply ple languages. A more detailed comparison between number their attributes. As we are interested in find- their approach and ours is given in section 6 ing an alignment between semantically meaningful relations in the two collections, we will therefore The potential of the multilingual nature of Wikipedia has been explored previously by several 4http://ilps.science.uva.nl/WikiXML

23 Dutch English dates need to be converted to a standard. English Wikipedia expresses distances and heights in miles date June 2008 August 2007 and feet, weights in pounds, etc., whereas Dutch pages 715,992 3,840,950 Wikipedia uses kilometres, metres, and kilograms. pages with template 290,964 757,379 Where English Wikipedia mentions both miles and cross-language links 126,555 kilometres we preserve only the kilometres. In other templates 550,548 1,074,935 situations we convert miles to kilometres. In spite tuples 4,357,653 5,436,033 of this effort, we noted that there are still quite a template names 2,350 7,783 few situations which are not covered by our normal- attribute names 7,510 19,378 ization patterns. Sometimes numbers are followed templ attr pairs 23,399 81,671 ∼ or preceded by additional text (approx.14.5 MB, 44 Table 1: Statistics for the version of Dutch and English minutes per episode), sometimes there is irregular Wikipedia used in the experiment. formatting (October 101988), and some units simply are not handled by our normalization yet (i.e. converting square miles to square kilometres). We concentrate on the problem of finding an alignment come back to this issue in section 6. between TemplateName Attribute pairs in English ∼ Krotzsch¨ et al. (2007) have aslo boserved that and Dutch Wikipedia. there is little structure in the way numeric values, Some statistics for the two collections are given units, dates, etc. are represented in Wikipedia. They in table 1. The number of pages is the count for all suggest a tagging system similar to the way links to pages in the collection. It should be noted that these other Wikipedia pages are annotated, but now with contain a fair number of administrative pages, pages the aim of representing numeric and temporal values for multimedia content, redirect pages, page stubs, systematically. If such a system was to be adopted etc. The number of pages which contains content by the Wikipedia community, it would greatly fa- that is useful for our purposes is therefore probably cilitate the processing of such values found in in- a good deal lower, and is maybe closer to the num- foboxes. ber of pages containing at least one template. Cross- language links (i.e. links from an English page to 4 Alignment the corresponding Dutch page) where extracted from In this section we present our method for aligning English Wikipedia. The fact that 0.5M templates in English Template Attribute pairs with correspond- Dutch give rise to 4.3M tuples, whereas 1.0M tem- ∼ ing pairs in Dutch. plates in English give rise to only 5.4M tuples is per- The first step is creating a list of matching tuples. haps a consequence of the fact that the two collections are not from the same date, and thus may re- Step 1. Extract all matching template flect different stages of the development of the tem- tuples. plate system. An English Page , Templ Attr , h e e∼ e We did spend some time on normalization of the Val tuple matches a Dutch Page , ei h d values found in extracted tuples. Our alignment Templ Attr , Val tuple if Page d∼ d di e method relies on the fact that for a sufficient num- matches Paged and Vale matches Vald and ber of matching pages, tuples can be found with there is no other tuple for either Pagee or matching values. Apart from identity and Wikipedia Paged with value Vale or Vald. cross-language links, we rely on the fact that dates, Two pages or values E and D match if amounts, and numerical values can often be rec- there exists a cross-language link which ognized and normalized easily, thus increasing the links E and D, or if E=D. number of tuples which can be used for alignment. Normalization addresses the fact that the use of We only take into account tuples for which there comma’s and periods (and spaces) in numbers is is a unique (non-ambiguous) match between English different in English and Dutch Wikipedia, and that and Dutch. Many infoboxes contain attributes which

24 often take the same value (i.e. title and imdb title 101 cite web title for movies). Other cases of ambiguity are caused 27 voetnoot web titel by numerical values which incidentally may take 12 film titel on identical values. Such ambiguous cases are ig- 10 commons 1 nored. Step 1 gives rise to 149,825 matching tu- 7 acteur naam ples.5 It might seem that we find matching tuples 6 game naam for only about 3-4% of the tuples present in Dutch 5 ster naam Wikipedia. Note, however, that while there are 290K 4 taxobox zoogdier w-naam pages with a template in Dutch Wikipedia, there are 4 plaats naam only 126K cross-language links. The total numer 4 band band naam of tuples on Dutch pages for which a cross-language 3 taxobox w-naam link to English exists is 837K. If all these tuples have Table 2: Dutch template attribute pairs matching En- a counterpart in English Wikipedia (which is highly ∼ glish cite web title. Counts refer to the number of pages unlikely), our method finds a match for 18% of the ∼ relevant template tuples.6 with a matching value. The second step consists of extracting matching English-Dutch Template Attribute pairs from the nth ∼ number suffix probably refer to the successor, list of matching tuples constructed in step 1. whereas the attributes without suffix probably refer to the immediate successor. Step 2. For each matching pair of tuples On the other hand, for some frequent Page , Templ Attr , Val and Page , h e e∼ e ei h d template attribute pairs, ambiguity is clearly Templd Attrd, Vald , extract the English- ∼ ∼ i an issue. For the English pair cite web title for Dutch pair of Template Attributes ∼ ∼ instance, 51 different mappings are found. The Templ Attr ,Templ Attr . h e∼ e d∼ di most frequent cases are shown in table 2. Note that In total, we extracted 7,772 different English- it would be incorrect to conclude from this that, for Dutch Templ Attr ,Templ Attr tuples. In every English page which contains a cite web title h e∼ e d∼ di ∼ 547 cases Templ Attr =Templ Attr . In 915 pair, the corresponding Dutch page should include, e∼ e d∼ d cases, Attr =Attr . In the remaining 6,310 cases, for instance, a taxobox w-naam tuple. e d ∼ Attr =Attr . The matches are mostly accurate. We In the third and final step, the actual alignment e 6 d evaluated 5% of the matching template attribute between English-Dutch template attribute pairs is ∼ ∼ pairs, that had been found at least 2 times. For 27% established, and ambiguity is eliminated. of these (55 out 205), it was not immediately clear whether the match was correct, because one of the Step 3. Given the list of matching template attribute pairs computed in step attributes was a number. Among the remaing 150 ∼ 2 with a frequency 5, find for each En- cases, the only clear error seemed to be a match be- ≥ glish Templ Attr pair the most frequent tween the attributes trainer and manager (for soc- e∼ e matching Dutch pair Templ Attr . Simi- cer club templates). Other cases which are perhaps d∼ d larly, for each Dutch pair Templ Attr , not always correct were mappings between succes- d∼ d sor, successor1, successor2 on the one hand and af- find the most frequent English pair Templ Attr . Return the intersection of ter/next on the other hand. The attributes with a e∼ e both lists. 5It is interesting to note that 51K of these matching tuples are for pages that have an identical name in English and Dutch, 2,070 matching template attribute tuples are but were absent in the table of cross-language links. As a result, ∼ we find 32K pages with an identical name in English and Dutch, seen at least 5 times. Preference for the and at least one pair of matching tuples. We suspect that these most frequent bidirectional match leaves 1,305 newly discovered cross-language links are highly accurate. 6 template attribute tuples. Examples of aligned tu- If we also include English pages with a name identical to a ∼ Dutch page, the maximum number of matching tuples is 1.1M, ples are given in table 3. We evaluated 10% of and we find a match for 14% of the data the tuples containing meaningful attributes (i.e. not

25 English Dutch pages triples Template Attribute Template Attribute existing pages 50,099 253,829 actor spouse acteur partner new cross-links 11,526 43,449 book series boek reeks new dutch pages 321,069 1,931,277 casino owner casino eigenaar csi character portrayed csi personage acteur total 382,694 2,228,555 dogbreed country hond land football club ground voetbal club stadion Table 4: Newly inferred template tuples ﬁlm writer ﬁlm schrijver mountain range berg gebergte radio station airdate radiozender lancering

times)) are dominated by templates for music al- Table 3: Aligned template attribute pairs ∼ bums, geographical places, actors, movies, and tax- onomy infoboxes. numbers or single letters). In 117 tuples, we discovered two errors: aircraft specification number We evaluated the accuracy of the newly gener- h ∼ of props, gevechtsvliegtuig bemanning aligns the ated tuples for 100 random existing Dutch wikipedia ∼ i number of engines with the number of crew pages, to which at least one new tuple was added. members (based on 10 matching tuples), and The pages contained 802 existing tuples. 876 book country, film land involves a mismatch of tuples were added by our automatic expansion h ∼ ∼ i templates as it links the country attribute for a book method. Of these newly added tuples, 62 con- to the country attribute for a movie. tained a value which was identical to the value of Note that step 3 is similar to bidirectional inter- an already existing tuple (i.e. we add the tuple Reuzenhaai, taxobox naam, Reuzenhaai where sective word alignment as used in statistical machine h ∼ i there was already an existing tuple Reuzenhaai, translation (see Ma et al. (2008), for instance). This h taxobox begin name, Reuzenhaai tuple – note that method is known for giving highly precise results. ∼ i we add a properly translated attribute name, where 5 Expansion the original tuple contains a name copied from En- glish!). The newly added tuples contained 60 tu- We can use the output of step 3 of the alignment ples of the form Aegna, plaats lat dir, N (letter) , h ∼ i method to check for each English tuple whether a where the value should have been N (the symbol for corresponding Dutch tuple can be predicted. If the latitude on the Northern hemisphere in geographical tuple does not exist yet, we add it. In total, this coordinates), and not the letter N. One page (Akira) gives rise to 2.2M new tuples for 382K pages for was expanded with an incoherent set of tuples, based Dutch Wikipedia (see table 4). We generate almost on tuples for the manga, anime, and music producer 300K new tuples for existing Dutch pages (250K for with the same name. Apart from this failure, there pages for which a cross-language link already ex- were only 5 other clearly incorrect tuples (adding isted). This means we exand the total number of o.a. place name to Albinism, adding name in Dutch ∼ tuples for existing pages by 27%. Most tuples, how- with an English value to Macedonian, and adding ever, are generated for pages which do not yet exist community name to Christopher Columbus). In ∼ in Dutch Wikipedia. These are perhaps less useful, many cases, added tuples are based on a different although one could use the results as knowledge for template for the same entity, often leading to almost a QA-system, or to generate stubs for new Wikipedia identical values (i.e. adding geographical coordi- pages which already contain an infobox and other nates using slightly different notation). In one case, relevant templates. Battle of Dogger Bank (1915), the system added new The 100 most frequenty added template attribute tuples based on a template that was already in use for ∼ pairs (ranging from music album genre (added the Dutch page as well, thus automatically updating ∼ 31,392 times) to single producer (added 5605 and expanding an existing template. ∼ 26 geboren population Such information is extremely valuable for ap- plications that attempt to harvest knowledge from 23 birth date 49 inwoners Wikipedia, and merge the result in an ontology, or 16 date of birth 9 population attempt to use the harvested information in an ap- 8 date of birth 5 bevolking plication. For instance, a QA-system that has to an- 8 dateofbirth 4 inwonersaantal swer questions about birth dates or populations, has 2 born 3 inwoneraantal to know which attributes are used to express this in- 2 birth 2 town pop formation. Alternatively, one can also use this infor- 1 date birth 2 population total mation to normalize attribute-names. In that case, 1 birthdate 1 townpop all attributes which express the birth date property 1 inw. could be replaced by birth date (the most frequent 1 einwohner attribute currently in use for this relation). This type of normalization can greatly reduce the Table 6: One-to-many aligned attribute names. Counts noisy character of the current infoboxes. For in- are for the number of (aligned) infoboxes that contain the stance, there are many infoboxes in use for geo- attribute. graphic locations, people, works of art, etc. These infoboxes often contain information about the same 6 Discussion properties, but, as illustrated above, there is no guar- antee that these are always expressed by the same 6.1 Detecting Irregularities attribute. Instead of adding new information, one may also search for attribute-value pairs in two Wikipedia’s 6.3 Alignment by means of translation that are expected to have the same value, but do Template and attribute names in Dutch often are not. Given an English page with attribute-value straightforward translations of the English name, pair Attr , Val , and a matching Dutch page with h e ei e.g. luchtvaartmaatschappij/airline, voetbal- Attr , Val , where Attr and Attr have been h d di e d club/football club, hoofdstad/capital, naam/name, aligned, one expects Vale and Vald to match as postcode/postalcode, netnummer/area code and well. If this is not the case, something irregular opgericht/founded. One might use this information is observed. We have applied the above rule to as an alternative for determining whether two our dataset, and detected 79K irregularities. An template attribute pairs express the same relation. ∼ overview of the various types of irregularities is We performed a small experiment on in- given in table 5. Most of the non-matching values foboxes expressing geographical information, using are the result of formatting issues, lack of transla- Wikipedia cross-language links and an on-line dic- tions, one value being more specific than the other, tionary as multilingual dictionaries. We found that and finally, inconsistencies. Note that inconsisten- 10 to 15% of the attribute names (depending on the cies may also mean that one value is more recent exact subset of infoboxes taken into consideration) than the other (population, (stadium) capacity, latest could be connected using dictionaries. When com- release data, spouse, etc.). A number of formatting bined with the attributes found by means of align- issues (of numbers, dates, periods, amounts, etc.) ment, coverage went up to maximally 38%. can be fixed easily, using the current list of irregularities as starting point. 6.4 Comparison It is hard to compare our results with those of Adar 6.2 Normalizing Templates et al. (2009). Their method uses a Boolean classi- It is interesting to note that alignment can also be fier which is trained using a range of features to de- used to normalize template attribute names. Table 6 termine whether two values are likely to be equiva- illustrates this for the Dutch attribute geboren and lent (including identity, string overlap, link relations, the English attribute population. Both are aligned translation features, and correlation of numeric val- with a range of attribute names in the other language. ues). Training data is collected automatically by

27 Attributes Values Type English Dutch English Dutch capacity capaciteit 23,400 23 400 formatting nm lat min 04 4 formatting date date 1775-1783 1775–1783 formatting name naam African baobab Afrikaanse baobab translation artist artiest Various Artists Verschillende artiesten translation regnum rijk Plantae Plantae (Planten) specificity city naam, Comune di Adrara San Martino Adrara San Martino specificity birth date geboren 1934 1934-8-25 specificity imagepath coa wapenafbeelding coa missing.jpg Alvaneu wappen.svg specificity population inwonersaantal 5345 5369 inconsistent capacity capaciteit 13,152 14 400 inconsistent dateofbirth geboortedatum 2 February 1978 1978-1-2 inconsistent elevation hoogte 300 228 inconsistent

Table 5: Irregular values in aligned attributes on matching pages selecting highly similar tuples (i.e. with identical tions, variation in specificity, etc. It is clear that the template and attribute names) as positive data, and a system of Adar et al. (2009) has a higher recall than random tuple from the same page as negative data. ours. This appears to be mainly due to the fact that The accuracy of the classifier is 90.7%. Next, for their feature based approach to determining match- each potential pairing of template attribute pairs ing values considers much more data to be equiva- ∼ from two languages, random tuples are presented lent than our approach which normalizes values and to the classifier. If the ratio of positively clas- then requires identity or a matching cross-language sified tuples exceeds a certain threshold, the two link. In future work, we would like to explore more template attribute pairs are assumed to express the rigorous normalization (taking the data discussed in ∼ same relation. The accuracy of result varies, with section 5 as starting point) and inclusion of features matchings of template attribute pairs that are based to determine approximate matching to increase re- ∼ on the most frequent tuple matches having an accu- call. racy score of 60%. They also evaluate their system by determining how well the system is able to pre- 7 Conclusions dict known tuples. Here, recall is 40% and precision We have presented a method for automatically com- is 54%. The recall figure could be compared to the pleting Wikipedia templates which relies on the mul- 18% tuples (for pages related by means of a cross- tilingual nature of Wikipedia and on the fact that language link) for which we find a match. If we systematic links exist between pages in various lan- use only properly aligned template attribute pairs, ∼ guages. We have shown that matching template tu- however, coverage will certainly go down some- ples can be found automatically, and that an accu- what. Precision could be compared to our obser- rate set of matching template attribute pairs can be ∼ vation that we find 149K matching tuples, and, af- derived from this by using intersective bidirectional ter alignment, predict an equivalence for 79K tuples alignment. The method extends the number of tu- which in the data collecting do not have a matching ples by 51% (27% for existing Dutch pages). value. Thus, for 228K tuples we predict an equiva- In future work, we hope to include more lan- lent values, whereas this is only the case for 149K guages, investigate the value of (automatic) transla- tuples. We would not like to conclude from this, tion for template and attribute alignment, investigate however, that the precision of our method is 65%, as alternative alignment methods (using more features we observed in section 5 that most of the conflict- and other weighting scheme’s), and incorporate the ing values are not inconsistencies, but more often expanded data set in our QA-system for Dutch. the consequence of formatting irregularities, transla-

28 References the sixteenth ACM conference on Conference on information and knowledge management, pages 41–50, S.F. Adafre and M. de Rijke. 2006. Finding similar New York, NY, USA. ACM. sentences across multiple languages in wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguis- tics, pages 62–69. E. Adar, M. Skinner, and D.S. Weld. 2009. Informa- tion arbitrage across multi-lingual Wikipedia. In Pro- ceedings of the Second ACM International Conference on Web Search and Data Mining, pages 94–103. ACM New York, NY, USA. Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2008. Dbpedia: A nucleus for a web of open data. The Se- mantic Web, pages 722–735. Gosse Bouma, Ismail Fahmi, Jori Mur, Gertjan van No- ord, Lonneke van der Plas, and Jorg¨ Tiedemann. 2006. The University of Groningen at QA@CLEF 2006: Us- ing syntactic knowledge for QA. In Working Notes for the CLEF 2006 Workshop, Alicante. Gosse Bouma, Jori Mur, Gertjan van Noord, Lonneke van der Plas, and Jorg¨ Tiedemann. 2008. Question answering with Joost at QA@CLEF 2008. In Working Notes for the CLEF 2008 Workshop, Aarhus. M. Erdmann, K. Nakayama, T. Hara, and S. Nishio. 2008. An approach for extracting bilingual terminology from wikipedia. Lecture Notes in Computer Sci- ence, 4947:380. B. Katz, G. Marton, G. Borchardt, A. Brownell, S. Felshin, D. Loreto, J. Louis-Rosenberg, B. Lu, F. Mora, S. Stiller, et al. 2005. External knowledge sources for question answering. In Proceedings of the 14th Annual Text REtrieval Conference (TREC’2005), November. M. Krotzsch,¨ D. Vrandeciˇ c,´ M. Volkel,¨ H. Haller, and R. Studer. 2007. Semantic wikipedia. Web Semantics: Science, Services and Agents on the World Wide Web. L.V. Lita, W.A. Hunt, and E. Nyberg. 2004. Resource analysis for question answering. In Association for Computational Linguistics Conference (ACL). Y. Ma, S. Ozdowska, Y. Sun, and A. Way. 2008. Improv- ing word alignment using syntactic dependencies. In Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST- 2), pages 69–77. D.P.T. Nguyen, Y. Matsuo, and M. Ishizuka. 2007. Rela- tion extraction from wikipedia using subtree mining. In PROCEEDINGS OF THE NATIONAL CONFER- ENCE ON ARTIFICIAL INTELLIGENCE, page 1414. AAAI Press. Fei Wu and Daniel S. Weld. 2007. Autonomously se- mantifying wikipedia. In CIKM ’07: Proceedings of