ANALYSING LANGUAGE-SPECIFIC DIFFERENCES IN

MULTILINGUAL

Fakult¨atf¨urElektrotechnik und Informatik der Gottfried Wilhelm Leibniz Universit¨atHannover zur Erlangung des Grades

Master of Science

M. Sc.

Thesis von

Simon Gottschalk

Erstpr¨ufer:Prof. Dr. techn. Wolfgang Nejdl Zweitpr¨ufer:Prof. Dr. Robert J¨aschke Betreuer: Dr. Elena Demidova

2015 ABSTRACT

Wikipedia is a free encyclopedia that has editions in more than 280 languages. While Wikipedia articles referring to the same entity often co-exist in many Wikipedia language editions, such articles evolve independently and often con- tain complementary information or represent community-specific point of view on the entity under consideration. In this we analyse features that en- able to uncover such edition-specific aspects within Wikipedia articles to pro- vide users with an overview of overlapping and complementary information available for an entity in different language editions. In this thesis we compare Wikipedia articles at different levels of granular- ity: First, we identify similar sentences. Then, these sentences are merged to align similar paragraphs. Finally, a similarity score at the article level is com- puted. To align sentences, we employ syntactic and semantic features including cosine similarity, links to other Wikipedia articles and time expressions. We evaluated the sentence alignment function on a dataset containing 1155 sen- tence pairs extracted from 59 articles in German and that had been annotated during a user study. Our evaluation results demonstrated that the inclusion of semantic features can lead to an improvement of the break-even point from 70.95% to 77.52% in this dataset. Given the sentence alignment function, we developed an algorithm to build similar paragraphs starting from the sentences that have been aligned before. We implemented a visualization of the algorithm results that enables users to obtain an overview of the similarities and differences in the articles by looking at the paragraphs aligned using the proposed algorithm and the other para- graphs, whose contents are unique to an article in a specific language edition. To further support this comparison, we defined an overall article similarity score and applied this score to illustrate temporal differences between article editions. Finally, we created a Web-based application presenting our results and visualising all the aspects described above. In the future work, the algorithms developed in this thesis can be directly applied as a help for Wikipedia authors to provide an overview of the en- tity representation across Wikipedia language editions. These algorithms can also build a basis for cultural research towards better understanding of the language-specific similarities and differences in multilingual Wikipedia. Contents

Table of Contents iii

List of Figures vii

List of Tables ix

List of Algorithms xi

1 Introduction1 1.1 Motivation...... 2 1.2 Problem Definition...... 3 1.3 Overview...... 5

2 Background on Multilingual Wikipedia7 2.1 Overview...... 8 2.2 Wikipedia Guidelines...... 9 2.2.1 Translations...... 9 2.2.2 Neutrality...... 11 2.3 Linguistic Point of View...... 11 2.4 Reasons for Multilingual Differences...... 12 2.5 Wikipedia Studies...... 14

3 Background on Multilingual Text Processing 17 3.1 NLP for Multilingual Text...... 17

iii iv

3.1.1 Machine Translation...... 17 3.1.2 Textual Features...... 19 3.1.3 Topic Extraction...... 19 3.1.4 Sentence Splitting...... 21 3.1.5 Other NLP techniques...... 21 3.2 Aligning Multilingual Text...... 23 3.2.1 Comporable Corpora...... 23 3.2.2 Plagiarism Detection in Multilingual Text...... 24

4 Approach Overview 27

5 Feature Selection and Extraction 31 5.1 Syntactic Features...... 31 5.2 Evaluation on Sentence Similarity of Parallel Corpus...... 33 5.3 Semantic Features...... 36 5.4 Evaluation of Entity Extraction Tools...... 39 5.4.1 Aim and NER tools...... 40 5.4.2 Data...... 40 5.4.3 Entity Extraction and Comparison...... 41 5.4.4 Comparison...... 42 5.4.5 Results...... 43

6 Sentence Alignment and Evaluation 47 6.1 Data...... 48 6.2 Pre-Selection of Sentence Pairs...... 49 6.3 Selection of Sentence Pairs for Evaluation...... 52 6.4 User Study...... 53 6.5 Judgement of Similarity Measures...... 54 6.6 Second Dataset...... 59 6.7 Pre-Selection and Creation of Similarity Function...... 59 6.8 Results...... 61

7 Paragraph Alignment and Article Comparison 69 7.1 Finding Similar Paragraphs...... 69 7.1.1 Aggregation of Neighboured Sentences...... 71 7.1.2 Aggregation of Proximate Sentence Pairs...... 72 7.1.3 Paragraph Aligning Algorithm...... 74 v

7.2 Similarity on Article Level...... 75 7.2.1 Text Similarity...... 75 7.2.2 Feature Similarity...... 76 7.2.3 Overall Similarity...... 78 7.3 Visualisation...... 79

8 Implementation 85 8.1 Data Model...... 85 8.2 Comparison Extracting...... 87 8.3 Preprocessing Pipeline...... 89 8.4 Text Parsing...... 90 8.5 Resources...... 91

9 Discussion and Future Work 93 9.1 Discussion...... 93 9.2 Future Research Directions...... 94

Bibliography 97

List of Figures

1.1 Text Comparison Example...... 4

2.1 English Wikipedia Article ”Großer Wannsee”...... 8 2.2 Interlanguage links for the English article ”Pfaueninsel”...... 10

3.1 First Paragraphs of the Wikipedia Article ”Berlin”...... 22

4.1 Process of Article Comparison...... 27

5.1 Precision Recall Graphs for Textual Features with Break-Even Points 35 5.2 Box Plots for Textual Features...... 36

6.1 Screenshot of User Study on Similar Sentences...... 54 6.2 Correlation of Syntactic Features for First Data Set...... 56 6.3 Correlation of Text Length Similarity...... 57 6.4 Correlation of External Links Similarity...... 57 6.5 Correlation of Time and Entity Similarity for the first Dataset.... 58 6.6 Iteration to Create Similarity Functions...... 60 6.7 Precision-recall Diagram of Sentences with Overlapping Facts.... 64 6.8 Precision-recall Diagram of Sentences with the Same Facts...... 65 6.9 Precision-recall Diagram of Sentences with the Same Facts (Adjusted Similarity Functions)...... 66

7.1 Paragraph Construction Example (Step 1)...... 70 7.2 Paragraph Construction Example (Steps 2 and 3)...... 70

vii viii LIST OF FIGURES

7.3 Paragraph Construction Example (Steps 4 and 5)...... 71 7.4 Comparison of the English and German article on ”Knipp”...... 79 7.5 Website Example: Text...... 81 7.6 Website Example: Links...... 81 7.7 Website Example: Images...... 82 7.8 Website Example: Authors...... 82 7.9 Website Example: Overall Similarity...... 83

8.1 Data Model...... 86 8.2 Preprocessing Pipeline...... 89 List of Tables

2.1 Statistics on in Different Languages...... 9

3.1 Machine Translation Example...... 18

5.1 Example Sentence Pairs for Time Similarity...... 37 5.2 Statistics of the N3 Dataset...... 41 5.3 Number of Entities Extracted from English Texts...... 42 5.4 Number of Entities Extracted from German Texts...... 42 5.5 Results of Entity Extraction...... 44

6.1 Wikipedia Articles Used in the User Study...... 49 6.2 Feature Combination Distribution in 14 Wikipedia Articles...... 50 6.3 Weights of Similarity Functions for Pre-Selection...... 51 6.4 Feature combination distribution in pre-selected Sentence Pairs... 52 6.5 Feature Distribution in the Dataset for the First Round of Evaluation 53 6.6 Correlation Coefficients for Similarity Measures...... 55 6.7 Dataset Evaluated in the Second Round...... 62 6.8 Retrieved Sentence Pairs per Article Pair...... 67

7.1 Composition of Overall Similarity...... 78 7.2 60 Wikipedia Article Pairs Ordered by Overall Similarity...... 84

8.1 Example of Revisions of an Article in Different Languages...... 88 8.2 Example of Revision Triples...... 89

ix

List of Algorithms

5.1 Computation of TP, FP and FN for the Evaluation of Entity Extraction 43 6.1 Identification of Candidates for Similar Sentences...... 51 7.1 Extension of Sentence Pairs with Neighbours...... 72 7.2 Extension of a Sentence with its Neighbours...... 72 7.3 Aggregation of Sentence Pairs...... 73 7.4 Paragraph Alignment...... 74

xi

1 Introduction

Wikipedia1 is a user-generated online encyclopaedia that is available in more than 280 languages and is widely used: Currently it counts more than 24 million registered users alone in the English Wikipedia and for the 12 most populated of the available language editions there are more than a million of articles each2. Wikipedia articles describing real-world entities, topics, events and concepts evolve independently in different language editions. Up to now, there are just insufficient possibilities to benefit from the knowledge that can be gained from these differences, although this could be useful for social research purposes or to extend Wikipedia articles with content from other language versions. Therefore, in this thesis we propose methods to automate a detailed comparison of Wikipedia articles that describe the same entities in different languages and create an example application that presents the findings to human users. Wikipedia articles can be compared at different levels of granularity. In this work we focus on three levels: the sentence level, the paragraph level and the article level. They are presented in a bottom-up order: Similar sentences are identified and merged to find similar paragraphs. The fraction of overlapping paragraphs is then used as an important component for the similarity score at the article level. At first, we develop methods to identify and align similar sentences in the articles. To do so, we analyse effectiveness of several syntactic and semantic features extracted from the texts. Moreover, we go further than related studies in this field by aligning not only the sentences with the same facts, but also the sentences with partly over- lapping contents. As this step builds a foundation for the paragraph alignment and the article comparison, we perform an extensive user study to evaluate and fine tune our proposed similarity functions. In the second step, we use the resulting sentence alignment to develop algorithms for alignment of similar paragraphs. This paragraph alignment method contributes to the improved visualisation of the textual compari-

1http://www.wikipedia.org/ 2http://meta.wikimedia.org/wiki/List of Wikipedias

1 2 Chapter 1 Introduction son by creating bigger paragraphs from the sentence pairs that were assigned in the previous step. Finally, as Wikipedia articles contain much more information than the raw texts (images, authors, links, etc.), we define further similarity measures that are applied at the article level to compute an overall similarity value for two articles in different languages. With these approaches to find similarities and differences across article pairs that describe the same entity in different languages, there are many possibilities to do in- vestigations of cross-lingual differences: Amongst others, we implement applications illustrating the development of the article similarity over time, rank article pairs by their similarity and oppose the article texts in different languages to visualise common paragraphs. These applications can support Wikipedia editors and researchers pro- viding an overview over similarities and differences of the articles and their temporal development.

1.1 Motivation

While collaboration is an indispensable part of Wikipedia editing within one language edition, this becomes a problem across languages: Besides from the language links interlinking articles on the same entities, multilingual coordination is difficult across the Wikipedia - each Wikipedia even has a separate set of user accounts3. Therefore, a tool that compares articles across languages can be a possibility to bridge this gap. Further aspects that our research aims at are listed below:

ˆ Social and cultural research: As Wikipedia articles are continuously written over a large range of time by a big amount of editors, a study on Wikipedia articles can always be seen as an investigation of the users as well.

ˆ Help for Wikipedia authors: When a Wikipedia author wants to add something to an article, it is very probably that he will find additional information in an article in an other language. If we provide a means to visualise the text passages or concepts that do not occur in the version of the author’s language, he can quickly get an idea which information is worthy to add to the article.

ˆ Trustworthiness of Wikipedia: Wikipedia is part of many investigations and programs – both for direct human interaction and for indirect information col- lections of automated systems. Given this importance of Wikipedia as an information resource, there have been many discussions on the reliability of Wikipedia4. Taking into account not just one Wikipedia edition, but extracting the infor- mation of more than one language version, it becomes possible to collect in-

3http://en.wikipedia.org/wiki/Wikipedia:Multilingual coordination 4http://en.wikipedia.org/wiki/Reliability of Wikipedia 1.2 Problem Definition 3

formation from independent groups of authors5 and either to further expand the knowledge with language-exclusive content or to discover language-specific differences. This allows for a better estimation of how reliable the texts are.

ˆ Statistics: Many different statistics and tools about Wikipedia are accessi- ble, mostly about the development of page views and edits6. This shows that there is a big interest in the automation of interesting information coming from Wikipedia. For multilingual comparisons, there is the website www.manypedia. com that is similar to our approach, but does not go deeper into textual simi- larity.

ˆ Existence of Neutrality across languages: Finally, the question arises whether it is possible to hold the idea of a neutral point of view across languages which also means across cultures. However, this can be seen as a question that is out of the scope of this thesis and rather touches topics of .

1.2 Problem Definition

The comparison of Wikipedia articles that describe the same entity in different lan- guages can be split into two tasks: The first task solely refers to the texts of the articles and takes place on the sentence and paragraph level. Here, the goal is to link similar text parts. The second task is done on the article level and takes additional information into account, for example the authors and external links mentioned in foot notes.

Text Comparison

The text comparison is done to get a precise information of how similar the texts are and where their similarities and differences are. Figure 1.1 shows what the text comparison should result in (with shortened versions of the English and German abstracts of the Wikipedia article about the General Post Office): The English text is shown on the left and the German one the right. The parts that are identified as similar are linked by green lines. In this example, two subtopics are found that occur similarly in both languages: The first is a general description of the General Post Office, its founding and its estab- lishment as state postal system and telecommunications carrier. The second common fact is about the office of Postmaster General created in 1961. The black parts without links are containing information that is unique to the respective language.

5As shown in [Digital Methods], this does not hold completely, as some authors contribute to multiple Wikipedias editions. 6http://en.wikipedia.org/wiki/Wikipedia:Statistics 4 Chapter 1 Introduction

uuuu]Duuuuu uuuuu]DuuuuÜöu uuÜöuuuuuuuuu uuu-u-uuuuuu uuuuuuuu uuLuu- u-uuuuu uuuuu- uuuuuuuÜÜuu u-uuuuuu uuÜÜuuuuuuuuu uuuuuuuüu uu- uuuuuu- Figure 1.1 Text Comparison Example

This kind of comparison reveals several possibilities for a human reader: On a first glance, you can see which text parts are similar across languages. In contrast, unmarked text parts contain contents exclusive for the respective language version. So, if you are interested in discussing the linguistic point of view (as defined in [17]) of the articles, you can investigate which and how many parts are similar and also compare the text structures that way. As a Wikipedia author, you can look on the text in the other language and search for unmarked passages that represent facts not appearing in the article of your language. To do the paragraph alignment, there are two premises that are different from related studies: ˆ Sentences can be aligned if they just share some of their facts. ˆ A sentence in one language can be assigned to more than one sentence in the other languages. This is mainly done to support the paragraph construction that is done after identifying the (partially) overlapping sentences. On the article level the paragraph alignment allows for deriving numerical values to judge the semantic similarity of cross-lingual texts: The fraction of texts whose contents are found in the other text as well and a simple comparison of the text lengths are examples for such measures.

Article Comparison

Another kind of comparing articles is also considering aspects in the surroundings of them that are not directly part of the texts like external links and mentioned entities. Although not directly connected to the article’s texts, this also includes a comparison of the authors and their locations. The paragraph alignment is included in the article comparison by computing the fraction of paragraphs that has counterparts in the other article. 1.3 Overview 5

Revision History

In the context of investigating the history of web7, we also aim at inspecting the similarity of articles over the course of time. Two things are needed for this goal: For an article, several revisions (articles at a specific point of time) have to be collected over time and for each of the revision pairs found that way an overall similarity has to be defined that is derived from both the text and the revision comparison values.

1.3 Overview

In the next two chapters, background information and related work on the topics of the special characteristics of multilingual Wikipedia (Chapter2) and multilingual text processing (Chapter3) – including related approaches like plagiarism detection – are described to give a first overview of the challenges and possible solutions. Chapter4 gives a clear idea how we tackle our research aim and contains a sketch of the procedure that is applied to reach our goals. To identify similar sentences across articles, it is necessary to extract additional information from sentences. Their collection and usage is explained in Chapter5; in Chapter6 their effectiveness with regards to sentence alignment is investigated and evaluated by a user study. Having constructed a sentence alignment function then, similar sentences can be joined by the algorithms given in Chapter7. Aside from pure text comparison, this chapter also contains information about similarity measures on the article level and screenshots of our example application. The realisation of the information extraction process (including a pre processing pipeline) is described in Chapter8. Finally, we discuss our results in Chapter9.

7as it is for example done in https://www.l3s.de/en/projects/iai/∼/alexandria/ 6 Chapter 1 Introduction 2 Background on Multilingual Wikipedia

Wikipedia describes itself as follows1:

”Wikipedia is a free-access, free content Internet encyclopedia, supported and hosted by the non-profit . Anyone who can access the site can edit almost any of its articles. Wikipedia is the sixth- most popular website2 and constitutes the Internet’s largest and most popular general reference work. and launched Wikipedia on January 15, 2001. . . . Initially only in English, Wikipedia quickly became multilingual as it developed similar versions in other languages, which differ in content and in editing practices.”

A Wikipedia article describes one real world entity, topic, event or concept, for example ”Barack Obama”, ”Politics” or ”United States elections, 2012”. Each article has once been created by a user and users can extend and edit it afterwards. Due to this, a Wikipedia article never reaches a final, but rather develops over time. The state of an article at a specified point of time is called a revision3. Wikipedia articles are written in a markup language called Wiki markup, allowing for image inclusion, hyper linking and more. There is even structured data available for Wikipedia articles that is part of some articles in the form of an info box and can be accessed by the DBpedia data set4. In Figure 2.1, there is an example for an English Wikipedia article about the German lake ”Großer Wannsee” that for example has an infobox on the right, a photo gallery and external links on the bottom.

1http://en.wikipedia.org/wiki/Wikipedia 2http://www.alexa.com/siteinfo/wikipedia.org 3However, when speaking about an article in this thesis, we often mean the most current revision or the represented object 4http://dbpedia.org/

7 8 Chapter 2 Background on Multilingual Wikipedia

Figure 2.1 English Wikipedia Article ”Großer Wannsee”

The following kinds of information are given by the Wikipedia markup and used in our research:

ˆ Images: Throughout the article, images can be displayed. These are stored on the Wikipedia servers and mostly available as smaller thumbnail and as the original file.

ˆ Internal links: Words that are hyper linked within the text and refer to other Wikipedia pages

ˆ External links: Some words or sentences are assigned to one or more foot notes. The foot notes may contain links to external web sites.

2.1 Overview

In March 2015, there were 288 languages for which an own Wikipedia existed5. This for example includes languages with more than a million of articles like English and German, but also Greek, Afrikaans and Greenlandic with less articles. Among these 287 language versions, there even are some non-official languages like Simple English

5http://meta.wikimedia.org/wiki/List of Wikipedias 2.2 Wikipedia Guidelines 9

and Esperanto or regional varieties like Bavarian. Table 2.1 shows the number of articles and authors for a few Wikipedias6. Language Articles Edits Users Active Users 1 English 4,636,933 741,415,363 22,986,467 133,327 2 Swedish 1,946,828 28,362,530 403,378 2,884 3 Dutch 1,794,646 43,378,141 639,436 4,136 4 German 1,771,852 141,065,828 1,996,778 19,583 . . . 79 Afrikaans 33,412 1,357,691 62,903 126 . . . 125 Bavarian 10,467 435,892 26,447 71 . . .

Table 2.1 Statistics on Wikipedias in Different Languages

Interlanguage links

Articles that are representing the same real world entity are interlinked by interlan- guage links that are collected in a central database by the project7.8 As an example, for the Wikipedia article about ”Berlin”, there are 221 interlan- guage links. For the ”Pfaueninsel” – an island in Berlin –, there are 11 links, as seen in Figure 2.2 (ten outgoing interlanguage links plus the English article itself).

2.2 Wikipedia Guidelines

To support its authors and to reach the aim of being a ”free, reliable encyclopedia”9, Wikipedia has introduced different policies and guidelines. Two of them are presented in the following, because they are of major relevance for our investigation of multi- lingual articles.

2.2.1 Translations

The contents of different Wikipedias are evolving independent from each other, which results in different numbers and varying levels of detail of articles (as shown in Sec- 6as of November 2, 2014 7http://en.wikipedia.org/wiki/Wikipedia:Wikidata 8There have been recent changes in the handling of interlanguage links: In the Wikidata, all language links for one article are stored compactly together. Before that, each language version of an article had its own list of language links. 9http://en.wikipedia.org/wiki/Wikipedia:Policies and guidelines 10 Chapter 2 Background on Multilingual Wikipedia

Figure 2.2 Interlanguage links for the English article ”Pfaueninsel” tion 2.4). Obviously, this leads to the question, whether it is reasonable to trans- late Wikipedia articles (or parts of them) from one language into another10. The Wikipedia policies regarding this question are stated as follows11:

”Articles on the same subject in different languages can be edited indepen- dently; they do not have to be translations of one another or correspond closely in form, style or content. Still, translation is often useful to spread information between articles in different languages. [...] Wikipedia con- sensus is that an unedited machine translation, left as a Wikipedia article, is worse than nothing.”

According to this quote, (parts of) articles in different languages can be divided into three groups:

ˆ Independently evolved text: In the natural case, Wikipedia authors are editing articles on their own without overtaking every information from the article in other languages.

ˆ Human translations: Especially when an article is not existing in an author’s language, he may want to copy e.g. the English one. To do so, he adopts all the information and translates them manually. So, this approach is useful to spread information without additional research. To indicate that an article has been translated from another one or is still in the process of being translated, Wikipedia gives some advices, including Wiki markup techniques.

10As seen in Section 2.1, the English Wikipedia has by far the most articles, so this is a frequent example when translations can be used to spread information 11http://en.wikipedia.org/wiki/Wikipedia:Translation 2.3 Linguistic Point of View 11

ˆ Machine translation: Instead of translating an article manually, machine trans- lation techniques could be used to save effort. However, this is not allowed because of low quality results and the fact that each user can access machine translation tools on its own.

For our research, we are considered both of the top two approaches. As texts may even contain the same information although there was no human translation and different parts of the article can behave differently, the borders between the both cases become blurred anyway.

2.2.2 Neutrality

A big concern for Wikipedia is the neutrality of its articles. To emphasize this, the ”core content policy” neutral point of view (NPOV ) was introduced and defined as follows12:

”Editing from a neutral point of view (NPOV) means representing fairly, proportionately, and, as far as possible, without bias, all of the signifi- cant views that have been published by reliable sources on a topic. All Wikipedia articles and other encyclopedic content must be written from a neutral point of view. NPOV is a fundamental principle of Wikipedia and of other Wikimedia projects. This policy is nonnegotiable and all editors and articles must follow it.”

As the following section will show, studies on multi lingual Wikipedias show that it is reasonable to state that the NPOV is not fulfilled across different languages.

2.3 Linguistic Point of View

There has been some research aiming at the judgement whether the NPOV can hold in the cross-lingual context. In [12], the authors describe the global consensus hypothesis which says ”that encyclopedic world knowledge is largely consistent across cultures and languages”. As they find out that ”knowledge diversity across Wikipedias is large”, they discard this hypothesis and emphasize that this has an important impact on many technologies using Wikipedia data are said to rely on that hypothesis. To distinguish this phenomena from the NPOV, the concept of a Linguistic Point of View (LPOV ) is introduced in [17] which is motivated by the question ”will rela- tively isolated language communities of Wikipedia develop their divergent represen- tations for topics?”.

12http://en.wikipedia.org/wiki/Wikipedia:Neutral point of view 12 Chapter 2 Background on Multilingual Wikipedia 2.4 Reasons for Multilingual Differences

Assuming the NPOV and an equal workload of Wikipedia authors, the translated content of Wikipedia articles would be the same in different languages. This is not the case because of many reasons that are described in the following.

Different number of authors and different interests

As shown in Section 2.1, the different Wikipedias vary in the amount of authors – and therefore in the amount of articles as well 13. In [8] there is a more detailed investigation not only of the number of pages in different languages, but also of their lengths. The authors have analyzed 48 Wikipedia entries about 48 persons and found out that all of them have an English version, but only 26 persons have articles in more than 20 languages – including the former Secretary-General of the United Nations, Kofi Annan, who had the most – 86 – language links at the time of the study. This leads to believe that the English Wikipedia is some kind of superset of the other Wikipedias which is proven to be wrong in that investigation. Of the 43 re- maining persons with at least one non-English entry, they are comparing the number of sentences in the different language’s articles. The result shows that for 17 of the persons, there exists at least one language version with an article that has more – for some persons more than twice as many – sentences than the English one. Given these results, the authors conclude that ”despite the fact that English has descriptions for the most number of Wikipedia entries across all languages, English descriptions can not always be considered as the most detailed descriptions” and that ”Multilingual Wikipedia is full of information asymmetries”. As one of the main sources for this asymmetry, they describe that many persons, locations and events are just important within smaller communities that speaks the same language. To illustrate this, they give examples of a Mexican singer with a Spanish entry only and a Greek mountain with four entries in different languages.

Cultural Reasons

There are some topics which are of interest for many people, but the perception of them varies across different cultural communities. As a result, there are articles that exist in many languages and may even have a similar number of sentences, but their contents differ a lot. 13Looking at the Swedish Wikipedia, the relationship between the number of authors and of articles must not be directly related: In 2013, nearly half of the Swedish articles were auto- matically created by bots which lead to various debates (http://blog.wikimedia.org/2013/06/17/ swedish-wikipedia-1-million-articles/ 2.4 Reasons for Multilingual Differences 13

[26] describes this aspect in detail with a very extensive comparison of the Wikipedia entries for the Srebrenica massacre in the language versions of countries that were directly affected by the happenings in 1995, namely the Serbian, Bosnian, Dutch, Croatian, Serbo-Croatian and English version. They find a lot of differences that can easily be explained by cultural biasses. Among these differences, there are the three following ones: ˆ There have been many discussions whether to title the article with ”Srebrenica Massacre” or ”Srebrenica Genocide”. ˆ The victim counts differ across the language versions. ˆ The people who are blamed for the happenings are not named in a consistent way: In the Serbian article, they avoid to call them ”Bosnian Serb forces”, but prefer to say ”Army of the Republika Srpska”. Beyond this inspection of the texts themselves, the authors use some other meth- ods that are described in Section 2.5 – including the comparison of images and links used in the articles. Moreover, they take a look at the location of the Wikipedia authors that contributed to the pages14 and emphasize the role of power editors who are responsible for a major part of an article. In the end of their study, they conclude that the NPOV does not hold, as it is not possible to find any ”universality” between the different articles about the same topic.

Advertisement

Probably the most critical aspect concerning the NPOV occurs, when Wikipedia pages are edited or created by people who believe that they will profit from those changes. This kind of advertisement is – of course – a big contrast to the idea of neutrality. There have been many investigations to find such non-neutral editing that lead to the banning of more than 250 user accounts in October 2013 whose owners ”have been paid to write articles on Wikipedia promoting organizations or products, and have been violating numerous site policies and guidelines”15. Furthermore, there are more detailed reports of companies that have directly influenced articles which affect them. [16] states that every third German company listed on the stock exchange has behaved in that way. This includes small changes as for example replacing the text ”one of the leading companies” with ”the leading company” and bigger changes where the content of a company’s press release was inserted into an article without further adjustments. 14For non-registered users, you can see the IP address and draw conclusions about the user’s locations. 15http://blog.wikimedia.org/2013/10/21/sue-gardner-response-paid-advocacy-editing/ 14 Chapter 2 Background on Multilingual Wikipedia

Political Reasons

Similar to the manipulations for advertisement, which is done by the respective com- panies, there is also manipulation with political background going on. For example, there have been reports claiming that the Russian government has edited Wikipedia pages on flight MH17 to blame the Ukraine for having shot down that flight16. Al- though this was quite a naive approach for manipulation and maybe just done by an individual without any official instruction, this gives an insight of how mighty political manipulations may be. Wikipedia edits that were done from people connected to governmental institu- tions can be found by utilising the circumstance that edits of unregistered Wikipedia users are annotated with the IP address of their author. Having a list of IP addresses used in state institutions, it becomes a rather easy task to generate lists of suspicious edits. This has been done (the edit on the MH17 page was found this way) and resulted in a collection of Twitter accounts sending messages as soon as such an edit has been done and recognized17. To spread the manipulated texts, it is especially interesting for governmental sources to change not only the articles of the Wikipedia of their own language, but also of other languages. For example, there are more than 3000 edits18 from the German parliament and government within the English Wikipedia.

2.5 Wikipedia Studies

As already mentioned, there exist some tools that are taking use of multilingual Wikipedia. We will show three of them in this section.

Concept Similarity

In [17], the website www.manypedia.com is presented. On that interface, you can enter an article name and two languages. After a short calculation time, both ar- ticles are opposed and some information is highlighted: The images of each article are shown compactly together at the top, there are some statistics about the edit history (amongst others, the number of edits and the top authors) and – as the most important aspect – a concept similarity which is the overlap of common Wikipedia links mentioned in the articles. Additionally, you can translate non-English articles into English by machine translation.

16http://www.wired.co.uk/news/archive/2014-07/18/russia-edits-mh17-wikipedia-article 17https://jarib.github.io/anon-history/ 18Taking a closer look at the edits, the biggest part of edits can be identified as small and non- manipulative corrections, though. 2.5 Wikipedia Studies 15

Social Research

In [26], many aspects were manually investigated to compare the article about the massacre of Srebrenica and visualised, including the following:

ˆ Authors: The locations of anonymous editors per language version is shown by pie charts.

ˆ Table of contents: The table of contents is manually aligned to mark passages that appear in more than one language version.

ˆ External link hosts: From the external URLs mentioned in an article, only the first part (which are the host names) are considered to oppose the host names in a table where common ones are marked. The host name is used instead of the complete URL because many multilingual web sites – and these are the ones that are probably mentioned in articles in different languages – can be united this way.

ˆ Images: There also is a table for common images. Here, even similar images are aligned.

Edit History

There are some tools to visualise the number of edits over time19 20. Although these tools do not focus on the comparison of multilingual articles, it is no problem to do this kind of visualisation for more than one language at once. The first tool even contains a world map where the locations of editors are marked.

19http://sonetlab.fbk.eu/wikitrip/ 20http://sergionunes.com/p/wikichanges/ 16 Chapter 2 Background on Multilingual Wikipedia 3 Background on Multilingual Text Processing

Our aim of analyzing multilingual Wikipedia articles written in different languages touches different research topics that will be addressed in this section. On the one hand, there are topics that cover the inspection of multilingual text documents of different kind – for instance scientific papers or accurate translations, but not nec- essarily Wikipedia or even web specific texts. On the other hand, there are a lot of Wikipedia specific investigations. To structure this, we will give an overview of related work in the following topics:

ˆ Multilingual Natural Language Processing: To allow to find similarities between texts by using some automatic procedures, it is mandatory to extract informa- tion from them. To further allow the comparison of multilingual texts, the extraction has to be applicable in different languages.

ˆ Text Aligning / Parallel Corpora: Many machine translation programs require a collection of aligned texts in multiple languages. To build such a multilingual corpus, it is necessary to find similar passages in texts.

ˆ Plagiarism Detection: Plagiarism cannot only be done by copying texts that are written in the same language, but also by adopting texts in another language. Identifying such plagiarism touches our research topic and will discussed in Section 3.2.2.

3.1 NLP for Multilingual Text

3.1.1 Machine Translation

To conduct a comparison of texts that is solely based on the texts themselves and does not have any further extracted information available, it is necessary to have both texts

17 18 Chapter 3 Background on Multilingual Text Processing written in the same language. Because of their size, number and frequent changes, it is not possible to do a manual translation of Wikipedia articles. However, an evaluation will later show that it is not possible to get good results for text comparison without using translations (see Wikipedia baseline described in Section6). That is the reason why it is essential for this study to use machine translation. There exists a big amount of machine translators that can be queried using web applications, such as the Bing Translator1 or the Google Translator2. For some trans- lators, there exists a web API to simplify the access for programmers (for example, the Bing Translator using the Microsoft Translator API3). There are different approaches for machine translation including rule”=based, dictionary”=based and statistical translations4. The statistical approach (which is used by the Microsoft Translator5) is based on so-called multilingual parallel corpora. That are texts in at least two languages where the belonging parts are connected to each other. From these corpora, statistical features can be extracted which allow to do the translations. As already shown in Section 2.2.1 and by the example in table 3.1, machine translated texts do not reach the quality of human translations. Nevertheless, they can be used for the multilingual text comparison, as the translated text is not used to be presented to a human reader, but for example to compare single words.

German sen- Maschinelle Ubersetzung,¨ auch automatische Ubersetzung,¨ tence bezeichnet die Ubersetzung¨ von Texten aus der Quell- sprache in eine Zielsprache mit Hilfe eines Computerpro- gramms. human transla- Machine Translation, also automatic translation, is the tion (English) translation of texts from a source language in a source lan- guage with the help of a computer program. machine transla- Machine translation, automatic translation means the tion (English) translation of texts from the source language into a tar- get language with the help of a computer program.

Table 3.1 Example of Machine Translation: Human versus Machine Trans- lation

1http://www.bing.com/translator/ 2https://translate.google.de/ 3http://www.microsoft.com/translator/default.aspx 4http://en.wikipedia.org/wiki/Machine translation 5http://www.microsoft.com/translator/automatic-translation.aspx 3.1 NLP for Multilingual Text 19

3.1.2 Textual Features

To support the comparison of texts, there exist methods that are well-known in the field of Information Retrieval:

N-grams

An n-gram is a sequence of tokens in a text of the length n. For example, the sentence ”An n-gram is a sequence” can be split up in the character 5-grams (n = 5) ”An n-”, ”n n-g”, ” n-gr” and so on. On the word level, this sentence consist of the following word bigrams (n = 2): ”An n-gram”, ”n-gram is”, ”n-gram is”, ”is a” and ”a sequence”.

Stop Word Removal

Natural-language text often contain words that are just part of the text structure, but add nothing when doing a comparison of texts. These words are removed from texts by using a black list of so called stop words. For example, the sentence ”Stop words are the words that are filtered out” becomes ”Stop words words filtered” after stop word removal with a common black list.

Stemming

For text comparison, it does not matter how words are declined or conjugated. There- fore, it helps to reduce each word to its word stem. This is done in a process called stemming. This can for example be done by cutting words off after applying specific rules. In the example sentence from before, the resulting sentence is ”stop word filter” after stop word removal followed by stemming.

3.1.3 Topic Extraction

One of the tasks in NPL that focuses on the semantic features of a text is the ex- traction of elements that are useful for information collection and belong to the same concepts. Below, two types of such elements and their extraction will be shown: named entities and dates. Both types can be used to match elements with each other for texts in different languages.

Named Entity Recognition

Named entities are often divided into three categories: People, organisations and locations. The following line gives an example of these categories: 20 Chapter 3 Background on Multilingual Text Processing

Since the United States [Location] presidential election of 2008, held on Tuesday, November 4, 2008, Barack Obama [Person] from the Democratic Party [Organisation] is the 44th U.S. [Location] [Location] president.

[22] gives an overview of techniques for Named Entity Recognition (NER); [9] gives a more detailed idea of the implementation of the Stanford Named Entity Recognizer6 which does not only use local information but profits from building long-distance dependency models. NER programs usually use language-specific models that are trained with text collections where each named entity is annotated. Although named entities seem to be a good means for comparing texts, they suffer the disadvantage of not being unique. For example, the location entity ”U.S.” from the example sentence above can be named ”United States” as well. The differences can be even bigger for comparisons with texts in other languages: ”United States” is ”Vereinigte Staaten” in German and ”Ñîåäèí¼ííûå Øòàòû” in Russian. It is possible to overcome this problem by defining specific similarity measures [20] or by doing transliteration and normalisation [28]. The problem of finding canonical unambiguous references for the same entities is known as named entity normalization (NEN)[13] and can be attacked by using another kind of NER which is called entity linking: Each entity is not identified by its name, but by a unique resource identifier. Given its big number of articles – and therefore unique entities – Wikipedia (and DBpedia respectively, as they share their entities) is often used as a resource for such entities[18][21][7]. The following sentence illustrates the Wikipedia entity linking for the example from above:

Since the United States presidential election of 2008 [http:// en.wikipedia. org/ wiki/ United States presidential election, 2008] , held on Tuesday, Novem- ber 4, 2008, Barack Obama [en.wikipedia.org/ wiki/ United States presidential election, 2008] from the Democratic Party [en.wikipedia.org/ wiki/ Democratic Party ( United States)] is the 44th U.S. president.

Cross-lingual comparisons of entities can be done using the language links provided by Wikipedia (see Section 2.1). In Section 5.4, there is a detailed evaluation of Wikify and DBpedia Spotlight for entity extraction. In this thesis, the NER is only performed on Wikipedia texts. Given their internal links, the question arises, whether an additional entity extracting adds any additional information. The Wikipedia manual of Style/Linking7 lists the rule ”Generally, a link should appear only once in an article” (followed by some exceptions: for example, links can be attached to the same entity name in both the info box and the article). Instead of the entity linking approach, [1] at first builds a bilingual dictionary of the

6http://nlp.stanford.edu/software/CRF-NER.shtml 7http://en.wikipedia.org/wiki/Wikipedia:Manual of Style 3.1 NLP for Multilingual Text 21 whole Wikipedia article collection and then searches for occurrences of the link text in a n-gram based approach. While this method promises a higher confidence for the correctness of found words, it fails to detect conjugated entities8.

Time Extraction

Another type of named entities are time annotations that represent dates, time ranges or even sets of them. This is done by using language-specific rules like string matching for month names or date expression matching (by searching for susbtrings of the type ”d m yyyy” with ”d” representing a day number, ”m” a month number and ”yyyy” the four digits of a year). In contrast to other entities, the comparison of time annotations should not rely on a simple string equality. For example, a day may be matched with its containing week.

3.1.4 Sentence Splitting

With the aim not only to calculate a single similarity value for two articles, but also to demonstrate similarities and differences in smaller parts of the articles, the text must somehow be split into smaller parts. As suggested in [3], sentence-based splitting is a good approach which is part of many multilingual NLP tools9 10. To split a text into sentences, there are several difficulties making obvious that it does not suffice to split the text at every punctuation mark. [4] contains two example sentences for this behaviour: ˆ Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. ˆ Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Instead of splitting at every punctuation mark, a model must be trained before- hand by learning from a language-specific corpus with isolated sentences.

3.1.5 Other NLP techniques

Multilingual NLP contains many other topics of which two major important ones will be described in this section – together with reasons why they are not primarily interesting for this research.

8DBpedia Spotlight http://dbpedia-spotlight.github.io/demo/ detects the correct ”Autralia” en- tity in the phrase ”an Australian woman” 9http://nlp.stanford.edu/software/corenlp.shtml 10http://opennlp.apache.org/index.html 22 Chapter 3 Background on Multilingual Text Processing

Figure 3.1 First Paragraphs of the Wikipedia Article ”Berlin”

Text chunk extraction

Until now, the identification of text parts that belong together was limited to the sentence level. To split the text into bigger parts, sliding window techniques were developed that are used to subdivide a text into paragraphs that cover the same subtopics. One of them is the TextTiling algorithm [11] that makes use of lexical co-occurence and distribution of words to find gaps between sentences that indicate a change in topic. Wikipedia articles already consist of many rather small paragraphs11 who can be easily extracted12 and can be connected in a bottom-up manner to create bigger paragraphs. Figure 3.1 gives an example of the paragraphs (black framed) in the top section of the English ”Berlin” article.

Language identification

There exist NLP tools like NGramJ13 to identify the language of a text by building an ngram profile of a text and comparing it with statistical features of the ngram profiles in different languages. For this thesis, it is assumed that every sentence in an article from the Wikipedia version in the Language L only consists of sentences written in L14. As every sentence can be assigned to language L of the Wikipedia version, language identification is not needed.

11In a test set of 14 long articles, there were approximatively 3.606 sentences per paragraph 12This is done during the HTML parsing: Paragraphs are within

tags. 13http://ngramj.sourceforge.net/ 14This assumption does not hold for quotes in the original language. 3.2 Aligning Multilingual Text 23 3.2 Aligning Multilingual Text

The alignment of text is the task of identifying corresponding sentences within two texts in different languages. Typically, this is done to train machine translators and the input text is given as a parallel corpus. As Wikipedia can not be seen as a parallel corpus [1], this section describes approaches of text alignment for comparable corpora.

3.2.1 Comporable Corpora

In [1], a first approach to align similar sentences in multilingual Wikipedia was done. They describe the difficulty when applying sentence alignment functions on Wikipedia texts: For some article pairs, one article may be the translation of the other one, but then the articles may be very different. They propose two methods to compute similarity values for sentence pairs:

ˆ Machine translation based approach: One of the two texts (the Dutch one in their study) is translated into English, such that both articles are available in the same language. Then, the texts are split into sentences. After stop word removal, the sentence similarity is computed by Jaccard word overlap (which will be explained in Section 5.1).

ˆ Link based approach: In the beginning, a bilingual lexicon is created that maps article names to a unique language-independent representation of the article (using the language links described in Section 2.1). Then, a set of n-grams of the sizes n = 1, n = 2, n = 3 and n = 4 is created for each sentence. For each of the n-grams created this way, the lexicon is queried. If it returns a Wikipedia article, the Wikipedia term is added to the n-gram. Finally, the similarity is computed by Jaccard overlap as well. This approach is mainly done to find out whether sentence alignment can be done without translations. This would be appropriate, as the main goal is to use the findings for machine translation. Otherwise, there is some kind of a causality dilemma.

To do the sentence alignment with the help of these similarity measures, the following method is exercised:

1. All the sentence pairs are ranked by their similarity score (computed with one of the two methods above).

2. The sentence pair with the highest similarity is chosen and put in the list of similar sentences. 24 Chapter 3 Background on Multilingual Text Processing

3. All those sentence pairs were one of the sentences is contained in the chosen sentence pairs are removed from the ranked list.

4. The previous two steps are done until the list is empty.

By the filtering in the third step, the compliance with the following assumption is assured: Each sentence may be aligned with only one other sentence in the other article. We call this the 1:1 assumption. In [19], the second, link based approach, and a similar same alignment method is used, but results are improved by including thresholds for length similarity and sentence similarity. Referring to [10], they find out that sentences that differ a lot in their lengths are mostly not similar. Therefore, only sentences that reach a character length correlation of at least 0.5 and only sentence pairs that exceed a similarity score of 0.02 are added to the ranked list. For a data set of 30 article pairs, they reach a precision of 21% compared to 10% when using the original methods in [1].

3.2.2 Plagiarism Detection in Multilingual Text

[3] gives an overview how to define different types of plagiarism – the adoption of text without reference to the source – and how to detect them. For this purpose, they classify plagiarism detection into monolingual and cross-lingual. The latter class is the one being of interest for this thesis’ research as it refers to the identification of similar subsections within texts in different languages. The requirements are not the same as the ones given in the problem of this thesis, as the following review over the plagiarism detection methods will show. Besides the actual procedure of text processing, we will review the techniques that are used for the – syntactic and semantic – comparison of text passages. [3] provides an abstract design for the cross-lingual plagiarism detection that can be seen in [2] in a concrete example implementation. The procedure consists of the following steps, given an input document dq in a language Lq and a document collection D:

1. Reduce the complexity of dq and then translate it using machine translation. 0 This results in a new document representation dq.

0 2. By using dq, reduce D to a smaller set of documents Dx that are good candidates to have been plagiarised by dq.

3. For each document dx ∈ Dx: Perform a pair-wise comparison with dq to find parts sq from dq and sx from dx that are similar. 4. Merge found pairs if they are within a short distance. 3.2 Aligning Multilingual Text 25

For the first step, [2] uses a summarization strategy that extracts the n most important sentences from dq. The pairwise comparison in the third step is performed by computing the longest common sequence similarity15 between translated sentences. In this thesis, the plagiarism detection process does not have to be executed completely in this way: The first two steps are irrelevant, because the document collection D initially only consists of a single article (for each language of interest) that is the Wikipedia article in a different language found by using the language link. Nevertheless, the second step gives some ideas with regards to the comparison of articles on the ”document level”, without splitting the text into smaller passages. For that case, it is worthwhile to take a look at different features that are used to calculate similarity values between smaller text fragments. [3] gives an overview of textual features for cross-lingual plagiarism detection. They differ between syntactic, semantic and statistics features. The syntactic features aim at splitting the text into smaller passages: words, sentences or chunks. This can be done by the methods described in sections 3.1.4 and 3.1.5. Due to the inaccuracy of purely lexical (e.g. character-n-grams) or syntactic fea- tures on cross-lingual texts, it is of a major role to combine those measures with semantic and statistical features. For the extraction of semantic features, synonyms of the words occuring in the text are collected to improve the identification of corre- sponding words. In [2], a single feature is used which is the longest common subsequence of com- pared sentences. The similarity value derived from this feature that is later described in Section 5.1 must exceed 0.65 to be assigned as plagiarised. As the identification of similar sentences is not enough to detect plagiarism, a post-processing is applied on the detected sentence pairs where sentence pairs are merged that have not more than 10 characters in between. This can be used for our aim of paragraph alignment as well.

15http://www.cs.umd.edu/∼meesh/351/mount/lectures/lect25-longest-common-subseq.pdf 26 Chapter 3 Background on Multilingual Text Processing 4 Approach Overview

With the knowledge of the previous chapters (mainly the characteristics of multilin- gual Wikipedia and several techniques for multilingual text processing) the compari- son of Wikipedia articles becomes possible. This chapter will give an overview of the steps that are needed to do so. This includes the data extraction, the comparison of features in this data and finally the identification of similar sentences – and later paragraphs – within the texts. To reason about the results, there are some aspects that will be evaluated e.g. by a user study. Figure 4.1 gives a very rough overview of what is done during the process of article comparison for a single article pair. That means, there is one object for which an English article and an article in another language exists.

Extract Sentences and Features with Find Revisions Pre-processing Store in Database Pipeline

Compute Revision Find Similar Find Similar Similarity Paragraphs Sentences

Figure 4.1 Process of Article Comparison

The first three steps are part of the data collection and described in more detail later in Section8 when talking about the implementation: To observe the similarity development, we choose several revision pairs of the chosen article pair for different 27 28 Chapter 4 Approach Overview

points of time in the best possible equidistant manner. In the preprocessing step, each revision runs through a pipeline where the following things are done by HTML parsing and using the Wikipedia API and other external tools:

ˆ Images, internal and external links (items) that are mentioned in the revision are collected.

ˆ The raw text that is the essential part of the article (no tables, external links lists etc.) is extracted and split into sentences.

ˆ When possible, the extracted items are assigned to sentences.

ˆ For the non-English revision, the sentences are translated by using a machine translation API.

ˆ For each sentence, more entity links are extracted using NER tools.

ˆ For each sentence, time annotations are extracted using NLP tools.

As this preprocessing consists of many extensive steps and this has do be for every revision, it is not possible to do a ad hoc approach where the user can enter an arbitrary article and immediately gets the results. Therefore, the collected data is stored in a data base. The similarity calculations are done ad hoc1. That means, if an article comparison is requested, the respective data for both revisions is loaded from the data base. From now on, things are done in the bottom-up manner that was already described in Section 1.2: From similar sentences, we construct similar paragraphs and finally define an overall revision similarity. The problem of finding similar sentences is called sentence alignment. To do this, we must define a set of similarity measures that are given as computation rules that return values in the range [0, 1] by comparing syntactic and semantic features. This is described in the following Chapter5. In that chapter, there are two steps of evaluation: In the first evaluation, we compute the textual similarity values for sentence pairs of a parallel English/German corpus and try to find out which of them performs best. Secondly, we compare NER tools to choose the best performing one for our preprocessing. In order to align sentences, we need to have a sentence alignment function. Given the input of two sentences in different languages, this function calculates a similarity values that shall be higher proportional to their similarity, from ”different” to ”par- tially overlapping” to sentences that contain the same facts. To create and evaluate this similarity function, a user study is done that contains of two steps: In the first, the set of similarity measures is decreased to those that have the biggest impact of

1apart from the overall similarity values for all revision pairs that are part of the history chart 29 the similarity. In the second step, the similarity function is composed of them. The study is described in Chapter6. In Chapter7, an algorithm is explained that forms paragraphs from the previously identified matching sentence pairs. To compute the similarity of a paragraph, we use the alignment function from Chapter6 and add small penalties for unfitting sentences. The fraction of overlapping paragraphs is a part of the revision similarity that is a single score for a whole revision pair. Other components are very similar to those used on the sentence level (text overlap, common entities, . . . ). Nevertheless, we give a concrete overview over all its similarity measures in Section 7.2 to distinguish it from the sentence similarity. There are even some measures like the author similarity that can only be used on the article level. 30 Chapter 4 Approach Overview 5 Feature Selection and Extraction

It is obligatory to extract comparable information from text parts to estimate their similarity. Especially for multi-lingual comparisons, it is important not only to rely on pure syntactic textual information, but also on the semantics. In this chapter we will address this task of the extraction of information within text and finally review how big the influence of these different kinds of information – features – is. There are different types of features: Some can be used on the sentence level (e.g. textual features), others are used for the revision comparison (e.g. authors). At this stage, we will only focus on the comparison of two sentences in different languages (sentence pairs). The similarity between two sentences will later help us to realise a bottom-up approach to find similar paragraphs by combining proximate sentences. Similarity measure on the revision level are partially similar and will be described in Section 7.2. Textual features were already introduced in 3.1.2, other features needed for the revision similarity are motivated from the findings in 2.4. In this section, each feature will be described together with calculation rules to receive similarity values with respect to the feature.

5.1 Syntactic Features

The syntactic features of a sentence are based on its textual content. To do the textual comparison, non-English sentences are machine-translated into English. Furthermore, basic NLP techniques like ngrams are used. In the following, five syntactic features and their application for the computation of similarity values in the range of [0, 1] are described.

31 32 Chapter 5 Feature Selection and Extraction

Text Overlap

The text overlap similarity (TO) is a simple application of the Jaccard coefficient of the words appearing in both sentences after stemming and stop word removal. The

computation is given in the following equation, with Wsi being the set of stemmed and stop word free words of sentence i:

Ws1 ∩ Ws2 simTO(s1, s2) = . (5.1) Ws1 ∪ Ws2

Bigram Overlap

A method proposed in [24] uses the Jaccard coefficient as well, but not on the single words (unigrams). Instead of them, bigrams of the text are identified. By computing the overlap this way, the order of the terms is considered which is not the case for the unigrams.

Word Cosine

The word cosine is often used in the context of information retrieval and takes into account the selectivity of terms: The more frequent a term is, the less important it is for the similarity computation. This is done by defining tf-idf weights that indicate the selectivity of terms. The tf-idf concept is transferred to our requirements by treating sentences as documents and thus having all the sentences from both articles as the document collection D. A term t corresponds to an entity name. To compare two sentences, the following two term vectors are created beforehand:

ˆ term frequency (tf) for each sentence: number of occurrences of each term in this sentence ˆ inversed document frequency (idf): inverted and logarithmized number of sen- |D| tences where each term appears, computed as idft,d = log |{d0∈D|t∈d0}| .

The cosine similarity for two sentences is then computed as follows, with wt,d = tft,d · idft,d (tf − idf weight of the term t in the sentence d):

PN i=1 wi,s1 wi,s2 ni simCo(s1, s2) = q q . (5.2) PN w2 PN w2 i=1 i,s1 i=1 i,s2

Longest Common Subsequence

In the context of cross-lingual plagiarism detection, texts are compared using their longest common subsequence (LCS) [2]. This is the longest sequence of characters 5.2 Evaluation on Sentence Similarity of Parallel Corpus 33 that appear in that order in both text strings. For example, the LCS of the words ”sentence” and ”subsequence” is ”seence”. Similarly to the bigram overlap, the order of words matters as well in this case. To build a similarity value between 0 and 1 with the LCS, the similarity measure for LCS is defined as follows, with |s1| denoting the number of characters in the first sentence:

|LCS(s1, s2)| simLCS(s1, s2) = . (5.3) max(|s1|, |s2|)

Text Length Similarity

Different from the previous similarity measures, a comparison of the sentence lengths (in terms of characters of the original and untranslated text) does obviously not suffice to be used as the only syntactic feature, because sentences with the same length can have completely different contents. To take into account the varying length for the same contents in different lan- guages, you can normalize the text length, such that the average number of characters of texts in that language is considered. To do so, we computed the ratio of characters in the Europarl corpus [14] (described in more detail in the following Section 5.2) for the respective language pairs and use this value for normalization.

5.2 Evaluation on Sentence Similarity of Parallel Corpus

To be able to compare the textual features (except text length similarity) and finally decide for the best one, we applied them on a parallel corpus with aligned German and English sentences. The German texts were machine-translated into English which allows to compare the sentences using different textual features. The first goal is to judge the features’ quality which is done by plotting their precision and recall values. The second goal is to find good threshold values that can be used to classify unknown sentence pairs into parallel or not. As our final goal is not just to find identical sentences, but also partially overlap- ping sentence pairs, this evaluation is only reasonable when stating the assumption that the textual similarity of sentences with the same contents highly correlates with the similarity of partially overlapping ones. For the Jaccard overlap, this seems cor- rect, as identical sentences reach an overlap of 100%, while this drops to 50%, if the first sentence only contains half of the content of the second one. 34 Chapter 5 Feature Selection and Extraction

Data

The Europarl corpus [14] is a sentence-aligned parallel corpus extracted from the pro- ceedings of the European Parliament. It contains 20 versions, each with a document of English sentences and one in another language like German, Dutch or Portuguese. For example, the German corpus has 2,176,537 sentences. The primary goal of this corpus is to support machine translation systems1. As we are mainly focussing on German and English Wikipedia articles, we chose the first 500 lines of the English/German corpus and translated the German sentences into English using the Bing Translator2. This leads to a data set of 500 human-written English sentences and 500 former German, but now also English machine-translated sentences.

Approach

To judge the similarity measures, it does not suffice to solely compare the correctly aligned sentence pairs, because it is also important to know how they behave on wrongly aligned sentence pairs. Therefore all four measures are applied on each possible combination of sentences which are 500 · 500 = 250.000 sentence pairs in total. In the first part of this evaluation, the features are compared in terms of precision and recall. That means, for each feature, the set of 250.000 sentence pairs is sorted descending by their similarity value. After that, the top k sentence pairs are taken from the sorted set to compute precision (fraction of correctly aligned sentence pairs among the k pairs) and recall (fraction of all the 500 sentence pairs that is returned). This is done stepwise for k = 0 until k = 250.000. In the second part, the meaning of the similarity value is investigated. When later applying the features, it is necessary to create some threshold value. If this is exceeded by the similarity of a sentence pair, its sentences will be aligned. To find out how this threshold should be set for the various measures, a box plot3 is created for each of them. To do so, the similarity between each of the 500 correctly aligned sentence pairs is computed. Having a set of 500 similarity values, it then becomes possible to identify their median and their quartiles.

Results

Figure 5.1 shows the result of the computation of precision and recall for the four features. For an easy comparison, the break-even points (BEPs) are marked. These are the points where precision and recall equalise.

1http://www.statmt.org/europarl/ 2http://www.bing.com/translator/ 3http://en.wikipedia.org/wiki/Box plot 5.2 Evaluation on Sentence Similarity of Parallel Corpus 35

Obviously, the bigram similarity performs worst for our goals: While all measures can be used in the area of high precision and low recall, the recall for bigram simi- larity is much lower for a precision below 90%. One reason for this is that machine translation cannot guarantee to be order-preserving for word pairs. The other three features perform similarily well and are well-suited to our goals. For a precision of 80%, they still return just slightly less than 80% of all parallel sen- tence pairs. While it is not possible to say whether Cosine or Text Overlap similarity perform better on this data (BEPs are within one percent), the LCS similarity is a bit worse compared to them for a recall bigger than 80%. As a result, we will only consider Text Overlap and Cosine similarity in the following sections.

100

80

60

Text Overlap (78.6)

40 Cosine (77.9) Precision (%) Precision LCS (75.8)

20 Bigram (58.8)

0 0 20 40 60 80 100 Recall (%) Figure 5.1 Precision Recall Graphs for Textual Features with Break-Even Points

Figure 5.2 contains four box plots. The results make clear why an optimal precision cannot be reached: Some sentence pairs are assigned very low similarity values, there are even 7 pairs whose terms do not overlap at all. To put it another way, there are only 21 pairs (4.2%) that completely overlap. For the least common subsequence, the similarity always is above 0, because all sentence pairs share at least some letters. As a starting point for further investigation in Section6, we will choose the sim- ilarity values of the first quartile as threshold (this means, 75% of the pairs were returned in this data set), which is 0.39 for cosine and 0.3 for text overlap. 36 Chapter 5 Feature Selection and Extraction

1.00

0.80

0.60

Similarity 0.40

0.20

0.00 Cosine Text Overlap Bigram LCS Similarity Type Figure 5.2 Box Plots for Textual Features

5.3 Semantic Features

Semantic features are additional information from the texts and found by extracting them with some external tools or given by the Wikipedia formatting. Due to special characteristics, it is often more difficult to derive similarity measures from them.

Time Similarity

Time annotations can be extracted from sentences in the form of e.g. ”2013-04- 17”, ”2014” or even ”19XX” (representing a whole century). That means, each time annotation represents a time range that can be a single day or even whole centuries. If two sentences contain overlapping time annotations, this is a sign that they represent events that were happening at the same time and hence probably are the same. An important aspect is that the presence of the same time ranges not always is a good hint for a high probability that the sentences represent the same contents: The wider the range of a time period, the less important is its meaning for the time similarity. This is demonstrated by the examples in table 5.1 where the first sentence pair is talking about different facts happened in the same century. The second sentence pair has the same facts and the same concrete date is mentioned in both sentences. Another issue is that some sentences contain more than one time annotation which makes a trivial computation of overlapping days difficult. 5.3 Semantic Features 37

English Sentence German Sentence German Sentence Time Anno- (manually trans- tations lated) 1 Emigration from Die Reformation The reformation in 16th century Europe began with im 16. Jahrhundert the 16th centruy (ca. 36 524 Spanish and Por- spaltete die westliche split the West- days) tuguese settlers in Kirche [...] in einen ern church into a the 16th century [...]. katholischen und catholic and an evangelischen Teil. evangelic part. 2 On 1 January 2007, Am 1. Januar 2007 On 1 January 2007, 1 January 2007 Romania and Bul- wurden als 26. und Romania and Bul- (1 day) garia became EU 27. Mitgliedstaat garia were included members. Rum¨anienund Bul- into the Union as garien in die Union 26th and 27th mem- aufgenommen. ber.

Table 5.1 Example Sentence Pairs for Time Similarity

To overcome the first issue, we assign relevance values to the time intervals ac- cording to their length: the longer the time interval, the smaller the relevance value. We set the weight of the time interval w(ti) = 1, if ti represents a particular date, w(ti) = 0.85 for a month and w(ti) = 0.6 for a year.

To compute the time-based similarity SimT ime between two sentences s1, s2, we align each time annotation ti with its best matching counterpart tj (if any) in these sentences and sum up the minimum relevance weights of the aligned annotations to get a time overlap value tovl:

( ∗ X X min(w(ti), w(tj)) tovl(s1, s2) = (5.4) 0, otherwise ti∈s1 tj ∈s2

∗ if ti, tj refer to an overlapping time interval, and there is no other overlapping tj0 ∈ s2 with a higher weight for min(w(ti), w(tj0 )). If, for example, the annotations ”2011/03/20” and ”2011/03” are aligned, the relevance weight for a month is taken. The time overlap is computed for both directions, summed up and then normalized by the total number of the time annotations in the sentences |s1| and |s2|:

tovl(s1, s2) + tovl(s2, s1) SimT ime(s1, s2) = . (5.5) |s1| + |s2|

For example, if sentence s1 contains the time annotations ”2011/03/20” and ”2011” and sentence s2 contains ”2011/03”, the similarity is calculated as 38 Chapter 5 Feature Selection and Extraction

(0.85 + 0.6) + (0.85) Sim (s , s ) = ≈ 0.767. T ime 1 2 2 + 1

Entity Similarity

In sentences speaking about the same facts probably the same entities are mentioned. Therefore, we define an entity similarity measure. Entities are found by uniting the given internal Wikipedia links and other links extracted by DBpedia Spotlight. As entities always refer to Wikipedia pages, we also call this feature Wikipedia annota- tions. It is important to account for the selectivity of entities. However, we observed that due to the sparsity and the distribution of the Wikipedia annotations in sentences, cosine similarity measures directly applied to the Wikipedia annotations do not lead to a very precise sentence alignment. This can be shown by the following example with sentences from the Wikipedia articles about the German capital Berlin:

ˆ English sentence: Berlin is ’s largest city.

ˆ German sentence: Zudem ist Berlin der bedeutendste Verlagsstandort Deutsch- lands.

ˆ German sentence (human-translated): Besides, Berlin is the major publishing center in Germany.

Both sentences contain the single entity ”Germany” which can be found more than 50 times in both articles. Therefore, it should nearly add nothing to the similarity of the sentences. However, both the text overlap and the cosine similarity fail in this case: The Jaccard overlap obviously returns a similarity value of 1, but so does the cosine similarity because of its normalization factor in the denominator. This is illustrated in the following cosine calculation (with w defined as in 5.1) where other terms than the entity ”Berlin” are not shown:

wBerlin,s1 wBerlin,s2 SimEntity Cosine(s1, s2) = q q = 1. (5.6) w2 w2 Berlin,s1 Berlin,s2 Due to the sparsity and the distribution of the Wikipedia annotations in sentences, it is necessary to change this behaviour by adding a smoothing factor ~n to the cosine similarity computation. We create a vector ~n, where ni is the weight of the annotation i computed as:

df − 2 n = max(0.1, (1 − i ))β, (5.7) i α 5.4 Evaluation of Entity Extraction Tools 39 where dfi denotes the number of sentences with an aligned annotation i. The weights computed by equation 5.7 are in the interval [0.1β, 1] with the lower weights corre- sponding to the more common annotations. 4 For the most selective annotations that appear in two sentences (dfi = 2) , the cosine sentence similarity computation remains unchanged. For less selective anno- tations the similarity is reduced faster compared to cosine using two factors: β that controls the degree of similarity decrease and α that limits the maximal number of occurrences by which an annotation is considered as relevant.

When calculating the similarity SimEntity of two sentences s1 and s2 based on the Wikipedia annotations, the tf − idf weights of the annotations are adjusted by ~n:

PN i=1 wi,s1 wi,s2 ni SimEntity(s1, s2) = q q , (5.8) PN w2 PN w2 i=1 i,s1 i=1 i,s2 where wi,sj is the tf − idf weight of the annotation i in the sentence sj and N is the number of distinct aligned annotations in the sentences of both articles. We experimentally set α = 25 and β = 5.

External Link Similarity

Common external links can lead to a higher similarity of sentences as well. Similarly to the entity similarity, the selectivity of external links has to be taken into account. Therefore, the same calculation is used (with α = 5 and β = 2, because links are much rarer).

External Link Hosts Similarity

The external link similarity can be split up into two parts: The comparison of the full URLs and of the host names only (with smoothing as well). We weighted the external link similarity with 25% and the host similarity with 75% to compute a single similarity score from both measures.

5.4 Evaluation of Entity Extraction Tools

As described in Section 3.1.3, one of the semantic features used to identify similar sentences is the recognition of named entities (NER). To get an idea of the quality of NER tools and to rate their performance on texts in different languages, we ran an evaluation of two NER tools that were applied on manually annotated texts in English and German. 4If the annotation just appears in one sentence, it can never occur in both sentences of a sentence pair 40 Chapter 5 Feature Selection and Extraction

5.4.1 Aim and NER tools

To do the NER on the Wikipedia, it is important to know how different NERs per- form with regards of our goals to extract meaningful and comparable entities within sentences. For our tests, we use the following three configurations of NER tools:

ˆ Wikify [18]

ˆ DBpedia Spotlight [7] with a confidence of 0.6. As we are primarily interested in persons, locations and organisations, extracted entities that are not assigned a type are ignored.

ˆ DBpedia Spotlight [7] with a confidence of 0.8: This is supposed to return less, but more precise entities.

5.4.2 Data

To do this evaluation, the NER tools have to be applied to texts for which there have already been extracted all named entities correctly. These manual annotations provide the ground truth for the evaluation of the NER tools. The N3 data set, presented in [25], meets all criteria for the evaluation: It contains both English and German texts which are annotated with unique DBpedia links. The N3 data is formatted in XML. Listing 5.1 gives an example of a single document in the English corpus with one named entity. Listing 5.1 Extract of an example document of N3 data set http://www.research.att.com/˜lewis/Reuters−21578/15021 Reuters−21578 ... , Nassau Branch is issuinga 40 mln Australian dlr eurobond due May 15, 1990 paying 14 −1/2 pct and priced at 101−3/8 pct, lead manager Hambros Bank Ltd said. The non−callable bond is available in denominations of 1,000 Australian dlrs and will be listed in ...

To manage this data, the following preprocessing steps are necessary for the En- glish and the German document set: 5.4 Evaluation of Entity Extraction Tools 41

ˆ Parse the XML file and build a set of documents DL (L is the language) where 5 each document d ∈ DL is identified by its document id . Besides, d is assigned both a text string and the entity links together with their positions.

ˆ Use a sentence splitting algorithm6 to divide the text string of each document d ∈ D into a set of sentences S.

ˆ For each sentence s ∈ S, collect the belonging entity links using the following: An entity link belongs to a sentence, if its start and end position (with regard to the text string of the document) is between the start and end position of the sentence. Ignore entities without a valid DBpedia link7.

These preprocessing steps result in the English document set Den with |Den| = 700 and the German collection Dde with |Dde| = 1627 sentences. 5.2 shows some statistics for both document sets.

Document Collection Language English German Source Reuters news.de Document 128 99 Sentences 700 1627 Entities 878 1618 Entities per Sentence 1.2543 0.9945

Table 5.2 Statistics of the N3 Dataset

5.4.3 Entity Extraction and Comparison

Having collected all sentences and their manually annotated ”ground truth” entity links, the entity extraction must be applied for every sentence and for every inves- tigated NER tool now. As stated above, we are evaluating Wikify and DBpedia Spotlight with the latter being applied both with a confidence of 0.6 and 0.8. Both NER tools are operating using language dependent training data, which means that six executions of entity extraction have to be done (three NER tool configurations in two languages), whereby for each sentence only the three extraction runs for its language are needed. The results of these extractions are shown in table 5.3 for the English and in table 5.4 for the German data set.

5In the English data set, this is the content of the DocumentURI element. 6In our study, the Apache OpenNLP library (https://opennlp.apache.org/ is used. 7In the N3 data set, named entities that are not found in the DBpedia are annotated with a default ”not in wiki” link. 42 Chapter 5 Feature Selection and Extraction

DBpedia Manual Wikify Spotlight c = 0.6 c = 0.8 Entities 649 910 508 331 Entities per Sentence 0.927 1.300 0.726 0.473 Entities with English links 482 697 462 296 Fraction of entities with 0.743 0.766 0.909 0.894 English links

Table 5.3 Number of Entities Extracted from English Texts

DBpedia Manual Wikify Spotlight c = 0.6 c = 0.8 Entities 1513 1151 1144 864 Entities per Sentence 0.930 0.707 0.703 0.531 Entities with English links 1473 1136 1131 853 Fraction of entities with 0.974 0.987 0.989 0.987 English links

Table 5.4 Number of Entities Extracted from German Texts

5.4.4 Comparison

The big advantage when using entity links instead of entity names is the simplicity of matching the entities. Given the same reference – which is DBpedia in all of the three entity sources investigated in this study – two entities are the same if their URIs are identical. To compare the results, we are computing the precision, recall and F1 score values8 using TP, FP and FN values who are interpreted as follows9:

ˆ True Positive (TP): Number of entities that are both in the ground truth and in the extracted entities (correctly found entities).

ˆ False Positive (FP): Number of entities that are in the extracted entities, but not part of the ground truth entities (wrongly found entities).

ˆ False Negative (FN): Number of entities that are in the ground truth, but not part of the extracted entities (not found entities).

8http://en.wikipedia.org/wiki/Precision and recall 9By using these definitions for TP, FP and FN, the exact positions of the extracted entities are ignored. However, this should not be a big concern when working on the sentence level (which allows for only small ranges of positions). 5.4 Evaluation of Entity Extraction Tools 43

Algorithm 5.1 shows the computation steps to retrieve the needed values for the evaluation of one entity source like DBpedia Spotlight with c = 0.6 (with getEnti- tyLinks(L, e) returning the entity links in language L and entity source e found in the given sentence).

Algorithm 5.1 Computation of TP, FP and FN

1: procedure evaluate(DL, L, EntitySource e) 2: tp = 0 ; fp = 0 ; fn = 0 3: for each d ∈ DL do 4: for each Sentence s ∈ d.S do 5: List A = s.getEntityLinks(L, ’ground truth’) 6: List B = s.getEntityLinks(L, e) 7: tp = tp + |A ∩ B| 8: fp = fp + |B − A| 9: fn = fn + |A − B|

As this study aims at the goal of comparing Wikipedia articles written in different languages, the identification of identical entities is just an easy string comparison. In this scenario, the sentences contain entity links from different DBpedia language versions. That means, an intermediate processing step is needed: For example, to compare an English and a German entity, one of the entity links has to be transferred into the other language which is done using the underlying DBpedia information described in Section 2.1. If there is no counterpart of the English entity link in the German DBpedia and vice versa, this entity link cannot be used for the comparison and is worthless. This evaluation so far relied on the accessibility of entity links in any language. To make sure that the extracted entities also exist in other languages and can be used for a comparison, we ran another evaluation where all those entity links that are not available in the other specified language are ignored. To do so, we add the following preprocessing step:

ˆ For all entity links extracted (from any source) in the data set, collect the belonging entity links in a predefined language L2 6= L.

In this evaluation, L2 = de for the English data set and L2 = en for the German data set. Algorithm 5.1 is modified such that all entity links without counterparts in L2 are removed from GT and E (Lines5 and6).

5.4.5 Results

The results of this evaluation are shown in the tables 5.5a (English data set) and 5.5b (German data set) for the evaluation using the original language entity links 44 Chapter 5 Feature Selection and Extraction

and in the tables 5.5c (English data set) and 5.5d (German data set) for using links in language L2.

DBpedia DBpedia Wikify Spotlight Wikify Spotlight c = 0.6 c = 0.8 c = 0.6 c = 0.8 TP 246 245 190 TP 627 803 628 FN 387 388 443 FN 862 686 861 FP 664 263 141 FP 506 341 236 Precision 0.270 0.482 0.574 Precision 0.553 0.702 0.727 Recall 0.389 0.387 0.300 Recall 0.421 0.305 0.422 F1 score 0.159 0.215 0.197 F1 score 0.239 0.305 0.267 (a) English texts, English links (b) German texts, German links DBpedia DBpedia Wikify Spotlight Wikify Spotlight c = 0.6 c = 0.8 c = 0.6 c = 0.8 TP 212 212 164 TP 621 795 620 FN 270 270 318 FN 852 678 853 FP 485 250 132 FP 515 336 233 Precision 0.304 0.459 0.554 Precision 0.547 0.703 0.727 Recall 0.440 0.440 0.340 Recall 0.422 0.540 0.421 F1 score 0.180 0.225 0.211 F1 score 0.238 0.305 0.267 (c) English texts, German links (d) German texts, English links

Table 5.5 Results of Entity Extraction

The results make clear that DBpedia Spotlight is a better choice for our require- ments than Wikify (for example, the F1 score of English texts with English links for Wikify is 0.159 which is lower than 0.215 for DBpedia Spotlight with c = 0.6). Tak- ing DBpedia Spotlight, the precision and recall values behave as expected: A higher confidence threshold leads to a bigger number of extracted entities (i.e. higher recall), but a lower precision. The F1 score indicates that the choice of a lower confidence can be reasonable as long as a high precision is not crucial. To judge the adaptability of the NER tools on the comparison of texts in different languages, two aspects are important: A large percentage of the extracted entity links must be available in the other language as well and precision and recall values should not suffer much from using the links from the other language. According to the tables 5.3 and 5.4, much more than the half of the link exist in German and English10. Given the results in this section, the precision and recall values vary just a little. There is a big number of FPs which was already suspected because the following assumption does not hold: Named entities – mainly in the sense of persons, locations

10Because of the reasons described in 2.4, there are more English links for ”German entities” than the other way around 5.4 Evaluation of Entity Extraction Tools 45

and organisations – and entities with links are not the same: For example, there exist Wikipedia articles for every day in a year11. Besides, the NER tools used in this evaluation can find more than one entity link where the N3 data set only finds one. As an example, Wikify finds the entities ”Gulf of Aden” and ”Aden” within the words ”Gulf of Aden”, but only the latter was extracted manually. These problems occur for every NER tool, so the precision and recall values are – although being low – still comparable. For the use in the text comparison, this is not a problem, as the same NER tool will be used on both texts.

11e.g. http://en.wikipedia.org/wiki/March 1 46 Chapter 5 Feature Selection and Extraction 6 Sentence Alignment and Evaluation

We now have a set of syntactic and semantic similarity values and need to combine them to derive similarity values for sentence pairs such that an alignment function can use it to automatically identify similar sentences in an article pair. This function takes two sentences as input1 and has a similarity value as output that is 0 for the most dissimilar sentences2. Each revision pair is investigated independently from others. This means that we are only searching for sentences occurring similarly in the English and German version of the same revision. To be more precise, we search for German counterparts of the English sentences. To specify the similarity of sentences, human evaluators need to annotate given pairs of sentences with one of the three options ”same content (i.e. facts)”, ”partly same content” and ”different content”. This is quite a rough classification, but it allows to let users distinguish similarity values without having much background knowledge. In fact, it seems to be impossible to do any further sub division without stating many vague and controversial definitions. There are two aims of the study: The first one is to get an idea of the importance of the features used for measuring sentence similarity – the texts of the sentences together with their machine translation, external links, Wikipedia and time annota- tions. Having gained this information, we want to derive some numerical weights for the similarity measures to get an optimal configuration for our alignment function. Using the evaluated data, this makes it also possible to specify precision and recall values. This evaluation is done in two steps: In the first round, a rough pre-selection of sentence pairs from 14 long Wikipedia articles is done. These sentence pairs are then evaluated by human users to get an idea of which similarity values are important for

1Later, we will also use paragraphs as input. 2In contrast to the feature similarity measures, we do not require that the maximum similarity is 1 47 48 Chapter 6 Sentence Alignment and Evaluation the sentence similarity. In the second round, these values are used to make a less more precise pre-selection of sentence pairs in another dataset – also followed by a manual similarity classification. Finally, these results are used to build up a similarity function.

Classification versus Regression

To perform the analysis of the results, machine learning techniques – namely linear regression – will be used to build some kind of a classifier from the data collected in the user study. This classifier will be used to predict the similarity of unknown sentence pairs (and for known sentence pairs for evaluation). As the users can assign a sentence pair as having the same, partly the same or different content, an analysis of the results in a form of a ternary classifier seems reasonable. However, we aim at building a similarity function that assigns continuous similarity values to sentence pairs (e.g. Sim(s1, s2) = 0.62 for two sentences s1 and s2) rather than discrete classes. This allows for a better construction of similar paragraphs (as described in Section 7.1), as for a paragraph containing five same and two different sentence pairs, it is no problem to calculate an average similarity value using continuous values. Nevertheless, the assignment in three classes will be used for the evaluation of results, as this makes it possible to calculate exact precision and recall values (in contrast to measures like the mean squared error). To transform a similarity value into a class, the thresholds tpartly and tsame are defined, with tsame > tpartly obviously. If Sim(s1, s2) ≥ tpartly, the sentences a and b are at least partly overlapping. If Sim(s1, s2) ≥ tsame, the contents are supposed to be the same.

6.1 Data

For the first step of our evaluation, we chose 14 Wikipedia articles that were avail- able both in English and German from the list of controversial articles provided in [29]. From each article, the revisions that were current at extraction time (28th of November, 2014) were taken and their sentences were extracted as described later in Section 8.3 – ignoring text parts that are not within the main paragraphs (text in info boxes, tables. . . ) or those consisting of less than 10 characters. Being contro- versial, the chosen articles do not only add more interest towards our aim of finding language-specific differences, but the articles also tend to be longer which seems to make them a good choice for the study3. Table 6.1 gives an overview of the chosen articles and their sentence counts. The number of sentences is much more balanced between the German and English Wikipedia than expected from the findings in Sec- tion 2.4. This is because some of the articles where taken from the German list of controversial articles. 3As noticed later in Section 6.8, this is debatable. 6.2 Pre-Selection of Sentence Pairs 49

English Wikipedia English Title Revision ID Sentences Revision ID Sentences Berlin 635429067 440 136234983 797 Esotericism 635229541 86 135476259 363 European Union 635761078 439 136109478 687 Global warming 635667279 261 136180469 385 Hanover 633698142 278 136254194 824 Japan 634787979 348 136105720 387 Libertarianism 635380344 315 133958070 174 Minimum wage 634528784 246 136237192 315 Nicolaus Copernicus 634443003 376 134393612 203 633641386 414 136227798 248 Srebrenica massacre 635701542 939 135952331 328 Studentenverbindung 635237484 127 135587947 380 Truth 635761335 275 134999700 479 Wii 634756716 299 136003701 240 Total - 4843 - 5810

Table 6.1 Wikipedia Articles Used in the User Study

6.2 Pre-Selection of Sentence Pairs

As described in the previous Chapter5, sentences are compared by using different similarity measures that are using one feature each. The features are grouped into syntactic and semantic features:

ˆ Syntactic features

• Cosine • Text overlap • Text length

ˆ Semantic features

• External links • Wikipedia annotations • Time annotations

It is not reasonable to use the same similarity computation for any pair of sen- tences because not all similarity measures may be applicable. For example, less than 13 percent of the investigated German sentences contain external links. Syntactic 50 Chapter 6 Sentence Alignment and Evaluation

Level External Entities Times Sentences Sentences Links (English) (German) 0    674 725 1    210 93 2    94 143 3    1768 2981 4    848 28 5    57 381 6    730 1210 7    462 249

Table 6.2 Feature Combination Distribution in 14 Wikipedia Articles

similarity measures can be applied to any sentence, but the other features are bro- ken down into eight combinations that are shown in table 6.2. To refer to these combinations, they are labelled with a number (level). Given these numbers, it becomes clear that it is almost impossible to do an evalua- tion of every sentence pair. For example, the English ’Japan’ article has 348 sentences and the German one has 387 sentences. This makes a total of 348 · 387 = 148608 sentence pairs whose similarities had to be calculated and who could possibly belong together. Taking a random subset of all sentences does not suffice as well: Given a revision

rL1 with n and another revision rL2 with m sentences, there are n·m possible sentence pairs. Assuming that for 20 percent of the sentences in the first revision there exists a 4 0.2·n 20 similar sentence in rL2 , only 100 · n·m = m percent of the sentence pairs are similar which is a very small number that would mean a big number of ”not similar” instances 20 in the test set (for the Japan example: 100−100∗ 348 ≈ 99.94 percent of the sentence pairs are not similar). Therefore, the dataset undergoes a pre-selection resulting in a smaller set of sen- tence pairs such that the fraction of at least partial overlapping sentences becomes bigger. Hence, the evaluation data is no exact representation of the actual data which may cause some unavoidable bias in the data. To filter the sentence pairs and to get a set with sentences that are more likely to be similar than randomly chosen sentence pairs, a set W of similarity functions is manually created with the similarity functions defined in following manner: w = ({mi, wi}, t), Σiwi = 1, 0 ≤ t ≤ 1 with mi being one of the similarity measures defined above. The similarity of a sentence pair (s1, s2) is calculated as follows: Simw(s1, s2) = Σiwi · mi(s1, s2). If Simw(s1, s2) ≥ t, the sentence pair is added to the pre-selection.

4Of course, this number can only be guessed at this time of evaluation. 6.2 Pre-Selection of Sentence Pairs 51

Table 6.3 shows all the similarity functions used in the pre-selection. For simplic- ity, the table only shows the similarity functions that use the text overlap similarity, but cosine and bigram similarity are used as well. From the findings in 5.2, the thresh- old for the bigram similarity becomes 0.3, for the cosine similarity 0.4 (marked with ”*”). The similarity function in the last row combines all three text similarities, each with a weight of 1/7 (summing up to 3/7). This finally leads to a set of 4 · 3 + 1 = 13 similarity functions used to find candidates for similar sentences.

Weight Text Text Length Entity Links External Times Threshold Links 0.9 0.1 0 0 0 0.35* 0.45 0.1 0.45 0 0 0.4 0.45 0.1 0 0.45 0 0.4 0.45 0.1 0 0 0.45 0.4 3/7 1/7 1/7 1/7 1/7 0.4

Table 6.3 Weights of Similarity Functions for Pre-Selection

Sentence pairs are now found by the algorithm given in listing 6.1. By iterating over the sentence sets of both articles, sentence pairs are created. A sentence pair is added to the candidates, if its similarity for at least one of the similarity functions w ∈ W is above the configuration’s threshold (Line7).

Algorithm 6.1 Identification of Candidates for Similar Sentences

1: procedure findSimilarSentences(dL1 , dL2 , W ) 2: C = ∅

3: for each s1 ∈ aL1 do

4: for each s2 ∈ aL2 do 5: sp = (s1 , s2 ) 6: for each w ∈ W do 7: if sp.calculateSimilarity(w) ≥ w.threshold then 8: C = C ∪ sp 9: break 10: return C

The set of candidate sentence pairs C is stored, together with their similarity values for each similarity function. This results in the number of candidates shown in table 6.4. 52 Chapter 6 Sentence Alignment and Evaluation

Level External Entities Times Sentence Links pairs 0    162 1    27 2    49 3    467 4    6 5    64 6    196 7    18

Table 6.4 Feature combination distribution in pre-selected Sentence Pairs

6.3 Selection of Sentence Pairs for Evaluation

The pre-selection has resulted into t = 989 sentence pairs that need to be evaluated. As this number is too big for a manual evaluation, a ”random” subset of the candidates is extracted that has the following properties: There are l levels of feature applicability and k articles with t sentence pairs in total. The aim is to extract n datasets with about m sentence pairs each. Furthermore, the extracted datasets have to fulfil the following two conditions:

ˆ The sentence pairs have to be evenly distributed between the eight levels of feature applicability. That means, given l levels, there should be n·m/l sentence pairs per level. Given the results from table 6.5, this is not possible as there for example are too less instances for level 7. In this case, each of these sen- tence pairs needs to be added to the evaluation set. Conversely, the number of sentence pairs becomes bigger than n·m/l for the other levels.

ˆ The sentence pairs have to be evenly distributed between the articles. That means, there should be n·m/k sentence pairs per article. Special cases of too less sentences per article are treated as in the first condition.

The extraction is done by first iterating over all levels. For each level, a random article with a random sentence pair from that article is chosen m times per each dataset and added to the current dataset. A special case is the combination of sentence pairs having the same sentence of language L1 (which is English in the study). These sentence pairs will be presented to the user at the same time. When adding a sentence pair to the subset, the procedure immediately searches for other sentence pairs with this commonality and adds them to the current dataset. This is the reason, why the datasets will eventually have more than m sentences. In our evaluation we used the following parameters: k = 14, l = 8, n = 4, m = 70 which lead to the sentence pair subset shown in table 6.5. 6.4 User Study 53

Level External Entities Times Sentence links Pairs 0    30 1    24 2    25 3    57 4    6 5    47 6    51 7    18

Table 6.5 Feature Distribution in the Dataset for the First Round of Eval- uation

6.4 User Study

Given the sentence pairs, they were now evaluated by users using a web-based inter- face. Each user is randomly assigned one of the n datasets and is shown one English sentence and the corresponding German sentences in each step. For each sentence pair, the user needs to take his choice between one of the following four options that were explained to him earlier (together with examples) in the following manner:

ˆ same content (i.e. facts): The German and the English sentences cover (nearly) the same content (i.e. facts). This for example is the case with translations. Small variations in the fact descriptions are okay (see examples).

ˆ partially overlapping: The German and the English sentences partly cover the same content (facts) or are otherwise very similar.

ˆ different content: The German and the English sentences have nothing or very little in common (apart from the fact that both sentences are somehow related to the same Wikipedia article).

ˆ don’t know: If you are not sure, please take this option.

ˆ corrupted sentence: If the German or the English sentence is somehow corrupted (not a correct sentence), take this option.

The declaration of corrupted sentences is necessary, as these errors don’t originate from our procedures, but from the sentence splitting algorithm. Therefore, affected sentences are ignored in the study. As a help – for example to find out what word a pronoun refers to –, links to the Wikipedia pages of the article versions of the current sentence pairs are provided to the user. 54 Chapter 6 Sentence Alignment and Evaluation

Figure 6.1 Screenshot of user study: On the top, there is an English sentence from the Wikipedia article ”Wii”. Below, there are three German sentence. For each of them, the user has to check one of the options. On the bottom, links to the respective Wikipedia articles are provided.

Figure 6.1 gives an example of one evaluation step for an English sentence with more than one possible German counterpart. To get control over the quality of the user’s rating and to exclude users that are just clicking at the options at random, a ”honeypot” is included in each dataset. This is a set of ground truth objects built by manually creating some sentence pairs that have nothing in common and some that share the same content5. Of course, these pairs have to be excluded when evaluating the results. As mentioned before, the final evaluation will not only focus on the ternary clas- sification itself, but also on the calculation of continuous similarity values. For that purpose, the three classes have to be mapped to numbers – namely 0 for ”different content”, 0.5 for ”partially overlapping” and 1.0 for ”same content”.

6.5 Judgement of Similarity Measures

In total, 11 users (mainly graduate CS students with good knowledge of the both languages) participated in the user study. Each user performed at least 50 tasks (average evaluation time of a set of 50 tasks was 30 minutes). As a result of the user study, we obtained a benchmark containing 229 aligned sentence pairs (corrupted sentences and those in the honeypot are ignored): 16 in the “same content”, 104 in the “partially overlapping” and 109 in the “different” categories. A simple method to analyse the impact of different features on the similarity of the sentence pairs is to determine the correlation of the single similarity measure’s value and the similarity assigned by the users. A low correlation denotes that the

5This was actually done by translating an arbitrary chosen English sentence into German man- ually. 6.5 Judgement of Similarity Measures 55 specific measure is not a good indicator for similarity. To compute the correlation, the Pearson product-moment correlation coefficient (PCC) is used6. Table 6.6 contains an overview of the correlation coefficient for each investigated similarity measure.

Similarity Type PCC Text Overlap 0.678 Cosine 0.653 Text Length 0.109 External Links 0.012 Time annotations 0.253 Internal Links 0.103 DBpedia Entities 0.208 Wikipedia annotations 0.210

Table 6.6 Correlation Coefficients for Similarity Measures

For the non-textual measure, it is important to differ between two cases. For the example of external links you can include the external link similarity in any case and don’t distinguish between sentence pairs where both sentences have external links and those that don’t. This means, a sentence without external links always gets an external link similarity of 0 for every partner sentence. But if both sentences have completely different external links, they get the same external link similarity. As this behaviour is not desired, the set of sentence pairs is filtered and the external link similarity is only applied on sentence pairs with external links. Later, when creating concrete similarity functions, this means that different similarity functions for each level of applicability are needed.

Syntactic Similarity

In a first step, we compare the syntactic measures, for text overlap and cosine similar- ity in the graphs 6.2a and 6.2b (a regression line is added to the graphs). As already predicted in 5.2, both similarities highly correlate with the sentence similarity: For TO, PCC ≈ 0.678, for Co PCC ≈ 0.653. As these values are very similar, it is not possible to say which measure will perform better in general. Therefore, they will both be evaluated in the second round of evaluation.

Text Length Similarity

A comparison between the length of the investigated sentences has a problematic characteristic: While a low text length similarity may indicate that the sentences don’t share the same contents, it is counter-intuitive for the identification of partly

6http://en.wikipedia.org/wiki/Pearson product-moment correlation coefficient 56 Chapter 6 Sentence Alignment and Evaluation

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 CosineSimiliarity 0.2 Text Overlap Similiarity Overlap Text

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 User Rating User Rating (a) Correlation of Text Overlap Similarty (b) Correlation of Cosine Similarty

Figure 6.2 Correlation of Syntactic Features for First Data Set same sentences: As one of the sentences does not contain all the facts from the second one, it seems logical to assume that a low text length similarity raises the probability of having partly identical sentences. But this assumption does not hold for all examples, as the following will explain: Assume a sentence s1 with the facts a and b and a sentence s2 with the fact a. This means, the sentences are partly covering the same facts. Their text length similarity may be Simtext length(s1, s2) = 0.5. On the other hand, given another sentence s3 with facts a and c, the specific similarity is much higher, for example Simtext length(s1, s3) = 1.0, although the sentences have not become any more similar. This example illustrates the difficulty of using the text length similarity measure. It can only be used for the identification of same sentences, but not for finding partially overlapping sentences (in this case, there is, if any, a negative correlation). This is the reason why it becomes impossible to improve a single similarity function that is supposed to find and distinguish both types of similar sentences by adding the text length similarity. Our results in graph 6.3 clearly demonstrate this problem: While very similar sentences have a very high text length similarity (top right corner), the values differ a lot for sentence pairs with a user rating indicating partly same content. Besides, dissimilar sentences were often assigned high text length similarities.

External Link Similarity

For external links, there is a very low correlation that makes this feature unusable for similarity computation. This can be explained with the following reasons: A single external link refers to a whole web page that may contain a lot of in- formation. That means, the same external links may refer to different text passages in the referring web page and therefore, they are no unique representation of some information or an entity as internal links are. For example, there is an external link 6.5 Judgement of Similarity Measures 57

1

0.8

0.6

0.4

0.2 Text Length Similiarity Length Text

0 0 0.2 0.4 0.6 0.8 1 User Rating Figure 6.3 Correlation of Text Length Similarity

to ”The World Factbook” on Japan7 in 11 sentences of the English and 5 sentences of the German article on Japan. That web site contains many statistics on Japan, for example on Japan’s gross domestic product, but also on the number of airports in Japan. Because of this, the occurrence of this link in two sentences is no indication of any similarity. While this problem can be resolved in this case by ignoring links that occur many times or using the extended cosine measure defined in Section 5.1, the problem re- mains: No one can guarantee that a web page that refers to much diverse information occurs multiple times. Moreover, the non-equality of external links does not indicate any dissimilarity. For example, references to news articles are very likely to vary between (and even within) languages, because the information of news are often not unique for one website (especially when considering texts from press agencies) and authors prefer to refer to a website written in their own language. Another problem with external links is their rarity: Other similarity measures easily outperform it when using linear regression.

1

0.8

0.6

0.4

0.2 External Link Similiarity Link External

0 0 0.2 0.4 0.6 0.8 1 User Rating Figure 6.4 Correlation of External Links Similarity

7https://www.cia.gov/library/publications/the-world-factbook/geos/ja.html 58 Chapter 6 Sentence Alignment and Evaluation

Time Similarity

For time similarity, the highest correlation among the semantic measures is reached (Figure 6.5a). Therefore, it is seen as a major component of the similarity functions that are created in the second round and will be further evaluated in that context.

Entity Similarity

There are three approaches to use Wikipedia annotations for entity similarity: You can either use the internal links only, links extracted by NER tools or a combination of both. With regard to the entity extraction process, the easiest attempt is to solely rely on the internal Wikipedia links. Unfortunately, the results from 6.6 show that there is nearly no correlation of the internal link similarity with the whole similarity of a sentence pair. This can be explained by two reasons: On the one hand, internal links are usually only assigned to the text when the mentioned entity appears for the first time. This means that they are no help when classifying other sentences than the first one containing that entity. Even for the first sentence, there is no guarantee that the sentences occur in the same order for both language versions. In the end, the only fully reliable remaining application of this feature appears to be for entities appearing only once in each article. Morevoer, internal links are highly dependent by the human authors, as they are set manually. Contrasting to this, the automatic extraction can be expected to behave similarly on different sentences. Given the results, the most promising approach is to use as much knowledge as possible: By including all internal links and every extracted DBpedia link, the highest PCC is reached. Its correlation can be seen in the graph 6.5b.

1 1

0.8 0.8

0.6 0.6

0.4 0.4 Time Similiarity Time 0.2 Similiarity Entity 0.2

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Rating User Rating (a) Correlation of Time Similarty (b) Correlation of Entity Similarty

Figure 6.5 Correlation of Time and Entity Similarity for the first Dataset 6.6 Second Dataset 59 6.6 Second Dataset

From the results of the first round of evaluation, enough information is given to do a more precise pre-selection and to construct a similarity function. To take into ac- count the different cases of feature applicability, the definition of a similarity function is extended and now called constrained similarity function: A sentence alignment function c contains a set of similarity functions that are still defined as in Section 6.2 – with two major differences: The sum of the weights does not need to equal 1 any more and an intercept i is introduced that is added to the computed similarity value. Both changes come from the characteristics of the linear regression whose output behaves like that8. Each of these similarity functions is mapped to some conditions (like ”sentence has time annotations”). To compute the similarity value of a sentence pair, one similarity function w is chosen by checking these conditions. Given the findings from the first round (6.5), three similarity measures are seen as important: textual similarity, entity similarity and times similarity. As text similarity is applicable on each sentence pair, the following kind of alignment functions is used:

 Simw (s1, s2) if entity and times similarity are applicable  1 Sim (s , s ) else if entity similarity is applicable Sim (s , s ) = w2 1 2 (6.1) c 1 2 Sim (s , s ) else if times similarity is applicable  w3 1 2  Simw4 (s1, s2) else

9 To compute the weights of w1, w2, w3 and w4, linear regression is used on each of the four pre-filtered sets. The regression is done on a new dataset than the first round of evaluation, because of two reasons: The findings from the first round can be tested on unknown data and because the first dataset was not completely satisfying: Only 18 of 229 sentence pairs reached a similarity of at least 0.75 from the user entries which means very few ”same content” instances. A reason may be that especially long and controversial articles tend to differ within language versions and simple translations don’t occur.

6.7 Pre-Selection and Creation of Similarity Func- tion

In this second round, the following seven alignment functions are used:

ˆ Co: Similarity function that only takes into account the cosine similarity

8As we don’t demand similarity values to be in the interval 0, 1, this is alright and a post- normalization is not needed. 9For this purpose, the Weka software is used: http://www.cs.waikato.ac.nz/ml/weka/. 60 Chapter 6 Sentence Alignment and Evaluation

Compute similarity function

Manually Find similar classify sentences sentence pairs

Figure 6.6 Iteration to Create Similarity Functions

ˆ CoT: Constrained similarity function where time similarity is added to the cosine similarity, if one of the sentences has a time annotation.

ˆ CoW: Constrained similarity function where entity similarity is added to the cosine similarity, if one of the sentences has a Wikipedia annotation.

ˆ CoWT: Constrained similarity function as defined 6.1

ˆ TO Baseline: Similarity function that only takes into account the textual overlap similarity

ˆ LCS Baseline: Similarity function that comes from the plagiarism detection where the longest common subsequence is used (see Section 3.2.2)

ˆ Wikipedia Baseline: Baseline described in 3.2.1: N-gram overlap similarity of n-grams enriched with Wikipedia links when they are found by querying the bilingual lexicon10. The text length threshold and the algorithm to assure 1:1 assignments only is included as well. Hence, this baseline does not fit to our definition of constrained similarity functions.

The last two baseline methods use a threshold to filter out less relevant pairs (threshold for LCS Baseline: 0.65, for Wikipedia Baseline: 0.02). To enable a fair comparison of performance for different methods, in this experiment we omit the thresholds and retrieve the whole ranked sets of results for each method. The weights for the first five alignment functions have still to be determined. Initially, we take the weight coming from the first evaluation round. Now we apply an iterative process depicted in figure 6.6 to improve the functions by using the second dataset. 10To save the effort for creating such a lexicon, we queried the Wikipedia API to get language links for the n-grams. Temporally seen, this is inefficient, but it suffices for evaluation purposes. 6.8 Results 61

By using more precise similarity functions, the number of false positives is reduced. This makes it possible to evaluate all sentence pairs that reach some similarity value (tpartly). So, in the first step, each alignment function (including the base lines) is used to collect all sentence pairs with Simf (s1, s2) ≥ tpartly for any alignment functionf. The lower tpartly, the more sentence pairs will be returned that have to be annotated in the next step. Subsequently, user input is needed: For each sentence pair found by the current alignment functions, its class must be determined by using the same interface as before11. This results in an extended training set that is used to build a new linear classifier to modify the weights of Co, CoT, CoW, CoWT and the TO baseline. Now, the initial step of the cycle is reached again, which means that new newly built alignment functions can be used to collect new sentence pairs and so on. This cycle stops when no new sentence pairs are found. While the user evaluation seems obvious for the first step of iteration, as there is no user data available until that point, the cycle has to go through several iterations: As the evaluated training set is not complete, a similarity function may – after a new execution on the whole corpus – return more sentence pairs than it was trained on. These pairs have to be taken into account, as the following example shows (this case occurred during our iteration cycle): In an early step, CoWT may ignore time similarity (weighttime = 0). When applying this alignment function to search for at least partially overlapping sentences, many sentences are found that contain times that don’t fit together. Most of these pairs will be annotated as ”different content”, leading to a bigger value for weighttime in the next training step to avoid false positives.

6.8 Results

In the end of the iterations, 926 sentence pairs were annotated for the second dataset (see table 6.7) and weights for the five alignment functions were created. For example, the weights for CoWT are 0.99 for cosine similarity, 0.22 (Time) and 0.25 (Entities). The intercept is −0.05. For all functions, each similarity measure is used if it is

applicable (weightmi > 0). This means that there always is a positive correlation that is big enough to add information to the estimation of the similarity. To compare the alignment functions and to evaluate their quality, there are two approaches, one for the classification and for the regression. For the evaluation of the output of the similarity function as continuous values, the mean squared error is

11Because of lack of resources and the difficulty of the iteration cycle, the evaluation is done by one human user only 62 Chapter 6 Sentence Alignment and Evaluation

Same Partially Different Total Facts Overlap- ping Entities 32 134 93 259 and Times Entities 34 131 228 393 only Times 13 26 18 57 only Else 36 54 127 217 Total 115 345 466 926

Table 6.7 Dataset Evaluated in the Second Round computed as n 1 X MSE = (Sim (s , s ) − Sim (s , s ))2 c number of sentence pairs c 1 2 user rating 1 2 i=1 for each of the constrained similarity functions. From our results, we get MSECo ≈ 0.088, MSECoW ≈ 0.076, MSECoT ≈ 0.068 and MSECoW T ≈ 0.061. These values indicate that the alignments functions perform better when including Wikipedia and time annotations.

For a more detailed evaluation and to find out good values for tpartly and tsame, the sentence pairs are assigned similarity values for each alignment function. For each function, they are then classified into the three categories by using the threshold values. Given the user input, each sentence pair is given a ground truth class as well. Now, it is possible to calculate precision and recall values in the known manner12. f is the current alignment function, S the set of all sentences and Sk are the first k sentence pairs from S when S is sorted by the similarity values:

|(s1,s2)∈Sk,Simc(s1,s2)≥tpartly∧Simuser rating(s1,s2)≥0.5| ˆ P recisionf,k = |(s1,s2)∈Sk,Simc(s1,s2)≥tpartly| (fraction of the returned sentence pairs that indeed are having at least partly the same content)

|(s1,s2)∈Sk,Simc(s1,s2)≥tpartly∧Simuser rating(s1,s2)≥0.5| ˆ Recallf,k = |(s1,s2)∈Sk,Simc(s1,s2)=0.5| (fraction of correctly returned sentence pairs).

By defining precision and recall in this way, there is no distinction between partly and same content which is acceptable, as we will combine any kind of partly or same sentences later when extracting similar paragraphs and visualising them. 12http://en.wikipedia.org/wiki/Precision and recall 6.8 Results 63

For each alignment function f, the following procedure is applied:

1. Build all possible sentence pairs from both texts.

2. For each sentence pair (s1, s2), calculate the similarity value Simf (s1, s2). 3. Rank the list of sentences by that similarity value to get S. For the Wikipedia Baseline, exclude the forbidden pairs from that list.

4. For each k ∈ [1, |S|], k ∈ N calculate P recisionf,k and Recallf,k.

A very important note is that the recall value is just an approximation: As stated in the beginning of this chapter, it is not possible to manually annotate all sentence pairs. There are two possibilities to overcome this problem: You can exclude all sentence pairs from S that were not annotated by a user. Instead of this, we assume that the sentence pairs that have not been evaluated are not having overlapping facts (as we use a big set of similarity functions and therefore retrieve many sentences, this assumption is reasonable). With this assumption, we can compare easily compare precision and recall in the whole range from 0% to 100%. For both approaches, the calculated recall values are higher than the real and unknown recall values are. Figure 6.7 shows the result when applying the above procedure for all seven align- ment functions. For each of them, the graph also shows the break-even point where precision = recall. Among the baselines, the Wikipedia baseline performs worst by far. This clearly is because of the fact that no machine translation is used which makes a comparison with our measures difficult. For LCS similarity, we can – similarly to what we saw in Section 5.2 – observe that it lacks precision for a recall > 20%. Our methods significantly outperform all the baselines. Our function Co (BEP = 70.95%) that only uses Cosine Similarity for the terms already leads to a better pre- cision than the TO Baseline (BEP = 66.16%) (which shows the best performance among the three baselines) with a 4.8% improvement in BEP. Additional semantic features such as time annotations used in CoT (BEP = 73.75%) and Wikipedia annotations in CoW (BEP = 72.91%) and in particular the combination of these annotations in CoWT (BEP = 77.52%) enable us to further improve precision and achieve up to 11.3% improvement in the BEP compared to the TO Baseline. This result confirms the high effectiveness of the proposed semantic features for the sentence alignment. Table ?? gives a detailed overview of the dataset for predefined threshold values tpartly: This is 0.3111 for CoWT, 0.367 for TO Baseline and 0.521 for LCS Base- line to reach a precision of 80%. For the Wikipedia baseline, we took the original threshold (0.02). For each article, the ”total” number of (partially) overlapping sentences and the (correctly) found sentence pairs is given (in a similar way as in [1]). ”F.” stands 64 Chapter 6 Sentence Alignment and Evaluation

100

80

60 CoWT (77.52) CoT (73.75)

40 CoW (72.91) Precision (%) Precision Co (70.95) TO Baseline (66.16) 20 LCS Baseline (35.77) Wikipedia Baseline (24.95)

0 0 20 40 60 80 100 Recall (%) Figure 6.7 Precision-recall diagram for the alignment of sentences containing overlapping facts. The X -Axis represents recall in %, the Y -Axis represents precision in %. Each line corresponds to an alignment function. The break- even points are marked on each line and specified in brackets in the legend. for ”Found” (number of sentence pairs returned by the alignment function), ”M.” is ”Match” (number of sentence pairs returned by the alignment function that have been user evaluated as at least partially overlapping as well). Until now, we did not distinguish between the two classes ”partially overlapping” and ”same facts” (except of the comparison of the mean squared errors that take different similarity values into account). To do so, we left the alignment functions unchanged and applied them to find same sentences only. Practically, this mean that a higher threshold has to be used: tsame. For the evaluation, the definitions of precision and recall have changed in the following way:

|(a,b)∈Sk,Simc(a,b)≥tsame∧Simuser rating(a,b)=1| ˆ P recisionf , k = |(a,b)∈Sk,Simc(a,b)≥tsame|

|(a,b)∈Sk,Simc(a,b)≥tsame∧Simuser rating(a,b)=1| ˆ Recallf , k = |(a,b)∈Sk,Simc(a,b)=1|

The procedure stays the same which results in the precision and recall values shown in figure 6.8. Obviously, these results are not as convincing as before: There is no significant difference between the curves of the three similarity functions which means that the existence of similar times and entities is no unique indicator for same contents. Fur- thermore, high recall values come at the cost of very low precision values and vice versa. 6.8 Results 65

100

80

60 CoWT (63.48) CoT (62.01) 40 Precision(%) CoW (64.94) Co (62.61) 20 TO Baseline (63.76) LCS Baseline (67.53) Wikipedia Baseline (34.52) 0 0 20 40 60 80 100 Recall (%) Figure 6.8 Precision-recall Diagram of Sentences with the Same Facts

Similarity Function for Same Sentences only

As this is not completely satisfying if you aim at finding sentences with same facts only, we ran an additional annotation cycle with two classes only: ”different or par- tially overlapping” and ”same”. In the resulting alignment functions, the text length similarity is included – in contrast to the previous results (the reasons were described in the Section 6.5). As a result, the linear regressor set the weight of the time similar- ity to 0. This means that we do not have similarity functions for CoT and CoWT any more. Again, figure 6.9 gives an overview of the results. TO Baseline outperforms CoW in terms of BEP (69.37 > 68.4). However, the inclusion of entity similarity lead to a higher BEP than Co (68.4 > 66.96). So, the text overlap measure works better than cosine similarity when searching for the same sentences. This may be due to the fact that sentence with the same content need to have all the same terms – no matter how selective they are. 66 Chapter 6 Sentence Alignment and Evaluation

100

80

60

CoW (68.4) Co (66.96) 40 Precision (%) Precision TO Baseline (69.37) LCS Baseline (67.53) Wikipedia Baseline (34.21) 20

0 0 20 40 60 80 100 Recall (%) Figure 6.9 Precision-recall Diagram of Sentences with the Same Facts (Ad- justed Similarity Functions) 6.8 Results 67

Wikipedia LCS TO CoWT Article Total Baseline Baseline Baseline F. M. F. M. F. M. F. M. 249 0 0 0 0 0 0 0 0 0 A930 road 4 1 0 0 0 2 2 0 0 Aliso Viejo, California 2 1 0 3 1 2 1 2 1 Antonio Arenas 1 0 0 1 1 1 1 1 1 Banded bellowsfish 2 0 0 1 1 0 0 1 1 Bellview, 9 3 2 5 4 9 7 13 7 Cala˜nas 1 0 0 1 1 1 1 1 1 Champagne Showers 6 7 4 2 1 4 3 3 3 Codex Aureus of St. Emmeram 13 10 8 6 6 8 8 10 10 Commuter rail 1 3 0 1 0 0 0 0 0 Endarterectomy 0 2 0 0 0 0 0 0 0 Far point 1 1 0 0 0 0 0 0 0 Fort Sumter 27 35 6 3 3 15 7 21 19 General Post Office 49 23 11 21 19 38 27 46 39 George William Gray 17 5 3 4 4 11 10 12 12 Gohi Bi Zoro Cyriac 5 4 3 1 1 3 3 3 3 Hammond Peek 3 0 0 0 0 1 1 0 0 Hemisphaeriodon 6 7 2 1 0 2 2 6 3 History of Serbia 85 238 20 12 3 37 22 95 62 It girl 1 1 0 0 0 0 0 0 0 Kettwig station 14 13 6 4 4 8 7 10 8 Knipp 7 6 5 5 5 7 7 7 7 Lawrence Eagleburger 22 16 10 7 6 16 15 22 19 Mercedes MGP W01 3 3 3 0 0 3 3 3 3 Michelle Monaghan 13 9 7 1 1 2 2 7 7 Muggsy Bogues 26 18 9 5 3 13 11 21 16 Museum of Old and New Art 36 11 11 20 20 27 27 35 34 Omega (navigation system) 18 26 2 5 3 8 7 15 10 Prince Moritz of Anhalt-Dessau 4 10 4 3 3 3 3 3 3 Pseudomugilidae 3 3 2 0 0 2 2 2 2 Samˇs´ın 1 2 0 0 0 0 0 1 0 Sandro Cortese 10 9 1 6 3 9 6 8 6 Santo Antˆoniodo Amparo 1 1 1 0 0 2 1 0 0 Schoten 1 1 1 1 1 1 1 1 1 Sikorsky S-333 3 4 0 0 0 0 0 1 1 St. Germain (musician) 3 1 1 0 0 2 2 3 3 Suwa lki 3 8 0 0 0 0 0 1 0 The Sundays 13 4 1 11 8 14 9 16 12 Tomte 1 3 0 0 0 0 0 2 1 Travenbr¨uck 1 0 0 1 1 2 1 1 1 United Nations Security 7 4 2 0 0 1 1 2 2 Council Resolution 1753 W. Clement Stone 24 7 7 10 10 14 13 18 17 White-headed langur 6 3 0 2 2 4 2 5 4 Wiesenbach, Bavaria 2 0 0 2 1 1 1 1 1 Wilan´owPalace 5 16 3 0 0 2 1 4 3 Total 460 519 135 145 116 275 217 403 323

Table 6.8 Retrieved Sentence Pairs per Article Pair 68 Chapter 6 Sentence Alignment and Evaluation 7 Paragraph Alignment and Article Comparison

7.1 Finding Similar Paragraphs

As already noticed in Section 1.2, the visualization of similar text passages can be improved a lot when not only similar sentences are joined, but if the size of aligned text parts is increased as much as possible to let a human user get a better overview of the comparison. So far, we have only considered the identification of at least partly overlapping sentences, so the natural step is now to somehow merge these sentences in a bottom-up manner to finally align bigger parts of the articles. In contrast to the sentence alignment we use the 1.1 assumption for paragraph pairs, as This process consists of two major steps that are described in the following sec- tions:

ˆ Aggregation of neighboured sentences (Section 7.1.1): E.g. in a situation where the content of the English sentence was split into two sentences in the German article, it is necessary to connect single sentences and align them with another sentence.

ˆ Aggregation of sentence pairs within short distance (Section 7.1.2): To increase the size of paragraphs, it can be admissible to build a paragraph of two para- graph pairs1 (with similar sentences) that are not directly neighboured.

Starting with the situation in figure 7.1, we will explain a detailed example of the construction of a paragraph from 12 sentences. On the left, there are 7 sentences from the first article, each with exactly one fact represented by a single letter. On the right, there are 5 sentences from the second article. The second and the fourth sentence contain two facts each. After applying the sentence aligning algorithm, the sentences with overlapping facts are aligned (joined). Except of the small discrepancy

1As each sentence pair can be seen as a paragraph pair as well, this includes sentence pairs. 69 70 Chapter 7 Paragraph Alignment and Article Comparison regarding facts ”d” and ”g”, the first 6 sentences on the left make a good paragraph with the first four sentences of the second article.

a

b a

c b,c

d h

e e,f

f i

g

Figure 7.1 Paragraph Construction Example (Step 1)

In the first step of the algorithm, neighboured sentences are combined, if they can be aligned to the same sentence from the other article. At first, the sentence with the fact ”b” and the one with the fact ”c” are combined and together aligned with the sentence on the right that has both facts (see figure 7.2a). Analogously, the sentences with ”e” and ”f” are merged (figure 7.2b).

a a

b a b a

c b,c c b,c

d h d h

e e,f e e,f

f i f i

g g (a) Step 2 (b) Step 3

Figure 7.2 Paragraph Construction Example (Steps 2 and 3)

Now, the second step of the algorithm begins: Proximate sentence pairs that are marked as (partially) overlapping are combined to increase the paragraph size. For this example, there is no gap (distance is 0) between the first two paragraphs and they are merged (figure 7.3a). At this point, there are two paragraphs remaining. Between them there is a gap of distance 1 on each side: On the left, this is the sentence with the fact ”d”, on the right, it is the sentence with ”h”. If the algorithm is configured that a summed gap size of 2 is allowed, the two paragraphs are merged which results in the final constellation depicted in figure 7.3b. The two steps and their composition are explained in the following in more detail. 7.1 Finding Similar Paragraphs 71

a a

b a b a

c b,c c b,c

d h d h

e e,f e e,f

f i f i

g g (a) Step 4 (b) Step 5

Figure 7.3 Paragraph Construction Example (Steps 4 and 5)

7.1.1 Aggregation of Neighboured Sentences

The most illustrative use case of the first part of the paragraph finding algorithm can be explained by the following example from the English and German Wikipedia article about the General Post Office: The English article contains the following sentence (marked as [en1]):

In 1840 the Uniform Penny Post was introduced, which incorporated the two key innovations of a uniform postal rate, which cut administrative costs and encouraged use of the system, and adhesive pre-paid stamp. [en1]

The content of this sentence can be found in the German article as well, but split into two sentences:

Im Jahr 1840 wurde die Penny Post eingef¨uhrt. [de1] Dies bedeutete die Einf¨uhrungder Briefmarke und die Reduzierung der administrativen Kosten des Postdienstes. [de2] Human-translated: In the year 1840, the Penny Post was introduced. This meant the introduction of the stamp and the reduction of adminis- trative costs of the post office.

As a result of our sentence alignment algorithm, there are two sentence pairs found that are each marked as partially overlapping: [en1]/[de1] and [en1]/[de2]. This results from giving up the assumption of pure 1:1 sentence alignments (as stated in 3.2.1). As we don’t allow for the subsequent splitting of sentences2, the desired alignment result is [en1]/[de1]-[de2], where [de1]-[de2] denotes an aggregation of the German sentences. 2This could be used for another approach where you start with aligned paragraphs and then do a top-down procedure. 72 Chapter 7 Paragraph Alignment and Article Comparison

An aggregation needs to be an exact representation of its referred sentences: The texts are stringed together in the right order and semantic features are joined. Simi- larity computations applied on such an aggregation must treat this as if it was a single sentence. For example, the text length similarity for the sentence pair [en1]/[de1] has to be lower than for [en1]/[de1]-[de2] in the example. For the aggregation algorithm, we make use of a procedure coming from a very different domain: the matching of streets that are represented as geometric objects [27]. Small segments of streets (that were for example split up because of crossing other streets) are aggregated to be matched with a longer street that represents the same real world object, but is part of another data set. While this is an example for a 1:n matching, n:m matchings are also considered by that algorithm. Algorithm 7.1 is called for each sentence pair that was at least classified as partly overlapping in the prior process. By calling the procedure extend, it is tried to extend one of the sentences in a sentence pair. From all of the sentence pairs found this way, the one with the highest similarity value is returned.

Algorithm 7.1 Extension of Sentence Pairs with Neighbours

1: procedure extend((s1, s2)) 2: aggregations = aggregateWithNeighbour(s1, s2) ∪ aggregateWithNeighbour(s2, s1) 3: sortBySimilarityDescending(aggregations) 4: return aggregations.pop()

This procedure shown in 7.2 iterates over the neighbours of one of the sentences in the pair and builds aggregations of these two components (line3). If the similarity of such an aggregation is higher than the similarity of the original sentence pair, it is added to the sentence pairs (line6) that are returned in the end.

Algorithm 7.2 Extension of a Sentence with its Neighbours

1: procedure aggregateWithNeighbour(s1, s2) 2: for n ∈ s1.neighbours do 3: s1 − n = aggregateSentences(s1,n) 4: aggregations = {} 5: if Sim(s1 − n, s2) > Sim(s1, s2) then 6: aggregations = aggregations ∪ (s1 − n, s2) 7: return aggregations

7.1.2 Aggregation of Proximate Sentence Pairs

Until now, the paragraph aligning procedure does not allow for gaps within similar sentences. For example, a paragraph consisting of the sentences (each sentence con- 7.1 Finding Similar Paragraphs 73

tains one fact that is attached in brackets) en1(a), en2(b), en3(c) and en4(d) in that order can not be aligned with the sentences de1(a), de2(b), de3(e) and de4(d), also they are very similar and it is well possible that situations like this occur in compa- rable corpora. Even for originally exact translations it is possible that one article is changed after the translation process, possibly by adding an additional sentence with a missing fact into the text paragraph. Therefore, another step of paragraph extension is done. This one is similar to what is done in the terms of cross-lingual plagiarism detection[2]. Algorithm 7.3, called combine, shows the steps to combine a paragraph pair with another one. For the given sentence pair, a set of sentence pairs is searched in the proximity (line3). In line6, the sentences are aggregated by combining the sentence from the original sentence pair, the one from the sentence pair in its proximity and those sentences between (the gap). From these sentence pairs, the one with the highest similarity is chosen - unless there is one without a gap. In that case, this is taken in any case. There are two parameters whose values have to be set beforehand:

ˆ maxDistance: The number of sentences that are allowed to be between two sentences that are part of different sentence pairs.

ˆ penalty: Naturally, the similarity value of a paragraph decreases when adding unfitting sentences - independently from the single similarity values. Therefore, a penalty is added to the similarity that is multiplied with the gap size.

For our visualization, we set the values as follows: maxDistance = 3, penalty = 0.03.

Algorithm 7.3 Aggregation of Sentence Pairs

1: procedure combine((s1, s2)) 2: highestSimilarity = 0 3: closeSentenceP airs = findAnnotationPairInDistance(s1, s2,maxDistance) 4: // sort closeSentenceP airs by distance 0 0 5: for (s1, s2) ∈ closeSentenceP airs do 00 00 0 0 0 0 6: (s1, s 2) = (aggregate(s1,gap(s1, s1),s1), aggregate(s2,gap(s2, s2),s2)) 0 0 7: distance = dist(s1,s1) + dist(s2,s2) 00 00 8: similarity = Sim((s1, s 2)) - distance · penalty 9: if distance == 0 or similarity > highestSimilarity then 000 000 00 00 10: (s1 , s2 ) = (s1, s 2) 000 000 11: return (s1 , s2 ) 74 Chapter 7 Paragraph Alignment and Article Comparison

7.1.3 Paragraph Aligning Algorithm

The functions extend and combine are now composed to create the paragraph align- ing algorithm that is shown in 7.4. The two sub procedures are not applied one after another. Instead of that, one sentence pair is chosen in every step (line7) and this is either aggregated with a neighboured sentence (line9) or merged with another sentence pair (line 11) - depending on whether the first has already been done. If a new sentence pair was found with one of these methods, this is added to the result set and the old one is removed (line 14).

Algorithm 7.4 Paragraph Alignment 1: procedure buildParagraphs(similarSentences) 2: similarP aragraphs = similarSentences 3: foundChanges = true 4: while not foundChanges do 5: foundChanges = false 6: sort(similarP aragraphs) 7: for (s1, s2) ∈ similarP aragraphs do 8: if not (s1, s2).extended then 0 0 9: (s1, s2) = extend((s1, s2)) 10: else 0 0 11: (s1, s2) = combine((s1, s2))

12: if not (s1, s2) == null then 13: foundChanges = true 0 0 14: similarP aragraphs = (similarP aragraphs ∪ (s1, s2)) \ (s1, s2) 15: break 16: return similarP aragraphs

To improve efficiency and to focus on the extension of the most similar sentences, the set of sentences and paragraphs is kept in an order where those sentence pairs that have not been aggregated are on top. Within all those sentence pairs that are (not) aggregated, the sentence pairs with a higher similarity are preferred.

Notes

The paragraph construction algorithms has the following characteristics: ˆ As our similarity measures (e.g. cosine text similarity) are operating indepen- dently from the order of the text, it even becomes possible that a paragraph en1(a)-en2(b) is aligned to the paragraph de1(b)-de2(a). ˆ If you want to point out the missing parts of particular facts in the context of a subtopic covered in both articles, it is possible to additionally present those ignored sentences (the gaps) to the user. 7.2 Similarity on Article Level 75

ˆ To limit the size of paragraphs and to improve readability, you can forbid to aggregate sentences that are not in the same Wikipedia paragraph (see Section 3.1.5).

Evaluation

While the evaluation of sentence similarity was done in a rather easy manner by letting user put them into a small number of categories, evaluation proves to be more n2+n difficult for paragraph similarity. The difficulty is that for n sentences, there are n possible paragraphs if you have no restrictions3. Another problem is that the longer a paragraph is, the bigger the reading effort is for the users. To solve these issues, the evaluation could be done in a bottom-up manner: At first, the user is shown a single sentence. Stepwise, more sentences are added to build paragraphs. The user then for example has to evaluate these paragraphs in terms of relevancy and novelty as described in [6].

7.2 Similarity on Article Level

Until now, all the similarity values were applied on the sentence or paragraph level only and integrated into the confrontation of both articles’ texts. However, there are more aspects in the surroundings of Wikipedia articles that can be presented to the user. Moreover, it is a goal of this thesis to form an overall similarity value for the whole comparison of two articles. To do so, we define a set of similarity values on the article level and combine them.

7.2.1 Text Similarity

Of course, the similarity of the texts themselves can not be ignored when comparing the articles. But instead of taking the single similarity values between sentences and paragraphs, there has to be some general similarity value for the texts which is composed of three text similarities that are shown next:

Text Length Similarity

As already described in Section 2.4, the text length gives evidence to the question how important the topic of the article is in the corresponding language. Therefore, the text length similarity introduced in 5.1 is applied on the article level as well.

31 paragraph with n sentences, 2 paragraphs with n − 1 sentences, . . . 76 Chapter 7 Paragraph Alignment and Article Comparison

Text Overlap Similarity

Like the text length similarity, the text overlap similarity can be used for the whole articles as well.

Text Coverage Similarity

When looking at the visualization of similar text paragraphs (as in 1.2) , another textual similarity measure emerges by asking the question of how much of the text in one article can be found in the other one as well? Visually, this is the portion of green marked text in the comparison of Figure ??. More formally, this similarity is calculated as in 7.1, with ai denoting the articles and common1 representing an aggregation of all texts parts in the first article that are aligned to text parts in the other article.

|common1| + |common2| simCoverage(a1, a2) = (7.1) |a1| + |a2|

7.2.2 Feature Similarity

We have extracted several other features from the articles that are used to define six other similarity measures. Except for the author location similarity, they are all working the same way: A set of features is extracted from both articles and the similarity value is computed using Jaccard overlap. All of these features can be presented to the user by e.g. visualising tables where the common features are emphasized. For some of the features, this is an implementation of the tables given in [26] (see also Section 2.4).

Image Similarity

When looking at a Wikipedia article, images possibly are the most noticeable as- pect. Aside from the visual aspect, they often give some hint of the covered topics. Therefore, a comparison of the images can be used for an important similarity on article level. As for entities and external links, a set of images is extracted during the pre-processing for each revision and some images are ignored by a list of forbidden images (mainly Wikipedia specific images like an icon linking to Wikipedia’s sister projects). While [26] also aligns images that are very similar (e.g. maps) or show the same locations or events, we limit the question on image equality to the question whether the image URLs are the same. Of course, this is done because of several difficulties 7.2 Similarity on Article Level 77

when doing such comparisons automatically4.

Entities

An overlap between the internal and extracted entity links of both articles can be easily done, as all the information is present.

External Links

While they did not prove to be a good feature for the sentence similarity, external links surely add specific information on an article. Therefore, we add this measure to the article level similarities. However, it is further split into a comparison of the URLs themselves and the host names (external link hosts similarity).

Authors

As it is very improbable that the same author spreads contrary text on two articles, the authors are compared as well. Because of the linguistic point of view, we also include a author location similarity. It does not suffice to calculate a simple overlap e.g. for the distinct countries of the authors. The author location similarity should not aim at penalizing a different number of authors. For example, if the first article was written by 5 German authors and the other one by 25 German ones, their similarity should be 1 for this measure. This is implemented by the formula given in 7.2. In the case of this formula, we assume that

the first article has more authors than the second one. authorsa1 denotes the total

number of authors of a1, whereas authorsa1,c is the number of authors in that article for the particular country c. The number of authors per country of a2 is normalized as if both articles had equally many authors. This makes it possible to find the number of authors for each country that the articles have in common and to divide these numbers by the total number of authors in a1.

P authorsa1 c∈Countries min(authorsa1,c, authorsa2,c · ) authorsa2 simAuthor Location(a1,a2) = (7.2) authorsa1

For example, if the first article has 5 German, 4 French and 1 Mexican au- thors (authorsa1 = 10) and the other article has 3 German and 2 French authors min(5,2·3)+min(4,2·2)+min(1,2·0) (authorsa2 = 5), the similarity is simAuthor Location(,a2) = 10 = 5+4+0 10 = 0.9. 4[23] gives a possibility for image comparisons, but also states this problem ”is one of the most critical step in many Computer Vision tasks”. 78 Chapter 7 Paragraph Alignment and Article Comparison

7.2.3 Overall Similarity

Given these nine different similarity measures on the article level (each between 0 and 1), a linear combination of them produces some kind of an overall similarity of the two investigated articles. Obviously, it is a very demanding task to create some abstract numeric value from two possibly very long articles that represents their similarity. Because of this problem, we have not done any user evaluation and experimentally chose the coefficients shown in table 7.1 for the article similarity (example for understanding: the external link host similarity makes up 50% · 25% · 75% = 9.375% of the overall similarity).

33.33% Text Coverage 50% Text 33.33% Text Length 33.33% Text Overlap 25% Images 25% Entities 25% External Links 50% Feature 25% External Links 75% External Link Hosts 25% Authors 25% Authors 75% Author Locations

Table 7.1 Composition of Overall Similarity

For all 60 articles pairs extracted for the user evaluation, we took the newest revision pairs each and computed the similarity values. The results are shown in table 7.2. The range of similarity values varies a lot and is in the range of [0.038, 0.637]. When looking deeper at the respective articles, we can make the observation that the four most similar articles are about entities that are prominent in only one of the language versions (”Knipp” and ”Kettwig station” in Germany, ”Bellview, Florida” and ”Muggsy Bogues” in the USA). This is a sign that the users of the other language version do not have much domain knowledge on that language-specific topic and therefore just overtake many parts of the other article. However, if nobody has taken the effort to do this step, similarity stays on a low level (e.g. the least similar article pair on ”Aliso Viejo, California”). On the other hand, more general or controversial articles like ”Scientology”, ”Com- muter rail” or ”It girl” are unlikely to reach very high similarity values, as there are many aspects whose importance varies among countries. For example, the German article on ”Commuter rail” contains a section that is all about commuter rails in all federal states of Germany. Of course, they do not appear in the English article. 7.3 Visualisation 79

Evaluation

To evaluate the overall article similarity, a classification into three classes similarly to the sentence alignment user study could be an option. However, it is very difficult to define those classes, as e.g. two completely overlapping articles are very rare. In a different approach, a user could be given a small set of article pairs and has to rank them such that they are ordered. An aggregated ranked list over all article pairs evaluated by several users could then be compared with the ranking given by our similarity value (as in table 7.2) with common ranking measures. A manual investigation on some of the revision pairs shows that the ranking seems to be reasonable: For example, figure 7.4a and 7.4b oppose the two articles about the German type of sausage ”Knipp”. According to our ranking, this is the most similar article pair. Indeed, they are nearly literal translations and also share the same images. Another interesting observation is that none of the first ten article pairs is from the list of controversial articles.

(a) English Article ”Knipp” (b) German Article ”Knipp (Speise)”

Figure 7.4 Comparison of the English and German article on ”Knipp”

7.3 Visualisation

In our web interface, you can choose from a predefined set of articles in one of the three languages German, Dutch and Portuguese. Each of these article comparisons consists of five sub pages that are shown in the following for the comparison of the articles about the General Post Office. ˆ Text (Figure 7.5): As in all the following screenshots, the currently chosen article’s English name, the language of the compared article and the date of the 80 Chapter 7 Paragraph Alignment and Article Comparison

revision pair is shown in the top left corner. The text field (with a drop-down list) in the top right can be used to choose another article. For the ”General Post Office example” the two articles are shown side by side. On the left you can see the English article, on the right there is the German one. Aligned paragraphs are marked green and joined by green lines. On the top, the textual similarity scores are listed.

ˆ Links (Figure 7.6): The table contains all the Wikipedia annotations and exter- nal links together with their number of appearances in both articles. If a link appears in both articles, it is ranked at the top. For example, the ”BT Group” is mentioned 12 times in the English and 8 times in the German article, which makes this the entity that is mentioned the most times in both articles. Further below (not visible in the screenshot), there are also tables for external links and their hosts and the values for the respective similarity measures.

ˆ Images (Figure 7.7): Similar to the links, there is a table for images as well. For some images, the file names are still available and comparable, but the images themselves are not online any more.

ˆ Authors (Figure 7.8): On the top, there are two maps: The left one shows the locations of all anonymous authors that contributed to the English article. The right map is for the German article. These maps are interactive, as you can inspect the number of authors for a focussed country. Below the maps, there is a table with the author names.

ˆ Overall similarity (Figure 7.9): On the top, there is the overall similarity value. Below, there is the history chart that plots the number of edits and the overall similarity values over time. It is followed by two tables: The left one shows of which values the overall similarity is composed, the right one provides the list of revision pairs, such that the user can choose one of them to see comparisons from other dates. 7.3 Visualisation 81

Figure 7.5 Website Example: Text

Figure 7.6 Website Example: Links 82 Chapter 7 Paragraph Alignment and Article Comparison

Figure 7.7 Website Example: Images

Figure 7.8 Website Example: Authors 7.3 Visualisation 83

Figure 7.9 Website Example: Overall Similarity 84 Chapter 7 Paragraph Alignment and Article Comparison

English Name German Name Similarity Knipp Knipp (Speise) 0.637 Bellview, Florida Bellview (Florida) 0.410 Kettwig station Bahnhof Kettwig 0.352 Muggsy Bogues Muggsy Bogues 0.344 Codex Aureus of St. Emmeram Codex aureus von St. Emmeram 0.344 Sandro Cortese Sandro Cortese 0.336 Cala´nas Cala´nas 0.329 Michelle Monaghan Michelle Monaghan 0.319 Lawrence Eagleburger Lawrence Eagleburger 0.317 Stef´anJ´ohannStef´ansson Stef´anJ´ohann Stef´ansson 0.316 Berlin Berlin 0.306 United Nations Security Council Resolution 1753 Resolution 1753 des UN-Sicherheitsrates 0.303 Wii Wii 0.300 Japan Japan 0.283 Dimitri De Fauw Dimitri De Fauw 0.282 Omega (navigation system) Omega-Navigationsverfahren 0.271 European Union Europ¨aische Union 0.268 Travenbr¨uck Travenbr¨uck 0.267 Prince Moritz of Anhalt-Dessau Moritz von Anhalt-Dessau 0.263 George William Gray George William Gray 0.259 White-headed langur Hellk¨opfigerSchwarzlangur 0.257 General Post Office General Post Office 0.244 Hanover Hannover 0.232 Global warming Globale Erw¨armung 0.228 Scientology Scientology 0.218 Nicolaus Copernicus Nikolaus Kopernikus 0.218 Mercedes MGP W01 Mercedes MGP W01 0.217 A930 road A930 road 0.213 Fort Sumter Fort Sumter 0.211 Gohi Bi Zoro Cyriac Gohi Bi Cyriac 0.210 Minimum wage Mindestlohn 0.210 Samˇs´ın Samˇs´ın 0.202 Truth Wahrheit 0.202 Studentenverbindung Studentenverbindung 0.199 History of Serbia Geschichte Serbiens 0.191 Far point Fernpunkt (Optik) 0.191 Srebrenica massacre Massaker von Srebrenica 0.179 Wiesenbach, Bavaria Wiesenbach (Schwaben) 0.166 249 249 0.166 W. Clement Stone W. Clement Stone 0.165 Pseudomugilidae Blauaugen 0.165 Sikorsky S-333 Schweizer S-333 0.161 Banded bellowsfish Geb¨anderterBlasebalgfisch 0.154 Schoten Schoten 0.152 Hemisphaeriodon Schneckenskink 0.148 Wilan´owPalace Wilan´ow-Palast 0.148 Libertarianism Libertarismus 0.135 Antonio Arenas Antonio Arenas 0.126 Champagne Showers Champagne Showers 0.120 Hammond Peek Hammond Peek 0.119 St. Germain (musician) St Germain 0.118 Tomte Nisse 0.115 Suwalki Suwalki 0.114 Endarterectomy Endarteriektomie 0.092 Esotericism Esoterik 0.091 Santo Antˆoniodo Amparo Santo Antˆoniodo Amparo 0.081 Commuter rail Nahverkehr 0.077 The Sundays The Sundays 0.069 It girl It-Girl 0.066 Aliso Viejo, California Aliso Viejo 0.038

Table 7.2 60 Wikipedia Article Pairs Ordered by Overall Similarity 8 Implementation

In order to collect as much information as possible for the comparison of articles and presentation of the results, each investigated article has to run through several preprocessing steps. These steps include three major phases that are discussed in this chapter: The selection of articles and revisions to be investigated, the parsing of pure Wikipedia information like the structure and content of the article’s text and finally the use of external tools like a machine translator to extract further information. The aim is to set up a preprocessing pipeline that makes clear which steps each single article has to run through. To get an idea which data is needed for each article and how it is organised, this chapter will start with a depiction of the data model for our article collection that will be used in our implementation.

8.1 Data Model

Our document collection is organised as follows: The article collection A consists of a set of articles a, each representing one Wikipedia entity like ’Berlin’ for a Wikipedia language version l. To observe similarity changes over time, it is not enough to store just one single (the newest) revision of each article. Instead of this, each article a is assigned a set of revisions R. This means, information about a specific revision, such as the text itself or external links, are stored for each revision r ∈ R. In Figure 8.1, a part of the data model is shown as a entity-relationship model. It has the important tables Article and Revision, but some other entities and attributes for the comparisons and extracted information are left out.

To compare articles, comparisons are stored: Each comparison (a1, a2) ∈ C con- sists of two articles written in different languages1. For each comparison, there is a set of revision pairs whose similarity values have to be computed. The process

1One of these languages has to be English. 85 86 Chapter 8 Implementation

Figure 8.1 Data Model

of finding revisions that belong together, is shown in the next Section 8.2. When extracting the article texts, their inner structure is stored by using a hierarchical approach: On the lowest level, there are sentences (table Sentence). Each sentence belongs to a paragraph (Paragraph) which is given by the inner Wikipedia structure described in Section 3.1.52. These paragraphs can be bundled in a bottom-up man- ner to build bigger paragraphs – each together with a title given by Wikipedia. On the top level, this results in a paragraph that contains all the other paragraphs (and hence, all sentences) of the whole article. Its title is the article’s title. This makes it a representation of the whole article. Apart from the sentences and the lowest paragraph level, this paragraph structure is also given by the table of contents which is apparent on the page of each Wikipedia article as well. To store this structure, it is sufficient to store the text of the whole article and the start and end positions of each sentence and paragraph. Besides, each sentence and paragraph – except for the top one – is assigned to its upper paragraph (relation contains for the paragraphs and consists of for sentences). As the comparisons initially are sentence-based, the sentence text is stored for each sentence redundantly for performance reasons. For the extraction and storage of additional information (not shown in the dia- gram), we differ between two levels (taking external links as example):

2These paragraphs may not be confused with those that were algorithmically created in Section 7.1, but they can help to have upper bounds for the size of algorithmically created paragraphs. 8.2 Comparison Extracting 87

ˆ Revision level: For each revision, there is a list of external links and how often they occur in that revision.

ˆ Sentence level: For each sentence, each occurring external link is stored together with its position in that sentence.

While most of the information on the revision level could be extracted from the sentence level, there are two reasons for this redundancy: At first, there is no need to use and implement such an extraction procedure. Moreover, this allows for a quite robust information storage with two data sets that are independent from each other. For example, the list of ”see also” Wikipedia pages at the end of an article may be ignored during the sentence extraction, but for the comparison on the revision level, these links may still be of interest to keep it complete. For performance reasons, there is a table Text. If a sentence is stored, its text is stored in this table, if it has not already been stored. This is happening quite often, as sentences may remain unchanged for different revisions over time. Every operation on sentence texts is run against the data in this Text table. To avoid data base joins and thereby accelerate the loading of sentences with all their information, most attributes of Text are stored for the sentences pairs redundantly (not shown in the diagram).

8.2 Comparison Extracting

Table 8.1 gives a fictional example3 of the revisions of an article in the three languages English, German and Dutch4. Similar to what is described in Section 2.4, this is one of the cases where the English article contains of the biggest number of revisions. To create comparisons between these articles, there is one condition that must hold in any case: If two or more revisions are compared, they must have been online at the same time. For example, the English revision with the number 301 (online between 14/2/13 and 15/6/13) may be compared with the German revision with the number 190 (online between 5/1/13 and 21/7/14), but not with any other German one. With this restriction, it is assured that time-dependent events are no reasons for text differences. Another condition is that each comparison needs to contain one revision of each given language. These two conditions already lead to the fact that every revision from before 7/4/11 (latest date of the earliest dates per language) can be ignored in the example. A naive approach for the extraction of revisions is to take every possible combi- nation of revisions that is valid with respect to the time condition. However, there

3The data is fictional, revision numbers only consist of three digits and exact times are ignored (only the dates are considered). 4Until now, we only considered binary comparisons, but in future work ternary comparisons could be possible as well. 88 Chapter 8 Implementation

Number Date 53 5/10/09 78 10/10/09 Number Date 142 23/6/10 20 5/5/10 148 24/6/10 42 11/3/10 187 1/3/11 73 12/3/11 Number Date 203 7/9/11 90 24/7/11 22 7/4/11 234 12/7/12 175 18/11/12 49 14/8/13 237 14/7/12 190 5/1/13 58 30/10/13 257 22/9/12 251 21/7/14 (c) Dutch revisions 301 14/2/13 267 4/8/14 334 15/6/13 289 27/12/14 345 19/8/14 (b) German revisions 401 6/12/14 (a) English revisions

Table 8.1 Example of Revisions of an Article in Different Languages are two problems with this idea: On the one hand, this would make it necessary to store every single revision which means a lot of time and storage consumption: The – according to [29] – most controversial English Wikipedia article, ”George W. Bush”, consist of 45.638 revisions5 which makes it nearly impossible to store and proceed all revisions, thinking of the preprocessing pipeline that will be shown in the next sections. Furthermore, some of the revisions should be categorised as invalid and be ignored, because of being result of vandalism or simply too small (for example article stubs6 that just consist of a single sentence) to get reasonable comparison results. To overcome these problems, the revision finding consists of the following steps:

1. Per language version: Find all revisions (together with their dates and addi- tional information)7. 2. Mark those revisions that don’t fulfil certain criteria as ”invalid”. 3. For the remaining revisions: Find a limited number of tuples/triples with one revision for each language that fulfil the time condition.

At the moment, the last step is done by discarding every second revision as long as the number of revisions is above a predefined number. This produces a set of revision 5http://tools.wmflabs.org/xtools-articleinfo/index.php?article=George W. Bush&lang= en&wiki=wikipedia 6Wikipedia defines stubs as ”A stub is an article deemed too short to provide encyclopedic coverage of a subject” (http://en.wikipedia.org/wiki/Wikipedia:Stub). 7This can be done using the Wikipedia API. Example: http://en.wikipedia.org/w/api.php? action=query&prop=revisions&titles=GeorgeW.Bush&rvprop=timestamp|user&rvlimit=max 8.3 Preprocessing Pipeline 89

pairs that are in a equidistant fashion in terms of edits. However, this procedure could be improved by automatically detecting points with many edits that indicate important changes. In our example, the procedure could result in the revision triples shown in table 8.2 (after discarding revisions before 7/4/11 and then discarding every second revision triple), with a maximum amount of revisions of 3:

Number (English) Number (German) Number (Dutch) Date 203 90 22 7/9/11 237 90 22 14/7/12 301 190 22 14/2/13 345 267 58 19/8/14

Table 8.2 Example of Revision Triples

8.3 Preprocessing Pipeline

To extract the features needed for the sentence alignment and the revision similarity measures, each revision has to run through the preprocessing pipeline depicted in Figure 8.2.

Wikipedia HTML Sentence Wikipedia Revision Parsing Extraction Annotation

OpenNLP DBpedia Spotlight

Sentence Time Stemming & Translation Annotation Stop words

Bing Translator SUTime Lucene

Figure 8.2 Preprocessing Pipeline

This easily-reproducible pipeline contains the following steps for processing a sin- gle revision:

ˆ To retrieve the textual content, the structure and additional information on the revision level, the HTML code of the revision has to be proceeded and the 90 Chapter 8 Implementation

Wikipedia API is queried. This step is described in more detail in the following Section 8.4.

ˆ Sentence Extraction: Now given the parts of the revision with the essen- tial text, we split it into sentences using OpenNLP Sentence Detector8. Each sentence is associated with the internal Wikipedia links it contains (identified using Wikipedia mark-up).

ˆ Wikipedia Annotation: As reasoned in Section 5.4, we use DBpedia Spotlight service [7], with a confi- dence value of 0.6.

ˆ Sentence Translation: To facilitate term-based cosine similarity computa- tion, each non-English sentence is translated into English using the Bing Trans- lator9. This is not done for the whole article at once to keep the sentence detection robust.

ˆ Time Annotations: To enable time-based similarity computation, we extract time expressions from the (translated) English sentences using the Stanford Temporal Tagger (SUTime) [5] and reduce the extracted set to the time expres- sions that can be mapped to non-recurring time intervals.

ˆ Stemming and Stop Word Removal: For the cosine similarity measures, the (translated) English texts are further modified by applying stemming and stop word removal (using Apache Lucene library [?]).

Another pre processing step that is independent from the actual revisions, is the collection of language links of all the entities that are mentioned as internal link or extracted via DBpedia Spotlight. This is done by using the Wikipedia API. An alternative approach is to use a Wikipedia data dumb that already contains all language links.

8.4 Text Parsing

Before extracting the text of a revision, splitting the sentences and building the hierarchical paragraph structure, we had to decide on the question which text format is used. This must happen with regard to the extraction process itself, but to the final visualisation of the results as well. Three approaches can be considered: (a) pure text (b) HTML format

8https://opennlp.apache.org/ 9http://www.microsoft.com/translator/ 8.5 Resources 91

(c) Wiki markup. For approach (a), there exist algorithms like [15] that are used to extract the ”main content” of a website, ignoring parts like navigational elements and advertisements. While this preprocessing of the text could simplify the use of textual features a lot and rather unimportant parts as the ”see also” notes would automatically be ignored, this approach is too generous: It is difficult to control which text parts finally are removed, the hierarchical structuring and link extraction gets complicated and in a final visualisation of the article layout information is missing. The decision between using HTML or Wiki markup as text format is based on the aspects given in the following list, with ”+” indicating advantages of parsing HTML.

+ There is no need of an extra parsing for visualisation10. + For extraction and structuring, you can use well-known HTML parsers. + External tools like machine translators often offer options to work on HTML texts. - Wiki markup is more robust and will probably still be the same in many years. - Wiki markup bears a more consistent and semantic meaning

In our implementation, we decided on using HTML texts, although the use of Wiki markup can be a reasonable choice as well. The most important aspect while doing the HTML parsing are the

tags that contain the most essential textual paragraphs. For paragraph structuring, it is necessary to retrace the hierarchical structure by identifying the nested titles given by

,

, . . . tags. Two types of semantic features are already extracted in this step and are associated with each sentence: Internal and external links. By storing the start and end positions of sentences with respect to the whole article, the assignment of these features to their sentences becomes possible. External links can be found on the bottom of an article. They are linked to the text by footnotes. The extraction works similarly as for internal links, but each footnote must additionally be mapped to the external link. Especially for the revision similarity measures, it is necessary to do several calls to the Wikipedia API as well to get the revision history, authors and more.

8.5 Resources

The program for extraction, evaluation, alignment etc. and the user web interface are implemented with Java. For the user study, the JavaServer Faces framework11

10When using Wiki markup, one of the parsers listed here can be used: http://www.mediawiki. org/wiki/Alternative parsers 11https://javaserverfaces.java.net/ 92 Chapter 8 Implementation was used. To create the user interface shown in 7.3 that can be used to browse through the comparisons and to watch the results, a website was created, using the Vaadin Java web framework12. The following components were implemented:

ˆ Author locations map: Map of the world where the countries of Wikipedia authors who contributed to the article are marked. This was done using the HighMaps JavaScript API13.

ˆ Revisision similarity over time graph: Time line that shows the number of edits per day and the development of the overall similarity. This was done using the HighCharts JavaScript API14.

ˆ Feature tables: Several tables that oppose features found in the articles.

ˆ Textual article comparison: Confrontation of both articles where similar text passages are joined using the jsPlumb JavaScript library15.

12https://vaadin.com/ 13http://www.highcharts.com/products/highmaps 14http://www.highcharts.com/products/highcharts 15https://jsplumbtoolkit.com/demo/flowchart/dom.html 9 Discussion and Future Work

Our research on cross-lingual differences in Wikipedia revealed a lot of useful in- formation that are finally put together to retrieve two important means to better understand article similarity: The identification and visualisation of common sen- tences and paragraphs in the texts as well as the definition of an overall similarity for an article pair.

9.1 Discussion

Approach Overview

Our research aim was to develop algorithms and similarity measures that enable comparing Wikipedia articles describing the same entity in different language versions. This was done in several steps: First, the texts of the articles were split into sentences from which syntactic and semantic features were extracted. This approach allowed us to create two types of alignment functions: At first, we identified sentence pairs containing at least partially overlapping facts. Then, we merged these sentences to align similar paragraphs. Second, we defined a similarity score for an article pair. This score that does not only take the textual similarity into account, but also other features like the fraction of common images and authors. Finally, all these information were used to create an example application that presents similarity values and the aligned paragraph pairs to human users.

Discussion of the Results

To create and evaluate the proposed sentence alignment function, we have conducted a user study. This study showed that for the syntactic similarity, the cosine measure beats the text overlap similarity because of emphasizing selective terms. As additional semantic features, we extracted Wikipedia and time annotations. For Wikipedia 93 94 Chapter 9 Discussion and Future Work annotations (i.e. the links within Wikipedia), it did not suffice to take the given internal Wikipedia links only, due to their sparsity. Therefore, we extended these links with Wikipedia links identified by an external state-of-the-art NER tool. By including these semantic features, we improved the number of correctly found sentences and the precision. In numbers, the break-even point increased from 70.95% to 77.52%. This BEP also outperforms three baselines that we used for comparison. For the paragraph alignment, we created an algorithm that consists of two steps: (1) Neighboured sentences are aggregated if the similarity with regard to the sentence in the other articles increases; and (2) Paragraphs that are within a short distance are merged – as long as they are within the same paragraphs as given by the Wikipedia structure. Using this approach many paragraph pairs were created that consist of more then three sentences each. The similarity score for an article pair was composed of nine similarity values of which three came from the texts and the similar paragraphs. For temporal research, we have not only compared the articles at the extraction time, but also older revisions and plotted the development of the revision’s similarity over time to get a history chart. As it is difficult to evaluate an abstract number that represents the similarity of an article pair, we have ranked our list of Wikipedia article pairs by their similarity score and took a deeper look at the top and the lowest ranked articles. For these examples, the similarity score proved to be correct. Our example application for visualisation offers various possibilities to compare articles at different granularity levels: On the one hand, there are more ”technical” measures like the overlap of common entities or the number of similar external links that indicate an overall similarity of the article pair. These scores allow us to rank e.g. article revisions across languages according to their similarity. However, such technical measures do not suffice to illustrate important differences and similarities in the text and may be difficult to interpret for a human user, especially when the number of entities and links grows. On the other hand, our paragraph alignment approach is directly visualised on the text. The features mentioned above automatically flow into the paragraph similarity measure in the form of semantic features and the users do not have to interpret them any further. Using this interface, the users can immediately see where they can find similar and different information in which article.

9.2 Future Research Directions

The findings offer several aspects that can be tackled in future work: For example, the article comparisons were applied for each language pair independently. To gain more insights into language differences, it would be interesting to see the similarity development for more than one language in the same history chart. To implement the ideas from [26], the tables that show common features like authors and images could be extended to more than two columns (for two languages) as well. 9.2 Future Research Directions 95

On the social research side, our findings lead to many questions that can be investigated closer: By investigating the history charts for a bigger amount of articles, it could be possible to find typical ways of how articles are created. For example, a German article may start as a small stub that is later extended with the information from the English article. After a while, the editing may be continued independently from each other, such that similarity drops again. In [17], the idea of a linguistic point of view is presented that is contrary to the demand for neutrality. We can confirm that the information selection for a specific article can differ a lot among languages as only few article pairs reached high similarity scores. To investigate neutrality it is also important to look at the Wikipedia authors and their locations, which we have done by introducing not only an author similarity, but also an author location similarity. To test the idea of the linguistic point of view, the author location similarity may be an aspect to look at in more detail. Another interesting aspect is the question of whether the articles on some do- mains (e.g. locations) tend to be more similar then for other domains (e.g. political topics). By looking at the similarity values for 45 randomly chosen articles and 15 articles that were taken from a list of controversial articles, we already made the assumption that controversial articles and those describing very general topics like ”commuter rail” tend to be rather dissimilar. A systematic study of the articles in different Wikipedia categories using the similarity measures proposed in this thesis is an interesting direction for future research. 96 Chapter 9 Discussion and Future Work Bibliography

[1] S. F. Adafre and Maarten de Rijke. Finding Similar Sentences across Multiple Languages in Wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’06, pages 62– 69. Association for Computational Linguistics, 2006.

[2] S. Alzahrani, N. Salim, C.K. Kent, M.S. Binwahlan, and L. Suanmali. The development of cross-language plagiarism detection tool utilising fuzzy swarm- based summarisation. In Intelligent Systems Design and Applications (ISDA), 2010 10th International Conference on, pages 86–90, Nov 2010.

[3] Salha M. Alzahrani, Naomie Salim, and Ajith Abraham. Understanding plagia- rism linguistic patterns, textual features, and detection methods. Trans. Sys. Man Cyber Part C, 42(2):133–149, March 2012.

[4] The Apache Software Foundation. Apache OpenNLP Developer Documentation, 1.5.3 edition.

[5] Angel X. Chang and Christopher D. Manning. Sutime: A library for recogniz- ing and normalizing time expressions. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), 2012.

[6] Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan B¨uttcher, and Ian MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pages 659–666, New York, NY, USA, 2008. ACM.

[7] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the

97 98 BIBLIOGRAPHY

9th International Conference on Semantic Systems, I-SEMANTICS ’13, pages 121–124, New York, NY, USA, 2013. ACM.

[8] Elena Filatova. Directions for exploiting asymmetries in multilingual wikipedia. In Proceedings of the Third International Workshop on Cross Lingual Informa- tion Access: Addressing the Information Need of Multilingual Societies, CLI- AWS3 ’09, pages 30–37, Stroudsburg, PA, USA, 2009. Association for Compu- tational Linguistics.

[9] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 363–370, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics.

[10] William A. Gale and Kenneth W. Church. A program for aligning sentences in bilingual corpora. In Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, ACL ’91, pages 177–184, Stroudsburg, PA, USA, 1991. Association for Computational Linguistics.

[11] Marti A. Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Comput. Linguist., 23(1):33–64, March 1997.

[12] Brent Hecht and Darren Gergle. The tower of babel meets web 2.0: User- generated content and its applications in a multilingual context. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, pages 291–300, New York, NY, USA, 2010. ACM.

[13] Mahboob Alam Khalid, Valentin Jijkoun, and Maarten De Rijke. The impact of named entity normalization on information retrieval for question answering. In Proceedings of the IR Research, 30th European Conference on Advances in Infor- mation Retrieval, ECIR’08, pages 705–710, Berlin, Heidelberg, 2008. Springer- Verlag.

[14] Philipp Koehn. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit, pages 79–86, Phuket, Thailand, 2005. AAMT, AAMT.

[15] Christian Kohlsch¨utter,Peter Fankhauser, and Wolfgang Nejdl. Boilerplate de- tection using shallow text features. In Proceedings of the Third ACM Interna- tional Conference on Web Search and Data Mining, WSDM ’10, pages 441–450, New York, NY, USA, 2010. ACM.

[16] Malte Landwehr. Jeder 3. DAX Konzern manipuliert bei Wikipedia. http: //www.lorm.de/2008/03/11/jeder-3-dax-konzern-manipuliert-bei-wikipedia/, 2008. [Online; accessed 23-October-2014]. BIBLIOGRAPHY 99

[17] Paolo Massa and Federico Scrinzi. Manypedia: Comparing language points of view of wikipedia communities. In Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration, WikiSym ’12, pages 21:1–21:9, New York, NY, USA, 2012. ACM.

[18] Rada Mihalcea and Andras Csomai. Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, pages 233–242, New York, NY, USA, 2007. ACM.

[19] Mehdi Mohammadi and Nasser GhasemAghaee. Building bilingual parallel cor- pora based on wikipedia. In Proceedings of the 2010 Second International Con- ference on Computer Engineering and Applications - Volume 02, ICCEA ’10, pages 264–268, Washington, DC, USA, 2010. IEEE Computer Society.

[20] Erwan Moreau, Fran¸coisYvon, and Olivier Capp´e.Robust similarity measures for named entities matching. In Proceedings of the 22Nd International Confer- ence on Computational Linguistics - Volume 1, COLING ’08, pages 593–600, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.

[21] Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity linking meets word sense disambiguation: a unified approach. TACL, 2:231–244, 2014.

[22] David Nadeau and Satoshi Sekine. A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1):3–26, January 2007. Publisher: John Benjamins Publishing Company.

[23] Mustafa Ozuysal, Pascal Fua, and Vincent Lepetit. Fast keypoint recognition in ten lines of code. 2013 IEEE Conference on Computer Vision and Pattern Recognition, 0:1–8, 2007.

[24] C. Ramisch. Multiword Expressions Acquisition: A Generic and Open Frame- work. Theory and Applications of Natural Language Processing. Springer Inter- national Publishing, 2014.

[25] Michael R¨oder,Ricardo Usbeck, Sebastian Hellmann, Daniel Gerber, and an- dreas Both. N3 - a collection of datasets for named entity recognition and dis- ambiguation in the nlp interchange format. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may 2014. European Language Re- sources Association (ELRA).

[26] Richard Rogers. Digital Methods. The MIT Press, 2013. 100 BIBLIOGRAPHY

[27] Michael Sch¨afers. Ahnlichkeitsbasiertes¨ Matching linienf¨ormigerr¨aumlicherOb- jekte. Dissertation in progress, 2015.

[28] Ralf Steinberger. Cross-lingual similarity calculation for plagiarism detection and more - tools and resources. In Pamela Forner, Jussi Karlgren, and Christa Womser-Hacker, editors, CLEF (Online Working Notes/Labs/Workshop), 2012.

[29] Taha Yasseri, Anselm Spoerri, Mark Graham, and J´anos Kert´esz.The most con- troversial topics in wikipedia: A multilingual and geographical analysis. CoRR, abs/1305.5566, 2013. Erkl¨arung

Hiermit erkl¨areich, dass ich die vorliegende Arbeit selbstst¨andigverfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe, dass alle Stellen der Arbeit, die w¨ortlich oder sinngem¨aßaus anderen Quellen ¨ubernommen wurden, als solche kenntlich gemacht sind und dass die Arbeit in gleicher oder ¨ahnlicher Form noch keiner Pr¨ufungsbeh¨ordevorgelegt wurde.

Hannover, 20. M¨arz2015

Simon Gottschalk

101