ANALYSING LANGUAGE-SPECIFIC DIFFERENCES IN
MULTILINGUAL WIKIPEDIA
Fakult¨atf¨urElektrotechnik und Informatik der Gottfried Wilhelm Leibniz Universit¨atHannover zur Erlangung des Grades
Master of Science
M. Sc.
Thesis von
Simon Gottschalk
Erstpr¨ufer:Prof. Dr. techn. Wolfgang Nejdl Zweitpr¨ufer:Prof. Dr. Robert J¨aschke Betreuer: Dr. Elena Demidova
2015 ABSTRACT
Wikipedia is a free encyclopedia that has editions in more than 280 languages. While Wikipedia articles referring to the same entity often co-exist in many Wikipedia language editions, such articles evolve independently and often con- tain complementary information or represent community-specific point of view on the entity under consideration. In this thesis we analyse features that en- able to uncover such edition-specific aspects within Wikipedia articles to pro- vide users with an overview of overlapping and complementary information available for an entity in different language editions. In this thesis we compare Wikipedia articles at different levels of granular- ity: First, we identify similar sentences. Then, these sentences are merged to align similar paragraphs. Finally, a similarity score at the article level is com- puted. To align sentences, we employ syntactic and semantic features including cosine similarity, links to other Wikipedia articles and time expressions. We evaluated the sentence alignment function on a dataset containing 1155 sen- tence pairs extracted from 59 articles in German and English Wikipedia that had been annotated during a user study. Our evaluation results demonstrated that the inclusion of semantic features can lead to an improvement of the break-even point from 70.95% to 77.52% in this dataset. Given the sentence alignment function, we developed an algorithm to build similar paragraphs starting from the sentences that have been aligned before. We implemented a visualization of the algorithm results that enables users to obtain an overview of the similarities and differences in the articles by looking at the paragraphs aligned using the proposed algorithm and the other para- graphs, whose contents are unique to an article in a specific language edition. To further support this comparison, we defined an overall article similarity score and applied this score to illustrate temporal differences between article editions. Finally, we created a Web-based application presenting our results and visualising all the aspects described above. In the future work, the algorithms developed in this thesis can be directly applied as a help for Wikipedia authors to provide an overview of the en- tity representation across Wikipedia language editions. These algorithms can also build a basis for cultural research towards better understanding of the language-specific similarities and differences in multilingual Wikipedia. Contents
Table of Contents iii
List of Figures vii
List of Tables ix
List of Algorithms xi
1 Introduction1 1.1 Motivation...... 2 1.2 Problem Definition...... 3 1.3 Overview...... 5
2 Background on Multilingual Wikipedia7 2.1 Overview...... 8 2.2 Wikipedia Guidelines...... 9 2.2.1 Translations...... 9 2.2.2 Neutrality...... 11 2.3 Linguistic Point of View...... 11 2.4 Reasons for Multilingual Differences...... 12 2.5 Wikipedia Studies...... 14
3 Background on Multilingual Text Processing 17 3.1 NLP for Multilingual Text...... 17
iii iv
3.1.1 Machine Translation...... 17 3.1.2 Textual Features...... 19 3.1.3 Topic Extraction...... 19 3.1.4 Sentence Splitting...... 21 3.1.5 Other NLP techniques...... 21 3.2 Aligning Multilingual Text...... 23 3.2.1 Comporable Corpora...... 23 3.2.2 Plagiarism Detection in Multilingual Text...... 24
4 Approach Overview 27
5 Feature Selection and Extraction 31 5.1 Syntactic Features...... 31 5.2 Evaluation on Sentence Similarity of Parallel Corpus...... 33 5.3 Semantic Features...... 36 5.4 Evaluation of Entity Extraction Tools...... 39 5.4.1 Aim and NER tools...... 40 5.4.2 Data...... 40 5.4.3 Entity Extraction and Comparison...... 41 5.4.4 Comparison...... 42 5.4.5 Results...... 43
6 Sentence Alignment and Evaluation 47 6.1 Data...... 48 6.2 Pre-Selection of Sentence Pairs...... 49 6.3 Selection of Sentence Pairs for Evaluation...... 52 6.4 User Study...... 53 6.5 Judgement of Similarity Measures...... 54 6.6 Second Dataset...... 59 6.7 Pre-Selection and Creation of Similarity Function...... 59 6.8 Results...... 61
7 Paragraph Alignment and Article Comparison 69 7.1 Finding Similar Paragraphs...... 69 7.1.1 Aggregation of Neighboured Sentences...... 71 7.1.2 Aggregation of Proximate Sentence Pairs...... 72 7.1.3 Paragraph Aligning Algorithm...... 74 v
7.2 Similarity on Article Level...... 75 7.2.1 Text Similarity...... 75 7.2.2 Feature Similarity...... 76 7.2.3 Overall Similarity...... 78 7.3 Visualisation...... 79
8 Implementation 85 8.1 Data Model...... 85 8.2 Comparison Extracting...... 87 8.3 Preprocessing Pipeline...... 89 8.4 Text Parsing...... 90 8.5 Resources...... 91
9 Discussion and Future Work 93 9.1 Discussion...... 93 9.2 Future Research Directions...... 94
Bibliography 97
List of Figures
1.1 Text Comparison Example...... 4
2.1 English Wikipedia Article ”Großer Wannsee”...... 8 2.2 Interlanguage links for the English article ”Pfaueninsel”...... 10
3.1 First Paragraphs of the Wikipedia Article ”Berlin”...... 22
4.1 Process of Article Comparison...... 27
5.1 Precision Recall Graphs for Textual Features with Break-Even Points 35 5.2 Box Plots for Textual Features...... 36
6.1 Screenshot of User Study on Similar Sentences...... 54 6.2 Correlation of Syntactic Features for First Data Set...... 56 6.3 Correlation of Text Length Similarity...... 57 6.4 Correlation of External Links Similarity...... 57 6.5 Correlation of Time and Entity Similarity for the first Dataset.... 58 6.6 Iteration to Create Similarity Functions...... 60 6.7 Precision-recall Diagram of Sentences with Overlapping Facts.... 64 6.8 Precision-recall Diagram of Sentences with the Same Facts...... 65 6.9 Precision-recall Diagram of Sentences with the Same Facts (Adjusted Similarity Functions)...... 66
7.1 Paragraph Construction Example (Step 1)...... 70 7.2 Paragraph Construction Example (Steps 2 and 3)...... 70
vii viii LIST OF FIGURES
7.3 Paragraph Construction Example (Steps 4 and 5)...... 71 7.4 Comparison of the English and German article on ”Knipp”...... 79 7.5 Website Example: Text...... 81 7.6 Website Example: Links...... 81 7.7 Website Example: Images...... 82 7.8 Website Example: Authors...... 82 7.9 Website Example: Overall Similarity...... 83
8.1 Data Model...... 86 8.2 Preprocessing Pipeline...... 89 List of Tables
2.1 Statistics on Wikipedias in Different Languages...... 9
3.1 Machine Translation Example...... 18
5.1 Example Sentence Pairs for Time Similarity...... 37 5.2 Statistics of the N3 Dataset...... 41 5.3 Number of Entities Extracted from English Texts...... 42 5.4 Number of Entities Extracted from German Texts...... 42 5.5 Results of Entity Extraction...... 44
6.1 Wikipedia Articles Used in the User Study...... 49 6.2 Feature Combination Distribution in 14 Wikipedia Articles...... 50 6.3 Weights of Similarity Functions for Pre-Selection...... 51 6.4 Feature combination distribution in pre-selected Sentence Pairs... 52 6.5 Feature Distribution in the Dataset for the First Round of Evaluation 53 6.6 Correlation Coefficients for Similarity Measures...... 55 6.7 Dataset Evaluated in the Second Round...... 62 6.8 Retrieved Sentence Pairs per Article Pair...... 67
7.1 Composition of Overall Similarity...... 78 7.2 60 Wikipedia Article Pairs Ordered by Overall Similarity...... 84
8.1 Example of Revisions of an Article in Different Languages...... 88 8.2 Example of Revision Triples...... 89
ix
List of Algorithms
5.1 Computation of TP, FP and FN for the Evaluation of Entity Extraction 43 6.1 Identification of Candidates for Similar Sentences...... 51 7.1 Extension of Sentence Pairs with Neighbours...... 72 7.2 Extension of a Sentence with its Neighbours...... 72 7.3 Aggregation of Sentence Pairs...... 73 7.4 Paragraph Alignment...... 74
xi
1 Introduction
Wikipedia1 is a user-generated online encyclopaedia that is available in more than 280 languages and is widely used: Currently it counts more than 24 million registered users alone in the English Wikipedia and for the 12 most populated of the available language editions there are more than a million of articles each2. Wikipedia articles describing real-world entities, topics, events and concepts evolve independently in different language editions. Up to now, there are just insufficient possibilities to benefit from the knowledge that can be gained from these differences, although this could be useful for social research purposes or to extend Wikipedia articles with content from other language versions. Therefore, in this thesis we propose methods to automate a detailed comparison of Wikipedia articles that describe the same entities in different languages and create an example application that presents the findings to human users. Wikipedia articles can be compared at different levels of granularity. In this work we focus on three levels: the sentence level, the paragraph level and the article level. They are presented in a bottom-up order: Similar sentences are identified and merged to find similar paragraphs. The fraction of overlapping paragraphs is then used as an important component for the similarity score at the article level. At first, we develop methods to identify and align similar sentences in the articles. To do so, we analyse effectiveness of several syntactic and semantic features extracted from the texts. Moreover, we go further than related studies in this field by aligning not only the sentences with the same facts, but also the sentences with partly over- lapping contents. As this step builds a foundation for the paragraph alignment and the article comparison, we perform an extensive user study to evaluate and fine tune our proposed similarity functions. In the second step, we use the resulting sentence alignment to develop algorithms for alignment of similar paragraphs. This paragraph alignment method contributes to the improved visualisation of the textual compari-
1http://www.wikipedia.org/ 2http://meta.wikimedia.org/wiki/List of Wikipedias
1 2 Chapter 1 Introduction son by creating bigger paragraphs from the sentence pairs that were assigned in the previous step. Finally, as Wikipedia articles contain much more information than the raw texts (images, authors, links, etc.), we define further similarity measures that are applied at the article level to compute an overall similarity value for two articles in different languages. With these approaches to find similarities and differences across article pairs that describe the same entity in different languages, there are many possibilities to do in- vestigations of cross-lingual differences: Amongst others, we implement applications illustrating the development of the article similarity over time, rank article pairs by their similarity and oppose the article texts in different languages to visualise common paragraphs. These applications can support Wikipedia editors and researchers pro- viding an overview over similarities and differences of the articles and their temporal development.
1.1 Motivation
While collaboration is an indispensable part of Wikipedia editing within one language edition, this becomes a problem across languages: Besides from the language links interlinking articles on the same entities, multilingual coordination is difficult across the Wikipedia - each Wikipedia even has a separate set of user accounts3. Therefore, a tool that compares articles across languages can be a possibility to bridge this gap. Further aspects that our research aims at are listed below:
Social and cultural research: As Wikipedia articles are continuously written over a large range of time by a big amount of editors, a study on Wikipedia articles can always be seen as an investigation of the users as well.
Help for Wikipedia authors: When a Wikipedia author wants to add something to an article, it is very probably that he will find additional information in an article in an other language. If we provide a means to visualise the text passages or concepts that do not occur in the version of the author’s language, he can quickly get an idea which information is worthy to add to the article.
Trustworthiness of Wikipedia: Wikipedia is part of many investigations and programs – both for direct human interaction and for indirect information col- lections of automated systems. Given this importance of Wikipedia as an information resource, there have been many discussions on the reliability of Wikipedia4. Taking into account not just one Wikipedia edition, but extracting the infor- mation of more than one language version, it becomes possible to collect in-
3http://en.wikipedia.org/wiki/Wikipedia:Multilingual coordination 4http://en.wikipedia.org/wiki/Reliability of Wikipedia 1.2 Problem Definition 3
formation from independent groups of authors5 and either to further expand the knowledge with language-exclusive content or to discover language-specific differences. This allows for a better estimation of how reliable the texts are.
Statistics: Many different statistics and tools about Wikipedia are accessi- ble, mostly about the development of page views and edits6. This shows that there is a big interest in the automation of interesting information coming from Wikipedia. For multilingual comparisons, there is the website www.manypedia. com that is similar to our approach, but does not go deeper into textual simi- larity.
Existence of Neutrality across languages: Finally, the question arises whether it is possible to hold the idea of a neutral point of view across languages which also means across cultures. However, this can be seen as a question that is out of the scope of this thesis and rather touches topics of sociology.
1.2 Problem Definition
The comparison of Wikipedia articles that describe the same entity in different lan- guages can be split into two tasks: The first task solely refers to the texts of the articles and takes place on the sentence and paragraph level. Here, the goal is to link similar text parts. The second task is done on the article level and takes additional information into account, for example the authors and external links mentioned in foot notes.
Text Comparison
The text comparison is done to get a precise information of how similar the texts are and where their similarities and differences are. Figure 1.1 shows what the text comparison should result in (with shortened versions of the English and German abstracts of the Wikipedia article about the General Post Office): The English text is shown on the left and the German one the right. The parts that are identified as similar are linked by green lines. In this example, two subtopics are found that occur similarly in both languages: The first is a general description of the General Post Office, its founding and its estab- lishment as state postal system and telecommunications carrier. The second common fact is about the office of Postmaster General created in 1961. The black parts without links are containing information that is unique to the respective language.
5As shown in [Digital Methods], this does not hold completely, as some authors contribute to multiple Wikipedias editions. 6http://en.wikipedia.org/wiki/Wikipedia:Statistics 4 Chapter 1 Introduction