Analysing Language-Specific Differences in Multilingual

ANALYSING LANGUAGE-SPECIFIC DIFFERENCES IN MULTILINGUAL WIKIPEDIA FakultätfürElektrotechnik und Informatik der Gottfried Wilhelm Leibniz UniversitätHannover zur Erlangung des Grades Master of Science M. Sc. Thesis von Simon Gottschalk Erstprüfer:Prof. Dr. techn. Wolfgang Nejdl Zweitprüfer:Prof. Dr. Robert Jäschke Betreuer: Dr. Elena Demidova 2015 ABSTRACT Wikipedia is a free encyclopedia that has editions in more than 280 languages. While Wikipedia articles referring to the same entity often co-exist in many Wikipedia language editions, such articles evolve independently and often con- tain complementary information or represent community-specific point of view on the entity under consideration. In this thesis we analyse features that en- able to uncover such edition-specific aspects within Wikipedia articles to provide users with an overview of overlapping and complementary information available for an entity in different language editions. In this thesis we compare Wikipedia articles at different levels of granular- ity: First, we identify similar sentences. Then, these sentences are merged to align similar paragraphs. Finally, a similarity score at the article level is com- puted. To align sentences, we employ syntactic and semantic features including cosine similarity, links to other Wikipedia articles and time expressions. We evaluated the sentence alignment function on a dataset containing 1155 sentence pairs extracted from 59 articles in German and English Wikipedia that had been annotated during a user study. Our evaluation results demonstrated that the inclusion of semantic features can lead to an improvement of the break-even point from 70:95% to 77:52% in this dataset. Given the sentence alignment function, we developed an algorithm to build similar paragraphs starting from the sentences that have been aligned before. We implemented a visualization of the algorithm results that enables users to obtain an overview of the similarities and differences in the articles by looking at the paragraphs aligned using the proposed algorithm and the other paragraphs, whose contents are unique to an article in a specific language edition. To further support this comparison, we defined an overall article similarity score and applied this score to illustrate temporal differences between article editions. Finally, we created a Web-based application presenting our results and visualising all the aspects described above. In the future work, the algorithms developed in this thesis can be directly applied as a help for Wikipedia authors to provide an overview of the entity representation across Wikipedia language editions. These algorithms can also build a basis for cultural research towards better understanding of the language-specific similarities and differences in multilingual Wikipedia. Contents Table of Contents iii List of Figures vii List of Tables ix List of Algorithms xi 1 Introduction1 1.1 Motivation.................................2 1.2 Problem Definition............................3 1.3 Overview..................................5 2 Background on Multilingual Wikipedia7 2.1 Overview..................................8 2.2 Wikipedia Guidelines...........................9 2.2.1 Translations............................9 2.2.2 Neutrality............................. 11 2.3 Linguistic Point of View......................... 11 2.4 Reasons for Multilingual Differences................... 12 2.5 Wikipedia Studies............................. 14 3 Background on Multilingual Text Processing 17 3.1 NLP for Multilingual Text........................ 17 iii iv 3.1.1 Machine Translation....................... 17 3.1.2 Textual Features......................... 19 3.1.3 Topic Extraction......................... 19 3.1.4 Sentence Splitting......................... 21 3.1.5 Other NLP techniques...................... 21 3.2 Aligning Multilingual Text........................ 23 3.2.1 Comporable Corpora....................... 23 3.2.2 Plagiarism Detection in Multilingual Text........... 24 4 Approach Overview 27 5 Feature Selection and Extraction 31 5.1 Syntactic Features............................ 31 5.2 Evaluation on Sentence Similarity of Parallel Corpus......... 33 5.3 Semantic Features............................. 36 5.4 Evaluation of Entity Extraction Tools.................. 39 5.4.1 Aim and NER tools........................ 40 5.4.2 Data................................ 40 5.4.3 Entity Extraction and Comparison............... 41 5.4.4 Comparison............................ 42 5.4.5 Results............................... 43 6 Sentence Alignment and Evaluation 47 6.1 Data.................................... 48 6.2 Pre-Selection of Sentence Pairs..................... 49 6.3 Selection of Sentence Pairs for Evaluation............... 52 6.4 User Study................................ 53 6.5 Judgement of Similarity Measures.................... 54 6.6 Second Dataset.............................. 59 6.7 Pre-Selection and Creation of Similarity Function........... 59 6.8 Results................................... 61 7 Paragraph Alignment and Article Comparison 69 7.1 Finding Similar Paragraphs....................... 69 7.1.1 Aggregation of Neighboured Sentences............. 71 7.1.2 Aggregation of Proximate Sentence Pairs............ 72 7.1.3 Paragraph Aligning Algorithm.................. 74 v 7.2 Similarity on Article Level........................ 75 7.2.1 Text Similarity.......................... 75 7.2.2 Feature Similarity......................... 76 7.2.3 Overall Similarity......................... 78 7.3 Visualisation................................ 79 8 Implementation 85 8.1 Data Model................................ 85 8.2 Comparison Extracting.......................... 87 8.3 Preprocessing Pipeline.......................... 89 8.4 Text Parsing................................ 90 8.5 Resources................................. 91 9 Discussion and Future Work 93 9.1 Discussion................................. 93 9.2 Future Research Directions........................ 94 Bibliography 97 List of Figures 1.1 Text Comparison Example........................4 2.1 English Wikipedia Article ”Großer Wannsee".............8 2.2 Interlanguage links for the English article "Pfaueninsel"........ 10 3.1 First Paragraphs of the Wikipedia Article "Berlin".......... 22 4.1 Process of Article Comparison...................... 27 5.1 Precision Recall Graphs for Textual Features with Break-Even Points 35 5.2 Box Plots for Textual Features...................... 36 6.1 Screenshot of User Study on Similar Sentences............. 54 6.2 Correlation of Syntactic Features for First Data Set.......... 56 6.3 Correlation of Text Length Similarity.................. 57 6.4 Correlation of External Links Similarity................ 57 6.5 Correlation of Time and Entity Similarity for the first Dataset.... 58 6.6 Iteration to Create Similarity Functions................. 60 6.7 Precision-recall Diagram of Sentences with Overlapping Facts.... 64 6.8 Precision-recall Diagram of Sentences with the Same Facts...... 65 6.9 Precision-recall Diagram of Sentences with the Same Facts (Adjusted Similarity Functions)........................... 66 7.1 Paragraph Construction Example (Step 1)............... 70 7.2 Paragraph Construction Example (Steps 2 and 3)........... 70 vii viii LIST OF FIGURES 7.3 Paragraph Construction Example (Steps 4 and 5)........... 71 7.4 Comparison of the English and German article on "Knipp"...... 79 7.5 Website Example: Text.......................... 81 7.6 Website Example: Links......................... 81 7.7 Website Example: Images........................ 82 7.8 Website Example: Authors........................ 82 7.9 Website Example: Overall Similarity.................. 83 8.1 Data Model................................ 86 8.2 Preprocessing Pipeline.......................... 89 List of Tables 2.1 Statistics on Wikipedias in Different Languages............9 3.1 Machine Translation Example...................... 18 5.1 Example Sentence Pairs for Time Similarity.............. 37 5.2 Statistics of the N3 Dataset....................... 41 5.3 Number of Entities Extracted from English Texts........... 42 5.4 Number of Entities Extracted from German Texts........... 42 5.5 Results of Entity Extraction....................... 44 6.1 Wikipedia Articles Used in the User Study............... 49 6.2 Feature Combination Distribution in 14 Wikipedia Articles...... 50 6.3 Weights of Similarity Functions for Pre-Selection............ 51 6.4 Feature combination distribution in pre-selected Sentence Pairs... 52 6.5 Feature Distribution in the Dataset for the First Round of Evaluation 53 6.6 Correlation Coefficients for Similarity Measures............ 55 6.7 Dataset Evaluated in the Second Round................ 62 6.8 Retrieved Sentence Pairs per Article Pair................ 67 7.1 Composition of Overall Similarity.................... 78 7.2 60 Wikipedia Article Pairs Ordered by Overall Similarity....... 84 8.1 Example of Revisions of an Article in Different Languages...... 88 8.2 Example of Revision Triples....................... 89 ix List of Algorithms 5.1 Computation of TP, FP and FN for the Evaluation of Entity Extraction 43 6.1 Identification of Candidates for Similar Sentences........... 51 7.1 Extension of Sentence Pairs with Neighbours.............. 72 7.2 Extension of a Sentence with its Neighbours.............. 72 7.3 Aggregation of Sentence Pairs...................... 73 7.4 Paragraph Alignment........................... 74 xi 1 Introduction Wikipedia1 is a user-generated

Analysing Language-Specific Differences in Multilingual

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support