Language-Independent Methods for Identifying Cross-Lingual Similarity in Wikipedia

Language-Independent Methods for Identifying Cross-Lingual Similarity in Wikipedia Monica Lestari Paramita Information School University of Sheffield A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy February 2019 Declaration I hereby declare that this thesis is a presentation of my original research work. The contents of this thesis have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. Wherever contributions of others are involved, every effort is made to indicate this clearly, with due reference to the literature, and acknowledgement of collaborative research and discussions. A number of the chapters in this thesis have been published in academic papers or journals; these publications are specifically indicated at the end of each relevant chapter. Monica Lestari Paramita February 2019 Abstract The diversity and richness of multilingual information available in Wikipedia have in- creased its significance as a language resource. The information extracted from Wikipedia has been utilised for many tasks, such as Statistical Machine Translation (SMT) and sup- porting multilingual information access. These tasks often rely on gathering data from articles that describe the same topic in different languages with the assumption that the contents are equivalent to each other. However, studies have shown that this might not be the case. Given the scale and use of Wikipedia, there is a need to develop an approach to measure cross-lingual similarity across Wikipedia. Many existing similarity measures, however, require the availability of ‘language-dependent’ resources, such as dictionaries or Machine Translation (MT) systems, to translate documents into the same language prior to comparison. This presents some challenges for some language pairs, particu- larly those involving ‘under-resourced’ languages where the required linguistic resources are not widely available. This study aims to present a solution to this problem by first, investigating cross-lingual similarity in Wikipedia, and secondly, developing ‘language- independent’ approaches to measure cross-lingual similarity in Wikipedia. Two main contributions were provided in this work to identify cross-lingual similarity in Wikipedia. The first key contribution of this work is the development of a Wikipedia similarity corpus to understand the similarity characteristics of Wikipedia articles and to evaluate and compare various approaches for measuring cross-lingual similarity. The author elicited manual judgments from people with the appropriate language skills to assess similarities between a set of 800 pairs of interlanguage-linked articles. This corpus vi contains Wikipedia articles for eight language pairs (all pairs involving English and in- cluding well-resourced and under-resourced languages) of varying degrees of similarity. The second contribution of this work is the development of language-independent approaches to measure cross-lingual similarity in Wikipedia. The author investigated the utility of a number of “lightweight” language-independent features in four different experiments. The first experiment investigated the use of Wikipedia links to identify and align similar sentences, prior to aggregating the scores of the aligned sentences to rep- resent the similarity of the document pair. The second experiment investigated the use- fulness of content similarity features (such as char-n-gram overlap, links overlap, word overlap and word length ratio). The third experiment focused on analysing the use of structure similarity features (such as the ratio of section length, and similarity between the section headings). And finally, the fourth experiment investigates a combination of these features in a classification and a regression approach. Most of these features are language-independent whilst others utilised freely available resources (Wikipedia and Wiktionary) to assist in identifying overlapping information across languages. The approaches proposed are lightweight and can be applied to any languages written in Latin script; non-Latin script languages need to be transliter- ated prior to using these approaches. The performances of these approaches were eval- uated against the human judgments in the similarity corpus. Overall, the proposed language-independent approaches achieved promising results. The best performance is achieved with the combination of all features in a classification and a regression approach. The results show that the Random Forest classifier was able to classify 81.38% document pairs correctly (F1 score=0.79) in a binary classification problem, 50.88% document pairs correctly (F1 score=0.71) in a 5-class classification problem, and RMSE of 0.73 in a regression approach. These results are significantly higher com- pared to a classifier utilising machine translation and cosine similarity of the tf-idf scores. These findings showed that language-independent approaches can be used to measure cross-lingual similarity between Wikipedia articles. Future work is needed to evaluate these approaches in more languages and to incorporate more features. Acknowledgements Firstly, I would like to thank my first supervisor, Professor Paul Clough, for his expertise and continuous support throughout my entire study. Thank you for always finding time to provide valuable feedback on my research and for encouraging me to be a better re- searcher. I would also like to thank my second supervisor, Professor Rob Gaizauskas, for providing insightful comments on my study. My PhD was partially funded by the ACCURAT project supported by the European Commission and I gratefully acknowledge this support. Thank you also to my colleagues in the ACCURAT project who have participated in the evaluation task. To my colleagues, Ahmet, Emina, Emma, Mengdie, Paula, Sophie, Alberto, Nikos, Si- mon, and all the members of the IR and NLP group in the University of Sheffield, and to my wonderful friends from the writing club, Akanimo and Anna, thank you very much for the motivation, encouragement, peer support and all the feedback you have given me. I would also like to thank Kay Guccione, Sarah Barnes (my career mentor) and, especially Hannah Roddie (my thesis mentor), whose advice and encouragement have built up my confidence to finish my thesis. Finally, my profound gratitude goes to my family and friends for their endless support and motivation during my years of study. To my wonderful husband, Paul, who has never stopped believing in me. To my son, Oliver, who always gives me a reason to laugh and be grateful every day, and to my baby bump, whose kicks have kept me accompanied (and awake) during the late night writing sessions. Finally, to my wonderful parents and parents in law who have loved, supported and encouraged me throughout this process. This thesis would not have been possible without their support. Table of contents Table of contents ix List of figures xvii List of tables xxi 1 Introduction1 1.1 Motivation . .1 1.2 Research questions and objectives . .5 1.2.1 Scope of the research . .6 1.3 Main contributions to the research field . .7 1.4 Thesis outline . .8 1.5 Structure of the work . 10 2 Related Work 13 2.1 Measuring similarity . 14 2.2 Defining the concept of similarity . 15 2.3 Identifying monolingual similarity . 23 2.3.1 Lexical similarity . 24 2.3.2 Semantic similarity . 28 2.4 Methods to measure cross-lingual similarity . 34 2.4.1 Language-dependent methods . 35 2.4.2 Language-independent approaches . 43 x Table of contents 2.5 Wikipedia as a linguistic resource . 52 2.5.1 Extraction of similar texts . 52 2.5.2 Assistance for other tasks . 54 2.6 Identifying similarity (or dissimilarity) in Wikipedia . 56 2.6.1 Evaluating similarity in interlanguage-linked articles . 57 2.6.2 Improving similarity in Wikipedia . 59 2.7 Summary........................................ 60 3 Methodology 61 3.1 Overview of methodology . 61 3.2 Initial study on investigating similarity in Wikipedia . 64 3.3 Gathering human judgments on similarity . 65 3.3.1 Selection of languages . 65 3.3.2 Selecting evaluation documents . 66 3.3.3 Assessors . 68 3.3.4 Evaluation tasks . 69 3.4 Development of approaches to measure similarity in Wikipedia . 71 3.4.1 Resources . 71 3.4.2 Language-dependent baseline . 72 3.5 Evaluation . 73 3.6 Wikipedia corpus . 74 3.6.1 Pre-processing of Wikipedia corpus . 74 3.6.2 Corpus statistics . 80 4 Identifying Similarity Features in Wikipedia 85 4.1 Background . 85 4.2 Similarity analysis method . 87 4.3 Dataset . 88 4.4 Initial findings . 89 4.4.1 Similarity between structures . 89 Table of contents xi 4.4.2 Similarity between paragraphs . 93 4.4.3 Similarity between sentences . 93 4.4.4 Relations between quality of articles, word length, and the similarity degree . 99 4.5 Conclusion . 100 5 Evaluation Corpus 103 5.1 Background . 103 5.2 Creating evaluation tasks . 104 5.2.1 Pilot evaluation task . 105 5.2.2 Main evaluation task . 108 5.3 Selecting evaluation documents . 109 5.4 Assessors . 112 5.5 Annotation results . 112 5.5.1 Q1 Results . 113 5.5.2 Q1-Reasons Results . 116 5.5.3 Q2 Results . 122 5.5.4 Q3 Results . 125 5.5.5 Q4 Results . 129 5.5.6 Comparison between evaluation questions . 132 5.6 Discussion . 135 5.6.1 Analysis of disagreements . 135 5.6.2 Relations between ‘similarity’ and ‘comparability’ . 136 5.6.3 Limitations . 137 5.7 Conclusion . 138 6 Anchor Text and Word Overlap Method 141 6.1 Background . 141 6.2 Method . 143 6.2.1 Pre-process documents . 145 xii Table of contents 6.2.2 Create a bilingual lexicon . 145 6.2.3 Anchor text translation . 146 6.2.4 Similarity calculation . 147 6.3 Experiments...................................... 151 6.3.1 Language selection . 151 6.3.2 Corpus . 151 6.3.3 Evaluation . 152 6.4 Results ......................................... 153 6.4.1 Correlation between the automatic methods .

Language-Independent Methods for Identifying Cross-Lingual Similarity in Wikipedia

Latvian Sportspeople Representation in English and Latvian Wikipedias

Universality, Similarity, and Translation in the Wikipedia Inter-Language Link Network

A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages

D2.8 GLAM-Wiki Collaboration Progress Report 2

Modeling Communities in Information Systems: Informal Learning Communities in Social Media"

Same Gap, Different Experiences

Genre Analysis of Online Encyclopedias. the Case of Wikipedia

Critical Point of View: a Wikipedia Reader

The Case of 13 Wikipedia Instances

A Wikipedia Reader

34 Understanding Wikipedia Practices Through Hindi, Urdu, and English

LASE JOURNAL of SPORT SCIENCE Is a Scientific Journal Published Two Times Per Year in Sport Science LASE Journal for Sport Scientists and Sport Experts/Specialists