Is Simple Wikipedia Simple? – a Study of Readability and Guidelines

Linköping University | Department of Computer and Information Science Bachelor thesis, 18 ECTS | Cognitive science 2018 | LIU-IDA/KOGVET-G--18/029--SE Is Simple Wikipedia simple? – A study of readability and guidelines Fabian Isaksson Supervisor : Arne Jönsson Examiner : Henrik Danielsson Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin- istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam- manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum- stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con- sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni- versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/. c Fabian Isaksson Abstract Creating easy-to-read text is an issue that has traditionally been solved with manual work. But with advancing research in natural language processing, automatic systems for text simplification are being developed. These systems often need training data that is parallel aligned. For several years, simple Wikipedia has been the main source for this data. In the current study, several readability measures has been tested on a popular simplification corpus. A selection of guidelines from simple Wikipedia has also been operationalized and tested. The results imply that the following of guidelines are not greater in simple Wikipedia than in standard Wikipedia. There are however differences in the readability measures. The syntactical structures of simple Wikipedia seems to be less complex than those of standard Wikipedia. A continuation of this study would be to examine other readability measures and evaluate the guidelines not covered within the current work. keywords: corpus, readability, Wikipedia, automatic text simplification Acknowledgments There are no words that can describe my gratitude for the endless support from my best friend and partner Nora. And the selfless dedication to my happiness from my mother Elis- abeth. Without you this paper would not exist. iv Contents Abstract iii Acknowledgments iv Contents v List of Tables vii 1 Introduction 1 1.1 Aim............................................ 2 1.2 Research questions . 2 1.3 Delimitations . 2 1.4 Structure . 2 2 Theory 3 2.1 Concerning the authorship of simple Wikipedia . 4 2.2 The basis for text simplification . 4 2.2.1 Measuring Readability . 5 2.3 Material . 6 2.3.1 Dataset . 6 2.3.2 Preprocessing tool . 7 2.3.3 The architecture of spaCy . 7 3 Method 9 3.1 Surface features . 9 3.2 Dependency based Features . 10 3.3 Operationalization of Simple Wikipedia Guidelines . 11 3.4 Statistical analysis . 13 4 Results 14 4.1 Surface features and word list comparsion . 14 4.2 Dependency based Features . 15 4.3 Features from Simple Wikipedia Guidelines . 16 5 Discussion 17 5.1 Results . 17 5.1.1 Sentence complexity . 17 5.1.2 Were the guidelines followed? . 18 5.2 Method . 19 5.2.1 Parallelism matters . 20 5.2.2 Concerning the parser . 21 5.2.3 Studying the guidelines . 21 5.2.4 Social aspects . 22 v 6 Conclusion 23 Bibliography 24 vi List of Tables 4.1 Surface features and Word list comparison . 15 4.2 Dependency based features . 15 4.3 Statistics pertaining to simple wiki Guidelines . 16 vii 1 Introduction Linguists have been working on ways to concretely assess the readability of text prior to the dawn of computer science. Early approaches were aimed at determining the years of formal education needed to understand a text. With the practical motivation of providing children with reading material appropriate to their language abilities. In contemporary readability research, both the motivations and ambitions have expanded. Natural language processing has allowed for more sophisticated ways of measuring text complexity. With statistical machine learning we can now create tools that automatically distinguish easy-to-read language from other language. Using that knowledge to let a computer simplify texts is known as automatic text simplification. Several groups of our population are in need of easy-to-read material. Including second language learners, aphasic patients and children. Since manual simplification is a tedious and time consuming task, the interest is growing in automating the process. With data driven machine learning, new systems can automatically learn how to simplify texts. But in order to develop such systems, a large amount of simple language data is needed. Simple Wikipedia is a large collection of encyclopaedic articles aimed at people learning english. There is research suggesting that these articles can be said to be lower in readability (Yasseri, Kornal, and Kertész 2012)(Kauchak 2013)(Hwang, Hajishirzi, Ostendorf, and Wu 2015). For several years the Simple English Wikipedia corpus (Kauchak and Coster 2011) has been a primary source of data in automatic text simplification (ATS)(Napoles and Dredze 2010). It has been used as a resource for creating language models, developing clas- sifiers and studying readability. The corpus consists of parallel aligned sentences between the standard part of Wikipedia (standard Wikipedia) and simple english Wikipedia (simple Wikipedia). In creating parallel aligned corpora from unaligned texts, a lot of text from the source material is left out. The results of readability studies on the whole of standard and simple Wikipedia do not necessarily apply to any subsection of the data. It would be useful to see if the aligned corpus follow the same patterns as has been found in the whole of simple Wikipedia. At this point in time there is also a lack of descriptive information about the readability differences in the corpus. And no studies have been made about whether author follow the simple Wikipedia guidelines for writing. These points of concern is what will be addressed in this study. 1 1.1. Aim 1.1 Aim The aim of this study is to evaluate the language complexity differences between simple and standard Wikipedia. This study also aims to test the extent to which the guidelines for simple english Wikipedia has been followed. The analysis will specifically pertain to the corpus prepared by Kauchak (Kauchak and Coster 2011). This means that only a subset of all available articles, that has been sentence aligned with standard Wikipedia, will be the subject of this study. The results found here will hopefully provide insights that can be used in developing new text simplification systems based on the corpus. 1.2 Research questions There are two research questions that I intend to answer in this thesis: 1. Are there surface level or dependency based features that indicate differences in readability between parallel aligned sentences from simple Wikipedia and standard Wikipedia? 2. Do these parallel aligned sentences follow the guidelines for simple English Wikipedia to the same extent? 1.3 Delimitations Not all Simple Wikipedia guidelines have been operationalized and tested. Constituency based readability features have not been used. 1.4 Structure This thesis consists of six different chapters. Chapter 2 gives an account of previous research on text simplification, presents the guidelines, how the corpus was created and the NLP tools used in this project. Chapter 3 contains motivations and descriptions of the readability measures and guidelines that have been used to analyse the corpus. In chapter 4, results of the measures are presented. Chapter 5 is a discussion on how to interpret the results and method- ological alternatives. Chapter 6 presents the conclusions drawn from the study and proposi- tions for future research. 2 2 Theory Simple English Wikipedia is a collection of simplified encyclopaedic articles available for free online. The database contains 133,641 articles at this time (Wikipedia 2018a). Most of which have corresponding standard articles on the same subjects. This is a resource created to share knowledge. And specifically to facilitate an easy-to-read option to standard Wikipedia for children and adults who are learning English. Even though this resource is unique in many ways, addressing the need for simple language is not a novel concept. Before the website Wikipedia launched, guidelines for writing plain English were being developed. These guidelines specify instructions for how to limit a vocabulary, change the syntactic structures of sentences, structure a text in logical order and prioritize the core information needed for comprehension (PLAIN 2011)1.

Load more