Helper Scripts for Wikipedia Editors
Total Page:16
File Type:pdf, Size:1020Kb
Masaryk University Faculty of Informatics Helper scripts for Wikipedia editors Bachelor’s Thesis Šimon Baláž Brno, Spring 2017 Masaryk University Faculty of Informatics Helper scripts for Wikipedia editors Bachelor’s Thesis Šimon Baláž Brno, Spring 2017 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Šimon Baláž Advisor: Mgr. et Mgr. Vít Baisa Ph.D. i Acknowledgement I would like to thank my advisor, Mgr. et Mgr. Vít Baisa, Ph.D. for his guidance, patience and helpful advice during the solution of this thesis. iii Abstract This thesis is focused on the creation of several scripts that aim to enhance the development of Wikipedia and to facilitate the work of its editors. These scripts are designed to search for articles based on specified conditions, create various statistics, compare articles in different language versions, validate articles and check the hierarchy of categories. An important characteristic of scripts on Wikipedia is that they do not change the articles or categories they work with, but, depending on the pre-defined conditions, they process the necessary data that is later displayed to the user. Due to the size of Wikipedia, it is possible to specify the articles to be worked on and to limit the number of articles and results. All scripts should work on any language version of Wikipedia, even though their design has been focused especially on the English version. iv Keywords Python, Wikipedia, Pywikibot, script v Contents 1 Introduction 1 2 Wikipedia, Scripts and Pywikibot 3 2.1 Wikipedia ...........................3 2.1.1 Namespaces . .3 2.1.2 Wikipedia articles . .4 2.1.3 Multilingual Wikipedia . .5 2.2 Automated tools ........................5 2.2.1 MediaWiki API . .6 2.3 Pywikibot ...........................6 2.3.1 Page module . .7 2.3.2 Page generators . .8 2.3.3 Other modules . .8 3 Created scripts 9 3.1 Similarities between scripts ..................9 3.2 Multilingual scripts ...................... 10 3.2.1 proposetranslation.py . 10 3.2.2 translations.py . 12 3.2.3 interlangimages.py . 13 3.3 Counting scripts ....................... 13 3.3.1 proposepages.py . 14 3.3.2 editvolume.py . 15 3.4 Category scripts ........................ 16 3.4.1 categorytreedepth.py . 18 3.4.2 categorytreepages.py . 18 3.4.3 categorytreeloops.py . 19 3.5 Remaining scripts ....................... 20 3.5.1 proposeimages.py . 20 3.5.2 lastedits.py . 21 4 Further improvements and ideas 23 4.1 Improvements and changes .................. 23 4.1.1 Multilingual scripts . 23 4.1.2 Category scripts . 24 4.1.3 Other scripts . 24 vii 4.2 Script suggestions ....................... 25 4.2.1 Scripts to detect incorrect information . 25 4.2.2 Script to check references . 26 5 Conclusion 27 viii List of Tables 3.1 A proposetranslation.py example 11 3.2 A translations.py example 12 3.3 A interlangimages.py example 13 3.4 A proposepages.py example 15 3.5 A editvolume.py example 17 3.6 A categorytreedepth.py example 19 3.7 A categorytreepages.py example 20 3.8 A categorytreeloops.py example 20 3.9 A proposeimages.py example 21 3.10 A lastedits.py example 22 ix 1 Introduction Since its creation, Wikipedia has greatly expanded in size and popular- ity. Nowadays its biggest, English, version consists of over 5 million articles [1], which makes it impossible to maintain Wikipedia manu- ally, even though it has tens of thousands of editors. This resulted in a development of tools that make various tasks easier or automated, thus allowing editors to focus on more complicated problems. As Wikipedia developed, so did the tools that worked on it. From simple tools that only solved trivial problems, to robots that are ca- pable of almost completely replacing humans. Various libraries for programming languages were also created to make a development of tools for Wikipedia easier [2]. A large part of these tools consists of scripts, simple programs for simple tasks, which are created in various programming languages. This thesis deals with the creation of such scripts, developed in programming language Python, whose purpose is to provide useful information and statistics without modifying pages with which they work. A Python library called Pywikibot allows to simplify the imple- mentation of the scripts. Although this library has been created to facilitate the development of automated tools on Wikipedia, it also has some disadvantages that limit its use. The first part of the thesis is dedicated to the observation ofWiki- pedia and some of its relevant characteristics. It includes an explana- tion of how automated tools can make working on Wikipedia easier and description of different types of these tools. Pywikibot is also described in detail, its advantages and disadvantages are mentioned and some of its parts are also explained. The creation of scripts and their role is described in the second part of the thesis. There is a detailed explanation of how these scripts work, what Wikipedia features they use, and the most common examples of their usage. Because some scripts are similar to each other, they were divided into several groups for better clarity. Each group contains a general summary of the shared characteristics and description of the individual scripts. Due to the different complexity of the various scripts, it is important to note that not all of them are described in the same detail. 1 1. Introduction The final part of thesis summarizes created scripts, particularly in terms of usability in practice and also evaluates the possibilities for their further improvement. It also addresses problems that have become apparent during their implementation. Some other options for editing existing scripts as well as ideas for new scripts are explained in more detail. Wikipedia is constantly changing and with it, various tasks that need to be fulfilled also change. Some of these tasks are addressed by people, other by various programs, but for continuous growth of the encyclopedia, their cooperation is required. And just as new people join the development while the old people leave, so do the tools that work on Wikipedia regularly change. This results in constant timeliness of this issue, and continuous development of newer and better tools that improve Wikipedia, thus contributing to the spreading of human knowledge. 2 2 Wikipedia, Scripts and Pywikibot 2.1 Wikipedia Wikipedia is a free online encyclopedia that is the source of a large amount of information for anyone with an Internet connection. It was launched in 2001 and since then has grown to be the biggest encyclopedia on the the Internet with over 40 million articles in more than 250 different languages. Wikipedia allows anonymous editing of almost all of its content, with the exception of a few protected pages, which can be edited only by certain users. This has led to the creation of an entire community of people that work on Wikipedia, and because of its non-profit nature, all of them are doing it for their own enjoyment. However, even though there are no firm rules and anyone can work with articles, there are still some editorial principles, basic guidelines and policies they should adhere to. Key principles are embodied in five pillars [3] and are focused on the description of how to write articles, respect copyright laws or communicate with other editors. It should be noted that Wikipedia strives to be as objective and truthful as possible and any unconfirmed, subjective or false information isa vandalism which may decrease respectability of Wikipedia as a source of knowledge. Due to this, pages on Wikipedia ought to be regularly checked by multiple people to detect vandalism. 2.1.1 Namespaces All pages on Wikipedia are divided into several groups. These groups, named namespaces, divide pages according to their purpose and allow better management of Wikipedia [4]. Each namespace, except for the main one, contains pages whose names begin with a prefix which indicates their purpose. For example, a page in which resides some media content, such as image, video or audio, is called a File page, it belongs to File namespace and its name begins with the prefix File:. Almost every namespace containing pages has its equivalent names- pace containing talk pages. These pages are used by editors to discuss improvements to normal pages. This discussion can be shown by click- 3 2. Wikipedia, Scripts and Pywikibot ing on the Talk tab at the top of the page. Two namespaces without talk pages are virtual namespaces that do not contain pages from a database. The first of these namespaces is called Special, and it consists of various reports, lists, logs and tools. The second virtual namespace is called Media, and it allows to link directly to a file instead of a file page. The remaining 16 namespaces, called subjective namespaces, con- tain pages and their 16 equivalents, called talk namespaces, contain talk pages. Each of these namespaces has their own function, expressed by their pages. For example, User namespace deals with pages created by users for their own personal use, Template namespace contains tem- plates, Category namespace contains categories, etc. Each namespace is also represented by a specific number. Subjective namespaces are represented by even numbers, talk namespaces by odd numbers and virtual namespaces by negative numbers. 2.1.2 Wikipedia articles The most important aspect of an entire Wikipedia lies in its articles [5]. They are often the first source of information people on the Internet use when they need to look up something.