Masaryk University Faculty of Informatics

Helper scripts for editors

Bachelor’s Thesis

Šimon Baláž

Brno, Spring 2017

Masaryk University Faculty of Informatics

Helper scripts for Wikipedia editors

Bachelor’s Thesis

Šimon Baláž

Brno, Spring 2017

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Šimon Baláž

Advisor: Mgr. et Mgr. Vít Baisa Ph.D.

i

Acknowledgement

I would like to thank my advisor, Mgr. et Mgr. Vít Baisa, Ph.D. for his guidance, patience and helpful advice during the solution of this thesis.

iii Abstract

This thesis is focused on the creation of several scripts that aim to enhance the development of Wikipedia and to facilitate the work of its editors. These scripts are designed to search for articles based on specified conditions, create various statistics, compare articles in different language versions, validate articles and check the hierarchy of categories. An important characteristic of scripts on Wikipedia is that they do not change the articles or categories they work with, but, depending on the pre-defined conditions, they process the necessary data that is later displayed to the user. Due to the size of Wikipedia, it is possible to specify the articles to be worked on and to limit the number of articles and results. All scripts should work on any language version of Wikipedia, even though their design has been focused especially on the English version.

iv Keywords

Python, Wikipedia, Pywikibot, script

v

Contents

1 Introduction 1

2 Wikipedia, Scripts and Pywikibot 3 2.1 Wikipedia ...... 3 2.1.1 Namespaces ...... 3 2.1.2 Wikipedia articles ...... 4 2.1.3 Multilingual Wikipedia ...... 5 2.2 Automated tools ...... 5 2.2.1 MediaWiki API ...... 6 2.3 Pywikibot ...... 6 2.3.1 Page module ...... 7 2.3.2 Page generators ...... 8 2.3.3 Other modules ...... 8

3 Created scripts 9 3.1 Similarities between scripts ...... 9 3.2 Multilingual scripts ...... 10 3.2.1 proposetranslation.py ...... 10 3.2.2 translations.py ...... 12 3.2.3 interlangimages.py ...... 13 3.3 Counting scripts ...... 13 3.3.1 proposepages.py ...... 14 3.3.2 editvolume.py ...... 15 3.4 Category scripts ...... 16 3.4.1 categorytreedepth.py ...... 18 3.4.2 categorytreepages.py ...... 18 3.4.3 categorytreeloops.py ...... 19 3.5 Remaining scripts ...... 20 3.5.1 proposeimages.py ...... 20 3.5.2 lastedits.py ...... 21

4 Further improvements and ideas 23 4.1 Improvements and changes ...... 23 4.1.1 Multilingual scripts ...... 23 4.1.2 Category scripts ...... 24 4.1.3 Other scripts ...... 24

vii 4.2 Script suggestions ...... 25 4.2.1 Scripts to detect incorrect information ...... 25 4.2.2 Script to check references ...... 26

5 Conclusion 27

viii List of Tables

3.1 A proposetranslation.py example 11 3.2 A translations.py example 12 3.3 A interlangimages.py example 13 3.4 A proposepages.py example 15 3.5 A editvolume.py example 17 3.6 A categorytreedepth.py example 19 3.7 A categorytreepages.py example 20 3.8 A categorytreeloops.py example 20 3.9 A proposeimages.py example 21 3.10 A lastedits.py example 22

ix

1 Introduction

Since its creation, Wikipedia has greatly expanded in size and popular- ity. Nowadays its biggest, English, version consists of over 5 million articles [1], which makes it impossible to maintain Wikipedia manu- ally, even though it has tens of thousands of editors. This resulted in a development of tools that make various tasks easier or automated, thus allowing editors to focus on more complicated problems. As Wikipedia developed, so did the tools that worked on it. From simple tools that only solved trivial problems, to robots that are ca- pable of almost completely replacing humans. Various libraries for programming languages were also created to make a development of tools for Wikipedia easier [2]. A large part of these tools consists of scripts, simple programs for simple tasks, which are created in various programming languages. This thesis deals with the creation of such scripts, developed in programming language Python, whose purpose is to provide useful information and statistics without modifying pages with which they work. A Python library called Pywikibot allows to simplify the imple- mentation of the scripts. Although this library has been created to facilitate the development of automated tools on Wikipedia, it also has some disadvantages that limit its use. The first part of the thesis is dedicated to the observation ofWiki- pedia and some of its relevant characteristics. It includes an explana- tion of how automated tools can make working on Wikipedia easier and description of different types of these tools. Pywikibot is also described in detail, its advantages and disadvantages are mentioned and some of its parts are also explained. The creation of scripts and their role is described in the second part of the thesis. There is a detailed explanation of how these scripts work, what Wikipedia features they use, and the most common examples of their usage. Because some scripts are similar to each other, they were divided into several groups for better clarity. Each group contains a general summary of the shared characteristics and description of the individual scripts. Due to the different complexity of the various scripts, it is important to note that not all of them are described in the same detail.

1 1. Introduction

The final part of thesis summarizes created scripts, particularly in terms of usability in practice and also evaluates the possibilities for their further improvement. It also addresses problems that have become apparent during their implementation. Some other options for editing existing scripts as well as ideas for new scripts are explained in more detail. Wikipedia is constantly changing and with it, various tasks that need to be fulfilled also change. Some of these tasks are addressed by people, other by various programs, but for continuous growth of the encyclopedia, their cooperation is required. And just as new people join the development while the old people leave, so do the tools that work on Wikipedia regularly change. This results in constant timeliness of this issue, and continuous development of newer and better tools that improve Wikipedia, thus contributing to the spreading of human knowledge.

2 2 Wikipedia, Scripts and Pywikibot

2.1 Wikipedia

Wikipedia is a free online encyclopedia that is the source of a large amount of information for anyone with an Internet connection. It was launched in 2001 and since then has grown to be the biggest encyclopedia on the the Internet with over 40 million articles in more than 250 different languages. Wikipedia allows anonymous editing of almost all of its content, with the exception of a few protected pages, which can be edited only by certain users. This has led to the creation of an entire community of people that work on Wikipedia, and because of its non-profit nature, all of them are doing it for their own enjoyment. However, even though there are no firm rules and anyone can work with articles, there are still some editorial principles, basic guidelines and policies they should adhere to. Key principles are embodied in five pillars [3] and are focused on the description of how to write articles, respect copyright laws or communicate with other editors. It should be noted that Wikipedia strives to be as objective and truthful as possible and any unconfirmed, subjective or false information isa vandalism which may decrease respectability of Wikipedia as a source of knowledge. Due to this, pages on Wikipedia ought to be regularly checked by multiple people to detect vandalism.

2.1.1 Namespaces All pages on Wikipedia are divided into several groups. These groups, named namespaces, divide pages according to their purpose and allow better management of Wikipedia [4]. Each namespace, except for the main one, contains pages whose names begin with a prefix which indicates their purpose. For example, a page in which resides some media content, such as image, video or audio, is called a File page, it belongs to File namespace and its name begins with the prefix File:. Almost every namespace containing pages has its equivalent names- pace containing talk pages. These pages are used by editors to discuss improvements to normal pages. This discussion can be shown by click-

3 2. Wikipedia, Scripts and Pywikibot ing on the Talk tab at the top of the page. Two namespaces without talk pages are virtual namespaces that do not contain pages from a database. The first of these namespaces is called Special, and it consists of various reports, lists, logs and tools. The second virtual namespace is called Media, and it allows to link directly to a file instead of a file page. The remaining 16 namespaces, called subjective namespaces, con- tain pages and their 16 equivalents, called talk namespaces, contain talk pages. Each of these namespaces has their own function, expressed by their pages. For example, User namespace deals with pages created by users for their own personal use, Template namespace contains tem- plates, Category namespace contains categories, etc. Each namespace is also represented by a specific number. Subjective namespaces are represented by even numbers, talk namespaces by odd numbers and virtual namespaces by negative numbers.

2.1.2 Wikipedia articles

The most important aspect of an entire Wikipedia lies in its articles [5]. They are often the first source of information people on the Internet use when they need to look up something. Because of their importance, Wikipedia has an overview of every change that has been made in them and saves all of their versions. These changes are documented in a page history which can be viewed by clicking on a View history tab on top of the page. This allows editors to see what changes were made in articles and reverse them in a case of vandalism. MediaWiki has its own markup language, known as wikitext, which is used to format pages [6]. It allows a change of layouts in articles, modification of format of text, a creation of links and URLs (Uniform Resource Locator), use of mathematical or musical notations and so on. This language is then converted to HTML (Hypertext Markup Language) and given to web browsers. To see a marked-up text instead of a plain text that normally shows, it is necessary to click on the View source or the Edit tab on top of the page. This text can be used to calculate the size of articles in bytes.

4 2. Wikipedia, Scripts and Pywikibot

2.1.3 Multilingual Wikipedia To this day, official Wikipedia has been created for 295 languages, of which the oldest and largest is the English version with more than 5 million articles. Most pages on Wikipedia have links, called inter- language links, to equivalent pages in other languages [7]. These links are listed in the Languages sidebar and clicking on one of them will display page in that language. Not all articles are of equal importance in all languages and their content depends on editors who work with them. For example, Czech Wikipedia does not need a detailed article about some small town in China, even though the article on about it might have thousands of words. Each language version has multiple lists1 that contain articles from other versions which need to be translated into it. Articles that are translated from other languages must credit their source and should be checked by a proofreader [8]. Various automated tools are usually used to determine articles that should be translated, compare articles in different languages or detect articles that could be improved with additional information from other languages. However, using unpro- cessed machine translation in articles is frowned upon, because such translation is of poor quality and Wikipedia considers it worse than nothing.

2.2 Automated tools

A script is a computer program written in a scripting language, that automates some task and is easy to use and write [9]. Scripts should follow the KISS (Keep it simple, stupid) principle, which means that a single script should do only one thing, and do it well. For numerous different tasks, multiple scripts should be created instead ofaone that would carry out all of them. On Wikipedia, scripts are used to enhance or personalize already existing functionality and due to this, they generally do not require any approval. Programs that interact with Wikipedia similarly to human editors are called robots, or bots for short [10]. Currently, there are several

1. Articles needing translation can be found here https://en.wikipedia.org/ wiki/Category:Articles_needing_translation_from_foreign-language_ 5 2. Wikipedia, Scripts and Pywikibot hundreds of bots, and more can be created if the need arises. These bots are capable of making edits to pages and change their content far faster than humans, and after launch no longer require human attention. Because of this, bots that are incorrectly designed or misused may cause severe damage. For this reason, a policy that covers the operation of all bots on Wikipedia has been created. This policy deals with usage of bots, requirements they need to fulfill, their approval and measures in case they malfunction. Assisted editing deals with the use of tools that assist editors with repetitive tasks, but do not make any changes without human input. These vary in complexity and usability, and may or may not require the creation of a bot. Therefore, some of them may need to be approved before they can be used, while others can be created and used freely. Examples of such tools include those that reverse vandalism, fix broken links and correct typographical or grammatical errors.

2.2.1 MediaWiki API MediaWiki API (Application programming interface) is a tool that pro- vides a user with data and meta-data from MediaWiki installations [11], including, but not limited to, Wikipedia, Wiktionary and Wik- iquote. This tool can be used for the creation of bots and scripts that can get data, and post changes with HTTP (Hypertext Transfer Pro- tocol) requests. It allows, among other things, to search wiki, change wiki content and create queries. The query module is often used by various tools to get information about data stored in a wiki, such as wikitext of a specified page or list of pages which meet some condi- tion2. Many libraries belonging to different programming languages, like Pywikibot for Python or .api.js for JavaScript, make use of this API.

2.3 Pywikibot

Pywikibot is a Python library that streamlines work on Wikipedia and other MediaWiki sites [12]. When used with a compatible version of

2. For more detailed information about API queries, see https://www.mediawiki. org/wiki/API:Query

6 2. Wikipedia, Scripts and Pywikibot

Python, it supports work on all the current popular operating sys- tems such as Microsoft Windows or Linux. The original version of Pywikibot, called compat, was created in 2002. In 2007 a new version, called core was created utilizing MediaWiki API, which eventually surpassed the original. Due to the fact that Pywikibot core makes use of a wrapper function pwb.py, it is possible to launch scripts just by installing Pywikibot and calling this function together with the name of the script and its various parameters3. Pywikibot can only be used on Wikipedia when it has access to a file named user-config.py which specifies the name of one or more Wikimedia sites on which scripts will work, language version of these sites, a username of the bot, and, optionally, even allows to change various encodings [13]. The aforementioned file can be either gener- ated by special script or created manually and saved in the correct location. It is possible to override configuration in this file with the use of global arguments, which are by default available to all programs that use Pywikibot.

2.3.1 Page module The pywikibot.page module is an essential part of Pywikibot that con- tains features and classes for working with pages, categories, users and files [14]. Pages are represented as a class with multiple methods that allow scripts to find out almost all of the basic page information, including interlanguage links of page, wikitext, revision history or a list of contributors. However, these methods alone are incapable of providing a more complicated functionality, such as parsing of wikitext, by themselves. Because of this, it is often necessary to fur- ther process any information gained by them, or to combine multiple methods to get the specific result. Users, categories, and files are similarly represented as their own classes with various methods. There are also classes that hold a more detailed information about revisions of pages, links to other pages and sites of MediaWiki, etc. All of this gives programs a way to find almost any information on Wikipedia without a complicated usage of API.

3. For more details about running scripts, see https://www.mediawiki.org/wiki/ Manual:Pywikibot/Installation#Running_a_script

7 2. Wikipedia, Scripts and Pywikibot

2.3.2 Page generators The pywikibot.pagegenerators is a module which provides a collection of iterable objects, called page generators [15]. These generators then return pages in accordance with predefined conditions and filters. There are multiple pre-existing parameters to specify which pages are to be yielded by a particular generator, including parameters that restrict the work to pages whose title matches a regular expression or limit the number of pages to be returned. To make use of the already existing parameters, it is necessary to use a class called GeneratorFactory, which handles command line arguments one by one and gathers corresponding generators. After all arguments are processed, accumulated generators are combined into one. If these parameters are not needed or additional functionality is desired, generators can be utilized in a code independently. Various generators have different uses and return different results, suchas those that yield new pages, interlanguage pages or unused files.

2.3.3 Other modules In addition to the modules described above, Pywikibot contains dozens more [16]. Each of them is useful, but not all are necessary. Modules for working with dates, logs, or some extensions are only used for specific functionality. The pywikibot.bot module is basically always necessary because it contains functionality related to user-interface. It contains classes and methods for input and output, methods for handling command line arguments, and makes the global parameters available to the user. Another frequently used module is the pywikibot.site which rep- resents MediaWiki sites. It makes it easier to work with namespaces, languages and links to other wikis.

8 3 Created scripts

The goal of this thesis was the creation of several scripts in program- ming language Python 3 that would help editors on Wikipedia, and thereby contribute to its development. The final result of the work is ten scripts with different purposes and complexity, but also with certain similarities between some of them. These scripts are designed to create various statistics, compare the content of articles in different language versions, search for articles based on specific conditions, etc. Their main feature is that they do not edit or otherwise change pages they work with. Scripts in the scope of this work only get data which can be processed further and shown to a user. Even though they were created and tested on , they should work correctly on all language versions.

3.1 Similarities between scripts

Although all of the scripts developed for this thesis are different, they share some characteristics. One of such characteristics is an option to save all their results into an HTML file in a form of simple table. Another is that they were all created using Pywikibot. This resulted in the use of page generators by all scripts whose task is to navigate through multiple pages. These scripts make use of a class GeneratorFactory which gives them the option to specify the pages they want to work on with a set of pre-existing parameters. However, only some of these parameters are supposed to be used, namely:

∙ -start, which specifies that the script should go alphabetically through pages, starting at the stated page;

∙ -newpages, which restricts the script to work on newest pages;

∙ -uncat, which causes the script to work on uncategorized pages.

It is possible to utilize other parameters, though it should be noted that scripts which use page generators were created to work with the ar- ticles. Therefore, it is recommended to run a parameter -namespace:0,

9 3. Created scripts which limits the results of generators to the main namespace. It is also worth noting that the scripts do not have a default beginning. The starting point has to be specified, otherwise, the scripts will not launch. A further important characteristic is an automatic exclusion of disambiguation pages in almost every script. These non-articles pages resolve cases where one expression can refer to multiple distinct things by listing various meanings of such expression and linking to all of them [17]. In other words, scripts do not work with disambiguation pages because these pages are not articles, even though they are con- nected and share the main namespace with them. Similarly, redirect pages are also excluded because they only re-route to other pages and generally do not contain any content other than redirect link itself.

3.2 Multilingual scripts

Among the scripts, which were created for this thesis, are three that work on an interlingual level. Each of them goes through pages on the specified Wikipedia and then tries to load the same pages from other language versions and process them. To do that, they use the Pywiki- bot method interlanglinks() located in pywikibot.page module and belonging to the class BasePage, which represents a single page. This method returns all interlanguage links located on that page. During the creation of these scripts, it has become apparent that certain features of Wikipedia can cause complications. Pywikibot con- tains a list of all languages that have a version of Wikipedia which allows it to recognize and work with pages from these languages. However, a creation of a new Wikipedia version or a change of the language abbreviation causes Pywikibot to not recognize it, and thus fail. To avoid this, the list of languages should be regularly checked and updated.

3.2.1 proposetranslation.py The main purpose of this script is to browse articles on Wikipedia, compare them to the articles in the language, which was given as a parameter, and detect cases where the difference in size between

10 3. Created scripts

these articles is too large. This enables editors to find articles whose translation could be used for further improvement of the same articles in a different language. A user of the script can also allow ittolist articles which have no translation or articles whose translation is considered sufficient. For each page from the page generator, larger than a certain amount of bytes (default value is 10), the script obtains all interlanguage links. Languages of pages on which these links point to are compared to the one that was given as a parameter and if a match is found, the matching page is loaded. If this page is smaller than the maximum set amount (default 5000 bytes), the script compares wikitext of the original page with wikitext of the translated page and when the difference between them is higher than the specified threshold (default 10000 bytes), returns it. All restrictions on the number of bytes can be changed using the parameters designed for that. Example of use with parameters -translanguage:cs -limit:50 -start:pyth -save -sufficient can be seen below in the table 3.1.

Original page Translated page Description Pythagorean comma Pythagorejské koma Pythagorean comma is significantly larger than Pythagorejské koma Pythagorean interval Ditón Pythagorean interval is significantly larger than Ditón Pythagorean triple Pythagorejská trojice Pythagorean triple is significantly larger than Pythagorejská trojice Pythia Pýthia Pythia is significantly larger than Pýthia

Table 3.1: A proposetranslation.py example

11 3. Created scripts

3.2.2 translations.py

The script which counts pages that meet language requirements. It can be used in two different variants, or in the combination of them. The first one allows the user to enter a number of languages and script will move through pages and count those which have at least that many interlanguage links. The other gives the user an option to specify any number of languages and compels script to return pages containing interlanguage links to pages in those languages. When combined, the script will check the number of links and whether they link to specified languages, and count those pages that satisfy both ofthese requirements. Although the script normally counts pages, it is possible to use a parameter -list which will make the script to also list pages that are counted. This parameter is necessary in case the user of the script re- quires the titles of pages to be saved in HTML file. Another parameter, named -reverse, allows to reverse the checking of pages. In that case, the script will count pages that do not have a translation in specified languages. It does not reverse examination of pages according to the number of their links. Example of use with parameters -save -start:pyth -limit:150 -language:cs -language:sk -list can be seen below in the table 3.2

Page that have specified translations: cs, sk Pythagoras Pythagorean theorem Pythagoreanism Pythagoreion Pytheas Pythia Pythian Games Python (programming language)

Table 3.2: A translations.py example

12 3. Created scripts

3.2.3 interlangimages.py Similarly to the scripts described above, this one also goes trough pages yielded by the page generator, and for each one checks its language versions. The user has an option to specify languages whose versions of the page are to be checked. If the user does not specify any language, all versions of the page will be examined. The purpose of this script is to find pages whose versions in other languages have no images. The script loads, one by one, all versions of the page and for each of these versions gets links to all images used on it. If the number of these links is zero, it means that the page has no images. It is possible to use a parameter -original which will make the script to check the original pages too. Example of use with parameters -save -language:cs -limit:50 -start:pyth -language:sk can be seen below in the table 3.3

Site code Page title cs PythagoraSwitch sk Pythagoras (boxer) sk Pythagoras zo Sparty cs Pythagorejské koma cs Ditón cs Pythagorejské ladění sk Pytagoreizmus

Table 3.3: A interlangimages.py example

3.3 Counting scripts

Both of these scripts resemble the rest in their base functionality. They use the page generators to get pages they work with and which are represented by the BasePage class from the pywikibot.page module. However, for each page they work with, they also get a numeric value that allows them to sort their results according to that value. As a result of this addition, these scripts do not just gradually list pages

13 3. Created scripts through which they pass, but they save them as results that are later sorted according to their user. Of course, the results may not only be sorted by their value but also by page title, or they can remain unsorted and be left in the order in which they were obtained. Adding results as a variable made it possible to create a parameter, named -resultlimit, that indicates the minimum number of results needed for scripts to stop processing pages. Of course, depending on other parameters specified by the user of these scripts, it is possible that scripts will finish their work even before reaching this limit. Another useful parameter is -progress, which makes the scripts show pages that are being processed.

3.3.1 proposepages.py As the name suggests, this script proposes the creation of non-existing pages and expansion of pages that are too small. It can do both of these activities at once or do only of them. Additionally, for each page that satisfies conditions, the script finds out how many times it islinked to by other pages. A parameter -minlinks makes it possible to set a minimum amount of links necessary for a page to be added to the results. To propose an expansion of small pages, the user needs to run this script with a parameter -smallpages, which adds all pages smaller than a certain amount of bytes (default is 2000) to results. Similarly, to propose a creation of entirely new pages, the user needs to use a parameter -redlinks, which adds all the red links from pages that are being processed to results. Red links represent pages that do not exist yet but are linked to by other pages. This allows the script to find out the number of links to these non-existing pages and based on that, suggest their creation. It is worth pointing out that, unlike pages yielded by a generator, red links obtained in this manner are more or less random. They are mostly related to a topic of the page from which they were acquired, but they can not be narrowed any further, only limited by an amount of links. After processing the pages, the script allows the user to sort results. Other than remaining unsorted or being sorted by page titles, the results can be sorted by an amount of links and shown descendingly.

14 3. Created scripts

Consequently, such sorting can also represent an importance of the resulting proposals. Example of use with parameters -minlinks:10 -resultlimit:15 -save -start:xen -sortbylinks -smallpages can be seen below in the table 3.4.

Page title Number of links Xenien 113 Xenia, Missouri 53 Xenia Smits 38 Xenies 32 Xeni Gwet’in 23 Xenakis 18 Xenarchus of Seleucia 17 Xenia Ivanov 15 Xenicus 14 Xenacanthidae 14 Xenasmataceae 13 Xena pipe 12 Xenesthis 11 Xenia (automobile) 10 Xenerpestes 10

Table 3.4: A proposepages.py example

3.3.2 editvolume.py The purpose of this script is to get an information about revisions of pages that have been made by non-bot users during a certain period of time. It allows the user of the script to set a minimum date, which results in the addition of pages and all their edits since that date to the results. Similarly, it allows the user to set a maximum date, which adds pages and their edits till this date to the results. The combination of

15 3. Created scripts these two dates restricts edits to a period of time between them. Pages without any edits in a given time span will be ignored. These results, containing pages and necessary information about their revisions, are only temporary and can be processed further. Unlike the script mentioned above, this one enables clarification of what to count. The first, simpler, option is to count a number of changes for each page according to the specified dates. This can be achieved with the use of a parameter -editcount. The outcome of this option is a list of pairs consisting of page titles and amount of edits. Another option is to calculate the size of the change between oldest and latest revision available. This change is represented as the differ- ence between the number of bytes of these revisions. To do this, it is necessary to use a parameter -bytecount. If this parameter is speci- fied, the script processes temporary results and for each of them, finds the oldest and the latest revision. It then subtracts the smaller revision from the bigger and, together with the page title, adds the difference to the final results. Pages with no difference between the revisions are also included in results, therefore it is a good idea to sort them by a number of changed bytes. This will show results descendingly. Only one of these options can be used, their combination is not possible. On the other hand, one of them must be used. In the event that the user does not specify what to count, the default option is the calculation of changed bytes. Example of use with parameters -datefrom:2015-01-01 -save -start:xen -sortbycount -dateto:2016-12-31 -resultlimit:15 -bytecount can be seen on the page below in the table 3.5.

3.4 Category scripts

The only group of scripts that do not move through pages yielded by page generators. Each of these three scripts progressively passes through a category tree whose root is the category specified by the -root parameter. Categories may contain subcategories, files or pages. None of the scripts works with files but one of them counts pages, and all scripts use subcategories to represent the tree. The tree is recursively processed, that is, the category-processing method calls

16 3. Created scripts

Page title Changed bytes XenForo 9178 XenDesktop 4838 Xen’drik 4328 Xen (album) 4016 Xena 3016 Xen 2954 XenClient 1515 Xen C. Scott 1202 XenMobile 1045 Xena, Saskatchewan 158 Xen (disambiguation) 107 XenApp 76 Xen Cuts 0 Xen Coffee 0 Xen Balaskas 0

Table 3.5: A editvolume.py example the same method for all subcategories until it reaches a category that does not have subcategories. In the end, the Wikipedia categories are better represented by a graph. This is because the categories are rather chaotically structured and one category can appear in multiple positions in the tree. As a result, scripts often have to process the same part of the tree time several times. The problem, however, occurs when the category refers to the subcategory that is located on the path between the root of the tree and this category. This will cause the script to continually process this part of the tree and never end. For this reason, it is recommended to make sure that no such loops appear in the tree. One of the scripts in this group was created for this purpose. The capability of these scripts to save results to an HTML file is limited. They can only save various information about the tree such

17 3. Created scripts as an amount of pages for each category or level of each category in the tree. All scripts have an option to show a tree transition using the -progress parameter.

3.4.1 categorytreedepth.py This script monitors the depth of the category tree. The user can use parameter -maxdepth to set a limit to the tree level and the script will list any category whose level exceeds this limit. Similarly, it is possible to use parameter -mindepth to list all leaves, categories which have no subcategories, whose level is smaller than the entered number. The combination of these two parameters can also be used. The main function of this script is to check differences between leaves of the tree. Parameter -depthdiff enables the user to set the maximum allowed difference between leaf with the smallest level and leaf with the highest level. After the script processes the entire tree and discovers these leaves, it subtracts the smallest level leaf from the highest level leaf and compares the result with the number given by the parameter. If this number is exceeded, the difference between leaves is bigger than allowed. Example of use with parameters -root:Apollo -save can be seen on the page below in the table 3.6.

3.4.2 categorytreepages.py This script checks page counts between categories at the same hi- erarchy level. Categories are on the same level if they have a com- mon supercategory. The script gets subcategories of category and finds the number of pages for each subcategory. If the user entered the maximum allowed difference in page counts with the parameter -pagebalance, the script compares page count of each subcategory with page counts of other subcategories and lists those cases where the difference between page counts is bigger than allowed. Another useful parameter is -subcatpages. This option does not just get page counts of subcategories but counts a number of pages of an entire subtree for each subcategory. The disadvantage is that this counting can take a lot of time and it is not recommended to use this option for large trees. Similarly, it is possible to count a number

18 3. Created scripts

Category Tree level Apollo 0 Apollo in art 1 Paintings of Apollo 2 Sculptures of Apollo 2 Cult of Apollo 1 Festivals of Apollo 2 Epithets of Apollo 1 Offspring of Apollo 1 Orpheus 2 Operas about Orpheus 3 Temples of Apollo 1

Table 3.6: A categorytreedepth.py example

of pages in the entire tree with the use of parameter -treepagecount. Such counting also takes some time, especially in large trees. Example of use with parameters -root:Apollo -save can be seen on the page below in the table 3.7.

3.4.3 categorytreeloops.py As previously explained, loops in the category tree can cause the scripts above to lockup. For this reason, this script was created to detect such loops and alert the user about them. It recursively moves through the tree and saves each category it processes into a dictionary as a key whose value consists of all available subcategories of this category. If any subcategory already exists in the dictionary as a key, the script finds a path between the root of the tree and the category that is being processed. This path consists of keys, and if it contains a key identical to the subcategory that is being checked, adding the subcategory would create a loop. In that case, the subcategory is removed from the values, the user is alerted and the script continues without obstruction.

19 3. Created scripts

Category Number of pages Apollo in art 0 Cult of Apollo 0 Epithets of Apollo 63 Offspring of Apollo 33 Temples of Apollo 23 Paintings of Apollo 2 Sculptures of Apollo 13 Festivals of Apollo 8 Orpheus 42 Operas about Orpheus 15

Table 3.7: A categorytreepages.py example

Example of use with parameters -root:Twelve_Olympians -save can be seen below in the table 3.8.

Category Tree path Demeter Twelve Olympians -> Demeter -> Festivals of Demeter -> Eleusinian Mysteries -> Demeter

Table 3.8: A categorytreeloops.py example

3.5 Remaining scripts

This group contains two simple scripts which do not share enough similarities with others to be included somewhere else. They are not even similar to each other, except for the aforementioned characteristic.

3.5.1 proposeimages.py A script to find pages with an insufficient number of images. The boundary that determines the number of images that is satisfactory is given by parameter -images whose default value is 1. In case that the

20 3. Created scripts

user does not change this value, this script will find all pages without images. However, it should be noted that not all images on the page are necessarily related to its content. This can cause problems because the script can not recognize im- ages by their importance. It uses a page method imagelinks() to get a list of all images, represented by the class FilePage, on that page. This list consists of not only images related to the topic but also images like state flags, template pictures and various logos. Because most small pages do not contain images, it is often unnec- essary to check them. For this reason, it is possible to set a minimum size of the pages to be processed (default value is 5000 bytes). When increasing this limit, it is recommended to increase the minimum num- ber of images needed for the page to be satisfactory because larger pages may contain multiple unrelated images. Example of use with parameters -save -images:3 -start:pyth -limit:50 can be seen below in the table 3.9.

Page title Number of images Pythagoras (crater) 1 Pythagorean astronomical system 2 Pythagorean field 0 Pythagorean hammers 2 Pythagorean prime 2 Pythagorean quadruple 1 Pythia (band) 2

Table 3.9: A proposeimages.py example

3.5.2 lastedits.py This script is used to find pages whose most recent non-bot revision was created before a certain date. It goes through the pages yielded by the page generator, and for each of them gets its revisions. After that, the script finds the latest revision made by a user that is not a bot and checks if the date of the revision is older than the specified date. If

21 3. Created scripts the date is older, the script lists page title and all relevant information about the revision to the user. Example of use with parameters -save -start:pyth -limit:10 -date:2015-12-31 can be seen below in table 3.10

Page title Revision Pythagenpat {’revid’: 499091696, ’text’: None, ’timestamp’: Timestamp(2012, 6, 24, 5, 58), ’user’: ’Lockley’, ’anon’: False, ’comment’: ’remove context tag’, ’minor’: True, ’rollbacktoken’: None} Pythagoras (crater) {’revid’: 677845619, ’text’: None, ’timestamp’: Timestamp(2015, 8, 25, 21, 56, 51), ’user’: ’217.189.252.0’, ’anon’: True, ’comment’: ”, ’minor’: False, ’rollbacktoken’: None} Pythagoras ABM {’revid’: 506750248, ’text’: None, ’timestamp’: Timestamp(2012, 8, 10, 16, 49, 11), ’user’: ’Rwalker’, ’anon’: False, ’comment’: ’Disambig- uated: [[MOE#Acronyms]] ¨ [[Measure of effectiveness]]’, ’minor’: False, ’rollbacktoken’: None}

Table 3.10: A lastedits.py example

22 4 Further improvements and ideas

4.1 Improvements and changes

Although all the scripts created for this thesis work as they should, there are improvements and alternative solutions that can be applied to them. Ideas for additional scripts that could not be implemented due to time requirements or lack of experience will be described later in this chapter. It is also worth mentioning some of the problems that emerged during the creation of scripts. One of the biggest problems, especially in terms of time, is a test- ing of the scripts. Even though all scripts work correctly on a small number of pages or categories, it is exceedingly difficult to verify that they always properly work on a larger scale. The reason is the size of Wikipedia. Processing of a single page is not immediate, and since some language versions can contain millions of different pages, scripts running without any limitations can take a substantial amount of time to finish. Because all scripts use the API, processing large amounts of pages also causes long-term burdens on the network and on the servers of Wikipedia. Another problem is the need to maintain the simplicity of the scripts. The goal of this thesis is to facilitate the work of the Wikipedia editors, not programmers. Most editors have little or no experience with programming languages and therefore there is a need to explain how scripts work and how to use them without the need to understand the details. This is aided by documentation, explanation of various pa- rameters and clear formulation of results. The only exception consists of scripts working with category trees where the user should have at least basic knowledge of various data structures in computer science, more specifically understanding about trees.

4.1.1 Multilingual scripts Although this has been explained in the section that describes this group of scripts, one of the most important things to be aware of is that scripts have to be able to recognize all the language versions that are referenced from the processed page. Scripts will crash if they can

23 4. Further improvements and ideas not recognize one of the languages on the page. Because this problem occurs when Pywikibot is outdated, implementation of these scripts with other tools, or just with a little more complex SQL (Structured Query Language) queries, could solve this problem. Another option would be the implementation of an own method, similar to that used by Pywikibot to obtain language links, that would tolerate unknown languages Automatic choice of language versions of Wikipedia to work with could also be a good improvement. These versions could be selected based on statistics about page numbers, an amount of files, or average page size of various versions.

4.1.2 Category scripts

The biggest disadvantage of these scripts is their dependence on the correct structure of the categories. While it is possible to check the category trees manually or to use the script that was created for it, this shortcoming restricts the usefulness of these scripts too much. Realistically, they can be used only for small trees. Because the category structure on Wikipedia is very complicated, it is best to leave these scripts as they are and use them only for small or checked trees. For a large number of categories, new scripts should be created that would work with graphs instead of working with trees. These scripts could use already-known algorithms of computer science to detect loops and other problems.

4.1.3 Other scripts

The remaining scripts mostly work as they should and usually require only minor improvements and changes. Solving the image problem in the proposeimages.py script would be possibly the most important refinement. Because people working with Wikipedia are aware of this problem, different solutions have emerged of which the most useful is an extension PageImages. Its purpose is to find the most appropri- ate image associated with an article [18]. The selection of this image is based on various requirements and the first image satisfying the requirements will be returned.

24 4. Further improvements and ideas

Other possible changes are not very important and just make the scripts more effective, simple or clearer. For example, limiting the randomness of the red links in the proposepages.py script would result in the increase of accuracy and reliability of this script. Also, in the lastedits.py script, the revision information could be processed and made more comprehensible. The script editfrequency.py does not have any clear problems and the only possible improvement would be to extend the functionality or add additional counting options. The last potential issue that concerns only some scripts is the date format. All date-based scripts only use yyyy-mm-dd format, which is mainly used in English-speaking countries1. However, this format is not universal, so it might not be a bad idea to increase a number of date formats with which the scripts can work.

4.2 Script suggestions

All the scripts created for this thesis represent only a minuscule frac- tion of the potential number of scripts that can be created. Because scripts are mostly created to improve an already existing functionality or to perform simple tasks, the size and complexity of Wikipedia al- low the creation of a large number of such programs. Although many potential scripts have not been implemented for this thesis, it is good to mention, and possibly explain, some of them.

4.2.1 Scripts to detect incorrect information Even though automatic programs are helping considerably with the development of Wikipedia, the creation and maintenance of its most important part, the articles, remains generally in the hands of editors. And because editors are humans, they can make mistakes. Whether the incorrect information in the articles is deliberate or accidental, it must be found and corrected as soon as possible. While programs can not detect every possible inaccuracy, they are able to distinguish flaws such as wrong dates, grammatical errors,

1. For additional information about date formats on English Wikipedia, see https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Dates_and_ numbers#Chronological_items

25 4. Further improvements and ideas typographical errors, missing parts of articles and vulgarisms. In the case of robots, such mistakes could be corrected automatically, but in the case of scripts, the user should be alerted, who then decides how these mistakes should be corrected. A useful tool to detect errors could be a comparison of the article with its versions in multiple languages. Since machine translation is unreliable, such a comparison could only be used to detect errors in general information such as dates, locations, relatives, ages and timelines. Distinguishing more detailed inaccuracies might be possible, but it would increase the complexity of the program to do it.

4.2.2 Script to check references Because Wikipedia is trying to be as objective and credible as possible, each article must contain references to its sources2. The primary role of the potential script should, therefore, be to discover whether the article contains any references to its sources. References have their own section in the articles, so it should be enough to see if this section exists and contains at least one reference. If the article does not contain a single reference, it means it is untrustworthy. The secondary role of the script could be a verification of the exis- tence of individual references. It would not need to verify the source itself, it would just need to verify whether this source exists at all. This could be done by various tools, the simplest of which would be an Internet search engine.

2. For more detailed information about references, see https://en.wikipedia. org/wiki/Wikipedia:Citing_sources

26 5 Conclusion

The main task of this thesis was to create several scripts to help ease the work with Wikipedia for its editors. All scripts had to be created in programming language Python using the Pywikibot library. Each of them had to be unique, useful and effective. In addition to the scripts themselves, the concepts relevant to the subject were explained in the first part. It was described what Wikipedia is and some of its important parts were explained. Tools that are used on Wikipedia are also described, and so is the Python library which was used for implementation of all scripts, Pywikibot. In the course of this thesis, ten scripts with different functionalities were implemented and several more were suggested for implemen- tation. Each of the implements scripts was described in the second chapter of the thesis and examples of their use were also shown. Al- though all the scripts were unique, some of them shared features with others which allowed them to be grouped into several groups. During the creation of the scripts, some flaws in their design and in the Pywikibot itself were revealed. Wikipedia categories have been shown to be not structured in trees, but their hierarchy resembles a graph. It also turned out that Pywikibot needs to be regularly updated, otherwise it will not work properly. These, and other, complications are described in the third part. In this part, there are recommendations for solving problems, suggestions for improvements of already existing scripts and ideas for completely new scripts. All of the created scripts are unique, useful and simple. Their usefulness varies, depending on what editors need, but every script has its specific role. Since Wikipedia changes regularly, some orall of the scripts may become outdated. In that case, it will be needed to design and create new scripts that will work with the modified Wikipedia. This results in an actuality of this topic as long as Wikipedia exists.

27

Bibliography

1. WIKIPEDIA. Wikipedia. 2001. Available also from: https://en.wikipedia. org/wiki/Wikipedia. [Online; accessed 9-April-2017]. 2. WIKIPEDIA. Wikipedia:Creating a bot. 2006. Available also from: https: //en.wikipedia.org/wiki/Wikipedia:Creating_a_bot#Programming_ languages_and_libraries. [Online; accessed 15-April-2017]. 3. WIKIPEDIA. Wikipedia:Five pillars. 2005. Available also from: https: //en.wikipedia.org/wiki/Wikipedia:Five_pillars. [Online; accessed 15-April-2017]. 4. WIKIPEDIA. Wikipedia:Namespace. 2002. Available also from: https:// en.wikipedia.org/wiki/Wikipedia:Namespace. [Online; accessed 14-May-2017]. 5. WIKIPEDIA. Wikipedia:What is an article? 2002. Available also from: https : / / en . wikipedia . org / wiki / Wikipedia : What _ is _ an _ article%3F. [Online; accessed 14-May-2017]. 6. WIKIPEDIA. Wiki markup. 2003. Available also from: https://en. wikipedia.org/wiki/Wiki_markup. [Online; accessed 10-April- 2017]. 7. WIKIPEDIA. Help:Interlanguage links. 2002. Available also from: https: //en.wikipedia.org/wiki/Help:Interlanguage_links. [Online; accessed 17-April-2017]. 8. WIKIPEDIA. Wikipedia:Translation. 2004. Available also from: https: / / en . wikipedia . org / wiki / Wikipedia : Translation. [Online; accessed 17-April-2017]. 9. WIKIPEDIA. Scripting language. 2001. Available also from: https:// en.wikipedia.org/wiki/Scripting_language. [Online; accessed 15-April-2017]. 10. WIKIPEDIA. Wikipedia:Bot policy. 2002. Available also from: https: //en.wikipedia.org/wiki/Wikipedia:Bot_policy. [Online; ac- cessed 15-April-2017]. 11. MEDIAWIKI. API:Main page. 2015. Available also from: https://www. mediawiki.org/wiki/API:Main_page. [Online; accessed 15-April- 2017].

29 BIBLIOGRAPHY

12. MEDIAWIKI. Manual:Pywikibot/Overview. 2009. Available also from: https://www.mediawiki.org/wiki/Manual:Pywikibot/Overview. [Online; accessed 17-April-2017]. 13. MEDIAWIKI. Manual:Pywikibot/user-config.py. 2009. Available also from: https://www.mediawiki.org/wiki/Manual:Pywikibot/user- config.py. [Online; accessed 17-April-2017]. 14. WIKIMEDIA. pywikibot.page module. 2016. Available also from: https: //doc.wikimedia.org/pywikibot/api_ref/pywikibot.html# module-pywikibot.page. [Online; accessed 21-April-2017]. 15. WIKIMEDIA. pywikibot.pagegenerators module. 2016. Available also from: https://doc.wikimedia.org/pywikibot/api_ref/pywikibot. html#module- pywikibot.pagegenerators. [Online; accessed 21- April-2017]. 16. WIKIMEDIA. pywikibot package. 2016. Available also from: https : / / doc . wikimedia . org / pywikibot / api _ ref / pywikibot . html. [Online; accessed 14-May-2017]. 17. WIKIPEDIA. Wikipedia:Disambiguation. 2002. Available also from: https: //en.wikipedia.org/wiki/Wikipedia:Disambiguation. [Online; accessed 25-April-2017]. 18. MEDIAWIKI. Extension:PageImages. 2012. Available also from: https: / / www . mediawiki . org / wiki / Extension : PageImages. [Online; accessed 7-May-2017].

30