<<

Masaryk University Faculty of Informatics

Assessing the Quality of

Master’s Thesis

Rajivv

Brno, Fall 2020 Masaryk University Faculty of Informatics

Assessing the Data Quality of Wikipedia

Master’s Thesis

Rajivv

Brno, Fall 2020 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Rajivv

Advisor: doc. Mouzhi Ge, Ph.D.

i Acknowledgements

I would first like to thank my thesis advisor, doc. Mouzhi Ge, Ph.D., for his support and advice, which helped in writing this thesis. He also steered me in the right direction whenever it was needed. I would also like to thank my family, partner, friends, and colleagues for their unfailing support and continuous encouragement throughout this research.

ii Abstract

Nowadays, Wikipedia becomes one of the largest Encyclopedia and one of the most popular websites. The most significant growth oc- curred in the English-language version of Wikipedia, which is accessed by more than 300 million users on a daily basis and has more than 6 million articles. Most of the time, internet users access Wikipedia and use it as a reference. Although the growth of Wikipedia shows positive impacts, the quality of Wikipedia may suffer. Many of the English-language version of Wikipedia articles are of poor quality, and some popular Wikipedia topics are not covered adequately in good quality. Until now, data quality is one of the main issues faced by Wikipedia. Wikipedia overcomes these issues by developing an article quality grading scheme, which able to classify Wikipedia articles into several quality grades. However, this grading scheme is also considered inade- quate to guarantee the quality of the articles. Even the featured articles (article with the highest level of Wikipedia in quality grading scheme or Wikipedia article’s role model) could be contained quality flaws. Quality flaws not only affect people’s understanding of the articles but also data in infoboxes. In this research, the author refines nine data quality dimensions as a solution to poor data quality issues by helps the users scoring the quality of Wikipedia. Those dimensions will be used in the automatic quality assessment of data in Wikipedia’s structural parts (articles and infoboxes). The author constructs the quality measuring method- ologies using metrics extracted from Wikipedia. To validate those methodologies, the author conducted experiments on a representa- tive sample of Wikipedia (articles and infoboxes) with the selected categories: city, musician, and film. The results are then employed as the optimal value for each dimension and will be used to determine the final quality score of the given Wikipedia article.

iii Keywords

Data analysis, Data quality, Data quality dimension, , Credibility, Format compliance, Popularity, Reputation, Semantic sta- bility, Completeness, Semantic connection, Web scraping, Wikipedia.

iv Contents

1 Introduction 1 1.1 Motivation of the Study ...... 2 1.2 Goals of the Study and Research Question ...... 4 1.3 Thesis Organization ...... 4

2 Literature Review 6 2.1 Wikipedia ...... 6 2.1.1 Wikipedia Architecture ...... 7 2.1.2 Wikipedia Data & API ...... 7 2.1.3 Wikipedia Article ...... 9 2.1.4 Wikipedia Infobox ...... 10 2.2 Data Quality ...... 12 2.2.1 General Concept of Data Quality Dimensions . . 15 2.2.2 Data Quality Issues in Wikipedia ...... 19 2.2.3 Data Quality Dimensions on Wikipedia Article . 20 2.2.4 Data Quality Dimensions on Wikipedia Infobox 24

3 Model of Assessing the Data Quality of Wikipedia 26 3.1 Measurement of Data Quality Dimensions of Wikipedia Article 26 3.1.1 Dimension of Credibility ...... 28 3.1.2 Dimension of Format Compliance ...... 31 3.1.3 Dimension of Popularity ...... 35 3.1.4 Dimension of Reputation ...... 40 3.1.5 Dimension of Semantic Stability ...... 43 3.1.6 Dimension of Verifiability ...... 51 3.2 Measurement of Data Quality Dimensions of Wikipedia Infobox 55 3.2.1 Dimension of Completeness ...... 56 3.2.2 Dimension of Credibility ...... 61 3.2.3 Dimension of Semantic Connection ...... 63 3.3 Scoring Data Quality of Wikipedia Article and Infobox ... 66

4 Experiment and Implementation 70 4.1 Experiment ...... 70 4.1.1 Datasets ...... 70 4.1.2 Development Design and Environment . . . . . 74 4.2 Implementation ...... 80

v 4.3 Results ...... 85

5 Conclusion 89 5.1 Research Findings ...... 89 5.2 Practical Implications ...... 90 5.3 Limitations and Future Research ...... 90

Bibliography 99

vi List of Tables

2.1 MediaWiki architecture layers. 7 2.2 Definition of data quality dimensions from a different approach. 16 2.3 English version Wikipedia article quality grading scheme [52]. 20 3.1 Sample of seeds from Wikipedia featured articles. 31 3.2 Sample of seeds of the format compliance dimension optimal value. 35 3.3 An example of set A for calculating the optimal value of P1. 39 3.4 An example of set A for calculating the optimal value of P2. 39 3.5 An example of set A for calculating the optimal value of P3. 40 3.6 Set A for determining the optimal value of the reputation dimension. 43 3.7 Set A for determining the threshold value of the semantic stability dimension. 49 3.8 An example of set A from Wikipedia featured articles for determining the optimal value of the verifiability dimension. 56 3.9 Weight of the parameters. Source: own calculation and the numbers are just an example to express the calculation. 60 3.10 An example of seeds for the optimal value of credibility dimension for the infobox. 63 3.11 An example of seeds of infoboxes from the Wikipedia featured articles. 66 3.12 List of weights of the article and infobox dimensions. 67 3.13 An example of scores of article and infobox dimensions. 69 4.1 The Python library and its description. 81

vii 4.2 The Python library based on their classifications (Frontend). 82 4.3 The Python library based on their classifications (Backend). 82 4.4 The experiment results of the City category. 86 4.5 The experiment results of the Musician category. 87 4.6 The experiment results of the Film category. 88 4.7 The average score of each dimension on each category. 88

viii List of Figures

1.1 Examples of Wikipedia issue. 3 1.2 Thesis structure. 5 2.1 MediaWiki Architecture [9]. 8 2.2 A fragment of Wikipedia data in the XML format. 8 2.3 System architecture of the Wikipedia API [13]. 9 2.4 Wikipedia article elements and its additional elements. 10 2.5 An overview of Wikipedia article with template category: Infobox book [25]. 12 2.6 Example of code behind the given infobox [25]. 13 2.7 The common data quality definitions [29]. 14 2.8 Data quality dimensions proposal and the researchers. 17 2.9 Wikipedia quality flaws types. 21 2.10 Article footnote tags (from top to bottom: citation needed, verification needed, and unreliable source). 22 3.1 An illustration of Wikipedia article organization and elements. 26 3.2 An illustration of the defined metrics of the format compliance dimension on Wikipedia article structure. 33 3.3 An illustration of the English Wikipedia article on Pokémon. Circled in red are the "View source" tab (instead of "Edit") and the padlock icon which signal the page is protected. 44 3.4 Set of A for determining the maximum value of the semantic stability dimension. 50 3.5 Set of A for determining the maximum value of the semantic stability dimension. 51 3.6 An illustration about information loop. 52 3.7 Circular references in Wikipedia article about Poznań. 53 3.8 An illustration of infobox structure and elements. 57

ix 3.9 An illustration of the infobox about Anne of Green Gables and some of its metrics. 64 3.10 An illustration of the final score in the traffic light scorecard. 69 4.1 A Wikipedia page: the list of the FA and GA articles on cities. 71 4.2 The differences between Wikipedia infoboxes parameters on FA: Altrincham and GA: Atlanta. 72 4.3 A Wikipedia page: the list of the most revisions articles on cities. 73 4.4 The predefined data application processes 74 4.5 The GUI application processes 75 4.6 The three categories of the infobox for the predefined data in HTML script. 76 4.7 The pageviews metric is presented as views in JSON format of the Wikipedia API. 77 4.8 The page creation date and the first edit date metrics is presented as timestamp in JSON format of the Wikipedia API. 77 4.9 The latest edit date metric is presented as timestamp in JSON format of the Wikipedia API. 78 4.10 The Xtools API of page prose. 78 4.11 The Xtools API of page links. 79 4.12 The Xtools API of page article info. 79 4.13 The Xtools API of top editors. 80 4.14 Main interface of the application. 83 4.15 The Main interface of the application with the list of categories. 84 4.16 The warning popup box to notify the error. 84 4.17 Result interface of the application. 85

x 1 Introduction

Nowadays, Wikipedia becomes one of the largest Encyclopedia and one of the most popular websites with now (per September 2020) has more than 6 million articles1. It is accessing by millions of users from all kinds of communities worldwide, where the English edition is the most visited project on the website, with monthly averages of about 9 billion pageviews2. Even Google adopted Wikipedia as a trusted knowledge source for web metadata by pulling the information “snippets” when people search for relevant information on Google. On the contrary, Wikipedia states on its website that Wikipedia is not a reliable source. It is open to anonymous and collaborative editing, and it means everyone can create, edit, or even delete content in Wikipedia since it is free and easy to access. It means that any information containing any particular time could be vandalism, a work in progress, or just plain wrong. Therefore, internet users need to understand how to choose data and information with high-quality from the Web. To assess the articles’ quality, Wikipedia uses a system based on a letter scheme, which reflects principally how the factually complete the article is, though the language quality and layout are also factors. These quality assessments are mainly performed by members of WikiProjects, who tag talk pages of the articles, and these tags are collected by a bot that generates output, such as a log and statistics. On the other hand, independent editors are involved in assessing two levels of the article quality. For example, in English Wikipedia, the best articles have the name “Featured Article” (when all criteria are met) and “Good Article” (when almost all criteria are met). Their limited number confronted with the article’s growth rate does not allow overall and constant control. Furthermore, the subjectivity connected to human assessors results in a different quality evaluation for different articles belonging to distinct topic areas. Due to the fact that Wikipedia articles have been manually updated and maintained by contributors, it procreates plenty of data quality issues.

1. See https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia 2. See https://stats.wikimedia.org/#/en.wikipedia.org/reading/ total-page-views/normal|bar|1-month|~total|daily

1 1. Introduction

In this context, the proposed approach aims at automating the data quality assessment of Wikipedia, specifically by assessing the article and the infobox of Wikipedia. Why articles and infobox? Articles and infobox are the two most important elements of Wikipedia. An article is a center of data with encyclopedic information on it and developed by a collaboration between editors. Simultaneously, the infobox is a summary of that encyclopedic information and displayed on a struc- tured table containing a set of attributes and its value (attribute-value pairs) that allow readers to rapidly gather the most important infor- mation about the article’s subject on Wikipedia. These infoboxes are also used to enrich other public knowledge bases such as DBpedia3, Freebase4, and Yago5. Data from such bases have been successfully ap- plied in several domains: Life Sciences, Web Search, Digital Libraries, Maritime Domain, Art Market, and others [1]. To improve the quality of Wikipedia, we need to understand how to identify the data quality problems on Wikipedia articles and infoboxes, design the measures, and implement the metrics related to data quality dimensions to assess Wikipedia’s data quality.

1.1 Motivation of the Study

Since its launch in 20016, the Wikipedia project has become the largest and most prominent collaboratively created reference work on the Web. Wikipedia is a dominant source of online information for millions of people. As a free online encyclopedia, Wikipedia must face the fact that no guarantees of all users comply with the rules in editing and creating the article. Wikipedia may contain information with shoddy or insufficient quality due to the fact that anyone with internet access, even the anonymous one, can be the contributors in Wikipedia [2]. The number of users collaborating on an article does have certain benefits, but it also generates a tremendous negative impact, especially on the article’s quality that may suffer [3]. Wikipedia is often criticized for containing

3. See https://wiki.dbpedia.org/ 4. See https://en.wikipedia.org/wiki/Freebase_(database) 5. See https://en.wikipedia.org/wiki/YAGO_(database) 6. See https://en.wikipedia.org/wiki/Wikipedia

2 1. Introduction

low-quality information; as mentioned by [4], at least one quality flaw lies in one of four English Wikipedia articles. The author collects various examples of data quality problems on Wikipedia, as shown in figure 1.1 below.

Figure 1.1: Examples of Wikipedia issue.

The quality problems on Wikipedia are a vast subject in scientific works. Many researchers investigated this issue, and various kinds of researches have been done on the automatic prediction of the quality rank of Wikipedia articles [1]. Each study typically used their own set of measures with distinct algorithms to construct a model for solving this task. In this thesis, the author proposes known and new measures re- lated to the different data quality dimensions of the Wikipedia articles. By implementing metrics in Wikipedia and based on related research,

3 1. Introduction which determines the assessment of data quality on Wikipedia. It is hoped that it can help the Wikipedia community select high-quality articles on Wikipedia.

1.2 Goals of the Study and Research Question

This thesis is focusing on data quality in Wikipedia for both elements: article and infobox. This thesis’s main goal is to analyze, measure, and assess the data quality on Wikipedia. The author develops new methods to analyze the flaws in Wikipedia articles and infoboxes, for- mulated the flaws into data quality dimensions, and then develops a new measurement for each dimension based on other researchers’ ex- isting methods. Furthermore, assessing the data quality of a Wikipedia based on the measurements. The author develops a prototype to indi- cate the best possible level of Wikipedia’s data quality and visualize the result into scorecards. The main question which the author seeks to examine profoundly in this study is: » "How to assess the data quality and determine the level of data quality in Wikipedia?" The assessments are not only related to one dimension of the data quality but also involve multi-dimensions on both Wikipedia elements: article and infobox. Therefore, these multi-dimensions are expected to improve the assessments’ quality of the data quality on Wikipedia. The author expects this study can contribute to the progress of the data quality of Wikipedia.

1.3 Thesis Organization

The following figure 1.2 represents the structure of this thesis. Below are the summaries regarding chapters:

Introduction: The author discusses the background and the ground reason for conducting this study, followed by an illustration of the problems. Furthermore, the goals of this study and the research ques- tion.

4 1. Introduction

Figure 1.2: Thesis structure.

Literature Review: This thesis focuses on identifying the data quality flaws in Wikipedia for both articles and infoboxes, classifying them into data quality dimensions, and developing the measurements for assessing the data quality of Wikipedia. Therefore, reviews related works on Wikipedia, current research clarifies the data quality dimen- sions and lists several relevant technologies and knowledge.

Model of Assessing the Data Quality of Wikipedia: In this chap- ter, the author profoundly describes each of the data quality problems, formulated each problem into the dimension accordingly, reviewed related research on each dimension including their measurement met- rics, proposed the measurement metrics moreover the methodology in assessing the data quality for each dimension, plus the artificial examples.

Experiment and Implementation: The author designs and builds a prototype by implementing a programming language, API, and com- bined with the web scraping technologies to point out the proposed model’s implementation processes in chapter 3.

Conclusion: The author’s conclusion based on this research, answers to the research question, and suggestions for future related studies.

5 2 Literature Review

The author divided this chapter into two main subjects, which are Wikipedia and Data quality. Both subjects will be examined and fo- cused in terms of Wikipedia data quality based on studies that had been conducted by many researchers. The author will describe and illustrate Wikipedia’s architecture, data, API, article’s structure, and Infobox. Meanwhile, on the other hand, the author will describe the relevant aspects of data quality, which are the dimensions of data quality and the data quality issue on Wikipedia.

2.1 Wikipedia

Jimmy Wales and Larry Sanger created Wikipedia on 15 January 2001 using a Wiki concept and technology under a free license on the web [5]. Wikipedia is based on the MediaWiki1 software, a project developed by the Wikipedia community to complement Nupedia [6]. Wikipedia allows everyone to edit and extend its article con- tent. Now, Wikipedia is one of the largest Encyclopedia with a world- wide monthly readership is approximately 495 million. The English Wikipedia is the first edition of Wikipedia. Then, Wikipedia developed other language-editions and gradually became a multilingual site. Nowadays, there are 304 active language-editions, and each language- edition was developed in the same technology but with different con- tent and rules based on each language-edition community. The English Wikipedia has more than 6 million articles, and the size of all articles compressed is approximately 19 GB2. Wikipedia may contain information with shoddy or insufficient quality due to the fact that anyone with internet access, even the anonymous one, can be the contributors in Wikipedia [2]. What is contributed is more valuable than who contributes it [7]. To assess the data quality in Wikipedia, it needs to understand the essential parts of Wikipedia. There are several parts in Wikipedia who scrutinized by the author, which is: architecture, data, article, and infobox. Each

1. See https://en.wikipedia.org/wiki/History_of_Wikipedia 2. See https://en.wikipedia.org/wiki/Wikipedia:Database_download# Where_do_I_get_it? 6 2. Literature Review

of the parts will be discussed in more detail through the subsections below.

2.1.1 Wikipedia Architecture MediaWiki is a free and open-source wiki engine developed by Wiki- media Foundation for use on Wikipedia from 2002 and still be used today [6]. Wikipedia architecture3 is based on MediaWiki, written in PHP language, and uses MySQL as the core database combine with other DBMS technologies [8]. MediaWiki general architecture con- tains 4 layers: User layer, Network layer, Logic layer, and Data layer, as shown in Table 2.1 below. And each layer’s elements are illustrated in figure 2.1 below.

Table 2.1: MediaWiki architecture layers.

Layer Element User layer Web browser Varnish/ Squid Network Layer Apache webserver MediaWiki’s PHP scripts Logic layer PHP File system Data layer MySQL Database (program and content) Caching system

2.1.2 Wikipedia Data & API Wikipedia provides backup files of all its available contents into dump files. As for an example the English language version of Wikipedia, users can download these dump files for free from https://dumps. wikimedia.org/enwiki/latest/ with file name enwiki-latest-pages- articles.xml.bz2 [10]. This file compressed and has a size of around

3. See: https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture# General_architecture

7 2. Literature Review

Figure 2.1: MediaWiki Architecture [9].

16.2 GB. Wikipedia also provides an easy web tool to export a particu- lar page wrapped in XML format. This tool attains to be accessed via https://en.wikipedia.org/wiki/Special:Export. To export a par- ticular article, the users need to enter the Wikipedia article’s title. Then the users will be able to export all categories or delete the unnecessary categories before exporting the XML file. The single page of the categories is put inside a tag, the article’s title is given by the element, and the source code (wikitext) is given by the <text> element [11], as shown in figure 2.2.</p><p>Figure 2.2: A fragment of Wikipedia data in the XML format.</p><p>Besides the web interface, MediaWiki software also provides an extensible web API (Application Programming Interface). Wikipedia</p><p>8 2. Literature Review</p><p>API is a RESTful call web service that supports a wide variety of formats, including XML, JSON, PHP, YAML, and others [12]. The API enables the client programs to interact with the server in high- level direct access. The API gives the capability of Wikipedia mining methods such as login, retrieve data, submit edits to the MediaWiki database. Figure 2.3 shows an overview of the Wikipedia API system architecture.</p><p>Figure 2.3: System architecture of the Wikipedia API [13].</p><p>2.1.3 Wikipedia Article Wikipedia article is a page on Wikipedia that provides a readable summary of the knowledge within the scope based on reliable sources [14, 15] or in a nutshell, a page that has encyclopedic information on it. In most cases, articles consist of paragraphs and some visualization such as images, tables, or audio-visuals. Some articles are well-written, quite lengthy with the appropriate amounts of the word and references, high quality, and deep in the content, for example, featured articles. Others are shorter or of lesser quality, possibly a stub [16]. Wikipedia presents a guideline to help its editors write and main- tain the articles with precise and consistent language, layout, style, and formatting [17]. The structure of Wikipedia articles follows the specific layout, including the sections an article usually has, ordering of sections, and formatting styles for various elements of an article. On</p><p>9 2. Literature Review</p><p> the other hand, guidelines do not regulate how content is structured, and this decision rests with Wikipedia editors [18]. The following figure 2.4 below shows the article elements, including the additional standardized sections in an article.</p><p>Figure 2.4: Wikipedia article elements and its additional elements.</p><p>2.1.4 Wikipedia Infobox An infobox is a summary of information from the Wikipedia’s article, transformed into fixed-format table form. It is an emit structured</p><p>10 2. Literature Review</p><p> metadata containing a set of attribute–value pairs [19]. In this way, the structure of infobox forks out the benefits for the readers to rapidly accumulate the most important information about the article’s subject [20]. If the readers are too lazy to read the full Wikipedia page, with this infobox, it will deliver the essence of the page for the readers [21]. It also becomes easier for machines to retrieve the various types of data fields of the template from those infoboxes at a comparatively low complexity cost. This kind of method is often applied by several related works and for example, one of which is DBPedia [19]. The in- fobox type/category of a Wikipedia article is determined based on the collaboration between the editors of the content and the contributors of the Wikipedia article [22]. An Editor must conform the template’s name, parameters, and how those parameters are intended to be used because each of the infobox templates has differences and unique from each other [23]. The infobox templates are stored in Wikipedia:namespace, which is detached from the article. It will necessitate some effort to spot an appropriate infobox template by the name [23]. There are several ways in which an editor can choose to determine which infobox template he or she wants to use:</p><p>∙ by browsing the set of all standardized infoboxes via Wikipedia:List of infoboxes [23].</p><p>∙ by specifying the name of a particular infobox used in a similar article [23].</p><p>∙ by creating manual a new infobox template [24].</p><p>Figure 2.5 is an overview of the infobox from Wikipedia article "Anne of Green Gables" and containing a set of attribute–value pairs. Figure 2.6 shows us, the code behind the infobox table from Wikipedia article "Anne of Green Gables". This template is provided by participants who have designed the infobox template in a collaboration project called Wikiproject. This template is an example of the many other categories of infobox templates such as geographic entities, education, plants, organizations, people, and so on produced by Wikiproject.</p><p>11 2. Literature Review</p><p>Figure 2.5: An overview of Wikipedia article with template category: Infobox book [25].</p><p>2.2 Data Quality</p><p>"Is the quality of data important?" the answer is, "yes, data quality is important!" [26]. To understand why data quality is important, we would like to be attentive to what is the definition of data quality itself. Data quality refers to the level of quality of data available to the business user [27]. Data quality has many definitions, but data is generally considered high quality if "they are fit for their intended uses in operations, decision making, and planning" [28]. Figure 2.7</p><p>12 2. Literature Review</p><p>Figure 2.6: Example of code behind the given infobox [25].</p><p>13 2. Literature Review below describes the summary of common data quality definitions from data quality research.</p><p>Figure 2.7: The common data quality definitions [29].</p><p>Among many other definitions of data quality, we can categorize the definition into three kinds of perspectives. From a consumer per- spective, data quality is [29]: ∙ data that are "fit for use by data consumers." ∙ data "meeting or exceeding consumer expectations." ∙ data that "satisfies the requirements of its intended use." From a business perspective, data quality is: ∙ data that is "’fit for use’ in their intended operational, decision- making and other roles" or that exhibits" ’conformance to stan- dards’ that have been set so that fitness for use is achieved" [30]. ∙ data that "are fit for their intended uses in operations, decision making, and planning"[28]. ∙ "the capability of data to satisfy the stated business, system, and technical requirements of an enterprise" [26].</p><p>14 2. Literature Review</p><p>From a standards-based perspective, data quality is:</p><p>∙ the "degree to which a group of inherent characteristics of an object fulfills requirements" [29].</p><p>∙ "the usefulness, accuracy, and correctness of data for its appli- cation" [31].</p><p>Data is deemed of top quality if it correctly represents the real- world construct to which it refers. Furthermore, other than these defi- nitions, as data volume increases, the question of internal consistency within data becomes significant, regardless of fitness to be used for any particular external purpose. People’s views on data quality can often be divided, even when discussing the identical set of data used for the identical purpose. Altogether these cases, "data quality" could be a comparison of the particular state of a specific collection of data to a desired state, with the specified state being typically stated as "fitto be used", "to specification", "meeting consumer expectations", "free of defect", or "meeting requirements". These expectations, specifications, and requirements are usually defined by one or more individuals or groups, standards organizations, laws and regulations, business policies, or software development policies. Data quality dimension is needed for defining and understanding the level of data quality. The data quality dimension identifies and classifies data quality based on measurement metrics related tothe business processes are to be measured, such as measuring the data quality related to data values, data models, data presentation, and conformance with governance policies [32].</p><p>2.2.1 General Concept of Data Quality Dimensions How to measure how good the data quality is? Poor data quality can be measured from some perspectives. For instance, is the data reflected the real-world conditions? Is it easily used and understood by the data user? Is it interpretable and accessible by the user? Most of the people think of data quality relates to its data accuracy, but another critical role is the consumer who will examine the use of its data. Based on the concept of “fitness for use” where it is used widely in the quality literature, the consumer is taking an important part in measuring the</p><p>15 2. Literature Review Table 2.2: Definition of data quality dimensions from a different ap- proach.</p><p>No. Approach type Definition approach to separate the dimensions into those intrinsic to an 1. A scientifically grounded approach[41] <a href="/tags/Information_system/" rel="tag">information system</a> from those external to the system. 2. An ontological based approach[38][41] identifies the data deficiencies that exist when mapping [42][43][44][45] the real world to an information system. to empirically derive and define dimensions of importance to data consumers, then develop a 3. Marketing research[41][46][47][45] framework that captures the aspects of data quality that are important to data consumers. to have data quality defined by the user, depending on 4. Pragmatic approach[41] the context of the task at hand.</p><p> data quality [33]. Thus, back to the question, to measure how good the data quality is, it is required data quality metrics or data quality dimensions, where the measurement not only from its quality but in terms of user requirements [34, 35, 36]. From the literature review, there is no agreement on a specific definition of data quality dimension [37]. Frequently mentioned in many studies, data quality dimensions are accuracy, completeness, consistency, and timeliness [38]. But many researchers acknowledge the accuracy as a key dimension in the most data quality [38]. Based on Wang & Strong, 1996, from a consumer perspective, the data qual- ity dimension is a set of data quality attributes that represent a single aspect or construct of data quality [33, 29]. For design purposes, defin- ing the data quality dimension is sufficient to consider the information system in its role as a representation of known aspects of the real-world [38, 39]. Different approaches are taken to define and establish data quality dimensions by researchers based on their areas [40], as shown in Table 2.2 above. Mostly the researchers are in the area: 1) data quality, 2) information systems success and user satisfaction, and 3) accounting and auditing [41]. Thus, there are various lists of data quality dimen- sions from different approaches and researchers [48]. The following figure 2.8 is the summary of the conceptual framework data quality dimensions proposal and its researcher [29, 41, 45].</p><p>16 2. Literature Review</p><p>Figure 2.8: Data quality dimensions proposal and the researchers.</p><p>17 2. Literature Review</p><p>With defining the category: 1. Intrinsic, indicates data quality that contains attributes “that data has on its own” [33, 29]. 2. Contextual, revealed that data quality can only be perceived when using data in task contexts [33]. For example, complete- ness can be only fulfilled when all necessary values of asetof data are incorporated [29, 38, 41, 49]. 3. Representational, includes dimensions related to the format and meaning of data such as the consistent representation of data or the ease to understand the data at hand [33]. The sys- tem must present information in such a way that data are well represented, not only be concise and consistently represented (across repositories, applications, tables, and fields and across internal and external sources), but also interpretable (the user understands the syntax and semantics of the data) [29, 38, 28, 41]. 4. Accessibility, together with the representational category, em- phasizes the importance of the role of systems, where the sys- tem not only must present data in such a way that they are represented concisely and consistently but also must be acces- sible (the user has the ways and privilege to get the data) and secure (data access security) [33, 29, 28, 41]. From the aforementioned figure, not all of those data dimensions can be applied to measure the data quality of Wikipedia. By the reason, each of Wikipedia page does not consist only of articles or contents, but it also contains references and data in the form of infobox. Some data quality dimensions are often used to measure the data quality of the Wikipedia articles and infoboxes. For the Wikipedia article assessment, suggested by Ofer Arazy, Wayne Morgan, and Raymond A. Patterson, to use the WoC "Wisdom of Crowds" principles according to Surowiecki, 2005, “If you put together a big enough and diverse enough group of people and ask them to ‘make decisions affecting matters of general interest,’ that group’s decision will, over time, be ‘intellectually superior to the isolated individual,’ no matter how smart or how well-informed he is” [50].</p><p>18 2. Literature Review</p><p>Thus, in order to achieve high-quality content, it is essential that many users participate in authoring wiki pages, and its participation levels are high (i.e., each page has many edits). Besides, the Wikipedia community has created its own grading scheme to assess the quality of articles based on specific criteria [51]. One of the critical criteria is the availability of sources/references in the articles, where they must be reliable, independent, has reputation and accuracy.</p><p>2.2.2 Data Quality Issues in Wikipedia Wikipedia is a community-based encyclopedia that has huge popu- larity in public and becomes one of the most important sources of knowledge throughout the world. As an open-source software where anyone can edit almost any Wikipedia content with minimal effort, people are raising questions regarding the quality of its content [52]. To cope with this issue, Wikipedia has a different set of quality classes based on the quality of article content for particular languages. For example, the quality classes below are defined for English, French, and Russian: ∙ English: FA, GA, B, C, Start, Stub. ∙ French: ADQ, BA, A, B, BD, E. ∙ Russian: FA, GA, SA, I, II, III, IV. With the description of English Wikipedia quality classes4 are pro- vided in Table 2.3 below. This classification provides an important guideline for readers in selecting high-quality articles and at the same time informs Wikipedia editors to improve low-quality articles. However, the quality classes of Wikipedia articles are manually assigned by reviewers. And this quality class needs to be reassigned after each individual modification on a given article. Due to Wikipedia’s size and dynamic nature, human resources are not enough to review every revision of Wikipedia, which is likely that many flawed articles have yet been identified. Bythe fact in [4], that only less than 0.1% of the English Wikipedia articles are labeled as FA (featured), they are considered to be well-written,</p><p>4. See https://en.wikipedia.org/wiki/Wikipedia:Content_assessment</p><p>19 2. Literature Review Table 2.3: English version Wikipedia article quality grading scheme [52]. Class Description Professional, outstanding, and thorough; a definitive source FA for encyclopedic information. Useful to nearly all readers, with no obvious problems; GA approaching (but not equalling) the quality of a professional encyclopedia. Readers are not left wanting, although the content may not B be complete enough to satisfy a serious student or researcher. Useful to a casual reader, but would not provide a complete C picture for even a moderately detailed study. Provides some meaningful content, but most readers will Start need more. Provides very little meaningful content; may be little more than a dictionary definition. Readers probably see Stub insufficiently developed features of the topic and may not see how the features of the topic are significant.</p><p> comprehensive, well-researched, and neutral. From a quantitative point of view, in [4, 53] they identified the most important quality flaws in Wikipedia and grouped them into a set of 11 general flaw types, as shown in figure 2.9 below.</p><p>2.2.3 Data Quality Dimensions on Wikipedia Article Although there are various lists of data quality dimensions from vari- ous approaches and studies conducted, not all data quality dimensions are capable and fit to be applied in order to assess the level ofdata quality of Wikipedia article. Below are the most commonly used di- mensions for Wikipedia articles, which are explained as follows.</p><p>Credibility</p><p>The Wikipedia Readers must be able to check the source of information presented by Wikipedia [54, 55]. Wikipedia motivates its contributors to iterate continually on the article’s contents by providing a set of</p><p>20 2. Literature Review</p><p>Figure 2.9: Wikipedia quality flaws types.</p><p> tagging systems that can be placed as a footnote by contributors on articles’ contents to improve the article quality, as shown in figure 2.10 below. There are 3 steps of improving the credibility dimension on Wikipedia based on its contributors perspective [56]:</p><p>(1) a contributor place an appropriate tag on some part of the article’s content;</p><p>21 2. Literature Review</p><p>Figure 2.10: Article footnote tags (from top to bottom: citation needed, verification needed, and unreliable source).</p><p>(2) this action will trigger a discussion between the contributors, which encouraged the contributors to research the appropriate references in order to improve the article quality; (3) those references are added to the article’s content. There is a tight correlation between the article’s contents and its reference [57]. The credibility dimension refers to the degree to which information can be trusted and to use reliable sources. Therefore the number of references is used to measure the credibility dimension [1, 58, 56, 54].</p><p>Format Compliance</p><p>Each Wikipedia article displayed to the readers must be presented in the same format structure in terms of layout according to its language- edition. Each Wikipedia language-edition has a different set of rules and guidelines. For example, Wikipedia article English-edition must conform to the Wikipedia Manual of Style [1, 59]. According to [60, 59, 61, 17], the format compliance dimension refers to how information is presented in accordance with format guidelines or conformity to governance policies. The format compli- ance dimension assesses each of the structure’s elements on Wikipedia articles to determine the quality level of the Wikipedia article.</p><p>22 2. Literature Review</p><p>Popularity</p><p>According to [62, 63, 64, 65], the popularity dimension refers to how popular or significant the article is chosen by the reader. According to [1], the popularity dimension is similar to the relevance dimension. On Wikipedia, the article’s popularity can be determined by its num- ber of viewers [66, 67]. The number of viewers of an article cannot blindly use as a warranty of the article’s quality because many factors can affect the number of viewers. For example, an article that contains a hot topic discussed like a war, a deadly virus, etc., will directly in- crease the article’s number of viewers [57]. The popularity dimension of articles is based not only on the number of viewers but also on the article’s age, the number of page watchers, the number of incoming internal links, etc., [7, 68, 63, 69, 1, 67, 64, 57].</p><p>Reputation</p><p>According to [51, 45], the reputation dimension refers to the degree to which information is trusted and from reliable sources. An article with a good reputation must have high quality, trusted, and comes from a reliable source. In [70, 71], they concluded that reputation based on the survival of the contributor’s edits and depends on the subsequent contributors. According to [72], the article’s quality can be assessed based on its contributors’ reputation. In [73], reputation quality focused on the contributor contributions. According to [72, 74, 75], they focused on editing behavior to achieve the reputation dimension’s quality.</p><p>Semantic Stability</p><p>According to [76], the semantic stability dimension refers to the de- gree to which information is constantly changing the content. In [77, 78, 79, 80], the article becomes unstable if there is massive fluctuation in the number of edits and the number of reverted edits. Normally, the number of edits and reverted edits will increase if there are edit wars between the contributors and vandalism.</p><p>23 2. Literature Review</p><p>Verifiability</p><p>According to [53], the verifiability dimension refers to the degree to which information can be tested and verified in the correct state. Verifiability issues emerged due to a lack of citations and references to the original sources [3, 81, 82, 61]. By ensuring the reference qual- ity (no incomplete or missing reference, and no circular reference) means the verifiability dimension is accomplished. According to their experiment in [81], at least one quality flaw related to the verifiability dimension was found in the articles. Verifiability of online informa- tion is one of the most vital fundamentals in the open-source-based encyclopedia, Wikipedia [4, 82, 14, 58].</p><p>2.2.4 Data Quality Dimensions on Wikipedia Infobox Infobox contains a lot of important information about the subject of the article. The experiment in [83] proved that there’s a correlation between the overall quality of the Wikipedia article and the quality of infobox attributes. But several data quality issues arise in the infobox [84]:</p><p>∙ Incompleteness and Inconsistency: happens due to the manual creation and editing process is separated between the article and infobox. It leads to a contradiction between the article context and the infobox summary.</p><p>∙ Schema Drift: the case where schema for a class of articles tends to evolve during the creation or editing process.</p><p>∙ Typefree System: there is no standard of a type system for infobox attributes. It increases inconsistency and complexity during extraction.</p><p>∙ Irregular Lists: the inconsistent placement of list pages. Some are separate information in items, while others use tables with different schemas.</p><p>∙ Flattened Categories: low usage of Wikipedia’s category tag system due to too flat and unconventional.</p><p>24 2. Literature Review</p><p>Completeness</p><p>According to [53, 51, 45, 1, 41, 85, 49], the completeness dimension refers to the degree to which information includes all necessary and relevant values, has sufficient breadth and depth based on its context. In [86] defined the completeness as the data elements that must have values and data can be completed even if the optional data is missing. While in [38, 76, 87], completeness refers to the level of missing or un- usable certain attributes value in the data set, which has represented every meaningful state of the represented real-world system. As for the Wikipedia infobox, the completeness dimension refers to the level of ratio of the number of attribute-value pairs to the number of all defined attribute-value pairs of the given type’s infobox [1].</p><p>Credibility</p><p>The credibility dimension of Wikipedia infobox is more or less similar as in the case of Wikipedia articles, which is related to the analysis of its references. The credibility dimension refers to the level of ratio number of filled parameters to the number of references of the filled parameters [1].</p><p>Semantic Connection</p><p>Each Wikipedia article contains a collection of hyperlinks associating crucial terms to other Wiki pages, termed as wikilinks [88]. Wikilinks are the signals of semantic connection, which focus on the connection between entities in the same domain [89]. The semantic connection dimension refers to the level of ratio number of filled parameters to the number of wikilinks of the filled parameters [90, 91, 92, 93].</p><p>25 3 Model of Assessing the Data Quality of Wikipedia</p><p>3.1 Measurement of Data Quality Dimensions of Wikipedia Article</p><p>Figure 3.1: An illustration of Wikipedia article organization and ele- ments.</p><p>Wikipedia article consists of several vital elements that support the article’s level of quality, as seen in figure 3.1. Those elements are layout structure, content/entry, reference, Wikilink, contributors, edit activity, viewer, age, and infobox, where the infobox will be discussed further in the next section. In assessing the Wikipedia article, the author implements several dimensions that are considered capable of interpreting those vital</p><p>26 3. Model of Assessing the Data Quality of Wikipedia</p><p> elements. Some previously published literature [3, 32, 26, 33, 34, 49, 41, 28, 94, 60, 29, 1, 80] presents a comprehensive grouping of data quality dimensions. By analyzing these groupings, the author is able to estab- lish a core set of dimensions, namely credibility, format compliance, popularity, reputation, semantic stability, and verifiability. The author develops the definition of each of these dimensions by analyzing it based on several journals and related research conducted by other researchers. These journals and researches discuss data qual- ity dimensions and how to determine the measurement in assessing the level of data quality in Wikipedia. Those six dimensions are re- lated to each other. Each of them has its characteristics so that they are expected to be able to provide maximum assessment in achieving a high-quality article. Here is a brief explanation of each dimension. - Credibility: focuses on the trustworthiness and expertise of the shared content. The number of references and article length is used to measure these two elements. Why? Because both parameters are the most important part of Wikipedia’s core content policies. And if examined further, these two param- eters are closely related because the article length is directly proportional to the number of references. Typically, the high- quality article has a significant length (bytes, characters, words, and others) and contains an appropriate number of references.</p><p>- Format compliance: a high-quality article must be in a struc- tured format that complies with defined rules or guidelines. Its format consists of several important elements used as a measuring tool for this dimension (as an illustration in figure 3.1). This dimension is a fundamental dimension in assessing the quality of the Wikipedia article and ensures the assessment process’s interrelationship from other dimensions. This dimen- sion is utilized to ensure each article’s element structure must refer to the defined format.</p><p>- Popularity: this dimension has an important role in assessing the quality of an article because the assessment includes 3 different aspects of the Wikipedia article but related toeach other, namely article statistics, article edit history, and article age. Article statistics include pageview activity, page views,</p><p>27 3. Model of Assessing the Data Quality of Wikipedia</p><p> and incoming links reflecting the popularity of an article. The article edit history describes the role of the contributors in terms of editing and reverting activities that indicate the arti- cle’s popularity. Meanwhile, the article age reflects the lifetime of the article, which can affect its popularity.</p><p>- Reputation: this dimension focuses on the reputation of the article. Articles that have a good reputation definitely are of high quality. In general, an opus’ reputation can be judged by the contributor reputation because the contributor reputation has a significant role in the opus produced. Correspondingly, if an article is written by the top 10% of editors on Wikipedia, it will have a good reputation. However, the author concludes that the editor’s reputation alone is not enough to judge an ar- ticle’s reputation. Therefore, the author adds the page network (external links and outgoing links) as an element to measure the article’s reputation.</p><p>- Semantic stability: focusing on the article’s stability, one of the criteria to achieve a high-quality article is that the article needs to be stable in its current form. In other words, no massive number of edits and no vast number of revert operations. This dimension is related to the popularity dimension because pop- ularity has "two sides of the same coin." Whether the article has become popular due to the edit wars, vandalism, etc., or it has become popular due to its high-quality.</p><p>- Verifiability: focuses on the reference itself. Incomplete pa- rameters or even circular information from its reference can influence the user in judging the validity of the reference. So it is crucial to determine the quality of a reference, which is directly proportional to the article’s quality.</p><p>The following paragraphs will be explained in detail.</p><p>3.1.1 Dimension of Credibility Credibility is becoming an increasingly significant area to understand. Likewise, one of the largest issues nowadays on the internet is its</p><p>28 3. Model of Assessing the Data Quality of Wikipedia</p><p> information credibility; this issue also occurs on Wikipedia [54, 95]. But what is credibility means, and how to achieve it? According to [95], credibility is defined as believability. The author coincides with this term. On the internet, information is credible if it provides trustworthiness and comes from a reliable source, accurate, expertise, and trusted. In achieving credibility, it is necessary to assess both trustworthiness and expertise to achieve an overall assessment of credibility. In line with [95], trustworthiness and expertise are the core components of credibility. Wikipedia’s core content policies, "Wikipedia does not publish original research"1 and "If no reliable sources can be found on a topic, Wikipedia should not have an article on it."2 Consequently, the author concludes that this particular dimension aims to ensure the article has adequate references and commensurate to the content of the Wikipedia article. Thus, the readers can easily access the broader knowledge referenced in Wikipedia and judge the trustworthiness of the knowledge from the article itself [55, 96]. In writing the high quality article, editors must include the inline references or citations, and also the citation must come from reliable sources, such as an academic source [55]. Reference is one of the primary keys to achieving the credibility dimension [1, 56, 58]. This metric is commonly being used by many researchers to conduct the measurement of the Wikipedia article quality [1, 56, 58, 57, 54, 55, 97, 61, 98]. In this work, the author combines two metrics to ensure this dimen- sion measurement generates a high-quality Wikipedia article content. Those metrics are the number of unique references and article length (in words) [2, 99, 16, 100, 101, 102, 103]. The author chooses these metrics because they represent the core components of credibility: the unique references can be defined as trustworthiness, and the article length can be defined as expertise. ∙ Measure: Comparison between the ratio number of unique references and the article length. ∙ Unit of measure: percentage.</p><p>1. See https://en.wikipedia.org/wiki/Wikipedia:Verifiability 2. See https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources</p><p>29 3. Model of Assessing the Data Quality of Wikipedia</p><p>∙ Variables of measure: unique references, article length.</p><p>∙ Equation: R CR = , (3.1) L where R – number of unique references, L – article length(in words).</p><p>The author prepared a contrive to determine the optimal value of the ratio between the two metrics above, and for ensuring the veracity of the above equation, here are the steps:</p><p>- Set of (n) seed featured articles from Wikipedia3 denoted as A = {Ai,j,..., An,m}, for 0 < i ≤ n, and for 0 < j ≤ m;</p><p>- Each seed of the articles A(i) was analyzed to extract the (j) value, which is the number of unique references and the arti- cle’s length (in words). New set of unique references denoted as R = {ri,..., rn} and new set of articles length denoted as L = {li,..., ln}, for 0 < i ≤ n, were constructed based on this step;</p><p>- Implementing the set of R and L into the equation 3.1:</p><p>Ri CRi = , Li</p><p> new set of CR = {CRi,..., CRn}, for 0 < i ≤ n, was constructed;</p><p>- The optimal value is determined by the mean of the set of CR = {CRi,..., CRn}:</p><p>1 n OptimalValue = ∑ CRi, (3.2) n i=1</p><p>- Then, to define the score of credibility dimension of article (x):</p><p>CR(x) CR = . (3.3) score OptimalValue</p><p>30 3. Model of Assessing the Data Quality of Wikipedia</p><p>Table 3.1: Sample of seeds from Wikipedia featured articles.</p><p>A R L CR Canada 434 9960 0.04357430 Ronald Reagan 469 16195 0.02895956 Metalloid 592 8885 0.06662915 Evolution 348 9877 0.03523337 Japan 327 9348 0.03498074</p><p>For instance, the author has taken an example the Wikipedia article of Brno4 to accumulate the number of unique references = 154, and the article length = 5780 words. If (n) = 5, as shown in Table 3.1 above, then to calculate the optimal value:</p><p>1 5 0.20938 = ∑ CRi = = 0.04188, 5 i=1 5</p><p>By implementing equation 3.1 and the optimal value in above, the credibility dimension score from Wikipedia article of Brno is:</p><p>0.02664 = = 0.63626 = 64% 0.04188</p><p>3.1.2 Dimension of Format Compliance Is each article on Wikipedia presented in the same format structure? Every encyclopedia has a set of rules bounding its representation. The conformance of format structure referred to its standardization and presented in an appropriate format that is consistent for the entire contents. The format structure contains several elements that can be used to assess the quality of the content. Format compliance can be used as a rule to validate those missing elements of the structure and also to monitor adherence to the format specifications [60].</p><p>3. See https://en.wikipedia.org/wiki/Wikipedia:Featured_articles/By_ length 4. See https://en.wikipedia.org/wiki/Brno</p><p>31 3. Model of Assessing the Data Quality of Wikipedia</p><p>Wikipedia developed a Manual of Style (MoS)5 to assist the Wikipedi- ans in write and maintain the consistency of language, layout, and for- matting for all English Wikipedia articles [104, 105]. The high-quality articles must have an appropriate article structure and organization in the form or layout6 displayed to the readers, in accordance with the MoS [1, 59]. According to Wikipedia, "a simple article should have at least a lead section and references", in other words, those two elements are manda- tory for each English articles, and from related researches such as in [61], they introduced the structure features to measure the quantity of organization of an article by implemented several variables: nesting of sections, sections, lead section, images, tables, files, templates, categories, references, "See also", "Further reading", etc. According to [1] used sec- tions, tables, and templates as a core measurement of the article structure. In [18] used "references", a similar approach was employed by [101], who also analyzed "references" and "links to other articles", these are the elements that bolster the appropriate structure. Other researches in [17] "articles should generally start with a lead section", and "articles can optionally contain an infobox", their studied more focus on the sections, infoboxes, and links in the term of arti- cle structure. By [59] sections, infobox, references, template, categories, "See also", Wikilinks, and images are elements used to assess the article structure or organization. This dimension aims to assess the article structure by checking the presence of elements either compulsory by Wikipedia or based on research in terms of the article’s layout in Wikipedia [59, 61, 17]. Each of the elements is illustrated in figure 3.2, as shown below. Therefore, the author concluded that those elements are essential in estimating the value of the format compliance dimension and formulated those elements into binary metrics (0/1) with the equation, as shown below.</p><p>∙ Measure: analysis existence of the elements of article structure and organization in the form or layout.</p><p>∙ Unit of measure: percentage</p><p>5. See https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style 6. See https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Layout</p><p>32 3. Model of Assessing the Data Quality of Wikipedia</p><p>∙ Variables of measure: Lead section/intro, section, section nest- ing/sub section, infobox, image, table, category, template, refer- ence, link to the page, external link, link from the page, "Further reading", "See also". ∙ Equation: E FC = , (3.4) V where E – number of exist variables, V = {Lead section/intro, section, section nesting/sub section, infobox, image, table, cat- egory, template, reference, link to the page, external link, link from the page, "Further reading", "See also"}, V – number of all defined variables, by default is |V| = 14.</p><p>Figure 3.2: An illustration of the defined metrics of the format compli- ance dimension on Wikipedia article structure.</p><p>The author prepared a contrive to determine the optimal value for ensuring the veracity of the above equation, and in order to fit with the real-world situation, here are the steps:</p><p>33 3. Model of Assessing the Data Quality of Wikipedia</p><p>- Set of (n) seed featured articles from Wikipedia denoted as A = {Ai,..., An}, for 0 < i ≤ n;</p><p>- Each seed of the featured articles A(i) was analyzed to extract the (j) value. New nested sets of A = {Ai,j,..., An,m}, for 0 < j ≤ m, were constructed;</p><p>- New sets were constructed from each of {Ai,j}, E = {ei,..., en}, V = {vi,..., vn}, for 0 < i ≤ n, where E is a collection of number of exist variables, and V is a collection of number of all defined variables; - Implementing the set of (E, V) into the equation 3.4, new set of FC = {FCi,..., FCn}, for 0 < i ≤ n, was constructed; - The optimal value is determined by the mean of the set of FC = {FCi,..., FCn}: 1 n OptimalValue = ∑ FCi, (3.5) n i=1 - Then, to define the score of format compliance dimension for the article (x) is:</p><p>FC(x) FC = . (3.6) score OptimalValue As an example, if the article A has a set of variables |E| = {section, table, infobox, template, category, link to the page, reference}, then the following result is obtained: 7 = = 0.5, 14 By using the elements of set A from Table 3.2 below to calculate the optimal value with equation 3.5: 1 5 4.785714 = ∑ FCi = = 0.957143, 5 i=1 5 The format compliance dimension score for the article A is: FC 0.5 = A = = 0.522388 = 52%. OptimalValue 0.957143</p><p>34 3. Model of Assessing the Data Quality of Wikipedia</p><p>Table 3.2: Sample of seeds of the format compliance dimension optimal value. A E V FC Barack Obama 14 14 1.000000 Hillary Clinton 13 14 0.928571 Canada 13 14 0.928571 Taylor Swift 13 14 0.928571 Dinosaur 14 14 1.000000</p><p>3.1.3 Dimension of Popularity Most internet users use Wikipedia as the first option while seeking information, especially when they want to discover and study the most popular or important topics in the given periods [66]. Many users are interested in Wikipedia, especially for articles that are widely accessed and become popular [1, 65]. The popularity dimension is also called the relevance dimension according to [1] because this dimension focuses on how popular or significant the article, which is chosen by readers. In Wikipedia statistics, to identify an article that’s significantly increased popularity in periods, one of the approaches by analyzes the pageview statistics development trends7 [66, 67]. If the number of pageviews increases, it means the article is popular or important to the readers. But the main problem is whether the popularity of the article can ensure the quality of the article? The above question has been answered in [57], they concluded that there is a correlation between high quality and popularity within a topic, and the strength of the correlation depends on the topic and the language version of Wikipedia. The correlation does not solely depend on the statistical development of pageviews or not only determined by the number of readers, but also associated with the article age and the edit history(contributors activities) [68, 7]. In most of the studies [62, 63, 64, 65] of dimensions, articles with aging content appeared to be superior and high quality with longer entries, more references, more links, more viewers, more contribu-</p><p>7. See https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics</p><p>35 3. Model of Assessing the Data Quality of Wikipedia</p><p> tors, more updates, more page watchers8, etc. In addition, there is an interesting connection between the article’s age, popularity, and the number of edits to the quality of an article [106, 107]. The article’s age could affect the number of edits, i.e., the average older articles have more edits [107, 3]. Similarly, the article’s popularity also could affect the number of edits, i.e., the more eyes on the article, the more edits will occur [57]. Thus, the author divided the measurement into three components: - Page statistic: for measuring the popularity based on the num- ber of pageviews, number of page watchers, and the number of links to the article [7, 68, 63, 69, 1, 67, 64, 57]. - Page edit history: measuring the relevancy between the num- ber of total edits and the number of reverted edits [107, 62, 108, 109, 66, 3]. - Page age: calculate the quality of the article based on its age [63, 1, 62, 106, 107]. These components are keys to the popularity dimension to achieve the high data quality of Wikipedia. Because these components are bound to each other, it is very relevant to use these components’ value as the point of reference in achieving the popularity dimension on Wikipedia. The components are calculated by its metrics and equations, normalized for determining the final score of this dimension. ∙ Measure: analysis of page statistic, page edit history, and com- bine with the page age. ∙ Unit of measure: percentage. ∙ Variables of measure: links to the page (incoming links), pageviews, page watchers, total edits, reverted edits, actual date, date of page creation. ∙ Equation: (P1 + P2 + P3) PO = , (3.7) 3</p><p>8. See https://en.wikipedia.org/wiki/Help:Watchlist</p><p>36 3. Model of Assessing the Data Quality of Wikipedia</p><p> where, P1 = L + V + W, (3.8) P1 – representing the page statistics and links, L – number of links to the page (incoming links), V – number of pageviews (in 30 days), W – number of page watchers,</p><p>E P2 = , (3.9) T P2 – representing the page edits, E – number of reverted edits, T – number of page total edits,</p><p>P3 = (D − C), (3.10)</p><p>P3 – representing the page age, D – actual date, C – Date of page creation.</p><p>In order to verify the correctness of the equations P1, P2, and P3, and for achieving its respective optimal values, the author applies the mean value theorem for each equation. As below:</p><p>- A = {Ai,..., An}, for 0 < i ≤ n, where A is a set of (n) seeds of featured articles from Wikipedia;</p><p>- Each seed of articles, Ai, was analyzed to extract (j) values. New nested sets of A = {Ai,j,..., An,m} were constructed based on this step, for 0 < j ≤ m;</p><p>- New sets were constructed from each of {Ai,j}, L = {li,..., ln}, V = {vi,..., vn}, W = {wi,..., wn}, E = {ei,..., en}, T = {ti,..., tn}, D = {di,..., dn}, and C = {ci,..., cn}, for 0 < i ≤ n, where L is collection of number of incoming links, V is collection of number of pageviews, W denotes the collection of number of page watchers, E is set of number of reverted edits, T is set of number of page total edits, D is collection of the actual dates, and C represents the collection of page creation date;</p><p>- Implementing the sets of (L, V, W) into the equation 3.8, new set of P1 = {P1i,..., P1n}, for 0 < i ≤ n, was constructed;</p><p>37 3. Model of Assessing the Data Quality of Wikipedia</p><p>- The optimal value is determined by the mean of the set of P1 = {P1i,..., P1n}: 1 n OV1 = ∑ P1i, (3.11) n i=1 - Implementing the sets of (E, T) into the equation 3.9, new set of P2 = {P2i,..., P2n}, for 0 < i ≤ n, was constructed;</p><p>- Then, the optimal value of P2 = {P2i,..., P2n} is: 1 n OV2 = ∑ P2i, (3.12) n i=1</p><p>- Next, implementing the sets of (D, C) into the equation 3.10, new set of P3 = {P3i,..., P3n}, for 0 < i ≤ n, was constructed;</p><p>- The optimal value of P3 = {P3i,..., P3n} is: 1 n OV3 = ∑ P3i, (3.13) n i=1</p><p>- Then, the score of popularity dimension for article (x) is for- mulated into:       ! 1 P1(x) P2(x) P3(x) PO = + + . (3.14) score 3 OV1 OV2 OV3</p><p>For example, an article about Vienna9 is employed for calculating the popularity dimension, as below: P1 = 32037 + 82557.5 + 518 = 115112.5, 668 P2 = = 0.120665, 5536 P3 = (10/10/2020 − 01/28/2001) = 7195 days, The next step is to calculate the optimal value of P1 by equation 3.11, and the author uses Table 3.3 as the sample seeds from Wikipedia featured articles, as shown below:</p><p>9. See https://en.wikipedia.org/wiki/Vienna</p><p>38 3. Model of Assessing the Data Quality of Wikipedia</p><p>Table 3.3: An example of set A for calculating the optimal value of P1.</p><p>A L V W P1 Canada 364723 499542.50 2264 866529.50 Ronald Reagan 16450 681069 1556 699075.00 Metalloid 1041 34847.5 170 36058.50 Evolution 13242 60665.5 2145 76052.50 Japan 272633 4244075 2356 4519064.00</p><p>1 5 6196779.50 = ∑ P1i = = 1239355.90, 5 i=1 5 Then, the optimal value of P2 by equation 3.12 is calculated by imple- menting Table 3.4, as demonstrated below: Table 3.4: An example of set A for calculating the optimal value of P2.</p><p>A E T P2 Canada 4365 21568 0.2024 Ronald Reagan 3119 19342 0.1613 Metalloid 333 2710 0.1229 Evolution 3514 14933 0.2353 Japan 3318 18307 0.1812</p><p>1 5 0.9031 = ∑ P2i = = 0.1806, 5 i=1 5 The last optimal value that needs to be calculated is P3, by implement- ing Table 3.5 and with equation 3.13: 1 5 34304 = ∑ P3i = = 6860.8, 5 i=1 5 Then, the score of popularity dimension for article about Vienna is:       ! 1 P1( ) P2( ) P3( ) = Vienna + Vienna + Vienna 3 OV1 OV2 OV3</p><p>39 3. Model of Assessing the Data Quality of Wikipedia</p><p>Table 3.5: An example of set A for calculating the optimal value of P3.</p><p>A D C P3 Canada 10/10/2020 10/31/2001 6919 Ronald Reagan 10/10/2020 9/28/2001 6952 Metalloid 10/10/2020 9/14/2002 6601 Evolution 10/10/2020 11/6/2001 6913 Japan 10/10/2020 10/31/2001 6919</p><p>! 1  115112.5  0.120665  7195  = + + 3 1239355.90 0.1806 6860.8 1.809668321 = = 0.603223 = 60% 3</p><p>3.1.4 Dimension of Reputation What makes an article has a good reputation? An article with a good reputation must have high quality, trusted, and comes from a reli- able source. Reliable sources can also be interpreted in other words who contributed to the article. As we knew, Wikipedia articles are made by a massive collaboration between registered or unregistered Wikipedians10. According to [72] "the quality of whole articles can be assessed on the basis of the reputation of its authors," and also in [74] "people naturally ascribe reputations to other individuals and will accept knowledge without verification from sources that they feel have a high enough reputation." In accordance with [73] "Wikipedia encourages contributors to become ‘registered users’ by outlining the benefits of having a user account, including building a reputation in the community." Thus, it is tough to have guaranteed the reputation quality of the article without distinguishing the type of Wikipedians, whether the article was contributed by the collaboration between registered Wikipedians only or unregistered Wikipedians only or even a combi- nation of both Wikipedians [110, 111].</p><p>10. See https://en.wikipedia.org/wiki/Wikipedia:Wikipedians</p><p>40 3. Model of Assessing the Data Quality of Wikipedia</p><p>Related work in [70, 71] the reputation based on the survival of the contributor’s edits and depends on the subsequent contributors. By [73] reputation quality focused on the contributor contributions by implementing two ways of contributions: the number of times a contributor made an edit, and the number of characters added per edit. Furthermore, in [72, 74, 75] focus on editing behavior to achieve the reputation quality and according to [112] the reputation quality based on the user contribution and the seriousness of the user’s rating. Therefore, the author concludes in representing the reputation of an article can be measured by how many edits are made by the top 10% editors (in this case, registered Wikipedian) to the article and also by combining metrics from network features in [61, 3] as shown below.</p><p>∙ Measure: analysis of the page edits made by the top 10% of editors and combines with the page network.</p><p>∙ Unit of measure: percentage.</p><p>∙ Variable of measure: total edits, edits made by the top 10% of editors, outgoing links, and external links.</p><p>∙ Equation:  T  RE = + L + P, (3.15) E where T – edits made by the top 10% of editors, E – number of total edits, L – number of the external links, P – number of links from the page (outgoing link Wikipedia).</p><p>Why are these metrics employed in this dimension? Imagine like this, if an article is created by an anonymous user, no matter how good quality the article is, it will be considered as low quality because the user who created it does not have a good reputation in their commu- nity, so they cannot guarantee the article’s quality. On the contrary, if the article is created by a top 10% user who has a good reputation in their community, the article will undoubtedly have good quality because of the user’s reputation to guarantee the article’s quality. The page network metrics are a complement to make the equation closer to the real-world case.</p><p>41 3. Model of Assessing the Data Quality of Wikipedia</p><p>In the real-world case, it is nearly impossible to determine the reputation dimension score using only equation 3.15. Therefore, the author uses the mean value theorem to gain the optimal value of this dimension, as follow: - The author constructed a set denoted as A, and collects (n) seeds of featured articles from Wikipedia, A = {Ai,..., An}, for 0 < i ≤ n;</p><p>- Each seed of featured articles, Ai, was analyzed to extract the value of (j). New nested sets of A = {Ai,j,..., An,m}, for 0 < j ≤ m, were constructed based on this step;</p><p>- New sets were constructed from each of {Ai,j}, T = {ti,..., tn}, E = {ei,..., en}, L = {li,..., ln}, and P = {pi,..., pn}, for 0 < i ≤ n, where T is the collection of number of edits made by the top 10% of editors, E is a set of number of total edits, L denotes for the number of external links, and P is the collection of number of links from the page; - Implementing the sets of (T, E, L, P) into the equation 3.15, new set of RE = {REi,..., REn}, for 0 < i ≤ n, was constructed; - The optimal value is determined by the mean value of the seeds in the RE = {REi,..., REn}: 1 n OptimalValue = ∑ REi, (3.16) n i=1 - Then, the final score of reputation dimension for article (x) is formulated into the following equation:</p><p>RE(x) RE = . (3.17) score OptimalValue</p><p>For instance, what is the score of the reputation dimension from an article about Prague11? It is computed as below:  727  = + 190 + 1076 = 1266.12, 5863</p><p>11. See https://en.wikipedia.org/wiki/Prague</p><p>42 3. Model of Assessing the Data Quality of Wikipedia</p><p>Table 3.6: Set A for determining the optimal value of the reputation dimension. A T E L P RE Canada 4719 21568 741 1190 1931.22 Ronald Reagan 3669 19342 592 1918 2510.19 Metalloid 1522 2710 217 625 842.56 Evolution 2888 14933 2063 1052 3115.19 Japan 2826 18301 429 1347 1776.15</p><p>Then, to calculate the score of the reputation dimension, the author needs to compute the optimal value first, by implementing equation 3.16. In this case, the author uses Table 3.6 contains a set of A with five seeds, as an example in calculating the optimal value, as below:</p><p>1 5 10175.32 = ∑ REi = = 2035.06, 5 i=1 5</p><p>The score of the reputation dimension is:</p><p>REPrague 1266.12 = = = 0.622155 = 62% OptimalValue 2035.06</p><p>3.1.5 Dimension of Semantic Stability One of the high-quality article’s criteria is the article’s stability, but why? To answer this question, we need to understand the basic term of stability. In terms of Wikipedia, stability means no massive fluctuation in the number of edits and the number of reverted edits [77, 78, 79, 80]. However, since the nature of open collaborative on Wikipedia, whenever the article starts growing, it will encourage more and more editors to contribute to the article’s contents, especially if the article is a controversial topic [113]. This leaves a "two sides of the same coin" effect in the article. On one side, the article starts growing to achieve high-quality criteria and becomes semantically stable [114]. Still, on the other side, the article’s content exposes to edit wars, vandalism, etc. and becomes not stable.</p><p>43 3. Model of Assessing the Data Quality of Wikipedia</p><p>Edit wars are the situation where editors repeatedly overwrite the article. This is often happening while editors discussing a par- ticular topic, they don’t always have the same perspective [115, 77, 78]. Edit wars came with two basic actions: (1) a commit operation, adding, editing, or deleting content in an article (2) a revert operation, restoring the article to the previous version. Meanwhile, vandalism is editing the content that is intentionally disruptive to the article’s quality. Additionally, to prevent the edit wars between contributors with divergent points of view, the Wikipedia administrator will protect the pages12 [116]. This action also helps in reducing vandalism on the article. The protection is marked by the vanished of the "Edit" tab and the emergence of the "Padlock" icon on the right corner of the article below the search box, shown in figure 3.3 below. Protection is used to ensure the article’s stability Wikipedia community has determined should provide consistent content, like the policy pages [116].</p><p>Figure 3.3: An illustration of the English Wikipedia article on Pokémon. Circled in red are the "View source" tab (instead of "Edit") and the padlock icon which signal the page is protected.</p><p>12. See https://en.wikipedia.org/wiki/Wikipedia:Protection_policy</p><p>44 3. Model of Assessing the Data Quality of Wikipedia</p><p>Thus, the author concludes the two basic actions explained above are the main indicator of achieving the article’s semantic stability. Therefore, the number of edits and reverted edits are employed as the metrics in measuring this dimension.</p><p>∙ Measure: analysis of the edit frequency and reverted edit fre- quency.</p><p>∙ Unit of measure: percentage.</p><p>∙ Limitation: if the metrics’ values are bigger than the maximum value, it counts as the maximum value.</p><p>∙ Variable of measure: total edits, reverted edits, date of first edit, date of latest edit.</p><p>∙ Equation: 1 SS = (EF + RF), (3.18) 2 where, TE EF = , (3.19) (DL − DF) EF – number of edit frequency, TE – number of total edits, DL – date of latest edit, DF – date of first edit, RE RF = , (3.20) (DL − DF)</p><p>RF – number of reverted edit frequency, RE – number of re- verted edits, DL – date of latest edit, DF – date of first edit.</p><p>Equation 3.18 aims to capture the daily fluctuation on the metrics in order to achieve the stability term. If the metrics values increase and over the defined optimal value or the defined maximum value means the article is not stable.</p><p>Determining the threshold value and the maximum value In the real-world case, the semantic stability dimension value is not directly proportional to the metrics’ values. If the metrics’ values are</p><p>45 3. Model of Assessing the Data Quality of Wikipedia</p><p> closer to zero, then the article can be rated as a dead article (from an edits history perspective). Conversely, the article’s stability level is determined by the ranges between the metrics’ value to the threshold value. If the metrics’ values are closer to the threshold value, then the article is more stable. Therefore, the author uses Wikipedia’s featured articles as a point of reference to determine the threshold value for the stability dimension. Moreover, the maximum value is employed to determine the maxi- mum limit of this dimension. If the metrics’ values are getting closer or even exceeding the maximum value, then the article is unstable. The author uses Wikipedia’s the most revisions articles13 as benchmark data in determining the limit of maximum value. The author uses the mean value theorem to gain the threshold value as well as to gain the maximum value of this dimension, as follow: - The author constructed a set denoted as A, and collects (n) seeds of featured articles from Wikipedia, A = {Ai,..., An}, for 0 < i ≤ n;</p><p>- Each seed of featured articles Ai, was analyzed to extract the value of (j). New nested sets of A = {Ai,j,..., An,m}, for 0 < j ≤ m, were constructed based on this step;</p><p>- New sets were constructed from each of {Ai,j}, T = {ti,..., tn}, R = {ri,..., rn}, D = {di,..., dn}, and F = { fi,..., fn}, for 0 < i ≤ n, where T is the collection of number of total edits, R is the collection of number of total reverted edits, D denotes for the collection of the date of latest edit, and F is the collection date of first edit; - Implementing the sets of (T, D, F) into the equation 3.19, new set of EF = {EFi,..., EFn}, for 0 < i ≤ n, was constructed; - The optimal value is determined by the mean value of the seeds in the EF = {EFi,..., EFn}: 1 n TV1 = ∑ EFi, (3.21) n i=1</p><p>13. See https://en.wikipedia.org/w/index.php?title=Special: MostRevisions, retrieved at 4:20 PM, 28 October 2020 (CET)</p><p>46 3. Model of Assessing the Data Quality of Wikipedia</p><p>- Implementing the sets of (R, D, F) into the equation 3.20, new set of RF = {RFi,..., RFn}, for 0 < i ≤ n, was constructed;</p><p>- The optimal value is determined by the mean value of the seeds in the RF = {RFi,..., RFn}:</p><p>1 n TV2 = ∑ RFi, (3.22) n i=1</p><p>Below are the steps to collect the maximum value data from the list of pages with the most revisions in Wikipedia.</p><p>- The author constructed a set denoted as B, and collects (n) seeds of articles from the list mentioned above, B = {Bi,..., Bn}, for 0 < i ≤ n;</p><p>- Each seed of the articles, Bi, was analyzed to extract the value of (j). New nested sets of B = {Bi,j,..., Bn,m}, for 0 < j ≤ m, were constructed based on this step;</p><p>- New sets were constructed from each of {Bi,j}, T = {ti,..., tn}, R = {ri,..., rn}, D = {di,..., dn}, and F = { fi,..., fn}, for 0 < i ≤ n, where T is the collection of number of total edits, R is the collection of number of total reverted edits, D denotes for the collection of the date of latest edit, and F is the collection date of first edit;</p><p>- Implementing the sets of (T, D, F) into the equation 3.19, new set of EF = {EFi,..., EFn}, for 0 < i ≤ n, was constructed;</p><p>- The maximum value is determined by the mean value of the seeds in the EF = {EFi,..., EFn}:</p><p>1 n MV1 = ∑ EFi, (3.23) n i=1</p><p>- Implementing the sets of (R, D, F) into the equation 3.20, new set of RF = {RFi,..., RFn}, for 0 < i ≤ n, was constructed;</p><p>47 3. Model of Assessing the Data Quality of Wikipedia</p><p>- The maximum value is determined by the mean value of the seeds in the RF = {RFi,..., RFn}: 1 n MV2 = ∑ RFi, (3.24) n i=1</p><p>Based on the above descriptions about semantic stability, there are two possibilities in the way determining the score of this dimension, as formulated below:</p><p>- If the value of EF(x) or RF(x) is less than or equal to the thresh- old value, then:    ! 1 (TV1 − EF(x)) (TV2 − RF(x)) SS = + , score 2 TV1 TV2 (3.25)</p><p>- If the value of EF(x) or RF(x) is greater than the threshold value, then:    ! 1 (EF(x) − TV1) (RF(x) − TV2) SS = + . score 2 (MV1 − TV1) (MV2 − TV2) (3.26) where EF(x) – represents the number of edit frequency from the article (x), RF(x) – represents the number of reverted edit frequency from the article (x), TV1 or TV2 – represents the threshold values, MV1 or MV2 – represents the maximum values. For instance, what is the score of the semantic stability dimension from an article about Prague in Wikipedia? If the number of edits = 5869, the number of reverted edits = 765, date of first edit = 11/6/2001, and the date of latest edit = 10/26/2020. It is computed as below by implementing equation 3.19 and 3.20: 5869 EF = = 0.85, 6929 765 RF = = 0.11, 6929</p><p>48 3. Model of Assessing the Data Quality of Wikipedia</p><p>Table 3.7: Set A for determining the threshold value of the semantic stability dimension.</p><p>DL-DF A TE RE EF RF (Days) Canada 21568 4365 6919 3.12 0.63 Ronald Reagan 19342 3119 6952 2.78 0.45 Metalloid 2710 333 6601 0.41 0.05 Evolution 14933 3514 6913 2.16 0.51 Japan 18307 3318 6919 2.65 0.48</p><p>The author then uses five samples of seeds from Table 3.7 to calculate the threshold values by implementing equation 3.21 and 3.22, as shown below: 1 5 11.12 TV1 = ∑ EFi = = 2.22, 5 i=1 5 1 5 2.12 TV2 = ∑ RFi = = 0.42, 5 i=1 5 The next step is determining the maximum value. As shown in figures 3.4 and 3.5, the author collects fifty seeds and then implements it into equation 3.23 and 3.24:</p><p>1 50 306.10 MV1 = ∑ EFi = = 6.122, 50 i=1 50</p><p>1 50 31.01 MV2 = ∑ RFi = = 0.620, 50 i=1 50</p><p>Because the value of EF(x) and RF(x) are less than the threshold values, the score of the semantic stability dimension is calculated by implementing equation 3.25: ! 1 (2.22 − 0.85) (0.42 − 0.11) = + , 2 2.22 0.42</p><p>49 3. Model of Assessing the Data Quality of Wikipedia</p><p>Figure 3.4: Set of A for determining the maximum value of the semantic stability dimension.</p><p>! 1 1.376183 0.313161 = + , 2 2.22 0.42</p><p>! 1     = 0.619009 + 0.739343 , 2</p><p>1  = 1.358352 = 0.679176, 2</p><p>Then, the score of the semantic stability dimension from an article about Prague in Wikipedia is 68%.</p><p>50 3. Model of Assessing the Data Quality of Wikipedia</p><p>Figure 3.5: Set of A for determining the maximum value of the semantic stability dimension.</p><p>3.1.6 Dimension of Verifiability Another crucial dimension that is tightly related to credibility and constitutes the core draught of credibility itself is none other than verifiability [56]. Verifiability means if we are sure the information is true, it must be verifiable, come from a reliable source, and even other people can check the information source by themselves. As a free online encyclopedia, Wikipedia must face the fact that no guarantees of all users comply with the rules in the process of editing and creation of the article. The number of users collaborating on an article does have certain benefits, but it also generates a tremendous negative impact, especially on the verifiability of the article that may suffer [3]. As mentioned by [4], "more than one in four English Wikipedia articles contains at least one quality flaw, 70% of which concern article verifiability."</p><p>51 3. Model of Assessing the Data Quality of Wikipedia</p><p>Related researches in [4, 82, 14, 58] mention that verifiability of online information is one of the most vital fundamentals in the open- source-based encyclopedia, Wikipedia. As concurs with Wikipedia’s core content policies14: "neutral point of view", "no original research", and "verifiability" [4, 15, 117, 118, 119, 120, 121]. Verifiability issues emerged due to a lack of citations and references to the original sources [3, 81, 82, 61]. If the article does not contain any reference material, it is difficult for readers to judge whether the information is true, accurate, or came from a reliable source [32, 82, 15, 117]. Related works in [82, 117] the verifiability quality assessments are divided into two categories: "technical verifiability" and "practical verifia- bility". "Technical verifiability" is focusing on the existence of supporting information on the referenced material. "Practical verifiability" is focus- ing on the accessibility of the referenced material. In this work, the author applies the "technical verifiability" with supplementary metrics to improve the quality assessment of the article. Those supplementary metrics are circular reference and broken link.</p><p>Figure 3.6: An illustration about information loop.</p><p>Circular reference arose when editors cite other Wikipedia arti- cles as a references, this might cause an information loop/circular reference. From the article page, circular reference is denoted by a special template {{Circular reference|}}15. Editors use this template to</p><p>14. See https://en.wikipedia.org/wiki/Wikipedia:Verifiability 15. See https://en.wikipedia.org/wiki/Template:Circular_reference</p><p>52 3. Model of Assessing the Data Quality of Wikipedia</p><p> flag the references that citing other Wikipedia article as the source of information. From the Wiki-code, the information loop is placed between special tags <ref> [[...]] </ref>, as shown in figure 3.7 [122].</p><p>Figure 3.7: Circular references in Wikipedia article about Poznań.</p><p>Broken link aim for flagging articles that have incomplete or miss- ing supporting information16 of the referenced material, or there are no references completely17 [122]. These templates are divide into several types based on these variables: type of cite, title, author(s) information, URL, URL status, access date, publisher, type of cite, ISBN, JSTOR, PMC,</p><p>16. See https://en.wikipedia.org/wiki/Template:Citation_needed 17. See https://en.wikipedia.org/wiki/Template:Unreferenced</p><p>53 3. Model of Assessing the Data Quality of Wikipedia</p><p>PMID, arXiv, ISBN, ISSN, OCLC, and others18. In some cases, those variables are missing from the references. ∙ Measure: analysis on the existence of supporting information on the referenced material (broken link) and the circular refer- ences. ∙ Unit of measure: percentage. ∙ Variable of measure: references. ∙ Equation: VB = R − B − C. (3.27) where R – number of references, B – number of broken links (incomplete or missing information of the referenced material), C – number of circular references. Why focusing on reference to achieve the verifiability dimension? If taking into account the word verifiable, it means able to be checked, thus needing to use something as a tool for the checking action. In Wikipedia, reference is the tool for the user to verify the editors’ works on Wikipedia. Thus, ensuring the reference quality (no incomplete or missing information, and no circular reference) means the verifi- ability dimension is accomplished. The author prepared guidelines to determine the optimal value of this dimension, the steps are listed below:</p><p>- A = {Ai,..., An}, for 0 < i ≤ n, where A is a set of (n) seeds of featured articles from Wikipedia;</p><p>- Each seed of articles, Ai, was analyzed to extract (j) values. New nested set of A = {Ai,j,..., An,m} were constructed based on this step, for 0 < j ≤ m;</p><p>- New sets were constructed from each of {Ai,j}, R = {ri,..., rn}, B = {bi,..., bn}, and C = {ci,..., cn}, for 0 < i ≤ n, where R is collection of number of references, B is collection of number of broken links, and C denotes the collection of number of circular references;</p><p>18. See https://en.wikipedia.org/wiki/Help:Citation_Style_1</p><p>54 3. Model of Assessing the Data Quality of Wikipedia</p><p>- Implementing the sets of (R, B, C) into the equation 3.27, new set of VB = {VBi,..., VBn}, for 0 < i ≤ n, was constructed; - The optimal value is determined by the mean of the set of VB = {VBi,..., VBn}: 1 n OptimalValue = ∑ VBi, (3.28) n i=1 - Then, the score of verifiability dimension for article (x) is for- mulated into: VB(x) VB = . (3.29) score OptimalValue For example, the author accumulates the number of references, the number of broken links, and the number of circular references to calculate the verifiability dimension from the Wikipedia article of Poznań19, the values are R = 63, B = 4, and C = 2 by equation 3.27:</p><p>VBpozna ´n = 63 − 4 − 2 = 57, By using the elements of set A from Table 3.8 below to calculate the optimal value with equation 3.28: 1 5 2475 = ∑ VBi = = 495, 5 i=1 5 Then, the score of the verifiability dimension Wikipedia article of Poznań is: VB 57 = Poznan´ = = 0.115152 = 12%. OptimalValue 495</p><p>3.2 Measurement of Data Quality Dimensions of Wikipedia Infobox</p><p>As discussed in chapter two above, there are three dimensions selected to assess the quality of the Wikipedia infobox: completeness, cred- ibility, and semantic connection. But what is the reason the author employed those dimensions?</p><p>19. See https://en.wikipedia.org/wiki/Pozna%C5%84</p><p>55 3. Model of Assessing the Data Quality of Wikipedia</p><p>Table 3.8: An example of set A from Wikipedia featured articles for determining the optimal value of the verifiability dimension.</p><p>A R B C VB Canada 482 0 2 480 Ronald Reagan 561 3 1 557 Metalloid 662 5 0 657 Evolution 394 0 0 394 Japan 400 6 7 387</p><p>From figure 3.8 below, infobox consists of several elements. The author formulated the infobox’s key elements by analyzing those elements based on research and some previously published literature [1, 93, 123, 20, 21, 124, 125]: parameters, references, and wikilinks. The data quality dimensions assess each of these key elements. The parameters element is assessed by the completeness dimension to ensure the infobox has a complete set of parameters. The references element is assessed by the credibility dimension to assure the users can validate the parameters’ values’ correctness. The wikilinks element are assessed by semantic connection to ascertain the connectedness between resources of information in Wikipedia. The author will elaborate clearly on how to calculate each of the dimensions in the subsections below.</p><p>3.2.1 Dimension of Completeness</p><p>Are there any missing parameters in Wikipedia infobox? How to de- termine the complete set of the parameters for the given type infobox? Before answering those questions, first, we need to understand the basic definition of the completeness dimension. The completeness dimension is described as an expected comprehensiveness [94, 60]. Data can be complete as long as the data meets the expectations or matches with the recommended set [126, 53]. In this work, to verify the completeness dimensions in the Wikipedia infobox is by checking the absence of the parameter in the given in- fobox [124, 1]. This can be done by determining in advance a complete</p><p>56 3. Model of Assessing the Data Quality of Wikipedia</p><p>Figure 3.8: An illustration of infobox structure and elements. set of parameters that will be used as the benchmark. For more details, it will be discussed in the subsection below. ∙ Measure: a measure of the absence of parameter in the infobox against to all defined parameters in the given type of the in- fobox.</p><p>∙ Unit of measure: percentage.</p><p>∙ Variables of measure: parameters infobox.</p><p>∙ Equation: |FP| ∑ WPi CO = i=1 , (3.30) |DP| ∑i=1 WPi</p><p>57 3. Model of Assessing the Data Quality of Wikipedia</p><p> where |FP| – number of filled parameters, WPi – weight of the parameters Pi, |DP| – number of all defined parameters for the given type of the infobox.</p><p>The calculation of the completeness weight will be explained in the subsection below as well as the example.</p><p>Determining the collection of benchmark parameters In order to create a set of benchmark parameters, the very first thing to do is to determine which category of infoboxes that want to be used to populate those parameters. Those infoboxes have to be in the same category and similar types/conditions. Otherwise, it will be difficult to generalize and normalize each of the parameters because there are too many discrepancies from each of the infoboxes. The formula below can be applied to collect the number of all elements from the set of A = {Ai,j,..., An,m}, for 0 < i ≤ n, and for 0 < j ≤ m.</p><p>NI [ |DP| = A(i,j), (3.31) i=1</p><p>Where NI – the number of the infoboxes, A(i,j) – the set of parameters for every infobox (i) and parameter (j).</p><p>Determining the weight of the parameters After the benchmark parameters are set, the next step is to count the weight for each of the parameters. Weight is based on the frequency of filling the same parameters from the different entities of the infoboxes compare to the benchmark parameters from all entities for the given type of the infoboxes [124].</p><p>∑1≤i≤NI FF(i,j) 1≤j≤FP WP = , (3.32) |DP|</p><p>Where FF(i,j) – the frequency of filling the same parameters (j) for each of the infoboxes (i), NI – the number of the infoboxes, FP – the</p><p>58 3. Model of Assessing the Data Quality of Wikipedia</p><p> number of filled parameters, DP – number of all defined parameters for the given type of the infoboxes. In this example, the author selected a list of cities in Italy to collect the defined parameters of the infoboxes20. There are a hundred forty- four cities in the list and based on the equation 3.31 the number of all defined parameter is: 144 [ |DP| = A(i,j), i=1</p><p>|DP| = A(1,j) ∪ A(2,j) ∪ A(3,j) ∪ ... ∪ A(144,m),</p><p>Where A1 = Rome = {Country, Region, Founded, Founded by, Gov- ernment Type, Government Body, Government Mayor, Area Total, Elevation, Population Rank, Population Density, Population Comune, Population Metropolitan City, Demonym, Time Zone, CAP code, Area code, Website}, A2 = Milan = {Country, Region, Metro, Government Type, Govern- ment Body, Government Mayor, Area Comune, Elevation, Popula- tion Comune, Population Density, Population Metropolitan City, De- monym, Area code, Website }, and so on until A144 = Scandicci. If for the example, only two sets of unions from the list of infoboxes are used in the calculation which is Rome and Milan, then:</p><p>|DP| = A(1,j) ∪ A(2,j)∪,</p><p>|DP| = {Country, Region, Metro, Founded, Founded by, Government Type, Government Body, Government Mayor, Area Comune, Area Total, Elevation, Population Rank, Population Density, Population Comune, Population Metropolitan City, Demonym, Time Zone, CAP code, Area code, Website},</p><p>|DP| = 20.</p><p>The next work calculates the weight for each of the parameters by implementing equation 3.32, as shown in Table 3.9 below. Then, final work of the completeness dimension is to calculate the score of the given type of infobox by implementing the equation 3.30. If for the</p><p>20. See https://en.wikipedia.org/wiki/List_of_cities_in_Italy</p><p>59 3. Model of Assessing the Data Quality of Wikipedia</p><p>Table 3.9: Weight of the parameters. Source: own calculation and the numbers are just an example to express the calculation.</p><p>Weight of Frequency Defined Parameters (DP) Parameters of Filling (FF) (WP)= FF/|DP| Country 143 7.15 Region 142 7.1 Metro 110 5.5 Founded 120 6 Founded by 121 6.05 Government Type 2 0.1 Government Body 3 0.15 Government Mayor 142 7.1 Area Comune 2 0.1 Area Total 142 7.1 Elevation 138 6.9 Population Rank 1 0.05 Population Density 135 6.75 Population Comune 3 0.15 Population Metropolitan City 3 0.15 Demonym 144 7.2 Time Zone 141 7.05 CAP Area 1 0.05 Area code 123 6.15 Website 141 7.05 Total WP 87.85</p><p>60 3. Model of Assessing the Data Quality of Wikipedia</p><p> example, A3 ={Country, Region, Demonym, Time Zone, CAP code, Area code, Website}, where DP = 20, and FP = |A3| = 7, then: 7 (7.15 + 7.1 + 7.2 + 7.05 + 0.05 + 6.15 + 7.05) = ∑i=1 20 , ∑i=1 WPi 20 where ∑i=1 WPi is equal to the total of the WP from Table 3.9 which is 87.85 and substitute this number into the equation as shown below. 41.75 = * 100% = 0.47524 = 48%. 87.85</p><p>The score of the completeness dimension for Wikipedia infobox A3 is 48%.</p><p>3.2.2 Dimension of Credibility Similarly, as of the Wikipedia article, one way to assess the credibility of Wikipedia infobox is its number of references [1]. One of the primary purposes of the infobox is to summarize key facts of an article and present them to the reader, allowing the reader to identify those key facts at a glance. Those key facts should be equipped with an adequate reference so that it able to support the value of each filled parameter. Therefore readers can easily find out the correctness of the information. In this work, the author compares the number of references to the number of filled parameters of the infobox. As they represent the core components of credibility: the number of references can be defined as trustworthiness, and the number of filled parameters can be defined as expertise. The calculation of the credibility dimension score is described below. ∙ Measure: analysis of the number of references in the infobox compared to the number of filled parameters in the infobox. ∙ Unit of measure: percentage. ∙ Variables of measure: references, parameters infobox. ∙ Equation: ∑P R CI = i=1 i , (3.33) P</p><p>61 3. Model of Assessing the Data Quality of Wikipedia</p><p> where Ri – number of references in the filled parameter Pi of the infobox, P – number of all filled parameters.</p><p>For equation 3.33 fits into the real-world case, the author uses the mean value theorem from the selected seeds to gain this dimension’s optimal value, the steps are:</p><p>- A = {Ai,..., An}, for 0 < i ≤ n, where A is a set of (n) seeds of a given type infoboxes from Wikipedia;</p><p>- Each seed of infoboxes, Ai, was analyzed to extract (j) values. New nested set of A = {Ai,j,..., An,m} were constructed based on this step, for 0 < j ≤ m;</p><p>- New sets were constructed from each of {Ai,j}, R = {ri,..., rn}, and P = {pi,..., pn}, for 0 < i ≤ n, where R is collection of number of references in the infobox, and P is collection of number of all filled parameters;</p><p>- Implementing the sets of (R, P) into the equation 3.33, new set of CI = {CIi,..., CIn}, for 0 < i ≤ n, was constructed;</p><p>- The optimal value is determined by the mean of the set of CI = {CIi,..., CIn}: 1 n OV = ∑ CIi, (3.34) n i=1</p><p>- Then, the score of credibility dimension for infobox (x) is:</p><p>CI(x) CI = . (3.35) score OV</p><p>As an example, infobox of Anne of Green Gables as shown in figure 3.9 denoted as (x), substitute the number of filled parameters and the number of references into equation 3.33, the following calculation will be obtained: 3 CI = = 0.25, (x) 12</p><p>62 3. Model of Assessing the Data Quality of Wikipedia</p><p>After that, calculate the optimal value by using equation 3.34 and seeds from Table 3.10 below: 1 5 2.6466 = ∑ CIi = = 0.529316, 5 i=1 5 The score of the credibility dimension for Wikipedia infobox Anne of Table 3.10: An example of seeds for the optimal value of credibility dimension for the infobox. A R P CI Canada 5 12 0.4167 Ronald Reagan 7 10 0.7000 Metalloid 4 13 0.3077 Evolution 5 9 0.5556 Japan 8 12 0.6667</p><p>Green Gables is:</p><p>CI(x) 0.25 = = = 0.472307 = 47%. OptimalValue 0.529316</p><p>3.2.3 Dimension of Semantic Connection An infobox template not only consists of a collection of attribute-value pairs and references to support their information but also contains internal links or what is often named as Wikilink21 [93, 123]. Wikilink is one of Wikipedia’s important attributes, connecting one entity with many other related entities within Wikipedia [92]. But why is wikilink important for infobox? Wikilink aims to bind the project together into an interconnected whole and help users navigate one resource to another by clicking on hyperlinks. Wikilinks are one of the resources used by DBpedia or YAGO to enrich their datasets [91, 127, 128, 89]. In accordance with it, viewed from the definition of the dimension of semantic connection, which relates to the relationship between</p><p>21. See https://en.wikipedia.org/wiki/Help:Link</p><p>63 3. Model of Assessing the Data Quality of Wikipedia</p><p>Figure 3.9: An illustration of the infobox about Anne of Green Gables and some of its metrics. entities connected via wikilinks [89], it can be concluded that this dimension is one among the many keys in measuring data quality. The equation below is employed to live the scale of semantic connection by implementing wikilinks jointly of the measuring variables.</p><p>∙ Measure: an analysis of the connection between data (infobox attributes) in the given infobox with other Wikipedia data. By compares the ratio number of wikilinks and the number of filled parameters.</p><p>∙ Unit of measure: percentage.</p><p>∙ Variables of measure: number of wikilinks, parameters in- fobox.</p><p>∙ Equation: W SC = , (3.36) F</p><p>64 3. Model of Assessing the Data Quality of Wikipedia</p><p> where W – number of wikilinks in the given infobox, F – num- ber of filled parameters.</p><p>Since not all parameters within the infobox are required to contain wikilinks, no definite rules are requiring how many wikilinks must be on an infobox. Therefore, the author uses an average value from the optimal value calculation to determine the score of this dimension.</p><p>- A = {Ai,..., An}, for 0 < i ≤ n, where A is a set of (n) seeds of infoboxes from Wikipedia;</p><p>- Each seed of infoboxes Ai was analyzed to extract (j) values. New nested set of A = {Ai,j,..., An,m} were constructed based on this step, for 0 < j ≤ m;</p><p>- New sets were constructed from each of {Ai,j}, W = {wi,..., wn}, and F = { fi,..., fn}, for 0 < i ≤ n, where W is collection of number of internal links Wikipedia/ Wikilinks, and F is collection of number of filled parameters;</p><p>- Implementing the sets of (W, F) into the equation 3.36, new set of SC = {SCi,..., SCn}, for 0 < i ≤ n, was constructed;</p><p>- The optimal value of SC = {SCi,..., SCn} is:</p><p>1 n OptimalValue = ∑ SCi, (3.37) n i=1</p><p>- After that, the score of semantic connection dimension for infobox (x) is formulated into:</p><p>SC(x) SC = . (3.38) score OptimalValue</p><p>For instance, figure 3.9, denoted as (x) can be used as an example to calculate quality metrics of this dimension, the following calculation will be obtained: 6 SC = = 0.5, (x) 12</p><p>65 3. Model of Assessing the Data Quality of Wikipedia</p><p>The optimal value is calculated by implementing the seeds from Table 3.11, as below:</p><p>1 5 6.8574 = ∑ SCi = = 1.371484, 5 i=1 5 The score of the semantic connection dimension for Wikipedia infobox</p><p>Table 3.11: An example of seeds of infoboxes from the Wikipedia featured articles. A W F SC Canada 50 38 1.3158 Ronald Reagan 41 31 1.3226 Istanbul 33 35 0.9429 Kolkata 46 28 1.6429 Japan 49 30 1.6333</p><p> of figure 3.9 is:</p><p>SC(x) 0.5 = = = 0.364569 = 36%. OptimalValue 1.371484</p><p>3.3 Scoring Data Quality of Wikipedia Article and Infobox</p><p>The author develops several stages in determining the final score of data quality dimensions of Wikipedia. In the first step, the author classified data quality dimensions measurement into two main types based on Wikipedia elements: article and infobox, then distributed each type’s weight based on their number of dimensions, and divided by the total number of dimensions from both types. In the second step, the author formulates a rank system for each data quality dimension measurement types. This rank system is based on research results in [53, 126, 3]. The researchers in [53, 126, 3] had conducted experiments using seeds from featured and random articles to rank the most frequently occurring dimensions. According to that,</p><p>66 3. Model of Assessing the Data Quality of Wikipedia</p><p> the author classifies and ranks each of the dimensions based onthe similarity of definition and metrics used.</p><p>Table 3.12: List of weights of the article and infobox dimensions.</p><p>#Rank Article Dimension Point Weight #1 Credibility 6 29% #2 Verifiability 5 24% #3 Format Compliance 4 19% #4 Reputation 3 14% #5 Semantic Stability 2 10% #6 Popularity 1 5% #Rank Infobox Dimension Point Weight #1 Completeness 3 50% #2 Credibility 2 33% #3 Semantic Connection 1 17%</p><p>In the third step, the author formulates the weight for each dimen- sion measurement type according to its ranking. Each rank has its point. The highest rank gets the biggest points, as well as the lowest grade gets the lowest points. The weight is calculated based on each dimension’s points divided by the total number of points, as shown in equation 3.39 below. The amount of the weight value is determined by the number of points obtained from the ranking system, as shown in Table 3.12 above.</p><p>P W = , (3.39) T</p><p> where W – represents the weight of the dimension, P – the dimension’s point, T – the total number of points for each type of data quality dimensions.</p><p>As the final step, after determining each dimension’s weight, the author formulates the final score of both types, as shown in equation 3.40 below.</p><p>67 3. Model of Assessing the Data Quality of Wikipedia</p><p> n n  Final_Score = (WAD * AS ) + ( + ) ∑ (i) (i) n m i=1 !  m m  (WID * IS ) . (3.40) ( + ) ∑ (j) (j) n m j=1</p><p>Where n – number of total dimensions on the article, for 0 < i ≤ n, m – number of total dimensions on the infobox, for 0 < j ≤ m, WAD(i) – weight of each dimension (i) on the article, AS(i) – score of each dimension (i) on the article, WID(i) – weight of each dimension (i) on the infobox, IS(j) – score of each dimension (j) on the infobox.</p><p>For instance, as shown in Table 3.13 below is the score from ex- amples of each dimension above, used to calculate the final score for article (x) by implementing equation 3.40: ! 6 6  3 3  = ∑(WAD(i) * AS(i)) + ∑(WID(j) * IS(j)) , 9 i=1 9 j=1</p><p>6 = (0.285714 * 0.63626) + (0.190476 * 0.522388)+ 9</p><p>(0.047619 * 0.603223) + (0.142857 * 0.622155)+  (0.095238 * 0.679176) + (0.238095 * 0.115152) +</p><p>3 (0.5 * 0.47524) + (0.333333 * 0.472307)+ 9 !  (0.166667 * 0.364569) ,</p><p>! 6  3  = (0.490996) + (0.455817) , 9 9</p><p>68 3. Model of Assessing the Data Quality of Wikipedia</p><p>Table 3.13: An example of scores of article and infobox dimensions.</p><p>Article Dimension Score Credibility 0.636260 Format Compliance 0.522388 Popularity 0.603223 Reputation 0.622155 Semantic Stability 0.679176 Verifiability 0.115152 Infobox Dimension Score Completeness 0.475240 Credibility 0.472307 Semantic Connection 0.364569</p><p>  = 0.327331 + 0.151939 ,</p><p>= 0.479270 = 48%. The author uses a traffic light visualization system to describe the Wikipedia article’s final score. As per the example above, Wikipedia article (x) final score is 48%, the red light will light up, which means the article (x) is in a poor quality state, as shown in figure 3.10 below. As for the yellow light, it will light up if the article’s final score is in the range between 50% to 75%, which means the article is in a good quality state. Likewise, the green light will light up if the final score at least reaches 76%, which means the article is in a high-quality state.</p><p>Figure 3.10: An illustration of the final score in the traffic light score- card.</p><p>69 4 Experiment and Implementation</p><p>4.1 Experiment</p><p>In this section, the proposed Model of Assessing the Data Quality of Wikipedia, as presented in Chapter 3, will be applied practically. Each step of the experiment will be explained starting from the dataset, the design of processes, the APIs, and the Python Libraries, until the machine configurations used in this experiment. This will help the readers and other researchers have a better understanding of how the proposed methodology is applied, and the experiment’s result will be explained under the Result section below.</p><p>4.1.1 Datasets The author employed three Wikipedia article categories to determine each dimension’s optimal values and the semantic stability dimen- sion’s maximum values, namely: City, Musician, and Film. Each cate- gory is described as below:</p><p>City</p><p>In the category of City, the author selected 50 Wikipedia articles on cities1 to populate the optimal values for all dimensions and the prede- fined infobox parameters and their values for the infobox completeness dimension, as shown in figure 4.1 below. The selected 50 Wikipedia articles consist of two Wikipedia quality assessment levels: Featured Articles (FA) and Good Articles (GA). The author applies the Featured Articles because these articles are high-quality Wikipedia articles. However, the number of articles about cities within this level of quality is very limited. Moreover, some of these FA articles do not have adequate infobox parameters, or even the type of the infobox parameters in each article may differ, as shown in figure 4.2. Therefore, the authors consider the GA articles’ quality is the same quality as the FA articles to cover the FA articles’ shortage</p><p>1. See https://en.wikipedia.org/wiki/Draft:List_of_featured_and_good_ articles_about_cities</p><p>70 4. Experiment and Implementation</p><p> in determining the optimal value for all the proposed data quality dimensions in Chapter 3.</p><p>Figure 4.1: A Wikipedia page: the list of the FA and GA articles on cities. The maximum value of the semantic stability data quality dimen- sion is determined based on the 50 articles with the most revisions on cities. Figure 4.3 shows a Wikipedia page "Drafts:List of most revisions articles by cities2," which is a list of the articles used as samples in the dataset. All those 50 articles were selected from Wikipedia’s special page "Pages with the most revisions3."</p><p>Musician</p><p>In the category of Musician, the author selected 50 Wikipedia articles on musicians biographies4 in order to populate the optimal values for all dimensions and the predefined infobox parameters and their values for the infobox completeness dimension. The selected 50 Wikipedia</p><p>2. See: https://en.wikipedia.org/wiki/Draft:List_of_the_most_ revisions_articles_by_cities 3. See: https://en.wikipedia.org/w/index.php?title=Special: MostRevisions 4. See: https://en.wikipedia.org/wiki/Draft:List_of_featured_and_ good_articles_about_musicians</p><p>71 4. Experiment and Implementation</p><p>Figure 4.2: The differences between Wikipedia infoboxes parameters on FA: Altrincham and GA: Atlanta.</p><p> articles consist of two Wikipedia quality assessment levels: Featured Articles (FA) and Good Articles (GA). The author applies both quality assessment levels for the same reasons, as explained in the category of City. The maximum value of the semantic stability data quality di- mension is determined based on the 50 articles with the most re- visions on musicians biographies. The author created a Wikipedia page "Draft:List of the most revisions articles by musicians5," which is a list of the articles used as samples in the dataset. All the 50 articles were selected from Wikipedia’s special page.</p><p>5. See: https://en.wikipedia.org/wiki/Draft:List_of_the_most_ revisions_articles_by_musicians</p><p>72 4. Experiment and Implementation</p><p>Figure 4.3: A Wikipedia page: the list of the most revisions articles on cities.</p><p>Film</p><p>In the category of Film, the author selected 50 Wikipedia articles on films6 in order to populate the optimal values for all dimensions and the predefined infobox parameters and their values for thein- fobox completeness dimension. The selected 50 Wikipedia articles consist of Featured Articles (FA). The author applies the Featured Ar- ticles because these articles are high-quality Wikipedia articles. In this category, the FA articles have enough articles with adequate infobox parameters. The maximum value of the semantic stability data quality dimen- sion is determined based on the 50 articles with the most revisions on films. The author created a Wikipedia page "Draft:List of the most revisions articles about films7," which is a list of the articles used as sam- ples in the dataset. All the 50 articles were selected from Wikipedia’s special page.</p><p>6. See: https://en.wikipedia.org/wiki/Draft:List_of_50_featured_ articles_about_films 7. See: https://en.wikipedia.org/wiki/Draft:List_of_the_most_ revision_articles_about_films</p><p>73 4. Experiment and Implementation</p><p>4.1.2 Development Design and Environment The author designs this application based on the measurement method- ology and data on each of the dimensions proposed in chapter 3. The author divided the application into two parts based on its function: the Predefined Data application and the GUI application. The predefined data application is used to collect and generate the dimensions’ optimal values from the dataset described in the previous subsection. Later, it will be used in the GUI application processes. This application consists of 5 processes, as shown in figure 4.4 below.</p><p>Figure 4.4: The predefined data application processes</p><p>1) This process will break down the Wikipedia articles in each list into a chunk of data in HTML format. It starts to find all the required metrics using a web scraping function or by making an API call on each article in the lists.</p><p>2) The process of extracting the required metrics values for each dimension and store them.</p><p>3) Insert each article’s metrics values into the equations for each dimension. Then, it stores the result as the optimal value.</p><p>4) This process will collect all the optimal values of the dimen- sions in order to generate the predefined data.</p><p>5) The store process of the predefined data into a configuration file used for the GUI application. The GUI application is used as the interface for the user in or- der to assess the data quality of Wikipedia. It consists of 6 processes, as shown in figure 4.5 below. The author limits this application in- put into three Wikipedia articles categories only. The author makes limitation because of the methodology measurement of the complete- ness dimension of the infobox. As elaborated earlier in Chapter 3, the</p><p>74 4. Experiment and Implementation</p><p> completeness dimension aims to measure the incomplete or missing attributes in the Wikipedia infobox and employed the attributes as the equation’s metrics. As a matter of fact, each infobox in Wikipedia has different attributes according to its types. For instance, the infobox about cities has different attributes from the infobox about musicians. Furthermore, the infobox about cities in one country has different attributes from other countries’ infobox about cities, even though they have the same infobox type.</p><p>Figure 4.5: The GUI application processes</p><p>To maintain the validity of the application and due to time con- straints, the author develops this application based on three categories of Wikipedia articles: city, musician, and film. The GUI application processes are described below.</p><p>1) The user inserts a Wikipedia article URL as an input for the next process and chooses one category from the list.</p><p>2) This process will break down the input URL into a chunk of data in HTML format to find the infobox type, as shown inthe figure 4.6 below. It will then check if the infobox type matches with any of the categories on the given list.</p><p>3) If the input URL infobox is not matched, it will show a warning, and the process will return to the initial process (or the input process).</p><p>75 4. Experiment and Implementation</p><p>4) If the input URL infobox is matched, it will load the predefined data and continue the next process. 5) The process of extracting the required metrics values for each dimension and applied the predefined data to calculate each dimension’s score. 6) Calculating the final score and visualizing the result into the interface as a traffic light scorecard.</p><p>Figure 4.6: The three categories of the infobox for the predefined data in HTML script.</p><p>Wikipedia API</p><p>The author employed three Wikipedia APIs to retrieve the metrics’ value for the popularity and semantic stability dimensions measure- ment. The application accesses the APIs URL in the JSON format, and then the application will then start scraping the data in order to find the metrics and return its value. Each API will be described below. ∙ https://wikimedia.org/api/rest_v1/metrics/pageviews /per-article/en.wikipedia/all-access/user/{article} /monthly/{start}/{end} where {article} – article title, {start} – start date in the for- mat (yyyyMMdd), {end} – end date in the format (yyyyM- Mdd). The API is employed to get the monthly value of the number of pageviews on the popularity dimension, as shown in figures 4.7 below.</p><p>76 4. Experiment and Implementation</p><p>Figure 4.7: The pageviews metric is presented as views in JSON format of the Wikipedia API.</p><p>∙ https://en.wikipedia.org/w/api.php?action=query& prop=revisions&rvlimit=1&rvprop=timestamp&rvdir= newer&format=json&titles={article} where {article} – article title. This API is employed to get the value of the page creation date on the popularity dimension and get the first edit date’s value on the semantic stability dimension, as shown in figure 4.8 below.</p><p>Figure 4.8: The page creation date and the first edit date metrics is presented as timestamp in JSON format of the Wikipedia API.</p><p>∙ https://en.wikipedia.org/w/api.php?action=query& prop=revisions&rvlimit=1&rvprop=timestamp&rvdir= older&format=json&titles={article} where {article} – article title. The API is employed to get the value of the latest edit date on the semantic stability dimension, as shown in figure 4.9 below.</p><p>XTools API</p><p>The author employed 4 APIs to retrieve the metrics’ value in JSON for- mat by using APIs from the Xtools8 built on the Symfony framework.</p><p>8. See https://xtools.readthedocs.io/en/stable/index.html</p><p>77 4. Experiment and Implementation</p><p>Figure 4.9: The latest edit date metric is presented as timestamp in JSON format of the Wikipedia API.</p><p>∙ https://xtools.wmflabs.org/api/page/prose/{project} /{article} where {project} – project domain or database name, {article} – article title.</p><p>This API is employed to get the statistics about the prose (characters, word count, reference, etc.). The ’words’ and ’unique references’ are used as the credibility dimension metrics, as the ’references’ is used in the verifiability dimension, as shown in figure 4.10 below.</p><p>Figure 4.10: The Xtools API of page prose.</p><p>∙ https://xtools.wmflabs.org/api/page/links/{project} /{article} where {project} – project domain or database name, {article} – article title.</p><p>This API is employed to get the number of in and outgoing links and redirects to the given Wikipedia article. The ’links in count’ is used as metrics in the popularity and format compliance dimensions. While the ’links ext count’ is used as the number of external link metric in the reputation dimension. The ’links out count’ is employed as the metrics in the reputation and format compliance dimensions, as shown in figure 4.11 below.</p><p>78 4. Experiment and Implementation</p><p>Figure 4.11: The Xtools API of page links.</p><p>∙ https://xtools.wmflabs.org/api/page/articleinfo/ {project}/{article} where {project} – project domain or database name, {article} – article title.</p><p>This API is employed to get basic information about the history of a given Wikipedia article. The ’watchers’ is used as metrics in the popularity dimensions. While the ’revisions’ is utilized as the number of total edits metric in popularity, reputation, and semantic stability dimensions, as shown in figure 4.12 below.</p><p>Figure 4.12: The Xtools API of page article info.</p><p>∙ https://xtools.wmflabs.org/api/page/top_editors/ {project}/{article} /{start}/{end}/{limit}?nobots=1 where {project} – project domain or database name, {article} – article title, {start} – start date in the format (yyyyMMdd), {end} – end date in the format (yyyyMMdd), {limit} – return the number of results by default 20 and the maximum 1000, {?nobots=1} – to exclude bots from the results.</p><p>This API is employed to get the top editors to a page by edit count for the given Wikipedia article. This API is employed in the reputation dimension as edits made by the top 10% of editors metric, as shown in figure 4.13 below.</p><p>79 4. Experiment and Implementation</p><p>Figure 4.13: The Xtools API of top editors.</p><p>Python Libraries</p><p>Table 4.1 below shows the Python libraries and modules with their description that the author used to develop the application. Part of the libraries and modules are built-in Python.</p><p>Development Environment</p><p>All the experiments are implemented using a notebook computer with an Intel(R) Core(TM) i3-7100U CPU as the processor running at 2.40GHz (4 CPUs), using 8GB 2400MHz DDR4 of RAM, running on Windows 10 Pro 64-bit, developed in Python version 3.7.4 with PyCharm version 2020.2.3 as the IDE, and using Python libraries. The Python libraries are classified by Frontend and Backend, as shown in Table 4.2 below. The Backend is further divided into Data Processing and Web Scraping, as shown in Table 4.3 below.</p><p>4.2 Implementation</p><p>There will be 2 interfaces: Main and Result. Interface 1 is called Main, and interface 2 is called Result. The Main interface contains three elements, as shown in figure 4.14 above, which is:</p><p>∙ 1.1 is a dropdown/combo box for the category selection.</p><p>∙ 1.2 is a textbox for input the Wikipedia article URL.</p><p>80 4. Experiment and Implementation</p><p>Table 4.1: The Python library and its description.</p><p>Python Library Description is a Python GUI framework to build an EasyTkinter application interface. is used for extracting and processing data Bs4 out of HTML and XML files. is an open-source numerical Python library Numpy for working with amultidimensional array. is used for data manipulation and analysis, Pandas with a key data structure, is called the DataFrame. is used to send the HTTP requests, and the Requests HTTP request returns aresponse object with all the response data. Urllib3 is used for fetching URLs or HTTP clients. is a module in Python for storing collections Collections of data, for example, list, dict, set, etc. is a module in Python that supplies classes Datetime to work with date and time. is a module in Python for passing a function Functools as an argument to another function and/or can return other functions as well. is a module in Python that handles the Re regular expressions.</p><p>∙ 1.3 is a button to start the program. The author divides the Main interface process into two processes, which are: input and background process. Each process will be ex- plained step by step as below. The author develops simple instructions for the input process, as described below:</p><p>81 4. Experiment and Implementation</p><p>Table 4.2: The Python library based on their classifications (Frontend).</p><p>Python Library Version Frontend EasyTkinter 1.1.0</p><p>Table 4.3: The Python library based on their classifications (Backend).</p><p>Python Library & Module Backend Data Processing Web Scraping Pandas v.1.1.4 Bs4 v.0.0.1 Numpy v.1.19.4 Urllib3 v.1.26.2 Re Requests v.2.25.0 Collections Functools Datetime</p><p>1. The user needs to select one of the categories in the dropdown box, as shown in figure 4.15 below.</p><p>2. Then, the user needs to input the Wikipedia article URL based on the chosen category.</p><p>3. After that, the user can press the Start button to run the appli- cation.</p><p>After the user presses the Start button, then the application starts the back-end process immediately, as described step by step sequen- tially below:</p><p>1. The application scrapes the input URL and inspecting the Wikipedia article page.</p><p>2. The application will find the infobox’s name and match it with the predefined infobox’s name on each category, as explained in subsection above.</p><p>82 4. Experiment and Implementation</p><p>Figure 4.14: Main interface of the application.</p><p>3. If the input URL infobox’s name is matched with the prede- fined infobox’s name on the chosen category, then the appli- cation will continue to extract all the metrics and start the calculation for each dimension.</p><p>4. The score for each dimension and the final score will be shown in the Result interface, which will be explained in detail below.</p><p>5. If the input URL infobox’s name is not matched with the pre- defined infobox’s name on the chosen category, the application will stop and pop up the warning notification, as shown in figure 4.16 below.</p><p>The Warning popup box is made from the popup dialog box and has a function to show any warning or error during the application running. The Result interface shows the given Wikipedia input URL, each dimension’s score in table form, and the final score in traffic light scorecard, as shown in figure 4.17 below.</p><p>∙ 2.1 is a textbox for the given Wikipedia article URL.</p><p>∙ 2.2 is a table that is employed to show each dimension’s score and the final score.</p><p>83 4. Experiment and Implementation</p><p>Figure 4.15: The Main interface of the application with the list of categories.</p><p>Figure 4.16: The warning popup box to notify the error.</p><p>∙ 2.3 is a traffic light scorecard that is employed to visualize the final score. The red light will light up if the final score isbelow 50%, which means it has low data quality. The yellow light will light up if the final score is in the range between 50% to 75%, which means it has medium data quality. If the final score is greater than 75%, then the green light will light up, which means high data quality.</p><p>84 4. Experiment and Implementation</p><p>Figure 4.17: Result interface of the application.</p><p>4.3 Results</p><p>The author conducted the experiments for selected 3 categories, which are City, Musician, and Film. Each category consists of 10 articles, mul- tiply by three, in total author tested 30 articles. In the final score for each category, the author realized that only around 16% of articles could be considered contain a high-quality data. The following para- graphs are the details of experiments for each category and the legends for all tables are described below: ∙ D1 = Article credibility dimension.</p><p>∙ D2 = Article format compliance dimension.</p><p>∙ D3 = Article popularity dimension.</p><p>∙ D4 = Article reputation dimension.</p><p>∙ D5 = Article semantic stability dimension.</p><p>∙ D6 = Article verifiability dimension.</p><p>85 4. Experiment and Implementation</p><p>∙ D7 = Infobox completeness dimension.</p><p>∙ D8 = Infobox credibility dimension.</p><p>∙ D9 = Infobox semantic connection dimension.</p><p>∙ G = Green (high data quality; final score > 75%).</p><p>∙ Y = Yellow (medium data quality; 50% ≤ final score ≥ 75%).</p><p>∙ R = Red (low data quality; final score < 50%).</p><p>In Table 4.4 below, describes the experiment result for the City category. The author employed 10 Wikipedia articles about cities in the Czech Republic. The verifiability dimension has the lowest average score, it means almost all the selected articles have a low quality on their references. However, the highest average score is achieved by format compliance dimension, it means almost all the selected articles have a good quality in the term of their article structure.</p><p>Table 4.4: The experiment results of the City category.</p><p>City in the Infobox Final Article Dimension (%) Czech Republic Dimension (%) Score Article Title D1 D2 D3 D4 D5 D6 D7 D8 D9 (%) Light Prague 61 105 135 110 9 56 32 175 85 79 G Brno 90 97 71 57 69 71 45 61 86 72 Y Ostrava 15 89 55 38 84 6 32 63 91 43 R Zlín 4 81 41 22 94 1 30 31 79 33 R Olomouc 20 89 50 41 89 5 31 27 77 39 R Pardubice 37 89 40 24 95 4 30 29 67 40 R Plzeň 23 89 60 31 86 7 30 29 72 39 R Liberec 16 73 45 23 92 2 25 31 48 33 R Ústí nad Labem 18 89 53 18 93 10 30 29 61 37 R Hradec Králové 22 89 44 23 94 1 31 29 61 37 R AVG 30.6 89 59.4 38.7 80.5 16.3 31.6 50.4 72.7 45.2 R</p><p>In Table 4.5 below, describes the experiment result for the Musician category. The author employed 10 Wikipedia articles about musicians from the Czech Republic. The credibility dimension on infobox has the lowest average scores, it means almost all the selected articles have an insufficient proportion in the comparison between references and</p><p>86 4. Experiment and Implementation</p><p> filled parameters. However, the article’s credibility dimension hasthe highest average score, which means almost all the selected articles have a good proportion in the comparison between references and article length (in words).</p><p>Table 4.5: The experiment results of the Musician category.</p><p>Musician from the Infobox Final Article Dimension (%) Czech Republic Dimension (%) Score Article Title D1 D2 D3 D4 D5 D6 D7 D8 D9 (%) Light Mikolas Josef 133 94 30 23 85 13 44 0 55 58 Y Hana Hegerová 289 103 30 15 98 16 43 0 37 88 G Ivan Mládek 36 77 42 5 98 1 34 0 33 32 R Lucie Vondráčková 18 85 31 8 98 0 55 0 33 33 R Petr Hapka 299 85 17 5 99 1 18 0 154 87 G Jiří Helekal 0 68 27 4 99 0 16 0 99 24 R Fritz Weiss 48 85 40 4 98 5 18 0 82 36 R Zuzana Marková 76 77 3 5 93 5 17 0 22 35 R (soprano) Vladimír Franz 77 85 27 9 98 3 14 0 66 40 R Gabriela Eibenová 90 68 17 6 98 3 17 0 99 42 R AVG 106.6 82.7 26.4 8.4 96.4 4.7 27.6 0 68 48 R</p><p>In Table 4.6 below, describes the experiment result for the Film category. The author employed 10 Wikipedia articles about films pro- duced by the Czech Republic. The verifiability dimension has the lowest average score, it means almost all the selected articles have a low quality on their references. However, the highest average score is achieved by the semantic stability dimension, it means almost all the selected articles have a good balance between the frequency of edits and reverted edits. Table 4.7 below, shows that the musician category obtained the highest average score in the credibility dimension. Some possible rea- sons are that almost all selected articles have high-quality data in terms of the proportion of the article reference and the article length (in words).</p><p>87 4. Experiment and Implementation</p><p>Table 4.6: The experiment results of the Film category.</p><p>Film produced by Infobox Final Article Dimension (%) Czech Republic Dimension (%) Score Article Title D1 D2 D3 D4 D5 D6 D7 D8 D9 (%) Light Kolya 137 112 54 62 96 5 92 62 90 82 G The Ear 63 93 45 5 98 0 81 0 62 49 R Marketa Lazarová 113 112 32 28 95 35 95 116 85 84 G The Firemen’s Ball 33 93 40 22 96 3 73 0 40 43 R Daisies (film) 50 93 55 21 95 7 76 35 37 51 Y A Report on the 91 74 25 6 99 0 73 0 32 48 R Party and the Guests Closely Watched Trains 37 112 47 48 94 6 89 31 51 55 Y The Shop on Main Street 82 112 43 42 97 5 73 38 89 63 Y Intimate Lighting 111 84 31 13 98 3 60 0 29 52 Y The Sun in a Net 32 74 35 6 98 3 61 0 9 35 R AVG 74.9 95.9 40.7 25.3 96.6 6.7 77.3 28.2 52.4 56.2 Y</p><p>Table 4.7: The average score of each dimension on each category.</p><p>Average Value of Article Dimension City (%) Musician (%) Film (%) Credibility 30.6 106.6 74.9 Format Compliance 89 82.7 95.9 Popularity 59.4 26.4 40.7 Reputation 38.7 8.4 25.3 Semantic Stability 80.5 96.4 96.6 Verifiability 16.3 4.7 6.7 Average Value of Infobox Dimension City (%) Musician (%) Film (%) Completeness 31.6 27.6 77.3 Credibility 50.4 0 28.2 Semantic Connection 72.7 68 52.4</p><p>88 5 Conclusion</p><p>This section summarizes the thesis and provides further insight in re- spect to the research question. This section covers the research findings of the author’s research, practical implications, and the limitations linked with it are discussed. Ultimately, this section is concluded by analyzing future research and how it can be done to enhance the discussion of the author’s research questions.</p><p>5.1 Research Findings</p><p>The main purpose of this study is to find a prevalent methodology that helps the users in order to determine the level of data quality on Wikipedia. The contribution has been claimed by providing a pro- totype of multi-dimensions assessments’ quality of the data quality and answering the research question "How to assess the data quality and determine the level of data quality in Wikipedia?" The proposed method- ologies, as well as the literature review, including the experiments and the prototype, help answer the above question. In the beginning, the author defined data quality issues on Wikipedia and identified the importance of scoring the quality of Wikipedia. The author analyzed multiple data quality dimensions related to Wikipedia’s data quality flaws in the literature review. Based onthe research, the author formulated the quality flaws into the data quality dimensions and refined 9 data quality dimensions for which the users need to use for scoring the quality of Wikipedia. The author studied the 9 data quality dimensions for this research: credibility, format compliance, popularity, reputation, semantic stability, and verifiabil- ity related to Wikipedia article; as for the Wikipedia infobox, there are completeness, credibility, and semantic connection. By propos- ing these 9 methodologies help in answering the research question by applying all steps in each methodology. As shown in the experi- ment’s result in chapter 4, the author experimented on three categories, namely City, Musician, and Film. A total of 30 articles were employed to assess their data quality based on those proposed 9 dimensions. In general, the film category has higher quality compared with the other two categories. The film category prevails in the article’s for-</p><p>89 5. Conclusion</p><p> mat compliance dimension and the completeness dimension of the infobox. This means almost selected articles in this category have a high-quality in terms of the article structure and have more filled in the infobox’s required parameters. This result can contribute to the project researcher or data scientist considering the proposed 9 dimensions in determining the data quality level of Wikipedia articles. Another contribution is to help the Wikipedia community to show which articles or infoboxes need to be improved based on those data quality dimensions, which the author has provided the methodology for each dimension.</p><p>5.2 Practical Implications</p><p>This research’s primary implication is that it is one of the solutions to poor data quality issues on Wikipedia and would assist the Wikipedi- ans or the Wikipedia community in determining the score of data qual- ity of the Wikipedia articles by implementing the proposed method- ologies. Furthermore, these proposed methodologies are built into a desktop application that can be installed on any desktop PC with a similar configuration as previously described. In addition, this study focuses on 9 data quality dimensions, describes them in detail, in- cluding their metrics, and how each of those dimensions affects the assessment result at the end.</p><p>5.3 Limitations and Future Research</p><p>This study is limited to the variety of data quality issues on Wikipedia, in which the author has focused on the 9 data quality dimensions with three Wikipedia article types and applied 50 seeds of articles from each list as the datasets. Further researchers can focus more on these deficiencies, identify more data quality issues on Wikipedia across the other types of articles, and implement more seeds into the datasets. The author’s future work is profoundly connected with this re- search. The future study aims to achieve a complete set of multi- dimensions assessments of the data quality that helps Wikipedia and the Wikipedians to improve the level of data quality both in Wikipedia articles and infoboxes.</p><p>90 Appendix: Key Implementation Code</p><p>Listing 1: Function for calculation the optimal value of article credibil- ity dimension # Function for calculation the optimal value of article credibility dimension def Optimal_Credibility(url=""): # Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format API_1_url=requests.get("https://xtools. wmflabs.org/api/page/prose/"+url) # Fetching theAPI data into the Dictionary Format json_data1=API_1_url.json() # Fetching the value of unique_references from the Dictionary unique_references=json_data1[’ unique_references’] # Fetching the value of words from the Dictionary words=json_data1[’words’] # Return the optimal value return(unique_references/words)</p><p>Listing 2: Function for calculation the optimal value of article popular- ity dimension # Function for calculation the optimal value of article popularity dimension def Optimal_Popularity(url=""): # Split the url to fetch the article title url_header,url_title=list(map(str,url.split( "/"))) # Set the article title into the APIs url API_0_url="https://xtools.wmflabs.org/ articleinfo/"+ url</p><p>91 5. Conclusion</p><p>API_2_url="https://xtools.wmflabs.org/api/ page/links/"+url API_3_url="https://wikimedia.org/api/rest_v1 /metrics/pageviews/per-article/"+str( url_header)+"/all-access/user/"+str( url_title)+"/monthly/2020100100/2020103100 " API_4_url="https://xtools.wmflabs.org/api/ page/articleinfo/"+url API_5_url="https://en.wikipedia.org/w/api. php?action=query&prop=revisions&rvlimit=1& rvprop=timestamp&rvdir=newer&titles="+str( url_title)+"&format=json"</p><p># Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format json_data2= requests.get(API_2_url) # Fetching theAPI data into the Dictionary Format json_data2= json_data2.json() # Fetching the value of links_in_count from the Dictionary links_in_count= json_data2[’links_in_count’ ]</p><p># Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format json_data3= requests.get(API_3_url) # Fetching theAPI data into the Dictionary Format json_data3= json_data3.json() # Fetching the value of views from the Dictionary views= json_data3[’items’][0][’views’]</p><p>92 5. Conclusion</p><p># Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format json_data4= requests.get(API_4_url) # Fetching theAPI data into the Dictionary Format json_data4= json_data4.json() # Fetching the value of watchers and revisions from the Dictionary watchers=json_data4[’watchers’] revisions= json_data4[’revisions’]</p><p># Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format json_data5= requests.get(API_5_url) # Fetching theAPI data into the Dictionary Format json_data5= json_data5.json()</p><p># Fetching the value of date_first_edit and ActualDate from the Dictionary date_first_edit= json_data5[’query’][’pages ’][str(*json_data5[’query’][’pages’].keys ())][’revisions’][0][’timestamp’].replace( "-","")[0:8] # Creatinga web scrapping request call with the base api and the web wiki url to fetch the Reverted Eidts Reverted_edits=findRevertedEdits(API_0_url)</p><p># Optimal c1 calculations c1=links_in_count+ views+watchers c1=links_in_count+views+watchers # Optimal c2 calculations c2= Reverted_edits/revisions c2= Reverted_edits/revisions</p><p>93 5. Conclusion</p><p># Optimal c3 calculations c2= Actual_date- date_first_edit returns value in days today= datetime.date.today() someday= datetime.date(int(date_first_edit [0:4]) , int(date_first_edit[4:6]), int( date_first_edit[6:])) diff= today-someday c3=diff.days # Return the optimal value return(c1,c2,c3)</p><p>Listing 3: Function for calculation the optimal value of article reputa- tion dimension # Function for calculation the optimal values of article reputation dimension def Optimal_Reputation(url=""): # Set the article title into the APIs url API_2_url="https://xtools.wmflabs.org/api/ page/links/"+ url API_4_url="https://xtools.wmflabs.org/api/ page/articleinfo/"+ url API_6_url="https://xtools.wmflabs.org/api/ page/top_editors/"+str(url)+" /2001-01-01/2020-10-31/10?nobots=1"</p><p># Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format json_data2= requests.get(API_2_url) # Fetching theAPI data into the Dictionary Format json_data2= json_data2.json() # Fetching the value of links_out_count from the Dictionary links_out_count= json_data2[’ links_out_count’]</p><p>94 5. Conclusion</p><p># Fetching the value of links_ext_count from the Dictionary links_ext_count= json_data2[’ links_ext_count’]</p><p># Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format json_data4= requests.get(API_4_url) # Fetching theAPI data into the Dictionary Format json_data4= json_data4.json() # Fetching the value of revisions from the Dictionary revisions= json_data4[’revisions’]</p><p># Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format json_data6= requests.get(API_6_url) # Fetching theAPI data into the Dictionary Format json_data6= json_data6.json() # Set the variable for the top 10% edits Sum_10_edits=0 # Fetching the value of Sum_10_edits from the Dictionary for each in json_data6[’top_editors’]: Sum_10_edits+=(each[’count’]) # Optimald calculationsd=(Sum_10_edits/ revisions)+links_ext_count+links_out_count d=(Sum_10_edits/ revisions)+ links_ext_count+links_out_count # Return the optimal value return(d)</p><p>95 5. Conclusion</p><p>Listing 4: Function for calculation the optimal value of article verifia- bility dimension</p><p># Function for calculation the optimal value of article verifiability dimension def Optimal_Verifiability(url="",urlx=""): # Called the APIsURL+ article title API_1_url="https://xtools.wmflabs.org/api/ page/prose/"+ url # Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format json_data1= requests.get(API_1_url) # Fetching theAPI data into the Dictionary Format json_data1= json_data1.json() # Fetching the value of references from the Dictionary references= json_data1[’references’] # Creatinga request call with the base api and the web wiki url to fetch the api data ina dictionary format response= request.urlopen(urlx) page_source= response.read().decode(’utf-8’ ) # Find the reference dead link the given wikipedia article url dead=r’<a␣href="#cite.*permanent␣dead␣link ’ # Find the wikilink inside the reference class of the given wikipedia article url wiki=r’class="reference-text"><a␣href=" https://.*[/wiki/]{1}.*’ # Total broken/permanent dead link in the source code of the article broken_link= len(re.findall(dead, page_source))</p><p>96 5. Conclusion</p><p># Total wiki_link in the source code of the article wiki_link= len(re.findall(wiki, page_source )) # Optimalf calculationsf=references- broken_link-wiki_link f= references-broken_link-wiki_link # Return the optimal value return(f)</p><p>Listing 5: Function for calculation the optimal value of infobox seman- tic connection dimension # Function for calculation the number of wikilink of the infobox semantic connection dimension for music category def getNumberLink(url): # Empty list to store the data links = [] try: # Find the infobox type soupx= soup.find(attrs={"class":" infobox␣biography␣vcard"}) # Select the desire infobox type for tr in soupx.select(’.vcard␣tr’): td= tr.find(’td’) th= tr.find(’th’) # Check the wikilink in each filled parameters if td is not None and th is not None : links += th.findAll("a", href=re .compile("(/wiki/)+([A-Za-z0-9 _:()])+")) links += td.findAll("a", href=re .compile("(/wiki/)+([A-Za-z0-9 _:()])+")) # Return the wikilink value return links</p><p>97 5. Conclusion except Exception as error: print(’Error␣occured:␣{}’.format(error)) return[]</p><p>98 Bibliography</p><p>1. LEWONIEWSKI, Włodzimierz. Measures for Quality Assess- ment of Articles and Infoboxes in Multilingual Wikipedia. In: Business Information Systems Workshops. Springer International Publishing, 2019, pp. 619–633. Available from DOI: 10.1007/ 978-3-030-04849-5_53. 2. CALZADA, Gabriel De la; DEKHTYAR, Alex. On measuring the quality of Wikipedia articles. In: Proceedings of the 4th workshop on Information credibility - WICOW ’10. ACM Press, 2010. Available from DOI: 10.1145/1772938.1772943. 3. STVILIA, Besiki; TWIDALE, Michael; SMITH, Linda; GASSER, Les. Assessing information quality of a community-based encyclopedia. In: In Proceedings of the International Conference on Information Quality. 2005, pp. 442–454. Available also from: https://www.researchgate.net/publication/200772827_ Assessing _ information _ quality _ of _ a _ community - based_encyclopedia. 4. ANDERKA, Maik; STEIN, Benno. A breakdown of quality flaws in Wikipedia. In: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality - WebQuality ’12. ACM Press, 2012. Available from DOI: 10.1145/2184305.2184309. 5. VÖLKEL, Max; KRÖTZSCH, Markus; VRANDECIC, Denny; HALLER, Heiko; STUDER, Rudi. Semantic Wikipedia. In: Pro- ceedings of the 15th international conference on World Wide Web - WWW ’06. ACM Press, 2006, pp. 585–594. ISBN 1595933239. Available from DOI: 10.1145/1135777.1135863. 6. VOSS, Jakob. Measuring Wikipedia. Proceedings of ISSI 2005: 10th International Conference of the International Society for Scientometrics and Informetrics. 2005, vol. 1. Available also from: https : / / www . researchgate . net / profile / Jakob _ Voss/publication/28803354_Measuring_Wikipedia/links/ 0deec52053a48777d5000000/Measuring-Wikipedia.pdf.</p><p>99 BIBLIOGRAPHY</p><p>7. SINGER, Philipp; LEMMERICH, Florian; WEST, Robert; ZIA, Leila; WULCZYN, Ellery; STROHMAIER, Markus; LESKOVEC, Jure. Why We Read Wikipedia. In: Proceedings of the 26th In- ternational Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017. Available from DOI: 10.1145/3038912.3052716. 8. KATTOUW, Roan. Wikimedia infrastructure. Retrieved from. 2011. Available also from: https : / / www . lugod . org / presentations / Wikimedia _ Infrastructure _ Utrecht _ LUG _ 2011.pdf. 9. CURINO, Carlo A; TANCA, Letizia; MOON, Hyun J; ZANIOLO, Carlo. Schema evolution in wikipedia: toward a web information system benchmark. In: In International Conference on Enterprise In- formation Systems (ICEIS. 2008, pp. 323–332. Available also from: http://www.cs.ucla.edu/~zaniolo/papers/ICEIS2008.pdf. 10. WAHYUDI; KHODRA, M. L.; WIBISONO, Y. Construction of Encyclopedic Knowledge Base from Infobox of Indonesian Wikipedia. In: 2018 International Conference on Information Technology Systems and Innovation (ICITSI). 2018, pp. 542–546. 11. CHIU, Jimmy K; LEE, Thomas Y; LEE, S; ZHU, Hailey H; CHE- UNG, David W. Extraction of RDF Dataset from Wikipedia Infobox Data. 2010. Technical report. Tech. rep., Department of Com- puter Science, The University of Hong Kong. 12. NAKAYAMA, Kotaro; PEI, Minghua; ERDMANN, Maike; ITO, Masahiro; SHIRAKAWA, Masumi; HARA, Takahiro; NISHIO, Shojiro. Wikipedia Mining Wikipedia as a Corpus for Knowledge Extraction. 2010. Available also from: http://www. academia.edu/download/33974452/Wikimania2008.pdf. 13. ZESCH, Torsten; GUREVYCH, Iryna; MÜHLHÄUSER, Max. Analyzing and accessing Wikipedia as a lexical semantic re- source. Data Structures for Linguistic Resources and Applications. 2007, vol. 197205. Available also from: http://citeseerx.ist. psu.edu/viewdoc/download?doi=10.1.1.411.2044&rep= rep1&type=pdf.</p><p>100 BIBLIOGRAPHY</p><p>14. REDI, Miriam; FETAHU, Besnik; MORGAN, Jonathan; TARA- BORELLI, Dario. Citation Needed: A Taxonomy and Algorith- mic Assessment of Wikipedia’s Verifiability. In: The World Wide Web Conference on - WWW ’19. ACM Press, 2019. Available from DOI: 10.1145/3308558.3313618. 15. FORD, Heather; SEN, Shilad; MUSICANT, David R.; MILLER, Nathaniel. Getting to the source: where does Wikipedia get its information from? In: ACM Press, 2013. Available from DOI: 10.1145/2491055.2491064. 16. HU, Meiqun; LIM, Ee-Peng; SUN, Aixin; LAUW, Hady Wirawan; VUONG, Ba-Quy. Measuring article quality in wikipedia:models and evaluation. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM ’07. ACM Press, 2007. Available from DOI: 10.1145/1321440.1321476. 17. LAMPRECHT, Daniel; LERMAN, Kristina; HELIC, Denis; STROHMAIER, Markus. How the structure of Wikipedia articles influences user navigation. New Review of Hypermedia and Multimedia. 2016, vol. 23, no. 1, pp. 29–50. Available from DOI: 10.1080/13614568.2016.1179798. 18. LIU, Peter J.; SALEH, Mohammad; POT, Etienne; GOODRICH, Ben; SEPASSI, Ryan; KAISER, Lukasz; SHAZEER, Noam. Gener- ating Wikipedia by Summarizing Long Sequences. 2018. Available from arXiv: 1801.10198 [cs.CL]. 19. KING, Irwin; BAEZA-YATES, Ricardo (eds.). Weaving Services and People on the World Wide Web. Springer Berlin Heidelberg, 2009. Available from DOI: 10.1007/978-3-642-00570-1. 20. LANGE, Dustin; BÖHM, Christoph; NAUMANN, Felix. Proceedings of the 19th ACM International Conference on Information and Knowledge Management. Extracting struc- tured information from Wikipedia articles to populate infoboxes. Toronto, ON, Canada: Universitätsverlag Potsdam, 2010. CIKM ’10. ISBN 9781450300995. Available from DOI: 10.1145/1871437.1871698.</p><p>101 BIBLIOGRAPHY</p><p>21. YU, Liyang. DBpedia. In: A Developer’s Guide to the Semantic Web. Springer Berlin Heidelberg, 2010, pp. 379–408. Available from DOI: 10.1007/978-3-642-15970-1_10. 22. SACK, HARALD, BISWAS, RUSSA, KOUTRAKI, Maria. Pre- dicting wikipedia infobox type information using word embed- dings on categories. 2018. Available from DOI: 10.5445/IR/ 1000089255. 23. Help:Infobox [online]. Wikimedia Foundation, Inc. [visited on 2020-06-24]. Available from: https : / / en . wikipedia . org / wiki/Help:Infobox. 24. Wikipedia:Manual of Style/Infoboxes [online]. Wikimedia Foun- dation, Inc. [visited on 2020-06-24]. Available from: https : //en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/ Infoboxes. 25. Template:Infobox book [online]. Wikimedia Foundation, Inc. [vis- ited on 2020-06-25]. Available from: https://en.wikipedia. org/wiki/Anne_of_Green_Gables. 26. MAHANTI, R. Data Quality: Dimensions, Measurement, Strategy, Management, and Governance. ASQ Quality Press, 2019. ISBN 9780873899772. Available also from: https://books.google. cz/books?id=THeSDwAAQBAJ. 27. GE, Mouzhi; HELFERT, Markus. Cost and Value Management for Data Quality. In: SADIQ, Shazia W. (ed.). Handbook of Data Quality, Research and Practice. Springer, 2013, pp. 75–92. 28. FLECKENSTEIN, Mike; FELLOWS, Lorraine. Data Quality. In: Modern Data Strategy. Springer International Publishing, 2018, pp. 101–119. Available from DOI: 10.1007/978-3-319-68993- 7_11. 29. FÜRBER, Christian. Data Quality Management with Semantic Tech- nologies. Data Quality. Springer Fachmedien Wiesbaden, 2015. Available from DOI: 10.1007/978-3-658-12225-6_3. 30. HERZOG, T.N.; SCHEUREN, F.J.; WINKLER, W.E. Data Quality and <a href="/tags/Record_linkage/" rel="tag">Record Linkage</a> Techniques. Springer New York, 2007. ISBN 9780387695051. Available also from: https://books.google. cz/books?id=iofCetdcJSoC.</p><p>102 BIBLIOGRAPHY</p><p>31. NIST <a href="/tags/Big_data/" rel="tag">Big Data</a> Interoperability Framework:Volume 4, Security and Privacy. National Institute of Standards and Technology, 2019. Available from DOI: 10.6028/nist.sp.1500-4r2. Technical report. 32. LOSHIN, D. Monitoring Data Quality Performance Using Data Quality Metrics. 2006. Available also from: https://it.ojp. gov/documents/Informatica_Whitepaper_Monitoring_DQ_ Using_Metrics.pdf. 33. WANG, Richard Y.; STRONG, Diane M. Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems. 1996, vol. 12, no. 4, pp. 5–33. Available from DOI: 10.1080/07421222.1996.11518099. 34. WANG, R.Y.; KON, H.B.; MADNICK, S.E. Data quality require- ments analysis and modeling. In: Proceedings of IEEE 9th Interna- tional Conference on Data Engineering. IEEE Comput. Soc. Press, 1993, pp. 670–677. Available from DOI: 10.1109/icde.1993. 344012. 35. ISHIKAWA, Kaoru. What is total quality control? The Japanese way. Prentice Hall, 1985. 36. HELFERT, Markus; FOLEY, Owen; GE, Mouzhi; CAPPIELLO, Cinzia. Limitations of Weighted Sum Measures for Information Quality. In: NICKERSON, Robert C.; SHARDA, Ramesh (eds.). Proceedings of the 15th Americas Conference on Information Systems, AMCIS 2009, San Francisco, California, USA, August 6-9, 2009. Association for Information Systems, 2009, p. 277. 37. GE, Mouzhi; DOHNAL, Vlastislav. Quality Management in Big Data. Informatics. 2018, vol. 5, no. 2, pp. 19. 38. WAND, Yair; WANG, Richard Y. Anchoring data quality dimen- sions in ontological foundations. Communications of the ACM. 1996, vol. 39, no. 11, pp. 86–95. Available from DOI: 10.1145/ 240455.240479. 39. GE, Mouzhi; LEWONIEWSKI, Wlodzimierz. Developing the Quality Model for Collaborative <a href="/tags/Open_data/" rel="tag">Open Data</a>. In: Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 24th International Conference KES-2020, Virtual Event, 16-18</p><p>103 BIBLIOGRAPHY</p><p>September 2020. Elsevier, 2020, vol. 176, pp. 1883–1892. Procedia Computer Science. 40. GE, Mouzhi; HELFERT, Markus. Impact of Information Quality on Supply Chain Decisions. J. Comput. Inf. Syst. 2013, vol. 53, no. 4, pp. 59–67. 41. WANG, R.Y.; STOREY, V.C.; FIRTH, C.P. A framework for anal- ysis of data quality research. IEEE Transactions on Knowledge and Data Engineering. 1995, vol. 7, no. 4, pp. 623–640. Available from DOI: 10.1109/69.404034. 42. WAND, Y.; WEBER, R. An ontological model of an information system. IEEE Transactions on Software Engineering. 1990, vol. 16, no. 11, pp. 1282–1292. Available from DOI: 10.1109/32.60316. 43. BUNGE, Mario. Treatise on Basic Philosophy. Ontology I: The Furniture of the World. Springer Netherlands, 1977. Available from DOI: 10.1007/978-94-010-9924-0. 44. BUNGE, Mario. Treatise on Basic Philosophy. Ontology II: A World of Systems. Springer Netherlands, 1979. Available from DOI: 10.1007/978-94-009-9392-1. 45. LEE, Yang W.; STRONG, Diane M.; KAHN, Beverly K.; WANG, Richard Y. AIMQ: a methodology for information quality assess- ment. Information & Management. 2002, vol. 40, no. 2, pp. 133–146. Available from DOI: 10.1016/s0378-7206(02)00043-5. 46. BALLOU, Donald P.; PAZER, Harold L. Designing Information Systems to Optimize the Accuracy-Timeliness Tradeoff. Infor- mation Systems Research. 1995, vol. 6, no. 1, pp. 51–72. Available from DOI: 10.1287/isre.6.1.51. 47. MENDELSON, Haim; SAHARIA, Aditya N. Incomplete infor- mation costs and database design. ACM Transactions on Database Systems. 1986, vol. 11, no. 2, pp. 159–185. ISSN 0362-5915. Avail- able from DOI: 10.1145/5922.5678. 48. HELFERT, Markus; GE, Mouzhi. Big Data Quality - Towards an Explanation Model. In: CARRETERO, Ana G.; CABALLERO, Ismael; PIATTINI, Mario (eds.). Proceedings of the 21st Interna- tional Conference on Information Quality, ICIQ 2016, Ciudad Real,</p><p>104 BIBLIOGRAPHY</p><p>Spain, June 22-23, 2016. Alarcos Research Group (UCLM), 2016, pp. 16–23. 49. WANG, Richard Y.; REDDY, M.P.; KON, Henry B. Toward qual- ity data: An attribute-based approach. Decision Support Sys- tems. 1995, vol. 13, no. 3-4, pp. 349–372. Available from DOI: 10.1016/0167-9236(93)e0050-n. 50. ARAZY, Ofer; MORGAN, Wayne; PATTERSON, Raymond. Wis- dom of the Crowds: Decentralized Knowledge Construction in Wikipedia. SSRN Electronic Journal. 2006. Available from DOI: 10.2139/ssrn.1025624. 51. AZEROUAL, Otmane; LEWONIEWSKI, Włodzimierz. How to Inspect and Measure Data Quality about Scientific Publications: Use Case of Wikipedia and CRIS Databases. Algorithms. 2020, vol. 13, no. 5, pp. 107. Available from DOI: 10.3390/a13050107. 52. DANG, Quang-Vinh; IGNAT, Claudia-Lavinia. An end-to-end learning solution for assessing the quality of Wikipedia articles. In: Proceedings of the 13th International Symposium on Open Col- laboration - OpenSym ’17. ACM Press, 2017. Available from DOI: 10.1145/3125433.3125448. 53. STVILIA, Besiki; TWIDALE, Michael B; GASSER, Les; SMITH, Linda C. Information quality discussions in Wikipedia. In: Proceedings of the 2005 international conference on knowledge man- agement. 2005, pp. 101–113. Available also from: https://www. researchgate.net/profile/Besiki_Stvilia/publication/ 200773232 _ Information _ Quality _ Discussions _ in _ Wikipedia/links/541cabe70cf2218008cd60a9/Information- Quality-Discussions-in-Wikipedia.pdf. 54. CHOI, Wonchan; STVILIA, Besiki. Web credibility assessment: Conceptualization, operationalization, variability, and models. Journal of the Association for Information Science and Technology. 2015, vol. 66, no. 12, pp. 2399–2414. Available from DOI: 10. 1002/asi.23543. 55. PICCARDI, Tiziano; REDI, Miriam; COLAVIZZA, Giovanni; WEST, Robert. Quantifying Engagement with Citations on Wikipedia. In: Proceedings of The Web Conference 2020. ACM, 2020. Available from DOI: 10.1145/3366423.3380300.</p><p>105 BIBLIOGRAPHY</p><p>56. LOPES, Rui; CARRIÇO, Luis. On the credibility of wikipedia:an accessibility perspective. In: Proceeding of the 2nd ACM workshop on Information credibility on the web - WICOW ’08. ACM Press, 2008. Available from DOI: 10.1145/1458527.1458536. 57. LEWONIEWSKI, Włodzimierz; WĘCEL, Krzysztof; ABRAMOW- ICZ, Witold. Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles. Informatics. 2017, vol. 4, no. 4, pp. 43. Available from DOI: 10.3390/informatics4040043. 58. LUYT, Brendan; TAN, Daniel. Improving Wikipedia’s credi- bility: References and citations in a sample of history articles. Journal of the American Society for Information Science and Technol- ogy. 2010. Available from DOI: 10.1002/asi.21304. 59. WARNCKE-WANG, Morten; COSLEY, Dan; RIEDL, John. Tell me more: an actionable quality model for Wikipedia. In: Pro- ceedings of the 9th International Symposium on Open Collaboration - WikiSym ’13. ACM Press, 2013. Available from DOI: 10.1145/ 2491055.2491063. 60. LOSHIN, David. Data Quality and MDM. In: Master Data Man- agement. Elsevier, 2009, pp. 87–103. Available from DOI: 10 . 1016/b978-0-12-374225-4.00005-9. 61. ANDERKA, Maik; STEIN, Benno; LIPKA, Nedim. Predicting quality flaws in user-generated content: the case of wikipedia. In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’12. ACM Press, 2012. Available from DOI: 10.1145/2348283.2348413. 62. LUYT, Brendan; AARON, Tay Chee Hsien; THIAN, Lim Hai; HONG, Cheng Kian. Improving Wikipedia’s accuracy: Is edit age a solution? Journal of the American Society for Information Science and Technology. 2007, vol. 59, no. 2, pp. 318–330. Available from DOI: 10.1002/asi.20755. 63. INFELD, Donna Lind; ADAMS, William C. Using the Inter- net for Gerontology Education: Assessing and Improving Wikipedia. Educational Gerontology. 2013, vol. 39, no. 10, pp. 707–716. Available from DOI: 10.1080/03601277.2012.734266.</p><p>106 BIBLIOGRAPHY</p><p>64. KATZ, Gilad; ROKACH, Lior. Wikiometrics: a Wikipedia based ranking system. World Wide Web. 2017, vol. 20, no. 6, pp. 1153– 1177. Available from DOI: 10.1007/s11280-016-0427-8. 65. HANADA, Raıza; CRISTO, Marco; GRAÇA CAMPOS PI- MENTEL, Maria da. How do metrics of link analysis correlate to quality, relevance and popularity in wikipedia? In: Proceedings of the 19th Brazilian symposium on Multimedia and the web - Web- Media ’13. ACM Press, 2013, pp. 105–112. ISBN 9781450325592. Available from DOI: 10.1145/2526188.2526198. 66. LEHMANN, Janette; MÜLLER-BIRN, Claudia; LANIADO, David; LALMAS, Mounia; KALTENBRUNNER, Andreas. Reader preferences and behavior on Wikipedia. In: Proceedings of the 25th ACM conference on Hypertext and social media - HT ’14. ACM Press, 2014, pp. 88–97. ISBN 9781450329545. Available from DOI: 10.1145/2631775.2631805. 67. CIGLAN, Marek; NØRVÅG, Kjetil. WikiPop: Personalized Event Detection System Based on Wikipedia Page View Statistics. In: Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM ’10. ACM Press, 2010, pp. 1931–1932. ISBN 9781450300995. Available from DOI: 10.1145/1871437.1871769. 68. MOYER, Daniel; CARSON, Samuel L; DYE, Thayne Keegan; CARSON, Richard T; GOLDBAUM, David. Determining the influence of Reddit posts on Wikipedia pageviews. In: Ninth International AAAI Conference on Web and So- cial Media. 2015, pp. 75–82. Available also from: https : //opus.lib.uts.edu.au/bitstream/10453/43941/1/Moyer% 2615InfluenceRedditOnWiki.pdf. 69. KIM, Juram; KIM, Seungho; LEE, Changyong. Anticipating tech- nological convergence: Link prediction using Wikipedia hyper- links. Technovation. 2019, vol. 79, pp. 25–34. Available from DOI: 10.1016/j.technovation.2018.06.008. 70. ADLER, B. Thomas; ALFARO, Luca de. A content-driven repu- tation system for the wikipedia. In: Proceedings of the 16th inter- national conference on World Wide Web - WWW ’07. ACM Press,</p><p>107 BIBLIOGRAPHY</p><p>2007, pp. 261–270. ISBN 9781595936547. Available from DOI: 10.1145/1242572.1242608. 71. JAVANMARDI, Sara; LOPES, Cristina. Statistical measure of quality in Wikipedia. In: Proceedings of the First Workshop on Social Media Analytics - SOMA ’10. ACM Press, 2010, pp. 132–138. ISBN 9781450302173. Available from DOI: 10.1145/1964858. 1964876. 72. WÖHNER, Thomas; KÖHLER, Sebastian; PETERS, Ralf. Au- tomatic Reputation Assessment in Wikipedia. In: 2011, vol. 4. Available also from: https://aisel.aisnet.org/icis2011/ proceedings/onlinecommunity/5. 73. ANTHONY, Denise; SMITH, Sean W.; WILLIAMSON, Timothy. Reputation and Reliability in Collective Goods. Rationality and Society. 2009, vol. 21, no. 3, pp. 283–306. Available from DOI: 10.1177/1043463109336804. 74. ADLER, B Thomas. WikiTrust: content-driven reputation for the Wikipedia. 2012. Available also from: https://escholarship. org/uc/item/7rv812n5. PhD thesis. UC Santa Cruz. 75. JAVANMARDI, Sara; LOPES, Cristina; BALDI, Pierre. Modeling User Reputation in Wikis. Statistical Analysis and <a href="/tags/Data_mining/" rel="tag">Data Mining</a>: The ASA <a href="/tags/Data_science/" rel="tag">Data Science</a> Journal. 2010, vol. 3, no. 2, pp. 126–139. Available from DOI: 10.1002/sam.10070. 76. LOSHIN, David. Dimensions of Data Quality. In: The Practi- tioner’s Guide to Data Quality Improvement. Elsevier, 2011, pp. 129– 146. Available from DOI: 10 . 1016 / b978 - 0 - 12 - 373717 - 5 . 00008-7. 77. KALYANASUNDARAM, Arun; WEI, Wei; CARLEY, Kathleen M.; HERBSLEB, James D. An agent-based model of edit wars in Wikipedia: How and when is consensus reached. In: 2015 Winter Simulation Conference (WSC). IEEE, 2015, pp. 276–287. Available from DOI: 10.1109/wsc.2015.7408171. 78. SUMI, Robert; YASSERI, Taha; RUNG, Andrs; KORNAI, Andrs; KERTESZ, Jnos. Edit Wars in Wikipedia. In: 2011 IEEE Third Int’l Conference on Privacy, Security, Risk and Trust and 2011 IEEE</p><p>108 BIBLIOGRAPHY</p><p>Third Int’l Conference on Social Computing. IEEE, 2011. Available from DOI: 10.1109/passat/socialcom.2011.47. 79. KITTUR, Aniket; SUH, Bongwon; CHI, Ed H. Can you ever trust a wiki? Impacting perceived trustworthiness in wikipedia. In: Proceedings of the ACM 2008 conference on Computer supported cooperative work - CSCW ’08. ACM Press, 2008. Available from DOI: 10.1145/1460563.1460639. 80. SIDI, Fatimah; PANAHY, Payam Hassany Shariat; AFFENDEY, Lilly Suriani; JABAR, Marzanah A.; IBRAHIM, Hamidah; MUSTAPHA, Aida. Data quality: A survey of data quality dimensions. In: 2012 International Conference on Information Retrieval & Knowledge Management. IEEE, 2012. Available from DOI: 10.1109/infrkm.2012.6204995. 81. FERRETTI, Edgardo; CAGNINA, Leticia; PAIZ, Viviana; DONNE, Sebastián Delle; ZACAGNINI, Rodrigo; ERRECALDE, Marcelo. Quality flaw prediction in Spanish Wikipedia: A case of study with verifiability flaws. Information Processing & Management. 2018, vol. 54, no. 6, pp. 1169–1181. Available from DOI: 10.1016/j.ipm.2018.08.003. 82. HARDER, Reed H.; VELASCO, Alfredo J.; EVANS, Michael S.; ROCKMORE, Daniel N. Measuring Verifiability in Online Infor- mation. 2015. Available from arXiv: 1509.05631 [cs.SI]. 83. WĘCEL, Krzysztof; LEWONIEWSKI, Włodzimierz. Mod- elling the Quality of Attributes in Wikipedia Infoboxes. In: ABRAMOWICZ, Witold (ed.). Business Information Systems Workshops. Springer International Publishing, 2015, pp. 308–320. Available from DOI: 10.1007/978-3-319-26762-3_27. 84. WU, Fei; WELD, Daniel S. Autonomously semantifying wikipedia. In: Proceedings of the sixteenth ACM conference on Con- ference on information and knowledge management - CIKM ’07. ACM Press, 2007. Available from DOI: 10.1145/1321440.1321449. 85. TAYI, Giri Kumar; BALLOU, Donald P. Examining Data Quality. Commun. ACM. 1998, vol. 41, no. 2, pp. 54–57. ISSN 0001-0782. Available from DOI: 10.1145/269012.269021.</p><p>109 BIBLIOGRAPHY</p><p>86. LOSHIN, David. <a href="/tags/Data_governance/" rel="tag">Data Governance</a> for Big Data Analytics. In: Big Data Analytics. Elsevier, 2013, pp. 39–48. Available from DOI: 10.1016/b978-0-12-417319-4.00005-3. 87. PLOTKIN, David. Important Roles of Data Stewards. In: Data Stewardship. Elsevier, 2014, pp. 127–162. Available from DOI: 10.1016/b978-0-12-410389-4.00007-6. 88. JANA, Abhik; KANOJIYA, Pranjal; GOYAL, Pawan; MUKHER- JEE, Animesh. WikiRef: Wikilinks as a route to recommending ap- propriate references for scientific Wikipedia pages. 2018. Available from arXiv: 1806.04092 [cs.CL]. 89. GALÁRRAGA, Luis; SYMEONIDOU, Danai; MOISSINAC, Jean- Claude. Rule Mining for Semantifying Wikilinks. In: LDOW@ WWW. 2015. Available also from: https://hal.telecom-paris. fr/hal-02412475. 90. SINGH, Sameer; SUBRAMANYA, Amarnag; PEREIRA, Fer- nando; MCCALLUM, Andrew. Wikilinks: A large-scale cross- document coreference corpus labeled via links to Wikipedia. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2012. 2012, vol. 15. 91. BIZER, Christian; LEHMANN, Jens; KOBILAROV, Georgi; AUER, Sören; BECKER, Christian; CYGANIAK, Richard; HELLMANN, Sebastian. DBpedia - A crystallization point for the Web of Data. Journal of Web Semantics. 2009, vol. 7, no. 3, pp. 154–165. Available from DOI: 10.1016/j.websem.2009.07.002. 92. MIHALCEA, Rada; CSOMAI, Andras. Wikify! Linking Doc- uments to Encyclopedic Knowledge. In: Proceedings of the six- teenth ACM conference on Conference on information and knowl- edge management - CIKM ’07. ACM Press, 2007, pp. 233–242. ISBN 9781595938039. Available from DOI: 10.1145/1321440. 1321475. 93. XU, Mengling; WANG, Zhichun; BIE, Rongfang; LI, Juanzi; ZHENG, Chen; KE, Wantian; ZHOU, Mingquan. Discovering Missing Semantic Relations between Entities in Wikipedia. In: Advanced Information Systems Engineering. Springer Berlin Hei- delberg, 2013, pp. 673–686. Available from DOI: 10.1007/978- 3-642-41335-3_42.</p><p>110 BIBLIOGRAPHY</p><p>94. SCANNAPIECO, Monica; CATARCI, Tiziana. Data quality under a computer science perspective. Archivi & Com- puter. 2002, vol. 2, pp. 1–15. Available also from: https : / / www . researchgate . net / profile / Tiziana _ Catarci2 / publication/228597426_Data_quality_under_a_computer_ science _ perspective / links / 0fcfd51169a156b61a000000 / Data - quality - under - a - computer - science - perspective . pdf. 95. FOGG, B. J. et al. What makes Web sites credible? a report on a large quantitative study. In: Proceedings of the SIGCHI conference on Human factors in computing systems - CHI ’01. ACM Press, 2001, pp. 61–68. Available from DOI: 10.1145/365024.365037. 96. MENCHEN-TREVINO, Ericka; HARGITTAI, Eszter. YOUNG ADULTS’ CREDIBILITY ASSESSMENT OF WIKIPEDIA. Infor- mation, Communication & Society. 2011, vol. 14, no. 1, pp. 24–51. Available from DOI: 10.1080/13691181003695173. 97. DANG, Quang-Vinh; IGNAT, Claudia-Lavinia. Measuring Qual- ity of Collaboratively Edited Documents: The Case of Wikipedia. In: 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC). IEEE, 2016. Available from DOI: 10. 1109/cic.2016.044. 98. HAN, Jingyu; CHEN, Kejia. Ranking Wikipedia article’s data quality by learning dimension distributions. International Journal of Information Quality. 2014, vol. 3, no. 3, pp. 207. Available from DOI: 10.1504/ijiq.2014.064056. 99. LEWONIEWSKI, Włodzimierz; WĘCEL, Krzysztof. Relative Quality Assessment of Wikipedia Articles in Different Lan- guages Using Synthetic Measure. In: Business Information Sys- tems Workshops. Springer International Publishing, 2017, pp. 282– 292. Available from DOI: 10.1007/978-3-319-69023-0_24. 100. BLUMENSTOCK, Joshua E. Size matters: word count as a mea- sure of quality on wikipedia. In: Proceeding of the 17th interna- tional conference on World Wide Web - WWW ’08. ACM Press, 2008. Available from DOI: 10.1145/1367497.1367673.</p><p>111 BIBLIOGRAPHY</p><p>101. CHEVALIER, Fanny; HUOT, Stephane; FEKETE, Jean- Daniel. WikipediaViz: conveying article quality for casual Wikipedia readers. In: 2010 IEEE Pacific Visualization Sympo- sium (PacificVis). IEEE, 2010, pp. 49–56. Available from DOI: 10.1109/pacificvis.2010.5429611. 102. CHEN, Chih-Chun; ROTH, Camille. {{Citation needed}}: the dynamics of referencing in wikipedia. In: Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration - WikiSym ’12. ACM Press, 2012. Available from DOI: 10.1145/ 2462932.2462943. 103. DALIP, Daniel Hasan; GONÇALVES, Marcos André; CRISTO, Marco; CALADO, Pável. Automatic quality assessment of con- tent created collaboratively by web communities: a case study of wikipedia. In: Proceedings of the 2009 joint international conference on Digital libraries - JCDL ’09. ACM Press, 2009. Available from DOI: 10.1145/1555400.1555449. 104. VELDEN, Maja van der. Decentering Design: Wikipedia and Indigenous Knowledge. International Journal of Human-Computer Interaction. 2013, vol. 29, no. 4, pp. 308–316. Available from DOI: 10.1080/10447318.2013.765768. 105. LEWANDOWSKI, Dirk; SPREE, Ulrike. Ranking of Wikipedia articles in search engines revisited: Fair ranking for reasonable quality? Journal of the American Society for Information Science and Technology. 2010, vol. 62, no. 1, pp. 117–132. Available from DOI: 10.1002/asi.21423. 106. GYLLSTROM, Karl; MOENS, Marie-Francine. Examining the "leftness" property of Wikipedia categories. In: Proceedings of the 20th ACM international conference on Information and knowl- edge management - CIKM ’11. ACM Press, 2011, pp. 2309–2312. ISBN 9781450307178. Available from DOI: 10.1145/2063576. 2063953. 107. WILKINSON, Dennis M.; HUBERMAN, Bernardo A. Assessing the value of cooperation in Wikipedia. First Monday. 2007, vol. 12. Available from DOI: 10.5210/fm.v12i4.1763.</p><p>112 BIBLIOGRAPHY</p><p>108. SUH, Bongwon; CHI, Ed H.; KITTUR, Aniket; PENDLETON, Bryan A. Lifting the veil: Improving Accountability and Social Transparency in Wikipedia with Wikidashboard. In: Proceed- ing of the twenty-sixth annual CHI conference on Human factors in computing systems - CHI ’08. ACM Press, 2008, pp. 1037–1040. ISBN 9781605580111. Available from DOI: 10.1145/1357054. 1357214. 109. CLÉMENT, Maxime; GUITTON, Matthieu J. Interacting with bots online: Users’ reactions to actions of automated programs in Wikipedia. Computers in Human Behavior. 2015, vol. 50, pp. 66–75. Available from DOI: 10.1016/j.chb.2015.03.078. 110. BARBAGALLO, Donato; CAPPIELLO, Cinzia; FRANCALANCI, Chiara; MATERA, Maristella. REPUTATION-BASED SELEC- TION OF WEB INFORMATION SOURCES. In: Proceedings of the 12th International Conference on Enterprise Information Sys- tems. SciTePress - Science, 2010, pp. 30–37. Available from DOI: 10.5220/0002908400300037. 111. JAVANMARDI, Sara; GANJISAFFAR, Yasser; LOPES, Cristina; BALDI, Pierre. User contribution and trust in Wikipedia. In: Proceedings of the 5th International ICST Conference on Collabo- rative Computing: Networking, Applications, Worksharing. IEEE, 2009. Available from DOI: 10.4108/icst.collaboratecom2009. 8376. 112. PANTOLA, Alexis Velarde; PANCHO-FESTIN, Susan; SAL- VADOR, Florante. Rating the raters: A reputation system for Wiki-like domains. In: Proceedings of the 3rd international con- ference on Security of information and networks - SIN ’10. ACM Press, 2010, pp. 71–80. Available from DOI: 10.1145/1854099. 1854116. 113. WU, Qinyi; IRANI, Danesh; PU, Calton; RAMASWAMY, Laksh- mish. Elusive vandalism detection in Wikipedia: A text stability- based approach. In: Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM ’10. ACM Press, 2010, pp. 1797–1800. Available from DOI: 10.1145/ 1871437.1871732.</p><p>113 BIBLIOGRAPHY</p><p>114. THOMAS, Christopher; SHETH, Amit P. Semantic Convergence of Wikipedia Articles. In: IEEE/WIC/ACM International Confer- ence on Web Intelligence (WI’07). IEEE, 2007. Available from DOI: 10.1109/wi.2007.108. 115. STANISAVLJEVIC,Darko; HASANI-MAVRIQI,Ilire; LEX, Elisa- beth; STROHMAIER, Markus; HELIC, Denis. Semantic Stability in Wikipedia. In: Studies in Computational Intelligence. Springer International Publishing, 2016, pp. 379–390. Available from DOI: 10.1007/978-3-319-50901-3_31. 116. HILL, Benjamin Mako; SHAW, Aaron. Page Protection: Another Missing Dimension of Wikipedia Research. In: Proceedings of the 11th International Symposium on Open Collaboration - OpenSym ’15. ACM Press, 2015. ISBN 9781450336666. Available from DOI: 10.1145/2788993.2789846. 117. HARDER, Reed H.; VELASCO, Alfredo; EVANS, Michael; AN, Chuankai; ROCKMORE, Daniel. Wikipedia Verification Check: A Chrome Browser Extension. In: Proceedings of the 26th Inter- national Conference on World Wide Web Companion - WWW ’17 Companion. International World Wide Web Conferences Steering Committee, 2017, pp. 1619–1625. ISBN 9781450349147. Available from DOI: 10.1145/3041021.3053364. 118. SANGER, Lawrence M. The Fate of Expertise after Wikipedia. Episteme. 2009, vol. 6, no. 1, pp. 52–73. Available from DOI: 10. 3366/e1742360008000543. 119. HUVILA, Isto. Where does the information come from? Infor- mation source use patterns in Wikipedia. Information research. 2010, vol. 15, no. 3, pp. 15–3. Available also from: https://www. academia.edu/download/52920823/Huvila2009c.pdf. 120. GARFINKEL, Simson L. Wikipedia and the Meaning of Truth. Technology Review. 2008, vol. 111, no. 6, pp. 84–86. Available also from: https://stephencodrington.com/Blogs/Hong_Kong_ Blog/Entries/2009/4/11_What_is_Truth_files/Wikipedia% 5C%20and%5C%20the%5C%20Meaning%5C%20of%5C%20Truth. pdf.</p><p>114 BIBLIOGRAPHY</p><p>121. TKACZ, Nathaniel. The Truth of Wikipedia. Academic research into Wikipedia. 2012, no. 14. ISSN 1575-2275. Available also from: http://wrap.warwick.ac.uk/54108. 122. LEWONIEWSKI, Włodzimierz; WĘCEL, Krzysztof; ABRAMOW- ICZ, Witold. Analysis of References Across Wikipedia Lan- guages. In: Communications in Computer and Information Science. Springer International Publishing, 2017, pp. 561–573. Available from DOI: 10.1007/978-3-319-67642-5_47. 123. ZHANG, Kezun; XIAO, Yanghua; TONG, Hanghang; WANG, Haixun; WANG, Wei. The Links Have It: Infobox Generation by Summarization over Linked Entities. 2014. Available from arXiv: 1406.6449 [cs.IR]. 124. LEWONIEWSKI, Włodzimierz. Completeness and Reliability of Wikipedia Infoboxes in Various Languages. In: Business In- formation Systems Workshops. Springer International Publishing, 2017, pp. 295–305. Available from DOI: 10.1007/978-3-319- 69023-0_25. 125. MEDELYAN, Olena; MILNE, David; LEGG, Catherine; WIT- TEN, Ian H. Mining meaning from Wikipedia. International Jour- nal of Human-Computer Studies. 2009, vol. 67, no. 9, pp. 716–754. Available from DOI: 10.1016/j.ijhcs.2009.05.004. 126. STVILIA, Besiki; GASSER, Les; TWIDALE, Michael B.; SMITH, Linda C. A framework for information quality assessment. Jour- nal of the American Society for Information Science and Technol- ogy. 2007, vol. 58, no. 12, pp. 1720–1733. Available from DOI: 10.1002/asi.20652. 127. LEHMANN, Jens et al. DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web. 2015, vol. 6, no. 2, pp. 167–195. ISSN 1570-0844. Available from DOI: 10.3233/SW-140134. 128. MORSEY, Mohamed; LEHMANN, Jens; AUER, Sören; STADLER, Claus; HELLMANN, Sebastian. DBpedia and the live extraction of structured data from Wikipedia. Pro- gram. 2012, vol. 46, no. 2, pp. 157–181. Available from DOI: 10.1108/00330331211221828.</p><p>115</p> </div> </div> </div> </div> </div> </div> </div> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" integrity="sha512-aVKKRRi/Q/YV+4mjoKBsE4x3H+BkegoM/em46NNlCqNTmUYADjBbeNefNxYV7giUp0VxICtqdrbqU7iVaeZNXA==" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script src="/js/details118.16.js"></script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> <noscript><div class="statcounter"><a title="Web Analytics" href="http://statcounter.com/" target="_blank"><img class="statcounter" src="//c.statcounter.com/11552861/0/b956b151/1/" alt="Web Analytics"></a></div></noscript> </body> </html>