Masaryk University Faculty of Informatics

Finding evidence for unsubstantiated claims on

Bachelor’s Thesis

Vojtěch Krajňanský

Brno, Fall 2017

Masaryk University Faculty of Informatics

Finding evidence for unsubstantiated claims on Wikipedia

Bachelor’s Thesis

Vojtěch Krajňanský

Brno, Fall 2017

This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Vojtěch Krajňanský

Advisor: Mgr. et Mgr. Vít Baisa, Ph.D.

i

Acknowledgements

I would like to thank my advisor, Mgr. et Mgr. Vít Baisa, Ph.D., for his patient guidance which proved invaluable in helping me finish this work. My thanks also belong to Mgr. Michal Krajňanský, who first sparked my interest in the field of computer science, andwho provided me with helpful insights in our various discussions about computational linguistics and natural language processing. Finally, I would like to express my gratitude to Lorcan Cook, for proofreading this work.

iii Abstract

The goal of this thesis is the creation of a Python script to help Wikipedia annotators semi-automate the process of source searching for claims which have been marked as requiring an external source. The result is a language independent tool which parses a given set of Wikipedia articles, retrieves parts of text which have been deemed to be unsubstantiated claims, formulates a free text search query for a given claim and retrieves relevant responses via the Bing Web Search API. It then generates an HTML report summarizing the results and presenting the user with an overview of suggested sources for them to assess and choose the best sources for the claims.

iv Keywords information retrieval, evidence finding, Wikipedia, query formulation, free text search, Python

v

Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Thesis Outline ...... 2 1.3 Results ...... 2

2 State-of-the-art and Related Work 5

3 Thesis Foundations 9 3.1 Natural Language Processing and Information Retrieval ..9 3.1.1 Key Concepts ...... 10 3.2 Task Definition ...... 12

4 Implementation 15 4.1 Unsubstantiated Claim Identification ...... 15 4.1.1 markup ...... 15 4.1.2 Templates ...... 16 4.1.3 Claim Text Identification ...... 16 4.1.4 Quote Templates ...... 18 4.2 Query Formulation ...... 18 4.2.1 Assumptions ...... 18 4.2.2 Term Scoring ...... 20 4.2.3 Keyword Extraction ...... 23 4.3 Web Search Engine ...... 25 4.4 Website Relevance Assessment ...... 26 4.4.1 Copycat Websites ...... 26 4.5 Report Generation ...... 28 4.6 Tool Usage ...... 28 4.6.1 Configuration File ...... 29

5 Performance Evaluation 31 5.1 Evaluation Metric ...... 31 5.2 Results ...... 31

6 Summary 33

Bibliography 35

vii A Project Structure 39

B Blacklisted Sites 41

C Term Score Cutoff Estimation 43

viii List of Figures

4.1 A citation needed template after claim text 16 4.2 A citation needed template before a list 17

ix

1 Introduction

1.1 Motivation

Today, Wikipedia (www.wikipedia.org) is one of the most commonly accessed sources of factual information for people around the world. As of April 18, 2017, Alexa.com, a website that aggregates information about internet traffic, ranks Wikipedia as the 5th most popular website, globally [1]. This makes Wikipedia one of the most influential websites in the world, possibly even capable of affecting public opinion on certain issues. As of April 18, 2017, the English-language edition of Wikipedia contained 5,386,775 articles, with an increase of over 20,000 new ar- ticles each month [2], and over 30 million users have registered a username. Out of these users, only a fraction contributes to the site’s maintenance regularly. In February 2015, only about 12,000 editors had an of at least 600 overall and 50 in the last six months [3]. These relatively few volunteers’ responsibilities range from fixing typographic errors and policing articles for vandalism to resolving disputes and perfecting content. They are the ones making sure that Wikipedia articles abide by its core sourcing policy which has been previously described as “verifiability, not truth” [4]. That is, they ought to present only material that has been previously published by a re- liable source and describe leading opinions on disputed issues in a neutral way. Verifiability of information available on the Internet has recently become a much discussed topic in various social circles, with main- stream western media often calling our era the post-truth age. The term post-truth has in fact been named the 2016 international word of the year by Oxford Dictionaries [5]. Wikipedia editors have a duty to en- sure that the information presented is as up to date as possible, while referencing credible sources. Various software robots have been created to help with the more mundane tasks of the site’s maintenance, such as vandalism identifi- cation, spell checking, etc. [6] However, finding reliable evidence for disputed claims still remains a human dominated manual task.

1 1. Introduction

The goal of this work is to create and present a language inde- pendent tool to help semi-automate the process of evidence finding for unsubstantiated claims in order to ease and accelerate it, saving time and energy of the human party involved. The tool should parse a given set of articles, find claims that have been identified as requiring an external source, and subsequently search the Internet for electronic material which could serve as a credible reference. A collection of candidate sources is then presented to the user, ranked by estimated relevance, from which they are able to choose the best result as they see fit. Different lightweight, language independent approaches to query formulation from natural language text are explored in an effort to maximize the precision of retrieved source candidates for a given claim.

1.2 Thesis Outline

Chapter 2 describes recent work related to this thesis. Chapter 3 in- troduces the key concepts used throughout the thesis, and precisely defines the task of source searching for unsubstantiated claims found on Wikipedia. In Chapter 4, the proposed solution for the task and its implementation are described. Chapter 5 introduces a quality metric against which the presented tool’s success rate is measured and pro- vides an overview of its experimental evaluation. Finally, in Chapter 6, the results of the work are summarized and various methods for improvement are suggested.

1.3 Results

The result of this work is a script written in the Python programming language, which provides the user with a simple means to automati- cally search for a set of web documents presumed to contain factual information supporting claims which have been marked by Wikipedia editors as requiring an additional external source. The user is first required to create a dictionary of word–document frequency mappings for a text corpus created from the whole Wikipedia article dump of a language of their choice. Afterwards, the user provides a dump of

2 1. Introduction

articles of their choice to search for unsubstantiated claims as marked by editors, and find relevant sources for these claims using the Bing Web Search API. Both the dictionary creation and source search steps are performed by running a simple command in the commandline. The tool is able to formulate free text queries from natural language text for which the Google search engine returns a response containing a factual source for a given claim in more than 60% of cases, tested on a set of 50 English and 20 Czech articles. The Bing API shows a worse performance for the queries, returning a factual source for 46% of the English claims, and a mere 15% of the Czech ones. I try to explain the reason for this large a difference between the Bing API success rates for the two languages in Section 5.2. The script is divided into 6 Python modules, each of which han- dles one of the responsibilities necessitated by the task. All code and related files are publicly available in a GitHub repository at https://github.com/Vocco/wikifinder.

3

2 State-of-the-art and Related Work

No tool currently exists to help Wikipedia annotators automate the process of finding evidence in the form of electronic documents/web- pages for unsubstantiated claims. The means to creating one, however, lie in techniques and theory established by the scientific community, especially in the field of information retrieval. The tasks such atool must handle can be summarized as follows: Wiki markup and plain text parsing, semantic and topical analysis, query formulation from natural language text, and relevance assessment for found sources. Wiki markup is the syntax and keywords used by the MediaWiki software to format a page [7]. Although not meant primarily as a Wiki markup parser, the gensim Python library has been chosen as the base framework for Wikipedia processing and query formulation, as it has been shown to efficiently compute bag-of-words (BOW) representations of documents on Wikipedia dumps [8] and can be easily modified for detecting citation needed templates used by Wiki markup to identify claims that need to be verified. It also provides several utility functions such as computing the cosine similarity of two documents, which is used during the website relevance assessment process. Once a portion of text has been identified as a claim that needs to be supported by an outside source, it needs to be analyzed and reformulated into a query for a web search engine. This can be viewed as a feature identification task for document similarity classification, where tokenized words serve as features. Since the tool in question should be language independent, the feature identification process should not depend on any language constraints, such as grammatical tags or lexical databases. We define language independence as mean- ing that in order for the tool to be modified to process a language that is not yet supported, as little overhead work as possible needs to be done. There is currently no language independent natural language processor that produces grammatical tags such as part of speech (POS) and predicate argument structure (PAS) tags based on syntactic analy- sis alone [9]. Common feature models which do not depend on any language constraints other than tokenization are the BOW and vector space models, which are used to calculate document similarity using the number of word occurrences and their relative importance to the

5 2. State-of-the-art and Related Work document in question. This approach has the advantage of treating both a document and a query as vectors in a shared space [10]. In order to retrieve the most relevant sources possible, the query passed to the search engine has to approximate the topic of the in- formation need well. Information retrieval methods based on exact word matching may be inaccurate due to linguistic problems, such as multiple words commonly being used to express the same concept or a single word conveying multiple different meanings [11]. Latent semantic indexing (LSI) is a method that addresses these problems by taking advantage of implicit higher-order structure in the association of terms in documents by using singular value decomposition (SVD), where a large term by document matrix is decomposed into a set of orthogonal factors, from which the original matrix can be approxi- mated by linear combination [12]. LSI may be used as a way of query expansion during its formulation process [10]. The information retrieval task at hand is probably most similar to the Evidence Finding (EF) task introduced by Cartright et al. [13], however, EF restricts itself to a fixed collection from which returned evidence is sought. In their work, Cartright et al. use a collection of approximately 50,000 scanned books with section markup provided by Microsoft and the Galago [14] information retrieval system. For query formulation, they use the Sequential Dependence Model (SDM) developed by Metzler and Croft [15] to make use of phrase information in the given statement and its immediate context. They explore various modifications of their system, with the best results achieving a mean reciprocal rank (MRR) of 0.6245, success at rank 5 of 0.8750, and success at rank 10 of 0.8889, where success at rank n for a given query is defined as 1, if at least one relevant result is found in the first n ranks and 0 otherwise [13]. My exact approach to query formulation is described in detail in the second section of Chapter 4. Once a query has been formulated, it has to be presented to a web search engine. Numerous efficient engines exist, the most popular of which is Google, with a global market share of more than 80% as of November, 2017 [16]. These engines employ their own ranking systems to determine the relevance of found results for a given query, however, the credibility of retrieved results needs to be verified subsequently, as this task is beyond the scope of such systems.

6 2. State-of-the-art and Related Work

A number of solutions have been proposed by various authors for automating credibility assessment; some generic, others related to a specific set of websites [17, 18, 19, 20]. Sonal Aggarwal et al. have recently created a proof of concept support tool in Python for website credibility assessment, which uses existing real-time databases like Alexa for popularity ranking, and Web of Trust for user ratings and review data [21]. I leave the verification of credibility of a found source up to the user of the tool, focusing instead on filtering out results which are deemed to be simple copies of the article from which a given claim came.

7

3 Thesis Foundations

3.1 Natural Language Processing and Information Retrieval

The goal of linguistic science is to characterize and explain the great variety of linguistic observations in human interaction, in spoken word or written text [22]. Humans have an inherent ability to create, learn and understand languages as a means of communication between one another. However, natural languages have evolved in a way that, while natural for humans, is intrinsically difficult for computers to process and analyze. Perhaps the main source of this difficulty is the ambiguity of lan- guage [22]. Ambiguity occurs on a multitude of levels in natural lan- guages. For example, the word “band” may have a different meaning depending on the surrounding context, a phenomenon known as pol- ysemy. Often, multiple words may describe the same concept, such as “car” and “automobile”. This is known as synonymy. Ambiguity does not occur only on the level of individual words, however. Consider the following sentence: ∙ “The policeman is chasing the man with a gun.” This is an example of an amphiboly. Without any further context, not even a human reader is able to determine with absolute certainty which party is actually holding the weapon. However, even without additional context, the human reader is able to infer that a crime has probably been committed, and the man being chased is suspected of committing it or being related to it in some way—a task which requires an amount of prerequisite knowledge of the world and meanings of words, as well as the ability to create judgments based on presented facts, which a computer algorithm would need to possess in order to reach the same conclusion. Known as inferential analysis, this is a general challenge in the field of artificial intelligence. Natural language processing (NLP) is an interdisciplinary field of computer science and computational linguistics studying interactions between computers and natural (human) languages, such as termi- nology extraction, syntactic analysis, or machine translation. For a

9 3. Thesis Foundations long time, the majority of scientific research in NLP was focused on analysis of very limited linguistic domains and individual sentences using complex sets of hand-written rules. In the past decades, how- ever, with the rise of the World Wide Web and the vast amount of text data being produced every day resulting in availability of large text corpora, along with the increase in raw machine computing power, we have seen a shift in the scientific community toward statistical and machine-learning based language analysis [22]. Information retrieval (IR) is an academic field of study loosely connected to NLP, which is concerned with retrieving material of an unstructured nature (usually text) from a large collection to satisfy an information need of an individual. In recent years, the task of IR has become a daily activity for hundreds of millions of people across the globe. IR systems can be distinguished based on the scale on which they operate, from web search, where the system has to cover billions of documents distributed on computers worldwide, to personal information retrieval such as an email or an integrated OS search system for use on a personal computer [10]. Numerous NLP techniques have been used in IR, with simple meth- ods like stopwording and porter-style stemming usually resulting in significant improvements to IR systems, and higher-level processing like chunking or word sense disambiguation yielding small improve- ments and considerably increasing computation and storage cost [23]. This thesis is only concerned with web text retrieval, as its goal is the creation of a tool which supports search for candidate sources for a given claim in the form of web documents. As the tool has to be language independent, meaning as little overhead work as possible should have to be performed in order for it to be applicable to a pre- viously unused language, I avoid the usage of language dependent NLP techniques, even stemming and stopwording, and focus mostly on keyword extraction and document similarity comparison.

3.1.1 Key Concepts

This section introduces the key terms and concepts used through- out this thesis. The definitions presented come from Introduction to Information Retrieval by Christopher D. Manning et al. [10]

10 3. Thesis Foundations

Today, commercial web search engines make use almost exclusively of free text queries, i.e. having the user type just one or more words rather than using a precisely defined language with operators to build a query (e.g. querying an SQL database). Some engines allow the use of special operators to further specify the information need, along with the free text query. For the purposes of this work, we think of documents, i.e. Wikipedia articles and web pages, mainly in terms of the BOW and vector space models. In a BOW model, a document is represented as a set of term–weight pairs, where weights are determined by a function mapping the num- ber of occurrences of a given term in the document to a positive real value. The exact ordering of the terms is ignored. In this view, the documents “This house is available for rent, not for sale.” and “This house is available for sale, not for rent.” are seen as identical, even though their semantics differ. Still, the intuition holds that documents with similar BOW representations are close to each other in content. In a vector space model, documents are represented by document vectors, that is vectors of the term weights for both documents in a shared vector space. The number of occurrences of a term t in a document d is known as term frequency, and denoted as t ft,d. Document frequency d ft of a term t refers to the number of docu- ments in a given collection which contain t. We introduce the concept of document frequency because raw term frequency treats all terms as equally important when assessing the similarity of documents. How- ever, certain terms have very little value when determining document relevance to a query. For example, in any collection of English doc- uments the term “the” can be expected to be found in every single document. Document frequency allows us to take into account that certain terms naturally occur more frequently in a collection than others. We also define the inverse document frequency id ft of a term t as follows:

N id ft = log( ), d ft

11 3. Thesis Foundations where N is the total number of documents in a collection. Defined in this way, the id ft of a term t is high if t is rare in the collection, and low for common terms. Term frequency and inverse document frequency can be combined into a weighting scheme known as tf–idf. Tf–idf assigns to each term t in document d a weight given by

t ft,d × id ft, that is, the tf–idf weight is highest for terms that occur frequently in a small number of documents, and lowest for terms that are found in a vast majority of documents in the collection. A text corpus is a structured collection of texts used to perform statistical analysis and hypothesis testing. Cosine similarity is a way of quantifying the similarity between two documents d1 and d2, computed by:

d1 · d2 sim(d1, d2) = , |d1| * |d2|

where d1 and d2 are the respective document vectors, the numera- tor is their inner product, and the denominator is the product of their Euclidean lengths. Cosine similarity naturally compensates for differ- ent document lengths by taking into account the relative distribution of terms in the vectors. Since term weights cannot be negative, the value of cosine similarity for IR purposes is bounded in the interval [0, 1].

3.2 Task Definition

Let us define the task at hand as follows: Given a Wikipedia article A, identify sequences of text that have been marked as requiring an external source with the citation needed Wikipedia template c1, c2, ..., cn. For each ci, 1 ≤ i ≤ n, create a vector ci ci ci t of tokens t1 , ..., tm, where a token is defined as an individual word (a non-stemmed sequence of alphabetical and certain special characters c as it appears in ci). Using t i as a query for a web search engine, retrieve a response R of web documents that the web search engine has deemed as relevant to the query. From R, determine a subset R′ of at most

12 3. Thesis Foundations

20 articles that are most likely to be good candidate sources. Finally, generate a report of the found claims and their respective retrieved candidate sources for the user to be able to effectively browse through and determine the best source for a given claim. The definition of a good candidate source is given in Section 4.4.

13

4 Implementation

This chapter describes how the tool is implemented to handle the task defined in the previous section. The task is organized into several mostly independent parts. Sections 4.1, 4.2, 4.4, and 4.5 describe one of the parts each and explain the reasoning behind the implementation. Section 4.3 discusses the choice of the web search engine used to retrieve web documents as possible candidates for a claim’s source. The final section describes how the tool should be used.

4.1 Unsubstantiated Claim Identification

4.1.1 Wiki markup

Wikipedia articles are written in a format known as the Wiki markup. Its syntax consists of various layout and text formatting elements, and expressions to include links and references, images, or automati- cally copy (transclude) other segments of Wiki markup into the page. A Wikipedia page meant to be included in other pages is called a template. A template is transcluded in the Wiki markup by including a string of the format {{template_name}} in the document. Some templates may include parameters to further extend their use. A template with param- eters included has the syntax {{template_name|parameter|...}}, where template_name is the name of the template and parameter con- tains either only a value (it is then called an unnamed parameter), or is of the form name=value (referred to as a named parameter), where name is the name of the parameter [7]. Various templates can also be nested, which poses a problem when attempting to parse a Wiki markup text into plain text. The gensim library offers a function to retrieve plain text from an article written in Wiki markup. During the parsing, this function attempts to remove all template markup from the text. In certain cases, this leads to a loss of textual information found in the article. For this reason, the function has been modified to accommodate the needs of claim text identification.

15 4. Implementation

Figure 4.1: A citation needed template after claim text

4.1.2 Citation Needed Templates

Claims which have been identified by Wikipedia editors as unsubstan- tiated or requiring an external source are identified by the transclusion of a {{citation needed}} template in the text. The template name has several possible aliases in English as well as other languages. The aliases for a given language are usually listed on the template docu- mentation Wikipedia page. To search for occurrences of citation needed templates in the Wiki markup, the gensim function to remove Wiki markup from a given string is modified to check the template name for any of the possible aliases, the set of which is provided by the user via a configuration file (see Section 4.6.1), and if a match is found, the template is replaced with a $$CNMARK$$ sequence of characters that is included in the parsed plain text.

4.1.3 Claim Text Identification

The greatest challenge in identifying an unsubstantiated claim arises from the fact that the citation needed template is included in the markup on its own, it is not directly tied to a portion of the text, and there are no exact guidelines as to where it should be placed. Most of the templates seem to be put at the end of the portion of text which they refer to. However, during my research I have encountered templates that were actually found before the text to which they referred, as seen in Figure 4.1. These were exclusively found after an occurrence of a colon punctuation mark and what followed was usually a list of examples or a quote. See Figure 4.2 for an example.

16 4. Implementation

Figure 4.2: A citation needed template before a list

The tool handles this distinction by using a simple heuristic which searches for the occurrence of a semicolon before an encountered citation needed template and, if it is found, includes the text after it as well in the claim. Such a claim is then classified to be of type “F”, whereas claims which are deemed to be found exclusively before the template are assigned a class “B”. This classification is then used when computing the vector tci of query tokens for the claim, as described in Section 4.2.2. “B” type claims include the text of the whole paragraph before the template occurrence, or following the last citation needed template if found in the paragraph. If there were any other citation needed tem- plates found before in the paragraph, the text of up to three of pre- viously seen claims is included in the current claim text as well, as the individual claims are likely semantically related. “F” type claims include only the last sentence before the template occurrence and:

1. Either the list of items after it, if encountered,

2. the rest of the paragraph after it, if the paragraph does not end after the template,

3. the whole paragraph after it.

This heuristic has proven to work well for claim text identification during the experimental process.

17 4. Implementation

4.1.4 Quote Templates In contrast to the citation needed template, the quote template wraps the text to be displayed on the page in its own text parameter. As noted in the previous section, some identified claims can hold most of the semantic information in a quote which follows an introductory sentence. In order to deal with this problem, quote template content needs to be taken into account when parsing Wiki markup using the modified gensim function. When a quote template is identified by its template name, it is first searched for a text parameter. If found, its value is used to replace the original template in the plain text. Some quote templates use an unnamed parameter to define their text value. When no text param- eter is found in the template, the part of the template which comes after the template name is included in the text as a whole. This can have the effect of some template metadata being included in the plain text, which may result in some tokens parsed from the metadata being included in the query vector tci . In practice, however, this has not re- sulted in significant problems during the query formulation process, as this metadata usually consists of attributes like the name of the author of the quote, the title of the work it appears in, and other data which is relevant to the claim. In such a case, if any named parameters appear in the quote, they are included in the claim as well, but since they are mostly generic terms, such as “author” or “source”, their score is generally lower than that of terms found in the actual claim.

4.2 Query Formulation

4.2.1 Assumptions I make several assumptions during the query formulation process:

1. To find a relevant candidate source for a claim with a search engine, it is sufficient that the query contain only tokens found in the claim text.

2. The order in which terms appear in the query has an effect on the retrieved response from a web search engine.

18 4. Implementation

3. The web search engine has been optimized with the expecta- tion that the user is an average user with little knowledge of IR techniques, and it is assumed the common query very closely resembles a natural language construct.

To verify the first assumption, I created a test set of 19 claims foundon the that are already properly sourced. In order to include a claim in the test set, the following criteria had to have been fulfilled:

∙ The article from which the claim came from must not have been marked by Wikipedia as a stub, i.e. a short or incomplete article.

∙ The source for the claim must have come from a web page.

∙ The source must have still been accessible at the time of test set creation.

∙ Only one claim per article could be included in the test set.

Although the claims were chosen by hand, I have tried to random- ize the process as much as possible in order to avoid human bias skewing the results. The articles were chosen with the help of the Wikipedia Special:Random page, which automatically redirects to a random Wikipedia article [24]. The claim to choose from an article was identified with the help of RANDOM.ORG, a service created by Dr. Mads Haahr of the School of Computer Science and Statistics at Trin- ity College in Dublin, which utilizes atmospheric noise to generate a random number from a user specified range [25]. I used this number to find an existing reference in the given article, and if all of the previ- ously defined criteria were fulfilled, analyzed which part of thetext the reference sourced. For each of the claims I manually formulated a query I would then use with Bing and Google web search services, where each query contained only words found in the claim with no modifications. Using this approach, I was able to retrieve a relevant web page that could be used as a source for the original claim in the top 10 of returned results in 94.74% of test cases using Google, and 73.68% using Bing. In 47.37% of cases, the relevant page found was actually of

19 4. Implementation the same domain as the original source used in the Wikipedia article for both Bing and Google. Based on these results, I concluded that the assumption 1 holds true. Assumption 2 is easily verified by creating multiple queries consist- ing of the same tokens (words) in a different ordering. As of Novem- ber, 2017 both Bing and Google show differences even in the top 10 retrieved documents for queries “fellowship of evangelical bible churches founded desire” and “of bible churches founded desire fel- lowship evangelical”. Even for simple search terms like “tuesday after- noon” and “afternoon tuesday”, the assumption holds true for both search engines. The final assumption may be impossible to verify, as the inner workings of commercial web search engines represent an invaluable part of the business know-how of their respective providers, and as such, are a closely protected secret.

4.2.2 Term Scoring

For the tool to be language independent and robust, the extraction of keywords from the claim has to make use of simple language pro- cessing tools. I utilize a modification of the tf–idf weighting scheme to determine the semantic importance of a parsed token to a given claim. The text of the claim retrieved in the previous step is first parsed into sentences. The parsing is done by a very simple heuristic. The whole text is first split on each occurrence of a period mark into pre- sumed sentences. A period is deemed as not ending a sentence if any of the following conditions holds:

∙ The candidate sentence after the period consists only of one word.

∙ The word after the period does not begin with an uppercase letter.

∙ The last word before the period is not longer than 4 characters and begins with an uppercase letter.

20 4. Implementation

The heuristic will not parse any given string perfectly, but has nonetheless proven to work very well for the query formulation pur- pose. For claims of type “B”, I assume the closer a sentence is found to the end of the claim, the more semantic relevance it holds to the claim, as it seems natural for an editor to place the citation needed template right at the end of the disputed information presented. However, sometimes the sentence (or part of the sentence in case of a template included in the middle) closest to the template does not contain enough information on its own. This is often a result of the use of anaphora, an expression whose interpretation is dependent upon another in the context. For this reason, terms found in up to three sentences before the end of the claim are considered. Precisely, I define a term claim weight cwt,ci of term t in claim ci by the following formula:

cwt,ci = x + 0.36 * y + 0.216 * z, where x is equal to 1 if it is found in the sentence closest to the end ci, and 0 otherwise, y is the number of occurrences of t in the second to last sentence, and z is the number of occurrences of t in the third to last sentence. The coefficient values were obtained experimentally, by assessing the keywords extracted from claims from the set of 19 claims introduced in the previous section and their similarity to a human-defined query. The reader may notice that the values 0.36 and 0.216 are the number 0.6 raised to the second and third power, respectivelly. This is because I assume that the semantic value of a term to a claim decreases exponentially with respect to the distance in terms of sentences to the end of the claim. For claims of type “F”, the formula is modified in the following way:

cwt,ci = x + 0.36 * y + 0.216 * z + 0.7 * w, where w is equal to 1 if the claim has been split into more than three sentences and x + y + z = 0, and 0 otherwise. That is, I take into consideration only terms closest to the end of the claim, and also assign a weight of 0.7 to terms not found among these, but found in the introductory sentence of the “F” claim. This reflects the intuition that the introductory sentence contains information important to the ci, yet the actual content is probably more

21 4. Implementation important. The reason for including only the last part of the quote or a list present in the claim is to limit the domain from which the terms are taken, with the assumption that the keywords extracted from this domain will suffice to formulate the query. The term claim weight is then used to modify the actual frequency of the terms found in the article, and the resulting vector of term weights scaled to the range between 0 and 1 by dividing each weight by the sum of all weights. During experimentation, this weighing metric has been deemed to favor terms that occur very frequently in the article too significantly, so term frequency has been further modified to mitigate this effect. In summary, I have arrived atthe following mapping to determine the term weight twt,A,ci of a term t found in claim ci of article A to be used in place of the commonly used term frequency in the tf–idf scheme:

= * ( + ) twt,A,ci cwt,ci log15 t ft,A 14

The vector of these term weights is again scaled to the interval [0, 1] by sum division. The inverse document frequency id ft value is then computed for each term. Document frequency comes from a precomputed dictio- nary mapping between terms and their document frequencies across all Wikipedia articles in a given language. This dictionary is obtained using the gensim library’s WikiCorpus class to create a text corpus from a Wikipedia article dump and determine its vocabulary. Each article is stripped of Wiki markup, tokenized and its BOW representation is created. Articles shorter than 50 words (after preprocessing) are ig- nored. A file containing the word id–word–document frequency mapping for each word in the vocabulary is then saved on disk. If a document frequency value of t is not found in the mapping, id ft is set to 0 by default. The final score of a term t in an article A and claim ci is set to the following value:

Scoret,A,ci = twt,A,ci × id ft The vector of scores for each term is scaled to the range between 0 and 1, so that the query formulation process may be completed on the same scale for every claim.

22 4. Implementation

4.2.3 Keyword Extraction

Once the domain of terms and their respective scores in ci have been acquired, the question becomes how to utilize the scoring to determine which terms should be used as elements of the query. Perhaps the simplest idea is to define a threshold value and include only those terms whose score is greater than this threshold. How- ever, precisely defining an optimal threshold value in this scenario is difficult. In my attempts to find the optimal value experimentally, modifying the threshold often resulted in better output queries for certain claims and worsened the outputs for others, with little effect on the algorithm’s overall performance. This approach also does not take into account the distribution of term score values in the vector. Since the score vectors as described in the previous section are scaled to the interval [0, 1] by dividing each score in the vector by the overall sum of the scores, the resulting vector has the property that the sum of all its scores equals to 1. This means that if the original score vector contained many similar values with no high-valued outliers, the scaled vector’s values will generally be low. On the other hand, if few high values were found in the original vector with many low-valued scores, the low-valued scores will be lowered even more significantly in the resulting vector due to the outliers’ presence. The count of terms in the original also affects the resulting vector, such that vectors with more scores included will generally result in smaller score values and vice-versa. This property of the score vector implies it is impossible to provide a strictly defined threshold value which would work well for any score vector. To address this, I decided to implement a method which analyzes the distribution of values in the score vector and sets the threshold based on this distribution. The method computes the value of each decile of the set of scores, i.e. the 0th, 10th, ..., 100th percentile. It then analyzes the differences of score values between the closest deciles. This provides a rough idea of the distribution of the values. Let us define the breaking point bp of a set of score values as 0.1 * n if the difference between the (n + 1)th and nth decile is the maximum of

23 4. Implementation the differences among the closest deciles, i.e.:

bp = {0.1 * n| max (decilen+1 − decilen), n ∈ N} 0≤n≤9

This bp serves as the basis to determine the threshold value, following the intuition that the point at which the values of term scores decrease most significantly represents the value which distinguishes between most relevant and least relevant terms. Experiments have shown that this value actually leaves out terms that might make queries more similar to human defined ones. For this reason, bp is lowered by a value which is dependent on the overall spread of scores in the set and the ratio of this spread and the maximum difference among closest deciles. The exact algorithm, consisting of 4 methods of a Claim class, which serves as a wrapper for a given claim during the query formu- lation process, can be found in Appendix C. Both the overall spread and the ratio of maximum difference of values at two closest deciles and the spread are assigned one of seven labels based on their actual value. The computation attempts to capture the following intuition:

∙ Score sets with a large overall spread, where there are few out- liers responsible for this spread should take into consideration more of the lower-valued terms, so that the outliers do not dom- inate the resulting query. Sets with a large overall spread which in which outliers are hard to determine should take into ac- count less of the lower-valued terms into account, since there are already enough highly scored terms available. Generally, the smaller the range of values in the set, the breaking point bp is closer to the actual threshold value, so it should not be decreased too much.

After obtaining the actual threshold value for a given score vector, the keywords with values higher than the threshold are included in the final query. The keywords are included in the order inwhich their score decreases. At most 8 of the terms with the highest score are included. As the order in which the terms appear in the query matters, they are then sorted in the ordering in which they appear

24 4. Implementation

in the original claim to more closely resemble a natural language construct. While one of the assumptions I make during the query formulation process, which has proven to be true for human-defined queries, is that it is sufficient that the resulting query contain only terms found inthe original claim, it has proven useful to include the original Wikipedia article title in the query generated by the tool to provide additional topical context. Only terms that are not already included in the query as constructed are included to avoid duplication of terms. The terms found in the article title are put at the beginning of the query, as they serve as the introduction to the topic, and as such, seem natural to be input first.

4.3 Web Search Engine

When a query is formulated, a request containing it is sent to the Bing Web Search API. Although, as of November 2017, Google is by far the leading web search engine provider [16], and as explained in Chapter 5 of this work, often returns more relevant results to the queries for- mulated by the previously described process, it has discontinued its Web Search API in 2014, and scraping of its sites is prohibited, as it falls into the category of misuse of their services as described in their Terms of Service [26]. Bing is the second most popular search engine as of November 2017 [16], and provides an extensive API for users to access their web search services. The API is a commercial product and users are required to provide an authentication key within their requests. This API key must be provided by the user of the tool via its configuration file. It is possible that the tool be modified in the future to make useof the FAROO Web Search API, which provides 1 million free queries per month, when the user does not provide a valid Bing Web Search API key, but further testing and evaluation of the results will have to be performed.

25 4. Implementation 4.4 Website Relevance Assessment

When a response is acquired from the Bing Web Search API, the first 20 retrieved web pages are evaluated further and any retrieved website which is deemed not to be a good candidate source is ignored. A good candidate source is defined as a web page whose textual content isnor too similar neither too dissimilar to the text of the Wikipedia article in which the corresponding claim was found. Furthermore, if the paragraph most similar to the claim text is deemed to be too similar in its content to the text of the claim, the page is evaluated as a bad source, as it is probable that the text was actually copied from Wikipedia and thus does not represent a useful reference. Precisely, a page p is considered to be a good candidate source if for the value sim(A, p) of cosine similarity of its textual content and the parsed Wikipedia article the following holds: 0.4 < sim(A, p) < 0.95, and for the value sim(ci, pmax) of cosine similarity of the claim text and the paragraph deemed most similar to it is less than 0.9, i.e.:

maxp∈P(sim(ci, p)) < 0.9, where P is the set of paragraphs found in the retrieved page. These conditions were again arrived upon experimentally, evaluating the pre- cision of the retrieved responses for a given query, and the similarity values for pages containing text obviously copied from Wikipedia. The textual content of the retrieved pages is retrieved using the jusText library developed by the Natural Language Processing Centre of Masaryk University in Brno [27], and the cosine similarities are com- puted by a utility function provided by gensim.

4.4.1 Copycat Websites It comes as no surprise that the document ranked most relevant to queries formulated by the tool by both Bing and Google search engines is almost exclusively the original Wikipedia article from which the claim was retrieved. More surprising is the amount of websites I have encountered during the tool’s development which seem to create most of their content by copying Wikipedia articles and posting them on the web. I refer to them as “copycat websites”, “blacklisted sites”, or

26 4. Implementation

“skipsites” in the code. Often, they represent most of the top ten pages retrieved by both Bing and Google services. A list of some of the sites most frequently identified as too similar in content to a Wikipedia article can be found in Appendix B. In order to account for this, I have implemented a system which attempts to detect these sites and puts them on a blacklist, which is then used to expand the query. Both Bing and Google offer operators to allow the used not to include certain domains in the retrieved result. In case of Bing, such a domain is defined by including a phrase of the format NOT site:domain, where domain is the domain of the site to be excluded. Google users can include a phrase of the format -site:domain where domain has the same meaning as in the previous case. Sites that are deemed to be too similar in the website relevance assessment phase are put on the blacklist by being included in the tool’s configuration file in the [skipsites] section, which is parsed for domain names to be skipped before each request sent to the Bing API. Therefore, if such a site is found in a response retrieved for one query, it will not be included in the results of the queries that come after it. Since sometimes this approach puts a website on the blacklist even if the site may actually contain content that is not copied from Wikipedia, yet was wrongfully assessed as being a copycat site, the user can define sites that should always be included in the search even if they are found to be too similar in a certain case by defining the boolean value of the domain name which serves as a key in the configuration file to be false. The Bing Web Search API imposes a limitation of 2,048 characters on the length of the GET requests sent to it. If the length surpasses this limit the API returns a response with a status code of 404 - Not Found [28]. In order to prevent this from happening, the query sent in the request is limited to the length of 1,400 characters. This means that if the inclusion of a blacklisted site were to cause the query to become longer, no further sites will be included in it. As blacklisted sites are included in the queries in the order in which they appear in the configuration file, it is up to the user to make sure that thesites which should definitely be ignored are placed at the beginning ofthe [skipsites] section.

27 4. Implementation 4.5 Report Generation

The report is generated from the retrieved responses in the format of an HTML document. The document presents the user with a list of articles searched for unsubstantiated claims. Upon clicking an article, a summary of unsubstantiated claims found within it is shown, and the user may further expand each to show the retrieved pages that were evaluated to be good source candidates. To give the user an opportunity to assess at a glance which page may represent the best option for a source, the page title, the summary of relevant text found provided by Bing and the paragraph deemed to be the most similar to the claim by the tool are presented to them. The user may navigate to the original Wikipedia article in which the claims were found and any of the good canditate sources retrieved. As Google search often offers better results than the Bing API, alink to a Google search results page for the given query is constructed as well, including the list of sites to be ignored, so that the user has an opportunity to easily browse these results too. The HTML report is generated with the help of the Jinja2 library, which provides a templating language for Python modelled after Django templates [29].

4.6 Tool Usage

Before the user is able to search for unsubstantiated claim sources, they are required to first create a dictionary mapping the terms of a Wikipedia article dump in the language of their choice, and their respective document frequencies across the dump. This is handled by running the preparefinder.py script. The script expects two ar- guments for its input: the location of the Wikipedia dump in bzip2- compressed format, and the path prefix for the output files. The script creates two files on disk, prefix_wordids.txt.bz2 containing the dictionary mapping, and prefix_wikifinder.cfg, the tool’s config- uration file described in detail in Section 4.6.1, where prefix is the path prefix passed to the script as the second argument. This process may take several hours to complete depending on the size of the article dump. In the case of the full English language

28 4. Implementation

Wikipedia dump, it takes approximately 3.5 hours on a Lenovo Idea- Pad Y500 laptop. For the Czech Wikipedia, only approximately 10 minutes are needed on the same machine. To search for claims and their sources in a given set of Wikipedia articles, the user then runs the runfinder.py script with two argu- ments: the location of a bzip2-compressed dump of the articles of their choice, and the location of the configuration file created by the preparefinder.py script. The dump of user-defined Wikipedia arti- cles can be easily obtained using the Special:Export Wikipedia page, which provides the means to create a dump of articles based on a list of entered article titles [30]. The running time of the process depends on the number of articles included in the user provided dump, and more imporantly on the number of claims found in the articles. As the first 20 retrieved results for a given claim’s query are accessed to determine the similarity of the page to the article and claim, a higher number of claims results in more network communication and the delay associated with it. Thus, the running time also depends on the speed of the internet connection available. On average, the time it takes to evaluate a single query is approxi- mately 1.8 minutes. About a minute is additionaly required to load the whole word to document frequency mapping in memory.

4.6.1 Configuration File The tool requires a configuration file to define properties needed torun a search. The Python Standard Library provides a basic configuration file parser language with a structure similar to Microsoft Windows INI files. I use this default structure and parser for the definition of the file. The configuration file consists of 4 sections, general, citation-needed, quote, and skipsites, each led by a [section] header, where section is the respective name of a section. Each section contains multiple name=value entries (also called options), where name is the key of the entry and value is the corresponding value. The general section contains three options: ∙ articlecount, holding the overall number of articles that were processed during the term–document frequency mapping creation

29 4. Implementation

∙ wordids_path, holding the absolute path to the file containing the dictionary mapping

∙ bing_api_key, which holds the authentication key to the Bing Web Service API the user is required to provide

The first two options are included automatically upon the creation of the file by the preparefinder.py script. The last option is assigned to the value of “none” by default, and the user needs to edit it with the value of their API key. The citation-needed section contains the allowed aliases for the ci- tation needed template. The template and template parameter names depend on the corresponding Wikipedia language domain. Even in English Wikipedia, there are 5 shortcut aliases in addition to citation needed for the definition of the template. As such, the user is required to declare all aliases that should be understood as defining a citation needed template in this section. The options have the format of name = bool, where name is the alias string, and bool is either true, if the option should be treated as the template alias, and false otherwise. By default, the created configuration file contains all the English aliases for the template. If the user wishes to use the tool on a different lan- guage Wikipedia dump, they should edit this section to declare the names used in the language of their need. The quote section holds two options: quote, whose value should be the language dependent name of the quote template, and text, which defines the language dependent name of the text parameter of the template. By default, these are set to the English name values for the template and parameter. The skipsites section contains the domains determined to contain copied articles from Wikipedia, in the same format as in the citation- needed section. This section gets automatically updated during indi- vidual runs of the tool, but may be also edited manually by the user to predefine sites which should be ignored during the search (witha value of true), or should not be included in the list of ignored sites (by defining their value to be false). The user-defined values will not be modified by the tool. This section is pre-populated by values of known problematic sites when the file is first created.

30 5 Performance Evaluation

To evaluate the tool’s performance, two testing sets of Wikipedia ar- ticles have been created in the same manner as the one described in Section 4.2. The first set was created from English Wikipedia articles, totaling 50 entries. The word–document frequency mapping was created from an English Wikipedia article dump created on August 8, 2017. The second set consists of 20 Czech Wikipedia articles, and the dic- tionary mapping was created from a Czech article dump created on November 3, 2017.

5.1 Evaluation Metric

The tool’s success rate was computed separately for the Bing API responses pruned and returned by the tool as well as the Google response for the same query, which was not subject to good candidate source evaluation based on page to article similarity. In order for a source search to be deemed successful, at least one of the retrieved web documents presented to the user must have either contained factual information supporting or refuting the claim, which was not obviously copied from the original article, or presented a quick and straightforward way to obtain this information, in the form of a search field or a link. In the case of the results retrieved by Google, sucha document must have been found within the first 10 results.

5.2 Results

On the English set, the tool was deemed to be successful in retrieving such a document in 23 cases, resulting in a success rate of 46%. On the same set, Google was found to perform much better, with 31 successful cases, that is 62% success rate. On the Czech test set, the difference between the two search engines is much more significant, however. While Google almost retains its success rate, returning a successful response in 60% of the cases, the Bing response with pruning of similar and dissimilar pages does not perform very well, with only 15% of successful responses. Since the

31 5. Performance Evaluation ideas formulated in this work were tested solely on English texts, a decrease in performance is not unexpected, but a margin this large requires an explanation. There may be several reasons for this difference. First, the approach to the query formulation may not generalize well to the Czech lan- guage. This seems unlikely however, as the Google engine success rate is very similar on both test sets. Second, the Bing web search engine does not perform well on Czech language queries. To test this hypothesis, I have tried to formulate queries by hand for several claims from the Czech test set. I have found it particularly hard to create a query which would return relevant documents other than the original Wikipedia articles. This may be only a part of the problem. I have also noticed that even some queries which would return a relevant response on their own failed to do so when operators to not include certain sites were present in the query, even though the relevant site was not included in the list of the sites to ignore. It is also possible that the cosine similarity based pruning of retrieved pages actually has a detrimental effect on the result, decreasing recall in the process. While this is a possibility, skipping the pruning did not result in a large increase of the success rate, with only a single response retriev- ing a relevant response in addition to the already retrieved responses, increasing the success rate to 20%. Based on these findings I believe the major problem comes from the way the Bing search engine handles Czech queries, and possibly how much of the Czech language web it has indexed. It should also be noted that the evaluation of the tool’s performance is further made more difficult by the fact that both the Google and Bing search engines’ responses may vary in time for the same query. This means that the tool shows a slightly different success rate depending on the time at which the responses were retrieved. To obtain the results described, I have run the tool on a single day for both test sets, and evaluated the response immediately after completion.

32 6 Summary

It is true that the success rate of the tool using the Bing Web Search API with the cosine similarity based pruning strategy applied is far from satisfactory. Even in the English language, the tool is successful in less than 50% of cases. The responses retrieved from the Google search engine on the same queries are slightly more promising, with a relevant response retrieved in almost two thirds of cases on both the English and Czech test sets. Considering that the domain of documents searched, i.e. the World Wide Web, is extremely large and diverse, I consider the Google results a mild success. There is still a lot of room for improvement, however. Currently, I believe the running time of source search represents the major issue. As mentioned in Section 4.6, it takes approximately 1.8 minutes for a single claim to be processed. Since claims are processed sequentially, the running time for a set of 20 articles may take anywhere from half an hour to several hours depending on the number of unsubstantiated claims found within them. This means the tool needs to be run long before the user intends to go through the resulting report and search for sources. The running time could be improved significantly by parallelizing the processing of individual claims. I believe that this work may serve as a proof-of-concept for the idea that there is a way to create free text queries from natural language text using simple language independent tools. Furthermore, I believe that the results could be further improved by using a machine-learning approach to train models for keyword extraction, similar to the work of Cartright et al. mentioned in Chapter 2. Using LSI to create a topical representation of a given article, and using this information to modify the query creation process might also have a positive effect on the success rate. A different choice of features during keyword extraction, such as n-grams (contiguous sequences of n items found in the text) should also be explored. An introduction of stemming or lemmatization could also improve the tool’s success rate. While most existing stemmers and lemmatizers rely on language dependent rules, attempts have recently been made at creation of automated language independent stemmers [31, 32].

33

Bibliography

1. Wikipedia.org Traffic, Demographics and Competitors — Alexa [online]. Alexa Internet, 2017 [visited on 2017-04-18]. Available from: http: //www.alexa.com/siteinfo/wikipedia.org. 2. Wikipedia:Size of Wikipedia — Wikipedia, The Free Encyclopedia [online]. Wikipedia, 2017 [visited on 2017-04-18]. Available from: https:// en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia. 3. Wikipedia:Wikipedians — Wikipedia, The Free Encyclopedia [online]. Wikipedia, 2017 [visited on 2017-04-18]. Available from: https://en.wikipedia. org/wiki/Wikipedia:Wikipedians. 4. Wikipedia:Verifiability, not truth — Wikipedia, The Free Encyclopedia [on- line]. Wikipedia, 2017 [visited on 2017-04-18]. Available from: https: //en.wikipedia.org/wiki/Wikipedia:Verifiability, _not_ truth. 5. Word of the Year 2016 is... | Oxford Dictionaries [online]. Oxford Uni- versity Press, 2017 [visited on 2017-04-18]. Available from: https: //en.oxforddictionaries.com/word- of- the- year/word- of- the-year-2016. 6. Wikipedia:Bots — Wikipedia, The Free Encyclopedia [online]. Wikipedia, 2017 [visited on 2017-04-18]. Available from: https://en.wikipedia. org/wiki/Wikipedia:Bots. 7. Help:Wiki markup — Wikipedia, The Free Encyclopedia [online]. Wikipedia, 2017 [visited on 2017-04-04]. Available from: https://en.wikipedia. org/wiki/Help:Wiki_markup. 8. gensim: Topic modelling for humans [online]. Radim Řehůřek, 2017 [vis- ited on 2017-04-04]. Available from: https://radimrehurek.com/ gensim/. 9. KIM, Kwanho; CHUNG, Beom-Suk; CHOI, Yerim; LEE, Seungjun; JUNG, Jae-Yoon; PARK, Jonghun. Language Independent Semantic Kernels for Short-text Classification. Expert Systems with Applications. 2014, vol. 41, no. 2, pp. 735–743. ISSN 0957-4174. Available from DOI: 10.1016/j.eswa.2013.07.097.

35 BIBLIOGRAPHY

10. MANNING, Christopher D.; RAGHAVAN, Prabhakar; SCHÜTZE, Hinrich. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008. ISBN 9780521865715. 11. KOKIOPOULOU, Effrosyni; SAAD, Yousef. Polynomial Filtering in Latent Semantic Indexing for Information Retrieval. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, United Kingdom: ACM, 2004, pp. 104–111. SIGIR ’04. ISBN 1-58113-881-4. Available from DOI: 10.1145/1008992.1009013. 12. DEERWESTER, Scott; DUMAIS, Susan T.; FURNAS, George W.; LAN- DAUER, Thomas K.; HARSHMAN, Richard. Indexing by latent semantic analysis. Journal of the American Society for Information Sci- ence. 1990, vol. 41, no. 6, pp. 391–407. ISSN 1097-4571. Available from DOI: 10.1002/(SICI)1097-4571(199009)41:6<391::AID- ASI1>3.0.CO;2-9. 13. CARTRIGHT, Marc-Allen; FEILD, Henry A.; ALLAN, James. Evi- dence Finding Using a Collection of Books. In: Proceedings of the 4th ACM Workshop on Online Books, Complementary Social Media and Crowdsourcing. Glasgow, Scotland, UK: ACM, 2011, pp. 11–18. Book- sOnline ’11. ISBN 978-1-4503-0961-5. Available from DOI: 10.1145/ 2064058.2064063. 14. Lemur Project Components: Galago [online]. Lemur Project, 2016 [visited on 2017-05-27]. Available from: https://www.lemurproject.org/ galago.php. 15. METZLER, Donald; CROFT, W. Bruce. A Markov Random Field Model for Term Dependencies. In: Proceedings of the 28th Annual Interna- tional ACM SIGIR Conference on Research and Development in Infor- mation Retrieval. Salvador, Brazil: ACM, 2005, pp. 472–479. SIGIR ’05. ISBN 1-59593-034-5. Available from DOI: 10.1145/1076034. 1076115. 16. Search engine market share [online]. Net Applications, 2017 [visited on 2017-11-26]. Available from: https://www.netmarketshare.com/ search-engine-market-share.aspx?qprid=4&qpcustomd=0.

36 BIBLIOGRAPHY

17. AMIN, Alia; ZHANG, Junte; CRAMER, Henriette; HARDMAN, Lynda; EVERS, Vanessa. The Effects of Source Credibility Ratings in aCul- tural Heritage Information Aggregator. In: Proceedings of the 3rd Workshop on Information Credibility on the Web. Madrid, Spain: ACM, 2009, pp. 35–42. WICOW ’09. ISBN 978-1-60558-488-1. Available from DOI: 10.1145/1526993.1527003. 18. SONDHI, Parikshit; VYDISWARAN, V. G. Vinod; ZHAI, Cheng Xiang. Reliability Prediction of Webpages in the Medical Domain. In: Pro- ceedings of the 34th European Conference on Advances in Information Re- trieval. Barcelona, Spain: Springer-Verlag, 2012, pp. 219–231. ECIR’12. ISBN 978-3-642-28996-5. Available from DOI: 10.1007/978-3-642- 28997-2_19. 19. PATTANAPHANCHAI, Jarutas; O’HARA, Kieron; HALL, Wendy. Trustworthiness Criteria for Supporting Users to Assess the Cred- ibility of Web Information. In: Proceedings of the 22Nd International Conference on World Wide Web. Rio de Janeiro, Brazil: ACM, 2013, pp. 1123–1130. WWW ’13 Companion. ISBN 978-1-4503-2038-2. Avail- able from DOI: 10.1145/2487788.2488132. 20. AGGARWAL, Sonal; OOSTENDORP, Herre Van. An attempt to auto- mate the process of source evaluation. ACEEE International Journal on Communication. 2011, vol. 2, no. 2, pp. 18–20. ISSN 2158-7558. 21. AGGARWAL, Sonal; VAN OOSTENDORP, Herre; REDDY, Y. Raghu; INDURKHYA, Bipin. Providing Web Credibility Assessment Sup- port. In: Proceedings of the 2014 European Conference on Cognitive Er- gonomics. Vienna, Austria: ACM, 2014, 29:1–29:8. ECCE ’14. ISBN 978-1-4503-2874-6. Available from DOI: 10.1145/2637248.2637260. 22. MANNING, Christopher D.; SCHÜTZE, Hinrich. Foundations of Statis- tical Natural Language Processing. Cambridge, MA, USA: MIT Press, 1999. ISBN 0-262-13360-1. 23. BRANTS, Thorsten. Natural Language Processing in Information Re- trieval. Inf. Process. Manage. 1986, vol. 26, pp. 19–20. 24. Wikipedia:Special:Random - Wikipedia [online]. Wikipedia, 2017 [visited on 2017-11-25]. Available from: https://en.wikipedia.org/wiki/ Wikipedia:Special:Random.

37 BIBLIOGRAPHY

25. RANDOM.ORG - True Random Number Service [online]. Randomness and Integrity Services Ltd., 2017 [visited on 2017-11-25]. Available from: https://www.random.org/. 26. Google Terms of Service - Privacy & Terms - Google [online]. Google, 2017 [visited on 2017-11-26]. Available from: https://www.google.com/ policies/terms/. 27. Justext - Corpus tools [online]. Masaryk University and Lexical Com- puting CZ, s.r.o, 2017 [visited on 2017-11-26]. Available from: http: //corpus.tools/wiki/Justext. 28. Bing Web Search API v5 Reference | Microsoft Docs [online]. Microsoft, 2017 [visited on 2017-11-26]. Available from: https://docs.microsoft. com/en- us/rest/api/cognitiveservices/bing- web- api- v5- reference. 29. Welcome to Jinja2 [online]. Armin Ronacher, 2017 [visited on 2017-11-26]. Available from: http://jinja.pocoo.org/docs/2.10/. 30. Help:Export - Wikipedia [online]. Wikipedia, 2017 [visited on 2017-11-25]. Available from: https://en.wikipedia.org/wiki/Help:Export. 31. KASTHURI, M.; KUMAR, S. B. R.; KHADDAJ, S. PLIS: Proposed Lan- guage Independent Stemmer for Information Retrieval Systems Us- ing Dynamic Programming. In: 2017 World Congress on Computing and Communication Technologies (WCCCT). 2017, pp. 132–135. Avail- able from DOI: 10.1109/WCCCT.2016.39. 32. BOYER, Célia; DOLAMIC, Ljiljana; FALQUET, Gilles. Language In- dependent Tokenization vs. Stemming in Automated Detection of Health Websites’ HONcode Conformity: An Evaluation. Pro- cedia Computer Science. 2015, vol. 64, no. Supplement C, pp. 224–231. ISSN 1877-0509. Available from DOI: 10.1016/j.procs.2015.08. 484. Conference on ENTERprise Information Systems/International Conference on Project MANagement/Conference on Health and Social Care Information Systems and Technologies, CENTERIS/Pro- jMAN / HCist 2015 October 7-9, 2015.

38 A Project Structure

The following files constitute the tool which is the result of the thesis. They may be found in the Information System of Masaryk University in Brno, or at https://github.com/Vocco/wikifinder.

∙ bingapi.py - used to connect to the Bing Web Search API

∙ claims.py - a collection of utilities to handle claims found in a Wikipedia article

∙ finderwikicorpus.py - used to parse Wiki markup, based on gensim’s wikicorpus.py

∙ preparefinder.py - run by the user to create a word–document frequency mapping for terms found in a Wikipedia article dump

∙ report_generator.py - used to generate the final report for a single run

∙ runfinder.py - run by the user to search for claims and their sources in a given set of Wikipedia articles

∙ templates/template.html - an HTML template for the final report generation

39

B Blacklisted Sites

A list of second-level domains of web pages whose content has been frequently deemed as too similar to a Wikipedia article follows:

∙ revolvy.com

– Appears under many aliases which redirect to the same site, e.g. rvlvy.com, revolvy.co, revolvy.net

∙ wow.com

∙ wikivisually.com

∙ digplanet.com

∙ everipedia.com

∙ wikia.com

∙ explained.today

∙ infogalactic.com

∙ wikiomni.com

∙ jsonpedia.org

∙ sensagent.com

∙ my-definitions.com

∙ thefullwiki.org

∙ pediaview.com

41

C Term Score Cutoff Estimation

Following is the algorithm used to determine the value against which the score of a term retrieved from a claim is compared to determine whether it should be included in the corresponding query.

def __get_cutoff(self, tfidfs): """ Returns TFIDF percentile at which to cut off the addition of keywords.

The function analyzes the rate at which the values of TFIDF for different tokens decrease between the 90th and 100th percentile, the 80th and 90 percentile, and so on up to the difference between the 0th and 10th percentile.

It evaluates the overall spread of the values, finds the percentile interval at which the rate of decrease is highest and determines the ratio of this rate and overall spread. Based on these values, it returnsa percentile at which the addition of new keywords to the query should be stopped.

Args: tfidfs:A list of tfidf values for the tokens to be included in the query.

Returns: The percentile at which the addition of more keywords to the query should be stopped. """ perc_value = 90 cumulative=0 cur_max=0 position=0

while(perc_value >= 0): diff=(numpy.percentile(tfidfs, perc_value + 10) - numpy.percentile(tfidfs, perc_value)) cumulative += diff if diff> cur_max: position= perc_value cur_max= diff

perc_value -= 10

# get the amount of spread between the TFIDF values spread= self.__get_tag(cumulative)

# get the ratio of the overall spread and the point of highest rate of # the descent of TFIDF values break_rate= self.__get_tag(cur_max/ cumulative)

# get the percentile at which to cut off the addition of keywords with # respect to the point with the highest rate of descent, the rate, and # the overall spread of values of the TFIDF values

43 C. Term Score Cutoff Estimation

cutoff= self.__get_strategy(spread, position, break_rate)

return cutoff def __get_tag(self, value): """ Returns the tag value ofa given spread/descent rate of TFIDF values.

There are seven categories defined.

Args: value: The rate of spread/descent of TFIDF values.

Returns: The tag of the category to which the rate belongs to. """ tag="none" if value > 0.85: tag= self.TAGS[0] elif value > 0.7: tag= self.TAGS[1] elif value > 0.6: tag= self.TAGS[2] elif value > 0.4: tag= self.TAGS[3] elif value > 0.3: tag= self.TAGS[4] elif value > 0.15: tag= self.TAGS[5] else: tag= self.TAGS[6]

return tag def __get_strategy(self, spread, breaking_point, break_rate): """ Returns the percentile at which the adding of keywords should stop.

Determines the percentile based on the point with the highest rate of decrease in the TFIDF values and the ratio of this rate of decrease and overall spread.

Note that the percentile computation was acquired empirically.

Args: spread: The overall spread of TFIDF values of the potential keywords. breaking_point: An approximation of the percentile at which the TFIDF values decrease the most rapidly. Takes on values from {0, 10, 20, ..., 90}. break_rate: The ratio of the decrease slope at the point of highest decrease and the overall spread. """ strategy= breaking_point

coeff = [6, 5, 4, 4, 4, 3, 2] rates = [3, 3, 2, 2, 2, 1, 1]

44 C. Term Score Cutoff Estimation

fori in range(0, 7): if spread == self.TAGS[i]: forj in range(0, 7): if break_rate == self.TAGS[j]: strategy= self.__lower_by_max(breaking_point, coeff[i]*(j + 1), rates[i]) break

return strategy

@staticmethod def __lower_by_max(breaking_point, maximum, change): """ Lowers the’breaking_point’ by at most’maximum’.

If the value of’breaking_point’-’value’ is negative, tries decreasing’value’ by’change’, while’value’ is non-negative.

Args: breaking_point: The value to be decreased. maximum: The maximum value to attempt to subtract. change: The rate of increase in the value being subtracted.

Returns: The value of’breaking_point’ lowered by the maximum possible value that keeps’breaking_point’a non-negative number. If this is not possible, returns the original value of’breaking_point’. """ while maximum >= 0: if breaking_point- maximum >= 0: return breaking_point- maximum

maximum -= change

return breaking_point

45