Automatizované Metody Určování Důležitosti Pojmů Pro Procvičování

Masaryk University Faculty of Informatics

Automatizované metody určování důležitosti pojmů pro procvičování

Bachelor’s Thesis

Norbert Slivka

Brno, Fall 2017

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Norbert Slivka

Advisor: doc. Mgr. Radek Pelánek, Ph.D.

Acknowledgement

I would like to thank my thesis advisor doc. Mgr. Radek Pelánek, Ph.D. for his help and suggestions and the whole Adaptive Learning team for valuable suggestions for improvements.

iii Abstract

This thesis looks for ways to identify new ways to semi-automatically build an expert model for adaptive learning. The output is a list of recommendations on how to utilize different algorithms to find the expert model.

iv Keywords

Adaptive learning, term importance

Contents

1 Introduction 1

2 Background 3 2.1 Prior work ...... 3 2.1.1 Adaptive learning ...... 3 2.1.2 Adaptive Learning group at Masaryk University3 2.2 Equal term importance distribution hypothesis ...... 4 2.3 Knowledge discovery in databases ...... 4 2.3.1 Selection ...... 5 2.3.2 Pre-processing ...... 5 2.3.3 Transormation ...... 5 2.3.4 Data mining ...... 5 2.3.5 Evaluation ...... 6 2.4 Programming language choice ...... 6

3 Goals 7 3.1 Partial goals ...... 7

4 Adaptive learning data preprocessing 9 4.1 Anatomy terms ...... 9 4.1.1 Terminologica Anatomica ...... 10 4.2 Geography terms ...... 11 4.2.1 Knowledge Graph ...... 12

5 Term importance 15 5.1 Google Search ...... 15 5.2 Wikipedia ...... 16 5.2.1 Pagerank ...... 16 5.2.2 Pageviews ...... 18 5.2.3 Wordcounts ...... 19 5.3 GDELT ...... 20 5.4 PubMed Central ...... 22

6 Measurements and results 23 6.1 Term importance ...... 23 6.2 Final evaluation ...... 24

vii 7 Conclusion 25

A Attachments 27

Bibliography 29

viii List of Tables

4.1 Anatom CSV table header explanation. 10 4.2 Geography CSV table header explanation. 12 4.3 Country terms that have to be manually created. 13

List of Figures

5.1 The number of books with full text-indexing over years, shown per country. The drop in 1920 is caused by copyright limiting the number of available data sources. 21

List of Listings

2.1 GCC C compiler flags ...... 6 4.1 The script to generate Anatom JSON file...... 9 4.2 The script to convert the geography JSON into processable files. Only selects countries and cities. . . 11 4.3 API URL of Knowledge Graph ...... 13 4.4 Full format of the Wikipedia URL ...... 13 5.1 Example of a SQL row tuple in the page file...... 17 5.2 Example of a SQL row tuple in the pagelinks file. . . . . 18 5.3 Pages-articles Wiki dump file...... 19

xiii

1 Introduction

Adaptive learning is a modern teaching method which incorporates information technologies into learning process in order to increase student motivation [1]. This technique is based on changing difficulty of questions according to individual level of the student. Modelling of the difficulty of the questions is done using student and expert model. Student model is based on success rate of other students. The advantage of this system is fast addition of new questions but at a cost of inaccurate difficulty in the beginning. Expert model is created by an expert in the field assigning difficulty to questions. This approach is accurate from the roll out, but requires input of an expert before new questions are added into the system. However, computer program can be used to replace expert to determine difficulty or importance of questions from external sources. Determining importance of terms from large data collections is commonly utilized in other industry areas. When using such algorithms, selection of correct datasets corresponding to the target group of students is crucial. Area interested in collecting large (>1GB) datasets and performing queries on them is called Data Mining. Queries can be performed based on full-text or context aware search, depending on the nature of data provided and algorithms used. These techniques can be utilized to retrieve importance of terms to be used in an expert model. Adaptive Learning from Faculty of Informatics of Masaryk Univer- sity has seven systems for adaptive learning that mainly use student model to assign term difficulty. The two systems relevant for this work are Outline maps (outlinemaps.org1) and Practice Anatomy (practiceanatomy.com2). This work is is split into five chapters. The first chapter describes prior work, describes what importance is in the context of this paper and what terms from adaptive systems are going to be used. Background of this work is described in the second chapter. It contains description of both the adaptive learning data sources and processes used in this work.

1. Also available in Slovak and Czech at slepemapy.cz 2. Also available in Czech at anatom.cz 1 1. Introduction

The third chapter describes data cleaning process. As most data is not formatted or uses different representation, it is necessary to perform this step to achieve better results. The fourth chapter outlines the goals of this work. All targets are described here and are addressed in the later chapters. The fifth chapter describes different sources to extract importance from. Every data source also contains description of bias it introduces and for which target group of students it is more relevant. This chapter also describes how to extract the data from these data sets what are their limitations. Results are shown and discussed in the sixth chapter. Every hypothesis of a data set bias is shown in the results and all relevant graphs with their respective descriptions are added. The final seventh chapter concludes all work and suggests further usage in adaptive learning systems.

2 2 Background

2.1 Prior work

2.1.1 Adaptive learning Adaptive learning is a modern education technique that teaches student based on their individual performance to increase motivation. Adaptive learning is used in many applications where user interacts with the system in a non-trivial way, such as in computer games or distance studies. This approach is enabled by an introduction of information technology to education. Computerized Adaptive Testing (CAT) is a system for revision that changes difficulty of questions based on student’s ability toan- swer previous questions correctly. The system tries to achieve a preset success rate by changing the difficulty of the questions asked. This motivates the student and makes learning more entertaining [1].

2.1.2 Adaptive Learning group at Masaryk University Adaptive Learning group at Masaryk University is a small lab research- ing student and domain modeling, instructional policies, evaluation of educational systems and difficulty of problems [2]. The main systems operated by this lab are:

∙ outlinemaps.org -– geography;

∙ practiceanatomy.com -– anatomy for biology students;

∙ matmat.cz — mathematics;

∙ poznavackaprirody.cz -– fauna and flora;

∙ umimecesky.cz — Czech language for native speakers;

∙ autoskolachytre.cz — tests for driving license;

∙ tutor.fi.muni.cz — logical puzzles.

3 2. Background

The projects relevant for this work are Outline maps and Practice anatomy. Both use student modeling to adapt difficulty of questions to the students level. Outline maps (also available as Slepé mapy for Czech and Slovak users) is a system where users are shown a map without labels with one area highlighted and their task is to select a name for that area from a list, or given a name, select an area on a map. Areas can have different resolution such as countries, cities, rivers, mountains. Practice anatomy (available as Anatomfor Czech users) is an adaptive learning system that has two question modes. The first one is selecting correct body part on a diagram of human body or selecting correct name from labeled part of a diagram. In the second mode, user is presented with questions regarding body part relationships.

2.2 Equal term importance distribution hypothesis

The hypothesis proposed by Radek Pelanek (TODO) is formulated as follows: The resulting value for a term is the same independently of the chosen method. The context of a method is expected to be the only cause of the differences between values generated by different methods. This implies that if a single method has different values assignment compared to others, the results generated by the method should be ignored as they do not follow context-less importance.

2.3 Knowledge discovery in databases

Data mining is an area of computer science that’s overall goal is to manage raw data and extract useful information from it [3]. This process involves several steps: ∙ Selection ∙ Pre-processing ∙ Transformation ∙ Data Mining

4 2. Background

∙ Evaluation [4]

2.3.1 Selection

Selection of the data sources is an essential first step. It is concerned with the selection of data sets that contain the relevant information and are large enough to ensure the variance in them will be negligible.

2.3.2 Pre-processing

Pre-processing the data is important to ensure every part of the data is normalised. This includes making sure any excess irrelevant data and noise is removed, and handling of incomplete data is decided. In text files, this step involves removing any markup language and normalising encoding and in graph files, it involves removing any excess data about the nodes and edges.

2.3.3 Transormation

The transformation step creates new data from the pre-processed data to ensure that only the relevant information is passed to the next step. In the case of text files, this involves replacing the text with tokens with a normalised format (e.g. removed capitalisation and conversion to a singular form of nouns and an infinitive form of verbs or creating a bag of words). Graphs have to be converted to the format used by the data mining step (e.g. adjacency matrix, a set of edges).

2.3.4 Data mining

Data mining is the step that finds relationships, clustering or any other previously unknown underlying property in the transformed data. These results have to be useful to draw conclusions from as the following step is interpreting them. Examples of this step is a computation of graph centrality measure or the number of occurrences of certain words.

5 2. Background

2.3.5 Evaluation The evaluation is a step that interprets results of the mining step. These conclusions are the result of the research and are expected to be understandable and actionable results. To correctly evaluate results, good domain understanding is necessary. The human performing this step needs to understand the biases introduced by the previous steps and revise all the previous steps, if necessary, to remove any bias.

2.4 Programming language choice

Choice of a suitable programming language is important for this project. Because the programs are going to be run manually and are not interactive, the long runtime does not cause a problem. Therefore, Python 2.7 was chosen as the main language for this project. Other factors include good support of libraries and common usage in data mining. For performance-critical parts, C was chosen, because of its speed and ease of use compared to other lower-level languages (C++, Golang). Since the execution pipelines use different scripts and bi- naries, bash scripts are used to control the execution order. Usage of several files simplifies writing and maintenance and allows for catch- ing errors and resuming computation after a script in a pipeline fails. Python scripts are executed directly using Python 2.7 interpreter. C programs can be compiled using GCC (GNU Compiler Collection) compiler with flags as in the Listing 2.1. gcc-std=c99-g-o filename filename.c List of Listings 2.1: GCC C compiler flags

6 3 Goals

The primary goal of this work is to utilise semi-automatic methods to generate expert model used by an adaptive learning application. This means identifying possible sources of data, processing and interpreting them. The requirements for sources are that they have an academic or educational background because the results are to be used for learning. They should be easily obtainable and free for use. Their processing should require minimal manual intervention, but some corrections can be made. Another goal of this work is to verify the hypothesis in Section 2.2. Because the size of the datasets used should be large, the algorithms have to be efficient enough to be run manually. Their inputs have to be readily available online, but because of their sizes, down- loading them before working with them and reusing downloaded files for new terms would be preferred approach.

3.1 Partial goals

The automatic character of processing data makes it possible to find incorrect values in the adaptive learning systems. These incorrect values should then be replaced in the source with the corrected values. Examples of such values will be included in the following chapters 1. Some of the incorrect values may also be no longer used in the system and should be removed or marked as removed to stop them from diluting precision of the dataset.

1. Namely Adaptive learning data preprocessing and Measurements and results

4 Adaptive learning data preprocessing

This chapter presents preprocessing of data from the adaptive learning systems. Detection of incorrect terms is also partially covered in this chapter.

4.1 Anatomy terms

Anatomy terms are in a single CSV file available only as an internal dataset. The columns in this file are described in Table 4.1. The anatom_id column can either be a 40 character long random hexadeci- mal number or a 12 character long TA98 code (see section 4.1.1). Name columns (anatom_name_* and fma_name) can be used in full text search, but because they can contain the same values, they should be treated as a set to count every occurrence only once. Preparing the CSV file for this work first requires conversion to a JSON using the csv2json.py script with a ’,’ delimiter and after- wards converting the result file with the script in the Listing 4.1.The output file contains a list of id and name pairs, where all terms have a TA98 code as their ID and the name is a semicolon separated list of all name options. import json json.dump( {’data’:[ {’id’:i[’anatom_id’],’name’: ’;’.join(set(j.lower().strip() forj in i[’anatom_name_la’].split(’;’) + i[’anatom_name_en’].split(’;’) + i[’fma_name’].split(’;’)))} fori in json.load(open(’anatomy.json’)) [’data’] if len(i[’anatom_id’]) == 12]}, open(’anatomy_all.json’,’w’)) List of Listings 4.1: The script to generate Anatom JSON file.

9 4. Adaptive learning data preprocessing

anatom_id ID, further described in the text anatom_name_cc Czech (some Latin) names anatom_name_cs Latin (some Czech) names anatom_name_en English (some Latin) names anatom_name_la Latin (some English) names fma_id FMA1 ID fma_name FMA name first_time_success_prob_avg Success rate

Table 4.1: Anatom CSV table header explanation.

4.1.1 Terminologica Anatomica

Terminologica Anatomica (TA98) is the international standard of human anatomy, first published as a book in 1998 by Federative Commit- tee on Anatomical Terminology [5, 6, 7]. It contains 7475 entities [8] representing body parts. Each body part has assigned an identification code formatted as Axx.x.xx.xxx where x represents a digit. For example human body has a code A01.0.00.000. These codes precisely and easily map Latin and English names to unique identifiers used in many other systems. Database from the practiceanatomy.com system uses these TA98 codes as IDs for most terms. However, some terms have differently formatted IDs assigned to them because TA98 only tracks body parts and the database also contains some other terms. However, some of the terms do not have a TA98 identifier and represent a body part with an assigned TA98 code (For example Pectineus muscle does not have a TA98 ID but its TA98 catalog identifier is A04.7.02.025). Since the number of non-TA98 terms is 282 (out of 1828 in total), the task to manually assign them is not reasonable. Therefore they are going to be skipped for this work, but the values should be corrected in the adaptive learning data if this work is implemented. As these terms are not anatomical, it is hard to compare their importance to body parts terms. Therefore, only the terms that have their identifiers originating from the TA98 codes are used in this paper further to compute their importance. Other factor that leads to removal

10 4. Adaptive learning data preprocessing

of such terms is that some of them do not have any human readable name but just represent some part of a diagram. The problem with such terms is that removing them requires human input, which this work is meant to eliminate.

4.2 Geography terms

Geography terms are in a single CSV file called place.csv. The id column contains integer identifiers for the terms 2. The code is a human readable unique identifier, and its format varies depending onthe type of the term. name is a name used to describe the term, and it is expected to find it in any sources in this form. type contains the type of the term as per place_type.csv file in the attachment. The description of the types is listed in the table 4.2. The last column, maps, contains identifiers of the maps that contain the term. To start processing this file, first, convert it into a JSON using the csv2json.py script with a delimiter set to ’;’ and then process the output file with a script in the listing 4.2 to split it into severalfiles according to a category it belongs to while removing data columns that are not needed 3. The only selected values are from import json types = {1: [], 2: []} for row in json.load(open(’geo.json’))[’data’]: if int(row[’type’]) in types: types[int(row[’type’])]\ .append({’id’: int(row[’id’]),\ ’name’: row[’name’]}) for typeid, typefile in ((1, ’country.json’),\ (2, ’city.json’)): json.dump({’data’: types[typeid]},\ open(typefile,’w’)) List of Listings 4.2: The script to convert the geography JSON into processable files. Only selects countries and cities.

2. Starting from 1. 3. code, type and maps are not needed after running the script 4.2.

11 4. Adaptive learning data preprocessing

Countries 1 299 Cities 2 731 World (a singular entity representing the world) 3 1 Continent 4 6 River 5 84 Lake 6 21 Region 7 197 Mountains 8 71 Island 9 56

Table 4.2: Geography CSV table header explanation.

4.2.1 Knowledge Graph Knowledge Graph is a knowledge base created by Google that contains many different entities and was created to aid the Google Search engine with finding best possible responses not only on a puretext search but also to understand the meaning of the entities [9]. This graph of entities is exposed to the public through the read-only Google Knowledge Graph Search API [10]. It is exposed through an API available at the URL in Listing 4.3. Using this API requires a private key generated by Google Cloud platform 4. Entities in the Knowledge Graph contain types describing what they represent. Countries are represented by a type Country. Other types present with countries are AdministrativeArea representing a legal area and Place representing any geographical location. Cities have AdministrativeArea, Place and City as their types. Most entities also contain a link to Wikipedia in detailedDescription.url. When processing countries, the ones that failed have their names, identifiers and Wikipedia page titles in the table 4.3. These values were entered into the script manually, after the main loading phase is fin- ished. All the values created by this script are be merged with the original file to support more names. This provides alternative or correct names

4. Avalable at https://cloud.google.com

12 4. Adaptive learning data preprocessing

Ukraine 216 Ukraine Andamany a Nikobary 287 Andaman and Nicobar Islands Damán a Díjú 309 Daman and Diu Puttuččéri 302 Puducherry New South Wales 459 New South Wales Fiji 91 Fiji Lakadivy / Lakšadvíp 282 Lakshadweep

Table 4.3: Country terms that have to be manually created.

of many places encountered in this dataset. Another problem is that the city of ’Rejdice’ does not have an entry in an English version of Wikipedia and has to be removed from the dataset. https://kgsearch.googleapis.com/v1/ entities:search List of Listings 4.3: API URL of Knowledge Graph

https://en.wikipedia.org/wiki /Ukraine List of Listings 4.4: Full format of the Wikipedia URL

5 Term importance

5.1 Google Search

Google Search is a leading web search engine with index size over 100 PB (over 1 trillion links) created in 1995 by Sergey Brin and Larry Page at Stanford University [11, 12, 13, 14]. This index is available through Custom Search API. Setting up the API to search entire web requires setting the Site to search to any website (the website used in this work is example.com) and enabling Search the entire web but emphasize included sites. This technique was already tried by the Adaptive leaning group to generate expert model but the results were unsatisfactory1. The only metric provided by this system is number of pages returned as search results. These values are not precise but can be used for comparisons of terms with large difference of pages returned. Other problems of this approach are homographs and slightly changing the queries to get more results. This may cause a lot of noise in the results. The search engine can be accessed after setting it up as described in the previous paragraph by calling the following URL:

https://www.googleapis.com/customsearch/v1 ?key=&cx=&q=&fields=queries where and are values provided by Google Cloud and is the requested query while setting fields to queries de- creases the amount of irrelevant data sent from the server. The response is a JSON file containing whole response. Following snippet retrieves the field with total number of results after parsing the response with json library:

int(response["queries"]["request"]\ [0]["totalResults"]) where response is a variable containing the response JSON.

1. As mentioned in the prior work

15 5. Term importance 5.2 Wikipedia

Wikipedia is a free online encyclopedia owned by Wikimedia Founda- tion Inc. [15]. The English version of Wikipedia has over 5.3 million articles as of December 2016. All articles can be collaboratively created and edited by its users. The file size of whole Wikipedia articles and metadata is 55.6 GB [16]. The articles are formatted text with links to different articles and external sources. They also have a lot of metadata attached to them to represent categories, references external identification systems, and other associated information. The text in the articles is written in a mark-up language called wikitext (or wiki-markup). Wikipedia then automatically transforms this markup into an HTML sent to the website users. The formatting can be used to represent both visual (such as headings) and functional parts (links) of the article. Instead of using web interface to retrieve the pages one by one, Wikipedia also offers downloads of snapshot of all articles in a single XML file. Other available files are for example pageviews, edit history, and links [17].

5.2.1 Pagerank

Pagerank is a method developed by Google to rank web pages objec- tively and mechanically by computing their importance in relation to other pages. Since web pages are not just plain text, but hypertext files, it is possible to follow links between them to form a directional graph. Outgoing links are called forward links (outedges) and incoming links are called backlinks (inedges). These links form a directed graph of the network. However, simply counting the number of backlinks may result in an unintuitive ranking that can be easily manipulated. The more advanced approach proposed by Page, Lawrence, et al. is to weight the incoming links according to the importance of the source website [18]. This algorithm simulates random walks on the graph with random jumps to emulate user behavior and to avoid rank sinks2. The computation of Pagerank is iterative. The formula for the (n + 1)th

2. Pages with inedges but no outedges

16 5. Term importance

iteration of PageRank for a page x is

PRn (v) PRn+1 (x) = 1 − d + d ∑ (5.1) Nv v∈Bx

PR0 (x) = 1 (5.2)

Where PRx is a value of PageRank after x iterations, Nx is a number of outgoing links of vertex x and d is a scaling factor. The 1 − d expression guaranties that the average value of Pagerank of all pages is 1. It can 1−d be replaced by N where N is the total number of pages to set the maximal value of Pagerank of any page to be 1 (Equation 5.2 also 1 has to be changed to PR0 (x) = N ). This approach is, however, less efficient as it does not use the full range of floating point variables. This computation is repeated until the difference between the sum of Pageranks in both iterations is less than a predetermined value. Wikipedia provides its internal link structure as a separate SQL file. It contains originating page ID, namespace and target page title name and namespace (See figure 5.2). To convert the target titles into IDs, another file has to be used. Pages SQL file contains page IDs, namespaces, titles and more irrelevant columns (See figure 5.1). This step is performed by the [TODO]. The titles from the pagelinks file can than be converted to their respective IDs for further processing. This is necessary because not all links lead to pages that exist. This file can be called as a SQL script on a database to build the graph, but that approach is slow. Much faster parser processes the file using states and only stores the values necessary for further computations. The resulting pairs of page ID and title are then saved into a file. These links are stored in memory and build up a graph. Since page IDs are not continuous set of integers, they can be mapped to a continuous sequence of integers to improve storage performance. (10 ,0 , ’AccessibleComputing’,’’,\ 0,1,0,0.33167112649574004,\ ’20171002144257’,’20171003005845’,\ 767284433,124,’wikitext’,NULL) List of Listings 5.1: Example of a SQL row tuple in the page file.

17 5. Term importance

(18612601,0,’\"Awesome\"’,0) List of Listings 5.2: Example of a SQL row tuple in the pagelinks file.

5.2.2 Pageviews

Wikipedia publishes page view count of every article. These counts are aggregated in hourly intervals into separate files. Pageviews are only counted when a HTML page is requested, CSS and JavaScript files are not counted. These values are clear of what Wikimedia expects to be artificial views. The files are available at: https://dumps.wikimedia.org/other/pageviews/ /-/ pageviews--0000.gz where YYYY is four digit year (2015 - ), MM is two digit month (1 - 12), DD is two digit day (1 - 31) and HH is two digit hour in 24 hour format (00 - 23). The data format after decompression is in a single file. Every file represents a single record of an article in a language. Theformat of every row is:

Some rows are not formatted this way and so are discarded. All the irregular values can be viewed by following Python script: import os, gzip for filename in\ [file for file in os.listdir(’.’)\ if os.path.isfile(os.path.join(’.’, file))]: with gzip.open(filename,’rb’) as file: for line in file: if len(line.split(’’)) not in (0, 4): print filename, line.replace(’\n’,’’)

Pageviews can be used to determine popularity of a web page [19]. This method would not represent importance in an academic context but in more general terms.

18 5. Term importance

5.2.3 Wordcounts Since Wikipedia dumps make whole Wikipedia articles’ texts available, it is possible to count the number of occurrences of each term in all articles. The file that contains the texts is pages-articles. It is a compressed XML (Extensible Markup Language) file which contains all articles in the structure described in the listing 5.3. The size of the file 3 limits the ability of the parser to process the file all at once and requires usage of iterative XML parsing. The python package xml.etree.ElementTree contains an XML parser with such capability exposed through its iterparse method. After encountering an end of a page tag, it is possible to read its children and find the text tag inside. The provided code also checks for proper format and only accepts text/x-wiki format of the text. This check is necessary only to process pages containing text articles. Since the content of the text elements is formatted with the wiki markup, conversion to a plain text is performed using a function filter_wiki from a Python package gensim.corpora.wikicorpus. This plain text is then further tokenised using a function tokenise from the same package. The task is now to find the number of occurrences of each term. The naive implementation is ineffective, and so, the Aho-Corasick algorithm is used. Aho-Corasick is an algorithm developed in 1975 to more efficiently find all occurrences of several substrings in a longtext in time O(nterms + ntext + nmatches) by building a finite state machine [20]. The implementation of this algorithm in Python is available in the ahocorasick package. Since the Aho-Corasick algorithm is using a text as an input, not tokens, we convert the list of tokens to a string separated by ";". ... ... ... ...

3. The 2016-11-20 file has size of 60 GB.

19 5. Term importance

... ... List of Listings 5.3: Pages-articles Wiki dump file.

5.3 GDELT

GDELT is a collection of world’s print and web news collected in over 100 languages with identification of many features such as emotions or mentions of a location in the source documents [21]. The location data is extracted using a Leetaru (2012 [22]) algorithm [23]. These locations are stored in the V1LOCATIONS column as location tuples separated by semicolons. Location tuples contain seven fields separated by pound symbol (#). Their respective fields are:

∙ Location type – indicated resolution of the location from country to city level;

∙ Location full name – location name with the same spelling as was in the document;

∙ Location CountryCode – 2 character FIPS410-4 country code5;

∙ Location ADM1 Code – 2 character FIPS10-4 country code fol- lowed by a 2 character FIPS10-4 ADM16;

∙ Latitude;

∙ Longitude;

∙ FeatureID – GNS7 or GNIS8 FeatureID or ADM1 code or text.

4. Federal Information Processing Standards 5. Standard for country codes different from the ISO 3166-1 alpha-2 codes 6. Administrative division 1 7. Geographic Names Database 8. Geographic Names Information System

20 5. Term importance

Figure 5.1: The number of books with full text-indexing over years, shown per country. The drop in 1920 is caused by copyright limiting the number of available data sources.

To detect countries, the simplest to use is the CountryCode field. Values contained in it are FIPS10-4 country codes, which can be trans- lated to the ISO9 Alpha-210 used in the slepemapy.cz system. The database is available through Google BigQuery. To get the number of mentions of a country in the database from all sources from 1800 to 2000, following non-legacy SQL syntax11 query can be used: SELECTREGEXP_EXTRACT(locations, r’^[1-5]#.*?#(.*?)#.*?#.*?#.*?#.*’) as location, COUNT(1) as count FROM( SELECT docid, locations, suffix FROM( SELECT DocumentIdentifier as docid, SPLIT(V2Locations,’;’) as locations, REGEXP_REPLACE(_TABLE_SUFFIX, r’notxt’,r’’) as suffix FROM‘gdelt-bq.internetarchivebooks.*‘ WHERE_TABLE_SUFFIX BETWEEN "1800"AND "2000") AS mappingCROSSJOIN mapping.locations) GROUPBY location Results can be downloaded as a CSV12 file and than converted to the Alpha-2 country codes. This is done by a Python script with mapping between FIPS10-4 and Alpha-2 from the table on https://www. cia.gov/library/publications/the-world-factbook/appendix/appendix-d. html.

9. International Organization for Standardization 10. Short for the ISO 3166-1 alpha-2 country code 11. Google BigQuery uses legacy SQL syntax by default 12. Comma separated values

21 5. Term importance 5.4 PubMed Central

PubMed Central (PMC) is a free archive of medical science journals at the U.S. National Institutes of Health’s National Library of Medicine [24]. PMC was launched in 2000 and contains 4.3 million artcles from over 2000 journals. The dataset is split into two parts: data available for comercial use and data limited to non-comercial use only. This work only works with the comercial dataset to better reflect usage in comercial systems. Both datasets are avaialble at ftp://ftp.ncbi. nlm.nih.gov/pub/pmc/oa_bulk/ for bulk download as tar GZip files. The archives contain folder structure where every folder represents a jounal and contains several nxml files, which are just XML files. All these files have a simplified structure as follows:

[Metadata] [HTML text of the article] [Citations] The body tag contains whole text of the article formatted as HTML file. This can than be parsed by an HTML parser, all text nodesmerged together and the resulting plain text string tokenized for search. The problem with tokenization of body parts is that their names can have several names and are commonly shortened. These problems largely limit any text mining techniques on anatomy data. The program therefore is expected to have lower hit rate compared to other techniques, even zero occurrences of a term.

22 6 Measurements and results

This chapter describes results obtained from running the programs from previous chapter.

6.1 Term importance

The outputs of the methods presented in the chapter are floating point numbers representing the importance of each term. A comparing function is necessary to compare a pair of methods. The most straightforward implementation can be to normalise the values, calcu- late differences of respective pairs of values produced by the pairof methods and sum these differences. This approach introduces two problems. The first problem is that the actual values returned by the methods are not significant, only their order because they signify how important the words are expected to be in the specific context. Another problem stems from the fact that the difference between the first and the tenth most important termsis much more significant than the difference between the 100th and the 110th terms. Based on the Zipf’s law [25], the distribution of values of the importance of terms is expected to be exponential, but the parameters of the exponential function can vary between any two compared methods. This means that using the importance value or the rank of a term yield similar results when comparing two methods. Since the values are in an exponential distribution with very few having large values and the most values with small results, the second problem is also addressed if the values follow Zipf’s law. Because both problems mentioned in the first paragraphs are addressed by the Zipf’s law, the naive approach can be used to compare methods. A considered alternative was to assign terms to different buckets representing a stepped function of importance. This approach would, however, cause less precision for terms that would be close to the border between two buckets on the continuous scale.

23 6. Measurements and results 6.2 Final evaluation

The results presented in this chapter clearly show that the initial hypothesis

24 7 Conclusion

The results presented in this chapter show that the initial hypothesis is correct, and the importance of a term does not depend on a narrow context but is uniform across many sources and mining methods. However, the terms are difficult to match with their actual meaning and can sometimes have unexpected values assigned to them even after initial sanitation of the inputs. This means that further human processing of all terms is necessary, which is contrary to the initial task of simplifying the process by removing the human task of manually processing every value. In both the geography and anatomical dataset, the terms were incorrectly assigned which resulted in unexpected ranking compared to the expected values. Some of the values were matched with different real-world entities that are much more important than those entities, while some were not recognised. The ones that are not recognised and require manual input do not pose a big problem for this task as minimal human input is expected. However, the ones that have been matched incorrectly are difficult to identify and require manually processing every single term. However, since the elimination of manually processing every single term was the main target of this work, the results are negative and can not be incorporated into the adaptive learning system. However, the preprocessing step was still successful in identifying many mistakes or uncommon notations of the terms. This opens a possibility of using a similar system to validate new terms when they are added in large quantities that are difficult to verify manually.

A Attachments

Bibliography

1. PAPOUŠEK, Jan; PELÁNEK, Radek. Impact of adaptive educational system behaviour on student motivation. In: International Conference on Artificial Intelligence in Education. 2015, pp. 348–357. 2. Adaptive Learning @ FI MU: Our lab [online] [visited on 2017-05-14]. Available from: https://www.fi.muni.cz/adaptivelearning/. 3. DATA MINING CURRICULUM: A PROPOSAL [online] [visited on 2017-12-02]. Available from: http://www.kdd.org/curriculum/ index.html. 4. FAYYAD, Usama; PIATETSKY-SHAPIRO, Gregory; SMYTH, Padhraic. From data mining to knowledge discovery in databases. AI magazine. 1996, vol. 17, no. 3, pp. 37. 5. ANATOMICAL TERMINOLOGY, Federative Committee on. Ter- minologia Anatomica: International Anatomical Terminology. 1st ed. Stuttgart: Thieme Stuttgart, 1998. ISBN 3-13-114361-4. 6. WE, Allen. Terminologia anatomica: international anatomical terminology and Terminologia Histologica: International Terms for Hu- man Cytology and Histology [online] [visited on 2017-05-14]. Avail- able from: https : / / www . ncbi . nlm . nih . gov / pmc / articles / PMC2740970/. 7. About the TA98 on-line version [online]. University of Fribourg [visited on 2017-05-14]. Available from: http : / / www . unifr . ch / ifaa / Public/EntryPage/AboutTA98.html. 8. human body [online]. University of Fribourg [visited on 2017-05-14]. Available from: http://www.unifr.ch/ifaa/Public/EntryPage/ TA98%20Tree/Alpha/All%20KWIC%20EN.htm. 9. Introducing the Knowledge Graph: things, not strings [online] [visited on 2017-05-14]. Available from: https://googleblog.blogspot. cz/2012/05/introducing-knowledge-graph-things-not.html. 10. Google Knowledge Graph Search API [online]. Google [visited on 2017-05-14]. Available from: https://developers.google.com/ knowledge-graph/.

29 BIBLIOGRAPHY

11. The top 500 sites on the web [online]. Alexa [visited on 2017-05-14]. Available from: http : / / www . alexa . com / topsites / category / Computers/Internet/Searching/Search_Engines. 12. How Search organizes information [online]. Google [visited on 2017-05-14]. Available from: https://www.google.com/search/ howsearchworks/crawling-indexing/. 13. GOOGLE. We knew the web was big... [online] [visited on 2017-05-14]. Available from: https://googleblog.blogspot.cz/2008/07/we- knew-web-was-big.html. 14. GOOGLE. Our story [online] [visited on 2017-05-14]. Available from: https://www.google.com/about/our-story/. 15. Wikipedia [online] [visited on 2017-05-14]. Available from: https: //en.wikipedia.org/wiki/Wikipedia. 16. Wikipedia Statistics English [online] [visited on 2017-05-14]. Available from: https://stats.wikimedia.org/EN/TablesWikipediaEN. htm. 17. [online] [visited on 2017-05-14]. Available from: https : / / dumps . wikimedia.org/enwiki/latest/. 18. PAGE, Lawrence; BRIN, Sergey; MOTWANI, Rajeev; WINOGRAD, Terry. The PageRank citation ranking: Bringing order to the web. 1999. Technical report. Stanford InfoLab. 19. Page Views [online] [visited on 2017-05-14]. Available from: https: //en.wikipedia.org/wiki/Wikipedia:Pageview_statistics. 20. AHO, Alfred V; CORASICK, Margaret J. Efficient string matching: an aid to bibliographic search. Communications of the ACM. 1975, vol. 18, no. 6, pp. 333–340. 21. The GDELT Project [online] [visited on 2017-05-14]. Available from: http://gdeltproject.org/. 22. LEETARU, Kalev. Fulltext geocoding versus spatial metadata for large text archives: Towards a geographically enriched Wikipedia. D-Lib Magazine. 2012, vol. 18, no. 9, pp. 5.

30 BIBLIOGRAPHY

23. THE GDELT GLOBAL KNOWLEDGE GRAPH (GKG): DATA FORMAT CODEBOOK V2.1 [online] [visited on 2017-05-14]. Available from: http://data.gdeltproject.org/documentation/GDELT-Global_ Knowledge_Graph_Codebook-V2.1.pdf. 24. PMC Overview [online] [visited on 2017-05-14]. Available from: https: //www.ncbi.nlm.nih.gov/pmc/about/intro/. 25. NEWMAN, Mark EJ. Power laws, Pareto distributions and Zipf’s law. Contemporary physics. 2005, vol. 46, no. 5, pp. 323–351.