Workshop on the Use of Data from the Scielo Database
Total Page:16
File Type:pdf, Size:1020Kb
Workshop on the use of data from the SciELO database Report slides and Python/R notebooks with the performed analyses Danilo J. S. Bellini Abstract Knowledge from data science or contemporary statistics can be used to perform analyses and inferences on large datasets including hundreds of thousands of entries. The exploratory data analysis of a dataset in a research aiming to get information from it might include steps like data acquiring, cleaning, normalization, interpretation, grouping, description and visualization. The goal of this work is to share techniques, methodologies and tools for accessing and exploring data from the SciELO database through its own open access interfaces like SciELO Analytics’ reports, SciELO Ratchet, and SciELO ArticleMeta (JSON API and Python software package), as well as from 4 external sources: Web of Science (SciELO Citation Index), Dimensions, SCImagoJR and Scopus. Using either Python (IPython/Jupyter, Numpy, Pandas, Matplotlib, Seaborn, Scipy, NetworkX) or R (R Studio, dplyr) as the programming languages, several analyses had been performed with their open source code included, aiming the reproducibility of the results. Keywords Python, R, Data science, Statistics, SciELO, H5, FCR, SJR, Citations, Open access, Open source, Exploratory data analysis Source code repository https://github.com/scieloorg/scielo20gt6/ WG6 Report SciELO 20 Years WG6 presentation title: Workshop on the use of data from the SciELO database Lecturer/speaker/rapporteur: Danilo J. S. Bellini Group coordinator: Gustavo Fonseca Executive secretary: Carolina Tanigushi WG6 date: 2018-09-24 Report date: 2018-09-25 Venue: Tivoli Mofarrej São Paulo Hotel WG6 Report SciELO 20 Years 1 / 8 Summary During the workshop, these had been done: Brief introduction to the data analysis processes, emphasizing data access, data cleaning and exploratory data analysis Hands-on examples using Python and R, with emphasis in the Pandas library resources, showing how data munging, normalization, data visualization and other data analysis processes can be performed Explanation of data analyses previously performed on data coming from SciELO and external sources Most of the time was spent on exploratory data analysis, interpretation of descriptive statistics and visualization. Some highlights of what had been studied in more depth include: Hirsch index Google Scholar’s h5-index and h5-median Calculation from raw Dimensions’ data SCImagoJR’s H index Field Citation Ratio (FCR) from Dimensions WG6 Report SciELO 20 Years 2 / 8 Tools Two programming languages had been used, besides several: Python IPython / Jupyter Notebook Python built-in modules (csv, statistics, urllib, json, glob, os, re, collections, itertools, pprint) numpy pandas matplotlib seaborn openpyxl, to open XLSX files scipy.stats, to calculate the Pearson’s correlation coefficient NetworkX, a graph manipulation library including an API to draw graphs with matplotlib R R built-in modules (base, utils, stats, graphics) R Studio, an IDE for R, for creating R Markdown notebooks dplyr, to perform grouping operations similar to SQL’s GROUP BY and Pandas’ DataFrame.groupby WG6 Report SciELO 20 Years 3 / 8 Data sources Several analyses were performed before the workshop, whose processes and results were part of it. The data that had been studied in every analysis performed during and before the workshop came from these sources: SciELO’s JSON APIs (RESTful) from: ArticleMeta, to get journal metadata Ratchet, to get access data SciELO’s articlemeta Python library, an alternative way to access the ArticleMeta API Reports from the SciELO Analytics SciELO Citation Index entries from the Web of Science Dimensions data regarding two journals: Nauplius and Brazilian Journal of Plant Physiology SCImagoJR’s CSV with all SJR and H indices for 2017 Scopus’ XLSX with all the data they make available WG6 Report SciELO 20 Years 4 / 8 Introduction to the analysis Besides: Identifying which collections have data in SciELO analytics (all certified and development collections, besides the active independent collections) Downloading all SciELO analytics reports Evaluating if the network reports have everything from the remaining reports Simplifying the column names Normalizing/cleaning the ISSN when dealing with multiple collections Normalizing the thematic area (dealing with unfilled data) It had been seen how to plot data, with a strong emphasis on data interpretation and multiple types of plots (bar plots, line plots, box-and-whisker plots, heat maps, scatterplots, etc.), as well as subplot splitting/grouping. WG6 Report SciELO 20 Years 5 / 8 Previously prepared analyses Number of indexed journals in the SciELO network Deindexing reason in the SciELO Brazil collection Evaluating the daily access in the SciELO Brazil collection Three indices in Scopus 2017: CiteScore, SNIP and SJR SCImago Journal Rank in 2017, including SJR and H index FCR and H index in Dimensions Google Scholar indices Languages of research articles in SciELO Brazil, by thematic area, document publication year and journal indexing year Citations in the SciELO CI Proportion of Brazil in affiliation institutions in research articles from journals in the SciELO Brazil collection WG6 Report SciELO 20 Years 6 / 8 Results The proportion of Brazilian affiliations of research articles in the SciELO Brazil collection is decreasing Most citations of research articles in the SciELO network come from documents/journals that aren’t in the SciELO network. For research articles written in English, 76% of the received citations comes from documents external to the SciELO network The normalization step when calculating the FCR and its non-standard average calculation can easily push down the result (e.g. a journal with 10 documents receiving 15 citations and 3 documents with zero citations would have an average of citations of less than 7.5, before this number gets normalized by the year and field of research), making it an index best fit to evaluate older journals that are no longer publishing We should always look for the mathematics that defines an index, as that evaluation can already give us some insights regarding its bias towards some documents/journals WG6 Report SciELO 20 Years 7 / 8 Results Scopus indices should be taken with care: mixing the data from all countries makes it hard to compare data from SciELO and from other journals In SciELO Brazil, 95% of the journals marked as deceased were actually just renamed to a new journal title/ISSN Matching data with external sources is difficult without a common standardized index such as the ISSN and DOI 0.8% of SciELO journals are marked in Scopus as not open, which seem to be an issue regarding Scopus data WG6 Report SciELO 20 Years 8 / 8 1 Introduction to Pandas with ArticleMeta Note: This notebook had been written during the presentation, but some text (like this) have afterwards been included to help on understanding its contents. The ArtileMeta API is public, we can get a JSON with some information about the collections: http://articlemeta.scielo.org/api/v1/collection/identifiers/ Instead of a raw JSON we can see from opening that address in a web browser, can we load/analyze this with Python? Can we plot some information from it? Yes! And that’s the goal of this notebook. 1.1 Loading a table-like JSON from ArticleMeta with Pandas Let’s import Pandas as pd following its convention, and load that JSON directly with pd.read_json. Note: Data collected on 2018-09-24. In [1]: import pandas as pd In [2]: url = "http://articlemeta.scielo.org/api/v1/collection/identifiers/" dataset = pd.read_json(url) In [3]: dataset Out [3]: The table is in the next page ... 1 Introduction to Pandas with ArticleMeta — Page 1 / 14 1.1 Loading a table-like JSON from ArticleMeta with Pandas acron acron2 code document_- domain has_analyt- is_ac- journal_- name original_- status type count ics tive count name 0 arg ar arg 37438.0 www.scielo. True True {’current’: {’es’: ’Ar- Argentina certified journals org.ar 120, ’de- gentina’, ’pt’: ceased’: ’Argentina’, 23} ’en’: ’... 1 chl cl chl 61760.0 www.scielo.cl True True {’current’: {’es’: ’Chile’, Chile certified journals 103, ’de- ’pt’: ’Chile’, ceased’: ’en’: ’Chile’} 1 Introduction to Pandas with ArticleMeta 13, ’sus- pended’: 1} 2 col co col 66973.0 www.scielo. True True {’current’: {’es’: ’Colom- Colombia certified journals org.co 224, ’sus- bia’, ’pt’: pended’: ’Colombia’, 4} ’en’: ’Co... 3 cub cu cub 33492.0 scielo.sld.cu True True {’current’: {’es’: ’Cuba’, Cuba certified journals 61, ’de- ’pt’: ’Cuba’, ceased’: ’en’: ’Cuba’} 2, ’sus- pended’: 4} 4 esp es esp 37223.0 scielo.isciii.es True True {’current’: {’es’: ’España’, España certified journals 43, ’de- ’pt’: ’Espanha’, ceased’: ’en’: ’Spain’} — Page 2 / 14 6, ’sus- pended’: 11} 5 mex mx mex 61167.0 www.scielo. True True {’current’: {’es’: ’Mexico’, Mexico certified journals org.mx 156, ’de- ’pt’: ’Mexico’, ceased’: ’en’: ’Mexico’} 12, ’sus- pended’: 47} Continued on next page 1.1 Loading a table-like JSON from ArticleMeta with Pandas acron acron2 code document_- domain has_analyt- is_ac- journal_- name original_- status type count ics tive count name 6 prt pt prt 17237.0 www.scielo. True True {’current’: {’es’: ’Portugal’, Portugal certified journals mec.pt 47, ’de- ’pt’: ’Portugal’, ceased’: ’en’: ’Po... 5, ’sus- pended’: 17} 7 NaN NaN NaN NaN books.scielo. False True NaN {’es’: ’Sci- NaN NaN books 1 Introduction to Pandas with ArticleMeta org ELO Libros’, ’pt’: ’SciELO Livros’,... 8 scl br scl 370296.0 www.scielo. True True {’current’: {’es’: ’Brasil’, Brasil certified journals br 291, ’de- ’pt’: ’Brasil’, ceased’: ’en’: ’Brazil’} 40, ’sus- pended’: 35} 9 spa sp spa 40996.0 www.sci- True True {’current’: {’es’: ’Salud Saúde certified journals elosp.org 18, ’sus- Publica’, Pública pended’: ’pt’: ’Saúde 2} Pública’,... 10 sss ss sss 665.0 socialsciences. True False {’current’: {’es’: ’Social Sci- Social Sci- certified journals scielo.org 33} ences’, ’pt’: ’So- ences cial Scienc..