Workshop on the use of data from the SciELO database

Report slides and Python/R notebooks with the performed analyses

Danilo J. S. Bellini

Abstract

Knowledge from data science or contemporary statistics can be used to perform analyses and inferences on large datasets including hundreds of thousands of entries. The exploratory data analysis of a dataset in a research aiming to get information from it might include steps like data acquiring, cleaning, normalization, interpretation, grouping, description and visualization. The goal of this work is to share techniques, methodologies and tools for accessing and exploring data from the SciELO database through its own interfaces like SciELO Analytics’ reports, SciELO Ratchet, and SciELO ArticleMeta (JSON API and Python software package), as well as from 4 external sources: (SciELO Citation Index), Dimensions, SCImagoJR and . Using either Python (IPython/Jupyter, Numpy, Pandas, Matplotlib, Seaborn, Scipy, NetworkX) or R (R Studio, dplyr) as the programming languages, several analyses had been performed with their open source code included, aiming the reproducibility of the results.

Keywords

Python, R, Data science, Statistics, SciELO, H5, FCR, SJR, Citations, Open access, Open source, Exploratory data analysis

Source code repository https://github.com/scieloorg/scielo20gt6/ WG6 Report SciELO 20 Years

WG6 presentation title: Workshop on the use of data from the SciELO database

Lecturer/speaker/rapporteur: Danilo J. S. Bellini Group coordinator: Gustavo Fonseca Executive secretary: Carolina Tanigushi

WG6 date: 2018-09-24 Report date: 2018-09-25

Venue: Tivoli Mofarrej São Paulo Hotel

WG6 Report SciELO 20 Years 1 / 8 Summary

During the workshop, these had been done: Brief introduction to the data analysis processes, emphasizing data access, data cleaning and exploratory data analysis Hands-on examples using Python and R, with emphasis in the Pandas library resources, showing how data munging, normalization, data visualization and other data analysis processes can be performed Explanation of data analyses previously performed on data coming from SciELO and external sources Most of the time was spent on exploratory data analysis, interpretation of descriptive statistics and visualization. Some highlights of what had been studied in more depth include: Hirsch index Google Scholar’s h5-index and h5-median Calculation from raw Dimensions’ data SCImagoJR’s H index Field Citation Ratio (FCR) from Dimensions WG6 Report SciELO 20 Years 2 / 8 Tools

Two programming languages had been used, besides several: Python IPython / Jupyter Notebook Python built-in modules (csv, statistics, urllib, json, glob, os, re, collections, itertools, pprint) numpy pandas matplotlib seaborn openpyxl, to open XLSX files scipy.stats, to calculate the Pearson’s correlation coefficient NetworkX, a graph manipulation library including an API to draw graphs with matplotlib R R built-in modules (base, utils, stats, graphics) R Studio, an IDE for R, for creating R Markdown notebooks dplyr, to perform grouping operations similar to SQL’s GROUP BY and Pandas’ DataFrame.groupby

WG6 Report SciELO 20 Years 3 / 8 Data sources

Several analyses were performed before the workshop, whose processes and results were part of it. The data that had been studied in every analysis performed during and before the workshop came from these sources: SciELO’s JSON APIs (RESTful) from: ArticleMeta, to get journal metadata Ratchet, to get access data SciELO’s articlemeta Python library, an alternative way to access the ArticleMeta API Reports from the SciELO Analytics SciELO Citation Index entries from the Web of Science Dimensions data regarding two journals: Nauplius and Brazilian Journal of Plant Physiology SCImagoJR’s CSV with all SJR and H indices for 2017 Scopus’ XLSX with all the data they make available

WG6 Report SciELO 20 Years 4 / 8 Introduction to the analysis

Besides: Identifying which collections have data in SciELO analytics (all certified and development collections, besides the active independent collections) Downloading all SciELO analytics reports Evaluating if the network reports have everything from the remaining reports Simplifying the column names Normalizing/cleaning the ISSN when dealing with multiple collections Normalizing the thematic area (dealing with unfilled data) It had been seen how to plot data, with a strong emphasis on data interpretation and multiple types of plots (bar plots, line plots, box-and-whisker plots, heat maps, scatterplots, etc.), as well as subplot splitting/grouping.

WG6 Report SciELO 20 Years 5 / 8 Previously prepared analyses

Number of indexed journals in the SciELO network Deindexing reason in the SciELO Brazil collection Evaluating the daily access in the SciELO Brazil collection Three indices in Scopus 2017: CiteScore, SNIP and SJR SCImago Journal Rank in 2017, including SJR and H index FCR and H index in Dimensions Google Scholar indices Languages of research articles in SciELO Brazil, by thematic area, document publication year and journal indexing year Citations in the SciELO CI Proportion of Brazil in affiliation institutions in research articles from journals in the SciELO Brazil collection

WG6 Report SciELO 20 Years 6 / 8 Results

The proportion of Brazilian affiliations of research articles in the SciELO Brazil collection is decreasing Most citations of research articles in the SciELO network come from documents/journals that aren’t in the SciELO network. For research articles written in English, 76% of the received citations comes from documents external to the SciELO network The normalization step when calculating the FCR and its non-standard average calculation can easily push down the result (e.g. a journal with 10 documents receiving 15 citations and 3 documents with zero citations would have an average of citations of less than 7.5, before this number gets normalized by the year and field of research), making it an index best fit to evaluate older journals that are no longer publishing We should always look for the mathematics that defines an index, as that evaluation can already give us some insights regarding its bias towards some documents/journals WG6 Report SciELO 20 Years 7 / 8 Results

Scopus indices should be taken with care: mixing the data from all countries makes it hard to compare data from SciELO and from other journals In SciELO Brazil, 95% of the journals marked as deceased were actually just renamed to a new journal title/ISSN Matching data with external sources is difficult without a common standardized index such as the ISSN and DOI 0.8% of SciELO journals are marked in Scopus as not open, which seem to be an issue regarding Scopus data

WG6 Report SciELO 20 Years 8 / 8 1 Introduction to Pandas with ArticleMeta

Note: This notebook had been written during the presentation, but some text (like this) have afterwards been included to help on understanding its contents. The ArtileMeta API is public, we can get a JSON with some information about the collections: http://articlemeta.scielo.org/api/v1/collection/identifiers/ Instead of a raw JSON we can see from opening that address in a web browser, can we load/analyze this with Python? Can we plot some information from it? Yes! And that’s the goal of this notebook.

1.1 Loading a table-like JSON from ArticleMeta with Pandas

Let’s import Pandas as pd following its convention, and load that JSON directly with pd.read_json. Note: Data collected on 2018-09-24. In [1]: import pandas as pd

In [2]: url = "http://articlemeta.scielo.org/api/v1/collection/identifiers/" dataset = pd.read_json(url)

In [3]: dataset

Out [3]: The table is in the next page ...

1 Introduction to Pandas with ArticleMeta — Page 1 / 14 1.1 Loading a table-like JSON from ArticleMeta with Pandas journals journals journals journals journals journals status type certified certified certified certified certified certified Continued on next page Argentina Chile Colombia Cuba España Mexico name name original_- {’es’:gentina’,’Argentina’, ’pt’: ’Ar- ’en’: ’... {’es’:’pt’: ’Chile’, ’en’: ’Chile’} ’Chile’, {’es’:bia’, ’Colom- ’Colombia’, ’en’: ’Co... {’es’: ’pt’: ’pt’: ’Cuba’, ’en’: ’Cuba’} ’Cuba’, {’es’: ’España’, ’pt’: ’Espanha’, ’en’: ’Spain’} {’es’: ’Mexico’, ’pt’:’en’: ’Mexico’, ’Mexico’} journal_- count {’current’: 120,ceased’: ’de- 23} {’current’: 103,ceased’: ’de- 13,pended’: ’sus- 1} {’current’: 224,pended’: ’sus- 4} {’current’: 61,ceased’: 2, ’de- pended’: 4} ’sus- {’current’: 43,ceased’: 6, ’de- pended’: 11} ’sus- {’current’: 156,ceased’: ’de- 12,pended’: ’sus- 47} is_ac- tive True True True True True True ics True True True True True True domainwww.scielo. org.ar has_analyt- www.scielo.cl www.scielo. org.co scielo.sld.cu scielo.isciii.es www.scielo. org.mx 37438.0 61760.0 66973.0 33492.0 37223.0 61167.0 count arg chl col cub esp mex ar cl co cu es mx acron acron2arg code document_- chl col cub esp mex 0 1 2 3 4 5

1 Introduction to Pandas with ArticleMeta — Page 2 / 14 1.1 Loading a table-like JSON from ArticleMeta with Pandas journals books journals journals journals journals journals status type certified NaN certified certified certified certified certified Continued on next page Portugal NaN Brasil Saúde Pública Socialences Sci- South Africa Venezuela name name original_- {’es’: ’Portugal’, ’pt’: ’Portugal’, ’en’: ’Po... {’es’:ELO’pt’: Libros’, ’Sci- Livros’,... ’SciELO {’es’:’pt’: ’Brasil’, ’en’: ’Brazil’} ’Brasil’, {’es’:Publica’, ’Salud ’pt’:Pública’,... {’es’: ’Saúde ’Social Sci- ences’, ’pt’: ’So- cial Scienc... {’es’: ’Sudáfrica’, ’pt’: ’África do Sul’, ’en... {’es’: ’Venezuela’, ’pt’: ’Venezuela’, ’en’: ’... journal_- count {’current’: 47,ceased’: 5, ’de- pended’: 17} ’sus- NaN {’current’: 291,ceased’: ’de- 40,pended’: ’sus- 35} {’current’: 18,pended’: ’sus- 2} {’current’: 33} {’current’: 75,ceased’: 1} ’de- {’current’: 35,pended’: ’sus- 23} is_ac- tive True True True True False True True ics True False True True True True True domainwww.scielo. mec.pt has_analyt- books.scielo. org www.scielo. br www.sci- elosp.org socialsciences. scielo.org www.scielo. org.za www.scielo. org.ve 17237.0 NaN 370296.0 40996.0 665.0 25617.0 18971.0 count prt NaN scl spa sss sza ven pt NaN br sp ss za ve acron acron2prt code document_- NaN scl spa sss sza ven 6 7 8 9 10 11 12

1 Introduction to Pandas with ArticleMeta — Page 3 / 14 1.1 Loading a table-like JSON from ArticleMeta with Pandas journals journals journals journals journals journals journals journals status type diffusion certified certified certified diffusion development certified diffusion Continued on next page Biodiversidade Bolivia Costa Rica Peru Proceedings Paraguay Uruguay West Indians name name original_- {’es’: ’Biodiver- sidade’,’Biodiversi- ’pt’: dade... {’es’: ’Bolivia’, ’pt’:’en’: ’Bolivia’, ’Boli... {’es’:Rica’, ’Costa ’Costa’en’:... ’pt’: Rica’, {’es’:’pt’: ’Peru’, ’en’: ’Peru’, ’Peru’} {’es’: ’Proceed- ings’, ’pt’: ’Pro- ceedings’, ’en... {’es’: ’Paraguay’, ’pt’: ’Paraguai’, ’en’: ’Pa... {’es’: ’Uruguay’, ’pt’: ’Uruguai’, ’en’: ’Urug... {’es’:Indians’,’West ’West ’pt’: Indians’, ’... journal_- count {} {’current’: 20,pended’: ’sus- 2} {’current’: 37,ceased’: 1, ’de- pended’: 4} ’sus- {’current’: 29} {} {} {’current’: 20,ceased’: 5} ’de- {} is_ac- tive True True True True False True True False ics False True True True False True True False domainbiodiversidade. scielo.br has_analyt- www.scielo. org.bo www.scielo. sa.cr www.scielo. org.pe www.pro- ceedings.sci- elo.br scielo.iics. una.py www.scielo. edu.uy caribbean.sci- elo.org NaN 4758.0 9158.0 9618.0 NaN NaN 4360.0 NaN count bio bol cri per pro pry ury wid bi bo cr pe pro py uy wi acron acron2bio code document_- bol cri per pro pry ury wid 13 14 15 16 17 18 19 20

1 Introduction to Pandas with ArticleMeta — Page 4 / 14 1.1 Loading a table-like JSON from ArticleMeta with Pandas journals journals journals journals journals journals journals journals journals status type diffusion diffusion diffusion diffusion diffusion independent independent independent independent Continued on next page ComCiência CiênciaCultura e Conhecimento e Inovação Pesquisa Fapesp Revista VirtualQuímica de Educa PPEGEO PEPSIC REVENF name name original_- {’es’:Ciência’, ’Com- ’ComCiência’, ’pt’: ’en’:... {’es’:e ’Ciência Cultura’, ’pt’: ’Ciência e C... {’es’:hecimentoInovação’, ’Con- e ’pt’: ’Conhe... {’es’: ’Pesquisa FAPESP’, ’pt’: ’Pesquisa FAPE... {’es’:Virtual ’Revista Química’, ’pt’: ’Re... de {’es’:’pt’: ’Educa’, ’en’: ’Educa’} ’Educa’, {’es’:de ’Portal Eletrônicos Periódicos em Ge... {’es’: ’Periódi- cos Eletrônicos em Psicologia’,. .. {’es’:de ’Revista magem’,’Revista... ’pt’: Enfer- journal_- count {} {’current’: 1} {} {} {} {} {} {’current’: 91,ceased’: 24, ’de- progress’: ... ’in- {’current’: 14,pended’: ’sus- 4} is_ac- tive True True True True True True True True True ics False False False False False False False True True domaincomciencia. scielo.br has_analyt- cienciaecultura. bvs.br inovacao.sci- elo.br revistapesquisa. fapesp.br www.uff. br/RVQ educa.fcc.org. br ppegeo.igc. usp.br pepsic. bvsalud.org www.revenf. bvs.br NaN 1784.0 NaN NaN NaN NaN NaN 23841.0 22733.0 count cci cic inv pef rvq edc ppg psi rve cc ci cinov pef rvq educa ppegeo pepsic revenf acron acron2cci code document_- cic inv pef rvq edc ppg psi rve 21 22 23 24 25 26 27 28 29

1 Introduction to Pandas with ArticleMeta — Page 5 / 14 1.1 Loading a table-like JSON from ArticleMeta with Pandas journals journals journals journals status type independent independent development independent RevOdonto SES Ecuador RevTur name name original_- {’es’: ’RevOdonto’, ’pt’: ’RevOdonto’, ’en’: ’... {’es’: ’Portal de Revistas - SES’, ’pt’: ’Port... {’es’: ’Ecuador’, ’pt’: ’Equador’, ’en’: ’Equa... {’es’: ’RevTur’, ’pt’:’en’: ’RevTur’, ’RevTur’} journal_- count {} {} {’current’: 2} {’current’: 3} is_ac- tive True True True True ics False False True True domainrevodonto. bvsalud.org has_analyt- periodicos. ses.sp.bvs.br www.scielo. ec www.revtur. org NaN NaN 15.0 136.0 count rvo ses ecu rvt revodonto ses ec revtur acron acron2rvo code document_- ses ecu rvt 30 31 32 33

1 Introduction to Pandas with ArticleMeta — Page 6 / 14 1.2 How many collections are in SciELO Analytics?

Horizontal scrolling or many-paged landscape printing might make it difficult to see the big picture, but all information is there. That’s a DataFrame, one of the two main Pandas data structures. The other is a Series, which one can think of as a column from some dataframe. The first column we can see is the index, which isn’t part of the JSON returned by the ArticleMeta API. The remaining columns have the downloaded data. The journal_count and name columns are nested in the downloaded JSON, so they’re dictionaries in the resulting dataframe. In [4]: type(dataset)

Out [4]: pandas.core.frame.DataFrame

In [5]: type(dataset["has_analytics"])

Out [5]: pandas.core.series.Series

1.2 How many collections are in SciELO Analytics?

SciELO Analytics[1] have reports for several collections, but not for all of them. The has_analytics column tells us if a collection have such a report, and the Network report only have the data from collections where has_analytics is True. In [6]: dataset["has_analytics"]

Out [6]: 0 True 1 True 2 True 3 True 4 True 5 True 6 True 7 False 8 True 9 True 10 True 11 True 12 True 13 False 14 True 15 True 16 True 17 False 18 True 19 True 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 True 29 True 30 False 31 False

[1]https://analytics.scielo.org/

1 Introduction to Pandas with ArticleMeta — Page 7 / 14 1.3 Why does/doesn’t a collection have reports in SciELO Analytics?

32 True 33 True Name: has_analytics, dtype: bool

Filtering a single column, we get a Series instance, which has some specific methods like value_counts. Most collections are in SciELO Analytics! In [7]: dataset["has_analytics"].value_counts()

Out [7]: True 21 False 13 Name: has_analytics, dtype: int64

1.3 Why does/doesn’t a collection have reports in SciELO Analytics?

Most collections have analytics, but can we find that same information from other columns? Actually, that’s about manually building/interpreting a classifier.

1.3.1 Active/discontinued collections

The first column we may check is the is_active. A discontinued collection should have is_active equals to False, and most collections are active: In [8]: dataset["is_active"].value_counts()

Out [8]: True 31 False 3 Name: is_active, dtype: int64

The discontinued entries (it’s a selection, filtering the entries that have is_active equal to False): In [9]: dataset[~dataset["is_active"]].T

Out [9]: 10 17 20 acron sss pro wid acron2 ss pro wi code sss pro wid document_count 665 NaN NaN domain socialsciences.scielo.org www.proceedings.scielo.br caribbean.scielo. org has_analytics True False False is_active False False False journal_count {’current’: 33} {} {} name {’es’: ’Social Sciences’, ’pt’: {’es’: ’Proceedings’, ’pt’: ’Pro- {’es’: ’West Indi- ’Social Scienc... ceedings’, ’en... ans’, ’pt’: ’West In- dians’, ’... original_name Social Sciences Proceedings West Indians status certified diffusion diffusion type journals journals journals

The Social Sciences collection (sss) has analytics and isn’t active, while the other two discontinued collections don’t have analytics. That’s an important information, but we can’t use this information alone to segregate the entries by the has_analytics column.

1 Introduction to Pandas with ArticleMeta — Page 8 / 14 1.3 Why does/doesn’t a collection have reports in SciELO Analytics?

1.3.2 Collection status

There are 4 possible collection status. Let’s count the number of collections in each of them. In [10]: dataset["status"].value_counts()

Out [10]: certified 16 diffusion 8 independent 7 development 2 Name: status, dtype: int64

It’s worth mentioning that independent collections aren’t managed by SciELO, but they comply to the SciELO model. Can we use this information to classify the entries? In [11]: dataset.groupby(["has_analytics", "status"]).size()

Out [11]: has_analytics status False diffusion 8 independent 4 True certified 16 development 2 independent 3 dtype: int64

Almost! • The diffusion collections never have analytics • The certified and development collections always have analytics • Only 3 independent collections (out of 7) have analytics We just need to split the collections. In [12]: dataset.groupby(["status", "has_analytics"]).size()

Out [12]: status has_analytics certified True 16 development True 2 diffusion False 8 independent False 4 True 3 dtype: int64

Technical note: There’s no value_counts method for pd.DataFrame, we should use the size method of a groupby result instead. It’ll create a series with multiple indices. The indices order is the order of the first groupby parameter, and we can make them columns by using the reset_index method. In [13]: dataset.groupby(["status", "has_analytics"]).size().reset_index()

Out [13]: status has_analytics 0 0 certified True 16 1 development True 2 2 diffusion False 8 3 independent False 4 4 independent True 3

A cleaner approach to the same, renaming the pd.Series column:

1 Introduction to Pandas with ArticleMeta — Page 9 / 14 1.3 Why does/doesn’t a collection have reports in SciELO Analytics?

In [14]: status_has_analytics = \ dataset.groupby(["status", "has_analytics"]) \ .size() \ .rename("count")\ .reset_index() status_has_analytics

Out [14]: status has_analytics count 0 certified True 16 1 development True 2 2 diffusion False 8 3 independent False 4 4 independent True 3

We can save that table as a CSV to load it afterwards with pd.read_csv: In [15]: status_has_analytics.to_csv("articlemeta_pandas.csv", index=False)

1.3.3 Selection and projection

Roughly speaking, selection means filtering rows by some criteria, whereas projection means choosing certain columns. We’ve already seen a selection while trying to understand the is_active. Now let’s perform a simple projection. The available columns are: In [16]: dataset.columns

Out [16]: Index(['acron', 'acron2', 'code', 'document_count', 'domain', 'has_analytics', 'is_active', 'journal_count', 'name', 'original_name', 'status', 'type'], dtype='object')

We’ll sort by the number of documents (as of 2018-09-24) before getting the projection. The double square brackets means that the indexing with square brackets require a single argument: a list of column names. In [17]: dataset.sort_values(by="document_count")\ [["code", "domain", "document_count", "status", "has_analytics"]]

Out [17]: code domain document_count status has_analytics 32 ecu www.scielo.ec 15.0 development True 33 rvt www.revtur.org 136.0 independent True 10 sss socialsciences.scielo.org 665.0 certified True 22 cic cienciaecultura.bvs.br 1784.0 diffusion False 19 ury www.scielo.edu.uy 4360.0 certified True 14 bol www.scielo.org.bo 4758.0 certified True 15 cri www.scielo.sa.cr 9158.0 certified True 16 per www.scielo.org.pe 9618.0 certified True 6 prt www.scielo.mec.pt 17237.0 certified True 12 ven www.scielo.org.ve 18971.0 certified True 29 rve www.revenf.bvs.br 22733.0 independent True 28 psi pepsic.bvsalud.org 23841.0 independent True Continued on next page

1 Introduction to Pandas with ArticleMeta — Page 10 / 14 1.4 Plotting

code domain document_count status has_analytics 11 sza www.scielo.org.za 25617.0 certified True 3 cub scielo.sld.cu 33492.0 certified True 4 esp scielo.isciii.es 37223.0 certified True 0 arg www.scielo.org.ar 37438.0 certified True 9 spa www.scielosp.org 40996.0 certified True 5 mex www.scielo.org.mx 61167.0 certified True 1 chl www.scielo.cl 61760.0 certified True 2 col www.scielo.org.co 66973.0 certified True 8 scl www.scielo.br 370296.0 certified True 7 NaN books.scielo.org NaN NaN False 13 bio biodiversidade.scielo.br NaN diffusion False 17 pro www.proceedings.scielo.br NaN diffusion False 18 pry scielo.iics.una.py NaN development True 20 wid caribbean.scielo.org NaN diffusion False 21 cci comciencia.scielo.br NaN diffusion False 23 inv inovacao.scielo.br NaN diffusion False 24 pef revistapesquisa.fapesp.br NaN diffusion False 25 rvq www.uff.br/RVQ NaN diffusion False 26 edc educa.fcc.org.br NaN independent False 27 ppg ppegeo.igc.usp.br NaN independent False 30 rvo revodonto.bvsalud.org NaN independent False 31 ses periodicos.ses.sp.bvs.br NaN independent False

Almost all rows with an invalid document count (NaN, which stands for Not a Number) doesn’t have an- alytics, while almost all rows with a valid document count number have analytics. The only exceptions are: • cic, a diffusion collection (it doesn’t have analytics) • pry, a development collection (it has analytics) Which can be classified using the status column alone. Therefore we’ve found our classifier. Ana- lytics are available for all certified and development collections, as well as non-empty independent collections.

1.4 Plotting

We’ll use the Pandas integration with Matplotlib to plot. To display the plots in a notebook like this, we need: In [18]: %matplotlib inline

All pd.Series and pd.DataFrame instances have a plot object with methods for plotting, like barh for horizontal bars: In [19]: status_has_analytics.plot.barh()

1 Introduction to Pandas with ArticleMeta — Page 11 / 14 1.4 Plotting

Out [19]:

There are two issues with that plot: • The line, which can be removed with a trailing ; in the cell; • The lack of meaningful labels for the bars. Since the labels are just the index, we can just set the index in the order we want: In [20]: status_has_analytics_indexed = \ status_has_analytics.set_index(["status", "has_analytics"]) status_has_analytics_indexed

Out [20]: count status has_analytics certified True 16 development True 2 diffusion False 8 independent False 4 independent True 3

And plot it again:

1 Introduction to Pandas with ArticleMeta — Page 12 / 14 1.4 Plotting

In [21]: status_has_analytics_indexed.plot.barh();

1.4.1 Seaborn

Seaborn is a library for plotting data stored in Pandas dataframes. Like pd for Pandas, it has a name convention on importing. In [22]: import seaborn as sns

Let’s create a bar plot like the one above, grouping by both indices (mainly status, but using a distinct color/hue when it does/doesn’t have analytics). In [23]: sns.barplot(data=status_has_analytics, x="count", y="status", hue="has_analytics");

1 Introduction to Pandas with ArticleMeta — Page 13 / 14 1.4 Plotting

A summary to create the above plot from scratch: import pandas as pd import seaborn as sns %matplotlib notebook url = "http://articlemeta.scielo.org/api/v1/collection/identifiers/" dataset = pd.read_json(url) status_has_analytics = \ dataset.groupby(["status", "has_analytics"]) \ .size() \ .rename("count") \ .reset_index() sns.barplot(data=status_has_analytics, x="count", y="status", hue="has_analytics");

1 Introduction to Pandas with ArticleMeta — Page 14 / 14 2 Downloading the reports/spreadsheets from SciELO Analytics

The SciELO Project provides spreadsheets in a CSV format with the metadata of the articles stored/accessed in its database. These reports can be found at https://analytics.scielo.org/w/ reports as ZIP packages that are monthly updated. There’s an easy way to download all the ZIP packages from that link on Linux or any environment with wget available: we just need to download that web page with a single “crawl” step, i.e., download all files that has some reference on that page. That can be done with: wget -rcHl1 https://analytics.scielo.org/w/reports It should create several directories, one for each host. The ZIP packages are in the static.scielo.org/tabs directory. As of today, there are 22 links: • https://static.scielo.org/tabs/tabs_arg.zip • https://static.scielo.org/tabs/tabs_bol.zip • https://static.scielo.org/tabs/tabs_bra.zip • https://static.scielo.org/tabs/tabs_chl.zip • https://static.scielo.org/tabs/tabs_col.zip • https://static.scielo.org/tabs/tabs_cri.zip • https://static.scielo.org/tabs/tabs_cub.zip • https://static.scielo.org/tabs/tabs_ecu.zip • https://static.scielo.org/tabs/tabs_esp.zip • https://static.scielo.org/tabs/tabs_mex.zip • https://static.scielo.org/tabs/tabs_network.zip • https://static.scielo.org/tabs/tabs_per.zip • https://static.scielo.org/tabs/tabs_prt.zip • https://static.scielo.org/tabs/tabs_pry.zip • https://static.scielo.org/tabs/tabs_psi.zip • https://static.scielo.org/tabs/tabs_rve.zip • https://static.scielo.org/tabs/tabs_rvt.zip • https://static.scielo.org/tabs/tabs_spa.zip • https://static.scielo.org/tabs/tabs_sss.zip • https://static.scielo.org/tabs/tabs_sza.zip • https://static.scielo.org/tabs/tabs_ury.zip • https://static.scielo.org/tabs/tabs_ven.zip The file names follows a tabs_COLLECTION.zip structure, that is, the file name suffix before the exten- sion is usually the collection code. There are only 2 exceptions to this rule: • tabs_bra.zip: Brazil collection, the first SciELO collection which has the scl code (legacy code). • tabs_network.zip: All entries from all collection-specific reports together. These ZIP packages have files with the following names: • accesses_by_journals.csv • documents_affiliations.csv • documents_altmetrics.csv • documents_authors.csv • documents_counts.csv • documents_dates.csv • documents_languages.csv • documents_licenses.csv • journals.csv • journals_kbart.csv • journals_status_changes.csv The CSV type is the name of the file without its extension, e.g. documents_counts.

2 Downloading the reports/spreadsheets from SciELO Analytics — Page 1 / 2 The specs (in Portuguese) for all the CSV files can be found in http://docs.scielo.org/projects/ scielo-processing/pt/latest/public_reports.html but as of today the English-only reader should rely on these notebooks. For the remaining notebooks, the contents of every file tabs_COLLECTION.zip had been extracted on the tabs_COLLECTION/ directory. On a Linux shell that could be done with this command: for f in tabs_*.zip ; do unzip -d $(basename ${f%%.zip}) $f ; done

2 Downloading the reports/spreadsheets from SciELO Analytics — Page 2 / 2 3 Simplifying the column names (CSV header)

The CSV types and their columns have a Brazilian Portuguese description in SciELO’s public reports documentation page[1]. However, the column names have some issues: • Sometimes they’re way too long; • Almost always they have some whitespace or other non-alphanumeric character; • Some names have trailing whitespaces; • There are multiple languages in the journals_kbart.csv, following the Brazilian Portuguese Name (english_name) format; • They might include redundant/ambiguous/misleading parts. In summary, they’re difficult to deal with when we’re performing some exploratory data analysis or otherwise using them in the middle of some source code, in almost any language. Our goal is to simplify that to keep it similar to a snake_case format. In [1]: import csv from glob import glob

In [2]: import numpy as np import pandas as pd pd.options.display.max_rows = 200 # Default is 60

3.1 Current rows

From all the column titles from every CSV file, there are a lot of names that appear more than once. The first 5 columns for every CSV type is: In [3]: for fname in glob("tabs_network/*.csv"): with open(fname) as f: cr = csv.reader(f) print(next(cr)[:5])

['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"] ['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"] ['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"] ['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"] ['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"] ['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"] ['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"] ['Título do Periódico (publication_title)', 'ISSN impresso (print_identifier)', 'ISSN online (online_identifier)', 'Data do primeiro fascículo (date_first_issue_online)', 'volume do primeiro fascículo (num_first_vol_online)'] ['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"] ['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"] ['extraction date', 'study unit', 'collection', 'ISSN SciELO', "ISSN's"]

Joining the column headers from every CSV, the full set of names we obtain is: In [4]: names = set() for fname in glob("tabs_network/*.csv"): with open(fname) as f: cr = csv.reader(f) names.update(next(cr)) np.array(sorted(names))

[1]http://docs.scielo.org/projects/scielo-processing/pt/latest/public_reports.html

3 Simplifying the column names (CSV header) — Page 1 / 8 3.1 Current rows

Out [4]: array(['+6 authors', '0 authors', '1 author', '2 authors', '3 authors', '4 authors', '5 authors', 'Data do primeiro fascículo (date_first_issue_online)', 'Data do último fascículo publicado (date_last_issue_online)', 'ID de publicação pai (parent_publication_title_id)', 'ID de publicação prévia (preceding_publication_title_id)', 'ID do periódico no SciELO (title_id)', 'ISSN SciELO', 'ISSN impresso (print_identifier)', 'ISSN online (online_identifier)', "ISSN's", 'Título do Periódico (publication_title)', 'accesses to abstract', 'accesses to epdf', 'accesses to html', 'accesses to pdf', 'accesses year', 'alpha frequency', 'altmetrics url', 'authors', 'citable documents', 'citable documents at 2013', 'citable documents at 2014', 'citable documents at 2015', 'citable documents at 2016', 'citable documents at 2017', 'citable documents at 2018', 'cobertura (coverage_depth)', 'collection', 'data de publicação monográfica impressa (date_monograph_published_print)', 'data de publicação monográfica online (date_monograph_published_online)', 'date of the first document', 'date of the last document', 'document accepted at', 'document accepted at day', 'document accepted at month', 'document accepted at year', 'document affiliation city', 'document affiliation country', 'document affiliation country ISO 3166', 'document affiliation instituition', 'document affiliation state', 'document author', 'document author affiliation city', 'document author affiliation country', 'document author affiliation state', 'document author institution', 'document en', 'document es', 'document is citable', 'document languages', 'document license', 'document other languages', 'document pt', 'document published as ahead of print at', 'document published as ahead of print at day', 'document published as ahead of print at month', 'document published as ahead of print at year', 'document published at', 'document published at day', 'document published at month', 'document published at year', 'document published in SciELO at', 'document published in SciELO at day', 'document published in SciELO at month', 'document published in SciELO at year', 'document publishing ID (PID SciELO)', 'document publishing year', 'document reviewed at', 'document reviewed at day', 'document reviewed at month', 'document reviewed at year', 'document submitted at', 'document submitted at day', 'document submitted at month', 'document submitted at year', 'document type', 'document updated in SciELO at', 'document updated in SciELO at day', 'document updated in SciELO at month', 'document updated in SciELO at year', 'documents at 2013', 'documents at 2014', 'documents at 2015', 'documents at 2016', 'documents at 2017', 'documents at 2018', 'edição de monografia (monograph_edition)', 'english documents at 2013 ', 'english documents at 2014 ', 'english documents at 2015 ', 'english documents at 2016 ', 'english documents at 2017 ', 'english documents at 2018 ', 'extraction date', 'google scholar h5 2013 ', 'google scholar h5 2014 ', 'google scholar h5 2015 ',

3 Simplifying the column names (CSV header) — Page 2 / 8 3.1 Current rows

'google scholar h5 2016 ', 'google scholar h5 2017 ', 'google scholar h5 2018 ', 'google scholar m5 2013 ', 'google scholar m5 2014 ', 'google scholar m5 2015 ', 'google scholar m5 2016 ', 'google scholar m5 2017 ', 'google scholar m5 2018 ', 'inclusion year at SciELO', 'informação de embargo (embargo_info)', 'informação sobre cobertura (coverage_notes)', 'issue of the first document', 'issue of the last document', 'issues at 2013', 'issues at 2014', 'issues at 2015', 'issues at 2016', 'issues at 2017', 'issues at 2018', 'nome do publicador (publisher_name)', 'numeric frequency (in months)', 'número do primeiro fascículo (num_first_issue_online)', 'número do último fascículo publicado (num_last_issue_online)', 'other language documents at 2013 ', 'other language documents at 2014 ', 'other language documents at 2015 ', 'other language documents at 2016 ', 'other language documents at 2017 ', 'other language documents at 2018 ', 'pages', 'portuguese documents at 2013 ', 'portuguese documents at 2014 ', 'portuguese documents at 2015 ', 'portuguese documents at 2016 ', 'portuguese documents at 2017 ', 'portuguese documents at 2018 ', 'primeiro autor (first_author)', 'primeiro editor (first_editor)', 'publisher name', 'publishing year', 'references', 'regular issues at 2013', 'regular issues at 2014', 'regular issues at 2015', 'regular issues at 2016', 'regular issues at 2017', 'regular issues at 2018', 'score', 'short title ISO', 'short title SciELO', 'spanish documents at 2013 ', 'spanish documents at 2014 ', 'spanish documents at 2015 ', 'spanish documents at 2016 ', 'spanish documents at 2017 ', 'spanish documents at 2018 ', 'status change date', 'status change day', 'status change month', 'status change reason', 'status change year', 'status changed to', 'stopping reason', 'stopping year at SciELO', 'study unit', 'tipo de acesso (access_type)', 'tipo de publicação (publication_type)', 'title + subtitle SciELO', 'title PubMed', 'title at SciELO', 'title current status', 'title is agricultural sciences', 'title is applied social sciences', 'title is biological sciences', 'title is engineering', 'title is exact and earth sciences', 'title is health sciences', 'title is human sciences', 'title is linguistics, letters and arts', 'title is multidisciplinary', 'title thematic areas', 'total accesses', 'total of documents', 'total of issues', 'total of regular issues', 'url de fascículos (title_url)', 'use license', 'volume de monografia (monograph_volume)', 'volume do primeiro fascículo (num_first_vol_online)', 'volume do último fascículo publicado (num_last_vol_online)', 'volume of the first document', 'volume of the last document'], dtype='

Sometimes the columns names have some misleading stuff we can fix, like: • Extra trailing whitespace • Distinct/mixed letter cases • Redundant parentheses structure like Plain text column description in Portuguese (snake_case_descr_in_english) • Symbols like ' and +

3 Simplifying the column names (CSV header) — Page 3 / 8 3.1 Current rows

We can fix that by keeping only the parenthesized code, removing some less meaningful common words, shortening some lenghty words, and performing some replacements: In [5]: def normalize_column_title(name): import re name_unbracketed = re.sub(r".*\((.*)\)", r"\1", name.replace("(in months)", "in_months")) words = re.sub("[^a-z0-9+_ ]", "", name_unbracketed.lower()).split() ignored_words = ("at", "the", "of", "and", "google", "scholar", "+") replacements = { "document": "doc", "documents": "docs", "frequency": "freq", "language": "lang", } return "_".join(replacements.get(word, word) for word in words if word not in ignored_words) \ .replace("title_is", "is")

With Pandas, its use should be straightforward. In [6]: network_journals = pd.read_csv("tabs_network/journals.csv")\ .rename(columns=normalize_column_title) network_journals.columns

Out [6]: Index(['extraction_date', 'study_unit', 'collection', 'issn_scielo', 'issns', 'title_scielo', 'title_thematic_areas', 'is_agricultural_sciences', 'is_applied_social_sciences', 'is_biological_sciences', 'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences', 'is_human_sciences', 'is_linguistics_letters_arts', 'is_multidisciplinary', 'title_current_status', 'title_subtitle_scielo', 'short_title_scielo', 'short_iso', 'title_pubmed', 'publisher_name', 'use_license', 'alpha_freq', 'numeric_freq_in_months', 'inclusion_year_scielo', 'stopping_year_scielo', 'stopping_reason', 'date_first_doc', 'volume_first_doc', 'issue_first_doc', 'date_last_doc', 'volume_last_doc', 'issue_last_doc', 'total_issues', 'issues_2018', 'issues_2017', 'issues_2016', 'issues_2015', 'issues_2014', 'issues_2013', 'total_regular_issues', 'regular_issues_2018', 'regular_issues_2017', 'regular_issues_2016', 'regular_issues_2015', 'regular_issues_2014', 'regular_issues_2013', 'total_docs', 'docs_2018', 'docs_2017', 'docs_2016', 'docs_2015', 'docs_2014', 'docs_2013', 'citable_docs', 'citable_docs_2018', 'citable_docs_2017', 'citable_docs_2016', 'citable_docs_2015', 'citable_docs_2014', 'citable_docs_2013', 'portuguese_docs_2018', 'portuguese_docs_2017', 'portuguese_docs_2016', 'portuguese_docs_2015', 'portuguese_docs_2014', 'portuguese_docs_2013', 'spanish_docs_2018', 'spanish_docs_2017', 'spanish_docs_2016', 'spanish_docs_2015', 'spanish_docs_2014', 'spanish_docs_2013', 'english_docs_2018', 'english_docs_2017', 'english_docs_2016', 'english_docs_2015', 'english_docs_2014', 'english_docs_2013', 'other_lang_docs_2018', 'other_lang_docs_2017', 'other_lang_docs_2016', 'other_lang_docs_2015', 'other_lang_docs_2014', 'other_lang_docs_2013', 'h5_2018', 'h5_2017', 'h5_2016', 'h5_2015', 'h5_2014', 'h5_2013', 'm5_2018', 'm5_2017', 'm5_2016', 'm5_2015', 'm5_2014', 'm5_2013'], dtype='object')

The map of names is:

3 Simplifying the column names (CSV header) — Page 4 / 8 3.1 Current rows

In [7]: name_map = pd.DataFrame(pd.Series({name: normalize_column_title(name) for name in names}) .rename("simple_name")) name_map.sort_values("simple_name")

Out [7]: simple_name +6 authors +6_authors 0 authors 0_authors 1 author 1_author 2 authors 2_authors 3 authors 3_authors 4 authors 4_authors 5 authors 5_authors tipo de acesso (access_type) access_type accesses to abstract accesses_to_abstract accesses to epdf accesses_to_epdf accesses to html accesses_to_html accesses to pdf accesses_to_pdf accesses year accesses_year alpha frequency alpha_freq altmetrics url altmetrics_url authors authors citable documents citable_docs citable documents at 2013 citable_docs_2013 citable documents at 2014 citable_docs_2014 citable documents at 2015 citable_docs_2015 citable documents at 2016 citable_docs_2016 citable documents at 2017 citable_docs_2017 citable documents at 2018 citable_docs_2018 collection collection cobertura (coverage_depth) coverage_depth informação sobre cobertura (coverage_notes) coverage_notes date of the first document date_first_doc Data do primeiro fascículo (date_first_issue_o... date_first_issue_online date of the last document date_last_doc Data do último fascículo publicado (date_last_... date_last_issue_online data de publicação monográfica online (date_mo... date_monograph_published_online data de publicação monográfica impressa (date_... date_monograph_published_print document accepted at doc_accepted document accepted at day doc_accepted_day document accepted at month doc_accepted_month document accepted at year doc_accepted_year document affiliation city doc_affiliation_city document affiliation country doc_affiliation_country document affiliation country ISO 3166 doc_affiliation_country_iso_3166 document affiliation instituition doc_affiliation_instituition document affiliation state doc_affiliation_state document author doc_author document author affiliation city doc_author_affiliation_city document author affiliation country doc_author_affiliation_country document author affiliation state doc_author_affiliation_state document author institution doc_author_institution document en doc_en document es doc_es document is citable doc_is_citable Continued on next page

3 Simplifying the column names (CSV header) — Page 5 / 8 3.1 Current rows

simple_name document languages doc_languages document license doc_license document other languages doc_other_languages document pt doc_pt document published at doc_published document published as ahead of print at doc_published_as_ahead_print document published as ahead of print at day doc_published_as_ahead_print_day document published as ahead of print at month doc_published_as_ahead_print_month document published as ahead of print at year doc_published_as_ahead_print_year document published at day doc_published_day document published in SciELO at doc_published_in_scielo document published in SciELO at day doc_published_in_scielo_day document published in SciELO at month doc_published_in_scielo_month document published in SciELO at year doc_published_in_scielo_year document published at month doc_published_month document published at year doc_published_year document publishing year doc_publishing_year document reviewed at doc_reviewed document reviewed at day doc_reviewed_day document reviewed at month doc_reviewed_month document reviewed at year doc_reviewed_year document submitted at doc_submitted document submitted at day doc_submitted_day document submitted at month doc_submitted_month document submitted at year doc_submitted_year document type doc_type document updated in SciELO at doc_updated_in_scielo document updated in SciELO at day doc_updated_in_scielo_day document updated in SciELO at month doc_updated_in_scielo_month document updated in SciELO at year doc_updated_in_scielo_year documents at 2013 docs_2013 documents at 2014 docs_2014 documents at 2015 docs_2015 documents at 2016 docs_2016 documents at 2017 docs_2017 documents at 2018 docs_2018 informação de embargo (embargo_info) embargo_info english documents at 2013 english_docs_2013 english documents at 2014 english_docs_2014 english documents at 2015 english_docs_2015 english documents at 2016 english_docs_2016 english documents at 2017 english_docs_2017 english documents at 2018 english_docs_2018 extraction date extraction_date primeiro autor (first_author) first_author primeiro editor (first_editor) first_editor google scholar h5 2013 h5_2013 google scholar h5 2014 h5_2014 google scholar h5 2015 h5_2015 google scholar h5 2016 h5_2016 google scholar h5 2017 h5_2017 google scholar h5 2018 h5_2018 inclusion year at SciELO inclusion_year_scielo title is agricultural sciences is_agricultural_sciences title is applied social sciences is_applied_social_sciences Continued on next page

3 Simplifying the column names (CSV header) — Page 6 / 8 3.1 Current rows

simple_name title is biological sciences is_biological_sciences title is engineering is_engineering title is exact and earth sciences is_exact_earth_sciences title is health sciences is_health_sciences title is human sciences is_human_sciences title is linguistics, letters and arts is_linguistics_letters_arts title is multidisciplinary is_multidisciplinary ISSN SciELO issn_scielo ISSN’s issns issue of the first document issue_first_doc issue of the last document issue_last_doc issues at 2013 issues_2013 issues at 2014 issues_2014 issues at 2015 issues_2015 issues at 2016 issues_2016 issues at 2017 issues_2017 issues at 2018 issues_2018 google scholar m5 2013 m5_2013 google scholar m5 2014 m5_2014 google scholar m5 2015 m5_2015 google scholar m5 2016 m5_2016 google scholar m5 2017 m5_2017 google scholar m5 2018 m5_2018 edição de monografia (monograph_edition) monograph_edition volume de monografia (monograph_volume) monograph_volume número do primeiro fascículo (num_first_issue_... num_first_issue_online volume do primeiro fascículo (num_first_vol_on... num_first_vol_online número do último fascículo publicado (num_last... num_last_issue_online volume do último fascículo publicado (num_last... num_last_vol_online numeric frequency (in months) numeric_freq_in_months ISSN online (online_identifier) online_identifier other language documents at 2013 other_lang_docs_2013 other language documents at 2014 other_lang_docs_2014 other language documents at 2015 other_lang_docs_2015 other language documents at 2016 other_lang_docs_2016 other language documents at 2017 other_lang_docs_2017 other language documents at 2018 other_lang_docs_2018 pages pages ID de publicação pai (parent_publication_title... parent_publication_title_id document publishing ID (PID SciELO) pid_scielo portuguese documents at 2013 portuguese_docs_2013 portuguese documents at 2014 portuguese_docs_2014 portuguese documents at 2015 portuguese_docs_2015 portuguese documents at 2016 portuguese_docs_2016 portuguese documents at 2017 portuguese_docs_2017 portuguese documents at 2018 portuguese_docs_2018 ID de publicação prévia (preceding_publication... preceding_publication_title_id ISSN impresso (print_identifier) print_identifier Título do Periódico (publication_title) publication_title tipo de publicação (publication_type) publication_type publisher name publisher_name nome do publicador (publisher_name) publisher_name publishing year publishing_year references references regular issues at 2013 regular_issues_2013 Continued on next page

3 Simplifying the column names (CSV header) — Page 7 / 8 3.1 Current rows

simple_name regular issues at 2014 regular_issues_2014 regular issues at 2015 regular_issues_2015 regular issues at 2016 regular_issues_2016 regular issues at 2017 regular_issues_2017 regular issues at 2018 regular_issues_2018 score score short title ISO short_iso short title SciELO short_title_scielo spanish documents at 2013 spanish_docs_2013 spanish documents at 2014 spanish_docs_2014 spanish documents at 2015 spanish_docs_2015 spanish documents at 2016 spanish_docs_2016 spanish documents at 2017 spanish_docs_2017 spanish documents at 2018 spanish_docs_2018 status change date status_change_date status change day status_change_day status change month status_change_month status change reason status_change_reason status change year status_change_year status changed to status_changed_to stopping reason stopping_reason stopping year at SciELO stopping_year_scielo study unit study_unit title current status title_current_status ID do periódico no SciELO (title_id) title_id title PubMed title_pubmed title at SciELO title_scielo title + subtitle SciELO title_subtitle_scielo title thematic areas title_thematic_areas url de fascículos (title_url) title_url total accesses total_accesses total of documents total_docs total of issues total_issues total of regular issues total_regular_issues use license use_license volume of the first document volume_first_doc volume of the last document volume_last_doc

There’s no overlap in these new names: In [8]: name_map.shape[0] == len(names)

Out [8]: True

3 Simplifying the column names (CSV header) — Page 8 / 8 4 Does the Network reports have everything from the remaining re- ports?

Note: This notebook requires a machine with 16GB of RAM + SWAP, since it loads all reports from SciELO Analytics at once, and the performed calculations require some extra memory. Actually, the rows in each CSV from the network reports are just the rows of the respective CSV from the collection-specific reports joined together (only the network package isn’t collection-specific). Also, the network reports are the only ones with “rows of intersection” between files (the joined collection- specific stuff). Below is an empirical justification for that. In [1]: import collections, glob, os

In [2]: import numpy as np import pandas as pd

4.1 CSV Types

The files available in the ZIP packages are: • accesses_by_journals.csv • documents_affiliations.csv • documents_altmetrics.csv • documents_authors.csv • documents_counts.csv • documents_dates.csv • documents_languages.csv • documents_licenses.csv • journals.csv • journals_kbart.csv • journals_status_changes.csv The CSV type is the name of the file without its extension, e.g. documents_counts.

4.2 Loading all CSV files at once

Each package have been unzipped directories named like tabs_spa, where spa is a collection code (i.e., there’s just a tabs_ leading prefix). Let’s load it all in a nested dictionary structure to have a dataframe with the CSV contents in dfs["documents_authors"]["spa"]. In [3]: dfs = collections.defaultdict(lambda: collections.defaultdict(dict)) for fname in glob.glob("tabs_*/*.csv"): dname, csvname = os.path.split(fname) dfs[os.path.splitext(csvname)[0]][dname[5:]] = \ pd.read_csv(fname, dtype=str, keep_default_na=False)

Therefore, the tabs_network/journals.csv file is in: In [4]: network_journals = dfs["journals"]["network"]

4.3 Are the rows from all journals.csv in tabs_network/journals.csv?

Yes, and every row from tabs_network/journals.csv are in another journals.csv file. To prove that, let’s join the rows from every journals.csv source but the one from the network:

4 Does the Network reports have everything from the remaining reports? — Page 1 / 9 4.3 Are the rows from all journals.csv in tabs_network/journals.csv?

In [5]: all_journals = pd.concat([df for k, df in dfs["journals"].items() if k != "network"])

This joined dataframe has the same shape/size of the network journals dataframe, and no row is dupli- cated in these two dataframes: In [6]: { "all_journals": all_journals.shape, "all_journals (unique)": all_journals.drop_duplicates().shape, "network_journals": network_journals.shape, "network_journals (unique)": network_journals.drop_duplicates().shape, }

Out [6]: {'all_journals': (1732, 98), 'all_journals (unique)': (1732, 98), 'network_journals': (1732, 98), 'network_journals (unique)': (1732, 98)}

The column names are all the same: In [7]: np.all(network_journals.columns.sort_values() == all_journals.columns.sort_values())

Out [7]: True

Every row is in the intersection: In [8]: pd.merge(network_journals, all_journals).shape

Out [8]: (1732, 98)

And the symmetric difference is empty: In [9]: pd.concat([network_journals, all_journals]).drop_duplicates(keep=False)

Out [9]: Empty DataFrame

98 columns extraction date study unit collection ISSN SciELO ISSN’s title at SciELO title thematic areas title is agricultural sciences title is applied social sciences title is biological sciences ... google scholar h5 2016 google scholar h5 2015 google scholar h5 2014 google scholar h5 2013 google scholar m5 2018 google scholar m5 2017 google scholar m5 2016 google scholar m5 2015 Continued on next page

4 Does the Network reports have everything from the remaining reports? — Page 2 / 9 4.4 Does tabs_network have all CSV types?

98 columns google scholar m5 2014 google scholar m5 2013

Therefore, we can say the journals.csv in tabs_network has exactly the same rows from the remaining journals.csv joined together.

4.4 Does tabs_network have all CSV types?

Yes. Every CSV type has a network entry: In [10]: {k: "network" in v for k, v in dfs.items()}

Out [10]: {'documents_languages': True, 'accesses_by_journals': True, 'documents_dates': True, 'documents_counts': True, 'documents_altmetrics': True, 'documents_authors': True, 'journals': True, 'journals_kbart': True, 'documents_affiliations': True, 'documents_licenses': True, 'journals_status_changes': True}

4.5 Comparing the network reports with the remaining reports for all CSV types

Let’s perform on every CSV type the same verification we did on journals: In [11]: for csv_type, datasets in dfs.items(): print(f"Evaluating {csv_type} ...") network = datasets["network"] network_dd = network.drop_duplicates() remaining = pd.concat([df for k, df in datasets.items() if k != "network"]) remaining_dd = remaining.drop_duplicates() shapes = [remaining.shape, remaining_dd.shape, network.shape, network_dd.shape] if len(set(shapes)) != 1: print(f" There are duplicated rows or distinct sizes on {csv_type}:") print(f" {shapes}") if np.any(network.columns.sort_values() != remaining.columns.sort_values()): print(f" The columns of {csv_type} aren't the same!") continue intersection = pd.merge(network_dd, remaining_dd) symmetric_difference = pd.concat([network_dd, remaining_dd]) \ .drop_duplicates(keep=False) if intersection.shape != shapes[1]: print(f" The intersection of {csv_type} " "doesn't have the same number of rows!") if symmetric_difference.shape[0] != 0: print(f" Symmetric difference of {csv_type} isn't empty!")

Evaluating documents_languages ... Evaluating accesses_by_journals ...

4 Does the Network reports have everything from the remaining reports? — Page 3 / 9 4.6 Count matching

Evaluating documents_dates ... Evaluating documents_counts ... Evaluating documents_altmetrics ... Evaluating documents_authors ... There are duplicated rows or distinct sizes on documents_authors: [(2872098, 26), (2844742, 26), (2872098, 26), (2844742, 26)] Evaluating journals ... Evaluating journals_kbart ... Evaluating documents_affiliations ... There are duplicated rows or distinct sizes on documents_affiliations: [(1690988, 26), (1415499, 26), (1690988, 26), (1415499, 26)] Evaluating documents_licenses ... Evaluating journals_status_changes ...

There are a lot of duplications going on in both documents_affiliations and documents_authors. Apart from these, the rows are unique. The set of [distinct] rows from every network CSV are always the [distinct] rows from the remaining CSVs joined together.

4.6 Count matching

Are the duplication counts also matching? In [12]: for csv_type in ["documents_affiliations", "documents_authors"]: datasets = dfs[csv_type] print(f"Evaluating {csv_type} ...") network = datasets["network"] network_dd = network.drop_duplicates() network_gs = network.groupby(network.columns.tolist()).size() \ .rename("duplication_count")\ .reset_index() remaining = pd.concat([df for k, df in datasets.items() if k != "network"]) remaining_gs = remaining.groupby(remaining.columns.tolist()).size() \ .rename("duplication_count")\ .reset_index() shapes = [(network_dd.shape[0], network_dd.shape[1] + 1), network_gs.shape, remaining_gs.shape] if len(set(shapes)) != 1: print(f" The duplicated rows don't count the same on {csv_type}:") print(f" {shapes}") intersection = pd.merge(network_gs, remaining_gs) symmetric_difference = pd.concat([network_gs, remaining_gs]) \ .drop_duplicates(keep=False) if intersection.shape != shapes[0]: print(f" The intersection of {csv_type} " "w/ a duplication_count column " "doesn't have the expected number of rows!") if symmetric_difference.shape[0] != 0: print(f" Symmetric difference of {csv_type} " "w/ a duplication_count column isn't empty!")

Evaluating documents_affiliations ... Evaluating documents_authors ...

Yes, they are! =)

4 Does the Network reports have everything from the remaining reports? — Page 4 / 9 4.7 Duplication in CSV files besides network

4.7 Duplication in CSV files besides network

Does any of the CSV files, individually, have duplicates? In [13]: nrowsdf = pd.DataFrame({"filename": f"tabs_{collection}/{csv_type}.csv", "csv_type": csv_type, "collection": collection, "total_rows": dataset.shape[0], "unique_rows": dataset.drop_duplicates().shape[0], } for csv_type, datasets in dfs.items() for collection, dataset in datasets.items()) \ .reindex(columns=["filename", "csv_type", "collection", "total_rows", "unique_rows"]) nrowsdf[nrowsdf["total_rows"] != nrowsdf["unique_rows"]]

Out [13]: filename csv_type collection total_rows unique_rows 110 tabs_ury/documents_au- documents_authors ury 14279 14260 thors.csv 111 tabs_per/documents_au- documents_authors per 35037 34374 thors.csv 113 tabs_col/documents_au- documents_authors col 170355 169618 thors.csv 114 tabs_sza/documents_au- documents_authors sza 63613 63194 thors.csv 115 tabs_bol/documents_au- documents_authors bol 10069 10062 thors.csv 116 tabs_ven/documents_au- documents_authors ven 56059 56056 thors.csv 117 tabs_cri/documents_au- documents_authors cri 22856 22765 thors.csv 118 tabs_cub/documents_au- documents_authors cub 115951 115937 thors.csv 119 tabs_bra/documents_au- documents_authors bra 1413752 1397115 thors.csv 120 tabs_mex/documents_au- documents_authors mex 150356 150000 thors.csv 122 tabs_arg/documents_au- documents_authors arg 114048 113076 thors.csv 123 tabs_esp/documents_au- documents_authors esp 170066 168916 thors.csv 125 tabs_chl/documents_au- documents_authors chl 197798 194644 thors.csv 126 tabs_network/documents_- documents_authors network 2872098 2844742 authors.csv 127 tabs_psi/documents_au- documents_authors psi 53616 53354 thors.csv 128 tabs_rve/documents_au- documents_authors rve 83395 82757 thors.csv 129 tabs_spa/documents_au- documents_authors spa 145604 144402 thors.csv 130 tabs_sss/documents_au- documents_authors sss 1584 1564 thors.csv 131 tabs_prt/documents_au- documents_authors prt 53264 52252 thors.csv 176 tabs_ury/documents_affilia- documents_affiliations ury 8189 6267 tions.csv 177 tabs_per/documents_affilia- documents_affiliations per 20382 17189 tions.csv 178 tabs_ecu/documents_affilia- documents_affiliations ecu 20 18 tions.csv 179 tabs_col/documents_affilia- documents_affiliations col 132502 100039 tions.csv Continued on next page

4 Does the Network reports have everything from the remaining reports? — Page 5 / 9 4.8 Does any duplication happen between files (besides network)?

filename csv_type collection total_rows unique_rows 180 tabs_sza/documents_affilia- documents_affiliations sza 45553 39017 tions.csv 181 tabs_bol/documents_affilia- documents_affiliations bol 6491 5993 tions.csv 182 tabs_ven/documents_affilia- documents_affiliations ven 31500 27619 tions.csv 183 tabs_cri/documents_affilia- documents_affiliations cri 17301 13879 tions.csv 184 tabs_cub/documents_affilia- documents_affiliations cub 51397 50058 tions.csv 185 tabs_bra/documents_affilia- documents_affiliations bra 804928 653808 tions.csv 186 tabs_mex/documents_affili- documents_affiliations mex 97770 85514 ations.csv 188 tabs_arg/documents_affilia- documents_affiliations arg 64825 60439 tions.csv 189 tabs_esp/documents_affilia- documents_affiliations esp 79803 72114 tions.csv 190 tabs_rvt/documents_affilia- documents_affiliations rvt 345 211 tions.csv 191 tabs_chl/documents_affilia- documents_affiliations chl 112992 97975 tions.csv 192 tabs_network/documents_- documents_affiliations network 1690988 1415499 affiliations.csv 193 tabs_psi/documents_affilia- documents_affiliations psi 36800 34049 tions.csv 194 tabs_rve/documents_affilia- documents_affiliations rve 58789 43740 tions.csv 195 tabs_spa/documents_affilia- documents_affiliations spa 90207 79197 tions.csv 196 tabs_sss/documents_affilia- documents_affiliations sss 963 898 tions.csv 197 tabs_prt/documents_affilia- documents_affiliations prt 30231 27475 tions.csv

We already knew these two Network spreadsheets had duplicates, but it’s clear that they aren’t the only ones.

4.8 Does any duplication happen between files (besides network)?

No, since the sum of the number of unique rows from each CSV file matches the number of unique rows in the network file: In [14]: nrowsdf[(nrowsdf["collection"] == "network") & nrowsdf["csv_type"].isin(["documents_affiliations", "documents_authors"])]

Out [14]: filename csv_type collection total_rows unique_rows 126 tabs_network/documents_- documents_authors network 2872098 2844742 authors.csv 192 tabs_network/documents_- documents_affiliations network 1690988 1415499 affiliations.csv

In [15]: nrowsdf[(nrowsdf["collection"] != "network") & nrowsdf["csv_type"].isin(["documents_affiliations", "documents_authors"])] \ .groupby("csv_type").sum()

4 Does the Network reports have everything from the remaining reports? — Page 6 / 9 4.9 Collection in documents_affiliations and documents_authors

Out [15]: total_rows unique_rows csv_type documents_affiliations 1690988 1415499 documents_authors 2872098 2844742

4.9 Collection in documents_affiliations and documents_authors

There’s a column named collection in both these CSV types. The tabs_network/documents_affiliations.csv and tabs_network/documents_authors.csv have several collections. In [16]: network_coll = pd.concat([ dfs["documents_affiliations"]["network"] .groupby("collection") .size() .rename("documents_affiliations.csv"), dfs["documents_authors"]["network"] .groupby("collection") .size() .rename("documents_authors.csv"), ], axis=1).sort_index() network_coll

Out [16]: documents_affiliations.csv documents_authors.csv collection arg 64825 114048 bol 6491 10069 chl 112992 197798 col 132502 170355 cri 17301 22856 cub 51397 115951 ecu 20 45 esp 79803 170066 mex 97770 150356 per 20382 35037 prt 30231 53264 psi 36800 53616 rve 58789 83395 rvt 345 351 scl 804928 1413752 spa 90207 145604 sss 963 1584 sza 45553 63613 ury 8189 14279 ven 31500 56059

However, there’s at most a single collection in the remaining reports. Actually, we should call each remaining report as a collection-specific report: In [17]: doc_coll_dict_sized = pd.merge(*[ pd.DataFrame([ {"collection": collection, csv_type + ".csv": dataset.groupby("collection").size().to_dict()}

4 Does the Network reports have everything from the remaining reports? — Page 7 / 9 4.9 Collection in documents_affiliations and documents_authors

for collection, dataset in dfs[csv_type].items() if collection != "network" ]) for csv_type in ["documents_affiliations", "documents_authors"] ]).set_index("collection").sort_index() doc_coll_dict_sized

Out [17]: documents_affiliations.csv documents_authors.csv collection arg {’arg’: 64825} {’arg’: 114048} bol {’bol’: 6491} {’bol’: 10069} bra {’scl’: 804928} {’scl’: 1413752} chl {’chl’: 112992} {’chl’: 197798} col {’col’: 132502} {’col’: 170355} cri {’cri’: 17301} {’cri’: 22856} cub {’cub’: 51397} {’cub’: 115951} ecu {’ecu’: 20} {’ecu’: 45} esp {’esp’: 79803} {’esp’: 170066} mex {’mex’: 97770} {’mex’: 150356} per {’per’: 20382} {’per’: 35037} prt {’prt’: 30231} {’prt’: 53264} pry {} {} psi {’psi’: 36800} {’psi’: 53616} rve {’rve’: 58789} {’rve’: 83395} rvt {’rvt’: 345} {’rvt’: 351} spa {’spa’: 90207} {’spa’: 145604} sss {’sss’: 963} {’sss’: 1584} sza {’sza’: 45553} {’sza’: 63613} ury {’ury’: 8189} {’ury’: 14279} ven {’ven’: 31500} {’ven’: 56059}

The only collection identifier different from the reports filename suffix is scl for the tabs_bra.zip (Brazil), named differently due to its history of being the first collection. Getting the values from the collection-specific reports: In [18]: doc_coll = doc_coll_dict_sized.rename({"bra": "scl"}).drop("pry")\ .T.apply(lambda row: row.apply(lambda cell: cell[row.name])).T\ .sort_index() doc_coll

Out [18]: documents_affiliations.csv documents_authors.csv collection arg 64825 114048 bol 6491 10069 chl 112992 197798 col 132502 170355 cri 17301 22856 cub 51397 115951 ecu 20 45 esp 79803 170066 mex 97770 150356 per 20382 35037 prt 30231 53264 Continued on next page

4 Does the Network reports have everything from the remaining reports? — Page 8 / 9 4.9 Collection in documents_affiliations and documents_authors

documents_affiliations.csv documents_authors.csv collection psi 36800 53616 rve 58789 83395 rvt 345 351 scl 804928 1413752 spa 90207 145604 sss 963 1584 sza 45553 63613 ury 8189 14279 ven 31500 56059

And, as expected, that’s the same in the network reports: In [19]: (doc_coll == network_coll).all()

Out [19]: documents_affiliations.csv True documents_authors.csv True dtype: bool

4 Does the Network reports have everything from the remaining reports? — Page 9 / 9 5 Cleaning / Normalizing the ISSN

This is the analysis of the full SciELO’s network journals.csv report/spreadsheet/dataset as it was on its 2018- 09-14 release, future versions will hopefully have pre-normalized ISSN fields. Some journals might have more than one ISSN, since every medium (electronic/print/CD/etc.) have at least its own ISSN. However, in different collections, the ISSN might be different. We should find a way to normalize them, in order to know when two entries regard to the same journal. In [1]: import pandas as pd pd.options.display.max_colwidth = 400

In [2]: journals = pd.read_csv("tabs_network/journals.csv")

There are two columns regarding ISSN: In [3]: [col for col in journals.columns if "ISSN" in col.upper()]

Out [3]: ['ISSN SciELO', "ISSN's"]

The first, ISSN SciELO, has a selected ISSN to be something akin to a primary key, whereas the ISSN's has a list of other ISSNs regarding the same journal content, written as a single string where the ISSNs are separated by a ; (semicolon) symbol.

5.1 Detecting grossly invalid ISSNs

The format of an ISSN is NNNN-NNNC, where N is a digit (from 0 to 9) and C is a check “digit” (from 0 to 9 or X). Is there any ISSN in the ISSN SciELO column that doesn’t conform to that? In [4]: single_issn_regex = r"^\d{4}-\d{3}[\dX]$" journals[["ISSN SciELO"]][~journals["ISSN SciELO"] .str.contains(single_issn_regex)]

Out [4]: ISSN SciELO 1416 0719-448x

It’s not invalid, but we should always use the same letter case in order to work with the ISSN as a matching index or primary key. A proper normalization would use something like journals["ISSN SciELO"].str.upper(). How about the ISSN's column? In [5]: multi_issn_regex = r"^(?:\d{4}-\d{3}[\dX])(?:;\d{4}-\d{3}[\dX])*$" journals[["ISSN's"]][~journals["ISSN's"].fillna("") .str.contains(multi_issn_regex)]

Out [5]: ISSN’s 98 NaN 99 NaN 502 24516600 665 ISSN;0252-8584 1416 0719-448x;0718-0446 1707 20030507;1315-6411

5 Cleaning / Normalizing the ISSN — Page 1 / 12 5.2 ISSN check digit

Besides the x case issue and the empty ISSN's field, the date-like 20030507 and the ISSN text are invalid ISSN values, the latter being the only grossly invalid entry found. The date-like one was grabbed here because of the lack of -, but it’s invalid due to its last digit, which should have been 9 in order to get a valid ISSN value, as discussed in the next session. We can clean these issues by filling the NaN with the ISSN SciELO value from the same row, by taking the uppercase to get rid from the single small x, and by using a mapping to remove the undesired value. Before normalizing it all, let’s check if there’s no other invalid check digit.

5.2 ISSN check digit

5.2.1 Equation

The check digit is the modulo 11, and the equation to get it frmo the first 7 ISSN digits is (where X means this equation yields 10):

S = ISSN7 · [8, 7, 6, 5, 4, 3, 2] S check digit = 11 − S ⌈ 11 ⌉

S The check digit can be obtained from the remainder of the /11 division: if it’s zero, the check digit is zero, else the check digit is 11 − remainder. Proof:

S = 11 × integer quotient + remainder S = 11 + remainder ⌊ 11 ⌋ S = 11 − check digit ⌈ 11 ⌉

S S ∴ check digit = 11 − − remainder (⌈ 11 ⌉ ⌊ 11 ⌋)

5.2.2 Example

For example, 0103-6564 (regarding the Psicologia USP journal) is a valid ISSN, since the dot product S between its first 7 digits and [8, 7, 6, 5, 4, 3, 2] is:

ISSN: 0 1 0 3 − 6 5 6 (4) × 8 7 6 5 4 3 2 S = ∑ {0, 7, 0, 15, 24, 15, 12} = 73

The remainder is 7 and 11 − 7 = 4, the check digit:

73 = 11 · 6 + 7 = 11 · 7 − 4 ↑ ↑

5.2.3 ISSN digit checker function

In [6]: def issn_digit(issn7): issn7_int = map(int, issn7) dp_pairs = zip(issn7_int, [8, 7, 6, 5, 4, 3, 2]) dot_product = sum(a * b for a, b in dp_pairs) rem_compl = (-dot_product) % 11 return "X" if rem_compl == 10 else str(rem_compl)

5 Cleaning / Normalizing the ISSN — Page 2 / 12 5.2 ISSN check digit

In [7]: def check_issn_digit(issn): issn_clean = issn.replace("-", "").strip().upper() return len(issn_clean) == 8 \ and issn_clean[-1] == issn_digit(issn_clean[:7])

In [8]: def issn_full2digit(issn): return issn_digit(issn.replace("-", "").strip()[:7])

In [9]: issn_digit("0103656") # The "ISSN7" input shouldn't include the "-"

Out [9]: '4'

In [10]: check_issn_digit("0103-6564") # But here "-" is optional

Out [10]: True

In [11]: issn_full2digit("2003-0507") # And here, for convenience!

Out [11]: '9'

In [12]: check_issn_digit("20030507") # That's the invalid ISSN previously obtained

Out [12]: False

In [13]: issn_digit("2003050") # Its digit should had been 9 (as we've already seen)

Out [13]: '9'

In [14]: check_issn_digit("24516600") # The other ISSN without "-" seen previously

Out [14]: True

5.2.4 Validating the ISSN digits in the tabs_network/journals.csv dataset

The ISSNs with invalid digits from the ISSN SciELO column are: In [15]: icd_issn_scielo = journals[~journals["ISSN SciELO"].apply(check_issn_digit)] icd_issn_scielo[["title at SciELO", "ISSN's", "ISSN SciELO"]] \ .assign(digit=icd_issn_scielo["ISSN SciELO"].apply(issn_full2digit))

Out [15]: title at SciELO ISSN’s ISSN SciELO digit 509 Ajayu Órgano de Difusión Científica del Depart... 2077-2161 2077-2161 5 520 Acta Nova 1683-0789 1683-0789 4 961 Acta Médica Costarricense 0001-6012;0001- 0001-6002 4 6002 1293 Revista Diacrítica 0807-8967 0807-8967 3 1705 Utopìa y Praxis Latinoamericana 1315-5216 1315-5216 0

The only one we can easily fix is the 0001-6002, since its alternative in the ISSN's list is valid and is quite explicit in the Acta medica costarricense’s web site[1], besides being the only one there. In [16]: issn_full2digit("2077-2161")

Out [16]: '5'

[1]http://www.actamedica.medicos.cr

5 Cleaning / Normalizing the ISSN — Page 3 / 12 5.2 ISSN check digit

In [17]: check_issn_digit("0001-6012")

Out [17]: True

Fixing the remaining ones might be way more difficult than it might seem. The Ajayu’s web site[2] gives us that very same ISSN: 2077-2161. It seems that either the digit checking algorithm isn’t taken on account for every assigned/granted ISSN, or there’s some specific historical issue, like an assignment happening before that calculation was standardized, or some human mistake when performing the assignment. Or that’s simply a mistake in the journal home page that had been copied to the database. Whichever the reason for that, we should stick with some inconsistent data as is for the time being, at least until someone fixes or confirms that information. A similar analysis in the entries from the ISSN's column: In [18]: journals[["title at SciELO", "ISSN's", "ISSN SciELO"]] \ [journals["ISSN's"].fillna("").str.split(";") .apply(lambda issns: not all(check_issn_digit(issn) for issn in issns))]

Out [18]: title at SciELO ISSN’s ISSN SciELO 98 Revista Brasileira de Engenharia Biomédica NaN 1517-3151 99 Revista Brasileira de Coloproctologia NaN 0101-9880 402 SaberEs 1852-4418;1852-4222 1852-4222 488 Salud(i)ciencia 1667-8682;1667-8990 1667-8990 509 Ajayu Órgano de Difusión Científica del Depart... 2077-2161 2077-2161 520 Acta Nova 1683-0789 1683-0789 665 Economía y Desarrollo ISSN;0252-8584 0252-8584 961 Acta Médica Costarricense 0001-6012;0001-6002 0001-6002 962 Actualidades en Psicología 0858-6444;2215-3535 2215-3535 1293 Revista Diacrítica 0807-8967 0807-8967 1391 A Peste : Revista de Psicanálise e Sociedade 1775-1851;2175-6104 2175-6104 1446 Liberabit 1729-4827;2233-7666 1729-4827 1705 Utopìa y Praxis Latinoamericana 1315-5216 1315-5216 1707 Revista Venezolana de Economía y Ciencias Soci... 20030507;1315-6411 1315-6411

Some ISSNs there are valid: In [19]: all(check_issn_digit(issn) for issn in ["0252-8584", "1315-6411", "1667-8990", "1729-4827", "1852-4222", "2175-6104", "2215-3535"])

Out [19]: True

5.2.5 Finding the correct ISSN for these few journals

From SaberEs’s web page[3], we find 1852-4418 should have been 1852-4184. Likewise, from Lib- erabit’s web page[4], we find 2233-7666 has a typo, it’s 2223-7666. A similar typo is 0858-6444, which should have been 0258-6444, as it’s written in the Actualidades en Psicología’s web page[5]. The 1667-8682 should have been 1667-8982, as this PDF of a Salud(i)ciencia article[6] suggests and its SJR entry[7] seems to confirm. Utopia y Praxis Latinoamericana appears on SJR[8] with two ISSNs:

[2]http://www.ucb.edu.bo/publicaciones/ajayu [3]http://saberes.fcecon.unr.edu.ar/index.php/revista [4]http://revistaliberabit.com [5]https://revistas.ucr.ac.cr/index.php/actualidades [6]https://www.ris.uu.nl/ws/files/41145926/sic_176_1.pdf [7]https://www.scimagojr.com/journalsearch.php?q=4100151617&tip=sid [8]https://www.scimagojr.com/journalsearch.php?q=5700164382&tip=sid

5 Cleaning / Normalizing the ISSN — Page 4 / 12 5.2 ISSN check digit

1316-5216 and 2477-9555. Acta Nova[9]’s printed version ISSN is 1683-0768, not 1683-0789. Revista Diacrítica[10] on 26/2-2012[11] wrote 0807-8967 as its ISSN, but that seems like a typo, as in its page the ISSN is explicitly written as 0870-8967 (printed version); 2183-9174 (electronic version). There’s no information in A Peste’s web page[12] regarding a printed version ISSN, but that 1775-1851 appeared in the description of the cover image: The Fifth Plague of Egypt by Joseph Mallord William Turner (1775-1851); his Wikipedia page[13] states that’s the year range of his life, it’s not an ISSN. Revista Uruguaya de Medicina Interna[14] on No.3/Nov2017[15] tells us the ISSN is 2393-6797, not 2993-6797 as it used to be in the 2018-06-10 reports version, but it had been already corrected in the 2018-09-14 version. All these new ISSNs found have a valid check digit: In [20]: all(check_issn_digit(issn) for issn in ["0258-6444", "0870-8967", "1316-5216", "1667-8982", "1683-0768", "1852-4184", "2183-9174", "2223-7666", "2477-9555"])

Out [20]: True

In [21]: journals[["title at SciELO", "ISSN's", "ISSN SciELO"]][ journals["ISSN's"].str.contains("2393-6797") | (journals["ISSN SciELO"] == "2393-6797") ].drop_duplicates()

Out [21]: title at SciELO ISSN’s ISSN SciELO 1672 Revista Uruguaya de Medicina Interna 2393-6797;2393-6797 2393-6797

From the remaining entries, the only invalid ISSN we couldn’t fix was the one belonging to Ajayu. There’s no evidence that its ISSN could be different besides the inconsistency regarding the check digit, and a single article[16] that had written 2011-2161 as the ISSN, but that alternative still need to have 5 as its check digit (i.e., it’s also invalid), and that’s not a trusted source of information. In [22]: issn_full2digit("2011-2161")

Out [22]: '5'

A summary of what should be done regarding these selected ISSNs: In [23]: issns_fix = { # To replace all entries in ISSN SciELO and ISSN's "0001-6002": "0001-6012", # Acta Médica Costarricense "0858-6444": "0258-6444", # Actualidades en Psicología "1667-8682": "1667-8982", # Salud(i)ciencia "1852-4418": "1852-4184", # SaberEs "2233-7666": "2223-7666", # Liberabit "0807-8967": "0870-8967", # Revista Diacrítica "2993-6797": "2393-6797", # Revista Uruguaya de Medicina Interna "1315-5216": "1316-5216", # Utopia y Praxis Latinoamericana "1683-0789": "1683-0768", # Acta Nova "24516600": "2451-6600", "0719-448x": "0719-448X",

[9]https://www.ucbcba.edu.bo/universidad/publicaciones/revistas-2/acta-nova [10]http://diacritica.ilch.uminho.pt [11]http://ceh.ilch.uminho.pt/publicacoes/Diacritica_26-2.pdf [12]http://revistas.pucsp.br/apeste [13]https://pt.wikipedia.org/wiki/William_Turner [14]http://www.medicinainterna.org.uy/revista-medicina-interna [15]http://www.medicinainterna.org.uy/wp-content/uploads/2016/06/RumiNo3_Nov_2017Ch.pdf [16]https://www.scribd.com/document/152839301/Ruptura-Amorosa-y-Terapia-Narrativa

5 Cleaning / Normalizing the ISSN — Page 5 / 12 5.3 Mixed ISSN in the ISSN SciELO field

} extra_issns = { # To add as alternative ISSN's "0870-8967": "2183-9174", # Revista Diacrítica "1316-5216": "2477-9555", # Utopia y Praxis Latinoamericana } invalid_issns = [ # To remove from ISSN's "ISSN", # Economía y Desarrollo "20030507", # Revista Venezolana de Economía y Ciencias Sociales "1775-1851", # A Peste : Revista de Psicanálise e Sociedade ]

And the ISSN's should always include the ISSN SciELO value. Let’s do that! In [24]: issn_scielo = journals["ISSN SciELO"].str.upper().replace(issns_fix) issn_scielo.tail() # `ISSN SciELO` solving every issue found so far

Out [24]: 1727 1012-2508 1728 0254-0770 1729 1316-0087 1730 1317-5815 1731 0367-4762 Name: ISSN SciELO, dtype: object

In [25]: digitfix_issns = {k: {v, extra_issns[v]} if v in extra_issns else {v} for k, v in issns_fix.items()} issns_set = journals["ISSN's"]\ .fillna(issn_scielo) \ .str.upper() \ .str.split(";")\ .apply(lambda items: set.union(*[digitfix_issns.get(item, {item}) for item in items if item not in invalid_issns])) issns_set.tail() # `ISSN's` as a set, solving every issue found so far

Out [25]: 1727 {2443-468X, 1012-2508} 1728 {0254-0770} 1729 {1316-0087} 1730 {1317-5815} 1731 {0367-4762} Name: ISSN's, dtype: object

5.3 Mixed ISSN in the ISSN SciELO field

The ISSN SciELO should have a primary ISSN, in the primary key sense from databases, somewhat arbi- trary but still required in order to avoid errors in analysis. Crossing the data with other tables should ideally not require any other ISSN, and that’s the main goal: keep everything simple after this normal- ization. There are at most one mixed ISSN for every ISSN list (that is, there’s a single ISSN in the ISSN's field different from the ISSN SciELO of the same row that appears in the ISSN SciELO field of another row): In [26]: other_mixed_issns = (issns_set - issn_scielo.apply(lambda issn: {issn})) \ .apply(lambda issn_set: {issn for issn in issn_set if issn in issn_scielo.values}) how_many_mixed_issns = other_mixed_issns.apply(len) how_many_mixed_issns.max()

5 Cleaning / Normalizing the ISSN — Page 6 / 12 5.3 Mixed ISSN in the ISSN SciELO field

Out [26]: 1

If that number was greater than 1, the technique below wouldn’t work. Actually, our goal is just to find a mapping that would fix the mixed ISSN, i.e., for a set of ISSN values for a single journal, the ISSN SciELO should always have the same ISSN in every entry belonging to that same journal. Below is the mapping of what appears in both the ISSN's and ISSN SciELO columns and a distinct value that appears in the ISSN SciELO. In [27]: has_mixed_issn = how_many_mixed_issns > 0 mixed_issn_df = pd.DataFrame([ other_mixed_issns[has_mixed_issn] .apply(lambda x: set(x).pop()) .rename("mixed_issn"), issn_scielo[has_mixed_issn], ]).T mixed_issn_df

Out [27]: mixed_issn ISSN SciELO 60 1980-5438 0103-5665 79 1518-3319 2237-101X 263 1678-5177 0103-6564 515 2077-3323 1817-7433 962 0258-6444 2215-3535 1443 1668-7027 0325-8203 1461 1980-5438 0103-5665 1492 2175-3598 0104-1282 1656 0797-9789 1688-499X 1661 1688-4094 1688-4221

That small table above is exhaustive. We can select any of the columns to be the normalized ISSN, taking care of duplicated entries. The rows with the issues above are: In [28]: journals[["collection", "title at SciELO", "title thematic areas", "publisher name"]] \ .assign(issn_scielo=issn_scielo, issns=issns_set) \ [issn_scielo.isin(mixed_issn_df.values.ravel())]

Out [28]: collection title at SciELO title thematic publisher name issn_scielo issns areas 60 scl Psicologia Clínica Human Sci- Departamento de 0103-5665 {0103- ences Psicologia da Pon- 5665, tifícia Unive... 1980- 5438} 79 scl Topoi (Rio de Janeiro) Human Sci- Programa de Pós- 2237-101X {2237- ences Graduação em 101X, História Social d... 1518- 3319} 263 scl Psicologia USP Human Sci- Instituto de Psi- 0103-6564 {0103- ences cologia da Univer- 6564, sidade de São... 1678- 5177} 376 arg Interdisciplinaria Human Sci- Centro Interamer- 1668-7027 {1668- ences icano de Investi- 7027} gaciones Psico... Continued on next page

5 Cleaning / Normalizing the ISSN — Page 7 / 12 5.3 Mixed ISSN in the ISSN SciELO field

collection title at SciELO title thematic publisher name issn_scielo issns areas 513 bol Revista Ciencia y Cul- Applied Social Universidad 2077-3323 {2077- tura Sciences;Hu- Católica Boliviana 3323} man Sciences; Linguis... 515 bol Revista Científica Health Sci- Facultad de 1817-7433 {2077- Ciencia Médica ences Medicina, Univer- 3323} sidad Mayor de San... 962 cri Actualidades en Psi- Applied Instituto de In- 2215-3535 {0258- cología Social Sci- vestigaciones 6444, ences;Health Psicológicas, 2215- Sciences Uni... 3535} 1373 psi Psicologia USP Human Sci- Instituto de Psi- 1678-5177 {1678- ences cologia da Univer- 5177} sidade de São... 1427 psi Ciencias Psicológicas Human Sci- Facultad de Psi- 1688-4094 {1688- ences cología de la Uni- 4094} versidad Catól... 1428 psi Actualidades en psi- Applied Social Universidad de 0258-6444 {0258- cología Sciences Costa Rica. Facul- 6444} tad de Ciencia... 1442 psi Psicologia clínica (Rio Applied Social Pontifícia Univer- 1980-5438 {1980- de Janeiro. Online) Sciences sidade Católica do 5438} Rio de Jan... 1443 psi Interdisciplinaria Human Sci- Centro Interamer- 0325-8203 {0325- ences icano de Investi- 8203, gaciones Psico... 1668- 7027} 1455 psi Journal of Human Applied Social Centro de Estudos 2175-3598 {2175- Growth and Develop- Sciences do Crescimento e 3598} ment do Desenvol... 1461 psi Psicologia Clínica Applied Social Departamento de 0103-5665 {0103- Sciences Psicologia da Pon- 5665, tifícia Unive... 1980- 5438} 1492 psi Journal of Human Applied Social Centro de Estudos 0104-1282 {2175- Growth and Develop- Sciences de Crescimento e 3598, ment Desenvolvim... 0104- 1282} 1564 sss Revista Uruguaya de Applied Social Instituto de Ciên- 0797-9789 {0797- Ciencia Política Sciences cia Política 9789} 1570 sss Topoi: Revista de Applied Social Universidade 1518-3319 {1518- História Sciences Federal do Rio de 3319} Janeiro 1656 ury Revista Uruguaya de Applied Social Universidad de la 1688-499X {1688- Ciencia Política Sciences;Hu- República. Facul- 499X, man Sciences tad de Cienc... 0797- 9789} 1661 ury Ciencias Psicológicas Applied Social Universidad 1688-4221 {1688- Sciences;Hu- Católica del 4221, man Sciences Uruguay. Facul- 1688- tad de ... 4094}

The 1817-7433 entry in the bol collection has an incorrect secondary 2077-3323 ISSN (the entries are from distinct thematic areas), that won’t give us any trouble as long as we don’t use the ISSN's column afterwards, but for this normalization our goal is to fix that, as well. The resulting mapping is:

5 Cleaning / Normalizing the ISSN — Page 8 / 12 5.4 Distinct sets in ISSN's

In [29]: issns_select = { "1980-5438": "0103-5665", # psi -> scl/psi "2237-101X": "1518-3319", # sss -> scl "1678-5177": "0103-6564", # psi -> scl "0325-8203": "1668-7027", # psi -> arg "2175-3598": "0104-1282", # psi -> psi "0797-9789": "1688-499X", # sss -> ury "1688-4094": "1688-4221", # psi -> ury "0258-6444": "2215-3535", # psi -> cri }

Full normalization of the ISSN SciELO in a single step can be achieved with: In [30]: issn_scielo_n = journals["ISSN SciELO"].replace({**issns_fix, **issns_select})

5.4 Distinct sets in ISSN's

With the ISSN SciELO column normalized, two rows with the same ISSN should have the same ISSN's. Is that what we’ve found? In [31]: distinct_frozen_issns = \ pd.DataFrame([issn_scielo_n, issns_set.apply(frozenset)]).T\ .groupby("ISSN SciELO")\ .apply(lambda df: df["ISSN's"].unique()) distinct_frozen_issns[distinct_frozen_issns.apply(len) > 1]

Out [31]: ISSN SciELO 0011-5258 [(1678-4588, 0011-5258), (0011-5258)] 0100-512X [(0100-512X, 1981-5336), (0100-512X)] 0100-8587 [(1984-0438, 0100-8587), (0100-8587)] 0101-3300 [(1980-5403, 0101-3300), (0101-3300)] 0102-6909 [(0102-6909, 1806-9053), (0102-6909)] 0102-7182 [(1807-0310, 0102-7182), (0102-7182)] 0102-7972 [(1678-7153, 0102-7972), (0102-7972)] 0103-166X [(0103-166X, 1982-0275), (0103-166X)] 0103-2070 [(1809-4554, 0103-2070), (0103-2070)] 0103-5665 [(0103-5665, 1980-5438), (1980-5438)] 0103-6564 [(0103-6564, 1678-5177), (1678-5177)] 0103-863X [(0103-863X, 1982-4327), (0103-863X)] 0104-026X [(0104-026X, 1806-9584), (0104-026X)] 0104-1169 [(1518-8345), (1518-8345, 0104-1169)] 0104-1282 [(2175-3598), (2175-3598, 0104-1282)] 0104-4478 [(0104-4478, 1678-9873), (0104-4478)] 0104-7183 [(1806-9983, 0104-7183), (0104-7183)] 0104-8333 [(1809-4449, 0104-8333), (0104-8333)] 0104-9313 [(0104-9313, 1678-4944), (0104-9313)] 0123-417X [(0123-417X, 2011-7485), (0123-417X)] 1413-294X [(1678-4669), (1413-294X, 1678-4669)] 1413-8271 [(2175-3563), (1413-8271)] 1413-8557 [(2175-3539), (1413-8557)] 1414-3283 [(1807-5762, 1414-3283), (1414-3283)] 1414-9893 [(1982-3703, 1414-9893), (1414-9893)] 1415-4714 [(1984-0381, 1415-4714), (1415-4714)] 1415-790X [(1980-5497, 1415-790X), (1415-790X)] 1516-1498 [(1809-4414, 1516-1498), (1516-1498)] 1517-4522 [(1517-4522, 1807-0337), (1517-4522)]

5 Cleaning / Normalizing the ISSN — Page 9 / 12 5.5 Summary

1518-3319 [(2237-101X, 1518-3319), (1518-3319)] 1668-7027 [(1668-7027), (0325-8203, 1668-7027)] 1688-4221 [(1688-4094), (1688-4221, 1688-4094)] 1688-499X [(0797-9789), (1688-499X, 0797-9789)] 1726-4634 [(1726-4634), (1726-4642, 1726-4634)] 1729-4827 [(1729-4827), (2223-7666, 1729-4827)] 1806-6445 [(1983-3342, 1806-6445), (1806-6445)] 1983-3288 [(1983-3288, 1984-3054), (1983-3288)] 2215-3535 [(0258-6444, 2215-3535), (0258-6444)] 2216-0973 [(2216-0973), (2216-0973, 2346-3414)] dtype: object

No, it’s not. We’ve found more than one set, and some sets still don’t include the ISSN SciELO value. Perhaps the easiest way to fix this is by creating a mapping of an ISSN to the set union of these frozensets, and then re-creating the ISSN's column. In [32]: issns_mapping = \ pd.DataFrame([issn_scielo_n, journals["ISSN's"].fillna(issn_scielo_n)]).T\ .groupby("ISSN SciELO").apply(lambda df: ";".join(df.values.ravel())) \ .str.split(";")\ .apply(lambda items: set.union(*[digitfix_issns.get(item, {item}) for item in items if item not in invalid_issns])) # It's an exception to the rule seen before, from # Revista Científica Ciencia Médica (bol) issns_mapping.loc["1817-7433"] -= {"2077-3323"}

In [33]: issns_set_n = issn_scielo_n.map(issns_mapping).rename("ISSN's") issns_set_n.tail()

Out [33]: 1727 {2443-468X, 1012-2508} 1728 {0254-0770} 1729 {1316-0087} 1730 {1317-5815} 1731 {0367-4762} Name: ISSN's, dtype: object

Applying the same check as before: In [34]: distinct_frozen_issns = \ pd.DataFrame([issn_scielo_n, issns_set_n.apply(frozenset)]).T\ .groupby("ISSN SciELO")\ .apply(lambda df: df["ISSN's"].unique()) distinct_frozen_issns[distinct_frozen_issns.apply(len) > 1]

Out [34]: Series([], dtype: object)

Normalized! =)

5.5 Summary

5 Cleaning / Normalizing the ISSN — Page 10 / 12 5.5 Summary

In [35]: from pprint import pprint

5.5.1 Only normalizing the ISSN SciELO

We can apply all the normalization from the issns_fix and issns_select dictionaries by updating the dataframe with: journals["ISSN SciELO"].replace(issn_scielo_fix, inplace=True) Where issn_scielo_fix should be the joined dictionary, as follows: In [36]: pprint({**issns_fix, **issns_select})

{'0001-6002': '0001-6012', '0258-6444': '2215-3535', '0325-8203': '1668-7027', '0719-448x': '0719-448X', '0797-9789': '1688-499X', '0807-8967': '0870-8967', '0858-6444': '0258-6444', '1315-5216': '1316-5216', '1667-8682': '1667-8982', '1678-5177': '0103-6564', '1683-0789': '1683-0768', '1688-4094': '1688-4221', '1852-4418': '1852-4184', '1980-5438': '0103-5665', '2175-3598': '0104-1282', '2233-7666': '2223-7666', '2237-101X': '1518-3319', '24516600': '2451-6600', '2993-6797': '2393-6797'}

5.5.2 Normalizing the ISSN’s

It’s not that simple, and it won’t work the same way in a collection-specific report. If you’re working on a single collection but you need the ISSN's column including any secondary ISSN that might available just on an entry from another collection, you should perform this normalization in the network report and filter the desired collection afterwards. Given this dictionary of sets in the digitfix_issns variable: In [37]: pprint(digitfix_issns)

{'0001-6002':{'0001-6012'}, '0719-448x':{'0719-448X'}, '0807-8967':{'0870-8967', '2183-9174'}, '0858-6444':{'0258-6444'}, '1315-5216':{'2477-9555', '1316-5216'}, '1667-8682':{'1667-8982'}, '1683-0789':{'1683-0768'}, '1852-4418':{'1852-4184'}, '2233-7666':{'2223-7666'}, '24516600':{'2451-6600'}, '2993-6797':{'2393-6797'}}

5 Cleaning / Normalizing the ISSN — Page 11 / 12 5.5 Summary

You can get the sets in an issns_set_n variable by copying and pasting this not-so-simple snippet (from a previous cell in this notebook): issn_scielo_n = journals["ISSN SciELO"].replace(issn_scielo_fix) invalid_issns = ["ISSN", "20030507", "1775-1851"] issns_mapping = \ pd.DataFrame([issn_scielo_n, journals["ISSN's"].fillna(issn_scielo_n)]).T \ .groupby("ISSN SciELO").apply(lambda df: ";".join(df.values.ravel())) \ .str.split(";") \ .apply(lambda items: set.union(*[digitfix_issns.get(item, {item}) for item in items if item not in invalid_issns])) issns_mapping.loc["1817-7433"] -= {"2077-3323"} issns_set_n = issn_scielo_n.map(issns_mapping).rename("ISSN's") There, issn_scielo_n is the normalized ISSN SciELO column, and issn_set_n is a normalized ISSN's column where the entries are set objects instead of ; separated strings. To put the ISSN's back in place, sorted and ;-spea, you just need to: journals["ISSN's"] = issns_set_n.apply(lambda s: ";".join(sorted(s)))

5.5.3 Beyond normalization

The goal of this normalization is to analyze the data from journals.csv. For some contexts, you can keep the old values of your data, e.g. by adding new columns instead of replacing the raw ones: journals["issn"] = issn_scielo_n journals["issns"] = issns_set_n journals["issns_str"] = issns_set_n.apply(lambda s: ";".join(sorted(s))) Or: # Usually, this syntax is more helpful for using the # "assign" expression, not as part of an assignment statement journals = journals.assign( issn=issn_scielo_n, issns=issns_set_n, issns_str=issns_set_n.apply(lambda s: ";".join(sorted(s))), ) The goal of keeping the raw data is due to some external reference or some user input that might be looking for an invalid/inconsistent entry that no longer exists because of this normalization.

5 Cleaning / Normalizing the ISSN — Page 12 / 12 6 Number of indexed/deindexed/active journals in the SciELO net- work

This analysis shows the annual evolution of the number of indexed, deindexed and active journals of each collection, and of the whole network. In [1]: import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns

In [2]: %matplotlib inline

6.1 Loading the dataset

We’re going to use the network journals.csv for that. In [3]: journals = pd.read_csv("tabs_network/journals.csv")

The following ISSN cleaning/normalization step is fully documented in a normalization-specific note- book that can be found together with this one in the same repository. In [4]: issn_scielo_fix = {"0001-6002": "0001-6012", "0258-6444": "2215-3535", "0325-8203": "1668-7027", "0719-448x": "0719-448X", "0797-9789": "1688-499X", "0807-8967": "0870-8967", "0858-6444": "0258-6444", "1315-5216": "1316-5216", "1667-8682": "1667-8982", "1678-5177": "0103-6564", "1683-0789": "1683-0768", "1688-4094": "1688-4221", "1852-4418": "1852-4184", "1980-5438": "0103-5665", "2175-3598": "0104-1282", "2233-7666": "2223-7666", "2237-101X": "1518-3319", "24516600": "2451-6600", "2993-6797": "2393-6797"} journals["ISSN SciELO"].replace(issn_scielo_fix, inplace=True)

We don’t need all the columns from the journals.csv. In [5]: journals.columns

Out [5]: Index(['extraction date', 'study unit', 'collection', 'ISSN SciELO', 'ISSN's', 'title at SciELO', 'title thematic areas', 'title is agricultural sciences', 'title is applied social sciences', 'title is biological sciences', 'title is engineering', 'title is exact and earth sciences', 'title is health sciences', 'title is human sciences', 'title is linguistics, letters and arts', 'title is multidisciplinary', 'title current status', 'title + subtitle SciELO', 'short title SciELO', 'short title ISO', 'title PubMed', 'publisher name', 'use license', 'alpha frequency', 'numeric frequency (in months)', 'inclusion year at SciELO',

6 Number of indexed/deindexed/active journals in the SciELO network — Page 1 / 24 6.1 Loading the dataset

'stopping year at SciELO', 'stopping reason', 'date of the first document', 'volume of the first document', 'issue of the first document', 'date of the last document', 'volume of the last document', 'issue of the last document', 'total of issues', 'issues at 2018', 'issues at 2017', 'issues at 2016', 'issues at 2015', 'issues at 2014', 'issues at 2013', 'total of regular issues', 'regular issues at 2018', 'regular issues at 2017', 'regular issues at 2016', 'regular issues at 2015', 'regular issues at 2014', 'regular issues at 2013', 'total of documents', 'documents at 2018', 'documents at 2017', 'documents at 2016', 'documents at 2015', 'documents at 2014', 'documents at 2013', 'citable documents', 'citable documents at 2018', 'citable documents at 2017', 'citable documents at 2016', 'citable documents at 2015', 'citable documents at 2014', 'citable documents at 2013', 'portuguese documents at 2018 ', 'portuguese documents at 2017 ', 'portuguese documents at 2016 ', 'portuguese documents at 2015 ', 'portuguese documents at 2014 ', 'portuguese documents at 2013 ', 'spanish documents at 2018 ', 'spanish documents at 2017 ', 'spanish documents at 2016 ', 'spanish documents at 2015 ', 'spanish documents at 2014 ', 'spanish documents at 2013 ', 'english documents at 2018 ', 'english documents at 2017 ', 'english documents at 2016 ', 'english documents at 2015 ', 'english documents at 2014 ', 'english documents at 2013 ', 'other language documents at 2018 ', 'other language documents at 2017 ', 'other language documents at 2016 ', 'other language documents at 2015 ', 'other language documents at 2014 ', 'other language documents at 2013 ', 'google scholar h5 2018 ', 'google scholar h5 2017 ', 'google scholar h5 2016 ', 'google scholar h5 2015 ', 'google scholar h5 2014 ', 'google scholar h5 2013 ', 'google scholar m5 2018 ', 'google scholar m5 2017 ', 'google scholar m5 2016 ', 'google scholar m5 2015 ', 'google scholar m5 2014 ', 'google scholar m5 2013 '], dtype='object')

These are the columns we need: In [6]: columns = ["collection", "ISSN SciELO", "inclusion year at SciELO", "stopping year at SciELO"] journals[columns].head()

Out [6]: collection ISSN SciELO inclusion year at SciELO stopping year at SciELO 0 scl 1676-5648 2006 2010.0 1 scl 0101-8108 2004 2012.0 2 scl 0034-7701 2000 2008.0 3 scl 0102-261X 1999 2012.0 4 scl 1516-9332 2005 2009.0

In [7]: journals[columns].shape

6 Number of indexed/deindexed/active journals in the SciELO network — Page 2 / 24 6.2 Collections

Out [7]: (1732, 4)

6.2 Collections

Are there any inactive collection in the analytics? In [8]: url = "http://articlemeta.scielo.org/api/v1/collection/identifiers/" collections_info = pd.read_json(url) collections_info[collections_info["has_analytics"] & ~collections_info["is_active"]].code

Out [8]: 9 sss Name: code, dtype: object

Yes! The sss (Social Sciences) collection is discontinued. The easiest way to collect information from this dataset is by removing its entries, but in order to get the full network information and the information about this collection, we shouldn’t do that. How can we classify these collections? In [9]: collections_info[collections_info["has_analytics"]] \ .groupby("status").aggregate({"code": set})

Out [9]: code status certified {sss, mex, per, arg, cri, sza, prt, esp, cub, ... development {pry, ecu} independent {rvt, psi, rve}

The independent collections follow the SciELO model, but they aren’t managed by SciELO. We have 5 collections in analytics that are thematic: In [10]: acolinfo = collections_info[collections_info["has_analytics"]] \ [["code", "document_count", "domain", "original_name", "status", "is_active"]] \ .set_index("code") acolinfo.loc[["psi", "rve", "rvt", "spa", "sss"]]

Out [10]: document_count domain original_name status is_active code psi 23841.0 pepsic.bvsalud.org PEPSIC independent True rve 22733.0 www.revenf.bvs.br REVENF independent True rvt 136.0 www.revtur.org RevTur independent True spa 40996.0 www.scielosp.org Saúde Pública certified True sss 665.0 socialsciences.scielo.org Social Sciences certified False

And 10 collections that are national: In [11]: acolinfo.drop(["psi", "rve", "rvt", "spa", "sss"])

Out [11]:

6 Number of indexed/deindexed/active journals in the SciELO network — Page 3 / 24 6.3 Data de-duplication & inf instead of NaN

document_count domain original_name status is_active code arg 36555.0 www.scielo.org.ar Argentina certified True chl 61760.0 www.scielo.cl Chile certified True col 66973.0 www.scielo.org.co Colombia certified True cub 33492.0 scielo.sld.cu Cuba certified True esp 37200.0 scielo.isciii.es España certified True mex 56905.0 www.scielo.org.mx Mexico certified True prt 17127.0 www.scielo.mec.pt Portugal certified True scl 370150.0 www.scielo.br Brasil certified True sza 25617.0 www.scielo.org.za South Africa certified True ven 18971.0 www.scielo.org.ve Venezuela certified True bol 4758.0 www.scielo.org.bo Bolivia certified True cri 9158.0 www.scielo.sa.cr Costa Rica certified True per 9618.0 www.scielo.org.pe Peru certified True pry NaN scielo.iics.una.py Paraguay development True ury 4360.0 www.scielo.edu.uy Uruguay certified True ecu 15.0 www.scielo.ec Ecuador development True

6.3 Data de-duplication & inf instead of NaN

Each ISSN may appear in more than one collection and perhaps more than once in a collection, as the ISSN SciELO column is normalized, not the rows. As an example of that: In [12]: journals[columns][journals["ISSN SciELO"].isin(["0103-5665", "0104-1282"])]

Out [12]: collection ISSN SciELO inclusion year at SciELO stopping year at SciELO 60 scl 0103-5665 2006 2015.0 1442 psi 0103-5665 2015 2015.0 1455 psi 0104-1282 2012 NaN 1461 psi 0103-5665 2008 NaN 1492 psi 0104-1282 2008 NaN

These rows are inconsistent, as the inclusion year is different for the same ISSN. At least, these are the only inconsistent entries in the dataset: In [13]: is_consistent = ( journals[columns] .groupby(["collection", "ISSN SciELO"]) .apply(lambda df: df.apply(set).apply(len).sum() == df.size) .rename("is_consistent") ) pd.DataFrame(is_consistent[~is_consistent])

Out [13]: is_consistent collection ISSN SciELO psi 0103-5665 False psi 0104-1282 False

Then let’s get the consistent dataset while keeping the index in “sync” with the full journals dataframe. We’ll replace the NaN by inf (infinity) in the stopping year column in order to make it the greates possible value (which will also be required later on).

6 Number of indexed/deindexed/active journals in the SciELO network — Page 4 / 24 6.3 Data de-duplication & inf instead of NaN

In [14]: dataset = (journals .reset_index() .fillna({"stopping year at SciELO": np.inf}) .groupby(["collection", "ISSN SciELO"]) .aggregate({ "inclusion year at SciELO": "min", "stopping year at SciELO": "max", "index": "max", }) .reset_index() .set_index("index") ) dataset.iloc[::312]

Out [14]: collection ISSN SciELO inclusion year at SciELO stopping year at SciELO index 373 arg 0002-7014 2003 2014.000000 725 col 0120-3592 2006 inf 1006 esp 0211-5735 2007 inf 1284 prt 0873-2159 2006 2012.000000 277 scl 0102-6909 1998 inf 1571 sss 1414-3283 2006 inf

The previous cell has a quite generic code that should work on any input. But in our case, it could be simpler, since only two rows had been removed (as expected): In [15]: dataset.shape

Out [15]: (1730, 4)

The duplicated entries were fixed: In [16]: dataset[dataset["ISSN SciELO"].isin(["0103-5665", "0104-1282"])]

Out [16]: collection ISSN SciELO inclusion year at SciELO stopping year at SciELO index 1461 psi 0103-5665 2008 inf 1492 psi 0104-1282 2008 inf 60 scl 0103-5665 2006 2015.000000

A simpler (but not recommended) approach would be the removal of the two duplicated rows: 1442 and 1455, in the 2018-09-14 package version. (1436 and 1449 in the 2018-06-10 package version). In [17]: journals.drop([1442, 1455]).shape

Out [17]: (1730, 98)

Is it really the same? In [18]: journals.drop([1442, 1455])[columns].fillna(np.inf).eq(dataset).all()

Out [18]: collection True ISSN SciELO True inclusion year at SciELO True stopping year at SciELO True

6 Number of indexed/deindexed/active journals in the SciELO network — Page 5 / 24 6.4 Number of active journals in the network

dtype: bool

6.4 Number of active journals in the network

How many active journals does the network have? In [19]: dataset[dataset["stopping year at SciELO"] == np.inf]["ISSN SciELO"]\ .drop_duplicates().count()

Out [19]: 1360

Actually, that number isn’t clean, since a journal can’t be said active just because it’s was never deindexed in a now discontinued collection.

6.4.1 Social Sciences (sss) collection normalization

There are 10 journals from the sss collection that could be regarded as deindexed from the year when sss was discontinued. In [20]: dataset[(dataset["stopping year at SciELO"] == np.inf) & (dataset["collection"] != "sss")]["ISSN SciELO"]\ .drop_duplicates().count()

Out [20]: 1350

That’s another kind of normalization step: enforcing that the sss collection had all its entries deindexed in 2010. In [21]: sss_discontinuation_year = 2010 sss_selector = dataset[dataset["collection"] == "sss"].index # The "max" step is just for consistency # regarding an extraordinary entry included in 2017 dataset.loc[sss_selector, "stopping year at SciELO"] = \ dataset[dataset["collection"] == "sss"].T.apply( lambda row: max(min(sss_discontinuation_year, row["stopping year at SciELO"]), row["inclusion year at SciELO"]) )

6.4.2 Yearly totals

How many journals had been active in the past years? In order to answer this question, we’ll need to count how many journals had been indexed and deindexed on each year, grouping them to de-duplicate when any journal appears in more than one collection. In [22]: network_years = dataset.groupby(["ISSN SciELO"]).aggregate({ "inclusion year at SciELO": "min", "stopping year at SciELO": "max", }) network_years.head()

Out [22]:

6 Number of indexed/deindexed/active journals in the SciELO network — Page 6 / 24 6.4 Number of active journals in the network

inclusion year at SciELO stopping year at SciELO ISSN SciELO 0001-3714 1999 2000.000000 0001-3765 2000 inf 0001-6012 2002 inf 0001-6365 2001 2012.000000 0002-0591 2013 2015.000000

In [23]: network_index = pd.DataFrame({ "indexed": network_years.groupby("inclusion year at SciELO").size(), "deindexed": network_years.groupby("stopping year at SciELO").size(), }).fillna(0)\ .assign(total=lambda df: (df["indexed"] - df["deindexed"]).cumsum()) \ .drop(np.inf) network_index

Out [23]: indexed deindexed total 1997.0 9.0 1.0 8.0 1998.0 29.0 0.0 37.0 1999.0 15.0 1.0 51.0 2000.0 28.0 3.0 76.0 2001.0 26.0 2.0 100.0 2002.0 67.0 5.0 162.0 2003.0 43.0 3.0 202.0 2004.0 57.0 5.0 254.0 2005.0 83.0 10.0 327.0 2006.0 132.0 6.0 453.0 2007.0 121.0 9.0 565.0 2008.0 100.0 14.0 651.0 2009.0 100.0 8.0 743.0 2010.0 134.0 15.0 862.0 2011.0 103.0 10.0 955.0 2012.0 106.0 56.0 1005.0 2013.0 124.0 23.0 1106.0 2014.0 98.0 23.0 1181.0 2015.0 89.0 15.0 1255.0 2016.0 64.0 16.0 1303.0 2017.0 90.0 53.0 1340.0 2018.0 35.0 25.0 1350.0

The totals are the difference of the cumulative sum of the indexed/deindexed columns. Numbers might be difficult to understand, let’s plot this data.

6 Number of indexed/deindexed/active journals in the SciELO network — Page 7 / 24 6.4 Number of active journals in the network

In [24]: network_index[["indexed", "deindexed"]].plot( figsize=(12, 8), title="Number of newly indexed/deindexed journals by year in the network", xticks=network_index.index, grid=True, );

In [25]: network_index["total"].plot( figsize=(12, 8), title="Number of active journals in the network", xticks=network_index.index, grid=True, );

6 Number of indexed/deindexed/active journals in the SciELO network — Page 8 / 24 6.5 Why are the journals deindexed?

6.5 Why are the journals deindexed?

There are 3 possible reasons for that: • Suspended, a journal that hadn’t been satisfying some quality/requirement criteria (e.g. data ac- cess is no longer open, had one year delay) • Deceased, a journal that stopped publishing at all • Renamed, it became another journal entry (the old entry is regarded as deceased) Can we find this information in this dataset? There are two columns/fields that might help here: In [26]: journals["title current status"].unique()

Out [26]: array(['deceased', 'suspended', 'current', 'inprogress'], dtype=object)

In [27]: journals["stopping reason"].unique()

Out [27]: array([nan, 'susp', 'not-'], dtype=object)

These mean: • NaN: Not deindexed, deceased or renamed. It’s an empty field in the CSV; • "susp": Suspended by either the editor or the committee; • "not-": Suspended since the access is no longer open.

6.5.1 How many had been deindexed by these reasons?

Regarding the stopping reason column, that evaluation only makes sense for deindexed journals, since every indexed journal has NaN as the reason:

6 Number of indexed/deindexed/active journals in the SciELO network — Page 9 / 24 6.5 Why are the journals deindexed?

In [28]: # Number of reasons different than NaN for not deindexed journals journals[journals["stopping year at SciELO"] == np.inf] \ ["stopping reason"].count()

Out [28]: 0

Let’s summarize the information from this column and the title current status column: In [29]: reasons = (journals .assign(active=journals["stopping year at SciELO"].isna()) .fillna({"stopping year at SciELO": np.inf, "stopping reason": ""}) .groupby(["ISSN SciELO"]) .aggregate({ "inclusion year at SciELO": "min", "stopping year at SciELO": "max", "stopping reason": frozenset, "title current status": frozenset, "active": "max", }) .groupby(["stopping reason", "title current status", "active"]) .size() .rename("count") .reorder_levels(["active", "title current status", "stopping reason"]) .sort_index() ) pd.DataFrame(reasons)

Out [29]: count active title current status stopping reason False (deceased) () 124 False (suspended) (susp) 163 False (suspended) (not-) 6 True (current) () 1337 True (deceased, current) () 3 True (current, suspended) (, susp) 15 True (current, suspended) (, not-) 1 True (deceased, current, suspended) (, susp) 1 True (inprogress) () 1 True (current, inprogress) () 2

We don’t need to care about the several alternatives in True since there are journals that are no longer active in one collection while it’s still active in another collection, the only consistency check we can perform there is the sum, it should be the total of active journals. In [30]: pd.DataFrame(reasons).unstack(0).sum()

Out [30]: active count False 293.0 True 1360.0 dtype: float64

The proportion of deindexed reasons can be shown in a bar plot: In [31]: reasons.loc[False].plot.barh( title="Deindexing (status, reason) pair", figsize=(10, 6)

6 Number of indexed/deindexed/active journals in the SciELO network — Page 10 / 24 6.6 Collection-specific analysis

);

6.6 Collection-specific analysis

In the data de-duplication section, we had already normalized the rows to ensure every entry has a distinct ISSN. If we hadn’t, now that would need to be done in order to properly count the entries of each collection. Yearly, how many journals had been indexed for each collection? Let’s see the cumulative number of journals indexed until a certain year for every collection. In [32]: indexed_on = ( dataset .groupby(["inclusion year at SciELO", "collection"]) .size() .unstack() .fillna(0) .sort_index(axis=1) .sort_index() ) indexed_on.cumsum().astype(int)

Out [32]: The table is in the next page ...

6 Number of indexed/deindexed/active journals in the SciELO network — Page 11 / 24 6.6 Collection-specific analysis 0 0 0 0 5 9 17 23 26 37 44 49 51 52 53 53 56 57 57 57 58 58 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4 12 14 18 24 25 0 0 0 0 0 0 0 0 0 0 0 0 7 17 23 24 39 52 62 65 67 76 0 0 0 0 0 0 0 0 4 26 28 28 31 32 32 32 32 32 32 32 33 33 0 2 5 5 6 8 8 9 12 12 13 13 13 15 16 17 17 17 18 18 20 20 9 25 35 54 66 96 115 130 147 174 204 215 228 248 271 301 330 340 344 354 362 366 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 1 1 1 1 6 10 10 10 10 11 11 11 11 11 12 17 18 0 0 0 0 0 0 1 4 26 43 60 92 109 117 119 126 132 136 139 142 145 147 0 0 0 0 0 0 3 4 8 14 19 26 27 33 39 42 50 54 59 63 66 69 0 0 0 0 0 0 0 7 8 11 12 12 15 16 16 18 19 20 23 24 29 29 0 0 1 1 1 1 1 1 6 14 19 29 43 70 92 114 132 149 171 180 201 207 0 0 1 1 4 9 13 21 25 33 35 38 40 43 45 47 54 56 58 60 60 60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 2 4 4 9 10 15 17 19 22 29 33 40 42 47 52 53 61 64 67 67 0 0 0 0 0 10 10 10 10 10 10 10 10 11 14 15 18 22 27 31 39 42 0 0 0 0 0 3 3 6 7 22 47 61 78 106 121 144 162 182 204 219 227 228 0 12 13 20 25 35 39 47 56 62 69 78 85 89 99 100 105 106 107 112 115 117 0 0 0 0 0 0 0 0 0 0 0 5 8 11 16 16 17 19 19 21 22 22 0 0 0 0 0 0 3 4 17 30 48 57 69 87 94 102 107 120 123 123 137 141 collectioninclusion year at SciELO 1997 1998 1999 2000 2001 arg2002 bol2003 chl2004 col2005 cri2006 cub2007 ecu2008 esp2009 mex2010 per2011 prt2012 psi2013 rve2014 rvt2015 scl2016 spa2017 sss2018 sza ury ven

6 Number of indexed/deindexed/active journals in the SciELO network — Page 12 / 24 6.6 Collection-specific analysis

The number of indexed by year, instead of the cumulative values, can be seen in a heat map: In [33]: sns.heatmap(indexed_on, cmap="rainbow", fmt="g", annot=True, vmin=0, vmax=30, ax=plt.subplots(figsize=(14, 10))[1] ).set(title="Number of indexed journals");

And the same regarding the deindexed entries: In [34]: deindexed_on = (dataset .groupby(["stopping year at SciELO", "collection"]) .size() .drop(np.inf) .unstack() .fillna(0) .sort_index(axis=1) .sort_index() ) deindexed_on.cumsum().astype(int)

Out [34]:

6 Number of indexed/deindexed/active journals in the SciELO network — Page 13 / 24 6.6 Collection-specific analysis

collection arg bol chl col cri cub esp mex prt psi rve scl spa sss sza ury ven stopping year at SciELO 1997.0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1999.0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2000.0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 2001.0 0 0 1 0 0 0 0 0 0 0 0 6 0 0 0 0 0 2002.0 0 0 4 0 0 0 0 0 0 0 0 8 0 0 0 0 0 2003.0 0 0 4 0 0 0 2 0 0 0 0 9 0 0 0 0 0 2004.0 0 0 4 0 0 0 3 0 0 1 0 12 0 0 0 0 0 2005.0 0 0 4 0 1 0 4 1 0 4 0 16 0 0 0 0 0 2006.0 1 0 5 0 2 0 5 2 0 5 0 16 0 0 0 0 0 2007.0 1 0 6 0 4 0 6 5 0 6 0 17 0 0 0 0 0 2008.0 1 0 9 0 4 0 9 5 0 16 0 19 0 0 0 0 0 2009.0 1 0 9 0 5 1 9 6 0 19 0 23 0 0 0 0 0 2010.0 1 0 11 0 5 1 9 6 2 20 0 26 0 32 0 0 0 2011.0 2 0 11 0 5 1 11 7 4 20 0 30 0 32 0 0 0 2012.0 2 0 11 0 5 3 13 7 9 33 0 46 0 32 0 0 20 2013.0 2 0 11 0 5 3 16 9 12 44 0 50 0 32 0 0 23 2014.0 5 0 11 0 5 3 16 18 13 53 0 54 0 32 0 0 23 2015.0 7 0 11 0 5 4 16 20 16 53 0 61 0 32 0 2 23 2016.0 7 2 11 0 5 6 16 25 16 53 3 66 0 32 0 2 23 2017.0 16 2 14 3 5 6 17 44 22 53 4 73 2 33 1 4 23 2018.0 23 2 14 4 5 6 17 57 22 54 4 75 2 33 1 5 23

In [35]: sns.heatmap(deindexed_on, cmap="rainbow", fmt="g", annot=True, vmin=0, vmax=20, ax=plt.subplots(figsize=(14, 10))[1] ).set(title="Number of deindexed journals");

6 Number of indexed/deindexed/active journals in the SciELO network — Page 14 / 24 6.6 Collection-specific analysis

We can join these in a single table by stacking the collection as a secondary row index: In [36]: indexed_deindexed_df = pd.DataFrame([ indexed_on.stack().rename("indexed"), deindexed_on.stack().rename("deindexed"), ]).T.fillna(0) indexed_deindexed_df.iloc[::31].astype(int)

Out [36]: indexed deindexed 1997 arg 0 0 1998 psi 0 0 2000 chl 7 0 2001 rvt 0 0 2003 cri 0 0 2004 spa 1 0 2006 ecu 0 0 2007 sza 0 0 2009 mex 14 1 2010 ven 1 0 2012 prt 3 5 2014 bol 2 0 2015 rve 0 0 2017 col 8 3 2018 scl 4 2

That makes it easier to plot both information at once for a single collection, for example:

6 Number of indexed/deindexed/active journals in the SciELO network — Page 15 / 24 6.6 Collection-specific analysis

In [37]: indexed_deindexed_df.reorder_levels([1, 0]).loc["scl"].plot( figsize=(12, 8), title="Number of newly indexed/deindexed journals " "by year in the scl collection", xticks=range(1997, 2019), grid=True, );

We can get the active journals for all the collections directly from the two dataframes with the indexed and deindexed counts: In [38]: active_on = indexed_on.__sub__(deindexed_on, fill_value=0).cumsum() active_on.astype(int)

Out [38]: The table is in the next page ...

6 Number of indexed/deindexed/active journals in the SciELO network — Page 16 / 24 6.6 Collection-specific analysis 0 0 0 0 5 9 17 23 26 37 44 49 51 52 53 33 33 34 34 34 35 35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4 12 12 16 20 20 0 0 0 0 0 0 0 0 0 0 0 0 7 17 23 24 39 52 62 65 66 75 0 0 0 0 0 0 0 0 4 26 28 28 31 0 0 0 0 0 0 0 0 0 0 2 5 5 6 8 8 9 12 12 13 13 13 15 16 17 17 17 18 18 18 18 8 24 33 49 60 88 106 118 131 158 187 196 205 222 241 255 280 286 283 288 289 291 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 1 1 1 1 6 10 10 10 10 11 11 11 11 11 9 13 14 0 0 0 0 0 0 1 3 22 38 54 76 90 97 99 93 88 83 86 89 92 93 0 0 0 0 0 0 3 4 8 14 19 26 27 31 35 33 38 41 43 47 44 47 0 0 0 0 0 0 0 7 8 11 12 12 15 16 16 18 19 20 23 24 29 29 0 0 1 1 1 1 1 1 5 12 14 24 37 64 85 107 123 131 151 155 157 150 0 0 1 1 4 9 11 18 21 28 29 29 31 34 34 34 38 40 42 44 43 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 2 4 4 9 10 15 17 19 22 29 32 39 41 44 49 50 57 58 61 61 0 0 0 0 0 10 10 10 9 8 6 6 5 6 9 10 13 17 22 26 34 37 0 0 0 0 0 3 3 6 7 22 47 61 78 106 121 144 162 182 204 219 224 224 0 12 13 20 24 31 35 43 52 57 63 69 76 78 88 89 94 95 96 101 101 103 0 0 0 0 0 0 0 0 0 0 0 5 8 11 16 16 17 19 19 19 20 20 0 0 0 0 0 0 3 4 17 29 47 56 68 86 92 100 105 115 116 116 121 118 collection arg1997 bol1998 chl1999 col2000 cri2001 cub2002 ecu2003 esp2004 mex2005 per2006 prt2007 psi2008 rve2009 rvt2010 scl2011 2012 spa2013 sss2014 sza2015 ury2016 ven 2017 2018

6 Number of indexed/deindexed/active journals in the SciELO network — Page 17 / 24 6.6 Collection-specific analysis

In [39]: def add_markers_and_annotation(ax): # Apply the markers for line, marker in zip(ax.get_lines(), "ov^<>spP*hHxXDd8234+1.,"): line.set_marker(marker) col = line.get_label() x, y = line.get_data() x, y = x[:len(np.trim_zeros(y, "b"))], np.trim_zeros(y) x = x[-len(y):] line.set_data(x, y) ax.annotate(col, xy=(x[-1], y[-1]), xytext=(10,0), textcoords="offset points") ax.legend()

In [40]: active_on.plot(figsize=(14, 14), xticks=active_on.index, title="Active journals") add_markers_and_annotation(plt.gca())

The total number of active journals in 2018 for each collection is: In [41]: active_on.loc[2018].sort_values()

6 Number of indexed/deindexed/active journals in the SciELO network — Page 18 / 24 6.6 Collection-specific analysis

Out [41]: collection sss 0.0 ecu 2.0 rvt 3.0 rve 14.0 spa 18.0 bol 20.0 ury 20.0 per 29.0 ven 35.0 cri 37.0 esp 43.0 prt 47.0 cub 61.0 sza 75.0 psi 93.0 chl 103.0 arg 118.0 mex 150.0 col 224.0 scl 291.0 Name: 2018, dtype: float64

Seeing just parts of the data: In [42]: active_on[["scl", "col", "mex", "arg", "chl", "sza", "cub"]] \ .plot.line(figsize=(14, 10), xticks=active_on.index, title="Active journals in $7$ national collections") add_markers_and_annotation(plt.gca())

6 Number of indexed/deindexed/active journals in the SciELO network — Page 19 / 24 6.6 Collection-specific analysis

In [43]: active_on.drop(columns=["scl", "col", "mex", "arg", "chl", "sza", "cub", "psi", "rve", "rvt", "spa", "sss"]) \ .plot.line(figsize=(14, 10), xticks=active_on.index, title="Active journals in $8$ national collections") add_markers_and_annotation(plt.gca())

In [44]: active_on[["psi", "rve", "rvt", "spa", "sss"]] \ .plot.line(figsize=(14, 8), xticks=active_on.index, title="Active journals in thematic collections") add_markers_and_annotation(plt.gca())

6 Number of indexed/deindexed/active journals in the SciELO network — Page 20 / 24 6.6 Collection-specific analysis

Or, a subplots visualization of it all (without markers): In [45]: # Replacing 0 by NaN only works because these have no zero in-between active_on.replace(0, np.nan).plot( subplots=True, figsize=(16, 32), xticks=active_on.index, grid=True, marker="o", );

6 Number of indexed/deindexed/active journals in the SciELO network — Page 21 / 24 6.6 Collection-specific analysis

6 Number of indexed/deindexed/active journals in the SciELO network — Page 22 / 24 6.7 Summing the collection-specific entries

6.7 Summing the collection-specific entries

Usually we shouldn’t sum the count of the collection-specific entries, because they have some intersec- tion (the same ISSN) and that would add some residual to our results. To give some sense of how much residual is that, let’s calculate it! In [46]: collections_sum = active_on.T.sum() collections_sum

Out [46]: 1997 8.0 1998 38.0 1999 55.0 2000 80.0 2001 104.0 2002 169.0 2003 209.0 2004 262.0 2005 340.0 2006 477.0 2007 595.0 2008 689.0 2009 784.0 2010 884.0 2011 980.0 2012 1030.0 2013 1130.0 2014 1205.0 2015 1279.0 2016 1328.0 2017 1372.0 2018 1383.0 dtype: float64

In [47]: pd.DataFrame({ "Active total in network": network_index["total"], "Sum of the collection-specific active totals": collections_sum, }).plot( figsize=(12, 8), title="Number of journals", xticks=network_index.index, grid=True, );

6 Number of indexed/deindexed/active journals in the SciELO network — Page 23 / 24 6.7 Summing the collection-specific entries

The shape is almost the same, but it’s a 2.44% error in 2018 (using the 2018-09-14 data). In [48]: (collections_sum[2018] - network_index["total"][2018]) / network_index["total"][2018]

Out [48]: 0.024444444444444446

6 Number of indexed/deindexed/active journals in the SciELO network — Page 24 / 24 7 Deindexing reason in the SciELO Brazil collection

On the notebook regarding the number of indexed/deindexed/active journals, we weren’t able to dis- tinguish between a renamed and a deceased journal. Here the goal is to distinguish between these and to get more information about the suspending reason, limited to a single collection: scl (Brazil).

7.1 Using the ArticleMeta API

Using the ArticleMeta API[1], we can get the actual status of every journal in any collection, as well as its history. In order to get the renamed entries, it has an information that the reports don’t have: the previous/next titles of a journal. In PyPI, it’s the articlemetaapi package, not articlemeta, and it should be upgraded/installed with something like pip install 'articlemetaapi>=1.26.4', since the former versions aren’t prepared for Python 3.7. Let’s use it: In [1]: from articlemeta.client import RestfulClient

To get all the journals from the scl collection: In [2]: %%time journals = list(RestfulClient().journals(collection="scl"))

CPU times: user 3.47 s, sys: 655 ms, total: 4.13 s Wall time: 3min 16s

That’s it! It’s really slow as it’s making a request for each journal. Without the collection parameter, it would grab every journal in the SciELO network. In [3]: len(journals) # scl only

Out [3]: 366

7.1.1 Grabbing the history including the titles

We can get the full history from it as a dataframe, including some extra fields from properties of the journal objects: In [4]: import pandas as pd

In [5]: histories = pd.DataFrame([history + (journal.current_status, journal.scielo_issn, journal.title, journal.previous_title, journal.data.get("v710", [{}])[0] .get("_", None)) for journal in journals for history in journal.status_history], columns=["date", "status", "reason", "current_status", "issn", "title", "previous_title", "next_title"]) print(histories.shape) # There are more than a row for each journal histories.head(15)

[1]https://pypi.org/project/articlemetaapi

7 Deindexing reason in the SciELO Brazil collection — Page 1 / 10 7.1 Using the ArticleMeta API

(457, 8)

Out [5]: The table is in the next page ...

7 Deindexing reason in the SciELO Brazil collection — Page 2 / 10 7.1 Using the ArticleMeta API None None Trends in PsychiatryPsychotherapy and Trends in PsychiatryPsychotherapy and None None None None None None Brazilian Journal of Phar- maceutical Sciences Brazilian Journal of Phar- maceutical Sciences JornalBrasileira da deologia Fonoaudi- Sociedade JornalBrasileira da deologia Fonoaudi- Sociedade None None None None None None None None None None None Revista de Farmácia e Bio- química da Universida... Revista de Farmácia e Bio- química da Universida... None None None RAE eletrônica RAE eletrônica Revista dedo Rio Grande Psiquiatria do Sul Revista dedo Rio Grande Psiquiatria do Sul Revista de Antropologia Revista de Antropologia Revista Brasileira de Ge- ofísica Revista Brasileira de Ge- ofísica Revista Brasileira de Ge- ofísica Revista Brasileira de Ge- ofísica Revista BrasileiraCiências Farmacêuticas de Revista BrasileiraCiências Farmacêuticas de Pró-Fono Revista de At- ualização Científica Pró-Fono Revista de At- ualização Científica Sba:tomação Controle &Brasileira... Au- Sociedade 5648 5648 8108 8108 7701 7701 261X 261X 261X 261X 9332 9332 5687 5687 1759 1676- 1676- 0101- 0101- 0034- 0034- 0102- 0102- 0102- 0102- 1516- 1516- 0104- 0104- 0103- deceased deceased deceased deceased suspended suspended suspended suspended suspended suspended deceased deceased deceased deceased suspended suspended-by- committee suspended-by- committee suspended-by- committee current deceased current deceased current suspended current suspended current suspended current deceased current deceased current date2006- status02-08 2010- 12 2004- reason01-08 2012- 01 2000- 02-13 current_status2008- issn08 1999- 12-16 title2003 2008 2012- 06 2005- 11-23 2009- previous_title09-15 2006- 03-07 2011- 05 next_title 2002- 10-25 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

7 Deindexing reason in the SciELO Brazil collection — Page 3 / 10 7.1 Using the ArticleMeta API

That’s based in the Data dictionary of the SciELO’s model[2], which states in its section “4 - TITLE database” that v610 means previous title and v710 means new title. Most names have an alias to avoid direct access to the raw data, but the new title field didn’t have an alias as of the time of writing. From the above dataset, we can get the pair of ISSNs regarding each title change. We’ll get both the previous_title to title matching pairs and the title to next_title matching pairs, removing the duplicates. That will get all changes even if they’re not completely described in both entries of a pair. In [6]: issn_changes_raw = pd.concat([ pd.merge(histories[histories["next_title"].notna()], histories, how="left", left_on="next_title", right_on="title"), pd.merge(histories, histories[histories["previous_title"].notna()], how="right", left_on="title", right_on="previous_title"), ])[["current_status_x", "issn_x", "issn_y"]].dropna().drop_duplicates().rename( columns={"issn_x": "from", "issn_y": "to"}, ) print(issn_changes_raw.shape) issn_changes_raw

(39, 3)

Out [6]: current_status_x from to 0 deceased 0101-8108 2237-6089 2 deceased 1516-9332 1984-8250 4 deceased 0104-5687 2179-6491 8 deceased 0101-8175 1984-4670 10 deceased 0101-3122 2317-1537 12 deceased 2179-6491 2317-1782 14 deceased 0373-5524 1413-7739 18 deceased 0100-4239 0373-5524 22 deceased 0100-7386 1678-5878 26 deceased 0034-7108 1519-6984 28 deceased 0102-2555 1517-9702 32 deceased 1517-7491 1806-8324 34 deceased 0103-3131 1677-0420 38 deceased 0103-0663 1517-7491 42 deceased 1413-7739 1679-8759 44 deceased 1516-8034 2317-6431 46 deceased 0001-3714 1517-8382 48 deceased 0071-1276 0103-9016 50 deceased 0102-3586 1806-3713 52 deceased 0100-4158 1982-5676 56 deceased 0100-8455 1415-4757 58 deceased 0104-7930 1678-9199 60 deceased 1413-9251 1519-7077 62 deceased 0301-8059 1519-566X 66 deceased 0041-8781 1807-5932 68 deceased 1809-4872 1809-4864 70 deceased 0004-2730 2359-3997 72 deceased 1517-3151 2446-4740 74 deceased 0101-9880 2237-9363 76 deceased 1415-5419 2176-9451 78 deceased 1677-0420 2197-0025 Continued on next page

[2]https://docplayer.com.br/939603-Bireme-opas-oms-centro-latino-americano-e-do-caribe-de-informacao-em- ciencias-da-saude-metodologia-scielo-dicionario-de-dados-do-modelo-scielo.html

7 Deindexing reason in the SciELO Brazil collection — Page 4 / 10 7.1 Using the ArticleMeta API

current_status_x from to 82 deceased 0034-7299 1808-8694 84 deceased 1983-3083 2448-2455 86 deceased 0370-4467 2448-167X 88 current 2448-2455 2448-2455 89 deceased 1516-8484 2531-1379 91 deceased 0080-2107 2531-0488 30 deceased 0104-8023 1984-0292 90 deceased 1806-0013 2595-3192

There’s a single name that appeared as next_title but it’s the current title: In [7]: histories[histories["issn"] == "2448-2455"]

Out [7]: date status reason current_status issn title previous_title next_title 264 2016 current current 2448- Journal of Revista da Ed- Journal of 2455 Physical ucação Física / Physical Education UEM Education

Cleaning this is easy: when a journal is renamed and other ISSN is issued, the current status of the old entry is deceased. In [8]: issn_changes = issn_changes_raw \ [issn_changes_raw["current_status_x"] == "deceased"]\ [["from", "to"]] issn_changes.shape

Out [8]: (38, 2)

7.1.2 Analysing the directed graph of renamed journals’ ISSNs with NetworkX

In [9]: import matplotlib.pyplot as plt import networkx as nx %matplotlib inline

We can convert the last dataframe to a directed graph object: In [10]: issn_changes_graph = nx.DiGraph(issn_changes.values.tolist())

How many nodes (ISSNs) are there? In [11]: len(issn_changes_graph.nodes)

Out [11]: 71

Can we partition it as disjoint sets of connected nodes? How many partitions are there? In [12]: connected_nodes = \ list(nx.connected_components(issn_changes_graph.to_undirected())) print(len(connected_nodes)) connected_nodes

33

7 Deindexing reason in the SciELO Brazil collection — Page 5 / 10 7.1 Using the ArticleMeta API

Out [12]: [{'0101-8108', '2237-6089'}, {'1516-9332', '1984-8250'}, {'0104-5687', '2179-6491', '2317-1782'}, {'0101-8175', '1984-4670'}, {'0101-3122', '2317-1537'}, {'0100-4239', '0373-5524', '1413-7739', '1679-8759'}, {'0100-7386', '1678-5878'}, {'0034-7108', '1519-6984'}, {'0102-2555', '1517-9702'}, {'0103-0663', '1517-7491', '1806-8324'}, {'0103-3131', '1677-0420', '2197-0025'}, {'1516-8034', '2317-6431'}, {'0001-3714', '1517-8382'}, {'0071-1276', '0103-9016'}, {'0102-3586', '1806-3713'}, {'0100-4158', '1982-5676'}, {'0100-8455', '1415-4757'}, {'0104-7930', '1678-9199'}, {'1413-9251', '1519-7077'}, {'0301-8059', '1519-566X'}, {'0041-8781', '1807-5932'}, {'1809-4864', '1809-4872'}, {'0004-2730', '2359-3997'}, {'1517-3151', '2446-4740'}, {'0101-9880', '2237-9363'}, {'1415-5419', '2176-9451'}, {'0034-7299', '1808-8694'}, {'1983-3083', '2448-2455'}, {'0370-4467', '2448-167X'}, {'1516-8484', '2531-1379'}, {'0080-2107', '2531-0488'}, {'0104-8023', '1984-0292'}, {'1806-0013', '2595-3192'}]

In [13]: len(nx.dfs_tree(issn_changes_graph, "1679-8759")) # issn_changes_graph.in_degree

Out [13]: 1

Each set represents an article that had been renamed: In [14]: plt.figure(figsize=(16, 12)) nx.draw(issn_changes_graph, pos={node: (nidx, pidx) for pidx, partition in enumerate(connected_nodes) for nidx, node in enumerate(sorted( partition, reverse=True, key=lambda n: len(nx.dfs_tree(issn_changes_graph, n)), )) }, )

7 Deindexing reason in the SciELO Brazil collection — Page 6 / 10 7.1 Using the ArticleMeta API

7.1.3 Deindexing statistics

This finishes our analysis of the deindexing reason in the scl collection: In [15]: full_status_stats = pd.DataFrame(histories .sort_values("date") .assign(renamed=lambda df: df["issn"].isin(issn_changes["from"])) .groupby("issn") .agg("last") .fillna("") .groupby(["current_status", "reason", "renamed"]) .size() .rename("count") ) full_status_stats

Out [15]: count current_status reason renamed current False 291 deceased False 2 deceased True 38 suspended not-open-access False 7 suspended suspended-by-committee False 26 suspended suspended-by-editor False 2

Or, joining the three-layered index into a single string:

7 Deindexing reason in the SciELO Brazil collection — Page 7 / 10 7.2 Using the status changes report

In [16]: status_stats = full_status_stats.assign( status=["renamed" if renamed else reason.replace("not-", "suspended-by-not-") .replace("-", "") or current_status for current_status, reason, renamed in full_status_stats.index.values ] ).set_index("status") status_stats.drop("current").plot.barh(figsize=(12, 4), title="Deindexing status in scl") status_stats

Out [16]: count status current 291 deceased 2 renamed 38 suspended by not open access 7 suspended by committee 26 suspended by editor 2

7.2 Using the status changes report

We can’t get the information regarding the renamed entries using just the CSV reports, but the remaining information is there. We can use the journal.csv, but an analysis of it had already been performed in the notebook that analyzed the number of indexed journals. As an alternative, let’s open the journal_- status_changes.csv: In [17]: journals_status_changes = pd.read_csv("tabs_bra/journals_status_changes.csv") print(journals_status_changes.shape) journals_status_changes.columns

(457, 23)

Out [17]: Index(['extraction date', 'study unit', 'collection', 'ISSN SciELO', 'ISSN's', 'title at SciELO', 'title thematic areas', 'title is agricultural sciences', 'title is applied social sciences', 'title is biological sciences', 'title is engineering', 'title is exact and earth sciences', 'title is health sciences', 'title is human sciences', 'title is linguistics, letters and arts',

7 Deindexing reason in the SciELO Brazil collection — Page 8 / 10 7.2 Using the status changes report

'title is multidisciplinary', 'title current status', 'status change date', 'status change year', 'status change month', 'status change day', 'status changed to', 'status change reason'], dtype='object')

It has few columns. Let’s see the first few entries. In [18]: journals_status_changes.head().T

Out [18]: 0 1 2 3 4 extraction date 2018-09-13 2018-09-13 2018-09-13 2018-09-13 2018-09-13 study unit journal journal journal journal journal collection scl scl scl scl scl ISSN SciELO 1676-5648 1676-5648 0101-8108 0101-8108 0034-7701 ISSN’s 1676-5648 1676-5648 0101-8108 0101-8108 0034-7701 title at SciELO RAE RAE Revista de Revista de Revista de eletrônica eletrônica Psiquia- Psiquia- Antropologia tria do Rio tria do Rio Grande do Grande do Sul Sul title thematic areas Applied So- Applied So- Health Sci- Health Sci- Human Sci- cial Sciences cial Sciences ences ences ences title is agricultural 0 0 0 0 0 sciences title is applied social 1 1 0 0 0 sciences title is biological sci- 0 0 0 0 0 ences title is engineering 0 0 0 0 0 title is exact and 0 0 0 0 0 earth sciences title is health sciences 0 0 1 1 0 title is human sci- 0 0 0 0 1 ences title is linguistics, let- 0 0 0 0 0 ters and arts title is multidisci- 0 0 0 0 0 plinary title current status deceased deceased deceased deceased suspended status change date 2006-02-08 2010-12 2004-01-08 2012-01 2000-02-13 status change year 2006 2010 2004 2012 2000 status change month 2 12 1 1 2 status change day 8 NaN 8 NaN 13 status changed to current deceased current deceased current status change reason NaN NaN NaN NaN NaN

We need to see the title current status and the status change reason regarding the last status change entry. In [19]: last_in_history = (journals_status_changes .sort_values("status change date") .fillna("") .groupby("ISSN SciELO") .agg("last") )

In [20]: reasons_from_csv = pd.DataFrame(last_in_history .groupby(["title current status", "status change reason"]) .size()

7 Deindexing reason in the SciELO Brazil collection — Page 9 / 10 7.2 Using the status changes report

.rename("count") ) reasons_from_csv

Out [20]: count title current status status change reason current 291 deceased 40 suspended not-open-access 7 suspended suspended-by-committee 26 suspended suspended-by-editor 2

Which can appear misleading because deceased means deceased or renamed. A visualization alternative: In [21]: status_from_csv = reasons_from_csv.assign( status=[reason.replace("not-", "suspended-by-not-") .replace("-", "") or current_status.replace("ed", "ed or renamed") for current_status, reason in reasons_from_csv.index.values ] ).set_index("status") status_from_csv.drop("current").plot.barh( figsize=(12, 4), title="Deindexing status in scl from its journals_status_changes.csv" ) status_from_csv

Out [21]: count status current 291 deceased or renamed 40 suspended by not open access 7 suspended by committee 26 suspended by editor 2

To segregate the deceased from renamed, we need the information from ArticleMeta, as had been done before in this notebook.

7 Deindexing reason in the SciELO Brazil collection — Page 10 / 10 8 Collecting the daily access in the SciELO Brazil collection

The Ratchet API[1] has the daily access count for every journal. Our goal is to find a way to obtain and use this data. In [1]: import json, re from urllib.request import urlopen

In [2]: import matplotlib.pyplot as plt import pandas as pd import seaborn as sns

In [3]: %matplotlib inline

8.1 Single journal access data in a DataFrame

The entry point to get the access data of a single ISSN is: http://ratchet.scielo.org/api/v1/journals// replacing by the actual journal ISSN. That API returns a nested JSON structure, which isn’t properly fit when we’re reading the data with Pandas. The function below reads the dictionary of a single journal entry (i.e., a single journal of a properly loaded JSON from Ratchet) and converts the access data to a Pandas DataFrame: In [4]: def build_dataframe_from_dict(ratchet_dict): columns = ["total", "abstract", "html", "pdf", "pdfsite", "toc", "issues", "journal"] dicts = { "total": ratchet_dict, "pdfsite": ratchet_dict.get("other", {}).get("pdfsite", {}), **{key: ratchet_dict.get(key, {}) for key in columns if key not in ["total", "pdfsite"]}, } series = [] result = pd.DataFrame() for key, jdata in dicts.items(): pairs = [(pd.Timestamp(f"{pyear[1:]}-{pmonth[1:]}-{pday[1:]}"), count) for pyear, ydata in jdata.items() if re.match("y\d\d\d\d", pyear) for pmonth, mdata in ydata.items() if re.match("m\d\d", pmonth) for pday, count in mdata.items() if re.match("d\d\d", pday) ] if pairs: dates, counts = zip(*pairs) series.append(pd.Series(counts, index=dates).rename(key)) result = pd.DataFrame(series, dtype=object).reindex(columns).T result[result.isna()] = 0 return pd.DataFrame(result, dtype=int).rename_axis("date")

[1]http://docs.scielo.org/projects/ratchet/en/latest/api.html

8 Collecting the daily access in the SciELO Brazil collection — Page 1 / 19 8.2 Getting all collection journals

8.1.1 Nauplius example

For example, let’s get the access totals for Nauplius[2], whose ISSN is 0104-6497. In [5]: nauplius_url = "http://ratchet.scielo.org/api/v1/journals/0104-6497/" nauplius_dict = json.load(urlopen(nauplius_url))

In [6]: nauplius_df = build_dataframe_from_dict(nauplius_dict) nauplius_df.head()

Out [6]: total abstract html pdf pdfsite toc issues journal date 2013-09-10 65 0 4 0 0 3 26 32 2013-09-11 17 0 1 0 0 1 6 9 2013-09-12 184 3 11 5 5 24 23 113 2013-09-13 44 3 7 0 0 9 4 21 2013-09-14 15 1 1 0 0 3 2 8

In [7]: nauplius_df.loc["2017-09":"2018-08",["abstract", "html", "pdf"]].plot( figsize=(16, 8), title="Daily access to Nauplius in the last year", linewidth=1, grid=True, ) legend = plt.gca().legend_ legend.set_frame_on(False) for legend_line in legend.legendHandles: legend_line.set_linewidth(10) legend_line.set_solid_capstyle("round")

8.2 Getting all collection journals

The entry point to get the access data of all the journals from the scl collection is:

[2]http://www.scielo.br/scielo.php?script=sci_serial&pid=0104-6497&lng=en&nrm=iso

8 Collecting the daily access in the SciELO Brazil collection — Page 2 / 19 8.3 Evaluating the access count for the entire collection

http://ratchet.scielo.org/api/v1/general/?type=journal&collection=scl However, that link is effectively broken given it’ll always return a 504 - Gateway Time-out error. There- fore, we have no alternative but to request the journals one by one. At first, let’s create a façade to directly grab the daily access DataFrame. In [8]: def get_daily_accesses(issn): url = f"http://ratchet.scielo.org/api/v1/journals/{issn}/" return build_dataframe_from_dict(json.load(urlopen(url)))

We can get the ISSNs from either the journals.csv report file, or from the ArticleMeta API. Getting from the former is faster, so we’ll stick with it. In [9]: journals = pd.read_csv("tabs_bra/journals.csv") issns = journals["ISSN SciELO"].unique() len(issns)

Out [9]: 366

Then we can build a tidy matrix with all the downloaded access data. In [10]: %%time accesses = pd.concat([get_daily_accesses(issn).assign(issn=issn) for issn in issns])

CPU times: user 1min 12s, sys: 801 ms, total: 1min 12s Wall time: 1min 58s

In [11]: # Save the accesses to avoid re-downloading it all # (as of 2018-09-20, it had 38MB) accesses.to_csv("scl_daily_access.csv")

In [12]: accesses.head()

Out [12]: total abstract html pdf pdfsite toc issues journal issn date 2011-12-31 151 4 46 72 9 1 1 18 1676-5648 2012-01-01 180 3 54 96 8 3 2 14 1676-5648 2012-01-02 615 8 160 374 35 19 2 17 1676-5648 2012-01-03 688 14 196 384 46 19 6 23 1676-5648 2012-01-04 647 12 177 409 26 1 4 18 1676-5648

8.3 Evaluating the access count for the entire collection

In [13]: collection_accesses = accesses.reset_index().groupby("date").sum() collection_accesses

Out [13]: total abstract html pdf pdfsite toc issues journal date 2011-12-30 5 0 2 3 0 0 0 0 2011-12-31 225441 5989 156060 43226 8221 4145 1830 5970 2012-01-01 152307 7400 74829 47314 8582 4522 2126 7534 2012-01-02 343623 7573 162905 119296 24075 10337 5088 14349 Continued on next page

8 Collecting the daily access in the SciELO Brazil collection — Page 3 / 19 8.3 Evaluating the access count for the entire collection

total abstract html pdf pdfsite toc issues journal date 2012-01-03 383105 8967 180467 134192 27812 11660 5622 14385 2012-01-04 398325 8705 190284 139069 28324 10898 5705 15340 2012-01-05 396488 8457 192430 135552 27202 12441 5868 14538 2012-01-06 345391 8641 173659 114398 22035 9708 4528 12422 2012-01-07 333875 7174 207567 85053 16846 6057 2713 8465 2012-01-08 285171 16077 138494 88690 18048 9171 3540 11151 2012-01-09 494699 33372 228460 148660 32131 22580 7759 21737 2012-01-10 593908 69470 278209 151764 33066 29901 8644 22854 2012-01-11 507330 35529 232274 147398 31736 29530 8373 22490 2012-01-12 445707 18640 211135 141340 31054 17254 6664 19620 2012-01-13 378003 13546 181275 120512 26643 14732 6018 15277 2012-01-14 361575 20001 220618 84960 17658 7212 3036 8090 2012-01-15 273460 11428 134305 89064 18718 7808 2957 9180 2012-01-16 468103 23041 217976 148539 33452 18413 7020 19662 2012-01-17 498450 31363 226533 151281 35432 23395 8375 22071 2012-01-18 506176 31585 231393 155216 35368 22012 8548 22054 2012-01-19 439913 13640 211981 144641 33279 14515 6177 15680 2012-01-20 381625 17520 181394 119554 27381 15657 5259 14860 2012-01-21 368721 18281 216394 86868 19509 12605 4005 11059 2012-01-22 288466 12584 138208 92713 20645 9619 3550 11147 2012-01-23 455417 11364 211992 155175 35421 15156 6928 19381 2012-01-24 507045 25447 237568 155939 34286 22306 8631 22868 2012-01-25 500954 27177 229582 155120 33926 23194 8664 23291 2012-01-26 454778 16857 215447 148717 32590 16417 6279 18471 2012-01-27 433941 25181 204735 127125 28190 22850 6994 18866 2012-01-28 359657 10908 217473 88921 18797 9200 3787 10571 ...... 2018-08-21 1548417 125385 653653 610258 10635 17549 9301 121636 2018-08-22 1352492 69731 549938 586141 9061 13445 8436 115740 2018-08-23 1349190 100246 549611 559453 7872 12443 7669 111896 2018-08-24 1208159 107687 486725 475430 6433 12337 7572 111975 2018-08-25 856643 34563 312899 383789 5812 6969 5386 107225 2018-08-26 1075708 92784 426287 426855 4534 9062 6143 110043 2018-08-27 1432692 68008 599682 619275 4733 14027 8412 118555 2018-08-28 1411581 68844 573298 623172 4371 13407 8249 120240 2018-08-29 1492594 118631 616353 611515 3550 14948 8260 119337 2018-08-30 1414441 109344 579403 583262 3192 13578 7660 118002 2018-08-31 895 62 425 319 5 4 5 75 2018-09-01 866528 38467 324313 377309 3917 6815 6199 109508 2018-09-02 1095764 91967 434509 441412 3137 7208 6293 111238 2018-09-03 1545886 129406 649031 598335 3660 35346 9105 121003 2018-09-04 1434437 85846 587234 614996 3785 13558 8052 120966 2018-09-05 1799870 264639 764140 559685 3471 56109 10039 141787 2018-09-06 1111364 50661 428293 473339 1982 11662 7419 138008 2018-09-07 1012653 80890 376236 387011 2317 31416 6719 128064 2018-09-08 919296 58981 327936 383358 3614 9742 6380 129285 2018-09-09 1115466 105025 430741 435628 2244 9315 5604 126909 2018-09-10 1450453 68993 623706 595691 2787 13911 8602 136763 2018-09-11 1660814 112587 749664 610700 3151 36688 8854 139170 2018-09-12 1563269 137942 654267 613354 3215 16643 9829 128019 2018-09-13 1334611 67464 548346 570163 3099 12932 8478 124129 2018-09-14 1347717 142803 522307 494605 2909 47186 10227 127680 2018-09-15 1634685 394595 640161 418208 4147 53105 6799 117670 2018-09-16 1074175 54021 430230 452933 2903 8758 6048 119282 Continued on next page

8 Collecting the daily access in the SciELO Brazil collection — Page 4 / 19 8.3 Evaluating the access count for the entire collection

total abstract html pdf pdfsite toc issues journal date 2018-09-17 1585446 133221 675839 617419 3555 16602 8888 129922 2018-09-18 1603385 131722 670879 634299 3471 20976 8862 133176 2018-09-19 1147 165 523 365 2 7 1 84

In [14]: collection_accesses[["abstract", "html", "pdf"]].plot( figsize=(16, 8), title="Daily access to documents in the scl collection", linewidth=1, grid=True, ) legend = plt.gca().legend_ legend.set_frame_on(False) for legend_line in legend.legendHandles: legend_line.set_linewidth(10) legend_line.set_solid_capstyle("round")

In [15]: collection_accesses.loc["2017-09":"2018-08",["abstract", "html", "pdf"]].plot( figsize=(16, 8), title="Daily access to documents in the scl collection " "from 2017-09 to 2018-08", linewidth=1, grid=True, ) legend = plt.gca().legend_ legend.set_frame_on(False) for legend_line in legend.legendHandles: legend_line.set_linewidth(10) legend_line.set_solid_capstyle("round")

8 Collecting the daily access in the SciELO Brazil collection — Page 5 / 19 8.4 Reproducing the SciELO Analytics Charts

8.4 Reproducing the SciELO Analytics Charts

In https://analytics.scielo.org we have a home/dashboard like:

Let’s reproduce the left hand side of it, but the number of issues. The journal count we got from the journals.csv, but we could have done the same with the ArticleMeta API, which would get the current data (the notebook regarding the deindexing reason in scl has more information on how to do that). The total accesses shown are by month. Pandas allows us to group by the year and month (using collection_accesses.index.year and collection_accesses.index.month in the groupby), but it also has a monthly grouper, so let’s use it. In [16]: monthly_data = (collection_accesses .groupby(pd.Grouper(freq="M")) .sum() ) monthly_data[["abstract", "html", "pdf"]].plot( figsize=(16, 8), title="Monthly access to documents in the scl collection", linewidth=2.5, grid=True,

8 Collecting the daily access in the SciELO Brazil collection — Page 6 / 19 8.4 Reproducing the SciELO Analytics Charts

) legend = plt.gca().legend_ legend.set_frame_on(False) for legend_line in legend.legendHandles: legend_line.set_linewidth(10) legend_line.set_solid_capstyle("round") monthly_data.tail()

Out [16]: total abstract html pdf pdfsite toc issues journal date 2018-05-31 36611020 1944622 15263095 15155345 95756 409505 231829 3510868 2018-06-30 34278168 1991711 13936406 14491219 67733 413571 214659 3162869 2018-07-31 26470272 1882551 9918115 10792576 81643 480489 187428 3127470 2018-08-31 33371045 2171366 13414688 13694333 180712 405378 212873 3291695 2018-09-30 24156966 2149395 9838355 9278810 57366 407979 142398 2282663

In [17]: monthly_data.loc["2015-09":"2018-08",["abstract", "html", "pdf"]].plot( figsize=(16, 8), title="Access to documents in the scl collection, by month, " "in the last 3 years", linewidth=2.5, grid=True, ) legend = plt.gca().legend_ legend.set_frame_on(False) for legend_line in legend.legendHandles: legend_line.set_linewidth(10) legend_line.set_solid_capstyle("round")

8 Collecting the daily access in the SciELO Brazil collection — Page 7 / 19 8.4 Reproducing the SciELO Analytics Charts

8.4.1 Week day analysis

In which day of the week does SciELO Brazil articles have more accesses? In [18]: collection_accesses_weekday = ( collection_accesses.groupby([collection_accesses.index.weekday, collection_accesses.index.weekday_name]) .mean() .reset_index(0, drop=True) ).iloc[range(-1, 6)] collection_accesses_weekday[["abstract", "html", "pdf"]].plot( figsize=(16, 8), title="Mean daily access to documents in the scl collection by week day", xticks=range(7), linewidth=2.5, grid=True, ) legend = plt.gca().legend_ legend.set_frame_on(False) for legend_line in legend.legendHandles: legend_line.set_linewidth(10) legend_line.set_solid_capstyle("round") collection_accesses_weekday

Out [18]: total abstract html pdf pdfsite toc issues journal date Sunday 667275. 30531. 321898. 221847. 29626. 15768. 4335. 43267. 151862 512894 217765 630372 472779 830946 234957 252149 Monday 929982. 39583. 455172. 310183. 41974. 22807. 7481. 52780. 943020 225071 199430 421652 880342 145299 689459 381766 Tuesday 938702. 38615. 461691. 313432. 42595. 23383. 7319. 51664. 829060 227920 746439 242165 868946 911681 373219 458689 Wednesday 934482. 36531. 473964. 302749. 42004. 22194. 7123. 49913. 729345 743590 897436 301994 840456 301994 737892 905983 Continued on next page

8 Collecting the daily access in the SciELO Brazil collection — Page 8 / 19 8.5 Daily access by thematic area

total abstract html pdf pdfsite toc issues journal date Thursday 866744. 35349. 428322. 285194. 39702. 22451. 6708. 49015. 737143 480000 605714 960000 020000 720000 942857 008571 Friday 760144. 32603. 371228. 247018. 34579. 21581. 6332. 46801. 531609 298851 370690 795977 117816 054598 298851 594828 Saturday 581277. 29985. 273333. 191460. 26959. 15213. 4160. 40165. 145714 014286 277143 714286 111429 351429 202857 474286

8.5 Daily access by thematic area

As we have the data labeled for each ISSN, we can apply some filter using some information about the journal. For example, we can look for plots for each thematic area. In [19]: areas = [field for field in journals.columns if field.startswith("title is")] areas

Out [19]: ['title is agricultural sciences', 'title is applied social sciences', 'title is biological sciences', 'title is engineering', 'title is exact and earth sciences', 'title is health sciences', 'title is human sciences', 'title is linguistics, letters and arts', 'title is multidisciplinary']

At first we need an edge list of all thematic areas that an ISSN is assigned to. In [20]: stacked_areas = journals.set_index("ISSN SciELO")[areas].stack() stacked_areas = (stacked_areas[stacked_areas == 1] .reset_index(1)["level_1"] .rename("area") .apply(lambda x: x[len("title is "):].capitalize()) .rename_axis("issn") .reset_index()

8 Collecting the daily access in the SciELO Brazil collection — Page 9 / 19 8.5 Daily access by thematic area

) stacked_areas.head(15)

Out [20]: issn area 0 1676-5648 Applied social sciences 1 0101-8108 Health sciences 2 0034-7701 Human sciences 3 0102-261X Exact and earth sciences 4 1516-9332 Health sciences 5 0104-5687 Health sciences 6 0103-1759 Exact and earth sciences 7 0101-8175 Biological sciences 8 0101-3122 Agricultural sciences 9 2179-6491 Health sciences 10 1980-6523 Health sciences 11 0373-5524 Biological sciences 12 0373-5524 Exact and earth sciences 13 0100-4239 Biological sciences 14 0100-4239 Exact and earth sciences

That can be joined with the Ratchet data, giving us the desired data to be able to plot the accesses count by thematic area. In [21]: accesses_by_area = (pd.merge(accesses.reset_index(), stacked_areas) .groupby(["date", "area"]) .sum() .stack() .rename("count") .reset_index() .rename(columns={"level_2": "type"}) ) accesses_by_area

Out [21]: date area type count 0 2011-12-30 Biological sciences total 1 1 2011-12-30 Biological sciences abstract 0 2 2011-12-30 Biological sciences html 0 3 2011-12-30 Biological sciences pdf 1 4 2011-12-30 Biological sciences pdfsite 0 5 2011-12-30 Biological sciences toc 0 6 2011-12-30 Biological sciences issues 0 7 2011-12-30 Biological sciences journal 0 8 2011-12-30 Health sciences total 3 9 2011-12-30 Health sciences abstract 0 10 2011-12-30 Health sciences html 2 11 2011-12-30 Health sciences pdf 1 12 2011-12-30 Health sciences pdfsite 0 13 2011-12-30 Health sciences toc 0 14 2011-12-30 Health sciences issues 0 15 2011-12-30 Health sciences journal 0 16 2011-12-30 Human sciences total 2 17 2011-12-30 Human sciences abstract 0 18 2011-12-30 Human sciences html 1 Continued on next page

8 Collecting the daily access in the SciELO Brazil collection — Page 10 / 19 8.5 Daily access by thematic area

date area type count 19 2011-12-30 Human sciences pdf 1 20 2011-12-30 Human sciences pdfsite 0 21 2011-12-30 Human sciences toc 0 22 2011-12-30 Human sciences issues 0 23 2011-12-30 Human sciences journal 0 24 2011-12-31 Agricultural sciences total 28408 25 2011-12-31 Agricultural sciences abstract 1330 26 2011-12-31 Agricultural sciences html 18101 27 2011-12-31 Agricultural sciences pdf 6296 28 2011-12-31 Agricultural sciences pdfsite 982 29 2011-12-31 Agricultural sciences toc 551 ...... 176002 2018-09-19 Health sciences html 274 176003 2018-09-19 Health sciences pdf 212 176004 2018-09-19 Health sciences pdfsite 0 176005 2018-09-19 Health sciences toc 2 176006 2018-09-19 Health sciences issues 0 176007 2018-09-19 Health sciences journal 79 176008 2018-09-19 Human sciences total 189 176009 2018-09-19 Human sciences abstract 12 176010 2018-09-19 Human sciences html 89 176011 2018-09-19 Human sciences pdf 81 176012 2018-09-19 Human sciences pdfsite 0 176013 2018-09-19 Human sciences toc 4 176014 2018-09-19 Human sciences issues 1 176015 2018-09-19 Human sciences journal 2 176016 2018-09-19 Linguistics, letters and arts total 11 176017 2018-09-19 Linguistics, letters and arts abstract 0 176018 2018-09-19 Linguistics, letters and arts html 10 176019 2018-09-19 Linguistics, letters and arts pdf 1 176020 2018-09-19 Linguistics, letters and arts pdfsite 0 176021 2018-09-19 Linguistics, letters and arts toc 0 176022 2018-09-19 Linguistics, letters and arts issues 0 176023 2018-09-19 Linguistics, letters and arts journal 0 176024 2018-09-19 Multidisciplinary total 18 176025 2018-09-19 Multidisciplinary abstract 1 176026 2018-09-19 Multidisciplinary html 10 176027 2018-09-19 Multidisciplinary pdf 7 176028 2018-09-19 Multidisciplinary pdfsite 0 176029 2018-09-19 Multidisciplinary toc 0 176030 2018-09-19 Multidisciplinary issues 0 176031 2018-09-19 Multidisciplinary journal 0

In [22]: areas_order = sorted(accesses_by_area["area"].unique()) with sns.axes_style("whitegrid"): sns.FacetGrid(accesses_by_area, row="type", hue="area", hue_order=areas_order[::-1], aspect=4, sharey=False)\ .map(sns.lineplot, "date", "count", linewidth=1)\ .add_legend(label_order=areas_order) for legend_line in plt.gcf().legends[0].legendHandles: legend_line.set_linewidth(10)

8 Collecting the daily access in the SciELO Brazil collection — Page 11 / 19 8.5 Daily access by thematic area

8 Collecting the daily access in the SciELO Brazil collection — Page 12 / 19 8.5 Daily access by thematic area

In [23]: with sns.axes_style("whitegrid"): sns.FacetGrid(accesses_by_area[accesses_by_area["date"] >= "2017-09"], row="type", hue="area", hue_order=areas_order[::-1], aspect=4, sharey=False)\ .map(sns.lineplot, "date", "count", linewidth=1)\ .add_legend(label_order=areas_order) for legend_line in plt.gcf().legends[0].legendHandles: legend_line.set_linewidth(10)

8 Collecting the daily access in the SciELO Brazil collection — Page 13 / 19 8.5 Daily access by thematic area

8 Collecting the daily access in the SciELO Brazil collection — Page 14 / 19 8.6 Accesses report

In [24]: with sns.axes_style("whitegrid"): sns.FacetGrid(accesses_by_area[ (accesses_by_area["date"] >= "2018-06") & accesses_by_area["type"].isin(["pdf", "pdfsite"]) ], row="type", hue="area", hue_order=areas_order[::-1], aspect=4, sharey=False)\ .map(sns.lineplot, "date", "count", linewidth=1)\ .add_legend(label_order=areas_order) for legend_line in plt.gcf().legends[0].legendHandles: legend_line.set_linewidth(10)

8.6 Accesses report

A faster approach to get access data would be the accesses_by_journals.csv report file. It doesn’t have the daily access, but an yearly access summary. In [25]: accesses_by_journals = pd.read_csv("tabs_bra/accesses_by_journals.csv") accesses_by_journals.columns

Out [25]: Index(['extraction date', 'study unit', 'collection', 'ISSN SciELO', 'ISSN's', 'title at SciELO', 'title thematic areas', 'title is agricultural sciences', 'title is applied social sciences', 'title is biological sciences', 'title is engineering', 'title is exact and earth sciences', 'title is health sciences', 'title is human sciences', 'title is linguistics, letters and arts', 'title is multidisciplinary', 'title current status', 'publishing year', 'accesses year', 'accesses to html', 'accesses to abstract', 'accesses to pdf', 'accesses to epdf', 'total accesses'], dtype='object')

In [26]: accesses_fields = [field for field in accesses_by_journals if "accesses" in field and "to" in field] accesses_fields

Out [26]:

8 Collecting the daily access in the SciELO Brazil collection — Page 15 / 19 8.6 Accesses report

['accesses to html', 'accesses to abstract', 'accesses to pdf', 'accesses to epdf', 'total accesses']

In [27]: ajdata = (accesses_by_journals .set_index(["accesses year"] + accesses_fields) [areas] .rename_axis("area", axis="columns") .stack() .rename("temp") ) ajdata = ajdata[ajdata == 1].reset_index().drop(columns=["temp"]) ajdata[::5000]

Out [27]: accesses accesses accesses accesses accesses total area year to html to abstract to pdf to epdf accesses 0 2011 2 0 16 0 18 title is applied social sciences 5000 2015 29792 1331 13466 117 44706 title is human sciences 10000 2018 47970 18586 32856 53 99465 title is engi- neering 15000 2011 12 1 12 0 25 title is engi- neering 20000 2013 27117 2954 0 0 30071 title is biologi- cal sciences 25000 2011 22 3 9 0 34 title is agricul- tural sciences 30000 2015 208075 6624 91073 1584 307356 title is health sciences 35000 2017 28125 761 13861 11 42758 title is applied social sciences

In [28]: ajdata_tidy = ( ajdata .groupby(["accesses year", "area"]) .sum() .rename_axis("type", axis="columns") .stack() .rename("count") .reset_index() .assign(area=lambda df: df["area"].str.replace("title is ", ""), type=lambda df: df["type"].str.replace("accesses to ", "") .str.replace(" accesses", "")) ) ajdata_tidy

Out [28]:

8 Collecting the daily access in the SciELO Brazil collection — Page 16 / 19 8.6 Accesses report

accesses year area type count 0 2011 agricultural sciences html 18048 1 2011 agricultural sciences abstract 1325 2 2011 agricultural sciences pdf 6231 3 2011 agricultural sciences epdf 0 4 2011 agricultural sciences total 25604 5 2011 applied social sciences html 3827 6 2011 applied social sciences abstract 180 7 2011 applied social sciences pdf 1835 8 2011 applied social sciences epdf 0 9 2011 applied social sciences total 5842 10 2011 biological sciences html 24371 11 2011 biological sciences abstract 1074 12 2011 biological sciences pdf 4769 13 2011 biological sciences epdf 0 14 2011 biological sciences total 30214 15 2011 engineering html 5869 16 2011 engineering abstract 477 17 2011 engineering pdf 2174 18 2011 engineering epdf 0 19 2011 engineering total 8520 20 2011 exact and earth sciences html 5394 21 2011 exact and earth sciences abstract 553 22 2011 exact and earth sciences pdf 1678 23 2011 exact and earth sciences epdf 0 24 2011 exact and earth sciences total 7625 25 2011 health sciences html 105118 26 2011 health sciences abstract 3279 27 2011 health sciences pdf 22651 28 2011 health sciences epdf 0 29 2011 health sciences total 131048 ...... 330 2018 engineering html 5342225 331 2018 engineering abstract 893117 332 2018 engineering pdf 5309616 333 2018 engineering epdf 9630 334 2018 engineering total 11554588 335 2018 exact and earth sciences html 4215254 336 2018 exact and earth sciences abstract 856885 337 2018 exact and earth sciences pdf 2865870 338 2018 exact and earth sciences epdf 8433 339 2018 exact and earth sciences total 7946442 340 2018 health sciences html 50244334 341 2018 health sciences abstract 6822204 342 2018 health sciences pdf 47506109 343 2018 health sciences epdf 130127 344 2018 health sciences total 104702774 345 2018 human sciences html 25342659 346 2018 human sciences abstract 3333890 347 2018 human sciences pdf 20685944 348 2018 human sciences epdf 44039 349 2018 human sciences total 49406532 350 2018 linguistics, letters and arts html 1022854 351 2018 linguistics, letters and arts abstract 125274 352 2018 linguistics, letters and arts pdf 958338 Continued on next page

8 Collecting the daily access in the SciELO Brazil collection — Page 17 / 19 8.6 Accesses report

accesses year area type count 353 2018 linguistics, letters and arts epdf 1834 354 2018 linguistics, letters and arts total 2108300 355 2018 multidisciplinary html 1807694 356 2018 multidisciplinary abstract 283093 357 2018 multidisciplinary pdf 1710317 358 2018 multidisciplinary epdf 4832 359 2018 multidisciplinary total 3805936

In [29]: label_order = sorted(ajdata_tidy["area"].unique()) with sns.axes_style("darkgrid"): sns.FacetGrid(ajdata_tidy, row="type", hue="area", hue_order=label_order[::-1], aspect=4, sharey=False)\ .map(sns.lineplot, "accesses year", "count", linewidth=2)\ .add_legend(label_order=label_order) for legend_line in plt.gcf().legends[0].legendHandles: legend_line.set_linewidth(10)

8 Collecting the daily access in the SciELO Brazil collection — Page 18 / 19 8.6 Accesses report

Here we have the epdf information and another total (the one of these 4 types), when compared with the Ratchet API data.

8 Collecting the daily access in the SciELO Brazil collection — Page 19 / 19 9 Cleaning / Normalizing the thematic area

In [1]: import pandas as pd pd.options.display.max_colwidth = 400

In [2]: %matplotlib inline

9.1 Loading the dataset

In [3]: journals = pd.read_csv("tabs_network/journals.csv") journals.columns

Out [3]: Index(['extraction date', 'study unit', 'collection', 'ISSN SciELO', 'ISSN's', 'title at SciELO', 'title thematic areas', 'title is agricultural sciences', 'title is applied social sciences', 'title is biological sciences', 'title is engineering', 'title is exact and earth sciences', 'title is health sciences', 'title is human sciences', 'title is linguistics, letters and arts', 'title is multidisciplinary', 'title current status', 'title + subtitle SciELO', 'short title SciELO', 'short title ISO', 'title PubMed', 'publisher name', 'use license', 'alpha frequency', 'numeric frequency (in months)', 'inclusion year at SciELO', 'stopping year at SciELO', 'stopping reason', 'date of the first document', 'volume of the first document', 'issue of the first document', 'date of the last document', 'volume of the last document', 'issue of the last document', 'total of issues', 'issues at 2018', 'issues at 2017', 'issues at 2016', 'issues at 2015', 'issues at 2014', 'issues at 2013', 'total of regular issues', 'regular issues at 2018', 'regular issues at 2017', 'regular issues at 2016', 'regular issues at 2015', 'regular issues at 2014', 'regular issues at 2013', 'total of documents', 'documents at 2018', 'documents at 2017', 'documents at 2016', 'documents at 2015', 'documents at 2014', 'documents at 2013', 'citable documents', 'citable documents at 2018', 'citable documents at 2017', 'citable documents at 2016', 'citable documents at 2015', 'citable documents at 2014', 'citable documents at 2013', 'portuguese documents at 2018 ', 'portuguese documents at 2017 ', 'portuguese documents at 2016 ', 'portuguese documents at 2015 ', 'portuguese documents at 2014 ', 'portuguese documents at 2013 ', 'spanish documents at 2018 ', 'spanish documents at 2017 ', 'spanish documents at 2016 ', 'spanish documents at 2015 ', 'spanish documents at 2014 ', 'spanish documents at 2013 ', 'english documents at 2018 ', 'english documents at 2017 ', 'english documents at 2016 ', 'english documents at 2015 ', 'english documents at 2014 ', 'english documents at 2013 ', 'other language documents at 2018 ', 'other language documents at 2017 ', 'other language documents at 2016 ', 'other language documents at 2015 ', 'other language documents at 2014 ', 'other language documents at 2013 ', 'google scholar h5 2018 ', 'google scholar h5 2017 ', 'google scholar h5 2016 ', 'google scholar h5 2015 ', 'google scholar h5 2014 ', 'google scholar h5 2013 ', 'google scholar m5 2018 ',

9 Cleaning / Normalizing the thematic area — Page 1 / 14 9.1 Loading the dataset

'google scholar m5 2017 ', 'google scholar m5 2016 ', 'google scholar m5 2015 ', 'google scholar m5 2014 ', 'google scholar m5 2013 '], dtype='object')

The column names aren’t helping us with all the small details like the trailing whitespaces in the latter fields. The easiest approach to deal with them is to run this normalization function from the column names simplification notebook. Applying it is straightforward, and the order of the columns is kept as is. In [4]: def normalize_column_title(name): import re name_unbracketed = re.sub(r".*\((.*)\)", r"\1", name.replace("(in months)", "in_months")) words = re.sub("[^a-z0-9+_ ]", "", name_unbracketed.lower()).split() ignored_words = ("at", "the", "of", "and", "google", "scholar", "+") replacements = { "document": "doc", "documents": "docs", "frequency": "freq", "language": "lang", } return "_".join(replacements.get(word, word) for word in words if word not in ignored_words) \ .replace("title_is", "is")

In [5]: journals.rename(columns=normalize_column_title, inplace=True) journals.columns

Out [5]: Index(['extraction_date', 'study_unit', 'collection', 'issn_scielo', 'issns', 'title_scielo', 'title_thematic_areas', 'is_agricultural_sciences', 'is_applied_social_sciences', 'is_biological_sciences', 'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences', 'is_human_sciences', 'is_linguistics_letters_arts', 'is_multidisciplinary', 'title_current_status', 'title_subtitle_scielo', 'short_title_scielo', 'short_iso', 'title_pubmed', 'publisher_name', 'use_license', 'alpha_freq', 'numeric_freq_in_months', 'inclusion_year_scielo', 'stopping_year_scielo', 'stopping_reason', 'date_first_doc', 'volume_first_doc', 'issue_first_doc', 'date_last_doc', 'volume_last_doc', 'issue_last_doc', 'total_issues', 'issues_2018', 'issues_2017', 'issues_2016', 'issues_2015', 'issues_2014', 'issues_2013', 'total_regular_issues', 'regular_issues_2018', 'regular_issues_2017', 'regular_issues_2016', 'regular_issues_2015', 'regular_issues_2014', 'regular_issues_2013', 'total_docs', 'docs_2018', 'docs_2017', 'docs_2016', 'docs_2015', 'docs_2014', 'docs_2013', 'citable_docs', 'citable_docs_2018', 'citable_docs_2017', 'citable_docs_2016', 'citable_docs_2015', 'citable_docs_2014', 'citable_docs_2013', 'portuguese_docs_2018', 'portuguese_docs_2017', 'portuguese_docs_2016', 'portuguese_docs_2015', 'portuguese_docs_2014', 'portuguese_docs_2013', 'spanish_docs_2018', 'spanish_docs_2017', 'spanish_docs_2016', 'spanish_docs_2015', 'spanish_docs_2014', 'spanish_docs_2013', 'english_docs_2018', 'english_docs_2017', 'english_docs_2016', 'english_docs_2015', 'english_docs_2014', 'english_docs_2013', 'other_lang_docs_2018', 'other_lang_docs_2017', 'other_lang_docs_2016', 'other_lang_docs_2015', 'other_lang_docs_2014', 'other_lang_docs_2013', 'h5_2018', 'h5_2017', 'h5_2016', 'h5_2015', 'h5_2014', 'h5_2013', 'm5_2018', 'm5_2017', 'm5_2016', 'm5_2015', 'm5_2014', 'm5_2013'],

9 Cleaning / Normalizing the thematic area — Page 2 / 14 9.2 Thematic areas

dtype='object')

9.2 Thematic areas

At first, it might seem that there are way too many thematic areas: In [6]: journals["title_thematic_areas"].unique()

Out [6]: array(['Applied Social Sciences', 'Health Sciences', 'Human Sciences', 'Exact and Earth Sciences', 'Biological Sciences', 'Agricultural Sciences', 'Biological Sciences;Exact and Earth Sciences', 'Engineering;Exact and Earth Sciences', 'Agricultural Sciences;Biological Sciences', 'Applied Social Sciences;Human Sciences', 'Engineering', 'Health Sciences;Human Sciences', 'Agricultural Sciences;Biological Sciences;Exact and Earth Sciences;Health Sciences', 'Linguistics, Letters and Arts', 'Biological Sciences;Health Sciences', 'Agricultural Sciences;Biological Sciences;Health Sciences', 'Agricultural Sciences;Biological Sciences;Engineering;Exact and Earth Sciences;Health Sciences;Human Sciences', 'Agricultural Sciences;Biological Sciences;Engineering;Exact and Earth Sciences;Human Sciences', 'Agricultural Sciences;Biological Sciences;Engineering;Health Sciences', 'Applied Social Sciences;Biological Sciences;Human Sciences', 'Human Sciences;Linguistics, Letters and Arts', 'Applied Social Sciences;Linguistics, Letters and Arts', 'Biological Sciences;Human Sciences', 'Agricultural Sciences;Engineering', 'Applied Social Sciences;Exact and Earth Sciences', 'Applied Social Sciences;Human Sciences;Linguistics, Letters and Arts', 'Agricultural Sciences;Biological Sciences;Engineering', 'Agricultural Sciences;Biological Sciences;Engineering;Exact and Earth Sciences', 'Applied Social Sciences;Engineering', 'Applied Social Sciences;Biological Sciences;Health Sciences;Human Sciences', 'Applied Social Sciences;Exact and Earth Sciences;Human Sciences', 'Applied Social Sciences;Biological Sciences;Engineering;Exact and Earth Sciences', 'Applied Social Sciences;Health Sciences', 'Biological Sciences;Engineering;Exact and Earth Sciences', 'Agricultural Sciences;Applied Social Sciences', 'Agricultural Sciences;Applied Social Sciences;Biological Sciences;Health Sciences;Human Sciences', 'Agricultural Sciences;Biological Sciences;Engineering;Exact and Earth Sciences;Health Sciences;Human Sciences;Linguistics, Letters and Arts', 'Agricultural Sciences;Applied Social Sciences;Health Sciences', 'Biological Sciences;Engineering;Health Sciences', 'Agricultural Sciences;Applied Social Sciences;Exact and Earth Sciences;Health Sciences;Human Sciences;Linguistics, Letters and Arts', 'Applied Social Sciences;Health Sciences;Human Sciences', 'Biological Sciences;Human Sciences;Linguistics, Letters and Arts', 'Linguistics, Letters and Arts;Applied Social Sciences;Human Sciences', 'Linguistics, Letters and Arts;Human Sciences', 'Agricultural Sciences;Exact and Earth Sciences',

9 Cleaning / Normalizing the thematic area — Page 3 / 14 9.3 Multidisciplinary

'Agricultural Sciences;Applied Social Sciences;Human Sciences', 'Agricultural Sciences;Biological Sciences;Exact and Earth Sciences', 'Linguistics, Letters and Arts;Applied Social Sciences', 'Agricultural Sciences;Applied Social Sciences;Biological Sciences;Engineering;Exact and Earth Sciences;Health Sciences;Human Sciences', 'Applied Social Sciences;Biological Sciences;Engineering', 'Applied Social Sciences;Biological Sciences;Exact and Earth Sciences;Health Sciences', nan, 'Psicanalise', 'Human Sciences;Applied Social Sciences', 'Applied Social Sciences;Engineering;Linguistics, Letters and Arts', 'Agricultural Sciences;Biological Sciences;Engineering;Exact and Earth Sciences;Health Sciences', 'Biological Sciences;Engineering;Exact and Earth Sciences;Health Sciences', 'Exact and Earth Sciences;Human Sciences', 'Agricultural Sciences;Applied Social Sciences;Biological Sciences;Engineering;Exact and Earth Sciences;Health Sciences;Human Sciences;Linguistics, Letters and Arts'], dtype=object)

But, actually, there are just 8 of them, and what we’re seeing are the several combinations of them: In [7]: set.union(*journals["title_thematic_areas"].str.split(";") .dropna().apply(set).values)

Out [7]: {'Agricultural Sciences', 'Applied Social Sciences', 'Biological Sciences', 'Engineering', 'Exact and Earth Sciences', 'Health Sciences', 'Human Sciences', 'Linguistics, Letters and Arts', 'Psicanalise'}

The Psicanalise isn’t a thematic area, it appears in the psi collection, which is independent (i.e., it’s in the SciELO network but it’s not maintained by SciELO, and its requirements regarding some fields aren’t the same of other collections). Actually, we don’t need to worry so much about this column in this normalization step since this infor- mation is split in the several title is ... columns, which had been renamed here to: In [8]: areas_map = { "Agricultural Sciences": "is_agricultural_sciences", "Applied Social Sciences": "is_applied_social_sciences", "Biological Sciences": "is_biological_sciences", "Engineering": "is_engineering", "Exact and Earth Sciences": "is_exact_earth_sciences", "Health Sciences": "is_health_sciences", "Human Sciences": "is_human_sciences", "Linguistics, Letters and Arts": "is_linguistics_letters_arts", } areas = list(areas_map.values())

9.3 Multidisciplinary

Actually, is_multidisciplinary isn’t a thematic area by itself, but it might be useful, and its meaning can be promptly checked:

9 Cleaning / Normalizing the thematic area — Page 4 / 14 9.4 Consistency between text and flags

In [9]: ( (journals[areas].sum(axis=1) >= 3) != journals["is_multidisciplinary"].apply(bool) ).sum()

Out [9]: 0

We have is_multidisciplinary == 1 if and only if the journal have at least 3 areas.

9.4 Consistency between text and flags

Does the title_thematic_areas text match the data in the single-area is_* columns? In [10]: tta_sets = ( journals["title_thematic_areas"] .fillna("") .str.split(";") .apply(lambda x: {areas_map[area] for area in x if area in areas_map}) ) pd.concat([ journals[area] != tta_sets.apply((lambda a: lambda x: int(a in x))(area)) for area in areas ], axis=1).any()

Out [10]: 0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False dtype: bool

Yes, it does, as long as we’re ignoring the already seen Psicanalise value.

9.5 Emptiness

Are there entries without any thematic area? In [11]: journals[journals[areas].sum(axis=1) == 0]\ [["issn_scielo", "collection", "title_scielo", "title_thematic_areas"]]

Out [11]: issn_scielo collection title_scielo title_thematic_areas 1350 0104-3269 psi Mudanças NaN 1351 1516-1854 psi Interação NaN 1352 1679-074X psi Psicanalítica NaN 1353 1809-8894 psi Mnemosine NaN 1354 1413-0556 psi Psicanálise e Universidade Psicanalise 1355 1413-4063 psi Psicologia Revista NaN 1356 1806-6631 psi Família e Comunidade NaN 1357 0102-7182 psi Psicologia & Sociedade NaN Continued on next page

9 Cleaning / Normalizing the thematic area — Page 5 / 14 9.5 Emptiness

issn_scielo collection title_scielo title_thematic_areas 1358 1982-5471 psi Mosaico NaN 1359 0103-863X psi Paidéia (Ribeirão Preto) NaN 1360 0124-4906 psi Informes Psicológicos NaN 1362 0104-8023 psi Revista do Departamento de Psicologia. UFF Psicanalise 1363 1516-1498 psi Ágora: Estudos em Teoria Psicanalítica Psicanalise 1364 1415-4714 psi Revista Latinoamericana de Psicopatologia NaN Fund... 1365 1516-2567 psi Revista Kairós NaN 1366 1676-5478 psi Encontro Psicanalise 1367 1516-8530 psi Revista Brasileira de Psicoterapia Psicanalise 1368 1983-3288 psi Psychology & Neuroscience NaN 1371 0102-7972 psi Psicologia: Reflexão e Crítica NaN 1396 1981-9145 psi Revista Brasileira de Psicologia do Esporte NaN 1400 0257-4322 psi Revista Cubana de Psicología NaN 1406 1983-0769 psi Revista Estudos Lacanianos NaN 1407 1657-9267 psi Universitas Psychologica NaN 1408 0121-4381 psi Suma Psicológica NaN 1425 0102-762X psi Distúrbios da Comunicação NaN

That includes every entry with the unnormalized Psicanalise value as the thematic area, which will be regarded here as invalid. In [12]: journals[journals["title_thematic_areas"] == "Psicanalise"].shape

Out [12]: (5, 98)

That’s consistent, and all empty/invalid entries are from the psi collection. We could in some sense fix, but there are more than a single valid thematic area in that collection: In [13]: psi_areas = journals[journals["collection"] == "psi"].fillna("")\ .groupby("title_thematic_areas").size() psi_areas.plot.barh(figsize=(10, 8), title="Thematic areas in the psi collection") psi_areas

Out [13]: title_thematic_areas 20 Applied Social Sciences 64 Applied Social Sciences;Biological Sciences;Human Sciences 1 Applied Social Sciences;Human Sciences 4 Biological Sciences 15 Health Sciences 2 Health Sciences;Human Sciences 1 Human Sciences 37 Psicanalise 5 dtype: int64

9 Cleaning / Normalizing the thematic area — Page 6 / 14 9.6 Consistency within the ISSN

The most common classification as Applied Social Sciences, but psychology is instead rooted as Human Sciences in the Lattes knowledge tree[1] (there’s also a full PDF version of it[2] in CNPq, but both are in Brazilian Portuguese), which should be seen as a default/fallback for these empty/invalid entries.

9.6 Consistency within the ISSN

We’ll need the ISSN, so let’s normalize it by applying the snippet from the ISSN normalization notebook: In [14]: issn_scielo_fix = {"0001-6002": "0001-6012", "0258-6444": "2215-3535", "0325-8203": "1668-7027", "0719-448x": "0719-448X", "0797-9789": "1688-499X", "0807-8967": "0870-8967", "0858-6444": "0258-6444", "1315-5216": "1316-5216", "1667-8682": "1667-8982", "1678-5177": "0103-6564", "1683-0789": "1683-0768", "1688-4094": "1688-4221", "1852-4418": "1852-4184", "1980-5438": "0103-5665", "2175-3598": "0104-1282", "2233-7666": "2223-7666", "2237-101X": "1518-3319", "24516600": "2451-6600", "2993-6797": "2393-6797"} journals["issn_scielo"].replace(issn_scielo_fix, inplace=True)

Each journal might have more than one row, since it might appear in more than one collection, but there might be some inconsistency going on, as well. Repeated rows aren’t a big issue, but every inconsistent duplication needs to be fixed. Which ISSNs are inconsistent? That is, which ISSNs are assigned to distinct thematic areas in distinct rows? [1]http://lattes.cnpq.br/web/dgp/arvore-do-conhecimento [2]http://www.cnpq.br/documents/10157/186158/TabeladeAreasdoConhecimento.pdf

9 Cleaning / Normalizing the thematic area — Page 7 / 14 9.6 Consistency within the ISSN

In [15]: areas_inconsistency = journals[journals[areas].sum(axis=1) != 0]\ [["issn_scielo"] + areas] \ .groupby("issn_scielo")\ .apply(lambda df: df.apply(lambda col: set(col.dropna())) .apply(len).max() > 1) areas_inconsistency_index = areas_inconsistency[areas_inconsistency].index areas_inconsistency_index

Out [15]: Index(['0011-5258', '0100-512X', '0100-8587', '0101-3300', '0101-9074', '0102-6909', '0103-2070', '0103-5665', '0104-026X', '0104-4478', '0104-7183', '0104-8333', '0104-9313', '0120-0534', '0254-9247', '0717-7194', '0718-6924', '1012-1587', '1413-294X', '1413-8271', '1414-3283', '1414-753X', '1414-9893', '1517-4522', '1518-3319', '1688-4221', '1688-499X', '1794-9998', '1806-6445', '1806-6976', '1981-3821', '2215-3535'], dtype='object', name='issn_scielo')

In [16]: pd.DataFrame( journals[journals["issn_scielo"].isin(areas_inconsistency_index)] .groupby("issn_scielo") .apply(lambda df: {k: v for k, v in df[areas].apply(set) .to_dict().items() if len(v) > 1}) .apply(sorted) # Casts from dictionary (keys) to list .rename("inconsistency") )

Out [16]: inconsistency issn_scielo 0011-5258 [is_applied_social_sciences, is_human_sciences] 0100-512X [is_applied_social_sciences] 0100-8587 [is_applied_social_sciences] 0101-3300 [is_applied_social_sciences] 0101-9074 [is_applied_social_sciences, is_human_sciences] 0102-6909 [is_applied_social_sciences, is_human_sciences] 0103-2070 [is_applied_social_sciences, is_human_sciences] 0103-5665 [is_applied_social_sciences, is_human_sciences] 0104-026X [is_applied_social_sciences] 0104-4478 [is_applied_social_sciences, is_human_sciences] 0104-7183 [is_applied_social_sciences, is_human_sciences] 0104-8333 [is_applied_social_sciences] 0104-9313 [is_applied_social_sciences] 0120-0534 [is_biological_sciences, is_human_sciences] 0254-9247 [is_applied_social_sciences, is_human_sciences] 0717-7194 [is_human_sciences] 0718-6924 [is_applied_social_sciences, is_human_sciences] 1012-1587 [is_linguistics_letters_arts] 1413-294X [is_applied_social_sciences] 1413-8271 [is_applied_social_sciences, is_human_sciences] 1414-3283 [is_applied_social_sciences, is_health_science... 1414-753X [is_biological_sciences] 1414-9893 [is_applied_social_sciences, is_human_sciences] 1517-4522 [is_applied_social_sciences, is_human_sciences] 1518-3319 [is_applied_social_sciences, is_human_sciences] 1688-4221 [is_applied_social_sciences] Continued on next page

9 Cleaning / Normalizing the thematic area — Page 8 / 14 9.6 Consistency within the ISSN

inconsistency issn_scielo 1688-499X [is_human_sciences] 1794-9998 [is_applied_social_sciences, is_human_sciences] 1806-6445 [is_human_sciences] 1806-6976 [is_applied_social_sciences, is_health_sciences] 1981-3821 [is_applied_social_sciences, is_human_sciences] 2215-3535 [is_health_sciences]

There seems to be way too many inconsistencies, but let’s simply remove the empty entries before checking this. In [17]: inconsistencies_df = pd.DataFrame( journals[journals["issn_scielo"].isin(areas_inconsistency_index) & journals[areas].sum(axis=1)] .groupby("issn_scielo") .apply(lambda df: sorted(k for k, v in df[areas].apply(set) .to_dict().items() if len(v) > 1) or None) .dropna() .rename("inconsistency") ) inconsistencies_df

Out [17]: inconsistency issn_scielo 0011-5258 [is_applied_social_sciences, is_human_sciences] 0101-9074 [is_applied_social_sciences, is_human_sciences] 0102-6909 [is_applied_social_sciences, is_human_sciences] 0103-2070 [is_applied_social_sciences, is_human_sciences] 0103-5665 [is_applied_social_sciences, is_human_sciences] 0104-4478 [is_applied_social_sciences, is_human_sciences] 0104-7183 [is_applied_social_sciences, is_human_sciences] 0120-0534 [is_biological_sciences, is_human_sciences] 0254-9247 [is_applied_social_sciences, is_human_sciences] 0718-6924 [is_applied_social_sciences, is_human_sciences] 1413-8271 [is_applied_social_sciences, is_human_sciences] 1414-9893 [is_applied_social_sciences, is_human_sciences] 1517-4522 [is_applied_social_sciences, is_human_sciences] 1518-3319 [is_applied_social_sciences, is_human_sciences] 1794-9998 [is_applied_social_sciences, is_human_sciences] 1806-6976 [is_applied_social_sciences, is_health_sciences] 1981-3821 [is_applied_social_sciences, is_human_sciences]

In [18]: inconsistent_rows = ( journals [journals["issn_scielo"].isin(inconsistencies_df.index)] [["issn_scielo", "collection", "title_thematic_areas", "title_current_status"]] .sort_values(by=["issn_scielo", "collection"]) ) inconsistent_rows.set_index(["issn_scielo", "collection"])

9 Cleaning / Normalizing the thematic area — Page 9 / 14 9.6 Consistency within the ISSN

Out [18]: title_thematic_areas title_current_status issn_scielo collection 0011-5258 scl Human Sciences current 0011-5258 sss Applied Social Sciences current 0101-9074 scl Human Sciences current 0101-9074 sss Applied Social Sciences current 0102-6909 scl Human Sciences current 0102-6909 sss Applied Social Sciences current 0103-2070 scl Human Sciences current 0103-2070 sss Applied Social Sciences current 0103-5665 psi Applied Social Sciences deceased 0103-5665 psi Applied Social Sciences current 0103-5665 scl Human Sciences suspended 0104-4478 scl Human Sciences current 0104-4478 sss Applied Social Sciences current 0104-7183 scl Human Sciences current 0104-7183 sss Applied Social Sciences current 0120-0534 col Human Sciences current 0120-0534 psi Biological Sciences suspended 0254-9247 per Human Sciences current 0254-9247 psi Applied Social Sciences current 0718-6924 chl Human Sciences current 0718-6924 psi Applied Social Sciences suspended 1413-8271 psi Applied Social Sciences suspended 1413-8271 scl Human Sciences current 1414-9893 psi Applied Social Sciences suspended 1414-9893 scl Human Sciences current 1517-4522 scl Human Sciences current 1517-4522 sss Applied Social Sciences current 1518-3319 scl Human Sciences current 1518-3319 sss Applied Social Sciences current 1794-9998 col Human Sciences current 1794-9998 psi Applied Social Sciences suspended 1806-6976 psi Applied Social Sciences current 1806-6976 rve Health Sciences suspended 1981-3821 scl Human Sciences current 1981-3821 sss Applied Social Sciences current

In [19]: inconsistent_rows.groupby("issn_scielo")["collection"]\ .apply(set).value_counts()

Out [19]: {sss, scl} 9 {scl, psi} 3 {col, psi} 2 {psi, per} 1 {chl, psi} 1 {rve, psi} 1 Name: collection, dtype: int64

The above show that, internal to each collection, the thematic area is always consistent in the 2018- 09-14 reports. However, distinct collections sometimes classify some journals differently. Most entries regarding this issue are from both the now discontinued sss collection (Social Sciences) and the scl collection (Brazil), in these cases we should stick with the value given by the scl collection, since it’s probably the updated value. The entries with both psi and scl have the journal either suspended or deceased in psi, so we should, also, use the value in the scl entry. The same happen in the pairs col-psi and chl-psi.

9 Cleaning / Normalizing the thematic area — Page 10 / 14 9.6 Consistency within the ISSN

There’s a single entry active in both psi and per, but since psychology belongs to the Human Sciences area (as seem in the Emptiness section of this notebook), we should take care when a psychology col- lection entry is regarded as Applied Social Sciences. Actually, we should use the thematic area classification as in the per collection, as the journal clearly regards to psychology: In [20]: journals[journals["issn_scielo"] == "0254-9247"][[ "collection", "title_thematic_areas", "title_current_status", "title_scielo", "title_subtitle_scielo", "short_title_scielo", "title_pubmed", "publisher_name", "short_iso"]].T

Out [20]: 1262 1378 collection per psi title_thematic_areas Human Sciences Applied Social Sciences title_current_status current current title_scielo Revista de Psicología (PUCP) Revista de Psicología (Lima) title_subtitle_scielo Revista de Psicología (PUCP) Revista de Psicología (Lima) short_title_scielo Revista de Psicología Rev. psicol. (Lima) title_pubmed NaN NaN publisher_name Pontificia Universidad Católica del Pontificia Universidad Católica del Perú Perú. Depa... short_iso Revista de Psicología Rev. psicol. (Lima)

The only pair missing is the one regarding two thematic collections: In [21]: journals[journals["issn_scielo"] == "1806-6976"][[ "collection", "title_thematic_areas", "title_current_status", "title_scielo", "title_subtitle_scielo", "short_title_scielo", "title_pubmed", "publisher_name", "short_iso"]].T

Out [21]: 1453 1499 collection psi rve title_thematic_areas Applied Social Sciences Health Sciences title_current_status current suspended title_scielo SMAD. Revista eletrônica saúde mental SMAD. Revista eletrônica saúde álcool e... mental álcool e... title_subtitle_scielo SMAD. Revista eletrônica saúde mental SMAD. Revista eletrônica saúde álcool e... mental álcool e... short_title_scielo SMAD, Rev. Eletrônica Saúde Mental Ál- SMAD, Rev. Eletrônica Saúde cool Drog... Mental Álcool Drog... title_pubmed NaN NaN publisher_name Universidade de São Paulo, Escola de En- USP/EERP fermage... short_iso SMAD, Rev. Eletrônica Saúde Mental Ál- SMAD, Rev. Eletrônica Saúde cool Drog... Mental Álcool Drog...

As the journal title translated to English means something like Mental health, alcohol and drugs e-journal, it’s pretty hard to know if it’s more about psychology or some health science, despite the fact that the name might be misleading, but it might be both, and there’s no Human Sciences in either alternative. The easier approach for this normalization is: if the journal has distinct thematic areas in different col- lections, stick with entry in the certified and currently maintained collection, or in rve. That suffices in our case, and it’ll choose exactly the entries as discriminated above.

9 Cleaning / Normalizing the thematic area — Page 11 / 14 9.7 Normalizing

9.7 Normalizing

The goal is copy the fill the empty data as Human Sciences, and use information from a single row when there’s more than one with distinct areas, leaving the sss and psi with lower priority when there’s some conflict. That can be done on the title_thematic_areas column: In [22]: tta_map = journals.groupby("issn_scielo").apply( lambda df: df.assign(title_thematic_areas=df["title_thematic_areas"] .replace("Psicanalise", "Human Sciences") .fillna("Human Sciences"), order=df["collection"].isin(["sss", "psi"]) | (df["title_thematic_areas"] == "Psicanalise") | df["title_thematic_areas"].isna()) .sort_values("order")["title_thematic_areas"].iloc[0] ) tta_text_n = journals["issn_scielo"].map(tta_map) \ .rename("title_thematic_areas") tta_text_n.head()

Out [22]: 0 Applied Social Sciences 1 Health Sciences 2 Human Sciences 3 Exact and Earth Sciences 4 Health Sciences Name: title_thematic_areas, dtype: object

It can be used to re-build the several flag columns: In [23]: tta_list_n = tta_text_n.str.split(";") tta_n = pd.DataFrame(tta_text_n).assign(**{ area: tta_list_n.apply((lambda n: lambda entries: int(n in entries))(name)) for name, area in areas_map.items() }).assign( is_multidisciplinary=lambda df: (df[areas].sum(axis=1) >= 3).map(int) ) tta_n.head().T

Out [23]: 0 1 2 3 4 title_thematic_areas Applied So- Health Human Sci- Exact and Health cial Sciences Sciences ences Earth Sci- Sciences ences is_agricultural_sciences 0 0 0 0 0 is_applied_social_sciences 1 0 0 0 0 is_biological_sciences 0 0 0 0 0 is_engineering 0 0 0 0 0 is_exact_earth_sciences 0 0 0 1 0 is_health_sciences 0 1 0 0 1 is_human_sciences 0 0 1 0 0 is_linguistics_letters_arts 0 0 0 0 0 is_multidisciplinary 0 0 0 0 0

Which can be used to directly normalize the dataset: In [24]: journals_n = journals.assign(**tta_n) journals_n.shape

9 Cleaning / Normalizing the thematic area — Page 12 / 14 9.8 Summary

Out [24]: (1732, 98)

How many empty thematic area entries are there? In [25]: journals_n[journals_n[areas].sum(axis=1) == 0].shape[0]

Out [25]: 0

Are there any ISSN with inconsistent thematic areas? In [26]: journals["issn_scielo"].drop_duplicates().shape

Out [26]: (1653,)

In [27]: journals_n.groupby("issn_scielo")[areas].apply( lambda df: len(df.drop_duplicates()) ).value_counts()

Out [27]: 1 1653 dtype: int64

All distinct ISSNs in this new journals_n have only one set of thematic areas, so it’s consistent.

9.8 Summary

A full snippet for thematic area normalization is: areas_map = { "Agricultural Sciences": "is_agricultural_sciences", "Applied Social Sciences": "is_applied_social_sciences", "Biological Sciences": "is_biological_sciences", "Engineering": "is_engineering", "Exact and Earth Sciences": "is_exact_earth_sciences", "Health Sciences": "is_health_sciences", "Human Sciences": "is_human_sciences", "Linguistics, Letters and Arts": "is_linguistics_letters_arts", } areas = list(areas_map.values())

issn_scielo_fix = {"0001-6002": "0001-6012", "0258-6444": "2215-3535", "0325-8203": "1668-7027", "0719-448x": "0719-448X", "0797-9789": "1688-499X", "0807-8967": "0870-8967", "0858-6444": "0258-6444", "1315-5216": "1316-5216", "1667-8682": "1667-8982", "1678-5177": "0103-6564", "1683-0789": "1683-0768", "1688-4094": "1688-4221", "1852-4418": "1852-4184", "1980-5438": "0103-5665", "2175-3598": "0104-1282", "2233-7666": "2223-7666", "2237-101X": "1518-3319", "24516600": "2451-6600", "2993-6797": "2393-6797"}

9 Cleaning / Normalizing the thematic area — Page 13 / 14 9.8 Summary def normalize_column_title(name): import re name_unbracketed = re.sub(r".*\((.*)\)", r"\1", name.replace("(in months)", "in_months")) words = re.sub("[^a-z0-9+_ ]", "", name_unbracketed.lower()).split() ignored_words = ("at", "the", "of", "and", "google", "scholar", "+") replacements = { "document": "doc", "documents": "docs", "frequency": "freq", "language": "lang", } return "_".join(replacements.get(word, word) for word in words if word not in ignored_words) \ .replace("title_is", "is")

# Load the data journals = pd.read_csv("tabs_network/journals.csv")

# Column names and ISSN normalization journals.rename(columns=normalize_column_title, inplace=True) journals["issn_scielo"].replace(issn_scielo_fix, inplace=True)

# Thematic area normalization tta_map = journals.groupby("issn_scielo").apply( lambda df: df.assign(title_thematic_areas=df["title_thematic_areas"] .replace("Psicanalise", "Human Sciences") .fillna("Human Sciences"), order=df["collection"].isin(["sss", "psi"]) | (df["title_thematic_areas"] == "Psicanalise") | df["title_thematic_areas"].isna()) .sort_values("order")["title_thematic_areas"].iloc[0] ) tta_text_n = journals["issn_scielo"].map(tta_map) \ .rename("title_thematic_areas") tta_list_n = tta_text_n.str.split(";") tta_n = pd.DataFrame(tta_text_n).assign(**{ area: tta_list_n.apply((lambda n: lambda entries: int(n in entries))(name)) for name, area in areas_map.items() }).assign( is_multidisciplinary=lambda df: (df[areas].sum(axis=1) >= 3).map(int) ) journals = journals.assign(**tta_n) It also normalizes the column names and the issn_scielo column (former ISSN SciELO), as these are a requirement in order to normalize the thematic areas.

9 Cleaning / Normalizing the thematic area — Page 14 / 14 10 Hirsch indices from Google Scholar

What’s going on with the Google Scholar indices (h5-index or h5, h5-median or m5) of the journals in the SciELO network for each subject area? In [1]: from collections import Counter from itertools import accumulate, chain from statistics import median

In [2]: import matplotlib.pyplot as plt import networkx as nx import numpy as np import pandas as pd import seaborn as sns

In [3]: pd.options.display.max_colwidth = 400 pd.options.display.max_rows = 150 sns.set() # Plot style %matplotlib inline

In [4]: # As of 2018-09-18, Seaborn 0.9.0 uses Scipy 1.1.0, # and "scipy.stats" uses a deprecated indexing style. # It's just to avoid an annoying warning message, # and has nothing to do with the code of this notebook. import warnings warnings.filterwarnings("ignore", category=FutureWarning)

10.1 Loading the dataset

Let’s load the dataset with all journals in the SciELO network. In [5]: journals = pd.read_csv("tabs_network/journals.csv")

10.1.1 Column normalization

These are the column names in the raw CSV file: In [6]: journals.columns

Out [6]: Index(['extraction date', 'study unit', 'collection', 'ISSN SciELO', 'ISSN's', 'title at SciELO', 'title thematic areas', 'title is agricultural sciences', 'title is applied social sciences', 'title is biological sciences', 'title is engineering', 'title is exact and earth sciences', 'title is health sciences', 'title is human sciences', 'title is linguistics, letters and arts', 'title is multidisciplinary', 'title current status', 'title + subtitle SciELO', 'short title SciELO', 'short title ISO', 'title PubMed', 'publisher name', 'use license', 'alpha frequency', 'numeric frequency (in months)', 'inclusion year at SciELO', 'stopping year at SciELO', 'stopping reason', 'date of the first document', 'volume of the first document', 'issue of the first document', 'date of the last document', 'volume of the last document', 'issue of the last document', 'total of issues', 'issues at 2018', 'issues at 2017', 'issues at 2016', 'issues at 2015', 'issues at 2014', 'issues at 2013', 'total of regular issues', 'regular issues at 2018',

10 Hirsch indices from Google Scholar — Page 1 / 42 10.1 Loading the dataset

'regular issues at 2017', 'regular issues at 2016', 'regular issues at 2015', 'regular issues at 2014', 'regular issues at 2013', 'total of documents', 'documents at 2018', 'documents at 2017', 'documents at 2016', 'documents at 2015', 'documents at 2014', 'documents at 2013', 'citable documents', 'citable documents at 2018', 'citable documents at 2017', 'citable documents at 2016', 'citable documents at 2015', 'citable documents at 2014', 'citable documents at 2013', 'portuguese documents at 2018 ', 'portuguese documents at 2017 ', 'portuguese documents at 2016 ', 'portuguese documents at 2015 ', 'portuguese documents at 2014 ', 'portuguese documents at 2013 ', 'spanish documents at 2018 ', 'spanish documents at 2017 ', 'spanish documents at 2016 ', 'spanish documents at 2015 ', 'spanish documents at 2014 ', 'spanish documents at 2013 ', 'english documents at 2018 ', 'english documents at 2017 ', 'english documents at 2016 ', 'english documents at 2015 ', 'english documents at 2014 ', 'english documents at 2013 ', 'other language documents at 2018 ', 'other language documents at 2017 ', 'other language documents at 2016 ', 'other language documents at 2015 ', 'other language documents at 2014 ', 'other language documents at 2013 ', 'google scholar h5 2018 ', 'google scholar h5 2017 ', 'google scholar h5 2016 ', 'google scholar h5 2015 ', 'google scholar h5 2014 ', 'google scholar h5 2013 ', 'google scholar m5 2018 ', 'google scholar m5 2017 ', 'google scholar m5 2016 ', 'google scholar m5 2015 ', 'google scholar m5 2014 ', 'google scholar m5 2013 '], dtype='object')

The column names aren’t helping us with all the small details like the trailing whitespace in the fields whose name starts with google scholar. The easiest approach is to run this normalization function from the column names simplification notebook: In [7]: def normalize_column_title(name): import re name_unbracketed = re.sub(r".*\((.*)\)", r"\1", name.replace("(in months)", "in_months")) words = re.sub("[^a-z0-9+_ ]", "", name_unbracketed.lower()).split() ignored_words = ("at", "the", "of", "and", "google", "scholar", "+") replacements = { "document": "doc", "documents": "docs", "frequency": "freq", "language": "lang", } return "_".join(replacements.get(word, word) for word in words if word not in ignored_words) \ .replace("title_is", "is")

Applying it is straightforward, and the order of the columns is kept as is. In [8]: journals.rename(columns=normalize_column_title, inplace=True) journals.columns

Out [8]: Index(['extraction_date', 'study_unit', 'collection', 'issn_scielo', 'issns', 'title_scielo', 'title_thematic_areas', 'is_agricultural_sciences',

10 Hirsch indices from Google Scholar — Page 2 / 42 10.1 Loading the dataset

'is_applied_social_sciences', 'is_biological_sciences', 'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences', 'is_human_sciences', 'is_linguistics_letters_arts', 'is_multidisciplinary', 'title_current_status', 'title_subtitle_scielo', 'short_title_scielo', 'short_iso', 'title_pubmed', 'publisher_name', 'use_license', 'alpha_freq', 'numeric_freq_in_months', 'inclusion_year_scielo', 'stopping_year_scielo', 'stopping_reason', 'date_first_doc', 'volume_first_doc', 'issue_first_doc', 'date_last_doc', 'volume_last_doc', 'issue_last_doc', 'total_issues', 'issues_2018', 'issues_2017', 'issues_2016', 'issues_2015', 'issues_2014', 'issues_2013', 'total_regular_issues', 'regular_issues_2018', 'regular_issues_2017', 'regular_issues_2016', 'regular_issues_2015', 'regular_issues_2014', 'regular_issues_2013', 'total_docs', 'docs_2018', 'docs_2017', 'docs_2016', 'docs_2015', 'docs_2014', 'docs_2013', 'citable_docs', 'citable_docs_2018', 'citable_docs_2017', 'citable_docs_2016', 'citable_docs_2015', 'citable_docs_2014', 'citable_docs_2013', 'portuguese_docs_2018', 'portuguese_docs_2017', 'portuguese_docs_2016', 'portuguese_docs_2015', 'portuguese_docs_2014', 'portuguese_docs_2013', 'spanish_docs_2018', 'spanish_docs_2017', 'spanish_docs_2016', 'spanish_docs_2015', 'spanish_docs_2014', 'spanish_docs_2013', 'english_docs_2018', 'english_docs_2017', 'english_docs_2016', 'english_docs_2015', 'english_docs_2014', 'english_docs_2013', 'other_lang_docs_2018', 'other_lang_docs_2017', 'other_lang_docs_2016', 'other_lang_docs_2015', 'other_lang_docs_2014', 'other_lang_docs_2013', 'h5_2018', 'h5_2017', 'h5_2016', 'h5_2015', 'h5_2014', 'h5_2013', 'm5_2018', 'm5_2017', 'm5_2016', 'm5_2015', 'm5_2014', 'm5_2013'], dtype='object')

10.1.2 Thematic areas

Based on the thematic area normalization notebook, these are the columns for each thematic area after column name normalization: In [9]: areas_map = { "Agricultural Sciences": "is_agricultural_sciences", "Applied Social Sciences": "is_applied_social_sciences", "Biological Sciences": "is_biological_sciences", "Engineering": "is_engineering", "Exact and Earth Sciences": "is_exact_earth_sciences", "Health Sciences": "is_health_sciences", "Human Sciences": "is_human_sciences", "Linguistics, Letters and Arts": "is_linguistics_letters_arts", } areas = list(areas_map.values()) areas

Out [9]: ['is_agricultural_sciences', 'is_applied_social_sciences', 'is_biological_sciences', 'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences', 'is_human_sciences', 'is_linguistics_letters_arts']

One missing is the is_multidisciplinary, which is 1 if the journal has at least 3 of the distinct thematic

10 Hirsch indices from Google Scholar — Page 3 / 42 10.1 Loading the dataset

areas above, otherwise it’s 0. In [10]: areaswm = areas + ["is_multidisciplinary"]

10.1.3 Normalization

We’ll need the ISSN and the thematic areas, so these should be normalized. The code below follows the normalization notebooks for these fields. In [11]: # ISSN normalization issn_scielo_fix = {"0001-6002": "0001-6012", "0258-6444": "2215-3535", "0325-8203": "1668-7027", "0719-448x": "0719-448X", "0797-9789": "1688-499X", "0807-8967": "0870-8967", "0858-6444": "0258-6444", "1315-5216": "1316-5216", "1667-8682": "1667-8982", "1678-5177": "0103-6564", "1683-0789": "1683-0768", "1688-4094": "1688-4221", "1852-4418": "1852-4184", "1980-5438": "0103-5665", "2175-3598": "0104-1282", "2233-7666": "2223-7666", "2237-101X": "1518-3319", "24516600": "2451-6600", "2993-6797": "2393-6797"} journals["issn_scielo"].replace(issn_scielo_fix, inplace=True)

In [12]: # Thematic area normalization tta_map = journals.groupby("issn_scielo").apply( lambda df: df.assign(title_thematic_areas=df["title_thematic_areas"] .replace("Psicanalise", "Human Sciences") .fillna("Human Sciences"), order=df["collection"].isin(["sss", "psi"]) | (df["title_thematic_areas"] == "Psicanalise") | df["title_thematic_areas"].isna()) .sort_values("order")["title_thematic_areas"].iloc[0] ) tta_text_n = journals["issn_scielo"].map(tta_map) \ .rename("title_thematic_areas") tta_list_n = tta_text_n.str.split(";") tta_n = pd.DataFrame(tta_text_n).assign(**{ area: tta_list_n.apply((lambda n: lambda entries: int(n in entries))(name)) for name, area in areas_map.items() }).assign( is_multidisciplinary=lambda df: (df[areas].sum(axis=1) >= 3).map(int) ) journals = journals.assign(**tta_n)

10.1.4 Selecting the Google Scholar fields

The fields regarding to the h5/m5 indices from Google Scholar are:

10 Hirsch indices from Google Scholar — Page 4 / 42 10.1 Loading the dataset

In [13]: h5_fields = sorted(k for k in journals.columns if k.startswith("h5")) m5_fields = sorted(k for k in journals.columns if k.startswith("m5")) h5_fields, m5_fields

Out [13]: (['h5_2013', 'h5_2014', 'h5_2015', 'h5_2016', 'h5_2017', 'h5_2018'], ['m5_2013', 'm5_2014', 'm5_2015', 'm5_2016', 'm5_2017', 'm5_2018'])

Are the h5/m5 fields consistent? In [14]: h5m5_inconsistency = journals.groupby("issn_scielo")[h5_fields + m5_fields] \ .apply(lambda df: df.apply(lambda col: set(col.dropna())) .apply(len).max() > 1) h5m5_inconsistency[h5m5_inconsistency]

Out [14]: Series([], dtype: bool)

Yes, they are! The only requirement we have regarding these fields is to ignore any possible NaN values. Does every column have information? In [15]: h5m5_is_empty = journals[h5_fields + m5_fields].count() == 0 h5m5_empty_list = list(h5m5_is_empty[h5m5_is_empty].index) h5m5_empty_list

Out [15]: ['h5_2018', 'm5_2018']

No! These two empty columns shown above have no use for us. In [16]: h5_fields = [k for k in h5_fields if k not in h5m5_empty_list] m5_fields = [k for k in m5_fields if k not in h5m5_empty_list] gs_fields = list(chain(*zip(h5_fields, m5_fields))) gs_fields

Out [16]: ['h5_2013', 'm5_2013', 'h5_2014', 'm5_2014', 'h5_2015', 'm5_2015', 'h5_2016', 'm5_2016', 'h5_2017', 'm5_2017']

10.1.5 Data de-duplication (building the dataset)

A missing normalization step regards to this issue: there should be no more than a single entry for each journal. Since there are many journals that in more than one collection, we have data duplication. The normalization step already enforced that any of these rows have the same thematic area, besides a common issn_scielo that can be regarded as its primary key. As we’ve seen in the Google Scholar indices section, the indices are consistent, as long as we coalesce the NaN entries. Now that we’re aware of that, let’s remove the duplicate entries. The renamed columns we’ll need are (these names are the ones after the renaming step): In [17]: columns = ["issn_scielo"] + gs_fields + areaswm columns

10 Hirsch indices from Google Scholar — Page 5 / 42 10.2 Understanding the Hirsch index

Out [17]: ['issn_scielo', 'h5_2013', 'm5_2013', 'h5_2014', 'm5_2014', 'h5_2015', 'm5_2015', 'h5_2016', 'm5_2016', 'h5_2017', 'm5_2017', 'is_agricultural_sciences', 'is_applied_social_sciences', 'is_biological_sciences', 'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences', 'is_human_sciences', 'is_linguistics_letters_arts', 'is_multidisciplinary']

Since everything is already normalized, getting the dataset is straightforward: In [18]: dataset = journals[columns].groupby("issn_scielo").agg("max") dataset.head().T

Out [18]: issn_scielo 0001-3714 0001-3765 0001-6012 0001-6365 0002-0591 h5_2013 NaN 19.0 NaN NaN 3.0 m5_2013 NaN 28.0 NaN NaN 5.0 h5_2014 NaN 19.0 NaN NaN 3.0 m5_2014 NaN 31.0 NaN NaN 5.0 h5_2015 NaN 19.0 NaN NaN 4.0 m5_2015 NaN 24.0 NaN NaN 4.0 h5_2016 NaN 18.0 6.0 NaN 5.0 m5_2016 NaN 25.0 7.0 NaN 5.0 h5_2017 NaN 16.0 7.0 NaN 5.0 m5_2017 NaN 19.0 11.0 NaN 6.0 is_agricultural_sciences 1.0 1.0 0.0 0.0 0.0 is_applied_social_sciences 0.0 0.0 0.0 0.0 0.0 is_biological_sciences 1.0 1.0 0.0 0.0 0.0 is_engineering 0.0 1.0 0.0 0.0 0.0 is_exact_earth_sciences 0.0 1.0 0.0 0.0 0.0 is_health_sciences 0.0 1.0 1.0 1.0 0.0 is_human_sciences 0.0 1.0 1.0 0.0 1.0 is_linguistics_letters_arts 0.0 0.0 0.0 0.0 0.0 is_multidisciplinary 0.0 1.0 0.0 0.0 0.0

There’s a lot of NaNs! Let’s see what we can do with that information.

10.2 Understanding the Hirsch index

The dataset is ready, but before we start exploring the data, it makes sense to know what’s the meaning of the Google Scholar indices columns.

10 Hirsch indices from Google Scholar — Page 6 / 42 10.2 Understanding the Hirsch index

10.2.1 Definition

In the Google Scholar top publications web page[1] we can see the definition of the h5-index and h5- median (the h5_* and m5_* columns of the dataset, respectively). It defines: h5-index is the h-index for articles published in the last 5 complete years. It is the largest number h such that h articles published in 2013-2017 have at least h citations each. h5-median for a publication is the median number of citations for the articles that make up its h5-index. The Hirsch index for a graph is:

     hjournal = max min in-degree(nodei), ∑ 1 nodei∈journal  nodej∈journal   ≥    in-degree(nodej) in-degree(nodei)    Where the Google Scholar’s h5-index is just the Hirsch index for a journal in the directed graph of articles (nodes) connected by citations (edges) in the last completed 5 years, including the edges coming from other journals. The in-degree of an article is the number of citations it received in the network. And, by the above definition, the Google Scholar’s h5-median is:

median [in-degree(nodei)] nodei∈journal in-degree(nodei)≥hjournal

Which means that h5-median ≥ h5-index.

10.2.2 Simplified explanation of the Hirsch index definition

The mathematical definition of the index might look complicated, but thinking on it iteratively and avoiding the graph theory jargon (degree and node) might make its logic quite straightforward: • The context is the citation network of all articles from 2013 to 2017 in Google Scholar. • Let’s select the nodes of a specific journal J; • The Hirsch index couldn’t be less than 0, so let’s say it’s at least 0; • Is there at least 1 article from J receiving at least 1 citation in the network? If yes, then the Hirsch index is at least 1; • Are there at least 2 articles from J receiving at least 2 citations in the network? If yes, then the Hirsch index is at least 2; •... • Are there at least h articles from J receiving at least h citations in the network? If yes, then the Hirsch index is at least h; • Are there at least h + 1 articles from J receiving at least h + 1 citations in the network? If not, then the Hirsch index is smaller than h + 1, and since it’s a whole number, it’s the last ceiling we’ve found: h.

10.2.3 Implementation using a NetworkX directed graph

Given the citation graph as an instance of nx.DiGraph and a selected set of nodes (the journal’s articles in the graph), the following functions calculate the h5-index and h5-median. In [19]: def h_index(graph, nodes=None): if nodes is None: nodes = graph.nodes

[1]https://scholar.google.com/citations?view_op=top_venues

10 Hirsch indices from Google Scholar — Page 7 / 42 10.2 Understanding the Hirsch index

degree_counts = Counter(degree for node, degree in graph.in_degree if node in nodes) degrees, counts = zip(*sorted(degree_counts.items(), reverse=True)) cum_counts = accumulate(counts) return max(min(degree, cc) for degree, cc in zip(degrees, cum_counts))

In [20]: def h_median(graph, nodes=None, h=None): if nodes is None: nodes = graph.nodes if h is None: h = h_index(graph, nodes) return median(degree for node, degree in graph.in_degree if node in nodes and degree >= h)

By default, these functions use all nodes in the graph, and since the median calculation depends on the Hirsch index, it’s also an optional input (by default, it’ll calculate the index).

10.2.4 Example

This dataset doesn’t have a full citation graph including citations from external journals, so we will just create arbitrary/random graphs as examples for the calculation of these indices. Suppose we have a citation network of a single journal, like this: In [21]: arbitrary_citation_graph = nx.DiGraph([(1, 0), (2, 1), (3, 0), (4, 0), (5, 0)]) nx.draw(arbitrary_citation_graph, pos=nx.spring_layout(arbitrary_citation_graph, seed=42), # "random_state" in older NetworkX font_color="w", with_labels=True) print(f"h5-index: {h_index(arbitrary_citation_graph)}") print(f"h5-median: {h_median(arbitrary_citation_graph)}")

h5-index: 1 h5-median: 2.5

10 Hirsch indices from Google Scholar — Page 8 / 42 10.2 Understanding the Hirsch index

The number of citations for each numbered node from 0 to 5 is: In [22]: [degree for node, degree in sorted(arbitrary_citation_graph.in_degree)]

Out [22]: [4, 1, 0, 0, 0, 0]

The Hirsch index couldn’t be 2 as there’s only a single article with more than a single citation. Another example: In [23]: random_citation_graph = nx.gn_graph(15, seed=0) nx.draw(random_citation_graph, pos=nx.spring_layout(random_citation_graph, seed=42), font_color="w", with_labels=True) print(f"h5-index: {h_index(random_citation_graph)}") print(f"h5-median: {h_median(random_citation_graph)}")

h5-index: 2 h5-median: 4

10 Hirsch indices from Google Scholar — Page 9 / 42 10.3 Exploratory data analysis

For each numbered node sorted in ascending order, the number of citations is: In [24]: [degree for node, degree in sorted(random_citation_graph.in_degree)]

Out [24]: [1, 5, 4, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

The only entries greater or equal to 2 (the Hirsch index) are [5, 4, 2], whose median is 4. If that entry with 2 citations had been received 3 citations instead, this graph would have an Hirsch index of 3. Let the even and odd indexed nodes be from two distinct journals, then this analysis of each journal would be: • [1, 4, 1, 0, 0, 0, 0, 0] citations for each article, h5-index = 1 and h5-median = 1; • [5, 2, 0, 0, 1, 0, 0] citations for each article, h5-index = 2 and h5-median = 3.5.

10.3 Exploratory data analysis

Though we have information about every thematic area, we don’t have the Google Scholar indices for all the documents/articles: In [25]: print(dataset.shape) dataset.count()

(1653, 19)

Out [25]: h5_2013 265 m5_2013 265 h5_2014 266 m5_2014 266 h5_2015 803 m5_2015 803 h5_2016 772

10 Hirsch indices from Google Scholar — Page 10 / 42 10.3 Exploratory data analysis

m5_2016 772 h5_2017 923 m5_2017 923 is_agricultural_sciences 1653 is_applied_social_sciences 1653 is_biological_sciences 1653 is_engineering 1653 is_exact_earth_sciences 1653 is_health_sciences 1653 is_human_sciences 1653 is_linguistics_letters_arts 1653 is_multidisciplinary 1653 dtype: int64

The h5 and m5 are pairs, having one means we have the other one. We have that index for ≈ 55.8% of the data in 2017. In [26]: dataset["h5_2017"].count() / dataset.shape[0]

Out [26]: 0.55837870538415

We should emphasize the 2017 data, as it has way more information than the remaining of the data.

10.3.1 Data summary

Let’s get all the most usual descriptive statistics for single areas in this dataset. The sum isn’t 100% as several journals regards to more than a single area. In [27]: ddata = pd.concat([ dataset[dataset[area] == 1][gs_fields].describe().assign(area=area[3:]) for area in areaswm ]) ddata

Out [27]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 11 / 42 10.3 Exploratory data analysis agricultural_sciences agricultural_sciences agricultural_sciences agricultural_sciences agricultural_sciences agricultural_sciences agricultural_sciences agricultural_sciences applied_social_sciences applied_social_sciences applied_social_sciences applied_social_sciences applied_social_sciences applied_social_sciences applied_social_sciences applied_social_sciences biological_sciences biological_sciences biological_sciences biological_sciences biological_sciences biological_sciences biological_sciences biological_sciences engineering engineering engineering engineering engineering engineering engineering engineering Continued on next page 92.000000 13.304348 5.704710 2.000000 9.000000 13.000000 17.000000 31.000000 192. 000000 11.052083 5.720121 3.000000 7.000000 10.000000 14.000000 38.000000 102. 000000 14.333333 7.188182 5.000000 10.000000 12.000000 18.000000 39.000000 71.000000 11.591549 6.370866 3.000000 7.000000 10.000000 14.000000 31.000000 92.000000 10.043478 4.185042 2.000000 7.000000 9.000000 13.000000 20.000000 192. 000000 7.734375 3.774711 1.000000 5.000000 7.000000 10.000000 22.000000 102. 000000 10.784314 5.285188 4.000000 7.000000 9.500000 14.000000 29.000000 71.000000 8.366197 4.645242 3.000000 5.000000 7.000000 10.500000 21.000000 82.000000 12.975610 5.989653 3.000000 8.000000 11.500000 16.750000 31.000000 174. 000000 10.235632 5.728611 1.000000 6.000000 9.000000 13.000000 32.000000 91.000000 14.384615 7.282810 4.000000 9.000000 13.000000 18.500000 38.000000 60.000000 10.750000 6.374273 2.000000 7.000000 9.000000 13.000000 31.000000 82.000000 9.817073 4.343657 3.000000 6.000000 9.000000 12.750000 21.000000 174. 000000 7.063218 3.528828 1.000000 5.000000 6.000000 9.000000 20.000000 91.000000 10.670330 5.450902 3.000000 7.000000 9.000000 14.000000 29.000000 60.000000 8.016667 4.831014 2.000000 4.750000 7.000000 9.250000 21.000000 76.000000 12.328947 5.888720 4.000000 8.000000 11.000000 15.000000 28.000000 162. 000000 10.055556 4.961028 3.000000 7.000000 9.000000 12.000000 32.000000 93.000000 12.924731 6.369487 5.000000 8.000000 11.000000 16.000000 34.000000 60.000000 10.100000 5.601755 3.000000 6.000000 8.500000 13.000000 24.000000 76.000000 9.644737 4.813049 3.000000 6.000000 8.000000 12.250000 21.000000 162. 000000 6.895062 3.169356 2.000000 5.000000 6.000000 9.000000 20.000000 93.000000 9.946237 5.061289 4.000000 6.000000 8.000000 12.000000 25.000000 60.000000 7.583333 4.334691 2.000000 4.000000 7.000000 9.250000 19.000000 36.000000 14.138889 6.564636 2.000000 11.000000 13.000000 16.000000 31.000000 27.000000 10.407407 5.534919 3.000000 5.500000 9.000000 14.000000 24.000000 30.000000 18.133333 8.881338 8.000000 11.250000 16.500000 21.750000 52.000000 21.000000 12.285714 6.671903 4.000000 7.000000 11.000000 15.000000 31.000000 36.000000 11.111111 4.833087 2.000000 8.750000 11.000000 13.250000 23.000000 27.000000 6.814815 4.123451 1.000000 3.000000 7.000000 9.500000 17.000000 30.000000 13.733333 5.771113 6.000000 9.000000 13.500000 16.000000 33.000000 21.000000 8.666667 4.963198 2.000000 5.000000 7.000000 13.000000 19.000000 35.000000 15.742857 6.386304 4.000000 12.000000 14.000000 19.500000 31.000000 31.000000 12.129032 5.925886 5.000000 8.000000 11.000000 14.500000 32.000000 29.000000 17.137931 8.617773 4.000000 12.000000 17.000000 22.000000 46.000000 16.000000 12.125000 6.064926 4.000000 8.750000 11.500000 15.250000 28.000000 h5_201335.000000 m5_201312.285714 h5_20145.159636 3.000000 m5_20149.000000 h5_201511.000000 15.000000 m5_201525.000000 h5_201631.000000 m5_20168.032258 3.610022 h5_20173.000000 m5_20175.000000 8.000000 area 10.000000 19.000000 29.000000 12.793103 6.281374 3.000000 9.000000 11.000000 17.000000 32.000000 16.000000 9.312500 4.757012 2.000000 6.750000 8.500000 12.250000 19.000000 count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max

10 Hirsch indices from Google Scholar — Page 12 / 42 10.3 Exploratory data analysis exact_earth_sciences exact_earth_sciences exact_earth_sciences exact_earth_sciences exact_earth_sciences exact_earth_sciences exact_earth_sciences exact_earth_sciences health_sciences health_sciences health_sciences health_sciences health_sciences health_sciences health_sciences health_sciences human_sciences human_sciences human_sciences human_sciences human_sciences human_sciences human_sciences human_sciences linguistics_letters_arts linguistics_letters_arts linguistics_letters_arts linguistics_letters_arts linguistics_letters_arts linguistics_letters_arts linguistics_letters_arts linguistics_letters_arts multidisciplinary Continued on next page 64.000000 12.562500 6.334273 3.000000 8.000000 11.000000 15.250000 31.000000 286. 000000 16.000000 10.201479 1.000000 9.000000 13.500000 19.000000 75.000000 272. 000000 11.591912 6.621474 2.000000 7.000000 10.000000 15.000000 35.000000 45.000000 7.977778 4.303745 2.000000 5.000000 7.000000 9.000000 23.000000 32.000000 64.000000 9.406250 4.655493 2.000000 6.000000 9.000000 11.250000 25.000000 286. 000000 11.741259 7.460035 1.000000 7.000000 9.000000 14.750000 54.000000 272. 000000 8.180147 4.593503 2.000000 5.000000 7.000000 11.000000 23.000000 45.000000 5.577778 2.996125 1.000000 4.000000 5.000000 7.000000 16.000000 32.000000 60.000000 12.066667 6.655867 1.000000 7.000000 10.000000 15.000000 31.000000 210. 000000 16.414286 11.484326 2.000000 8.000000 13.000000 21.000000 81.000000 238. 000000 11.432773 6.645459 1.000000 6.000000 10.000000 16.000000 33.000000 34.000000 8.029412 3.857188 3.000000 5.000000 7.500000 10.000000 16.000000 28.000000 60.000000 8.866667 4.670378 1.000000 5.000000 7.000000 11.000000 23.000000 210. 000000 12.147619 8.262564 2.000000 6.000000 10.000000 16.000000 53.000000 238. 000000 7.991597 4.493776 1.000000 4.000000 7.000000 10.000000 23.000000 34.000000 5.500000 2.402650 2.000000 4.000000 5.000000 6.750000 11.000000 28.000000 58.000000 11.637931 5.981546 4.000000 8.000000 10.000000 13.750000 31.000000 266. 000000 15.060150 10.407733 2.000000 8.000000 12.000000 19.000000 72.000000 222. 000000 10.801802 6.513223 1.000000 6.000000 9.000000 14.000000 35.000000 37.000000 6.324324 3.520212 1.000000 3.000000 6.000000 8.000000 16.000000 27.000000 58.000000 8.724138 4.644533 3.000000 6.000000 7.000000 11.000000 24.000000 266. 000000 11.127820 7.747344 1.000000 6.000000 9.000000 14.000000 53.000000 222. 000000 7.675676 4.385625 1.000000 4.000000 7.000000 10.000000 22.000000 37.000000 4.594595 2.140346 1.000000 3.000000 5.000000 5.000000 9.000000 27.000000 16.000000 15.500000 9.040649 5.000000 10.000000 13.500000 16.750000 38.000000 91.000000 19.274725 13.851647 2.000000 9.000000 16.000000 25.500000 76.000000 76.000000 11.855263 7.346117 2.000000 6.000000 11.000000 17.250000 31.000000 8.000000 4.750000 3.370036 2.000000 2.000000 3.000000 7.500000 10.000000 6.000000 16.000000 10.875000 5.806605 3.000000 6.750000 10.500000 12.750000 25.000000 91.000000 14.274725 9.056693 2.000000 7.000000 12.000000 20.000000 39.000000 76.000000 8.434211 5.342498 1.000000 4.000000 7.500000 12.000000 23.000000 8.000000 3.125000 1.356203 2.000000 2.000000 3.000000 3.250000 6.000000 6.000000 14.000000 14.428571 8.149698 3.000000 10.250000 11.500000 19.000000 31.000000 97.000000 20.041237 10.997081 5.000000 12.000000 19.000000 26.000000 56.000000 73.000000 13.191781 6.428361 3.000000 7.000000 13.000000 18.000000 28.000000 6.000000 8.666667 5.006662 4.000000 6.000000 6.000000 12.000000 16.000000 5.000000 h5_201314.000000 m5_201310.857143 h5_20145.815856 2.000000 m5_20148.000000 h5_20159.500000 14.000000 m5_201524.000000 h5_201697.000000 m5_201614.886598 7.908166 h5_20174.000000 m5_20179.000000 14.000000 area 19.000000 38.000000 73.000000 9.328767 4.564521 2.000000 5.000000 10.000000 12.000000 21.000000 6.000000 5.000000 2.449490 3.000000 3.250000 4.000000 6.250000 9.000000 5.000000 count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count

10 Hirsch indices from Google Scholar — Page 13 / 42 10.3 Exploratory data analysis multidisciplinary multidisciplinary multidisciplinary multidisciplinary multidisciplinary multidisciplinary multidisciplinary 13.562500 7.237615 5.000000 8.500000 11.500000 18.250000 31.000000 9.687500 5.012484 4.000000 5.750000 8.000000 13.750000 20.000000 12.821429 7.237012 4.000000 7.750000 10.000000 18.000000 31.000000 9.500000 5.245810 3.000000 5.750000 8.000000 13.250000 21.000000 11.444444 5.535434 5.000000 7.500000 10.000000 13.500000 24.000000 8.518519 4.660558 4.000000 5.000000 7.000000 10.000000 19.000000 16.833333 7.704977 10.000000 12.500000 14.000000 18.500000 31.000000 11.833333 4.622409 8.000000 8.250000 10.000000 14.750000 19.000000 17.400000 6.804410 11.000000 12.000000 17.000000 19.000000 28.000000 h5_201312.600000 m5_20134.335897 h5_20149.000000 9.000000 m5_201411.000000 h5_201515.000000 19.000000 m5_2015 h5_2016 m5_2016 h5_2017 m5_2017 area mean std min 25% 50% 75% max

10 Hirsch indices from Google Scholar — Page 14 / 42 10.3 Exploratory data analysis

10.3.2 Heat map plotting

This function is a helper for the heat map plots of the subsections that follows. In [28]: def hmap(data, title): fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 6), sharey=True) sns.heatmap(data[h5_fields], ax=ax1, cmap="rainbow", annot=True, fmt="g", cbar=False) ax1.set(title=f"h5-index ({title})") sns.heatmap(data[m5_fields], ax=ax2, cmap="rainbow", annot=True, fmt="g", cbar=False) ax2.set(title=f"h5-median ({title})") fig.tight_layout()

Each subsection that follows tries to plot the common variability measure of some measure just after the measure. The main goal is to just make sense of the data.

Count

How many entries with the Google Scholar indices data are there for each thematic area? In [29]: dataset_counts = ddata.loc["count"].set_index("area") hmap(dataset_counts, "count") dataset_counts.astype(int)

Out [29]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 15 / 42 10.3 Exploratory data analysis 92 192 102 71 64 286 272 45 32 92 192 102 71 64 286 272 45 32 82 174 91 60 60 210 238 34 28 82 174 91 60 60 210 238 34 28 76 162 93 60 58 266 222 37 27 76 162 93 60 58 266 222 37 27 36 27 30 21 16 91 76 8 6 36 27 30 21 16 91 76 8 6 35 31 29 16 14 97 73 6 5 h5_2013 m5_2013 h5_201435 m5_201431 h5_201529 m5_201516 h5_201614 m5_201697 h5_201773 m5_2017 6 5 area agricultural_sciences applied_social_sciences biological_sciences engineering exact_earth_sciences health_sciences human_sciences linguistics_letters_arts multidisciplinary

10 Hirsch indices from Google Scholar — Page 16 / 42 10.3 Exploratory data analysis

This can be compared with the overall counts: In [30]: overall_counts = pd.DataFrame(dataset[areaswm].sum().rename("overall_count")) sns.heatmap(overall_counts, cmap="rainbow", annot=True, fmt="g") overall_counts

Out [30]: overall_count is_agricultural_sciences 147 is_applied_social_sciences 424 is_biological_sciences 185 is_engineering 120 is_exact_earth_sciences 115 is_health_sciences 457 is_human_sciences 474 is_linguistics_letters_arts 74 is_multidisciplinary 52

10 Hirsch indices from Google Scholar — Page 17 / 42 10.3 Exploratory data analysis

Mean

Central tendency In [31]: dataset_means = ddata.loc["mean"].set_index("area") hmap(dataset_means, "mean") dataset_means

Out [31]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 18 / 42 10.3 Exploratory data analysis 13.304348 11.052083 14.333333 11.591549 12.562500 16.000000 11.591912 7.977778 13.562500 10.043478 7.734375 10.784314 8.366197 9.406250 11.741259 8.180147 5.577778 9.687500 12.975610 10.235632 14.384615 10.750000 12.066667 16.414286 11.432773 8.029412 12.821429 9.817073 7.063218 10.670330 8.016667 8.866667 12.147619 7.991597 5.500000 9.500000 12.328947 10.055556 12.924731 10.100000 11.637931 15.060150 10.801802 6.324324 11.444444 9.644737 6.895062 9.946237 7.583333 8.724138 11.127820 7.675676 4.594595 8.518519 14.138889 10.407407 18.133333 12.285714 15.500000 19.274725 11.855263 4.750000 16.833333 11.111111 6.814815 13.733333 8.666667 10.875000 14.274725 8.434211 3.125000 11.833333 15.742857 12.129032 17.137931 12.125000 14.428571 20.041237 13.191781 8.666667 17.400000 h5_2013 m5_201312.285714 h5_20148.032258 12.793103 m5_20149.312500 h5_201510.857143 m5_201514.886598 9.328767 h5_20165.000000 m5_201612.600000 h5_2017 m5_2017 area agricultural_sciences applied_social_sciences biological_sciences engineering exact_earth_sciences health_sciences human_sciences linguistics_letters_arts multidisciplinary

10 Hirsch indices from Google Scholar — Page 19 / 42 10.3 Exploratory data analysis

Standard deviation

Variability In [32]: dataset_stds = ddata.loc["std"].set_index("area") hmap(dataset_stds, "standard deviation") dataset_stds

Out [32]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 20 / 42 10.3 Exploratory data analysis 5.704710 5.720121 7.188182 6.370866 6.334273 10.201479 6.621474 4.303745 7.237615 4.185042 3.774711 5.285188 4.645242 4.655493 7.460035 4.593503 2.996125 5.012484 5.989653 5.728611 7.282810 6.374273 6.655867 11.484326 6.645459 3.857188 7.237012 4.343657 3.528828 5.450902 4.831014 4.670378 8.262564 4.493776 2.402650 5.245810 5.888720 4.961028 6.369487 5.601755 5.981546 10.407733 6.513223 3.520212 5.535434 4.813049 3.169356 5.061289 4.334691 4.644533 7.747344 4.385625 2.140346 4.660558 6.564636 5.534919 8.881338 6.671903 9.040649 13.851647 7.346117 3.370036 7.704977 4.833087 4.123451 5.771113 4.963198 5.806605 9.056693 5.342498 1.356203 4.622409 6.386304 5.925886 8.617773 6.064926 8.149698 10.997081 6.428361 5.006662 6.804410 h5_2013 m5_20135.159636 h5_20143.610022 m5_20146.281374 h5_20154.757012 m5_20155.815856 h5_20167.908166 4.564521 m5_20162.449490 h5_20174.335897 m5_2017 agricultural_sciences applied_social_sciences biological_sciences engineering exact_earth_sciences health_sciences human_sciences linguistics_letters_arts multidisciplinary area

10 Hirsch indices from Google Scholar — Page 21 / 42 10.3 Exploratory data analysis

Median

Central tendency (robust) In [33]: dataset_medians = ddata.loc["50%"].set_index("area") hmap(dataset_medians, "median") dataset_medians

Out [33]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 22 / 42 10.3 Exploratory data analysis 13.0 10.0 12.0 10.0 11.0 13.5 10.0 7.0 11.5 9.0 7.0 9.5 7.0 9.0 9.0 7.0 5.0 8.0 11.5 9.0 13.0 9.0 10.0 13.0 10.0 7.5 10.0 9.0 6.0 9.0 7.0 7.0 10.0 7.0 5.0 8.0 11.0 9.0 11.0 8.5 10.0 12.0 9.0 6.0 10.0 8.0 6.0 8.0 7.0 7.0 9.0 7.0 5.0 7.0 13.0 9.0 16.5 11.0 13.5 16.0 11.0 3.0 14.0 11.0 7.0 13.5 7.0 10.5 12.0 7.5 3.0 10.0 14.0 11.0 17.0 11.5 11.5 19.0 13.0 6.0 17.0 h5_2013 m5_2013 h5_201411.0 m5_20148.0 h5_201511.0 m5_20158.5 h5_20169.5 m5_201614.0 h5_201710.0 m5_2017 4.0 11.0 area agricultural_sciences applied_social_sciences biological_sciences engineering exact_earth_sciences health_sciences human_sciences linguistics_letters_arts multidisciplinary

10 Hirsch indices from Google Scholar — Page 23 / 42 10.3 Exploratory data analysis

IQR (Inter-Quartile Range)

Variability (robust) In [34]: dataset_iqrs = ddata.loc["75%"].set_index("area") - \ ddata.loc["25%"].set_index("area") hmap(dataset_iqrs, "IQR") dataset_iqrs

Out [34]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 24 / 42 10.3 Exploratory data analysis 8.00 7.00 8.00 7.00 7.25 10.00 8.00 4.00 9.75 6.00 5.00 7.00 5.50 5.25 7.75 6.00 3.00 8.00 8.75 7.00 9.50 6.00 8.00 13.00 10.00 5.00 10.25 6.75 4.00 7.00 4.50 6.00 10.00 6.00 2.75 7.50 7.00 5.00 8.00 7.00 5.75 11.00 8.00 5.00 6.00 6.25 4.00 6.00 5.25 5.00 8.00 6.00 2.00 5.00 5.00 8.50 10.50 8.00 6.75 16.50 11.25 5.50 6.00 4.50 6.50 7.00 8.00 6.00 13.00 8.00 1.25 6.50 7.50 6.50 10.00 6.50 8.75 14.00 11.00 6.00 7.00 h5_2013 m5_2013 h5_20146.0 m5_20145.0 h5_20158.0 m5_20155.5 h5_20166.0 m5_201610.0 h5_20177.0 m5_2017 3.0 6.0 area agricultural_sciences applied_social_sciences biological_sciences engineering exact_earth_sciences health_sciences human_sciences linguistics_letters_arts multidisciplinary

10 Hirsch indices from Google Scholar — Page 25 / 42 10.3 Exploratory data analysis

Maximum

In [35]: dataset_maxs = ddata.loc["max"].set_index("area") hmap(dataset_maxs, "max") dataset_maxs.astype(int)

Out [35]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 26 / 42 10.3 Exploratory data analysis 31 38 39 31 31 75 35 23 31 20 22 29 21 25 54 23 16 20 31 32 38 31 31 81 33 16 31 21 20 29 21 23 53 23 11 21 28 32 34 24 31 72 35 16 24 21 20 25 19 24 53 22 9 19 31 24 52 31 38 76 31 10 31 23 17 33 19 25 39 23 6 19 31 32 46 28 31 56 28 16 28 h5_2013 m5_2013 h5_201425 m5_201419 h5_201532 m5_201519 h5_201624 m5_201638 h5_201721 m5_2017 9 19 area agricultural_sciences applied_social_sciences biological_sciences engineering exact_earth_sciences health_sciences human_sciences linguistics_letters_arts multidisciplinary

10 Hirsch indices from Google Scholar — Page 27 / 42 10.3 Exploratory data analysis

Minimum

In [36]: dataset_mins = ddata.loc["min"].set_index("area") hmap(dataset_mins, "min") dataset_mins.astype(int)

Out [36]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 28 / 42 10.3 Exploratory data analysis 2 3 5 3 3 1 2 2 5 2 1 4 3 2 1 2 1 4 3 1 4 2 1 2 1 3 4 3 1 3 2 1 2 1 2 3 4 3 5 3 4 2 1 1 5 3 2 4 2 3 1 1 1 4 2 3 8 4 5 2 2 2 10 2 1 6 2 3 2 1 2 8 4 5 4 4 3 5 3 4 11 h5_2013 m5_2013 h5_20143 m5_20143 h5_20153 m5_20152 h5_20162 m5_20164 h5_20172 m5_2017 3 9 area agricultural_sciences applied_social_sciences biological_sciences engineering exact_earth_sciences health_sciences human_sciences linguistics_letters_arts multidisciplinary

10 Hirsch indices from Google Scholar — Page 29 / 42 10.3 Exploratory data analysis

Range

In [37]: dataset_ranges = ddata.loc["max"].set_index("area") - \ ddata.loc["min"].set_index("area") hmap(dataset_ranges, "range") dataset_ranges.astype(int)

Out [37]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 30 / 42 10.3 Exploratory data analysis 29 35 34 28 28 74 33 21 26 18 21 25 18 23 53 21 15 16 28 31 34 29 30 79 32 13 27 18 19 26 19 22 51 22 9 18 24 29 29 21 27 70 34 15 19 18 18 21 17 21 52 21 8 15 29 21 44 27 33 74 29 8 21 21 16 27 17 22 37 22 4 11 27 27 42 24 28 51 25 12 17 h5_2013 m5_2013 h5_201422 m5_201416 h5_201529 m5_201517 h5_201622 m5_201634 h5_201719 m5_2017 6 10 area agricultural_sciences applied_social_sciences biological_sciences engineering exact_earth_sciences health_sciences human_sciences linguistics_letters_arts multidisciplinary

10 Hirsch indices from Google Scholar — Page 31 / 42 10.4 Full distributions in 2017

10.3.3 Summary

It’s hard to know how great a number of citations of an article is without knowing its area. The above heat maps gives us some reference on how should we regard the number of citations in some thematic area. It’s clear that the whole dataset have more health sciences and human sciences entries, but just a few multidisciplinary entries. There are a lot of applied social sciences entries too, but most of them don’t have the index. Linguistics, letters and arts seem to get less citations, that might be a characteristic of this thematic area. Maybe the time interval (past 5 years) is too short, or perhaps the whole network of articles in this thematic is smaller, the latter could be justified with the quantity of articles in this dataset, but that’s not an information we have. The variability of indices in health sciences is huge if compared with other areas. Multidisciplinary entries have a small variation. Biological sciences have a high maximum and minimum, besides the highest h5-index for the robust central tendency measurement (median), but this area doesn’t have the highest h5-median, nor the high- est mean of h5-index. This suggests it has a more stable h5-index than the ones with higher means of h5-index.

10.4 Full distributions in 2017

The above heat maps might be difficult to understand, mainly the ones regarding dispersion/variability. They have too much information scattered in distinct plots, which are mostly useful to understand the evolution of these indexes from Google Scholar, yet the older columns (2013 to 2016) isn’t that much representative of the whole. For now, let’s stick with the 2017 data, and plot let’s plot the distributions of these indices for both the entire SciELO network and the distinct thematic areas. The full data with a single-area column repeating an ISSN row for each area it’s assigned to is: In [38]: fdata = pd.concat([ dataset[dataset[area] == 1][gs_fields].assign(area=area[3:])

10 Hirsch indices from Google Scholar — Page 32 / 42 10.4 Full distributions in 2017

for area in areaswm ]) fdata.iloc[11::500]

Out [38]: The table is in the next page ...

10 Hirsch indices from Google Scholar — Page 33 / 42 10.4 Full distributions in 2017 agricultural_sciences applied_social_sciences health_sciences human_sciences multidisciplinary 12.0 7.0 37.0 17.0 23.0 9.0 6.0 29.0 13.0 18.0 13.0 5.0 33.0 17.0 22.0 10.0 4.0 27.0 13.0 18.0 19.0 NaN 33.0 12.0 20.0 14.0 NaN 28.0 9.0 17.0 13.0 NaN 36.0 4.0 NaN 8.0 NaN 30.0 3.0 NaN 17.0 NaN 36.0 5.0 NaN h5_2013 m5_2013 h5_201413.0 m5_2014NaN h5_201526.0 m5_20153.0 h5_2016NaN m5_2016 h5_2017 m5_2017 area issn_scielo 0100-2945 2145-9444 0034-7167 0104-4036 0717-3458

10 Hirsch indices from Google Scholar — Page 34 / 42 10.4 Full distributions in 2017

From which we can get just the 2017 data: In [39]: fdata2017 = fdata[["h5_2017", "m5_2017", "area"]].dropna().rename(columns={ "h5_2017": "h5-index", "m5_2017": "h5-median", }) fdata2017.head()

Out [39]: h5-index h5-median area issn_scielo 0001-3765 16.0 19.0 agricultural_sciences 0006-8705 12.0 15.0 agricultural_sciences 0030-2465 15.0 18.0 agricultural_sciences 0034-737X 13.0 16.0 agricultural_sciences 0038-2353 20.0 31.0 agricultural_sciences

In [40]: sns.pairplot(fdata2017, hue="area", height=2, aspect=2);

Most KDEs (kernel density estimates) seem alike, yet there’s a single huge index. Seaborn’s FacetGrid can plot all the distribution histograms and KDEs. Forcing the bin step size equal to 1, it’ll be a bar plot counting the frequency of each index value. However, it requires a stacked representation of the same data: In [41]: fdata2017_stack = (fdata2017 .set_index("area", append=True) .stack() .reset_index() .rename(columns={"level_2": "type", 0: "value"}) .set_index("issn_scielo") ) fdata2017_stack.head()

Out [41]:

10 Hirsch indices from Google Scholar — Page 35 / 42 10.4 Full distributions in 2017

area type value issn_scielo 0001-3765 agricultural_sciences h5-index 16.0 0001-3765 agricultural_sciences h5-median 19.0 0006-8705 agricultural_sciences h5-index 12.0 0006-8705 agricultural_sciences h5-median 15.0 0030-2465 agricultural_sciences h5-index 15.0

That we can plot with a sns.FacetGrid (be careful when interpreting, the titles are above, not below!): In [42]: sns.FacetGrid(fdata2017_stack, row="area", col="type", aspect=2.7, height=2)\ .map(sns.distplot, "value", bins=np.arange(fdata2017_stack["value"].max()), kde=False);

10 Hirsch indices from Google Scholar — Page 36 / 42 10.4 Full distributions in 2017

10 Hirsch indices from Google Scholar — Page 37 / 42 10.4 Full distributions in 2017

The same, normalized as a probability function and with its KDE: In [43]: sns.FacetGrid(fdata2017_stack, row="area", col="type", aspect=2.7, height=2)\ .map(sns.distplot, "value", bins=np.arange(fdata2017_stack["value"].max()));

10 Hirsch indices from Google Scholar — Page 38 / 42 10.4 Full distributions in 2017

10 Hirsch indices from Google Scholar — Page 39 / 42 10.4 Full distributions in 2017

There are a lot of information there, but they’re hard to compare. Perhaps a simple boxplot of it all would be simpler. In [44]: sns.catplot(kind="box", aspect=2.5, height=4, data=fdata2017_stack, row="type", y="area", x="value");

Now a several information is together in a single plot: quartiles, median, IQR, minimum, maximum, outlier thresholds and the outliers. This one is probably the most informative plot so far in this note- book. We can see the standard statistics in a barplot: In [45]: sns.barplot(data=fdata2017_stack, y="area", x="value", hue="type", ax=plt.subplots(figsize=(10, 8))[1]) \ .set(title="Mean with 95% confidence interval");

10 Hirsch indices from Google Scholar — Page 40 / 42 10.4 Full distributions in 2017

Let’s perform: fdata2017.groupby("area").describe() in a stacked table style: In [46]: fdata2017_descr = (fdata2017 .groupby("area") .describe() .stack(0) .rename_axis(["area", "type"]) .reset_index() ) fdata2017_descr

Out [46]: area type 25% 50% 75% count max mean min std 0 agricultural_sciences h5- 7.00 9.0 13. 92.0 20.0 10. 2.0 4. index 00 043478 185042 1 agricultural_sciences h5- 9.00 13.0 17. 92.0 31.0 13. 2.0 5. median 00 304348 704710 2 applied_social_sciences h5- 5.00 7.0 10. 192.0 22.0 7. 1.0 3. index 00 734375 774711 3 applied_social_sciences h5- 7.00 10.0 14. 192.0 38.0 11. 3.0 5. median 00 052083 720121 4 biological_sciences h5- 7.00 9.5 14. 102.0 29.0 10. 4.0 5. index 00 784314 285188 5 biological_sciences h5- 10. 12.0 18. 102.0 39.0 14. 5.0 7. median 00 00 333333 188182 6 engineering h5- 5.00 7.0 10. 71.0 21.0 8. 3.0 4. index 50 366197 645242 7 engineering h5- 7.00 10.0 14. 71.0 31.0 11. 3.0 6. median 00 591549 370866 Continued on next page

10 Hirsch indices from Google Scholar — Page 41 / 42 10.4 Full distributions in 2017

area type 25% 50% 75% count max mean min std 8 exact_earth_sciences h5- 6.00 9.0 11. 64.0 25.0 9. 2.0 4. index 25 406250 655493 9 exact_earth_sciences h5- 8.00 11.0 15. 64.0 31.0 12. 3.0 6. median 25 562500 334273 10 health_sciences h5- 7.00 9.0 14. 286.0 54.0 11. 1.0 7. index 75 741259 460035 11 health_sciences h5- 9.00 13.5 19. 286.0 75.0 16. 1.0 10. median 00 000000 201479 12 human_sciences h5- 5.00 7.0 11. 272.0 23.0 8. 2.0 4. index 00 180147 593503 13 human_sciences h5- 7.00 10.0 15. 272.0 35.0 11. 2.0 6. median 00 591912 621474 14 linguistics_letters_arts h5- 4.00 5.0 7.00 45.0 16.0 5. 1.0 2. index 577778 996125 15 linguistics_letters_arts h5- 5.00 7.0 9.00 45.0 23.0 7. 2.0 4. median 977778 303745 16 multidisciplinary h5- 5.75 8.0 13. 32.0 20.0 9. 4.0 5. index 75 687500 012484 17 multidisciplinary h5- 8.50 11.5 18. 32.0 31.0 13. 5.0 7. median 25 562500 237615

The information that the above plots misses is the total/overall count (already seen in a heat map, but easier to visualize in a bar plot). In [47]: sns.barplot(data=fdata2017_descr, y="area", x="count", ax=plt.subplots(figsize=(10, 8))[1]) \ .set(title="Count");

10 Hirsch indices from Google Scholar — Page 42 / 42 11 FCR in Dimensions

The goal of this notebook is to analyze the FCR index data of a single journal based on the CSV re- ports from Dimensions[1]. We can get at most 500 entries from its search system, which suffices in our case without an extra step of search splitting and CSV joining (sadly, we couldn’t download the full data from some journals due to that limit on the number of rows and due to the constrained ways os manually splitting the data in the interface). Note: Be careful when downloading stuff from a single journal! As the Dimensions’ web interface uses the journal name for filtering instead of its ISSN, sometimes we can grab a CSV from more than one journal because they happen to have the same name. As an example, Topoi can be either a Brazilian journal of history[2] and a Dutch journal of philosophy[3], but these journals are mixed together in that interface. There are two CSV formats available: a spreadsheet and a bibliometric mapping. Their first line in both formats is always a comment with some metadata regarding the downloaded CSV, including the date. The second line is the table header. In [1]: import csv

In [2]: import matplotlib.pyplot as plt import numpy as np import pandas as pd

In [3]: %matplotlib inline

11.1 Analyzing the FCR from the spreadsheet file

There are 2 bibliometric indices: RCR and FCR. The former is probably useless for anything that isn’t a health science publication regarding the research going on in the U.S.A., since its calculation is based on the number of citations from publications funded by the NIH[4] (National Institutes of Health in the U.S.A.). The latter is the Field Citation Ratio, which is somehow normalized by both year and Field of esearch[5], and is the only one we’re going to analyze here.

11.1.1 Loading the data

We should be explicit about the header line since the first line of the CSV files is a comment: In [4]: for fname in ["nauplius.csv", "plant_physiology.csv"]: with open(fname) as f: cr = csv.reader(f) print(next(cr)[0])

About the data: Exported on Sep 21, 2018. Criteria: Source title is Nauplius. About the data: Exported on Sep 22, 2018. Criteria: Source title is Brazilian Journal of Plant Physiology.

We’re going to analyze the content from these two journals.

[1]https://app.dimensions.ai [2]http://revistatopoi.org [3]https://link.springer.com/journal/11245 [4]https://support.dimensions.ai/support/solutions/articles/13000045404-what-is-the-rcr-how-is-the-rcr- score-calculated- [5]https://support.dimensions.ai/support/solutions/articles/13000045409-what-is-the-fcr-how-is-it- calculated-

11 FCR in Dimensions — Page 1 / 21 11.1 Analyzing the FCR from the spreadsheet file

In [5]: nauplius = pd.read_csv("nauplius.csv", header=1) plantp = pd.read_csv("plant_physiology.csv", header=1) print("Nauplius:", nauplius.shape) print("Brazilian Journal of Plant Physiology:", plantp.shape) nauplius.columns

Nauplius: (177, 27) Brazilian Journal of Plant Physiology: (361, 27)

Out [5]: Index(['Rank', 'Publication ID', 'DOI', 'PMID', 'PMCID', 'Title', 'Source title', 'Anthology title', 'PubYear', 'Volume', 'Issue', 'Pagination', 'Publication Type', 'Authors', 'Authors Affiliations', 'Times cited', 'Recent citations', 'RCR', 'FCR', 'Source Linkout', 'Dimensions URL', 'FOR (ANZSRC) Categories', 'FOR 1', 'FOR 2', 'FOR 3', 'FOR 4', 'FOR 5'], dtype='object')

We can see the number of publications by year: In [6]: year_counts = pd.DataFrame([ plantp.groupby("PubYear").size() .rename("Brazilian Journal of Plant Physiology"), nauplius.groupby("PubYear").size() .rename("Nauplius"), ], dtype=int) year_counts.T.plot(title="Number of publications by year", figsize=(12, 6)) year_counts.fillna("").T

Out [6]: Brazilian Journal of Plant Physiology Nauplius PubYear 2002 24 2003 23 2004 24 2005 41 2006 46 2007 38 2008 32 2009 33 2010 32 2011 35 19 2012 32 21 2013 1 22 2014 16 2015 18 2016 30 2017 33 2018 18

11 FCR in Dimensions — Page 2 / 21 11.1 Analyzing the FCR from the spreadsheet file

Which shows that Nauplius is a quite new and currently active journal, whereas the Brazilian Journal of Plant Physiology is no longer publishing anything (actually, it had been renamed to Theoretical and Experimental Plant Physiology, with a new ISSN: 2197-0025). Due to the way the FCR is calculated and normalized (by the publication year in the document level), its behavior is quite different in these two distinct journal contexts.

11.1.2 Proportion where the FCR is zero

From the FCR explanation[6] page in the Dimensions support, we know that: The FCR is calculated for all publications in Dimensions which are at least 2 years old and were published in 2000 or later. FCR is zero when a document hadn’t received any citation or when it was published in the last 2 years. That’s a quite common case, surely for new papers, which deserves its own analysis. In [7]: fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 3)) (nauplius["FCR"] == 0).value_counts().sort_index().plot.bar(ax=ax1, title="FCR is zero (Nauplius)" ) (plantp["FCR"] == 0).value_counts().sort_index().plot.bar(ax=ax2, title="FCR is zero (Brazilian Journal of Plant Physiology)", ) for ax in [ax1, ax2]: for rect in ax.patches: value = rect.get_height() ax.annotate(value, (rect.get_x() + rect.get_width() / 2, value / 1.05 if value > 50 else value * 1.05), ha="center", va="bottom" if value < 50 else "top", fontsize=20)

[6]https://support.dimensions.ai/support/solutions/articles/13000045409-what-is-the-fcr-how-is-it- calculated-

11 FCR in Dimensions — Page 3 / 21 11.1 Analyzing the FCR from the spreadsheet file

For a recent journal, more than half of the entries have zero as its FCR. The FCR mean for this journal as a whole would be tainted if we count these entries as equal. For a journal that is no longer publishing, the FCR can only increase. Most entries with zeroed FCR values are for 2017 and 2018, years where all entries are zero. In [8]: nauplius_counts = pd.DataFrame( nauplius .assign(fcr_is_zero=nauplius["FCR"] == 0) .groupby(["fcr_is_zero", "PubYear"]) .size() .rename("count") .unstack("fcr_is_zero") .fillna(0), dtype=int, ) nauplius_counts.plot.barh(subplots=True, title=["FCR isn't zero (Nauplius)", "FCR is zero (Nauplius)"], legend=False, figsize=(12, 8)) nauplius_counts

Out [8]: fcr_is_zero False True PubYear 2011 17 2 2012 14 7 2013 18 4 2014 11 5 2015 12 6 2016 14 16 2017 0 33 2018 0 18

11 FCR in Dimensions — Page 4 / 21 11.1 Analyzing the FCR from the spreadsheet file

As 2017 was the publication peak for Nauplius, averaging the raw FCR number for every year is unfair. For the Brazilian Journal of Plant Physiology, these are the ones from 2002 have something different going on. In [9]: plantp_counts = pd.DataFrame( plantp .assign(fcr_is_zero=plantp["FCR"] == 0) .groupby(["fcr_is_zero", "PubYear"]) .size() .rename("count") .unstack("fcr_is_zero") .fillna(0), dtype=int, ) plantp_counts.plot.barh( subplots=True, title=["FCR isn't zero (Brazilian Journal of Plant Physiology)", "FCR is zero (Brazilian Journal of Plant Physiology)"], legend=False, figsize=(12, 8) ) plantp_counts

Out [9]: fcr_is_zero False True PubYear 2002 11 13 2003 23 0 2004 23 1 2005 40 1 Continued on next page

11 FCR in Dimensions — Page 5 / 21 11.1 Analyzing the FCR from the spreadsheet file

fcr_is_zero False True PubYear 2006 44 2 2007 37 1 2008 31 1 2009 31 2 2010 30 2 2011 33 2 2012 26 6 2013 1 0

The reason is that the FCR can only be calculated if we know the Field of Research of every single docu- ment. However, from the same link as before, Assigning FoR codes to publications in Dimensions is done automatically using machine learning emulations of the categorisation processes. The FCR can’t be negative, but when a document lacks its category, the FCR is zero. In [10]: pd.DataFrame([ nauplius["FOR (ANZSRC) Categories"].isna().sum(), plantp["FOR (ANZSRC) Categories"].isna().sum(), ], index=["Nauplius", "Brazilian Journal of Plant Physiology"], columns=["Entries lacking a field of research category"])

Out [10]:

11 FCR in Dimensions — Page 6 / 21 11.1 Analyzing the FCR from the spreadsheet file

Entries lacking a field of research category Nauplius 14 Brazilian Journal of Plant Physiology 26

As FCR ≥ 0 and ∑ FCRi = 0, all these journals have FCR = 0. In [11]: plantp[plantp["FOR (ANZSRC) Categories"].isna()]["FCR"].sum() + \ nauplius[nauplius["FOR (ANZSRC) Categories"].isna()]["FCR"].sum()

Out [11]: 0.0

For the Brazilian Journal of Plant Physiology, it’s clearly biased towards 2002 (half of the non-classified entries are from this year): In [12]: plantp[plantp["FOR (ANZSRC) Categories"].isna()].groupby("PubYear").size()

Out [12]: PubYear 2002 13 2004 1 2005 1 2006 2 2008 1 2009 2 2010 1 2011 2 2012 3 dtype: int64

For Nauplius, it’s less skewed. In [13]: nauplius[nauplius["FOR (ANZSRC) Categories"].isna()].groupby("PubYear").size()

Out [13]: PubYear 2011 1 2012 2 2013 3 2015 2 2016 2 2017 1 2018 3 dtype: int64

11.1.3 Data cleaning

The cleaning we’re sure that must be done in order to have meaningful entries for further analysis regarding the FCR is that we should only use data up to 2016 and get rid from entries without a main field of research. In [14]: nauplius_valid = nauplius[nauplius["FOR (ANZSRC) Categories"].notna() & (nauplius["PubYear"] <= 2016)] plantp_valid = plantp[plantp["FOR (ANZSRC) Categories"].notna()]

In the CSV files, everything is an article, so we shouldn’t remove any other row: In [15]: pd.concat([nauplius["Publication Type"], plantp["Publication Type"]]).drop_duplicates()

11 FCR in Dimensions — Page 7 / 21 11.1 Analyzing the FCR from the spreadsheet file

Out [15]: 0 article Name: Publication Type, dtype: object

11.1.4 Hirsch index

As the total number of citations each document received is given, we can calculate the Hirsch index for these two journals based on the entire data. In [16]: nauplius["Times cited"].plot.hist( bins=nauplius["Times cited"].max(), title="Number of citations (Nauplius)", figsize=(12, 4), );

The Hirsch index for Nauplius is: In [17]: (nauplius["Times cited"] .value_counts() .sort_index(ascending=False) .cumsum() .reset_index() .values.min(axis=1) .max() )

Out [17]: 8

11 FCR in Dimensions — Page 8 / 21 11.1 Analyzing the FCR from the spreadsheet file

In [18]: plantp["Times cited"].plot.hist( bins=plantp["Times cited"].max(), title="Number of citations (Brazilian Journal of Plant Physiology)", figsize=(12, 4), );

The Hirsch index for the Brazilian Journal of Plant Physiology is: In [19]: (plantp["Times cited"] .value_counts() .sort_index(ascending=False) .cumsum() .reset_index() .values.min(axis=1) .max() )

Out [19]: 37

The above is based on all citations since the publication, which is something that might lean toward older publications.

11.1.5 Proportion where FCR isn’t zero

In [20]: nauplius[nauplius["FCR"] != 0]["FCR"].plot.hist( bins=100, title="Histogram of not-zero FCR values (Nauplius)", figsize=(12, 4), xticks=range(int(nauplius["FCR"].max()) + 2), );

11 FCR in Dimensions — Page 9 / 21 11.1 Analyzing the FCR from the spreadsheet file

In [21]: plantp[plantp["FCR"] != 0]["FCR"].plot.hist( bins=200, title="Histogram of not-zero FCR values " "(Brazilian Journal of Plant Physiology)", figsize=(12, 4), xticks=range(int(plantp["FCR"].max()) + 2), );

All FCR values above 1 received more citations than the average for the year of publication and for the field of research. That normalization makes FCR more fit than the raw number of citations when we’re looking to summary data without knowing when each publication had been made.

11.1.6 Average

Dimensions tells us the average FCR for Nauplius is 0.49. That means that the number of citations it received is, in average, about half of the average number of citations of its field of research area. For the Brazilian Journal of Plant Physiology, the average FCR is 1.07. Each of these are called FCR Mean in the Dimensions’ web site. The idea of average might be misleading here. As the mean value is highly influenced by extreme values, In the same link as before, Dimensions tells us they’re using a shifted geometric mean with logarithmic formulation for everything regarding FCR. From their description, we know that:

1 N geometricmeanof FCR = exp ∑ ln(FCRi + 1) − 1 [ N i=1 ]

11 FCR in Dimensions — Page 10 / 21 11.2 Yearly mean and shifted geometric mean/average

And the same idea applies for the average number of citations (by year and field of research) that is used to calculate the FCR. We can easily implement the above formula using Numpy: In [22]: def sgm_average(series): return np.exp(np.log(series.values + 1).mean()) - 1

11.2 Yearly mean and shifted geometric mean/average

Here we’ll calculate a “cumulative” mean and shifted geometric average, which means the mean or shifted geometric average of all publications up to the year in analysis, in contrast with the year-by-year statistics. In [23]: def get_yearly_stats(clean_dataset): cum_groups = [ (year, clean_dataset[clean_dataset["PubYear"] <= year]["FCR"]) for year in clean_dataset["PubYear"].unique() ] return pd.DataFrame( [(year, sgm_average(group), group.mean()) for year, group in cum_groups], columns=["year", "cum_shifted_geom_average", "cum_mean"] ).set_index("year").assign( shifted_geom_average=clean_dataset.groupby("PubYear")["FCR"] .apply(sgm_average), mean=clean_dataset.groupby("PubYear")["FCR"].mean(), )[["cum_mean", "mean", "cum_shifted_geom_average", "shifted_geom_average"]]

In [24]: def plot_yearly_stats(stats, journal_name): stats.plot( figsize=(12, 6), title="Mean and shifted geometric average of FCR, " f"cumulative up to an year and isolated ({journal_name})", ) for line, marker in zip(plt.gca().get_lines(), "xoDs"): line.set_marker(marker) plt.gca().legend()

In [25]: nauplius_fcr_year = get_yearly_stats(nauplius_valid) plot_yearly_stats(nauplius_fcr_year, "Nauplius") nauplius_fcr_year

Out [25]: cum_mean mean cum_shifted_geom_average shifted_geom_average year 2016 0.647931 0.863571 0.484481 0.440366 2015 0.579318 0.562500 0.498798 0.473841 2014 0.583056 0.373750 0.504402 0.325564 2013 0.642857 0.826316 0.559795 0.721414 2012 0.548649 0.551579 0.482791 0.475847 2011 0.545556 0.545556 0.490157 0.490157

11 FCR in Dimensions — Page 11 / 21 11.2 Yearly mean and shifted geometric mean/average

In [26]: plantp_fcr_year = get_yearly_stats(plantp_valid) plot_yearly_stats(plantp_fcr_year, "Brazilian Journal of Plant Physiology") plantp_fcr_year

Out [26]: cum_mean mean cum_shifted_geom_average shifted_geom_average year 2013 1.663134 1.480000 1.072867 1.480000 2012 1.663683 0.821034 1.071754 0.718440 2011 1.743803 0.728182 1.108916 0.658116 2010 1.867022 1.048387 1.171355 0.923672 2009 1.972324 0.691613 1.205448 0.626536 2008 2.161381 1.404194 1.306837 1.111041 2007 2.292514 2.034211 1.342545 1.533903 2006 2.362128 2.202273 1.293493 1.333378 2005 2.434639 4.189500 1.275626 1.934488 2004 1.203158 1.445652 0.903729 1.088004 2003 1.039118 0.991739 0.788385 0.811272 2002 1.138182 1.138182 0.741460 0.741460

11 FCR in Dimensions — Page 12 / 21 11.2 Yearly mean and shifted geometric mean/average

11.2.1 Let’s talk about the FCR. . .

These values should be re-calculated by Dimensions to include the citations from new publications. We are just seeing a snapshot of this index. As a side effect of the normalization, an article that is no longer cited will get a lower FCR for each other article from the same year and field of research that gets a citation. Probably the most strange part of FCR is the “field of research” itself, which is found from some un- known “machine learning” algorithm and data. Do the fields make sense, at least? In [27]: # Nauplius nauplius.groupby("FOR (ANZSRC) Categories").size().sort_values()

Out [27]: FOR (ANZSRC) Categories 0604 Genetics; 0602 Ecology; 0502 Environmental Science and Management 1 1701 Psychology 1 0502 Environmental Science and Management; 0403 Geology; 0602 Ecology 1 1117 Public Health and Health Services 1 0502 Environmental Science and Management; 0602 Ecology; 0604 Genetics 1 0502 Environmental Science and Management; 0604 Genetics 1 0502 Environmental Science and Management; 0907 Environmental Engineering; 0602 Ecology; 0405 Oceanography 1 1108 Medical Microbiology 1 1103 Clinical Sciences 1 0602 Ecology; 0608 Zoology 1 1102 Cardiorespiratory Medicine and Haematology 1

11 FCR in Dimensions — Page 13 / 21 11.2 Yearly mean and shifted geometric mean/average

1005 Communications Technologies 1 2102 Curatorial and Related Studies 1 0607 Plant Biology 1 0608 Zoology 1 0704 Fisheries Sciences 1 0403 Geology 2 2103 Historical Studies 2 0603 Evolutionary Biology 4 1114 Paediatrics and Reproductive Medicine 4 0602 Ecology; 0502 Environmental Science and Management 5 0502 Environmental Science and Management; 0602 Ecology 5 0604 Genetics 17 0502 Environmental Science and Management 51 0602 Ecology 57 dtype: int64

In [28]: # Brazilian Journal of Plant Physiology plantp.groupby("FOR (ANZSRC) Categories").size().sort_values()

Out [28]: FOR (ANZSRC) Categories 0102 Applied Mathematics 1 0912 Materials Engineering 1 0904 Chemical Engineering 1 0699 Other Biological Sciences; 0607 Plant Biology 1 0607 Plant Biology; 0699 Other Biological Sciences; 0602 Ecology 1 0607 Plant Biology; 0605 Microbiology 1 0607 Plant Biology; 0602 Ecology; 0705 Forestry Sciences 1 0607 Plant Biology; 0602 Ecology; 0699 Other Biological Sciences; 0705 Forestry Sciences 1 0607 Plant Biology; 0601 Biochemistry and Cell Biology; 0604 Genetics; 0703 Crop and Pasture Production 1 0607 Plant Biology; 0601 Biochemistry and Cell Biology; 0604 Genetics 1 0607 Plant Biology; 0503 Soil Sciences; 0602 Ecology; 0703 Crop and Pasture Production 1 0605 Microbiology 1

11 FCR in Dimensions — Page 14 / 21 11.2 Yearly mean and shifted geometric mean/average

0604 Genetics; 0601 Biochemistry and Cell Biology 1 0604 Genetics; 0502 Environmental Science and Management; 0602 Ecology; 0603 Evolutionary Biology 1 0602 Ecology; 0705 Forestry Sciences; 0607 Plant Biology 1 1117 Public Health and Health Services; 0399 Other Chemical Sciences 1 0306 Physical Chemistry (incl. Structural) 1 0602 Ecology; 0607 Plant Biology; 0503 Soil Sciences; 0703 Crop and Pasture Production 1 0602 Ecology; 0607 Plant Biology 1 0104 Statistics 1 0202 Atomic, Molecular, Nuclear, Particle and Plasma Physics 1 0206 Quantum Physics 1 0601 Biochemistry and Cell Biology; 0607 Plant Biology 1 0601 Biochemistry and Cell Biology; 0604 Genetics; 0607 Plant Biology 1 0305 Organic Chemistry 1 0602 Ecology; 0607 Plant Biology; 0603 Evolutionary Biology 1 0907 Environmental Engineering 2 0703 Crop and Pasture Production; 0607 Plant Biology 2 0607 Plant Biology; 0703 Crop and Pasture Production; 0602 Ecology 2 0607 Plant Biology; 0703 Crop and Pasture Production; 0601 Biochemistry and Cell Biology 2 0607 Plant Biology; 0604 Genetics 2 0607 Plant Biology; 0299 Other Physical Sciences 2 1103 Clinical Sciences 2 0604 Genetics; 0607 Plant Biology; 0601 Biochemistry and Cell Biology 2 0604 Genetics; 0607 Plant Biology 2 0607 Plant Biology; 0602 Ecology; 0699 Other Biological Sciences 2 0301 Analytical Chemistry 3 0607 Plant Biology; 0503 Soil Sciences 3 0703 Crop and Pasture Production 4 0602 Ecology 4 0302 Inorganic Chemistry 5

11 FCR in Dimensions — Page 15 / 21 11.2 Yearly mean and shifted geometric mean/average

0607 Plant Biology; 0601 Biochemistry and Cell Biology 7 0607 Plant Biology; 0602 Ecology 10 0607 Plant Biology; 0703 Crop and Pasture Production 13 0604 Genetics 24 0601 Biochemistry and Cell Biology 40 0607 Plant Biology 176 dtype: int64

Multidisciplinary entries and misclassification regarding publications from this and other journals might be biasing the whole. The trustfulness and meaningfulness of the normalization procedure is rooted on the Dimensions’ machine learning system. The citation count for a document as used in the FCR calculation doesn’t take into account which other document had cited it. Though the normalization helps on comparing publications from different years, the magnitude of this index might be influenced by stuff like self-citations and scattered never-cited [perhaps auto-generated] publications. There’s one important difference between the Hirsch index and the FCR average: not-cited publications are meaningless for the Hirsch index, yet they push down the FCR average, and faster than the common mean calculation because it’s using a geometric mean instead. The geometric mean is less influenced by extreme values near to the maximum, but it’s MORE influenced by extreme values near to the mini- mum. A single zero would weight hard on a journal with lots of highly cited documents. As a synthetic example: In [29]: synth_data = pd.DataFrame([range(11)], index=["zero_count"]).T synth_data["fifteen_count"] = 10 synth_data["shifted_geom_average"] = [ sgm_average(pd.Series([0] * a + [15] * b)) for a, b in synth_data.values.tolist() ] synth_data["mean"] = synth_data["fifteen_count"] * 15 \ / (synth_data["fifteen_count"] + synth_data["zero_count"]) synth_data.set_index(["fifteen_count", "zero_count"], inplace=True) synth_data.reset_index(0, drop=True).plot(figsize=(8, 4)) synth_data

Out [29]: shifted_geom_average mean fifteen_count zero_count 10 0 15.000000 15.000000 10 1 11.435250 13.636364 10 2 9.079368 12.500000 10 3 7.438129 11.538462 10 4 6.245789 10.714286 10 5 5.349604 10.000000 10 6 4.656854 9.375000 10 7 4.108647 8.823529 10 8 3.666116 8.333333 10 9 3.302762 7.894737 10 10 3.000000 7.500000

11 FCR in Dimensions — Page 16 / 21 11.3 Bibliometric mapping

This geometric mean of zero and 15 is 3. With 3 zeros, even 10 entries having 15 wouldn’t be enough to get 7.5 as the average result. This effectively means that the FCR average favors journals with a small number of uniformly cited publications, heavily pushing down a journal with even a single publication with zero citations. Due to that pushing down behavior of this alternative averaging calculation, FCR should be taken with a grain of salt when used to evaluate journals with recent publications.

11.3 Bibliometric mapping

Another file that can be downloaded from Dimensions is a bibliometric mapping CSV.

11.3.1 Loading the data

If follows a structure similar to the spreadsheet CSV file, like the metadata information in the first line. In [30]: for fname in ["nauplius_bibmap.csv", "plant_physiology_bibmap.csv"]: with open(fname) as f: cr = csv.reader(f) print(next(cr)[0])

About the data: Exported on Sep 21, 2018. Criteria: Source title is Nauplius. About the data: Exported on Sep 22, 2018. Criteria: Source title is Brazilian Journal of Plant Physiology.

But have fewer columns: In [31]: nauplius_bibmap = pd.read_csv("nauplius_bibmap.csv", header=1) plantp_bibmap = pd.read_csv("plant_physiology_bibmap.csv", header=1) print("Nauplius:", nauplius_bibmap.shape) print("Brazilian Journal of Plant Physiology:", plantp_bibmap.shape) plantp_bibmap.columns

11 FCR in Dimensions — Page 17 / 21 11.3 Bibliometric mapping

Nauplius: (177, 14) Brazilian Journal of Plant Physiology: (361, 14)

Out [31]: Index(['Publication ID', 'DOI', 'Title', 'Source title/Anthology title', 'PubYear', 'Volume', 'Issue', 'Pagination', 'Authors', 'Authors Affiliations - Name of Research organization', 'Authors Affiliations - Country of Research organization', 'Dimensions URL', 'Times cited', 'Publication IDs of cited references'], dtype='object')

Besides having less columns, the differences are: • Authors Affiliations had been splitted into two fields; • There’s an extra Publication IDs of cited references field. We don’t have a field like Publication IDs citing this entry, just the other way around. That is, from the directed graph of citations, these bibliometric mapping files have a partition of the graph including the nodes/articles from a journal and the outcoming edges, but not the incoming edges. It has the number of incoming edges in the Times cited field, but not the citations themselves. The citations in a document are joined by a semi-colon and a blank whitespace: In [32]: nauplius_bibmap["Publication IDs of cited references"].iloc[0].split("; ")

Out [32]: ['pub.1029197063', 'pub.1002374214', 'pub.1018747876', 'pub.1051616960', 'pub.1017805771', 'pub.1014161638', 'pub.1047379736', 'pub.1029350483', 'pub.1040183585', 'pub.1084353864', 'pub.1057031403', 'pub.1000284258', 'pub.1052431052', 'pub.1046335329', 'pub.1010113839', 'pub.1090336053', 'pub.1048696478', 'pub.1005549826', 'pub.1020510950', 'pub.1035956564', 'pub.1032912886', 'pub.1049807368', 'pub.1015284561', 'pub.1093107594', 'pub.1084353730']

From this, we can see an histogram of the overall number of citations of the papers: In [33]: nauplius_out_cite = (nauplius_bibmap ["Publication IDs of cited references"] .fillna("") # Unknown / empty citation list .str.split("; ") .apply(set) # Remove duplicated IDs, if any ) nauplius_out_cite.apply(len).plot.hist(

11 FCR in Dimensions — Page 18 / 21 11.3 Bibliometric mapping

figsize=(8, 4), bins=100, title="Histogram of citations in documents (Nauplius)", );

In [34]: plantp_out_cite = (plantp_bibmap ["Publication IDs of cited references"] .fillna("") # Unknown / empty citation list .str.split("; ") .apply(set) # Remove duplicated IDs, if any ) plantp_out_cite.apply(len).plot.hist( figsize=(8, 4), bins=100, title="Histogram of citations in documents " "(Brazilian Journal of Plant Physiology)", );

11 FCR in Dimensions — Page 19 / 21 11.3 Bibliometric mapping

As we have all the publication IDs, we can also see how many journal self-citation there are in these publications. In [35]: nauplius_ids = nauplius_bibmap["Publication ID"].values nauplius_self_cites = \ nauplius_out_cite.apply(lambda cites: sum(cite in nauplius_ids for cite in cites)) nauplius_self_cites[nauplius_self_cites > 0].value_counts().plot.bar( figsize=(8, 4), ) plt.gca().set( title="Self-citations in Nauplius", xlabel="Number of self-citations from a single document", ylabel="Documents", );

In [36]: plantp_ids = plantp_bibmap["Publication ID"].values plantp_self_cites = \ plantp_out_cite.apply(lambda cites: sum(cite in plantp_ids for cite in cites)) plantp_self_cites[plantp_self_cites > 0].value_counts().plot.bar( figsize=(8, 4), ) plt.gca().set( title="Self-citations in the" "Brazilian Journal of Plant Physiology", xlabel="Number of self-citations from a single document", ylabel="Documents", );

11 FCR in Dimensions — Page 20 / 21 11.3 Bibliometric mapping

11 FCR in Dimensions — Page 21 / 21 12 Citations in the SciELO Citation Index

In the Web of Science, the SciELO Citation Index can be found for data coming from SciELO. Since it’s a closed source of information, there won’t be much description on how to grab data from there. It’s known that the Web of Science system doesn’t allow downloading entries after the 100.000 (a hundred thousand) position on any search, and we can download at most 500 (five hundreds) entries in a single CSV. A concatenation can be performed by just appending results from different CSVs, but the header line, besides removing some duplications. The further analysis was performed on the CSV data manually extracted from the Web of Science (henceforth WoS) by Ednilson Gesseff and concatenated into a single huge CSV. The selected data is from 2008 to 2017, including all entries from SciELO Brazil, as downloaded in August 2018. In [1]: import pandas as pd from scipy import stats %matplotlib inline

12.1 Making sense of the data

There are hundreds of thousands of lines (downloading it all wasn’t easy!). In [2]: wos = pd.read_csv("wos_2008to2017_scielo_brazil.csv", index_col=0, low_memory=False) print(wos.shape) wos.columns

(210016, 51)

Out [2]: Index(['Publication Type', 'Authors', 'Editors', 'English Document Title An English Document Title may have a field tag of X1, Y1, or Z1', 'Spanish Document Title A Spanish Document Title may have a field tag of T1, Y1, or Z1', 'Portuguese Document Title A Portuguese Document Title may have a field tag of T1, X1, or Z1', 'Other Languages Document Title A document title may have a field tag of T1, X1, or Y1', 'Source', 'Language', 'Document Type', 'English Author Keywords', 'Spanish Author Keywords', 'Portuguese Author Keywords', 'Author Keywords Other Languages', 'English Abstract', 'Spanish Abstract', 'Portuguese Abstract', 'Abstract Other Languages', 'Addresses', 'E-mail Address', 'ResearcherID Number', 'ORCID Identifier Open Researcher and Contributor ID', 'Cited References', 'Cited Reference Count', 'scieloci_cited', 'scieloci_wos_cited', 'Usage Count Last 180 Days', 'Usage Count Since 2013', 'Publisher', 'Publisher City', 'Publisher Address', 'issn', 'Publication Date', 'Year Published', 'Volume', 'Issue', 'Beginning Page', 'Ending Page', 'Digital Object Identifier DOI', 'SciELO Categories', 'collection', 'Research Areas', 'pid', 'Open Access Indicator', 'ESI Highly Cited Paper. Note that this field is valued only for ESI subscribers.', 'ESI Hot Paper. Note that this field is valued only for ESI subscribers.', 'Date this report was generated.', 'issn_scielo', 'is_citable', 'one_o_more_scielo_cited', 'one_o_more_wos_cited'],

12 Citations in the SciELO Citation Index — Page 1 / 16 12.1 Making sense of the data

dtype='object')

Each row in this data regards to a single document: In [3]: wos.head(3).T

Out [3]: 0 1 2 Publication Type J J J Authors Kellner, Alexander Shi, Shuguo Alves, Célia A. W.A.; Meneghini, Rogerio Editors NaN NaN NaN English Document Title An A Special year for the The hypersurfaces Characterisation of English Document Tit... AABC with conformal solvent extractable normal Gauss ... organi... Spanish Document Title A NaN NaN NaN Spanish Document Titl... Portuguese Document Title A NaN NaN NaN Portuguese Documen... Other Languages Document NaN NaN NaN Title A document titl... Source Anais da Academia Anais da Academia Anais da Academia Brasileira de Ciências Brasileira de Ciências Brasileira de Ciências Language English English English Document Type editorial research-article research-article English Author Keywords NaN fourth fundamental atmospheric aerosol; form; conformal gas chromatography- normal Gaus... masss ... Spanish Author Keywords NaN NaN NaN Portuguese Author Key- NaN quarta forma fun- aerossol atmos- words damental; aplicação férico; cromatografia normal de ... gasosa-esp... Author Keywords Other Lan- NaN NaN NaN guages English Abstract NaN In this paper, we In spite of accounting introduce the fourth for 10-70% of the at- fundame... mos... Spanish Abstract NaN NaN NaN Portuguese Abstract NaN Neste artigo, intro- Apesar de consti- duzimos a quarta tuirem 10-70% da forma fund... massa do aero... Abstract Other Languages NaN NaN NaN Addresses [Meneghini, Rogerio] [Shi, Shuguo] Shan- [Alves, Célia A.] Uni- BIREME dong University, versidade de Aveiro, China Port... E-mail Address NaN NaN NaN ResearcherID Number Meneghini, NaN Alves, Célia/E-7583- Rogério/I-2961- 2013 2015; Kellner, Alexa... ORCID Identifier Open Re- Kellner, NaN Alves, Célia/0000- searcher and Contribut... Alexander/0000- 0003-3231-3186 0001-7174-9447 Cited References NaN NaN NaN Cited Reference Count 0 18 431 scieloci_cited 1 0 1 scieloci_wos_cited 1 0 62 Usage Count Last 180 Days 0 0 3 Usage Count Since 2013 1 1 48 Continued on next page

12 Citations in the SciELO Citation Index — Page 2 / 16 12.2 Joining with the documents reports

0 1 2 Publisher Academia Brasileira Academia Brasileira Academia Brasileira de Ciências de Ciências de Ciências Publisher City NaN Rio de Janeiro Rio de Janeiro Publisher Address NaN Rio de Janeiro Rio de Janeiro issn 1678-2690 1678-2690 1678-2690 Publication Date 3 3 3 Year Published 2008 2008 2008 Volume 80 80 80 Issue 1 1 1 Beginning Page 1 3 21 Ending Page 1 19 82 Digital Object Identifier DOI 10.1590/S0001- 10.1590/S0001- 10.1590/S0001- 37652008000100001 37652008000100002 37652008000100003 SciELO Categories Multidisciplinary Sci- Multidisciplinary Sci- Multidisciplinary Sci- ences ences ences collection SciELO Brazil SciELO Brazil SciELO Brazil Research Areas Science & Technology Science & Technology Science & Technology - Other Topics - Other Topics - Other Topics pid SCIELO:S0001- SCIELO:S0001- SCIELO:S0001- 37652008000100001 37652008000100002 37652008000100003 Open Access Indicator gold gold gold ESI Highly Cited Paper. Note NaN NaN NaN that this field i... ESI Hot Paper. Note that this NaN NaN NaN field is valued ... Date this report was gener- NaN NaN NaN ated. issn_scielo 0001-3765 0001-3765 0001-3765 is_citable 0 1 1 one_o_more_scielo_cited 1 0 1 one_o_more_wos_cited 1 0 1

And all data comes from the SciELO Brazil collection In [4]: wos["collection"].value_counts()

Out [4]: SciELO Brazil 210016 Name: collection, dtype: int64

The rows are from these two days: In [5]: wos["Date this report was generated."].dropna().unique()

Out [5]: array(['2018-08-10', '2018-08-12'], dtype=object)

Are there empty columns? In [6]: wos_is_empty = wos.isna().all() wos_is_empty[wos_is_empty].index.tolist()

Out [6]: ['Editors', 'E-mail Address', 'Cited References']

Yes! We no longer need these columns.

12.2 Joining with the documents reports

We can join the CSV files from the Web of Science coming from SciELO CI with the public SciELO analytics reports by the documents PID. For example, we have that in the documents counts CSV:

12 Citations in the SciELO Citation Index — Page 3 / 16 12.2 Joining with the documents reports

In [7]: documents_counts = pd.read_csv("tabs_bra/documents_counts.csv")\ .set_index("document publishing ID (PID SciELO)") documents_counts.head().T

Out [7]: The table is in the next page ...

12 Citations in the SciELO Citation Index — Page 4 / 16 12.2 Joining with the documents reports S0100- 879X1998000800010 2018-09-13 document scl 0100-879X 0100-879X;1414-431X BrazilianMedical Journal and Biological Re... of BiologicalHealth Sciences 0 Sciences; 0 1 0 0 1 0 0 0 current 1998 rapid-communication 1 7 0 0 0 0 0 Continued on next page S0100- 879X1998000800009 2018-09-13 document scl 0100-879X 0100-879X;1414-431X BrazilianMedical Journal and Biological Re... of BiologicalHealth Sciences 0 Sciences; 0 1 0 0 1 0 0 0 current 1998 rapid-communication 1 3 0 0 0 1 0 S0100- 879X1998000800005 2018-09-13 document scl 0100-879X 0100-879X;1414-431X BrazilianMedical Journal and Biological Re... of BiologicalHealth Sciences 0 Sciences; 0 1 0 0 1 0 0 0 current 1998 research-article 1 3 0 0 0 1 0 S0100- 879X1998000800011 2018-09-13 document scl 0100-879X 0100-879X;1414-431X BrazilianMedical Journal and Biological Re... of BiologicalHealth Sciences 0 Sciences; 0 1 0 0 1 0 0 0 current 1998 rapid-communication 1 1 0 1 0 0 0 S0100- 879X1998000800006 2018-09-13 document scl 0100-879X 0100-879X;1414-431X BrazilianMedical Journal and Biological Re... of BiologicalHealth Sciences 0 Sciences; 0 1 0 0 1 0 0 0 current 1998 research-article 1 2 0 0 1 0 0 documentID publishing (PID SciELO) extraction date study unit collection ISSN SciELO ISSN’s title at SciELO title thematic areas title is agriculturalences sci- title is applied social sci- ences title isences biologicaltitle is sci- engineering title is exactsciences and earth title is health sciences title is human sciences title is linguistics, letters and arts title is multidisciplinary title current status documentyear publishing document type document is citable authors 0 authors 1 author 2 authors 3 authors 4 authors

12 Citations in the SciELO Citation Index — Page 5 / 16 12.2 Joining with the documents reports S0100- 879X1998000800010 0 1 3 23 S0100- 879X1998000800009 0 0 3 13 S0100- 879X1998000800005 0 0 8 35 S0100- 879X1998000800011 0 0 2 20 S0100- 879X1998000800006 0 0 4 28 documentID publishing (PID SciELO) 5 authors +6 authors pages references

12 Citations in the SciELO Citation Index — Page 6 / 16 12.2 Joining with the documents reports

However, the PID in the WoS data always starts with a SCIELO: prefix: In [8]: wos["pid"].str.startswith("SCIELO:").all()

Out [8]: True

We can just remove that prefix: In [9]: wos_pids = wos["pid"].str.slice(len("SCIELO:")) wos_pids.head()

Out [9]: 0 S0001-37652008000100001 1 S0001-37652008000100002 2 S0001-37652008000100003 3 S0001-37652008000100004 4 S0001-37652008000100005 Name: pid, dtype: object

12.2.1 Reference count

As an example of such a match, there’s a Cited Reference Count column in the WoS data. It’s not the number of citations it received, but the number of references each document has made. This should be equivalent to the references column in the documents_counts.csv SciELO analytics report. That’s matching for about 91% of the data. In [10]: ref_counts = documents_counts["references"] wos_ref_counts = wos_pids.map(ref_counts) wos_ref_matches = wos_ref_counts == wos["Cited Reference Count"] wos_ref_matches.value_counts().sort_index(ascending=False)\ .plot.barh(figsize=(12, 1), title="Documents matching the reference counts") wos_ref_matches.value_counts()

Out [10]: True 191371 False 18645 dtype: int64

It’s not clear why the number of references is different, but as we can see, in most of these cases that don’t match the reference count in the SciELO analytics report spreadsheet is higher. Only 31 of these values are lower. In [11]: different_ref_counts = (wos .assign(ref_counts=wos_ref_counts) [~wos_ref_matches] [["Cited Reference Count", "ref_counts"]] ) print(different_ref_counts.shape) different_ref_counts.head()

(18645, 2)

12 Citations in the SciELO Citation Index — Page 7 / 16 12.3 Incoming citation fields

Out [11]: Cited Reference Count ref_counts 2 431 432.0 46 48 49.0 110 12 13.0 115 50 51.0 194 41 42.0

In [12]: less_counts = different_ref_counts[different_ref_counts.iloc[:, 1] < different_ref_counts.iloc[:, 0]] print(less_counts.shape) less_counts.head()

(31, 2)

Out [12]: Cited Reference Count ref_counts 30167 101 24.0 30221 14 3.0 49369 39 0.0 58588 30 28.0 58600 21 15.0

12.3 Incoming citation fields

There are two fields regarding the number of citations each document received: • scieloci_cited: number of citations coming from journals in the SciELO network; • scieloci_wos_cited: number of citations coming from journals in the whole Web of Science platform. These are respectively the TC and Z9 fields in the field description documentation[1], which states: TC: SciELO Citation Index Times Cited Count Z9: Total Times Cited Count (Web of Science Core Collection, BIOSIS Citation Index, Chi- nese Science Citation Database, Data Citation Index, Russian , SciELO Citation Index) Since the SciELO data is in the Web of Science, every citation in the former should be counted in the latter, but there are some inconsistent rows: In [13]: inconsistent_cites = wos[wos["scieloci_wos_cited"] < wos["scieloci_cited"]] inconsistent_cites.shape[0]

Out [13]: 48

Most (45) of these inconsistent entries have a small difference of 1: In [14]: (inconsistent_cites["scieloci_cited"] - inconsistent_cites["scieloci_wos_cited"]).value_counts()

[1]https://images.webofknowledge.com/images/help/SCIELO/hs_selo_fieldtags.html

12 Citations in the SciELO Citation Index — Page 8 / 16 12.4 Overall statistics

Out [14]: 1 45 2 2 3 1 dtype: int64

To fix this, let’s simply use the max of these two fields as the number of WoS citations. Perhaps that’s not the case, but that’s enough to avoid “negative” citations when subtracting the values. That’s enough to build our analysis dataset: In [15]: wos_cited = wos[["scieloci_wos_cited", "scieloci_cited"]].max(axis=1) dataset = wos.assign( pid=wos_pids, total=wos_cited, not_sci=wos_cited - wos["scieloci_cited"] ).set_index("pid").rename(columns={ "Language": "lang", "Document Type": "type", "scieloci_cited": "sci", })[["lang", "type", "sci", "not_sci", "total"]] print(dataset.shape) dataset.head()

(210016, 5)

Out [15]: lang type sci not_sci total pid S0001-37652008000100001 English editorial 1 0 1 S0001-37652008000100002 English research-article 0 0 0 S0001-37652008000100003 English research-article 1 61 62 S0001-37652008000100004 English research-article 0 0 0 S0001-37652008000100005 English research-article 0 9 9

There’s no empty entry in this dataset: In [16]: dataset.dropna().shape

Out [16]: (210016, 5)

12.4 Overall statistics

As the median is zero (i.e., half of the data have no citations), it’s a quite odd dataset to see in a boxplot. In [17]: dataset.describe()

Out [17]: sci not_sci total count 210016.000000 210016.000000 210016.000000 mean 1.492482 2.098231 3.590712 std 3.834625 5.293388 7.800272 min 0.000000 0.000000 0.000000 25% 0.000000 0.000000 0.000000 50% 0.000000 0.000000 1.000000 75% 2.000000 2.000000 4.000000 Continued on next page

12 Citations in the SciELO Citation Index — Page 9 / 16 12.4 Overall statistics

sci not_sci total max 767.000000 581.000000 1348.000000

In [18]: dataset.plot.box(vert=False, figsize=(12, 2));

Zooming in: In [19]: dataset.plot.box(vert=False, figsize=(12, 2), xlim=[0, 20]);

The difference is small even filtering by research articles: In [20]: research_articles = dataset[dataset["type"] == "research-article"] research_articles.describe()

Out [20]: sci not_sci total count 180780.000000 180780.000000 180780.000000 mean 1.640004 2.224914 3.864919 std 4.037042 5.388988 8.064238 min 0.000000 0.000000 0.000000 25% 0.000000 0.000000 0.000000 50% 0.000000 1.000000 2.000000 75% 2.000000 2.000000 5.000000 max 767.000000 581.000000 1348.000000

In [21]: research_articles.plot.box(vert=False, figsize=(12, 2), xlim=[0, 20]);

12 Citations in the SciELO Citation Index — Page 10 / 16 12.5 Correlation between the number of citations from SciELO and out from SciELO

Fact is, there are a lot of zeros. If we analyze only the research articles that have at least one citation, we get: In [22]: cited_research_articles = research_articles[research_articles["total"] > 0] cited_research_articles.describe()

Out [22]: sci not_sci total count 121011.000000 121011.000000 121011.000000 mean 2.450025 3.323830 5.773855 std 4.728938 6.303373 9.280644 min 0.000000 0.000000 1.000000 25% 0.000000 1.000000 2.000000 50% 1.000000 2.000000 3.000000 75% 3.000000 4.000000 7.000000 max 767.000000 581.000000 1348.000000

In [23]: cited_research_articles.plot.box(vert=False, figsize=(12, 2), xlim=[0, 20]);

This analysis shows that, for a publication in the SciELO network, we should expect more citations from outside the SciELO network than from inside it.

12.5 Correlation between the number of citations from SciELO and out from Sci- ELO

The distribution of the difference of the number of citations is: In [24]: citation_diff = dataset["sci"] - dataset["not_sci"] citation_diff.plot.hist(bins=500, figsize=(12, 4), density=True);

Zooming in:

12 Citations in the SciELO Citation Index — Page 11 / 16 12.5 Correlation between the number of citations from SciELO and out from SciELO

In [25]: citation_diff.plot.hist(bins=500, figsize=(12, 4), density=True, xlim=[-30, 30]);

The mean is almost zero, but it’s negative, which means the number of citations in WoS not coming from SciELO is higher than the number of citations coming from SciELO: In [26]: cd_mean = citation_diff.mean() cd_mean

Out [26]: -0.6057490857839403

Is the correlation between the SciELO citations and non-SciELO citations high? Note: In Jupyter Notebook, the text enclosed by $ in the table is rendered as math. In [27]: r, pvalue = stats.pearsonr(dataset["sci"], dataset["not_sci"]) pd.DataFrame([r, r*r, pvalue], index=["$r$", "$r^2$", "$p$-value"], columns=[""])

Out [27]:

$r$ 0.446344 $r^2$ 0.199223 $p$-value 0.000000

It’s a quite low value (r2 ≈ 0.2), but significative (small p-value, which isn’t zero but it’s really low). As expected, the r is positive (i.e., the more citations a publication has from one source, the more we can expect from another, given the sources are sci and not_sci). Technical note: Another way to perform the same calculation, without the p-value, is: dataset["sci"].corr(dataset["not_sci"]) As the correlation is low, we shouldn’t expect a linear relationship between the number of citations. Nevertheless, it’s worth seeing the number of citations from each source in a scatterplot.

12 Citations in the SciELO Citation Index — Page 12 / 16 12.6 Analysis of the categorical fields

In [28]: dataset.plot.scatter(x="sci", y="not_sci", figsize=(14, 14), title="Number of citations");

With the previously calculated mean, we know there are more citations coming from elsewhere than internal to the SciELO network, though the numbers aren’t much different.

12.6 Analysis of the categorical fields

Most data are in Portuguese: In [29]: lang_counts = dataset["lang"].value_counts() lang_counts.plot.barh(figsize=(12, 3)) pd.DataFrame(lang_counts.iloc[::-1])

Out [29]:

12 Citations in the SciELO Citation Index — Page 13 / 16 12.6 Analysis of the categorical fields

lang German 45 Italian 62 French 222 Spanish 4283 English 87995 Portuguese 117409

And most entries are research articles: In [30]: type_counts = dataset["type"].value_counts() type_counts.plot.barh(figsize=(12, 6)) pd.DataFrame(type_counts.iloc[::-1])

Out [30]: type news 32 addendum 63 undefined 191 abstract 467 correction 556 press-release 623 article-commentary 697 brief-report 1763 letter 2199 rapid-communication 2975 book-review 3173 review-article 4545 case-report 4578 editorial 7374 research-article 180780

12 Citations in the SciELO Citation Index — Page 14 / 16 12.6 Analysis of the categorical fields

Splitting by both categories, we get: In [31]: counts = (dataset .groupby(["lang", "type"]) .size() .rename("count") .unstack("lang") .fillna(0) .astype(int) ) pd.DataFrame(counts)

Out [31]: lang English French German Italian Portuguese Spanish type abstract 94 0 0 0 372 1 addendum 15 0 0 0 48 0 article-commentary 300 0 0 0 384 13 book-review 141 3 2 3 2910 114 brief-report 814 1 1 2 930 15 case-report 2791 0 0 1 1779 7 correction 259 0 0 0 290 7 editorial 2300 1 2 2 5004 65 letter 1488 0 0 0 700 11 news 16 0 0 0 16 0 press-release 83 0 0 0 537 3 rapid-communication 1727 2 0 2 1211 33 research-article 75273 215 40 52 101225 3975 review-article 2666 0 0 0 1841 38 undefined 28 0 0 0 162 1

These are way too much categories to analyze individually, and many of them are quite small. But we can see if the language makes some difference in the statistics of citations regarding research articles. In [32]: ra_stats = research_articles.groupby("lang").describe().T ra_stats

Out [32]:

12 Citations in the SciELO Citation Index — Page 15 / 16 12.6 Analysis of the categorical fields

lang English French German Italian Portuguese Spanish not_sci count 75273.000000 215.000000 40.000000 52.000000 101225.000000 3975.000000 not_sci mean 3.409217 0.186047 0.100000 0.250000 1.404791 0.840503 not_sci std 7.316331 0.643291 0.378932 0.904780 3.181294 1.855559 not_sci min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 not_sci 25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 not_sci 50% 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 not_sci 75% 4.000000 0.000000 0.000000 0.000000 2.000000 1.000000 not_sci max 581.000000 6.000000 2.000000 6.000000 192.000000 29.000000 sci count 75273.000000 215.000000 40.000000 52.000000 101225.000000 3975.000000 sci mean 1.059623 0.162791 0.025000 0.057692 2.105834 0.884780 sci std 3.865018 0.577225 0.158114 0.235435 4.165023 1.975009 sci min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 sci 25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 sci 50% 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 sci 75% 1.000000 0.000000 0.000000 0.000000 3.000000 1.000000 sci max 767.000000 4.000000 1.000000 1.000000 352.000000 39.000000 total count 75273.000000 215.000000 40.000000 52.000000 101225.000000 3975.000000 total mean 4.468840 0.348837 0.125000 0.307692 3.510625 1.725283 total std 9.857795 0.997279 0.404304 1.057901 6.542915 3.397948 total min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 total 25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 total 50% 2.000000 0.000000 0.000000 0.000000 1.000000 0.000000 total 75% 6.000000 0.000000 0.000000 0.000000 4.000000 2.000000 total max 1348.000000 9.000000 2.000000 7.000000 544.000000 62.000000

In [33]: ra_stats.xs("mean", level=1).drop("total").T\ .plot.barh(figsize=(12, 6), title="Average number of citations " "for research articles (mean)");

12 Citations in the SciELO Citation Index — Page 16 / 16 13 Scopus 2017 - CiteScore, SNIP and SJR

In Scopus[1], we can download a single spreadsheet workbook with all the data they have (titles and metrics) regarding their free journal rankings and metrics, provided you’re signed in. As of 2018-09-21, it’s a 38MB XLSX file with a spreadsheet of metrics for each year. In [1]: import openpyxl import pandas as pd import seaborn as sns pd.options.display.max_colwidth = 200 # Default is 50 pd.options.display.max_rows = 200 # Default is 60 %matplotlib inline

13.1 Opening the Excel File in Pandas

Pandas have a read_excel function that can read with xlrd a spreadsheet in an old XLS file, loading its data into a Pandas DataFrame. However, we’re not going to use it. In order to open an OOXML containing spreadsheets from Microsoft Excel (a.k.a. XLSX) in Python, we’ll need another library. There’s a web page[2] listing which packages were created to deal with MS Excel files, stating we should use the openpyxl[3] library to load the data we’ve got. Which spreadsheets are in the Scopus spreadsheet workbook? In [2]: workbook_filename = "CiteScore_Metrics_2011-2017_Download_25May2018.xlsx" wb = openpyxl.load_workbook(workbook_filename) wb.sheetnames

Out [2]: ['About CiteScore', '2017 All', 'Sheet1', '2016 All', '2015 All', '2014 All', '2013 All', '2012 All', '2011 All', 'ASJC Codes']

For now, we’re mainly interested in the 2017 worksheet. Let’s see it. In [3]: ws2017 = wb["2017 All"]

There’s a documentation[4] on how to convert such a worksheet object to a Pandas DataFrame instance (as well as the other way around). In [4]: data_gen = ws2017.values info = next(data_gen) header, *data = data_gen scopus2017 = pd.DataFrame(data, columns=header).dropna(how="all")

In [5]: print(info[0]) print(scopus2017.shape) scopus2017.head().T

[1]https://www.scopus.com/sources [2]http://www.python-excel.org [3]https://openpyxl.readthedocs.io [4]https://openpyxl.readthedocs.io/en/2.6/pandas.html

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 1 / 17 13.2 Splitting the data based on SciELO ISSNs

CiteScore metrics calculated using data from 30 April, 2018. SNIP and SJR calculated using data from 30 April, 2018 (50182, 21)

Out [5]: 0 1 2 3 4 Scopus 28773 28773 19434 19434 19434 SourceID Title Ca-A Can- Ca-A Can- MMWR. Recom- MMWR. Recom- MMWR. Recom- cer Journal cer Journal mendations and mendations and mendations and for Clini- for Clini- reports : Morbid- reports : Morbid- reports : Morbid- cians cians ity ... ity ... ity ... CiteScore 130.47 130.47 63.12 63.12 63.12 Percentile 99 99 99 99 99 Citation 16961 16961 1010 1010 1010 Count Scholarly 130 130 16 16 16 Output Percent 70 70 100 100 100 Cited SNIP 88.164 88.164 32.534 32.534 32.534 SJR 61.786 61.786 34.638 34.638 34.638 RANK 1 1 1 1 1 Rank Out Of 120 323 87 241 106 Publisher Wiley- Wiley- Centers for Dis- Centers for Dis- Centers for Dis- Blackwell Blackwell ease Control ease Control ease Control and Prevention and Prevention and Prevention (CDC) (CDC) (CDC) Type Journal Journal Journal Journal Journal OpenAccess NO NO YES YES YES Scopus 2720 2730 2713 3306 2307 ASJC Code (Sub-subject Area) Scopus Sub- Hematology Oncology Epidemiology Health(social sci- Health, Toxicol- Subject Area ence) ogy and Mutage- nesis Quartile Quartile 1 Quartile 1 Quartile 1 Quartile 1 Quartile 1 Top 10% Top 10% Top 10% Top 10% Top 10% Top 10% (CiteScore Percentile) Scopus https://www.https://www.https://www. https://www. https://www. SourceID scopus. scopus. scopus. scopus. scopus. com/sourceid/28773com/sourceid/28773com/sourceid/19434com/sourceid/19434com/sourceid/19434 Print-ISSN 79235 79235 10575987 10575987 10575987 E-ISSN 15424863 15424863 15458601 15458601 15458601

The first five entries regards to just two journals, this duplication makes it clear we’ll need some cleaning before we can use this data.

13.2 Splitting the data based on SciELO ISSNs

Our goal is to create a dataset based on Scopus 2017 data with an extra SciELO boolean column which should just tell if the journal belongs to the SciELO network or not.

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 2 / 17 13.2 Splitting the data based on SciELO ISSNs

13.2.1 Set of SciELO ISSNs

Based on the ISSN normalization notebook, we can get a full list of ISSNs in the SciELO network that are also in the analytics reports (including the independent and development collections) with: In [6]: network_journals = pd.read_csv("tabs_network/journals.csv") issns_scielo = set(network_journals["ISSN SciELO"].str.upper().values) \ .union(*network_journals["ISSN's"].dropna().str.split(";") .apply(set).values) \ .union({"0719-448X", "0870-8967", "1316-5216", "1667-8982", "1683-0768", "1852-4184", "2183-9174", "2223-7666", "2477-9555", "2993-6797"}) len(issns_scielo)

Out [6]: 2303

That’s not the number of journals, but the number of distinct ISSNs. We’ve got the set of SciELO ISSNs, including the extra values that regards to ISSN normalization (for the 2018-09-14 reports version).

13.2.2 Normalizing the Scopus ISSN

We have two columns for the ISSN in the imported Scopus data, most of it should be cast from integer to string, and there are several empty values out there: In [7]: scopus2017_issns = pd.concat([scopus2017["Print-ISSN"], scopus2017["E-ISSN"]]) scopus2017_issns_types = scopus2017_issns.apply(type) scopus2017_issns_types.value_counts()

Out [7]: 63400 30197 6767 dtype: int64

Regarding the ISSNs that are written as strings (mostly because of some letter, which should be X), not even the letter case is normalized: In [8]: scopus2017_issns_str = scopus2017_issns[scopus2017_issns_types == str] print("Not equal to the lower (count): ", scopus2017_issns_str[scopus2017_issns_str != scopus2017_issns_str.str.lower()] .size) print("Not equal to the upper (entries): ",) scopus2017_issns_str[scopus2017_issns_str != scopus2017_issns_str.str.upper()]

Not equal to the lower (count): 6522 Not equal to the upper (entries):

Out [8]: 37969 0322788x 37970 0322788x 25769 1558688x 25770 1558688x 26107 1558691x dtype: object

A single string entry have some noise, no entry have the - separator: In [9]: scopus2017_issns_str[~scopus2017_issns_str.str.contains("^[\dXx]{8}$")]

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 3 / 17 13.2 Splitting the data based on SciELO ISSNs

Out [9]: 48755 00304565; dtype: object

The integer entries might have less digits, they’re probably just missing some leading zeros. There’s no integer with more than 8 digits. In [10]: scopscopus2017_issns_int = scopus2017_issns[scopus2017_issns_types == int] scopscopus2017_issns_int.min(), scopscopus2017_issns_int.max()

Out [10]: (10782, 87569728)

Then this function should be enough to normalize a single ISSN: In [11]: def normalize_issn(issn): if isinstance(issn, int): before, after = divmod(issn, 10000) return f"{before:04d}-{after:04d}" if isinstance(issn, str): return f"{issn[:4]}-{issn[4:8]}".upper() return ""

Let’s apply this normalization function and add the SciELO column: In [12]: scopus2017n = scopus2017.assign(**{ "Print-ISSN": scopus2017["Print-ISSN"].apply(normalize_issn), "E-ISSN": scopus2017["E-ISSN"].apply(normalize_issn), }).assign(SciELO=lambda df: df["Print-ISSN"].isin(issns_scielo) | df["E-ISSN"].isin(issns_scielo)) print(scopus2017n.shape) scopus2017n.loc[4095:20000:1570,["Print-ISSN", "E-ISSN", "SciELO"]]

(50182, 22)

Out [12]: Print-ISSN E-ISSN SciELO 4095 1330-0962 False 5665 1932-6254 False 7235 0742-0528 False 8805 0074-0276 1678-8060 True 10375 0716-9760 0717-6287 True 11945 1941-9899 1941-9902 False 13515 1542-0752 1542-0760 False 15085 0167-2681 False 16655 1413-8670 True 18225 1094-6136 False 19795 1364-985X 1467-8489 False

13.2.3 Data de-duplication

The same pair of ISSNs might appear more than once. In [13]: issn_repeat_count = scopus2017n.groupby(["Print-ISSN", "E-ISSN"]) \ .size().value_counts() issn_repeat_count.plot.bar( title="Number of lines with a print/electronic ISSN pair" ) issn_repeat_count

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 4 / 17 13.2 Splitting the data based on SciELO ISSNs

Out [13]: 1 8434 2 7538 3 4261 4 1957 5 680 6 277 7 90 8 20 9 2 13 1 178 1 dtype: int64

The 178 entries are the empty ones (they have data, but no ISSN). Such entries aren’t in SciELO since they don’t have open access: In [14]: scopus2017empty = scopus2017n[(scopus2017n["Print-ISSN"] == "") & (scopus2017n["E-ISSN"] == "")] print(scopus2017empty.shape) scopus2017empty.groupby("OpenAccess").size()

(178, 22)

Out [14]: OpenAccess NO 178 dtype: int64

There are duplications in the Scopus SourceID, as well. In [15]: scopus2017n.columns

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 5 / 17 13.2 Splitting the data based on SciELO ISSNs

Out [15]: Index(['Scopus SourceID', 'Title', 'CiteScore', 'Percentile', 'Citation Count', 'Scholarly Output', 'Percent Cited', 'SNIP', 'SJR', 'RANK', 'Rank Out Of', 'Publisher', 'Type', 'OpenAccess', 'Scopus ASJC Code (Sub-subject Area)', 'Scopus Sub-Subject Area', 'Quartile', 'Top 10% (CiteScore Percentile)', 'Scopus SourceID', 'Print-ISSN', 'E-ISSN', 'SciELO'], dtype='object')

In [16]: sid_repeat_count = scopus2017n["Scopus SourceID"].iloc[:, -1].value_counts() \ .value_counts() sid_repeat_count.plot.bar(title="Number of lines with a Scopus SourceID") sid_repeat_count

Out [16]: 1 8473 2 7574 3 4284 4 1961 5 680 6 274 7 90 8 20 9 2 13 1 Name: Scopus SourceID, dtype: int64

That duplication happens mostly because multiple subject areas are stored as multiple lines for the same journal, and some features are specific to the subject area. We’ll use just some selected columns, whose projection is enough to get rid from most duplicated entries. In [17]: id_columns = ["Scopus SourceID", "Title", "Print-ISSN", "E-ISSN"] columns = ["CiteScore", "SNIP", "SJR", "OpenAccess", "SciELO"] dataset_with_ids = scopus2017n[id_columns + columns].drop_duplicates()

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 6 / 17 13.2 Splitting the data based on SciELO ISSNs

Actually, the Scopus SourceID becomes unique: In [18]: dataset_with_ids["Scopus SourceID"].iloc[:, -1].value_counts().value_counts()

Out [18]: 1 23359 Name: Scopus SourceID, dtype: int64

But not the ISSNs. Disregarding the entries without any ISSN, these are the ISSN duplications: In [19]: dpi_issns_sizes = dataset_with_ids.groupby(["Print-ISSN", "E-ISSN"]).size() dpi_issns_duplicated = dpi_issns_sizes[dpi_issns_sizes > 1].drop(("", "")) dataset_with_ids.reset_index() \ .set_index(["Print-ISSN", "E-ISSN"]) \ .loc[dpi_issns_duplicated.index.tolist()]

Out [19]: The table is in the next page ...

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 7 / 17 13.2 Splitting the data based on SciELO ISSNs False False False False False False False False False False False False YES YES NO NO NO NO YES YES NO NO YES YES 0.178 0.107 0.497 0.371 0.464 0.172 0.576 0.109 0.144 0.114 0.162 0.101 0.434 0.395 0.668 0.865 0.624 0.163 0.696 0.021 0.312 0.176 0.868 0 0.62 0.20 1.28 1.13 1.06 0.50 1.21 0.03 0.22 0.12 0.44 0.00 Oxford Medicalports Case Re- Perspectives on Federalism Japanese Journal of Applied Physics, Part 1: R... Japanese Journal of Applied Physics Chemical Modelling Spectroscopic Properties of Inorganic and Orga... InternationalOphthalmology JournalInternational Eye of Science IUTAM Bookseries Solid Mechanics and itsplications Ap- Southeast Asian Studies Japanese Journal ofeast Asian South- Studies https://www.scopus. com/sourceid/21100790340 https://www.scopus. com/sourceid/21100786380 https://www.scopus. com/sourceid/130262 https://www.scopus. com/sourceid/28117 https://www.scopus. com/sourceid/20500195421 https://www.scopus. com/sourceid/21100201539 https://www.scopus. com/sourceid/130135 https://www.scopus. com/sourceid/21100391400 https://www.scopus. com/sourceid/21100203922 https://www.scopus. com/sourceid/21100201921 https://www.scopus. com/sourceid/21100778849 https://www.scopus. com/sourceid/26510 2.110079e+10 2.110079e+10 1.302620e+05 2.811700e+04 2.050020e+10 2.110020e+10 1.301350e+05 2.110039e+10 2.110020e+10 2.110020e+10 2.110078e+10 2.651000e+04 index Scopus SourceID30252 41080 19594 21457 22571 Title32768 20473 47991 40297 43902 CiteScore34267 SNIP SJR48882 OpenAccess SciELO 2036- 5438 2036- 5438 2423- 8686 2423- 8686 E-ISSN 0021-4922 0021-4922 0584-8555 0584-8555 1672-5123 1672-5123 1875-3507 1875-3507 2186-7275 2186-7275 Print- ISSN

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 8 / 17 13.2 Splitting the data based on SciELO ISSNs

The 2036-5438 and 1672-5123 had been seen in the SCImagoJR analysis notebook, the former is prob- ably two distinct sources, yet the second seem distinct translations of the same source title in Chinese, perhaps regarding to distinct moments of the journal. The Japanese Journal of Applied Physics appears twice as well as the Japanese Journal of Southeast Asian Studies. Some normalization is still required here. However, these are no more than 5 entries in 23359 rows, and it’s quite difficult to know what’s going on with these duplications or which value should be regarded as correct for each column. For now, we can stand with this noise, but we could had removed some rows based on index with something like: dataset_plus_ids.drop([47991, 48882], inplace=True) Where the numbers are the set of index values to be removed. We no longer need the ID columns, so this is our dataset: In [20]: dataset = dataset_with_ids[columns] print(dataset.shape) dataset.head()

(23359, 5)

Out [20]: CiteScore SNIP SJR OpenAccess SciELO 0 130.47 88.164 61.786 NO False 2 63.12 32.534 34.638 YES False 6 51.08 11.97 23.414 NO False 7 39.42 7.967 17.633 NO False 8 36.13 19.73 33.557 NO False

A description of the CiteScore, SNIP and SJR columns can be found in the Scopus support/help web page[5]. There’s no empty field in this dataset: In [21]: dataset.dropna().shape

Out [21]: (23359, 5)

13.2.4 Consistency in the SciELO and OpenAccess columns

All SciELO entries should be open, since that’s a criterion for belongingness in the SciELO network. Yet, some rows are inconsistent in Scopus data regarding this constraint: In [22]: dataset_oscounts = dataset.groupby(["OpenAccess", "SciELO"]).size() \ .rename("count") dataset_oscounts.unstack("OpenAccess")\ .plot.barh(figsize=(12, 2), title="Number of journals") pd.DataFrame(dataset_oscounts)

Out [22]: count OpenAccess SciELO NO False 19152 NO True 160 YES False 3552 YES True 495

[5]https://service.elsevier.com/app/answers/detail/a_id/14834/supporthub/scopus/

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 9 / 17 13.2 Splitting the data based on SciELO ISSNs

That is, there are journals marked as without open access in Scopus, but whose ISSN is in the SciELO network. As it seems, most titles are matching the ones in the SciELO data (the empty rows need further normalization to be properly matched). In [23]: dataset_ids = dataset_with_ids[(dataset["OpenAccess"] == "NO") & dataset["SciELO"]][id_columns] dataset_ids_with_scielo_titles = \ dataset_ids.join(network_journals.set_index("ISSN SciELO") ["title at SciELO"].rename("P-SciELO"), on="Print-ISSN")\ .join(network_journals.set_index("ISSN SciELO") ["title at SciELO"].rename("E-SciELO"), on="E-ISSN") pd.concat([ dataset_ids_with_scielo_titles["Title"], (dataset_ids_with_scielo_titles["P-SciELO"].fillna("") + dataset_ids_with_scielo_titles["E-SciELO"].fillna("") ).rename("Title in SciELO"), ], axis=1).drop_duplicates()

Out [23]: Title Title in SciELO 4915 Bulletin of the World Health Organization Bulletin of the World Health Organization 17404 Journal of the Brazilian Society of Mechanical. Journal of the Brazilian Society of Mechanical. .. .. 18118 Annals of Hepatology Annals of Hepatology 19624 Journal of Applied Research and Technology Journal of applied research and technology 20659 Atmosfera Atmósfera 20972 Revista Latinoamericana de Psicologia Revista Latinoamericana de Psicología 21328 Ameghiniana Ameghiniana 21565 Theoretical and Experimental Plant Physiol- Theoretical and Experimental Plant Physiol- ogy ogy 22740 Revista Mexicana de Ingeniera Qumica Revista mexicana de ingeniería química 23269 South African Journal of Animal Sciences South African Journal of Animal Science 23646 Actas Urologicas Espanolas Actas Urológicas Españolas 24376 Neotropical Entomology Neotropical Entomology 24751 Revista de Investigacion Clinica Revista de investigación clínica 24962 Journal of the Mexican Chemical Society Journal of the Mexican Chemical Society 24966 Cuadernos de Psicologia del Deporte Cuadernos de Psicología del Deporte 25117 Acta Scientiarum - Agronomy 25618 Revista Brasileira de Botanica Brazilian Journal of Botany 25722 African Journal of Laboratory Medicine African Journal of Laboratory Medicine 26643 Revista Mexicana de Astronomia y Astrofisica Revista mexicana de astronomía y astrofísica 27407 Medicina Intensiva Medicina Intensiva 28808 European Journal of Psychiatry The European Journal of Psychiatry 28890 Revista Colombiana de Estadistica Revista Colombiana de Estadística Continued on next page

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 10 / 17 13.2 Splitting the data based on SciELO ISSNs

Title Title in SciELO 29540 Revista Mexicana de Analisis de la Conducta Revista mexicana de análisis de la conducta 29591 Geofisica International Geofísica internacional 29717 Madera Bosques Madera y bosques 29737 Revista de la Union Matematica Argentina Revista de la Unión Matemática Argentina 30566 Computacion y Sistemas Computación y Sistemas 30884 Revista de la Asociacion Geologica Argentina Revista de la Asociación Geológica Argentina 30957 Mastozoologia Neotropical Mastozoología neotropical 31690 Journal of Integrated Coastal Zone Manage- Revista de Gestão Costeira Integrada ment 32068 Ciencia e Tecnica Vitivinicola Ciência e Técnica Vitivinícola 32189 Salud Mental Salud mental 32746 Politica y Gobierno Política y gobierno 32747 Acta Scientiarum - Animal Sciences 33470 Acta Botanica Mexicana Acta botánica mexicana 33507 Revista Chapingo, Serie Horticultura Revista Chapingo. Serie horticultura 33514 Comunicacion y Sociedad (Mexico) Comunicación y sociedad 33755 Revista Mexicana de Trastornos Alimentarios Revista mexicana de trastornos alimentarios 34230 Ciencia e Tecnologia dos Materiais Ciência & Tecnologia dos Materiais 34304 Revista Mexicana de Fisica Revista mexicana de física 34318 Informacion Tecnologica Información tecnológica 34586 Agrociencia Agrociencia 35312 Revista Colombiana de Cancerologia Revista Colombiana de Cancerología 35341 Journal of the South African Institution of Ci... Journal of the South African Institution of Ci... 35365 Archivos Latinoamericanos de Nutricion Archivos Latinoamericanos de Nutrición 35393 CT y F - Ciencia, Tecnologia y Futuro CT&F - Ciencia, Tecnología y Futuro 35631 Neurocirugia Neurocirugía 35869 Dynamis Dynamis 35988 Revista Enfermagem Revista Enfermagem UERJ 36086 Gaceta Medica de Mexico Gaceta médica de México 36176 Revista Chilena de Infectologia Revista chilena de infectología 36215 Revista Fitotecnia Mexicana Revista fitotecnia mexicana 36223 Revista Colombiana de Anestesiologia Revista Colombiana de Anestesiología 36253 Cuadernos de Desarrollo Rural Cuadernos de Desarrollo Rural 36383 Anales del Sistema Sanitario de Navarra Anales del Sistema Sanitario de Navarra 36613 Archivos de Cardiologia de Mexico Archivos de cardiología de México 36673 Revista Cubana de Educacion Medica Supe- Educación Médica Superior rior 37092 Revista Iberoamericana de Educacion Supe- Revista iberoamericana de educación superior rior 37177 Revista Mexicana de Sociologia Revista mexicana de sociología 37322 Ensayos Sobre Politica Economica Ensayos sobre POLÍTICA ECONÓMICA 37501 Revista Colombiana de Psiquiatria Revista Colombiana de Psiquiatría 37585 Revista Colombiana de Entomologia Revista Colombiana de Entomología 37719 Investigacion Clinica Investigación Clínica 37749 Interciencia Interciencia 37760 Archivos de la Sociedad Espanola de Oftal- Archivos de la Sociedad Española de Oftal- mologia mología 37956 Revista Portuguesa de Saude Publica Revista Portuguesa de Saúde Pública 38048 Archivos Espanoles de Urologia Archivos Españoles de Urología (Ed. impresa) 38327 Revista Internacional de Contaminacion Am- Revista internacional de contaminación ambi- biental ental 38357 Revista de Salud Publica Revista de Salud Pública 38513 Hidrobiologica Hidrobiológica 38675 Revista Mexicana de Investigacion Educativa Revista mexicana de investigación educativa 38716 Perfiles Educativos Perfiles educativos 38773 Educacion Quimica Educación química Continued on next page

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 11 / 17 13.2 Splitting the data based on SciELO ISSNs

Title Title in SciELO 38797 Infectio Infectio 38968 Investigacion Economica Investigación económica 39085 Temas em Psicologia Temas em Psicologia 39246 Acta Colombiana de Psicologia Acta Colombiana de Psicología 39304 Cuadernos de Administracion Cuadernos de Administración 39385 Boletin Cientifico del Centro de Museos Boletín Científico. Centro de Museos. Museo de... 39589 Perspectivas em Ciencia da Informacao Perspectivas em Ciência da Informação 39798 Ginecologia y Obstetricia de Mexico Ginecología y obstetricia de México 39917 Bioagro Bioagro 40004 Signos Historicos Signos históricos 40173 Revista Cubana de Salud Publica Revista Cubana de Salud Pública 40196 Tydskrift vir Geesteswetenskappe Tydskrif vir Geesteswetenskappe 40460 Desarrollo y Sociedad Desarrollo y Sociedad 40727 Revista Latinoamericana de Derecho Social Revista latinoamericana de derecho social 40855 Revista Escola de Minas Rem: Revista Escola de Minas 41001 Comunicacoes Geologicas Comunicações Geológicas 41369 Revista Latinoamericana de Investigacion en Revista latinoamericana de investigación en Ma... ma... 41444 Revista Brasileira de Geofisica Revista Brasileira de Geofísica 41665 Analise Psicologica Análise Psicológica 41812 Transactions of the South African Institute of.. . 41830 Biocell Biocell 41863 Online Brazilian Journal of Nursing Online Brazilian Journal of Nursing 41984 Revista Latinoamericana de Metalurgia y Ma- Revista Latinoamericana de Metalurgia y Ma- teri... teri... 42017 Revista Mexicana de Ingenieria Biomedica Revista mexicana de ingeniería biomédica 42343 Salud Uninorte Revista Salud Uninorte 42376 Revista Brasileira de Orientacao Profissional Revista Brasileira de Orientação Profissional 42665 Revista de Pedagogia Revista de Pedagogía 42751 Revista Gerencia y Politicas de Salud Revista Gerencia y Políticas de Salud 42762 Revista Lasallista de Investigacion Revista Lasallista de Investigación 42807 Boletin de Malariologia y Salud Ambiental Boletín de Malariología y Salud Ambiental 42901 Gestion y Politica Publica Gestión y política pública 42970 Analisis Politico Análisis Político 43182 Anuario Mexicano de Derecho Internacional Anuario mexicano de derecho internacional 43264 Revista de la Facultad de Ingenieria Revista de la Facultad de Ingeniería Univer- sid... 43266 Revista Tecnica de la Facultad de Ingenieria Revista Técnica de la Facultad de Ingeniería U... U... 43461 Revista Portuguesa de Imunoalergologia Revista Portuguesa de Imunoalergologia 43639 Revista Venezolana de Gerencia Revista Venezolana de Gerencia 43726 Revista Cubana de Obstetricia y Ginecologia Revista Cubana de Obstetricia y Ginecología 43998 Avaliacao Psicologica Avaliação Psicológica 44034 Revista Colombiana de Reumatologia Revista Colombiana de Reumatología 44051 Revista Colombiana de Obstetricia y Gine- Revista Colombiana de Obstetricia y Gine- cologia cología 44170 Revista Colombiana de Gastroenterologia Revista Colombiana de Gastroenterologia 44435 Revista Cientifica de la Facultad de Ciencias ... Revista Científica 44459 Avances en Odontoestomatologia Avances en Odontoestomatología 44543 Agroalimentaria Agroalimentaria 44584 Revista de la Facultad de Agronomia Revista de la Facultad de Agronomía 44632 E-Journal of Portuguese History e-Journal of Portuguese History 44710 Problema Problema anuario de filosofía y teoría del der. .. 44792 Revista de la Sociedad Espanola del Dolor Revista de la Sociedad Española del Dolor Continued on next page

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 12 / 17 13.2 Splitting the data based on SciELO ISSNs

Title Title in SciELO 44926 Literatura y Linguistica Literatura y lingüística 44970 Acta Theologica Acta Theologica 45305 Salus Salus 45333 Cuadernos del Cendes Cuadernos del Cendes 45431 Acta Botanica Venezuelica Acta Botánica Venezuelica 45473 Revista Mexicana de Cardiologia Revista mexicana de cardiología 45669 Medicina Interna de Mexico Medicina interna de México 45678 Revista de Obstetricia y Ginecologia de Revista de Obstetricia y Ginecología de Venezuela Venezuela 45795 Revista de Estudios Historico-Juridicos Revista de estudios histórico-jurídicos 45802 Boletin Mexicano de Derecho Comparado Boletín mexicano de derecho comparado 45805 Revista de Antropologia Revista de Antropologia 45837 Andamios: Revista de Investigacion Social Andamios 45977 Revista da Abordagem Gestaltica Revista da Abordagem Gestáltica 46001 Vniversitas Vniversitas 46125 PSICOLOGIA Psicologia 46335 Revista Brasileira de Cardiologia Invasiva 46356 Opcion Opción (Maracaibo) 46356 Opcion Opción 46624 Revista del Instituto Nacional de Enfer- Revista del Instituto Nacional de Enfer- medades... medades... 46745 Arete Areté 46821 Revista Venezolana de Oncologia Revista Venezolana de Oncología 46822 Boletin de Linguistica Boletin de Linguistica 46874 Kasmera Kasmera 46902 Vitae Vitae 46907 Bitacora Urbano Territorial Bitácora Urbano Territorial 47108 Revista de la Asociacion Espanola de Espe- Revista de la Asociación Española de Espe- ciali... ciali... 47229 Salud (i) Ciencia Salud(i)ciencia 47728 Revista Cubana de Ortopedia y Traumatolo- Revista Cubana de Ortopedia y Trauma- gia tología 47821 Revista de Filosofia (Venzuela) Revista de Filosofía 47892 Tempo Psicanalitico Tempo psicanalitico 47913 Tzintzun Tzintzun. Revista de estudios históricos 47915 Signos Filosoficos Signos filosóficos 47963 Discusiones Filosoficas Discusiones Filosóficas 48182 Cuadernos de Medicina Forense Cuadernos de Medicina Forense 48649 Ciencia da Informacao Ciência da Informação 48888 Desarrollo Economico: Revista de Ciencias Desarrollo Económico (Buenos Aires) Soci... 48965 Boletin Tecnico/Technical Bulletin Boletín Técnico 49415 Archivos Venezolanos de Farmacologia y Ter- Archivos Venezolanos de Farmacología y Ter- apeu... apéu... 50144 Cogitare Enfermagem Cogitare Enfermagem

We should regard these as open access journals. We can create a Type column with the SciELO, Not SciELO (but open) and Closed types, which should fix this issue. In [24]: datasetf = dataset.assign( Type=dataset.T.apply(lambda row: "SciELO" if row["SciELO"] else ( "Not SciELO" if row["OpenAccess"] == "YES" else "Closed" ) ) ).drop(columns=["OpenAccess", "SciELO"])

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 13 / 17 13.2 Splitting the data based on SciELO ISSNs

print(datasetf.shape) datasetf.head()

(23359, 4)

Out [24]: CiteScore SNIP SJR Type 0 130.47 88.164 61.786 Closed 2 63.12 32.534 34.638 Not SciELO 6 51.08 11.97 23.414 Closed 7 39.42 7.967 17.633 Closed 8 36.13 19.73 33.557 Closed

And now the total count makes more sense. In [25]: dataset_tcounts = datasetf["Type"].value_counts() dataset_tcounts.plot.barh(figsize=(12, 2), title="Number of journals") pd.DataFrame(dataset_tcounts)

Out [25]: Type Closed 19152 Not SciELO 3552 SciELO 655

13.2.5 CiteScore, SNIP and SJR

In a tidy format, our data becomes: In [26]: datasetf_tidy = ( datasetf .set_index("Type") .rename_axis("Measure", axis="columns") .stack() .rename("Value") .replace("-", None) # Empty entries are marked with "-" .dropna() .astype(float) # Required to avoid breaking Seaborn .reset_index() ) print(datasetf_tidy.shape) datasetf_tidy.head()

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 14 / 17 13.2 Splitting the data based on SciELO ISSNs

(70077, 3)

Out [26]: Type Measure Value 0 Closed CiteScore 130.470 1 Closed SNIP 88.164 2 Closed SJR 61.786 3 Not SciELO CiteScore 63.120 4 Not SciELO SNIP 32.534

Now we can have a boxplot of this data. In [27]: sns.catplot( kind="box", data=datasetf_tidy, row="Measure", x="Value", y="Type", height=2, aspect=5, );

The huge outliers makes it difficult to understand what’s going on. Let’s impose some limits to [0, 40] (we won’t see these huge outliers). In [28]: sns.catplot( kind="box", data=datasetf_tidy, row="Measure", x="Value", y="Type", height=2, aspect=5, ).set(xlim=[0, 40]);

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 15 / 17 13.2 Splitting the data based on SciELO ISSNs

It’s still too high. Seeing just [0, 5]: In [29]: sns.catplot( kind="box", data=datasetf_tidy, row="Measure", x="Value", y="Type", height=2, aspect=5, ).set(xlim=[0, 5]);

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 16 / 17 13.2 Splitting the data based on SciELO ISSNs

SciELO data seem to be either not properly referenced in the Scopus network (as the ISSN normalization is an issue and we saw lots of open access journals not marked as open), or we have some reason for such smaller values for the SciELO-matching entries in Scopus. In the SCImagoJR analysis notebook, the SJR field had been analyzed, SJR is higher in most countries SciELO has data, but mixing all the countries makes a huge difference.

13 Scopus 2017 - CiteScore, SNIP and SJR — Page 17 / 17 14 Analyzing the SCImago Journal Rank in 2017

In [1]: import pandas as pd import seaborn as sns pd.options.display.max_colwidth = 200 # Default is 50 %matplotlib inline

In the SCImago Journal Rank’s web site[1] we can get the journal rank in a format based on CSV (CSV stands for comma separated values, but the CSV-like files we can download from the SJR use commas as thousands separators and semi-colons as value separators), which can be directly loaded by the Pandas CSV reader function, requiring some extra parameters: In [2]: sjr2017scielo = pd.read_csv("scimagojr_2017_scielo.csv", sep=";", thousands=",", index_col="Rank") sjr2017open = pd.read_csv("scimagojr_2017_open.csv", sep=";", thousands=",", index_col="Rank")

The first few entries: In [3]: sjr2017scielo.head().T

Out [3]: Rank 1 2 3 4 5 Sourceid 21100853560 15205 21100200421 21807 22596 Title African Journal Memorias do In- Journal of Brazilian Revista de Saude of Disability stituto Oswaldo Soil Science Journal of Publica Cruz and Plant Infectious Nutrition Diseases Type journal journal journal journal journal Issn 22267220, 00740276, 07189516 14138670 00348910 22239170 16788060 SJR 1463 1172 823 817 807 SJR Best Q1 Q1 Q1 Q2 Q2 Quartile H index 4 76 24 37 65 Total Docs. 0 124 80 124 164 (2017) Total Docs. 7 438 235 401 351 (3years) Total Refs. 0 3682 2860 2919 1015 Total Cites 32 1082 521 616 570 (3years) Citable 6 425 235 315 334 Docs. (3years) Cites 533 281 235 202 148 / Doc. (2years) Ref. / Doc. 0 2969 3575 2354 619 Country South Africa Brazil Chile Brazil Brazil Publisher OpenJournals Fundacao Os- Sociedad Elsevier Edi- Universidade de Publishing waldo Cruz Chilena de tora Ltda Sao Paulo AOSIS (Pty) Ltd la Ciencia del Suelo Continued on next page

[1]https://www.scimagojr.com/journalrank.php

14 Analyzing the SCImago Journal Rank in 2017 — Page 1 / 13 Rank 1 2 3 4 5 Categories Physical Ther- Medicine (mis- Agronomy Infectious Medicine (mis- apy, Sports cellaneous) (Q1); and Crop Diseases cellaneous) Therapy and Microbiology Science (Q2); Mi- (Q2); Public Rehabilit... (m... (Q1); Plant crobiology Health, ... Science ... (medica...

They have a dedicated web page for help that includes the description of each field: https://www. scimagojr.com/help.php In [4]: sjr2017scielo.head().T

Out [4]: Rank 1 2 3 4 5 Sourceid 21100853560 15205 21100200421 21807 22596 Title African Journal Memorias do In- Journal of Brazilian Revista de Saude of Disability stituto Oswaldo Soil Science Journal of Publica Cruz and Plant Infectious Nutrition Diseases Type journal journal journal journal journal Issn 22267220, 00740276, 07189516 14138670 00348910 22239170 16788060 SJR 1463 1172 823 817 807 SJR Best Q1 Q1 Q1 Q2 Q2 Quartile H index 4 76 24 37 65 Total Docs. 0 124 80 124 164 (2017) Total Docs. 7 438 235 401 351 (3years) Total Refs. 0 3682 2860 2919 1015 Total Cites 32 1082 521 616 570 (3years) Citable 6 425 235 315 334 Docs. (3years) Cites 533 281 235 202 148 / Doc. (2years) Ref. / Doc. 0 2969 3575 2354 619 Country South Africa Brazil Chile Brazil Brazil Publisher OpenJournals Fundacao Os- Sociedad Elsevier Edi- Universidade de Publishing waldo Cruz Chilena de tora Ltda Sao Paulo AOSIS (Pty) Ltd la Ciencia del Suelo Categories Physical Ther- Medicine (mis- Agronomy Infectious Medicine (mis- apy, Sports cellaneous) (Q1); and Crop Diseases cellaneous) Therapy and Microbiology Science (Q2); Mi- (Q2); Public Rehabilit... (m... (Q1); Plant crobiology Health, ... Science ... (medica...

The SJR column have the SCImago Journal Rank index we’re here to analyze. The same web page have a PDF explaining the mathematics that defines it[2].

[2]https://www.scimagojr.com/SCImagoJournalRank.pdf

14 Analyzing the SCImago Journal Rank in 2017 — Page 2 / 13 14.1 Do all SciELO entries have open access?

14.1 Do all SciELO entries have open access?

Yes! We can see this by comparing the number of distinct entries in the union of the dataframes. In [5]: pd.DataFrame([ ("Open", "all", sjr2017open.shape[0] ), ("Open", "distinct ISSNs", sjr2017open["Issn"].drop_duplicates().size, ), ("Open", "distinct titles", sjr2017open["Title"].drop_duplicates().size, ), ("Open", "distinct title-ISSN pairs", sjr2017open[["Title", "Issn"]].drop_duplicates().shape[0], ), ("SciELO", "all", sjr2017scielo.shape[0], ), ("SciELO", "distinct ISSNs", sjr2017scielo["Issn"].drop_duplicates().size, ), ("SciELO", "distinct titles", sjr2017scielo["Title"].drop_duplicates().size, ), ("SciELO", "distinct title-ISSN pairs", sjr2017scielo[["Title", "Issn"]].drop_duplicates().shape[0], ), ("Union of Open and SciELO", "all", pd.concat([sjr2017open.drop_duplicates(), sjr2017scielo.drop_duplicates()]) .drop_duplicates().shape[0], ), ("Union of Open and SciELO", "distinct ISSNs", pd.concat([sjr2017open.drop_duplicates(), sjr2017scielo.drop_duplicates()])["Issn"] .drop_duplicates().size, ), ("Union of Open and SciELO", "distinct titles", pd.concat([sjr2017open.drop_duplicates(), sjr2017scielo.drop_duplicates()])["Title"] .drop_duplicates().size, ), ("Union of Open and SciELO", "distinct title-ISSN pairs", pd.concat([sjr2017open.drop_duplicates(), sjr2017scielo.drop_duplicates()])[["Title", "Issn"]] .drop_duplicates().shape[0], ), ], columns=["source", "selection", "count"]) \ .set_index(["source", "selection"]) \ .unstack("source")

Out [5]:

14 Analyzing the SCImago Journal Rank in 2017 — Page 3 / 13 14.2 Understanding the duplicates in the open access dataframe

NaN count source Open SciELO Union of Open and SciELO selection all 4503 628 4503 distinct ISSNs 4501 628 4501 distinct title-ISSN pairs 4503 628 4503 distinct titles 4502 628 4502

Seeing the title-ISSNs pairs, the dataframe with open access entries and the union of both dataframes have the same number of distinct entries, as expected. But we can see some ISSN duplication and title duplication in the Open dataframe.

14.2 Understanding the duplicates in the open access dataframe

These are the duplicated ISSNs: In [6]: sjr2017open_size_gt1 = sjr2017open.groupby("Issn").size() > 1 dupl_issns = sjr2017open_size_gt1[sjr2017open_size_gt1].index.tolist() dupl_issns

Out [6]: ['16725123', '20365438']

In [7]: sjr2017open[sjr2017open["Issn"].isin(dupl_issns)].T

Out [7]: Rank 1154 3223 4153 4171 Sourceid 130135 21100790340 21100391400 21100786380 Title International Jour- Oxford Medical International Eye Perspectives on nal of Ophthalmol- Case Reports Science Federalism ogy Type journal journal journal journal Issn 16725123 20365438 16725123 20365438 SJR 576 178 109 107 SJR Best Quar- Q2 Q4 Q4 Q4 tile H index 18 4 5 1 Total Docs. 336 90 626 26 (2017) Total Docs. 619 137 1990 20 (3years) Total Refs. 9780 838 8260 1254 Total Cites 745 85 69 3 (3years) Citable Docs. 545 122 1989 20 (3years) Cites / Doc. 134 71 3 15 (2years) Ref. / Doc. 2911 931 1319 4823 Country China United States China Germany Publisher Press of Interna- Oxford Univer- Press of Interna- Walter De Gruyter tional Journal of sity Press tional Journal of Ophthalmology Ophthalmology Categories Ophthalmology Infectious Dis- Ophthalmology Law (Q4); Political (Q2) eases (Q4); Micro- (Q4) Science and Inter- biology (Q4); P... national ...

14 Analyzing the SCImago Journal Rank in 2017 — Page 4 / 13 14.3 Getting the open access entries that aren’t in the SciELO dataframe

The 2036-5438 regards to Perspectives on Federalism, whereas Oxford Medical Case Reports should prob- ably have been2053-8855. The 1672-5123 entries looks like the same, the titles are probably distinct translations of , and the entries perhaps regards to two timings of the same journal, but the different numbers for everything else makes it really hard to normalize anything. For now, let’s simply accept these as different journals. How about the duplicate title? In [8]: sjr2017open_title_gt1 = sjr2017open.groupby("Title").size() > 1 dupl_titles = sjr2017open_title_gt1[sjr2017open_title_gt1].index.tolist() dupl_titles

Out [8]: ['Alea']

In [9]: sjr2017open[sjr2017open["Title"] == "Alea"].T

Out [9]: Rank 640 4448 Sourceid 21100231200 12100157116 Title Alea Alea Type journal journal Issn 19800436 1517106X SJR 934 100 SJR Best Quartile Q2 Q4 H index 10 3 Total Docs. (2017) 10 42 Total Docs. (3years) 101 98 Total Refs. 272 718 Total Cites (3years) 54 1 Citable Docs. (3years) 101 82 Cites / Doc. (2years) 44 0 Ref. / Doc. 2720 1710 Country Brazil Brazil Publisher Instituto Nacional de Matematica Universidade Federal do Rio de Janeiro Pura e Aplicada Categories Statistics and Probability (Q2) Language and Linguistics (Q4); Lin- guistics and...

It’s just a coincidence.

14.3 Getting the open access entries that aren’t in the SciELO dataframe

Since every field match (but the Rank index) and every SciELO entry is in the open access entries, we can just get the symmetric difference. In [10]: sjr2017openns = pd.concat([sjr2017open, sjr2017scielo], sort=False)\ .drop_duplicates(keep=False) sjr2017openns.shape

Out [10]: (3875, 17)

We can build a full dataset as the CSV regarding the open access entries, just including a new boolean scielo column. In [11]: dataset = pd.concat([ sjr2017openns.assign(SciELO=False), sjr2017scielo.assign(SciELO=True),

14 Analyzing the SCImago Journal Rank in 2017 — Page 5 / 13 14.4 Data from countries not in SciELO

])

14.4 Data from countries not in SciELO

The SCImago Journal Rank data for entries coming from SciELO regards to some few countries: In [12]: scielo_countries = sjr2017scielo["Country"].unique() scielo_countries

Out [12]: array(['South Africa', 'Brazil', 'Chile', 'Spain', 'Mexico', 'United States', 'Argentina', 'Costa Rica', 'Netherlands', 'Colombia', 'Portugal', 'Cuba', 'Peru', 'Venezuela', 'Uruguay'], dtype=object)

This open dataset have a lot of other countries we won’t be able to compare. In [13]: dataset["Country"].unique()

Out [13]: array(['United States', 'Austria', 'United Kingdom', 'Germany', 'Sweden', 'Netherlands', 'France', 'Italy', 'New Zealand', 'Switzerland', 'Japan', 'Bulgaria', 'Canada', 'China', 'South Korea', 'Egypt', 'Finland', 'Spain', 'Australia', 'Belgium', 'Qatar', 'India', 'Turkey', 'Taiwan', 'Greece', 'Czech Republic', 'Hong Kong', 'Brazil', 'Poland', 'Denmark', 'Bangladesh', 'United Arab Emirates', 'Russian Federation', 'Hungary', 'Singapore', 'Saudi Arabia', 'Iran', 'Ukraine', 'Slovenia', 'Estonia', 'South Africa', 'Croatia', 'Ireland', 'Slovakia', 'Malaysia', 'Norway', 'Philippines', 'Lithuania', 'Argentina', 'Israel', 'Serbia', 'Oman', 'Bosnia and Herzegovina', 'Romania', 'Ethiopia', 'Azerbaijan', 'Portugal', 'Pakistan', 'Puerto Rico', 'Kazakhstan', 'Mexico', 'Bahrain', 'Tanzania', 'Malawi', 'Kuwait', 'Latvia', 'Montenegro', 'Indonesia', 'Nigeria', 'Thailand', 'Kenya', 'Chile', 'Iceland', 'Moldova', 'Venezuela', 'Macedonia', 'Libya', 'Colombia', 'Iraq', 'Jordan', 'Belarus', 'Jamaica', 'Nepal', 'Ghana', 'Rwanda', 'Morocco', 'Cuba', 'Sri Lanka', 'Malta', 'Brunei Darussalam', 'Fiji', 'Ecuador', 'Costa Rica', 'Peru', 'Uruguay'], dtype=object)

About one third of the open data not from SciELO are from a country that have SciELO data: In [14]: openns_cf_count = (sjr2017openns["Country"] .isin(scielo_countries) .value_counts() .sort_index(ascending=False) ) openns_cf_count.plot.barh( title="Number of entries not from SciELO with open access " "splitten by the country belongingness in SciELO data", figsize=(12, 2), ) openns_cf_count

Out [14]: True 1013 False 2862 Name: Country, dtype: int64

14 Analyzing the SCImago Journal Rank in 2017 — Page 6 / 13 14.4 Data from countries not in SciELO

The H index and SJR aren’t much different in this data split: In [15]: sns.catplot( kind="box", data=sjr2017openns.assign( SciELO_Country=sjr2017openns["Country"].isin(scielo_countries) ), y="SciELO_Country", orient="h", sharey=False, x="H index", aspect=5, height=2, ).set(title="H index of entries not from SciELO with open access " "splitten by the country belongingness in SciELO data", );

In [16]: sns.catplot( kind="box", data=sjr2017openns.assign( SciELO_Country=sjr2017openns["Country"].isin(scielo_countries) ), y="SciELO_Country", orient="h", sharey=False, x="SJR", aspect=5, height=2, ).set(title="SJR of entries not from SciELO with open access " "splitten by the country belongingness in SciELO data", );

14 Analyzing the SCImago Journal Rank in 2017 — Page 7 / 13 14.5 Countries that have SciELO data in SCImage Journal Rank

14.5 Countries that have SciELO data in SCImage Journal Rank

A proper comparison is difficult for some countries, since either almost all data is from SciELO, or almost all data isn’t from it: In [17]: # "cf" stands for Country-filtered dataset_cf = dataset[dataset["Country"].isin(scielo_countries)] dataset_cf_count = (dataset_cf .groupby(["Country", "SciELO"]) .size() .unstack() .fillna(0) .astype(int) ) dataset_cf_count.iloc[::-1].plot.barh(figsize=(12, 6), title="Count") dataset_cf_count

Out [17]: SciELO False True Country Argentina 6 35 Brazil 91 209 Chile 11 77 Colombia 7 79 Costa Rica 0 2 Cuba 2 19 Mexico 10 76 Netherlands 191 4 Peru 0 7 Portugal 18 15 South Africa 12 43 Spain 216 38 United States 444 6 Uruguay 0 1 Venezuela 5 17

The proportion of data in quite different. We’ll should analyze just the data from Portugal and Brazil. Venezuela has just 22 entries in total just 5 of them aren’t from SciELO.

14 Analyzing the SCImago Journal Rank in 2017 — Page 8 / 13 14.5 Countries that have SciELO data in SCImage Journal Rank

In [18]: proportions = ( pd.concat([dataset_cf_count[True] / dataset_cf_count[False], dataset_cf_count[False] / dataset_cf_count[True]], axis=1) .min(axis=1) .sort_values(ascending=False) .rename("proportion") ) proportions.iloc[::-1].plot.barh( figsize=(12, 6), title="Proportion $1:x$ for SciELO and non-SciELO data " "(or the other way around)", ) pd.DataFrame(proportions)

Out [18]: proportion Country Portugal 0.833333 Brazil 0.435407 Venezuela 0.294118 South Africa 0.279070 Spain 0.175926 Argentina 0.171429 Chile 0.142857 Mexico 0.131579 Cuba 0.105263 Colombia 0.088608 Netherlands 0.020942 United States 0.013514 Uruguay 0.000000 Peru 0.000000 Costa Rica 0.000000

The H index and SJR for the overall data is greater in open access content not from SciELO: In [19]:

14 Analyzing the SCImago Journal Rank in 2017 — Page 9 / 13 14.5 Countries that have SciELO data in SCImage Journal Rank

sns.catplot( kind="box", data=dataset_cf, y="SciELO", orient="h", sharey=False, x="H index", aspect=5, height=2, ).set(title="H index of entries from countries that have SciELO data " "splitten by the belongingness in SciELO data", );

In [20]: sns.catplot( kind="box", data=dataset_cf, y="SciELO", orient="h", sharey=False, x="SJR", aspect=5, height=2, ).set(title="SJR of entries from countries that have SciELO data " "splitten by the belongingness in SciELO data", );

This difference can be mainly explained by the data from the United States and Netherlands, countries which have, together, only 10 entries from SciELO. Most of the data from other countries behave the other way around (but Mexico and, for SJR, Portugal): In [21]: dataset_cfne = dataset_cf[~dataset_cf["Country"] # Not empty country .isin(proportions[proportions == 0].index)]

In [22]: sns.catplot( kind="box", data=dataset_cfne, row="Country", y="SciELO", orient="h", sharey=False, x="H index", aspect=7.4, height=1.2, sharex=False,

14 Analyzing the SCImago Journal Rank in 2017 — Page 10 / 13 14.5 Countries that have SciELO data in SCImage Journal Rank

).set_titles("H Index in {row_name}")\ .fig.tight_layout();

14 Analyzing the SCImago Journal Rank in 2017 — Page 11 / 13 14.5 Countries that have SciELO data in SCImage Journal Rank

In [23]: sns.catplot( kind="box", data=dataset_cfne, row="Country", y="SciELO", orient="h", sharey=False, x="SJR", aspect=7.4, height=1.2, sharex=False, ).set_titles("SJR in {row_name}")\ .fig.tight_layout();

14 Analyzing the SCImago Journal Rank in 2017 — Page 12 / 13 14.5 Countries that have SciELO data in SCImage Journal Rank

14 Analyzing the SCImago Journal Rank in 2017 — Page 13 / 13 15 Languages of research articles in SciELO Brazil

In [1]: import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns %matplotlib inline

15.1 Loading the data

In the column names simplification notebook we can find this function: In [2]: def normalize_column_title(name): import re name_unbracketed = re.sub(r".*\((.*)\)", r"\1", name.replace("(in months)", "in_months")) words = re.sub("[^a-z0-9+_ ]", "", name_unbracketed.lower()).split() ignored_words = ("at", "the", "of", "and", "google", "scholar", "+") replacements = { "document": "doc", "documents": "docs", "frequency": "freq", "language": "lang", } return "_".join(replacements.get(word, word) for word in words if word not in ignored_words) \ .replace("title_is", "is")

Loading the documents_languages.csv regarding the SciELO Brazil collection, and applying the col- umn names simplification function: In [3]: dataset = pd.read_csv("tabs_bra/documents_languages.csv")\ .rename(columns=normalize_column_title) print(dataset.shape) dataset.columns

(368491, 26)

Out [3]: Index(['extraction_date', 'study_unit', 'collection', 'issn_scielo', 'issns', 'title_scielo', 'title_thematic_areas', 'is_agricultural_sciences', 'is_applied_social_sciences', 'is_biological_sciences', 'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences', 'is_human_sciences', 'is_linguistics_letters_arts', 'is_multidisciplinary', 'title_current_status', 'pid_scielo', 'doc_publishing_year', 'doc_is_citable', 'doc_type', 'doc_languages', 'doc_pt', 'doc_es', 'doc_en', 'doc_other_languages'], dtype='object')

In [4]: dataset.head(3).T

Out [4]:

15 Languages of research articles in SciELO Brazil — Page 1 / 25 15.2 Types of documents

0 1 2 extraction_date 2018-09-13 2018-09-13 2018-09-13 study_unit document document document collection scl scl scl issn_scielo 0100-879X 0100-879X 0100-879X issns 0100-879X;1414-431X 0100-879X;1414-431X 0100-879X;1414-431X title_scielo Brazilian Journal of Brazilian Journal of Brazilian Journal of Medical and Biological Medical and Biological Medical and Biological Re... Re... Re... title_thematic_areas Biological Sciences; Biological Sciences; Biological Sciences; Health Sciences Health Sciences Health Sciences is_agricultural_sciences 0 0 0 is_applied_social_sciences 0 0 0 is_biological_sciences 1 1 1 is_engineering 0 0 0 is_exact_earth_sciences 0 0 0 is_health_sciences 1 1 1 is_human_sciences 0 0 0 is_linguistics_letters_arts 0 0 0 is_multidisciplinary 0 0 0 title_current_status current current current pid_scielo S0100- S0100- S0100- 879X1998000800006 879X1998000800011 879X1998000800005 doc_publishing_year 1998 1998 1998 doc_is_citable 1 1 1 doc_type research-article rapid-communication research-article doc_languages en en en doc_pt 0 0 0 doc_es 0 0 0 doc_en 1 1 1 doc_other_languages 0 0 0

15.2 Types of documents

Most documents are research articles, we’ll continue by just looking to this subset of the data: In [5]: doc_types_counts = dataset["doc_type"].value_counts() doc_types_counts.plot.barh(figsize=(12, 8), title="Number of documents by its type " "in the SciELO Brazil collection") pd.DataFrame(doc_types_counts)

Out [5]: doc_type research-article 308006 editorial 13114 case-report 7505 book-review 6940 review-article 6738 rapid-communication 6627 undefined 4908 brief-report 3906 letter 3435 abstract 2930 article-commentary 2613 Continued on next page

15 Languages of research articles in SciELO Brazil — Page 2 / 25 15.3 Set of languages

doc_type correction 785 press-release 727 addendum 164 news 93

In [6]: dataset_ra = dataset[dataset["doc_type"] == "research-article"]

15.3 Set of languages

Each article is written in some set of languages, written as ;-separated entries: In [7]: dataset_ra["doc_languages"].unique()

Out [7]: array(['en', 'pt', 'es', 'fr', 'en;pt', 'pt;es', 'es;pt', 'fr;pt', 'en;es;pt', 'en;es', 'it', 'en;pt;es', 'it;pt', 'de;al', 'pt;la', 'de', 'fr;en;pt', 'de;pt', 'de;es', 'fr;en', 'en;it;pt'], dtype=object)

The distribution of [disjoint] sets of research articles divided by the set of languages they’re written in is: In [8]: langs_sets = dataset_ra["doc_languages"].str.lower().str.split(";").apply(set) doc_langs_counts = langs_sets.value_counts() doc_langs_counts.plot.barh(figsize=(12, 8), title="Number of documents by its set of languages " "in the SciELO Brazil collection") pd.DataFrame(doc_langs_counts)

Out [8]:

15 Languages of research articles in SciELO Brazil — Page 3 / 25 15.4 Multiple languages in time

doc_languages {pt} 163858 {en} 103199 {en, pt} 31065 {es} 5841 {es, en, pt} 2913 {es, en} 484 {fr} 346 {fr, pt} 106 {de} 64 {it} 61 {es, pt} 39 {it, pt} 11 {fr, en, pt} 6 {de, pt} 6 {la, pt} 3 {es, de} 1 {it, en, pt} 1 {de, al} 1 {fr, en} 1

We can say an article is multi-language if it’s available in at least 3 languages.

15.4 Multiple languages in time

The quantity of articles with multiple languages seem to be getting higher when we see them by the publication year. In [9]: dataset_ramf = dataset_ra.assign( multi_language=dataset_ra["doc_languages"].str.contains(";"), )

15 Languages of research articles in SciELO Brazil — Page 4 / 25 15.4 Multiple languages in time

np.trim_zeros(dataset_ramf.groupby("doc_publishing_year")["multi_language"] .sum() ).plot.line( figsize=(8, 4), title="Count of research articles in multiple languages", );

In [10]: np.trim_zeros(dataset_ramf.groupby("doc_publishing_year")["multi_language"] .mean() ).plot.line( figsize=(8, 4), title="Proportion of research articles in multiple languages", );

15 Languages of research articles in SciELO Brazil — Page 5 / 25 15.4 Multiple languages in time

Can we split by both the publishing and indexing years?

15.4.1 Getting the indexing year

The indexing year can only be found in the journal spreadsheet, in the inclusion_year_scielo. In [11]: journals = pd.read_csv("tabs_bra/journals.csv")\ .rename(columns=normalize_column_title) print(journals.shape) journals.columns

(366, 98)

Out [11]: Index(['extraction_date', 'study_unit', 'collection', 'issn_scielo', 'issns', 'title_scielo', 'title_thematic_areas', 'is_agricultural_sciences', 'is_applied_social_sciences', 'is_biological_sciences', 'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences', 'is_human_sciences', 'is_linguistics_letters_arts', 'is_multidisciplinary', 'title_current_status', 'title_subtitle_scielo', 'short_title_scielo', 'short_iso', 'title_pubmed', 'publisher_name', 'use_license', 'alpha_freq', 'numeric_freq_in_months', 'inclusion_year_scielo', 'stopping_year_scielo', 'stopping_reason', 'date_first_doc', 'volume_first_doc', 'issue_first_doc', 'date_last_doc', 'volume_last_doc', 'issue_last_doc', 'total_issues', 'issues_2018', 'issues_2017', 'issues_2016', 'issues_2015', 'issues_2014', 'issues_2013', 'total_regular_issues', 'regular_issues_2018', 'regular_issues_2017', 'regular_issues_2016', 'regular_issues_2015', 'regular_issues_2014', 'regular_issues_2013', 'total_docs', 'docs_2018', 'docs_2017', 'docs_2016', 'docs_2015', 'docs_2014', 'docs_2013', 'citable_docs', 'citable_docs_2018', 'citable_docs_2017', 'citable_docs_2016', 'citable_docs_2015', 'citable_docs_2014', 'citable_docs_2013', 'portuguese_docs_2018', 'portuguese_docs_2017', 'portuguese_docs_2016', 'portuguese_docs_2015', 'portuguese_docs_2014', 'portuguese_docs_2013', 'spanish_docs_2018', 'spanish_docs_2017', 'spanish_docs_2016', 'spanish_docs_2015', 'spanish_docs_2014', 'spanish_docs_2013', 'english_docs_2018', 'english_docs_2017', 'english_docs_2016', 'english_docs_2015', 'english_docs_2014', 'english_docs_2013', 'other_lang_docs_2018', 'other_lang_docs_2017', 'other_lang_docs_2016', 'other_lang_docs_2015', 'other_lang_docs_2014', 'other_lang_docs_2013', 'h5_2018', 'h5_2017', 'h5_2016', 'h5_2015', 'h5_2014', 'h5_2013', 'm5_2018', 'm5_2017', 'm5_2016', 'm5_2015', 'm5_2014', 'm5_2013'], dtype='object')

This is the joined dataset: In [12]: mdataset = pd.merge(dataset, journals, on="issn_scielo", how="left") mdataset.shape

Out [12]: (368491, 123)

Fields with an _x suffix regards to the document, whereas fields with _y regards to the journal. Fields that aren’t in both dataframes appear without any extra suffix. In [13]: mdataset.columns

Out [13]:

15 Languages of research articles in SciELO Brazil — Page 6 / 25 15.4 Multiple languages in time

Index(['extraction_date_x', 'study_unit_x', 'collection_x', 'issn_scielo', 'issns_x', 'title_scielo_x', 'title_thematic_areas_x', 'is_agricultural_sciences_x', 'is_applied_social_sciences_x', 'is_biological_sciences_x', ... 'h5_2016', 'h5_2015', 'h5_2014', 'h5_2013', 'm5_2018', 'm5_2017', 'm5_2016', 'm5_2015', 'm5_2014', 'm5_2013'], dtype='object', length=123)

15.4.2 Document count by indexing year and publication year

We can see the quantity of documents by the year of journal indexing and the year of document publi- cation. In [14]: years_mdataset = (mdataset .groupby(["inclusion_year_scielo", "doc_publishing_year"]) .size() .unstack("doc_publishing_year") .fillna(0) .astype(int) ) plt.figure(figsize=(12, 6)) sns.heatmap(years_mdataset, cmap="magma")\ .set(title="Document count by journal indexing year " "and document publication year");

The same map, but only for 2007 onwards: In [15]: plt.figure(figsize=(14, 6)) sns.heatmap(years_mdataset.loc[2007:, 2007:], cmap="magma", annot=True, fmt="g")\ .set(title="Document count by journal indexing year " "and document publication year");

15 Languages of research articles in SciELO Brazil — Page 7 / 25 15.4 Multiple languages in time

Filtering by research articles, we get almost the same: In [16]: mdataset_ra = mdataset[mdataset["doc_type"] == "research-article"] years_mdataset_ra = (mdataset_ra .groupby(["inclusion_year_scielo", "doc_publishing_year"]) .size() .unstack("doc_publishing_year") .fillna(0) .astype(int) ) plt.figure(figsize=(14, 6)) sns.heatmap(years_mdataset_ra.loc[2007:, 2007:], cmap="magma", annot=True, fmt="g")\ .set(title="Research articles count by journal indexing year " "and document publication year");

15 Languages of research articles in SciELO Brazil — Page 8 / 25 15.4 Multiple languages in time

15.4.3 Multiple languages by indexing year and publication year

In [17]: mdataset_ramf = mdataset_ra.assign( multi_language=dataset_ra["doc_languages"].str.contains(";"), ) years_mdataset_ramf_sum = (mdataset_ramf .groupby(["inclusion_year_scielo", "doc_publishing_year"]) ["multi_language"] .sum() .unstack("doc_publishing_year") .fillna(0) .astype(int) ) plt.figure(figsize=(14, 6)) sns.heatmap(years_mdataset_ramf_sum.loc[2007:, 2007:], cmap="magma", annot=True, fmt="g")\ .set(title="Count of multi language research articles " "by journal indexing year " "and document publication year");

Zooming out: In [18]: plt.figure(figsize=(12, 6)) sns.heatmap(years_mdataset_ramf_sum.loc[:, 1996:], cmap="magma")\ .set(title="Proportion of multi language research articles " "by journal indexing year " "and document publication year");

15 Languages of research articles in SciELO Brazil — Page 9 / 25 15.4 Multiple languages in time

The raw count is probably not enough for understanding what’s going on. Let’s see the proportion. In [19]: years_mdataset_ramf_mean = (mdataset_ramf .groupby(["inclusion_year_scielo", "doc_publishing_year"]) ["multi_language"] .mean() .unstack("doc_publishing_year") .fillna(0.) ) plt.figure(figsize=(14, 6)) sns.heatmap(years_mdataset_ramf_mean.loc[2007:, 2007:], cmap="magma", annot=True)\ .set(title="Proportion of multi language research articles " "by journal indexing year " "and document publication year");

15 Languages of research articles in SciELO Brazil — Page 10 / 25 15.4 Multiple languages in time

Zooming out: In [20]: plt.figure(figsize=(12, 6)) sns.heatmap(years_mdataset_ramf_mean.loc[:, 1996:], cmap="magma")\ .set(title="Proportion of multi language research articles " "by journal indexing year " "and document publication year");

The same as above, but as line plots: In [21]: def add_markers(ax): for line, marker in zip(ax.get_lines(), "v^<>spP*hHoxXDd8234+1.,"): line.set_marker(marker) ax.legend()

In [22]: fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(12, 16))

years_mdataset_ramf_sum.loc[2007:, 2007:].T.plot(ax=ax1) ax1.set(title="Count of multi language research articles") add_markers(ax1)

years_mdataset_ramf_mean.loc[2007:, 2007:].T.plot(ax=ax2) ax2.set(title="Proportion of multi language research articles") add_markers(ax2)

15 Languages of research articles in SciELO Brazil — Page 11 / 25 15.5 Thematic area

15.5 Thematic area

These are the fields for each area, besides the _x or _y suffix: In [23]: areas = ["is_agricultural_sciences", "is_applied_social_sciences", "is_biological_sciences", "is_engineering", "is_exact_earth_sciences",

15 Languages of research articles in SciELO Brazil — Page 12 / 25 15.5 Thematic area

"is_health_sciences", "is_human_sciences", "is_linguistics_letters_arts"] areaswm = areas + ["is_multidisciplinary"]

This new trm dataset: • Has an entry copy for each thematic area of a document; • Is filtered by research articles, having no other document type; • Includes a multi_language field, besides specific flag fields for the pt, es and en languages. In [24]: trm = pd.concat([ mdataset_ramf[mdataset_ramf[area + "_x"] == 1] [["inclusion_year_scielo", "doc_publishing_year", "multi_language", "doc_pt", "doc_es", "doc_en"]] .assign(area=area[3:]) for area in areaswm ]).reset_index(drop=True) print(trm.shape) trm[::50_000]

(372208, 7)

Out [24]: The table is in the next page ...

15 Languages of research articles in SciELO Brazil — Page 13 / 25 15.5 Thematic area agricultural_sciences agricultural_sciences biological_sciences engineering health_sciences health_sciences health_sciences human_sciences 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 False False False False True False True True 1998 2011 2007 2016 2006 2012 2017 2016 inclusion_year_scielo doc_publishing_year1998 multi_language2012 doc_pt2006 doc_es2011 doc_en2000 area 1998 2008 2012 0 50000 100000 150000 200000 250000 300000 350000

15 Languages of research articles in SciELO Brazil — Page 14 / 25 15.5 Thematic area

With that data, we can see some language statistics for each area. But, first, what’s the number of research articles on each thematic area? Note: The proportion based on the total count is beyond 100%, since there are articles in more than one thematic area. In [25]: trm_area_counts = trm["area"].value_counts().rename("count") trm_area_counts.plot.barh( figsize=(12, 5), title="Count of research articles by thematic area", ) pd.DataFrame(trm_area_counts).assign( proportion=trm_area_counts / mdataset_ramf.shape[0], )

Out [25]: count proportion health_sciences 129204 0.419485 agricultural_sciences 69143 0.224486 human_sciences 54581 0.177208 biological_sciences 47412 0.153932 engineering 21148 0.068661 exact_earth_sciences 20288 0.065869 applied_social_sciences 17736 0.057583 multidisciplinary 8355 0.027126 linguistics_letters_arts 4341 0.014094

Now let’s see, for each thematic area, the multi-language document count by both the journal indexing year and the document publishing year, besides a proportion based on the total document count for the specific thematic area. In [26]: years_trm = (trm .groupby(["area", "inclusion_year_scielo", "doc_publishing_year"]) ["multi_language"] .agg(["sum", "mean"]) .unstack("inclusion_year_scielo") .fillna(0) ).T

That’s the full matrix of counts and proportions by area. Let’s see it with some heatmaps. In [27]:

15 Languages of research articles in SciELO Brazil — Page 15 / 25 15.5 Thematic area for field in areaswm: fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(13, 4)) area = field[3:] sns.heatmap(years_trm.xs(area, 1).xs("sum").loc[2007:, 2007:], cmap="magma", annot=True, fmt="g", ax=ax1) \ .set(title=f"Multilanguage research articles count in {area}") sns.heatmap(years_trm.xs(area, 1).xs("mean").loc[2007:, 2007:], cmap="magma", annot=True, fmt=".02f", ax=ax2) \ .set(title=f"Multilanguage research articles proportion " f"in {area}") fig.tight_layout()

15 Languages of research articles in SciELO Brazil — Page 16 / 25 15.5 Thematic area

15 Languages of research articles in SciELO Brazil — Page 17 / 25 15.5 Thematic area

Zooming out to see the big picture: In [28]: for field in areaswm: fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(13, 4)) area = field[3:] sns.heatmap(years_trm.xs(area, 1).xs("sum").loc[:, 1996:], cmap="magma", ax=ax1) \ .set(title=f"Multilanguage research articles count in {area}") sns.heatmap(years_trm.xs(area, 1).xs("mean").loc[:, 1996:], cmap="magma", ax=ax2) \ .set(title=f"Multilanguage research articles proportion " f"in {area}") fig.tight_layout()

15 Languages of research articles in SciELO Brazil — Page 18 / 25 15.5 Thematic area

15 Languages of research articles in SciELO Brazil — Page 19 / 25 15.5 Thematic area

15 Languages of research articles in SciELO Brazil — Page 20 / 25 15.6 Number of published articles by thematic area in en, es and pt

15.6 Number of published articles by thematic area in en, es and pt

Using the same technique from when we created the trm dataframe, we can see the number of published articles by the 3 languages that have its own column: • en: English; • es: Spanish; • pt: Portuguese. In [29]: langs = ["en", "es", "pt"] trlangsum = pd.concat([ trm[trm["doc_" + lang] == 1] [["area", "doc_publishing_year"]] .assign(lang=lang) for lang in langs ]).groupby(["area", "lang", "doc_publishing_year"]) \ .size().rename("count").reset_index() print(trlangsum.shape) trlangsum[::200]

(1274, 4)

Out [29]: area lang doc_publishing_year count 0 agricultural_sciences en 1942 1 200 applied_social_sciences es 2011 44 400 biological_sciences pt 1911 16 600 exact_earth_sciences en 1971 6 800 health_sciences en 1983 46 1000 health_sciences pt 2010 5409 1200 multidisciplinary en 2012 297

This data is what we wish to plot. In [30]: sns.FacetGrid(trlangsum, hue="lang", row="area", aspect=6, height=1.8)\ .map(sns.lineplot, "doc_publishing_year", "count")\ .add_legend() for legend_line in plt.gcf().legends[0].legendHandles: legend_line.set_linewidth(10)

15 Languages of research articles in SciELO Brazil — Page 21 / 25 15.6 Number of published articles by thematic area in en, es and pt

The same, from 1990 and without a shared y axis: In [31]: sns.FacetGrid(trlangsum, hue="lang", row="area", aspect=6, height=1.8, sharey=False)\ .map(sns.lineplot, "doc_publishing_year", "count")\ .add_legend() \ .set(xlim=[1990, 2018]);

15 Languages of research articles in SciELO Brazil — Page 22 / 25 15.6 Number of published articles by thematic area in en, es and pt for legend_line in plt.gcf().legends[0].legendHandles: legend_line.set_linewidth(10)

Instead, we might want to see the proportion of thematic areas in some specific language. We can plot a heat map to see this.

15 Languages of research articles in SciELO Brazil — Page 23 / 25 15.6 Number of published articles by thematic area in en, es and pt

In [32]: fig, axes = plt.subplots(nrows=len(langs), figsize=(12, 12)) for lang, ax in zip(langs, axes): data = trlangsum[trlangsum["lang"] == lang] \ .pivot(index="area", columns="doc_publishing_year", values="count")\ .fillna(0) sns.heatmap(data, cmap="magma", ax=ax) \ .set(title=f"Count of research articles in {lang}") fig.tight_layout()

The same, from 1990: In [33]: fig, axes = plt.subplots(nrows=len(langs), figsize=(12, 12)) for lang, ax in zip(langs, axes): data = trlangsum[(trlangsum["lang"] == lang) & (trlangsum["doc_publishing_year"] >= 1990)] \ .pivot(index="area",

15 Languages of research articles in SciELO Brazil — Page 24 / 25 15.6 Number of published articles by thematic area in en, es and pt

columns="doc_publishing_year", values="count")\ .fillna(0) sns.heatmap(data, cmap="magma", ax=ax) \ .set(title=f"Count of research articles in {lang}") fig.tight_layout()

15 Languages of research articles in SciELO Brazil — Page 25 / 25 Proportion of Brazil as the affiliation of documents in SciELO Brazil

Our goal is to find the proportion of Brazil in the affiliations of documents belonging to the SciELO Brazil collection. Let d be a document, then the proportion we’re looking for is:

number of affiliations of d in Brazil p(d) = total number of affiliations of d

We’re going to study the p(d) on an yearly basis, counting only the affiliations whose country we know. Let’s load from SciELO Analytics the CSV of documents affiliations in SciELO Brazil: # We shouldn't interpret Namibia (NA) as "not available" doc_aff <- read.csv("tabs_bra/documents_affiliations.csv", na.strings = c()) dim(doc_aff) # Number of rows and columns

## [1] 804928 26 as.data.frame(t(head(doc_aff, 1))) # First entry

1 extraction.date 2018-09-13 study.unit document collection scl ISSN.SciELO 0100-879X ISSN.s 0100-879X;1414-431X title.at.SciELO Brazilian Journal of Medical and Biological Research title.thematic.areas Biological Sciences;Health Sciences title.is.agricultural.sciences 0 title.is.applied.social.sciences 0 title.is.biological.sciences 1 title.is.engineering 0 title.is.exact.and.earth.sciences 0 title.is.health.sciences 1 title.is.human.sciences 0 title.is.linguistics..letters.and.arts 0 title.is.multidisciplinary 0 title.current.status current document.publishing.ID..PID.SciELO. S0100-879X1998000800006 document.publishing.year 1998 document.type research-article document.is.citable 1 document.affiliation.instituition University of Gorakhpur document.affiliation.country document.affiliation.country.ISO.3166 document.affiliation.state document.affiliation.city

R already simplifies the column names in some sense, replacing the whitespaces and special characters by a dot. We can see the names with names(doc_aff).

1 Categorical fields are known as factors. class(doc_aff$document.type)

## [1] "factor" class(doc_aff$document.affiliation.country.ISO.3166)

## [1] "factor" The levels of a factor are the values one factor vector can have. levels(doc_aff$document.type)

## [1] "abstract" "addendum" "article-commentary" ## [4] "book-review" "brief-report" "case-report" ## [7] "correction" "editorial" "letter" ## [10] "news" "press-release" "rapid-communication" ## [13] "research-article" "review-article" "undefined" levels(doc_aff$document.affiliation.country.ISO.3166)

## [1] "" "AE" "AG" "AL" "AM" "AN" "AO" "AR" "AS" "AT" "AU" "AZ" "BA" "BB" ## [15] "BD" "BE" "BF" "BG" "BH" "BI" "BJ" "BO" "BR" "BS" "BT" "BW" "BY" "CA" ## [29] "CD" "CF" "CH" "CI" "CL" "CM" "CN" "CO" "CR" "CS" "CU" "CV" "CY" "CZ" ## [43] "DE" "DK" "DO" "DZ" "EC" "EE" "EG" "ES" "ET" "FI" "FJ" "FR" "GA" "GB" ## [57] "GD" "GE" "GF" "GH" "GN" "GP" "GR" "GT" "GW" "GY" "HK" "HN" "HR" "HT" ## [71] "HU" "ID" "IE" "IL" "IN" "IQ" "IR" "IS" "IT" "JM" "JO" "JP" "KE" "KG" ## [85] "KN" "KR" "KW" "KY" "KZ" "LA" "LB" "LK" "LR" "LT" "LU" "LV" "LY" "MA" ## [99] "ME" "MG" "MI" "MK" "ML" "MM" "MN" "MT" "MU" "MW" "MX" "MY" "MZ" "NA" ## [113] "NE" "NG" "NI" "NL" "NO" "NP" "NZ" "OM" "PA" "PE" "PG" "PH" "PK" "PL" ## [127] "PR" "PS" "PT" "PY" "QA" "RO" "RS" "RU" "RW" "SA" "SC" "SD" "SE" "SG" ## [141] "SI" "SK" "SL" "SN" "SR" "SS" "SU" "SV" "SY" "TG" "TH" "TL" "TN" "TR" ## [155] "TT" "TW" "TZ" "UA" "UG" "US" "UY" "VE" "VN" "YE" "YU" "ZA" "ZM" "ZW" Most entries are research articles, we’ll work only with this document type: options(scipen = 6) # Avoid scientific notation in plots

as.data.frame(summary(doc_aff$document.type))

summary(doc_aff$document.type) abstract 3215 addendum 192 article-commentary 3396 book-review 7948 brief-report 7732 case-report 17602 correction 875 editorial 17827 letter 6122 news 111 press-release 839 rapid-communication 12995 research-article 702149 review-article 15902 undefined 8023

2 par(mar = c(3, 9, 2, 2) + .1) barplot(summary(doc_aff$document.type), horiz = TRUE, las = 1, # Horizontal labels main = "Count of document types") Count of document types

undefined review−article research−article rapid−communication press−release news letter editorial correction case−report brief−report book−review article−commentary addendum abstract

0 100000 300000 500000 700000 articles <- doc_aff[doc_aff$document.type == "research-article",] nrow(articles)

## [1] 702149 Most affiliation entries are from Brazil (that’s somewhat expected for a Brazilian collection). aff_country_summary <- summary(articles$document.affiliation.country.ISO.3166, maxsum = 10) aff_country_summary_names <- replace(names(aff_country_summary), names(aff_country_summary) == "", "(Empty)") acs_xmidpoints <- barplot(aff_country_summary, axisnames = FALSE, main = "Count of affiliations by country") axis(1, at = acs_xmidpoints, las = 2, labels = aff_country_summary_names, xpd = TRUE, tick = FALSE)

3 Count of affiliations by country 0 100000 300000 500000 IR PT TR ES BR US AR CN (Other) (Empty)

Let’s build a dataset with just four columns: • One regarding the document publication year; • One regarding to the PID, a way to identify an article; • One logical, TRUE if an article have a Brazilian affiliation; • One logical, TRUE if an article have a non-Brazilian affiliation. We should remove the empty country entries, since they might belong to any country (Brazil or other). Using two columns should be cleaner to understand than merging the Brazilian/non-Brazilian affiliation as a single column. dataset <- data.frame( articles$document.publishing.year, articles$document.publishing.ID..PID.SciELO., articles$document.affiliation.country.ISO.3166 == "BR", grepl("[^B].|.[^R]", articles$document.affiliation.country.ISO.3166) ) names(dataset) <- c("year", "pid", "br", "not_br") dataset <- dataset[dataset$br | dataset$not_br,] head(dataset)

year pid br not_br 624 1998 S0074-02761998000300014 TRUE FALSE 2319 1998 S0102-76381998000400005 TRUE FALSE 2321 1998 S0102-76381998000400003 TRUE FALSE 2323 1998 S0102-76381998000400009 TRUE FALSE 2333 1998 S0102-76381998000400004 TRUE FALSE 2334 1998 S0102-76381998000400010 TRUE FALSE

4 nrow(dataset)

## [1] 626660 As all entries are either br or not_br, we just need to calculate the mean of br for each PID. We’ll use dplyr to group that result by the PID. library(dplyr) # Masks intersect, setdiff, setequal, union, filter, lag

proportions <- dataset %>% group_by(pid) %>% summarize(mean(br), max(year)) proportions <- proportions[c(2, 3)] names(proportions) <- c("prop", "year") nrow(proportions)

## [1] 284274 head(proportions)

prop year 1 1998 1 1998 0 1998 1 1998 1 1998 1 1998

Let’s see the evolution of the mean of these proportions: mprops <- proportions %>% group_by(year) %>% summarize(mean(prop)) min(mprops$year, na.rm = TRUE) # Oldest document publication year

## [1] 1909 plot( mprops, type = "l", cex.main = 1, main = paste("Mean proportion of BR affiliation", "in research articles (SciELO Brazil)", sep = "") )

5 Mean proportion of BR affiliation in research articles (SciELO Brazil) mean(prop) 0.80 0.85 0.90 0.95 1.00

1920 1940 1960 1980 2000 2020

year

The raw data: library(kableExtra) mprops_all_years <- merge(data.frame(year = 1909:2018), mprops, all.x = TRUE) mprops_all_years$year = as.character(mprops_all_years$year) kable( cbind(mprops_all_years[seq(from = 1, length = 22),], mprops_all_years[seq(from = 23, length = 22),], mprops_all_years[seq(from = 45, length = 22),], mprops_all_years[seq(from = 67, length = 22),], mprops_all_years[seq(from = 89, length = 22),]), digits = 5, format.args = list(nsmall = 5), ) %>% kable_styling(latex_options = "striped") %>% add_header_above(c("1909-1930" = 2, "1931-1952" = 2, "1953-1974" = 2, "1975-1996" = 2, "1997-2018" = 2))

6 1909-1930 1931-1952 1953-1974 1975-1996 1997-2018 year mean(prop) year mean(prop) year mean(prop) year mean(prop) year mean(prop) 1909 1.00000 1931 1.00000 1953 0.90000 1975 0.98935 1997 0.90395 1910 NA 1932 1.00000 1954 0.88889 1976 0.97136 1998 0.92757 1911 NA 1933 1.00000 1955 0.90833 1977 0.94796 1999 0.90184 1912 NA 1934 1.00000 1956 0.95775 1978 0.86443 2000 0.90748 1913 NA 1935 NA 1957 1.00000 1979 0.95635 2001 0.92294 1914 NA 1936 1.00000 1958 0.97581 1980 0.92127 2002 0.91769 1915 NA 1937 1.00000 1959 1.00000 1981 0.97504 2003 0.91756 1916 NA 1938 0.83333 1960 1.00000 1982 0.93280 2004 0.91753 1917 1.00000 1939 0.93750 1961 0.99528 1983 0.96676 2005 0.91481 1918 1.00000 1940 1.00000 1962 0.97710 1984 0.93607 2006 0.91021 1919 NA 1941 0.91667 1963 1.00000 1985 0.91731 2007 0.91356 1920 NA 1942 1.00000 1964 0.96277 1986 0.90533 2008 0.90882 1921 NA 1943 1.00000 1965 0.97436 1987 0.85090 2009 0.89872 1922 1.00000 1944 1.00000 1966 0.93137 1988 0.88082 2010 0.88754 1923 1.00000 1945 0.96154 1967 0.95570 1989 0.87085 2011 0.87702 1924 1.00000 1946 0.96875 1968 0.95833 1990 0.90380 2012 0.87200 1925 1.00000 1947 0.97368 1969 0.98944 1991 0.89194 2013 0.85422 1926 1.00000 1948 0.88710 1970 0.98824 1992 0.80630 2014 0.83702 1927 1.00000 1949 0.94340 1971 0.97264 1993 0.90240 2015 0.81992 1928 1.00000 1950 0.90323 1972 0.97877 1994 0.87933 2016 0.81231 1929 1.00000 1951 0.97887 1973 0.96792 1995 0.90300 2017 0.80760 1930 NA 1952 0.97115 1974 0.97958 1996 0.89737 2018 0.78823

Is that significantly decreasing? To answer that, let’s consider the linear regression slope, which should be negative, that is, the mean proportion should get lower when the year gets higher. regr <- lm(mean.prop. ~ year, data.frame(mprops)) summary(regr)

## ## Call: ## lm(formula = mean.prop. ~ year, data = data.frame(mprops)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.149550 -0.011874 0.005626 0.025461 0.058664 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.7169339 0.2630088 14.13 <2e-16 *** ## year -0.0014108 0.0001336 -10.56 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.03833 on 96 degrees of freedom ## Multiple R-squared: 0.5375, Adjusted R-squared: 0.5327 ## F-statistic: 111.6 on 1 and 96 DF, p-value: < 2.2e-16 The slope (the year estimate) is negative. But is that negative for the 95%CI range?

7 confint(regr, level = .95)

## 2.5 % 97.5 % ## (Intercept) 3.194865642 4.239002162 ## year -0.001675865 -0.001145654 Yes, it’s decreasing! The slope (year, last row of confint result) is negative for the entire 95% confidence interval.

8