Workshop on the Use of Data from the Scielo Database

Total Page:16

File Type:pdf, Size:1020Kb

Workshop on the Use of Data from the Scielo Database Workshop on the use of data from the SciELO database Report slides and Python/R notebooks with the performed analyses Danilo J. S. Bellini Abstract Knowledge from data science or contemporary statistics can be used to perform analyses and inferences on large datasets including hundreds of thousands of entries. The exploratory data analysis of a dataset in a research aiming to get information from it might include steps like data acquiring, cleaning, normalization, interpretation, grouping, description and visualization. The goal of this work is to share techniques, methodologies and tools for accessing and exploring data from the SciELO database through its own open access interfaces like SciELO Analytics’ reports, SciELO Ratchet, and SciELO ArticleMeta (JSON API and Python software package), as well as from 4 external sources: Web of Science (SciELO Citation Index), Dimensions, SCImagoJR and Scopus. Using either Python (IPython/Jupyter, Numpy, Pandas, Matplotlib, Seaborn, Scipy, NetworkX) or R (R Studio, dplyr) as the programming languages, several analyses had been performed with their open source code included, aiming the reproducibility of the results. Keywords Python, R, Data science, Statistics, SciELO, H5, FCR, SJR, Citations, Open access, Open source, Exploratory data analysis Source code repository https://github.com/scieloorg/scielo20gt6/ WG6 Report SciELO 20 Years WG6 presentation title: Workshop on the use of data from the SciELO database Lecturer/speaker/rapporteur: Danilo J. S. Bellini Group coordinator: Gustavo Fonseca Executive secretary: Carolina Tanigushi WG6 date: 2018-09-24 Report date: 2018-09-25 Venue: Tivoli Mofarrej São Paulo Hotel WG6 Report SciELO 20 Years 1 / 8 Summary During the workshop, these had been done: Brief introduction to the data analysis processes, emphasizing data access, data cleaning and exploratory data analysis Hands-on examples using Python and R, with emphasis in the Pandas library resources, showing how data munging, normalization, data visualization and other data analysis processes can be performed Explanation of data analyses previously performed on data coming from SciELO and external sources Most of the time was spent on exploratory data analysis, interpretation of descriptive statistics and visualization. Some highlights of what had been studied in more depth include: Hirsch index Google Scholar’s h5-index and h5-median Calculation from raw Dimensions’ data SCImagoJR’s H index Field Citation Ratio (FCR) from Dimensions WG6 Report SciELO 20 Years 2 / 8 Tools Two programming languages had been used, besides several: Python IPython / Jupyter Notebook Python built-in modules (csv, statistics, urllib, json, glob, os, re, collections, itertools, pprint) numpy pandas matplotlib seaborn openpyxl, to open XLSX files scipy.stats, to calculate the Pearson’s correlation coefficient NetworkX, a graph manipulation library including an API to draw graphs with matplotlib R R built-in modules (base, utils, stats, graphics) R Studio, an IDE for R, for creating R Markdown notebooks dplyr, to perform grouping operations similar to SQL’s GROUP BY and Pandas’ DataFrame.groupby WG6 Report SciELO 20 Years 3 / 8 Data sources Several analyses were performed before the workshop, whose processes and results were part of it. The data that had been studied in every analysis performed during and before the workshop came from these sources: SciELO’s JSON APIs (RESTful) from: ArticleMeta, to get journal metadata Ratchet, to get access data SciELO’s articlemeta Python library, an alternative way to access the ArticleMeta API Reports from the SciELO Analytics SciELO Citation Index entries from the Web of Science Dimensions data regarding two journals: Nauplius and Brazilian Journal of Plant Physiology SCImagoJR’s CSV with all SJR and H indices for 2017 Scopus’ XLSX with all the data they make available WG6 Report SciELO 20 Years 4 / 8 Introduction to the analysis Besides: Identifying which collections have data in SciELO analytics (all certified and development collections, besides the active independent collections) Downloading all SciELO analytics reports Evaluating if the network reports have everything from the remaining reports Simplifying the column names Normalizing/cleaning the ISSN when dealing with multiple collections Normalizing the thematic area (dealing with unfilled data) It had been seen how to plot data, with a strong emphasis on data interpretation and multiple types of plots (bar plots, line plots, box-and-whisker plots, heat maps, scatterplots, etc.), as well as subplot splitting/grouping. WG6 Report SciELO 20 Years 5 / 8 Previously prepared analyses Number of indexed journals in the SciELO network Deindexing reason in the SciELO Brazil collection Evaluating the daily access in the SciELO Brazil collection Three indices in Scopus 2017: CiteScore, SNIP and SJR SCImago Journal Rank in 2017, including SJR and H index FCR and H index in Dimensions Google Scholar indices Languages of research articles in SciELO Brazil, by thematic area, document publication year and journal indexing year Citations in the SciELO CI Proportion of Brazil in affiliation institutions in research articles from journals in the SciELO Brazil collection WG6 Report SciELO 20 Years 6 / 8 Results The proportion of Brazilian affiliations of research articles in the SciELO Brazil collection is decreasing Most citations of research articles in the SciELO network come from documents/journals that aren’t in the SciELO network. For research articles written in English, 76% of the received citations comes from documents external to the SciELO network The normalization step when calculating the FCR and its non-standard average calculation can easily push down the result (e.g. a journal with 10 documents receiving 15 citations and 3 documents with zero citations would have an average of citations of less than 7.5, before this number gets normalized by the year and field of research), making it an index best fit to evaluate older journals that are no longer publishing We should always look for the mathematics that defines an index, as that evaluation can already give us some insights regarding its bias towards some documents/journals WG6 Report SciELO 20 Years 7 / 8 Results Scopus indices should be taken with care: mixing the data from all countries makes it hard to compare data from SciELO and from other journals In SciELO Brazil, 95% of the journals marked as deceased were actually just renamed to a new journal title/ISSN Matching data with external sources is difficult without a common standardized index such as the ISSN and DOI 0.8% of SciELO journals are marked in Scopus as not open, which seem to be an issue regarding Scopus data WG6 Report SciELO 20 Years 8 / 8 1 Introduction to Pandas with ArticleMeta Note: This notebook had been written during the presentation, but some text (like this) have afterwards been included to help on understanding its contents. The ArtileMeta API is public, we can get a JSON with some information about the collections: http://articlemeta.scielo.org/api/v1/collection/identifiers/ Instead of a raw JSON we can see from opening that address in a web browser, can we load/analyze this with Python? Can we plot some information from it? Yes! And that’s the goal of this notebook. 1.1 Loading a table-like JSON from ArticleMeta with Pandas Let’s import Pandas as pd following its convention, and load that JSON directly with pd.read_json. Note: Data collected on 2018-09-24. In [1]: import pandas as pd In [2]: url = "http://articlemeta.scielo.org/api/v1/collection/identifiers/" dataset = pd.read_json(url) In [3]: dataset Out [3]: The table is in the next page ... 1 Introduction to Pandas with ArticleMeta — Page 1 / 14 1.1 Loading a table-like JSON from ArticleMeta with Pandas acron acron2 code document_- domain has_analyt- is_ac- journal_- name original_- status type count ics tive count name 0 arg ar arg 37438.0 www.scielo. True True {’current’: {’es’: ’Ar- Argentina certified journals org.ar 120, ’de- gentina’, ’pt’: ceased’: ’Argentina’, 23} ’en’: ’... 1 chl cl chl 61760.0 www.scielo.cl True True {’current’: {’es’: ’Chile’, Chile certified journals 103, ’de- ’pt’: ’Chile’, ceased’: ’en’: ’Chile’} 1 Introduction to Pandas with ArticleMeta 13, ’sus- pended’: 1} 2 col co col 66973.0 www.scielo. True True {’current’: {’es’: ’Colom- Colombia certified journals org.co 224, ’sus- bia’, ’pt’: pended’: ’Colombia’, 4} ’en’: ’Co... 3 cub cu cub 33492.0 scielo.sld.cu True True {’current’: {’es’: ’Cuba’, Cuba certified journals 61, ’de- ’pt’: ’Cuba’, ceased’: ’en’: ’Cuba’} 2, ’sus- pended’: 4} 4 esp es esp 37223.0 scielo.isciii.es True True {’current’: {’es’: ’España’, España certified journals 43, ’de- ’pt’: ’Espanha’, ceased’: ’en’: ’Spain’} — Page 2 / 14 6, ’sus- pended’: 11} 5 mex mx mex 61167.0 www.scielo. True True {’current’: {’es’: ’Mexico’, Mexico certified journals org.mx 156, ’de- ’pt’: ’Mexico’, ceased’: ’en’: ’Mexico’} 12, ’sus- pended’: 47} Continued on next page 1.1 Loading a table-like JSON from ArticleMeta with Pandas acron acron2 code document_- domain has_analyt- is_ac- journal_- name original_- status type count ics tive count name 6 prt pt prt 17237.0 www.scielo. True True {’current’: {’es’: ’Portugal’, Portugal certified journals mec.pt 47, ’de- ’pt’: ’Portugal’, ceased’: ’en’: ’Po... 5, ’sus- pended’: 17} 7 NaN NaN NaN NaN books.scielo. False True NaN {’es’: ’Sci- NaN NaN books 1 Introduction to Pandas with ArticleMeta org ELO Libros’, ’pt’: ’SciELO Livros’,... 8 scl br scl 370296.0 www.scielo. True True {’current’: {’es’: ’Brasil’, Brasil certified journals br 291, ’de- ’pt’: ’Brasil’, ceased’: ’en’: ’Brazil’} 40, ’sus- pended’: 35} 9 spa sp spa 40996.0 www.sci- True True {’current’: {’es’: ’Salud Saúde certified journals elosp.org 18, ’sus- Publica’, Pública pended’: ’pt’: ’Saúde 2} Pública’,... 10 sss ss sss 665.0 socialsciences. True False {’current’: {’es’: ’Social Sci- Social Sci- certified journals scielo.org 33} ences’, ’pt’: ’So- ences cial Scienc..
Recommended publications
  • Revostmm Vol 10-1-2018 Ingles Maquetaciûn 1
    SUMMARY Vol. 10 - Nº 1 - January-March 2018 Our cover EDITORIAL Ectopic bone formed by human 3 Background history of Revista de Osteoporosis y mesenchymal cells injected into Metabolismo Mineral. The situation ten years on the subcutaneous tissue of immu- Sosa Henríquez M, Gómez de Tejada Romero MJ nodeficient mice. The structure of concentric sheets of the bone matrix (stained in pink with H&E staining ORIGINALS x 20) and osteocytes inside. 7 Vitamin D deficiency in postmenopausal Ecuadorian Authors: women with diabetes mellitus type 2 Doctors C Sañudo, L López-Delgado López Gavilanez E, Orces CH, Guerrero Franco K, and JA Riancho. Hospital Universitario Marqués de Valdecilla, Universidad Segale Bajaña Á, Veliz Ortega J, Bajaña Granja W de Cantabria, IDIVAL (Santander) Relationship between the presence of anemia and 15 the risk of osteoporosis in women with rheumatoid arthritis Director Batún‐Garrido JAJ, Salas‐Magaña M Manuel Sosa Henríquez Editor 21 Effect of biological therapy on concentrations of Mª Jesús Gómez de Tejada Romero DKK1 and sclerostin, cardiovascular risk and bone metabolism in patients with rheumatoid artritis Palma‐Sánchez D, Haro‐Martínez AC, Gallardo Sociedad Española de Investigación Ósea Muñoz I, Portero de la Torre M, García‐Fontana B, y del Metabolismo Mineral (SEIOMM) Reyes‐García R President Prevention and early diagnosis of childhood Josep Blanch Rubió 30 osteoporosis: are we doing the right thing? Vicepresident Mir‐Perelló C, Galindo Zavala R, González Fernández MI, Mª Jesús Moro Álvarez Graña Gil J, Sevilla
    [Show full text]
  • Indicadores De Género Basados En El Análisis De Revistas Científicas
    This is a postprint version of: Mauleón, E.; Hillán, L.; Moreno, L.; Gómez, I.; Bordons, M. “Assessing gender balance among journal authors and editorial board members.” Scientometrics 95(1): 87-114, 2013. DOI: 10.1007/s11192-012-0824-4 The final publication is available at Springer: http://link.springer.com/article/10.1007/s11192-012-0824-4 Assessing gender balance among journal authors and editorial board members Elva Mauleón1, 1 University of Bologna, Department of Management, Bologna, Italy Laura Hillán2, Luz Moreno2, Isabel Gómez2, María Bordons2 2. Instituto de Estudios Documentales en Ciencia y Tecnología (IEDCYT), Centre for Human and Social Sciences (CCHS), Spanish National Research Council (CSIC), Madrid, Spain Corresponding author: María Bordons IEDCYT CCHS Albasanz 26-28 28037 Madrid (Spain) Phone: 34-91-602-23-00 Fax: 34-91-304-75-10 Abstract The study of journal authorship and editorial board membership from a gender perspective is addressed in this paper following international recommendations about the need to obtain science and technology indicators by gender. Authorship informs us about active scientists who contribute to the production and dissemination of new knowledge through journal articles, while editorial board membership tells us about leading scientists who have obtained scientific recognition within the scientific community. This study analyses by gender the composition of the editorial boards of 131 high-quality Spanish journals in all fields of science, the presence of men and women as authors in a selection of 36 journals, and the evolution of these aspects from 1998 to 2009. Female presence is lower than male presence in authorship, editorial board membership and editorship.
    [Show full text]
  • Journal Topic Citation Potential and Between-Field Comparisons: the Topic Normalized Impact Factor
    Journal topic citation potential and between-field comparisons: The topic normalized impact factor Pablo Dorta-González a, María Isabel Dorta-González b, Dolores Rosa Santos-Peñate a, Rafael Suárez-Vega a a Instituto de Turismo y Desarrollo Económico Sostenible Tides, Universidad de Las Palmas de Gran Canaria, Spain; b Departamento de Estadística, Investigación Operativa y Computación, Universidad de La Laguna, Spain. ABSTRACT The journal impact factor is not comparable among fields of science and social science because of systematic differences in publication and citation behaviour across disciplines. In this work, a source normalization of the journal impact factor is proposed. We use the aggregate impact factor of the citing journals as a measure of the citation potential in the journal topic, and we employ this citation potential in the normalization of the journal impact factor to make it comparable between scientific fields. An empirical application comparing some impact indicators with our topic normalized impact factor in a set of 224 journals from four different fields shows that our normalization, using the citation potential in the journal topic, reduces the between- group variance with respect to the within-group variance in a higher proportion than the rest of indicators analysed. The effect of journal self-citations over the normalization process is also studied. Keywords: journal assessment; journal metric; bibliometric indicator; citation analysis; journal impact factor; source normalization; citation potential. 1 1. Introduction This work is related to journal metrics and citation-based indicators for the assessment of scientific scholar journals from a general bibliometric perspective. For decades, the journal impact factor (JIF) has been an accepted indicator in ranking journals.
    [Show full text]
  • Journal List of Scopus.Xlsx
    Sourcerecord id Source Title (CSA excl.) (Medline-sourced journals are indicated in Green). Print-ISSN Including Conference Proceedings available in the scopus.com Source Browse list 16400154734 A + U-Architecture and Urbanism 03899160 5700161051 A Contrario. Revue interdisciplinaire de sciences sociales 16607880 19600162043 A.M.A. American Journal of Diseases of Children 00968994 19400157806 A.M.A. archives of dermatology 00965359 19600162081 A.M.A. Archives of Dermatology and Syphilology 00965979 19400157807 A.M.A. archives of industrial health 05673933 19600162082 A.M.A. Archives of Industrial Hygiene and Occupational Medicine 00966703 19400157808 A.M.A. archives of internal medicine 08882479 19400158171 A.M.A. archives of neurology 03758540 19400157809 A.M.A. archives of neurology and psychiatry 00966886 19400157810 A.M.A. archives of ophthalmology 00966339 19400157811 A.M.A. archives of otolaryngology 00966894 19400157812 A.M.A. archives of pathology 00966711 19400157813 A.M.A. archives of surgery 00966908 5800207606 AAA, Arbeiten aus Anglistik und Amerikanistik 01715410 28033 AAC: Augmentative and Alternative Communication 07434618 50013 AACE International. Transactions of the Annual Meeting 15287106 19300156808 AACL Bioflux 18448143 4700152443 AACN Advanced Critical Care 15597768 26408 AACN clinical issues 10790713 51879 AACN clinical issues in critical care nursing 10467467 26729 AANA Journal 00946354 66438 AANNT journal / the American Association of Nephrology Nurses and Technicians 07441479 5100155055 AAO Journal 27096 AAOHN
    [Show full text]
  • Assessing Gender Balance Among Journal Authors and Editorial Board Members
    Scientometrics (2013) 95:87–114 DOI 10.1007/s11192-012-0824-4 Assessing gender balance among journal authors and editorial board members Elba Mauleo´n • Laura Hilla´n • Luz Moreno • Isabel Go´mez • Marı´a Bordons Received: 25 April 2012 / Published online: 9 August 2012 Ó Akade´miai Kiado´, Budapest, Hungary 2012 Abstract The study of journal authorship and editorial board membership from a gender perspective is addressed in this paper following international recommendations about the need to obtain science and technology indicators by gender. Authorship informs us about active scientists who contribute to the production and dissemination of new knowledge through journal articles, while editorial board membership tells us about leading scientists who have obtained scientific recognition within the scientific community. This study analyses by gender the composition of the editorial boards of 131 high-quality Spanish journals in all fields of science, the presence of men and women as authors in a selection of 36 journals, and the evolution of these aspects from 1998 to 2009. Female presence is lower than male presence in authorship, editorial board membership and editorship. The presence of female authors is slightly lower than the presence of women in the Spanish Higher Education sector and doubles female presence in editorial boards, which mirrors female presence in the highest academic rank. The gender gap tends to diminish over the years in most areas, especially in authorship and very slightly in editorial board mem- bership. Large editorial boards and having a female editor-in-chief are positively correlated with women presence in editorial boards. The situation of women in Spanish science is further assessed in an international context analysing a selection of international reference E.
    [Show full text]
  • 2019 Journal Citation Reports Full Journal List
    2019 Journal Citation Reports Full journal list Every journal has a story to tell About the Journal Citation Reports Each year, millions of scholarly works are published containing tens of millions of citations. Each citation is a meaningful connection created by the research community in the process of describing their research. The journals they use are the journals they value. Journal Citation Reports aggregates citations to our selected core of journals, allowing this vast network of scholarship to tell its story. Journal Citation Reports provides journal intelligence that highlights the value and contribution of a journal through a rich array of transparent data, metrics and analysis. jcr.clarivate.com 2 Journals in the JCR with a Journal Impact Factor Full Title Abbreviated Title Country/Region SCIE SSCI 2D MATERIALS 2D MATER ENGLAND ! 3 BIOTECH 3 BIOTECH GERMANY ! 3D PRINTING AND ADDITIVE 3D PRINT ADDIT MANUF UNITED STATES ! MANUFACTURING 4OR-A QUARTERLY JOURNAL OF 4OR-Q J OPER RES GERMANY ! OPERATIONS RESEARCH AAPG BULLETIN AAPG BULL UNITED STATES ! AAPS JOURNAL AAPS J UNITED STATES ! AAPS PHARMSCITECH AAPS PHARMSCITECH UNITED STATES ! AATCC JOURNAL OF AATCC J RES UNITED STATES ! RESEARCH AATCC REVIEW AATCC REV UNITED STATES ! ABACUS-A JOURNAL OF ACCOUNTING FINANCE AND ABACUS AUSTRALIA ! BUSINESS STUDIES ABDOMINAL RADIOLOGY ABDOM RADIOL UNITED STATES ! ABHANDLUNGEN AUS DEM ABH MATH SEM MATHEMATISCHEN SEMINAR GERMANY ! HAMBURG DER UNIVERSITAT HAMBURG ACADEMIA-REVISTA LATINOAMERICANA DE ACAD-REV LATINOAM AD COLOMBIA ! ADMINISTRACION
    [Show full text]
  • Unerevistas Otoño 2017
    revistas Suplemento digital de la revista Unelibros n.º 35 Revistas universitarias Novedades Otoño 2017 UNIÓN DE EDITORIALES UNIVERSITARIAS ESPAÑOLAS Universidades Asociadas Universidad Internacional de Andalucía Universitat Rovira i Virgili [email protected] [email protected] Universidad de Alcalá http://www.unia.es/content/view/79/100/ http://www.publicacionsurv.cat [email protected] Universidad de Salamanca http://www.uah.es/servicio_publicaciones/ Universidad Internacional de La Rioja [email protected] [email protected] Universidad de Alicante www.unir.net http://www.eusal.es/ [email protected] Universidad de Jaén Universidad San Jorge de Zaragoza http://publicaciones.ua.es/ [email protected] [email protected] www.usj.es Universidad de Almería http://www3.ujaen.es/servpub/ [email protected] Universidade de Santiago de Compostela http://www.ual.es/editorial Universitat Jaume I [email protected] [email protected] http://www.usc.es/publicacions Universitat Autònoma de Barcelona http://www.uji.es/CA/publ/ [email protected] Universidad de Sevilla http://www.uab.es/publicacions Universidad de La Laguna [email protected] Universidad Autónoma de Madrid [email protected] http://editorial.us.es [email protected] http://publica.webs.ull.es Universitat de València http://www.uam.es/servicios/otros/spublicaciones/ Universidad de La Rioja [email protected] http://puv.uv.es Universitat de Barcelona [email protected] [email protected] http://publicaciones.unirioja.es Universidad de Valladolid http://www.publicacions.ub.edu
    [Show full text]