Publication Profiles Through Knowledge Areas – Diverse Analyses Over Google Scholar

PUBLICATION PROFILES THROUGH KNOWLEDGE AREAS – DIVERSE ANALYSES OVER GOOGLE SCHOLAR

TECHNICAL REPORT

Gabriela Duarte Lanza, Mariana de Oliveira Santos Silva, Mirella M. Moro Universidade Federal de Minas Gerais {mirella,gabrieladl,marianaos}@dcc.ufmg.br

Jose´ Palazzo M. de Oliveira Universidade Federal do Rio Grande do Sul [email protected]

April 1, 2019

ABSTRACT Knowledge areas (e.g., Biology and Physics) have different publication profiles, but the metrics in- terpretation to assess the relevance of papers and researchers do not vary from one field to another. This scenario results in inadequate comparisons between areas. Focusing on such a variety of profiles, this study seeks to characterize Engineering and Computer Science sub-areas using topological and non-topological metrics. After characterizing the areas, we also correlate social network metrics with the researchers’ h-index of each sub-area. Our analysis qualifies their influence, correlating the topological metrics with the researchers’ h-index, and we conclude there is no explicit correlation be-tween them. Overall, there are external factors influencing each author’s h-index that are not easily identified by algorithms. Keywords H-index · Publication profiles · Knowledge areas · Collaboration Social Networks · Google Scholar.

1 Introduction

The pressure for publishing more and in qualified venues has been on for some time now. It has even changed our research and innovation culture with services (such as publish or perish) and widely used metrics (such as the h-index). Over the years, the scientific community has proposed a diverse set of parameters to evaluate researchers, events, and publications relevance [4, 12, 16, 17]. These metrics (including citations and h-index) depend on the area or sub-area with which a publication or researcher is involved, due to the different publication profiles of the knowledge areas. This variation exists due to many factors, such as time to conduct a research, studies relevance in a given field or even incentives received by funding agencies and national or international politics. Since 2005, the h-index is one of the few author-based metrics currently available that evaluate the impact of individual scientists, researchers or academics [1, 19, 16]. Jorge E. Hirsch [9] identifies that using the proposed index depends on the area and sub-area to which the researcher is inserted. He shows that researchers in Biologics and Biomedical Sciences tend to receive a higher h-index compared to researchers in Physics, presenting different publication profiles. Nonetheless, such distinctions may still persist when comparing sub-areas within the same field. For example, Electromagnetism and Astronomy are sub-areas of Physics, but they have distinct profiles. In a former paper [18] we demonstrated that the h-index should be used to classify groups of researchers with different profiles from a quantita- tive point of view, which is meaningfully interesting to eliminate obscure qualitative methods. But that classification is strongly affected by area’s specific characteristics. In addition, many publications fall over more than one knowledge area. For example, Biophysics is a sub-area that belongs to both Biology and Physics and can inherit their characteristics, constituting a new publication profile. Fur- Publication Profiles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT thermore, these publications may be the outcome of collaboration between researchers from different sub-areas. Over- all, collaboration has beneficial effects on the quality of results and productivity. That is, with additional information about academic co-author networks, it is possible to identify the impact of collaborations on the researchers evaluation [7, 10, 14]. The goal of this work is to identify and comprehend the similarities and differences between Knowledge Areas profiles. Moreover, we aim to understand how multidisci-plinary publications can be related to the areas which they belong. Thus, we analyze publications from three different perspectives. First, we analyze the singularities and similarities between the sub-areas of Computer Science and Engineering collected from Google Scholar, according to non-topological metrics. Then, we characterize the same sub-areas according to topological metrics. Finally, we use social network met-rics in co-authorship graphs of all the sub-areas and correlate them with the research-ers’ h-index. At the end, we conclude there are external factors that influence the h-index received by each author which are not easily identified by current algorithms. We also corroborate previous studies that make clear one single index cannot reflect all existing area profiles; i.e., assessing that a researcher with high h-index in area X is more influent or “better” than another researcher from area Y is unreasonable. The paper is organized in the following sections. Section 2 presents the related work. The methodology is described in Section 3. Section 4 goes over our analyses and results. Finally, Section 5 concludes the paper and points out some future work.

2 Related Work

This section goes over key publications related to ours. Much more has been published on h-index alone, but a full survey on the topic is beyond the scope of our work.

2.1 H-Index Variations

The h-index, or Hirsch index, was introduced in 2005 and is defined as “A scientist has index h if h of his or her Np papers have at least h citations each and the other (Np − h) papers have ≤ h citations each” [9]. The main advantage of h-index is that it combines the effect of two dimensions: the number of publications which represent a measure of quantity and number of citations, representing the impact [1, 5]. Although it is a popular index, due to its simplicity, ease of calculation and availability on different platforms, the h-index has several limitations. Since 2005, the h-index has been strongly discussed and criticized [6]. Many authors focus on understanding the real relevance of the h-index and its impact on scientific production. For example, Henzinger et al. [8] investigated whether the h-index ranking is stable in relation to numerous aspects, such as different citation database options. Their experiments show while the h-index rating is stable in most of the changes analyzed, it is unstable when different databases are used. A closed formula for h-index distribution that can be applied for distinct databases is not yet known. In fact, to obtain such distribution, the knowledge of citation distribution of the authors and its specificities are required [18]. Benevenuto et al. [2] investigated whether the friendship paradox also occurs in terms of a researcher’s scientific productivity, as measured by their h-index. According to the authors, a researcher may use their co-authors’ h-indexes as a way of inferring whether their own h-index is suitable in their respective research areas. However, they show the h-index average of a researcher’s co-authors is generally higher than their own h-index. Therefore, as in the friendship paradox, the h-index paradox induces authors to believe that they are below average compared to their co-authors.

2.2 Social Inﬂuence

Menezes et al. [15] study the scientific production in Brazil, North America and Europe through the characteristics of collaboration social networks in Computer Science communities. These co-authorship social networks are formed by researchers and their publications collected from the DBLP1. The results for different social network metrics indicate the process of knowledge production has changed individually for each region. Another study that aims to collect information on scientific production through social metrics was performed by Fu et al. [7]. The authors analyze a variety of classification metrics in distinct entities, going beyond traditional metrics such as paper counts, citations and h-index. Specifically, they define new metrics such as influence, connections and exposure for authors. With additional information from author-institution relationships, they can analyze institution rankings based on the corresponding author rankings for each type of metric, as well as different domains.

1DBLP – Computer Science Bibliography: http://dblp2.uni-trier.de/

2 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Data Data Data Cleaning & Collec�on Analysis Structuring

Figure 1: Stages of methodology applied.

2.3 Social Inﬂuence and H-index

In [14], McCarty et al. propose testing the relationship between characteristics of co-authorship networks to identify the improvement of the h-index. Speciﬁcally, they focus on understanding which collaborative behaviors contributed most to the scientiﬁc impact of a Web of Science author. Contrary to the hypothesis proposed in the article, the network structure measured by components was not predictive. That is, reaching a high h-index requires to collaborate with many authors who have, for the most part, high h-index values themselves. Hence, by structuring a co-authorship network to keep separate research communities, a small improvement in h-index can be accomplished. The discussion about the validity of h-index is still a hot-topic as presented in Costas [6]. In this paper the authors stressed some group characteristics showing that ”the h-index has a preference towards scholars who produce many moderately cited publications over those who prefer to produce a few high impact papers”. This is a clear demonstration that some topological considerations are required.

2.4 Sub-areas Inﬂuence

Different studies analyze publication profiles in their different knowledge areas. Lima et al. [13] analyze the publication profiles of Computer Science sub-areas and propose a new approach to rank the researchers. According to the authors, most existing indexes ignore the specificity of different research areas and evaluate researchers based on global statistics. Thus, they propose a new index, called ca-index, which considers the differences between the publication profiles of the Computer Science sub-areas.

2.5 Our Contributions

Similar to such papers, our study aims to understand and compare publication proﬁles in Engineering and Computer Science sub-areas. However, unlike them, we characterize the sub-areas collected from Google Scholar Metrics according to topological and non-topological metrics. In addition, we identify correlations between the social networks metrics and the researchers’ h-index of each area. Our ﬁnal goal is to shed more light on the discussions about h-index and its variance over the particular features of areas. We do so from different perspectives and help to corroborate the previous results that point to the unfair use of h-index to judge researchers from different areas of expertise in one single ranking system.

3 Methodology

Our methodology, illustrated in Figure 1, is based on three main steps. First, we collect data from the Google Scholar Metrics (GSM)2 – data collected in April 2017. The GSM online tool allows to search for venues and publications from a particular area of interest. The choice of this tool is due to the consistency and homogeneity of the data, and a clear classification of the publications over different knowledge areas. Here, we follow such classification (instead of proposing yet another one). Specifically, we consider all 58 subcategories (here called sub-areas) present in ‘Engineering and Computer Science’ area (only papers in English). Then, the process of collecting the papers is as follows. Let A be the set of sub-areas. Table 1 presents all sub-areas a ∈ A and their respective identifiers.3 Let V be the whole set of publication venues,

2Google Scholar Metrics, https://scholar.google.com/citations?view_op=metrics_intro (Accessed 23 January 2018). 3For simplicity, we present our results considering just the ids, instead of the whole name of each sub-area.

3 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Table 1: Areas of Knowledge - Computer Science and Engineering Area ID Area ID Architecture 1 Manufacturing & Machinery 30 Artiﬁcial Intelligence 2 Materials Engineering 31 Automation & Control Theory 3 Mechanical Engineering 32 Aviation & Aerospace Engineering 4 Medical Informatics 33 Bioinformatics & Computational Biology 5 Metallurgy 34 Biomedical Technology 6 Microelectronics & Electronic Packaging 35 Biotechnology 7 Mining & Mineral Resources 36 Ceramic Engineering 8 Molecular Modeling 37 Civil Engineering 9 Multimedia 38 Combustion & Propulsion 10 Nanotechnology 39 Computational Linguistics 11 Ocean & Marine Engineering 40 Computer Graphics 12 Oil, Petroleum & Natural Gas 41 Computer Hardware Design 13 Operations Research 42 Computer Networks & Wireless Communication 14 Plasma & Fusion 43 Computer Security & Cryptography 15 Power Engineering 44 Computer Vision & Pattern Recognition 16 Quality & Reliability 45 Computing Systems 17 Radar, Positioning & Navigation 46 Data Mining & Analysis 18 Remote Sensing 47 Databases & Information Systems 19 Robotics 48 Educational Technology 20 Signal Processing 49 Engineering & Computer Science (general) 21 Software Systems 50 Environmental & Geological Engineering 22 Structural Engineering 51 Evolutionary Computation 23 Sustainable Energy 52 Food Science & Technology 24 Technology Law 53 Fuzzy Systems 25 Textile Engineering 54 Game Theory and Decision Science 26 Theoretical Computer Science 55 Human Computer Interaction 27 Transportation 56 Information Theory 28 Water Supply & Treatment 57 Library & Information Science 29 Wood Science & Technology 58

and Va the set of publications in each sub-area a. For each Va, we consider only the 20 venues that received the most citations between the years 2011 to 2015. At the end, ∀v20 ∈ Va, we collect the following information:

• name, the publication venue name; • h5-index, the h quantity of publications that have had h or more citations within that venue in the last ﬁve years (that is, between 2011 and 2015); • h5-median, the average citation of articles that form the h5-index; • year, the year of its ﬁrst edition.

Now, let Pv be the set of papers (and articles) published in a venue v ∈ Va. Then, ∀v20 ∈ Va, we collect information of all publications p ∈ Pv : p = publications that received the highest number of citations. In other words, we consider publications that inﬂuence the h5-index of each venue. Overall, ∀p ∈ Pv, we collect the following information:

• name, the publication title; • #citations, the total number of citations; • year, the year in which it was published.

Finally, let Rp ∈ R be the set of researchers who have authored a paper p ∈ Pv. Then, ∀p ∈ Pv, we collect data of all researchers who have an active proﬁle in Google Scholar. Therefore, ∀r ∈ R, we collect the following information:

• name, the researcher name; • institution, the researcher’s afﬁliation; • areas, the related research areas; • #citations, the total number of citations received (total and since 2012); • h-index, the h-index (total and since 2012);

4 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Maximum value 75th percentile

Median 25th percentile

Minimum value Outlier

Figure 2: Age of Areas (sub-areas closer to Engineering are highlighted)

• 10-index, the 10-index (total and since 2012); • year, the year of the ﬁrst article the author published; • #citperyear, number of citations received per year between 2009 and 2016.

After data collection, the second step is to perform data cleaning to remove duplicates, error and other problems. The ﬁnal step is to analyze the data from different perspective. All analyses are based on three different graphs: (i) a graph with vertices representing areas and edges connecting those areas that present publications in common (i.e., those publications that belong to more than one area, connects them with an edge); (ii) a similar graph but with edges connecting areas that have researchers in common; and (iii) a co-authorship graph with vertices representing authors and edges connecting those that have co-authored at least one paper.

4 Results and Discussion

This section is divided into three parts. First, we analyze the singularities and similarities between Computer Science and Engineering sub-areas, in order to comprehend the differences between them through non-topological metrics (Section 4.1). Second, we analyze the same sub-areas through topological (or social) network metrics (Section 4.2). Finally, we correlate the topological metrics with the h-index of the researchers, using three standard correlation methods: Kendall, Pearson and Spearman (Section 4.3).

4.1 Analyses using Non-Topological Metrics

In this section, we discuss the results of the analysis performed comparing the 58 sub-areas of Computer Science and Engineering, seeking to identify the proﬁle of each area individually and compare them.

4.1.1 Age of Areas The first analysis identifies the age of each sub-area by considering the year of each venue first edition. In this analysis (and the next three), the graphics distinguish 27 areas close to Engineering (highlighted in dark green) from 31 close to Computing (plain white). Figure 2 shows the results, in which Engineering areas have older venues, with most appearing before Computing areas.

4.1.2 Researchers’ h-index by sub-area The h-index is a metric commonly used to measure the productivity and citation impact of a researcher’s publications, based on their most cited papers. In order to understand how this index varies between Computing and Engineering

5 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Figure 3: H-index of All Researchers (sub-areas closer to Engineering are highlighted)

Figure 4: H-index of the Top 20% Researchers (sub-areas closer to Engineering are highlighted)

sub-areas, we plot the chart of Figure 3 considering the h-index of all researchers in each sub-area. The areas have different h-index profiles, but the variation is relatively small. This outcome “dilution” could be explained by the inclusion of many beginner researchers in the calculation. Also, there is a clear equilibrium between Engineering and Computing areas, as the bottom half of the graph contains 14 Engineering sub-areas and 15 Computing sub-areas, whereas the top half contains the remaining 13 Engineering and 16 Computing sub-areas. We take a closer look at such results by investigating how h-index varies considering the main researchers for each area. Hence, Figure 4 shows the h-index of the top 20% researchers in each area (classified according to Google metrics). Now, there is a more visible variation of the h-index. Specifically, Engineering-Architecture area remains with the lowest median, whereas the largest belongs to Materials Engineering. Also, the bottom half of the graph contains 17 Engineering and 12 Computing areas, whereas the top half contains 10 Engineering and 19 Computing areas. In other words, there are more top researchers with largest h-index in the Computing-related areas.

6 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Figure 5: H5-index of Publication Venues (sub-areas closer to Engineering are highlighted)

Figure 6: Number of Publications vs Number of Areas

As already discussed in the Related Work, we are aware that h-index values may favor authors who have longer careers with more years of publishing works, then with more time to get more citations. At a higher level, the same could be transposed to the areas and their publication venues. However, the trend is not clear because it could be: (i) older venues get more citations because they are well known; or (ii) younger venues get more citations because they (usually) tackle new, hot topics for which there is a high demand of work and its published results. Nonetheless, there are some clear distinction between pairs of sub-areas within Engineering and Computing. For example, comparing researchers from Robotics* with those from Nanotechnology** favors the nanotechnologists, even though both are in the Computing. Overall, these results per se would be enough to show our claim of evaluating sub-areas may provide a better, more truthful understanding of a ﬁeld. Finally, to complete our discussion on age and h-index, we sought a correlation between them considering the top 20% researchers in each area. However, our results show there is limited or no relationship between those aspects, indicating the researchers’ h-index grows in different proportions

4.1.3 Venues’ H5-index by sub-area A variation of h-index, called h5-index, tries to amortize the impact of older publications and citations in the overall h-index value. Therefore, h5-index indicates the number of h publications that have obtained h or more citations in the last 5 years (here, between 2011 and 2015). Thus, we analyze the distribution of the h5-index of the venues by

7 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Software Systems

Quality & Reliability Power Engineering Data Mining & Analysis Civil Engineering Biotechnology Multimedia Computational Databases & Linguistics Information Systems Structural Engineering Nanotechnology Transportation Materials Engineering Computer Vision & Pattern Recognition Mechanical Engineering Environmental & Geological Engineering Operations Signal Processing Research Combustion & Propulsion Water Supply Mining & Mineral & Treatment Resources Manufacturing Ceramic Engineering Computer Security & Cryptography & Machinery Medical Informatics Robotics Automation & Aviation & Control Theory Aerospace Bioinformatics & Engineering Computer Computational Biology Hardware Design Artiﬁcial Intelligence Metallurgy

Radar, Positioning Fuzzy Systems Information Theory & Navigation Human Computer Interaction Evolutionary Microeletronics & Remote Sensing Computation Eletronic Packaging Computer Networks Computing Systems & Wireless Communication

Figure 7: Publications Graph (nodes connected if the same publication appears in two different sub-areas; red nodes for those closer to Engineering)

sub-area as illustrated in Figure 5. Since h5-index is limited to citations from a recent period of time, one way to read this graphic is: areas at the top (right side) are currently more popular, or getting more citations than those at the bottom (left side).

4.1.4 Relationship between Sub-areas

There are publications that ﬁt into more than one area, just as there are researchers working in more than one area. In other words, the sub-areas are not isolated from each other. In this section, we discuss how these relationships between areas occur and their impact in the researchers’ h-index.

Relationship between Sub-areas based on Publications. Due to interdisciplinary research, publications may be associated with more than one sub-area. The distribution of the number of publications by the number of associated areas is in Figure 6. From a universe of 38,827 publications, 35,632 (92%) are classiﬁed in only one area. Now, considering the others (8%), we investigate how their sub-areas relate to each other. Figure 7 presents the graph created in which the edges represent the number of publications associated to both sub- areas. There are two clearly distinguished groups in the graph: one more focused on Engineering sub-areas (leftmost region), and the other on Computing sub-areas (rightmost region). The areas with the largest number of publications in common are: Nanotechnology and Materials Engineering, Civil Engineering and Structural Engineering, Data Mining and Database.

Relationship between Sub-areas based on Researchers. Just as there are publications associated with more than one sub-area, there are also researchers correlated with more than one sub-area. Researchers associations may occur in two ways: when the researcher publishes one paper classiﬁed in more than one sub-area, or has at least two publications in different sub-areas. The distribution of the researchers by the number of sub-areas in which they work is in Figure 8. Figure 9 allows a closer look on such results by showing vertices representing the sub-areas and edges the number of researchers that work in the related areas. The complete graph is too dense; hence, we restrict the edges to those with a weight larger than 50. That is, an edge represents that at least 50 researchers are associated with its corresponding vertices. The graph is divided into three connected components. The pairs of areas with most researchers in common are: Nanotechnology and Materials Engineering, Database and Data Mining.

8 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Figure 8: Number of Researchers vs Number of Areas

Computer Security & Cryptography

Computer Signal Processing Combustion & Propulsion Hardware Design Computer Networks & Wireless Communication Mechanical Engineering Information Theory Civil Engineering Computing Systems Computational Environmental & Structural Engineering Linguistics Geological Engineering Software Systems Databases & Multimedia Water Supply Information Systems & Treatment Transportation Power Engineering

Operations Data Mining & Analysis Research Manufacturing Sustainable Energy & Machinery Computer Vision & Biotechnology Artiﬁcial Intelligence Pattern Recognition

Materials Engineering Human Computer Computer Graphics Interaction Fuzzy Systems Evolutionary Computation Automation & Control Theory

Robotics Aviation & Aerospace Engineering Nanotechnology

Figure 9: Researchers Graph (nodes connected if a researcher’s paper is classiﬁed in two areas or if a researcher has published papers in different areas; red nodes for those closer to Engineering)

Comparison of H-index in Related Areas. We also analyze how h-index value varies in related areas, comparing the researchers who publish in each isolated area and work in both areas at the same time. Specifically, we study pairs of areas with the highest number of researchers in common: Nanotechnology and Materials Engineering, Database and Data Mining. For Nanotechnology and Materials Engineering areas, we examine the researchers’ h-index in all three cases: researchers working only in the first sub-area, researchers working only in the second sub-area, and researchers working in both sub-areas. Figure 10 shows the researchers working in both areas tend to have a higher h-index than those who publish in only one. The same study for Database and Data Mining areas confirms such results in Figure 11.

9 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Figure 10: H-index: Nanotechnology vs Materials Engineering

Figure 11: H-index: Database vs Data Mining

4.2 Analyses using Topological Metrics

In this section, we characterize the 58 sub-areas of Computer Science and Engineering according to each of the following topological (or social networks) metrics: Modularity, Neighborhood Overlap, Clustering Coefﬁcient, Eigenvec- tor Centrality, Eccentricity Centrality, Closeness Centrality, Betweenness Centrality, Degree, Weighted Degree and PageRank.

4.2.1 Modularity Modularity divides the graph into modules based on the density of their connections. Hence, it helps to identify which groups of areas share the most considerable number of publications and authors. In this analysis, we use the algorithm proposed by Blondel et al. [3] and Figure 12 shows the resulting graph. There are 10 clear groups (and one of disconnected areas) with the highest amount of shared publications, each is circled and numbered to facilitate identiﬁcation. The algorithm forms groups containing areas considered similar. For example, Group 1 is more focused on hardware and communication, including Computer Networks, Signal Processing, Computer Security, Computer Hardware

10 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Food Science& Technology Textile Engineering 11 Biomedical Technology Technology Law Plasma& Fusion 1 Computer Hardware Design Ocean& Marine Engineering Medical Informatics 10 Wood Science& Technology Engineering& Computer Science (general) Bioinformatics& Computational Biology Molecular Modeling Nanotechnology Architecture Microelectronics& Electronic Packaging 9 Materials Engineering Information Theory Power Engineering Oil, Petroleum& Natural Gas Biotechnology Operations Research Computer Networks& Wireless Communication Transportation 8 Computing Systems 7 Manufacturing& Machinery Computer Security& Cryptography Sustainable Energy

Signal Processing Computer Graphics Computer Vision& Pattern Recognition Software Systems Computational Linguistics Automation& Control Theory Civil Engineering Ceramic Engineering Metallurgy Evolutionary Computation Multimedia Structural Engineering 6

Educational Technology Fuzzy Systems Artiﬁcial Intelligence Aviation& Aerospace Engineering Mechanical Engineering 3 Robotics Quality& Reliability Human Computer Interaction Mining& Mineral Resources Databases& Information Systems Combustion& Propulsion Radar, Positioning& Navigation Library& Information Science Environmental& Geological Engineering Data Mining& Analysis 4 Remote Sensing Water Supply& Treatment 5 Theoretical Computer Science 2 Game Theory and Decision Science

Figure 12: Modularity - Publications

Computer Networks& Wireless Communication Game Theory and Decision Science

Theoretical Computer Science Information Theory 6 1 Computer Security& Cryptography Transportation Technology Law Software Systems

Food Science& Technology Computer Hardware Design Ocean& Marine Engineering Computing Systems Wood Science& Technology Environmental& Geological Engineering Oil, Petroleum& Natural Gas Structural Engineering Power Engineering Architecture Combustion& Propulsion Fuzzy Systems Operations Research Mining& Mineral Resources Automation& Control Theory Plasma& Fusion Robotics Civil Engineering Manufacturing& Machinery 5 Radar, Positioning& Navigation Biotechnology Sustainable Energy Artiﬁcial Intelligence Engineering& Computer Science (general) Water Supply& Treatment Mechanical Engineering Evolutionary Computation Quality& Reliability Aviation& Aerospace Engineering 2 Multimedia Remote Sensing Signal Processing Ceramic Engineering Computer Graphics 4 Metallurgy

Computer Vision& Pattern Recognition Microelectronics& Electronic Packaging Biomedical Technology Bioinformatics& Computational Biology Medical Informatics Nanotechnology Data Mining& Analysis Materials Engineering Computational Linguistics Textile Engineering Molecular Modeling Human Computer Interaction Educational Technology 3 Databases& Information Systems Library& Information Science

Figure 13: Modularity - Researchers

Design, among others. Although some areas do not belong to the rest of their group, as is the case of Biomedical Technology (belonging in Group 1), most areas are grouped with others that tackle similar topics. Therefore, it is possible to observe a segment of research aimed at that group.

11 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

0,7

0,6

p 0,5 a l r e v O 0,4 d o o h r 0,3 o b h g i

e 0,2 N

0,1

0 0 0 2 9 3 7 8 0 3 2 4 7 8 1 9 9 4 8 9 5 1 9 8 1 0 5 0 3 2 6 & 2 & 2 & 6 & 2 & 2 9 & 5 & 5 5 & 3 & 1 & 2 5 1 8 & 2 2 & 2 & 3 & 48 & 5 & 5 & 5 & 1 & 16 & 57 & 2 & 4 & 4 & 2 & 2 & 49 & 4 & 1 & 19 & 4 & 2 & 1 & 3 & 2 & 3 & 4 & 1 & 3 & 5 & 4 & 2 5 2 2 2 5 8 8 2 8 1 9 2 5 6 1 9 5 8 5 9 1 8 5 6 3 5 4 4 1 8 3 5 8 4 2 3 1 1 9 5 1 1 5 2 4 3 1 4 5 5 2 1 4 5 1 3 1 2 3 3 2 1 1 4 1 1 2 Edges between Areas

Figure 14: Neighborhood Overlap - Publications

0,9

0,8

0,7 p a l

r 0,6 e v O 0,5 d o o h

r 0,4 o b h g

i 0,3 e N 0,2

0,1

0 2 1 2 2 9 4 1 5 2 4 7 5 0 4 8 9 5 0 5 0 0 1 6 7 4 2 2 6 0 0 9 5 1 2 1 1 5 6 7 8 4 3 5 3 7 5 4 4 4 3 4 4 2 1 1 1 2 3 4 5 1 3 4 1 1 & & & & & & & 2 & & & & & & & & & & & & & & & 2 5 3 & 2 & 3 & 5 & 5 & 2 & 3 & 5 & 4 & 5 & 2 & 4 & 3 & 5 & 4 & 4 & 1 & 4 & 5 & 3 & 5 & 3 & 4 & 5 & 2 & 3 7 8 7 6 6 9 8 5 8 2 1 9 0 4 8 1 8 1 9 3 5 9 4 8 2 8 3 9 3 3 7 7 7 3 7 2 2 2 2 5 3 9 4 2 1 3 1 3 1 1 4 1 1 5 2 2 1 1 1 1 1 1 4 3 1 3 2 1 3 Edges between Areas

Figure 15: Neighborhood Overlap - Researchers

Next, we calculate the modularity of the graph in which the edges correspond to the researchers. This means that if an author has published in distinct areas, then it forms an edge between those areas. The result is illustrated in Figure 13 with six larger groups. This second conﬁguration indicates not only the areas with common research topics but also the areas a researcher tends to work in. For example, a researcher working with Computer Vision has a greater chance of working with Signal Processing or Remote Sensing than Sustainable Energy.

4.2.2 Neighborhood Overlap The Neighborhood Overlap metric varies from one edge to another. Given two nodes A and B, this metric is deﬁned as the ratio:

number of nodes that are neighbors of A and B at the same time number of nodes that are neighbors of at least one of the two : A or B

In other words, the greater the number of nodes that are simultaneously neighbors of both and the smaller the number of neighbors uncommon to them, the higher the neighborhood overlap value of an edge (A, B). Figure 14 shows the neighborhood overlap between the areas with common publications, i.e., in the graph that connects distinct areas with common publications. Areas with a neighborhood overlap equal to zero (such as Radar, Position and Navigation #46, and Remote Sensing #47) indicate that there is no other area in the graph that has publications with both simultaneously. However, the areas Robotics (#48) and Evolutionary Computation (#23) have many publications in common, that is, there are several venues sharing publications with both.

12 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

1,2

1 t n

e 0,8 i c i f f e o C

0,6 g n i r e t s

u 0,4 l C

0,2

0 2 9 4 5 0 0 7 6 4 5 5 6 9 9 8 0 2 1 8 5 3 8 8 5 2 7 6 7 9 4 2 0 1 3 3 4 8 0 1 3 1 1 6 7 7 2 6 4 6 8 5 3 9 4 2 7 1 1 4 4 2 5 2 4 2 1 1 4 1 1 3 3 1 5 5 1 3 2 2 4 5 4 1 5 2 2 3 2 4 4 4 5 5 5 3 1 1 3 2 4 5 3 3 3 2 Areas

Figure 16: Clustering Coefﬁcient - Publications

1,20

1,00 t

n 0,80 e i c i f f e o C

0,60 g n i r e t s

u 0,40 l C

0,20

0,00 7 1 8 6 2 2 3 2 4 0 5 6 3 3 1 7 0 5 0 9 7 4 1 6 1 4 4 9 9 7 1 9 8 4 2 8 8 3 0 0 7 8 3 6 5 2 6 5 5 7 4 5 9 3 1 2 6 8 2 1 4 3 2 1 5 3 5 4 1 1 2 1 5 4 3 2 2 3 3 2 3 5 4 4 3 4 2 5 2 1 1 1 5 2 5 4 1 5 1 3 3 4 4 4 2 5 3 Areas

Figure 17: Clustering Coefﬁcient - Researchers

At the level of common researchers, Figure 15 illustrates the neighborhood overlap values calculated over the graph that connects areas with common researchers. Now, for every two areas, there is always at least one other area that has authors in common with both areas.

4.2.3 Clustering Coefﬁcient

The Clustering Coefficient reveals the tendency of a node to form groups. The bigger and denser the connection of a node to others, the greater the probability of this node forming groups. We used the algorithm proposed by Latapy [11], where the clustering coefficient is calculated for each node, observing if triangles are formed between this node and others connected to it. Considering the graph of Figure 12 and Figure 16, the areas that have the highest clustering coefficient are Medical Informatics (Id = 33) and Nanotechnology (Id = 39). Now, in relation to the graph of Figure 13 and Figure 17, the clustering coefficients are different from the previous analysis. This graph is denser, making the coefficient high for all areas.

4.2.4 Eigenvector Centrality

The Eigenvector measures the degrees of the centrality of each node in the graph, but not always with the same equivalence. The central nodes that attach to other central nodes get more weight than those that do not attach to them. Figures 18 and 19 present the results of the analyses performed in relation to Graphs 12 and 13, respectively. It is important to note that the Artiﬁcial Intelligence (Id = 2) area appears as central or almost central in both graphs. While

13 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

1,2

1 y t i l 0,8 a r t n e C

r 0,6 o t c e v n

e 0,4 g i E

0,2

0 3 5 4 1 6 0 1 0 9 3 0 8 2 6 5 3 7 6 2 8 1 7 3 4 8 0 1 2 5 8 9 4 7 0 1 3 9 9 2 2 5 5 6 7 7 4 4 8 6 5 6 8 1 2 3 4 7 9 5 3 3 2 3 1 4 3 5 4 1 4 4 2 5 4 5 2 2 4 1 1 3 1 3 2 3 3 4 4 4 2 4 1 1 5 3 5 5 1 2 3 2 5 2 5 1 2 1 Areas

Figure 18: Eigenvector Centrality - Publications

1,20

1,00 y t i

l 0,80 a r t n e C

r 0,60 o t c e v n e

g 0,40 i E

0,20

0,00 7 7 1 3 8 8 9 7 9 3 2 5 9 2 7 4 5 4 5 6 2 6 2 4 0 9 1 3 1 1 4 2 0 0 1 4 8 8 5 6 6 6 7 3 8 5 3 0 0 5 7 4 6 8 9 2 1 3 4 5 2 1 3 1 4 3 4 5 1 4 1 4 2 3 4 2 5 5 1 2 1 5 1 1 2 4 4 2 1 3 5 2 2 3 3 4 5 3 2 3 5 3 1 2 3 4 5 Areas

Figure 19: Eigenvector Centrality - Researchers some areas almost do not affect others, belonging to no group, such as Architecture (Id = 1), Technology Law (Id = 53) and Textile Engineering (Id = 54).

4.2.5 Metrics Using Shortest Path

This section groups the metrics based on shortest path calculations.

Eccentricity Centrality. The Eccentricity is a node centrality index. For all vertices in the graph, the shortest path is measured from (or to) the vertex and taking the maximum. Thus, an eccentricity with higher values indicates that this node is closer to the other nodes in the network. In other words, the greater the Eccentricity Centrality value of a node, the greater the inﬂuence it receives from the other nodes. Figure 20 shows the eccentricity centrality is not an ideal metric to measure the centrality of these types of graphs because there is no considerable variation between the areas. In Graph 13, the longest path from one area to another is equal to 2. That is, it takes a maximum of 2 researchers to connect one area to another.

Closeness Centrality. This metric is another way of measuring the node centrality using the shortest path. In this case, the value given to each node corresponds to the average of the shortest paths from that node to ever other nodes in the graph. That is, the larger the average, more central the node will be. Nodes with high values receive greater inﬂuence or obtain more access to information from the other nodes.

14 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Researchers Publica�ons 6

5 y t i l 4 a r t n e C

y 3 t i c i r t n

e 2 c c E

0 2 1 4 5 6 5 6 0 0 8 2 0 7 3 5 8 8 5 3 2 2 7 1 6 4 3 9 5 3 0 1 9 4 1 6 1 6 7 3 8 7 7 4 2 8 4 0 9 9 9 5 8 3 2 4 6 7 1 1 5 1 4 2 5 1 5 3 3 5 2 4 2 1 2 4 3 1 4 3 5 2 4 5 5 3 2 3 1 1 1 4 4 3 5 4 3 3 4 5 1 2 2 2 3 4 2 1 Areas

Figure 20: Eccentricity Centrality

Researchers Publica�ons 1,00

0,90

0,80

y 0,70 t i l a r

t 0,60 n e C 0,50 s s e

n 0,40 e s o l 0,30 C

0,20

0,10

0,00 4 9 0 1 8 0 6 2 3 5 3 8 0 1 7 8 3 4 3 2 2 7 0 4 8 5 5 5 4 9 4 6 1 3 7 8 1 6 2 0 2 6 7 9 9 5 1 6 7 5 9 8 3 4 7 1 6 2 1 3 2 3 3 5 3 1 5 4 4 2 2 2 3 1 4 2 2 2 3 3 4 4 3 5 5 1 1 5 1 2 4 1 1 1 1 4 3 2 4 5 3 4 4 5 2 5 5 Areas

Figure 21: Closeness Centrality

Figure 21 shows the Closeness Centrality of the two graphs is different. While the central elements of the Publications graph are Evolutionary Computation (Id = 23), Robotics (Id = 48), Artificial Intelligence (Id = 2) and Automation & Control Theory (Id = 3), for the Researchers graph are Engineering and Computer Science (general) (Id = 21), Artificial Intelligence, Computer Networks (Id = 14) and Automation & Control Theory. Although they are distinct, some elements are present in both, like Artificial Intelligence and Automation & Control Theory. In addition, the chart is ordered considering the Researchers graph to show that areas such as Architecture, Technology Law (Id = 53) and Textile Engineering (Id = 54) appear with a low value of closeness centrality in both cases.

Betweenness Centrality. Another measure of centrality in a graph based on shortest paths is Betweenness Centrality. In this metric, it is observed how many times a given node appears in the shortest path of the other nodes in the graph. The more it appears, the higher the influence of this node and the greater its access to information from other areas. Figure 22 indicates the results for the Betweenness Centrality metric. The chart is ordered considering the value of the Researchers graph. Although the central nodes of both graphs are distinct, there is a pattern. The highest values of betweenness centrality are concentrated to the left of the figure, while the lowest values are clustered on the right. Although there is no exact pattern, there are certain nodes that tend to appear as central to centrality metrics. This is the case of Artificial Intelligence (Id = 2), Automation & Control Theory (Id = 3) and Signal Processing (Id = 49). And others that tend to appear as peripherals, which is the case of Architecture, Technology Law (Id = 53) and Textile Engineering (Id = 54).

15 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Researchers Publica�ons

200 y t

i 150 l a r t n e C

s s

e 100 n n e e w t e

B 50

0 7 1 2 8 7 6 4 8 4 7 3 9 3 0 5 8 2 0 5 6 1 2 6 9 3 4 7 8 6 0 6 3 4 5 1 0 1 5 9 3 9 2 4 0 7 1 2 5 8 5 1 7 6 3 4 2 8 9 1 1 4 4 2 4 2 5 5 3 5 5 3 3 2 2 3 1 5 3 1 3 3 3 5 2 4 3 3 2 2 1 2 4 2 4 4 1 1 1 4 2 4 1 4 5 5 5 1 Areas

Figure 22: Betweenness Centrality

degree weighted degree 900

800

700

600

500 ee

400 Deg r

300

200

100

0 0 3 1 5 6 3 4 0 7 8 7 5 6 1 2 3 4 2 9 9 0 8 7 8 9 0 5 6 7 8 1 2 4 5 6 7 8 1 2 3 4 9 0 1 2 3 4 5 6 3 5 6 7 8 9 1 2 4 1 1 1 3 3 4 4 2 2 2 3 4 4 4 4 4 5 5 5 5 5 5 5 5 5 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 4 4 4 2 2 2 2 2 2 Areas

Figure 23: Degree e Weighted Degree - Publications

4.2.6 Degree e Weighted Degree

The Degree indicates how many other nodes the node is connected to. In this study, each node degree represents how many other areas each area is connected to. The Weighted Degree works like Degree, but it considers the weight of each edge. In the second case, the node will have a higher value if it shares more publications or researchers with other areas. Figures 23 and 24 show the results for the Publications and the Researchers graphs, respectively. In Figure 23, there are areas connected with few other areas, but the relationship between them is very strong. For example, the areas of Nanotechnology (Id = 39) and Materials Engineering (Id = 31). Despite having a low degree, their weighted degrees are extremely high. However, there are areas related to many other areas and their relationship is also strong, such as Artificial Intelligence (Id = 2). Figure 24 is more uniform. That is, although there is no exact correlation, areas that have more relationship with others also tend to have a stronger relationship. Artificial Intelligence also appears in this figure as one of the areas with a higher degree and higher weighted degree.

4.2.7 PageRank

PageRank is an algorithm used to determine the node relevance or importance in a graph. This relevance is determined by the connections established to other nodes. According to this metric, both Artiﬁcial Intelligence (Id = 2) and Automation & Control Theory (Id = 3) appear as the most inﬂuential areas in the two types of graphs. This can be seen in Figures 25 and 26.

16 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

degree weighted degree 2.000

1.800

1.600

1.400

1.200 ee 1.000 Deg r 800

600

400

200

0 3 1 2 0 4 8 0 2 3 4 5 6 7 8 9 7 5 3 4 5 6 7 8 5 4 2 6 1 3 9 1 7 9 1 6 8 4 8 0 2 5 9 1 3 7 0 2 6 0 3 4 7 1 2 5 6 9 8 4 5 1 1 3 4 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 4 4 4 5 5 5 5 5 5 5 5 1 1 1 1 3 3 3 3 3 4 4 4 3 4 4 Areas

Figure 24: Degree e Weighted Degree - Researchers

0,05

0,04

0,03 k n a r 0,03 e g a P 0,02

0,02

0,01

0,00 5 0 2 7 7 7 3 8 2 4 6 7 9 4 9 0 4 2 3 5 1 0 7 5 8 1 1 8 0 6 4 1 3 8 6 2 8 4 9 1 0 2 5 6 3 6 3 9 5 4 7 2 8 5 6 3 9 1 2 4 4 4 4 5 3 3 2 4 5 2 3 2 2 5 1 1 1 3 1 3 4 3 1 3 2 2 3 4 4 5 5 4 1 4 5 5 3 1 2 2 1 5 3 1 5 1 2 Areas

Figure 25: Pagerank - Publications

4.2.8 Results of Analysis with Topological Metrics

Topological metrics used in this work indicate the node influence in the graph. If an area (node) has a high degree of importance, the knowledge produced in that area will be quickly noticed by other areas. During the analyzes we observed that there are areas that appear, most of the times, as more influential. For example, the areas of Artificial Intelligence and Automation & Control Theory. But, the areas of Architecture, Textile Engineering and Technological Law are less influential. These observations led us to assume the main areas have researchers with the highest h-index and vice versa. To do this analysis, considering a mean or median of the h-index of the researchers is not enough. The ideal would be to consider every researcher. In this way, the study proceeds to the next section, in which we correlate the topological metrics and the h-index.

17 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

0,03

0,02 k n a r 0,02 e g a P

0,01

0,00 7 9 0 1 5 5 3 0 3 2 0 2 1 7 6 4 0 8 2 2 2 7 3 5 8 0 5 9 6 9 6 4 9 4 1 6 4 3 8 8 5 1 7 6 1 3 7 8 4 5 9 6 4 8 1 3 2 7 1 1 5 1 4 2 1 1 3 3 2 3 5 3 5 3 1 5 4 1 1 1 3 2 2 3 4 5 2 1 2 2 4 4 4 3 2 5 5 2 2 4 4 3 5 3 4 5 4 Areas

Figure 26: Pagerank - Researchers

Table 2: Kendall Correlation Eigenvector Closeness Betweenness Weighted Clustering ID Eccentricity Degree PageRank Centrality Centrality Centrality Degree Coefﬁcient

53 0.140880193 0.214879232 0.214879232 NA 0.140880193 0.140880193 0.139203933 -0.109277359 54 -0.003002747 -0.071320481 -0.071678983 -0.079244252 -0.000348649 0.004132 -0.116675082 0.130218429

4.3 Correlation between Topological Metrics and H-index

The next step is to identify if there is any correlation between the h-index of the researcher and results of the topological metrics. For this, we create 58 co-authorship graphs (one for each area). That is, an edge is trace between every two researchers who worked together on a publication. Also, we consider the following topological metrics: Eigen- vector Centrality, Eccentricity, Closeness Centrality, Betweenness Centrality, Degree, Weighted Degree, PageRank and Clustering Coefficient. Thus, we seek to identify if more central researchers or with more connections in the co-authorship network tend to obtain h-index higher than the other researchers. To do this, we choose three standard correlation methods: Kendall, Pearson and Spearman. These coefficients show whether two metrics are linearly correlated. If the coefficient value is

Table 3: Pearson Correlation Eigenvector Closeness Betweenness Weighted Clustering ID Eccentricity Degree PageRank Centrality Centrality Centrality Degree Coefﬁcient

1 -0.110203843 -0.067543684 -0.042713825 -0.059766489 -0.044163531 -0.044163531 -0.007117905 -0.033433911 3 0.26993069 0.093675774 0.06515788 0.279954145 0.297849161 0.317430205 0.289929446 -0.062664262 16 0.151423196 0.030769881 0.018195911 0.140481375 0.308013192 0.285814884 0.359866542 -0.083030848 18 0.213932394 0.002985135 -0.018399192 0.274756298 0.303716717 0.290269137 0.315508231 -0.061087002 25 0.099326211 0.0777369 0.065763806 0.247481293 0.153554141 0.196230284 0.152506833 -0.11299234 34 0.071134107 0.115674132 0.100531939 0.33594391 0.147362518 0.168094165 0.119714314 0.079668846 40 0.005725564 -0.104640398 -0.108724755 0.075979978 0.06154051 0.048482622 0.071092985 -0.00848774 53 -0.118931574 0.148094892 0.148094892 NA -0.011919076 -0.011919076 0.148094892 -0.154889981 54 0.060274514 -0.182832017 -0.190417565 -0.070799935 -0.060523634 -0.066321372 -0.183620155 0.032858404 57 0.047022957 -0.039859313 -0.041232813 0.16430608 0.071948155 0.064415194 0.08295604 -0.109657006

18 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Table 4: Spearman Correlation Eigenvector Closeness Betweenness Weighted Clustering ID Eccentricity Degree PageRank Centrality Centrality Centrality Degree Coefﬁcient

8 -0.056856702 -0.103851241 -0.102106271 0.030937674 -0.055732856 -0.054027526 -0.07671004 -0.001891296 11 0.034904151 -0.020872666 -0.034600809 0.297835252 0.150020415 0.147515744 0.249551228 -0.105481261 19 0.075809908 0.034025478 0.028544745 0.24810278 0.147540501 0.148172039 0.190252898 -0.111796313 53 0.182804147 0.258241743 0.258241743 NA 0.182804147 0.182804147 0.180864914 -0.13132947 54 -0.014168015 -0.086409404 -0.086668916 -0.095931758 -0.011547538 -0.007507842 -0.138540233 0.156153531

closer to 1, then there is a correlation in which the two metrics grow together; if the value is closer to -1, then there is a correlation in that as one metric grows, the other decreases. If the coefficient is a value close to 0, then the correlation is low or zero between the two metrics. The correlation results are compiled in three tables (one for each type of correlation coefficient) with the correlation coefficient value between the researcher h-index and a topological metric. Tables 2, 3 and 4 have only the cells with the most expressive outcomes. The darker cells indicate correlations above 0.3 and the lighter cells indicate correlations below -0.1. The complete results can be found in the technical report. According to the tables, no correlation between the h-index of the researcher and some social metrics is under -0.2 or above 0.4. This indicates that there is no correlation between any of the metrics for either area. The outcome is counterintuitive since it is expected that the more centered the researcher is in the network, the greater its h-index. In this way, we conclude that the h-index is determined by external factors to these metrics.

4.3.1 External Factors for Topological Metrics This section aims to identify external factors that lead some central researchers having a low h-index, while others not so central get high values. For this, because it has a large related component, Database was chosen as the representative area. Initially, the graph will be analyzed using the PageRank metric. In the graph of Figure 27, the size of the nodes represents the node degree of centrality. The colors indicate the researcher h-index: the lighter the color, the higher the h-index; and the darker, the lower the h-index. In this way, we select six vertices to be analyzed: three have a high h-index and a low PageRank, and the others three vertices have low h-index and high PageRank. The external factors analyzed here will be the number of publications collected from the researchers, the number of authors in each publication, the citations received by publication and the quality of the publication venue following the Qualis4 2016 parameters. The first three researchers are the ones with the high h-index. Researcher number 1 has an h-index equal to 88, and we collected three of his publications: the first one with 4 co-authors and the other two with 3 co-authors. The publications received 238, 171 and 38 citations respectively. Two of them were published in the International Conference on Weblogs and Social Media and the other was published in the International Conference on Social Computing, both of which are qualified as A1 by Qualis. The h-index of the researchers 2 and 3 is 64. From the researcher 2, we collected five publications: the co-authorship varies between 2 and 4 authors and the citations, between 44 and 152. For the third researcher, five publications were also collected: 4 coauthors in each of them and the number of citations are between 44 and 152. Researchers 4, 5 and 6 have a low h-index, but the PageRank value is high. Researcher 4 presented h-index equal to 9 and were collected three publications of this researcher with co-authorship between 4 and 8. They received 184, 88 and 64 citations, respectively. The publication venues were: International World Wide Web Conferences (WWW) and International Conference on Weblogs and Social Media, both with A1 qualification. The h-index of the researcher 5 is equal to 16, and we also collected three publications, of which two have 4 co-authors and one has 5 co-authors. They received 991, 662 and 64 citations, respectively. The publication venues were: ACM International Conference on Web Search and Data Mining, International World Wide Web Conferences (WWW) and International Conference on Weblogs and Social Media, all with A1 qualification. Finally, the h-index of the last researcher is equal to 13. Of the five publications collected, one of them has 3 authors, and the others have 4. The citations vary between 64 and 991 and all publication venues have an A1 qualification. After analyzing the external factors considered, we verify none of them has a direct correlation with the researchers’ h-index. The venue quality is the same for all researchers, considering only the principal publications venues were analyzed. The citations received also do not have a significant impact since, of the publications collected, those that had more citations are related to the researchers with lower h-index. Likewise, the number of publications collected

4http://qualis.capes.gov.br/ (Accessed 08 March 2018)

19 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

Figure 27: Database Co-authoring Graph for each of the researchers is very close to each other, so there is no direct inﬂuence on h-index. Finally, the number of coauthors of the publications are proportional to the number of coauthors collected and it has already been shown that the degree or weighted degree has no correlation with the h-index of the researchers.

5 Concluding Remarks

In this work, we present publication profiles analyzes of the sub-areas of Engineering and Computer Science, through topological and non-topological metrics. Comparing the researchers h-index in each area through social metrics (non- topological), we observed that each one of the areas presents its own profile with a particular proportion of h-index growth. Therefore, comparing researchers from distinct areas (or sub-areas) based on h-index only is insufficient. Moreover, the areas are not isolated, being able to present publications and researchers that are related to multiple areas and sub-areas. From the analysis, we realized that researchers working in more than one sub-area tend to reach an h-index superior to those that work in only one. Through the topological metrics, it was revealed that there are very influential areas, as is the case of Artificial Intel- ligence and areas with less impact, like Architecture. In addition, we used the correlations of Kendall, Pearson, and Spearman to look for possible associations between the h-index of the researchers and their significance level of these in the co-authorship graph of their area of performance. In this last analysis, we confirmed that there is no correlation, which indicates that there are other external factors to these metrics, contributing to a researcher having a high or low h-index. Finally, some profiles were analyzed, based on four external factors (such as the number of publications

20 Publication Proﬁles through Knowledge Areas – Diverse Analyses over Google ScholarTECHNICAL REPORT

collected from the researchers and the quality of the publication vehicle following the parameters of Qualis 2016) in order to identify if any of them directly affects the h-index of researchers. The next steps in this work consist of two points: increase the co-authorship network and check other possible external factors that may be affecting the h-index value. The expansion of the co-authorship network is necessary because the Google Scholar website does not provide every researcher proﬁle. That is, we analyzed only the researchers available in the Google Scholar system during the collection. A complete network, with all authors, may produce different results. The second point is necessary because the external factors considered in this study were analyzed individually and may be insufﬁcient. This means that not just one, but a set of aspects may be altering the outcomes. In addition, other determinants may have not been taken into consideration. Acknowledgements. Work partially funded by CNPq, CAPES and FAPEMIG.

References [1] S. Ayaz, N. Masood, and M. A. Islam. Predicting scientific impact based on h-index. Scientometrics, 114(3):993– 1010, 2018. [2] F. Benevenuto, A. H. Laender, and B. L. Alves. The h-index paradox: your coauthors have a higher h-index than you do. Scientometrics, 106(1):469–474, 2016. [3] V. D. Blondel, J. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008. [4] L. Bornmann and R. Haunschild. Citation score normalized by cited references (csncr): The introduction of a new citation impact indicator. Journal of Informetrics, 10(3):875–887, 2016. [5] R. Costas and M. Bordons. The h-index: Advantages, limitations and its relation with other bibliometric indica- tors at the micro level. Journal of Informetrics, 1(3):193 – 203, 2007. The Hirsch Index. [6] R. Costas and T. Franssen. Reflections around ‘the cautionary use’ of the h-index: response to teixeira da silva and dobranszki.´ Scientometrics, 115(2):1125–1130, May 2018. [7] T. Z. Fu, Q. Song, and D. M. Chiu. The academic social network. Scientometrics, 101(1):203–239, 2014. [8] M. Henzinger, J. Sunol,˜ and I. Weber. The stability of the h-index. Scientometrics, 84(2):465–479, 2010. [9] J. E. Hirsch. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences, 102(46):16569–16572, 2005. [10] S. Klink, P. Reuther, A. Weber, B. Walter, and M. Ley. Analysing social networks within bibliographical data. In 17th International Conference Database and Expert Systems Applications, pages 234–243, 2006. [11] M. Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theoretical Com- puter Science, 407(1-3):458–473, 2008. [12] M. Liao, X. Gao, X. Peng, and G. Chen. CROP: an efficient cross-platform event popularity prediction model for online media. In 29th International Conference Database and Expert Systems Applications, pages 35–49, 2018. [13] H. Lima, T. H. Silva, M. M. Moro, R. L. Santos, W. Meira Jr, and A. H. Laender. Aggregating productivity indices for ranking researchers across multiple areas. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13, pages 97–106, New York, NY, USA, 2013. ACM. [14] C. McCarty, J. W. Jawitz, A. Hopkins, and A. Goldman. Predicting author h-index using characteristics of the co-author network. Scientometrics, 96(2):467–483, 2013. [15] G. V. Menezes, N. Ziviani, A. H. Laender, and V. Almeida. A geographical analysis of knowledge production in computer science. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pages 1041–1050, New York, NY, USA, 2009. ACM. [16] J. Mingers and M. Meyer. Normalizing google scholar data for use in research evaluation. Scientometrics, 112(2):1111–1121, Aug 2017. [17] M. Raheel, S. Ayaz, and M. T. Afzal. Evaluation of h-index, its variants and extensions based on publication age & citation intensity in civil engineering. Scientometrics, 114(3):1107–1127, 2018. [18] A. S. M. Roberto da Silva, Fahad Kalil and J. P. M. de Oliveira. Universality in Bibliometrics. Physica A: Statistical Mechanics and its Applications, 391(5):2119–2128, 2011. https://doi.org/10.1016/j.physa. 2011.11.021. [19] J. A. Teixeira da Silva and J. Dobranszki.´ Multiple versions of the h-index: cautionary use for formal academic purposes. Scientometrics, 115(2):1107–1113, May 2018.