Codd’s World: Topics and their Evolution in the Community Publication Graph

Rutuja Shivraj Pawar, Sepideh Sobhgol, Gabriel Campero Durand, Marcus Pinnecke, David Broneske and Gunter Saake

Otto-von-Guericke University Magdeburg Germany [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

ABSTRACT ly network, helping the research community to implement data- Scholarly network analysis is the study of a scientic research net- driven decisions. SNA compromises of at least seven points of inte- work aiming to discover meaningful insights and making data- rest: (i) authors collaborating on papers (co-authorship networks), driven research decisions. Analyzing such networks has become (ii) documents referencing each other (citation networks), (iii) do- increasingly challenging, due to the amount of scientic research cuments cited together (co-citation networks), (iv) documents that that is added every day. Furthermore, online resources often in- cite other documents in a similar manner (bibliographical coup- clude information from other online sources (e.g., academic social ling), (v) content clusters in document sets (topic networks), (vi) platforms), enabling to study networks on a larger and more com- word clusters occurring together (co-words), and (vii) any diver- plex scope. In this paper, we present a study on a specic research sity of types and relationships (heterogeneous networks)[2]. Some network: The (relational) database community publication graph, example SNA studies to mine knowledge from these scholarly net- that we call Codd’s World; a transitive closure over citations from works include analyzing the citation relationships to evaluate the the foundational work of E.F. Codd. We specically analyze the impact of a given paper or an author [3] or studying co-author be- topics of the published papers, the relevance of authors and pa- havior to identify the scientic community distribution [4]. Fur- pers, and how this relates to raw publication counts. Among our thermore, topic network analysis (TNA) can be used to extract the ndings, we show that topic modeling can be a useful entry point underlying topics from a corpus in terms of word distribution and for scholarly network analysis. also the anity of each document to a topic. Discovering the to- pics unveils the underlying structure of the data and thus better serves as a rst step towards SNA. Furthermore, the visualization Keywords of the extracted topics can lead to discovering topic evolution over Data Analysis, Topic Modeling, Database Publication Network, time [5–7]. Science of Science Main Contributions: In this paper we undertake an SNA stu- dy over a sub-network from the DBLP dataset of 1. INTRODUCTION publications [8] with TNA as our starting point towards under- standing this complex network structure. Specically, we select Rapid advancements in science and research leads to enormous a network that corresponds to the (relational) database inuence amounts of digital scholarly data being produced and collected community. We call this specic network Codd’s World. Our main every day [1]. This scholarly data can be in the form of scientic contributions can be summarized as follows: publications, books, teaching materials, and many other scholarly sources of information made available on the Web. Apart from the (i) Unlike previous studies about the database community, which volume of scholarly data, there is a meaningful variety of connecti- have restricted the community to encompass papers appea- ons in this data. For instance, papers are connected through citati- ring in top database venues, like VLDB, ICDE, EDBT and ons, authors are connected in networks of collaboration, and there SIGMOD [9]; in this paper we identify the community by are many other embedded relationships. As a consequence, in the using the foundational work of E.F. Codd [10] as a starting world of research, understanding relevant work and its scientic point, and collect all papers transitively related to this work impact has become more and more challenging. One approach that through citation relationships. As a result, we expect our work allows modeling the relevance of papers by leveraging the net- to show a more diverse view of the database community, works in which they are embedded is scholarly network analysis spanning inuences beyond the top database venues. (SNA). (ii) Unlike previous studies about the database community, in SNA proposes to study the underlying structure of a scholar- our work, we propose to consider topics as an important di- mension to understand the underlying structure of the pu- blication network; and how it can be a rst step for visuali- zing the content and discovering meaningful trends. Hence, this paper presents a detailed description of how unveiling the research topics was utilized as the rst step towards SNA on the database publication graph. We complement this ap- proach by analyzing relevance and the role of self-citations. 31st GI-Workshop on Foundations of (Grundlagen von Daten- banken), 11.06.2019 - 14.06.2019, Saarburg, Germany. The rest of the paper is structured as follows: The relevant basic Copyright is held by the author/owner(s). background is presented in Sec.2. The SNA carried out with the

1 Entire Research Citation Graph

Codd's World ......

A Relational Model of Data for CITES AUTHORED E.F. Codd Large Shared Data Banks ... Are Databases Fit ... for Hybrid Workloads on ... GPUs?

M. Pinnecke G.D. Campero D. Broneske G. Saake

...

on certain probabilities and helps towards assigning each docu- ment with the identied topics. Non-negative Matrix Factorization (NMF) NMF is a linear- algebra optimization algorithm used for dimensionality reducti- Co-Authorship RAW Graph Citations Co-Citations (Collaboration) on and data analysis [18]. NMF factorizes a document-term ma- trix (i.e., a matrix representing the frequency of terms in dierent documents) into two matrices namely the term-feature and the feature-document, with the property that all the three matrices will have non-negative elements [19].

Bibliographical Heterogeneous Topics Co-Words Coupling Networks 2.3 Relevance Ranking A paper is considered to be most inuential if it is cited mo- Figure 1: Overview of SNA on raw graphs: co author- re often by other inuential papers in the scholarly network [20]. ship/collaboration, citations, co-citations, bibliographical Papers are ranked similar to search engines ranking of web pages, coupling, topics, co-words, and heterogeneity. with the dierence that instead of using the hyperlink network, the citation network formed by the publications is utilized. Conse- quently, this ranking mechanism also helps to determine the most results answering the formulated research questions is detailed in important authors. In SNA, determining the most inuential aut- Section 3. Section 4 concludes the paper discussing the next steps hors based on a ranking mechanism on a collaboration network is in this research direction. also possible. Ranking mechanisms based on the page rank algo- rithm [21], considering collaboration and citation networks were 2. BACKGROUND utilized for our analysis. We used 20 iterations, with a damping factor of 0.85. In this section we provide background on concepts and terms used in this work. We start with SNA (Section 2.1), continue with 2.4 Self-Citation Detection topic models (Section 2.2), relevance ranking (Section 2.3) and end A self-citation occurs when a cited publication shares at least with detection of self citations (Section 2.4). one common author with the publication that cites it [22]. Self- 2.1 Scholarly Network Analysis citations boost a paper’s citation count for a paper. Self-citations that are introduced in a new research paper to indicate an extension Figure1 shows various aspects of SNA depicting the seven types to the author’s previous work is considered valid. However, unvei- of networks usually considered for SNA on RAW graphs. Further ling these semantic self-citations would require exhaustive proces- elaborating on their functionality, Citation and co-citation net- sing. In SNA, understanding these self-citation counts for a paper work analysis is used to nd relationships between cited papers uncovers information on whether the authors have attempted to and a set of papers which cite those papers. Moreover, citation have a global information coverage on the topic. Considering this analysis can be employed in community detection which is one aspect, in our work we study if self-citations have an impact on of the fundamental tasks of network analysis [11]. Bibliographic the relevance and citation counts of the most relevant papers. coupling also employs citation analysis to link documents which reference a common third cited work in their bibliography [12]. Co-authorship network analysis can be used to nd scientic col- 3. SNA ON CODD’S WORLD laboration between authors [13] depicting how individual scienti- In this section, we highlight important aspects of the SNA per- c ideas of authors can get together through collaboration to cause formed on Codd’s World. We start with the illustration of how an explosion of scientic ndings. Through co-word analysis [14], Codd’s World was created, further describe the formulated rese- it is possible to identify the relationships between subjects in the arch questions, highlight the tooling framework and then present specic eld of research through nding the co-occurrence of key- the detailed output evaluation. words which helps to examine the development of science in spe- cic areas. Discovering citation evolution over time or future ci- 3.1 Creation of Codd’s World tation through coupling co-authorship and citation networks can We choose the bibliographic database for computer science be considered as a task of heterogeneous network analysis [15]. DBLP [8] as the main source of information with the related Pa- per Abstracts curated from the Microsoft Academic Graph [23]. We 2.2 Topic Models used the DBLP release of April 2018. To narrow down information Topic Models (TMs) are unsupervised learning models that, when in DBLP specic to the database community, we take the node of given a set of documents as input, learn the underlying topics in the foundational paper of Edgar Codd on relational databases [10]. terms of word distribution and also the anity of each document Furthermore, all papers with transitive reference relationships to towards a topic [5]. In SNA, TMs are useful to discover the current the main paper node were considered (i.e., papers that cite tran- research topics as well as its relevant publications and to identify sitively the work of Codd), with no limitations on path length. topic trends through time. Among the many prevalent topic mo- This leads to the formation of our database community citation deling techniques, the following have seen a growing utilization network graph (Codd’s World). Based on this, further sub-graphs in various applications. were constructed to form the base of the scholarly network. The Latent Dirichlet Allocation (LDA) LDA is a probabilistic ge- resulting heterogeneous network consisted of four nodes, namely, nerative topic model used for modeling of discrete data collections (a) authors, (b) papers, (c) venues and (d) journals (cf. Figure2). through learning the relationships between words, topics and do- The ve types of relationships in the network consisted of author- cuments [16,17]. LDA views each document as a mixture of topics ship, collaboration, belonging to venue, belonging to journal and and through its processing, LDA assigns documents with the mo- citation. Taken together, these nodes and relationships formed the deled topics. The word distribution in the modeled topics is based database of our scholarly network1, which was then further used

2 2 The Science Python 3.0 . The framework utilizes the Python library TOM (TO- Publication Graph pic Modeling) [24] for topic extraction using LDA or NMF. TOM library was selected after a careful examination of its oered func- tionality and the relevant research study utilizing it [25]. This de-

Codd's World ... veloped framework was mainly used to visualize and answer RQ1. ...

A Relational Topic Model Selection As presented in Section 2.2, there are Model of Data for CITES AUTHORED E.F. Codd Large Shared two approaches that can be used for topic modeling. We applied Data Banks ... Are both techniques, NMF and LDA, and inspected the output of both Databases Fit ... for Hybrid the TMs on our data set. We found out that the topics given by Workloads on ... GPUs? NMF provided better understanding and were more interpretable when compared to the topics given by LDA. Hence, we use NMF

M. Pinnecke G.D. Campero D. Broneske G. Saake throughout this work for topic modeling.

... Optimal Number of Topics Estimation Estimating the op- timal number of topics is a crucial part of topic modeling. If the number is too small the topics will be vague and if the number is Papers Authors Venues Journals too large the topics will be redundant and overlap too much. The estimation of the ideal number of topics as 30 was based on the size of the input dataset, following the Greene metric over top 10 Figure 2: The heterogeneous network Codd’s World, a subset words. of the science publication graph: papers (transitively) citing Further Analysis The further analysis on the dataset to answer the relational database foundation paper by E.F. Codd, aut- 3 Co-Authorship RQ2 - RQ4, used the graph database Neo4j to store and retrieve horsRA ofW Graph these papers, related venuesCitations and journals.Co-Citations (Collaboration) connected data using its query language, Cypher. The nodes and their relationships were loaded into Neo4j and queries written in for our SNA. Overall, our scholarly network contains 3,122,404 Cypher were utilized to obtain results for the respective research papers, 1,338,357 authors, 25,166,959 citations and 2,815,781 co- questions. For the Page Rank Algorithm required for RQ3 and RQ5, Bibliographical Heterogeneous to identify the most inuential paper and most inuential author authorship edges. Topics Co-Words Coupling Networks respectively, we employed the default provided by Neo4j. The sco- 3.2 Research Questions res were calculated with a Cypher query which implements the page rank algorithm in Neo4j. In order to nd the most inuential Through our SNA, we answer ve carefully formulated research paper and most inuential author, the graph of Codd’s World was questions to understand the scholarly network data. The research enriched with the calculated page rank scores and fed back into questions answered throughout our analysis of the network data Neo4j. are: Design and Assumptions The following assumptions were made prior to the analysis: • RQ1: From topic evolution through time, are there stand-out topics? (i) The input dataset used for analysis only consists of papers that cite transitively the work of Codd and is assumed to have • RQ2: Does the use of self-citations have an impact on the most cited papers per topic per year? no duplicate authors. (ii) The tools and techniques used for analysis are assumed to • RQ3: Does the use of self-citations have an impact on the most inuential papers per topic per year? serve their purpose eciently. Accordingly, the analysis re- sults were formulated and validated based on the best of the • RQ4: Does the use of self-citations have an impact on the knowledge of the authors. citations per topic per year? Considering these design and assumptions we further analyzed • RQ5: Is their a dierence among the most important authors the scholarly network. per topic, looking at collaboration only, citation only, and mixed, while considering self-citations or not? 3.4 Evaluation In this part, we analyze the results of our SNA in order to answer We further describe the utilized tooling and analysis framework our research questions RQ1 - RQ5. However, due to space limitati- and then present a detailed evaluation of the research questions. ons, we have included the detailed output results of the SNA along with the related Cypher queries as a separate technical analysis re- 3.3 Tools port which is available online4. We now describe the tools utilized towards answering the for- mulated research questions through the description of the topic 3.4.1 RQ1: Evolution of Topics Through Time modeling framework with its various aspects concerning topic mo- Relevance Visualizing topic evolution depicts popular research del selection and the optimal number of topics estimation. Addi- topics, and measures topic change over time, increase or decrease tionally, further analysis carried out is also detailed. of importance for a topic and other topic evolutionary characteri- Topic Modeling Framework Since the main aim of the analysis stics, thus helping to better understand the research trend in the was the extraction of meaningful and frequent topics from the pa- 2 per abstract, a framework for Topic Modeling was implemented in Code available online: https://github.com/Rspawar/Scholarly-Network-Analysis.git 1Datasets available online: 3https://neo4j.com/developer/ ftp://itidbftppublic:[email protected]/ 4https://github.com/Rspawar/Scholarly-Network-Analysis/blob/ upload/protolabs/datasets/as-is/codds-world/2019-02 master/Technical_Analysis_Report.pdf

3 Title TopicName Count A Survey of Analytical Time-Sharing NumericalAnalysis 3 Models A relational model of data for large DataMining 3 shared data banks Optimizing the Performance of a Drum- TemporalAnalysis 2 Like Storage

Table 2: Most cited papers in 1970 without self-citation

3.4.3 RQ3: Top Influence per Topic per Year Relevance Measuring the most inuential paper based on its ranking in the network is an indicator of high acceptance of the research work by the scientic community. Visualizing the top in- uential papers per topic helps to understand the trend of topic acceptance over the years. Figure 3: Evolution of topics through time Results and Discussion Cypher queries were executed to re- turn the most inuential papers (based on Page Rank score) for a particular year with and without self-citation. Tables3 and4 database eld. summarize the query output for a particular year 1970 with and Results and Discussion Figure3 shows the evolution of the without self-citation. Similarly, the network was queried for all identied 30 topics over the years, indicating the percentage of the years and also on specic years like 2017. Observation of the papers from those published in a year that are assigned to a given tables suggests that the highest Page Rank is indeed associated topic. It is seen from the gure that only on starting years and end with the old papers but is not necessarily always true. As expec- years, does there seem to be a disproportion towards some topics. ted, the foundational paper of Edgar Codd on relational databases This is due to the data collection and can be dismissed. Some to- remains the most inuential over all the years (with and without pics, specially Topics 0 (named Numerical Analysis) (concerning self-citation). The results of this research question cannot be com- estimations and cost models) and Topics 5, 11, 19, 26, 28 (Hardwa- pared with the results of RQ as self-citation makes a dierence re, Algorithms, Temporal Analysis, Distributed Systems, Informati- 2 on the network dynamics (given that Page Rank scores depend on Retrieval) have seen a steady presence in research throughout on the complete network structure) but not on the citation count the years. Other topics, like 2, 14, 18, 21, 22 (Networking, Image of the most cited papers. Furthermore, we observe that removing Processing, Network Analysis, Machine Learning, Video Processing) self-citations leads to a higher range for the scores of the most have seen an increase in relevance, with Machine Learning’s incre- inuential paper, showing that self-citation does indeed make a ase being remarkable. Among the topics that have seen a relative dierence in the scoring, though it does not change the top items decrease interest is Topic 8 (Operating Systems), which played a ranked in their inuence. larger role in the community during the rst decades shown, but seems to do less so in recent years. Title Topic Name Score A relational model of data for large DataMining 814.42 3.4.2 RQ2: Top Citations per Topic per Year shared data banks Relevance Understanding most cited papers helps to measure Virtual memory NumericalAnalysis 151.17 Toward an understanding of data the overall scientic impact made by a paper. Recognizing the most NetworkAnalysis 27.37 cited papers per topic per year facilitates deep analysis through structures A schema for describing a relational NumericalAnalysis 18.15 measuring the trends in the scientic impact along the years. data base Results and Discussion Cypher queries were executed to re- Introduction to storage structure NumericalAnalysis 3.31 turn the most cited papers for a particular year with and without denition self-citation. Tables1 and2 summarize the query output for the ye- Time-sharing for OS TemporalAnalysis 1.64 TICKETRON: a successfully operating ar 1970 with and without self-citation. Similarly, the network was DistributedSystems 0.23 queried for the year 2017. Comparing the tables, it is observed that system without an Swap-Time Considerations in Time- the majority of the returned papers with their topics are the same TemporalAnalysis 0.18 Shared Systems in both the queries. This suggests that the top papers returned do A contiuum of time-sharing scheduling Applications 0.15 not achieve their most cited criteria through self-citation. algorithms

Title TopicName Count Table 3: Most inuential papers (based on Page Rank score) A Survey of Analytical Time-Sharing NumericalAnalysis 3 with self-citation 1970 Models A relational model of data for large DataMining 3 shared data banks Optimizing the Performance of a 3.4.4 RQ4: Citations per Topic Through Time TemporalAnalysis 2 Drum-Like Storage Relevance Measuring citation count for a topic helps to un- Principles of Optimal Page Replacement Optimization 1 derstand its research popularity among the scientic community. Analyzing citation count per topic per year helps to measure the Table 1: Most cited papers in 1970 with self-citation relevant trends of research on a topic over the years. Results and Discussion Cypher queries were executed to re- turn the citation count for a topic for all the years. Figure4 visua- lizes the output for topic Machine Learning. Observing Figure4

4 Title Topic Name Score that removing self-citation has a large inuence on the combined A relational model of data for large DataMining 13669.49 score, aecting the order of top items in the list. This shows that, shared data banks Toward an understanding of data though these kind of citations do not aect the ranking of top pa- NetworkAnalysis 5092.02 structures pers, the sum of these changes as it is aggregated into the score of Virtual memory NumericalAnalysis 3988.57 an author, creates a dierence. Hence self-citation does inuence A schema for describing a relational NumericalAnalysis 264.00 the ranking of the top authors, therefore scoring measures that dis- data base miss these kind of links, could be a good choice for understanding Introduction to storage structure NumericalAnalysis 18.38 authors relevance in the network we study. denition TICKETRON: a successfully operating DistributedSystems 12.22 system without an operating system Author Name Score Time-sharing for OS TemporalAnalysis 5.36 Scott Shenker 2323.37 Swap-Time Considerations in Time- 1693.30 TemporalAnalysis 0.21 Shared Systems Georey E. Hinton 1563.39 A contiuum of time-sharing scheduling Hari Balakrishnan 1534.55 Applications 0.15 algorithms Rakesh Agrawal 1505.22

Table 4: Most inuential papers (based on Page Rank score) Table 5: Combination of Author Rank and Page Rank with without self-citation 1970 self-citation for Topic Data Mining

Citation count for Topic Machine Learning over all the years Author Name Score CitationCount

250000 E. F. Codd 18399.13 Daniel G. Bobrow 14275.87 Carl Hewitt 12347.67 200000 Ben Wegbreit 9271.63 Peter J. Denning 7198.57 150000

CitationCount Table 6: Combination of Author Rank and Page Rank wi- 100000 thout self-citation for Topic Data Mining

50000

0

1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 4. RELATED WORK Years SNA towards an understanding of the scientic research com- munity can be seen as a fast growing research area. This analytical Figure 4: Citation count for topic Machine Learning over all study carried out and presented in our paper is mainly inspired by the years Prof. Dr. Erhard Rahm and Prof. Dr. Andreas Thor work on Ci- tation analysis of database publications [9] and Prof. Dr. Erhard Rahm and David Aumüller work on Aliation analysis of databa- indicates an increasing trend for the selected topic Machine Lear- se publications [27]. Additionally, Topic Modeling for SNA helps ning over the years. Additionally, running the query for other to- to extract meaning out of the scholarly information in the form pics identied no signicant downtrend for any topic. This could of research topics. The work which is closely related to our rese- be given the fact that we have not included information regarding arch [7] is based on a hybrid topic model. Contrary to our research, the distribution of the papers having high citation counts. Power where it is required to set a xed number of topics, the hybrid to- Law analysis [26] can be used to solve this problem in the future pic model incorporates a dynamic model which is not based on by drilling down into papers. a xed number of topics. This dynamic consideration can further prove benecial to accurately model and understand the evolving 3.4.5 RQ5: Top Influential Author per Topic and ever-changing nature of the scientic community. Relevance Measuring the top inuential author per topic com- bined and ranked over all the years, helps to understand the popu- lar acceptance of the author’s research on a particular topic among 5. CONCLUSION AND FUTURE WORK the scientic community. It is also an indicator of the valuable con- Summarizing, in this paper we introduce the Codd’s World da- tribution made by the author towards the research topic. taset, and we presenting some early analytical results obtained Results and Discussion Cypher queries were executed to re- through SNA on this dataset. The modeling of the 30 topics pre- turn the most inuential authors through combination of Author sented the prominent areas of research in the database communi- Rank and Page Rank on collaboration/co-authorship network for a ty. Interestingly it is observed that the database community, when particular topic with and without self-citation. We found that Au- considered as an inuence network in Codd’s World, includes ma- thor Rank alone did not provide insightful information on the top ny topics that go beyond databases, such as Machine Learning, inuential authors, with possible errors introduced by lack of au- Energy and a few others. This shows the large inuence of data- thor name disambiguation. We also found little dierences when base research on overall computing research. From our evaluation combining Page Rank and Author Rank, with respect to Page Rank we can also report a relative downtrend in publications for some alone, since the scores on the collaboration network are in a smal- topics like Operating Systems, and uptrends in Machine Learning ler scale when compared to the scores on the citation network. and Image Processing. Similarly, we report that no topic has a do- Table5 and6 summarize the partial output of the top authors ac- minance in their presence throughout the years, though some to- cording to a combined AuthorRank and PageRank score, for the pics like NumericalAnalysis (concerning cost models in databases) topic Data Mining with and without self-citation. Results show and Hardware show a constant presence through time. Similarly

5 we show that self-citation plays no big role on the most cited pa- Journal of the American Society for Information Science and pers, though it aects the network structure and thus changes the Technology, 61(12):2389–2404, 2010. scoring, having impact on the long tail of cited papers. We also nd [13] Tze-Haw Huang and Mao Lin Huang. Analysis and results suggesting that to date, the collaboration network that we visualization of co-authorship networks for understanding study is not sucient to evaluate an authors relevance, but that academic collaboration and knowledge domain of instead a ranking based on citation networks while removing self- individual researchers. In Computer Graphics, Imaging and citations could be a good choice. Visualisation, 2006 International Conference on, pages 18–23. Future work could improve limitations in this dataset, observed IEEE, 2006. from authors sharing the same name but being dierent, which [14] Qin He. Knowledge discovery through co-word analysis. might have introduced errors in the ranking of authors in the col- 1999. laboration network. Future work can also consider other dimensi- [15] Erjia Yan and Ying Ding. Scholarly network similarities: ons for SNA on this dataset. How bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship 6. ACKNOWLEDGMENTS networks, and coword networks relate to each other. The authors would like to thank the reviewers for their insight- Journal of the American Society for Information Science and ful feedback. This work was partially funded by the DFG (grant: Technology, 63(7):1313–1326, 2012. SA 465/50-1). [16] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 7. REFERENCES 3(Jan):993–1022, 2003. [1] Feng Xia, Wei Wang, Teshome Megersa Bekele, and Huan [17] Thomas L Griths and Mark Steyvers. Finding scientic Liu. Big scholarly data: A survey. IEEE Transactions on Big topics. Proceedings of the National academy of Sciences, Data, 3(1):18–35, 2017. 101(suppl 1):5228–5235, 2004. [2] Erjia Yan and Ying Ding. Scholarly networks analysis. In [18] Michael W Berry and Murray Browne. Email surveillance Encyclopedia of Social Network Analysis and Mining, pages using non-negative matrix factorization. Computational & 1643–1651. Springer, 2014. Mathematical Organization Theory, 11(3):249–264, 2005. [3] Mark Newman. Networks. Oxford university press, 2018. [19] Daniel D Lee and H Sebastian Seung. Learning the parts of [4] Wei Wang, Jiaying Liu, Shuo Yu, Chenxin Zhang, Zhenzhen objects by non-negative matrix factorization. Nature, Xu, and Feng Xia. Mining advisor-advisee relationships in 401(6755):788, 1999. scholarly big data: A deep learning approach. In Proceedings [20] Carl Bergstrom. Eigenfactor: Measuring the value and of the 16th ACM/IEEE-CS on joint conference on digital prestige of scholarly journals. College & Research Libraries libraries, pages 209–210. ACM, 2016. News, 68(5):314–316, 2007. [5] Sergey I Nikolenko, Sergei Koltcov, and Olessia Koltsova. [21] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Topic modelling for qualitative studies. Journal of Winograd. The pagerank citation ranking: Bringing order to Information Science, 43(1):88–102, 2017. the web. Technical report, Stanford InfoLab, 1999. [6] Abram Hindle, Michael W Godfrey, and Richard C Holt. [22] Yuxiao Dong, Hao Ma, Zhihong Shen, and Kuansan Wang. What’s hot and what’s not: Windowed developer topic A century of science: Globalization of scientic analysis. In Maintenance, 2009. ICSM 2009. IEEE collaborations, citations, and innovations. In Proceedings of International Conference on, pages 339–348. IEEE, 2009. the 23rd ACM SIGKDD International Conference on [7] Christin Katharina Kreutz. A hybrid approach for dynamic Knowledge Discovery and Data Mining, pages 1437–1446. topic models with uctuating number of topics. In ACM, 2017. Grundlagen von Datenbanken, pages 35–40, 2018. [23] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin [8] Michael Ley. The computer science bibliography: Eide, Bo-june Paul Hsu, and Kuansan Wang. An overview of Evolution, research issues, perspectives. In International microsoft academic service (mas) and applications. In symposium on string processing and information retrieval, Proceedings of the 24th international conference on world pages 1–10. Springer, 2002. wide web, pages 243–246. ACM, 2015. [9] Erhard Rahm and Andreas Thor. Citation analysis of [24] Adrien Guille and Edmundo-Pavel Soriano-Morales. Tom: A database publications. ACM Sigmod Record, 34(4):48–53, library for topic modeling and browsing. In EGC, pages 2005. 451–456, 2016. [10] Edgar F Codd. A relational model of data for large shared [25] Adrien Guille, Edmundo-Pavel Soriano-Morales, and data banks. Communications of the ACM, 13(6):377–387, Ciprian-Octavian Truica. Topic modeling and hypergraph 1970. mining to analyze the egc conference history. In EGC, pages 383–394, 2016. [11] Satrio Baskoro Yudhoatmojo and Muhammad Arvin Samuar. Community detection on citation network of dblp [26] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ data sample set using linkrank algorithm. Procedia Newman. Power-law distributions in empirical data. SIAM Computer Science, 124:29–37, 2017. review, 51(4):661–703, 2009. [12] Kevin W Boyack and Richard Klavans. Co-citation analysis, [27] David Aumüller and Erhard Rahm. Aliation analysis of bibliographic coupling, and direct citation: Which citation database publications. ACM SIGMOD Record, 40(1):26–31, approach represents the research front most accurately? 2011.

6