Codd's World: Topics and Their Evolution in the Database

Codd’s World: Topics and their Evolution in the Database Community Publication Graph Rutuja Shivraj Pawar, Sepideh Sobhgol, Gabriel Campero Durand, Marcus Pinnecke, David Broneske and Gunter Saake Otto-von-Guericke University Magdeburg Germany [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT ly network, helping the research community to implement data- Scholarly network analysis is the study of a scientic research net- driven decisions. SNA compromises of at least seven points of inte- work aiming to discover meaningful insights and making data- rest: (i) authors collaborating on papers (co-authorship networks), driven research decisions. Analyzing such networks has become (ii) documents referencing each other (citation networks), (iii) do- increasingly challenging, due to the amount of scientic research cuments cited together (co-citation networks), (iv) documents that that is added every day. Furthermore, online resources often in- cite other documents in a similar manner (bibliographical coup- clude information from other online sources (e.g., academic social ling), (v) content clusters in document sets (topic networks), (vi) platforms), enabling to study networks on a larger and more com- word clusters occurring together (co-words), and (vii) any diver- plex scope. In this paper, we present a study on a specic research sity of types and relationships (heterogeneous networks)[2]. Some network: The (relational) database community publication graph, example SNA studies to mine knowledge from these scholarly net- that we call Codd’s World; a transitive closure over citations from works include analyzing the citation relationships to evaluate the the foundational work of E.F. Codd. We specically analyze the impact of a given paper or an author [3] or studying co-author be- topics of the published papers, the relevance of authors and pa- havior to identify the scientic community distribution [4]. Fur- pers, and how this relates to raw publication counts. Among our thermore, topic network analysis (TNA) can be used to extract the ndings, we show that topic modeling can be a useful entry point underlying topics from a corpus in terms of word distribution and for scholarly network analysis. also the anity of each document to a topic. Discovering the topics unveils the underlying structure of the data and thus better serves as a rst step towards SNA. Furthermore, the visualization Keywords of the extracted topics can lead to discovering topic evolution over Data Analysis, Topic Modeling, Database Publication Network, time [5–7]. Science of Science Main Contributions: In this paper we undertake an SNA study over a sub-network from the DBLP dataset of computer science 1. INTRODUCTION publications [8] with TNA as our starting point towards understanding this complex network structure. Specically, we select Rapid advancements in science and research leads to enormous a network that corresponds to the (relational) database inuence amounts of digital scholarly data being produced and collected community. We call this specic network Codd’s World. Our main every day [1]. This scholarly data can be in the form of scientic contributions can be summarized as follows: publications, books, teaching materials, and many other scholarly sources of information made available on the Web. Apart from the (i) Unlike previous studies about the database community, which volume of scholarly data, there is a meaningful variety of connecti- have restricted the community to encompass papers appea- ons in this data. For instance, papers are connected through citati- ring in top database venues, like VLDB, ICDE, EDBT and ons, authors are connected in networks of collaboration, and there SIGMOD [9]; in this paper we identify the community by are many other embedded relationships. As a consequence, in the using the foundational work of E.F. Codd [10] as a starting world of research, understanding relevant work and its scientic point, and collect all papers transitively related to this work impact has become more and more challenging. One approach that through citation relationships. As a result, we expect our work allows modeling the relevance of papers by leveraging the net- to show a more diverse view of the database community, works in which they are embedded is scholarly network analysis spanning inuences beyond the top database venues. (SNA). (ii) Unlike previous studies about the database community, in SNA proposes to study the underlying structure of a scholar- our work, we propose to consider topics as an important di- mension to understand the underlying structure of the publication network; and how it can be a rst step for visuali- zing the content and discovering meaningful trends. Hence, this paper presents a detailed description of how unveiling the research topics was utilized as the rst step towards SNA on the database publication graph. We complement this approach by analyzing relevance and the role of self-citations. 31st GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 11.06.2019 - 14.06.2019, Saarburg, Germany. The rest of the paper is structured as follows: The relevant basic Copyright is held by the author/owner(s). background is presented in Sec.2. The SNA carried out with the 1 Entire Research Citation Graph Codd's World ... ... A Relational Model of Data for CITES AUTHORED E.F. Codd Large Shared Data Banks ... Are Databases Fit ... for Hybrid Workloads on ... GPUs? M. Pinnecke G.D. Campero D. Broneske G. Saake ... on certain probabilities and helps towards assigning each document with the identied topics. Non-negative Matrix Factorization (NMF) NMF is a linear- algebra optimization algorithm used for dimensionality reducti- Co-Authorship RAW Graph Citations Co-Citations (Collaboration) on and data analysis [18]. NMF factorizes a document-term matrix (i.e., a matrix representing the frequency of terms in dierent documents) into two matrices namely the term-feature and the feature-document, with the property that all the three matrices will have non-negative elements [19]. Bibliographical Heterogeneous Topics Co-Words Coupling Networks 2.3 Relevance Ranking A paper is considered to be most inuential if it is cited mo- Figure 1: Overview of SNA on raw graphs: co author- re often by other inuential papers in the scholarly network [20]. ship/collaboration, citations, co-citations, bibliographical Papers are ranked similar to search engines ranking of web pages, coupling, topics, co-words, and heterogeneity. with the dierence that instead of using the hyperlink network, the citation network formed by the publications is utilized. Conse- quently, this ranking mechanism also helps to determine the most results answering the formulated research questions is detailed in important authors. In SNA, determining the most inuential aut- Section 3. Section 4 concludes the paper discussing the next steps hors based on a ranking mechanism on a collaboration network is in this research direction. also possible. Ranking mechanisms based on the page rank algorithm [21], considering collaboration and citation networks were 2. BACKGROUND utilized for our analysis. We used 20 iterations, with a damping factor of 0.85. In this section we provide background on concepts and terms used in this work. We start with SNA (Section 2.1), continue with 2.4 Self-Citation Detection topic models (Section 2.2), relevance ranking (Section 2.3) and end A self-citation occurs when a cited publication shares at least with detection of self citations (Section 2.4). one common author with the publication that cites it [22]. Self- 2.1 Scholarly Network Analysis citations boost a paper’s citation count for a paper. Self-citations that are introduced in a new research paper to indicate an extension Figure1 shows various aspects of SNA depicting the seven types to the author’s previous work is considered valid. However, unvei- of networks usually considered for SNA on RAW graphs. Further ling these semantic self-citations would require exhaustive proces- elaborating on their functionality, Citation and co-citation net- sing. In SNA, understanding these self-citation counts for a paper work analysis is used to nd relationships between cited papers uncovers information on whether the authors have attempted to and a set of papers which cite those papers. Moreover, citation have a global information coverage on the topic. Considering this analysis can be employed in community detection which is one aspect, in our work we study if self-citations have an impact on of the fundamental tasks of network analysis [11]. Bibliographic the relevance and citation counts of the most relevant papers. coupling also employs citation analysis to link documents which reference a common third cited work in their bibliography [12]. Co-authorship network analysis can be used to nd scientic col- 3. SNA ON CODD’S WORLD laboration between authors [13] depicting how individual scienti- In this section, we highlight important aspects of the SNA per- c ideas of authors can get together through collaboration to cause formed on Codd’s World. We start with the illustration of how an explosion of scientic ndings. Through co-word analysis [14], Codd’s World was created, further describe the formulated rese- it is possible to identify the relationships between subjects in the arch questions, highlight the tooling framework and then present specic eld of research through nding the co-occurrence of key- the detailed output evaluation. words which helps to examine the development of science in specic areas. Discovering citation evolution over time or future ci- 3.1 Creation of Codd’s World tation through coupling co-authorship and citation networks can We choose the bibliographic database for computer science be considered as a task of heterogeneous network analysis [15]. DBLP [8] as the main source of information with the related Pa- per Abstracts curated from the Microsoft Academic Graph [23].

Load more