applied sciences

Article Co-Occurrence Network of High-Frequency Words in the Literature: Structural Characteristics and Evolution

Taoying Li *, Jie Bai, Xue Yang, Qianyu Liu and Yan Chen School of Maritime Economics and Management, Dalian Maritime University, Dalian 116026, China; [email protected] (J.B.); [email protected] (X.Y.); [email protected] (Q.L.); [email protected] (Y.C.) * Correspondence: [email protected]; Tel.: +86-155-6680-2152

 Received: 11 September 2018; Accepted: 19 October 2018; Published: 20 October 2018 

Featured Application: Mining valuable knowledge in the literature helps to save time for readers to understand the content and direction of the literature quickly.

Abstract: The subjects of literature are the direct expression of the author’s research results. Mining valuable knowledge helps to save time for the readers to understand the content and direction of the literature quickly. Therefore, the co-occurrence network of high-frequency words in the bioinformatics literature and its structural characteristics and evolution were analysed in this paper. First, 242,891 articles from 47 top bioinformatics periodicals were chosen as the object of the study. Second, the co-occurrence relationship among high-frequency words of these articles was analysed by word segmentation and high-frequency word selection. Then, a co-occurrence network of high-frequency words in bioinformatics literature was built. Finally, the conclusions were drawn by analysing its structural characteristics and evolution. The results showed that the co-occurrence network of high-frequency words in the bioinformatics literature was a small-world network with scale-free distribution, rich-club phenomenon and disassortative characteristics. At the same time, the high-frequency words used by authors changed little in 2–3 years but varied greatly in four years because of the influence of the state-of-the-art technology.

Keywords: bioinformatics literature; ; high-frequency word; structural characteristics

1. Introduction Biotechnology science has led to biological information research in twenty-first century. Biological information improves the level of biological intelligence with the help of information technology. A large number of related articles have been published in this field as it is one of the most concerned research areas. Biological research results and findings are recorded in various forms of literature [1]. More than 1 million articles are included in PubMed each year since 2011, which can be accessed by searching on the website (https://www.ncbi.nlm.nih.gov/pubmed). Consequently, how to quickly collect the literature and use knowledge discovery and methods to identify future research hotspots has become an urgent issue in scientific investigations. For example, many countries carry a high cancer burden, and comprehensive cancer nursing has become increasingly complicated and difficult [1]. Segregating the vast number of existing articles will help to identify the cause and effect of cancer and achieve the goal of preventing cancer. The theory of complex networks has been widely applied in many fields [2–20]. First, a is its most classical application. For instance, the relationship among people or networks

Appl. Sci. 2018, 8, 1994; doi:10.3390/app8101994 www.mdpi.com/journal/applsci Appl. Sci. 2018, 8, 1994 2 of 14 by on-line detection can be found [2–5], the impact of disaster on social networks [6], the trade-off between social utility and economic performance [7], the intention to migrate [8,9], and unifying aging and frailty [10]. Second, it is also used in other networks [11–13], for example, traffic networks [11,12] and metro networks [13]. Co-word analysis is a direction for applying complex networks in text mining [14]. Many articles have reported on the application of co-word analysis to the biological literature. For example, a text and co-word matrix composed of 40 high-frequency words and 2945 articles was constructed and the proposed cancer immunotherapy made use of the immune system for treating cancer [14]. At the same time, complex networks, such as knowledge network, social networks, shipping network, traffic network, have been widely used in daily life, analysing the structural characteristics of the networks, and providing inspiration for work and life. Zarandi [15] proposed a novel Community Detection based on Structural Similarity executed in two consecutive phases. Ilhan˙ [16] proposed a novel framework that examined various structural features of the network and detected the most prominent subset of community features to predict the future direction of community evolution. Zhao [17] featured several significant findings. The evolution of the knowledge flow network of a strategic alliance could produce a bifurcation phenomenon composed of saddle-node bifurcation and trans-critical bifurcation. Knowledge-embedded resource allocation was the most effective in improving the knowledge flow rate of networks and could further supply ample impetus for evolution. The aforementioned findings were beneficial for understanding the key problems of each resource allocation model and the evolution of strategic alliance in knowledge flow networks. Li [18] investigated that the evolution of the network structure in TFP-glass with increasing temperature to explore the plausible mechanism. The dissociation of the network structure in TFP-glass, which could be observed in all TFP-glass samples synthesized under different conditions, was believed to be the reason for the temperature-dependency of the interfacial interaction. The mechanism underlying the dissociation was thoroughly investigated using two-dimensional infrared spectroscopy, dynamic rheological analysis and XRD. Yang [19] mainly explored the discussion network and its structural evolution based on an empirical study of a famous online discussion that happened in China in 2008 and found that the scale growth of the network had an S shape, the represented the power law in the first halfway, and the network showed a degree of disassortative characteristic. Wang [20] referred to the analysis ideas and methods of complex networks, used the standardized Laplace matrix and K-means clustering method to divide the gene regulatory network into multiple communities, and demonstrated the gene interactions within each community and among communities. The map of scientific knowledge is one of the hottest research methods in the field of international scientific metrology. It combines the citation analysis and technology in scientific metrology to realize the effective use of information and further generate new knowledge. The relationship of co-occurrence is one of the most important aspects in the map of scientific knowledge. It has already been widely used in text mining [21], [22], environmental analysis [23], and so forth. It is also applied in the field of biology to solve all kinds of related problems. For example, Kamneva [24] predicted co-occurrence between reference genomes from two 16S-based ecological datasets. Wang, et al. [25,26] analysed the Protein Domain Co-occurrence Network for predicting protein and domain functions. Li, et al. [27] showed that the signatures of ARG and MRG co-occurrence were much more frequent and the co-occurrence structures in the habitat divisions were significantly different, which could be attributed to their distinct gene transfer potentials. However, a few studies focused on the hot or high-frequency words in the biological literature. In this study, the co-occurrence word method (one of the most popular methods for the scientific knowledge map) was mainly adopted to analyse 242,891 articles from 47 top bioinformatics journals during 2013–2018. The hot words of the related subjects and contents in the field of biological information could be accurately expressed and identified, the relationship among the subjects could be analysed, and then the co-occurrence network of high-frequency words could be constructed using the words and terms appearing in the same-subject articles. Subsequently, the structural characteristics and Appl. Sci. 2018, 8, 1994 3 of 14 evolution rules of the network were analysed, and the hot research trends in the field of bioinformatics were concluded. Python was used to accomplish the experimental process and Gephi (an open-source and free software on https://gephi.org/)to achieve the visualization of results.

2. Methods

2.1. Datasets Journals on bioinformatics are so vast in number that analysing all of them was difficult. Therefore, only those journals that included Mathematical and , Biochemistry and Molecular Biology, Biotechnology and Applied Microbiology, and Multidisciplinary Science were chosen in this study from PubMed and Letpub. Finally, 47 journals were selected, and their names, IF, rank, areas and publishers are given in Table1. Appl. Sci. 2018, 8, 1994 4 of 14

Table 1. Periodical list.

No Journal Name IF (5 Year) Rank Area(s) Press 1 for Molecular Biology 1.617 JCR4 Mathematical and Computational Biology Biomed Central Ltd. 2 Analytical Biochemistry 2.160 JCR3 Biochemistry and Molecular Biology Elsevier Sci Ltd. 3 Bioinformatics 8.561 JCR2 Mathematical and Computational Biology Oxford Univ Press 4 Biosystems 1.460 JCR4 Mathematical and Computational Biology Elsevier Sci Ltd. 5 BMC Bioinformatics 3.114 JCR3 Mathematical and Computational Biology Biomed Central Ltd. 6 BMC Biology 7.436 JCR1 Biology Biomed Central Ltd. 7 BMC Genomics 4.257 JCR2 Biotechnology and Applied Microbiology Biomed Central Ltd. 8 BMC Systems Biology 2.505 JCR3 Mathematical and Computational Biology Biomed Central Ltd. 9 Briefings in Bioinformatics 7.065 JCR1 Biochemical Research Methods Oxford Univ Press 10 Bulletin of Mathematical Biology 1.536 JCR4 Biology Springer 11 Computational Biology and Chemistry 1.345 JCR4 Biology Elsevier Sci Ltd. 12 Computers in Biology and Medicine 2.168 JCR3 Engineering, Biomedical Elsevier Science Bv Science Ltd. 13 EURASIP Journal on Bioinformatics and Systems Biology Mathematical and Computational Biology Springer Heidelberg 14 Journal of Biomedical Semantics 1.883 JCR3 Mathematical and Computational Biology Springer Nature 15 Gene 3.286 JCR3 Genetics and Heredity Elsevier Science Bv 16 Genome Biology 16.497 JCR1 Biotechnology and Applied Microbiology Biomed Central Ltd. 17 IEEE/ACM Transactions on Computational Biology and Bioinformatics 2.064 JCR3 Engineering IEEE Computer Soc 18 IET Systems Biology 0.972 JCR4 Mathematical and Computational Biology Inst Engineering Technology 19 In Silico Biology Biochemistry IOS Press 20 International Journal of Data Mining and Bioinformatics 0.585 JCR4 Mathematical and Computational Biology Inderscience Enterprises Ltd. 21 Chemical Biology and Drug Design 2.404 JCR3 Biochemistry and Molecular Biology Wiley-Blackwell Publishing 22 Acta Biotheoretica 0.907 JCR4 Mathematical and Computational Biology Springer 23 International Journal of Functional Informatics and Personalized Medicine Biomedical Sciences Inderscience Enterprises Ltd. 24 International Journal of Molecular Sciences 3.878 JCR3 Biochemistry and Molecular Biology Mdpi 25 Journal of Bioinformatics and Computational Biology 0.959 JCR4 Mathematical and Computational Biology World Scientific Publishing Co Pte Ltd. 26 Journal of Biological Systems 0.686 JCR4 Mathematical and Computational Biology World Scientific Publishing Co Pte Ltd. 27 Journal of Biomedical Informatics 3.120 JCR3 Medical Informatics Academic Press Inc Elsevier Science 28 Journal of Biomolecular Structure and Dynamics 2.443 JCR3 Biochemistry and Molecular Biology Adenine Press 29 Journal of Computational Biology 3.118 JCR4 Mathematical and Computational Biology Mary Ann Liebert Inc 30 Journal of Computational Neuroscience 1.763 JCR4 Mathematical and Computational Biology Springer 31 Journal of Integrative Bioinformatics Biomedicine And Biotechnology Imbio Association 32 Journal of Theoretical Biology 1.980 JCR3 Mathematical and Computational Biology Elsevier Science Ltd. 33 Mathematical Biosciences 1.617 JCR4 Mathematical and Computational Biology Elsevier Science Inc 34 Mathematical Biosciences and Engineering 1.260 JCR4 Mathematical and Computational Biology Amer Inst Mathematical Sciences 35 Methods 3.936 JCR2 Biochemistry and Molecular Biology Academic Press Inc Elsevier Science 36 Molecular Biosystems 2.838 JCR3 Biochemistry and Molecular Biology Royal Soc Chemistry 37 Nature Communications 13.691 JCR1 Multidisciplinary Sciences Nature Publishing Group 38 Nucleic Acids Research 10.235 JCR1 Biochemistry and Molecular Biology Oxford Univ Press 39 Online Journal of Bioinformatics Computational Biology Online Journal Of Bioinformatics 40 PeerJ 2.469 JCR3 Multidisciplinary Sciences Peerj, Inc. 41 PLoS Computational Biology 4.834 JCR2 Mathematical and Computational Biology Public Science 42 Plos One 3.352 JCR3 Multidisciplinary Sciences Public Library Science 43 Protein and peptide letters 1.052 JCR4 Biochemistry and Molecular Biology Bentham Science Publ Ltd. 44 Scientific Reports 4.609 JCR3 Multidisciplinary Sciences Springer Nature 45 Source Code for Biology and Medicine Bioinformatics Springer Nature 46 StaProteins: Structure, Function and Bioinformatics 2.328 JCR3 Biochemistry and Molecular Biology Wiley-Liss 47 Statistical Applications in Genetics and Molecular Biology 1.104 JCR4 Mathematical and Computational Biology De Gruyter Appl. Sci. 2018, 8, 1994 5 of 14

Appl. Sci.A 2018 total, 8, ofx FOR 242,891 PEER articlesREVIEW were chosen from these journals during the last five years (2013–2018)5 of 14 to identify the recent research hotspots in the bioinformatics literature. A co-occurrence network of bioinformaticsbioinformatics high high-frequency-frequency words basedbased on on these these 242,891 242,891 articles articles was wasconstructed, constructed and,its and process its processis given is ingiven Figure in 1Figure. 1.

Document.csv Segment Words Obtain the top N words Count word frequency Function words.txt Remove function words with highest frequency

Calculate the co-occurrence obtain the top E edges with Input E edges into Network matrix of top N words highest frequency of top N words Gephi graph

FigureFigure 1. 1.ProcessProcess of ofconstructing constructing the the co co-occurrence-occurrence network network of of high high-frequency-frequency words words..

2.2. Word Segmentation and High-Frequency Words 2.2. Word Segmentation and High-Frequency Words AnAn article article usually usually comprises comprises several several thousand thousand words words (as (as in in this this study study),), and and the the efficiency efficiency of of analysinganalysing words words of of all all 242,891 242,891 articles articles was was quite quite slow. slow. Therefore, Therefore, the the subjects subjects,, including including the the title, title, summary,summary, and and key key words words of of the the article article,, were were analysed analysed instead instead of of the the full full text text of of the the article article.. Then Then,, subjectssubjects were were split split into into words words by by space space and and the the sequence sequence of of words words was was recorded recorded to to express express the the articlearticle.. Next, Next, the the function function words, words, such such as asarticles, articles, prepositions, prepositions, conjunctions conjunctions and other and words other wordswithout withoutpractical practical significance significance were removed. were removed The pseudocode. The pseudocode of this sectionof this issection shown is in shown Figures in A1Figure–A3 sin AppendixA. ci presents the ith word and ni is its frequency. Then, the of ci appearing A1–A3 in Appendix A. ci presents the ith word and ni is its frequency. Then, the probability of ci in the subjects of the whole N articles was calculated using p = n /N. Next, words were sorted by appearing in the subjects of the whole N articles was calculatedi usingi pi = ni/N. Next, words were frequency (or probability) in a descending order, which ensured n ≥ n for ∀i < j (equivalent to sorted by frequency (or probability) in a descending order, whichi ensurej d ni ≥ nj for i < j p ≥ p ). The number of high-frequency word K was set, which meant that the top K words were K (equivalenti j to pi ≥ pj). The number of high-frequency word K was set, which meant that the top K wordshigh-frequency were K high words.-frequency words. 2.3. High-Frequency Word Co-Occurrence Matrix and Co-Occurrence Network 2.3. High-Frequency Word Co-Occurrence Matrix and Co-Occurrence Network Let eij be the number of any two high-frequency words ci and cj in the same subject among all N Let eij be the number of any two high-frequency words ci and cj in the same subject among all N article subjects. The co-occurrence relationship between any two high-frequency words was expressed article subjects. The co-occurrence relationship between any two high-frequency words was by mutual information in (pseudocode shown in Figure A4), describing the degree expressed by mutual information in information theory (pseudocode shown in Figure A4), of association between these two words. The following formula was used for calculation (1). describing the degree of association between these two words. The following formula was used for calculation (1). Pi,j Ii,j= log2 (1) PPiiP,jj Ii,j = log2 (1) PiPj where Pi,j represents the probability of co-occurrence of ci and cj, Pi indicates the probability of whoccurrenceere Pi,j represents of ci, and thePj probabilityindicates the of probability co-occurrence of occurrence of ci and cj of, Pci j.indicates The larger the the probability value of ofIi ,j, occurrencethe greater of the ci, co-occurrence and Pj indicates degree the of probabilityci and cj. The of matrixoccurrence (Ii,j)K of× cKj. isThe a high-frequency larger the value co-occurrence of Ii,j, the greatermatrix the (considering co-occurrence the symmetricdegree of c relation,i and cj. TheIi,j =matrixIj,i, and (Ii the,j)K matrix× K is a ( Ihighi,j)K-×frequencyK can also co- beoccurrence expressed as a triangular or lower triangular matrix). matrix (considering the symmetric relation, Ii,j = Ij,i, and the matrix (Ii,j)K × K can also be expressed as a triangularThe main or reason lower fortriangular choosing matrix). mutual information instead of selecting the number of frequent wordsThe couldmain bereason analysed for choosing by the followingmutual information process, assuming instead of 10,000 selecting articles, the number and the frequencyof frequent of wordshigh-frequency could be analysed words c1 andby thec2 asfollowingn1 = 8000 process (p1 = 0.8),, assumingn2 = 7000 10 (p,0002 = 0.7),articles respectively., and the Thefrequency number of of co-occurrence of c and c was 5000 (p = 0.5), and mutual information was I = −0.36. The frequency high-frequency words1 c1 and2 c2 as n1 = 1,28000 (p1 = 0.8), n2 = 7000 (p2 = 0.7), respectively.1,2 The number of of high-frequency words c and c was n = 5000 (p = 0.5), n = 5000 (p = 0.5), respectively, the number co-occurrence of c1 and c2 was3 50004 (p1,2 = 30.5), and mutual3 information4 4was I1,2 = −0.36. The frequency c c p I ofof high co-occurrence-frequency ofwords3 and c3 and4 was c4 was 4500 n (3 =3,4 5000= 0.45), (p3 =and 0.5), the n4 = mutual 5000 (p information4 = 0.5), respectively, is 3,4 = 0.85. the Althoughnumber c c c c ofthe co- co-occurrenceoccurrence of ofc3 and3 and c4 was4 was 4500 relatively (p3,4 = 0.45), small, and3 andthe mutual4 almost information always appeared is I3,4 = at0.85. the Although same time. theTherefore, co-occurrence they wereof c3 and considered c4 was relatively to be in a small, co-occurrence c3 and c4 almost relationship. always Additionally, appeared at the the same association time. Therefore,rules [28] they were were algo usedconsidered to obtain to be the in co-occurrence a co-occurrence relationship relationship among. Additionally high-frequency, the association words. rules [28] were algo used to obtain the co-occurrence relationship among high-frequency words.

2.4. Average Path Length The average path length of a network is the average value of the shortest path length between any two nodes in the network, which was calculated using Equation (2): Appl. Sci. 2018, 8, 1994 6 of 14

2.4. Average Path Length The average path length of a network is the average value of the shortest path length between any two nodes in the network, which was calculated using Equation (2):

∑i6=j dij L = (2) N(N − 1) where dij is the number of edges between high-frequency word nodes i and j. The clustering coefficient of the network is the average of clustering coefficient of all nodes in the network defined as follows:

1 Ni C = Σi (3) N ki(k i − 1)/2 where ki is the degree of node i, and Ni is the number of edges among ki neighbour nodes.

2.4.1. Rich-Club Coefficient The rich-club coefficient is defined as follows:

2E>k ϕ(k) =  (4) N>k(N >k − 1  where N>k(N >k − 1 /2 represents the maximum possible number of edges among nodes with degree more than k.

2.4.2. Neighbour Average Degree The neighbour average degree of node i was calculated using Equation (5):

1 ( ) = knn i ΣjeNi kj (5) ki where Ni is the set of neighbours of node i. The average degree of neighbours of these nodes with same degree k was statistically averaged, which was the number of nodes in the network with moderate k. If the value of knn(k) increased with the increase of k, high-connectivity nodes were easy to connect with other high-connectivity nodes and the network was assortative network. Vice versa, if the value of knn(k) increased with the decrease of k, the network performance was the disassortative network.

3. Results and Discussion

3.1. Co-Occurrence Network of High-Frequency Words in the Bioinformatics Literature According to the co-occurrence matrix and the arrangement of mutual information in a descending order, the threshold of the co-occurrence of high-frequency words was the value of E, and the top E co-occurrence of high-frequency words was considered as the number of edges. The high-frequency words related to these E edges were nodes, and the network formed by these high-frequency words and edges was the co-occurrence network of high-frequency words. In this study, without loss of generality, K and E were chosen to be 500 and 200, respectively. The reasons were as follows: (1) If the value of K was increased and the value of E (e.g., K = 1000, E = 200) was fixed, the co-occurrence network remained unchanged; (2) If the value of E was increased and the value of K (e.g., K = 500, E = 500) was fixed, the newly added nodes had little influence on the structural characteristics of the network; (3) If the values of both K and E (e.g., K = 1000, E = 500) were increased, the nodes and their edges were too many to clearly display in the network (shown in Figure A5). Therefore, the 500 high-frequency words with the most frequent occurrences were chosen. The maximum 200 mutual information among these 500 high-frequency words were 200 edges, and the high-frequency words Appl. Sci. 2018, 8, 1994 7 of 14 related to these 200 edges were nodes. Finally, the co-occurrence network of high-frequency words Appl.was Sci. obtained, 2018, 8, x as FOR shown PEER REVIEW in Figure 2. 7 of 14

FigureFigure 2 2.. CCo-occurrenceo-occurrence networknetwork ofof high-frequency high-frequency words words within within the the bioinformatics bioinformatics literature literature(K = ( 500,K = 500,E = 200).E = 200).

In Figure2, the size of the node indicates the value of the node degree, which means the number In Figure 2, the size of the node indicates the value of the node degree, which means the of neighbor nodes connected to this node directly; only 50 nodes were from the top 200 edges among number of neighbor nodes connected to this node directly; only 50 nodes were from the top 200 edgesthe top among 500 high-frequency the top 500 high words.-frequency words. The topological structure of the co-occurrence network of the high-frequency words in The topological structure of the co-occurrence network of the high-frequency words in bioinformaticsbioinformatics articles articles is is shown shown in in Fig Figureure 22.. In In this this network, network, a a node node represents represents a a high-frequencyhigh-frequency word in all bioinformatics articles articles.. An An edge edge represents the co co-occurrence-occurrence relationship between two two highhigh-frequency-frequency words appearing inin thethe subjectsubject ofof thethe same same article article simultaneously. simultaneously. The The node node degree degree is isone one of of the the key key indicators indicators to measure to measure the node’sthe node’s importance importance in the in network. the network. Nodes Nodes with a with large a degree large degreeare often are considered often considered as high-connectivity as high-connectivity nodes ornodes hub nodes.or hub nodes. N Generally,Generally, Nis assumedis assumed to be to the be number the number of nodes of in the nodes co-occurrence in the co network-occurrence of high-frequency network of highwords,-frequency and the co-occurrencewords, and the relationship co-occurrence of the relationship high-frequency of the wordshigh-frequency as a binary word adjacencys as a binary matrix N N i j adjacencyA( , ). If matrix a co-occurrence A(N,N). If relationship a co-occurrence exists relationship between two exists high-frequency between two words high-frequencyand , the valuewords of i element aij is 1, otherwise its value is 0. A(N,N) is a symmetric matrix and can be used to calculate the and j, the value of element aij is 1, otherwise its value is 0. A(N,N) is a symmetric matrix and can be usedstructural to calculate characteristics the structural of the network, characteristics such as theof the shortest network path,, such network as the density, shortest degree path, distribution, network density,clustering degree coefficient, distribution, community clustering structure, coefficient, rich club, community matching structure, form, and rich so forth. club, matching form, and3.2. so Small-World forth. Network Characteristics

3.2. SmallMany-World real-world Network networksCharacteristics exhibit the structural characteristics of small-world network. Compared with the same-scale random network, it has a similar average path length and higher Many real-world networks exhibit the structural characteristics of small-world network. clustering coefficient [29]. According to the aforementioned, the number N of nodes in the Compared with the same-scale random network, it has a similar average path length and higher co-occurrence network of high-frequency words in bioinformatics literature in Figure2 was 50. clustering coefficient [29]. According to the aforementioned, the number N of nodes in the The average path length of the network was 1.9 and the clustering coefficient was 0.363. Compared co-occurrence network of high-frequency words in bioinformatics literature in Figure 2 was 50. The with the corresponding random networks, the co-occurrence network of the high-frequency words of average path length of the network was 1.9 and the clustering coefficient was 0.363. Compared with the bioinformatics literature had the same level of average path length and higher level of clustering the corresponding random networks, the co-occurrence network of the high-frequency words of the coefficient, implying a clear small-world phenomenon. The results showed that any two high-frequency bioinformatics literature had the same level of average path length and higher level of clustering words of bioinformatics literature were connected at most by another high-frequency word. More than coefficient, implying a clear small-world phenomenon. The results showed that any two high-frequency words of bioinformatics literature were connected at most by another high-frequency word. More than half of high-frequency words had a direct co-occurrence relationship with each other, indicating a clear co-occurrence relationship among high-frequency words of the bioinformatics literature. Appl. Sci. 2018, 8, 1994 8 of 14

Appl. Sci. 2018, 8, x FOR PEER REVIEW 8 of 14 half of high-frequency words had a direct co-occurrence relationship with each other, indicating a clear co-occurrenceAccording relationship to the small among-world high-frequency characteristics wordsof the ofco the-occurrence bioinformatics network literature. of high-frequency wordsAccording in the bioinformatics to the small-world literature characteristics, any two high of the-frequency co-occurrence words network have a of direct high-frequency or indirect cowords-occurrence in the bioinformatics relationship. The literature, information any twoin Fig high-frequencyure 2 can provide words readers have or a direct researchers or indirect with certainco-occurrence searching relationship. suggestions. The For information example, in if Figure a researcher2 can provide wants readersto query or the researchers literature with related certain to “searchinggenetics” suggestions., the platform For example, should also if a researcherautomatically wants recommend to query the the literature literature related related to “genetics”, to the highthe platform-frequency should words also “protein automatically” and “ recommendmetabolism” the which literature have relateda direct to co the-occurrence high-frequency relationship words “protein”with “genetics and “metabolism””. which have a direct co-occurrence relationship with “genetics”.

3.33.3.. Degree Degree Distribution Distribution Characteristics The degree distribution isis oneone ofof thethe mostmost important important indicators indicators for for describing describing the the characteristics characteristics of ofthe the complex complex network network structure. structure. In the In existingthe existing literature, literatureP(k), (the P(k) distribution (the distribution function function of node of degree) node degreeor P(≥)k )or (that P(≥ ofk) the(that cumulative of the cumulative degree) was degree used) to was describe used the to degree describe distribution the degree characteristics distribution characteristicsof nodes. The formerof nodes.P(k The) is theformer ratio P of(k the) is numberthe ratio of of nodes the number with degree of nodesk in thewith complex degree networkk in the complexto the number network of to total the nodes.number The of total latter nodes.P(≥k )The is the latter ratio P(≥ ofk) theis the number ratio of of the nodes number with of degrees nodes withgreater degrees than or greater equal tothank in or the equal complex to k networkin the complex to the number network of to total the nodes. number Empirical of total studiesnodes. Empiricalshow that studies a large show number that ofa large real-world number complex of real- networksworld complex are characterized networks are by characterized three types by of threedegree types distribution of degree of nodes: distribution scale-free of properties, nodes: scale wide-scale-free properties, properties, wide and- single-scalescale properties, properties. and singleThe cumulative-scale properties. distribution The cumulative was used in distribution this study to was describe used the in thisdegree study distribution to describe characteristics the degree distributionof the co-occurrence characteristics network of of high-frequency the co-occurrence words innetwork the bioinformatics of high-frequency literature. Theword cumulatives in the bioinformaticsdistribution of literature network in. The Figure cumulative2 is shown distribution in Figure3 of. network in Figure 2 is shown in Figure 3.

FigureFigure 3 3.. CumulativeCumulative distribution distribution of of the the co co-occurrence-occurrence network of high high-frequency-frequency words words..

Figure 33 showsshows thatthat thethe cumulativecumulative degreedegree distributiondistribution curvecurve seemsseems toto declinedecline fasterfaster atat thethe firstfirst stage andand slowerslower at at the the second second stage stage with with the the increasing increasing of the of degreethe degreek, indicating k, indicating that the that node the degree node degreeof the network of the network was scale-free. was scale The-free. scale-free The scale characteristics-free characteristics showed show thated the that connectivity the connectivity of a small of a smallnumber number of nodes of nodes in the in network the network were were quite quite large large (with (with a large a large number number of connections), of connections), which which had hada leading a leading role inrole the in operation the operation of the of network the network while while most ofmost the nodesof the hadnodes small had connections small connections (only a (onlysmall a number small number of connections). of connections).

3.43.4.. Rich Rich-Club-Club Phenomenon Phenomenon Characteristics The rich rich-club-club phenomenon refers refers to the close connection between the more connected nodes ((hubhub nodes)nodes) in in the the network network and and the the formation formation of a core of a team core in team the network, in the network, which can which be measured can be using the rich-club coefficient ϕ(k) [30]. E denotes the number of connections among nodes whose measured using the rich-club coefficient >k(k) [30]. E>k denotes the number of connections among nodesdegrees whose are larger degree thans arek inlarger the network. than k in the network. The rich-clubrich-club coefficient coefficient of the of co-occurrence the co-occurrence network networkof high-frequency of high words-frequency in bioinformatics words in bioinformaticsliterature is shown literature in Figure is shown4. The in coefficient Figure 4. increasedThe coefficient with theincreased increase with in the the nodeincrease degree in thek, nodeimplying degree that k, the implying connection that degreethe connection among hub degree nodes among was largerhub nodes than thatwas amonglarger than other that nodes, among and otherformed nodes, a rich and club. formed At the a samerich club. time, At it showedthe same that time, the it nodesshowed with that degree the nodes greater with than degree 10 formed greater a thanfully 10 connected formed a graph. fully connected The rich-club graph. phenomenon The rich-club of the phenomenon network showed of the thatnetwork the words showed in thethat club the words in the club ere the core of the network, which controlled the composition of the high-frequency word nodes in the whole network. Appl. Sci. 2018, 8, 1994 9 of 14 ere the core of the network, which controlled the composition of the high-frequency word nodes in the Appl.whole Sci. network. 2018,, 8,, xx FORFOR PEERPEER REVIEWREVIEW 9 of 14

Figure 4. Rich-club coefficient. Figure 4 4.. RichRich-club-club coefficient. coefficient.

3.53.5... Matching form Characteristics The matching form described thethe relationship relationship between between thethe node degree and the the neighbour node degree degree of of thethe the network network [31 [31–33–33]. ].Figure Figure 5 5shows shows the the relationship relationship between between thethe thenode node degree degree and theandthe neighbour the neighbour node node degree degree of ofthe the co- co-occurrenceoccurrence network network of of high high-frequency-frequency words words in in the the bioinformatics literatureliterature literature... The The network network was was a mixed a mixed network. network. The Theresult result showed showed thatthat thethe that nodenode the nodewithwith highwith high connectivity connectivity was waseasy easy to connect to connect with with thethe the node node node with with with low low low connectivity connectivity connectivity in in in the the the co coco-occurrence-occurrence network of of high high-frequency-frequency words words in in thethe the bioinformatics bioinformatics literatureliterature literature... Meanwhile, Meanwhile, this thisshowed showed thatthat new that wordsnew words tended tended toto connectconnect to connect wordswords words withwith withhighhigh highconnectivityconnectivity connectivity inin thethe in processprocess the process ofof networknetwork of network generationgeneration generation andand evolution.and evolution.

FigureFigure 5 5... AverageAverage degree of the neighbours neighbours... 3.6. Evolution of the Co-Occurrence Network of High-Frequency Words in the Bioinformatics Literature 3.6.. Evolution of the Co-Occurrence Network of High-Frequency Words inin the Bioinformatics Literature The use of high-frequency words in the existing 242,891 articles divided by year, including 41,457 The use of high-frequency words in the existing 242,891 articles divided by year, including articles in 2013, 42,049 articles in 2014, 43,114 articles in 2015, 45,257 articles in 2016, 60,404 articles in 41,,457 articles in 2013, 42,,049 articles in 2014, 43,,114 articles in 2015, 45,,257 articles in 2016, 60,,404 2017, and 10,610 articles in 2018, was analysed to trace the changing trend of high-frequency words in articles in 2017,, and 10,,610 articles in 2018, was analysed toto trtrace thethe changing changing trend trend of of the bioinformatics literature. Setting K = 500, and E = 200, the co-occurrence network of high-frequency high-frequency words in thethe bioinformatics literatureliterature.. Settingting K = 500, and E = 200, the words in the bioinformatics literature from 2013 to 2018 was obtained, as shown in Figure6. co-occurrence network of high-frequency words inin thethe bioinformatics literatureliterature from 2013 to 2018 was obtained,, as shown in Figure 6.. Figure 6 showed thatthat thethe changeschanges inin twotwo consecutiveconsecutive yearsyears were relatively small.. ForFor example, a littlelittle change change was observed inin terms terms of of high-frequency words in thethe bioinformatics literatureliterature during 2013 and 2014,, andand hubhub nodesnodes inin bothboth yearsyears were “metabolism”, “study”, “genetic”,, and so forth.. However,, from from 2015 toto 2016, the word “analysis” became increasinglyly importanimportantt.. The analysis of the reasons indicatedindicated thatthat the the application application of of big big data data to to solve solve practical practical problems problems has become more common, which was well reflected in thethe bioinformatics research and literatureliterature.. FromFrom 2013 to 2015, threethree most important nodes were identified.. However, However, therethere was was only only one one most importantimportant node node during 2016–2018, implyingimplying thatthat,, recently,, more attention was paid to academic research and analysis in thethe bioinformatics literatureliterature insteadinstead of of focusing focusing on on metabolism metabolism and and genetics. Appl. Sci. 2018, 8, 1994 10 of 14 Appl. Sci. 2018, 8, x FOR PEER REVIEW 10 of 14

(a) (b)

(c) (d)

(e) (f)

FigureFigure 6. 6Changing. Changing trends trends of the of co-occurrence the co-occurrence network network of high-frequency of high- wordsfrequency in the word bioinformaticss in the literaturebioinformatics from 2013 literature to 2018 from with 2013K = 500, to 2018E = 200. withFigure K = 500,6 contained E = 200. sixFig graphs,ure 6 contained which were six listedgraphs by, year:which (a )were 2013, listed (b) 2014, by year: (c) 2015, (a) 2013, (d) 2016,(b) 2014, (e) 2017, (c) 2015, and ( (df)) 2016, 2018. (e) 2017, and (f) 2018. Appl. Sci. 2018, 8, 1994 11 of 14

Figure6 showed that the changes in two consecutive years were relatively small. For example, a little change was observed in terms of high-frequency words in the bioinformatics literature during 2013 and 2014, and hub nodes in both years were “metabolism”, “study”, “genetic”, and so forth. However, from 2015 to 2016, the word “analysis” became increasingly important. The analysis of the reasons indicated that the application of big data to solve practical problems has become more common, which was well reflected in the bioinformatics research and literature. From 2013 to 2015, three most important nodes were identified. However, there was only one most important node during 2016–2018, implying that, recently, more attention was paid to academic research and analysis in the bioinformatics literature instead of focusing on metabolism and genetics.

4. Conclusions Based on the complex , the association of high-frequency words in the bioinformatics literature was abstracted into a network, indicating that the high-frequency words of subjects in bioinformatics literature were taken as nodes, and the co-occurrence relationships of high-frequency words were selected as edges. The co-occurrence network of high-frequency words in the bioinformatics literature was constructed. Additionally, the structural characteristics and evolution laws of the bioinformatics literature were analysed. The main conclusions were summarized as follows:

• The co-occurrence network of high-frequency words in bioinformatics literature is a small world network. The co-occurrence relationship between any two high-frequency words needed to be transferred at most once, and more than half of the high-frequency words in the bioinformatics literature had direct co-occurrence relationships. • The degree distribution of the co-occurrence network of high-frequency words in the bioinformatics literature was scale-free, and the connectivity of a small number of nodes in the network was large, which had a leading role in the network. On the contrary, the connectivity of most nodes was small, indicating that the factors explored by the authors of the bioinformatics literature were more concentrated. • The co-occurrence network of high-frequency words in the bioinformatics literature had the rich-club phenomenon. The high-frequency words in the club were the core words in the bioinformatics literature and they expressed the author’s attention to the bioinformatics literature. • The co-occurrence network of high-frequency words in the bioinformatics literature had the characteristics of disassortative network. High-connectivity nodes were easily connected to nodes with low connectivity. • The analysis on the evolution of the co-occurrence network of high-frequency words in the bioinformatics literature revealed that the high-frequency words in the bioinformatics literature changed little in 2–3 years. However, the state-of-the-art technology was introduced gradually with time. Consequently, the authors’ wording also changed, such as passion for big data and data analysis.

Author Contributions: Conceptualization: T.L. and J.B.; methodology: T.L.; validation: J.B., X.Y. and Q.L.; formal analysis: T.L.; investigation: Q.L.; resources: J.B.; data curation: T.L.; writing—original draft preparation: T.L.; writing—review and editing: T.L.; visualization: X.Y.; supervision: Y.C.; project administration: T.L.; funding acquisition: T.L. Funding: This research was funded by the National Natural Science Foundation of China, grant number 71271034, the National Social Science Foundation of China, grant number 15CGL031, the Fundamental Research Funds for the Central Universities, grant numbers 3132016306 and 3132018160, the Program for Dalian High Level Talent Innovation Support, grant number 2015R063, the National Natural Science Foundation of Liaoning Province, grant number 20180550307, and the National Scholarship Fund of China for Studying Abroad. Acknowledgments: We are deeply grateful to the anonymous referees and to the editors for their suggestion and help in significantly improving the article. Conflicts of Interest: The authors declare no conflict of interest. Appl. Sci. 2018, 8, 1994 12 of 14

Appl. Sci. 2018, 8, x FOR PEER REVIEW 12 of 14 Appl. Sci. 2018, 8, x FOR PEER REVIEW 12 of 14 AppendixAppl. Sci. 2018 A., 8, x FOR PEER REVIEW 12 of 14 Appl. Sci. 2018, 8, x FOR PEER REVIEW 12 of 14

Figure A1. Pseudocode of segmenting words. Figure A1. Pseudocode of segmenting words. Figure A1. Pseudocode of segmenting words. FigureFigure A1.A1. Pseudocode ofof segmentingsegmenting words.words.

Figure A2. Pseudocode of counting word frequency. Figure A2. Pseudocode of counting word frequency. Figure A A2.2. Pseudocode of counting word frequency frequency.. Figure A2. Pseudocode of counting word frequency.

Figure A3. Pseudocode of obtaining high-frequency words. FigureFigure A3.A3. PseudocodePseudocode ofof obtainingobtaining high-frequencyhigh-frequency words.words. Figure A3. Pseudocode of obtaining high-frequency words. Figure A3. Pseudocode of obtaining high-frequency words.

Figure A4.A4. Pseudocode of obtaining high high-frequency-frequency words.words. Figure A4. Pseudocode of obtaining high-frequency words. Figure A4. Pseudocode of obtaining high-frequency words. Figure A4. Pseudocode of obtaining high-frequency words. Appl. Sci. 2018, 8, 1994 13 of 14 Appl. Sci. 2018, 8, x FOR PEER REVIEW 13 of 14

Figure A5. Co-occurrence network of high-frequency words within the bioinformatics literature Figure A5. Co-occurrence network of high-frequency words within the bioinformatics literature (K = (K = 1000, E = 500). 1000, E = 500).. References References 1. Zhang, X.C.; Huang, D.S.; Li, F. Cancer nursing research output and topics in the first decade of the 21st

1. century:Zhang Results, X.C.; Huang of a bibliometric, D.S.; Li, F and. Cancer co-word nursing cluster research analysis. outputAsian and Pac. topics J. Cancer in the Prev. first2011 decade, 12, 2055–2058. of the 21st [PubMedcentury:] Results of a bibliometric and co-word cluster analysis. Asian Pac. J. Cancer Prev. 2011, 12, 2055– 2. Kendrick,2058. L.; Musial, K.; Gabrys, B. Change point detection in social networks—Critical review with

2. experiments.Kendrick, LComput.; Musial Sci., K Rev..; Gabrys2018, 29, B,. 1–13. Change [CrossRef point] detection in social networks—Critical review with 3. Bahri,experiments. L.; Carminati, Compu B.;t Sci Ferrari,. Rev. 2018 E. Decentralized, 29, 1–13. privacy preserving services for online social networks.

3. OnlineBahri Soc., L.;Netw. Carminati Media, B2018.; Ferrari, 6, 18–25., E. Decentralized [CrossRef] privacy preserving services for online social networks. 4. Bidarta,Online C.;Soc Degenne,. Netw. Media A.; 2018 Grossetti,, 6, 18– M.25. Personal networks typologies: A structural approach. Soc. Netw.

4. 2018Bidarta, 54, 1–11., C.; Degenne [CrossRef, ]A.; Grossetti, M. Personal networks typologies: A structural approach. Soc. Netw. 5. Houston,2018, 54, J.F.; 1–11. Lee, J.; Suntheim, F. Social networks in the global banking sector. J. Account. Econ. 5. 2018Houston, 65, 237–269., J.F.; Lee [CrossRef, J.; Suntheim] , F. Social networks in the global banking sector. J. Account. Econ. 2018, 65, 6. Kim,237 J.;–269. Hastak, M. Social network analysis: Characteristics of online social networks after a disaster. Int. J. 6. Inf.Kim Manag., J.; Hastak2018, 38, ,M 86–96.. Social [CrossRef network] analysis: Characteristics of online social networks after a disaster. 7. Growieca,Intern. J. K.;Inf. Growiec, Manage. J.;2018 Kaminski,, 38, 86–96. B. Social network structure and the trade-off between social utility and 7. economicGrowieca performance., K.; GrowiecSoc., J.; Netw.Kaminski2018,, B55. ,Social 31–46. network [CrossRef structure] and the trade-off between social utility 8. Manchin,and economic M.; Orazbayev, performance. S. Social Soc. Netw networks. 2018 and, 55, the31–46. intention to migrate. World Dev. 2018, 109, 360–374. 8. [CrossRefManchin] , M.; Orazbayev, S. Social networks and the intention to migrate. World Dev. 2018, 109, 360–374. 9. 9. Ye,Ye W., W. The The rich-club rich-club phenomenon phenomenon of China’s of China population’s population flow network flow network during the during country’s the countryspring festival.’s spring Appl.festival. Geogr. Appl2018. Geogr, 96, 77–85.. 2018, 96, 77–85. 10.10. Rutenberg,Rutenberg A.D.;, A.D Mitnitski,.; Mitnitski A.B.;, A.B Farrell,.; Farrell S.G.;, S.G Rockwood,.; Rockwood K., K. Unifying Unifying aging aging and and frailty frailty through through complex complex dynamicaldynamical networks. networks.Exp. Exp Gerontol.. Gerontol2018. 2018, 107, 107, 126–129., 126–129. [CrossRef ][PubMed] 11.11. Shanmukhappa,Shanmukhappa T.;, T Iwh,.; Iwh H.;, H Chi,.; Chi K.T., K.T. Spatial Spatial analysis analysis of of bus bus transport transport networks networks using using network network theory. theory. Phys.Phys A. 2018A 2018, 502, 502, 295–314., 295–314. [CrossRef ] 12.12. Xu,Xu X.;, X Chen,.; Chen A.;, A Jansuwan,.; Jansuwan S.;, S Yang,.; Yang C.;, C Ryu,.; Ryu S., TransportationS. Transportation network network redundancy: redundancy: Complementary Complementary measuresmeasures and and computational computational methods. methods.Transp. Transp Res.. Res Part. Part B 2018B 2018, 114, 114, 68–85., 68–85. [CrossRef ] 13.13. Zhang,Zhang J.;, J Wang,.; Wang S.;, S Wang,.; Wang X., X Comparison. Comparison analysis analysis on on vulnerability vulnerability of of metro metro networks networks based based on on complex complex network.network.Phys. Phys A. A2018 2018, 496, 496, 72–78., 72–78. [CrossRef ] 14.14. Lu,Lu K.;, K.; Yu, Yu, S.; S.; Yu,Yu, M.M. Bibliometric a analysisnalysis of of tumor tumor immunotherapy immunotherapy studies studies.. MedMed.. Sci. Monit Sci.. 2018 Monit., 24, 20183405, 24–,3414 3405–3414.. [CrossRef][PubMed] 15.15. Zarandi,Zarandi F.D.;, F.D Rafsanjani,.; Rafsanjani M.K., M.K. Community Community detection detection in in complex complex networks networks using using structural structural similarity. similarity. Phys.Phys A. 2018A 2018, 503, 503, 882–891., 882–891. [CrossRef ] 16. İlhan, N.; Öğüdücü, Ş.G. Feature identification for predicting community evolution in dynamic social networks. Eng. Eng. Appl. Artif. Intell. 2016, 55, 202–218. Appl. Sci. 2018, 8, 1994 14 of 14

16. Ilhan,˙ N.; Ö˘güdücü,¸S.G.Feature identification for predicting community evolution in dynamic social networks. Eng. Eng. Appl. Artif. Intell. 2016, 55, 202–218. [CrossRef] 17. Zhao, J. Research on the characteristics of evolution in knowledge flow networks of strategic alliance under different resource allocation. Expert Syst. Appl. 2018, 98, 242–256. 18. Li, H. The evolution of the network structure in tin-fluoro-phosphate glass with increasing temperature. J. Non-Cryst. Solids 2018, 492, 84–93. [CrossRef] 19. Yang, Y.; Chen, Q.; Liu, W. The structural evolution of an online discussion network. Phys. A 2010, 389, 5871–5877. [CrossRef] 20. Wang, X.S.; Yang-Yang, G.U.; Cheng, Y.H. Construction of delay gene regulatory network based on complex network. Acta Electron. Sin. 2010, 38, 2518–2522. 21. Van Rijsbergen, C.J. A theoretical basis for the use of co-occurrence data in . J. Doc. 1977, 33, 106–119. [CrossRef] 22. Mika, P. Ontologies are us: A unified model of social networks and semantics. Web Semant: Sci. Serv. Agents Word Wide Web 2007, 5, 522–536. [CrossRef] 23. Barberán, A.; Bates, S.T.; Casamayor, E.O.; Fierer, N. Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J. 2012, 6, 343–351. [CrossRef][PubMed] 24. Kamneva, O.K. Genome composition and phylogeny of microbes predict their co-occurrence in the environment. PLoS Comput. Biol. 2017, 13, e1005366. [CrossRef][PubMed] 25. Wang, Z.; Cao, R.Z.; Cheng, J.L. Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks. BMC Bioinf. 2013, 14 (Suppl. 3), S3. [CrossRef] 26. Wang, Z.; Zhang, X.C.; Le, M.H.; Xu, D.; Stacey, G.; Cheng, J.L. A protein domain co-occurrence network approach for predicting protein function and inferring species phylogeny. PLoS ONE 2011, 6, e17906. [CrossRef][PubMed] 27. Li, L.G.; Xia, Y.; Zhang, T. Co-occurrence of antibiotic and metal resistance genes revealed in complete genome collection. ISME J. 2017, 11, 651–662. [CrossRef][PubMed] 28. Li, T.Y.; Li, F.; Chen, Y.; Lv, X.N. Fast clustering for sparse network of retail products associated big data. Control Decis. 2018, 33, 1117–1122. 29. Watts, D.J.; Strogatz, S.H. Collective dynamics of ‘small-world’ networks. Nature 1998, 393, 440–442. [CrossRef][PubMed] 30. Zhou, S.; Mondragón, R.J. The rich-club phenomenon in the internet topology. IEEE Commun. Lett. 2004, 8, 180–182. [CrossRef] 31. Newman, M.E.J. The structure and function of complex networks. SIAM Rev. 2003, 45, 167–256. [CrossRef] 32. Liu, Y.S.; Zeng, X.; He, Z.; Zou, Q. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans. Comput. Biol. Bioinf. 2016, 14, 905–915. [CrossRef][PubMed] 33. Li, P.; Guo, M.; Wang, C.; Liu, X.; Zou, Q. An overview of SNP interactions in genome-wide association studies. Briefings Funct. Genomics 2015, 14, 143–155. [CrossRef][PubMed]

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).