<<

” in the Age

A study of publication statistics in computer networking research

Dah Ming Chiu and Tom Z. J. Fu Department of Information Engineering, CUHK {dmchiu, zjfu6}@ie.cuhk.edu.hk

This article is an editorial note submitted to CCR. It has NOT been peer reviewed. The author takes full responsibility for this article’s technical content. Comments can be posted through CCR Online.

ABSTRACT in the academic world. In this paper, we study the publi- This study takes papers from a selected set of computer net- cation records of a selected set of conferences and journals working conferences and journals spanning the past twenty in the networking field in the past 10-20 years at an aggre- years (1989-2008) to produce various statistics to show how gate level, summarize, and discuss the statistics. Through our community publishes papers, and how this process is these statistics, we hope to (1) share some interesting ob- changing over the years. We observe the rapid growth in the servations about the way our community publish, and the rate of publications, venues, citations, authors, and number characteristics of some familiar conferences and journals; (2) of co-authors. We explain how these quantities are related, ask questions and generate interest for future studies; (3) in particular explore how they are related over time and get feedback on methodologies and possible collaboration the reasons behind the changes. The widely accepted model for this kind of studies. to explain the power law distribution of paper citations is For the rest of the paper, we begin by describing (a) the preferential attachment. We propose an extension and re- paperset we use in the context of existing on-line sources, finement that suggests elapsed time is also a factor to deter- and (b) the most important previous works and our ap- mine which papers get cited. We try to compare the selected proach. Subsequently, we discuss various results and ob- venues based on citation count, and discuss how we might servations one by one, ending with a conclusion section. For think about these comparisons, in terms of the roles played more detailed statistics and a detailed explanation of the by different venues, and the ability to predict impact by methodology, see our technical [9]. venues, and citation counts. The treatment of these issues is general and can be applied to study publication patterns 2. DATA MODEL in other research communities. The larger goal of this study We begin with a brief definition of terminology and an is to generate discussion about our publication system, and explanation of the data we analyze. work towards a vision to transform our publication system A paper is published by a venue at a date. Venues refer to for better scalability and effectiveness. conferences (workshops) or journals. Most of them publish papers on a periodic basis (e.g. every year, or every month). Categories and Subject Descriptors A paperset is a collection of papers we use to calculate var- ious statistics. Each paper has one or more authors;each C.2.0 [Computer-Communication Networks]: General author may publish one or more papers. As a rule, each pa- per cites related papers published earlier. The citation count General Terms of a paper, which changes as time goes on, is the total cita- Documentation, Performance tion received by a paper by a given time. This data model is used by various providers of academic publication statis- tics, e.g. [1], DBLP [2], IEEE [3], ACM [4], Keywords ISI [5], Citeseer [6], or Microsoft Libra [7]. These providers, citation, h-index, g-index, co-authorship, academic publish- however, tend to use different papersets. For example, Cite- ing seer, DBLP and Libra focus mostly on computer science and related literature, but each has its own rules of which conferences/papers to include or not. The various statistics, 1. INTRODUCTION e.g. citation count of papers, are derived from the respective The academic publication machineries, taken as a whole, closed papersets, leaving them not cross-comparable. They provides an archive for peer-reviewed academic papers. In also tend to use different metrics, e.g. ISI and Citeseer may the process, the meta-information associated with the pub- have different definitions of for venues. lications, such as date and venue of publication, author- For our study, we decided to focus on a particular research ship and citations can also be readily gathered from various field, in this case, computer networking. We also decided to sources [1–5]. Such publication records can be very useful use the Google Scholar citation count (for each paper we for research, though it is perhaps most often used in per- consider) because Google Scholar’s paperset is arguably the formance evaluation for hiring, promotion and tenure cases largest. This decision has a couple of consequences: (a) The

ACM SIGCOMM Computer Communication Review 34 Volume 40, Number 1, January 2010

process of gathering citation counts from Google Scholar than the focus on the computer networking field, is mainly cannot be completely automated. For a given set of papers, in the consideration of how things change over time. In the it takes some time to gather the citation counts with some discussion of our results, we will make passing reference to level of manual verification. (b) The citation count of a pa- the prior works as the opportunities arise. per in Google Scholar is continuously updated and changing. This means we need to focus on a limited set of conferences. 3. PAPER INFLATION In this study, we tried to choose those venues that we are It should not be surprising to observe that we are seeing more familiar with, and tried to pick different types to make a rapid increase in the rate we are papers in our them reasonably representative. field. We can account for the increase in terms of: The paperset we consider comes from the set of venues listed in Table 1. We illustrate its relationship to the other 1. Many conferences increase the number of papers the papersets in Fig 1. The citation counts are gathered from publish, usually by increasing the number of tracks of Google Scholar in late summer of 2009, and can be consid- presentation. Fig 2 shows the total rate of publication ered as a snapshot. of the list of venues we consider. 2. Most successful conferences gradually add workshops Google Scholar before/after the main event, which allow more papers (?) to be published. Fig 2 also shows the number of work- shops associated with the list of conferences in our list. Libra (3314754) 3. The number of conferences and journals have increased Our data set significantly over the last 10-20 years. Fig 3 shows the (19907) number of venues in CiteSeerX’s venue impact factor DBLP CiteSeer report. CiteSeerX tracks a much larger field than net- (1293019) (1440519) working; based on our experience, we are assuming the growth of the networking field is strongly correlated to ISI that of the superset tracked by CiteSeerX. (87+Million)

2000 80 Total number of published papers Total number of associated workshops 1750 70

1500 60 # Associated workshops Figure 1: Illustration of several existing data sets.

1250 50

Venue name #years Total # papers 1000 40

Sigcomm (88-08) 21 612 750 30

Infocom (89-08) 20 4069 # Published papers Sigmetrics (89-08) 20 795 500 20

Elsevier CN (89-08) 20 2953 250 10 IEEE/ACM ToN (93-08) 16 1331 0 0 ICNP (93-08) 16 559 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 Year MobiCom (95-08) 14 385 ICC (98-08) 11 7721 WWW (01-08) 8 1063 Figure 2: Total number papers and number of asso- IMC (01-08) 8 281 ciated workshops changing by year. NSDI (04-08) 5 138 Discussion - Likely reasons for paper inflation: We Table 1: Paper set. think the following are important reasons for paper inflation. 1. The number of authors is increasing. We will show The study of academic publication statistics is by no means some statistics in a later section on authorship.There new. Previous attention focused mostly in different areas of may be various reasons for the increase, but two comes science, especially physics. In fact, the field of this study to mind: (a) The success of Internet and WWW has is referred to as . The most influential work drastically raised people’s awareness and interest in was published in 1965 by Derek de Solla Price, [10] in which working on computer networking. This is sometimes he considered papers and citations as a network, and no- referred to as the dotcom effect. (b) Due the eco- ticed the citation distribution (degree distribution) follows nomic developments and opening-up of many devel- the power law. He tried to explain this phenomenon using a oping countries, most notably China, a large number simple model, later referred to as the model of preferential of authors are joining the research community overall. attachment (i.e. a paper is more likely to cite another pa- per with more existing citations). Other authors also turned 2. The rate of publication per author is edging up. In or- their attention to the network formed by authors who wrote der to raise the level of measurable output, many aca- papers together [13–17]. The difference in our study, other demic units or individual themselves strive to publish

ACM SIGCOMM Computer Communication Review 35 Volume 40, Number 1, January 2010

4 1000 10 Our data set 900

800 3 10 700

600

2 500 10

400

# Listed venues R2 = 0.9952 300 1

# Papers with X citations 10 200

100 CiteSeerX Statistic Fitting curve (1993 − 2005) 0 0 10 0 1 2 3 4 1992 1994 1996 1998 2000 2002 2004 2006 2008 10 10 10 10 10 Year Citation count, X

Figure 3: Number of listed venues by CiteSeerX in Figure 4: Distribution of number of citations (our each year. data set)

4 x 10 9 Our data set more. Sometimes a minimum number of publications 8 is imposed before a student can obtain a graduate de- gree. 7 6 3. The Internet and WWW make publications more ac- 5 cessible. As a result, publications are shared more globally. 4

3 Discussion - Consequences of paper inflation: A Total citation count rapid increase in publication puts great stress on the peer- 2 review systems. Given the decentralized way publication 1 venues are run, the relevant papers for any research topic 0 may become more spread out, making it harder for researchers 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 Year to follow the literature, hence increase the chances of re- inventing the wheels. There have been various proposals to revamp the whole publication system. It is indeed very ex- Figure 5: Total number of citations by year (our citing to think about how to design a global and scalable data set). on-line system that all researchers can share their research results, and in the end all the accounting (of who did what) can be kept, and it remains highly usable (in terms of ease of finding the useful and relevant information quickly). recent the year, the less the time for this build-up. The rise of the curve during the early years has also a good reason: it is a (subtle) consequence of paper inflation. By having 4. CITATION INFLATION fewer papers in earlier years, we lower the rate citations are First, we plot the distribution of citation count earned by collected in earlier years as well. papers in our paperset using a log-log scale, as shown in 1 What is even more interesting is that we believe we dis- Fig 4. The result is consistent with prior works based on covered evidence that there is fashion in research. In other larger data sets [17]. words, researchers may favor citing papers published in the It is more interesting to observe what happens when we recent past. This observation comes from our effort to create plot the total citation count of time, shown in Fig 5. Prefer- a model to explain the citation curve. ential attachment alone cannot explain this, as there seems Let us try to explain the situation by a simple model. Let a pronounced dependence on time. The total citation count the number of papers published in year t be denoted by xt, seems to be an increasing function till some point (7-8 years and the number of citations a paper makes be γ (a constant). from the current time) then it starts decreasing. The total number of citations made in year t is thus γxt. This curve can be explained intuitively. First, the fall The distribution of these citations to the papers published in total citation count in more recent years is due to the previously is assumed to follow a probability distribution fact that citation count takes time to build up. The more α(n)wheren is the number of years separating the citing 1 paper and receiving paper. The sum of citations received In order to ascertain the distribution indeed follows the power law, we can plot it as a Complementary Cumulative by papers n years from the current year t is denoted c(n, t). Distribution Function (CCDF). The result shows the data Assuming the time horizon for t is [s, f], then is closer to a lognormal distribution, which is also approxi- mately heavy-tailed. In any case, the paperset in our study is a relatively small and biased data set. So the exact cita- α(n)xt−n c(n, t)= t−s xtγ. tion distribution is not the focus of this study. n=1 α(n)xt−n

ACM SIGCOMM Computer Communication Review 36 Volume 40, Number 1, January 2010

The total citations received in year t, denoted c(t)isthen to the set. In our analysis, our paperset is but a (small) subset of all the networking papers in the last 10-20 years. f So there is inaccuracies due to papers in our paperset citing c(t)= c(i − t, i). other papers, and other papers citing our papers. We are i=(t+1) overlooking these issues in this simple analysis. We then use different plausible distributions for α(n), and Discussion - adjusting to inflation: Citation count try to fit the curve we have. We tried to use the uniform dis- (and paper count) inflation is a phenomenon relevant to peo- tribution and Poisson distribution with various parameters. ple making evaluations based on citation year after year. In The result for the Poisson distribution worked quite well. year t, you may think a paper (or a person) receiving a ci- The paper increasing statistics is plotted in Fig 6. We used tation count of at least c as adequate/good. In year t +1, a smoothened version to model xt. The dotted line is a pro- you would need to adjust your threshold up due to citation jection into the next few years. The resulting citation curve inflation. We have demonstrated how you might compute predicted by the model (with a mean for α of 4 years2)isas the inflation rate from a simple model, once you know about shown in Fig 7. The fashion in research, based on this data the paper inflation rate, and some idea of the citation dis- set, is thus research topics first published 3-4 years earlier. tribution function α. Finally, the dotted line curve is a prediction for inflation Similarly, when comparing two papers published at dif- down the road - the expected number of citation counts in ferent times by citation count, you may consider using our the coming years based on the trend. model as a way to calibrate the comparison. Discussion - research fashion? The model of prefer- ential attachment alone does not seem rich enough to ex- 3000 Real data plain the aggregate behavior citation counts. The model of 2700 Fitting curve (quadratic function) four−year prediction research fashion seems very interesting to us. To further 2400 validate the model, we need to use a paper database with more detailed information, e.g. including the times at which 2100 citations are made. 1800

n 1500 X 5. COMPARISON OF VENUES 1200 Perhaps the most interesting question to readers is how 900 the publication venues compare to each other, in terms of

600 citation count. This corresponds to what is known as the impact factor in academic circles. There is no standard def- 300 inition of impact factor, so we will use a number of differ- 0 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 ent metrics to compute the citation statistics for different Year venues over a period of time: (i) The top twentieth paper; (ii) the average; (iii) the percentiles; (iv) the h-index; (v) Figure 6: Total number of papers in each year versus the g-index. The results are tallied in Table 2. quadratic fitting curve with four-year prediction. H-index [8,12] and g-index [11] are recently proposed met- rics for computing impact factor as a single number. To de- rive these numbers, you first sort all papers in descending

4 order of citation count; h-index is the highest index (h)of x 10 12 Real data a paper whose citation count is at least h, and g-index is 11 Poisson model the highest index (g) of a paper such that the sum of the four−year prediction 2 10 citation counts from paper 1 to g is at least g . H-index 9 was originally proposed for evaluating the impact factor of 8 a person. G-index is a variation of that. Both can be used 7 to evaluate the impact factor of any aggregate of papers. 6 Comparing publication venues is controversial. There are 5 several issues: (a) there is no well-established metric; (b) the 4 venues publish papers at different rates (c) in our paperset, Total citation count 3 the data cover different time periods, and we know from

2 the citation inflation discussion, it is not fair to compare

1 citation statistics for paper published at different time. For

0 (a), what we try to do is to apply many different metrics 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Year and let the readers make their own judgements. For (b) and (c), we re-computed all the statistics for a window of three years common to all venues. Indeed, the ranking for Figure 7: Total citation number of published papers NSDI, the venue had the most negative bias, moved up in versus Poisson model with four-year prediction. ranking in all metrics. The other conferences with 8 years of papers (WWW and IMC) also moved up slightly. The Discussion - model assumption: Needless to say, our detailed results can be found in the [9]. model assumes a closed paperset that cites papers internal To remove the time-dependent effects, we can also try to 2Actually, the mean is three years, if we allowed citations do the comparison on a year by year fashion. Instead of do- for paper to be made to other papers in the same year. ing this for all metrics, we do it using only one metric: plot

ACM SIGCOMM Computer Communication Review 37 Volume 40, Number 1, January 2010

Venue name #ofpaper top 20th Avg. 90- 80- 70- 60- 50- 40- 30- 20- 10- h-index g-index Sigcomm 612 1155 233.5 513 281 181 114 79 58 34 16 6 179 362 MobiCom 385 962 201.3 484 230 142 93 62 39 22 12 5 124 276 ToN 1331 1026 99.4 206 102 61 39 24 16 10 5 2 167 333 NSDI 138 108 53.2 141 85 61 39 28 20 16 11 7 50 82 IMC 281 144 52.9 126 82 57 42 28 18 11 6 3 68 111 Infocom 4069 727 52.3 132 69 40 25 16 10 6 3 1 207 341 Sigmetrics 795 233 47.9 113 66 42 27 18 11 7 3 1 97 167 WWW 1063 293 40.8 120 51 32 20 11 7 3 2 0 110 176 ICNP 559 156 30.2 76 43 24 14 10 6 3 1 0 66 113 Elsevier CN 2953 453 29.2 54 25 15 9 6 4 2 1 0 127 241 ICC 7721 270 9.3 20 10 6 4 3 1 1 0 0 100 161

Table 2: Percentile- and index-based analysis on 11 selected venues. the median citation count for all venues(inFig8).Inthis citation count reliably. For example, in our paperset, Info- case, a couple of conferences consistently stand out - these com published comparable number of papers as the journal are the ACM single-track conferences Sigcomm and Mobi- Computer Networks, and for each percentile, the paper from com, with very low annual acceptance rates (about 10%). the former had more citations than that of the latter. Discussion - use of citation count to judge impact: Of course, citation count is not always a reliable indicator

350 of impact. For citation count produced by Google Scholar, Sigcomm since it does not remove self-citations and it includes some Infocom 300 TON on-line articles not peer-reviewed, it can have a lot of noise, MobiCom Sigmetrics especially when the value is lower than a certain threshold ICNP 250 WWW (of 10 for example). Finally, the correlation of the citation NSDI IMC count of a paper to the quality and research value of a pa- 200 Com.Netw. ICC per is complicated. Whether a paper gets cited seems to depend a lot on whether it gets introduced to and read by 150 other researchers interested in the same topic, which can no Citation count

100 longer be guaranteed these days since the rate of publica- tion is higher than the ability of a researcher to follow them.

50 Many other factors, such as presenting the paper to more people at different occasions, the prestige and the social con- 0 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 nection and activeness of the authors, all seem to bias the Year citation count. To remove the effect of the noise, perhaps a condensed scale should be established (e.g. 0-10, 10-20, 20- 3 Figure 8: Median citation number of 11 selected 50, 50-100, 100-200 etc counts as 0,1,2,3,4 and so on) .The venues in Computer Networking field with all paper other major problem with the use of citation count is the countedineachyear. initial time lag. For this, our model of deriving the effect of this lag on an average basis can be used as a rough predictor of citation count of a young paper. This is a topic worthy Discussion - different needs: First, it is expected that of further study. In our study, we do not have the citation there are many levels of publications. In one extreme, we history of individual papers. Some repositories provide this, have papers based on new discovery, or a new idea, or by such as ISI [5], but Google Scholar’s output format does not experienced researchers who have certain amount of self- make it easy to collect this information. It is interesting to discipline when submitting papers. In the other extreme, study what the history (the rate of citation over time, and we have many papers written to mostly fulfill graduation, perhaps where the citations come from) can predict. For ex- or job requirements. Different venues position themselves ample, what type of impact? is it due to a controversy, or a to satisfy different needs, in the process providing different short-lived hot topic? how broad and lasting is the impact? kinds of service to the community. Discussion - use of venue to judge impact: If the venue of a paper is used to predict the potential impact of 6. CONFERENCE-THEN-JOURNAL that paper, the different venues have very different predic- It seems we often publish a paper in conference, and then tive power. For example, if we consider a (Google-scholar) journalize it. How often do people try and successfully jour- citation count of 20 or higher to be of impact, then a Sig- nalize a paper? It is said computer science people prefer to comm paper is close to 80% likely to satisfy that threshold, only look at conference papers, and usually do not bother whereas an ICC paper has only 10% likelihood of doing so. journalize papers. We compiled some statistics to look at pa- For venues with low predictive power, the citation count pers that were published in conference first and were journal- should be considered as well. In some academic institu- ized subsequently, versus papers that were not journalized. tions, only papers from journals are counted. This may be a The method to determine a journalized paper is somewhat reasonable simplification for conferences like ICC, but it is not wise to exclude those conferences that can indicate high 3This method is used to score matches in contract bridge.

ACM SIGCOMM Computer Communication Review 38 Volume 40, Number 1, January 2010

Conf. Total Journalized Conf Cite Jnl Cite Total Cite Non Jnl Cite Sigcomm 541 108 276.5 298.9 575.4 193.1 Infocom 3415 597 34.9 82.1 117.0 56.0 Sigmetrics 691 88 37.3 84.3 121.6 47.7 ICNP 414 58 41.2 55.4 96.6 31.9 MobiCom 314 91 72.9 205.0 277.9 245.9 ICC 5547 336 7.5 17.8 25.3 12.3 WWW 598 65 71.9 97.6 169.5 58.1 IMC 209 17 82.6 74.2 145.8 60.4

Table 3: Comparison on average citation number between journalized papers and non-journalized ones. ad hoc, so you need to take that into account when consid- rious reviewing to achieve higher quality. But in reality, the ering the results. For each paper in our paperset, we use review process for journals are not that more extensive than Google Scholar to search for papers with approximately the the better quality conferences; and neither have consistent same title (based on threshold of percentage match of reg- quality control. A scalable and consistent mechanism for ular expressions) and same authors (at least two common quality differentiation of publications is a very interesting authors). The results are summarized in Table 3. open problem. In this table, the second column is the total number of papers for each conference; the third column is the total 7. AUTHORSHIP STATISTICS number of journalized papers; the fourth column is the av- We now turn to authorship statistics. Let us define a few erage citation count of those conference papers that are later quantities, each for a given period of time: journalized; the fifth column is the average citation count of those journal version of the conference papers; the sixth col- n: number of papers published; umn is the sum of both conference and journal versions; and the last column is the average citation count of those papers m: number of distinct authors; that are not journalized. r: average number of papers published by each author; Discussion - percentage of papers journalized: Across the board, less than 20% of the papers are journalized, and q: average number of co-authors for each paper. the overall average is significantly lower than 10%. The rel- atively low percentage may be due to different reasons. For And there exists an invariant relationship connecting these ACM conferences, some authors may consider there is no quantities: need to journalize; so the limit may be caused by the au- nq = mr thors. For other conferences, it can be limited by the quality of the paper. We can call this the balancing equation of authorship. Discussion - journalizing and citation count:First, It can be easily verified that the equation is valid. There again across the board, the total citation count (for the is one challenging problem - how to separate out distinct conference version plus the journal version), on average, is authors. We adopted the simplest solution - take the name higher than those non-journalized papers (in last column). from the paper as is. This method has two problems: Apparently, two rounds of peer reviews can better sort out 1. Name collision - more than one real person share the work with more impact - this is reasonable and expected. same name. They will be treated as the same author Note, this applies to the case of the ACM conferences as by us. well4. Second, the journal version, on average, receive more ci- 2. Multiple name representations - for example, Dah Ming tations than the conference version. The exception is IMC. Chiu is sometimes represented as D. M. Chiu. In this Our guess is that for a measurement conference, timeliness case, one real person is considered as two separate au- of a paper is more important than other type of papers, pos- thors by us. sibly leading to the poorer citation count for the journalized We will come back to this issue later in the section. versions. This can also be due to a glitch, since the sample We can consider 1989-2008, the twenty years of our pa- size for IMC is quite small. For this topic, we have done more analysis. For example, to find out the distribution of perset, as one period. But the time aspect is lost. We are the number of years it takes to journalize (most often 1-2 interested to see how the above statistics (n, m, r and q) years, but there are cases between 0 to 8 years); and to find changed over the twenty years. If we consider each year as a out which journals the conference paper go. The readers are separate period of time, the sample size is small (hence vari- ance would be large). So we divided the twenty years into referred to read our technical report for more details. four windows of five-year periods. The resulting quantities Discussion - archival versus quality differentiation Originally, an important need for journalizing is because it are tabulated in Table 4. serves an archival need. Nowadays, it can be argued that From the column for n, we see the number of papers pub- conference papers are equally well-archived. So the gain in lished in successive 5-year windows increased very rapidly, as journalization is mostly in allowing more time and more se- we discussed before. The number of distinct authors in the corresponding periods had equally rapid increases. Many of 4Although for Mobicom, the non-journalized papers scored these authors (perhaps most students) wrote only a single quite high compared to the journalized cases. paper,asshowninFig9.

ACM SIGCOMM Computer Communication Review 39 Volume 40, Number 1, January 2010

Venue name 89-90 91-92 93-94 95-96 97-98 99-00 01-02 03-04 05-06 07-08 Sigcomm 1.90 2.27 2.45 2.47 2.78 3.01 3.12 3.51 3.63 4.35 Infocom 2.16 2.29 2.41 2.39 2.52 2.78 2.83 2.93 3.07 3.20 Sigmetrics 2.19 2.36 2.55 2.54 2.87 2.90 2.95 3.20 3.36 3.51 Elsevier CN 1.62 1.86 2.16 2.39 2.52 2.93 2.89 2.88 3.00 3.12 ToN - - 2.25 2.40 2.49 2.59 2.95 2.70 2.90 2.90 ICNP - - 2.53 2.82 2.63 2.62 3.00 3.05 3.26 3.32 MobiCom - - - 2.66 2.87 3.07 3.21 3.34 3.36 3.41 ICC - - - - 2.57 2.69 2.66 2.80 2.92 3.05 WWW ------3.17 3.50 3.07 3.35 IMC ------3.19 3.25 3.81 3.57 NSDI ------3.70 4.02 4.26

Table 5: Average number of co-authors per paper for each conference.

5 papers. For these papers, we performed the same (aggre- 10 1989 − 1993 gate) authorship analysis. The results are shown in Table 6. 1994 − 1998 1999 − 2003 If we plot the distribution of papers per author for each of 4 2004 − 2008 10 the five-year intervals, the result is basically the same as that shown in Fig 9: We see increasing number of authors over

3 time, but the distribution remains the same. This means 10 a set of curves with the same (negative) slope, moved out slightly corresponding to the increased author numbers. In 2 # Authors 10 Fig 10, we plot the distribution for the entire period (1989- 2008) for both papersets, for a comparison.

1 10 Window n m q r 1989 - 1993 13075 14657 2.066 1.843 0 10 0 1 2 1994 - 1998 17643 20446 2.234 1.928 10 10 10 # Papers per author 1999 - 2003 24372 30252 2.458 1.980 2004 - 2008 50351 64781 2.840 2.207 Figure 9: Distribution of number of papers per au- thor (our data in 5-year window). Table 6: Window-based authorship analysis on DBLP data. Window n m q r 1989 - 1993 1699 2678 2.161 1.371 5 1994 - 1998 2834 4843 2.489 1.457 10 Our data set (1989 − 2008) 1999 - 2003 5763 9873 2.817 1.644 DBLP data (1989 − 2008)

4 2004 - 2008 9441 15731 3.074 1.845 10

Table 4: Window-based authorship analysis on our 3 data. 10

2 It is interesting to note that the number of co-authors # Authors 10 per paper has increased nearly 50%. At the same time, the

1 number of papers per author also increased about 20-30%. 10 These are all average numbers. If a large group of people maintain the same behavior (e.g. write only one paper), 0 10 0 1 2 3 then the rest of the people must have incurred a much more 10 10 10 10 significant change. In Table 5, we separately account for the # Papers per author number of co-authors per paper for each conference. We see that the number of co-authors stayed relatively more con- Figure 10: Distribution of number of papers per stant for some large conferences like Infocom. At the same author (20 years of two data sets). time, some smaller conferences such as Sigcomm underwent much more significant increase in number of co-authors. For the DBLP paperset, they made an effort to solve the We found a way to double-check our method using another name collision problem, by trying to distinguish authors paperset. The DBLP [2] project open their paperset data for based on their co-authorship patterns. The assumption is the public to use. Their paperset is much larger, contain- that two different real persons will have totally disjoint co- ing approximately 1.1 million papers (1989-2008) covering author groups. This does tend to eliminate name collisions, a broader set of disciplines. We extract those published in but it may also introduce another problem. When a person computer networking venues in 1989-2008, a total of 105441 changes jobs, he/she is likely to build up totally disjoint

ACM SIGCOMM Computer Communication Review 40 Volume 40, Number 1, January 2010

authorship relationships, and will be treated as two separate works. A proper survey of this area is beyond the scope of authors. Despite a different (and much larger) paperset, our current study. and a somewhat different way of identifying distinct authors. the resultant statistics are quite similar to those from our 8. AUTHOR PRODUCTIVITY smaller paperset. Suppose we agree to take the number of published papers Discussion - authorship observations: These results as a measure of author productivity. We are interested to on authorship confirmed our speculation on the reasons for find out what factors have statistical correlation to produc- paper inflation earlier. Indeed, we have all these causes - tivity. We are able to study a few factors and report them more authors, and authors are writing more papers on av- here. erage. It behooves us to further study the type of authors The first factor we consider is the number of distinct col- and their contribution to the load on the publication sys- laborators. The correlation (of number of papers to number tem. Such information can be very helpful in the discussion of collaborators) is plotted in Fig 12. There is clearly some of how to transform the current system into a more scalable correlation. system. Discussion - supervision of students: As we discussed earlier, there may be many patterns of collaboration. The 5 10 collaboration between a supervisor and his/her students (or Our data set DBLP data other hierarchical relationships) would lead a clear-cut cor-

4 relation, a straight line with the slope corresponding to the 10 average number of papers published involving a student. If other lateral collaboration relationships can be separated 3 10 out, it would be interesting to see how the behaviour is dif- ferent.

2 The next factor we consider is the number of active years. # Authors 10 There are at least two ways to define active years: (a) the number of years, starting from the year of the first publica- 1 10 tion to the year of the last publication, of a given author; (b) the number of years in which a given author published

0 at least one paper. We used the latter definition, and plot 10 0 1 2 3 10 10 10 10 the result in Fig 13. Again, there is clearly positive corre- # Distinct collaborators lation, as expected. But the diversity is quite high - from authors who publish one paper per year, to authors who Figure 11: Distribution of number of distinct col- publish about ten papers per year, on average. laborators (two data sets). Next we consider the correlation with the number of co- authors. The result is as shown in Fig 14. In this case, there Discussion - co-authorship implications: The co- does not seem to be noticeable correlation. authorship patterns and statistics can help us understand Finally, we show how a percentage of the most productive the prevailing collaboration trends, inter-institution or intra- authors relate to the percentage of all papers published, in institution. For intra-institution, the traditional mode may our paperset. In other words, we are interested in the mini- be dominated by student-supervisor collaboration. The in- mum number of authors covering any percentage of papers creased co-authorship numbers may indicate group super- published. This number can be computed approximately vision, or hierarchical research groups are becoming more using a greedy algorithm, described as follows: common. As pressure for more publications increase, there is also suspicion by some that some co-authors are free-riding. Initialization: Sort all the authors in a descending order We are not able to ascertain this one way or the other. of their productivity (number of published papers); Overall, the number of co-authors for computer network- Repeat: Remove the author with most papers from the ing (from our statistics) is similar to that for the field of author list, and all his/her papers from the paper list; physics (average 2.53) and biology (average 3.75), but higher and plot (% authors removed, % papers removed). than that for mathematics (average 1.45), as reported by a 2004 study by Newman [14]. The plot is shown in Fig 15, for both papersets (ours and the Another interesting co-authorship analysis is to look at one from DBLP). Furthermore, we also plot (% authors, % the collaborator distribution - how many collaborators an citations), derived from the same greedy algorithm, replac- author has - on both data sets (in Fig 11). ing papers by citations. This is done only for our paperset, As shown in Fig 11, the collaborator distribution results since citation information is readily available. It is inter- match against Newman’s results (the physics and mathe- esting to note that both paper coverage curves satisfy the matics curves in [14]). 20/80 rule: twenty percent of the most active authors can Discussion - relationship to other studies: Previ- take credit for eighty percent of the papers. The top 80% of ous to this section, the analysis is based on only the paper the citations, however, can be credited to the top 5% of the citation network (node=paper, link=citation). For the au- authors. thorship analysis, the network is extended to authors. There Discussion - Partitioning authors: From Fig 15, it is are tow types of relationships: author-to-paper, and author- perhaps reasonable to separate the authors into two broad to-author (co-authorship). The author network is a special classes: professional authors and occasional authors. For case of social network. There is considerable related work purpose of studying author productivity, we may see certain on this, and the more recent interest in on-line social net- properties for the class of professional authors, which may

ACM SIGCOMM Computer Communication Review 41 Volume 40, Number 1, January 2010

(a) (b)

Figure 12: Productivity VS number of distinct collaborators (a) our data set, (b) DBLP data.

(a) (b)

Figure 13: Productivity VS number of active years (a) our data set, (b) DBLP data. not be easily visible when considering all authors together. lem. It seems we have endless number of deadlines for pa- This will be considered in future work. per reviews and paper submissions, and yet we do not have enough time to read all the papers being published on the research problems we are working on. We also noticed how 9. RELATING CITATION COUNT TO CO- the number of publications and citations of researchers seem AUTHORSHIP to be going through an inflation process, causing a lot of con- Finally, we study the relationship between the number of fusion on how to evaluate researchers for degrees and jobs. co-authors with citation count. In this case, we only used We tried to pick a relatively small but representative set our own paperset. For the DBLP paperset, although they of conferences and journals and all the papers published at have the co-authorship information, the citation count in- these venues in the past twenty years. We relied on avail- formation (based on Google Scholar) is not easily available. able data and tools as much as possible, and tried to produce The result is plotted in Fig 16. The main figure is a scatter some meaningful statistics to our community, with discus- plot. In the inset, we also plot the average citation count sion based on our perspective. We believe the model for for each number of co-authors. For higher number of co- citation history, and the balancing equation for authorship authors, the sample size is very small, so the variance can are useful tools and the time-based viewpoint is worthy of be high. Between 1 to 7 co-authors, the average citation further studies. Our goal is to provoke more thinking by the count seems almost flat, except for 5 authors. It is not clear whole community about how to improve our own system of if this is statistically significant, and if so why. publications, starting from how it is working now.

10. CONCLUDING REMARKS Acknowledgement What drove us to this work is a strong feeling that our The authors wish to thank the student helpers, Kevin Lee publication system is running into a huge scalability prob- and Ivy Ting, for their hard work and carefulness in data

ACM SIGCOMM Computer Communication Review 42 Volume 40, Number 1, January 2010

(a) (b)

Figure 14: Productivity VS average number of co-authors (a) our data set, (b) DBLP data.

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3 Percentage coverage

0.2 Our data set − papers 0.1 DBLP data − papers Our data set − citations 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percentage of authors

Figure 15: Approximate minimum number of au- Figure 16: Correlation of citation count and number thors for maximum paper and citation coverage. of coauthors. collection; and thank Don Towsley, Vishal Misra and John [10] D. J. de Solla Price. Networks of sciendtific papers. Lui for various suggestions. The work is partially supported Science, 149(3683):510–515, 1965. by a CUHK direct grant 2050411. [11] L. Egghe. An improvement of the h-index: The g-index. ISSI Newsletter, 2(1):8–9, 2006. [12] J. E. Hirsch. An index to quantify an individual’s 11. REFERENCES scientific research output. Proceedings of the National [1] “Google Scholar”, http://scholar.google.com/. Academy of Sciences of the United States of America, [2] “DBLP”, 102:16569–16572, 2005. http://www.informatik.uni-trier.de/ ley/db/. [13] R. K. Merton. The matthew effect in science. Science, [3] “IEEE Digital Library”, 159:56–63, 1968. http://dl.comsoc.org/comsocdl/. [14] M. E. J. Newman. Coauthorship networks and [4] “ACM Digital Library”, patterns of scientific collaboration. Proc. Natl. Acad. http://portal.acm.org/dl.cfm/. Sci. USA, 101:5200–5205, 2004. [5] “ISI Web of Knowledge”, [15] M. E. J. Newman. Who is the best connected http://www.isiknowledge.com/. scientist? a study of scientific coauthorship networks. [6] “CiteSeerX”, http://citeseerx.ist.psu.edu/. Springer, pages 337–370, 2004. [7] “ Search”, http://libra.msra.cn/. [16] M. E. J. Newman. The first-mover advantage in [8] P. Ball. Index aims for fair ranking of scientists. scientific publication. Europhys. Lett., 86:68001, 2009. Nature, 436:900, 2005. [17] S. Redner. How popular is your paper? an empirical [9] D. M. Chiu and T. Z. J. Fu. A study of publication study of the citation distribution. Eur. Phys. J. B, statistics in computer networking research. Technical 4:131–134, 1998. report, 2009.

ACM SIGCOMM Computer Communication Review 43 Volume 40, Number 1, January 2010