Assessing Italian Research in Statistics: Interdisciplinary Or Multidisciplinary?
Total Page:16
File Type:pdf, Size:1020Kb
Assessing Italian Research in Statistics: Interdisciplinary or Multidisciplinary? Sandra De Francisci Epifani*, Maria Gabriella Grassia**, Nicole Triunfo**, Emma Zavarrone* [email protected] ; [email protected] ; [email protected] ; [email protected] Abstract In this paper, we assess cross disciplinary of research produced by the Italian Academic Statisticians (IAS) combining text mining and bibliometrics techniques Textual and bibliometric approaches have together advantages and disadvantages, and provide different views on the same interlinked corpus of scientific publications. In addition textual information in such documents, jointly citations also constitute huge networks that yield additional information. We incorporate both points of view and show how to improve on existing text-based and bibliometric methods. In particular, we propose an hybrid clustering procedure based on Fisher ╒s inverse chi-square method as the preferred method for integrating textual content and citation information. Given clustered papers, it’s possible to evaluate ISI subject categories (SCs) as descriptive labels for statistical documents, and to address individual researchers interdisciplinary. Keywords : Bibliometrics, Text mining, Social network Analysis, Hybrid Clustering 1 Introduction Increasing dissemination of scientific and technological publications via web sides, and their availability in large-scale bibliographic databases, opened to massive opportunities for improving classification and bibliometric cartography for science and technology. This metascience benefits of the continuous arise of computing power and development of new algorithms. The purpose of mapping, charting or cartography of scientific fields is the knowledge of the structure and the evolution for different areas of research and link other fields, based on scientific publications. Research fields can be profiled using different keywords i.e. in terms of prolific authors, major concepts, important publications and journals, institutions, regions and countries, etc. Knowledge about the amount of activity in various fields and about new, emerging and converging fields is important to organizations, research institutions and nations. Quantitative information can be used for evaluation of research performance, interdisciplinary, collaboration, internationalization and for the support of innovation management, science and technology policies (for example, what fields should be supported through funding?). Such policies are crucial for competitive positions at university. We focus on cross disciplinary within scientific areas of research Italian Universities using clustering algorithms and techniques in bibliometrics and text mining. The multidisciplinary context given by statistical affords an excellent opportunity to examine the methods used to study interdisciplinary and integration. 2 Background Research that occurs at the intersection between disciplines is thought to lead to great advances in science (Porter and Rafols, 2009). Interdisciplinary research would be supported and encouraged to solve new statistical challenges. A cynical disposition to this problem is eloquently stated in Brewer (1999): ╥The world has problems, but universities have departments. ╙ The term interdisciplinary tends to be tacitly understood by researchers, without shared definition. We adopt the definition suggested by Porter et al. (2007), given by the National Academies (2005): interdisciplinary research requires an integration of concepts, theories, techniques and/or data from two or more bodies of specialized knowledge. Multidisciplinary research may incorporate elements of other specialized knowledges, but without * UNIVERSITA' IULM - Via Carlo Bo, 1 Milano ** Dipartimento di Matematica e Statistica, Università degli Studi Federico II Napoli – via Cintia, Napoli Assessing Italian Research in Statistics:Interdisciplinarity or Multidisciplinatory? interdisciplinary synthesis (Wagner et al., 2011) which includes more than single parts. Analysis of cross disciplinary improves traditional indicators assessing and quantifying interdisciplinary research (Morillo et al., 2001) (fig.1). Fig. 1: Interdisciplinary and multidisciplinary Indicators of different disciplinary describe heterogeneity of a bibliometric set obtained starting from predefined categories i.e. using a top-down approach, we allocate the set on the global map of science. Network coherence indicators are constructed to measure the intensity of similarity relations within a bibliometric set, i.e. using a bottom-up approach, which reveals the structural consistency of the publications network (Rafols and Meyer, 2010). Instead of exploring large-scale trends in publications using a top-down approach, it is necessary to have a large amount of data that represents the research track of each statistician using a bottom-up approach. We suggest to measure one or more individuals versed in statistics. Therefore, an unsupervised approach is optimal as such methods can find trends in data without prior knowledge of its structure. Substantial distinction between text world and graph world refers to different parts of views on a collection of interlinked publications. In addition, textual information such as citations, kept in documents, are large networks, which yield additional information. To create groups of publications in clusters or groups of documents, we consider two complementary approaches. In integrated or hybrid analysis we include how to improve existing text-based and graph analytic (or bibliometric) methods by deeply merging textual content with the structure of the citation graph. The main difference between text world and graph world refers to an interlinked data collection such as World Wide Web and bibliographic databases containing written scientific communications. These documents contain textual information that can be mined for knowledge by using text mining techniques. Moreover, each document refers to other documents that are related in some way. Most scientific papers indeed cite previous research on which it is based or which is considered to be relevant for the subject. These citations are collected in the bibliography of a publication. Although various reasons are conceivable for citing other works, citations usually imply endorsement or recommendation of previous work. All citations among publications or hyperlinks among Web pages constitute extremely large networks, of which the World Wide Web is the biggest example. Instead of the Web, where each Web page can have hyperlinks to any other page, a citation network or literature network is a kind of/or similar to directed acyclic graph (DAG). Citations and hyperlinks have, respectively, a direction (they point from one entity to another), but citations are not reciprocal and no directed cycles occur in the citation graph. Usually, a scientific paper only cites documents that have already been published. Textual and graph-based approaches might be applied to a dataset. For example, similarity of different perceptions Page 2 Assessing Italian Research in Statistics:Interdisciplinarity or Multidisciplinatory? between documents or groups of documents can be described using different methods. In addition, we observe dynamics in evolving databases. We include viewpoints and claim jointly to improve on existing text-based and graph analytic or bibliometric methods to science and statistics mapping. Indeed, textual information can indicate similarities that are invisible to bibliometric techniques. Based only on text, true document similarity can be overshadowed by differences in vocabulary use, or spurious similarities might be introduced as a result of textual pre-processing, or because of polysemous words (a word with several meanings) or words with little semantic values. Widely used method of co-citation clustering was introduced independently by Small (1973, 1978) and Marshakova (1973). Cross-citation-based cluster analysis for science mapping is different; while the former is usually based on links connecting individual documents, the latter requires aggregation of documents to units like journals or subject fields among which cross-citation links are established. Some advantages of this method are undermined by possible biases. (for instance, analyze directed information flows). For example, bias could be caused by the use of predefined units (journals, subject categories, etc.), in some way, this implies an initial level of structural classification. Journal crosscitation clustering has been used by Leydesdorff (2006), Leydesdorff and Rafols (2009), and Boyack, BÜrner, and Klavans (2005), while Moya-Anegùn et al. (2007) applied subject co-citation analysis to visualize the structure of science and its dynamics. The integration of lexical similarities and citation links are attractive also in other fields such as search engine design (i.e., Google combines text and links; Brin & Page 1998). In early 90’s, the combination of link-based clustering with a textual approach was suggested for better efficiency and appliability of co-citation and coword analysis. A new Weighted hybrid clustering framework was proposed by Liu, Yu, Janssens, Glènzel, Moreau & De Moor (2010) the focus was on text mining with bibliometrics in journal set analysis. This framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. 3 Aims, methods and data collection In order to verify the hypothesis of accuracy of clustering and