Mining Scientific Papers for Bibliometrics: a (Very) Brief Survey of Methods and Tools

Mining Scientific Papers for Bibliometrics: a (very) Brief Survey of Methods and Tools

Iana Atanassova1, Marc Bertin2 and Philipp Mayr3

1 [email protected] Centre Tesniere, University of Franche-Comte, France

2 [email protected] Centre Interuniversitaire de Rercherche sur la Science et la Technologie (CIRST), Université du Québec à Montréal (UQAM), Canada

3 [email protected] GESIS, Leibniz Institute for the Social Sciences, Germany

Introduction Statistical Analysis of Textual Data The Open Access movement in scientific publishing and search engines like Google Scholar Text Mining in R have made scientific articles more broadly Temis, an R Commander plugin (Bastin, 2013) accessible. During the last decade, the availability provides integrated tools for text mining. Corpora of scientific papers in full text has become more can be imported in raw text. Another package is and more widespread thanks to the growing number IRaMuTeQ (Ratinaud, 2009), a python application of publications on online platforms such as ArXiv which uses the R libraries. and CiteSeer (Wu, 2014). The efforts to provide articles in machine-readable formats and the rise of Correspondence Analysis Open Access publishing have resulted in a number Correspondence analysis is a technical description of standardized formats for scientific papers (such of contingency tables and is mainly used in the field as NLM-JATS, TEI, DocBook). of text mining (Morin, 2006). These tools could be very useful on the perspectives Corpora for the development of new text analytics Different projects have been carried out to respond approaches for bibliometrics. to the need of full-text datasets for research experiments (PubMed, JSTOR, etc.) and corpora. Natural Language Processing Tools E.g. the iSearch dataset was designed to facilitate Research in the field of Natural Language research and experimentation in information Processing (NLP) has provided a number of open retrieval, and specifically in aspects of task-based source tools for versatile text processing. and integrated (a.k.a. aggregated) search. Its The Apache OpenNLP library (Baldridge, 2005) is compressed size is about 46GB of documents in a machine learning based toolkit for the processing English from the physics domain that were of natural language text. Written in Java, it is open collected from public libraries and open archive source and platform-independent. resources. Stanford CoreNLP (Manning, 2014) integrates many NLP tools, including a part-of-speech (POS) Semantic Web and Information Retrieval tagger, a named entity recognizer (NER), a parser, a Scientific papers are highly structured texts and coreference resolution system, a sentiment analysis display specific properties related to their tool, and bootstrapped pattern learning tools. references but also argumentative and rhetorical Stanford CoreNLP is written in Java and licensed structure. Recent research in this field has under the GNU General Public License concentrated on the construction of ontologies for MALLET (McCallum, 2002) is a Java-based citations and scientific articles. package for statistical NLP, document CiTO (Shotton, 2010), the Citation Typing classification, clustering, topic modeling, Ontology, is an ontology for the characterization of information extraction, and other machine learning citations, both factually and rhetorically. It is part of applications to text. It includes sophisticated tools SPAR, a suite of Semantic Publishing and for document classification: efficient routines for Referencing Ontologies. Other SPAR ontologies converting text to "features", a wide variety of are described at http://purl.org/spar/. algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several obtained by text analytics? What insights can NLP common metrics. provide on the structure of scientific writing, on GATE (Cunningham, 2002) is open source free citation networks, and on in-text citation analysis? software for all types of computational tasks involving human language. It includes components References for diverse NLP tasks, e.g. parsers, morphology, Baldridge, J. (2005) The Apache OpenNLP library. tagging, Information Retrieval tools, Information https://opennlp.apache.org/ Extraction components for various languages. Bastin, G., Bouchet-Valat, M. (2013) RcmdrPlugin. CiteSpace (Chen, 2006) is a freely available Java temis, a Graphical Integrated Text Mining application for visualizing and analyzing trends and Solution in R. The R Journal 5(1): 188–196 patterns in scientific literature. It is designed to Bertin, M., Atanassova, I., Larivière, V., Gingras, answer questions about a knowledge domain, which Y. (2013). The Distribution of References in is a broadly defined concept that covers a scientific Scientific Papers: an Analysis of the IMRaD field, a research area, or a scientific discipline. Structure. In Proceedings of the 14th ISSI Conference (ISSI-2013), Vienna What is next? Chen, C. (2006). CiteSpace II: Detecting and Several studies examine the distribution of visualizing emerging trends and transient references in papers (Bertin, 2013). However, up to patterns in scientific literature. Journal of the now full-text mining efforts are rarely used to American Society for Information Science and provide data for bibliometric analyses. An example Technology (JASIST), 57(3): 359-377 is the special issue on Combining Bibliometrics and Cunningham, H., Maynard, D., Bontcheva, K., & Information Retrieval (Mayr, 2015). Novel Tablan, V. (2002). GATE: an architecture for approaches to full-text processing of scientific development of robust HLT applications. In papers and linguistic analyses for Bibliometrics can Proceedings of the 40th Annual Meeting of the provide insights into scientific writing and bring ACL, pp. 168–175 new perspectives to understand both the nature of Lykke, M., Larsen, B., Lund, H., Ingwersen, P. citations and the nature of scientific articles. The (2010). Developing a Test Collection for the possibility to enrich metadata by the full-text Evaluation of Integrated Search. Advances in processing of papers offers new fields of Information Retrieval: 32nd European application to bibliometrics studies like e.g. text Conference on IR Research, UK. reuse patterns in specific disciplines. McCallum, A.-K. (2002). MALLET: A Machine Working with full text allows us to go beyond Learning for Language Toolkit. metadata used in Bibliometrics. Full text offers a http://mallet.cs.umass.edu new field of investigation, where the major Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., problems arise around the organization and Bethard, S.J., McClosky, D. (2014). The structure of text, the extraction of information and Stanford CoreNLP Natural Language its representation on the level of metadata. Unlike Processing Toolkit. In Proceedings of 52nd text-mining from titles and abstracts, full-text Annual Meeting of the ACL: System processing allows the extraction of rhetorical Demonstrations, pp. 55-60 elements of scientific discourse, such as results, Mayr, P., Scharnhorst, A. (2015). Scientometrics methodological descriptions, negative citations, and Information Retrieval - weak-links discussions, etc. Scientific abstracts, by revitalized. Scientometrics, 102(3): 2193-2199 summarizing the text, provide only short, synthetic Morin, A. (2006). Intensive use of factorial and thematic information. correspondence analysis for text mining: Furthermore, the study of contexts around in-text application with statistical education citations offers new perspectives related to the publications. Statistics Educational Research semantic dimension of citations. The analyses of Journal (SERJ) citation contexts and the semantic categorization of Ratinaud, P. (2009). IRaMuTeQ:Interface de R publications will allow us to rethink co-citation pour les Analyses Multidimensionnelles de networks, bibliographic coupling and other Textes et de Questionnaires, bibliometric techniques. http://www.iramuteq.org Our aim is to stimulate research at the intersection Shotton, D. (2010). CiTO, the Citation Typing of Bibliometrics and Computational Linguistics in Ontology. Journal of Biomedical Semantics, 1 order to study the ways Bibliometrics can benefit (Suppl 1), S6. from large-scale text analytics and sense mining of Wu J., Williams K., Chen H.-H., Khabsa M., scientific papers, thus exploring the Caragea C., Ororbia A, Jordan D., Giles C. L. interdisciplinarity of Bibliometrics and Natural (2014). "CiteSeerX: AI in a Digital Library Language Processing. Typical questions of this Search Engine," Innovative Applications of AI, emerging field are: How can we enhance author Proceedings of the 28th AAAI Conference, pp. network analysis and Bibliometrics using data 2930-2937