Proceedings of the 8Th International Workshop on Mining Scientific Publications

WOSP 2020 Proceedings of the 8th International Workshop on Mining Scientific Publications 05 August, 2020 Wuhan, China c 2020 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-952148-88-0 Introduction The entire body of research literature is currently estimated at 100-150 million publications, with an an- nual increase of around 1.5 million. Research literature constitutes the complete representation of knowledge we have assembled as human species. It enables us to develop cures to diseases, solve challenging engineering problems and answer many of the world’s challenges we are facing today. Systematically reading and analysing the full body of knowledge is now beyond the capacities of any human being. Consequently, it is essential to understand better how we can leverage Natural Language Processing/Text Mining techniques to aid knowledge creation and improve the process by which research is being done. This workshop aims to bring together people from different backgrounds who: 1. have experience with analysing and mining databases of scientific publications, 2. develop systems that enable such analysis and mining of scientific databases or 3. who develop novel technologies that improve the way research is being done. The topics of the workshop were organised around the following themes: 1. The whole ecosystem of infrastructures including repositories, aggregators, text-and data-mining facilities, impact monitoring tools, datasets, services and APIs that enable analysis of large volumes of scientific publications. 2. Semantic enrichment of scientific publications utilising text and data mining. 3. analysis of large databases of scientific publications to identify research trends and improve access to research content. This year, we hosted a new shared task: 3C Citation Context Classification Shared Task Recent years have witnessed a massive increase in the amount of scientific literature and research data being published online, providing revelation about the advancements in the field of different domains. The introduction of aggregator services like CORE has enabled unprecedented levels of open access to scholarly publications. The availability of full text of the research documents facilitates the possibility of extending the bibliometric studies by identifying the context of the citations. The shared task organized as part of the WOSP 2020 focused on classifying citation context in research publications based on their influence and purpose. Subtask A: Multiclass classification of citations into one of six classes: Background, Uses, Com- • pare_Contrast, Motivation, Extension and Future. Subtask B: Binary classification of citations based on the classes Incidental and Influential, a task • for identifying the importance of a citation. Given a citation context, the participants were required to predict the intent of the citations. The participants were provided with a labelled dataset of 3000 training instances annotated using the ACT platform. iii We are grateful to the program committee for their careful and thoughtful reviews of the submitted papers. Likewise, we are thankful to the keynote speakers for sharing their research and their vision for the field, and to the workshop attendees for a lively and productive discussion. Petr Knoth Christopher Stahl Bikash Gyawali David Pride Suchetha N. Kunnath Drahomira Herrmannova iv Committees Organisers: Petr Knoth, Knowledge Media institute, The Open University, UK Christopher Stahl, Oak Ridge National Laboratory, USA Bikash Gyawali, Knowledge Media institute, The Open University, UK David Pride, Knowledge Media institute, The Open University, UK Suchetha N. Kunnath, Knowledge Media institute, The Open University, UK Drahomira Herrmannova, Oak Ridge National Laboratory, USA Program Committee: Akiko Aizawa, National Insutitute of Informatics, Japan Iana Atanassova , Université de Bourgogne Franche-Comté, France Marc Bertin, Université Claude Bernard Lyon 1, France José Borbinha, Universidade de Lisboa, Portugal Pravallika Devineni, Oak Ridge National Laboratory, USA Tirthankar Ghosal, Indian Institute of Technology Patna, India Saeed-Ul Hassan, Information Technology University, Pakistan Radim Hladik, Czech Academy of Sciences, Czech Republic Monica Ihli, University of Tennessee, USA Roman Kern, Graz University of Technology, Austria Martin Klein, Los Alamos National Laboratory, USA Birger Larsen, Aalborg University, Denmark Paolo Manghi, ISTI-CNR, Italy Sepideh Mesbah, Delft University of Technology, Netherlands Peter Mutschke, GESIS Leibniz Institute for the Social Sciences, Germany Federico Nanni, Alan Turing Institute, UK Francesco Osborne, The Open University, UK Robert M. Patton, Oak Ridge National Laboratory, USA Eloy Rodrigues, Universidade do Minho, Portugal Wojtek Sylwestrzak, ICM Univeristy of Warsaw, Poland Vetle Torvik, University of Illinois, USA Jian Wu, Old Dominion University, USA Invited Speakers: Anne Lauscher, University of Mannheim Allan Hanbury, Vienna University of Technology David Jurgens, University of Michigan Neil Smalheiser, University of Illinois at Chicago Kuansang Wang, MSR Outreach Academic Services v Table of Contents Virtual Citation Proximity (VCP): Empowering Document Recommender Systems by Learning a Hypothetical In-Text Citation-Proximity Metric for Uncited Documents.........1 Paul Molloy, Joeran Beel and Akiko Aizawa Citations Beyond Self Citations: Identifying Authors, Affiliations, and Nationalities in Scientific Papers............................................9 Yoshitomo Matsubara and Sameer Singh SmartCiteCon: Implicit Citation Context Extraction from Academic Literature Using Supervised Learning........................................... 21 Chenrui Guo, Haoran Cui, Li Zhang, Jiamin Wang, Wei Lu and Jian Wu Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and CORA......................................... 27 Mark Grennan and Joeran Beel Term-Recency for TF-IDF, BM25 and USE Term Weighting.................... 36 Divyanshu Marwah and Joeran Beel The Normalized Impact Index for Keywords in Scholarly Papers to Detect Subtle Research Topics 42 Daisuke Ikeda, Yuta Taniguchi and Kazunori Koga Representing and Reconstructing PhySH: Which Embedding Competent?............. 48 Xiaoli Chen and Zhixiong Zhang Combining Representations For Effective Citation Classification.................. 54 Claudio Moisés Valiense de Andrade and Marcos André Gonçalves Scubed at 3C task A - A simple baseline for citation context purpose classification......... 59 Shubhanshu Mishra and Sudhanshu Mishra Scubed at 3C task B - A simple baseline for citation context influence classification........ 65 Shubhanshu Mishra and Sudhanshu Mishra Amrita_CEN_NLP @ WOSP 3C Citation Context Classification Task............... 71 Premjith B and Soman Kp Overview of the 2020 WOSP 3C Citation Context Classification Task............... 75 Suchetha N. Kunnath, David Pride, Bikash Gyawali and Petr Knoth vi Virtual Citation Proximity (VCP): Empowering Document Recommender Systems by Learning a Hypothetical In-Text Citation-Proximity Metric for Uncited Documents Paul Molley Joeran Beel ∗ Akiko Aizawa Trinity College Dublin University of Siegen National Institute of School of Computer Science Dept. of Electrical Engr. Informatics (NII), Digital and Statistics, Ireland & Computer Science Content and Media Sciences [email protected] Germany Tokyo, Japan [email protected] [email protected] Abstract (bibliographic coupling), that are co-cited by other The relatedness of research articles, patents, documents or that are otherwise connected in the court rulings, web pages, and other docu- citation graph (Beel et al., 2016). ment types is often calculated with citation Citation-based approaches may recommend or hyperlink-based approaches like co-citation more diverse items than content-based filtering, as (proximity) analysis. The main limitation of citations can be made for various reasons (Willett, citation-based approaches is that they cannot 2013;F arber¨ and Sampath, 2019; Erikson and Er- be used for documents that receive little or no landson, 2014). For instance, two documents can citations. We propose Virtual Citation Prox- imity (VCP), a Siamese Neural Network ar- be co-cited because they address the same research chitecture, which combines the advantages of problem; use the same methodology (to solve dif- co-citation proximity analysis (diverse notions ferent problems); or two documents may be co- of relatedness / high recommendation perfor- cited for less predictable reasons. Today’s text- mance), with the advantage of content-based based methods can hardly distinguish such diverse filtering (high coverage). VCP is trained on a types of relatedness. Instead, text-based methods corpus of documents with textual features, and generally consider two documents as related the with real citation proximity as ground truth. 1 VCP then predicts for any two documents, more terms they have in common . based on their title and abstract, in what prox- A particularly promising citation-based ap- imity the two documents would be co-cited, if proach is Citation Proximity Analysis (CPA) (Gipp they were indeed co-cited. The prediction can and Beel, 2009), which is illustrated

Proceedings of the 8Th International Workshop on Mining Scientific Publications

Arxiv:1809.06223V1 [Cs.CL] 17 Sep 2018

Roberto Navigli Curriculum Vitæ

Ontology Learning and Its Application to Automated Terminology Translation

Incorporating Trustiness and Collective Synonym/Contrastive Evidence Into Taxonomy Construction

Stefano Faralli Curriculum Vitæ

Clustering Enterprise Networks by Patent Analysis

Word Sense Disambiguation Based on Word Similarity Calculation Using Word Vector Representation from a Knowledge-Based Graph

Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites

What Women Like: a Gendered Analysis of Twitter Users' Interests

Word Sense Disambiguation: a Survey

End-To-End Reinforcement Learning for Automatic Taxonomy Induction

Syntactically Aware Neural Architectures for Definition Extraction