Unsupervised Extraction and Clustering of Key Phrases from Scientific Publications
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Extraction and Clustering of Key Phrases from Scientific Publications Xiajing Li Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ects credits September 25, 2020 Supervisors: Dr. Fredrik Wahlberg, Uppsala University Dr. Marios Daoutis, Ericsson AB Abstract Mapping a research domain can be of great signicance for understanding and structuring the state-of-art of a research area. Standard techniques for system- atically reviewing scientic literature entail extensive selection and intensive reading of manuscripts, a laborious and time consuming process performed by human experts. Researchers have spent eorts on automating methods in one or more sub-tasks of reviewing process. The main challenge of this work lies in the gap in semantic understanding of text and background domain knowledge. In this thesis we investigate the possibility of extracting keywords from scien- tic abstracts in an automated way. We intended to use the categories of these keywords to form a basis of a classication scheme in the context of systemati- cally mapping studies. We propose a framework by joint unsupervised keyphrase extraction and semantic keyphrase clustering. Specically, we (1) explore the eect of domain relevance and phrase quality measures in keyphrase extraction; (2) explore the eect of knowledge graph based word embedding in embedding rep- resentation of phrase semantics; (3) explore the eect of clustering for grouping semantically related keyphrases. Experiments are conducted on a dataset of publications pertaining the do- main of "Explainable Articial Intelligence (XAI)”. We further test the perfor- mance of clustering using terms and labels from publicly available academic taxonomies and keyword databases. Experiment results shows that: (1) Extended ranking score does improve the keyphrase extraction performance. Adapting pre-processing and candidate selection method to target document type would be more important. (2) Semantic network based word embeddings (ConceptNet) has fairly good performance, with less computational complexity. (3) Term-level semantic keyphrase clustering does not generate ideal categories for terms, how- ever it is shown that clustering can group semantically similar terms together. Finally, we conclude that it is considered particularly challenging to nd semantic related, but not morphologically similar terms. Contents Acknowledgements5 1. Introduction8 1.1. Challenges.................................8 1.2. Aim.....................................9 1.3. Structure of the Thesis..........................9 2. Background 10 2.1. Systematic Mapping Studies....................... 10 2.1.1. Background of Systematic Mapping Studies.......... 10 2.1.2. Previous Work of Systematic Mapping Studies......... 11 2.2. NLP Methods for Automation Support................. 12 2.2.1. NLP Techniques for Conducting Search............ 12 2.2.2. NLP Techniques for Screening of Papers............ 13 2.2.3. NLP Techniques for Keywording and Generation of Classi- cation Scheme........................... 13 2.3. Word Embedding Representations.................... 14 2.3.1. Word2Vec............................. 14 2.3.2. Contextual Word Embedding.................. 15 2.3.3. Knowledge Graph based Embedding.............. 15 2.4. Automatic Keyword Extraction...................... 15 2.5. Terms Clustering and Taxonomy..................... 18 3. Methodologies 20 3.1. Architecture Overview.......................... 20 3.2. Embedding Representation........................ 20 3.3. Keyphrases Extraction.......................... 21 3.3.1. Document Relevance Score................... 21 3.3.2. Domain Relevance Score..................... 23 3.3.3. Phrase Quality Score....................... 23 3.4. Semantic Keyphrase Clustering..................... 25 3.4.1. Spherical :-means........................ 25 3.4.2. Hierarchical Agglomerative Clustering............. 25 4. Experimental Evaluation 27 4.1. Dataset................................... 27 4.1.1. Scientic Publications Dataset.................. 27 4.1.2. Synthetic Dataset for Term Clustering............. 28 4.2. Implementation and Tools........................ 29 4.3. Evaluation Metrics............................. 30 4.3.1. Extraction Evaluation...................... 31 4.3.2. Clustering Evalution....................... 31 5. Results 34 5.1. Keyphrase Extraction........................... 34 5.1.1. Candidate Selection........................ 34 3 5.1.2. Candidate Ranking........................ 35 5.2. Word Embedding............................. 36 5.3. Clustering................................. 37 6. Conclusion 41 6.1. Future Work................................ 41 A. Clustering Results 43 4 Acknowledgements I would like to thank my academic supervisor, Fredrik Wahlberg, for his encouragement and suggestions in this thesis. I am deeply indebted to Ericsson AI research team, particularly my supervisor, Marios Daoutis, for providing this interesting thesis topic. I am grateful for his valuable guidance and support during the long phase of thesis work. Also, I am grateful for having had the chance to work with my colleagues in the project. Finally, I would like to thank all the teachers in Language Technology Program, who guided me to the world of NLP. I would like to thank my family, my friends and my boyfriend, who have been encouraging me and supporting me all the time. 5 List of Figures 2.1. The systematic mapping process. (Petersen et al., 2008)........ 10 2.2. Survey from Carver et al. (2013) shows two most dicult and time consuming steps are paper selection and data extraction........ 11 2.3. Word2vec (cbow and skip-gram), Figure from (Bilgin and Şentürk, 2017) 14 2.4. Example of "conceptnet" node from website https://conceptnet.io/ ... 16 2.5. Automatic keywords extraction process stages. (Figure from Merrouni et al. (2019))................................ 17 2.6. The taxonomy development method proposed by Nickerson et al. (2013) 18 3.1. Overall framework in this thesis...................... 20 3.2. The framework of the SIFRank model (Sun et al., 2020)......... 22 3.3. Example of Hierarchical Agglomerative Clustering Dendrogram... 26 4.1. An example of scientic papers with INSPEC Controlled Indexing and Non-Controlled Indexing. Phrases in bold are present in text...... 28 4.2. Numbers of tokens of phrases in "Controlled indexing terms", "Non- controlled indexing terms" and "candidates keyphrases". Here candidates selection applies noun phrase chunking................. 29 5.1. Example from top-15 extracted keyphrases................ 36 5.2. Results of silhouette coecient score with n clusters.......... 38 5.3. Results of Calinski-Harabasz Score with n clusters........... 38 5.4. Results of Davies Bouldin Index with n clusters............. 38 6 List of Tables 3.1. Conditions and score calculation for phrase quality.......... 24 4.1. Comparative analysis of Non-Controlled indexing terms and Controlled inxdexing terms............................... 27 4.2. Term frequency analysis of Non-Controlled indexing terms....... 27 4.3. Example of two synthetic clustering dataset............... 29 5.1. Analysis of candidates selection in base models. "- preprocess" means without dash tag removal and "+ preprocess" means with dash tag removal................................... 34 5.2. Comparison of three extraction methods with their title-weighted ranking.(e.g. TextRankCF uses title-weighted score for ranking.)... 35 5.3. Comparison of keyphrase extraction results from ensemble methods with three base models.......................... 35 5.4. Comparison of top candidates in domain relevance score using ELMo embedding and ConceptNet Numberbatch Embedding. Domain glos- saries from AI and Machine Learning................... 36 5.5. Running time (by seconds) of two keyphrase extraction methods. Note: execution time of phrase quality is not included, because it is done as corpus-level and applies the same to both methods........... 37 5.6. Clustering analysis on XAI publications dataset............. 38 5.7. Example of cluster-wise results on Spherical :-means clustering of XAI publications dataset.......................... 39 5.8. Clustering analysis on DM dataset and KG dataset........... 39 A.1. (Parts of) cluster-wise results on semantic term clustering using spher- ical :-means on XAI publications dataset................. 43 7 1. Introduction Understanding and structuring state-of-art research provides a signicant foundation of knowledge around that research area. Methods such as systematic mapping studies (SM) and systematic review studies (SR) have been widely applied to information mining and conceptualization of research articles. Typical procedure is commonly performed by human experts and researcher, including selecting (ltering) the rel- evant between an large amount of manuscripts, reading, extracting and organising key information and nally categorizing papers based on extracted information. In general, output can be analyzed and presented via reporting surveys, or graph-based visualizations that illustrate the mapping and structuring of the research domain. Research areas, such as that of Articial Intelligence and Machine learning, grow in popularity in recent years. Consequently, the increasing number of publications makes the manual reviewing process of such domains quite challenging and time-consuming. Various studies have been investigating various techniques that aim to automate one