<<

Unsupervised Extraction and Clustering of Key Phrases from Scientific Publications

Xiajing Li

Uppsala University Department of and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ects credits September 25, 2020

Supervisors: Dr. Fredrik Wahlberg, Uppsala University Dr. Marios Daoutis, Ericsson AB Abstract

Mapping a research domain can be of great signicance for understanding and structuring the state-of-art of a research area. Standard techniques for system- atically reviewing scientic literature entail extensive selection and intensive reading of manuscripts, a laborious and time consuming process performed by human experts. Researchers have spent eorts on automating methods in one or more sub-tasks of reviewing process. The main challenge of this work lies in the gap in semantic understanding of text and background domain knowledge. In this thesis we investigate the possibility of extracting keywords from scien- tic abstracts in an automated way. We intended to use the categories of these keywords to form a basis of a classication scheme in the context of systemati- cally mapping studies. We propose a framework by joint unsupervised keyphrase extraction and semantic keyphrase clustering. Specically, we (1) explore the eect of domain relevance and phrase quality measures in keyphrase extraction; (2) explore the eect of knowledge graph based in embedding rep- resentation of phrase semantics; (3) explore the eect of clustering for grouping semantically related keyphrases. Experiments are conducted on a dataset of publications pertaining the do- main of "Explainable Articial Intelligence (XAI)”. We further test the perfor- mance of clustering using terms and labels from publicly available academic taxonomies and keyword databases. Experiment results shows that: (1) Extended ranking score does improve the keyphrase extraction performance. Adapting pre-processing and candidate selection method to target document type would be more important. (2) based word embeddings (ConceptNet) has fairly good performance, with less computational complexity. (3) Term-level semantic keyphrase clustering does not generate ideal categories for terms, how- ever it is shown that clustering can group semantically similar terms together. Finally, we conclude that it is considered particularly challenging to nd semantic related, but not morphologically similar terms. Contents

Acknowledgements5

1. Introduction8 1.1. Challenges...... 8 1.2. Aim...... 9 1.3. Structure of the Thesis...... 9

2. Background 10 2.1. Systematic Mapping Studies...... 10 2.1.1. Background of Systematic Mapping Studies...... 10 2.1.2. Previous Work of Systematic Mapping Studies...... 11 2.2. NLP Methods for Automation Support...... 12 2.2.1. NLP Techniques for Conducting Search...... 12 2.2.2. NLP Techniques for Screening of Papers...... 13 2.2.3. NLP Techniques for Keywording and Generation of Classi- cation Scheme...... 13 2.3. Word Embedding Representations...... 14 2.3.1. ...... 14 2.3.2. Contextual Word Embedding...... 15 2.3.3. Knowledge Graph based Embedding...... 15 2.4. Automatic Keyword Extraction...... 15 2.5. Terms Clustering and Taxonomy...... 18

3. Methodologies 20 3.1. Architecture Overview...... 20 3.2. Embedding Representation...... 20 3.3. Keyphrases Extraction...... 21 3.3.1. Document Relevance Score...... 21 3.3.2. Domain Relevance Score...... 23 3.3.3. Phrase Quality Score...... 23 3.4. Semantic Keyphrase Clustering...... 25 3.4.1. Spherical :-means...... 25 3.4.2. Hierarchical Agglomerative Clustering...... 25

4. Experimental Evaluation 27 4.1. Dataset...... 27 4.1.1. Scientic Publications Dataset...... 27 4.1.2. Synthetic Dataset for Term Clustering...... 28 4.2. Implementation and Tools...... 29 4.3. Evaluation Metrics...... 30 4.3.1. Extraction Evaluation...... 31 4.3.2. Clustering Evalution...... 31

5. Results 34 5.1. Keyphrase Extraction...... 34 5.1.1. Candidate Selection...... 34

3 5.1.2. Candidate Ranking...... 35 5.2. Word Embedding...... 36 5.3. Clustering...... 37

6. Conclusion 41 6.1. Future Work...... 41

A. Clustering Results 43

4 Acknowledgements

I would like to thank my academic supervisor, Fredrik Wahlberg, for his encouragement and suggestions in this thesis. I am deeply indebted to Ericsson AI research team, particularly my supervisor, Marios Daoutis, for providing this interesting thesis topic. I am grateful for his valuable guidance and support during the long phase of thesis work. Also, I am grateful for having had the chance to work with my colleagues in the project. Finally, I would like to thank all the teachers in Language Technology Program, who guided me to the world of NLP. I would like to thank my family, my friends and my boyfriend, who have been encouraging me and supporting me all the time.

5 List of Figures

2.1. The systematic mapping process. (Petersen et al., 2008)...... 10 2.2. Survey from Carver et al. (2013) shows two most dicult and time consuming steps are paper selection and data extraction...... 11 2.3. Word2vec (cbow and skip-gram), Figure from (Bilgin and Şentürk, 2017) 14 2.4. Example of "conceptnet" node from website https://conceptnet.io/ ... 16 2.5. Automatic keywords extraction process stages. (Figure from Merrouni et al. (2019))...... 17 2.6. The taxonomy development method proposed by Nickerson et al. (2013) 18

3.1. Overall framework in this thesis...... 20 3.2. The framework of the SIFRank model (Sun et al., 2020)...... 22 3.3. Example of Hierarchical Agglomerative Clustering Dendrogram... 26

4.1. An example of scientic papers with INSPEC Controlled Indexing and Non-Controlled Indexing. Phrases in bold are present in text...... 28 4.2. Numbers of tokens of phrases in "Controlled indexing terms", "Non- controlled indexing terms" and "candidates keyphrases". Here candidates selection applies noun phrase chunking...... 29

5.1. Example from top-15 extracted keyphrases...... 36 5.2. Results of silhouette coecient score with n clusters...... 38 5.3. Results of Calinski-Harabasz Score with n clusters...... 38 5.4. Results of Davies Bouldin Index with n clusters...... 38

6 List of Tables

3.1. Conditions and score calculation for phrase quality...... 24

4.1. Comparative analysis of Non-Controlled indexing terms and Controlled inxdexing terms...... 27 4.2. Term frequency analysis of Non-Controlled indexing terms...... 27 4.3. Example of two synthetic clustering dataset...... 29

5.1. Analysis of candidates selection in base models. "- preprocess" means without dash tag removal and "+ preprocess" means with dash tag removal...... 34 5.2. Comparison of three extraction methods with their title-weighted ranking.(e.g. TextRankCF uses title-weighted score for ranking.)... 35 5.3. Comparison of keyphrase extraction results from ensemble methods with three base models...... 35 5.4. Comparison of top candidates in domain relevance score using ELMo embedding and ConceptNet Numberbatch Embedding. Domain glos- saries from AI and Machine Learning...... 36 5.5. Running time (by seconds) of two keyphrase extraction methods. Note: execution time of phrase quality is not included, because it is done as corpus-level and applies the same to both methods...... 37 5.6. Clustering analysis on XAI publications dataset...... 38 5.7. Example of cluster-wise results on Spherical :-means clustering of XAI publications dataset...... 39 5.8. Clustering analysis on DM dataset and KG dataset...... 39

A.1. (Parts of) cluster-wise results on semantic term clustering using spher- ical :-means on XAI publications dataset...... 43

7 1. Introduction

Understanding and structuring state-of-art research provides a signicant foundation of knowledge around that research area. Methods such as systematic mapping studies (SM) and systematic review studies (SR) have been widely applied to information mining and conceptualization of research articles. Typical procedure is commonly performed by human experts and researcher, including selecting (ltering) the rel- evant between an large amount of manuscripts, reading, extracting and organising key information and nally categorizing papers based on extracted information. In general, output can be analyzed and presented via reporting surveys, or graph-based visualizations that illustrate the mapping and structuring of the research domain. Research areas, such as that of Articial Intelligence and Machine learning, grow in popularity in recent years. Consequently, the increasing number of publications makes the manual reviewing process of such domains quite challenging and time-consuming. Various studies have been investigating various techniques that aim to automate one or more sub-steps of the process, such as paper selection and visualization (Marshall and Wallace, 2019). A few studies examples in the literature focus on other aspects such as the keywording and categorization steps, which usually require human involvement and most importantly requires background knowledge from domain experts. This thesis intends to address this specic research challenge of devising keywords and a classication scheme, where we investigate several unsupervised methods that could achieve reasonable results. The overall problem is decomposed into two sub-tasks: (i) domain-specic keyphrase extraction from scientic documents and (ii) clustering of the extracted semantic keyphrases. The two main components have been developed and evaluated on a in-house dataset, composed by publications pertaining the research domain of ”Explainable Articial Intelligence (XAI)”.

1.1. Challenges

Mapping a research domain is very signicant in order to get a deep understanding in that domain, however it still requires a great eort from human experts. The main challenge of and mapping lies in the gap that exists between what a machine can understand from the natural language text and what a human can comprehend from the same text using his background knowledge. In the original systematic mapping procedure (explained in detail in Section 2.1.1), keyword extraction & classication scheme are two essential steps which help to classify papers in dierent perspectives and produce a group of categories from, typically, manual keywording and grouping of descriptive terms. First, terms extracted by intensively reading of papers should not only be representative in the source document but also in the research domain. However, existing keywords and keyphrase extraction systems are usually independent in regard to downstream tasks and type of documents. For document types, such as web pages and social media documents short and concise keywords are required, while keywords of scientic publications are mostly multi-word expressions. For downstream tasks, keywords used for document classi- cation and summarisation should encode salient (important and relevant) text features, while keywords used for document labeling and tagging should consider also the aspect of human readability. Then, when grouping sets of keywords into dierent categories,

8 human experts have an inherent ability to understand the denition, background knowledge, and semantic relatedness of keywords. Traditional information extraction and algorithms typically lack such ability. This typical challenge can be attacked with machine processable, structured semantic representations, e.g. word em- beddings and knowledge databases, which see great adoption and scientic interest in recent years. However, existing knowledge databases do not seem to have the capacity to capture dierent domains. In addition, as we will see in the following section, only a few studies seem to explore the concept of semantic clustering at phrase-level.

1.2. Aim

The focus of this thesis two-fold. In one hand we aim to investigate potential methods suitable for extracting keywords that are representative, precise, yet informative, from the summary text of scientic publications. On the other hand, we want to explore clustering methods that can leverage extracted keywords to identify categories for the research domain of interest. We try to address the question below:

Whether automating methods of keyphrase extraction and terms clustering can extract and identify useful information for systematic mapping studies.

Specically, we:

1. Explore the eect of ensemble scores measure in keyphrase extraction. 2. Explore the eect of semantic network based word embedding techniques in embedding representation of phrase semantics. 3. Explore the eect of clustering for grouping semantically related keyphrases.

1.3. Structure of the Thesis

The remaining structure of the thesis is organised as follows:

• Chapter2 introduces the general concepts behind systematic mapping studies, the related work on state-of-the-art NLP techniques, as well as some theoretical background. • Chapter3 describes the overall framework and details of our methodology from keyphrase extraction to clustering. • Chapter4 describes details regarding the dataset and other implementation details and evaluation design of the two sub-tasks. • Chapter5 presents the evaluation of results and analysis that aim to address the research questions. • Chapter6 concludes the overall research work done by in this thesis, with short discussion on limitations and future work.

9 2. Background

2.1. Systematic Mapping Studies

2.1.1. Background of Systematic Mapping Studies Both Systematic Mapping studies (SM) and Systematic Review studies (SR) originally come from evidence based medicine research. In recent years they are routinely applied to several research areas, in order to keep track of updates of existing techniques and methodologies in a particular research area. The generic methodology of systematic mapping studies aims at structuring the information of a certain research area while giving an overview of the state-of-art methodologies pertaining that research area. With a classication scheme (step 4 in Figure 2.1), either generated or with a manually dened, the nal process maps extracted data or articles themselves into dierent categories, to nd answers of research questions. Petersen et al. (2008) summarizes the systematic mapping study procedure in the following steps:

Figure 2.1.: The systematic mapping process. (Petersen et al., 2008)

• Denition of a research question • Conduct search • Screening of papers to conduct paper selection on nding relevant papers • Keywording using abstracts for classication scheme to generate classication scheme for selected papers • Data extraction and mapping process: to extract key data for representing certain research area

Systematic literature review studies were originally introduced for evidence-based practice, to perform unbiased aggregation of empirical studies. Although systematic mapping studies and systematic literature reviews share some basic common method- ologies, e.g. with respect to initial search and study selection, they accomplish dierent objectives. The procedure of systematic review is further extended to critical appraisal or data synthesis, while systematic mapping studies tend to capture a broad subject area and map selected data into certain scheme. Mapping process specically state the importance of visualizing data extraction results (Petersen et al., 2008, 2015). Therefore, researchers often suggest a two-stage review, initially with a systematic map of a wide scope which is then followed by review studies to focus on a specic area of interest.

10 Keywording and Classification Scheme "Classication scheme" refers to the pro- cess of coming up with a set of categories for classication comprised of labeled terms. As Petersen et al. (2015) proposed, one of the main goals of systematic mapping studies is to give an overview of an area using the classication of articles. In their research, classication scheme conducted among dierent systematic mapping studies can be divided into two ways: xed (topic-independent) classication and topic-specic classication. Five commonly used topic-independent classication scheme are: (1) venue (the type of publication venue); (2) research type; (3) research methods; (4) study focus; (5) contribution type. Venue could show a derivation of actual publication activity as well as inclusion and exclusion criteria in paper selection. Research type and methods proposed by Wieringa et al. (2006) are commonly used, while distin- guishing between dierent research types can sometime be confusing. Topic-specic classication scheme can either emerge from each study per se, or can be based on existing literature (Petersen et al., 2015). Keywording, originally proposed by Petersen et al. (2008), oer means for creating a classication scheme. It mainly consists of two steps: (1) identication of keywords and concepts from abstracts; and (2) grouping and rening. Franzago et al. (2016) further explain that all identied keywords and concepts should be combined together to clearly identify the context, nature, and contribution of the research. Also they suggest that by using a clustering operation on them, one can obtain a set of representative clusters of keywords.

Figure 2.2.: Survey from Carver et al. (2013) shows two most diicult and time consuming steps are paper selection and data extraction.

Meanwhile, with the ever growing number of research publications, especially in articial intelligence, one particular problem for current systematic mapping method- ologies is that they are typically conducted in a manual way, while the underlying procedures are quite time-consuming and tedious, especially if done on a large set of documents. The survey from Carver et al. (2013) discusses the barriers of manual work in systematic literature review process, especially in the context of paper selection and data extraction. However, with recent advancements and development of text-mining algorithms and NLP techniques we are interested in exploring how such algorithms can help in automating (parts of) the manual work within the systematic mapping studies procedure.

2.1.2. Previous Work of Systematic Mapping Studies Although the framework of systematic mapping studies proposed by Petersen et al. (2008) and Petersen et al. (2015) is widely used in recent studies, each systematic mapping study can have a dierent focus, trying to address possibly dierent research questions, which consequently lead to dierent implementations. Febrero et al. (2014) conducted a systematic mapping study to obtain a panorama and a taxonomy of Software Reliability Modeling (SRM), especially capturing the dierent models used in papers. In total 503 papers were selected after ve iterations of a manual

11 selection process. When they found that the proposed classication scheme failed to address the selected studies, they proposed and extended classication scheme by using keywords clustering, where the keywords were provided by both the authors and the extraction process. The group of keywords clusters were identied based on existing library taxonomy (i.e., IEEE taxonomy). Ahmad and Babar (2016) conducted a systematic mapping study on software archi- tectures for robotic systems with the aim to identify and classify the existing solutions, research progress and directions. In order to create the taxonomy of research themes, they classied the relevant studies from generic, to thematic and sub-thematic, and then re-classied the overlapping themes. The systematic review results were mapped to dierent dimensions, such as year of publication or research themes. Mohammed et al. (2017) conducted a systematic mapping study to identify the existing software security approaches used in the software development life-cycle. They conducted quality assessment after initial selection of inclusion/exclusion crite- ria, as a two-fold selection. 118 nal studies were selected which identied security approaches, categorized into ve named groups. The classication scheme used in this study concerned authors, publication venues, sources types, used study strategies and demographic information. Vakkuri and Abrahamsson (2018) conducted systematic mapping study on exploring key concepts of ethics used in current autonomous and AI systems. During screening, pre-exclusion (document type, source type, article language) was done automatically in databases and manual screening was done for the rest of the criteria. 37 re-occurring keywords were extracted from 83 papers which were then classied into 9 categories. The classication process was based on linguistic similarity, ontological similarity, family resemblance of keywords and similarity in usage. The results addressed more general topics under separate AI branches. Overall, all previous works mentioned above are done mostly by manual work, though database search techniques and existing tools from digital libraries were commonly used to aid the process. Vakkuri and Abrahamsson (2018) reduced part of the manual screening work by including exclusion criteria into a database search step. Their research provides detailed examples of incorporating modern techniques in systematic mapping studies.

2.2. NLP Methods for Automation Support

With the rapid development of machine learning and natural language processing (NLP) techniques in recent years, more and more researchers tend to apply these techniques to reduce human eorts. Text-mining (TM), similar to the process of human understanding the text, aims at retrieving relevant high-quality information from text. Text-mining techniques are typically composed by algorithms drawn from dierent disciplines such as , data mining, machine learning, natural language processing (NLP) and knowledge management (Feldman, Sanger, et al., 2007). Also, recent studies from Feng et al. (2017), Marshall and Wallace (2019), Olorisade et al. (2016) present a survey analyzing state-of-art text mining techniques and machine learning implemented to automate query search and screening of papers in systematic literature reviews.

2.2.1. NLP Techniques for Conducting Search Current research contains query word expansion, citation "snowballing" research, etc. Zdravevski et al. (2019) conducted a systematic review case study using NLP toolkit ( and lemmization) to enhance search and identify relevant papers. They keep

12 query search but expand search keywords by their synonyms (taken from WordNet). Marshall and Wallace (2019) proposed a possible future application, a semantic , which replaces keyword search with . Several existing similar systems exist in medical research, such as PubMed and Thalia.

2.2.2. NLP Techniques for Screening of Papers Automatic Text Classication (ATC) is widely applied when screening papers. When considering paper selection as binary text classication, modern rule based or machine learning based classiers can be applied to this task. Typical supervised learning tech- niques require constructing a dataset with training and test data. However, supervised learning often is endowed with the problem of an insuciently large labeled dataset. A popular solution to this is semi-supervised learning (active learning) to reduce the cost of data collection, for example using a proxy Support Vector Machine classier, where partial human annotation is involved (Kontonatsios et al., 2017; Miwa et al., 2014; Z. Yu and Menzies, 2019). Data for manual annotation are selected by certain criteria. The most recent research is FASTREAD from Z. Yu and Menzies (2019). FASTREAD sets stop criteria at a target recall (in their case approx. 95%) while reducing the manual review eort to 10%-20%. The authors of FASTREAD mainly address the existing problem, that of the selection bias.

2.2.3. NLP Techniques for Keywording and Generation of Classification Scheme Comparing to screening automation, fewer works focus on automating the process of producing the classication scheme. If using an existing classication scheme, the whole process can be seen as similar to text classication task. Eykens et al. (2019) conducted ne-grained disciplinary categories classication on social science journal, using an existing classication scheme from digital libraries as labels for the articles. Traditional supervised machine learning methods were used and evaluated. However, existing classication scheme does not always t the need for dierent scientic research domains. Updating or generating a new classication scheme from selected papers is widely applied in most cases. During this process, text-mining tech- niques, such as topic modeling, keywords extraction, sequence to sequence labelling, all can help to extract key information from text. Terko et al. (2019) conducted con- ference paper classication using traditional machine learning methods, with labels generated from topic modeling. Kim and Gil (2019) applied :-means as an unsupervised clustering method for creating the classication scheme at a document level, during which they extracted features from Latent Dirichlet Allocation (LDA) results, abstracts and author provided keywords, followed by vectorising the document features by using TF-IDF and further clustering using :-means. They evaluate their results on clusters without labeling them. Even though unsupervised framework and clustering methods provide solutions for identifying document categories, they are still not suitable for systematic mapping studies. First, they are single-faceted and one-dimension clustering, which means each article will be assigned to only one category. Second, inter-relations as well as hierarchical (ontological) ordering such as sub-topics/sub-categories are missing in such clustering. Osborne et al. (2019) proposed their semi-supervised system for mapping studies. Their system starts with over large scholarly datasets, then rene the ontology with the help of domain experts and nally use knowledge bases to automatically select and classify the primary studies. Their classication scheme is

13 generated by selecting several ontologies from author provided keywords as categories, and identied equivalent ontologies (based on relations learned in ontology learning) as they appeared in abstracts, keywords and titles. Their approach shows higher precision in comparison with traditional LDA and TFIDF for ranking topic terms. However, author provided keywords are required as input to the system as well as candidates ontologies, where keyword attributes appear to be sometimes missing when scrapping papers.

2.3. Word Embedding Representations

Word embedding represents a class of techniques in NLP, i.e. language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to real-valued vectors in a predened vector space. In earlier research, one- hot encoding method (e.g. bag-of-words) has been widely applied to encode word occurrence information as numerical values, where it often led to high sparsity and also suered from producing purely frequency-based representations. Modern word embedding methods have been introduced on the hypothesis of distri- butional semantics, which indicate that similar words contain similar context (Sahlgren, 2005). Words or phrases from the vocabulary are mapped to vector space of real values. Classic word embedding methods include neural networks, dimensionality reduction on word co-occurrence matrix, probabilistic models, etc. Usually the quality of word embedding is aected by the quality of training corpus, especially neural networks and probabilistic models are sensitive to the size of dataset. Therefore, pre-trained word embedding and language models are proposed as means to transfer external knowledge into downstream NLP tasks, which have been trained on large-scale corpus.

2.3.1. Word2Vec

Figure 2.3.: Word2vec (cbow and skip-gram), Figure from (Bilgin and Şentürk, 2017)

Word2Vec, proposed by Mikolov et al. (2013), is one of the most popular techniques. It learns word embedding from a context window of a certain word, thus words that share common contexts in the corpus are located close to one another in the space. Continuous bag-of-words (CBOW) and continuous skip-gram architecture are two models used by Word2Vec. GloVe1 improves the architecture of Word2Vec by adding

1https://nlp.stanford.edu/projects/glove/

14 global statistics on very large corpus. Fasttext (Bojanowski et al., 2017; Joulin et al., 2016) uses character-based n-gram features, to solve out-of-vocabulary problems. Comparing to Word2Vec, Fasttext captures both the semantic and morphological similarities between words.

2.3.2. Contextual Word Embedding Recent contextualized word embedding approaches, such as BERT and ELMo, were shown to achieve state-of-art performance in many downstream tasks. The main idea behind them is to encode word embedding on condition of word context. That is to say, one particular word in a dierent context will have dierent vectors. ELMo (Peters et al., 2018) contains a bi-directional Long Short-Term-Memory based language model. Final vector representation is concatenated by two hidden states and raw input word vector.

2.3.3. Knowledge Graph based Embedding Knowledge graph is a semantic network representation, built upon an existing knowl- edge database in the form of entities and the relationship between entities. Knowledge- graph representation helps in structuring semantic information from unstructured text and can form the basis for computer understandable semantics. WordNet2 is a well-known knowledge database of English lexicon mapped onto sets of cognitive synonyms. Similarly, Microsoft Concept Graph3 links "IsA" relations between a concept and an instance based on statistical probability inference. However, Microsoft Concept Graph covers a large number of concepts in dierent domains. ConceptNet4 is an open, multilingual knowledge graph that relates the meanings of words and phrases. It is collected from a combination of expert-created resources and crowdsourcing(Speer and Lowry-Duda, 2017). ConceptNet includes more types of relations than "IsA". An example can be seen from Figure 2.4. Faruqui et al. (2015) rst proposes graph-based learning technique to incorporate lexicon relations from WordNet into word embeddings, namely "retrotting". Their technique demonstrates the superior performance on word-similarity evaluation. Speer and Chin (2016) extended the retrotting process and proposed ConceptNet number- batch word embeddings. Using ConceptNet embedding benets from the fact that they have semi-structured, common sense knowledge from ConceptNet, giving external knowledge of word meaning out of simply textual context.

2.4. Automatic Keyword Extraction

Keywords extraction is considered a fundamental task in information extraction, aiming at identifying highly representative and relevant information from unstructured text data. Usually keyword can be used as features for downstream tasks, such as summarization, clustering, indexing, knowledge graph and taxonomy construction, and many other NLP applications. There are several similar tasks:

• Automatic Term Extraction: Automatic Term Extraction (ATE) tasks, such as from domain-specic corpora, are usually based on large scale corpus. Aiming at extracting terms and ontologies, ATE has been widely applied into many knowledge acquisition processes. Linguistic features together

2https://wordnet.princeton.edu/ 3https://concept.research.microsoft.com/Home/Introduction 4https://conceptnet.io/

15 Figure 2.4.: Example of "conceptnet" node from website hps://conceptnet.io/

with statistical methods helps identify the characteristic of terminology terms in text. In many cases of automatic terms extraction, domain specic glossary or is not available, or the target domain is newly explored so that terminology terms are are relatively new to external knowledge resources.

• Automatic keywords/keyphrases extraction: Keywords and keyphrases extrac- tion both identify key information at document-level. Generally, a keyword should be representative and highly correlated with its source document. Firoozeh et al. (2020) comprehensively analyse keyness properties of keywords in dierent scenarios.

• Concept Extraction: concept extraction task usually has a pre-dened tagging scheme for target concepts, such as named entities in general documents and medical concepts in clinical research. Augenstein et al. (2017) denes three types of key concepts for scientic publications: TASK, PROCESS, MATERIAL, which cover the fundamental objects in scientic research.

A typical process of keyword extraction contains the ve steps shown in Figure 2.5. Important properties which distinguish keywords from other term expressions are designed as features for both supervised methods and unsupervised methods, such as Term Frequency (TF), Inverse Document Frequency (IDF), position information, word occurrence, etc. IDF (equation 2.1) is calculated by total number of documents n and number of documents that contain term t. Frequent terms in many documents will be likely to have low IDF. Thus, combined TF-IDF (equation 2.2) reduces the inuence of common terms, such as "the". Research from Hulth (2003) shows the eectiveness of leveraging lexical and morphological features in keywords, by extracting noun phrase

16 Figure 2.5.: Automatic keywords extraction process stages. (Figure from Merrouni et al. (2019)) chunks on part-of-speech tags. Adding linguistic constrains increase the informative and readability of extracting keyphrases. = idf(C) = log (2.1) df(C) tf-idf(t,d) = tf(t,d) × idf(t) (2.2) Without limitation of large set of annotated data, unsupervised systems apply scoring and ranking methods on candidate keywords. TF-IDF is a simple but eective scoring mechanism to lter out irrelevant words. However, purely frequency based scoring may ignore infrequent yet domain-specic keywords. TextRank (Mihalcea and Tarau, 2004) is a well-known graph-based keyword extrac- tion method, using Google PageRank website ranking algorithm. PageRank builds a directional graph of websites, where links are connected between nodes when other websites contain hyperlinks referring to it. As a result, the more in-links a website has, the more important is the website. Similarly, TextRank presents a document as a word co-occurrence graph, where part-of-speech tags lter is applied and words are restricted to nouns and adjectives. Edges between two word nodes are weighted by the number of times the corresponding words co-occur within a window of W words in the document. After extracting highly ranked words, keyphrases are generated based on two keywords adjacent in the original documents. TextRank has shown its eectiveness independently of domain and language. Semantic information of words are rarely used in early keywords extraction systems, as it is usually dicult to model or measure. The development of word embedding representation for words and text provides possibilities to measure semantic simi- larity. Semantic relatedness between each candidate and its source document can be calculated by cosine similarity of their embedding representation (equation 2.3). Papagiannopoulou and Tsoumakas (2020) utilizes averaging GloVe word embedding as phrase vector and “theme vector”5. Bennani-Smires et al. (2018) applies Doc2Vec and Sent2Vec for document representation and phrase representation. Sun et al. (2020) combined various contextualized word embedding methods together with SIF weighted sentence embedding model. In this thesis, we would like to further explore the per- formance of semantic knowledge graph based word embedding based on the work of SIFRank. Eì ·E ì E ,E E ,E #%8 3 Sim( #%8 3 ) = cos( #%8 3 ) = (2.3) kìE# %8 kkìE3 k

5They dene it as the most important sentence in the document

17 2.5. Terms Clustering and Taxonomy

Taxonomy is the practice and science of classication of things or concepts. Taxon- omy organizes a certain area in a tree-like structure, where each node represents a topic, an entity or a concept. The link between each node and its sub-nodes indicates hypernymy relation between terms. As knowledge is represented in a structured for- mat, taxonomy contributes to information system, knowledge management and other semantic analysis tasks. Also, taxonomy is described as a classication mechanism in recent research. Using taxonomy as a classication scheme would be more suitable in immature or evolving domains, comparing to classication with xed classes (Usman et al., 2017). Traditional methods of taxonomy generation are based on hand-crafted work. The method proposed by Nickerson et al. (2013) is widely used (Szopinski et al., 2019) (in Figure 2.6, in which there are two types: inductive process (empirical-to-conceptual) and deductive process (conceptual-to-empirical). Manually curated taxonomies may have limited coverage or become unavailable in some domains and languages (Mao et al., 2018). Recently, more and more research focus on automatic taxonomy generation, following the inductive process, by automatically identifying and grouping terms.

Figure 2.6.: The taxonomy development method proposed by Nickerson et al. (2013)

Pattern-based methods utilize lexico-syntactic features, like Hearst Pattern (Hearst, 1992), to nd out (x,y) term pairs to match is-a relation from . One typical example is "NP such as NP". This hypernym-hyponym will be represented in the con- structed taxonomy tree as parent-child node pairs. Pattern-based methods show their eectiveness with high precision in large corpus (Wang et al., 2017). However, even though more kinds of lexical patterns have been manually designed to t dierent corpora, these hand-crafted rules are still faced with low recall. Besides, topical taxon- omy does not have strict hypernym-hyponym relations, while each node in a topic taxonomy can be a group of terms representing a conceptual topic (Zhang et al., 2018).

18 Besides manually crafted rules for pattern matching, (semi-) supervised models can be used for training a classier with a set of labeled hypernym term pairs (Kozareva and Hovy, 2010). Recent research considers it as relation learning and relation classication task on term pairs, or improving hypernym relation in pre-trained word embeddings (Z. Yu et al., 2015). Clustering methods focus on grouping words/terms similarity based on their rep- resentation. Liu et al. (2012) construct taxonomy from keywords using hierarchical clustering. Luu et al. (2016) utilize dynamic weighting neural network to learn a term embedding for taxonomic relation identication. By learning word/hypernymy, more semantic information is encoded in embedding vectors (Wang et al., 2017).

19 3. Methodologies

This chapter discusses the methodologies followed to build our system, describing the overall framework overview and the key components involved in each step.

3.1. Architecture Overview

Figure 3.1.: Overall framework in this thesis.

Our automation methods follow the pipeline of classication scheme generation, proposed by Franzago et al. (2016). The are composed by two modules: (1) keyphrase extraction from titles and abstracts; (2) clustering extracted keyphrases to identify categories. The overall architecture of our system is shown in Figure 3.1. is used as a measurement in both scoring the document relatedness of keyphrase candidate and grouping semantic related words in clustering. We leverage external knowledge from pre-trained word embedding for semantic similarity.

3.2. Embedding Representation

Our method needs embedding representation for both phrases and documents. In terms of document embedding, averaging of words embedding in text has been widely applied, though it ignores words positional information. TF-IDF weighting considers

20 importance of words based on frequency. SIF (Arora et al., 2017) weighting further leverages words relatedness with dierent topics in large-scale corpus. These two weighting methods can be combined with dierent pre-trained word embedding, and there is no-need for training or ne-tuning. Eectiveness of SIF weighting has been proven in various tasks, including the keyphrase extraction framework, SIFRank (Sun et al., 2020), which we use in our method. We specically consider the embedding method selection for phrase representation, as we need to measure the phrase-level semantic similarity in domain relevance scoring and semantic keyphrase clustering module. Similar to sentence embedding, averaging of all word tokens is widely used, though it cannot avoid polysemy issues. For example, "cloud" in "cloud computing" indicates a dierent meaning compared to "cloud" in its general sense. Researchers attempt to train phrase embedding by combining phrases as a single unit then using word embedding learning model. The method has the limitation that, without pre-dened set of phrases and more importantly, with our dataset corpus which is not large enough to support phrase embedding training for probability estimation (section 4.1). ConceptNet Number-batch1 is generated from commonsenese knowledge-graph, thus bridge the way of linking n-grams phrases to standardized concept. ConceptNet leverages external knowledge sources to enhance the semantic representations of words and phrases beyond sentence context.

3.3. Keyphrases Extraction

Our keyphrase extraction module is built on the basis of SIFRank (Sun et al., 2020), a state-of-art embedding-based method. SIFRank follows generic keyphrase extrac- tion pipeline of candidates selection and candidates ranking. Also, embedding-based keyphrase extraction only ranks keyphrases by keyphrase-document semantic similar- ity, indicating "document relevance", but overlooks other task-specic features which could also be important to keyphrases. Specically, our keyphrase extraction module works for systematic mapping studies on domain-specic scientic articles. Extracted keyphrases should also be: (1) domain-specic and (2) well-formed (as scientic con- cepts or terminologies). We incorporate SIFRank score to measure document relevance, together with two other scoring functions for measuring domain relevance and phrase quality. The three scores then dene document relevance, domain relevance and phrase quality are generated for candidate keyphrases ranking. The domain relevance and phrase quality scoring methods are motivated by features used in domain-specic . Concept/phrase mining is the rst and key step of ontology extraction for domain knowledge graph/taxonomy construction. Therefore, concept mining focuses on extracting ontology keywords and phrases on corpus-level, regardless of local document relatedness. Usually they require a large scale of domain corpus, or highly relevant labeled corpus. The main idea of combining joint keyphrase and concept extraction is to improve the quality of extracted keyphrases which can both be representative in local documents and in the global domain of systematic mapping.

3.3.1. Document Relevance Score One keyword of a single document should have a strong connection with this docu- ment, as two principles of exhaustivity and specicity, established by United Nations Educational, Science and Cultural Organization (UNESCO 1975). A large number of

1https://github.com/commonsense/conceptnet-numberbatch

21 automatic keyword extraction system focus on incorporating features from word fre- quencies (TF-IDF ranking), position information (PositionRank (Florescu and Caragea, 2017), MultipartiteRank (Boudin, 2018)), word co-occurennce information (TextRank (Mihalcea and Tarau, 2004)), or combinations of the above. Then keyphrases, as a set of keywords, can be generated from highly ranked keywords. With help of embedding techniques, semantic distance measurement is considered next. It is based on the principle that as closer a candidate vector is to the docu- ment vector the closest the distance is in regard to their meanings. Semantic distance calculated at document-level is called global semantics, while local-level semantics focus on specic section of documents, such as titles. The eectiveness of semantic distance measurement has been proved by EmbedRank (Bennani-Smires et al., 2018) and SIFRank (Sun et al., 2020) in benchmark datasets. SIFRank score is utilized in our model for measuring document relevance. SIFRank (Sun et al., 2020) is an unsuper- vised embedding-based keyphrase extraction model, which reaches the state-of-art performance in keyphrase extraction for short documents. The framework of SIFRank is shown in g 3.2. There are two main characteristics of SIFRank:

1. Autoregressive Pre-trained Language Model ELMo2: ELMo is the state-of-art deep contextualized word representation trained on large corpus proposed by Allennlp (Peters et al., 2018). First, it is generated on bi-directional LSTM model, which combines the raw word vector together with two output intermediate word vector from LSTM layers. Second, dierent from traditional xed-weight word embedding, ELMo contains contextual features of each word. Third, it is character based that morphological clues are included for out-of-vocabulary representation.

2. Sentence embedding model SIF: SIF (Smooth Inverse Frequency) was introduced by Arora et al. (2017) as a new word weighting function to unsupervised sentence embedding generation. They take sentence embedding as maximum likelihood estimation of the topic, where 0 is the hyper parameter, |B| is the length of tokens in the given sentence,5F represents the frequency of word F.

1 Õ 0 1 Õ E = E = Weight(F)E (1) B |B| 0 + 5 F |B| F F ∈B F F ∈B

Figure 3.2.: The framework of the SIFRank model (Sun et al., 2020).

2https://allennlp.org/elmo

22 Apart from scoring semantic relatedness, logical regions of scientic publications usually have high correlation with representative keyphrases. Therefore, score of each candidate is multiplied by a weight if candidates appear in the title. The weight is dened by the tokens length of candidate phrases, considering that the problem of nested candidates may lead to redundant terms.

3.3.2. Domain Relevance Score Finding domain-specic terms has been a challenge towards domain specic tasks, especially for newly evolved domain with less resources. With domain specic corpus, terms with high frequency in target domain and low frequency in other domains can be considered as domain specic terms (Navigli and Velardi, 2002). Without domain specic corpus, dictionary-based validation can improve the representative of candidate keywords within studied domain. Structured semantic resources (e.g. WordNet) help to utilize semantic relations, such as groups of synonyms or topic-based clusters, assuming that related terms are more likely to be important than isolated ones (Firoozeh et al., 2020). In general systematic mapping studies, glossary dictionary and domain seed keyphrases can be given with help of human experts. In this thesis, our domain glossary terms are selected from open resources knowledge graphs database: (1) articial intelligence knowledge graph3 (Dessı et al., n.d.): terms with direct link connection with "articial intelligence" are extracted; (2) machine learning taxonomy from Aminer4 (Tang, 2016). All terms are pre-processed with lowercase. Semantic similarity between candidates and glossary terms are calculated for rele- vance scoring. Detailed steps are:

Step 1. All candidate keyphrases and terms in domain glossary are transferred by pre-trained word embedding.

Step 2. For each candidate phrase, cosine similarity is calculated between itself and each term in domain glossary.

Step 3. Domain relevance score of the candidate phrase is the average of top 20% highest similarity scores of glossary terms.

3.3.3. Phrase ality Score Identifying collocation as informative semantic units has been emphasized in indexing, information retrieval, terminology extraction, concept identication and many other phrase-level tasks. Collocations are usually multi-words expressions, which words co-occur frequently and have collective meaning dierent from each word alone. Various features have been explored to measure the degree of dependence for ranking candidates, including frequency, dice coecient, log likelihood ratio, point-wise mutual information, chi-squared test, etc. In our method, point-wise Mutual Information (PMI) and left-right information entropy are chosen to calculate the quality of extracted phrases, which have been proven by its performance in domain-specic new concept discovery (Wan et al., 2017; J. Yu et al., 2019). Point-wise Mutual Information (PMI) is a well-known measurement of interde- pendence among two random variables. In Natural Language Processing, point-wise mutual information is extensively used to measure co-occurrence and association strength among two words, which can help identify semantic units of word pairs or

3http://scholkg.kmi.open.ac.uk/ 4https://www.aminer.cn/dataKnowledge-Graph-for-Machine-Learning

23 multi-word expressions. Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word inside the collocation. The point-wise mutual information score of n-grams xy x and word y is dened as: ?(G,~) %" (G;~) ≡ ;>6 (3.1) 2 ?(G)?(~) where p(x) and p(y) indicate the probability of word x and word y in corpus, p(x,y) is the probability that word x and word y appear together. Based on the hypothesis of mutual information that two variables are independent, we utilize word frequency count in total corpus to calculate PMI score. In addition to , PMI can also apply to any n-grams collocations, by calculating all the PMI scores for two segments of n-grams and selecting the minimum PMI score. For example, the score of "explainable articial intelligence" is equal to the minimum score of PMI(x=explainable machine, y=learning) and PMI(x=explainable, y=machine learning). If point-wise mutual information is utilized to measure the internal consistency of word tokens inside of a phrase, then left-right information entropy shows the variety of word context of a candidate phrase. Information entropy, proposed by Shannon, as a measure of disorder. The larger entropy indicates disorder and uncertainty of variables. Research has shown that a meaningful concept in a corpus usually has a higher frequency and a higher degree of exibility. It is based on the idea that adjacent words will be widely distributed if the string (candidate phrase) is meaningful, and they will be localized if the string is a sub-string of a meaningful string (Shimohata et al., 1997). Õ  (C) = − ?(F8 |C) ;>62 ?(F8 |C) (3.2) F8 ∈F; where wl represent the list of adjacent words of candidate phrase t. Both left and right sides of phrase t is calculated and the lower one is selected as the nal information entropy score. The limitation is that point-wise mutual information and information entropy can only measure multi-words expressions, so uni-grams will not have these two scores. Based on the fact that uni-grams appear far less as gold keyphrases, acronym set extracted in the early step will be used to assign a quality score for uni-grams candidates. In detail, the initial quality score is 0 for each candidate t, then score is accumulated when candidates matching conditions in Table 3.1

Conditions Score length of t <2 or >4 - 0.5* ||length - 3|| PMI(t) >2 entropy(t) t is the acronym in text 1

Table 3.1.: Conditions and score calculation for phrase quality

Here length indicates the number of tokens in candidate t, and entropy score has been normalized to [0,1].

24 3.4. Semantic Keyphrase Clustering

The goal of clustering is to identify distinct groups in dataset and assign a group label to each data point. This module focuses on clustering keyphrase based on their semantic similarity, which is dened as cosine similarity of their embedding representation. Two clustering algorithms are explored in this module.

3.4.1. Spherical :-means Standard :-means (MacQueen et al., 1967) is a simple and classic centroid-based and partitioning algorithms, which has been successfully applied to large and high dimensional datasets in text mining. Standard :-means aim at minimizing within- cluster squared-error criterion, the sum of Euclidean distance of all data points to the cluster center to measure cluster quality. 1 Õ  = k − ` k2 # x : (x) (3.3) x :-means clustering is initialized with pre-dened k clusters and k randomly starting centers, then cluster labels are assigned to the nearest center of each data point. Optimization is conducted by updating cluster centers as the mean of all in-cluster data points and relocating each data point, until convergence or maximum iterations. Studies have found the eectiveness of cosine similarity in quantifying the semantic similarities between high dimensional data such as word embedding or text documents, because the direction of a vector is more important than the magnitude (Strehl et al., 2000). Spherical :-means is :-means on a unit hyper-sphere, where (1) all vectors are normalized to unit-length and (2) objective function is to minimize cosine distance be- tween vectors. Comparing to standard :-means, the characteristic of :-means matches distinct nature of cosine similarity measure in words embedding high dimensional space. Zhang et al. (2018) illustrate that using spherical :-means for topic detection, the center direction acts as a semantic focus on the unit sphere, and the member terms of that topic fall around the center direction to represent a coherent semantic meaning.

3.4.2. Hierarchical Agglomerative Clustering Agglomerative clustering and divisive clustering are two main types of hierarchical clustering algorithms. Agglomerative clustering, as bottom-up clustering, starts with each data point as individual cluster and then merges sub-clusters into one super cluster based on certain distance threshold. Divisive clustering is similar but in top-down direction. From this point of view, a distance measure is used to compute the (dis-) similarity between each pair of data points, and a linkage function is needed to link neighbor clusters together. Commonly used linkage functions are average-linkage (average distance of all elements), single linkage (minimum pairwise distance between elements), Ward’s linkage (Ward Jr, 1963) (minimize the total within-cluster variance), etc. The output of hierarchical agglomerative clustering builds a tree structure of clusters, called dendrogram. Without pre-dened k clusters, the ultimate output forms one super cluster with all data points, while cutting of the hierarchical tree can be dened by certain distance threshold.

25 Figure 3.3.: Example of Hierarchical Agglomerative Clustering Dendrogram

26 4. Experimental Evaluation

4.1. Dataset

4.1.1. Scientific Publications Dataset Data collection plays an important role in the initial state of systematic mapping, determining the quality and relevance of the further studies. As discussed above, keyphrase extraction from scientic articles has long been researched on various methods, with several benchmark dataset, such as INSPEC dataset, SemEval-2010 dataset, SemEval-2017 dataset. However, systematic mapping studies require domain- specic keyphrases, and those benchmark dataset do not focus on a specic research domain. Therefore, we collected several publications and constructed an appropriate dataset that we used throughout our evaluation. In this chapter, we present the data creation procedure and initial analysis of our dataset. As keywording follows after the step of paper selection, we assume that the selected input articles under consideration for our framework are considered to be already in-domain. Based on this assumption, we extracted a set of scientic articles from IEEE xplore1 on 3A3 D=4, 2020, one of the largest online digital libraries of scientic articles. Selection of papers was constrained by querying author provided keywords as below:

• "Explainable AI" • "Explainable Articial Intelligence" • "Explainable Machine Learning"

In total 286 scientic publications were extracted together with their meta-data at- tributes, which we name XAI dataset. “Title” and “abstract” of each article were com- bined as input text. An example of a data entry example is shown in Figure 4.1.

average #nums average #count total in text of tokens per paper Non-Controlled terms 3200 88.84% 2.6181 11.1888 Controlled terms 1536 20.73% 2.1978 4.7727

Table 4.1.: Comparative analysis of Non-Controlled indexing terms and Controlled inxdexing terms.

mean freq unique terms freq >1 percentage Non-Controlled terms 3.4910 391 176 45.01%

Table 4.2.: Term frequency analysis of Non-Controlled indexing terms.

Also, IEEE xplore provides INSPEC indexing terms assigned by human experts to represent the content of a publication. Before selecting labels as gold standard

1https://ieeexplore.ieee.org/Xplore/home.jsp

27 Figure 4.1.: An example of scientific papers with INSPEC Controlled Indexing and Non- Controlled Indexing. Phrases in bold are present in text. keyphrases for evaluation, we did initial exploratory analysis of two kinds of INSPEC indexing terms, as well as the frequency and token length of two indexing terms. Table 4.1 indicates that 45.01% of the gold keyphrases appears only once in the corpus of the whole dataset. Figure 4.2 shows that majority of gold keyphrases are bi-grams and tri-grams. Thus, we use the "INSPEC Non-Controlled Indexing terms" attribute as gold standard, because it appears more often in documents (shown in Table 4.1).

4.1.2. Synthetic Dataset for Term Clustering In order to evaluate semantic keyphrase clustering module in detecting categories of keyphrases, concept ontologies from Aminer Knowledge Graph of Data Mining2 and Knowledge Graph3 (Tang, 2016) were used for ground truth evaluation. Both dataset are two-level depth knowledge graphs, where the rst level ontologies are used as labels, and the second level ontologies are used as data points.

2https://www.aminer.cn/dataAMiner-Knowledge-Graph-datamining 3https://www.aminer.cn/dataAMiner-Knowledge-Graph-kg

28 Figure 4.2.: Numbers of tokens of phrases in "Controlled indexing terms", "Non-controlled indexing terms" and "candidates keyphrases". Here candidates selection applies noun phrase chunking.

Knowledge Graph Data Mining Labels Terms Labels Terms knowledge technology time series data time series analysis semantic (web) technology data streams cognitive science computational modeling algorithm articial intelligence reinforcement learning information processing dynamic databases big data clistering algorithms heterogeneous data

Table 4.3.: Example of two synthetic clustering dataset.

4.2. Implementation and Tools

Pre-processing. For each document, title and abstract are concatenated as input text. Initial experiments explore recall of extracted candidates keyphrases based on dierent pre-processing methods, which show that lowercase and punctuation removal would lead to worse performance on acronym extraction, tokenization and noun phrase chunking. Based on experiments, noun phrases with dash tag will be segmented in part-of-speech matching. Also, frequent but meaningless common words will be extracted, such as "several", "many". For pre-processing, we apply dash ("-") tags removal and a extended set of common stopwords4.

Candidate Extraction. Candidate extraction is built under the framework of SIFRank5 model, where tokenizer and POS tagger have been changed to SpaCy6 library, a python-based Natural Lan- guage Processing library. Noun phrase pattern (dene as 4.1) is captured by regular expressions and parsed into constituency tree for pattern matching.

< ##. ∗ | > ∗ < ##.∗ > (4.1)

4Stopwords list from https://www.ranks.nl/stopwords. 5https://github.com/sunyilgdx/SIFRank 6https://spacy.io/

29 Acronym Extraction. Acronym Extraction is implemented directly using AbbreviationDetector function in ScispaCy7. Considering that acronym are case-sensitive, we implemented acronym extraction before pre-processing. Pre-trained model "en_core_sci_sm" is loaded in to ScispaCy as pipeline.

Candidate Ranking. Details of candidate scoring process is illustrated in Section3. For embedding repre- sentation, latest version of pre-trained ConceptNet numberbatch (ConceptNet Num- berbatch 19.08, English version) is used as pre-trained word embedding. Our domain glossary terms are selected from open resources knowledge graphs database: (1) arti- cial intelligence knowledge graph8 (Dessı et al., n.d.): terms with direct link connection with term "articial intelligence" are extracted; (2) machine learning taxonomy from Aminer9 (Tang, 2016).

Selection of Keyphrases. Before moving forward to the clustering module, post-processing should be done as user control, to make sure the quality of keyphrases match use case. Here we dene a few rule-based steps for post-processing:

1. Lemmatize keyphrases to remove redundant keyphrase due to language inec- tion. For example, both "neural network" and "neural networks" will be lemmatized to "neural network" and assigned with the higher score among two.

2. Calculate average rank of lemmatized keyphrases list among dierent documents. Initial keyphrase list is generated by selecting keyphrases ranked above 15. The purpose is to remove outliers.

3. If the selected keyphrase is identied as acronym, it will be replaced by its original denition in text.

4. We remove the last 20% keyphrases based on TF-IDF scores and PMI (described in Section 3.3.3.

Clustering Algorithms. Clustering module is built on scikit-learn (Pedregosa et al., 2011) and spherecluster10. Before clustering, each term will be transformed to embedding representation from ConceptNet Numberbatch. Then, for dataset which clusters number : is unknown, we rst explore the performance of clustering among 5 to 300 clusters, then select the optimal :.

4.3. Evaluation Metrics

For the purpose of exploring text mining methods for automating keywording and classication scheme generation in systematic mapping studies, the ultimate per- formance will be measured in two criteria: reliability of extracted keyphrases and potential explainability of generated categories based on keyphrases. Performance evaluation is conducted separately on keyphrase extraction module and semantic semantic keyphrase clustering module.

7https://github.com/allenai/scispacy 8http://scholkg.kmi.open.ac.uk/ 9https://www.aminer.cn/dataKnowledge-Graph-for-Machine-Learning 10https://pypi.org/project/spherecluster/

30 4.3.1. Extraction Evaluation Automatic keyphrase extraction evaluation is based on matching annotated gold standard keyphrases with ranked list of extracted keyphrases. INSPEC non-controlled indexing terms are assigned from professional indexer, thus benet from accessibility of being extracted from text. Also, comparing to author assigned keywords, indexing terms are generally more objective (Papagiannopoulou and Tsoumakas, 2020). Thus they are used as gold standard labels for evaluating keyphrase extraction and traditional statistical measures of Precision, Recall and F1-score are used. |'4CA84E43 4~F>A3B ∩ >;3 4~F>A3B| %A428B8>= = (4.2) |'4CA84E43 4~F>A3B| |'4CA84E43 4~F>A3B ∩ >;3 4~F>A3B| '420;; = (4.3) |>;3 4~F>A3B| 2 ∗ %A428B8>= ∗ '420;; 1 = (4.4) %A428B8>= + '420;; Studies has explored approximate matching strategy to deal with relevant keywords with morphological variant, or extracted phrases as sub-string of gold keyphrases. In our methods, removal of morphological variants of phrases have been applied to both extracted phrases and gold standard keyphrases before evaluation.

4.3.2. Clustering Evalution As an exploratory analysis, evaluating the quality and performance of clustering is tricky without external gold standard data. Due to the limitation that our dataset from IEEE xplore does not provide gold standard topic categories of documents and keyphrases, clustering evaluation is based on inter-cluster quality measures, including: Silhouette Coecient and Davies-Bouldin Index.

Silhouette Coecient Silhouette Coecient measures similarity and dissimilarity between all samples in each cluster, by calculating the mean inter-cluster distance (0) and the mean nearest- cluster distance (1) for each sample. For each data point 8 in its cluster 8 , a(i) is the average distance of point 8 and all other elements in cluster 8 , b(i) is the smallest mean distance of 8 to all points in any other cluster (or the closest "neighboring cluster" to 8, of which 8 is not a member (Wikipedia contributors, 2020). The silhouette coecient score of one data point 8 is dened as: 1(8) − 0(8) B(8) = , 85 | | > 1 (4.5) <0G {0(8),1(8)}) 8 Silhouette coecient score ranges from -1 to 1, where -1 means the sample point is closer to the neighbor cluster and 1 means the sample point is closer to the assigned cluster. The overall clustering quality is measured by averaging silhouette coecient score of all data points.

Davies-Bouldin Index Davies-Bouldin Index is another unsupervised metric for measuring separation among generated clusters. The index is dened as the average similarity between each cluster 8 for i=1, ..., : and its most similar one 9 (Davies and Bouldin, 1979; Pedregosa et al., 2011). Similarity is calculated as:

31 B8 + B9 '8 9 = (4.6) 38 9

where B8 is the average distance between each point of cluster and the center of that cluster, 38 9 is the distance between cluster centers 8 and 9. The nal Davies-Bouldin index is dened as:

: 1 Õ  = max '8 9 (4.7) : 8≠9 8=1 Lower score of Davies-Bouldin Index indicates better separation among clusters, where each cluster is positioned far from its neighbor clusters. The lowest score of Davies-Bouldin Index is zero.

Calinski-Harabasz Index Calinski-Harabasz Index (Caliński and JA, 1974; Pedregosa et al., 2011) is also called Variance Ratio Criterion, used as ratio between the within-cluster dispersion and the between-cluster dispersion. The larger between-cluster dispersion and the smaller the within-cluster dispersion, the better the quality of clustering. For a set of data  of size = which has been clustered into : clusters, the Calinski-Harabasz score is dened as the ratio of the between-clusters dispersion mean and the within-cluster dispersion (Pedregosa et al., 2011):

tr( ) # − : B = : × (4.8) tr(,: ) : − 1

where CA (: ) is trace of the between group dispersion matrix and CA (,: ) is the trace of the within-cluster dispersion matrix dened by:

: Õ Õ ) ,: = (G − 2@)(G − 2@) (4.9) @=1 G ∈@

: Õ ) : = =@ (2@ − 2)(2@ − 2) (4.10) @=1

with @ the set of points in cluster @, 2@ the center of cluster @, 2 the center of , and =@the number of points in cluster @.

Two other metrics for evaluation with ground truth labels are as follows:

Purity Purity is calculated by the highest accuracy by clusters. With ground truth labels, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N (Manning et al., 2010).

Adjusted Random Index Random Index is proposed to measure the amount of data points that are the same in truth clusters and assigned clusters. Give C as a ground truth class assignment and K the clustering, a represents the number of pairs of elements that are in the same set in C and in the same set in K, and b represents the number of pairs of elements that are in dierent sets in C and in dierent sets in K

32 0 + 1 RI = (4.11) =B0

RI − [RI] ARI = (4.12) max(RI) − [RI]

33 5. Results

This chapter presents and compares experiment results, where experiments are con- ducted on dierent settings for comparative analysis: (1) ensemble scoring and ranking methods for keyphrase extraction; (2) phrases embedding representation; (3) clustering analysis.

5.1. Keyphrase Extraction

For keyphrase extraction, three base models are selected for comparison. One is Tex- tRank1 (Mihalcea and Tarau, 2004), a graph-based keyword extraction module. The other two are SIFRank-ELMo and SIFRank-ConceptNet, the state-of-art embedding- based keyphrase extraction model, where the dierence lies in the underlying pre- trained word embedding representation. Our ensemble method for keyphrase extrac- tion combines score from three base models with domain relevance score and phrase quality score, then re-ranking the candidates.

5.1.1. Candidate Selection The two-step keyphrase extraction framework mostly starts with selecting candidates for ranking. The quality of selected candidates determines the upper limit of extraction performance, though fewer studies investigate inuence of candidates selection. In many other keyword and keyphrase extraction research, TextRank is widely used as a baseline model due to its convenience and accessibility. But our experiments show that TextRank performs well in our dataset. We investigate whether pre-processing would impact on candidates quality on our base models. In our method, pre-processing of text is dash tag removal. SIFRank utilizes Part-of-Speech tags sequence matching to detect noun phrases. TextRank generates keyphrases based on its ranked word connected graph, so we set N=50 in TextRank when extracting keyphrases as candidates. Note that sometimes there will be not enough keyphrases to generate.

- pre-process + pre-process TextRank Noun Phrase TextRank Noun Phrase total #num of candidates 4280 11202 11176 11039 average #num of candidates per doc 14.0 39.2 39.1 38.6 precision 0.3380 0.1891 0.1973 0.2114 recall 0.4521 0.6619 0.6891 0.7294

Table 5.1.: Analysis of candidates selection in base models. "- preprocess" means without dash tag removal and "+ preprocess" means with dash tag removal.

Table 5.1 shows that noun phrase chunking is more suitable for candidates selection in scientic data, which is in line with the fact that keyphrases of scientic publications appear mostly in form of noun phrases and multi-word expressions. Pre-processing not only improves the recall of keyphrases for both candidates selection methods, but also reduces the wrong candidates in noun phrase chunking. For TextRank, pre-processing improves the number of candidates and recall of keyphrases in a large scale. Therefore,

1Implemented on pke python library, https://github.com/boudin/pke

34 considering that generic keywords extraction methods as TextRank usually focus on single words, it is necessary to include proper pre-processing of text when tting generic methods to specic document type.

5.1.2. Candidate Ranking We rst analyze the impact of title-weighted ranking (dened as document relevance score) from three base models, shown in Table 5.2. Extraction performance is evaluated on precision, recall and F1-score of top-N keyphrases. Among three base models, SIFRank-ELMo performs best in top 5 and top 15 keyphrases while TextRank performs best in top 10. With title-weighted score on three base models, extraction results shows improvement of all three base models, which indicates that logical region of keyphrase position acts as important information for scientic publications. Besides, title-weighted SIFRank score with ELMo embedding outperforms other models, though the improvement is not signicant.

Top5 Top10 Top15 P R F1 P R F1 P R F1 TextRank 0.4986 0.2228 0.3080 0.4411 0.3941 0.4162 0.3791 0.5066 0.4337 TextRankCF 0.5035 0.2250 0.3110 0.4495 0.4016 0.4242 0.3866 0.5166 0.4422 SIFRank-ELMo 0.5105 0.2281 0.3153 0.4327 0.3866 0.4083 0.3803 0.5072 0.4347 SIFRankCF-ELMo 0.5245 0.2344 0.3240 0.4582 0.4094 0.4324 0.3962 0.5284 0.4529 SIFRank-ConceptNet 0.5049 0.2256 0.3119 0.4257 0.3803 0.4017 0.3679 0.4906 0.4205 SIFRankCF-ConceptNet 0.5210 0.2328 0.3218 0.4407 0.3938 0.4159 0.3758 0.5013 0.4296

Table 5.2.: Comparison of three extraction methods with their title-weighted ranking.(e.g. TextRankCF uses title-weighted score for ranking.)

Our ensemble keyphrase extraction method is the extension of base models by combining domain relevance score and phrase quality score. Final ranking score is the weighted sum of three scores, where the weights can be assigned by users. In this thesis, we optimized the weights on evaluation and set weights to 0.1 for both domain relevance and phrase quality. Table 5.3 shows that ensemble methods outperform their original base models, where SIFRank-ConceptNet-ensemble performs best among all methods in top-5 and top-10 keyphrases, though there is a slight drop of SIFRank-ConceptNet-ensemble in performance of top-15 keyphrases compared to SIFRank-ELMo-ensemble method.

Top5 Top10 Top15 P R F1 P R F1 P R F1 TextRank 0.4986 0.2228 0.3080 0.4411 0.3941 0.4162 0.3791 0.5066 0.4337 TextRank-ensemble 0.5287 0.2363 0.3266 0.4641 0.4147 0.4380 0.4022 0.5375 0.4601 SIFRank-ELMo 0.5105 0.2281 0.3153 0.4327 0.3866 0.4083 0.3803 0.5072 0.4347 SIFRank-ELMo-ensemble 0.5357 0.2394 0.3309 0.4746 0.4241 0.4479 0.4166 0.5556 0.4762 SIFRank-ConceptNet 0.5049 0.2256 0.3119 0.4257 0.3803 0.4017 0.3679 0.4906 0.4205 SIFRank-ConceptNet-ensemble 0.5587 0.2497 0.3451 0.4830 0.4316 0.4559 0.4152 0.5538 0.4746

Table 5.3.: Comparison of keyphrase extraction results from ensemble methods with three base models.

From the example of top-15 extracted keyphrases (in Figure 5.1, adding domain relevance and phrase quality could reduce the rank of single word ("method", "logic", "explanation") as well as terms with abstract meanings ("explanation method"). However, it still cannot avoid the problem of nested keyphrases with similar meanings ("black box decision making" and "black box") and wrong candidates from selection ("method outperforms").

35 Figure 5.1.: Example from top-15 extracted keyphrases.

5.2. Word Embedding

Word embedding techniques address the problem of encoding semantic and syntactic information and semantic similarity measurement. A large number of studies have demonstrated the eectiveness of language model based word embedding to encode context information, such as ELMo used in SIFRank (Sun et al., 2020). In our keyphrase extraction experiments, SIFRank-ELMo base model also slightly outperforms SIFRank- ConceptNet module (Table 5.2. We further utilize pre-trained embedding to encode words at phrase-level. For ConceptNet embedding, each phrase is segmented by longest matching terms in embeddings index and encoded by average embedding vectors. Since ELMo encodes phrases token by token, we takes the mean vector of all tokens in phrase.

Conceptnet + cosine similarity ELMo + cosine similarity model reasoning process 0.755504 equation 0.81263 probabilistic learning methods 0.754907 dierential equations 0.786096 nonparametric machine learning methodology 0.74829 sylvester equation 0.776896 machine learning algorithms 0.745002 equations 0.776227 incremental hierarchical topic modeling algorithm 0.743017 estimator 0.767971 ontology reasoning process 0.742769 thermal wind equation 0.762029 probabilistic methodology 0.742552 kinematic equations 0.760027 machine learning methodologies 0.741665 thermal wind equations 0.757065 machine learning methods 0.741245 interpretable machine learnig algorithm 0.740843 analytical programming decision model 0.740591 confusion matrix 0.739952 computational models 0.740437 pearson correlation coecient 0.739863 heuristic algorithm 0.740187 fuzzy algorithm 0.728873 inference model 0.739704 taylor approximation 0.728192 evolutionary computation technique 0.739564 apriori algorithm 0.725151 computational approach 0.739382 skelton kernel principal component analysis 0.725124 probabilistic model 0.739312 argumentation theory 0.724358 logic models 0.739127 decision problem 0.723329 learning analytics system 0.737555 sugeno type fuzzy inference model 0.719519 learning process 0.736319 prediction dierence method 0.718589 algorithmic process 0.735854 inference model 0.717841

Table 5.4.: Comparison of top candidates in domain relevance score using ELMo embed- ding and ConceptNet Numberbatch Embedding. Domain glossaries from AI and Machine Learning.

Table 5.4 selects highly scored candidates from domain relevance, where domain

36 glossaries used are from the domains of Articial Intelligence and Machine Learning. We nd that, on very short text (e.g. concept terms, phrases), those pre-trained word embeddings and language models do not perform as well as in general long documents. Though both word embedding can capture high relevance candidates with domain glossaries, candidates ranked by ELMo embedding cannot deal with outliers, such as "hermal wind equation" and "kinematic equations". The better quality of domain relevance score also leads to better improvement in ensemble model based on SIFRank- ConceptNet (Table 5.3).

Run time (include loading embeddings) SIFRank-ELMo-ensemble 1650.81 SIFRank-ConceptNet-ensemble 258.13

Table 5.5.: Running time (by seconds) of two keyphrase extraction methods. Note: execution time of phrase quality is not included, because it is done as corpus-level and applies the same to both methods.

We further compare the execution time of ELMo based methods and ConcepNet based methods (Table 5.5), where ELMo requires around eight times more than Con- ceptNet. This is because ELMo is contextualized and it generates embedding based on language modeling of full text, while ConceptNet is xed embedding of words. Our previous results in keyphrase extraction do not show large dierence between ELMo embedding based method and ConceptNet embedding based method. The reason might be that our method focus on phrase-level semantic representation and also the corpus is domain-specic.

5.3. Clustering

In clustering module, each keyphrase is treated as an independent ontological concept term. In phrase-level clustering, phrases will be grouped together based on their cosine similarity of embeddings. In our experiments, selected keyphrases from previous step are encoded by ConceptNet Numberbatch embeddings, due to the reason that language model based embeddings (e.g. ELMo) requires context for mining phrase semantics. Also ELMo requires much longer time to process data. Two clustering algorithms are evaluated in our clusteirng module: Spherical :-means and hierarchical agglomerative clustering. For clustering experiments on the XAI dataset, keyphrases are selected from the best model in keyphrase extraction, with keyphrases post-processing and cleaning discussed in Section 4.2. DM data and KG data have labels from the original knowledge graph, so we can use external evaluation metrics for clustering (Adjusted Rand-Index, Purity). Three settings of two algorithms are tested:

• SP-:-means: spherical :-means.

• HAC: hierarchical agglomeritive clustering, average linkage, cosine distance.

For DM data and KG data, number of clusters n is consistent with its numbers of labels. But for the XAI dataset, we do not know the best number of clusters. We rst conduct exploratory analysis for both hierarchical agglomeritive clustering and spherical :-means to identify the optimal n clusters. Three internal evaluation metrics (introduced in Section 4.3.2) can describe how well the clustering separate data points into clusters. The higher scores of Silhouette score and Calinski-Harabasz, the better quality clustering performs, while Davies Bouldin Index requires the lower score to show the better clustering quality. In Figure 5.3, Davies Bouldin Index in both methods

37 Figure 5.2.: Results of silhouee coeicient Figure 5.3.: Results of Calinski-Harabasz score with n clusters. Score with n clusters.

Figure 5.4.: Results of Davies Bouldin Index with n clusters. do not reach its lowest score and in Figure 5.3, curves show opposite trend with increase of clusters. We refer to the silhouette score in Figure 5.2, the curve of agglomerative clustering does not reach a peak within range of 300 clusters, while Spherical :-means reach its highest score at 75 clusters. Besides, both Figure 5.2 and Figure 5.3 shows that Spherical :-means get slightly better performance than agglomerative clustering, except for Davies Bouldin Index in Figure 5.4.

Silhouette Davies Bouldin Calinski–Harabasz Adjusted Rand-Index Purity SP-:-means 0.1825 2.3423 15.4498 - - XAI data HAC 0.0847 1.9726 8.7997 - -

Table 5.6.: Clustering analysis on XAI publications dataset.

However, it is inevitable that internal evaluation measures have limitations. In general, scores can vary from algorithms due to clustering objective functions, thus convex clusters such as density based clustering algorithms would have higher scores. Also, it is only calculated based on the distribution of data itself, so high scores on an internal measure do not necessarily result in eective information retrieval applications (Cambridge, 2009). Centroid distance calculation of Davies-Bouldin Index limits the distance metric to Euclidean space while our clustering is based on cosine similarity distance. We set N=75 in two clustering algorithms and evaluate clustering performance internal metrics (shown in Table 5.6). Theoretically, silhouette score ranges from -1 to 1 where score close to 1 indicate better separation among clusters. However, both

38 clustering algorithms do not reach ideal silhouette score, thus lead to unsatisfactory separation distance between the resulting clusters. We analyze some example cluster results from spherical :-means, as shown in Table 5.7. Terms within one cluster indicate similar sub-words, such as "learning" in cluster 0 and "fuzzy" in cluster 9. Sometimes words with same sub-words can be used to imply central meaning (e.g. cluster 9, 11, 55) , while in most cases having same adjective but dierent nouns indicate totally irrelevant items (such as cluster 46, 51). Besides, the center "word" of one cluster can have even dierent meanings in dierent compounds, as the meaning of "network" is dierent in "knowledge_network" and "network_cybersecurity".

0 10 9 11 learning_framework knowledge_network fuzzy_method detection_system active_learning network_interpretability fuzzy_classier detection_accuracy advanced_learning network_edge fuzzy_algorithm attack_detection learning_preference network_issue fuzzy_property robust_detection topological_learning convolutional_network fuzzy_rule speech_detection deep_learning_algorithm autoencoder_network fuzzy_methodology detection_methodology deep_learning_model input_network interpretable_fuzzy_system trustworthy_detection social_interaction identication_network inference_time community_detection deep_learning_method network_cybersecurity fuzzy_clustering object_detection_system unsupervised_learning_algorithm transparent_network fuzzy_predicate detection_trustworthiness problem_solving network_ow causal_reasoning object_detection unsupervised_learning network_alignment fuzzy_relation malware_detection deep_learning_outcome iiot_network fuzzy_inference anomaly_detector deep_learning_application network_depth symbolic_reasoning eective_detector 25 46 51 55 visual_data high_eciency complex_scenario real_user visual_interaction large_datasets complex_interaction user_understanding set_visualization large_range complex_policy interest_user visual_structure operational_level complex_environment user_requirement visual_aspect high_correlation complex_relationship user_action visual_feedback high_likelihood complex_event user_study visual_interface high_return complex_response end_user visual_collaboration large_volume synthesis_action user_trust online_visualization intermediate_level synthesis_interaction explicit_user visual_medium large_scale complex_event_recognition web_user interactive_visualization trac_volume complex_event_processing internet_service

Table 5.7.: Example of cluster-wise results on Spherical :-means clustering of XAI publications dataset.

We conducted a second experiment of clustering using terms from public taxonomy, with their upper nodes as ground truth labels (described in Section 4.1). We use Adjusted Rand-Index and Purity (described in Section 4.3.2) to measure clustering results with true labels. For both measures, higher scores mean better performance. The Adjusted Rand-index will have a value close to 0 if labeling is random and independent of number of clusters. Results from Table 5.8 indicate unsatisfactory performance of identifying categories of terms, though spherical :-means still reach better separation of data points. From internal evaluation scores of ground truth labels and data points, silhouette scores are lower than 0, indicating that data points are around the edges of dierent clusters. It means that overlapping is inevitable among clusters. Distance separation based clustering methods may not be suitable for identifying overlapping clusters.

Silhouette Davies Bouldin Calinski–Harabasz Adjusted Rand-Index Purity Labels -0.0502 3.1962 3.4238 - - DM data SP-:-means 0.1511 2.4739 7.1402 0.1711 0.4502 HAC 0.0714 2.0008 3.5055 0.0251 0.2830 Labels -0.0409 3.0323 4.0789 - - KG data SP-:-means 0.1413 2.7066 6.9495 0.0740 0.3821 HAC 0.0857 1.9510 2.1505 -0.0040 0.2830

Table 5.8.: Clustering analysis on DM dataset and KG dataset.

39 Overall, we analyze the possible reasons for unsatisfactory results. First, we may encounter a limitation in regard to the embedding representation of terms. Selection of semantic network based word embedding (ConceptNet) is based on the consideration that semantic meaning are encoded for each term as concept unit. However, ne-tuning ConceptNet embeddings would require a network of domain-specic ontologies. We do not apply ne-tuning in our research, thus the pre-trained embedding would have limited discriminative power in a specic domain. Second, a limitation is considered in relation to the clustering algorithms. Generic clustering algorithms assume data points can be separated. Internal evaluations also measure the separation of clusters. We notice that clusters of ground truth overlap in the embedding space, thus clustering results are not in line with their true labels in semantic taxonomy.

40 6. Conclusion

Our research built a compiled framework of unsupervised keyphrase extraction and semantic term clustering. To answer the question of how well these two modules can help in systematic mapping studies, various experiments are conducted using publications from the domain of "Explanable Articial Intelligence (XAI)". In detail, we examined the ensemble ranking scores, ConceptNet word embedding and clus- tering performance. Our methodologies start from text pre-processing, state-of-art unsupervised keyphrase extraction including candidates selection and candidates rank- ing, selection of highly ranked keyphrases and semantic term clustering. We further examine clustering performance on terms from domain taxonomy. Keyphrase extraction results demonstrate the eectiveness of ensemble ranking scores from dierent perspectives, such as position information, domain knowledge and statistical measures for phrase quality. Phrase quality measures eectively identify multi-word expression with high probability in target corpus. Domain knowledge (in terms of glossaries and domain corpus) nds out highly relevant terms, which can be further considered as constraints and external resources for weak supervision. Also, we found out the impact of text pre-processing for candidate selection quality, as it would lead to noise in subsequent steps. Performance is quite sensitive towards dash tag removal. Thus cleaning of text is necessary and essential. Comparing between dierent pre-trained embedding representations, we found out that ConceptNet based word embeddings perform as well as contextualized word embeddings, with much less execution time. Findings are further useful to guide choice of a suitable word embedding method in terms of tasks and use cases. On analysis of semantic keyphrases clustering module, we demonstrate that generic clustering methods can group similar terms together, while have limited quality. We notice that grouped terms are mostly morphologically similar with same or similar sub- words and ground truth term clusters overlap with each other in word embedding space. Thus, upper-level semantic relation of terms can hardly be identied and grouped by clustering methods. As in original systematic mapping studies, keywords are selected and clustered according to knowledge background of human experts. Encoding semantic meanings into machine understandable way has always been a challenging task. This thesis project concludes with ideas and applications for automatic keywording and classi- cation scheme. Still we suggest human experts participation in rening keywords and selecting high quality keywords clusters based on use cases. With minimal user con- trol, the compiled system can potentially address the research purpose of systematic mapping studies when reducing human eort.

6.1. Future Work

Due to time constraints, several ideas are left to be explored in future work. State-of-art unsupervised keyphrase extraction are limited in respect to the quality of candidate selection. Thus generative models may help to identify unseen but representative keyphrases. Another idea can be to incorporate ontology extraction and linking meth- ods, in order to further rene keyphrases.

41 In terms of semantic term clustering, though we have demonstrated that ConceptNet works better in term-level semantic representation, based on the advantage of its large semantic concept network, it still for now includes limited amount of terms as ontologies. For a new domain as well as domain specic terms, nding a better semantic representation is still necessary. Also, keyphrase analysis can potentially be explored in other ways, such as clustering on semantic graph network or co-occurrence network. Above all, we strongly hope our research can give a new perspective of automating keywording and classication scheme steps in systematic mapping studies, towards faster and more convenient solutions in an open research knowledge era.

42 A. Clustering Results

Table A.1 shows larger scale of cluster results on semantic term clustering using spherical :-means. Each row represents a cluster, where selected terms are ranked by its distance to cluster center.

1 computational_intelligence neural_network articial_intelligence articial_neural_network 2 xai_technique visualization_technique optimisation_technique storytelling_technique 3 ai_algorithm ai_technique ai_system ai_process 4 articial_intelligence_scientist articial_intelligence_research articial_intelligence_technology articial_intelligence_eld 5 decision_support_system ecient_decision_support_system comprehensive_decision_support_system computerised_decision_support_system 6 fuzzy_system hierarchical_fuzzy_system fuzzy_system_complexity evolutionary_fuzzy_system 7 quality_assessment quality_evaluation_assurance_level quality_criterion quality_monitoring 8 human_robot human_robot_interaction human_robot_collaboration human_robot_team 9 deep_learning_technique deep_learning_method learning_technique deep_learning_approach 10 practical_application application_testing specic_application successful_application 11 machine_learning_method machine_learning_technique traditional_machine_learning_method machine_learning_methodology 12 anomaly_detection hyperspectral_anomaly_detection anomaly_detection_problem anomaly_detector 13 contextual_knowledge rich_contextual_knowledge cognitive_context_knowledge dnn_knowledge 14 interpretable_classier explainable_classier interpretable_ml_classier interpretable_machine_learning 15 potential_risk risk potential_vulnerability potential_investment 16 algorithm apriori_algorithm heuristic_algorithm evolutionary_algorithm 17 multivariate_time_series_data time_series_data continuous_time_series_data time_series_analysis 18 xai_method method_outperforms vae_method practical_method 19 statistical_analysis empirical_analysis outlier_analysis comparative_analysis 20 research_challenge critical_research_challenge open_research_challenge challenging_research_problem 21 object_detection object_detection_system object_detection_framework interpretable_object_detection 22 high_level_feature high_level_image_feature high_level_policy high_level_visual_attribute 23 emotion_understanding intuitive_understanding emotion_recognition emotion_expression 24 predictive_model prediction_model model_prediction reliable_predictive_model 25 cyber_security_application security_application cyber_security cyber_physical_system_security 26 quantitative_evaluation qualitative_evaluation empirical_evaluation quantitative_threat_assessment 27 autoencoder_network convolutional_network input_network discriminator_network 28 spatial_structure spatial_structure_pattern implicit_spatial_structure spatial_structural_information 29 visual_analytics visual_analytics_workow visual_analytics_tool visual_analytics_framework 30 white_box white_box_solution white_box_method black_box_decision_making 31 multi_scale_segmentation regional_multi_scale_method regional_multi_scale_concept regional_multi_scale_approach 32 system_design holistic_system_design design_methodology feedforward_design_methodology 33 classication_algorithm classication_dataset binary_classication node_classication 34 deep_neural_network_model neural_network_model deep_neural_network_prediction deep_neural_network_software 35 software opensource_software software_system software_analytics 36 adversarial_attack adversarial_attack_algorithm adversarial_approach attack_strategy 37 scientic_community research_community industrial_testing_community intelligent_community 38 decision_process decision_making_process decision_nding_process algorithmic_decision_making_process 39 semantic_interaction semantic_interaction_foraging interaction_schema collaborative_semantic_inference 40 task_understanding visual_understanding_task speech_understanding_task learning_task 41 user_understanding understanding_user_expectation user_cognition explicit_user 42 visual_representation main_visual_representation deep_visual_representation vectorial_representation

Table A.1.: (Parts of) cluster-wise results on semantic term clustering using spherical :-means on XAI publications dataset.

43 Bibliography

Ahmad, Aakash and Muhammad Ali Babar (2016). “Software architectures for robotic systems: A systematic mapping study”. Journal of Systems and Software 122, pp. 16– 39. issn: 0164-1212. doi: https://doi.org/10.1016/j.jss.2016.08.039. url: http: //www.sciencedirect.com/science/article/pii/S0164121216301479. Arora, Sanjeev, Yingyu Liang, and Tengyu Ma (2017). “A Simple but Tough-to-Beat Baseline for Sentence Embeddings”. Augenstein, Isabelle, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew McCallum (2017). “SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientic Publications”. CoRR abs/1704.02853. arXiv: 1704.02853. url: http://arxiv.org/abs/1704.02853. Bennani-Smires, Kamil, Claudiu Musat, Martin Jaggi, Andreea Hossmann, and Michael Baeriswyl (2018). “EmbedRank: Unsupervised Keyphrase Extraction using Sentence Embeddings”. ArXiv abs/1801.04470. Bilgin, M. and İ. F. Şentürk (2017). “ on Twitter data with semi- supervised Doc2Vec”. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 661–666. Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov (2017). “En- riching word vectors with subword information”. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Boudin, Florian (2018). “Unsupervised Keyphrase Extraction with Multipartite Graphs”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 667–672. doi: 10.18653/v1/N18-2105. url: https://www.aclweb.org/anthology/ N18-2105. Caliński, Tadeusz and Harabasz JA (1974). “A Dendrite Method for Cluster Analysis”. Communications in Statistics - Theory and Methods 3 (Jan. 1974), pp. 1–27. doi: 10.1080/03610927408827101. Cambridge, UP (2009). “Introduction to information retrieval”. Carver, J. C., E. Hassler, E. Hernandes, and N. A. Kraft (2013). “Identifying Barriers to the Systematic Literature Review Process”. In: 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement. Oct. 2013, pp. 203– 212. doi: 10.1109/ESEM.2013.28. Davies, David L and Donald W Bouldin (1979). “A cluster separation measure”. IEEE transactions on pattern analysis and machine intelligence 2, pp. 224–227. Dessı, Danilo, Francesco Osborne, Diego Reforgiato Recupero, Davide Buscaldi, Enrico Motta, and Harald Sack (n.d.). “AI-KG: an Automatically Generated Knowledge Graph of Articial Intelligence” (). Eykens, Joshua, Raf Guns, and Tim CE Engels (2019). “Article Level Classication of Publications in Sociology: An Experimental Assessment of Supervised Machine Learning Approaches”. In: Proceedings of the 17th conference of the International Society for Scientometrics and Informetrics. Vol. 1, pp. 738–743. Faruqui, Manaal, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith (2015). “Retrotting Word Vectors to Semantic Lexicons”. In: Proceedings of NAACL.

44 Febrero, Felipe, Coral Calero, and Mª Ángeles Moraga (2014). “A Systematic Mapping Study of Software Reliability Modeling”. Information and Software Technology 56.8, pp. 839–849. issn: 0950-5849. doi: https://doi.org/10.1016/j.infsof.2014.03.006. url: http://www.sciencedirect.com/science/article/pii/S0950584914000676. Feldman, Ronen, James Sanger, et al. (2007). The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge university press. Feng, L., Y. K. Chiam, and S. K. Lo (2017). “Text-Mining Techniques and Tools for Systematic Literature Reviews: A Systematic Literature Review”. In: 2017 24th Asia- Pacic Software Engineering Conference (APSEC). Dec. 2017, pp. 41–50. doi: 10.1109/ APSEC.2017.10. Firoozeh, Nazanin, Adeline Nazarenko, Fabrice Alizon, and Béatrice Daille (2020). “Keyword extraction: Issues and methods”. Natural Language Engineering 26.3, pp. 259–291. Florescu, Corina and Cornelia Caragea (2017). “PositionRank: An Unsupervised Ap- proach to Keyphrase Extraction from Scholarly Documents”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics, July 2017, pp. 1105–1115. doi: 10.18653/v1/P17-1102. url: https://www.aclweb.org/anthology/ P17-1102. Franzago, Mirco, Davide Di Ruscio, Ivano Malavolta, and Henry Muccini (2016). “Pro- tocol for a Systematic Mapping Study on Collaborative Model-Driven Software Engineering”. CoRR abs/1611.02619. arXiv: 1611.02619. url: http://arxiv.org/abs/ 1611.02619. Hearst, Marti A. (1992). “Automatic Acquisition of Hyponyms from Large Text Cor- pora”. In: COLING 1992 Volume 2: The 15th International Conference on Computational Linguistics. url: https://www.aclweb.org/anthology/C92-2082. Hulth, Anette (2003). “Improved automatic keyword extraction given more linguistic knowledge”. In: Proceedings of the 2003 conference on Empirical methods in natural language processing, pp. 216–223. Joulin, Armand, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov (2016). “Fasttext. zip: Compressing text classication models”. arXiv preprint arXiv:1612.03651. Kim, Sang-Woon and Joon-Min Gil (2019). “Research paper classication systems based on TF-IDF and LDA schemes”. Human-centric Computing and Information Sciences 9.1, p. 30. Kontonatsios, Georgios, Austin J Brockmeier, Piotr Przybyła, John McNaught, Tingting Mu, John Y Goulermas, and Sophia Ananiadou (2017). “A semi-supervised approach using label propagation to support citation screening”. Journal of biomedical infor- matics 72, pp. 67–76. Kozareva, Zornitsa and Eduard H. Hovy (2010). “A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web”. In: EMNLP. Liu, Xueqing, Yangqiu Song, Shixia Liu, and Haixun Wang (2012). “Automatic taxonomy construction from keywords”. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1433–1441. Luu, Anh Tuan, Yi Tay, Siu Cheung Hui, and See Kiong Ng (2016). “Learning Term Embeddings for Taxonomic Relation Identication Using Dynamic Weighting Neural Network”. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 403–413. doi: 10.18653/v1/D16-1039. url: https://www.aclweb.org/ anthology/D16-1039.

45 MacQueen, James et al. (1967). “Some methods for classication and analysis of multi- variate observations”. In: Proceedings of the fth Berkeley symposium on mathematical statistics and probability. Vol. 1. 14. Oakland, CA, USA, pp. 281–297. Manning, Christopher, Prabhakar Raghavan, and Hinrich Schütze (2010). “Introduction to information retrieval”. Natural Language Engineering 16.1, pp. 100–103. Mao, Yuning, Xiang Ren, Jiaming Shen, Xiaotao Gu, and Jiawei Han (2018). “End-to-End Reinforcement Learning for Automatic Taxonomy Induction”. In: ACL. Marshall, Iain J and Byron C Wallace (2019). “Toward systematic review automation: a practical guide to using machine learning tools in research synthesis”. Systematic reviews 8.1, p. 163. Merrouni, Zakariae Alami, Bouchra Frikh, and Brahim Ouhbi (2019). “Automatic keyphrase extraction: a survey and trends”. Journal of Intelligent Information Systems, pp. 1–34. Mihalcea, Rada and Paul Tarau (2004). “TextRank: Bringing Order into Text”. In: Pro- ceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain: Association for Computational Linguistics, July 2004, pp. 404–411. url: https://www.aclweb.org/anthology/W04-3252. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jerey Dean (2013). “Ecient estimation of word representations in vector space”. arXiv preprint arXiv:1301.3781. Miwa, Makoto, James Thomas, Alison O’Mara-Eves, and Sophia Ananiadou (2014). “Reducing systematic review workload through certainty-based screening”. Journal of Biomedical Informatics 51, pp. 242–253. issn: 1532-0464. doi: https://doi.org/10. 1016/j.jbi.2014.06.005. url: http://www.sciencedirect.com/science/article/pii/ S1532046414001439. Mohammed, Nabil M., Mahmood Niazi, Mohammad Alshayeb, and Sajjad Mahmood (2017). “Exploring software security approaches in software development lifecycle: A systematic mapping study”. Computer Standards & Interfaces 50, pp. 107–115. issn: 0920-5489. doi: https : / / doi . org / 10 . 1016 / j . csi . 2016 . 10 . 001. url: http : //www.sciencedirect.com/science/article/pii/S0920548916301155. Navigli, Roberto and Paola Velardi (2002). “Semantic Interpretation of Terminological Strings”. In: Nickerson, Robert C, Upkar Varshney, and Jan Muntermann (2013). “A method for tax- onomy development and its application in information systems”. European Journal of Information Systems 22.3, pp. 336–359. Olorisade, Babatunde K, Ed de Quincey, Pearl Brereton, and Peter Andras (2016). “A critical analysis of studies that address the use of text mining for citation screening in systematic reviews”. In: Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering, pp. 1–11. Osborne, Francesco, Henry Muccini, Patricia Lago, and Enrico Motta (2019). “Reducing the Eort for Systematic Reviews in Software Engineering”. ArXiv abs/1908.06676. Papagiannopoulou, Eirini and Grigorios Tsoumakas (2020). “A review of keyphrase extraction”. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10.2, e1339. Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011). “Scikit-learn: Machine Learning in Python”. Journal of Machine Learning Research 12, pp. 2825–2830. Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer (2018). “Deep contextualized word representa- tions”. In: Proc. of NAACL.

46 Petersen, Kai, Robert Feldt, Shahid Mujtaba, and Michael Mattsson (2008). “Systematic Mapping Studies in Software Engineering”. Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering 17 (June 2008). Petersen, Kai, Sairam Vakkalanka, and Ludwik Kuzniarz (2015). “Guidelines for conduct- ing systematic mapping studies in software engineering: An update”. Information and Software Technology 64, pp. 1–18. Sahlgren, Magnus (2005). “An introduction to random indexing”. In: Methods and applications of semantic indexing workshop at the 7th international conference on terminology and knowledge engineering. Shimohata, Sayori, Toshiyuki Sugio, and Junji Nagata (1997). “Retrieving collocations by co-occurrences and word order constraints”. In: 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pp. 476–481. Speer, Robert and Joshua Chin (2016). “An ensemble method to produce high-quality word embeddings”. arXiv preprint arXiv:1604.01692. Speer, Robyn and Joanna Lowry-Duda (2017). “ConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge”. CoRR abs/1704.03560. arXiv: 1704.03560. url: http://arxiv.org/abs/1704.03560. Strehl, Alexander, Er Strehl, Joydeep Ghosh, and Raymond Mooney (2000). “Impact of Similarity Measures on Web-page Clustering”. In: In Workshop on Articial Intelli- gence for Web Search (AAAI 2000. AAAI, pp. 58–64. Sun, Y., H. Qiu, Y. Zheng, Z. Wang, and C. Zhang (2020). “SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model”. IEEE Access 8, pp. 10896–10906. Szopinski, Daniel, Thorsten Schoormann, and Dennis Kundisch (2019). “Because your taxonomy is worth it: Towards a framework for taxonomy evaluation”. In: Pro- ceedings of the Twenty-Seventh European Conference on Information Systems (ECIS), Stockholm. Vol. 2019. Tang, Jie (2016). “AMiner: Toward understanding big scholar data”. In: Proceedings of the ninth ACM international conference on web search and data mining, pp. 467–467. Terko, Ajša, Emir Žunić, and Dženana Ðonko (2019). “NeurIPS Conference Papers Classication Based on Topic Modeling”. In: 2019 XXVII International Conference on Information, Communication and Automation Technologies (ICAT). IEEE, pp. 1–5. Usman, Muhammad, Ricardo Britto, Jürgen Börstler, and Emilia Mendes (2017). “Tax- onomies in software engineering: A Systematic mapping study and a revised tax- onomy development method”. Information and Software Technology 85, pp. 43– 59. issn: 0950-5849. doi: https : / / doi . org / 10 . 1016 / j . infsof . 2017 . 01 . 006. url: http://www.sciencedirect.com/science/article/pii/S0950584917300472. Vakkuri, V. and P. Abrahamsson (2018). “The Key Concepts of Ethics of Articial Intelligence”. In: 2018 IEEE International Conference on Engineering, Technology and Innovation (ICE/ITMC). June 2018, pp. 1–6. doi: 10.1109/ICE.2018.8436265. Wan, Jing, Lidong Xing, Shuwu Zhang, and Wei Liangb (2017). “Concept Discovery of Specic Field based on Conditional Random Field and Information Entropy”. Wang, Chengyu, Xiaofeng He, and Aoying Zhou (2017). “A short survey on taxonomy learning from text corpora: Issues, resources and recent advances”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1190– 1203. Ward Jr, Joe H (1963). “Hierarchical grouping to optimize an objective function”. Journal of the American statistical association 58.301, pp. 236–244. Wieringa, Roel, Neil Maiden, Nancy Mead, and Colette Rolland (2006). “Requirements engineering paper classication and evaluation criteria: a proposal and a discussion”. Requirements engineering 11.1, pp. 102–107.

47 Wikipedia contributors (2020). Silhouette (clustering) — Wikipedia, The Free Encyclope- dia. [Online; accessed 6-August-2020]. url: https://en.wikipedia.org/w/index.php? title=Silhouette_(clustering)&oldid=954469735. Yu, Jie, Rongrong Chen, Lingyu Xu, and Dongdong Wang (2019). “Concept extrac- tion for structured text using entropy weight method”. 2019 IEEE Symposium on Computers and Communications (ISCC), pp. 1–6. Yu, Zhe and Tim Menzies (2019). “FAST2: An intelligent assistant for nding relevant papers”. Expert Systems with Applications 120, pp. 57–71. Yu, Zheng, Haixun Wang, Xuemin Lin, and Min Wang (2015). “Learning term embed- dings for hypernymy identication”. In: Twenty-Fourth International Joint Conference on Articial Intelligence. Zdravevski, Eftim, Petre Lameski, Vladimir Trajkovik, Ivan Chorbev, Rossitza Goleva, Nuno Pombo, and Nuno M Garcia (2019). “Automation in systematic, scoping and rapid reviews by an NLP toolkit: a case study in enhanced living environments”. In: Enhanced Living Environments. Springer, pp. 1–18. Zhang, Chao, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian Sadler, Michelle Vanni, and Jiawei Han (2018). “Taxogen: Unsupervised topic taxonomy construction by adaptive term embedding and clustering”. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2701–2709.

48