Automated Extraction of Feature and Variability Information from Natural Language Requirement Specifications

Automated Extraction of Feature and Variability Information from Natural Language Requirement Specifications D I S S E R TA T I O N zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) angenommen durch die Fakultät für Informatik der Otto-von-Guericke-Universität Magdeburg von M.Sc. Yang Li geb. am 14.06.1987 in Hebei, China Gutachterinnen/Gutachter Prof. Dr. Gunter Saake Prof. Dr. Andreas Nürnberger Prof. Dr. Rick Rabiser Magdeburg, den 30.10.2020 ii Abstract Software Product Lines support structured reuse of software artifacts to realize the maintenance and evolution of the typically large number of variants, which promotes the industrialization of software development, especially for software-intensive prod- ucts. Feature and variability information extraction from different artifacts is an indispensable activity to support the systematic integration of single software sys- tems and software product line. However, for a legacy system, it is non-trivial to gain information about commonalities and differences of the variants. Beyond manually extracting commonalities and variabilities, a variety of approaches, such as feature location in source code and feature extraction in requirements, has been proposed to provide automatic identification of features and their variation points. Compared with source code, requirements contain more complete variability information and provide traceability links to other artifacts from early development phases. In this thesis, we provide a systematic literature review, which contains a multi-dimensional overview of feature extraction approaches from natural language documents. Based on the observations from studies, we provide feasible and accurate approaches to improve the efficiency of feature extraction. To achieve this goal, we first explore the application of deep learning technologies in feature extraction. Second, we propose a hybrid approach based on multiple natural language processing and data mining techniques to extract features and variability information. Third, in order to provide understandable notations for features, we propose an approach combining keyword extraction and machine learning methods to predict feature-related terms. Fourth, we apply the proposed feature extraction approaches to analyze the requirements from a real-world scenario in practice, where we adjust the framework and combine other algorithms in terms of the specialities of real-world requirements. We empirically present how our proposed approaches can be used to extract features and variation points, while results show the usage of the proposed approaches can benefit the extraction process. Zusammenfassung Software-Produktlinien unterstutzen¨ die strukturierte Wiederverwendung von Soft- ware Artefakten, um die Wartung und Weiterentwicklung der normalerweise großen Anzahl von Varianten zu realisieren, was die Industrialisierung der Softwareen- twicklung insbesondere fur¨ softwareintensive Produkte fördert. Die Extraktion von Feature und Variabilitätsinformationen aus verschiedenen Artefakten ist eine un- verzichtbare Aktivität, um die systematische Integration einzelner Softwaresysteme und Software-Produktlinie zu unterstutzen.¨ Fur¨ ein Altsystem ist es jedoch nicht trivial, Informationen uber¨ Gemeinsamkeiten und Unterschiede der Varianten zu erhalten. Neben dem manuellen Extrahieren von Gemeinsamkeiten und Variabil- itäten wurden vielfältige Ansätze vorgeschlagen, z. B. die Position von Features im Quellcode und die Extraktion von Features in Anforderungen, um Features und ihre Variationspunkte automatisch zu identifizieren. Im Vergleich zum Quellcode enthal- ten die Anforderungen umfassendere Variabilitätsinformationen und bieten Ruck-¨ verfolgbarkeitsverknupfungen¨ zu anderen Artefakten aus fruhen¨ Phasen der Softwa- reentwicklung. In dieser Arbeit bieten wir eine systematische Literaturrecherche, die einen multidimensionalen Uberblick¨ uber¨ Ansätze zur Feature-Extraktion aus Doku- menten in naturlicher¨ Sprache enthält. Basierend auf den Beobachtungen aus dieser Studie schlagen wir praktikable und genaue Ansätze zur Verbesserung der Effizienz der Feature-Extraktion vor. Um dieses Ziel zu erreichen, untersuchen wir zunächst die Anwendung von Deep-Learning-Technologien bei der Feature-Extraktion. Zweit- ens schlagen wir einen hybriden Ansatz vor, der auf mehreren Techniken zur Verar- beitung naturlicher¨ Sprache und Data-Mining basiert, um Informationen von Fea- ture und Variabilität zu extrahieren. Daruber¨ hinaus präsentieren wir einen Ansatz, der Schlusselwortextraktion¨ und Methoden des maschinellen Lernens kombiniert, um feature-bezogene Termini vorherzusagen, damit verständliche Notationen fur¨ Features bereitgestellt werden können. Schließlich wenden wir die zuvor präsen- tierten Ansätze zur Feature-Extraktion an, um die Anforderungen aus einem realen Szenario in der Praxis zu analysieren, wobei wir das Framework anpassen und andere Algorithmen im Hinblick auf die Besonderheiten realer Anforderungen kombinieren. Empirisch präsentieren wir, wie von uns gestellte Ansätze verwendet werden können, um Features und Variationspunkte zu extrahieren. Zugleich zeigen die Ergebnisse, dass die Verwendung dieser Ansätze dem Extraktionsprozess zugutekommen kann. Acknowledgements I would like to express my deepest gratitude to Gunter Saake. He gave me the op- portunity to pursue my Ph.D. under his supervision, provided an excellent research environment, and gave me the freedom to choose my research direction. I would like to thank Sandro Schulze for his valuable and constructive suggestions during the planning and development of my research. His attentive guidance, comments, and feedback improved my academic writing skills. During the last four years, we had numerous fruitful discussions that had a major impact on my research. His willingness to give his time so generously has been very much appreciated. I would like to offer my special thanks to all my colleagues in our team without whom I could not achieve fruitful discussions. The advice given by them has been a great help in my research. I am particularly grateful for the assistance given by Xiao Chen, Sebastian Krieter, Jacob Kruger,¨ Wolfram Fenske, David Broneske, Juliana Alves Pereira, Mustafa Al-Hajjaji, Gabriel Campero Durand, Fabian Benduhn, Jens Meinicke, and Anja Buch. They not only supported me in my research, but also helped me solve the problems encountered in life. I would also like to thank all the researchers I met on the path of my research. In particular, I thank Thomas Fogdal, Helene Scherrebeck, Jiahua Xu, Stefania Gnesi, and Laura Semini. The collaboration, feedback, and comments from them had a great impact on my research. Moreover, I would like to thank Andreas Nurnberger¨ and Rick Rabiser for being the reviewers of my thesis. Last but not least, I want to thank my girlfriend Bowen who brings a lot of joy and happiness to our life. I am very grateful to my parents and brother for their selfless support and love. Contents List of Figures xiv List of Tables xv List of Acronyms xvii 1 Introduction 1 1.1 Goal of the Thesis . .2 1.2 Structure of the Thesis . .3 2 Background 5 2.1 Software Product Line Engineering . .5 2.1.1 Domain Engineering . .5 2.1.2 Application Engineering . .8 2.1.3 Feature Model . .9 2.1.4 Gap between SPL and Traditional Software Reuse . 10 2.2 Natural Language Processing . 11 2.2.1 Preprocessing . 11 2.2.2 WordEmbedding......................... 13 2.2.3 Recognizing Textual Entailment . 15 2.3 Summary ................................. 16 3 Current Research on Feature and Variability Extraction 17 3.1 Review Methodology . 18 3.1.1 Need for a Review . 18 3.1.2 Research Questions . 18 3.1.3 Search Strategy . 19 3.1.4 Conducting the Review . 20 3.2 Results................................... 22 3.2.1 Results of Studies Search . 22 3.2.2 Answering Research Questions . 26 3.3 Discussion................................. 35 3.4 ThreatstoValidity ............................ 36 3.5 RelatedWork ............................... 36 3.6 Summary ................................. 37 4 An Initial Self-Learning Structure for Feature Extraction 39 4.1 Methodology ............................... 41 x Contents 4.1.1 Overview ............................. 41 4.1.2 Laplacian Eigenmaps . 42 4.1.3 Convolutional Neural Network . 43 4.1.4 Clustering . 45 4.2 Preliminary Result . 45 4.2.1 Discussion . 45 4.3 RelatedWork ............................... 47 4.4 Summary ................................. 48 5 VarMine: Reverse Engineering Variability in A Hybrid Way 49 5.1 VarMine in a Nutshell . 50 5.2 Semantic Similarity Network . 51 5.2.1 Word Level Similarity . 51 5.2.2 Requirement Level Similarity . 52 5.3 Feature and Variability Extraction . 55 5.3.1 Feature Extraction . 55 5.3.2 Optionality and Group Constraints Detection . 58 5.3.3 Cross-Tree Constraints Detection . 59 5.4 Evaluation................................. 61 5.4.1 Research Questions . 62 5.4.2 Case Study Description . 62 5.4.3 Clustering Evaluation . 64 5.4.4 Feature Model Evaluation . 65 5.4.5 Comparison with SOVA and ArborCraft . 71 5.4.6 Answering RQs . 74 5.5 ThreatstoValidity ............................ 75 5.6 RelatedWork ............................... 75 5.7 Summary ................................. 77 6 The Inference of the Notions of Features 79 6.1 Methodology ............................... 81 6.1.1 Dataset Generation . 81 6.1.2 Dataset Preprocessing . 85 6.1.3 Training Process . 87 6.2 Evaluation................................. 88

Automated Extraction of Feature and Variability Information from Natural Language Requirement Specifications

10 Oriented Principal Component Analysis for Feature Extraction

Feature Selection/Extraction

Feature Extraction (PCA & LDA)

Feature Extraction for Image Selection Using Machine Learning

Time Series Feature Extraction for Industrial Big Data (Iiot) Applications

Machine Learning Feature Extraction Based on Binary Pixel Quantiﬁcation Using Low-Resolution Images for Application of Unmanned Ground Vehicles in Apple Orchards

Towards Reproducible Meta-Feature Extraction

Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data Robert T

Feature Extraction Using Dimensionality Reduction Techniques: Capturing the Human Perspective

Unsupervised Feature Extraction for Reinforcement Learning

Alignment-Based Topic Extraction Using Word Embedding

Deep Learning Feature Extraction Approach for Hematopoietic Cancer Subtype Classiﬁcation