Text Similarity Analysis for Test Suite Minimization

EXAMENSARBETE INOM DATALOGI OCH DATATEKNIK, AVANCERAD NIVÅ, 30 HP STOCKHOLM, SVERIGE 2020 Text Similarity Analysis for Test Suite Minimization HUGO HAGGREN KTH SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP Text Similarity Analysis for Test Suite Minimization HUGO HAGGREN Master in Machine Learning Date: November 2, 2020 Supervisor: Sahar Tahvili Examiner: Anne Håkansson School of Electrical Engineering and Computer Science Host company: Ericsson AB, Global Artificial Intelligence Accelerator (GAIA) Swedish title: Textlikhetsanalys för minimering av testsamlingar iii Abstract Software testing is the most expensive phase in the software development life cycle. It is thus understandable why test optimization is a crucial area in the software development domain. In software testing, the gradual increase of test cases demands large portions of testing resources (budget and time). Test Suite Minimization is considered a potential approach to deal with the test suite size problem. Several test suite minimization techniques have been proposed to efficiently address the test suite size problem. Proposing a good solution for test suite minimization is a challenging task, where several parameters such as code coverage, requirement coverage, and testing cost need to be considered before removing a test case from the testing cycle. This thesis proposes and evaluates two different NLP-based approaches for similarity analysis between manual integration test cases, which can be employed for test suite minimization. One approach is based on syntactic text similarity analysis and the other is a machine learning based semantic approach. The feasibility of the proposed solutions is studied through analysis of industrial use cases at Ericsson AB in Sweden. The results show that the semantic approach barely manages to outperform the syntactic approach. While both approaches show promise, subsequent studies will have to be done to further evaluate the semantic similarity based method. iv Sammanfattning Mjukvarutestning är den mest kostsamma fasen inom mjukvaruutveckling. Därför är det förståeligt varför testoptimering är ett kritiskt område inom mjuk- varubranschen. Inom mjukvarutestning ställer den gradvisa ökningen av testfall stora krav på testresurser (budget och tid). Test Suite Minimization an- ses vara ett potentiellt tillvägagångssätt för att hantera problemet med växan- de testsamlingar. Flera minimiseringsmetoder har föreslagits för att effektivt hantera testsamlingars storleksproblem. Att föreslå en bra lösning för minimering av antal testfall är en utmanande uppgift, där flera parametrar som kod- täckning, kravtäckning och testkostnad måste övervägas innan man tar bort ett testfall från testcykeln. Denna uppsats föreslår och utvärderar två olika NLP- baserade metoder för likhetsanalys mellan testfall för manuell integration, som kan användas för minimering av testsamlingar. Den ena metoden baseras på syntaktisk textlikhetsanalys, medan den andra är en maskininlärningsbaserad semantisk strategi. Genomförbarheten av de föreslagna lösningarna studeras genom analys av industriella användningsfall hos Ericsson AB i Sverige. Re- sultaten visar att den semantiska metoden knappt lyckas överträffa den syntak- tiska metoden. Medan båda tillvägagångssätten visar lovande resultat, måste efterföljande studier göras för att ytterligare utvärdera den semantiska likhets- baserade metoden. v Acknowledgments I would like to thank my supervisor at Ericsson, Sahar Tahvili. Thank you for helping in any way possible throughout the project and giving me a wonderful time at Ericsson. Furthermore, I would like to thank Cristina Landin at Eric- sson for providing the labeled data for this project and always being available for questions regarding existing software testing procedures. I also want to thank Auwn Muhammad for assisting the project in the form of consultation and practical assitance. Last but not least I would like to thank my examiner at KTH, Anne Håkansson. Thank you for always being available for questions and your extensive feedback on the report throughout the project. Sincerely, Hugo Haggren Contents 1 Introduction 1 1.1 Background . .1 1.2 Problem Statement . .2 1.3 Purpose . .3 1.4 Goal . .3 1.4.1 Benefits, Ethics and Sustainability . .3 1.5 Methodology . .4 1.5.1 Research Philosophy . .4 1.5.2 Resarch Methods . .5 1.5.3 Research Approach . .5 1.6 Stakeholder . .5 1.7 Delimitation . .5 1.8 Outline . .6 2 Theoretical Background 7 2.1 Software Testing . .7 2.1.1 Test Suite Minimization . .9 2.1.2 Manual Testing . .9 2.2 Natural Language Processing . .9 2.3 Machine Learning . 10 2.3.1 Artificial Neural Networks . 10 2.3.2 Deep Learning . 11 2.4 Paragraph Vectors . 11 2.4.1 Word2Vec . 11 2.4.2 Doc2Vec . 12 2.5 The Transformer Model . 13 2.5.1 SBERT . 14 2.6 Syntactic Similarity . 14 2.6.1 Levenshtein Distance . 14 vi CONTENTS vii 2.7 Density-Based Clustering . 15 2.7.1 Cosine Similarity . 15 2.8 Related Work . 16 3 Research Methods and Methodologies 17 3.1 Research Strategy . 17 3.2 Data Collection . 18 3.3 Data Analysis . 19 3.3.1 Visualization . 19 3.4 Quality Assurance . 19 3.4.1 Evaluation Metrics . 20 3.5 System Development . 21 4 Requirements and Design 22 4.1 Requirements . 22 4.2 Initial Design . 23 4.3 Final Design . 23 5 Implementation and Results 25 5.1 Data . 25 5.2 Data Labeling . 26 5.3 Syntactic Similarity Analysis . 26 5.4 Semantic Similarity Analysis . 27 5.4.1 Feature Vector Generation and Clustering . 27 5.5 Results . 28 5.5.1 Syntactic Similarity . 28 5.5.2 Semantic Similarity Analysis . 29 6 Evaluation and Implications 32 6.1 Evaluation . 32 6.1.1 Syntactic Evaluation . 32 6.1.2 Evaluation of Semantic Models . 34 6.2 Implications . 35 6.3 Threats to Validity . 36 7 Conclusions and Future Work 38 7.1 Discussion . 38 7.2 Future Work . 39 viii CONTENTS Bibliography 41 Chapter 1 Introduction In any industry it is always crucial that the product or service works as in- tended. Software development is no exception. Ensuring the quality of software, requires it to be tested rigorously. Hence, software testing plays a vital role in the software development life cycle. In fact, it takes up to 50% of the to- tal development cost [1]. Therefore, it is in any developer’s interest to optimize the software testing process in terms of cost, time, and resources [2]. To ensure the validity of tests, testers make use of test cases. A test case is defined as a set of test inputs, execution instructions, and expected results, developed for a particular objective [3]. Usually, a large number of test cases are created (manually or automatically) for testing a product [4]. Test cases are commonly grouped with other test cases that test a certain requirement [5]. These groups are called test suites. One way of optimizing a testing process is to remove any redundant test cases in a test suite. This process is called test suite minimization. It is formally defined as techniques used to minimize the testing cost in terms of execution time and resources [6]. The main objective of test case minimization is to generate a representative set from test cases that satisfy all the requirements as the original test suite with a minimum number of test cases [6, 5]. 1.1 Background Software testing can generally be divided into two main groups: automated testing and manual testing [1]. Automated testing is when each and every step of the testing procedure is automated and without manual operations [1]. In a manual testing procedure, however, all testing artifacts (e.g. requirement specification, test cases) are written by humans in natural language [7]. 1 2 CHAPTER 1. INTRODUCTION This opens up to the possibility of using natural language processing (NLP) techniques to optimize the testing process. NLP is a sub-field of computer science and linguistics which aims to find methods that enable computers to understand human language [8]. The area of NLP this thesis focuses on is text similarity analysis. Text similarity analysis consists of finding similarities between words, sentences, or documents [9]. There are two main types of text similarity: (1) syntactic similarity and (2) semantic similarity [9]. The syntactic similarity is the similarity of two words based on what characters they’re constructed off. The syntactic similarity does not take into account the meaning of the words, which is where semantic similarity comes in. Semantic similarity is how similar the under- lying meaning in two words is [9]. For instance "Paris" and "Stockholm" are string-wise two very different worlds, but semantically they are similar since they are both capital cities. 1.2 Problem Statement Software testing often takes up a large part of the software development process. This process, however, can be very time and resource consuming and require many manual operators. This consequently, can lead to large costs. To minimize testing times and costs one has to find ways to optimize the software testing process. This is the general, big-picture problem this thesis aims to tackle. With this problem in mind the main research question of this thesis can thus be formulated as follows: How can text similarity analysis be used for test optimization and test suite minimization? In order to analyze the research question of this thesis, the following steps will be performed: 1. Selecting appropriate algorithms for text similarity analysis. 2. Comparing the performance of selected algorithms. 3. Proposing the best solution for test optimization purposes using the similarities between test cases. CHAPTER 1. INTRODUCTION 3 With these steps it will be possible to come to a conclusion on whether the proposed algorithms can be a viable alternative for test optimization and how they can be best applied. 1.3 Purpose The purpose of this thesis is to explore and present how text similarity techniques can be applied for test optimization purposes. This is done by present- ing and analyzing a novel text similarity-based approach to test-suite minimization together with the results of the mentioned approaches when applied to a test suite.

Text Similarity Analysis for Test Suite Minimization

Intelligent Chat Bot

An Ensemble Regression Approach for Ocr Error Correction

Practice with Python

NLP - Assignment 2

3 Dictionaries and Tolerant Retrieval

Use of Word Embedding to Generate Similar Words and Misspellings for Training Purpose in Chatbot Development

Feature Combination for Measuring Sentence Similarity

The Research of Weighted Community Partition Based on Simhash

Hybrid Algorithm for Approximate String Matching to Be Used for Information Retrieval Surbhi Arora, Ira Pandey

Levenshtein Distance Based Information Retrieval Veena G, Jalaja G BNM Institute of Technology, Visvesvaraya Technological University

Thesis Submitted in Partial Fulﬁlment for the Degree of Master of Computing (Advanced) at Research School of Computer Science the Australian National University

Effective Search Space Reduction for Spell Correction Using Character Neural Embeddings