<<

EXAMENSARBETE INOM DATALOGI OCH DATATEKNIK, AVANCERAD NIVÅ, 30 HP STOCKHOLM, SVERIGE 2020

Text Similarity Analysis for Test Suite Minimization

HUGO HAGGREN

KTH SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP Text Similarity Analysis for Test Suite Minimization

HUGO HAGGREN

Master in Machine Learning Date: November 2, 2020 Supervisor: Sahar Tahvili Examiner: Anne Håkansson School of Electrical Engineering and Host company: Ericsson AB, Global Artificial Intelligence Accelerator (GAIA) Swedish title: Textlikhetsanalys för minimering av testsamlingar

iii

Abstract

Software testing is the most expensive phase in the software development life cycle. It is thus understandable why test optimization is a crucial area in the software development domain. In software testing, the gradual increase of test cases demands large portions of testing resources (budget and time). Test Suite Minimization is considered a potential approach to deal with the test suite size problem. Several test suite minimization techniques have been proposed to efficiently address the test suite size problem. Proposing a good solution for test suite minimization is a challenging task, where several parameters such as code coverage, requirement coverage, and testing cost need to be considered before removing a test case from the testing cycle. This thesis proposes and evaluates two different NLP-based approaches for similarity analysis between manual integration test cases, which can be employed for test suite minimiza- tion. One approach is based on syntactic text similarity analysis and the other is a machine learning based semantic approach. The feasibility of the pro- posed solutions is studied through analysis of industrial use cases at Ericsson AB in Sweden. The results show that the semantic approach barely manages to outperform the syntactic approach. While both approaches show promise, subsequent studies will have to be done to further evaluate the semantic simi- larity based method. iv

Sammanfattning

Mjukvarutestning är den mest kostsamma fasen inom mjukvaruutveckling. Därför är det förståeligt varför testoptimering är ett kritiskt område inom mjuk- varubranschen. Inom mjukvarutestning ställer den gradvisa ökningen av test- fall stora krav på testresurser (budget och tid). Test Suite Minimization an- ses vara ett potentiellt tillvägagångssätt för att hantera problemet med växan- de testsamlingar. Flera minimiseringsmetoder har föreslagits för att effektivt hantera testsamlingars storleksproblem. Att föreslå en bra lösning för minime- ring av antal testfall är en utmanande uppgift, där flera parametrar som kod- täckning, kravtäckning och testkostnad måste övervägas innan man tar bort ett testfall från testcykeln. Denna uppsats föreslår och utvärderar två olika NLP- baserade metoder för likhetsanalys mellan testfall för manuell integration, som kan användas för minimering av testsamlingar. Den ena metoden baseras på syntaktisk textlikhetsanalys, medan den andra är en maskininlärningsbaserad semantisk strategi. Genomförbarheten av de föreslagna lösningarna studeras genom analys av industriella användningsfall hos Ericsson AB i Sverige. Re- sultaten visar att den semantiska metoden knappt lyckas överträffa den syntak- tiska metoden. Medan båda tillvägagångssätten visar lovande resultat, måste efterföljande studier göras för att ytterligare utvärdera den semantiska likhets- baserade metoden. v

Acknowledgments

I would like to thank my supervisor at Ericsson, Sahar Tahvili. Thank you for helping in any way possible throughout the project and giving me a wonderful time at Ericsson. Furthermore, I would like to thank Cristina Landin at Eric- sson for providing the labeled data for this project and always being available for questions regarding existing software testing procedures. I also want to thank Auwn Muhammad for assisting the project in the form of consultation and practical assitance. Last but not least I would like to thank my examiner at KTH, Anne Håkansson. Thank you for always being available for questions and your extensive feedback on the report throughout the project.

Sincerely, Hugo Haggren Contents

1 Introduction 1 1.1 Background ...... 1 1.2 Problem Statement ...... 2 1.3 Purpose ...... 3 1.4 Goal ...... 3 1.4.1 Benefits, Ethics and Sustainability ...... 3 1.5 Methodology ...... 4 1.5.1 Research Philosophy ...... 4 1.5.2 Resarch Methods ...... 5 1.5.3 Research Approach ...... 5 1.6 Stakeholder ...... 5 1.7 Delimitation ...... 5 1.8 Outline ...... 6

2 Theoretical Background 7 2.1 Software Testing ...... 7 2.1.1 Test Suite Minimization ...... 9 2.1.2 Manual Testing ...... 9 2.2 Natural Language Processing ...... 9 2.3 Machine Learning ...... 10 2.3.1 Artificial Neural Networks ...... 10 2.3.2 Deep Learning ...... 11 2.4 Paragraph Vectors ...... 11 2.4.1 ...... 11 2.4.2 Doc2Vec ...... 12 2.5 The Transformer Model ...... 13 2.5.1 SBERT ...... 14 2.6 Syntactic Similarity ...... 14 2.6.1 Levenshtein ...... 14

vi CONTENTS vii

2.7 Density-Based Clustering ...... 15 2.7.1 ...... 15 2.8 Related Work ...... 16

3 Research Methods and Methodologies 17 3.1 Research Strategy ...... 17 3.2 Data Collection ...... 18 3.3 Data Analysis ...... 19 3.3.1 Visualization ...... 19 3.4 Quality Assurance ...... 19 3.4.1 Evaluation Metrics ...... 20 3.5 System Development ...... 21

4 Requirements and Design 22 4.1 Requirements ...... 22 4.2 Initial Design ...... 23 4.3 Final Design ...... 23

5 Implementation and Results 25 5.1 Data ...... 25 5.2 Data Labeling ...... 26 5.3 Syntactic Similarity Analysis ...... 26 5.4 Analysis ...... 27 5.4.1 Feature Vector Generation and Clustering ...... 27 5.5 Results ...... 28 5.5.1 Syntactic Similarity ...... 28 5.5.2 Semantic Similarity Analysis ...... 29

6 Evaluation and Implications 32 6.1 Evaluation ...... 32 6.1.1 Syntactic Evaluation ...... 32 6.1.2 Evaluation of Semantic Models ...... 34 6.2 Implications ...... 35 6.3 Threats to Validity ...... 36

7 Conclusions and Future Work 38 7.1 Discussion ...... 38 7.2 Future Work ...... 39 viii CONTENTS

Bibliography 41 Chapter 1

Introduction

In any industry it is always crucial that the product or service works as in- tended. Software development is no exception. Ensuring the quality of soft- ware, requires it to be tested rigorously. Hence, software testing plays a vital role in the software development life cycle. In fact, it takes up to 50% of the to- tal development cost [1]. Therefore, it is in any developer’s interest to optimize the software testing process in terms of cost, time, and resources [2]. To ensure the validity of tests, testers make use of test cases. A test case is defined as a set of test inputs, execution instructions, and expected results, developed for a particular objective [3]. Usually, a large number of test cases are created (manually or automatically) for testing a product [4]. Test cases are commonly grouped with other test cases that test a certain requirement [5]. These groups are called test suites. One way of optimizing a testing process is to remove any redundant test cases in a test suite. This process is called test suite minimization. It is formally defined as techniques used to minimize the testing cost in terms of execution time and resources [6]. The main objective of test case minimization is to generate a representative set from test cases that satisfy all the requirements as the original test suite with a minimum number of test cases [6, 5].

1.1 Background

Software testing can generally be divided into two main groups: automated testing and manual testing [1]. Automated testing is when each and every step of the testing procedure is automated and without manual operations [1]. In a manual testing procedure, however, all testing artifacts (e.g. requirement specification, test cases) are written by humans in natural language [7].

1 2 CHAPTER 1. INTRODUCTION

This opens up to the possibility of using natural language processing (NLP) techniques to optimize the testing process. NLP is a sub-field of computer science and which aims to find methods that enable computers to understand human language [8]. The area of NLP this thesis focuses on is text similarity analysis. Text similarity anal- ysis consists of finding similarities between words, sentences, or documents [9]. There are two main types of text similarity: (1) syntactic similarity and (2) semantic similarity [9]. The syntactic similarity is the similarity of two words based on what characters they’re constructed off. The syntactic sim- ilarity does not take into account the meaning of the words, which is where semantic similarity comes in. Semantic similarity is how similar the under- lying meaning in two words is [9]. For instance "Paris" and "Stockholm" are string-wise two very different worlds, but semantically they are similar since they are both capital cities.

1.2 Problem Statement

Software testing often takes up a large part of the software development pro- cess. This process, however, can be very time and resource consuming and require many manual operators. This consequently, can lead to large costs. To minimize testing times and costs one has to find ways to optimize the software testing process. This is the general, big-picture problem this thesis aims to tackle. With this problem in mind the main research question of this thesis can thus be formulated as follows:

How can text similarity analysis be used for test optimization and test suite minimization?

In order to analyze the research question of this thesis, the following steps will be performed:

1. Selecting appropriate algorithms for text similarity analysis.

2. Comparing the performance of selected algorithms.

3. Proposing the best solution for test optimization purposes using the sim- ilarities between test cases. CHAPTER 1. INTRODUCTION 3

With these steps it will be possible to come to a conclusion on whether the proposed algorithms can be a viable alternative for test optimization and how they can be best applied.

1.3 Purpose

The purpose of this thesis is to explore and present how text similarity tech- niques can be applied for test optimization purposes. This is done by present- ing and analyzing a novel text similarity-based approach to test-suite mini- mization together with the results of the mentioned approaches when applied to a test suite. Furthermore, the thesis discusses the applicability of the pro- posed approach to general cases.

1.4 Goal

The main goal of this degree project is to find a viable, text similarity-based, solution for optimizing a software testing process. Additionally, the goal is to provide an option for software developing companies and software testing practitioners, that are aiming to improve their software testing process.

1.4.1 Benefits, Ethics and Sustainability Considering the the fact that this thesis aims to reduce the testing cost and also optimize the software testing process, all software developers can benefit from this work. In a general sense, having a more efficient testing process will benefit the entire software industry. With less time and resources being spent on testing, more can be focused on other useful aspects. This not only benefits the industry with increased profits but also the general consumer by improving the quality of products. In our modern society software often fulfill roles that can be incredibly crucial, such as in banking, infrastructure, or health care. The importance of such software to work as intended is thus very high and any error or improper implementation can lead to large monetary loss or the loss of human lives. It is thus ethically important that such software is rigorously tested. Consequently, when aiming to optimize the testing process it is important to not compromise on the rigorousness of the test in favor of efficiency. This is something this thesis takes heavily into account. 4 CHAPTER 1. INTRODUCTION

While optimizing testing processes can be beneficial by lowering costs, it is also beneficial for the sustainability of society. Sustainability here is defined as behavior that works against the depletion of earth’s natural resources [10]. Having more streamlined, efficient testing processes would lower the required resources and thus the process would become more sustainable.

1.5 Methodology

When conducting research such as this degree project there are two main cate- gories in which the research methodology can fall. These are the Quantitative and Qualitative research method [11]. Quantitative research consists of ana- lyzing objective measurements and numerical values in experiments and tests to come to a conclusion about a theory or hypothesis. Qualitative research on the other hand is more subjective and uses meanings, opinions, or behaviors to reach conclusions [11]. This degree project employs the Quantitative research method because of its mathematical nature.

1.5.1 Research Philosophy The philosophical assumptions of a research project are the underlying values that steer the direction of the research and determine what is considered valid research or not [11]. There are many different philosophical ideas that can be applied, but the four main assumptions are: Positivism, Realism, Interpre- tivism, and Criticalism [11]. Positivism [12] assumes that all things studied are objectively given and can be described by variables that can be independently observed. Realism [11, 12] subscribes to the fact that the world around us exists in- dependently of what we as humans think or perceive. Interpretivism [13] is the idea that humans have inherent meaning, as op- posed to physical phenomena. This meaning is what interpretivism study and is usually embodied as opinions or ideas. Criticalism [11, 12] assumes that social reality is constituted by historical and cultural events. The focus lies on problems arising from the historical and cultural reality, such as oppression or domination. This thesis is based on the philosophical assumptions of positivism since it views the data used as objectively given. Moreover, statistical methods are applied to analyze the data without inferring any opinions or values. CHAPTER 1. INTRODUCTION 5

1.5.2 Resarch Methods Research methods are the blueprints for how the research should be conducted and how to accomplish the research task [11]. There are two main groups of re- search methods, Experimental and Non-Experimental. Experimental research examines the cause-and-effect relationship between variables [14]. Non-Experimental research aims to describe relationships between variables but does not try to find any causal relationships [14]. The method in this thesis is experimental since it aims to show a causal relationship between text similarity and test case redundancy.

1.5.3 Research Approach Research approaches are different ideas on how to come to conclusions from scientific results [11]. The two most common types are Deductive and Induc- tive reasoning. Deductive reasoning is when the conclusion is inferred logi- cally from a set of established premises [13]. Inductive reasoning is when there is no clear logical argument between the conclusion and supporting premises. The conclusion is deemed to be supported by the premises rather than proved by them [13]. The reasoning in this thesis is deductive since it uses previously established theory in text similarity analysis to come to new conclusions about test case redundancy.

1.6 Stakeholder

This thesis work was done in collaboration with Ericsson AB. Ericsson AB is a telecommunication company and produces, among many products, the lat- est technology in Radio Base Stations (RBS). A RBS is a radio transceiver used in wireless communication [15]. These base stations need to be tested and Ericsson has been looking into new ways to optimize their testing pro- cess. This collaboration was done as a way for Ericsson to explore new testing alternatives. All work for this degree project was performed at the Ericsson offices in Stockholm, Sweden. The data analyzed is in its entirety collected from Ericsson RBS products.

1.7 Delimitation

One delimitation of this degree project is that the test suite studied is provided by Ericsson and not selected using a strategic method. More specifically, this 6 CHAPTER 1. INTRODUCTION

thesis only studies tests made for radio base station products. The results and conclusions will thus mainly be applicable to these products. Furthermore, this thesis project focuses solely on the test case descriptions, that is, the text files describing the test case. At Ericsson, there are more data in the testing process that could be analyzed, such as test requirements.

1.8 Outline

The ensuing chapters of this thesis are organized in the following way. Chap- ter 2 provides relevant theory and recent relevant work in the field of test opti- mization. In Chapter 3 the engineering related methodologies and methods are presented. Chapter 4 and 5 describes the details of the development and im- plementation of the system used for test optimization. Chapter 5 also presents the obtained results from the used models. In Chapter 6, the evaluation of the results are presented together with implications and a discussion of the valid- ity of the study. Finally, Chapter 7 concludes the thesis by summarizing the most important insights, discussing the implemented methods, and presenting possible future directions of the present work. Chapter 2

Theoretical Background

In this chapter, a theoretical background is presented. This theory is what is needed to follow the procedure of this thesis. First, the relevant background in software testing is presented. Afterward, the background relevant to the algorithms applied in this project is presented.

2.1 Software Testing

The typical software development life cycle (SDLC) model consists of six dif- ferent phases; (i) requirement gathering and analysis, (ii) design, (iii) imple- mentation, (iv) testing, (v) development, and vi) maintenance [1]. Software testing is the most expensive phase in the software development life cycle [1], thus test optimization becomes a big role player in the testing domain. Software testing can be sectioned into the following levels (illustrated in Fig- ure 2.1):

Figure 2.1: Levels of Testing

7 8 CHAPTER 2. THEORETICAL BACKGROUND

Unit testing is the most basic level of software testing and consists of func- tionality tests on a code level, usually performed by the programmer. Integra- tion testing consists of combining individual modules and testing them as a group. System testing is when an entire system is tested to make sure it ful- fills its requirements. Finally, acceptance testing is testing done by usually a user or customer before accepting or purchasing a system or product. The test cases that are analyzed in this thesis are all from integration testing pro- cedures. Based on the testing level and the testing process, software testing can take up to 50 % of total development costs [1]. Therefore, test optimiza- tion has received much attention over the last few decades in both industry and academia. In industry, having a more accurate and cost-efficient testing pro- cess is always demanded. On the other hand, utilizing new techniques such as machine learning and natural language processing has recently become a hot topic in the testing domain [1]. The five main components of test optimization are: • Test suite minimization: Generating a representative set from a test suite that satisfy all the requirements as the original test suite with min- imum number of test cases [16]. • Test case selection: Selecting and evaluating a subset of generated test cases for execution is one technique to optimize the testing process [17]. • Test case prioritization: All generated test cases should be ranked for execution in such a way that test cases of higher importance are ranked higher [18]. • Test case scheduling: Selecting and prioritizing test cases dynamically for execution based on their execution results [19]. • Test automation: Makes use of special software tools to control the execution of tests and then compares actual test results with predicted or expected results [20]. Employing all the above approaches for test optimization can lead to a more accurate and efficient testing process. Moreover, the mentioned approaches are recognized as a multi-criteria decision-making problem1. The test optimiza- tion problem is based on the differences and similarities between test cases. Thus, the properties (criteria) of test cases need to be detected before making any decision such as test case reduction. 1Refers to making decisions in the presence of multiple criteria (in this case properties of test cases). CHAPTER 2. THEORETICAL BACKGROUND 9

2.1.1 Test Suite Minimization The test suite minimization problem is formulated by Elbaum et al. [21] as:

Definition 2.1. Given: Test suite T , a set of requirements r1, r2, ..., rn, that must be satisfied to provide the desired test coverage of the program. 0 00 00 00 0 Problem: Find T ⊂ T such that (∀T )(T ⊂ T )(T satisfies all ris)[|(T | ≤ |T 00|].

Given subsets of T,T1,T2, ..., Tn, one associated with each of the ris such that any one of the test cases tj belonging to Ti can be used to test ri , a test suite that satisfies all ris must contain at least one test case from each Ti . Such a set is called a hitting set of the Ti . Maximum reduction is achieved with the minimum cardinality hitting set of the Tis. Because the problem of finding the minimum cardinality hitting set is intractable [22].

2.1.2 Manual Testing Manual Testing is a type of software testing where testers manually create and execute test cases without using any automation tools. Manual testing is the most primitive of all testing types and helps find bugs in the software system. While manual testing requires more effort, it is necessary to check automation feasibility [1]. In a manual testing procedure, all test cases are written in a natural text. Therefore, applying natural language processing techniques can help to distinguish test cases from each other.

2.2 Natural Language Processing

Natural language processing (NLP) is a branch of computer science that aims to enable the communication between humans and computers. The ultimate goal of NLP is to build software that analyzes, understands and generates hu- man languages naturally [23]. Natural language processing interprets natural language by deriving meaning from an input in order to give information as an output [23]. To do this it is necessary to have a way of determining sim- ilarities and differences between pieces of texts. One way of doing this is to compare the two texts by analyzing their syntactic similarity. In the context of text, syntactic refers to the actual characters the words are made up of. The more identical characters in the same places, the more syntactically similar two texts can be considered. This similarity is usually quantified using some sort 10 CHAPTER 2. THEORETICAL BACKGROUND

of mathematical quantification. This thesis chooses to focus on Levenshtein Distance [24]. Another way of comparing two texts or paragraphs2 is to first transform the texts into a mathematical vector representation, so called feature vectors, and then compare the two feature vectors with each other. A feature vector is sim- ply a vector that in some way or form represent the features of the text. The most simple way of creating feature vectors for some amount of paragraphs would be to use a word frequency model. Each unique word in the collection of paragraphs is represented by one dimension in the resulting vector space, and the frequency of each unique word is the resulting vectors value in the corresponding dimension. While this method has its use cases it only really manages to encapsulate the syntactic information of the paragraph. In this the- sis the focus will lie on feature vector models that aim to embed the semantics of a paragraph into the feature vector.

2.3 Machine Learning

Machine Learning is a broad bracket term that encompasses a large amount of techniques used to understand and draw conclusions from data [25]. Gen- erally, machine learning techniques can be classified as either supervised or unsupervised. Supervised machine learning involves learning a mathematical model that predicts an output based on one or many inputs using the correct outputs as a learning tool [26]. In unsupervised learning, the model is given inputs without any correct outputs as learning tool. This leads to the model learning relationships and behaviors from the input features that may not have been previously detected [26]. Supervised models are usually referred to as classification models while unsupervised models are referred to as clustering models.

2.3.1 Artificial Neural Networks A machine learning algorithm that is relevant for this thesis is Artificial Neu- ral Networks (ANN). ANNs were originally introduced with the goal of con- structing a mathematical representation of the human brain [27]. The goal of a neural network is to find the function f(x,θθ) that approximates some other function f(x) by finding the best weights θ. Instead of real, physical neurons that send signals via synapses, an ANN have nodes of activation functions that

2A paragraph in this case could be anything from a sentence to an entire document CHAPTER 2. THEORETICAL BACKGROUND 11

take real values as inputs and outputs a signal. These nodes are ordered in lay- ers and connected via edges. The nodes in the network are weighted with the weights that are adjusted during learning. The simplest form of an ANN is the feed-forward network [27]. This adjustment is commonly done with an al- gorithm called Back-propagation. Back-propagation [28] works by traversing back through the network after each training run and updating the weights of each node by computing the gradient with regards to the weights, of the cost function, which is a function aiming to describe the error in the network.

2.3.2 Deep Learning Deep Learning is a subfield of machine learning that encompasses algorithms based upon artificial neural networks. A key ingredient of deep learning meth- ods is that they include representation learning. Representation learning is the ability of machine learning models, from raw input data, learn the mathemat- ical feature representations needed for classification. This is highly beneficial since this type of feature extraction is otherwise done manually by a human before application. In this thesis we make use of this by accessing these types of self-learned feature representations of text documents.

2.4 Paragraph Vectors

Paragraph Vectors, also known as Doc2Vec [29] is an unsupervised algorithm proposed by Le and Mikolov in 2014. The purpose of Doc2Vec is to try to encompass the semantic similarity of the analyzed paragraphs in the resulting vectors which the model give as output. Doc2Vec is heavily built upon its predecessor, Word2Vec [30]. Thus, Word2Vec is presented below to give a full understanding of the Doc2Vec algorithm.

2.4.1 Word2Vec Word2Vec [30] was introduced in 2013 by Mikolov et al. The aim of the algo- rithm was to map words in a corpus to feature vectors in a vector space where semantically similar words will lie closer to each other. The algorithm works by using a neural network, trained on a large collection of written text, to learn the requested word representations. Mikolov et al. propose two variations of Word2Vec, Continuous Bag-Of-Words (CBOW) and Continuous Skip-Gram. These are presented individually down below. Both models are trained using back-propagation. 12 CHAPTER 2. THEORETICAL BACKGROUND

Input Projection Output Input Projection Output

w(t-2) w(t-2)

w(t-1) w(t-1)

SUM w(t) w(t)

w(t+1) w(t+1)

w(t+2) w(t+2) (a) Continuous Bag-Of-Words (b) Continuous Skip-Gram

Figure 2.2: Model architecture for Continuous Bag-Of-Words and Continuous Skip-Gram. Continuous Bag-Of-Words The general idea of Continuous Bag-Of-Words is that a neural network is trained to predict a word given its context. The context is defined as the n words surrounding the one being predicted [30]. However, instead of outputting the predicted word, the network outputs a continuous vector of predetermined di- mension. The model architecture of CBOW is visualized in Figure 2.2a. It is a one-layer network and it starts off by taking the context words as input in the form of one-hot-encoded vectors. It then projects the sum of the context word vectors into a continuous word vector. The matrix that represents the projection layer is essentially what is learned when we train the network.

Continuous Skip-Gram The Continuous Skip-Gram model (Figure 2.2b) does more or less the opposite of CBOW. Instead of trying to predict a word given its context, it to predict the context from the word. So, given a word, Continuous skip-gram tries to predict all the other words in a given range around it [30]. Just like CBOW, the output given is a continuous vector for each word in the data set trained on.

2.4.2 Doc2Vec The Paragraph Vector algorithm [29](Doc2Vec) is heavily inspired by its younger sibling Word2Vec [30], and more specifically the Continuous Bag-Of-Words model. As can be seen from Figure 2.3, the only difference to CBOW is that we add a paragraph vector d to the input. In practice this paragraph vector works just as another context word, and the context it adds is simply which paragraph the current words belong to. Doc2Vec learns both the word and CHAPTER 2. THEORETICAL BACKGROUND 13

paragraph vectors of the entire corpora during training, as matrices D and W. It then predicts vectors of new paragraphs by adding columns to D and gradient descending while holding the rest of the parameters constant [29].

Input Projection Output

d

w(t-2)

w(t-1) AVG w(t)

w(t+1)

w(t+2)

Figure 2.3: Doc2Vec model architecture.

2.5 The Transformer Model

Recurrent Neural Network (RNN) [31] is a deep learning algorithm that is especially well suited for ordered sequential data, such as natural language. What sets RNNs appart from regular feed forward networks is that RNNs have the ability to store previous information in an internal state. This allows the network to have somewhat of a "memory". RNNs have long been the standard in several NLP applications [32], until recently (2017) when Vaswani et. al introduced the Transformer Model [32]. The Transformer Model, much like RNNs, is a neural network that works very well with sequential data. However, the Transformer Model has an addition that sets it apart, namely the concept of attention. The purpose of the attention mechanism is, for each step in the processing of a text sequence, decide which other parts of the sequence are relevant. This is similar to how Word2Vec uses a context window to get a context for each word. However, the attention mechanism takes the entire text sequence in consideration. The structure of the transformer consists of an encoder and a decoder [32]. The encoder maps the input to a continuous representation z = (z1, ..., zn). The decoder then takes this z and generates an output sequence (y1, ..., ym) one element at a time. For each generated element the decoder also take the 14 CHAPTER 2. THEORETICAL BACKGROUND

previously generated element as an additional input.

2.5.1 SBERT Sentence-BERT [33] is an extension of the pre-trained transformer model BERT [34]. BERT (Bidirectional Encoder Representations from Transformers) is an implementation of a transformer model made by researchers at Google AI Language. Instead of working as a traditional transformer like BERT, and out- putting a natural text sequence from an inputted one, SBERT outputs a vector representation of the input text sequence instead. It manages this by adding a pooling operation to the output of BERT to derive a fixed sized feature vector. A pooling operation is a way of reducing down the feature representation to a lower dimension.

2.6 Syntactic Similarity

When doing any kind of similarity analysis of text the most simple way of going about it is to compare the two texts by analyzing their syntactic similar- ity. In the context of text, syntactic refers to the actual characters the words are made up of. The more identical characters in the same places, the more syntactically similar two texts can be considered. This similarity is usually quantified using some sort of distance measure. In this thesis we choose to focus on Levenshtein Distance [24].

2.6.1 Levenshtein Distance The Levenshtein distance [24] between two strings is defined as the minimum amount of character alterations that need to be made to turn the one string into the other. Character alterations here refer to substitutions, additions or removal of characters. Mathematically, the Levenshtein distance algorithm can be described as [35],

 max(i, j) min(i, j) = 0,  if    la,b(i − 1, j) + 1 la,b(i, j) =  (2.1) min l (i, j − 1) + 1 otherwise.  a,b    la,b(i − 1, j − 1) + 1ai6=bj where a and b are two strings to be compared and i and j are the current in- dex we are evaluating, starting at |a| and |b| respectively. Furthermore, the CHAPTER 2. THEORETICAL BACKGROUND 15

theoretical maximum of the Levenshtein distance is the length of the longer string. This is because if the strings have no characters in common the algo- rithm reduces to first substituting all the characters in the shorter string, then adding the rest. This upper limit makes it possible to define a ratio of similarity between the two strings a and b as following,

la,b(|a|, |b|) lratio(a, b) = 1 − . (2.2) max(|a|, |b|) This ratio is what is used in this thesis to compare test case documents on a syntactic level.

2.7 Density-Based Clustering

Clustering is an unsupervised machine learning technique that tries to group data into so called clusters based on some kind of similarity. A common clus- tering algorithm is k-means [36], which take the desired number of clusters as input and then tries to maximize the sum of each pairwise similarity measure. Unlike k-means, Density-based clustering [37], does not take any predeter- mined cluster count. It calculates the clusters simply based on the density of data points. Contiguous areas of high density are considered clusters and contiguous areas of low density are considered noise. In this project an implementation of density-based clustering called HDB- SCAN [38] is used. It extends regular density-based clustering by using a hierarchical density estimate. It generates a simplified hierarchy and extracts the most significant clusters.

2.7.1 Cosine Similarity The cosine similarity is a similarity commonly used in natural language processing applications. It is defined simply as the cosine of the angle between the two vectors under analysis. It can be derived from the definition of the dot product as following; First the common definition of the vector dot product is introduced a · b = kakkbk cos θ. Then, the cosine factor is isolated in the following way a · b cos θ = . (2.3) kakkbk This final equation is the desired similarity measure. 16 CHAPTER 2. THEORETICAL BACKGROUND

2.8 Related Work

The work of this degree project is based, partially on the work previously done by Tahvili et al. [39]. In their paper they analyze the semantic similarity be- tween test case descriptions to conclude dependencies within the test suite. In this method, the Doc2Vec [29] algorithm is first applied to a test suite and then two clustering algorithms are applied for evaluation. Here Doc2Vec manages to achieve a F1-score of 0.75.

Text similarity analysis has previously been applied in the area of test optimiza- tion. Flemström et al. [40] present a way of using text similarity techniques for test case prioritization, defined as the problem of ordering the execution of test cases within a test suite, with the goal of maximizing some criteria. Thomas et al. [41] also apply text similarity methods for test case prioritization purposes. They however, use topic modeling. Topic modeling is a statistical technique used to find the relevant topics in a collection of documents. Unterkalmsteiner et al. [42] present a text similarity based approach to test case selection, de- fined as identifying test cases relevant to a change in the system under test. In their work they apply (LSA) [43]. LSA performs a quantitative content analysis on a set of documents by counting the prevalence of each word in each document and then group similar documents that share the same words [43].

Garousi et al. [44] present an extensive study of the prevalence of NLP appli- cations in software testing. From their study one can conclude that the most commonly applied technique is the Stanford Parser [45], a Java toolkit of prob- abilistic natural language parsers. Chapter 3

Research Methods and Methodologies

This chapter presents theory regarding conducting research. It introduces the different types of research strategies, data collection methods, and data anal- ysis methods that can be used in a research project. It also presents concepts in quality assurance that are used to ensure the quality of this research project. Lastly, the methodology followed in this research project is presented, together with used analysis and evaluation methods.

3.1 Research Strategy

A research strategy is the guidelines that a research project follow [11]. This includes organizing, planning, designing and conducting the research [11]. For quantitative research such as this degree project, the most common strate- gies are; Experimental Research, Ex post facto Research, Surveys, and Case Study [11].

• Experimental Research [11] aims to control for all the factors that can influence the outcome of the experiments. In experimental research the researcher tries to find cause and effect relationships by manipulating variables [46].

• Ex post facto Research is similar to experimental research but is instead performed after the data has been collected [11]. Thus, it does not con- trol for all variables.

17 18 CHAPTER 3. RESEARCH METHODS AND METHODOLOGIES

• Surveys can be both quantitative or qualitative. They assess the charac- teristics of a wide range of subjects [11]. They usually, but not neces- sarily, make use of questionnaires [46].

• Case Study is an empirical strategy that aims to analyze a phenomenon in a natural context [46]. It achieves this by studying individual cases.

In this degree project the case study strategy is employed. The case under study is radio base stations (RBS) which are used in wireless communications. Text similarity analysis is used to apply the concept of test suite minimization to the real life setting of telecommunications.

3.2 Data Collection

The act of data collection can be divided into two main categories. These are primary and secondary data collection [47]. Primary data is data collected specifically for the research problem at hand. Secondary data on the other hand, is data that is acquired from some previously performed data collection, i.e. a previous study or public data base. Furthermore, in the area of primary data collection there are several ways of performing data collection. For quan- titative research these include; Experiments, Questionnaire, and Case Study [11].

• Experimental data collection collects a large amount of data with the aim of incorporating the variables under study.

• Questionnaires are simply collection by asking questions. The data col- lected can both be quantitative in the form of numerical values or qual- itative in the form of open ended, subjective questions.

• Case Study data collection is when the collected data is from a specific, small population that is under study. This is used together with the case study strategy.

The data in this project is collected by extracting the test case text files from five different RBS’s. From each RBS the following amount of test cases is collected; RBS-1: 96, RBS-2: 81, RBS-3: 771, RBS-4: 86, RBS-5: 105. This is a case study collection which collects data from five specific RBS’s. These are the cases under study in this project and they are a smaller population in the context of radio base stations and general telecommunication products. CHAPTER 3. RESEARCH METHODS AND METHODOLOGIES 19

3.3 Data Analysis

Data analysis is any method used to analyze the data acquired during data collection. The two commonly used types of data analysis for quantitative research are Statistics and Computational Mathematics [11]. Statistics can be either descriptive or inferential [14]. Descriptive statistics aim to describe the data under analysis, while inferential statistics try to infer something about the population from which the data describes [14]. Computational mathematics consists of using numerical methods, models or simulations to analyze the data [11]. In this thesis numerical methods are used to analyze the test cases text files. Levenshtein distance is used to analyze the test cases on a syntactic level. Artificial neural networks and transformer models are used to transform the text files into feature vectors and then clustering is used to draw conclusions from the vectors.

3.3.1 Visualization In data analysis it can be important to visualize the data or acquired results to get insights. In this thesis the machine learning method T-distributed Stochas- tic Neighbor Embedding (t-SNE) [48] is used to visualize the results acquired from clustering. The aim of the method is to project the high dimensional feature vectors used during clustering to a lower dimension so that they can be visualized in a two or three dimensional space. It works by introducing a joint probability distribution that works as a similarity measure between the high dimensional vectors. The algorithm then learns the corresponding low dimensional vectors and their probability distribution by minimizing the dif- ference between the distributions.

3.4 Quality Assurance

Quality Assurance is an important part of research that aims to show that the research done is valid and trustworthy. To ensure its quality, a deductive, quan- titative research project needs to discuss validity, reliability, replicability and ethics [11].

• Validity can refer to construct, internal, and external validity. Construct validity is the assurance that the test instruments are actually measuring what they are designed to measure. Internal validity is the assurance that there is a clear cause-and-effect relationship in the conclusion of 20 CHAPTER 3. RESEARCH METHODS AND METHODOLOGIES

the study. External validity considers the generability of the presented approach and findings.

• Reliability is the consistency of the measurements. This means that the measurements, under the same conditions, have to produce sufficiently close to the same results when reproduced.

• Replicability is the criteria that other researchers need to be able to repli- cate the research.

• Ethics are the moral principles behind the way studies are planned, per- formed and presented [11].

The reliability of this thesis is ensured by documenting the process of im- plementation and isolating any randomness affecting the measurements. The threats to the validity and the replicability of this thesis is discussed in chapter 6 and the ethical ramifications have previously been discussed in chapter 1.

3.4.1 Evaluation Metrics In research it is important to have a proper way to evaluate the results. In this thesis the goal is to evaluate classifying algorithms. The simplest way to evaluate a classifier is to take the accuracy of it. Defined as the number of correct classifications divided by incorrect classifications. In this project however the more descriptive F1-score [49] is used. F1-score is used as a proper evaluation metric to show the performance of the proposed solution. F1-score represents a harmonic relationship between precision and recall. Precision and recall can be defined as:

• Precision: The number of correctly detected similarities over the total number of detected similarities by us.

• Recall: The number of correctly detected similarities over the total num- ber of existing similarities

Using the definition of precision and recall, the F1-score can then be defined as; P recision × Recall F1-Score = 2 × (3.1) P recision + Recall It is important to note that F1-score in itself is not always a perfect measure- ment. It can be hard to get a good overview without knowing the underlying values of precision and recall. Precision and recall show if there is any bias CHAPTER 3. RESEARCH METHODS AND METHODOLOGIES 21

in the system, while F1-score gives an overall performance measure. It is thus important to not only report F1-score but to include precision and recall as well. In this thesis precision, recall and F1-score is used to evaluate the results after performing clustering on the feature vector representation of test cases. While these metrics are usually used for classifiers they can also be applied to clustering algorithms by simply treating the acquired clusters as class labels.

3.5 System Development

In this case study a system is developed for test suite minimization to study the possibility of using text similarity analysis for these purposes. The devel- opment of the computer system used is done using the Waterfall Model [50]. It is a methodology that is widely used in software system development. The waterfall model consists of five development stages [51]. These are;

1. Requirements: This stage is a definition of the requirements on the system. In this project the requirements on the system were gathered via verbal interviews with the stakeholder.

2. Design: The design stage is where the gathered requirements are used to create a high level design of the desired system.

3. Implementation: This is the stage where the design is implemented into a working system.

4. Testing: This stage is for testing the system to make sure it works as intended and that all requirements are fulfilled. In this project the testing stage consists of applying the system to the cases under study, namely the radio base station test suites.

5. Maintenance: When the system has been finished and put into use it may need modifications, such as improvements or corrections. This is attended in the maintenance stage. This stage is not a part of this study since it can only be done after the system has been implemented on more cases.

The system development following the waterfall model is presented in Chapter 4 and 5. Chapter 4

Requirements and Design

In this chapter the requirements that were put on the system together with the developed system design is presented. Throughout the research project the system design went through two iterations. Thus, two system designs are presented in this chapter, the initial one and the final one.

4.1 Requirements

The requirements on the system design were developed through repeated ver- bal interviews together with Ericsson. The system requirements can be formu- lated as following;

1. The system is able to minimize a test suite based on only the text based descriptions of the test cases.

2. The system utilizes both syntactic and semantic text analysis to mini- mize the test suite.

3. The system is able to filter out syntactically identical test case descrip- tions.

4. The system is able to calculate feature vectors from test case descrip- tions, which have semantic meaning embedded in them

5. The system can use clustering based on the feature vectors to find test cases which are similar enough to be considered redundant

These requirements are the basis on which the design of this system is built.

22 CHAPTER 4. REQUIREMENTS AND DESIGN 23

4.2 Initial Design

The requirements on the system were used to develop a design of a potential system that would take a set of test cases as input and output a minimized set of test cases. However, because of complications discovered during the project the design had to be altered and redesigned. Below, in figure 4.1 the initial design is shown, and in the next section the final design is presented.

Input Step 1 Step 2 Step 3 Step 4 Syntactic Clustering test Semantic Test Suite Test cases Similarity based cases based on Similarity Minimization pre-processing similarity scores

Similarity Measures

Figure 4.1: Diagram showing the initial system design.

The initial design consists of four steps. In step one, the test case descrip- tion files are first given pair-wise syntactic similarity scores. This is done by using the Levenshtein distance and calculating the lratio between each test case pair in the test suite. The test cases that have a 100% (i.e. lratio = 1.0) simi- larity score with some other test cases are consider redundant because of their duplicity. In step two the semantic feature vector representations of the re- maining test cases are computed by applying either the Doc2Vec or SBERT model to each test case description file. In step three these feature vector repre- sentations are then used to cluster the test cases with the HDBSCAN clustering algorithm. Each cluster of test cases will then be regarded as test cases that are similar enough to be considered redundant. The fourth and final step con- sists of test suite minimization. This is the step where the test cases that have been determined as redundant are removed from the test suite to optimize the testing process.

4.3 Final Design

During the project it was concluded that the test case description files were not entirely conclusive in describing the tests. In practice this meant that software testers would sometimes make use of information sources other than the de- scription files while performing a test. The description files could sometimes refer to other such sources. The consequence of this was that it would not be 24 CHAPTER 4. REQUIREMENTS AND DESIGN

possible to rule out test case pairs with 100% syntactic similarity in the first step in the initial design. The design was thus changed to the one in figure 4.2. Instead of using syntactic similarity to rule out test cases beforehand as in the initial design, this final design uses the syntactic similarity as an alternative to the semantic similarity and compares the two approaches. In step 1 the se- mantic feature vectors are calculated for each test case in the test suite using both Doc2Vec and SBERT. In this step the lratio is calculated for each test case pair just as in step 1 in the initial design, however, no test cases are removed in this step. In step 2 the feature vectors are clustered using HDBSCAN. In the final step the test suite is minimized by removing redundant test cases. In this design test cases are considered redundant if they are either in a cluster with another test case, or if they have a lratio between another test case that is larger than a certain threshold. The results of using syntactic similarity analysis will be considered as the baseline of performance. If the results from the semantic similarity approach can outperform the baseline it will be clear that the se- mantic analysis does in fact manage to identify and quantify more information about the similarity between test cases than the syntactic similarity approach.

Input Step 1 Step 2 Step 3 Clustering test Semantic Test-Suite cases based on similarity Minimization Test cases similarity scores Syntactic similarity

Similarity Measures

Figure 4.2: Diagram showing the final system design. Chapter 5

Implementation and Results

In this chapter the implementation and the procedure followed during this study is explained and illustrated. After that the results of applying the syn- tactic similarity approach is presented together with the results of the seman- tic similarity approach. The implementation follows the finalized design pre- sented in figure 4.2. The Levenshtein distance presented in section 2.6 is used for the syntactic similarity text analysis applied in Step 1 of the presented de- sign. For the semantic text similarity analysis, the two feature vector models Doc2Vec and SBERT are applied to the test suite. They are then clustered using the HDBSCAN clustering algorithm, and lastly visualized using the t- SNE visualization algortithm. All the implementation in this project was done using python with various libraries.

5.1 Data

The data set used in this study is a corpora consisting of 417 text documents. These documents are all description files describing test cases that are cur- rently in use at Ericsson. The files are varying in length, ranging from 41 characters for the smallest file and 15880 for the largest one. The descriptions are in plain English with intention to be read by a human software tester. The 417 text documents each correspond to one test case from one out of five dif- ferent radio base stations (RBS). The data set was pre-processed by converting all the documents form Microsoft Word format (.docx) to raw text files (.txt). In figure 5.1 one can see an example of how the test case descriptions files can look.

25 26 CHAPTER 5. IMPLEMENTATION AND RESULTS

Test Case Description

Test Case Number and Name

Purpose The purpose of this test case, describing what is being tested.

Prerequisite Optional The pre-conditions that must be met before executing this test.

Configure The conditions and setting that the hardware should have before the test.

Procedure The procedure of the testing containing the steps of how to carry out this test.

Pass Criteria The pass criteria of this test case describing the conditions that must be met for this test case to be considered successful.

Figure 5.1: An example of the structure of the test case description files used in this study.

5.2 Data Labeling

The labeling of the test case documents was done by a domain-relevant subject matter expert (SME) at Ericsson. The labeling is done by having the SME inspect two test case description files and conclude if they are similar enough to be for one of them to be considered redundant. A consequence of this labeling method is that it takes a very large amount of test case comparisons to label the entire data set of 417 test case description files. Every test case needs to be compared to all other test cases, and thus (417×416)÷2 = 86736 are required. This is simply unfeasible for one person to perform during the duration of this project. To cut down on the amount of comparisons needed from the SME the labeling is done by only evaluating the test case pairs with lratio larger than 0.8. This cuts down the amount of comparisons to 225.

5.3 Syntactic Similarity Analysis

The syntactic analysis is done by measuring the Levenshtein ratio between all test cases. Since each test case needs to be compared with all other test cases, CHAPTER 5. IMPLEMENTATION AND RESULTS 27

a total of (417 × 416) ÷ 2 = 86736 comparisons are made between test cases. First the Levenshtein distance is calculated by using a pre-built python package called python-Levenshtein [52]. Then, the lratio is calculated for each test case pair in the test suite according to equation 2.2.

5.4 Semantic Similarity Analysis

After the syntactic analysis is carried out the semantic analysis is performed. For this, two different feature vector models are used, namely Doc2Vec [29] and SBERT [33]. The implementations used for these algorithms is firstly the gensim [53] library’s implementation of Doc2Vec. Gensim is an open-source python library for natural language processing, with a heavy focus on unsuper- vised semantic modelling. For Sentence-BERT the original implementation written by Reimers et al. [33] is used. It is written in python and can be in- stalled as a python package. This code can be found at; https://github.com/UKPLab/sentence-transformers.

5.4.1 Feature Vector Generation and Clustering The Doc2Vec model is trained on the test case data set for 100 epochs using a vector size of 100. The context window size is set to five. All words that have a frequency less or equal to one are ignored. To account for randomness in the model the training together with the following clustering was iterated 100 times. Unlike Doc2Vec that is trained specifically on the test case data set, the SBERT model is entirely pre-trained. The pre-trained network model that was used was ’bert-base-nli-mean-tokens’, it is a model that was specifically made for semantic text similarity purposes. The output feature vector size is hard-coded into the network model and can thus not be modified. It is set to 768. After the document feature vectors from the two models are acquired, these vector representations of the test case documents are given as input to the HDBSCAN clustering algorithm. The clustering is done using the hdbscan python package. The distance metric used for cluster boundary determination is cosine similarity, defined in equation 2.3. For each feature vector model two clustering runs are done. One using a minimum cluster size of two and one with minimum cluster size set to three. The result of the clustering is a label for each test case document in the test suite. Each label is a number represent- ing the cluster that the test case was categorized in, ranging from one to the the total number of clusters. If a test case is labeled zero however, it means 28 CHAPTER 5. IMPLEMENTATION AND RESULTS

that the test case is considered an outlier and was not categorized in a cluster. When each clustering result is acquired they are then visualized by applying the t-SNE algorithm, which projects the data set into a two dimensional space to get a visual representation of the clustering.

5.5 Results

This section presents the results of this study. First the results for the syntactic text analysis is presented. After that, we present the results from the clustering done during the semantic text analysis.

5.5.1 Syntactic Similarity The distribution of the syntactic similarity between test cases is interesting to analyze since it tells us how much of our test suite is syntactically similar. Below in figure 5.2 the distribution of test cases with lratio larger than 0.8 can be seen. Out of the test case pairs that had a Levensthein ratio larger than 0.8, 44 of them had a ratio of 0.99 − 1.0. This means that a significant percentage (∼ 10%) of the test suite under analysis is practically identical to at least one other test case in the test suite. While these test cases may or may not be redundant, it shows that there are at the least a large amount of repeated text used in the test suite and thus there is definite room for test optimization.

20

15

10 # of Test Case Pairs Test of #

5

0 0.80 0.85 0.90 0.95 1.0 Levenshtein Ratio

Figure 5.2: The distribution of test case document pairs with lratio larger than 0.8 CHAPTER 5. IMPLEMENTATION AND RESULTS 29

5.5.2 Semantic Similarity Analysis In this section the results from the semantic similarity analysis is presented in the form of evaluation metrics and t-SNE clustering visualizations. Figure 5.3 and 5.4 show the t-SNE visualization of the HDBSCAN clustering using Doc2Vec and SBERT respectively. In figure 5.5 and 5.6 the same is shown but with minimum cluster size set to 3 instead of 2. The total amount of clus- ters were 77 for Doc2Vec and 76 for SBERT. When using minimum cluster size 3 they were 45 and 42. Note that samples that were not categorized in a cluster are not visualized. It is clear from comparing the figure with differ- ent minimum cluster sizes that the clusters are much more sparse when using minimum size 3 just as one might expect.

− − − − −

Figure 5.3: t-SNE visualization of the clustered test cases using Doc2Vec fea- ture vector. The minimum cluster size is set to 2. Noisy samples are not in- cluded. 30 CHAPTER 5. IMPLEMENTATION AND RESULTS

− − −

Figure 5.4: t-SNE visualization of the clustered test cases using SBERT feature vectors. The minimum cluster size is set to 2. Noisy samples are not included.

− − − −

Figure 5.5: t-SNE visualization of the clustered test cases using Doc2Vec fea- ture vectors. The minimum cluster size is set to 3. Noisy samples are not included. CHAPTER 5. IMPLEMENTATION AND RESULTS 31

− − −

Figure 5.6: t-SNE visualization of the clustered test cases using SBERT feature vectors. The minimum cluster size is set to 3. Noisy samples are not included. Chapter 6

Evaluation and Implications

In this chapter, the evaluation of the results from the syntactic and semantic analysis is presented in the form of evaluation metrics. The resulting impli- cations are then discussed and lastly threats to the validity of this study is presented.

6.1 Evaluation

The evaluation in this thesis was done using the evaluation metrics; precision, recall, and F1-score, described in section 3.4.1. The labeled data acquired by the subject matter expert at Ericsson was used as a ground truth to base the evaluation on. As previously mentioned, only a portion of the total data set was labeled. Because of this the evaluation is done by taking each pair that the SME had labeled and comparing its label with the result of applying either the syntactic or semantic approaches to the same test case pair.

6.1.1 Syntactic Evaluation To be able to compare the acquired results with the labeled data it is necessary to have a result that consists of the same type of labels, i.e. the result needs to consist of similar or non-similar test case classifications. Since the lratio is a continuous value from 0 to 1.0 it is thus necessary to decide upon a threshold for when to consider to test cases similar or non-similar. However, instead of deciding upon an arbitrary threshold and evaluating it, the evaluation metrics are instead calculated for 100 evenly spaced thresholds between lratio = 0.8 and lratio = 1.0. The resulting metrics are visualized in figure 6.1, 6.2 and 6.3 down below. Observe that the threshold that achieves the highest metrics

32 CHAPTER 6. EVALUATION AND IMPLICATIONS 33

for both classes simultaniously lies at at around 0.91 where the two curves intersect in each figure. The highest F1-score achieved on either of the two classes is around 0.7, however at the intersection the F1-score is around 0.6.

Non-Similar Similar

0.8

0.6

0.4 Precision

0.2

0.0

0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Threshold

Figure 6.1: Precision for the Non-similar and similar class for varying thresh- old cutoffs.

Non-Similar Similar

1.0

0.8

0.6 Recall 0.4

0.2

0.0

0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Threshold

Figure 6.2: Recall for the Non-similar and similar class for varying threshold cutoffs. 34 CHAPTER 6. EVALUATION AND IMPLICATIONS

Non-Similar Similar

0.7

0.6

0.5

0.4

0.3 F1-Score

0.2

0.1

0.0

0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Threshold

Figure 6.3: F1-score for the Non-similar and similar class for varying threshold cutoffs. 6.1.2 Evaluation of Semantic Models For the evaluation of the semantic feature vector models two test cases are con- sidered similar if they are both categorized in the same cluster. The similar/non- similar labels received from the clustering are compared with the labels from the SME to calculate precision, recall and F1-score. In table 6.1 the precision, recall and F1-score for the HDBSCAN clustering is shown. In this experiment the minimum cluster size is set to two. As we can see, both SBERT and Doc2Vec achieve high precision and low recall on the Non-Similar class. Contrarily, they both get low precision and high recall on the Similar class. CHAPTER 6. EVALUATION AND IMPLICATIONS 35

Table 6.1: Evaluation metrics for HDBSCAN clustering using Doc2Vec and SBERT document embeddings. Minimum cluster size is set to 2. Observe that no standard deviation is shown for SBERT since the output of this model is deterministic.

DOC2VEC Class Precision Recall F1-score Similar 0.509 ± 0.01 0.983 ± 0.01 0.670 ± 0.01 Non-similar 0.757 ± 0.15 0.051 ± 0.01 0.096 ± 0.02 SBERT Class Precision Recall F1-score Similar 0.538 0.980 0.695 Non-similar 0.889 0.160 0.271 Table 6.2: Evaluation metrics for HDBSCAN clustering using Doc2Vec and SBERT document embeddings. Minimum cluster size is set to 3.

DOC2VEC Class Precision Recall F1-score Similar 0.503 ± 0.01 0.832 ± 0.03 0.627 ± 0.02 Non-Similar 0.518 ± 0.05 0.180 ± 0.02 0.267 ± 0.03 SBERT Class Precision Recall F1-score Similar 0.535 0.760 0.628 Non-similar 0.586 0.340 0.430

The results from the HDBSCAN clustering using 3 as minimum cluster size can be seen in table 6.2. One can see that the balance between precision and recall for both classes and algorithms has evened out a bit. We also see an increase for both algorithms F1-score for the Non-Similar class and a decrease for the Similar class. In both these cases Doc2Vec manage a higher F1-score for the Similar class than SBERT does, while SBERT gets a higher score for the Non-Similar class in both cases.

6.2 Implications

In figure 6.3 it can be seen that the F1-score for the similar class decreases as the threshold approaches 1.0. When two test case descriptions have a lratio = 36 CHAPTER 6. EVALUATION AND IMPLICATIONS

1.0 it means that syntactically they are completely identical, e.g. they are com- plete duplicates of each other. Thus, when the threshold is set to 1.0 it means that a test case is only considered similar to another if they are completely syn- tactically identical. The fact that the performance increases when the thresh- old is lowered shows that even though some test cases might not be completely syntactically identical they are still considered similar enough to be redundant by the SME. This result is promising since it shows that the syntax of the test cases might not be the only factor taken into account in to the similarity judg- ment made by the SME. When comparing the F1-score of the syntactic and semantic approaches it is clear that neither of the semantic models could outperform the syntactic Levenshtein approach when classifying the non-similar class. Furthermore, they barely manage to get a higher score for the similar class. It can also be seen in table 6.1 that the semantic models managed a much higher precision than recall on the non-similar class when using a minimum cluster size of 2. What this means in practice is that a large amount of the test cases that were classified as non-similar were considered non-similar by the SME. However, the low recall implies that a large part of the cases considered similar by the SME were classified as similar by the models. For the similar class the case is reversed, the models manage to classify a large amount of the similar test cases correctly but only around half of the test cases classified as similar were considered similar by the SME.

6.3 Threats to Validity

The threats to the validity, limitations and the challenges faced in conducting the present study are discussed in this section.

1. Construct Validity:The major construct validity threat in this study is the way that the syntactic and semantic similarities between test cases are measured. Utilizing just the test case descriptions for similarity de- tection may not be sufficient in other testing processes. Sometimes an- alyzing other testing artifacts such as requirement specifications, stan- dards and test records might help to detect the semantic similarities as well.

2. Internal Validity:The biggest threat to the internal validity of this thesis is the data labeling done by the SME. As previously stated because of time and labour constraints the labeling was done by only labeling test CHAPTER 6. EVALUATION AND IMPLICATIONS 37

case pairs with lratio > 0.8. However, the consequence of this is that all the test cases analyzed by the semantic models also have a significant syntactic similarity to each other. This makes it hard to tell from the results whether any additional similarity was captured in the models or if they simply managed to detect the syntactic similarity as well.

3. External Validity:A threat to the external validity of this study is the fact that this study was done on radio base stations produced by Ericsson using specific test setups and stations developed specifically for these products. While test case descriptions are common in test case setups there is no guarantee that the results of this thesis will be applicable to all possible test case types.

4. Replicability: The biggest replicability threat of this study is also the data labeling. Since the labeling was done by a human SME it is in- evitable that some subjective judgment had a affect on the outcome of the labeling. This would mean that using data labeling from a different SME might yield a different result of this study and thus lower the reli- ability of the study. However, it is assumed that the objectiveness of the SME is sufficient enough such that the result is not greatly affected. Chapter 7

Conclusions and Future Work

In this thesis two approaches to test suite minimization using text similarity analysis are proposed and evaluated, one syntactic similarity approach and one semantic similarity approach. For the semantic text analysis, two machine learning based models have been applied and evaluated against each other. The models have also been compared to the syntactic text similarity approach in the form of Levenshtein distance. The semantic and syntactic approaches have been evaluated using a data labeling from a subject matter expert in the field of software testing. The results showed that the semantic models could not out- perform the syntactic approach at identifying similar test cases for test suite minimization. Tracing back to the research question of this thesis; "How can text similarity analysis be used for test suite minimization?", we can conclude that both syntactic and semantic similarity analysis show promise as a way of finding potential test cases for reducing the test suite. However, it is un- clear whether semantic similarity can give any further insight than syntactic analysis. To reach a more clear verdict, future studies will have to be done.

7.1 Discussion

The method used for syntactic similarity in this thesis, Levensthein distance, was chosen mainly for its simplicity and its availability in public code libraries. Levenshtein distance is a more descriptive syntactic similarity measure com- pared to methods such as Longest Common Subsequence [54] that only consid- ers insertions and deletions, not substitutions. However, the similarity measure Damerau–Levenshtein distance [55] builds upon the Levenshtein distance and also allows transposition of two adjacent characters. The use of this method instead of the Levenshtein distance may have given a better representation of

38 CHAPTER 7. CONCLUSIONS AND FUTURE WORK 39

the syntactic similarity between test cases. In this thesis the two feature vector models Doc2Vec and SBERT were used for semantic text analysis together with the clustering algorithm HDBSCAN. The choice of Doc2Vec and HDBSCAN was made because of the promising results shown when these algorithms were applied in a software testing context in the work of Tahvili et al. [39]. SBERT on the other hand, was mainly chosen for the attention that transformer models have gotten in the media. Primarily in the form of the machine learning based company OpenAI’s transformer model GPT-2 [56]. The reason for not using this model instead was that there was no easy way to extract feature vector representations from it, and could thus not be used for this project. An interesting aspect of this study that can be discussed is the concept of semantics when applied to entire documents and in this case test case descrip- tions. When applied to words it is very intuitive what it means for two words to be semantically similar. "Paris" and "Stockholm" for example, are quite obvi- ously semantically similar. Now, how does this concept scale up when we talk about semantically similar documents? This is substantially more confusing than the simpler word case. In the paper where Doc2Vec was introduced the authors show that the algorithm can identify positive vs. negative texts when applied to movie reviews [29]. In this case a document is semantically similar to another if they are both either a positive or negative review. For our test case description text files we want the models to find test cases that test the same things, so in our case this is the definition of two semantically similar test cases. Whether or not the models used in this thesis can identify this "sim- ilarity" is unclear and the results of this study might even suggest that they are unable to rather than not.

7.2 Future Work

As mentioned in the previous chapter, a threat to the construct validity of this study is that only the test case descriptions are analyzed when the test case similarities are measured. This introduces the requirement that the descrip- tion files need to be complete in their information about the test case. This might not always be the case in practice. A possible direction for future work could thus be to try to incorporate more of the available resources a manual tester has when executing test cases into the similarity measurements. Another interesting continuation to this study would be to test the same approach but with a larger more diverse labeling of the data set. Of course, this would be hard to do without access to the data provided by Ericsson. However, a fu- 40 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

ture study where a different data set is used could still be of interest if a more random based labeling is done. Bibliography

[1] S. Tahvili. “Multi-Criteria Optimization of System Integration Testing”. PhD thesis. Malardalen University, Dec. 2018. isbn: 978-91-7485-414- 5. [2] S. Tahvili et al. “Cost-Benefit Analysis of Using Dependency Knowl- edge at Integration Testing”. In: The 17th Int. Conf. On Product-Focused Software Process Improvement. 2016. [3] S. Khan, A. Nadeem, and A. Awais. “TestFilter: A Statement-Coverage Based Test Case Reduction Technique”. In: (Dec. 2006). doi: 10.1109/ INMIC.2006.358177. [4] S. Tahvili et al. “A Novel Methodology to Classify Test Cases Using Natural Language Processing and Imbalanced Learning”. In: Engineer- ing Applications of Artificial Intelligence 95 (2020), pp. 1–13. [5] A. Leitner et al. “Efficient Unit Test Case Minimization”. In: Proceed- ings of the Twenty-Second IEEE/ACM International Conference on Au- tomated Software Engineering. ASE ’07. Atlanta, Georgia, USA: Asso- ciation for Computing Machinery, 2007, 417–420. isbn: 9781595938824. [6] R. Singh and M. Santosh. “Test Case Minimization Techniques : A Re- view”. In: IJERT 2 (Dec. 2013), pp. 1048–1056. [7] C. Landin et al. “Cluster-Based Parallel Testing Using Latent Semantic Analysis”. In: The Second IEEE International Conference On Artificial Intelligence Testing. Apr. 2020. [8] M. Bates. “Models of natural language understanding”. In: Proceedings of the National Academy of Sciences 92.22 (1995), pp. 9977–9982. issn: 0027-8424. [9] W. Gomaa, A. Fahmy, et al. “A survey of text similarity approaches”. In: International Journal of Computer Applications 68.13 (2013), pp. 13– 18.

41 42 BIBLIOGRAPHY

[10] S. Dresner. The principles of sustainability. Earthscan, 2008. [11] A. Håkansson. “Portal of research methods and methodologies for re- search projects and degree projects”. In: The 2013 World Congress in Computer Science, Computer Engineering, and Applied Computing WORLD- COMP 2013; Las Vegas, Nevada, USA, 22-25 July. CSREA Press USA. 2013, pp. 67–73. [12] M. Myers. Qualitative research in business and management. Sage Pub- lications Limited, 2019. [13] M. Saunders, P. Lewis, and A. Thornhill. “Understanding research philoso- phies and approaches”. In: Research Methods for Business Students 4 (Jan. 2009), pp. 106–135. [14] N. Salkind and T. Rainwater. Exploring research. Pearson Prentice Hall Upper Saddle River, NJ, 2006. [15] Ericsson AB. Base stations and networks. [Online; accessed 9-May- 2020]. 2020. url: https://www.ericsson.com/en/about- us/sustainability-and-corporate-responsibility/ responsible - business / radio - waves - and - health / base-stations-and-networks. [16] E. Cruciani et al. “Scalable Approaches for Test Suite Reduction”. In: Proceedings of the 41st International Conference on Software Engi- neering. ICSE ’19. Montreal, Quebec, Canada: IEEE Press, 2019, pp. 419– 429. [17] M. Grindal et al. “An evaluation of combination strategies for test case selection”. In: Empirical Software Engineering 11.4 (Dec. 2006), pp. 583– 611. issn: 1573-7616. [18] G. Rothermel et al. “Prioritizing test cases for regression testing”. In: IEEE Transactions on Software Engineering 27.10 (Oct. 2001), pp. 929– 948. issn: 0098-5589. [19] S. Tahvili et al. “sOrTES: A Supportive Tool for Stochastic Scheduling of Manual Integration Test Cases”. In: Journal of IEEE Access (Jan. 2019), pp. 1–19. [20] “IEEE Standard for Software and System Test Documentation”. In: IEEE Std 829-2008 (July 2008), pp. 1–150. BIBLIOGRAPHY 43

[21] S. Elbaum, A. Malishevsky, and G. Rothermel. “Incorporating vary- ing test costs and fault severities into test case prioritization”. In: Pro- ceedings of the 23rd International Conference on Software Engineer- ing. ICSE 2001. May 2001, pp. 329–338. [22] J. Jones and M. Harrold. “Test-suite reduction and prioritization for modified condition/decision coverage”. In: IEEE Transactions on Soft- ware Engineering 29.3 (Mar. 2003), pp. 195–209. [23] Håkansson Anne and Hartung Ronald L. THE ARTIFICIAL INTELLI- GENCE BOOK - Concepts, Areas, Techniques, Applications. Studentlit- teratur. isbn: 9789144125992. [24] V. Levenshtein. “Binary codes capable of correcting deletions, inser- tions and reversals.” In: Soviet Physics Doklady 10.8 (Feb. 1966). Dok- lady Akademii Nauk SSSR, V163 No4 845-848 1965, pp. 707–710. [25] J. Gareth et al. An introduction to statistical learning. Vol. 112. Springer, 2013. [26] H. Haggren and P. Amethier. Data-Driven Predictions of Outcome for an Internet-Delivered Treatment Against Anxiety Disorders. Bachelor Thesis. 2018. url: http://urn.kb.se/resolve?urn=urn: nbn:se:kth:diva-230705. [27] C. Bishop. Pattern recognition and machine learning. springer, 2006. [28] D. Rumelhart, G. Hinton, and R. Williams. Learning internal represen- tations by error propagation. Tech. rep. California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [29] Q. Le and T. Mikolov. “Distributed representations of sentences and documents”. In: International conference on machine learning. 2014, pp. 1188–1196. [30] T. Mikolov et al. Efficient Estimation of Word Representations in Vector Space. 2013. arXiv: 1301.3781 [cs.CL]. [31] D. Rumelhart, G. Hinton, and R. Williams. “Learning representations by back-propagating errors”. In: nature 323.6088 (1986), pp. 533–536. [32] A. Vaswani et al. “Attention Is All YouNeed”. In: CoRR abs/1706.03762 (2017). arXiv: 1706.03762. [33] I. Gurevych N. Reimers. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. 2019. arXiv: 1908.10084 [cs.CL]. 44 BIBLIOGRAPHY

[34] J. Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018). [35] C. Yuanyuan et al. “A hybrid approach for measuring semantic sim- ilarity based on IC-weighted path distance in WordNet”. In: Journal of Intelligent Information Systems 51.1 (2018), pp. 23–47. issn: 1573- 7675. [36] S. Lloyd. “Least squares quantization in PCM”. In: IEEE Transactions on 28.2 (1982), pp. 129–137. [37] H. Kriegel et al. “Density-based clustering”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.3 (2011), pp. 231– 240. [38] R. Campello, D. Moulavi, and J. Sander. “Density-based clustering based on hierarchical density estimates”. In: Pacific-Asia conference on knowl- edge discovery and data mining. Springer. 2013, pp. 160–172. [39] S. Tahvili et al. “Automated Functional Dependency Detection Between Test Cases Using Doc2Vec and Clustering”. In: The First IEEE Inter- national Conference On Artificial Intelligence Testing. Apr. 2019. [40] D. Flemström et al. “Similarity-based prioritization of test case automa- tion”. In: Software Quality Journal (2018), pp. 1–29. issn: 1573-1367. [41] S. Thomas et al. “Static test case prioritization using topic models”. In: Empirical Software Engineering 19.1 (2014), pp. 182–212. [42] M. Unterkalmsteiner et al. “Large-scale Information Retrieval in Soft- ware Engineering - an Experience Report from Industrial Application”. In: Empirical Software Engineering 21.6 (2016), pp. 2324–2365. [43] T. Landauer, P. Foltz, and D. Laham. “An introduction to latent seman- tic analysis”. In: Discourse processes 25.2-3 (1998), pp. 259–284. [44] V. Garousi, S. Bauer, and M. Felderer. “NLP-assisted software testing: a systematic review”. In: arXiv preprint arXiv:1806.00696 (2018). [45] Stanford University. The Stanford Parser: A statistical parser. [Online; accessed 13-May-2020]. 2020. url: https://nlp.stanford. edu/software/lex-parser.shtml. [46] P. Runeson and M. Höst. “Guidelines for conducting and reporting case study research in software engineering”. In: Empirical Software Engi- neering 14.2 (Dec. 2008), p. 131. issn: 1573-7616. BIBLIOGRAPHY 45

[47] J. Hox and H. Boeije. “Data collection, primary versus secondary”. In: (2005). [48] M. Laurens van der and H. Geoffrey. “Visualizing data using t-SNE”. In: Journal of machine learning research 9.Nov (2008), pp. 2579–2605. [49] G. Salton and D. Harman. Information retrieval. John Wiley and Sons Ltd., 2003. [50] W. Royce. “Managing the development of large software systems: con- cepts and techniques”. In: Proceedings of the 9th international confer- ence on Software Engineering. 1987, pp. 328–338. [51] A. Alshamrani and A. Bahattab. “A comparison between three SDLC models waterfall model, spiral model, and Incremental/Iterative model”. In: International Journal of Computer Science Issues (IJCSI) 12.1 (2015), p. 106. [52] D. Necas. python-Levenshtein 0.12.0. [Online; accessed 13-Aug-2020]. 2020. url: https://pypi.org/project/python-Levenshtein/. [53] R. Rehurek. Gensim. [Online; accessed 16-Aug-2020]. 2020. url: https: //radimrehurek.com/gensim/. [54] A. Apostolico and Z. Galil. Algorithms. Oxford Uni- versity Press, 1997. isbn: 9780195354348. url: https://books. google.se/books?id=mFd\_grFyiT4C. [55] Fred J. Damerau. “A Technique for Computer Detection and Correction of Spelling Errors”. In: Commun. ACM 7.3 (Mar. 1964), 171–176. issn: 0001-0782. doi: 10.1145/363958.363994. url: https:// doi.org/10.1145/363958.363994. [56] A. Radford et al. “Language models are unsupervised multitask learn- ers”. In: OpenAI Blog 1.8 (2019). TRITA-EECS-EX-2020:917

www.kth.se