Masters Thesis Project

Challenges of Large-Scale and the Role of

Quality Characteristics An Empirical Study of Software Testing

Author: Eyuel T Belay Supervisor: Michael Felderer Examiner: Francesco- Flammini Term: VT19 Subject: Computer Science Level: Masters

Course code: 4DV50E Credits : 15

Abstract Currently, information technology is influencing every walks of life. Our lives increasingly depend on the software and its functionality. Therefore, the development of high-quality software quality products is indispensable. Also, in recent years, there has been an increasing interest in the demand for high-quality software products. The delivery of high-quality software products and services is not possible at no cost. Furthermore, software systems have become complex and challenging to develop, test, and maintain because of scalability. Therefore, with increasing complexity in large scale software development, testing has been a crucial issue affecting the quality of software products. In this paper, large-scale software testing challenges concerning quality and their respective mitigations are reviewed using a systematic literature review, and interviews. Existing literature regarding large-scale software development deals with issues such as requirement and security challenges, so research regarding large-scale software testing and its mitigations is not dealt with profoundly.

In this study, a total of 2710 articles were collected from 1995-2020; 1137(42%) IEEE, 733(27%) Scopus, and 840(31%) Web of Science. Sixty-four relevant articles were selected using a systematic literature review. Also, to include missed but relevant articles, snowballing techniques were applied, and 32 additional articles were included. A total of 81 challenges of large-scale software testing were identified from 96 total articles out of which 32(40%) performance, 10(12 %) security, 10(12%) maintainability, 7(9 %) reliability, 6(8%) compatibility, 10(12%) general, 3(4%) functional suitability, 2(2%) usability, and 1(1%) portability were testing challenges were identified. The author identified more challenges mainly about performance, security, reliability, maintainability, and compatibility quality attributes but few challenges about functional suitability, portability, and usability. The result of the study can be used as a guideline in large-scale software testing projects to pinpoint potential challenges and act accordingly.

Keywords Software testing, large-scale, large-scale software testing, software testing challenges, software quality.

Abbreviations SLR=Systematic Literature Review DSE=Dynamic ML=Machine Learning, AI=Artificial Intelligence LST=Large-scale Software Testing

RE=Requirement Engineering, RTT= Round Trip Time

Contents 1 Introduction ...... 1 1.1 Problem Statement ...... 1 1.2 Objectives of the study ...... 2 1.3 Motivation ...... 2 1.4 Contributions ...... 3 1.5 Related Work ...... 3 1.6 Report Structure ...... 3 2 Background ...... 4 2.1 Large-Scale Software Testing and Software Quality ...... 4 2.2 Large-scale Systems ...... 4 2.3 Challenges of large-scale software development and Testing ...... 5 2.4 Large-scale vs. small-scale software challenges ...... 6 2.5 Large-Scale Software Testing and Its challenge ...... 7 3 Methodology ...... 8 3.1 Systematic Literature Review (SLR) ...... 8 3.1.1 Planning the review ...... 8 3.2 Snowballing ...... 10 3.2.1 Conducting the review ...... 12 3.2.2 Reporting the review ...... 12 3.3 Semi-structured interview ...... 12 3.4 Threats to Validity of SLR and Interview ...... 12 3.5 Reliability and Validity of Research ...... 13 3.6 The trustworthiness of the result ...... 13 3.7 Ethical considerations ...... 13 4 Result ...... 14 4.1 Summary of All challenges Identified through SLR and Interview...... 14 4.2 Summary Result and Mitigation of the Systematic Literature Review and Interview ...... 19 5 Analysis of the Result ...... 20 5.1 Summary of the result ...... 20 5.2 Result of Co-Occurrence Analysis ...... 20 5.3 Functional Suitability ...... 23 5.3.1 Reflection about Challenges Related to Functional Suitability ...... 25 5.4 Performance Efficiency ...... 25 5.4.1 Reflection about Challenges Related to Performance Efficiency .... 31 5.5 Compatibility ...... 32 5.5.1 Reflection about Challenges Related to Compatibility ...... 34 5.6 Usability ...... 34 5.6.1 Reflection about Challenges Related to Usability ...... 35 5.7 Portability ...... 43 5.7.1 Reflection about Challenges Related to Portability...... 43

5.8 Reliability ...... 35 5.8.1 Reflection about Challenges Related to Reliability...... 39 5.9 Security ...... 39 5.9.1 Reflection about Challenges Related to Security ...... 43 5.10 Maintainability ...... 43 5.10.1 Reflection about Challenges Related to Maintainability ...... 47 5.11 General ...... 47 5.11.1 Reflection about Challenges Related to General Challenges ...... 50 6 Discussions and Limitations of the Study ...... 50 6.1 Limitations of the Study ...... 50 7 Conclusions and Future Directions ...... 51 8 References ...... 53

Appendex 1 Result

Appendex 2 Interview Questions

1 Introduction Tim Berners-Lee forecasted in 1996 that to improve computing infrastructures, performance-related issues such as robustness, efficiency, and availability will be valuable in the years to come [1]. It has been decades since the world has fallen under a dramatic influence of Information Technology. For example, in the past few years, an immense, broad, and diverse impact of technology on humans' life both negatively and positively were felt by many [2]. Therefore, the growing demand for IT systems requires software engineers to be innovative in dealing with scalability and complexity issues to deliver high-quality services for a human being. Software systems have become complex and challenging to develop and maintain because of scalability. For example, computer programs designed for small-scale do not always work when it scaled up to large-scale [3]. Besides, maintaining software products is the most expensive activity in the software development life cycle [4]. Additionally, there are challenges related to the scale of software development processes. Some of these challenges are security-related [5], and technical challenges such as planning, code quality, integration, task prioritization, and knowledge related [6]. In this paper, LST challenges from the perspectives of software quality attributes are presented. Quality is "the extent to which a system or its components should comply with customer needs or requirement specifications" (IEEE_Std_610.12-1990). On the other hand, the anticipated principal qualities of large-scale-software systems are reliability, security, deployability, and safety [7]. Moreover, large-scale software development and testing are associated with many challenges that stem from the core attributes that are used to define large-scale software systems and LST [8].

1.1 Problem statement Challenges affecting small-scale software testing are emerging requirements, customer involvement, security-related issues, and level of expertise of developers [9]. On the other hand, challenges related to LST activities can be associated with a lack of coordination and complexity of testing itself [10]. Furthermore, large-scale development and testing are associated with many challenges that stem from the core attributes. But LST and challenges the core attributes that define large-scale testing is not entirely understood and explored [8]. Literature regarding LST concerning standard quality characteristics of a large-scale system is hardly touched, not completely understood, or limited to few quality attributes. For example, existing research regarding large-scale software development deals with requirement challenges [11] and security challenges [5, 6, 9]. Therefore, an SLR and interviews were used to identify challenges and mitigation strategies of LST concerning software quality attributes. Also, research regarding LST challenges concerning quality issues needs further investigation and research. Therefore, the research questions of this study are:

1(81)

RQ1. What are the key challenges of large-scale software testing hindering the development of quality software products? RQ2. What are the possible mitigations to large-scale software testing challenges concerning quality?

An SLR is conducted to answer these research questions. After identifying challenges missing and mitigations, additional challenges were identified, a semi- structured interview was conducted. Hence, the result of this study is a frame of reference for LST challenges and corresponding mitigations concerning software quality attributes. Additionally, the result of this study will also show the state-of- the-art of LST.

1.2 Objectives of the study The main objectives, see below, of this study are elicited from a systematic literature review [12]. O1 Identify the key challenges of LST concerning quality. O2 Map the key challenges of LST to software quality attributes. O3 Identify the possible mitigation for the key challenges.

1.3 Motivation Testing in the context of large-scale software development with regard to software quality is hardly touched the research topic. LST is not entirely understood and explored. The analysis of how software testing is conducted in large-scale development needs a more in-depth exploration of the subject. For example, existing literature deals with security [5]. Furthermore, software maintenance is expensive, yet fixing defects at its earliest stage in the development cycle is indispensable as it lowers the cost of maintenance at the later stage of a software development cycle [4]. Identifying large-scale software testing challenges and its mitigation is crucial for developing high-quality software development. In addition, identification of challenges and corresponding mitigations in LST will enable developers and testers to respond to challenges proactively. It also servers as a frame of reference. In this study, the focus is on identifying LST challenges and mitigations. However, both agile and non-agile software developments despite their scale are affected by challenges. For example, agile development methods are more suitable for small teams located in a given location. In contrast, large-scale agile-based software development processes are affected by challenges related to scaling projects [13]. Similarly, non-agile large-scale software development processes are also affected by challenges such as lack of effective inter-team collaboration [14].

2(81)

1.4 Contributions The result of this paper can be used as a guideline in planning large-scale software development projects. It could also be used as a guideline for LST. For future research, the result of this paper can be used as an inspiration for determining more challenges affecting LST in large scale projects. It is also possible to investigate whether challenges and corresponding mitigations are valid in large scale agile software development or not.

1.5 Related work In this section, related researches are briefly presented. Testing is an essential part of the software development process, which enables us to validate and verify software quality concerning customer requirements. For example, Shete and Jadhav did the empirical study to check the effect of test cases on the quality of software. They concluded that "software tested using test cases delivers quality software product "[15]. Furthermore, the quality attribute maintainability and issues related to evolution are also addressed by [16]. Software running on servers for prolonged durations showed an increasing demand for maintainability as software evolves [16]. Test case selection is also a challenging task. Kazmi indicated that regression test case selections are affected by attributes such as cost, fault detection capacity, and coverage [17]. Also, testing on large-scale is associated with many challenges that stem from the metrics used to define large-scale systems [8]. Different large-scale challenges are identified during development and testing, for instance, security- related [5], performance-related [18], safety, compatibility and security-related [19], critical system-related [20, 21]. Thus, several techniques have proposed to allocate scarce resources such as budget, time, people, and hardware efficiently and effectively when dealing with a large-scale system [22, 23]. Felderer and Ramler also proposed how to effectively and efficiently manage complexity with risk-based testing [24, 25]. There are hardly any research papers that focus on challenges and mitigations in the context of software quality attributes of LST. Existing literature regarding testing focuses on only a few software quality attributes, for instance, security.

1.6 Report structure Chapter 2, Background, contains core concepts covered in this study. Chapter 3, Methodology, describes the scientific approach applied to answer the research question and solve the research problem. Chapter 4, Result, contains a summarization of challenges identified, mitigations, and quality attributes. Chapter 5, Discussions and Limitations of the study, explains results from the SLR and Interview (conducted in the target group). Chapter 6, Conclusions and Future directions, covers overall conclusion and possible ways for future directions.

3(81)

2 Background In this section, key concepts about LST are presented.

2.1 Software quality Several quality models were introduced in software engineering literature, for instance, McCall’s Quality Model [26], Boehm’s Quality Model [27], FURPS Quality Model proposed by [28], and ISO Standard25010(1). However, in this paper, the ISO Standard25010 product quality model is used to map the identified LST challenges to their associated software quality attributes since ISO Standard25010 is the official software quality model. Definition of quality: According to IEEE (IEEE_Std_610.12-1990), software product quality is the degree to which a system, component, process meets specified requirements, customer's needs, or expectations. In the context of large-scale-software system reliability, security, safety, and deployability are the core quality attributes that are worth paying attention to [7]. For reference to software product quality, ISO/IEC 250101, refer to Table 1 below. According to, the cornerstone of a software product quality is the quality model; see the table below, which dictates relevant quality characteristics when evaluating software product properties. Software Product Quality

Functional Performance Compatibility Usability Reliability Security Maintainability Portability Suitability Efficiency Functional Time -Behavior Co-existence Appropriateness Maturity Confidentiality Modularity Adaptability completeness Resource Interoperability Recognizability Availability Integrity Reusability Installability Functional Utilization Learnability Fault Tolerance Non-repudiation Analysability Replaceability correctness Capacity Operability Recoverability Authenticity Modifiability Functional User Error Accountability Testability Appropriateness Protection User Interface Aesthetics Accessibility

Table 1: ISO 25010 software product quality model.

2.2 Large-scale systems All systems are not the same in terms of size or other characteristics, and they differ significantly in terms of several important characteristics, such as system characteristics, organizational characteristics, etc. In this thesis, the author of this thesis utilizes a model, large-scale model, proposed by Firesmith [8] to define large- scale software development and LST. The model describes core attributes associated with the development and testing of a large-scale software system and its organization. The core attributes define large-scale systems are organizational characteristics (governance, autonomy) and system characteristics (complexity, size, evolution, emergence, distribution, reuse, heterogeneity, and variability), as

1https://iso25000.com/index.php/en/iso-25000-standards/iso-25010?limit=3&limitstart=0

4(81)

illustrated in Figure 1[8]. Moreover, the core attributes applied to the system could also be applied recursively to components of the system.

Figure 1. Meters measuring the scalability of a software system [8]

2.3 Large-scale software development and LST The spectrum of scalability illustrated in Figure 1 defines large-scale software development and LST. The short definitions of core attributes are listed below [8, 128]. Autonomy- In a large-scale testing environment, independent testing teams require resource allocation and a degree of autonomy to handle their testing task effectively. Governance- In a large-scale system, testing teams require resources and degrees of freedom to handle their testing tasks appropriately and to achieve their testing objectives. Complexity- refers to the degree to which the system is difficult to understand and analyze due to different factors (a large number of components connected by different interfaces, etc.). Moreover, maintenance and evolution are directly affected by the complexity of a system.

5(81)

Evolution- In terms of rate and impact, it refers to the degree to which (requirement, test suits, code, technology, etc.) of system or component changing over time. Thus Alignment code to test and a requirement to test, maintain outdated tests, and prune dead tests are the issues that need attention. Emergence- System shows new emergent behavior due to integration among components or different services. Introducing ML and AI can be used to identify negative emergent behavior. Size- As scale growth, a large set of test cases required to developed, managed, and maintained. Variability- refers to the degree to which the core commonest or system appeared in different versions and variants. Thus, more attention is needed for reusability and backward traceability. Heterogeneity- refers to the degree to which (components, test suites) differ from each other in terms of (requirement, goal, technology, etc.) so that striving for homogeneity (single tool, infrastructure,etc.) is an issue. Distribution- refers to the degree to which components are distributed in different physical locations and virtual places, which in turn add burden for testers. Teams might be located in different geographical areas, and the coordination of teams needs to be addressed. Reuse- refers to which component or test suits have been reused or/and modularized for easy maintenance and reduced maintenance cost.

2.4 Large-scale vs. Small-scale software challenges van der Heijden in his paper “An empirical perspective on security challenges in large-scale agile software development.” stated different security challenges, which are categorized as presented in [5], see the list below: 1. General challenge (resource allocation), 2. Agile (common understanding of security, implementing low-overhead security) 3. Large-scale agile (alignment security objectives in a distributed setting, common understanding of security and integration of low-overhead tools) Some challenges belong to both large-scale and small-scale. However, some challenges are scale-dependent; for example, hidden faults for small-scale attributed in large-scale [29]. Small-scale software challenges, for instance, customer involvement, emerging requirements, implicit security requirements & security awareness, and expertise among developers encounter in agile development [9]. Also, more emphasis should be given to developer security awareness, expertise, and adequate customer involvement, and continuously improving the development process for security [9].

6(81)

2.5 LST and its challenge Large-scale software engineering poses many challenges to testing. The main challenges of large-scale software engineering associated with testing are the coordination of testing activities [10], coping with complexity [10], non- determinism [30], and flakiness [31]. Nielsen suggested interoperability testing to mitigate the coordination challenge, whereas a combination of pair-wise testing and orthogonal testing used as mitigation for coping with complexity challenge [10]. According to Nielsen, the combinatorial explosion is the identified problem for coping with complexity [10]. Furthermore, in this thesis, challenges related to LST concerning quality, and its mitigations are summarized in Chapter 4.

7(81)

3 Methodology

A systematic literature review and interviews were conducted to answer the research questions. This study is organized into two phases; in the first phase, a systematic literature review using the snowballing technique was conducted to elicit challenges and corresponding mitigations. In the second phase, semi-structured interviews were conducted to identify additional challenges and mitigations. The flow of the research strategy used in this thesis is illustrated in Figure 2 below.

Conduct semi- SLR Snowballing structured interview to Analysis of List Identify identify missing mitigations and of challenges & missing mitigations mitigations unforeseen challenges

Figure 2. The research process from the identification of challenges and corresponding mitigations using SLR, snowballing, and interviews are illustrated.

3.1 Systematic literature review (SLR) A systematic literature review is referred to as a secondary study, meaning no firsthand data is collected, but only published articles are used. The systematic literature review approach followed in this paper is proposed by [12, 32], which consists of three phases, namely planning the review, conducting the review, and reporting the review. These three phases are explained below.

3.1.1 Planning the review During planning, the researcher defines a strategy for study selection, which is the crucial primary step for any SLR to ensure the completeness of the collection of the primary studies. Often the quality of the results is dependent on primary studies for the SLR. In this phase, the researcher selected source of repositories or databases, formulate the search queries and defined inclusion and exclusion criteria based on the research question(s). Web of Science, Scopus, and IEEE were selected to locate and download relevant articles. The reasons for choosing these repositories were multidisciplinary and international coverage, a large number of records, and the opportunity to elicit the landscape and the state-of-the-art about LST [33, 34]. This thesis aims to identify the challenges of large-scale software testing from the context of software quality characteristics and their mitigation. Therefore, the general research question in this study is: What are the challenges of large-scale software testing hindering the development of quality software products? What are the possible mitigations? In this phase, the process was instantiated by entering search keywords on the source repositories mentioned above. The returned set of studies from the source

8(81)

repositories are then filtered using inclusion and exclusion criteria so that relevant scientific papers remain in the list. Scopus, Web of Sciences, and IEEE were used to extract relevant scientific literature. Identification and formulation of the search keyword were done systematically. First, the primary keywords are identified or derived from the research questions. Second, for primary keywords, synonyms or alternative words were identified. Finally, the formulation of the search keyword by joining the boolean OR for alternative terms and boolean AND for binding terms. Queries used in the corresponding repositories are the following:  Web of Science Query used for Web of Science is illustrated below and returned 840 articles

TOPIC: (((challenge OR problem) AND (large-scale OR “large scale”) AND (” software testing" OR "testing”) AND (software OR application))) Refined by: DOCUMENT TYPES: (ARTICLE OR PROCEEDINGS PAPER) Timespan: 1995-2020. Indexes: SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, ESCI.

 IEEE Query used for IEEE is illustrated below and returned 1137 articles

(Challenge OR problem) AND (large-scale OR 'large scale') AND ('software testing' OR testing) AND (software OR application)

 Scopus Query used for Scopus is illustrated below and returned 733 articles

TITLE-ABS ( ( challenge OR problem ) AND ( large-scale OR 'large scale') AND ( "software testing" OR "testing" ) AND ( software OR application ) ) AND ( LIMIT-TO ( DOCTYPE , "cp" ) OR LIMIT-TO ( DOCTYPE , "ar" ) OR LIMIT-TO ( DOCTYPE , "ip" ) ) AND ( LIMIT-TO ( SUBJAREA , "ENGI" ) OR LIMIT-TO ( SUBJAREA , "COMP" ) ) AND ( LIMIT-TO ( PUBYEAR , 2020 ) OR ( LIMIT-TO ( PUBYEAR , 2019 ) OR LIMIT-TO ( PUBYEAR , 2018 ) OR LIMIT-TO ( PUBYEAR , 2017 ) OR LIMIT-TO ( PUBYEAR , 2016 ) OR LIMIT-TO ( PUBYEAR , 2015 ) OR LIMIT-TO ( PUBYEAR , 2014 ) OR LIMIT- TO ( PUBYEAR , 2013 ) OR LIMIT-TO ( PUBYEAR , 2012 ) OR LIMIT-TO ( PUBYEAR , 2011 ) OR LIMIT-TO ( PUBYEAR , 2010 ) OR LIMIT-TO ( PUBYEAR , 2009 ) OR LIMIT-TO ( PUBYEAR , 2008 ) OR LIMIT-TO ( PUBYEAR , 2007 ) OR LIMIT-TO ( PUBYEAR , 2006 ) OR LIMIT-TO ( PUBYEAR , 2005 ) OR LIMIT-TO ( PUBYEAR , 2004 ) OR LIMIT-TO ( PUBYEAR , 2003 ) OR LIMIT-TO ( PUBYEAR , 2002 ) OR LIMIT-TO ( PUBYEAR , 2001 ) OR LIMIT-TO ( PUBYEAR , 2000 ) OR LIMIT- TO ( PUBYEAR , 1999 ) OR LIMIT-TO ( PUBYEAR , 1998 ) OR LIMIT-TO ( PUBYEAR , 1997 ) OR LIMIT-TO ( PUBYEAR , 1996 ) OR LIMIT-TO ( PUBYEAR , 1995 ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) ) The final search keyword is applied to the selected repositories, then a total of 2710 research papers were retrieved. The title and abstract of an article were carefully analyzed manually to filter irrelevant articles. In the first iteration, 500 articles were selected out of 2710. In the second iteration, Zotero and Mendeley were used to remove duplicates, and a total of 373 remained. In the last iteration, full research

9(81)

papers were carefully analyzed with the aid of Adobe PDF reader by using inclusion and exclusion criteria and a detailed review of the research papers. Finally, 64 research papers were chosen for SLR. Inclusion criteria  Year of publication should be between 1995 and 2020  Assessment of topic relevance using thematic analysis  Peer-reviewed conference and journal papers were included.  Articles published in IEEE, Web of Science, and Scopus used. Exclusion criteria  Articles published outside the specified year range 1995 and 2020 were excluded.  Irrelevance articles were excluded.  Articles published in non-peer reviewed outlets, reports, books, and non- scientific journals were excluded.  Non-English papers were excluded. Excluded articles included irrelevant keywords and titles such as Large-Scale Gesture Recognition, Integrated Circuits and Systems & Systems on chips, Human genome & Technology challenges in screening single-gene disorders, VLSI Circuit, Ultra-Large-Scale Software systems, Large-scale Global Optimization based on ABC Algorithm with Guest-guided Strategy and Opposite-based Learning Technique, etc. Collected research papers are summarized, as shown in Table 2 below. Date Repository Collected Selected Merged data after duplicates are removed 1995– IEEE 1137 143 2020 Scopus 733 220 Web of Science 840 137 Total 2710 500 373 (Zotero and Mendeley were used to remove duplicates) Final papers selected after a manual review is 64 conducted

Table 2. Collected research papers for SLR from 1995-2020.

3.2 Snowballing After collecting relevant research papers and excluding irrelevant literature, 64 articles were selected. To include possibility missing but relevant literature, forward and backward searching were conducted [35]. Forward searching is checking the relevance of papers citing selected papers while backward searching is including relevant papers from the reference list of selected articles. The iteration continues until no paper is found following the snowballing technique presented in [35].

10(81)

3.3 Text-mining techniques The co-occurrence of terms was obtained using text-mining techniques on the final selected papers. The number of selected papers and those included after snowballing is conducted illustrated in Table 3.

No Final selected papers for Included papers through Common Analysis (1995-2020) Snowballing paper 64 32 0 Final papers selected after a manual review is conducted 96

Table 3. The number of selected papers from SLR using Snowballing. The total collected articles are shown in Figure 3 below.

Collected Articles(2710) 1200 1137 1000 800 840 733 600 Year(1995-2020) 400 200 0 IEEE Scopus Web of Science

Figure 3. Collected articles from the repositories 1995-2020. The total selected articles, including snowballing techniques, are shown in Figure 4 below.

Selected Articles(96)

70 64 60 50 40 32 Selected 30 Articles(1995-2020) 20 10 0 SLR Snowballing

Figure 4. Final selected articles from the year 1995-2020.

11(81)

3.3.1 Conducting the review Zotero and Mendeley were used to remove possible duplicates before starting the review, and data extractions were done. The inclusion of relevant articles was done by applying inclusion and exclusion criteria. Collected data were checked against RQs, and then all extracted data were reviewed several times to check if incorrect data extraction was performed. Furthermore, collected data were carefully checked, reviewed several times against RQs, and mapped precisely to zero or many quality attributes.

3.3.2 Reporting the review The reporting of the review is summarized in a table, as illustrated in Chapter 4. Large-scale software testing challenges and their mitigations are listed along with software quality attributes of ISO.

3.4 Semi-structured interview This part is the continuation of the SLR, which was conducted to elicited additional challenges, and mitigations to the challenges identified during SLR. The sample space includes testers/quality assurance engineers, programmers, test and project managers, DevOps engineers, system analysts, and technologies that are currently working in the IT Industries, Telecom companies, Finance and Bank, Gaming Industry, etc. A total of ten respondents were interviewed (one Analyst, two developers, one dev lead system architect, two software quality assurance engineers, two testers, one both developer and tester, and one test manager).

3.5 Threats to validity of SLR and Interview

Quality assessment is an issue that needs to be addressed to have a valid SLR and Interview. The possible potential threats for the validity of SLR includes –Incorrect research keyword, incorrect selection of research articles, and incorrect data extraction or mapping. Quality assessment concerning SLR believed to adequately addressed since the research keyword is re-factor several times before reached to the final appropriate research keyword. Research papers were properly managed (duplicate, incomplete articles, etc.) using (Zotero, Mendeley). Selected research papers have followed the guidelines and the steps proposed by [12, 32].

The research papers were reviewed journals articles and conference papers that were selected from (IEEE, Web of Science, Scopes). Finally, the research papers checked against RQs and inclusion and exclusion criteria before articles were selected, mapping a challenge to respective product quality attributes were checked several times, and many more. Thus, quality assessment was assumed to be addressed, and final refined research papers match the RQs, and the selections of publications consider being valid.

One of the potential threats for the validity of this research during the interview is the coverage sample space, as it’s mentioned above, the coverage sample space was assumed to be valid.

12(81)

3.6 Reliability and validity of research

The reliability and validity of research are the two most important features to evaluate for validity research. The evaluation may refer to a tool, method, etc. Reliability refers to the stability of findings, whereas validity has represented the truthfulness of outcomes [36]. Validity in the context of qualitative research refers to the ‘appropriateness’ of the measured used, the accuracy of the analysis of the result, and generalization of findings [37].

The author of this thesis used both primary and secondary data sources to identify the challenges of large-scale testing concerning software quality attributes. First, a researcher used SLR to collect secondary data to answer the research question then based on the result, and the researcher used interviews to gather primary source data. Therefore, the study yielded a consistent result and can be considered to be reliable.

3.7 The trustworthiness of the result

In relevance to this thesis, the four aspects of trustworthiness, such as credibility, transferability, dependability, and conformability, are discussed as presented in [38]. Credibility related to internal validity, transferability refers to external validity and generalizability about applicability, dependability (focus on consistency, reliability, and stability of the data), and conformability (is about neutrality, objectivity) [38]. Reliability and validity are discussed in section 3.6. Applicability of the result is discussed in section 1.4.

3.8 Ethical considerations

Following the privacy and confidentiality issue discussed by [39], the confidentiality of respondents’ data is well addressed. Following Creswell, ethical issues for qualitative research are addressed in the process of investigation: before conducting the study, beginning to do the study, collecting, analyzing, reporting data, and publishing the work [40,127]. For example, relevant issues are addressed, such as the researcher informed the respondents about the purpose of the research and how data is collected on how to keep the privacy of the respondent by aliasing their identity in publications.

13(81)

4 Result In this section, the result of a systematic literature review, and interviews are presented.

4.1 Summary of all challenges identified through SLR and Interview.

Respondents of the interviews are shown in Table 4 below. No Respondent Current Position Experience Company 1 R1 Software Quality 4 years Telecom Assurance Engineer 2 R2 System Analyst and 10 years Telecom Team leader 3 R3 Tester 4 years Game 4 R4 Dev lead system 10 years Consult architect 5 R5 Team leader 6 years Consult 6 R6 Programmer 6-10 years IT company

7 R7 Quality Assurance 6 years Telecom Engineer 8 R8 Tester 4years Telecom 9 R9 Developer 10+years Consult 10 R10 Test Manager 8 years Bank, Finance, Telecom Table 4. Respondents of the interviews Summary of the result of all challenges form SLR and Interview are shown in Table 5 below.

No Challenges (in the large-scale system) Quality Functional 1 Database access code [41] Suitability Functional 2 Mapping Requirement to Testing [42, R1-R5, R10] Suitability Functional Suitability, 3 Develop a common understanding of (requirement, Compatibility configuration security, performance, risk) [5, R1-R3, Performance, R5, R6, R8, R10] Security Identifying a few performance counters form highly 4 Performance redundant data [43] Deciding the right time to stop performance testing is a 5 challenge especially, Performance

a. When performance data generated is repetitive [44]

b. When it’s is tested with simulated operations data

14(81)

[45] Generating test case dataset that mimics the actual data 6 Performance [46, R10] Lack of test equipment or difficult to mimic the real 7 scenarios in performance testing (ex- telecom domain) Performance [R2, R7, R9, R10, 21] It is a challenge to effectively identify or detect or investigate bugs or faults, a. When bugs are scale-dependent [29] 8 b. When large-scale system possesses big data [47] c. Challenge to identify bugs and Improve test infrastructure [48,49] d. In a highly configurable large-scale system [50,51,52, R3] Performance e. Due to large numbers of false positives and false negatives (using code-mining) [53] f. For a large combination of features [52] g. When performance bugs are workload-dependent [54] h. In large-scale data processing (Example- Adobe Analytics etc.) [55] I. To identify critical bugs (in the component, API, etc.) [R7] In large-scale system its challenge to effectively, 9 a. Predict fault [31, 56, 57] Performance

b. Predict and maintain a flaky test [31] It’s a challenge to perform, a. Load testing on web applications while running under heavy loads (data tsunami) [58] b. performance Testing effectively, efficiently and fully 10 [R2, R3] c. Load testing in the cloud due to challenges (configuration, capacity allocation, tuning of multiple system resources) [59] Performance d. Load testing due to environment-variation [18]

e. Testing systems (critical, parallel and disturbed) that demands intensive computation, multidimensional testing [20, 21]

11 Challenges on performance testing (example-load testing) Performance a. Detecting performance deviation on components in load tests [60, 61]

15(81)

b. Obtaining accurate performance testing result (for Cloud Applications) due to performance fluctuations [62] c. To perform load testing in a cost-effective and time- efficient manner [63] d. To evaluate the effectiveness of load testing analysis techniques [63] e. Identify the appropriate load to stress a system to expose performance issues[54] Managing duplicate bugs and provide reasonable 12 compensation of Crowdsourced Mobile Application Performance Testing (CMAT) [64] Bug debugging and bug report diagnosis is expensive, 13 time-consuming and challenging (in large-scale systems Performance or system of systems) [65] Classical approach load testing is challenging and 14 harder to be applied in parallel and frequently executed Performance delivery pipelines (CI/CD) [66] Effective fault or bug localization (both single-fault and 15 multi-fault) has long turn-around time and Performance challenging[67, 68, 69] Achieving a required system performance and 16 Performance maintaining it by reducing cost [70] In a large-scale microservices architecture, where 17 thousands of coexisting applications to work cooperatively, so challenges are, Performance a. Task scheduling[71], Resource scheduling [72] b. Performance modeling[71] Analytics-driven load testing challenges [73] 18 a. During test design: - Ensure performance quality

when a system is deployed in a production environment Performance b. During test execution: -Reduce test execution effort

c. During test analysis: - Determine whether a test

passes or fails. Challenges for multi-component, multi-version, configurable evolving integrated system are a. Perform compatibility testing for a large combination 19 of features in the multi-component configurable system [52, 74, R10]

b. Perform incremental and prioritized compatibility Compatibility testing [52, 75, 76, 77, R10].

c. Interoperability and dynamic reconfigurability [19]) d. Test optimization for compatibility testing while assuring coverage [78, 79]

e. Selecting environment for compatibility testing (due

16(81)

to limited resources large variants to test) [80] Challenge on platform compatibility: -Business without 20 a sense of ethics has a great impact on compatibility Compatibility testing. [R7] Challenges for usability or/and accessibility testing are 21 a. Low priority, less acknowledges (Time, resource) (ex- in Telecom domain, etc.) [R1-R4] Usability b. Detecting duplicate crowd testing reports effectively [81] Large-scale software testing challenges related to reliability are a. Service disruption challenge due to the difficulty of reproducing bugs [R1, 82] 22 b. Achieving a reliable system to an acceptable level of reliability[R2]

c. The root cause of failures identification [83, R1]

Reliability d. Testing for correctness and reliability [84] e. Worst-case test inputs generation (from dynamic symbolic execution) to use for [85, 86] f. Optimal testing resource allocation (due to reliability constraint and limited resources) [22, 23, 87, 88 , 89 , 90 , 91, 92 , 93, 94 , R8] g. Validation and testing for (Safety, reliability, and security) of autonomous car [95] Security testing challenges are a. Align security objectives in a distributed setting [5] b. Integration of low-overhead security testing tools [5] 23 c. Security verification and Coping with security threats [R8] d. Lack of hacker’s mindset when security testing is done [R7] e. Level of Trust (Infrastructure, platform, tools and security expertise) [R3] Security f. Effective detection of Denial of Service (DDoS) on of Software Defined Network (SDN) [96] g. Security oriented risk-assessment challenge [97] h. On safety and security-critical relevance systems [98] i. Due to Static application security testing (SAST)) [99] j. Security oriented code auditing challenge [100] To assure quality when enterprise systems are deployed 24 in an environment containing many thousands of Portability interconnected systems [101] 25 Challenges on maintainability or are Maintainability a. Maintain outdated test suits efficiently and effectively

17(81)

[R3, R4, R9, R10, 102] b. Effectively prone outdated test suites [R3, R4]

c. Keep track of the faults while doing regression[R9]

d. Estimating the impact of changes [102] e. Build and maintain test suites for continuously updated regression testing [102] f. Effective test case selections for regression test [103] g. Reusing class-based test cases [104] h. Optimizing continuous integration (CI) testing [105] i. Challenge on regression testing on a database for cost-effectiveness [106] Test suit optimization challenge 26 Maintainability a. Ensuring a “realistic” optimized suite [107]. b. Test optimization to speed up distributed GUI testing [108] General testing challenges a. Modeling and testing a real-world scenario at an acceptable level of abstraction [R2, R3] c. Coordination of testing activities (Integration and end-to-end testing) [R4-R6, 72] d. Cross-functional communication challenge [R7]. 27 e. Verification and validation of automotive system [109] f. Decision making regarding the adoption of cloud computing (CC) for software testing [110] g. Enable efficient and effective testing of big data General analytics in a real-world setting [111] h. Understandable and accessible various stake-holders in agile testing is difficult [112] i. Testing for safety, reliability, and security of embedded systems [113] j. Improve, optimized testing (infrastructure, process) and reduce testing cost [48, 106 , 114 ,115 , 116 , 117, 124] k. Design optimized combinatorial suitable test suits for large-scale systems require complex input structures [118, 119]

Table 5. Summary of identified challenges from SLR and Interview. The summary of the result(challenges for each quality) is shown in Figure 5 below.

18(81)

Challenges for each Qulaity

Functional Suitability 12% 4% 1% Performance Efficiency 12% 40% Compatibility 12% Usability Reliability 9% 2 8% Security % Maintainability Portability

Figure 5. Summary of identified challenges in relation to quality. The summary of the result(challenges of large-scale testing) is shown in Figure 6 below.

Challenges of Large-scale Testing 35 30 25 20 15 10 5 0 # of Challenges for each Qulaity

Figure 6. Summary of identified challenges in relation to quality using a bar graph.

4.2 Summary result and mitigation of SLR and Interview

The result of this study, that is the list of challenges and corresponding mitigations are attached in Appendix A. The author identified 81 challenges from 96 papers.

19(81)

5 Analysis of the result In this section, a summary of the result, co-occurrence analysis, and analysis of the result with respect to the software product qualities of ISO were presented. The analysis was done following a thematic analysis for identifying thematic structures, with respect to software product qualities. Therefore, this section is organized following the eight software quality characteristics and sub-quality characteristics. Additionally, the result from the semi-structured interview is analyzed under the software quality characteristics and sub-characteristics.

5.1 Summary of the result In this thesis, a total of 2710 articles was collected from the year 1995-2020; 1137(42%) IEEE, 733(27%) Scopus and 840(31%) Web of Science out of which 64 articles were selected SLR, and 32 additional articles were added using snowballing techniques. Thus, a total of 96 articles were identified for SLR. The identified challenges through SLR and interviews are categorized under eight software qualities. From 96 articles a total 81 challenges of LST were identified, 32(40%) performance efficiency, 10(12 %) security, 10(12%) maintainability, 7(9 %) reliability, 6(8%) compatibility, 10(12%) general testing challenges etc. The result shows that more challenges were identified in (performance efficiency, security, maintainability, reliability, and compatibility), and fewer challenges were identified in (functional suitability, usability, and portability) product quality attributes.

5.2 Co-occurrence analysis To get an insight into the relevance of current state-of-art analyzed through SLR and interview, co-occurring concepts, text mining techniques are used. The co- occurrence analysis was done using a script in R on 96 selected articles. Different results of co-occurrence analysis with R are shown below in figures Figure 8 and Figure 9 below. The frequency of terms or words is illustrated below, see Figure 7 below.

20(81)

Figure 7. Frequency of terms in SLR (96 articles). The terms, ’software’, ’large’, ‘scale’, ‘test’, and ‘challenge’ are the most frequent terms as illustrated in the frequency distribution graph, see Figure 7. It is evident that the most frequent terms are the key terms that construct the research keyword. The co-occurrence of terms with the term ‘test’ as the main part of the analysis is shown in Figure 8 below. The R code used to generate the visualization was obtained from the online course repository2. There is the co-occurrence of terms ‘test’ and ‘requirement’, implying that mapping requirement to testing is an important task in large scale software testing. Similarly, ‘test’ co-occurs with ‘automated’, ‘agile’, ‘regression’, ‘case’, and ‘combinatorial’, are the most prominent terms in the visualization indicating that these are agile-testing, regression-testing, and combinatorial-testing are among the significant techniques in large-scale software testing.

2 https://tm4ss.github.io/docs/Tutorial_5_Co-occurrence.html

21(81)

Figure 8. Co-occurrence of terms with term ‘test’. The co-occurrence of terms of ‘performance challenge’ from the year 2012-2020 is illustrated in Figure 9 below. Some of the tasks are described in the visualization, such as finding-bugs (it’s a challenge to find bugs in large scale when bugs are scale-dependent), worst-case input generation (worst-case input generation for stress testing in the large-scale system is a challenge), and others are found in the thematic analysis of the literature.

22(81)

Figure 9. Co-occurrence of terms of ‘performance challenge’ from the year 2012- 2020.

5.3 Functional suitability There are three functional suitability sub characteristics, namely functional completeness, functional correctness, and functional appropriateness (ISO/IEC 25010 1). The first sub characteristics of functional suitability are functional completeness, which is the degree to which specified tasks and objectives are covered by the set of functions. The second one, functional correctness, is the degree to which a system or a product provides the correct result with the appropriate degree of precision. The last one, functional appropriateness, is evaluating the degree to which the accomplishment of intended tasks and objectives (1). Under functional suitability, three challenges were identified.

23(81)

1. Database access codes are black-boxed so, it is a grey area to how good test case is written and functional, and performance problems are identified [41]. This challenge affects the three functional suitability sub characteristics.  Completeness  it is not easy to verify if all functions are fully implemented.  Correctness  it is challenging to detect functional problems or functional issues.  Appropriateness  it is not easy to show if functions are accomplished and fulfill specified requirements. Mitigation  [41] identified five framework-specific database access patterns to address this challenge, refer to [41] for details. 2. Alignment or mapping of requirements to testing is a challenge in large-scale software development [42] when the requirement is unclear or wrong [R1-R5, R10]. It is also challenging to write useful quality tests, especially when there are no documented requirements [R1]. This challenge affects the three functional suitability sub characteristics.  Completeness  it’s hard to verify if all requirements are fully implemented or addressed in the solution.  Correctness  it’s challenging to verify the correctness of functions if the alignment of a requirement for testing is missing.  Appropriateness  it’s also hard to verify all objectives or tasks of the requirement are adequately addressed. Mitigation  Adopt more agile RE practice and agile testing [42]. Development teams should include tester, analyst, programmer, etc. [R1- R5, R10], and tester must be involved from the beginning of systemization to product deployment. 3. Develop a common understanding of the requirement, security, performance, risk, compatibility [R1-R3, R5, R8]. Also, the inexperienced tester often lacks a full understanding of the project. Lack of technical analysts to determine the technical risk associated with compatibility due to multiple machine configurations, multi-version systems [R8], the risk associated with security is challenging [R6, R10].

This challenge affects the three functional suitability sub characteristics.  Completeness, correctness  if requirements are missing or unclear, then it’s hard to verify its completeness and correctness.  Appropriateness  it is also challenging to ensure functions fulfill all tasks & objectives. Mitigation  Facilitate successive meetings and communication [5, R1- R3, R5], including expertise on the teams [R1, R8]; companies should sign of a common agreement on requirement among the teams [R5]. To mitigate the risk associated with configurations: -include expertise on the teams [R1, R8], reduce configuration changes, and automatic execution of assertions [R8]. To mitigate the risk associated with security: - proper security documentation, correctly understand security principles, security

24(81)

metrics that a tester to understand and follow [R10], communication and have a dynamic security matrix [R10]. Challenges related to functional suitability also emanate from software development methodologies company follows, such as agile, waterfall, or iterative [R10]. This challenge affects the three functional suitability sub characteristics: - Completeness, correctness, appropriateness Quality of software development affected by a proper development, quality risk assessment, strict test strategy [R10]. Follow a rigorous test strategy is affected by the methodology the company follows. Mitigation  Teams should agree or should have a common understanding of risk area, strategies on risk, test strategies.

5.3.1 Reflection about Challenges Related to Functional Suitability There are generally few challenges under Functional Suitability compared to other quality attributes. Identified challenges to LST under functional suitability emanate from poor requirement management and poor software design and testing methodology. For example, mapping of a requirement to testing is the first-class challenge concerning functional suitability. Best practices to circumvent these challenges are the need for requirement documentation, involving cross-functional team members from the inception of the project to facilitate common understanding, practicing model-based testing, and suitable development strategy are crucial.

5.4 Performance Efficiency There are three performance efficiency sub characteristics, namely time behavior, resource utilization, capacity (1)

Time behavior- “The degree to which the response and processing times and throughput rates of a product or system, when performing its function to meet requirements.” Resource utilization- “The degree to which the amounts and types of resources used by a system when performing its functions, meet requirements.” Capacity- “The degree to which the maximum limits of a product or system parameter meet requirements.”

1. Lack of test equipment to mimic the real scenarios in performance testing [R2, R7]. Simulating a production server locally and doing performance testing for an integrated, parallel, distributed, and dependable system is a challenging task [R9, R10, 21]. The system that you depend on doesn’t allow you to send so many requests; otherwise, the system will break so challenging to do performance testing.

This challenge affects the three performance efficiency sub characteristics.

25(81)

 Capacity it is difficult to test a maximum limit capacity of a system that meets a requirement if the test equipment is not efficient or if the production server is not simulated locally.  Time behavior  it is difficult to test the response, and the processing speed of a system meets a requirement if test equipment is not efficient or if the production server is not simulated locally.  Resource utilization it is difficult to test the amount & type of resource used by the system if test equipment that mimics the real scenarios is not sufficient or if the production server is not simulated locally. Mitigation  Use better test equipment if possible [R2] and complement performance testing with field testing, especially in the telecom domain, to mimic real scenarios[R2, R3, R7]. Other mitigation could be service virtualization or emulating the actual system that your system depends on locally [R9, R10]. 2. Identify a few performance counters from redundant performance data [43]. This challenge affects the three performance efficiency sub characteristics.  Capacity, resource utilization, and time behavior it is difficult to test resource utilization, capacity, and response time of a system that meets a requirement specification if smaller data set is not properly identified from the redundant performance data. Mitigation  The author proposes principal component analysis (PCA) to reduce performance counter to a manageable smaller dataset [43]. 3. Challenges in line with the decision for the right time to stop testing are: -  Deciding when is the right time to stop performance testing is a challenge, especially when performance data generated is repetitive [44] or/and when it’s is tested with simulated operations data [45]. Mitigation  [44] proposes to stop performance testing when the generated data became highly repetitive, and repetitiveness becomes stabilized [44].  It’s is also a challenge to decide the right time to stop [45]. Mitigation  [45] proposed mitigation compare generated stop time by an algorithm with the known perfect stop time.

These challenges affect performance efficiency sub characteristics.  Resource utilization It’s difficult to test resource utilization of a system that meets a requirement specification if performance data generation is repetitive since difficult to decide when to stop testing. 4. Challenge on creating a test case data set that mimics the actual data set is a challenge [R10, 46]. For different reasons, performance testing is not done in the production server with real data instead it’s done using a performance testing server and with data that mimic the actual data. This challenge affects the three performance efficiency sub characteristics.  Capacity, resource utilization, and time behavior it is difficult to test resource utilization, capacity, and a response time of a system that

26(81)

meets a requirement specification if sufficient test data set is not created from the actual dataset. Mitigation  Using data sanitization techniques to alter large datasets in a controlled manner to obfuscate sensitive information, preserving data integrity, relationships, and data equivalence classes [46]. 5. Environment-variation-based load testing challenges. Adjusting the test environment dynamically for a system deployed in a heterogeneous platform to perform load testing is a challenge [18]. This challenge affects the three Performance efficiency sub characteristics.

 Capacity, resource utilization, and time behavior it is difficult to test resource utilization, capacity, a response time of a system that meets a requirement specification if the test environment not adjusted dynamically. Mitigation  [18] developed a modeling technique that leverages the results of existing environment-variation-based load tests, and it performs better than the baseline models when predicting the performance of the realistic runs. 6. Challenges related to bugs, faults in the large-scale system are: - It is a challenge to effectively identify or detect or investigate bugs or faults a. When bugs are scale-dependent [29] Mitigation The author developed WUKONG can automatically scalable and effectively diagnose bugs. [29]. b. When large-scale system possesses big data [47] Mitigation Principal Component Analysis (PCA) could be used to detect faults [47]. c. It is a challenge to effectively identify bugs and Improve test infrastructure [48, 49] Mitigation  Apply Model-based testing [48, 49] d. In a highly configurable system [50, 51,52, R3] Mitigation The author developed NAR-Miner [50], to extracts negative association programming rules and detected their violations to find bugs, and Genetic configuration sampling [51], to learn a sampling strategy for fault detection of highly-configurable systems. e. For a large combination of features[52] Mitigation The authors combine random and combinatorial testing. [52]. f. Due to large numbers of false positives and false negatives (using code-mining) [53]. Mitigation The author developed, EAntMiner, to improve the effectiveness of code mining [53]. g. When performance bugs are workload-dependent [54] Mitigation The author developed DYNAMO to find an adequate workload [54]. h. In large-scale data processing (Example- Adobe Analytics etc.) [55] Mitigation  Apply combinatorial (example- pair-wise) testing [55].

27(81)

i. Fault investigation is time-consuming and challenging[R3] Mitigation  Incorporating Machine learning (ML) and AI-based testing in the testing process [R2].

These challenges affect the three performance efficiency sub characteristics.  Capacity, time behavior, and resource utilization it is difficult to test the capacity, response time, processing speed, resource utilization of a system that meets a requirement if a fault is presented in the system.

In the large-scale system, it is a challenge to effectively a. Predict faults [31, 56] Mitigation Software fault prediction algorithms should be carefully examined and combined to give a better result than using a single machine learning technique to resolve those challenges [56]. The authors used, heterogeneous ensemble and fault prediction model to predict different fault in different data and reduced error [57]. b. Predict and maintain a flaky test [31] Mitigation using machine learning (ML), AI, and Bayesian networks to classify and predict flaky tests [31].

These challenges affect the three Performance efficiency sub characteristics.  Capacity, time behavior, and resource utilization It is difficult to test the capacity, response time, processing speed, resource utilization of a system that meets a requirement if a fault due to flakiness is presented in the system. 7. Challenges related to performance testing are: -

a. Perform load testing on web applications while running under heavy loads (data tsunami) is challenging [58]. Mitigation  Peer-to-peer load testing is valuable to isolate bottleneck problems could be the possible mitigations [58]. b. Doing performance testing effectively, efficiently, and fully [R2, R3]. For instance, it’s difficult to test the system fully if response time (RTT) of the requirement not compatible with the test set environment [R2, R3]. Mitigation Field testing is a choice in the telecom domain [R2, R3]. c. Doing perform load testing in the cloud is challenging due to challenges (configuration, capacity allocation, tuning of multiple system resources)[59] Mitigation SvLoad is proposed to compare the performance of catch and server [59]. d. Detecting performance deviation on components in load tests is challenging [60, 61]. Mitigation [60] present, Principal Component Analysis (PCA), to identify subsystems that show performance deviations in load tests Automate load test using supervised (a search-based technique) and unsupervised approach to uncover the performance deviation in load tests [61].

28(81)

e. Obtaining accurate performance testing results for Cloud Applications is challenging due to performance fluctuations [62]. Mitigation The author proposes a cloud performance testing methodology called PT4Cloud, which provides reliable stop conditions to obtain highly accurate performance distributions with confidence bands [62]. f. To perform load testing in a cost-effective and time-efficient manner is challenging [63, R4]. To evaluate the effectiveness of load testing analysis techniques [63] Mitigation  Using statistical test analysis techniques, a control chart, descriptive statistics, and regression tree yield the best performance [63]. To speed up and reduce testing costs in general, apply the best testing practice of large-scale testing and containerization (example-docker) [R4]. g. Managing duplicate bugs and provide reasonable compensation of Crowdsourced Mobile Application Testing (CMAT)[64] Mitigation  The author offers guidelines on how to manage duplicate and invalid bug reports and compensation schema for testers to make CMAT effective and to address the challenges. h. Bug debugging and bug report diagnosis is expensive, time-consuming and challenging (in large-scale systems or system of systems) [65] Mitigation The authors propose an automated approach to assist the debugging process by reconstructing the execution path and saves time. i. Identify the appropriate load to stress a system to expose performance issues [54] Mitigation The author proposes an approach called DYNAMO; automatically find an adequate workload for a web application [54].

These challenges affect the three performance efficiency sub characteristics.

 Capacity, time behavior, and resource utilization It is difficult to test capacity, response time, processing speed, resource utilization of a system that meets a requirement if appropriate load to stress the system is not identified correctly. It is also difficult to find appropriate performance deviation on components in load tests if systems are not fully load tested.

8. Classical approach load testing challenge in CI/CD: -Classical approach load testing is challenging and harder to be applied in parallel and frequently executed delivery pipelines (CI/CD) [66] Mitigation The author designed context-specific and low-overhead load testing in continuous software engineering (testing process from measurement data recorded in production to loading test results). For further study, read the article [66].

9. Effective fault or bug localization both single-fault and multi-fault has long turn-around time and challenging [67, 68, 69] These challenges affect the three performance efficiency sub characteristics.

29(81)

 Capacity, time behavior, and resource utilization it is difficult to test capacity, response time, processing speed, resource utilization of a system that meets a requirement if a fault is not localized effectively or if faulty components are not localized. Mitigation For single-fault localization, spectrum-based fault localization (SFL), methods are effective and efficient but for multi-fault localization, fast software multi-fault localization framework (FSMFL), based on genetic algorithms, are effective in addressing multiple faults simultaneously [69]. For a single fault, fault localization based on complex network theory for a single-fault program (FLCN-S) to improve localization effectiveness in a single-fault context [68]. Program slicing spectrum-based software fault localization (PSS-SFL) is more precise to locate the fault [126]. Classification could be used to predict the location of bugs and reduce turn-around time [67]. 10. Challenge on achieving and maintaining system performance to the desired level: -Achieving a required system performance and maintaining it by reducing the cost to a desired level in the large-scale system is a challenging task [70]. This challenge affects the three performance efficiency sub characteristics.

 Capacity, resource utilization, and time behavior it is difficult to test resource utilization, capacity, a response time of a system that meets a requirement specification if the system under test doesn’t achieve and maintain the required performance efficiency. Mitigation Performance quality gates (performance trending, speed, and efficiency, reliability) to be established at the software integration phases at various points to avoid or reduce unforeseen system performance degradations [70]. 11. Challenge on microservices: - In a large-scale microservices architecture, where thousands of coexisting applications to work cooperatively, challenges are:- Task scheduling [71], Resource scheduling [72], and Performance modeling [71].

Mitigation  Designed a performance modeling, and prediction method for microservice-based applications [71] formulated a microservice-based Application workflow scheduling problem for minimum end-to-end delay under a user-specified budget constraint (MAWS-BC) and proposed a heuristic micro service scheduling algorithm [71]. Kubernetes provide a deployment tool, Kubemark to address performance and scalability issue [72]. 12. Analytics-driven load testing challenges: - Designing, executing and analyzing a load test can be very difficult due to the scale of the test (e.g., simulating millions of users and analyzing terabytes of data) [73] So analytical driven load testing challenges are  The test design challenge: - ensuring performance quality when a system is deployed in a production environment.  Test execution challenge: - to reducing test execution effort.

30(81)

 Test analysis challenge: - determining whether a test passes or fails. This challenge affects the two performance efficiency sub characteristics.  Resource utilization, time behavior it is difficult to test resource utilization, a response time of a system that meets a requirement specification if load testing is not properly designed, executes and analyzed. Mitigation  [73] suggested (1) for test design: - Aggregate and assess user-level behavior of the system when user behavior can affect system quality when design tests. (2) For test execution, the -Performance analyst quickly determines and stops a test if it is an invalid test (3) For test analyst: -Use keywords for locating error logs, and if logs are the only indication of anomalies, then use the Z-test to identify anomaly event sequences. 13. Challenges on testing in large-scale systems (critical, disturbed and distributed, etc.) are  Testing large-scale critical systems demands intensive computation, multidimensional testing & a combination of different testing (functional, random, etc.), so it’s challenging [20]. Mitigation  For critical systems, [20] identified testing workflows and designed an automated framework, ATLAS, to perform extensive tests on large-scale critical software [20].  It is too difficult to test parallel, distributed and dependable systems sufficiently since it demands concurrency [21] Mitigation  For distributed systems, the author proposed a testing environment for dependable parallel and distributed system using, for example, cloud computing technology, named D-Cloud to perform multiple test cases simultaneously, to enable automatic configuration [21]. These challenges affect three performance efficiency sub characteristics.  Capacity it’s difficult to test a maximum limit capacity of a system that meets a requirement if the system demands intensive computation and concurrency.  Time behavior  it is difficult to test the response time and processing speed of a critical system to meets a requirement since it demands intensive computation and concurrency.  Resource utilization it is difficult to test the amount & type of resource used by the critical system that demands intensive computation and concurrency.

5.4.1 Reflection about Challenges Related to Performance Efficiency Performance testing is the most significant task in LST. Fault investigation is one of the most common time-consuming challenges in relation to performance. Incorporate ML and AI in the testing process is one of the recommended mitigations for fault investigation.

31(81)

5.5 Compatibility Compatibility is the degree to which a system, a component or product, can exchange information with other systems, components, or products, and/or perform its required functions while sharing the same software or hardware environment3. Compatibility is composed of sub characteristics, namely co-existence, and interoperability 3. The first sub characteristics of compatibility are interoperability, which refers to the degree to which two or more systems or products or components can exchange information and use the exchanged information among them3. The second one, co-existence, which refers to the degree to which a system or a component or a product performs its required tasks or functions efficiently while sharing the same resource or environment without affecting others negatively3.

Co-existence- “The degree to which a product can perform its required functions efficiently while sharing a common environment and resources with other products, without a detrimental impact on any other product.” Interoperability- “The degree to which two or more systems, products or components can exchange information and use the information that has been exchanged.”

Challenges identified in this section are: - 1. Challenge on compatibility testing for multi-component and highly configurable multi-version based evolving systems [R10, 74]. Perform compatibility testing for a large combination of features in a multi-component configurable system [52, R10, 74]. It is a challenge to perform compatibility testing for multi- component and highly configurable multi-version based evolving systems [R10, 74]. Similarly, it’s infeasible to test and configure a system when large combinations of features are selected in a large and complex system [52].

This challenge affects the two compatibility sub characteristics.  Co-existence, interoperability It’s difficult for systems or components to perform their required tasks or/and to share messages efficiently while sharing the same resource or/and environment if a combination of features is not properly configured or if all potential configurations are not fully tested. Mitigation  Combines random testing with combinatorial testing [52, 74] proposed an efficient and scaled compatibility testing called Rachet. Prioritized and execute highly prioritized once [R10], Reduce configuration as much as possible [R8]. 2. Challenge on incremental and prioritized compatibility testing for multi- component, multi-version, highly configurable evolving systems [52, 75-77, R10]. This challenge affects the two compatibility sub characteristics.

3 https://iso25000.com/index.php/en/iso-25000-standards/iso-25010?limit=3&start=3

32(81)

 Co-existence, interoperability It is difficult for systems or components to perform their required functions efficiently or/and to communicate to each other while sharing a common resource or environment if a system is not fully configured, tested in an incremental and prioritized way.

Mitigation  [77] presented an approach to support incremental and prioritized compatibility testing and cache-aware configuration for component-based multi-version systems that evolve independently over time. 3. Challenge on interoperability when different systems are integrated: -Due to the demand of highly reliable dynamic configuration of a large-scale distributed system, faces challenges related to dynamic configuration, interoperability, and trustworthiness (Security) [19]. This challenge affects the two compatibility sub characteristics.  Co-existence, interoperability It’s difficult for systems or components to perform their required tasks efficiently or/and to communicate to each other while sharing the same resource or/and environment if components are not Interoperable and untrustworthy. Mitigation Use the homogenous environment as much as possible, and the API specification agreement needs to be sign-off by teams [R5], using service-oriented architecture (SOA) to enhance interoperability [R5]. 4. Challenge on test optimization while assuring coverage for compatibility testing [78, 79]. This challenge affects the two compatibility sub characteristics.  Co-existence, interoperability It’s difficult for systems or components to perform their required tasks or/and to share messages efficiently while sharing the same resource or/and environment if the system is not tested with optimized test suits since it’s infeasible to test all potential configurations[78, 79].

Mitigation  Authors proposed a test case selection approach by analyzing the impact of configuration changes with static program slicing [78], and pair-wise [52, 125] as a mitigation for the challenge. 5. Challenge on selecting environment for compatibility testing due to limited resources and large variants to test [80]. This challenge affects the two compatibility sub characteristics.  Co-existence, interoperability It’s difficult for systems or components to perform their required tasks or/and to share messages efficiently while sharing the same resource or/and environment if the environment is selected appropriately and a combination of features is not properly configured and tested. Mitigation  Authors proposed an evolutionary algorithm using ML to select the best configurations and environment sensitivity measures to discover defects [80].

33(81)

6. Lack of knowledge and common understanding of configuration, risk, security, performance [R8]. Lack of technical Analyst to determine the technical risk associated with compatibility and other quality attributes (security, performance, etc.) due to multi-configuration and multi-version system). This challenge affects the two compatibility sub characteristics [R8].  Co-existence, interoperability It’s difficult for systems or components to perform their required tasks or/and to share messages efficiently while sharing the same resource or/and environment if all technical risk related to compatibility is not properly handled. Mitigation Risk analyst technical expertise should be included in the team, reduce configuration changes, and apply best practices on assertion and mutation [R8]. 7. Platform compatibility challenge: - Business without a sense of ethics affects compatibility negatively. For instance, gateway routing protocol (EIGRP) is a better routing protocol than the open shortest path first (OSPF). EIGRP interacts with Cisco routers whereas OSPF can be compatible with different routers from different vendors [R7]. This challenge affects the two compatibility sub characteristics.  Co-existence, Interoperability It’s difficult for systems or components to perform their required tasks or/and to share messages efficiently while sharing the same resource or/and environment if protocols are not compatible with others [R7]. Mitigation Build a common platform that the products, protocols, components interact; communicate to one another [R7].

5.5.1 Reflection about Challenges Related to Compatibility The integration of heterogeneous systems raises compatibility issues due to multiple configurations and versioning, so it is difficult to test all potential configurations. Determining the technical risks of a system in advance which is related to compatibility is critical, especially for a critical and dependable large-scale system. Technical analysts play a crucial role in reducing the occurrence of a risk. Also, prioritizing compatibility according to relevance and risk factors is important.

5.6 Usability Usability is composed of sub characteristics, namely appropriateness recognizability, learnability, operability, user interface aesthetics, and accessibility (3). According to to3 define

Appropriateness recognizability - “The degree to which users can recognize whether a product or system is appropriate for their needs.” Learnability- “The degree to which a product or system can be used by specified users to achieve specified goals of learning to use the product or system with effectiveness, efficiency, freedom from risk, and satisfaction in a specified context of use.”

34(81)

Operability- “The degree to which a product or system has attributes that make it easy to operate and control. User error protection-refers to the degree to which a system protects users against making errors.” User interface aesthetics- “The degree to which a user interface enables pleasing and satisfying interaction for the user.” Accessibility- “The degree to which a product or system can be used by people with the widest range of characteristics and capabilities to achieve a specified goal in a specified context of use.”

Identified challenges in this section are: - 1. Challenges related to usability are –  Usability or/and accessibility testing has given low priority, less acknowledge (time, resource) (ex- in telecom domain, etc.) [R1- R4]. Usability or accessibility is less acknowledged means the designer is not able to design the GUI according to the interest of target groups or users [R3]. Mitigation  for Accessibility challenge: - outsource to save resources like time, cost, humans [R3], and focus group to check for usability testing [R4].

 Detecting duplicate crowd testing reports effectively [81] Mitigation  The authors propose Screenshots and Textual (SETU) adopt a hierarchical algorithm, which combines information from the Screenshots and the textual descriptions to detect duplicate crowd testing reports [81].

These challenges affect two usability sub characteristics.  Accessibility, operability it is difficult for a product to be accessible widely by the target group if the product is designed without the interest of users or target groups. Critical bugs in the component or API can’t easily be identified if a product can’t be easily operable or accessible.

5.6.1 Reflection about Challenges Related to Usability Usability testing is less acknowledged in large scale software testing. It is difficult to test a GUI in relation to accessibility. Usability testing could enable the identification of critical bugs. Therefore, addressing usability testing is useful.

5.7 Reliability Reliability is composed of sub characteristics, namely maturity, availability, fault tolerance, and recoverability3. This characteristic is composed of the following sub characteristics:

Maturity- “The degree to which a system, product or component meets needs for reliability under normal operation.”

35(81)

Availability- “The degree to which a system, product or component is operational and accessible when required for use.” Fault tolerance- “The degree to which a system, product or component operates as intended despite the presence of hardware or software faults.” Recoverability- “The degree to which, in the event of an interruption or a failure, a product or system can recover the data directly affected and re-establish the desired state of the system.”

Identified challenges in this section are: - 1. Service disruption challenge – Service disruption challenge due to the difficulty of reproducing bugs [R1, 82]. Service disruption is the result of stress testing to verify the reliability of a system. Service disruption fault is extremely critical and mostly related to loss of life, finance, and customer dissatisfaction. Unfortunately, non-reproducible service disruption fault (fault appears only one time) is hard to fix due time and difficulty. Only reproducible service disruption fault (fault appears at least twice) is fixed before the product is delivered [R1]. So, getting non-reproducible service disruption fault when doing stress testing is the main challenge to fix, and as a result, the system is not reliable. This challenge affects the three reliability sub- characteristics.  Availability it is difficult for a component or a system to be operational and accessible when required for use if service disruption fault occurs or if bugs are difficult to be reproducible.  Fault tolerance  it is difficult for a component or a system to operate as intended when service disruption fault occurred or if bugs are difficult to be reproducible.  Recoverability it is difficult for a component or a system to recover the data when service disruption fault occurred or if bugs are difficult to be reproducible. Mitigation Assigned enough resources (time, programmer) to fix any service disruption fault even the fault appeared only one time. The trade-off is between time for service delivery and quality [R1]. Generate worst-case reproducible test inputs from dynamic symbolic execution. [85, 86]. [82] provides guidance and a set of suggestions to successfully replicate performance bugs, improve the bug reproduction success rate. 2. Achieving a reliable system to an acceptable level of reliability is challenging [R2]. It is hard to find a system that is 100% reliable, but achieving a reliable system to an acceptable level of reliability is challenging and project dependent. The source of challenge could be (test environment, etc.). If one intends to achieve a reliable system, then the test environment (hardware, software, etc.) should be reliable. Hardware can be affected by different factors (heating, bit flip, and interface (change of format back and forth), the software can be affected by different factors (the wrong script, misunderstood) [R2]. This challenge affects the four reliability sub characteristics.

36(81)

 Availability it might be difficult for a component or a system to be operational and accessible when required for use if a system is not achieved to an acceptable level of reliability.  Fault tolerance  it is difficult for a component or a system to operate as intended if a system is not achieved to an acceptable level of reliability.  Recoverability it is difficult for a component or a system to recover the data if a system is not achieved to an acceptable level of reliability.  Maturityit is difficult for a component or a system meets the needs for reliability under a normal operation if a system is not achieved to an acceptable level of reliability. Mitigation  Use CI/CD and versioning [R3], Cooling the system if hardware error due to heating [R2], Test the test script if the error comes from the test script (Write test script that tests the test script itself) [R2], No mitigation if the hardware error comes due to bit-flip [R2] are few of the possible mitigations. 3. Challenge to discover the root cause of failure [R1, 83]. Understand how robustness tests can be broken down from system level to low-level perspective to understand the root cause failures is time-consuming and a challenge [R1, 83]. Detecting the root cause of performance regressions is also a challenge [120].

This challenge affects reliability sub characteristics.  Availability It is difficult for a system or component to be operational if failure is discovered  RecoverabilityIt is difficult for the system or component to recovering data if data is directly affected by the failure or interruption.  Fault tolerance  It is difficult for the system or component to operate as intended if failure is discovered. Mitigation  Collect all possible log, automate a system that manages root cause analysis, and all teams participate [R2, 120] extensively evaluate and study the root cause of performance regression at the commit level [120]. 4. Testing for correctness and reliability of large-scale cloud-based applications is a costly, challenging, and traditional simulation, and an emulation-based alternative is too abstract or too slow for testing [84]. This challenge affects the three reliability sub characteristics.  Availability, fault tolerance, recoverability it is difficult for a component or a system to be operational and accessible when required for use or to operate as intended or to recover the data if a large-scale cloud-based application is not reliable. Mitigation  A testing approach that combines simulation and emulation in a cloud simulator that runs on a single processor [84].

37(81)

5. Generating scalable worst-case test inputs from dynamic symbolic execution to use for stress or performance testing is challenging [85, 86]. Automatic test input generation by dynamic symbolic execution (DSE) is difficult to scale up due to the path space explosion. This challenge affects

 Availability it is difficult for a component or a system to be operational and accessible when required for use if a system or component is not fully tested for stress testing through worst-case test inputs.  Fault tolerance  it is difficult for a component or a system to operate as intended if components or systems are not fully tested for stress testing to test the systems or components fully in advance.  Recoverability it is difficult for a component or a system to recover the data if system or components are not tested in advance for stress testing through worst-case test inputs. It can also affect sub-characteristics of performance efficiency (resource utilization, time behavior)  Resource utilization, time behavior it is difficult to test resource utilization, a response time of a system that meets a requirement specification if test input generation by DSE is required high computational cost and path space explode. Mitigation  Authors [86] proposed algorithm PySE and modeling technique XSTRESSOR [85] generate dynamic worst-case test inputs by reinforcement learning and incrementally update the branch policy and by inferring path conditions, respectively. The author proposed to use a combination of dynamic symbolic execution (DSE) and random search [86]. 6. Given a reliability constraint and limited resources, optimal resource allocation is a challenging task [22-23, 87-94]. For instance, minimizing software testing cost under limited available resources. Lack of resources when performing robustness testing [R8] This challenge affects the three reliability sub characteristics.  Availability it is difficult for a component or a system to be operational and accessible when required for use if there is a lack of resources (memory, hardware, etc.).  Fault tolerance  it is difficult for a component or a system to operate as intended if there is a lack of resources (memory, hardware, etc.).  Recoverability it is difficult for a component or a system to recover the data if there is a lack of resources (memory, hardware, skilled professionals, tools, etc.). Mitigation  Using a genetic algorithm to solve the optimization problem through learning from historical data [23]. Others authors proposed different alternative mitigations genetic algorithm [22, 23], Harmonic distance-based multi-objective evolutionary algorithm [87, 88], combination of software reliability growth models (SRGMs) and systematic and sequential

38(81)

algorithm[89], optimization algorithm based on the Lagrange multiplier method [90], search-based strategies[91], multi-objective evolutionary algorithm [92], genetic local search algorithm [93], combination of software reliable growth model (SRGMs) and genetic algorithm [94]. In relation to robustness testing, check log history and act if related to garbage collection, increase memory are few of the possible mitigation suggested by [R8]. 7. Validation and testing for (Safety, reliability, and security) of an autonomous car are challenging [95]. This challenge affects

 Availability, fault tolerance, recoverability It is difficult for a component or a system to be operational and accessible when required for use or to operate as intended or to recover the data if an autonomous car is not reliable, safe and secured. Mitigation  One suggested mitigation, all stake-holders required to be agreed on a common protocol, inclusive, and thoughtful solutions that satisfy consumer, industry and government requirements, regulation and policies [95].

5.7.1 Reflection about Challenges Related to Reliability Detecting and investigating the root causes of failure are the most challenging endeavors in LST. Also, none-reproducible faults disrupt services; for example, these faults are inherent in telecom sectors. A clear and well-documented requirement, collect log history, resource allocation, and generate worst-case test inputs are the most crucial tasks to mitigate reliability-related challenges.

5.8 Security Security is composed of sub characteristics, namely, confidentiality, Integrity, Non- repudiation, Accountability, and Authenticity (4). This characteristic is composed of the following sub characteristics:

Confidentiality - “The degree to which a product or system ensures that data are accessible only to those authorized to have access.” Integrity- “The degree to which a system, product or component prevents unauthorized access to, or modification of, computer programs or data.” Non-repudiation- “The degree to which actions or events can be proven to have taken place so that the events or actions cannot be repudiated later.” Accountability- “The degree to which the actions of an entity can be traced uniquely to the entity.” Authenticity- “The degree to which the identity of a subject or resource can be proved to be the one claimed.”

4 https://iso25000.com/index.php/en/iso-25000-standards/iso-25010?limit=3&start=6

39(81)

1. Develop a common understanding of security activity, the risk associated with security or security awareness among teams [R1- R3, R6, R10, 5]. When teams lack knowledge of security (risk, test coverage, metrics, etc.), then the product may not be fully tested [R10], open ports may not properly be tested [R1, R2]. This challenge affects the two security sub characteristics.  Confidentiality it is difficult for a system or a component to ensure data are only accessible by authorized users if teams don’t have a common understanding of security activities.  Integrity it is difficult for a system or a component to prevent unauthorized access to, modification of data if teams don’t have a common understanding of security activities.  Authenticity it is difficult for a system or a component to prove the accessibility of resources is only from authorized and privileged users if teams don’t have a common understanding of security activities. Mitigation  Knowledge sharing and meeting [R1-R3, R6], Good security documentation, security metrics that a tester to follow [R10]. 2. Lack of hacker’s mindset when security testing is done [R7, R10]. This challenge affects the two security sub characteristics.  Confidentiality It is difficult for a system or a component to ensure data are only accessible by the authorized users if teams lack hackers’ mindset when security testing is done.  Integrity It is difficult for a system or a component to prevent unauthorized access to, modification of data if teams lack hackers’ mindset when security testing is done. Mitigation  Properly understand security principles, security verification, communicate with a security specialist, Knowledge sharing [R7, R10]. Knowledge update on a dynamic security matrix [R10].

3. Level of Trust: -How much level you trust (Infrastructure, platform, tools, and security expert). To what extent you expose the system for security testing [R3]. This challenge affects the two security sub characteristics.  Confidentiality It is difficult for a system or a component to ensure data are only accessible by authorized users if a security expert doesn’t have a sense of ethics or if the system is not fully exposed to security testing.  Integrity It is difficult for a system or a component to prevent unauthorized access to, modification of if security expert doesn’t have a sense of ethics. Mitigation  Outsource security testing to consultancy, Deploy the web service on the trusted cloud, for instance, Amazon web service [R7]. 4. Coping with a security threat and Security verification is a challenge [R8]. This challenge affects the three security sub characteristics.

40(81)

 Confidentiality it is difficult for a system or a component to ensure data are only accessible by the authorized users if teams don’t cope with security threat and security verification.  Integrity it is difficult for a system or a component to prevent unauthorized access to, modification of data if teams don’t cope with security threat and security verification.  Authenticity it is difficult for a system or a component to prove the accessibility of resources is only from authorized and privileged users if teams don’t cope with security threat and security verification. Mitigation  Upgrade and share knowledge, cover the high priorities secrete issues [R8]. Deploy the system in a secure cloud [R7]. Security specialist creates a dynamic security matrix and should be up to date on security metrics, attacks, etc.[R10]. 5. Alignment of security objectives in distributed settings and the Integration of low-overhead security testing tools are challenges in large-scale agile [5]. This challenge affects the two security sub characteristics.  Confidentiality it’s difficult for a system or a component to ensure data are only accessible by the authorized users if the security requirement is not fully mapped to testing.  Integrity it’s difficult for a system or a component to prevent unauthorized access to, modification of data if security alignment is not properly implemented. Mitigation Knowledge sharing and meeting [R1-R3, R6]. Serious meeting with security expertise [R6]. 6. Effective detection of Denial of Service (DDoS) on of Software Defined Network (SDN) is challenging in a large-scale system [96] This challenge affects the two security sub characteristics.  Authenticity it’s difficult for a system or a component to prove the accessibility of resources is only from authorized and privileged users if SDN is not effectively prevented from Denial of Service (DDoS) attack.  Accountability - It’s difficult for the actions of an entity can be traced uniquely to the entity if Denial of Service (DDoS) attacks of SDN not detected effectively.

Mitigation  Deep learning (DL) based CNN (Convolutional Neural Network) is proposed mitigation [96].

7. Security oriented risk-assessment [97] is challenging. This challenge affects the three security sub characteristics.  Confidentiality it’s difficult for a system or a component to ensure data are only accessible by the authorized users if a security risk and security code auditing are not properly addressed.  Integrity it’s difficult for a system or a component to prevent unauthorized access to, modification of data if a security risk and security code auditing are not properly addressed.

41(81)

Mitigation Author combines security testing and risk assessment using method and tools, ACOMAT, [97] is the proposed solution. 8. Testing large-scale systems with safety and security: - critical relevance efficiently is a challenge due to a large number of requirements for several products variant and various configurations [98]. This challenge affects the two security sub characteristics.  Confidentiality it is difficult for a system or a component to ensure data are only accessible by the authorized users if safety and security-critical relevance are not tested efficiently.  Integrity it is difficult for a system or a component to prevent unauthorized access to, modification of data if safety and security-critical relevance is not tested efficiently.

Mitigation  Awareness and communication among teams, Keep the testbed environment simple and easily manageable [98]. 9. Security challenges due to introducing Static application security testing (SAST) in large-scale are[99]  Technical (High false positive and false negative rates)  Non-technical (Insufficient security awareness among the teams and integrating SAST in agile). This challenge affects the two security sub characteristics.  Confidentiality it’s difficult for a system or a component to ensure data are only accessible by the authorized users if SAST introduces High false positive and false negative rates.  Integrity it’s difficult for a system or a component to prevent unauthorized access to, modification of data if SAST introduces High false positive and false negative rates. Mitigation Work closely with a small team; pick a handful security problem first; measure the outcome; plan and invest enough resources & infrastructure (Introducing SAST requires significant resources); Execute scans regularly (SAST is not a one-time effort that solves all security issues on-and-for-all); plan your changes and updates, etc. are the possible mitigation.

10. Security oriented code auditing [100] is expensive, time- consuming and challenging in a large-scale system. This challenge affects the three security sub characteristics.  Confidentiality it is difficult for a system or a component to ensure data are only accessible by the authorized users if a security risk and security code auditing are not properly addressed.  Integrity it is difficult for a system or a component to prevent unauthorized access to, modification of data if a security risk and security code auditing are not properly addressed.  Mitigation Combine code auditing tools, Pscan, with security expertise [100] are few proposed mitigations.

42(81)

5.8.1 Reflection about Challenges Related to Security The main issue in LST with regards to security is that team members lack adequate knowledge about security-related risks, security metrics. Testers also lack the skill to assess if security requirements are implemented in systems. Team leaders should make sure that security expertise is available in the team in addition to facilitating knowledge sharing.

5.9 Portability Portability is composed of sub characteristics, namely Adaptability, Installability, and Replaceability 4. These characteristics listed below:

Adaptability- refers to how easy a product or system can effectively and efficiently adapt to evolving hardware or environment or software or other. Installability- refers to how easy a product can be successfully installed and/or uninstalled from a specified environment. Replaceability- refers to the degree to which a product or a component can be replaced by another for the same purpose in the same environment.

One challenge is identified in this section. 1. Challenge to assure quality for an enterprise system that contains thousands of interconnected subsystems deployed in an enterprise environment [101]. This challenge affects the two portability sub characteristics.  Adaptability  it is not easy to adopt an evolving environment or hardware or software efficiently and effectively for an enterprise system that contains thousands of interconnected subsystems.  Installability Install or/and uninstall of a product successfully form the environment might not be easy if an enterprise system that contains thousands of interconnected subsystems.  Replaceability  Maintain the existing quality by replacing existing components by another for the same purpose in the same environment may not be easy for systems that contain thousands of interconnected subsystems. Mitigation  [101] proposed the use of a scalable emulation environment, Reaco2.

5.9.1 Reflection about Challenges Related to Portability Assuring the quality of deployed enterprise systems is the only challenge identified in this study. It seems that portability is not a burning issue in LST.

5.10 Maintainability Maintainability is composed of sub characteristics, namely Modularity, Reusability, Analyzability, Modifiability, and Testability (4). This characteristic is composed of the following sub characteristics:

43(81)

Modularity- “The degree to which a system or computer program is composed of discrete components such that a change to one component has minimal impact on other components.” Reusability- “The degree to which an asset can be used in more than one system, or in building other assets.” Analyzability- “The degree of effectiveness and efficiency with which it is possible to assess the impact on a product or system of an intended change to one or more of its parts, or to diagnose a product for deficiencies or causes of failures, or to identify parts to be modified.” Modifiability: - Refers to the degree of effectiveness and efficiency with which a product is modified without introducing defects. Testability- “The degree of effectiveness and efficiency with which test criteria can be established for a system, product or component and tests can be performed to determine whether those criteria have been met.”

1. Maintain outdated test suites [R3, R4, R9, R10, 102] and prune test suites [R3, R4] efficiently and effectively are challenging tasks. Maintain outdated test suites are a first-class citizen challenge in relation to maintainability in a large- scale system. This challenge affects one maintainability sub-characteristics.  Modifiability it is difficult to modify the outdated test cases of a component without introducing any defects on other components that a test suit depends on. It’s also difficult to prune test suits form the existing component without introducing any defects on the other components. Mitigation  Clear documentation on dependencies, writing testable code [R3]. Apply test case reduction techniques or regression testing reduction techniques [117]. Required-skilled tester, apply best practices to refractor and prune legacy test suites [R3, R4]. 2. Maintain test suites and doing regression testing [R4, R9-R10], keep track of the faults while doing regression, especially when you have unstable tests [R9] are challenges in a large-scale system. This challenge affects one maintainability sub characteristics.

 Modifiability It is difficult to maintain the test cases or component or a system without introducing any defects on other components that a test suit depends on. Mitigation  Clear documentation, good architect, writing testable code [R3], Use CI/CD Pipeline (Continuous Integration, Continuous Delivery) and container like docker[R4, R9]. 3. Challenges on estimating the impact of challenges: - Different services provided by large-scale web applications/web services must be maintained & improved

44(81)

over-time since infrastructures & technology evolves rapidly. As a result, estimating the impact of changes is challenging [102]. This challenge affects one maintainability sub characteristics.  Modifiability It is difficult to modify the outdated test suit of a component without introducing any defects on other components that a test suit depends on if it’s not possible to estimate the impact of changes in advance. Mitigation The author used tool, REQAnalytics, continuously updating regression tests [102]. 4. Test case maintaining a challenge for regression testing for large-scale web applications/web services [102] are  Estimating the impact of changes.  Building & maintaining a test suit for continuously updated regression testing. This challenge affects one maintainability sub characteristics.

 Modifiability It is difficult to modify and maintain changes efficiently and effectively without introducing any defects if the impact of change is not correctly estimated in advance or/and if test suits are not built and maintained for continuous regression. Mitigation  Continuously updating regression tests to test the current behavior after requested changes using the tool, REQAnalytics[105]. 5. Challenge on test case selection of regression: -an Effective test case selection for regression test is challenging [103]. This challenge affects one maintainability sub characteristics.

 Modifiability it is difficult to maintain the test cases or component or a system without introducing any defects on other components that a test suit depends on. Mitigation  [103] proposed an automated method of source code-based regression test selection for Software product lines (SPLs) and the method reduces the repetition of the selection procedure and minimizes the in-depth analysis effort for source code and test cases based on the commonality and variability of a product family [103]. 6. Challenge in reusing class-based test cases: -Frameworks are introduced to reduce costs, make reusable design and implementation, and increase the maintainability of software through the deployment. The key challenges of testing in relation to frameworks are Development, evolution & maintenance of test cases to ensure the framework operates appropriately in a given application [104]. This challenge affects one maintainability sub characteristics.  Testability it’s difficult to ensure the testability of a module or a system if test suites are not reused. Mitigation  To increase the maintainability of software products, reusable test cases increase could be used because an entirely new set of test cases does not have to be generated each time the framework is deployed.

45(81)

7. Challenge of optimizing continuous integration (CI) testing. Optimizing continuous integration (CI) to handle code changes using regression testing faces long-lasting and scalable problems due to test redundancy [105]. This challenge affects one maintainability sub characteristics.

 Modifiability it is difficult to optimize and scale code changes efficiently and effectively without introducing any defects. Mitigation  [105] proposes an algorithm to optimize the CI regression test selection. The algorithm uses coverage metrics to classify tests as (unique, totally redundant, or partially redundant). Then classify partially redundant test cases as being likely or not likely to detect failures, based on their historical fault-detection effectiveness. 8. Challenge on regression testing for cost-effectiveness: - Regression testing of database applications is costly and challenging. Analysts can rely on generating synthetic data or production data based on (i.) combinatorial techniques or (ii.) operational profiles - combinatorial test suites might not be representative of system operations and automating testing with production data is impractical [106].  Modifiability it’s difficult to modify and maintain changes efficiently and effectively without introducing any defects if regression testing of a database is not effectively done. Mitigation Using production data and synthetic data generated through classification tree models of the input domain. Combinatorial testing strategies are effective, both in terms of the number of regression faults discovered but also, more surprisingly, in terms of the importance of these faults [106]. 9. Ensuring a “realistic” and optimized test suite from outdated test cases is challenging since test cases are outdated due to the evolving nature of a system [107]. Without realistic up to date optimized test suites. It is hard to do continuous validation of performance test workload [121]. For instance, test optimization is helpful to speed up distributed GUI testing but achieving an optimized test to speed up testing time for usability testing in the distributed environment is challenging[108] This challenge affects one maintainability sub characteristics.  Modifiability it is difficult to modify test suits efficiently and effectively without introducing any defects. Mitigation  [107] suggested the use of an algorithm, Multi-level random walk that optimizes and provides a reduced test case [107]. Apply test case reduction techniques or regression testing reduction techniques [117, 122]. [121] proposed a way to validate whether a performance test resembles the field workload and, if not, determines how they differ then update tests to eliminate such differences, which creates more realistic tests [121]. [108] purposed the Atom-task precondition technique, to filtering the test before running the parallel testing process to optimize the testing time and

46(81)

avoiding worker bottleneck problems and to scale for large-scale testing in the distributed environment [108].

5.10.1 Reflection about Challenges Related to Maintainability Maintaining test suits is difficult because it may get outdated in a large-scale environment. Similarly, it is difficult to maintain regression testing in a large-scale system. Besides, requirement documents should be clear enough to guide the maintenance of systems. For example, when testers want to maintain outdated test cases in a situation where there is no clear maintenance strategy, clear requirement specification, and poor documentation of maintenance, it will be difficult to maintain outdated test cases. Best practices to deal with outdated test cases are to write testable codes and to use DevApp strategy, i.e., continuous development & continuous integration.

5.11 General In this section, the general challenges that don’t belong to only one of the software quality attributes are analyzed. 1. Modeling & testing a real-world scenario at an acceptable level of abstraction is a challenging task [R2, R3]. Still, there are models to simulate this but inaccurate. Mitigation  Design a model that handles specialized scenarios [R2]. 2. Testing coordination challenge: - Due to heterogeneity, the complexity of the large-scale system coordination testing activities is challenging. [R5] With microservice architectures, where thousands of coexisting applications to work cooperatively, Integration and end-to-end testing performance testing are challenging [R4, R6, 72] since some components may not behave as expected or may not be ready for [R5].

Mitigation Use proper DevApp system with different versions that can be rolled back if needed [R5, R4]. Write testable & clean code, make reusable of code & test suits, and use containerization (ex-docker) [R4, R6]. Kubernetes provide a deployment tool, Kubemark to address performance and scalability issue [72]. Coordination challenges could be mitigated by interoperability testing [10]. 3. Communication challenges: - In large-scale system due to size, complexity, and multi-disciplinary cross-functional communication is a challenge [R7]. Both internal teams and external stakeholder's communication are challenging in large-scale systems in general. For instance, internal communication challenges might come from the ambiguity of requirements [R7]. Mitigation Communication challenge due to requirement ambiguity can be mitigated by having a clear requirement through meeting and discussion [R7].

47(81)

4. An increase in variability and complexity of the automotive system due to the growth of electrification and digitalization of vehicles possess verification and validation challenges [109]. Mitigation The challenges are mitigated using TIGRE, Product Line Testing methodology [109] 5. Decision making regarding the adoption of cloud computing (CC) for software testing is challenging [110] Mitigation  [110] proposed the adoption assessment model based on a fuzzy multi-attribute decision-making algorithm to adopt CC for software testing. 6. A challenge on Big data analytics: - Efficient and effective testing of big data analytics (Ex-Google’s MapReduce, Apache Hadoop, and Apache Spark, etc.) is a challenge since they process terabytes and petabytes of data [111]. Mitigation  [111] proposed, BigTest, to efficient and effective testing of big data analytics.

7. The challenge during testing activities of large-scaling agile projects in relation to large sets of test artifacts[112] are: -  Understandable and accessible to various stakeholders  Traceable to requirements  Easily maintainable as the software evolves. Mitigation The use of testing tools that leverage domain-specific language to streamline functional testing in agile projects. Some key features of the tools that could be used include test template generation from user stories, model-based automation, test inventory synchronization, and centralized test tagging [112].

8. Challenge on testing quality using model-based testing: -It’s a challenge to test qualities (reliability, safety, and security, etc.) in a large-scale especially embedded system in model-based in two cases (I. testing model dynamically, II. Improving the efficiency of model-based testing)[113]. Mitigation An enhancing adaptive random testing is investigated to generate test cases for AADL5 model-based testing is found to be more efficient than traditional random testing. 9. Challenges on test management and test processes are: -  Improve, manage, optimized testing (infrastructure, process) and reduce testing cost in large-scale (distributed, critical, etc.) systems is challenging [48, 106, 114, 115, 116, 117]. Reduce development & testing costs is possible by detecting defects at an early stage [115].  Conduct effective testing with less cost while avoiding risk [115]  Decide if additional testing is necessary [115]

48(81)

Mitigation [115] used and integrated testing tools at an early stage (eXVantage, a tool for coverage testing, debugging, and performance profiling), effective test management (example, communication is a key for effective test management) could be used to address the challenge; Apply model-based testing to improve test infrastructure [48]. Use process- oriented metrics-based assessment (POMBA) to monitor a testing process [114], Use a strategy “Software design for testability” to write clean code & improve quality and apply techniques to check for correctness of a program, for instance, “Test condition oracle” [106, 117].

 Manage all test cases [116]. Mitigation  [116] discussed recommendations on how to do & to better leverage test case management.  To speed up the testing process and to achieve a desired computationally cost-effective testing is a challenge in the large-scale system[124] Mitigation  [124] proposed grid computing. A swarm intelligence task scheduling approach also proposed to tackle the dynamism and performance heterogeneity problems in a grid computing environment with the expectation of reduction of testing task completion time.

10. Challenge on combinatorial testing and generating optimized testing suits: - Large-scale software systems commonly require structurally complex test inputs. Identified challenges related to combinatorial testing and generating test suits are:  How to design a small number of combinatorial suitable test suits to protect failure caused by a variety of factors & interactions when a number of factors are larger than the problem becomes a challenge since it requires a large number of test cases [118]. Mitigation The software under test can use combinatorial testing to generate a small test suite from large scale combinatorial space. Hadoop distributed computing system could be used for increased performance of running algorithms on large test cases [118].  Automatically generating a reasonable number of tests that are behaviorally diverse and legal to use for testing is a challenging task[119].  Applying pair-wise combinatorial testing when a system requires complex input structures that can’t be formed by using arbitrary combinations is a challenge[119]. Mitigation Constraint-based test generation is used to generate structurally complex input for ; however, it has limited scalability. So, a scalable approach to testing is combinatorial test generation, such as pair-wise testing, but it is challenging to use when complex inputs are required. ComKorat, which unifies combinatorial generation methods and constraint-based generation approaches, is effective [119].

10. Challenge to improve testing during maintenance in large-scale IT service provider are:

49(81)

Key challenges in maintenance testing were related to the recording of test cases, inadequate testing resources, test documentation, presenting test documentation to customers, and poor visibility of testing in early phases of software development [123]. This challenge affects one maintainability sub characteristics.

 Modifiability It is difficult to modify and maintain changes efficiently and effectively without introducing any defects during the maintenance process in order to improve the testing process. Mitigation  Testing resources should be planned to agreed time, and the testing should be carried out on agreed time, provide detailed instructions for maintenance testing, Ready-made test cases would fasten customer testing, and make it easier are the possible mitigation[123].

5.11.1 Reflection about Challenges Related to General Challenges

General identified challenges are often related to the testing process, test management, challenges originate from testing types and modeling, and testing real scenarios.

6 Discussions and Limitations of the Study The core objectives of this study are identifying key challenges of LST, mapping key challenges of large-scale software testing against software quality attributes, and identifying the possible mitigations of the challenges using systematic literature review, and interviews. In this thesis, 81 challenges were identified and categorized based on software quality attributes. Software maintenance is an expensive task [4]. Therefore, identification and detection of software bugs at the earliest stage are essential tasks to lower the cost of maintenance. The result of the study can be used to guide software testing endeavors. For example, testers and project managers can use the result to manage their work accordingly, using the result as guidance. Also, it can be used to identify and articulate checklists, requirements, and guidelines for testers to effectively undertake large-scale software quality testing.

6.1 Limitations of the Study This thesis is written in partial fulfillment of the Masters of Science in Software Technology. Also, the systematic literature review was done entirely by the author of this thesis. Hence, the quality of the review could be affected as it is based on an individual’s judgment when selecting and reviewing relevant articles, but the author did the task carefully. Finally, privacy related to interviewees is the other limitation of this study; for example, because of the organizational privacy policy, the interviewees were restricted to list only challenges and their mitigations, which are not especially security-related.

50(81)

7 Conclusions and Future directions The result of this study shows that software testing could be assisted with testing tools designed to test software from a wider variety of perspectives. Also, statistical techniques, data analytics methods, machine learning, and other managerial skills could be used to undertake software testing on a large-scale. In this chapter, conclusions drawn from the study are summarized. Functional suitability generally has few challenges compared to other quality attributes. Inherent difficulties in functional suitability emanate from poor software design, poor requirement management, and ill-managed testing methodology. It is essential to have requirement documentation, involving team members from the inception of the project to expedite common understanding, and suitable development strategy. The first most significant challenge in large-scale software testing falls under performance testing. For example, performance testing is not possible to run on a production environment, difficult to mimic the real scenarios due to lack of equipment. Also, test case data set, fault investigation is time-consuming and difficult to identify appropriate load to stress the system. Compatibility is an issue raised in integrated heterogeneous systems having multiple configurations and versioning issues. Also, it is difficult to test all potential configurations. Besides, it is critical to identify the technical risks of a system in advance, especially for critical and dependable large-scale systems. Some of the mitigations identified for compatibility related challenges are the use of technical analysts to reduce the occurrence of the risk, prioritizing compatibility according to relevance, and risk factors are important. Usability testing is less recognized in large scale software testing. It is also difficult to test for accessibility and to identify critical bugs through usability testing. Therefore, addressing usability testing is valuable. In relation to the reliability quality attribute investigating and detecting the root causes of failure are the most challenging endeavors in LST. Also, none- reproducible faults disrupt services. These faults are inherent in the telecom sector. It is advisable to have a well-documented requirement from log history and resource allocation to cope up with the reliability-related challenges. The second most significant challenge in LST falls under security and maintainability. Security is a crucial issue in any software product. Software testing and development team members lack adequate knowledge about security-related risks and coping up with security threats. For example, testers lack the skill to assess and test if security-related requirements are implemented in systems. It is advisable to involve team members with security expertise and facilitate knowledge sharing. Maintenance testing is problematic since test cases can get outdated on a large-scale as software evolves. Similarly, it is a challenging task to maintain outdated test cases in a large-scale system. It is advisable to keep the requirement document clear enough to guide the maintenance of systems and facilitate mitigations.

51(81)

Generally, after evaluating the result, the following key points were identified. 1. Current research endeavors and challenges mainly focus on performance, security, maintainability, reliability, and compatibility. Therefore, more focus is needed when performing LST. 2. On the other hand, LST seems to overlook usability, functional suitability, and portability. 3. It is also recommended to use machine learning and AI to advance large scale software testing and simplify testers' work and to increase productivity. 4. The source of challenges of large-scale testing is project dependent but often related to core attributes of complexity, evolution, heterogeneity, and size. 5. When performing software testing in large-scale settings, is infeasible in some cases due to the evolving nature. 6. A kick-off meeting or serious meeting before the project starts is vital to clarify the requirement and to reduce ambiguity. 7. Testing performance, security, and compatibility of large-scale software systems is a skill set required demanded by professionals?

From the result, the following future directions are suggested. 1. Testing of containerized systems. 2. Challenges of software testing concerning core attributes define a large- scale. 3. Large-scale software testing following agile principles 4. Model-based testing for large-scale software development 5. Incorporate ML and AI in a large-scale system.

52(81)

8 References [1] T. Berners-Lee, "WWW: past, present, and future," in Computer, vol. 29, no. 10, pp. 69-77, Oct. 1996. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=539724&isnumber =11670 [2] D. Leigha,” Assistive Technologies for People With Diverse Abilities”, Journal of Intellectual & Developmental Disability”, 39:4, 381-382, 2014 [3] N. Jindal, J. Yang, V. Lotrich, J. Byrd and B. Sanders, "Abstract Interpretation: Testing at Scale without Testing at Scale," Second International Workshop on Software Engineering for High-Performance Computing in Computational Science and Engineering, New Orleans, LA, 2014, pp. 1-5, 2014 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7017325&isnumb er=7017320 [4] E. Ogheneovo,” On the relationship between software complexity and maintenance costs”, 2014. [5] A. Van Der Heijden, C. Broasca, and A. Serebrenik, “An empirical perspective on security challenges in large-scale agile software development,” in International Symposium on Empirical Software Engineering and Measurement, 2018. [6] N. Sekitoleko, F. Evbota, E. Knauss, A. Sandberg, M. Chaudron, & H. Olsson, ” Technical dependency challenges in large-scale agile software development”, 2014. [7] T. Bianchi, D. S. Santos, and K. R. Felizardo, "Quality Attributes of Systems-of- Systems: A Systematic Literature Review," 2015 IEEE/ACM 3rd International Workshop on Software Engineering for Systems-of-Systems, Florence, 2015, pp. 23- 30. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7179220&isnumb er=7179208 [8] D. Firesmith, "Profiling Systems Using the Defining Characteristics of Systems of Systems (SoS)", 2010 [9] S. Bartsch, "Practitioners' Perspectives on Security in Agile Development," 2011 Sixth International Conference on Availability, Reliability, and Security, Vienna, 2011, pp. 479-484.

URL: http://ieeexplore.ieee.org.proxy.lnu.se/stamp/stamp.jsp?tp=&arnumber=6046 004&isnumber=6045921 [10] B. Nielsen, P. G. Larsen, J. Fitzgerald, J. Woodcock, and J. Peleska, “Systems of systems engineering: Basic concepts, model-based techniques, and research directions,” ACM Computing Surveys, vol. 48, no. 2, 2015.

53(81)

[11] W. Alsaqaf, M. Daneva, and R. Wieringa, “Quality requirements challenges in the context of large-scale distributed agile: An empirical study,” Inf. Softw. Technol., vol. 110, pp. 39–55, 2019. [12] Kitchenham, “Systematic literature reviews in software engineering – A systematic literature review”. 2004. [13] T. Dingsøyr, B. Moe, E. Fægri, T. E., & E. Seim. A, “Exploring software development at the very large-scale: a revelatory case study and research agenda for agile method adaptation”, 2018, Empirical Software Engineering, 23(1), 490- 520. [14] S. Bick, K. Spohrer, R. Hoda, A. Scheerer, & A. Heinzl, “Coordination challenges in large-scale software development: a case study of planning misalignment in hybrid settings”.2018, IEEE Transactions on Software Engineering, 44(10), 932-950, [15] N. Shete and A. Jadhav, "An empirical study of test cases in software testing," International Conference on Information Communication and Embedded Systems (ICICES 2014), Chennai, 2014, pp. 1-5. URL: http://ieeexplore.ieee.org.proxy.lnu.se/stamp/stamp.jsp?tp=&arnumber=7033 883&isnumber=7033740 [16] Chen, Celia & Lin, Shi & Shoga, Michael & Wang, Qing & Boehm, Barry, "How do defects hurt qualities? an empirical study on characterizing a software maintainability ontology in open source software", In 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS), pp. 226-237. IEEE, 2018. [17] R. Kazmi, N. Jawawi, R. Mohamad, and I. Ghani, "Effective regression test case selection: A systematic literature review" ACM Computing Surveys (CSUR) 50, no. 2 1-32, 2017 [18] R. Gao and C. M. (Jack) Jiang, “An Exploratory Study on Assessing the Impact of Environment Variations on the Results of Load Tests,” in 2017 IEEE/ACM 14TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2017), 2017, pp. 379–390. [19] A. Gonzalez, E. Piel, H.-G. Gross, and M. Glandrup, “Testing Challenges of Maritime Safety and Security Systems-of-Systems,” in TACI PART 2008:TESTING: ACADEMIC AND INDUSTRIAL CONFERENCE PRACTICE AND RESEARCH TECHNIQUES, PROCEEDINGS, 2008, p. 35+. [20] Zheng Liu and Paul Mei, “Automated Testing for Large-Scale Critical Software Systems,” in 2014 5TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2014, pp. 200– 203. [21] T. Hanawa, T. Banzai, H. Koizumi, R. Kanbayashi, T. Imada, and M. Sato, “Large-Scale Software Testing Environment Using Cloud Computing Technology for Dependable Parallel and Distributed Systems,” in 2010 Third International

54(81)

Conference on Software Testing, Verification, and Validation Workshops, 2010, pp. 428–433. [22] Y. S. Dai, M. Xie, K. L. Poh, and B. Yang, “Optimal testing-resource allocation with genetic algorithm for modular software systems,” J. Syst. Softw., vol. 66, no. 1, pp. 47–55, 2003. [23]P. K. Kapur, A. G. Aggarwal, K. Kapoor, and G. Kaur, “Optimal testing resource allocation for modular software considering cost, testing effort and reliability using genetic algorithm,” Int. J. Reliab. Qual. Saf. Eng., vol. 16, no. 6, pp. 495–508, 2009. [24] M. Felderer, M. and R. Ramler, “Integrating risk-based testing in industrial test processes”, Software Quality Journal, 22(3), pp.543-575, 2014. [25] R. Ramler and M. Felderer, “A process for risk-based test strategy development and its industrial evaluation”, PROFES 2015, pp. 355-371, 2015. [26] A. McCall, J. A., Richards, P. K., and Walters, G. F., “Factors in Software Quality", Nat'l Tech.Information Service, no. Vol. 1, 2 and 3, 1977. [27] Boehm, B. W., Brown, J. R., Kaspar, H., Lipow, M., McLeod, G., Merritt, M. “Characteristics of Software Quality”, North-Holland Publishing, Amsterdam, The Netherlands, 1978. [28] Robert. Grady, “Practical Software Metrics for Project Management and Process Improvement”. Prentice-Hall, Englewood Cliffs, NJ, USA, 1992. [29] Zhou, Bowen, Kulkarni, M., Bagchi, S., 2013. WUKONG: Effective Diagnosis of Bugs at Large System Scales. ACM SIGPLAN NOTICES 48, 317–318. https://doi.org/10.1145/2517327.2442563 [30] Ulrich and H. König, “Architectures for Testing Distributed Systems,” in Testing of Communicating Systems, vol. 21, G. Csopaki, S. Dibuz, and K. Tarnay, Eds. Boston, MA: Springer US, 1999, pp. 93–108. [31] T. M. King, D. Santiago, J. Phillips, and P. J. Clarke, “Towards a Bayesian Network Model for Predicting Flaky Automated Tests,” in 2018 IEEE 18TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C), 2018, pp. 100–107. [32] B. Kitchenham and S. Charters, “Guidelines for performing systematic literature reviews in software engineering”, 2007. [33] S. Vieira, and A. Gomes, ” A comparison of Scopus and Web of Science for a typical university”., Scientometrics, 81(2), 587, 2009. [34] A. Martín-Martín, E.Orduna-Malea, M. Thelwall, and D. López-Cózar, “Google Scholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subject categories”, Journal of Informetrics, 12(4), 1160-1177, 2018.

55(81)

[35] C. Wohlin, "Guidelines for snowballing in systematic literature studies and a replication in software engineering", In Proceedings of the 18th international conference on evaluation and assessment in software engineering, pp. 1-10. 2014. [36] L. Altheide and M. Johnson, “Criteria for assessing interpretive validity in qualitative research”, 1994 [37] Saunders, Mark NK, Philip Lewis, Adrian Thornhill, and Alexandra Bristow. "Understanding research philosophy and approaches to theory development”, 122- 161, 2015. [38] E. G. Guba, “Criteria for assessing the trustworthiness of naturalistic inquiries. Educational Technology Research and Development”, 29(2), 75-91, 1981. [39] D. Andrews, B. Nonnecke, and J. Preece, ” Conducting Research on the Internet: Online Survey Design, Development and Implementation Guidelines”. International Journal of Human-Computer Interaction, 1-31, 2003. [40] Creswell, John W., and Cheryl N. Poth” Qualitative inquiry and research design: Choosing among five approaches”. Sage publications, 2016. [41] T. Chen, W. Shang, A. E. Hassan, M. Nasser, and P. Flora, “Detecting Problems in the Database Access Code of Large Scale Systems - An Industrial Experience Report,” in 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C), 2016, pp. 71–80. [42] F. G. de Oliveira Neto, J. Horkoff, E. Knauss, R. Kasauli, and G. Liebel, “Challenges of Aligning Requirements Engineering and System Testing in Large- Scale Agile: A Multiple Case Study,” in 2017 IEEE 25TH INTERNATIONAL REQUIREMENTS ENGINEERING CONFERENCE WORKSHOPS (REW), 2017, pp. 315–322. [43] H. Malik, Z. M. Jiang, B. Adams, A. E. Hassan, P. Flora, and G. Hamann, “Automatic comparison of load tests to support the performance analysis of large enterprise systems,” in Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, 2011, pp. 222–231. [44[H. M. Al Ghmadi, M. D. Syer, W. Shang, and A. E. Hassan, “An automated approach for recommending when to stop performance tests,” in Proceedings - 2016 IEEE International Conference on Software Maintenance and Evolution, ICSME 2016, 2017, pp. 279–289. [45] V. Loll, “Developing and testing algorithms for stopping testing, screening, run-in of large systems or programs,” in Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055), 2000, pp. 124–130. [46] M. De Barros, J. Shiau, C. Shang, K. Gidewall, H. Shi, and J. Forsmann, “Web services wind tunnel: On performance testing large-scale stateful web services,” in 37TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2007, p. 612+.

56(81)

[47] S. DeCelles, T. Huang, M. C. Stamm, and N. Kandasamy, “Detecting Incipient Faults in Software Systems: A Compressed Sampling-Based Approach,” in 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), 2016, pp. 303– 310. [48] M. Blackburn, R. Busser, A. Nauman, and T. Morgan, “Life cycle integration use of model-based testing tools,” in 24th Digital Avionics Systems Conference, 2005, vol. 2, p. 13 pp. Vol. 2. [49] Z. Liu, W. Mo, D. Ren, G. Zhao, and M. Liu, “Bug detection in large scale nuclear power software,” in Proceedings - 23rd IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2012, 2012, pp. 47–52. [50] P. Bian, B. Liang, W. Shi, J. Huang, and Y. Cai, “NAR-miner: Discovering Negative Association Rules from Code for Bug Detection,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 411–422. [51] J. Xuan, Y. Gu, Z. Ren, X. Jia, and Q. Fan, “Genetic Configuration Sampling: Learning a Sampling Strategy for Fault Detection of Configurable Systems,” in Proceedings of the Genetic and Evolutionary Computation Conference Companion, 2018, pp. 1624–1631. [52] A. Arcuri and L. Briand, “Formal Analysis of the Probability of Interaction Fault Detection Using Random Testing,” IEEE Trans. Softw. Eng., vol. 38, no. 5, pp. 1088–1099, 2012. [53] P. Bian, B. Liang, Y. Zhang, C. Yang, W. Shi, and Y. Cai, “Detecting Bugs by Discovering Expectations and Their Violations,” IEEE Trans. Softw. Eng., vol. 45, no. 10, pp. 984–1001, 2019. [54] O. Huerta-Guevara, V. Ayala-Rivera, L. Murphy, and A. O. Portillo- Dominguez, “Towards an Efficient Performance Testing Through Dynamic Workload Adaptation,” in Testing Software and Systems, 2019, pp. 215–233. [55] R. Smith et al., “Applying Combinatorial Testing to Large-Scale Data Processing at Adobe,” in 2019 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2019, pp. 190–193. [56] C. W. Yohannes, T. Li, and K. Bashir, “A Three-Stage Based Ensemble Learning for Improved Software Fault Prediction: An Empirical Comparative Study,” Int. J. Comput. Intell. Syst., vol. 11, no. 1, pp. 1229–1247, 2018. [57] Goyal, Jyoti, and Bal Kishan. "Comparative Analysis Of Heterogeneous Ensemble Learning For Software Fault Prediction.", 2019. [58] J. A. Meira, E. C. d. Almeida, Y. Le Traon, and G. Sunye, “Peer-to-Peer Load Testing,” in 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, 2012, pp. 642–647. [59] J. Noor, M. G. Hossain, M. A. Alam, A. Uddin, S. Chellappan, and A. B. M. A. Al Islam, “svLoad: An Automated Test-Driven Architecture for Load Testing in

57(81)

Cloud Systems,” in 2018 IEEE Global Communications Conference (GLOBECOM), 2018, pp. 1–7. [60] H. Malik, B. Adams, A. E. Hassan, P. Flora, and G. Hamann, “Using Load Tests to Automatically Compare the Subsystems of a Large Enterprise System,” in 2010 IEEE 34th Annual Computer Software and Applications Conference, 2010, pp. 117–126. [61] H. Malik, H. Hemmati, and A. E. Hassan, “Automatic Detection of Performance Deviations in the Load Testing of Large Scale Systems,” in Proceedings of the 2013 International Conference on Software Engineering, 2013, pp. 1012–1021. [62] S. He, G. Manns, J. Saunders, W. Wang, L. Pollock, and M. Lou Soffa, “A Statistics-based Performance Testing Methodology for Cloud Applications,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 188–199. [63] R. Gao, Z. M. (Jack) Jiang, C. Barna, and M. Litoiu, “A Framework to Evaluate the Effectiveness of Different Load Testing Analysis Techniques,” in 2016 9TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION, AND VALIDATION (ICST), 2016, pp. 22–32. [64] R. Gao, Y. Wang, Y. Feng, Z. Chen, and W. Eric Wong, “Successes, challenges, and rethinking – an industrial investigation on crowdsourced mobile application testing,” Empir. Softw. Eng., vol. 24, no. 2, pp. 537–561, 2019. [65] A. R. Chen, “An Empirical Study on Leveraging Logs for Debugging Production Failures,” in Proceedings of the 41st International Conference on Software Engineering: Companion Proceedings, 2019, pp. 126–128. [66] H. Schulz, T. Angerstein, and A. van Hoorn, “Towards Automating Representative Load Testing in Continuous Software Engineering,” in Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, 2018, pp. 123–126. [67] L. Jonsson, D. Broman, M. Magnusson, K. Sandahl, M. Villani, and S. Eldh, “Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems Using Bayesian Classification,” in 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), 2016, pp. 423–430. [68] A. Zakari, S. P. Lee, and I. A. T. Hashem, “A single fault localization technique based on failed test input,” Array, vol. 3–4, p. 100008, 2019. [69] Y. Zheng, Z. Wang, X. Fan, X. Chen, and Z. Yang, “Localizing multiple software faults based on evolution algorithm,” J. Syst. Softw., vol. 139, pp. 107–123, 2018. [70] S. Sinha, T. Dasch, and R. Ruf, “Governance and Cost Reduction through Multi-tier Preventive Performance Tests in a Large-Scale Product Line

58(81)

Development,” in 2011 15th International Software Product Line Conference, 2011, pp. 295–302. [71] L. Bao, C. Wu, X. Bu, N. Ren, and M. Shen, “Performance Modeling and Workflow Scheduling of Microservice-Based Applications in Clouds,” IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 9, pp. 2114–2129, 2019. [72] Q. Lei, W. Liao, Y. Jiang, M. Yang, and H. Li, “Performance and Scalability Testing Strategy Based on Kubemark,” in 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), 2019, pp. 511–516. [73] T. Chen et al., “Analytics-Driven Load Testing: An Industrial Experience Report on Load Testing of Large-Scale Systems,” in 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2017, pp. 243–252. [74] I.-C. Yoon, A. Sussman, A. Memon, and A. Porter, “Effective and scalable software compatibility testing,” in ISSTA’08: Proceedings of the 2008 International Symposium on Software Testing and Analysis 2008, 2008, pp. 63–73. [75] I. Yoon, A. Sussman, A. Memon, and A. Porter, “Towards Incremental Component Compatibility Testing,” in Proceedings of the 14th International ACM Sigsoft Symposium on Component-Based Software Engineering, 2011, pp. 119–128. [76] I. Yoon, A. Sussman, A. Memon, and A. Porter, “Prioritizing component compatibility tests via user preferences,” in 2009 IEEE International Conference on Software Maintenance, 2009, pp. 29–38. [77] I. Yoon, A. Sussman, A. Memon, and A. Porter, “Testing component compatibility in evolving configurations,” Inf. Softw. Technol., vol. 55, no. 2, pp. 445–458, 2013. [78] X. Qu, M. Acharya, and B. Robinson, “Impact Analysis of Configuration Change for Test Case Selection,” in 2011 IEEE 22nd International Symposium on Software Reliability Engineering, 2011, pp. 140–149. [79] S. Route, “Test Optimization Using Combinatorial Test Design: Real-World Experience in Deployment of Combinatorial Testing at Scale,” in 2017 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2017, pp. 278–279. [80 Ł. Pobereżnik, “A method for selecting environments for software compatibility testing,” in 2013 Federated Conference on Computer Science and Information Systems, 2013, pp. 1355–1360. [81] J. Wang, M. Li, S. Wang, T. Menzies, and Q. Wang, “Images don’t lie: Duplicate crowd testing reports detection with screenshot information,” Inf. Softw. Technol., vol. 110, pp. 139–155, 2019. [82] X. Han, D. Carroll, and T. Yu, “Reproducing performance bug reports in server applications: The researchers’ experiences,” J. Syst. Softw., vol. 156, pp. 268–282, 2019.

59(81)

[83] S. Eldh and D. Sundmark, “Robustness Testing of Mobile Telecommunication Systems: A Case Study on Industrial Practice and Challenges,” in 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, 2012, pp. 895–900. [84] D. Citron and A. Zlotnick, “Testing large-scale cloud management,” IBM J. Res. Dev., vol. 55, no. 6, 2011. [85] C. Saumya, J. Koo, M. Kulkarni, and S. Bagchi, “XSTRESSOR : Automatic Generation of Large-Scale Worst-Case Test Inputs by Inferring Path Conditions,” in 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), 2019, pp. 1–12. [86] J. Koo, C. Saumya, M. Kulkarni, and S. Bagchi, “PySE: Automatic Worst-Case Test Generation by Reinforcement Learning,” in 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), 2019, pp. 136–147. [87] Z. Wang, K. Tang, and X. Yao, “Multi-Objective Approaches to Optimal Testing Resource Allocation in Modular Software Systems,” IEEE Trans. Reliab., vol. 59, no. 3, pp. 563–575, 2010. [88] Z. Wang, K. Tang, and X. Yao, “A multi-objective approach to testing resource allocation in modular software systems,” in 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), 2008, pp. 1148–1153. [89] P. C. Jha, D. Gupta, B. Yang, and P. K. Kapur, “Optimal testing resource allocation during module testing considering cost, testing effort, and reliability,” Comput. Ind. Eng., vol. 57, no. 3, pp. 1122–1130, 2009. [90] C.-Y. Huang and J.-H. Lo, “Optimal resource allocation for cost and reliability of modular software systems in the testing phase,” J. Syst. Softw., vol. 79, no. 5, pp. 653–664, 2006. [91] R. Pietrantuono and S. Russo, “Search-Based Optimization for the Testing Resource Allocation Problem: Research Trends and Opportunities,” in 2018 IEEE/ACM 11th International Workshop on Search-Based Software Testing (SBST), 2018, pp. 6–12. [92] Y. Shuaishuai, F. Dong, and B. Li, “Optimal Testing Resource Allocation for modular software systems based-on multi-objective evolutionary algorithms with effective local search strategy,” in 2013 IEEE Workshop on Memetic Computing (MC), 2013, pp. 1–8. [93] R. Gao and S. Xiong, “A Genetic Local Search Algorithm for Optimal Testing Resource Allocation in Module Software Systems,” in Intelligent Computing Theories and Methodologies, 2015, pp. 13–23. [94] M. Nasar and P. Johri, “Testing Resource Allocation for Fault Detection Process,” in Smart Trends in Information Technology and Computer Communications, 2016, pp. 683–690.

60(81)

[95] R. Hussain and S. Zeadally, “Autonomous Cars: Research Results, Issues, and Future Challenges,” IEEE Commun. Surv. TUTORIALS, vol. 21, no. 2, pp. 1275– 1313, 2019. [96] S. Haider, A. Akhunzada, G. Ahmed, and M. Raza, “Deep Learning-based Ensemble Convolutional Neural Network Solution for Distributed Denial of Service Detection in SDNs,” in 2019 UK/ China Emerging Technologies (UCET), 2019, pp. 1–4. [97] J. Viehmann and F. Werner, “Risk Assessment and Security Testing of Large Scale Networked Systems with RACOMAT,” in RISK ASSESSMENT AND RISK- DRIVEN TESTING, 2015, vol. 9488, pp. 3–17. [98] N. Tcholtchev, M. A. Schneider, and I. Schieferdecker, “Systematic Analysis of Practical Issues in Test Automation for Communication Based Systems,” in 2016 IEEE Ninth International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2016, pp. 250–256. [99] A. D. Brucker and U. Sodan, “Deploying static application security testing on a large scale,” in Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI), 2014, vol. P-228, pp. 91–101. [100] J. Heffley and P. Meunier, “Can source code auditing software identify common vulnerabilities and be used to evaluate software security?,” in 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the, 2004, p. 10 pp. [101] C. M. Hine, J.-G. Schneider, and S. Versteeg, “Reac2o: A runtime for enterprise system models,” in ASE’10 - Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, 2010, pp. 177–178. [102] P. Silva, A. C. R. Paiva, A. Restivo, and J. E. Garcia, “Automatic Test Case Generation from Usage Information,” in 2018 11TH INTERNATIONAL CONFERENCE ON THE QUALITY OF INFORMATION AND COMMUNICATIONS TECHNOLOGY (QUATIC), 2018, pp. 268–271. [103] P. Jung, S. Kang, and J. Lee, “Automated code-based test selection for software product line regression testing,” J. Syst. Softw., vol. 158, p. 110419, 2019. [104] J. Al Dallal, P. Sorenson, J. Dallal Al, and P. Sorenson, “Reusing class-based test cases for testing object-oriented framework interface classes,” J. Softw. Maint. Evol. Pract., vol. 17, no. 3, pp. 169–196, 2005. [105] D. Marijan, A. Gotlieb, and M. Liaaen, “A learning algorithm for optimizing continuous integration development and testing practice,” Softw. - Pract. Exp., vol. 49, no. 2, pp. 192–213, 2019. [106] E. Rogstad and L. Briand, “Cost-effective strategies for the regression testing of database applications: Case study and lessons learned,” J. Syst. Softw., vol. 113, pp. 257–274, Mar. 2016.

61(81)

[107] Z. Chi, J. Xuan, Z. Ren, X. Xie, and H. Guo, “Multi-Level Random Walk for Software Test Suite Reduction,” IEEE Comput. Intell. Mag., vol. 12, no. 2, pp. 24– 33, May 2017. [108] S. Wongkampoo and S. Kiattisin, “Atom-Task Precondition Technique to Optimize Large Scale GUI Testing Time based on Parallel Scheduling Algorithm,” in ICSEC 2017 - 21st International Computer Science and Engineering Conference 2017, Proceeding, 2017, pp. 1–5. [109] R. Ebert et al., “Applying product line testing for the electric drive system,” in ACM International Conference Proceeding Series, 2019, vol. A. [110] S. Ali and H. Li, “Moving Software Testing to the Cloud: An Adoption Assessment Model Based on Fuzzy Multi-Attribute Decision Making Algorithm,” in 2019 IEEE 6th International Conference on Industrial Engineering and Applications (ICIEA), 2019, pp. 382–386. [111] M. A. Gulzar, S. Mardani, M. Musuvathi, and M. Kim, “White-box testing of big data analytics with complex user-defined functions,” in ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 290–301. [112] T. M. King, G. Nunez, D. Santiago, A. Cando, and C. Mack, “Legend: An agile DSL toolset for web acceptance testing,” in 2014 International Symposium on Software Testing and Analysis, ISSTA 2014 - ProceedinKing, T. M., Nunez, G., Santiago, D., Cando, A., & Mack, C. (2014). Legend: An agile DSL toolset for web acceptance testing. In 2014 International Symposium on Software T, 2014, pp. 409– 412. [113] B. Sun, Y. Dong, and H. Ye, “On Enhancing Adaptive Random Testing for AADL Model,” in 2012 9TH INTERNATIONAL CONFERENCE ON UBIQUITOUS INTELLIGENCE & COMPUTING AND 9TH INTERNATIONAL CONFERENCE ON AUTONOMIC & TRUSTED COMPUTING (UIC/ATC), 2012, pp. 455–461. [114] Y. T. He, H. Hecht, and R. A. Paul, “Measuring and assessing software test processes using test data,” in FIFTH IEEE INTERNATIONAL SYMPOSIUM ON HIGH ASSURANCE SYSTEMS ENGINEERING, PROCEEDINGS, 2000, pp. 259– 264. [115] W. E. Wong and J. Li, “An integrated solution for testing and analyzing Java applications in an industrial setting,” in 12TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, PROCEEDINGS, 2005, pp. 576–583. [116] T. Parveen, S. Tilley, and G. Gonzalez, “A case study in test management,” in Proceedings of the Annual Southeast Conference, 2007, vol. 2007, pp. 82–87. [117] D. Suleiman, M. Alian, and A. Hudaib, “A survey on prioritization regression testing test case,” in 2017 8th International Conference on Information Technology (ICIT), 2017, pp. 854–862.

62(81)

[118] R.-Z. Qi, Z.-J. Wang, Y.-H. Huang, and S.-Y. Li, “Generating Combinatorial Test Suite with Spark Based Parallel Approach [基于Spark的并行化组合测试用例 集生成方法],” Jisuanji Xuebao/Chinese J. Comput., vol. 41, no. 6, pp. 1284–1299, 2018. [119] H. Zhong, L. Zhang, and S. Khurshid, “Combinatorial Generation of Structurally Complex Test Inputs for Commercial Software Applications,” in FSE’16: PROCEEDINGS OF THE 2016 24TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON FOUNDATIONS OF SOFTWARE ENGINEERING, 2016, pp. 981–986. [120] J. Chen and W. Shang, “An Exploratory Study of Performance Regression Introducing Code Changes,” in 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2017, pp. 341–352. [121] M. D. Syer, W. Shang, Z. M. Jiang, and A. E. Hassan, “Continuous validation of performance test workloads,” Autom. Softw. Eng., vol. 24, no. 1, pp. 189–231, Mar. 2017. [122] M. Alian, D. Suleiman, and A. Shaout, “Test case reduction techniques- survey”, 2016, International Journal of Advanced Computer Science and Applications, 7(5), 264-275. [123] M. Jäntti and T. Kujala, “Exploring a testing during maintenance process from IT service provider’s perspective,” in The 5th International Conference on New Trends in Information Science and Service Science, 2011, vol. 2, pp. 318–323. [124] Y. Li, T. Dong, X. Zhang, Y. Song, and X. Yuan, “Large-scale software on the grid,” in 2006 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, 2006, p. 596+. [125] Khurshid, Sarfraz, and Darko Marinov. "Reducing Combinatorial Testing Requirements Based on Equivalences with Respect to the Code Under Test." SQAMIA 2018 (2018). [126] Wen, Wanzhi, Bixin Li, Xiaobing Sun, and Jiakai Li. "Program slicing spectrum-based software fault localization." In SEKE, pp. 213-218. 2011. [127] Khan, Shahid N. "Qualitative research method: Grounded theory." International Journal of Business and Management 9, no. 11 (2014): 224- 233. [128] A. Emil, Class Lecture, Topic: "Module 1- Large-scale Terminology.” College of Engineering and Science, Blekinge Institute of Technology, Karlskrona, Dec 2019.

63(81)

9 Appendix

Appendix1–The result of the study(challenge and mitigation) Table 6, lists of challenges and mitigations identified through interviews and SLR.

Quality

Reference Reference

(Primary (Primary

study) No Challenges and Mitigations

Functional Suitability - Functional completeness, Functional correctness, Functional Appropriateness

1 [41] Database access code challenges are  Write good test cases, detect functional problems, and detect performance problems. Mitigation  [41] identified framework-specific database access patterns to address the challenges. 2 [42], Mapping Requirement for Testing R1-  Alignment or mapping of a requirement to testing is challenging R5, especially when the requirement is (unclear/ ambiguous/wrong) Mitigation R10]  Adopt more agile RE practices to enhance alignment

(requirement to testing) and then leverage agile testing [42]. Suitability Functional  The development team should include (tester, programmer, Analyst, etc.), testers should participate from the very beginning of systematization of a project and meeting to clear requirements [R1-R5, R10].  Practice Model-based-testing[R1]. 3 [5], Coordination (Common understanding) Challenge. R1-  Develop a common understanding of (requirement, security, R3, performance, risk) among teams, especially when a requirement is (unclear/ambiguous/not written) is challenging [5, R2, R3, R5]. R5,  Lack of knowledge and common understanding on R6, (configuration, risk, security, performance) [R8] or lack of R8, technical analysts to determine the technical risk associated with R10] compatibility due to multiple machine configurations, multi- version systems [R8], the risk associated with security is challenging [R6, R10].  Junior testers are often lack of understanding of the risk and critical area of the project so that the system is not thoroughly tested even if there is a written requirement document [R1]. Mitigation  Communication, meeting, and sharing knowledge for auditing the requirement [R1-R6].

64(81)

 To mitigate the risk associated with configurations: -Include expertise on the teams [R1, R8], reduce configuration changes, and automatic execution of assertions [R8].  To mitigate the risk associated with security: - proper security documentation, correctly understand security principle, security metrics that a tester to understand & follow[R10], communication and have a dynamic security matrix [R10].  Companies should sign off a common agreement on requirements among the teams [R5]. Performance Efficiency - Time –Behavior, Resource Utilization, Capacity 4 [43] Challenge of Identifying a few performance counters  Challenge to identify a few performance counters from highly redundant data. Mitigation  Use Principal Component Analysis (PCA) to reduce the large volume of performance counters data to a smaller manageable set. 5 [44, When is the right time to stop performance testing? 45] Challenge to decide when is the right time to stop performance testing  When the performance data generated is repetitive [44].  When it’s is tested with simulated operations data [45]. Mitigation  Stop performance testing when the generated data becomes highly repetitive and repetitiveness becomes stabilized[44].  Compare the generated stop times by the algorithm (Weibull Performance Efficiency algorithm) with the known perfect stop time[45]. 6 [ 46, Challenge on creating test case to mimics actual data R10]  Challenge on creating a test case dataset that mimics the actual data in production [R10, 46]. Mitigation  Use techniques, data sanitization, which is the alteration of large datasets in a controlled manner to obfuscate sensitive information, preserving data integrity, relationships, and data equivalence classes.

7 [R2, Challenge on test equipment to mimic a real scenario R7,  Lack of test equipment to mimic the real scenarios in R9, performance testing of the telecom network [R2, R7]. Simulating a production server locally and doing performance testing for R10, integrated, parallel, distributed, and dependable systems is a 21] challenging task [R9, R10, 21]. Since the system that you depend on doesn’t allow you to send so many requests; otherwise, the system will break so challenging to do performance testing. Mitigation  Use better test equipment if possible [R2] and complement performance testing with field testing, especially in the telecom domain [R2, R3, R7] to mimic real scenarios. Other mitigation could be service Virtualization or Emulating the actual system that your system depends on locally [R9, R10]. 8 [29, Challenge on fault or bugs detection 47, It’s a challenge to identify faults effectively 48,  When bugs are scale-dependent [29].

65(81)

49,50,  When a large-scale system possesses big data [47]. 51,  It’s a challenge to effectively identify bugs and improve test 52, infrastructure [48, 49]. 53,  In a highly configurable system [50, 51, 52, R3]. 55,  Due to large numbers of false positives and false negatives (using code-mining) [53]. R3]  For a large combination of features [52].  When performance bugs are workload-dependent[54].  In large-scale data processing (Example- Adobe Analytics, etc.)[55].  To identify critical bugs (in the component, API, etc.) [R7] Mitigation  The author developed WUKONG can automatically scalable and effectively diagnose bugs [29].  Principal Component Analysis (PCA) could be used to detect faults [47].  Apply Model-based testing [48, 49].  The author developed NAR-Miner [50] to extracts negative association programming rules and detected their violations to find bugs, and Genetic configuration sampling [51], used a sampling strategy to learn fault detection of highly-configurable systems.  Authors proved mathematical theorems to validate and compare random testing and combinatorial testing on the effectiveness of finding fault and doing configuration testing [52].  The author developed, EAntMiner, to improve the effectiveness of code mining[53].  The authors combine random and combinatorial testing [52].  The author developed DYNAMO to find an adequate workload[54].  Apply combinatorial (example- pair-wise) testing) [55].  One suggested mitigation, boundary value-analysis, and boundary testing [R7]. 9 [R3] Challenge on fault Investigation In general, in large-scale system  Fault investigation is a time-consuming and challenging task [R3]. Mitigation  Incorporating Machine learning (ML) and AI-based testing in the testing process [R3]. 10 Challenge on fault prediction [31,5 In a large-scale system, its challenge to effectively 6, 57]  predict fault due to data redundancy and noise [31, 56].  Predict and maintain a flaky test [31]. Mitigation  To resolve the challenge, the author used combined fault prediction algorithms than using a single machine learning technique [56].  The authors used, Heterogeneous ensemble and prediction model to predict different fault in different data and reduced error[57].  Using machine learning (Ml), AI, and Bayesian networks to classify, predict, and maintain flaky tests [31].

66(81)

11 [18, Challenges on performance or load test 20, It’s a challenge to perform 21,  Load testing on web applications while running under heavy loads (data tsunami) [58]. 58,  Performance testing effectively, efficiently, and fully [R2, R3]. 59, For instance, it’s challenging to test the system fully if response R2,R time (RTT) of the requirement not compatible with the test set 3,R9, environment [R2, R3]. R10]  Load testing in the cloud due to challenges (configuration, capacity allocation, tuning of multiple system resources) [59].  Load testing due to environment-variation [18]. Adjusting the test environment dynamically for a system deployed in a heterogeneous platform to perform load testing is a challenge.  Testing critical, parallel, and disturbed systems that demand intensive computation, multidimensional testing [20, 21]. Mitigation  Peer-to-peer load testing is valuable to isolate bottleneck problems [58].  Field testing is a choice in the telecom domain [R2, R3].  SvLoad is proposed to compare the performance of catch and server[59].  The author developed an ensemble-based modeling technique, which leverages the results of existing environment-variation- based load tests and it performs better than the baseline models when predicting the performance of the realistic runs [18].  Authors identified testing workflows and designed a framework, ATLAS, to perform extensive tests on large-scale critical software [20].  The author proposed, D-Cloud, to perform multiple test cases simultaneously, to enable automatic configuration and test environment for dependable parallel and distributed system[21]. 12 [73] Challenges on Analytics-driven load testing Analytics-driven load testing challenge [73] a) during test design  Ensure performance quality when a system is deployed in a production environment. Mitigation  User behavior can affect a system in two ways, i. several requests by a user (workload), ii. Individual user behavior such as activating a task 100 times – The solution to such problems is aggregate and assess user-level behavior of the system b) during test execution  Reduce test execution effort Mitigation  Performance analyst needs to be able to quickly determine and stop a test if it is an invalid test, example, network failure, disk space running out c) during test analysis  Determine whether a test passes or fails. Mitigation  Performance analysts often use keywords for locating error logs, but this doesn’t always work as logs are the only indication of

67(81)

anomalies. As logs may not bear names as assumed by the analyst – the solution is to use a statistical method, Z-test, to identify anomaly event sequences. 13 [54, Challenges on performance testing (example-load testing) 60,  Detecting performance deviation on components in load tests [60, 61, 61]. 62,  Obtaining accurate performance testing results (for Cloud Applications) due to performance fluctuations [62]. 63,  To perform load testing in a cost-effective and time-efficient R4] manner [63, R4].  To evaluate the effectiveness of load testing analysis techniques [63].  Identify the appropriate load to stress a system to expose performance issues [54]. Mitigation  The authors present, Principal Component Analysis (PCA), to identify subsystems that show performance deviations in load tests [60].  Automate load test using supervised (a search-based technique) and unsupervised approach to uncover the performance deviation in load tests [61].  The author proposes a cloud performance testing methodology called PT4Cloud, provides reliable stop conditions to obtain highly accurate performance distributions with confidence bands. [62].  Using statistical test analysis techniques; Control Chart, Descriptive Statistics, and Regression Tree yield the best performance. [63]  To speed up and reduce testing costs in general, apply the best testing practice of large-scale testing and containerization(example-docker) [R4].  The author proposes an approach called DYNAMO, automatically find an adequate workload for a web application [54]. 14 [64] Challenge on duplicate bugs It’s a challenge to  Managing duplicate bugs and provide reasonable compensation for Crowdsourced Mobile Application Testing (CMAT)[64]. Mitigation  The author offers guidelines on how to manage duplicate and invalid bug reports and compensation schema for testers to make CMAT effective and to address the challenges. 15 [65] Challenge on bugs debugging and bug report diagnosis

 Bug debugging and bug report diagnosis are expensive, time- consuming and challenging (in large-scale systems or systems of systems) [65]. Mitigation  The authors propose an automated approach to assist the debugging process by reconstructing the execution path and saves time. 16 [66] Classical approach load testing challenge in CI/CD

68(81)

 Classical approach load testing is challenging and harder to be applied in parallel and frequently executed delivery pipelines (CI/CD) [66]. Mitigation  The author designed context-specific and low-overhead load testing in continuous software engineering (testing process from measurement data recorded in production to loading test results). For further study, read the article. 17 [67, Challenge on fault localization 68,  Effective fault or bug localization (both single-fault and multi- 69] fault) has long turn-around time and challenging [67, 68, 69]. Mitigation Proposed mitigations are  For single-fault localization, Spectrum-based Fault Localization (SFL), methods are effective and efficient. Still, for multi-fault localization, the Fast Software Multi-Fault Localization Framework (FSMFL), based on Genetic Algorithms, is effective in addressing multiple faults simultaneously [69].  For a single fault, Fault localization based on complex network theory for a single-fault program (FLCN-S) to improve localization effectiveness in a single-fault context [68].  Program slicing spectrum-based software fault localization (PSS- SFL) is more precise to locate the fault [126].  To reduce turn-around time and to predict the location of bugs, classification could be used[67]. 18 [70] Challenge on Achieving and maintaining system performance to the desired level  Achieving a required system performance and maintaining it by reducing cost [70]. Mitigation  Performance quality gates (performance trending, speed, and efficiency, reliability) to be established at the software integration phases at various points to avoid or reduce unforeseen system performance degradations[70]. 19 [71, Challenge on microservices 72] In a large-scale microservices architecture, where thousands of coexisting applications to work cooperatively, it’s a challenge  Task scheduling [71], Resource scheduling (Lei, Q et al. 2019[72]), and Performance modeling [71]. Mitigation  The author designed a performance modeling and prediction method for microservice-based applications. The author formulated a microservice-based Application Workflow Scheduling problem for minimum end-to-end delay under a user- specified Budget Constraint (MAWS-BC) and proposed a heuristic micro service scheduling algorithm[71].  Kubernetes provide a deployment tool, Kubemark to address performance and scalability issue[72].

patibi

Compatibility - Co-existence, Interoperability Com

lity 20 [19, Challenge on compatibility, and configuration

52, Challenges on multi-component, multi-version, and configurable

69(81)

74, evolving integrated system on compatibility are 75,  Perform compatibility testing for a large combination of features 76, in a multi-component configurable system [52, 74, R10]. 77,  Perform incremental and prioritized compatibility testing [52, 75, 76, 77, R10]. Testing for compatibility is challenging since it’s 78, infeasible to test all potential configurations 79,  Interoperability and dynamic reconfigurability[19]. 80,  Test optimization for compatibility testing while assuring R10] coverage [78, 79].  Selecting environment for compatibility testing (due to limited

resources and large variants to test) [80].

Mitigation  The authors combine random and combinatorial testing[52].  The author proposed Rachet for compatibility testing, which is well scaled and discovered incompatibilities between components[74, 75].  The authors present a prioritization mechanism that enhances compatibility testing by examining the most important configurations first while distributing the work over a cluster of computers [76].  Prioritize and execute the highly prioritized one first [R10]. Moved out some test cases as part of the definition of done! [R10].  Reduce configuration changes as much as possible and test the combinations via automating[R8].  Authors proved mathematical theorems to validate and compare random testing and combinatorial testing on the effectiveness of finding fault and doing configuration testing [52].  The authors presented an approach to support incremental and prioritized compatibility testing and cache-aware configuration for component-based multi-version systems that evolve independently over time [77].  Recommended using pair-wise testing [125].  Use the homogeneous environment as possible [R5, 19] and having an API specification agreement that the teams sign off and communicating changes if there are any [19, R5].  The author proposed a test case selection approach by analyzing the impact of configuration changes with static program slicing [78].  Apply combinatorial (example- pair-wise) testing [79].  An evolutionary algorithm is proposed to select the best configurations and environment sensitivity measures to discover defects. Machine learning (ML) is used to correlate the change in application code[80]. 21 [R8] Challenge on the risk of compatibility and configuration

 Lack of knowledge and common understanding of (configuration, risk, security, performance) [R8].  Lack of technical analysts to determine the technical risk associated with compatibility due to multiple machine

70(81)

configurations, multi-version systems [R8]. Mitigation  Include expertise on the teams, reduce configuration changes, and Automatic execution of assertions [R8]. 22 [R7] Challenge on platform compatibility  Business without a sense of ethics has a significant impact on compatibility challenges and compatibility testing. [R7]. For instance, routing protocols OSPF (open shortest path first) interact with different routers, whereas Enhanced Interior Gateway Routing Protocol (EIGRP) interacts with the Cisco system. However, EIGRP is a better protocol than OSPF[R7].

Mitigation  Build standardized settings or common platform where components, protocols, products are interacting, communicate, compatible [R7]. Usability – Appropriateness, Recognizability, Learnability, Operability, User Error Protection, User Interface Aesthetics, Accessibility

23 [R1- Challenge on usability R4,  Usability or/and accessibility is given low priority, and less 81] acknowledges (Time, resource) (ex- in Telecom domain, etc.) [R1, R2, R3, R4]. It would be easy to design the GUI according to the interest of target groups if usability testing is acknowledged

[R3]. Usability  Detecting duplicate crowd testing reports effectively and efficiently is difficult[81]. Mitigation

 Outsource the task to specialized companies to save resources (time, cost, human) [R3].  Use a focus group to check for usability testing (Accessibility, operability, etc.)[R4].  The authors propose Screenshots and Textual (SETU), adopt a hierarchical algorithm, which combines information from the Screenshots and the Textual descriptions to detect duplicate crowd testing reports (extract features, calculate similarity, then discover duplicates). Reliability – Maturity, Availability, Fault Tolerance, Recoverability 24 [R1, Service disruption challenge 82]  Service disruption challenge due to the difficulty of reproducing bugs [R1, 82]. Non-reproducible service disruption is challenging

to fix. Only reproducible service disruptions faults are solved Reliability before product delivery [R1]. It’s also challenging to reproduce performance bugs for performance testing [82]. Mitigation  Assigned enough resources (time, programmer) to fix any service disruption fault even the fault appeared only one time. The trade- off is between time for service delivery and quality[R1].  Generate worst-case reproducible test inputs from dynamic symbolic execution [59,60].  [82] provides guidance and a set of suggestions to successfully

71(81)

replicate performance bugs, improve the bug reproduction success rate. 25 [R2] Challenges of achieving reliability to an acceptable level.  No system is 100% reliable, but achieving a reliable system to an acceptable level of reliability is challenging. To achieve a reliable system test environment (both hardware and software) should be reliable. Hardware can be affected by different factors (Heating, bit flip, and interface (change of format back and forth)), the software can be affected by various factors (the wrong script, misunderstood) [R2]. Mitigation  Cooling the system if HW error due to heating [R2].  Test the test script if the error comes from the test script (Write test script that tests the test script itself) [R2].  No mitigation if the HW error comes due to the bit flip[R2].  Use CI/CD and versioning is one good way to achieve a reliable system to some extent[R2]. 26 [ 83, The root cause of failure identification challenge 120,  It's a challenge to identify the root cause of failure/bug when R1] failure is investigated [R1]. Detecting the root cause of performance regressions is also a challenge [120].

Mitigation  Automate to tracks and handles the root cause analysis. A system tracks bugs, faults; failure by collecting all possible logs until the trouble report is closed after the root cause is identified. A serious measure has to be taken to improve development, testing, design, process, etc. after the root cause is identified [R1].  Automate a system that manages the root-cause analysis, and mitigation can be collecting all the possible logs [R2].  Chen et al. 2017 extensively evaluate and studies performance and root cause of performance regression at the commit level [120]. 27 [84] Challenge of testing for correctness and reliability Testing for correctness and reliability of large-scale cloud-based applications is a challenging, costly, and traditional simulation, and an emulation-based alternative is too abstract or too slow for testing[84]. Mitigation  A testing approach that combines simulation and emulation in a cloud simulator that runs on a single processor[84]. 28 [85, Challenge on test input generation for stress testing 86] It’s a challenge to generate  Scalable worst-case test inputs generation (from dynamic symbolic execution) to use for stress or performance testing [85, 86]. Automatic test input generation by Dynamic symbolic execution (DSE) is challenging to scale up due to the path space explosion. Mitigation  A Proposed algorithm, PySE, to Automatic Worst-Case Test Generation by Reinforcement Learning and incrementally update

72(81)

the branch policy to drive the program execution toward the worst case[86].  Proposed modeling technique, XSTRESSOR, to generate dynamic worst-case test inputs by Inferring Path Conditions[85].  The author proposed to use a combination of Dynamic symbolic execution (DSE) and random search [86]. 29 [22, Challenge on optimal resource allocation 23, It’s a challenge to allocate 87,  Optimal testing resource (due to reliability constraint and limited resources) [61-70, R8]. [88, Mitigation 89,  Different authors proposed different mitigation option to handle 90, resource allocation challenge- genetic algorithm [22, 23], 91, Harmonic Distance Based Multi-Objective Evolutionary 92, Algorithm [87, 88], Combination of Software reliability growth 93, models (SRGMs) and Systematic and Sequential Algorithm[89], 94] Optimization algorithm based on the Lagrange multiplier method [90], Search-based strategies [91], Multi-objective Evolutionary Algorithm [92], Genetic local search algorithm[93], combination of Software Reliable Growth Model (SRGMs) and Genetic Algorithm[94].  A genetic algorithm is used to solve optimization problems. Concerning reliability, it’s gives optimal results through learning from historical data [62].  One suggestion-Check log history and act, if related to garbage collection, increase memory when doing robustness testing [R8]. 30 [95] Challenge on autonomous car related to validation  Validation and testing for (Safety, reliability, and security) of an autonomous vehicle are challenging [95]. Mitigation  One suggested mitigation, all stake-holders required to be agreed on a standard protocol, inclusive, and thoughtful solutions that satisfy consumer, industry and government requirements, regulation, and policies.

Security – Confidentiality, Integrity, Non-repudiation, Authenticity, Accountability

31 [R1- Challenges on a common understanding R3,  Developing a common understanding of security and risk R6, associated with security [R1-R3, R6, R10]. Lack of hacker’s mindset when security testing is done [R7, R10]. R7, Mitigation R10] Few Suggested mitigations are Security  Knowledge sharing and meeting [R1-R3, R6].  Good security documentation, security metrics that a tester to understand & follow[R10].  Properly understand security principle, security verification [R7, R10], communicate with security specialists a lot & update your knowledge, and to have a dynamic security matrix [R10].  Outsource the job for consultancy if no special dedicated security testing teams; deploy the system in a secure cloud, Amazon web service [R7].

73(81)

32 [5] Unique Security Challenges in large-scale agile:  General coordination challenges in a multi-team environment (Alignment of security objectives in distributed settings and developing a common understanding of security activity among teams)[5].  General quality assurance challenges (Integration of low- overhead security testing tools) [5]. Mitigation  Knowledge sharing and meeting[R1-R3, R6].  A serious meeting with security expertise is essential[R6]. 33 [R8] Challenges on coping with security threats  Security verification and Coping with security threats is challenging [R8]. Mitigation  Make sure to cover the high priority secrete issues, upgrade and share knowledge on security [R8].  Good security documentation, security metrics that a tester to understand & follow[R10].  Security specialist creates a dynamic security matrix and should be up to date on security metrics, attacks, etc.[R10].  Deploy the system in a secure cloud [R7]. 34 [R3] Challenge on level of trust  Level of Trust (Infrastructure, platform, tools, and security expertise) is challenging[R3]. Mitigation  Outsource security testing to consultancy can be a mitigation.  Deploy the web service on the trusted cloud, for instance, Amazon web service. 35 [96] Challenge on DDoS on SDN network  Effective detection of Denial of Service (DDoS) on of Software Defined Networks (SDN)[96]. Mitigation  The authors' proposed framework, Deep learning (DL) based CNN (Convolutional Neural Network), results in high detection accuracy (99.48%) of DDoS in SDNs. 36 [97, Security testing challenges 100]  Security oriented risk-assessment is challenging [97].  Security oriented code auditing is challenging, expensive, time- consuming [100]. Mitigation  Authors used a combination of security testing and risk assessment (using method and tools (example-ACOMAT)) [97]  Use software code auditing tools Pscan in combination with security expertise [100]. 37 [99] Security challenge due to Static application security testing (SAST) Integration of SAST in large-scale introduces challenges

 Technical (High false positive and false negative rates) and Non- technical (Insufficient security awareness among the teams and integrating SAST in agile). Mitigation

74(81)

 Work closely with a small team.  pick a handful of security problems first.  Measure the outcome.  Plan and invest enough resources & infrastructure (Introducing SAST requires significant resources).  Execute scans regularly since SAST is not a one-time effort that solves all security issues on-and-for-all.  Plan your changes and updates etc. 38 [98] Testing Challenge on safety and security-critical relevance systems.  Handle testing (example-continues regression testing) efficiently in large-scale systems with safety and security-critical relevance is a challenge due to a large number of requirements for several products variant and various configurations [98]. Mitigation  Keep the testbed environment simple & easily manageable.  Increase awareness & communication among teams.

Maintainability – Modularity, Reusability, Analysability, Modifiability, Testability 39 [R3, Challenges on maintain regression testing R4, It’s a challenge R9,  To manage an outdated test suit efficiently and effectively. [R3, R4, R9, R10, 102]. R10,  To building & maintaining a test suit for continuously updated 102] regression testing[102]. Mitigation  Make regression testing a Definition of Done! [R10] Then let regression testing all green.  Working with a CI/CD Pipeline (Continuous Integration, Continuous Delivery), good architecture, maintain clear requirement, use docker [R3, R4, R9].

 The author used a tool, REQAnalytics, continuously updating Maintainability regression tests [102].  Apply test case reduction techniques or regression testing reduction techniques (Requirement Based, Coverage Based, Genetic Algorithm, etc.)[117]. 40 [102] Challenges on the Impact of changes Different services provided by large-scale web applications/web services must be maintained & improved over-time since infrastructures & technology evolves rapidly. As a result,  It’s a challenge to estimating the Impact of changes [102]. Mitigation  The author used a tool, REQAnalytics, continuously updating regression tests [102]. 41 [R3, Challenge on prune outdated test suites R4]  Effectively prone outdated test suites from existing modules or refactoring legacy test suits is challenging [R3, R4] Mitigation  Documentation-clear documentation on dependencies, writing testable code, required-skilled tester, applies best practices to refractor and prune legacy test suites [R3, R4].

75(81)

42 [R9] Regression testing challenge In large-scale regression testing, it’s a challenge  To keep track of the faults while doing regression especially when you have unstable tests[R9]. Mitigation  Working with a CI/CD Pipeline, test with smaller chucks[R9]. 43 [103] Challenge on test case selection of regression  Effective test case selections for the regression test is a challenge[103]. Mitigation  The author proposed an automated method of source code-based regression test selection for Software product lines (SPLs). The method reduces the repetition of the selection procedure. It minimizes the in-depth analysis effort for source code and test cases based on the commonality and variability of a product family. 44 [107, Challenge on optimized test suit 108,  Ensuring a “realistic” and optimized test suite from outdated test 121] cases is challenging since test cases are outdated due to the evolving nature of a system [107]. Without realistic up to date optimized test suites, it’s hard to do continuous validation of performance test workload [121]. For instance, test optimization is helpful to speed up distributed GUI testing but achieving an optimized test to speed up testing time for usability testing in the distributed environment is challenging [108].

Mitigation  Chi, Zongzheng et al. 2017 [107] suggested the use of an algorithm (Multi-Level Random Walk) that optimizes and provides a reduced test case [107]. Apply test case reduction techniques or regression testing reduction techniques [117, 122]. [121] proposed a way to validate whether a performance test resembles the field workload and, if not, determines how they differ then update tests to eliminate such differences, which creates more realistic tests [121].  S. Wongkampoo and S. Kiattisin 2017 [108] purposed the Atom- Task precondition technique, to filtering the test before running the parallel testing process to optimize the testing time and avoiding worker bottleneck problems and to scale for large-scale testing in the distributed environment [108]. 45 [104] Challenge in reusing class-based test cases Frameworks are introduced to reduce costs, make reusable design and implementation, and increase the maintainability of software through the deployment. The key challenges of testing concerning frameworks are

 Development, evolution & maintenance of test cases that ensure the framework operates appropriately in a given application. Mitigation  Reusable test cases could be used to increase the maintainability of software products since a brand new set of test cases does not have to be generated each time the framework is deployed.

76(81)

46 [105] Challenge of optimizing continuous integration testing practice The large-scale system characterized by evolution (frequent occurrence of change) and large software size.

 Optimizing continuous integration (CI) to handle code changes using regression testing faces long-lasting and scalable problems due to test redundancy [105]. Mitigation  [105] propose an algorithm to optimize CI regression test selection. The algorithm uses coverage metrics to classify tests as (unique, totally redundant, or partially redundant). Then classify partially redundant test cases as being likely or not likely to detect failures, based on their historical fault-detection effectiveness. 47 [106] Challenge on regression testing on a database for cost- effectiveness: -  Regression testing of database applications is costly and challenging. Analysts can rely on generating synthetic data or production data based on (i.) combinatorial techniques or (ii.) operational profiles. A combinatorial test suite might not be representative of system operations and automating testing with production data is impractical. Mitigation  Using production data and synthetic data generated through classification tree models of the input domain. Combinatorial testing strategies terms of the number of regression faults discovered and in terms of the importance of these faults are essential. 48 [123] Challenge to improve testing during maintenance process in large-scale IT service providers  Key challenges in maintenance testing were related to the recording of test cases, inadequate testing resources, test documentation, presenting test documentation to customers, and poor visibility of testing in early phases of software development. Mitigation  Collecting information regarding control and monitoring of testing  Detailed delivery notes of changes  More accuracy for scheduling and informing.  The test environment should be ready for testing on the agreed time, and the source code would be frozen at the agreed time.  Testing resources should be planned to agreed time, and the testing should be carried out at the agreed time.  Provide detailed instructions for maintenance testing.  Ready-made test cases would fasten customer testing and make it easier.

rt

it P

il b Portability - Adaptability, Installability, Replaceability y a o

77(81)

49 [101] Challenge to assure quality for enterprise system  Assuring quality to enterprise systems that have to be deployed in an environment containing many thousands of interconnected systems is a challenge. Mitigation  The author proposes the use of a scalable emulation environment, Reac2o, enabling real-time system testing in large-scale settings.

General -

50 [R2, Challenge on real-world modeling scenario R3] It’s a challenge to  Modeling & testing a real-world situation at an acceptable level of abstraction [R2, R3]. Still, there are models to simulate this but inaccurate!! Mitigation  Design a model that handles specialized scenarios[R2]. 51 [R4- Challenge on the coordination of testing activities R6, Due to heterogeneity, the complexity of the large-scale system 72]  It’s a challenge to coordinate testing activities [R5]. With microservices, where thousands of coexisting applications to work cooperatively, integration and end-to-end testing are challenging [R4, R6, 72] since some components may not behave as expected or may not be ready for integration testing [R5]. Mitigation  Use proper Devopp system with different versions that can be rolled back if needed [R5, R4]. Write testable & clean code, make reusable of code & test suits, and use containerization (ex- docker) [R4, R6].  Kubernetes provide a deployment tool, Kubemark to address General performance and scalability issue [72].  It could be mitigated by interoperability testing [10].

52 [R7] Challenge on communication In large-scale system due to size, complexity, and multi- disciplinary  Cross-functional Communication is a challenge [R7]. Both internal (teams) and external (stack-holders) communication are challenging in large-scale systems in general. For instance, internal communication challenges might come from the ambiguity of requirements [R7].

Mitigation  Communication challenge due to requirement ambiguity can be mitigated by having a clear requirement through meeting and discussion [R7]. 53 [109] Challenge on verification and validation An increase in variability and complexity of the automotive system due to the growth of electrification and digitalization of vehicles possess verification and validation challenges [109]. Mitigation  The challenges are mitigated using TIGRE, Product Line Testing methodology.

78(81)

54 [110] Challenge on the adoption of cloud  Decision making regarding the adoption of cloud computing (CC) for software testing is challenging[110]. Mitigation  The author proposed the adoption assessment model based on a fuzzy multi-attribute decision-making algorithm to adopt CC for software testing. 55 [111] Challenge testing on big data analytics  Efficient and effective testing of big data analytics (Ex-Google’s MapReduce, Apache Hadoop, and Apache Spark, etc.) is a challenge since they process terabytes and petabytes of data [111]. Mitigation  The author proposed, BigTest, to efficient and effective testing of big data analytics. 56 [112] Challenge in agile web Acceptance Testing In agile development, collaborations of customers, business analysts, domain experts, developers, and testers are essential. However,  The challenge during testing activities of the rapid development of rapidly developed and large-scaling agile projects concerning large sets of test artifacts are a) understandable and accessible to various stakeholders is difficult b) traceable to requirements is difficult c) Easily maintainable as the software evolves. Mitigation  The use of testing tools that leverages domain-specific language to streamline functional testing in agile projects.  Some key features of the tools that could be used include test template generation from user stories, model-based automation, test inventory synchronization, and centralized test tagging. 57 [113] In large-scale system Testing for safety, reliability, and security of the embedded system in model-based testing is a challenge in two cases I. Testing model dynamically, II. Improving the efficiency of model-based testing [113]. Mitigation  An enhancing adaptive random testing, investigated to generate test cases for AADL(6) model-based testing, is found to be more efficient than traditional random testing. 58 [48 , Challenge on test management and test process 106,  Improve, manage, optimized testing (infrastructure, process) and 114, reduce testing cost in large-scale (distributed, critical, etc.) systems is challenging [48, 106, 114, 115, 116, 117]. 115,  Reduce development & testing costs is possible by detecting 116, defects at an early stage is a challenge [115]. 117,  Decide if additional testing is necessary [115].  Conduct effective testing with less cost while avoiding risk [115].

6 (Architecture Analysis & Design Language – is used to model the software and hardware architecture of an embedded, real-time system)

79(81)

124]  Manage test cases[116].  To achieve a desired computationally, cost-effective testing to speed up the testing process is a challenge in the large-scale system [124]. Mitigation  The author used tools, eXVantage, for coverage testing, debugging, and performance profiling [115].  The authors discussed recommendations on how to do & to better leverage test case management [116].  Apply model-based testing to improve test infrastructure [48], Use Process-oriented metrics-based assessment(POMBA) to monitor a testing process [114], Use a strategy “Software design for testability” to write clean code & improve quality and apply techniques to check for correctness of program, for instance, “Test condition oracle” [106, 117].  Through the use of grid computing (or distributed computing) [124]. A swarm intelligence task scheduling approach also proposed to tackle the dynamism and performance heterogeneity problems in a grid computing environment with the expectation of reduction of testing task completion time 59 [118, Challenges on design optimized combinatorial suitable test suits 119] for system require complex input structures. Large-scale software systems commonly require structurally complex test inputs. Thus, it’s a challenge  Large-scale software systems commonly require structurally complex test inputs. Thus, it’s a challenge  To design optimized combinatorial suitable test suits to protect failure caused by a variety of factors[118].  Generate optimized tests that are behaviorally diverse and legal to use for testing[119].  Applying pair-wise combinatorial testing when a system requires complex input structures that can’t be formed by using arbitrary combinations [119]. Mitigation  Combinatorial pair-wise testing used to generate an optimized test suite from an ample combinatorial space. Hadoop distributed computing system could be used for increased performance of running algorithms on significant test cases[118].  The author proposed an efficient approach, ComKorat, which unifies pair-wise testing (for scalability) and constraint-based generation (generate structurally complex input) [119].

Appendix2– Interview Questions Interview questions 1. Thank you for accepting my invitation to this interview. This interview is meant to collect data for answering my research questions as partial fulfillment of my master's thesis. I would like to get your consent to record your voices, but according to the GDPR and other relevant personal data

80(81)

acts, you will be treated anonymously, and your voice will be deleted once the data is transcribed. Is that OK? Table 7. Interview Questions Interview Questions Q1 Would you please tell me about yourself? Q2 What is your current (previous) role in your organization? Q3 Have you been working in large-scale software development and or testing projects? Q4 How many years have you had in the software engineering field as a developer and or tester? Q5 What are the key challenges of large-scale software testing hindering the development of quality software products? Which means key challenges of large-scale software testing concerning product quality attributes (Functional Suitability, Performance Efficiency, Compatibility, Usability, Reliability, Security, Maintainability, and Portability)? Q7 What are the possible mitigations to large-scale software testing? Q8 From the system core attributes that define large-scale( complexity, size, evolution, emergence, heterogeneity, variability, reusability, distribution), which are the source of challenges of large-scale testing in your domain?

81(81)