MASARYK UNIVERSITY FACULTY OF INFORMATICS

Corner cases and possible improvements of automated testing

MASTER'S THESIS

Petra Mikova

Brno, Spring 2020 MASARYK UNIVERSITY FACULTY OF INFORMATICS

Corner cases and possible improvements of automated testing

MASTER'S THESIS

Petra Mikova

Brno, Spring 2020 This is where a copy of the official signed thesis assignment and a copy of the Statement of an Author is located in the printed version of the document. Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Petra Mikova

Advisor: RNDr. Adam Rambousek, Ph.D.

i Acknowledgements

I would like to thank my advisors Mgr. Jiri Vanek and RNDr. Adam Rambousek, Ph.D. for their help, valuable advice and support, and the members of OpenJDK QE team in Red Hat for their expertise and input.

ii Abstract

The aim of this thesis is analysis of current state of OpenJDK QE in Red Hat, identification of biggest issues and their resolution. That is achieved both by outlining new and reusing existing tools and software. Most of the problematic corner cases mentioned in this thesis are not product specific, making the solutions applicable for other QE teams. The last part of the thesis is dedicated to improving automated results processing and reporting.

iii Keywords

OpenJDK, Jenkins, quality engineering, test automation,

iv Contents

Introduction 1

1 Automation in testing 3

2 OpenJDK 4 2.1 What is OpenJDK 4 2.2 OpenJDK and Red Hat 4 2.3 OpenJDK QE 5

3 Test automation tooling 6 3.1 Jenkins 6 3.1.1 Architecture 6 3.2 Beater 7 3.2.1 Beaker usage 8 3.2.2 Beaker machine pool 9 3.2.3 Downsides 10

4 Major issues of OpenJDK QE 11 4.1 Mixing results of various projects 11 4.1.1 Sorting and comparing results 12 4.1.2 Separating jobs 14 4.2 Incomplete matrix of current jobs 14 4.2.1 Problems caused by manual cloning 15 4.3 Template job generation 16

5 O-tool 18 5.1 The O-tool idea 18 5.2 Build and test matrix 19 5.2.1 Matrix declaration 20 5.3 Job generation 25 5.4 Implementation 29

6 Common issues and handling 30 6.1 Hanging Jenkins jobs 30 6.1.1 Aborting of long-running jobs 32 6.1.2 Implied consequences 34

v 6.2 Product download issues 34 6.2.1 Inaccessible remote server 34 6.2.2 Insufficient executor storage 35 6.3 Known test failures 41 6.3.1 Old tests with compatibility issues 41 6.3.2 Testcase timeouts 42 6.3.3 Dependency issues 44 6.3.4 False negatives of unmaintained test suites ... 44

7 Test result evaluation and processing 48 7.1 New available result reporting 48 7.2 Result comparison 49 7.2.1 Narrowing the case 49 7.2.2 Results comparing 50 7.2.3 Jenkins report plugin 52 7.3 Test fail processing possible improvements 55 7.3.1 Type of the fail recording 55 7.3.2 Broader history discovery 55 7.3.3 Automatic reproducer extraction 55 7.3.4 Test author search 56

7.3.5 Blame a commit 56

8 Conclusion 60

Bibliography 61

vi List of Tables

vii List of Figures

3.1 Master node with several additional machines, that are used as slave nodes, split into several executor slots. 7 3.2 Artificial node that gathers all Jenkins jobs that are running in Beaker. 9 4.1 Different builds originating from different forests mixed in one test job. 13 5.1 JDK project representation and configuration. 27 5.2 JDK test project representation and configuration. 28 6.1 A stable test suite build trend 31 6.2 Unstable test suite with manually shut down hanging jobs. 32 6.3 Current build step that starts the executor and runs the test suite. 33 6.4 Build Timeout plugin version of execute shell. 33 6.5 Example configuration for node monitoring 37 7.1 Results of jtreg test suite displayed in the Jenkins Report JCK plugin view 53 7.2 Comparison of previous and current run of jtreg test suite shows a new failing test. 54 Introduction

These days, information technology and software are an integral part of our lives. It is in the cars we drive, the planes that fly above our heads, the phones in our hand and in the banks that control flow of the money Technology is everywhere and it is so intertwined with our lives, that the costs of a single failure can be unimaginable. Failures don't have to always cost us only money; it can cost us lives, knowledge or it can cause instability of our society. That makes technology reliability a key asset. Quality engineering is an inseparable part of software develop• ment. Each day, thousands of quality engineers and testers find mis• takes and errors in the software that could cause, if released, serious issues to people, property and break customer trust. They find new ways, how the software can be misused, broken or exploited, but also check that already found issues do not appear again. Programming languages and their libraries are like any other soft• ware product. However, their impact can be much wider than most single purpose applications, because the developers use it to create their software. A single bug can affect hundreds of applications and expose unexpected vulnerabilities. The Java platform has a special place in the software world; it is one of the most favored programming languages, it is heavily used in financial services, data centres and other important industries, and it is the official language for Android development, that currently dominates the world of mobile operating systems. That is also because of the fact that Java is considered a secure and reliable platform. The expectations are high from the community and the developers working on the implementations are working hard to make sure they are fulfilled. There are many implementations of Java platform. The open source implementation is called OpenJDK and it has a wide range of contributors from many companies like Oracle, Red Hat, IBM, Amazon, Azul and more. The goal of this thesis is to analyze the current state of automated testing of OpenJDK in the Red Hat OpenJDK QE team, to identify the biggest issues that are being struggled with and to outline possible solutions to these problems. The problems are not usually product or

1 company specific, so the designed solutions are applicable to other software and can be used by other engineers for similar purposes. First, we will put the thesis into context - explain why it is important to introduce automated solutions and briefly acquaint the reader with OpenJDK. Then we will introduce the most important tools currently used by the team, that are either directly related to the issues or a key part of the testing process. Then we will analyze the major problems in the testing process and test suites and either try to find an existing solution for these problems, or propose some. At last, we will outline possible changes in the results processing that could have significantly benefiting impact on the workflow, as it seems to be one of the bigger bottlenecks in the testing process. Those changes could be both implemented as a result processing tool, or be used to extend existing programs created by the members of the team.

2 1 Automation in testing

"A major problem of the software industry is its inability to develop error- free software." [1] The quote above pretty much sums up the current situation in software development. Today's software products are becoming increasingly complex, but the need for reliability, stability, performance and functionality persists. That makes quality engineer• ing and software testing an integral part of software development, because with growing code base and functionality, it gets harder to ensure the quality of the product. The increasing requirements for testing are then reflected in higher need of manpower in quality assurance and engineering. Even though manual testing has its perks and is useful especially for cases that are difficult to automate, the automatic testing is usually more cost- effective, less error prone and more accurate, especially in the regres• sion testing 1, because these test suites tend to grow with every bug found. These days, automation is a key factor in quality engineering. Once created, the automated tests are easy to repeat and reproduce, fast in execution and if configured correctly, much more reliable. It enables us to maximize usage of resources and in general, it increases produc• tivity, especially in the time, when manpower is much more expensive than machine resources. However, the test automation is not just about writing and main• taining the tests. It is also about the scripts that execute them, mecha• nisms that record and collect the results and also the interpretation of the results. In general we can say, the more tests and results we have, the more we want and need to automate the process.

1. Regression testing is repetitive re-rerunning of tests, that ensures that the devel• oped and tested software performs the same way after a change.

3 2 OpenJDK

2.1 What is OpenJDK

OpenJDK is an open source implementation of Java Standard Edition. It is licensed under the GNU General Public License v2 with classpath exception[2]. There is a number of projects that are being actively maintained by the community most known being probably JDK 8, JDK 11 and JDK Project1, but there are also various ports of alternative architectures (e.g. aarch64, PowerPC), or alternative implementations including features, that are not a part of some of the OpenJDK main projects (e.g. Java Flight Recorder, Shenandoah GC, Graal, OpenJ9). "Under the new JDK Project release model, new feature releases are planned to be released in a strict 6 month cadence." [3] That means that every six months, a new major version of OpenJDK is released. These releases are so called short term support releases (STS), so with a new major version released, their support is usually dropped unless someone steps out and continues maintaining it, which happens quite often - for example Red Hat takes care of upstream projects OpenJDK 11 and 8 and 7. Oracle intends to designate a LTS (long term support) release every three years [4]. Given the fact that the last LTS JDK release was JDK 11, it is expected that the next LTS should be JDK 17. However, this is not set in stone and possible delay would not be unusual. Even though it is not a rule, it is expected other vendors that provide support for (Open)JDK usually follow this model

2.2 OpenJDK and Red Hat

Red Hat is one of many companies that actively contribute to the development of the OpenJDK, as it provides some of the OpenJDK projects as a part of its portfolio. "Red Hat has been a member of the OpenJDK community since 2007 and is one of the largest contributors to the project." [5] After Oracle's announcement that they will no longer

1. JDK Project is a long term project with a goal to produce open source reference implementations of the Java SE Platform.

4 2. OPENJDK provide free binary downloads or write patches for JDK releases after six months, Andrew Haley of Red Hat took over the leadership of both OpenJDK 8 and OpenJDK 11 update projects[6]. As Red Hat OpenJDK team took over the responsibility of main• taining the said projects along with the help of the community, it was also necessary to appropriately extend the testing of OpenJDK, to make sure, the updates and backports did not affect the functionality and quality of the OpenJDK code base. With the ownership of OpenJDK 8 and 11 projects, there also arised a requirement for reference implementations - a sort of standard, example implementation of the project specifications. Hand in hand with that, there came a need for reference "vanilla" builds, that are built using only upstream code with exemption for essential patches, that ensure buildability on older systems. Considering these reference builds and other forked projects, the number of tested builds did rise significantly and new build variants and features keep coming into the OpenJDK.

2.3 OpenJDK QE

Testing an implementation of Java Platform is not a trivial thing to do, let alone testing several versions of it. With a growing number of supported variations of builds and packages, the workload is swiftly exceeding the scope that can be managed by humans. Even though there is already a powerful automation in place, the OpenJDK QE team has issues adapting their testing processes to the challenges arising. It is not only the growing number of builds and configurations that need to be tested, but also an enormous number of tests, that makes their job more difficult each day. With the constantly changing codebase of OpenJDK, it becomes challenging to adapt the tests to the new patches, analyze and process the results on time and deliver the product in a timely manner.

5 3 Test automation tooling

In this chapter, we will shortly introduce tools, technologies and soft• ware that are used for testing and test automation and will be often mentioned in this thesis.

3.1 Jenkins

"Jenkins is a self-contained, open source automation server which can be used to automate all sorts of tasks related to building, testing, and delivering or deploying software." [7] It is a widely used tool for software build and test automation, that is easily extensible and deployable. It can be installed through Docker, native system packages, but also run stan• dalone with JRE, because it is written and built using Java, making it a cross platform product. And even though there are existing platforms that do not support Jenkins, that usually means, they do not support Java, so it is not required to test OpenJDK there. The team uses it to run most of the automated tests and builds. The bare Jenkins can be extended by numerous plugins [8] that can implement extra functionality. The plugins can provide support for various build tools, generate test reports in various formats, provide charts, implement pipelines, improve virtual machine management and much more. Currently, there are over 1500 Jenkins plugins avail• able in the official repository, but there are many more available in user repositories. The OpenJDK team uses some of these plugins, but they also have their own plugin implementations, that are hosted on GitHub.

3.1.1 Architecture

"A basic Jenkins installation operates as a standalone entity." [9]. The mas• ter node usually controls most of the general tasks like job execution, source polling, configuration management, jobs distribution and result processing. The number of executors is scalable by slave agent services that are run on other machines. The workload is then being distributed to other systems, depending on the master node configuration. That

6 3- TEST AUTOMATION TOOLING makes Jenkins a very extensible system; the computing capacity can be expanded by providing more resources, that are running slave agent, to the master node. The OpenJDK team is using a lot of different machines in its Jenkins instance. These include benchmarking machines and powerful ppc641e, aarch64, s390x and Intel 64 servers with hundreds of cores. Since most of the machines in the pool are different in terms of resources, each machine has a designated number of slots (1-n), so more jobs can run on them in parallel. These slots do not have assigned resources, so it is up to the administrator to ensure there are enough resources for all slots. To guarantee that only relevant jobs are run on their respective architectures, there is a label system. Each node can have labels that can be then used in job configuration. •

Executor Executor Executor Executor Executor Executor Executor Executor Executor BlOt 1 slot 2 slot 3 BlOt 1 Sl0t1 slot 2 slot 3 slot 3 sloM 1 -KM Qa

Figure 3.1: Master node with several additional machines, that are used as slave nodes, split into several executor slots.

3.2 Beaker

Centralized labs, that provide machines and VMs on demand are a good way to manage resource distribution in a big company. It helps to maximize resource utilization, increase average core work and reduces costs. In this chapter, we will introduce one of the tools that can be used to manage distributed systems as one big lab and can be used to enhance automated testing.

7 3- TEST AUTOMATION TOOLING

"Beaker is open-source software for managing and automating labs of test computers. "[10] Using Beaker, users can manage and provision systems from multiple labs, that act as one automated inventory of systems. These machines can be used for various tasks scheduling and execution, and storing and viewing results.

3.2.1 Beaker usage

OpenJDK QE does not have its own instance of Beaker, but intensively uses Red Hat's labs, which are accessible via Beaker. The time, when only their private machines were able to manage the load are long gone and now there are scripts in place, that automatically request and reserve Beaker machines, run Jenkins jobs on them and returns the machines back to the Beaker pool once the execution of the tasks is done, while copying relevant results and resources back to their local infrastructure. The Beaker job is run as an ordinary Jenkins job that runs on one of the master executor slots, however it does not consume almost any resources and instead delegates the task to the Beaker, waits for the job to finish, uploads the results and ends. In this manner, hundreds of slots are opened on master, allowing hundreds of jobs run in parallel in Beaker.

8 3- TEST AUTOMATION TOOLING

• • Agent Hydra-beaker

Projects tied to Hydra-beaker

a 0 a i - jdkSu 2 42 .g a-02. ilr. upstream 5 hr 27 mlr - id k8 u242 .qa- 02. itrdi s abled. upstream .tastflebu a 0 n - jdk8u242.ga-02.jfr.upstream.fastdebuq 5 hr 15 mir - jdk8j242.qa-02.itrdisabled.upstream 0 ^ build-stafc-ojdka a reproducers-security -ojdkl 1-rhel7-ppc64 -beaker 7 hr 26 mir -11.0.6.10-0 .upstream jdk. ope njdkportable •s390x-beaket a Q it.-••i:tlv-:-:.:k l - 19 days -11.0,6.10-0 Jc30 .0.6.10-20200213.rkeririekeTraversalSeq Patch. 1. updates, upstream.fastdebuq » 3 days 3 hr - jdk. 11.O.S. 10-20200213. rkennekeTraversalSeaPatch.O.jpdales.upstream a •'~'\ tck-ojdk8-rhel6- 5 days 0 hr - idkSu242.ga-01.ifrdisabled..ips:'e,vr •~- W -beaker £} I day 5 hr - jdk.11.0.6.10-20200213.rkennekeTraversaSegPatch.3.updates.upstream ;4-beaker 0 jcstress-ojdkl 1- a 1 day 3 hr - jdk. 11 0-20200213. rken nekeTraversalSegPatch. 3. upd ates. u r. ©

m.fastdebuq a -aarch64-beaker I mo 29 days - jdk.11.0.6.3-0.upd 0 wycheproof-ojdkl 1 4 days 0 hr - jdk.11.0.5.10-20200213. rkennekeTraversalSegPatcr.O.updatesupstream 0 3 days 4 hr - jdk.11.0.5.10-20200213. rkennekeTraversalSegPatcr.O.updatesupstream 0 a 3 mo 15 days - shenaidoah.jdk.' 1.0.5.9-O.shenandoah.upstream.fastdebug 0 3 days 21 hr • a 0-202002' 3. rkennekeTraversalSegPatoh.O. upd ates .upstream a

Figure 3.2: Artificial node that gathers all Jenkins jobs that are running in Beaker.

3.2.2 Beaker machine pool Beaker can provide both hardware and virtual machines. Those can have very different specifications, so it is necessary to specify resources requirements for each job. Red Hat's instance of Beaker takes care of the installation and setup of the machines, which are then passed to the user. The powerful machines in the labs can help significantly shrink the execution time and handle resource-intensive tests that would not work properly in the OpenJDK QE infrastructure. It also allows the team to run long-time running jobs that would otherwise block the executors, or RAM intensive suites testing garbage collection. From the additional slots perspective, it is a nice addition to the existing team infrastructure. It helps balance the peaks, when the new release comes, that used to cause several days long queues. It also frees local machine slots to be used for Windows testing, as Red Hat's Beaker currently does not support Windows, speeding up the generic QA multiple times.

9 3- TEST AUTOMATION TOOLING

3.2.3 Downsides There are some disadvantages related to the Beaker usage. OpenJDK tests are very sensitive to system configuration, environment variables and any installed software. That makes some of the test suites very hard to deploy on these provisioned machines despite aggressive purge and kick start scripts. Therefore, for some of the test suites, it is actually beneficial to keep them in our infrastructure, because our custom VM's are easier to control and provide always the same system, even though the test suite execution time can be longer, and the wait for the executor can also take some time. Another disadvantage is that the borrowed system is cleaned after the task is finished, and the VMs are usually destroyed. That brings a reproducibility issue, because it is nearly impossible to get the exact same system, even though the recipes are archived for some time. Nevertheless, these issues are very rare and usually can be reproduced in our infrastructure with some engineer interference. In the end, using Beaker, or any other machine provisioning system is very beneficial and it can provide an enormous amount of resources that are useful especially during the peak times. The disadvantages are very small compared to the benefits and most of them can be avoided by carefully selecting the test suites.

10 4 Major issues of OpenJDK QE

4.1 Mixing results of various projects

First major issue is testing multiple builds of the same major OpenJDK version (7,8/ll/14/X) with different build types in a single Jenkins job. That happened due to rapid growth of the test matrix, which used to be really small; the testing used to be done by just a few hundreds of jobs, but the number of binaries and sources that are being tested has multiplied since then. And since the capabilities of the team were limited, they went with adding targets to existing jobs, because it was the way of the least resistance, however after years of growth, it proved to be a terrible approach, because the current matrix is covered by several thousands jobs. This way, an example job called reproducers-security-ojdkll-rhel7- x64 is running test suite "reproducers security" with all OpenJDK 11 builds, with no extra runtime configuration, on RHEL 7 virtual machine. That means, we mix at least 7 projects test results in one test job. It also means a single job tests two operating systems (current and future RHEL ) and up to three build types (slowdebug, fastdebug and release). Even though that might seem like a good idea at the beginning, with the given growth, it led to messy and non-transparent results.

That causes several problems in results interpreting:

• Different project builds have their own set of "known" failures, so it is hard to see regressions when number of fails changes with each tested project change

• If there is a different type of failure (e.g. change in stack trace) of the same test in two consecutive runs, the current tools do not mark it as regression, because only the name of test is checked

• Current tooling does not provide easy comparison of two NVRA that haven't been tested concurrently

11 4- MAJOR ISSUES OF OPEN JDK QE

• Fastdebug build results are mixed up with regular ones. That is causing false regressions, as the tests often timeout on fastdebug builds.

• Static builds and dynamic builds results are mixed up in one test job

• Shenandoah and non-Shenandoah forests results were mixed in one test job, which required additional workarounds to handle failures due to unknown Java options

4.1.1 Sorting and comparing results One approach to solve this problem is sort the results from each job by a program or a script that would take the NVRA 1 and sort the results to several different views. That would enable the user to see separated results for each project (rpms x upstream x custom), build variant (slowdebug x fastdebug x release) and OS releases ( current RHEL vs. future RHEL. There is a big obstacle in the way and that is the correct matching of the NVRA to each category. Even though the NVRA string gives us a lot of information, it is very difficult to process it automatically, because the name of the build is currently not deterministic and some of the necessary build information is not trivial to access.

1. NVRA - name, version, release, architecture - it is a naming convention of rpms, that identifies each build and where it comes from

12 4- MAJOR ISSUES OF OPEN JDK QE

shenandoah.jdk.11.0.7.3-0. Shenandoah, upstream shenandoah.jdk.11.0.7,3-0, Shenandoah, upstream .fastdebug j d k. 11.0.6.10-20 20 0213. r ken nekeTrave rsalSeq Pate ri. 1. u pdates. u pstream .f astd e b uq 0 jdk.11.0.6.10-2020 D213 .rkennekefraversalSeg Pate h.O.u pdates. upstream ij 11.0.6.10-0.fc31 ij 11.0.6.10-0 .upstream jdk. ope rijd kpo rt able Q 11.0.6.10-0. jpstream , jdk, ope njd kpo rt able jdk 11.0.6.10-2020D1100. updates upstream.fastdebug jdk. 11.0.6.10-20200110.0. jpd ate s. jpstream jdk.11.0.6.4-0. updates .upstream jdk. 11.0.6.4-0. update s. upstream, fastdebug jdk.11.0.6.3-0.updates upstream jdk.11.0.6.3-0. update s. upstream, fastdebug shenandoah.jdk.11.0.6.2-0. Shenandoah, upstream .fastdebug shenandoah.jdk.11.0.6.2-0, Shenandoah, upstream jdk.11.0.5.9-11 .updates.adinn.pmem.20191031 .upstream ij 11.0.5.10-0.fc31 Q 11.0.4.11^4.fc31

Figure 4.1: Different builds originating from different forests mixed in one test job.

The name inconsistency is caused by the fact that the binaries are being built from several different sources. The names of the builds are then having different conventions, depending on the origin:

• The upstream build name is generated from tag, that has a lot of different forms, depending on the project tagging it. Different forests have different tag conventions, those depend on many factors, like release type (general availability / early access), maintainer of the forest or development cycle.

• The custom builds are named by the engineer, who creates and uploads the build, so it is up to him to decide how and if he will follow the naming conventions. Any mistake (e.g. typographic error) can deny the correct identification.

13 4- MAJOR ISSUES OF OPEN JDK QE

• The rpms are named by the build system, but e.g. minor versions are depending on the build root, that was used by the system or the user for building it. Sometimes, it is impossible to say which minor version the build has just from the NVRA, so the build tag has to be checked, but the tag information is not trivial to get. Those and other differences make this approach difficult to im• plement. The name matching could be solved by regular expressions, however creating these could get really tricky and the changing tags, or in general any change in naming could instantly break it, meaning it would often need changes to the name matching regex.

4.1.2 Separating jobs

Creating a separate job for each project, build variant, test variant, operation system and architecture could be another approach to this issue. It would result in a multiplied number of jobs and we would end up with a test job for each project and build variant. The transparency of the jobs and results would be much better and it should not bring any additional work for the team, since the number of results will remain the same, it will only be distributed across more test jobs. That results in parallel execution of these jobs, so assuming there are enough executors available, we are able to test more binaries in a shorter amount of time. One thing that could be considered a disadvantage is losing quick comparison among multiple projects (upstream vs. rpms) and build configuration (fast debug vs. release). This was slightly more visible in current approach, but it was not working very well anyway, so it is not much to lose. However, this comparison is quite important and must be achieved differently.

4.2 Incomplete matrix of current jobs

Current state of matrix of OpenJDK QE Jenkins is not ideal, because the jobs are being cloned manually, usually by using the sed2 command

2. Sed is a stream editor used for basic text transformations on an input stream of a file or a pipeline.

14 4- MAJOR ISSUES OF OPEN JDK QE on job directories, because there are too many of them to be cloned one by one. That causes redundancy in some cases and incomplete test sets in others. In the end, there are a lot of jobs, but no one knows if everything is tested as it should be, and in every configuration we should be testing.

4.2.1 Problems caused by manual cloning

Every time a new project, test suite, build variant or test variant is added, they have to be cloned again. Currently, the new jobs are being added manually, either directly via the Jenkins GUI, or by mass cloning using sed on the directory tree. This approach leaves a lot of space for human error during the cloning process. And with each iteration of cloning, the error can spread wider, leaving unnoticed gaps in the matrix. Also, mass main• tenance of these jobs is a big problem, because there is no clear speci• fication of what we have and what we want to have and there are just too many jobs to check manually.

Manual cloning via Jenkins GUI

When the jobs are being cloned via the Jenkins GUI, given the amount of jobs, it takes an unreasonable amount of time, and there is a high probability that some of the jobs will be forgotten and will not be cloned. This breaks the test matrix and the error will spread with each subsequent cloning. Once a job is missing for given OpenJDK and configuration, it will also be missing in OpenJDK +1. There was already an effort to fix the matrix and create a tool that was checking, if no jobs are missing. It was stabilized, for a while, but then it broke again and with growing number of test suites, platforms, projects, build variants and test variants, where the matrix is different for each OpenJDK, it became unmaintainable.

Mass cloning using sed

The mass cloning via command line using the sed command is done on the Jenkins directory tree, where all configurations are being copied

15 4- MAJOR ISSUES OF OPEN JDK QE with changed parameters. It is definitely a faster approach, but one mistake in the command could have a catastrophic impact on the testing process and cripple the team infrastructure until fixed. It clones all jobs from the JDK -1 matrix, so it is less likely to "for• get" to clone a job. Unfortunately, if there is a mistake in the existing set of jobs we use for cloning, it will copy the mistake as well. But with one problem gone, another problem appears, because this approach can create jobs that do not make sense for the JDK + 1. So in the end, more useless jobs will appear, that can never be executed, or worse, will be run, even though there is no need for them. This is wasting resources and engineering time, and does not completely solve the human error issue.

Since both of these approaches are far from reliable, there is a clear need for automated template based generating of test jobs. However, that is not a trivial task to do. There are several existing tools that are targeting this issue. We will introduce a few of them to demonstrate the principle they work on in the following chapter.

4.3 Template job generation

Creating and maintaining hundreds of Jenkins jobs is usually too complicated to be done manually, especially when the job set changes over the time. With that many jobs, it is also difficult to know what you already have, making each change a complex task and leaving a lot of space for human error. The template job generation in Jenkins is a widely covered topic, with several existing implementations. The most common idea usually lies in one base job, that is used to create more jobs. There are several ways to exploit this approach. The Jenkins Job DSL plugin[ll] base job can be scripted to define simple jobs or entire pipelines of jobs. The DSL exposes methods that are used to configure the jobs. It supports many Jenkins plugins, so even the plugin configuration can be generated by these methods. In the end, a Job object is returned, that can be then used or passed around. The job configurations can be then stored in a version control system.

16 4- MAJOR ISSUES OF OPEN JDK QE

The Job Generator plugin[12] is used to generate jobs or pipelines with parameterized templates. It is also implemented as a separate job that generates other jobs, depending on the user input parameters - it basically fills a prepared template with the user defined content. It also works with other Jenkins plugins that support expanding variables. The Jenkins Job Builder [13] is a Python project that uses a different approach and generates the jobs from descriptions saved in YAML or JSON files. That makes the job descriptions human readable and easier to verify. It provides a flexible template system, making creation of similar jobs easier.

Even though the mentioned tools provide various approaches on how to seize the issue, none of them fits our use case as much as we need them to. While all of the tools are able to generate the jobs, only the JJB can help us to keep information about what is generated in human readable form, while being able to add, modify and archive jobs. But with other requirements, like several templates for partial configuration, that are then being combined into a resulting job, the readability disappears, slowly converging into a idea of simple front end, that might be more user friendly than text files. To sum up, the problems were very difficult to cover with existing tools that were examined. However, they served well as an inspiration for a new project that was designed to specifically target the issue of the constantly changing test coverage matrix and related platform, provider, task, test and build configurations and the resulting job generation - the OpenJDK builder, shortened to a name O-tool.

17 5 O-tool

The O-tool is a tool designed to serve two major purposes - to store information about the current test matrix in a human readable form, and to dynamically create, modify and archive Jenkins jobs based on the information stored in the matrix.

5.1 The O-tool idea

The motivation for this tool emerged, as the team needs a highly customizable Jenkins job generator. The investigated existing options were always missing a chunk of functionality and they would be harder to bend, then to reimplement the whole idea from scratch. The requirements for the tool are following:

1. For each major version of OpenJDK there are several forks and several development branches (so called projects) to be built and tested

2. Each project, build variant and test variant must have separate test jobs

3. The test matrix must be clearly visible, customizable and exten• sible

4. Each test project requires a different set of binaries (existence of whitelists and blacklists is crucial)

5. Jenkins jobs are automatically generated and regenerated de• pending on the matrix

6. The tool elements follow strict naming conventions

7. Matrix can be declared and corresponding jobs can be generated

8. The O-tool knows about what is built and tested where and can be extended to use these information for other purposes

18 5. O-TOOL

5.2 Build and test matrix

As I already mentioned, the test matrix is massive for OpenJDK. When we try to describe all possible configurations of the test jobs, it ends with a multidimensional matrix. However, the data can be interpreted also as a two dimensional table. With a simple script, I was able to get approximate numbers for our current test and build variants, to get a two dimensional model of our test matrix, because doing this manually would be near to impossible. The two dimensions of the matrix are set to be build variant and test variant. It was the most logical arrangement we came up with - it logically splits the set into two parts and also clearly states what is tested where. Build variant is a set of configurations that are used during the build time of the OpenJDK project. For our case, that is architecture, operating system, debug mode of the build, development kit type. Test variants include architecture, operating system, platform provider, test task, display protocol, garbage collector, cryptosetup, agent and other various OpenJDK switches options. Together, they define the Jenkins job - what is tested, where it is tested and how it is tested. The total number of the combinations is enormous - if we would consider every possible combination of build and test elements we cur• rently use, there were about 650 000 columns and 520 000 rows in the two-dimensional model. That equals about 341 billion combinations. However, this maximal matrix contains a lot of combinations that do not make sense (e.g. testing Windows builds on Linux or testing binaries built on Intel x86_64 architecture on s390x). If we exclude all invalid combinations, we are at approximately 150 000 test jobs. That is still a huge number. But this matrix also contains a lot of combinations that will never be used. Even though the build and test combination is not logically wrong (the test will execute successfully), it does not make sense to e.g. test installability or garbage collection with different Linux graphic environments, because no related functionality is actually touched by the testsuite. The reason why the matrix should be free of these combinations is that most of them will produce identical output every time (they either all fail, or all pass). That is not necessarily a bad thing, however there will be consequences in the term of the machine load, which would

19 5. O-TOOL rise significantly and also the amount of results needing processing grows linearly with the number of jobs. Therefore, it is important to scale it down to a smallest possible set, that ensures the full test coverage. It is expected that there should be about 10 000 jobs.

5.2.1 Matrix declaration

Once the matrix is shrunk to the smallest sensible set, it needs to be declared to match a test plan, because the base matrix covers every test combination that can be tested and makes sense to be tested. However, that does not mean that everything should be tested everywhere, where it makes sense, because other factors are in play - that can be the low availability of resources (e.g. alternative architectures like s390x), limited resources for processing the results, or generally low interest in some combinations. Usually, the test plan covers only the configurations that are sup• ported. In our case, we also need to include various upstream variants; we can not possibly cover everything, so picking a viable set of test and build variants that will ensure sufficient test coverage is essential.

Build variants

• Debug mode There are three base debug modes in OpenJDK - release, fastde- bug and slow debug. The release build is the most important one, because it is used in both production and development. That should be run with every viable combination to make sure it works, as it is probably the most used debug variant of OpenJDK build. The fastdebug build is using optimizations as release builds do, but it ships with debug information and uses asserts, making it slower than release builds. That means it is nearly unusable for performance testing, but it also reflects in test execution time. Huge test suites run days or even weeks with a fastdebug build, resulting in significant delays in daily regression testing and regular test timeouts. However for small and fast test suites, it is

20 5. O-TOOL

usually valuable, as it is more sensitive to failures than release builds. The slowdebug build also uses asserts and ships with debug information, but it does not have any optimizations. That makes it extremely slow and basically useless for testing. In this case, only sanity tests like installability or buildability are a viable option.

• JVM Another tested variant of the JDK builds is the JVM. There are several implementations of the JVM - the default for OpenJDK is HotSpot, but there is also OpenJ9x, Substrate VM2, Zero3 and others. The default JVM should be always covered by all testsuites, as we are testing upstream reference implementations and also use it in most production builds, whereas the secondary JVMs like Zero, we mostly want to test those just for buildability and maybe some key test suites. Unfortunately the variants do not end here, as we have also several build variants of JVM, where the behavior or GCC or the sources are changed and the build results in heavily modified JVM. In some cases, like Graal[14] (or Mandrel), only specialized suites apply, because the build creates another binary, which, when run, uses the Substrate VM and its own implementation of GC. In this case, the standard tests are not used. On the other hand, there are build variants like Java Flight Recorder[15] or Link Time Optimization[16] turned on / off that touch the VM functionality, but the testing coverage should be the same as for the classic variant, because it still produces the same binaries with additional functionality. The benchmark• ing importance grows in this case, as we need to make sure

1. https://www.eclipse.org/openj9/ 2. https://github.com/oracle/graal/tree/master/substratevm 3. https://openjdk.java.net/projects/zero/

21 5. O-TOOL

that new features do not have a damaging impact on the overall performance of the build. Especially the latter variants result in a huge amount of new jobs. Even though the matrix should be full in most cases, it would take a big chunk of the available resources. So the most important test suites (e.g. regression suites), fast tests and benchmarks should be covered first, leaving other suites in the "nice to have" category.

• Build targets and package splitting There are two basic build targets that can come out of OpenJDK build - a JRE (or Java Runtime Environment), that contains only core classes, JVM and supporting files, that are enough for executing a Java application, and a JDK ( or Java Development Kit) - a software development environment, that provides the JRE, compiler, documentation generator, archiver and other tools necessary to both develop and run Java applications. That gives us an idea how the testing should be arranged - the JRE should be tested only by suites, that use only runtime - meaning only precompiled binaries can be used for testing. The SDK should be tested with everything we got, as it contains both interpreter and compiler. There is one RPM specific case that needs to be included in the matrix - the JRE-headless. It is basically a headless JRE, that needs to be tested on headless systems. Since it is JRE, it only can test runtime specific (precompiled) suites, and as it is headless, it is expected to run only non-graphic tests.

Test variants

• Display protocol The basic display protocol variants relevant for OpenJDK are Wayland and XI1. A special case of this variant is headless build testing, that is done on headless systems that do not contain most libraries that support graphical interfaces.

22 5. O-TOOL

From the nature of the variant, it is not viable to run any tests that do not contain graphic elements. Those can be various cryp• tographic tests, GC tests, performance tests and others.

• Garbage collector Garbage collector's codebase is a part of the JDK, so in case of multiple GC testing, there should be a test variant for each GC in most test suites. However there are tests that are not GC dependent. Those can be various integration tests, that are only installing the software, checking contents of the binaries, permissions of files and testing other interactions of the binary with the system. Even though passing a GC option does not break these testsuites, it never gets invoked, so the results would be duplicated with ev• ery GC variant, wasting both human and computing resources.

• Crypto setup The crypto setup testing is mostly related to modifying the Java.security configuration file, that, depending on the settings, can forbid or hide some security algorithms. The purpose of the testing is to make sure the OpenJDK API stays consistent dur• ing the changes, and there is an implementation of all provided algorithms. In this case, is it key to run suites that are using crypto libraries - e.g. cryptographic test suites and frameworks, the specification tests and regression testing. On the opposite, it is not viable to use most GC specific test suites, graphical tests or benchmarks.

• JFR Flight Recorder is a low-overhead data collection framework that is used to troubleshoot HotSpot JVM and Java applications. It is implemented as a part of OpenJDK itself and can be turned on/off during the build time and run tim. In this case, the emphasis is primarily on performance testing, however most standard test suites should be run with it, possibly excluding the GC specialized test suites that are usually running

23 5. O-TOOL

for several days and would bring a very little value compared to the resource usage.

Static vs. dynamic builds

There are two different kinds of builds that are produced and sup• ported - static builds and dynamic builds. The static builds are linked against libraries that are then copied to the source code, whereas the dynamic builds are linked against system libraries, so the built binary expects the libraries to stay the same, and the links to point in the right direction. That is a key difference, because the dynamic builds are only sup• ported and tested on the systems, where they have been built. On the opposite, the static builds are expected to be forward compatible, meaning that it must be tested on the system where it was built and on all newer versions of the OS. As that is quite a lot of jobs, we should be testing at least the oldest system build everywhere, while the others should be only tested for buildability, keeping in mind that the resource pool is limited.

Compatible architectures (multilib)

Even though the 32 bit architectures are being dropped and mostly are not supported, there are some cases when 32 bit builds can be used, usually to run 32bit applications on the 64 bit system. Until the support is completely dropped, multilib needs to be tested at least on Intel architecture. As many tests are not expected to be run with 32 bit binaries and need to be adapted, only most important suites or compatible tests should be run with multilib.

Other limitations

Some of the test variants might be features that are not supported on all architectures or operating systems (e.g. Shenandoah is only sup• ported on Intel and Arm architectures). In this case, the tests must be restricted to only supported platforms, further lowering the number of the matrix elements.

24 5. O-TOOL

These are only examples of already existing elements. The tool can be customized to contain more test and build variants, if new require• ments arise. In this case, all the jobs will be renamed once the variant is added and the configuration files are updated and regenerated, making mass maintenance simple and transparent. The ultimate goal is to generate this matrix automatically. Even though the demands are more or less clearly defined, unless the re• sources are unlimited, it will be very difficult to do it without any human interference, as the job generating and processing order needs prioritization, that is not always clear, or deterministic.

5.3 Job generation

Once the matrix is settled and it is decided what should be built and tested, we need to handle the automatic generation of the jobs. The process itself has several steps. The source of most of the information are configuration files that are already present on the server. These files define build providers (where the packages, binaries and sources come from), platforms (ar• chitecture and operating system), Java platform, test variants (garbage collectors, display protocols, crypto setup), build variants (debug mode, JVM) and the test jobs. These configurations are combined in the JDK projects and JDK test projects. Since the matrix is way too wide to be used as a model, the combinations are displayed in a tree structure that is created by the user. This way, hundreds or thousands of jobs can be created for each project. A project represents a separate source of tested product. That can be a server providing pre-built packages or binaries, but also an upstream repository containing sources of OpenJDK. But it always contains only one forest of OpenJDK. JDK project represents a scenario, that starts with sources, that are first built and then tested. So each branch of the tree represents a chain of a build of OpenJDK and the following test of the resulting binary. This chain results in two Jenkins jobs - a build job and a test job.

25 5. O-TOOL

In the Figure 5.1 you can see a graphic visualization of example JDK project, that will result in two separate jobs. A build job is specified by the first half of the tree, where we declare a build platform (OS and architecture), build provider (VM or Beaker) and build variants listed in section 5.2.1. The following test job then uses result of this build job. That test job is defined by the other half of the tree. It has another declaration of a platform (OS and architecture), because we can build binaries on one system (e.g. RHEL 6) and test them on another (e.g. RHEL7), assuming they are compatible. Then, it also needs a provider (VM or Beaker), task (test suite) and test variants listed in section 5.2.1. JDK test project represents only one test job - it is useful in cases when the binaries are already built by external entities and we just pull and test them. In this case, each branch represents only one Jenkins test job, because the binaries are already pre-built. A graphic representation of a JDK test project can be seen in Figure 5.2. It looks similar like JDK project, but it misses the build part. That is, because it reuses already built binaries. However we need to specify a platform (OS and architecture), debug mode and JRE/SDK of the sources, that should be downloaded. The test job part of the tree is identical with the JDK project. Once the paths are properly set up and saved, for each branch of the tree, the configuration files are merged together to create a job template in a form of .xml file, and the tree is transformed into a list of jobs. Another list of jobs is created from the existing objects. The lists are then compared, to remove the unchanged elements. The changed jobs are then either archived, changed, revived or generated, depending on the previous state. That is done by generating a build.xml file, that is saved in a generated job directories in the Jenkins home. The build.xml file is then passed to Jenkins via the CLI API and the configuration from the disk is reloaded to reflect the changes in jobs.

26 5. O-TOOL

Figure 5.1: JDK project representation and configuration.

27 5. O-TOOL

Figure 5.2: JDK test project representation and configuration.

28 5. O-TOOL

5.4 Implementation

The OpenJDK builder is currently being developed and it is also de• ployed on the production server. It is done as an extension of an ex• isting tool called Fake Koji4, that used to serve as a storage for builds, logs and source snapshots. Now it serves multiple purposes and cre• ates a nice solution for the two issues mentioned above and provides extensible API to mass manage Jenkins jobs.

4. https://github.com/judovana/jenkins-scm-koji-plugin

29 6 Common issues and handling

In this part of the thesis, we will mention other issues that occur dur• ing the test run and we will try to find ways to solve or process them automatically, without any interference from the engineers, because from what I have seen, it can easily take several days, before the engi• neer actually notices this kind of failure. This could be significantly improved by proper reactions that either solve the issue, or notify the engineer there is an issue. The ideal situation would be, if the fails could be solved auto• matically. That is possible for most of the simple issues that have deterministic reactions - e.g. a VM that fails to start is always solved by reschedule. If the reaction is easy to implement, it could be instantly solved or prevented without any human intervention. When the solution is not trivial and requires human attention, it is absolutely crucial to give the engineer as much relevant information as possible about the failure. Usually, if the engineer knows what is wrong, where it broke and ideally gets some recommendation about how to fix it, it is easier for him to estimate how much time it will take him and it is also more likely he will not put it off and fix it immediately instead. However, picking out only the relevant information is tricky, because if we overwhelm the engineer with a lot of data, we will probably end up right where we started.

6.1 Hanging Jenkins jobs

A stuck Jenkins job is an occasional troublemaker. It is not a common problem, however, when it happens, usually due to a stuck process or disconnected executor, the Jenkins job does not fail properly and continues to run indefinitely. These zombie jobs are harmless, but they are also useless and block executor slots (and computing resources). They also block the respective jobs, so other builds are hoarding in queue to be run. That is until someone kills the hanging jobs manually. Beaker jobs have an extra limitation, that all the machines are lent for a maximum duration of 5 days (custom script configuration). The jobs can be finished earlier; in that case, the machine is returned once

30 6. COMMON ISSUES AND HANDLING

the test run is finished. But after 5 days, the machine is automatically returned to Beaker and all running code is killed. In this case, Jenkins job should be stopped and finished with failure status. That does not always happen and it can also lead to a hanging job. The average execution time of OpenJDK test suites can be gained from the Jenkins build time trend. It varies from 10-30 minutes for some sanity tests, to up to 3-5 days long jobs, that are usually testing garbage collection or compiling and running huge amounts of code. But generally, we can identify a time frame needed for each test suite, sometimes even for each OpenJDK major version.

Build Time Trend

Build 1 Duration Agent J jdk.11.0.6.4-Q.updates.upstream.fastdebug 2 days 3 hr Hydra-beaker > jdk.11.0.6.3-0.updates,upstream 1day19hr Hydra-beaker 9 jdk.11.0.6.3-0.updates.upstream.fastdebLig 1 day 4 hr Hydra-beaker •J shenandoah.jdk.11.0.6.2-O.shenandoah.upstream.fastdebLifl 2 days 1 hr Hydra-beaker j shenandoah.jdk.11.0.6.2-O.shenandoah.upstream 19 hr Hydra-beaker •J jdk.11.0.5.9-11 jpdates.adinn.pmenn.20191031 .upstream 6 days 19 hr Hydra-beaker ij jdk.11,0.5.9-11, jpdates.adinn.pmem.20191031 .upstream 10 hr Hydra-beaker •j jdk.11.0.5.9-11 updates.adinn.pmem.2Q191031 .upstream 5 days 4 hr Hydra-beaker •J jdk.11.0.5.9-11, jpdates.adinn.pmem.2Q191031 .upstream.fastdebug 1 day 23 hr Hydra-beaker J 11.0.5.10-3.fc30 1 day 6 hr Hydra-beaker J 11,0.5.10-1 .upstream.jdk.openidkportable 1 days hr Hydra-beaker •J jdk.11.0.5.10-0, jpdates.upstream 20191009fastdebjg 1 day 16 hr Hydra-beaker *J jdk.11.0.5.10-0. jpdates.upstream.2Q191009 5 days 1 hr Hydra-beaker j 11.0.5.2-O.Q.ea.upstream jdk.open|dkportable 1 day 6 hr Hydra-beaker •j jdk.11.0.5.9-11, jpdates.adinn.pmem.2019Q926.upstream.fastdebug 2 days 6 hr Hydra-beaker •J jdk.11,0.5.9-11jpdates.adinn.pmem.20190926.upstream.fastdebug 1 day 14 hr Hydra-beaker j jdk.11 0.5.9-11 jpdates.adinn.pmem.2Q19D926 upstream 1 day 5 hr Hydra-beaker •J jdk.11.Q.5.9-11,jpdates.adinn.pmem.20190926.upstream.Ťastdebug 1 day 14 hr Hydra-beaker •J jdk.11.0.5.9-11 jpdates.adinn.pmem.2019D926 upstream 1 days hr Hydra-beaker

Figure 6.1: A stable test suite build trend

Above is an example of a stable testsuite; a jcstress job for Open­ JDK 11, tested on Fedora 30 operating system. This testsuite is run in Beaker, that means it might be run on a different machine every time.

31 6. COMMON ISSUES AND HANDLING

However, from the trend, we can observe that the jcstress run usually takes from one to two days. You can also see some failed test runs, that suggest that the test probably got stuck somewhere during the execution and was automatically killed after 5 days, when the beaker machine was returned. This could have been avoided and the test could be rescheduled much earlier. Build Time Trend

Build T Duration Agent 9 1.8.0.242 b07-2.e!8 5 hr 6 min Hvd'S-beaker and counting • 1 8 0.242b07-1_el8 a days 13 hr Hvd'S-beaker 1 8 0.2421)05-0.1 ea.elS 5 days 2 hr Hvd'S-beaker • 1.8.0.242 b07-0_el8 1 2 days 2 hr Hyd'S-beaker ij 1.8.0.242 b05-0.0ea.el8 1 5 days 0 hr Hydra-beaker • 18 0.232 b09-4.e!8 15 days Hvd'S-beaker

Figure 6.2: Unstable test suite with manually shut down hanging jobs.

Then, there is an example of unstable test suite. Notice that there is an aborted 15 days long test execution. That job run was killed manually, after someone noticed there is a hanging job in the Jenkins GUI. So if we consider an average execution time from two to five days, it uses almost triple the amount of time and resources to produce nothing. If the test suite has serious issues, and the jobs hangs regularly, like in Figure 6.2, it should definitely be cared for - it can be fixed or updated to a more stable version. However if the issues are rather rare, it is usually not a faulty test and there should be an automated way to kill and reschedule these occasionally hanging jobs.

6.1.1 Aborting of long-running jobs

There is an existing Jenkins Build Timeout plugin[17] that can handle these issues. It allows the job to be automatically aborted, if the build reaches a set timeout value. The plugin behavior is configured for each job separately, so we can set a different timeout value for every testsuite run; we can also set a different value for each OpenJDK (that could be very helpful e.g. in case of JTREG test suite, where the execution time is very different for each JDK.

32 6. COMMON ISSUES AND HANDLING

The timeout option can be applied on any Jenkins build step. The OpenJDK team currently uses only one build step - execute shell, so the adaption of the plugin would be quite simple.

HUIID

Figure 6.3: Current build step that starts the executor and runs the test suite. With the Build Timeout plugin, the build step gains more options. We set the timeout value for the desired value (e.g. two days), and we set the action that should be performed in case the timeout is exceeded - either fail the build, or abort it.

J • • list busk .UciLili/JriMnt/iir toft TiiM/lttHr/rt-ihkrpi/TtpJt.' Ipn/jcttilmfHihf /IJI. t>' ''hell -w 'tHkh /•T.ISitMicJ/Ti IMflHilf)

ha r d nUtannmi hwufcfci -

buU

Figure 6.4: Build Timeout plugin version of execute shell.

It is a very little change that significantly improves the current situation with long-running jobs. And since it is a job configuration entry, it is only a change in the configuration .xml file, that can be easily done by changing the job template in the O-tool with the average execution time for each test suite, followed by regeneration of all affected configuration files.

33 6. COMMON ISSUES AND HANDLING

6.1.2 Implied consequences There is a consequence that must be taken into consideration when using this plugin - the possibility of hidden regressions. That is very likely to happen, if regression causes prolonging of the test suite exe• cution and it could be aborted before it finishes. In case of regular error, that occurs every time, this would be discovered fairly quickly, because the test suite would never finish and someone would start investigating the issue eventually. However, if the test suite would catch e.g. some race condition1, that does not occur every time, it could easily get mistaken for a "general fail". Giving the test suite a wider timeout window should help in these cases, but it should not be multiplied more than once, because if the job runs several times longer than usual, it is most likely a stuck thread or process that just won't finish ever.

6.2 Product download issues

The team tests OpenJDK binaries and sources that are usually pro• vided in tarballs (tar.gz / tar.xz), zips in case of Windows builds, or RPM installation packages for RHEL, Fedora and EPEL. These are downloaded from mercurial repositories, Koji2, Brew3 and various internal repositories or storage. There are two points of failure in this phase of testing - either the remote server does not act as expected, or there can be issues on the local server. We will describe the two problems that are most likely to be encountered.

6.2.1 Inaccessible remote server

This is an occasional issue that occurs at the very early stage of the job. Since a lot of builds are downloaded from remote repositories

1. Race condition usually occurs when processes or threads depend on shared state, it is a situation when device or system attempt to perform more operations at the same time on the same resource. 2. Koji is Fedora RPM build system. It is used by the packagers for creating Fedora package builds for various architectures. 3. Brew is a Red Hat RPM build system.

34 6. COMMON ISSUES AND HANDLING or servers, network issues are an issue that can not be prevented, because we usually can not anticipate, when the remote sources won't be available, or when the network will fail. The only resolution for this issue is the rescheduling of the failed job. Since it can take an engineer up to several days to notice this failure, it should be detected automatically. However, the automatic rescheduling should not be careless, because if the remote server is not accessible for several hours or more, it could result in tens or even hundreds of failed job runs, which is definitely not what we want to see. A reasonable algorithm to deal with these issues should be in place. There are several ways to approach rescheduling of failing jobs:

1. Fixed time intervals We can try to reschedule the job in regular intervals, that are wide enough to give the server time to get back - e.g. every one or two hours. It is the easiest case to implement, however, in case of long time unavailability, this does not really help and still produces a lot of fails.

2. Grading time intervals A slightly improved approach is grading the length of the gaps between each try. It is better in case of short outages, as the initial retry interval is short, but as it grows with each reschedule, it is also reducing the number of rescheduled jobs in the long run.

3. Limited amount of tries Adding a fixed number of tries could reduce the number of failed jobs significantly, however it raises a question, if and when does the engineer notice the fails and reschedules the job manually, once the server becomes available again. However, it protects the Jenkins job to be full of failed re-tries, especially if the servers are down for longer period of time.

6.2.2 Insufficient executor storage

Once in a while, there is a situation where executors do not have enough space on disk. It does not happen very often, however when it

35 6. COMMON ISSUES AND HANDLING does, it usually means dozens or hundreds of failed jobs, depending on when someone notices it, because the slots on the executor keep filling up with new jobs that fail when they try to download the tested OpenJDK.

Failure prevention

In an ideal world, we would like to prevent job failures altogether. That means detecting executors that are running out of disk space and shutting them down in advance. We can either use monitoring already implemented in Jenkins, or some external tools.

» Jenkins preventive node monitoring There is a basic node check functionality present in the Jenkins, that is able to monitor free space on the executors, and if a given threshold is reached, it marks the affected executor as offline, not allowing the scheduler to use it for any other jobs. If configured properly, it can prevent most of the failures, as the executor should be off before problems could occur. It also allows the other running jobs to finish while the executor is being marked as offline. However, this should be more of an insurance or a last resort solution, because we want the executors to be available as much as possible.

36 6. COMMON ISSUES AND HANDLING

Preventive Node Monitoring

D Architecture

D Clock Difference

D Free Disk Space

Free Space Threshold

0 Free Swap Space

D Free Temp Space

Free Space Threshold j

O Response Time

Figure 6.5: Example configuration for node monitoring

The big advantage of this feature is that it is built in Jenkins itself; it does not need any extra services or software and does not affect the executors. It is also fully automated soft off-switch - it marks the executor off, allowing the running jobs to finish as usual, and it does not require any human interaction. That is extremely valuable especially during the holidays and weekends, when engineers are not present to do it themselves. A disadvantage could be considered the fact that the shutdown of the executor can be unnoticed and it may be marked offline for a long period of time, losing the resources of the machine. It also gives no prior warning to the engineers about the declining amount of free space. In general, Jenkins provides no information about the health of the executor. But that can be improved by using existing monitoring Jenkins extensions like Java Melody4.

4. https://wiki.jenkins.io/display/jenkins/monitoring

37 6. COMMON ISSUES AND HANDLING

» External monitoring tools Another way to monitor the health of executors are external monitoring tools, that are deployed on each executor separately (e.g. open source monitoring tool NetData5). These tools are able to watch various metrics, like RAM and CPU usage, network traffic, disk space and much more. They are then able to send notifications to the users by several channels like email or chat. These notifications can be very useful, because they notify en• gineers instantly and if they are configured properly, it can be a good preventive tool on how to prevent disk filling. It is also able to display various metrics in time, so it provides extra infor• mation about the events on the server. A big disadvantage is that it usually requires a person to evaluate the data and proper reaction to it. So if no one is there to respond, it solves nothing. It is also a separately running software that takes some resources from the system, making it unusable for benchmarking machines, because it could easily affect the re• sults. In general, both the internal and external monitors of the nodes have good arguments why to use them and it can be beneficial in most cases to use both. The external tools can be a good first line of defense and trigger preventive actions, while the Jenkins preventive monitoring makes sure that the disk filling beyond acceptable limits marks the executor offline to avoid failed builds.

Post task disk space cleanup

We should try to prevent the filling of the disk space, if possible. Currently, there is a script in the process, that cleans up the executor workspace prior to the test run. The workspace can not be cleaned in the script execution phase, because there is a post task running Jenkins archiver plugin that needs the results to be present in the workspace, so it can archive the artifacts. However, it could be cleaned after the results are archived.

5. https: / /github.com/netdata/netdata

38 6. COMMON ISSUES AND HANDLING

» Workspace Cleanup plugin There is an existing plugin that handles the workspace cleanup. This cleanup can be executed either as a build wrapper, that executes before the build starts, or as a post build step, that executes after build is done. Order matters with the post build steps. There are several post tasks to be run: destroying the VM, artefact archiving, JTREG report generating, Textfinder plugin execution and more. We need to remove the workspace after all of these are done to be sure that we don't delete the results before they can be saved and processed. One of the reasons why we have to clean up the space after the script runs and not before is that the tested product is down• loaded at the very first stage of the job (when the SCM plugin detects a change and starts pulling sources or binaries from the remote server). So along with the workspace, all the downloaded content would be gone. The second reason is that we usually need the biggest amount of free space when the job starts, for the downloaded bits, VM and other files. So if there is not enough space, the job fails before the clean up task even starts.

Executor cleanup

The basic executor cleanup is quite a trivial task and it should be auto• mated. The most usual cause of full hard drive are big job workspaces and leftovers of old VMs or improperly destroyed VM clones. Both are easily accessible and identifiable. This cleanup can be done several different ways:

1. Cleanup of offline executor This is the easier of these two approaches, because once the executor is off, we can delete all VMs and workspaces, since they can not be used. Following command gives us a nice view of all VMs registered on the system: $ vagrant global - status

39 6. COMMON ISSUES AND HANDLING

The output provides us IDs, that can be used to kill all the VMs with vagrant destroy ID command. id name provider state directory

d64666c default libvirt running /tmp/jtreg . . . 88cc8a0 default libvirt running /tmp/jcstress . . . 98ee0ca default libvirt running /tmp/jsf ...

Another step is to remove all old workspaces by:

$ rm -rf /mnt/raid/jenkins/workspace/*

The tricky part here is turning off the executor. Because there are multiple slots on each executor, it is supposed to be shut down only after all slots are empty. With day long test executions, one running slot can block several more while waiting for shutdown. It is usually better to kill the running jobs, turn off the executor and clean it up. But this is something that should not be done automatically, without an engineer knowing about it. That makes this approach unusable for regular cleaning, but it could be useful in case of occasional cleanup (e.g. when the machine is unused / updated / rebooted or master shut down). It could also be used, after the executor is automatically marked offline, because it was running out of memory.

2. Cleanup of online executor This option is much more flexible, because it does not require the executor to be shut down. But it is also more tricky, because when cleaning up workspaces and VMs, we must be able to identify which jobs are currently running on the executor to not remove their VM and workspace and there is no easy way to do this via Jenkins API. Nevertheless, this approach fits the idea of the regular and on- demand cleaning, because it does not require shutting down the slaves, so other jobs can keep running on the executor, while it is being cleaned. That could be helpful, when e.g. some monitoring could determine, when the amount of free space is hitting a pre- configured limit and it could even trigger automated cleanup.

40 6. COMMON ISSUES AND HANDLING

Implementing any option should significantly reduce the current issues with executor disk space.. With properly configured Jenkins nodes, the executors should be shut down before the job could start to fail, reducing the amount of failed jobs due to the insufficient disk space to zero.

6.3 Known test failures

While analyzing the results of various test suites, I found out that there are hundreds of test failures that are being ignored on a regular basis. These include various timeouting issues, improper VM handling, irregular setup errors and other test issues, but it also includes more or less regularly failing test cases. According to the team, most of the regular failures were briefly looked over and it was decided that it is a broken test. However, In many cases, there is no one who has the time to fix them. This way, the broken tests keep piling up and it is difficult to keep track, making it easier to overlook new failures. We can divide these fails in several categories, depending on the reason why the tests are failing.

6.3.1 Old tests with compatibility issues

There are many old tests that are incompatible with new versions of OpenJDK, especially after project Jigsaw was introduced. The mod• ularity fixed a lot of exploits that are no longer possible, but it also brought significant changes, so some of the tests for OpenJDK 8 end with compilation errors or runtime issues. These tests are usually quite complex, so if someone wants to fix the test, it requires a lot of effort and digging in bugs they are covering. Also, some of the bugs are very old, so they might be inaccessible for the public and the binaries affected with the bugs are often unavailable. This makes fixing the test even harder, because you need to make sure that you did not break the test and often you have no build to test it on. Even though the QE is slowly working on fixing these issues, it clearly needs more effort and people to fix them all, because every

41 6. COMMON ISSUES AND HANDLING release, some new fails occur. I would consider involving OpenJDK engineers in fixing these issues, because they work on OpenJDK and it should take them significantly less time to fix it, especially if they were the ones who wrote the tests. From what I have seen, engineers usually write these internal tests due to security embargoes - a very few people have access to the patches, that are essential to understand what is the problem and what the test should look like. Since every push and every test can be tracked to it's author, they should be contacted, if their test is failing and QE is unable to fix it in a reasonable amount of time.

6.3.2 Testcase timeouts

Another known issue that appears frequently in the results are time- outing tests. I found several test suites that contain some test cases that were more or less regularly timeouting. The timeouts are usually set for each test or testsuite, to make sure the tests do not run forever. If a test starts to timeout, it can mean many different things - it can be a performance issue, lack of resources, stuck process or broken test. Lack of resources is the only issue that can be prevented and does not require human resources. There are two possible scenarios - either the test is running in the VM, or in Beaker. When the test is timeouting in the VM due to the insufficient per• formance, there is not much we can do. The testing VMs have a fixed RAM size and fixed number of processor cores. So the sensible thing to do here is to move the timeouting test suites to Beaker, where we can get more powerful machines, or extend the timeout limit, to be able to finish the test with what we have. With Beaker, it gets much more interesting. The Red Hat Beaker pool counts several thousands of machines with various configura• tions. So we can get a basic machine with a few gigabytes of RAM and a few cores, or a powerful one with hundreds gigabytes of RAM and hundreds of cores. Beaker just gives you whatever is available. The Beaker Job XMLs are highly customizable [18], as it is expected that users will have requirements for the assigned machines. In our case, the file is created automatically by a shell script, so it can be templated with environment variables that can be declared for each suite and architecture, while using default values for suites that do

42 6. COMMON ISSUES AND HANDLING not have any resource requirements. These limit values should be as small as possible, to get the biggest possible pool of machines. Listing 6.1: Part of script generating Beaker job XML specifying host requirements

$( if [ -n "${model:-}" ] ; then cat <<- E0F2 E0F2 f i if [ -n "${min_processors:-}" ] ; then cat <<- E0F2 E0F2 f i if [ -n "${min_memory:-}" ] ; then cat <<- E0F2 E0F2 f i )

The hostRequires allow the user to specify the resources needed for the test to run smoothly. In the example above, you can see a part of a script

43 6. COMMON ISSUES AND HANDLING that generates a beaker recipe with an optional CPU model, minimal amount of processors and RAM. There are many more elements in the .xml template [18] that can be used to specify the system requirements. Using more powerful Beaker machines should help with the time- outing issues and as a nice side effect, it can shorten the execution time. The values can be further adjusted per architecture, as some of the secondary architectures might not be as available as x86_64, or their testing could require less resources for the successful run.

6.3.3 Dependency issues

This is a RPM specific issue, that happens, when some other package needs a build root override. When are the OpenJDK packages built, while the build root is overridden, it is built with dependencies that are not released yet. This results in installation errors, when the OpenJDK RPM requires packages that are not in the repositories yet. Because the packages can not be installed properly, it results in fail in 99% of the tests. There is currently only a single framework that does not require dependencies to be satisfied in order to work, because it does not install the binaries, but only unpack them. Even though the tests fail early during the test execution (approximately about 10-20 minutes), it wastes the resources, blocks the slots and reserves machines that could have been used for other tasks. In most cases, the packages with broken dependencies will be rebuilt with clean build root anyway, because the release usually can not wait for the release of other packages, so it is also useless to test them. To solve this problem, there should be introduced some form of mass disable of the broken NVRA. The current infrastructure of OpenJDK QA uses a text file called "processed.txt" to store NVRA of already tested builds. It should be possible to extract all affected jobs from O-tool, so a script or an API that is able to write the broken NVRA to all relevant files could be a rather quick and effective fix for the issue.

6.3.4 False negatives of unmaintained test suites

OpenJDK QE runs approximately several tens of test suites for each OpenJDK version, project, build variant and test variant. The num-

44 6. COMMON ISSUES AND HANDLING ber of tests in each test suite vary; smaller test suites like TPS6 have dozens of tests, mid-sized applications like Wildfly7 usually have a few thousands of tests, and the biggest test suites, like JCK, can have hundreds of thousands of tests. Unfortunately, not all those test suites are properly maintained, so there are tests that more or less regularly fail. There is an effort to fix these issues, however, not all of them are in our power to fix. We can divide the test suites into three categories.

Internal test suites

QE team along with developers are co-creating and co-maintaining internal test suites that keep growing. These are usually specific for a given environment, so they do not have much use upstream. It is a mix of tests for known (fixed) bugs, usual troublemakers, reproduc• ers created by other teams, tests made as a reaction to reports from the customers or embargoed security reproducers, that are testing vulnerabilities fixed in previous versions of OpenJDK. These are in general well maintained, there is about 1% of failing tests that should be fixed. However, even one failing test case means that the results need to be checked, if the error or exception remains the same. The most tempting approach to get rid of these fake failures is an exclusion of the broken tests, however this is more a workaround, and it also reduces the number of tests, which is not the goal of this effort.

Open source test suites and frameworks

OpenJDK QE is using and contributes to several open source test suites, benchmarks, frameworks and tools for OpenJDK testing. In general, any Java project can be used to test OpenJDK, however it is important to pick well maintained projects that are easily configured and their runs can be automated. These test suites have their source code available and definitely can be fixed - issues can be mostly solved by opening up a PR or issue on GitHub, or with a custom patch applied on the sources. However,

6. TPS tests instalability of rpms 7. https://wildfly.org/

45 6. COMMON ISSUES AND HANDLING the human resources are very limited, resulting into a fact that no one is paying too much attention to the known failures, no matter if they are repeating, or occurring without any visible pattern. After several years, some of the less stable test suites became so broken, that they have informative-only character and unless they start to fail in hundreds, no one cares. Another pain point is that the test suites are not frequently up• dated. They are usually updated only when there is a problem with the new version of OpenJDK. The argument is, that updates usually bring changes that reflect in the results and the testsuite can require additional configuration. It can also cause new failures, that can be misidentified as JDK issue, in case the engineer does not notice there was an update recently. Lastly, it breaks the baseline, that is used for comparison of the results. The issue with updating the test suites is that it can be incompatible with older major versions of OpenJDK. This forces the team to use multiple versions of one framework. Initially, it does not bring any extra work, since the runs are automated, however it can break over time, as the OpenJDK changes and the old versions of frameworks are rarely maintained, so it has to be patched, or some tests have to be ignored. This could be changed, e.g. by assigning the test suites to a responsi• ble person, that would fix the test suite and then would be maintaining it regularly. Another solution, from what I have seen from other teams, is to dedicate some percentage of time to each member of the team to work on these tests. The ultimate solution is creating a JDK specific exclude list, that would just ignore the given set of tests. It is already applied in some cases; e.g. when there are different variants of a test for some OpenJDK major versions, if a test applies on libraries that were added / removed in a given version, or there is a different (usually older) version of test suite required. Although, this should be used only for tests that will never be fixed or can not be fixed, because excluding tests makes them forgettable. So if the test can and will be fixed eventually, it should rather be failing then excluded.

46 6. COMMON ISSUES AND HANDLING

Proprietary test suites The biggest proprietary test suite that we currently run is Java Com• patibility Kit(JCK) [19]. It is the largest test suite we run - currently, it has hundreds of thousands of tests, depending on the configuration and OpenJDK version. JCK is created for each release to ensure the implementation is compliant with the specification of the platform. We have to run the JCK on all of our builds in order to be sure that the implementation can be called OpenJDK. The JCK are available for us due to OCTLA - OpenJDK Community TCK license. [20] The JCK is obtained as a binary and we are not allowed to modify it. We usually have no power over the proprietary test suites - e.g. if a test fails in JCK, it must be communicated back to Oracle and they either exclude, or fix the test. We can not fix or change these test suites, even if they are broken. Luckily, we run only a few of them and in general, long-time failing tests in proprietary suites are rather rare.

47 7 Test result evaluation and processing

The current test result management is not ideal. There are a very few tools for results comparison that are being used and they do only a basic result change detection, with some more advanced regression scoring mechanisms that allow the engineer to detect new failures in benchmarks and bigger test suites. We will explore new effective ways, how the results could be pro• cessed automatically, finding and identifying failures, so the engineers can spend less time reading logs and looking for issues.

7.1 New available result reporting

The first step is notifying the engineer, that there are new builds tested and ready to be checked. With several thousands of jobs, it is not ideal to use email notifications, because many builds can be processed each day and it would result in spam. It is also not very convenient to look straight into Jenkins GUI, because with that many jobs and builds, it is very easy to miss something. The engineers are usually well-aware about the most important builds. NVRAs of the builds are posted on mailing lists when they are built and expected to be properly tested. The internal builds are triggered manually, so these are also less likely to be missed. But there are builds daily flowing into the infrastructure via automated hooks, that can easily go through unnoticed. Since each build takes up to several days to go through the full test kit, it is not viable to wait for all tests to finish before checking the results. Since there are new results every day, it can be beneficial to look for new results every day. The new runs view should be shrunk and posted somewhere, along with links to the respective jobs. There is a tool, that takes care of this in the team called daily report. A nice solution could be a script or a simple tool, that would gather results processed each day / week or other defined time period. This way, the engineers would be able to easily see new job runs every day. It could also be enhanced by separating the results by the job result (passed, unstable or failed), or sorting the results per NVRA, project and other factors, so only the relevant results could be checked.

48 7- TEST RESULT EVALUATION AND PROCESSING

7.2 Result comparison

The next step is usually result comparison - comparing new test results against older ones and looking for regressions. In software testing, the most simple approach is usually comparison with the previous version of the product. Since we have a lot of (ir)relevant test failures that haven't been properly resolved, we need to check that there are no new failures rather than just simply checking if the tests fail or not. However, in our case, the comparison is more complicated, because we are testing several variants of the same product. We need to see the situation from several angles.

7.2.1 Narrowing the case

After a regression is discovered, it is important to evaluate the extent of the problem. To find the cause of a failure, it is essential to know where and under what conditions it occurs, because it helps to specify where we should look for the bug.

1. Major version There can be bugs that are specific to one version of OpenJDK. That can be a bug that is fixed in the newer versions but not backported yet, or also a bug that was introduced by a backport.

2. Operation system There can be some unwanted interaction between the JDK and the operation system or system libraries.

3. Architecture There can be bugs that are architecture specific, especially at the bytecode or instruction set.

4. Build variant Testing fast debug builds tends to unveil some issues more than the regular builds. Also a different VM can be a cause of unex• pected behavior.

49 7- TEST RESULT EVALUATION AND PROCESSING

5. Test variant Each test variant usually brings alive a part of code that is other• wise not used (e.g. using a different garbage collector). When the bug occurs only with a specific test variant, it gives a hint about where to look.

Finding these differences can be managed by grouping all test results for given NVRA and comparing the failures. Assuming we have results for the full set, it should give us enough information to know, if the problem is JDK itself, some feature or external factor. There is one more angle from which we could look at it, and that is the project.

» Project There can be an issue with the build process of rpms or bugs that are specific only for one of the projects.

That was not possible to do simply by grouping results by NVRA, because the build naming convention was completely different, as the NVRA string is different for every project, let alone the human dependent naming of manually uploaded binaries . With more strict naming conventions introduced by O-tool, the mapping of NVRA of different projects to one version should be easier, as the builds now have deterministic names. There are still a few smaller obstacles like several rpm releases of one build, but that should only raise the total amount of matched results.

7.2.2 Results comparing

From Jenkins point of view, the regressions are discovered either by comparing results of several runs of one job, or one NVRA in several jobs. Both of these comparisons are giving us vital information that is necessary for fast and successful fix of discovered bugs. There are several different checks that should be done in order to gain as much information as possible.

1. Previous tested version comparison

50 7- TEST RESULT EVALUATION AND PROCESSING

This check is easiest to implement and is already a part of JTREG report. Basically compares the last result of the same job to the current one. This is important to do for every result, because it will discover when exactly a new regression occurs. With results for all runs, it is easy to spot, if it is a valid regression, that needs to be fixed and reported, or if it is just a flaky test, that sometimes fails e.g. due to network issues.

2. Previous released version comparison This check is important, because during the work on the new version, several binaries are usually built and tested. In this fash• ion, failures can be overlooked, because the failure is currently detected only if it is new, and tests that started to fail during the development can get lost in the expected failures that occur when the patches are applied on the sources. It would be also very handy for the release process, because the engineer responsible for the quality assurance could just compare last binaries with the current ones, skipping many in - between results of the partially patched binaries. It saves the engineer time and speeds up the release process. The last released version is not easy to get and the information about it needs to be stored somehow or passed as an argument from the user. For the rpms, we should be able to get the infor• mation from the build system, as the released builds have extra tags. But for the manual uploads and vanilla builds, it can only take the last built binary with previous version (usually it is bumped by 1 somewhere in the version string) and the sources bumps are tracked as build numbers. But this method is only semi-reliable and it is up to the QE to decide with what precision should this check be implemented.

3. Cross variant comparison For cross variant comparison, it is generally better to limit the range to one test suite and look for differences there, due to high volume of results. Each test variant of a test suite has its own Jenkins job. To compare different variants, we need to sort jobs per project, to see which jobs have test failures and which do not.

51 7- TEST RESULT EVALUATION AND PROCESSING

Comparing across different test variants can give us a quick idea about if it is GC or architecture specific, environment issue, or something else. If it fails everywhere, it points to an issue with the common parts of the JDK. The same data view could be implemented for build variants, as those are another break point. Fastdebug builds or builds with JFR on/off can have bugs that are specific for the given build variant and view clearly showing it would be nice to have As O-tool provides the necessary information about what jobs belong to which project and build variant, it is more of the matter of querying the data and creating a human-readable view over it, than inventing a new way how to match the jobs.

4. Cross project comparison Lastly, we need to know what projects are impacted by the bug. In O-tool terminology, each project represents either an Open- JDK forest, that is being built and tested, or another single source of binaries, that are test-only. That gives us the important infor• mation about which forest(s) are impacted, if it is a JDK or distribution issue, and where did the bug spread via backports. As mentioned before, O-tool is aware of which jobs belong under what project, so getting them is not an issue. However with the amount of jobs of a single test suite for a project, the most challenging part will be how to display the results to be easily readable and understandable.

7.2.3 Jenkins report plugin

One of the tools that is heavily used for result processing is jenkins- report-jck plugin1 With a little bit of misleading name, this plugin can show and compare unit test, testng, jtreg and JCK results. It can display summary, details from the logs and most importantly a diff of previous and current run. The tool works pretty well, assuming the tested NVRAs are in chronological order and we always want to compare the latest result

1. https://github.com/judovana/jenkins-report-jck

52 7- TEST RESULT EVALUATION AND PROCESSING with the previous one. The biggest issue - mixing multiple sources and platforms in one job - disappeared with the deployment of the O-tool. Every run of a Jenkins job, that has this plugin configured, provides a view containing a summary of run tests. That also includes a list of failed tests and expanded view, that shows the problematic part of the code for each test. This is important for regression tracking, because we can look in the past, when the test started to fail.

7D5 « 706 » 7D7 » Suite name Total Passed Failed Error Skipped hotspot 766 765 1 0 B

jdk 5355 5330 25 0 0

largtools 3119 3117 2 0 0

Failures and Errors

diff | all | hotspot

• serviceability/sa/jmap-hprof/JMapHProfLargeHeapTest. java#JMapHPrQfLargeHeapTest ( expand)

jdk

• com/sjn/management/DiagnosticCommandMBean/DcmdMBeanPemissionsTest.javatfDcmdMBeanPermissionsTest (expand 1 • com/sun/nio/sctp/SctpChannel/SocketOptionTests.java#SE>cketOptionTests (expand) • java/beans/lnjtrospect(>r/Test3027648. java#Test3Q27648 (expand] • java/text/Breaklterator/BreaklteratorTest.java#BreaklteratorTest I expand] • java/text/Format/DateFormat/DateFormatTest.java#DateFormatTest texpand) • java/text/Format/DateFormat/NoriGregorianFormatTest.javatfNonGregoriariFormatTest (expand) • java/text/Format/MessageFormat/LargeMessageForniat.]ava#l_argeMessageFormat (expand) • j ava/text/Format/Number Format/NumberRegressicm. java#NLimberRecj res sion (expand) • java/text/Format/NumberForrnat/NumberTest.java#NumberTest (expand] • j ava/util/Calendar/CalendarRegression.j ava#CalendarReg ression [expand) • j ava/util/Calendar/JapanEraNameCompatTest.j ava#JapanEral\lameCompatlest (expand) • j ava/util/limeZone/HongKong.java#HongKong (collapse) o Status: FAILED o Status line: Failed. Execution failed: ''main1 threw exception: Java.lang.RuntimeException • Output: system-out: suite test • Output: system-err: suite Java.lang.RuntimeException at HongKong.checkCountry(HongKong.j ava:53) at HongKong.main(HongKong.j ava:391 at sun.reflect.NativeMethodAccessorlmpl.invokeQfNative Method) at sun.reflect.NativeMethodAccessorlmpl.invoke{NativeMethodAccessorlmpl.Java:621 at sun.reflect.DelegatingMethodAccessorlmpl.invoke(DelegatingMethodAccessorlmplJava:43) at j ava.lang.reflect.Method.invoke(Method.Java:498) at com.sun.javatest.regtest.agent.HainWrapper$Mainlhread.run(MainWrapper.Java:127; at Java.lang.Thread.run{Thread.Java:74&)

JavaTest Message: Test threw exception: Java.lang.RuntimeException JavaTest Message: shutting down test

Figure 7.1: Results of jtreg test suite displayed in the Jenkins Report JCK plugin view

53 7- TEST RESULT EVALUATION AND PROCESSING

It also contains a section, where we can see changes in failures compared to the last run and list of all tests that have been run and the difference between them.

Test status changes: Ejl Probiens hots pot |dk New Failures:

• com/sun/nio/sctp/SctpChannel/SocketOptionTests.java#SocketuptionTest5 lang too Is

All tests

cJiff I problems botsoot I jdk I lanqtools I hotspot

• C0mpiler/5£l57225/Test5057225.ja»a#Test5057225 . compiler/5091921/Test5091921.jaua#Test5091921 • compiler/5Ei91921/Test61B6134.ja»a#Test6186134 • compiler/5091921/Test6196102.jaua#Test6196182 • compiler/5091921/Test6357214.ja»a#Test6357214 • compiler/5091921/Test6559156.jaua#Test6559156 . C0mpiler/5EI91921/Test6753639.jaua#Test6753639 . compiler/5091921/Test6B50611.jaua#Test685Ei611 • compiler/5Ei91921/Test6B90943.ja»a#Test6890943 . compiler/5091921/Test6B97150.jaua#Test689715G

Figure 7.2: Comparison of previous and current run of jtreg test suite shows a new failing test.

This plugin provides a nice way to compare two consecutive runs of the same job. Unfortunately, the plugin is unable to track the regression development over time; it can only compare results with the last run, but it does not provide information about the overall test suite health. That functionality is currently being substituted by graphs that are able to show how the test suite is doing in a given period of time. It is also unable to compare results across several jobs and it does not check the stack traces of the failed tests. However, proper compar• ison against other jobs is a key feature that needs to be implemented in order to achieve effective result processing and regression triage and this plugin could be a great inspiration how it can be done.

54 7- TEST RESULT EVALUATION AND PROCESSING

7.3 Test fail processing possible improvements

We already established that there are test failures, that are known and failing regularly, but not yet fixed. As far as the current automated checks go, the test suite is checked for the amount of failures, and for the tests that are failing. If the number and names of the tests match, no regression is reported. This is not enough and should be improved, because the exception or error can change and go on unnoticed.

7.3.1 Type of the fail recording

The current check should be extended with an exception matching algorithm. It should inspect the test log and check the part of the log, where the failure occurs. In most cases, the log outputs should be identical - especially if there is only an error code returned. This is a necessary check, because already failing tests can start to fail differently - e.g. a test failure can change to a compilation failure because of a change in the API. Currently, it is not noticed by any tooling and passes as no new regression - this is wrong and it should be noticed and reported.

7.3.2 Broader history discovery

Unless an engineer decides to dig deeper, only the last available re• sults are being checked in the comparison. This is useful to see new regressions, however it does not consider occurrences of the same regression in the past. That can be a valuable asset when looking for a fix, but also a way to track irreproducible bugs. Depending on the resources and the search speed, this could be done also for different test variants of the same build and test suite. It can also give a hint, if it is a regular problem, or a new failure that needs attention.

7.3.3 Automatic reproducer extraction

For some test suites, it is easy to extract a single test and run it to have a quick way to test a potential fix. But yet, it is not implemented for the vast majority of the suites. That is a shame, because when a test fails, the developers usually require an extracted reproducer anyway.

55 7- TEST RESULT EVALUATION AND PROCESSING

There is already one existing and working tool in the team, that is able to extract reproducers from JCK 8. If this was done for the other big test suites (e.g. Lucene, Wildfly), it could simplify the workflow significantly for both testers and developers. Many information about a failing test including the name of the failing class or script are already well-formatted in xml or JSON files for processing purposes. The part that needs to be done is finding the failing class in the source code and extracting it with minimal dependencies. That is necessary in order for the engineers to be able to compile the reproducer code and run it themselves. The source code can be compressed in a single tarball, that can be archived as a part of job artifacts.

7.3.4 Test author search

Some of the internal test suites (e.g. reproducers) have an author annotation in the header of the test. Extracting this information and including it in the minimal report would help with assignment of the issues. Sometimes, the authors of the test suites are developers, but most of the tests were at least modified by the QE people, so identifying a person who already knows the test could make the investigation faster. Currently, the JTREG report and other report plugins simply return the failed tests and their stack trace. There should be another line at the end of each test, stating the author name, that can be extracted from the test source file.

7.3.5 Blame a commit

Finding the commit that caused the failure can speed up the investi• gation. It can help us identify if the change was intentional and what changed, so we can easily determine the cause of the failure. As each commit comes with a committer, it also gives us a person to contact, who can verify the test. "In addition to being primarily for version control, Git also provides a couple commands to help you debug your source code projects."[18] we will briefly introduce git commands, that can be a big help when looking for a bug in the sources.

56 7- TEST RESULT EVALUATION AND PROCESSING

Git blame / git annotate

Git blame2 is used for file annotation. If we find a regression and are able to identify which file(s) are a potential suspect, we can use git blame to get more information. The "git blame" command annotates every line of a given file with a hash of a commit, that was the last one to bring a change to that particular line, and name of the author of the commit. This way, we can find the date and author of the commit, but most importantly, we get the hash of the commit, that could cause the problem in the first case, to get the whole picture (there can be more files causing the issue, not just the one we identified). Therefore, we want to take sources of OpenJDK, that are being tested, find the .Java files, that are present in the core of the failed test stack trace and annotate them After annotating the files, we should look for the lines, that are present in the stack trace of the failed test, to see, when they were lastly changed and keep the commit hash. We can also search and annotate related classes ( e.g. return types, argument types) to find possible changes. The commit hash should be used to find the whole commit (and related commits) to see what classes and files were changed, we can also get commit messages and bug numbers that can show us what the commit was supposed to fix. Lastly, the author name will tell us, who created, backported or pushed the commit, so the author can be tracked and notified, that his commit could introduce a bug. All this information can then be presented to the engineer in a form of auto generated report, that saves a lot of manual work. It should also contain links to the test and build sources, the original stack trace and other information about the fail.

Git bisect

"The git bisect tool is an incredibly helpful debugging tool used to find which specific commit was the first one to introduce a bug or problem by doing an automatic binary search." [10] It is a great tool to look for older bugs, especially if they are buried somewhere in thousands of commits. You basically give the command a good commit, that does

2. Git blame and git annotate are very similar commands, that only differ in output format. Git annotate exists only for backward compatibility.

57 7- TEST RESULT EVALUATION AND PROCESSING not have the bug and a bad commit, that has it, and git picks a commit in the middle and asks, if the commit is good, or bad. This way, it continues up to the point, where you find the bad commit. It is a great approach, when you don't know where to look, but it is not as useful in case of OpenJDK, as it might be for other software. The main reason is that the build of OpenJDK can take several hours, depending on the version, architecture and hardware. And we need to rebuild the JDK each time we want to run the test (that can also take some time), if the commit is good or bad. Even with a few tries, we can look for the bug for days or even weeks. Another thing is, that most of the tested forests are built and tested regularly, some even per each tag, so we should be able to look in the past results and see the bug. If the test was not run in the past, we can re-test the binaries with older builds, that are all archived, and it would probably still be faster, then bisecting and building the forest over and over again. Nevertheless, git bisect is a very useful tool and should be consid• ered as an option for occasional bug hunting, if it fits the situation, but in our case, it is not a good idea for bug search automation.

Git versus Mercurial

There seems to be an inconsistency with the git approach - OpenJDK sources are being developed using Mercurial. There are three reasons, why this is not a big deal:

1. hg annotate Mercurial has a very similar feature called hg annotate. That has almost the same usage, as git blame, so it could be used instead. As the reader might be wondering why we did not focus on Mercurial instead, when the OpenJDK is developed using Mer• curial. The reason is project Skara[21], that is currently under development in the OpenJDK community.

2. Project Skara This is a project that is focused on alternative SCM investigation and code review options. It already has two JEPs planned:

58 7- TEST RESULT EVALUATION AND PROCESSING

» JEP 357: Migrate from Mercurial to Git [22] » JEP 369: Migrate to GitHub[23]

So assuming that Skara will happen eventually, we should focus on Git rather than on Mercurial, as the future projects will be using Git and old projects will either be migrated, or eventually slowly die. However, that puts another question in front of us - what will we do until project Skara will be implemented.

3. AdoptOpenJDK Github

AdoptOpenJDK3 community repositories can be a big help to cover this gap. They mirror most OpenJDK projects on their GitHub4, so these mirrors could be used for the time being.

3. AdoptOpenJDK[24] started in 2017 to provide open and reproducible build and test systems for OpenJDK. 4. https://github.com/AdoptOpenJDK/

59 8 Conclusion

The aim of this thesis was analysis of the current state of automated testing of OpenJDK. The goal of the analysis was to identify major is• sues of the current testing process and outline ways how the problems could be solved, and also to come up with suggestions how to improve current results processing that heavily depends on the engineers work. First, the thesis was set into context; we introduced the OpenJDK project, Red Hat's role in it and the most important tools used for test automation and development. We explored the process of OpenJDK testing. The analysis discovered two major problems, that were a result of the growing number of builds and binaries, that needed to be tested - mixed projects in a single job, and inability to effectively maintain the test matrix. We introduced a solution in the form of a tool, that is able to maintain the Jenkins job, while keeping all the important information about the matrix. It is extensible to provide API to work with the jobs and test results. We also covered the topic of the test matrix declaration, where we outlined how the build and test variants should be covered, and why it is not trivial to fully automate generation of such matrix. The analysis covered other corner case issues in the automation, like infrastructure problems, test issues and false negatives. For each problem, one or more possible solutions were outlined, or already existing tools were found and tested. Lastly, we explored the process of test result analysis and process• ing. We found that the process is not yet automated and we outlined how the results could be handled in an automated way, cooperating with existing automation in the form of O-tool and used reporting tools. The thesis does not have to serve a sole purpose for a single team. The ideas, drafts and mentioned tools are usually customizable, reusable and therefore applicable for other QE teams, who might struggle with similar issues - the thesis also discovers multiple solutions for most issues, giving an overview over available tools.

60 Bibliography

1. SINGH, Yogesh. Software Testing. Cambridge University Press, 2011. ISBN 9781107012967. 2. OpenJDK FAQ [online]. and/or its affiliates, 2020 [visited on 2020-06-15]. Available from: https: //openjdk. java.net/faq/. 3. JDK Update Releases [online]. Oracle Corporation and/or its af• filiates, 2020 [visited on 2020-07-06]. Available from: https : // openj dk.j ava.net/proj ects/j dk-updates/. 4. Oracle Java SE Support Roadmap [online]. Oracle, 2020 [visited on 2020-05-12]. Available from: https : //www. oracle . com/Java/ technologies/java-se-support-roadmap.html. 5. RED HAT BECOMES STEWART TO OPENJDK 8 AND OPEN• JDK 11. UNIX Update. 2019, vol. 30, no. 5. ISSN 08928088. 6. HALEY, Andrew. OpenJDK 8 and 11: Still in safe hands [online]. Red Hat Inc., 2019 [visited on 2020-04-19]. Available from: https: //developers.redhat.com/blog/2019/05/22/openjdk-8-and- 11-still-in-safe-hands/. 7. Jenkins User Documentation [online] [visited on 2020-07-06]. Avail• able from: https: //www. j enkins . io/doc/. 8. Managing plugins [online] [visited on 2020-04-19]. Available from: https://www.j enkins.io/doc/book/managing/plugins/. 9. MCALLISTER, Jonathan. Mastering Jenkins. Packt Publishing, 2015. Community Experience Distilled. ISBN 9781784390891. 10. CHACON, Scott; STRAUB, Ben. Pro Git. Apress, 2014. Available also from: https: //git-scm. com/book/en/v2. 11. Job DSL [online] [visited on 2020-06-06]. Available from: https: //plugins.j enkins.io/j ob-dsl/. 12. Job Generator [online] [visited on 2020-06-06]. Available from: https://plugins.j enkins.io/j obgenerator/. 13. Jenkins Job Builder [online] [visited on 2020-06-06]. Available from: https : / /Jenkins-job-builder . readthedocs . io/en/ latest/.

61 BIBLIOGRAPHY

14. GraalVM Documentation [online]. Oracle Corporation and/or its affiliates, 2020 [visited on 2020-07-10]. Available from: https : //www.graalvm.org/docs/. 15. Java Flight Recorder [online]. Oracle Corporation and/or its af• filiates, 2020 [visited on 2020-07-10]. Available from: https : // docs . oracle . com/javacomponents/jmc-5-4/jfr-runtime- guide/about.htm#JFRUH170. 16. ENGELEN, Johan. Link Time Optimization (LTO), C++/D cross- language optimization [online] [visited on 2020-07-10]. Available from: http : //johanengelen . github . io/ldc/2016/ 11 /10/ Link-Time-Optimization-LDC.html. 17. Build Timeout [online] [visited on 2020-05-10]. Available from: https://plugins.Jenkins.io/build-timeout/. 18. Beaker XML schema [online] [visited on 2020-06-06]. Available from: https : //beaker-project . org/docs/%5C_downloads/ beaker-job.rng. 19. JCKDistilled [online]. IcedTea [visited on 2020-07-12]. Available from: https: //icedtea. classpath.org/wiki/JCKDistilled. 20. Gaining access to the JCK [online]. Oracle Corporation and/or its affiliates, 2020 [visited on 2020-06-10]. Available from: https : //openj dk.j ava.net/groups/conf ormance/JckAccess/. 21. Skara [online]. Oracle Corporation and/or its affiliates, 2020 [vis• ited on 2020-06-10]. Available from: https : //openj dk . Java . net/proj ects/skara/. 22. JEP 357: Migrate from Mercurial to Git [online]. Oracle Corporation and/or its affiliates, 2020 [visited on 2020-07-09]. Available from: https://openjdk.java.net/jeps/357. 23. JEP 369: Migrate to GitHub [online]. Oracle Corporation and/or its affiliates, 2020 [visited on 2020-07-09]. Available from: https: //openj dk.j ava.net/j eps/369. 24. About AdoptOpenJDK [online]. AdoptOpenJDK [visited on 2020- 07-09]. Available from: https : / /adoptopenjdk . net/about . html.

62