Exploring methods for dependency management in multi- repositories

Design science research at Saab

Training and simulation

Main Subject area: Computer Engineering Author: Oskar Persson, Samuel Svensson Supervisor: Ragnar Nohre JÖNKÖPING 2021-07-15

This final thesis has been carried out at the School of Engineering at Jönköping University within Computer Engineering. The authors are responsible for the presented opinions, conclusions, and results.

Examiner: Florian Westphal Supervisor: Ragnar Nohre Scope: 15 hp (first-cycle education) Date: 2021-07-15

i

Abstract Dependency problems for developers are like sneezing for people with pollen allergies during the spring, an everyday problem. This is especially true when working in multi-repositories. The dependency problems that occur do so as a byproduct of enabling developers to work on different components of a project in smaller teams, where everything is version controlled. Nearly all developers use systems, such as , , or Subversion. While version control systems have helped developers for nearly 40 years and are constantly getting updated, there are still functionalities that do not exist. One example of that is having a good way of managing dependencies and allowing developers to download projects without having to handle dependency problems manually. The solutions that version control systems offer to help manage dependencies (e.g., Git’s submodules or Mercurial’s subrepositories), do not enable developers a fail-safe download or build the project if it contains dependency problems. In this study, a case study was conducted at Saab Training and Simulation to explore methods for dependency management as well as discuss and highlight some of the problems that emerge when working with dependencies in multi-repositories. An argument can be made that the functionality of dependency management systems, both package managers and version control systems’ solutions are not up to date on how dependencies are used in the development, during this time. In this paper, a novel approach to dependency management is introduced with the possibility to describe the dependencies dynamically by providing the utility to describes usages of a repository (such as simulation of hardware or the main project). As well as discussing the necessary functionalities that are required to handle such a system. By re-opening the dialog about dependency management as well as describing problems that arise in such environments, the goal is to inspire further research within these areas. Keywords: dependency management, dependency problems, dependency conflicts, dependency hell, version control systems, git, mercurial, subversion.

ii

Table of content

Abstract ...... ii

Table of content ...... iii

1 Introduction ...... 1

1.1 PROBLEM STATEMENT ...... 3

1.2 CURRENT RESEARCH ...... 3

1.3 PURPOSE AND RESEARCH QUESTIONS ...... 4

1.4 SCOPE AND LIMITATIONS ...... 5

1.5 DISPOSITION ...... 5

2 Method and implementation ...... 7

2.1 DATA COLLECTION ...... 10

2.2 ARTIFACT EVALUATION ...... 10

2.3 VALIDITY AND RELIABILITY ...... 11

2.4 CONSIDERATIONS ...... 11

3 Theoretical framework ...... 12

3.1 GIT ...... 13

3.1.1 Submodules ...... 13

3.1.2 Subtrees ...... 14

3.2 MERCURIAL ...... 16

3.2.1 Subrepositories ...... 16

3.3 SUBVERSION...... 16

3.3.1 Externals ...... 16

3.4 GOOGLE REPO ...... 17

3.5 PACKAGE MANAGERS ...... 17

4 Challenges with dependency management ...... 19

iii

4.1 DIAMOND DEPENDENCY PROBLEMS ...... 19

4.2 CIRCULAR DEPENDENCIES ...... 20

4.3 ALTERNATIVE DEPENDENCIES ...... 22

4.4 WORKFLOW ...... 23

4.5 MERGING STRATEGIES ...... 25

5 Results ...... 27

5.1 EVALUATION CRITERIA ...... 27

5.2 ARTIFACT ITERATIONS ...... 27

5.2.1 Iteration 1 ...... 28

5.2.2 Iteration 2 ...... 28

5.2.3 Iteration 3 ...... 28

5.2.4 Iteration 4 ...... 29

5.2.5 Iteration 5 ...... 29

5.3 MANIFEST ...... 29

5.4 OVERLAY ...... 31

5.4.1 Gathering dependencies ...... 31

5.4.2 Dependency mitigation ...... 32

5.4.3 Cloning ...... 32

5.5 GUIDELINES ...... 33

5.6 RESEARCH QUESTIONS ...... 33

6 Discussion ...... 35

6.1 RESULT DISCUSSION ...... 35

6.2 METHOD DISCUSSION ...... 36

6.3 LIMITATIONS ...... 36

6.4 RELATED WORK ...... 37

7 Conclusions and further research ...... 39

iv

7.1 CONCLUSIONS ...... 39

7.1.1 Practical implications ...... 39

7.1.2 Scientific implication ...... 39

7.2 FURTHER RESEARCH ...... 39

8 Acknowledgments ...... 40

9 References ...... 41

v

1 Introduction Developing software is a continuously ongoing process. According to Lehman, the law of continuing change “A program that is used and that as an implementation of its specification reflects some other reality undergoes continual change or becomes progressively less useful” applies to all software systems (Lehman; Lehman, 1980, p. 1068). Today, many larger systems are not developed monolithically but are instead divided into several components both internal and external (Florisson & Mycroft, 2015). This makes it more complicated as components have dependencies to other components in the system and that they are usually developed by different teams of developers. Even though most systems have moved from a monolithic development, the rules that were stated by (Lehman, 1980) still apply for component-based software systems (Abate et al., 2012; Lehman & Ramil, 2000). This component-based development strategy strongly relates to development ideas such as reuse of code but this puts more challenge on those in charge of planning, managing, and executing the evolution of software (Lehman & Ramil, 2000). Systems today are getting increasingly complex, and companies today produce more and more code at an increasing velocity (Vouillon & Cosmo, 2013), but how do they store their code so developers can keep track of changes, dependencies, issues, and who are the authors of the changes? In most cases, the answer to this is a version control system (VCS) (Otte, 2009; Spinellis, 2005). Over the years a lot of VCS has emerged (Ruparelia, 2010), and many of these have different ways of handling version control. VSC can be divided into two different classes or families of these systems and they all have different attributes that need to be considered when choosing which VCS suits the needs of a company or project (Zolkifli et al., 2018). Some companies choose to keep all their code in a single repository, a so-called mono-repository, while others use multiple repositories for their projects (Jaspan et al., 2018). There are of course advantages and disadvantages to both these approaches. Figure 1 below illustrates the differences between mono and multi- repositories.

1

Figure 1 Illustration on how components and projects are stored in Multi- vs Mono- repositories. In this paper, a project is defined as a top-level repo that also uses components that the project is dependent upon. A component is defined as code that is extracted to be reusable building blocks in projects. To further explain what is shown in Figure 1 above, the most notable difference is that in multi-repositories (MR), every part of the system has its own repository and in extension their own version control. This means that the dependencies used in a MR are accessed by adding them as components (also called submodules in some VCS) to the project. The components are their own repositories and therefore have their own versions and backlogs which means that dependency problems with components are more complex. Mostly because the version of the dependency does not follow the latest version of that component, but instead needs to be updated separately with every project that uses it. Because components and projects are versioned separately it is possible to target specific versions of a component. In a mono-repo, all the parts of the system are in the same repository and therefore only allow one version of each component. This concept of handling problems with version-controlled dependencies is nothing new and over the years that has been an extensive subject in the research area. Even though the problem has been alleviated, there is still more to do to achieve problem- free management of dependencies (Abate et al., 2020). However, dependency management between components in repositories is something that lacks research, and it is arguably an important area to explore, which is what this paper aims to do.

2

1.1 Problem statement The current research on dependency management focuses on a wider perspective of dependency management, but there is a clear gap in how dependencies between repositories should be managed. The scientific gap in handling dependency problems partly lies in “… a solution to an installation problem may exist for one system but not another.” (Florisson & Mycroft, 2015, p. 2). In most VCS there exists functionality to handle the use of dependencies, but this is usually only done one level down, which creates problems when working in systems that are based on multiple layers of components that are all developed in individual repositories. There is currently no way to handle the use of alternative dependencies, which in short is different instantiations of a component that is used under different circumstances. When looking at dependencies (in MR) there are a range of problems that can erupt if the dependency management is not done carefully. While many, if not most dependency problems are avoided by working in a mono-repo where everything is versioned together, other areas in development will suffer instead. There are a few reasons some companies, Saab Training and Simulation (Saab T&S) included, cannot transition over to working in a mono-repo. One of them is because of the need to continuously be able to develop and version control components, which is important for tracking changes as well as testing code. Another reason is the ability to easily split into smaller teams where every team has their area responsibilities. This means that the developers do not have to download the entire codebase of a large mono-repo, when they only need to fix a small bug in one component. Finally, many components are used in multiple projects which means that adding a specific version of a component as a dependency is easier and more efficient in a MR.

1.2 Current research A study conducted with Google engineers that had both worked on MR-projects as well as mono-repos, showed that there was a preference towards working in mono- repos by 79%. The two most cited reasons were:

• The ease of dependency management • The ease of updating dependent code.

The engineers also stated how much they “loved not having to deal with the diamond dependency problem.”(Jaspan et al., 2018, p. 233). This is one reason to move to mono-repos, but this does not work for all kinds of projects. As there are several pros for mono-repos there are also cons, e.g., in a mono- repo there only exists one version of a dependency at any given time which can cause

3

problems because different parts of a system may require different versions of a dependency (Florisson & Mycroft, 2015). The current studies in dependency management have their focus almost exclusively on package management where software is used to download, implement, and upgrade packages or dependencies. By looking at the current most popular VCS (Git, Mercurial, and Subversion), they all have some sort of component management. Git uses a recursive approach to get all submodules in such systems, but this can lead to strange results such as retrieving multiple versions of the same component or multiple copies of the same component, which might lead to memory leaks, see the Circular dependencies chapter. As stated by (Abate et al., 2020) the process of solving dependency problems is a hard one, and it is proven to be NP-Complete (A problem that has no other solutions than to try all possible combinations, refer to (Cook, 1971; Lewis, 1983) for additional information). Therefore, it might not be a feasible option to have a dependency solver, both due to the complexity of the problem but also because it is an optimization problem because different versions of a component might be needed in different situations. The tools that support the evolution of component-based systems need to improve and be studied further, which will only be possible if dependency solving is treated as its own separate concern (Abate et al., 2012; Lehman & Ramil, 2000). (Lehman & Ramil, 2000) continues by saying, “These [aforementioned] tools are essential if mankind is to survive its ever-growing dependence on computers, that is on software”. (Lehman & Ramil, 2000, p. 254) With that in mind, a step in the right direction would be to have a method for effectively managing component dependencies so the problems can be handled.

1.3 Purpose and research questions Following the problem statement, it is evident that not all companies can use a mono- repo to circumvent the dependency problems that occur. Consequently, the purpose of this study is: To research potential ways of handling dependency problems in multi-repositories as well as explore the possibility of producing a method that facilitates dependency management. The dependency problems are a hassle for engineers when working in MR (Jaspan et al., 2018), hence the first research question is: [RQ1]: Why do dependency problems occur?

4

When using VCS, it is currently not possible to see potential dependency problems. This leads to the problems not being found until the project is downloaded (and manually checked) or built, thus, our second research question is: [RQ2]: How can dependency problems be identified before they become an issue? For certain companies, such as Saab T&S, it is not feasible to transition over to working in mono-repos, therefore our third research question is: [RQ3]: How can dependency problems be handled?

1.4 Scope and limitations The dependency problems exist for all kinds of dependencies, the focus in this paper is on the dependencies that are made using submodules/components in a MR, therefore mono-repository solutions are not taken into account. This paper does not aim to automate the process of solving dependency problems, but rather detect the problems. The artifact produced in this paper refers to a method for dependency management. A delimitation in this study is that the case study was conducted with one company (Saab T&S) and their projects. This means that their way of using MR in VCS is applied, which can differentiate from how other companies do it.

1.5 Disposition The remainder of the papers is structured as follows. In the Method and implementation chapter, the design science research method will be explained, and how it is implemented in this report. In the subchapters, the data collection and evaluation of the artifact will be explained. The Theoretical framework chapter will further deep dive into how different types of VCS work. Three individual VCS will be reviewed (Git, Mercurial, Subversion), and their pros and cons will be presented and discussed. As well as covering package managers and Google Repo. In the Challenges with dependency management chapter, the different dependency problems that have emeraged during the study are further explained along with a discussion on how the workflow problems affect the development in MR. In the Results chapter, the resulting artifact will be presented along with the evaluation criteria. The functionalities of the artifact will be explained on a deeper level, as well as a subchapter describing iterations of the artifact, and how it was conducted with the Design Science Research framework.

5

The Discussion covers a discussion regarding the results of the research, what could be done differently. The method used in this research will also be evaluated and reviewed in terms of how well it worked in this context. The Conclusions and further research chapter cover the conclusions of the study, as well as the potential practical and/or scientific implications this study might have. Finally, what further needs to be researched is presented. The structure of this study’s chapters, as discusses above, is presented in Figure 2 to help give the reader a visualization of how they are connected.

Figure 2 Structure of the study

6

2 Method and implementation Understanding why dependency problems appear, how they are tracked and how they are managed are done by studying this out in the industry of software development. To achieve this, a study of dependency problems will be conducted as well as research how this is handled, and explore potential methods and guidelines of managing dependencies in MR. The framework and guidelines for design science in information systems presented by (Hevner et al., 2004) have been used in this study with further inspiration from (Hevner, 2007; Mettler et al., 2014), where the focus is on developing an artifact to mend this information problem. Simplified the design science research (DSR) framework consists of seven guidelines and has an iterative development method where the focus is shifting between designing an artifact and evaluating the artifact against the guidelines, presented in (Hevner et al., 2004). This iterative approach can be seen as three separate cycles: the relevance cycle, the rigor cycle, and the design cycle. The relevance cycle provides the requirements for the research, i.e., the problem to be addressed but also defines acceptance criteria for the artifact. And further iterations of the relevance cycle may create restatement of the requirements as discovered by actual experience (Hevner, 2007). The rigor cycle provides the artifact with knowledge from state-of-the-art research. The knowledge base is not only the current research but also the extensions to theories and methods done during the research. Contributions to the knowledge base are important because they are the main selling point to the academic audience (Hevner, 2007). The design cycle is the main part of DSR; however, it is important to understand that even if this cycle is somewhat independent of the other cycles during the actual research. It is equally important to convincingly base the design/evolution as well as the evaluation in both relevance and rigor. It does, however, call for multiple iterations in the design cycle before any contributions are put into the rigor and relevance cycles (Hevner, 2007).

7

Figure 3 below illustrates how the DSR framework was adapted to the research approach. The environment (Organizations, People, and Technology) and provides an in-depth understanding of the industry and the need for alternative solutions for dependency management. Particularly relevant are organizations engaged in development where products need to be rigorously tested. People that shape the environment and that could potentially benefit from dependency management methods are developers in the fields where current approaches are a nuisance. Domain experts in IT architecture and Development and IT iterations (DevOps) have been part of the focus group to achieve a true representation of how development is done in the industry. Saab T&S represented the organizational aspect of this DSR. All the products at Saab T&S require testing and tracking of specific versions of products.

Figure 3 Research approach based on DSR, adapted from (Hevner et al., 2004). The knowledge base (Foundations and Methodologies) is represented by research on version control systems, package managers, and dependency management systems as well as theories for dependency solving. To support the research process, observations are done via a case study to study methods for dependency management in depth in the business environment as well as controlled experiments and simulations. In this framework, the artifact is the method for dependency management. The artifact, together with the evaluation criteria (for the artifact) is the foundation through which [RQ1] and [RQ2] will be answered and are developed iteratively. Table 1 below summarizes how the guidelines for DSR are applied to the research process.

8

Table 1 Description of how the guidelines in (Hevner et al., 2004) are applied in the research.

Guideline Description

Guideline 1: Design as In this study, the artifact will consist of a method for an Artifact dependency management.

Guideline 2: Problem The problem relevance is to address problems in dependency Relevance management, in software development practices where multi- repositories are a necessary structure.

Guideline 3: Design The design evaluation will have both an observation Evaluation approach through a case study, an experimental approach where the artifact’s usability is evaluated in a controlled environment. Furthermore, evaluation criteria will be developed during the beginning of the study through the relevance cycle. The artifact is then evaluated against these criteria.

Guideline 4: Research In this study, the main research contribution is the artifact Contributions (the method for dependency management). But also, the foundations of why dependency problems occur and any additional information concerning the nature of the dependency problems that are found during the research.

Guideline 5: Research The knowledge base provides the study with research by Rigor describing how dependency management and dependency problems are handled in the present time.

Guideline 6: Design as a The goal has been to present a method as a foundation for Search Process future research, in order to strive for problem-free dependency management. Methods during the research have either been dismissed or refined.

Guideline 7: Because DSR must be presented both to technology-oriented Communication of as well as management-oriented audiences, the goal is to both Research describe the method and functionality in detail but also in an organizational context. It is important for the technology- oriented audience that the artifact is described with enough detail to implement the artifact. For management-oriented audiences, the functionality of the artifact will be described

9

more generally which is the fundamental question of DSR. What utility does the new artifact provide? And what demonstrates that utility?

2.1 Data collection To achieve sufficient relevance, work has been closely conducted with domain experts from the areas of version control and software architecture from Saab T&S. By getting access to Saab’s codebase and practices, a realistic view on how larger corporations might structure their work, but also what the drawbacks are, have emerged. More on that will follow in the Theoretical framework & Challenges with dependency management chapter. The data collection consists of the results from testing the artifact. The different iterations have different goals as their main focus. If one iteration’s goal is to detect dependency problems on a project that is located on a local computer, the data is in that case the dependency collisions, presented in plain text which is then later evaluated. [RQ1]: Why do dependency problems occur? will be answered through the observation part of the design evaluation in DSR, where the artifact will be studied in the business environment with a case study. During the development, along with the current knowledge base, this question will be answered. [RQ2]: How can dependency problems be identified before they become an issue? will be answered through analysis on why dependency problems occur and demonstrated via the artifact. [RQ3]: How can dependency problems be handled? will be answered through experiments and testing part of the DSR. Dependency problems are not black or white, and different structures of dependencies will affect the complexity of solving them differently. This is why tests and experiments will be conducted according to DSR.

2.2 Artifact evaluation Together with the experts from Saab, a set of functionalities are developed iteratively to achieve a relevant artifact. Every iteration of the artifact is presented to Saab T&S and gets tested and evaluated by domain experts. Depending on how well the artifact meets one or more of the functionalities, it is either continuously developed to fill more of the criteria or changed to have a different approach to the problem.

10

2.3 Validity and reliability Validity is achieved by testing the different iterations of the artifact. By first structuring test environments that share the same structure as projects at Saab T&S. This makes it possible to see whether the current iteration of the artifact manages to do what it is supposed to do. The validity of the report could have been improved, had it been a complete solution to the problem, which due to the time frame and complexity would not be plausible. The focus is rather on the method, how dependency problems can be approached. The benefit of making these project instead of using it directly at Saab’s codebase is when minor changes to the artifact have been made, that should in theory work the same. It is also beneficial to guarantee that Saab’s repositories are not unintentionally altered when testing the artifact. Reliability is achieved by presenting the end method partly as pseudo-code. The reason is that it allows others to grasp our approach, independently of what programming language they are using. More importantly, pseudo-code allows an explanation of what needs to be done in general terms, while the main focus remains on the method. The reliability could also have been higher if this study would have focused more on a complete solution-approach, rather than the method. This was not done, as explained above, both due to the time frame, but also due to the complexity of the problem.

2.4 Considerations A consideration made in this study is that the end method is just that, a method, with guidelines on how code can be (and usually is) structured. Another consideration to keep in mind is that artifacts constructed in DSR are generally not complete information systems that are used in practice, instead, they are innovations that define ideas. (Hevner et al., 2004). The problem regarding dependency management in VCS is a task requiring a lot of time to try and solve. More time than is granted for a thesis, which is why a method and guidelines were chosen, which will help start the discussion and allow future research to continue from where this study left off.

11

3 Theoretical framework In this chapter, some different types of VCS will be presented with their pros and cons, as well as how they are able to handle dependencies within themselves. According to the latest information gathered from OpenHub, 72% of all repositories are Git repositories, while subversion has 23%, and Mercurial have 1% (OpenHub). Stack Overflow’s latest survey that involved version control was made in 2018, where almost 90% of all respondents answered that Git was their choice of VCS (StackOverflow, 2018). There are two main types of VCS (see Figure 4), the first being distributed version control systems (DVCS). This means that the source code and history of a project are not only stored remotely, but each contributor to a repository has a local copy of the repository that in essence, can be seen as a separate server, where changes can be committed on a local level without needing to merge the changes to the remote server. Therefore, every developer has a copy of the whole history of the repository. The second type of VCS is centralized version control systems (CVCS). In these VCS, there is one remote server that contains all history of the code. Therefore, if a developer wants to change anything in a project, they must download a specific version of the code to their local computer and make the changes. After that, they have to commit the changes straight into the master repository.

Figure 4 Illustration on how CVCS and DVCS works respectively.

The main differences between the two are as follows:

• CVCS is generally easier to learn. • DVCS enables the user to work offline. • DVCS is faster because the user does not need to communicate with the remote server for every command.

12

• Working on branches is easier in DVCS, due to the user(s) having the entire history of the code. • If a project has a long history or contains large files, DVCS can take more time and space than CVCS, due to the user not needing the entire history in their local computer. • If the main server goes down, the user can still get the backup of the code from the local repository in DVCS, while in CVCS the remote server contains all the code. • There are fewer merge conflicts in DVCS because every developer is working on their own piece of code.

(De Alwis & Sillito, 2009; Otte, 2009).

3.1 Git One of, if not the most popular modern VCS is Git (Chacon & Straub, 2014). Git is an actively maintained open-source project originally developed in 2005 by Linus Torvalds, the famous creator of Linux. Many software projects rely on Git for version control both commercial and open-source projects. Git is built on a distributed architecture and therefore goes under the category DVCS (Chacon & Straub, 2014). It is very likely that while working on a project there is a need to use another project’s code, it can be another repository that is developed by the same organization or a third-party developer. This is a common issue, and developers want to treat them as separate projects but still be able to use and incorporate them into the main project. Git has a few solutions for this which will be explained in Submodules and Subtrees subchapters.

3.1.1 Submodules Git submodules allow a repository to keep another Git repository as a subdirectory of the main repository (Chacon & Straub, 2014, pp. 249-277). Simply put, it is a reference to a git repository with a specific commit targeted so that it does not automatically change to the newest version. It is also possible to target a specific branch and to switch back and forth in the history of the repository. Submodules are static and do not update automatically when the repository is updated but instead need to be updated manually when the developer wants to change the version of the submodule (Loeliger & McCullough, 2012). Git can track multiple submodules in a repository, but in each repository, only the next level of submodules is tracked. Therefore, the full hierarchy of submodules is only visible when the whole repository, including all the submodules, is downloaded. This creates problems like the diamond dependency problem (DDP) where multiple versions of a repository might be used in a single project. This problem is likely to happen but the main problem here is not that it can occur but that there are no warnings or indications of when it has.

13

Furthermore, submodules do not support the use of alternative dependencies. When downloading the project and its submodules, it is not possible to avoid downloading submodules that are only used for testing if the process of downloading the submodules is not done manually. Sometimes in the development of projects that require the use of submodules, the development is done simultaneously in the whole project. I.e., it requires development and changes in multiple repositories in different levels of the hierarchy. This creates a problem when working with submodules because a submodule is just a reference to another repository at a certain point in time. When updating the version of a submodule, it needs to be done from the bottom of the hierarchy and up. If this is not done correctly, the project using submodules can reference a version that is not yet committed on the submodule. This problem escalates further when the code is about the be pulled into the main branch of the development which is often done via peer-reviewed pull requests. To be able to avoid referencing a non-existing commit, the pull requests also need to be in the same order. This creates a disturbance in the development workflow as submodules are an advanced feature and might have a steep learning curve for team members to adopt.

3.1.2 Subtrees Git subtrees is another solution to incorporating another repository into the main repository. With subtrees there are no nested repositories, there is only one repository. This means that the lifecycle and history of the incorporated repository get merged with the main repository. Subtrees are basically the same as a merge between two repositories where the whole history of the subtree is merged into the history of the

14

main project. This renders the whole history of the two projects indistinguishable, see Figure 5 below.

Figure 5 Illustration of a Git subtree add. As the merged repository is no longer distinguishable from the main repository, changes to the source code in the merged (B in the figure above) repository are changes to the main repository. Therefore, it is possible to change the source code. It is not necessarily a bad thing, but as reused components are versioned there is no convenient way to see if a component’s source code has been changed. Because the source code is also merged, projects can become large very quickly which often requires a lot of memory. This creates a challenge when working on such a project because the developer cannot choose to download only the main project but must instead download the full project with all its dependencies. This is very impractical when working with dependencies such as tests or alternative dependencies that are not required in every situation. However, when working with subtrees there is no need to update in a specific order because the changes in subtrees are only present in the main projects repository and developers do not need to know that they are working on another repository’s code. The problems exist when it is time to contribute the changed subtree’s code upstream

15

as the person doing this needs to be aware of specific workflows and commands. If the commits are not merged back upstream to the original repository, the subtree and the repository that it is mirroring cannot be seen as the same repository, as development can be done separately.

3.2 Mercurial Mercurial is a free DVCS and was developed in 2005 by Matt Mackall. It is supported on Microsoft Windows as well as Unix-like systems, such as macOS and Linux. The major design goals that Mercurial has are high performance, scalability, and decentralization, to name a few. The biggest difference between Mercurial and Git is the branches. In Mercurial, a branch is embedded in a commit or known as a set of changesets, while in Git a branch is a pointer to a particular changeset. If a commit is done in a certain branch, it will always remain in that branch, which means it cannot be deleted or renamed due to such actions would be changing the history of the commits. It is, however, possible to close branches. For more information regarding Mercurial, see (O'Sullivan, 2009).

3.2.1 Subrepositories Mercurial’s subrepositories are the equivalent of Git’s submodules. It works in a similar way to Git’s submodules, by adding a reference to another repository’s specific commit. A difference with subrepositories compared to Git’s submodules is that Mercurial makes sure that the commits of subrepositories that are referenced by the root repository are available on the server. This means that it will not be possible to push changes to the root repository without recursively push all the changes of the subrepositories as well. In the bigger picture, Git and Mercurial works in a similar fashion.

3.3 Subversion , abbreviated SVN, was initially released in 2000 and was developed by CollabNet, as a successor to the Concurrent Versions System (CVS). It is still one of the most popular VCS and one of, if not the most popular CVCS.

3.3.1 Externals The way of handling submodules in SVN is by using the command svn:externals. One of the downsides in their way of handling externals is that you cannot list nested externals, e.g. if u have external B inside external A. Another downside is that if you have made changes to a file and its externals, you must run the commit command on all working copies, as it does not recurse into any externals. (Collins-Sussman et al.)

16

3.4 Google repo Google repo is a tool built on top of Git, which helps developers manage projects that are distributed over several repositories. The tool helps developers manage a decentralized project containing many repositories as a centralized project, where it helps to sort changes into the different repositories and to upload the changes to the review system used. Repo is not meant to replace git but was created to optimize the workflow in android development. This solution relies on Git for the version control, diff, branching, and edits, while they rely on Repo to upload to the Gerrit revision control system and automates parts of the development workflow in android. Gerrit is a code review system for projects that use Git. Gerrit uses a more centralized approach to Git, allowing all users to submit changes, which are automatically merged if they pass the code review. Repo uses manifest files to link Git projects into the super project. The file contains information about the dependencies in the Git projects used. The manifest is in a super repository or a manifest repository that is only used for keeping track of the dependencies. Handling dependencies this way makes it easier to track synchronized development because Repo provides the utility to both be described as non-deterministic (each repository version can be defined as a floating head) and deterministic (point to a specific version or commit). But it also has a few downsides, e.g., that it requires an extra repository for management and the nondeterministic relationship which makes it hard to reconstruct older versions because it is not possible to track what versions of dependencies were used at a specific time. This allows developers to always reference the newest version during development and then lock into specific versions for testing and documentation later in the process. There are some parts of this implementation that could prove useful. For example, when developing a project located in multiple repositories, which have dependencies connecting the repositories. Repo simplifies the workflow by creating branches automatically in all repositories and merges them simultaneously when the commits or pull requests are made. As well as the utility to configure dependencies as both deterministic and non-deterministic.

3.5 Package managers What is a package? A package is a type of archive that contains a computer program in the form of source code or a binary and additional metadata about the package. For a given package, the metadata contains data about other packages that need to be downloaded i.e., dependencies, checksums, and sometimes additional information (Abate et al., 2020; Cappos et al., 2008).

17

First introduced in the early 90’s package managers (PM) have been used in the development and to support the life cycle of software components (Abate et al., 2020). A PM is a software tool that helps automate the process of installing, removing, and updating computer programs for a user or operating system (Florisson & Mycroft, 2015). One of the main responsibilities of a PM is dependency solving when a request for a package is made. The PM gathers information about the current environment in which the package is to be installed, the available packages in the database and additional user preferences e.g., only install strictly required packages. As its output, the dependency solver presents an update plan i.e., actions to take in a specific order to reach the requested status of the system (Abate et al., 2020). Dependency hell is a common phrase that describes a multitude of problems when dealing with dependency management e.g., DDP and circular dependency problems (CDP). It arises because traditional PM insists on only having one version of the package available at a given time. This leads to conflicts because packages may require different versions of a dependency. Systems only allow a single version of a package mainly because it can be impossible to link multiple versions of the same library into a single executable. Linking two versions of the same library to a single program may also result in passing values between the two library versions and by this violating data abstraction and type safety (Florisson & Mycroft, 2015). A package’s dependencies are mainly described in version ranges, contrary to version control where it is dependent on a specific version effectively loosening the constraints on dependencies, thereby making dependency solving easier. However, as packages are static objects it makes it more complicated in the development process because a downloaded package only contains the source code or binary, and it is not version controlled. For example, if development is done in an environment where packages are used, and a developer wants to contribute and fix a bug in a package. It is first necessary to download the repository with the version control, commit the changes, make a new package, and upload it to the package managers’ database. This creates an ineffective workflow if dependencies in a repository are treated as packages. There are also challenges to the auditing of packages because there is no way for a developer to know if a security issue affecting a dependency is also affecting their program (Abate et al., 2020; Cappos et al., 2008). PM solves dependency problems mainly by using version ranges for dependencies and all dependencies in the environment are evaluated to find a valid solution in the given context. And as argued by (Abate et al., 2012), dependency solving should be treated as a separate concern and thereby could be transferable to other methods of managing dependencies.

18

4 Challenges with dependency management In the following chapter, the different dependency problems will be analyzed and compared against the guidelines of what a dependency management system needs to accomplish. As well as highlight some of the problems that have emerged during the research. The information about the dependency problems emerged mainly from the domain experts at Saab T&S who face them every day, and partly from literature, for example (Fan et al., 2020; Jaspan et al., 2018).

4.1 Diamond Dependency Problems One of the most common problems that can occur in MR is the DDP, see Figure 6 below.

Figure 6 Diamond dependency problem. Simplified, the DDP occurs when a project uses multiple dependencies that in turn use the same underlying dependencies. This itself is not a problem and is a common and many times needed architectural pattern. The problem emerges when dependencies are not updated synchronically and therefore use two different versions of the lower hierarchy dependency. This can cause several problems and is of importance to track which dependencies a project is using, not only on the first layer but also for the full hierarchy of dependencies (Abate et al., 2020). When looking at a dependency conflict in MR the problem becomes complex because a dependency is a reference to a specific version. When trying to solve these conflicts it is only possible to affect the dependencies on the project’s own level. The problem which exists (see Figure 7) is that if A (1.0.0) uses dependency C (1.1.0) and B (1.0.0) uses dependency C (1.2.0), you need to find a version of A that points to the correct (or necessary) version of C. In this case, A (1.X.Y) would have to point to C (1.2.0) for the DDP to be handled.

19

Figure 7 Illustration on how to solve a DDP. This problem gets increasingly more difficult the more components that are added due to mutually incompatible versions of the same component. As explained in the Problem statement, the only solution is to check every version of every component to find the lowest common denominator, see Figure 8 below.

Figure 8 Expansion of Figure 7 with another submodule added (D). To explain the figures above further (Figure 7 & Figure 8), the current version of A is not necessarily the latest version of that component, but rather the version the project is using at that moment. Every version of a component that is using a submodule is referring to a specific version of the submodule. In the above example, A (1.0.0) is referring to C (1.1.0). A later version of A might point to a later version of C. It is not possible to change which version of C, that A (1.0.0) refers to. But rather solve this by finding a specific version of A that points to C (1.2.0), which in this case could be A (1.1.0).

4.2 Circular dependencies Another problem regarding dependency management is CDP, which is not as common as DDP but can have devastating consequences. CDP means that there is a relation between two or more components that either directly or indirectly depend on each other to function. One of the unwanted effects of CDP is memory leaks, see Figure 9 below, where A depends on B which depends on C, which in turn depends on A. If a developer tries to download a project of that structure from Git using the

20

command “git clone –recurse-submodules”. It will lead to an infinite recursion until memory is full.

Figure 9 Visualization of circular dependencies. The CDP might also start a domino effect whenever a local change in one component is spread to other components which makes the program unable to compile or run. This can lead to difficulties for a developer to debug and find where the error began. The problems categorized under CDP occur due to a few different reasons. The main reason can be broken down due to tight coupling and low cohesion. The level of coupling is based on how connected two components or methods in code are to one another. Bieman & Kang (1995) explains cohesion like this: “Cohesion refers to the “relatedness” of module components.” (Bieman & Kang, 1995, p. 259). What they mean is how well, methods, for example, fit in a certain component. If many parts of a component depend on other component(s) it can be close to impossible to re-use only that component, see Figure 10 below.

Figure 10 Illustration of tight coupling and low cohesion in code. There is a solution to a few of the CDP, by instead writing code that has loose coupling and high cohesion, see Figure 11.

21

Figure 11 Illustration on loose coupling and high cohesion in code. The dots in each component (in Figure 10 & Figure 11) represents code related to each other. Blue lines are code using other code from the same component. Red lines represent usage from another component. By structuring the work like this, the potential domino effect (see Figure 9) can be mitigated, or at least become easier to debug due to fewer interconnected parts between components.

4.3 Alternative dependencies In some cases, it is necessary to have multiple versions or instantiations of a component that is used in different circumstances. This could be when testing or when building the complete system, an example of this is shown in Figure 12 below.

Figure 12 Visualization of two mutually exclusive instantiations of a component. An implication of the desired need for having dependencies that are only necessary for some situations makes a situation emerge. Inside a repository, there can be internal dependencies i.e., that there exist different source code files that are dependent on each other or the testing application for the source code. Different usages of a repository might need different external dependencies e.g., the testing application might need some external test framework. While in development,

22

the simulated hardware version of a dependency should be used, see Figure 13 below for an example. In this case, the dependencies for the test framework and the simulated version of component A are only used in the test build for the repo. While developing in the product build, the only dependencies used are for components A and B. The conclusion is that there is not a single description of a repository’s dependencies, the dependencies vary between the different usages of that repository.

Figure 13 Illustration of usages (U1 & U2) linked to their respective dependencies. This leads to the argument that dependencies need to be described dynamically, this is contradictory to how most dependency management systems are describing dependencies today.

4.4 Workflow Another problem that surfaces when developing in a MR environment is that of developing simultaneously in multiple repositories at the same time, which is sometimes necessary to make dependencies work together. When developing in two repositories, where one is dependent on the other, it is important that updates are done in the correct order. In the figure below, A has a dependency to B. If updates are made to B, A must in turn reference the new version of B(V3). If A pushes an update to the origin server where it references the new version of B(V3), before that version is pushed to the server, A will actually reference a version of B that is not yet existing on the server. This will create failures in the update chain, see Figure 14 below.

23

Figure 14 Visualization of the incorrect order of updates when working in multi- repositories. Instead, the changes to B should be pushed to the server first, and once that is accepted, the new version of A(V2.0) that references the new version of B(V3) can be pushed to the server, see Figure 15 below.

Figure 15 Visualization of the correct order of updates when working in multi- repositories. This creates something that is from now on referred to as the update problem. The problem gets even more complex the more components in a project there are, and more so because dependencies may exist in multiple locations in the tree of dependencies. This creates chains of dependencies that need to be updated in a

24

specific order to have valid references. Google Repo (described in the chapter Google repo) handles this elegantly by providing support for both committing changes and pushing changes to the server. This is, however, possible to solve by always referencing the newest version of a dependency, but as some products need to be traceable and testable to specific versions, it is not a feasible option in most cases. What differs is mainly how updates are applied and if the product supports Firmware over the Air (FOTA). If a bug is found on a FOTA-supported product, a patch can be applied easily over the internet. For products, without FOTA support it is quite the opposite. If a patch needs to be applied, it is necessary to apply the patch to the product directly. Therefore, the need for rigorous testing and tracking is necessary due to the increased effort required to apply a patch or updates. This adds even more complexity to the problem because if each dependency has targeted versions, this means that most of them will not be referenced to the latest versions. So, what happens if the development in a project where changes are made to the components that are not on the newest version, should these changes be branched out? How is it possible to effectively avoid that all projects use their own branches on components? These are all questions that surface when exploring these kinds of problems. However, the problem with the workflow still exists and there is a need to improve the workflow of developers in all kinds of software development.

4.5 Merging strategies Another topic regarding dependencies is how a pull request is made into another branch. Because dependencies reference specific commits it is possible to target other branches than the main branch in a repository. Each commit has identifiers (in Git it is a SHA1 hash id) that are based on its parent i.e., the commit before. Consider what happens when a developer starts working on new features in a new branch, while another development team has provided new commits to the main branch. The result is a forked history. If these changes are relevant to the features the developer is working on, there are two options: merging or rebasing. Merging is basically creating a merge commit where in this case the changes to the main branch are added to the history on the features branch. However, this means that if the main branch is continuously updated with new commits, the developer will have to create extra merge commits every time there are changes on the main branch that are relevant to the features. This can make the history of the branch very difficult to understand.

25

The other option is rebasing. A rebase moves the entire history of the feature branch to the tip of the main branch, effectively incorporating all new changes on the main branch. The benefit of this is a much cleaner history but what is more so when rebasing, the branch and thereby the identifiers have changed parents. This means that the entire history of the feature branch has different identifiers. Why is this important? If a feature branch is referenced as a component and that branch is later rebased into the main branch, these commits that are referenced do not exist anymore. The same goes for the feature squash. The feature allows a developer to squash a history of a branch into a single commit but as commit identifiers are based on the actual content, it will have a different identifier. Consider a dependency that is referencing a commit on a branch. This might be a problem depending on what strategy is used when making a pull request to take the new features into the main branch. If the pull request is done with a merge, the commit still exists but if it is done by rebasing or a squash command, it will not exist anymore and thereby destroy the dependency. All these utilities are useful in their own right and should not be forbidden. However, it is important to know about their consequences. One way of avoiding this is to only allow dependencies to reference the main branch because that branch should not change its history.

26

5 Results In the following chapter, the recommended solution will be presented, along with the iterations of the artifact. The proposed solution for dependency management in this paper is a manifest solution. This involves describing every component and their dependencies not only for a single usage but for the alternative usages as well and thereby providing the utility of describing dependencies dynamically. This will be accompanied by an overlay program for a VCS, in this case, Git as this is Saab T&S current standard. For the overlay, the functionality and theories will be explained. The remainder of the chapter will be split into subchapters, first presenting Evaluation criteria used to develop the artifact, followed by the Artifact iterations. As the artifact is two-part consisting of a Manifest and an Overlay this will be presented in separate subchapters followed by Guidelines and Research questions.

5.1 Evaluation criteria The evaluation criteria have been continuously developed through relevance cycles. These are the final product of evaluation criteria and are deemed the most important utility for a method for dependency management. These criteria are used in the next chapter to evaluate the artifact.

• Full versioning of a component including tags for release or similar tags. • The possibility to choose alternative dependencies that the current user wants, for example choosing between a component for simulation or the full version of that component, See Figure 12 and Figure 13 in Alternative dependencies. • In dependency conflicts, the optimization needs to be considered, i.e., it should be possible to request components of: o Exactly version N o Version N or newer o Version N or older • It should be possible to represent dependencies to an interface and its implementation or implementations. • Avoid the problems that relate to circular dependencies as well as notifying developers about dependency conflicts.

5.2 Artifact iterations At the beginning of the research, there were several iterations in both the relevance and rigor cycle to further investigate what problems that occur when managing

27

dependencies in a MR. It started out with an interest in why does the DDP happens and how can it be mitigated. In the next sub chapters, the main artifact i.e., the design iterations will be presented.

5.2.1 Iteration 1 The first iteration of the artifact was using Git submodules, by writing Git commands to recurse through all submodules and parsing out all names of the components, along with their version. Doing it this way made it possible to see DDP’s, but without doing anything about them. The direct drawbacks with this are requiring the user to first download the project (which contains the DDP’s) and after that run the script to get notified about what DDP’s exist in that project. Another drawback is the risk of infinite recursion happening while downloading all the files, which rendered this solution fruitless. This iteration to create the initial artifact was mostly done in the relevance cycle of DSR, where input from Saab’s domain experts shaped the necessary characteristics that the artifact would need. The use of the rigor part of DSR was mainly researching what current VCS enables the developers to do, and how it works.

5.2.2 Iteration 2 The second iteration was an attempt to only download a shallow (only one version) clone of all components of a project. The problem emerging from this was that Git defines shallow as downloading a project’s history with depth X, where depth 0 meaning only the latest version. This could have been part of a possible solution if only the specified version at a certain depth was downloaded, and not all the newer versions as well. While working on this iteration, feedback was continuous from Saab on potential ways of using Git to choose a particular version of a project. This was then checked against the knowledge base which showed why it would not work.

5.2.3 Iteration 3 The third iteration was the one moving away from Git submodules, due to a lack of customization options. The continuous conversations with Saab’s experts lead to the necessity to be able to define alternative dependencies. The first approach to this was to use a manifest file and allow a dependency to target multiple repositories i.e., that a dependency might have two implementations. In this way, a user would have to select which of these implementations were used at the moment. This iteration started on the relevance side of DSR, where the lack of customization options was prevalent. The rigor cycle was regarding how different systems,

28

applications, or projects handle dependencies was important. This led to the similar usage of manifest files, but without clear-cut contents in them.

5.2.4 Iteration 4 The fourth iteration came as a result of the discovery that it is not only a dependency that varies between different implementations, but that a whole repository might have different dependencies altogether depending on how it is used. Considering this, it was concluded that a dependency might not only change between different usages, but it might also not even exist. Therefore, usages were described and implemented in the manifest, where one usage of a repository describes a dependency that is only needed for that specific usage. This was mainly achieved by the relevance cycle of DSR, where the characteristics of how dependencies and repositories may vary between different usages were discovered while structuring how a manifest file could work.

5.2.5 Iteration 5 The manifests themselves need an overlay that reads them and handles the download of the required versions of components. The fifth iteration of the artifact was about how to read the manifest files and download dependencies’ manifest files. This was also iterated a few times (which will not be presented here) until the final solution was found, see Overlay. This type of solution mostly came from the rigor cycle, by researching how manifests are used by other systems, e.g., Android Studio, for developing applications for Android mobile phones. Because of the difference in that kind of development versus many other kinds of development, another rigor cycle was necessary. This time with the focus on how workflow differs and how this affects development where it is not possible to reference dependencies towards the newest possible version.

5.3 Manifest To represent dependencies, an XML-style document is used, called a manifest. In the manifest, there exist two main tags: dependency and usage. The dependency tag contains attributes, the required ones are the URL and version or commit, while the optional ones are name, branch, and usage. Simply put, the URL is the unique identifier of the repository and the version is to allow the possibility to both targets a specific version, range of version, or a commit id. For example, a dependency can be represented to a specific commit as:

Or a range of versions can be represented as:

29

The range can be described with wildcard characters such as “1.2.0-1.*”, where * represents any version that has the same major version, but the same, or newer minor version. In this way, a version can be both be described as a snapshot in time by the commit, a fixed range, or an open range. As the range can be open it is possible to set a dependency to always point to the newest version available during development. This is represented by as:

The optional attribute name is to allow users to specify to folder names that the component will be stored under, the standard is that the folder will be named after the repository name i.e., the last part of the URL. In order to have the functionality to target specific branches, the optional branch tag exists, if the tag does not exist, the main branch is always chosen, or the branch that matches the specific version tag that is described. The optional attribute usage is to provide the utility to chain together the usages between components. If usage is not defined, the default will be to target the first usage described in the manifest of that component.

By specifying usages, it simplifies the process of gathering dependencies as it will allow the process to be done automatically. The usage tag exists to allow alternative dependencies as well as interfaces that support multiple implementations of the interface. The usage tag describes a collection of dependencies because dependencies can vary between different usages of a repository, see Figure 13 in Alternative dependencies. The usage tag has two attributes, the required id which is the identifier, and an optional description. A usage is represented as follows:

30

This format is not supported by VCS’s, which means an overlay is necessary to use this manifest. In the next part, the overlay and its functionality will be explained.

5.4 Overlay The overlay is a program designed to download, update, and manage the dependencies in a manifest file. The methods presented in this chapter are focused on Git as the VCS, but as the methods are general they can be implemented with any VCSs of a similar structure. As such, the following subchapters will describe the functionality and theories of the overlay.

5.4.1 Gathering dependencies When cloning a repository with all its dependencies, the first part is to gather and map the dependencies for a repository. Here dependencies are considered unique by their URL and their range of versions or specific commit. As dependencies are described on every level of the project, the algorithm to get all the dependencies will do so recursively, see Figure 16 below.

Figure 16 Algorithm used to gather all the dependencies in a manifest while avoiding infinite recursion. This might lead to duplicates of the same version of dependencies, as well as circular dependencies. To prevent this, the algorithm will check if the dependency is already present in the list of dependencies before downloading it. However, as argued before circular dependencies are not good practice and therefore it is important to notify a user of these problems. Every manifest can have multiple usages, which means that if the usages are not defined in the dependency the overlay must prompt the developer to choose what usage to use.

31

5.4.2 Dependency mitigation To mitigate or reduce dependency problems it is important that the entire system of dependencies is visible to a developer. Even if it is possible to reduce the risks with dependency problems such as the CDP, it is still important to not just avoid the problems they cause but also to reduce the cause of them. However, de-couple a CDP is something that is generally hard to do automatically as it requires developer expertise to do so. From the gathering stage, the output is a list of dependencies that are unique i.e., no dependencies with the same URL and version or version range. Thereby it is possible to compare if there exist dependencies with the same URL but with different versions i.e., the DDP. It is generally hard to solve dependency problems, partly because of the complexity of the problem, but also because of how a dependency needs to reference a specific version. This is explained more in-depth in the chapter Challenges with dependency management. Thereby it might not be possible in all cases to solve dependencies automatically. However, if dependencies are expressed in ranges of versions instead of specific commits, it will reduce the complexity of the problem and thereby open the possibility to both solve the conflict but also optimize them by allowing a developer to choose where in the range of possible solutions the dependency will be pointed to, e.g., exactly version N, version N or newer, version N or older. Thereby the solution presented here is not to merely focus on solving these problems automatically but to make it visual to a developer where these problems exist so that they can be fixed.

5.4.3 Cloning For the cloning of the repositories, the main functionality of the VCS will be used, but with scripts to clone the dependencies all at the same time. Instead of putting dependencies at multiple levels, the main project is downloaded in the top level, will have one folder containing all dependencies so they are not spread over multiple levels of the project. With this approach, it is more convenient to work with components because they only exist in one location. Furthermore, describing where dependencies are located becomes easier when building projects. These cloned dependencies are in essence also repositories, they contain both the source code and the metadata about the repository. It is not necessary to track changes to the dependencies in the top-level project, which is why the folder containing the dependencies will be marked as ignored. However, as they are repositories, they are still trackable at their own level. By doing this, the functionality of the VCS is preserved.

32

Furthermore, this allows developers to switch between different usages of a repository because files can be changed or deleted in those folders without the VCS tracking those changes.

5.5 Guidelines Having a naming convention standard at the company on how different versions are presented is one step of the way. One of the most common ones used is the semantic versioning of components, which looks like this: MAJOR.MINOR.PATCH. The MAJOR version, which is what it sounds like, a major update. This could be anything from removing functionality to changing how the current functionality works. Generally, when the major number increases (e.g., from 1 to 2), it is no longer backward compatible and therefore not expected to work with projects using the older version. The MINOR version, which means that there are minor changes. These are changes that might additional functionality but is generally still backward compatible. It should be able to work with the older overlays or interfaces. PATCH is a patch or build number. This is often bug fixes or other small changes that do not affect the usage of the component. By structuring version numbers in this manner, it is easier to handle version- incompatibilities. Worth mentioning, however, is that some hardware is incapable of this semantic versioning. The way Saab T&S names hardware is by using article numbers and letters to define major and minor updates. The article number defines the major version, whenever a breaking change (non-backward compatible) occurs, the article number changes. With minor changes, the letter changes, from a-z. If it gets continuously updated after the alphabet, it starts over at “aa”.

5.6 Research questions The purpose of this study as presented in the chapter Purpose and research questions, was to research potential ways of handling dependency problems in multi-repositories as well as explore the possibility of producing a method that facilitates dependency management. With this as a pillar of the foundation of this study, the research questions can be answered. [RQ1]: Why do dependency problems occur? When working in a MR, the dependency problems occur as a byproduct of enabling developers to work with their own project or component in a project, as well as

33

having that component version controlled. It makes the work more bite-sized and easier to handle, as opposed to working in a mono-repo where all the code is in the same repository. [RQ2]: How can dependency problems be identified before they become an issue? The manifest-oriented method presented in this report is made with this research question in mind. By analyzing why different dependency problems occur, countermeasures can be taken to handle them, in this case, using manifest files. Having manifest files along with an overlay that handles them, makes it easier for a developer to see linked dependencies, and therefore see when two different components are using the same dependency, albeit in different versions. [RQ3]: How can dependency problems be handled? The overlay used in conjunction with having manifest files enables the developer to only download one version of every dependency, which prevents a number of potential problems, such as infinite recursion and DDP.

34

6 Discussion In this chapter the results of the study will be discussed, why we ended up with the solution we did. Why that solution is good, and what could be improved in the future. Limitations of the study, as well as related work, will also be covered.

6.1 Result discussion The result in this study was a product of exploring how dependency management can become easier for developers, as well as research on how problems can be avoided. To achieve our result, we have gotten ideas from different solutions, such as VCS’s functionality, package managers, Google Repo, along with a continuous dialog with experts at Saab T&S. Our solution might not work for everyone, but it could be used as a foundation for others. We believe the method suggested for managing dependencies has addressed the requirements for such a system. Full versioning was achieved by keeping a VCS system in the loop, in our experiments we used Git, but it could in theory be used with any VCS of the same characteristics. The manifest supports both interfacing and alternative dependencies, and by lifting the need for describing dependencies dynamically instead of statically we hope that others will continue our work towards more dynamic dependency management systems. However, as managing dependency conflicts proved more challenging than first predicted, due to the complexity of the problem (explained in the chapter Challenges with dependency management), the dependency conflicts cannot always be automatically solved and therefore it is not always possible to request dependencies of specific versions. Instead, we argue that the best way forward is to track the dependencies and make developers aware that they exist. This is a bit contradictory to what (Abate et al., 2020) state about the need to track the full hierarchy of dependencies. We argue that it is enough to have tools that track the hierarchy of dependencies. If the whole hierarchy of dependencies is located in the top-level project, there is a risk to introduce data redundancy concerning what dependencies are actually used. During the research, a range of more complex problems surfaced such as the discussed in the chapter Challenges with dependency management, by describing these problems we hope to open new dialogs and possible future research on these kinds of problems. While most of the research conducted in this area touched package managers in one way or the other, we landed on this manifest solution. This was mostly because of the inconveniences related to package managers, where it would not be as much of a feasible solution for Saab T&S. Working with package managers is good when

35

working in a system where the dependencies are static. When development is done in many parts of the system it creates inconvenient workflows by having to create new a package every time a change is made to one part of a system in development and upload it to the server to be able to use it. Instead, using a manifest to track different dependencies in projects and within components, themselves was a more fitting solution for Saab T&S due to the possibility of describing dependencies dynamically.

6.2 Method discussion The DSR method used in this study worked very well, however, we found it challenging to apply the rigor cycle as the current research on dependency management is largely focused on package managers and dependency mapping systems. However, both the design cycle and the relevance cycle worked excellently due to our continuous communication with the domain experts at Saab T&S. This method would be optimal when applying in an area that has been researched for a long time as well as being closely related to developers and their opinions. This area of dependency management nearly requires feedback from developers (in our case experts working in the industry), because they have a lot of knowledge regarding how it is done today, and what needs to be improved. However, as argued earlier, the current research in this area is somewhat limited and we hope that it will be easier to conduct DSR as the knowledge base grows.

6.3 Limitations We believe the purpose of this study helped us stay on track for the entire duration we conducted it. This led to clear answers to our research questions. The extent of our answer to the third research question: “How can dependency problems be handled?” is not the only solution. It is only one solution, for a very complex problem. The relevance cycle of DSR was very advantageous in terms of answering our research questions. By talking to the experts at Saab T&S, who work with these types of problems every day, have a large knowledge in the area. A limitation of this study is that our solution while being a method for using manifest files with an overlay is not a finished product. This was mainly because of the time frame of this study. While it is easy to find tags and relevant branches, such as release branches, in Git, however, this might not be true for other VCS or company structures. The validity and reliability of this study were achieved to a certain degree. It could have been improved if any or all of the following would allow it, the time frame, the complexity of the problem, the scope of this research. Due to a lot of testing on Git

36

projects which we structured to contain the different dependency problem which exists. If given more time to research this subject, we would hope to automate the process of generating manifest files, which would further credit the validity and reliability of this study. One thing we would have wanted to do differently would be to visit Saab’s office earlier in the study to get direct feedback as well as discuss the problems we faced, with them in person. We did not go there early in the study both because of the current pandemic but also because we did not realize that it would help us as much as it did. This could potentially have improved our view of the problems and what to include in the solution faster. When we worked at the office, we could visibly show our take on a solution, which they could give feedback on, providing us with a faster relevance cycle.

6.4 Related work Software dependency management has many sides, and in this paper, we have focused on dependency management in multi-repositories. The need for properly managing dependencies in component-based software development has been pointed out by (Lehman & Ramil, 2000) and (Vieira & Richardson, 2002) argues that dependencies should be treated as a first-class problem. There are quite a few papers that discuss software dependencies and dependency problems (also referred to as DLL hell and dependency hell) (Abate et al., 2020; Fan et al., 2020; Florisson & Mycroft, 2015; Jaspan et al., 2018) to name a few that has provided us with good descriptions of these problems. But the research is generally limited to PM and dependency solving of such or mapping of complex systems (Fan et al., 2020; Sangal et al., 2005). (Abate et al., 2020; Abate et al., 2012) Discusses the practices of dependency solving as a separate concern and does a thorough review of the state of the art PM and argues that they are not up to the task. However, we argue that PM is not a suitable solution due to the nature of how these systems work. As there is limited research on dependency management in MR, we have done reviews on the functionality in VCS to handle dependencies, to do this we have mostly relied on documentation and studies of these systems (Chacon & Straub, 2014; Loeliger & McCullough, 2012; Perez De Rosso & Jackson, 2013) for Git and (Pilato et al., 2008) for subversion and (O'Sullivan, 2009) for Mercurial. (Jaspan et al., 2018) provided us with a case study going in-depth of why Google is moving towards a mono-repository just to avoid many of the problems presented in this paper about dependency problems.

37

Even though we have argued that PM is not an adequate solution in this situation, there exist a lot of good package managers out there, see (Wikipedia, 2021).

38

7 Conclusions and further research In this chapter, we conclude and summarize our research both for practical and scientific implications as suggest directions for future research.

7.1 Conclusions In this study, we used the DSR framework to explore the practices of dependency management as well as suggesting a novel method for dependency management. What we aimed to accomplish by conducting this study was to re-open the dialog about how dependency management can be improved and bring our take on a potential solution to the table. VCS is essential for every kind of developer, from individual developers to large corporations, everyone uses them. VCS does not work nor look the same today as it did twenty years ago. By acknowledging the current drawbacks and limitations with VCS, we hope to help the continued development by showing our solution which could possibly give developers ideas on how to improve VCS in the future.

7.1.1 Practical implications This study has the potential to have large implications on the industry, where we hope to alleviate dependency management for developers. Especially for those working on many projects or components simultaneously. A lot of resources in companies today are used in making sure that projects can be built and avoiding dependency problems. If this task would require fewer resources, that would mean more focus could be put on development, or other areas of a company.

7.1.2 Scientific implication The current research regarding dependency management is limited, other than research about package managers or describing the problems while managing dependencies. Our goal is that this study provides a foundation for others to continue researching this area in computer science.

7.2 Further research Our solution is only a start for what we believe can be done with dependency management. Enabling automated handling of dependencies to avoid problems in VCS is a topic that further needs research. This would prove helpful for developers, due to fewer resources used for managing dependencies. As stated earlier, there are a lot of interesting questions and problems that arise when analyzing the problems in MR.

39

For example, how the workflow can be improved to avoid some of the inconveniences with merging strategies as well as update problems. Mitigating the effects of these problems will be of great value to developers. More research is necessary for the development of tools and methods for management and workflow with dependencies. One way to start is by researching how to automate the process of generating manifest files, which would increase efficiency as well as minimize the risk of human errors.

8 Acknowledgments First off, we would like to thank our supervisor, Ragnar Nohre for your insightful feedback during this thesis period. We are grateful to everybody at Saab training and simulation that helped us make this research possible by allowing us to study your work with dependency management. Especially a thank you to the domain experts in DevOps and IT architecture that took the time to help us grasp the complex problems in dependency management.

40

9 References Abate, P., Di Cosmo, R., Gousios, G., & Zacchiroli, S. (2020, 2020). Dependency solving is still hard, but we are getting better at it. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), Abate, P., Di Cosmo, R., Treinen, R., & Zacchiroli, S. (2012). Dependency solving: a separate concern in component evolution management. Journal of Systems and Software, 85(10), 2228-2240. Bieman, J. M., & Kang, B.-K. (1995). Cohesion and reuse in an object-oriented system. ACM SIGSOFT Software Engineering Notes, 20(SI), 259-262. Cappos, J., Samuel, J., Baker, S., & Hartman, J. H. (2008, 2008). A look in the mirror: Attacks on package managers. Chacon, S., & Straub, B. (2014). Pro git. Springer Nature. Collins-Sussman, B., W. Fitzpatrick, B., & Pilato, C. M. Externals Definitions. Retrieved April 5th from http://svnbook.red- bean.com/en/1.7/svn.advanced.externals.html Cook, S. A. (1971, 1971). The complexity of theorem-proving procedures. De Alwis, B., & Sillito, J. (2009, 2009). Why are software projects moving from centralized to decentralized version control systems? Fan, G., Wang, C., Wu, R., Xiao, X., Shi, Q., & Zhang, C. (2020, 2020). Escaping dependency hell: finding build dependency errors with the unified dependency graph. Florisson, M., & Mycroft, A. (2015). Towards a Theory of Packages. Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in information systems research. MIS quarterly, 75-105. Jaspan, C., Jorde, M., Knight, A., Sadowski, C., Smith, E. K., Winter, C., & Murphy- Hill, E. (2018). Advantages and disadvantages of a monolithic repository: a case study at google. Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, Lehman, M. M. (1996). Laws of software evolution revisited. Lehman, M. M. (1980). Programs, life cycles, and laws of software evolution. Proceedings of the IEEE, 68(9), 1060-1076. https://doi.org/10.1109/PROC.1980.11805 Lehman, M. M., & Ramil, J. F. (2000). Software evolution in the age of component- based software engineering. IEE proceedings. Software, 147(6), 249. https://doi.org/10.1049/ip-sen:20000922 Lewis, H. R. (1983). Michael R. ΠGarey and David S. Johnson. Computers and intractability. A guide to the theory of NP-completeness. W. H. Freeman and Company, San Francisco1979, x + 338 pp. Journal of Symbolic Logic, 48(2), 498-500. https://doi.org/10.2307/2273574

41

Loeliger, J., & McCullough, M. (2012). Version Control with Git: Powerful tools and techniques for collaborative software development. " O'Reilly Media, Inc.". Mettler, T., Eurich, M., & Winter, R. (2014). On the Use of Experiments in Design Science Research: A Proposition of an Evaluation Framework. Communications of the Association for Information Systems, 34, 223-240. https://doi.org/10.17705/1CAIS.03410 O'Sullivan, B. (2009). Mercurial: The Definitive Guide: The Definitive Guide. " O'Reilly Media, Inc.". OpenHub. Compare Repositories. Open Hub. Retrieved April 4th from https://www.openhub.net/repositories/compare Otte, S. (2009). Version control systems. Computer Systems and Telematics, 11-13. Perez De Rosso, S., & Jackson, D. (2013, 2013). What's wrong with git? A conceptual design analysis. ACM international symposium on New ideas, new paradigms, and reflections on programming & software, Pilato, C. M., Collins-Sussman, B., & Fitzpatrick, B. W. (2008). Version control with subversion: next generation open source version control. " O'Reilly Media, Inc.". Ruparelia, N. B. (2010). The history of version control. ACM SIGSOFT Software Engineering Notes, 35(1), 5-9. Sangal, N., Jordan, E., Sinha, V., & Jackson, D. (2005, 2005). Using dependency models to manage complex software architecture. Spinellis, D. (2005). Version control systems. IEEE software, 22(5), 108-109. https://doi.org/10.1109/MS.2005.140 StackOverflow. (2018). Developer Survey Results 2018. Retrieved April 4th from https://insights.stackoverflow.com/survey/2018#work-_-version-control Vieira, M., & Richardson, D. (2002, 2002). The role of dependencies in component- based systems evolution. Proceedings of the international Workshop on Principles of Software Evolution Vouillon, J., & Cosmo, R. D. (2013). On software component co-installability. ACM Transactions on Software Engineering and Methodology (TOSEM), 22(4), 1- 35. Wikipedia. (2021). Package Managers. Retrieved May 10 from Zolkifli, N. N., Ngah, A., & Deraman, A. (2018). Version control system: A review. Procedia Computer Science, 135, 408-415.

42