Broken Sets in Software Repository Evolution
Total Page:16
File Type:pdf, Size:1020Kb
Broken Sets in Software Repository Evolution Jer´ omeˆ Vouillon Roberto Di Cosmo CNRS, PPS UMR 7126, Univ Paris Diderot, Sorbonne Paris Cite´ Univ Paris Diderot, Sorbonne Paris Cite´ PPS, UMR 7126, CNRS, F-75205 [email protected] [email protected] Abstract—Modern software systems are built by composing 1 Package: tesseract-ocr components drawn from large repositories, whose size and 2 Source: tesseract (2.04-2.1) 3 Version: 2.04-2.1+b1 complexity increase at a fast pace. Software systems built with 4 Depends: libc6 (>= 2.2.5), libgcc1 (>= 1:4.1.1), components from a release of a repository should be seamlessly 5 libjpeg8 (>= 8c), libstdc++6 (>= 4.1.1), upgradeable using components from the next release. Unfortu- 6 libtiff4, zlib1g (>= 1:1.1.4), nately, users are often confronted with sets of components that 7 tesseract-ocr-eng | tesseract-ocr-language 8 were installed together, but cannot be upgraded together to the 9 Package: tesseract-ocr-eng latest version from the new repository. Identifying these broken 10 Source: tesseract-eng sets can be of great help for a quality assurance team, that could 11 Version: 3.02-2 examine and fix these issues well before they reach the end user. 12 C o n f l i c t s : tesseract-ocr (<< 3.02-2) Building on previous work on component co-installability, we show that it is possible to find these broken sets for any Fig. 1. Inter-package relationships of tesseract-ocr, an optical character two releases of a component repository, computing extremely recognition engine, and tesseract-ocr-eng, the english language pack, efficiently a concise representation of these upgrade issues, as found on 20 February 2012, in the testing suite of the Debian GNU/Linux together with informative graphical explanations. distribution. A tool implementing the algorithm presented in this paper is available as free software, and is able to process the evolution between two major releases of the Debian GNU/Linux distri- components which were co-installable in the old repository, bution in just a few seconds. These results make it possible to but become non co-installable in the new release. integrate seamlessly this analysis in a repository development process. Sets of components with this property, which we call broken sets, are particularly damaging in the evolution of a I. INTRODUCTION repository: their existence means that there may be perfectly functional deployments based on the old repository that will Component-based software architectures, maintained in a be disrupted as soon as one tries to upgrade their components distributed fashion and evolving at a very quick pace are to the version in the new repository. The configuration in nowadays commonplace, in particular in the world of free and Example I is a real world example of such a broken set: the open source software (FOSS). Components are usually made tesseract optical character recognition program, which available via a repository, and are equipped with metadata, is split in several related packages1, was perfectly functional such as dependencies and conflicts, that specify concisely the before February 20th 2012, but on that day, the introduction of contexts in which a component can or cannot be installed. the updated english language pack tesseract-ocr-eng A typical example taken from the Debian GNU/Linux made the installation of this program temporarily impossible distribution, is shown in Figure I, where we can see that for the English language. the logical language used for expressing dependencies and Broken sets need to be identified early in the evolution conflicts is quite powerful, as it allows conjunctions (symbol process, and fixed well before the release of the new ‘,’), disjunctions (symbol ‘|’) and version constraints. repository, but finding broken sets in repositories whose size Maintaining and evolving component repositories is an is in the tens of thousands of components is a daunting task: important task and requires an extensive quality assurance there may be exponentially many broken sets, and listing process: besides traditional issues concerning the bugs in the them all would be both computationally unfeasible and of no code inside each component, the quality of a large component use for a quality assurance team, which would be flooded repository rests also on how well components can be combined under the error reports. with each other, a property known as co-installability [24]. Similarly to what happens with the source code of a single In this paper, we describe a highly efficient algorithm that component, which is passed through regression tests to ensure solves the problem by finding a very small subset of the broken that one did not re-introduce issues that were fixed before [18], sets, which subsume all the others, and is close to minimal, in we are naturally led to check whether there is any set of a sense that will be made precise later on. Work performed at the IRILL centre for Free Software Research and 1Essentially, tesseract-ocr for the core engine, and Innovation (http://www.irill.org). tesseract-ocr-lang for all supported language lang We have developed a tool that implements this algorithm are co-installable in a repository R if they are all included in and finds such minimal problematic configurations on real some healthy installation I of the repository. component repositories in a few seconds; it also provides very concise explanations that allow to identify the origin of the B. Repository Transformations problem. We found several such issues in the evolution of the To identify co-installability issues in a single repository, Debian distribution using the tool. [24] introduced a set of repository transformations that al- The paper is organised as follows: Section II briefly recalls low to extract from a large repository a much smaller the basic notions about packages and dependencies, and the co-installability kernel which is equivalent as far as co- formally certified semantic preserving repository transforma- installability is concerned. The gain allowed by these transfor- tions developed in [24] which our work builds upon, and mations is impressive: starting from a Ubuntu repository con- then presents the key ideas of our approach. Section III taining 7277 packages, 31069 dependencies and 82 conflicts, presents some of the results obtained by our tool on real-world one goes down to a kernel containing only 100 elements, with GNU/Linux distributions. The technical development follows: 29 dependencies and 60 conflicts (similar gains are obtained Section IV provides a precise formalisation of all the transfor- on Debian or Mandriva repositories). A description of these mations and simplifications that can be performed to reduce transformations can be found in section 2 of [24], where they dramatically the search space for explanations; Section V are also shown correct (the proofs are certified in Coq [20]): describes in detail an algorithm that uses these results and for completeness we recall here the main steps involved. builds a set of broken sets that cover all upgrade issues, and we 1) Flattening Dependencies: The first transformation re- show in Section VI how we provide minimal explanations that places the dependency function D of a repository R = contain all the relevant information for understanding the cause (P; D; C) with a special flattened form Db with two key of these broken sets. We discuss related work in Section VII properties: first, by applying a sort of transitive closure to D, and Section VIII concludes. it directly describes all dependencies of each package, so if 1 k1 1 kn II. OVERALL APPROACH Db (π) = ffπ1; :::; π1 g; :::fπn; :::; πn gg; then the packages πj are all the packages relevant for installing package π, While the concrete details may vary from one technology i and only them. On the example repository of Figure 2a, this to the other, the core metadata found in component based amounts to adding a dependency from d to f, and one from systems always allows to express a few fundamental prop- d to g (Figure 2b). Second, since we are only interested erties: a component, called package in the following, may in co-installability, we can prune the expanded dependency depend on a combination of components, expressed as a function by removing any dependency containing a package conjunction of disjunctions of components, and a component with no conflicts. On the example repository of Figure 2a, may conflict with a combination of components, expressed this pruning phase removes the dependencies from d to e, as a conjunction. Extra properties like versioned constraints from d to g and from e to g, leading to the repository of (e.g. libc6 (>= 2.2.5) in Figure I) can be preprocessed Figure 2c. Finally, to render the repository more homogeneous out [14], so we focus on a core dependency system containing D adds a self dependency to each package with conflicts, a binary symmetric conflict relation, and a dependency func- b leading to Figure 2d. Packages d, e and f now have the same tion D(π) = ffπ1; :::; π1 g; :::; fπk; :::; πk gg that is satisfied 1 n1 1 nk dependencies, which reflects the intuition that they behave the when for each i at least one of the πi is installed. We follow j same way as far as co-installability is concerned. the notations of earlier works [14], [24], that we briefly recall. These three operations (transitive closure, pruning, and addi- A. Repositories tion of self conflicts) leave co-installability invariant (Theorem A repository is a tuple R = (P; D; C ) where P is a 6 in [24]). They are performed in a single pass, and can be finite set of packages, D : P !P(P(P)) is the dependency formally defined as follows; given a repository (P; D; C ), the function (we write P(X) for the set of subsets of X), and flattened dependency function Db is the smallest function (with C , a symmetric irreflexive relation over P, is the conflict respect to point-wise inclusion) such that: relation.