The Ultimate Debian Database: Consolidating Bazaar Metadata for Quality Assurance and Data Mining

The Ultimate Debian Database: Consolidating Bazaar Metadata for Quality Assurance and Data Mining Lucas Nussbaum Stefano Zacchiroli LORIA / Nancy-Universite´ Universite´ Paris Diderot, PPS Nancy, France UMR 7126, Paris, France Email: [email protected] Email: [email protected] Abstract—FLOSS distributions like RedHat and Ubuntu form—, it is not as common among FLOSS distributions. require a lot more complex infrastructures than most other The role of such distributions (e.g. Debian, Fedora, Ubuntu, FLOSS projects. In the case of community-driven distributions openSUSE, Gentoo, to name just a few) is to mediate be- like Debian, the development of such an infrastructure is often not very organized, leading to new data sources being added in tween “upstream” development projects and their final users. an impromptu manner while hackers set up new services that In doing so distributions try to ease software installation gain acceptance in the community. Mixing and matching data and configuration, often by the mean of the package ab- is then harder than should be, albeit being badly needed for stractions [4]. The peculiar role of distributions makes their Quality Assurance and data mining. Massive refactoring and infrastructure needs different from those of development integration is not a viable solution either, due to the constraints imposed by the bazaar development model. projects; some of their peculiar needs are: 1 This paper presents the Ultimate Debian Database (UDD), • archive management that enable developers to release which is the countermeasure adopted by the Debian project new packages, usually to fix bugs or in response to new to the above “data hell”. UDD gathers data from various data sources into a single, central SQL database, turning Quality upstream releases; Assurance needs that could not be easily implemented before • build bots (or build daemons) that rebuild packages on into simple SQL queries. The paper also discusses the customs architectures other than those available to individual that have contributed to the data hell, the lessons learnt while maintainers; designing UDD, and its applications and potentialities for data • support for distribution-wide rebuilds to evaluate the mining on FLOSS distributions. impact of core toolchain (e.g. gcc) changes [5]; Keywords-open source; distribution; data warehouse; quality • multi-package bug tracking able to scale up to millions assurance; data mining of bugs; • automated testing I. INTRODUCTION to check that packages adhere to some distribution policy, that might change over time; Most FLOSS (Free/Libre Open Source Software) projects • developer tracking to quickly identify developers whose require some infrastructure: a version control system (VCS) interest fade away [6]; to store source code, a bug tracker to organize development • support for different line of developments (or suites, work, mailing lists for user and developer communication, e.g.: stable, unstable, . ), corresponding to indepen- and a homepage for “marketing” purposes. The needs are dently evolving sets of packages; usually that simple, and are also similar across projects. That similarity has paved the way to the diffusion of forge The sheer amount of packages to be managed, the resources software and providers of one-fits-all FLOSS-infrastructures, needed, the relatively small number of FLOSS distributions, that have been steadily growing for the past decade [1]. Such and the existence of distribution-specific customizations of a de facto standardization of infrastructure eases all tasks the above processes have hindered the standardization of that require mining and comparing facts about, for instance, distribution infrastructures: every distribution has developed all projects hosted on the very same (or across very similar) its own infrastructure. Still, the ability to combine together forge(s) (e.g. [2], [3]). information coming from different distribution data sources While infrastructure standardization is common among is compelling for distributions as it is for researchers willing FLOSS development projects—whose aim is developing and to mine and study facts about them. periodically release software components, usually in source The need of researchers to have convenient access to distribution data is easily explained: they want to focus on Stefano Zacchiroli is partially supported by the European Community data mining rather than on the gazillion of technologies FP7, MANCOOSI project, grant agreement n. 214898 Unless otherwise stated, cited URLs have been retrieved on 12/01/2010. under which they are hidden. As an example of distribution 1http://udd.debian.org needs to combine data, we consider Quality Assurance (QA) Table I DESIGN CHOICES IN THE DEBIAN INFRASTRUCTURE: PROGRAMMING LANGUAGE AND DATA FORMAT Purpose Name Implementation language and data storage format Python, problem-specific PostgreSQL database. Archive management dak Data exported as text files formatted similarly to (but not equal to) RFC 2822 email. Distribution of package wanna-build Perl, PostgreSQL database (now), Berkeley DB (until recently). builds over build daemons buildd.debian.org Data exported on the Web, with an interface not suitable for massive retrieval. debbugs Perl, per-bug text files and mailbox Bug tracker bugs.debian.org Data exported on the Web; SOAP API available, but not suitable for massive retrieval. Management of developer Debian LDAP LDAP (accessible using standard LDAP tools and APIs), GnuPG keyrings. accounts keyring.debian.org Developers identity carnivore Perl, Berkeley DB. tracking (email addresses) History of packages upload — Mailing list archives. DEHS Upstream version tracking PHP, PostgreSQL database dehs.alioth.debian.org Perl, all results exported as a line-oriented text database, emails (to record the Package popularity popcon.debian.org submissions). Policy conformance check (a-la` continuous lintian.debian.org Perl, all results exported on the web as a line-oriented text database. integration) Package classification debtags C++, all data exported on the web as a line-oriented text database. Python, XSLT, per-package XML files. Package tracking system packages.qa.debian.org SOAP API available, but not suitable for massive retrieval. Web site www.debian.org WML (Website META Language), static HTML pages. in the context of Debian, a very large distribution currently II. THE DEBIAN DATA HELL formed by about 27,000 binary packages [7]. The task The difficulty of joining together different data sources, boils down to identify sub-standard quality packages which which has plagued the Debian project for several years, are in need of specific actions such as technical changes, is colloquially referred to as the Debian data hell. Before assignment to a new maintainer, or removal from the archive. analyzing UDD as a solution to that, we summarize here the Example 1: The authors, as members of the Debian QA factors that have caused the Debian data hell: team, often face the following question:2 • heterogeneity Which release critical bugs affect source pack- • community inertia ages in the testing suite, sorted by decreasing • long history package popularity? • maintainers’ programming and design skills The affected packages are in urgent need of attention, given • tight-coupling and service distribution how popular and at the same time how buggy they are. Discussing these factors highlights how the data hell cannot Implementing a monitor that periodically answers the above easily appear in distributions whose infrastructures are run question turns out to be surprisingly difficult in Debian, due by (wise) companies, and how it is likely to plague only to the amount of involved data sources (binary and source purely community-driven distributions. Among mainstream package metadata, bug tracker, suite status, popularity, . ) distributions, that boils down to Debian itself and Gentoo.3 and, more importantly, due to their heterogeneity. Heterogeneity: The basic cause of the Debian data hell This paper reports about the countermeasure that has been is of course heterogeneity. Even though the basic entities are adopted to such a difficulty—called the Ultimate Debian fairly clear albeit complex (packages, bugs, . ), the services Database (UDD)—and how it implements data integration dealing with them have been implemented according to for the purposes of both data mining and Quality Assurance. sharply different design choices. To give an idea of that, Paper structure: SectionII discusses the data integra- TableI shows the core parts of the Debian infrastructure tion problem in Debian and the peculiar factors which have summarizing, for each service, the main design choices in originated it, while Section III details the architecture of term of implementation language and data storage format: UDD. SectionIV gives an overview of the current status languages vary from system-level languages such as C to of UDD and some of its figures. Some applications are scripting languages (Shell, Perl, Python) and even XSLT, presented in SectionV, and a comparison with related work data formats vary from databases of all kinds (SQL, Berke- is made in SectionVI. leyDB, LDAP) to ad-hoc textual languages and even plain mailboxes(!). 2A few Debian terminology explanations are in order: release critical (RC) are those bugs whose severity makes a package unsuitable for release; 3Private communication with Gentoo developers has indeed confirmed testing is the suite meant to become the next stable release. that they suffer,

Load more