Parallelism and Conflicting Changes in Git Version Control Systems Hoai Le Nguyen, Claudia-Lavinia Ignat
Total Page:16
File Type:pdf, Size:1020Kb
Parallelism and conflicting changes in Git version control systems Hoai Le Nguyen, Claudia-Lavinia Ignat To cite this version: Hoai Le Nguyen, Claudia-Lavinia Ignat. Parallelism and conflicting changes in Git version control systems. IWCES’17 - The Fifteenth International Workshop on Collaborative Editing Systems , Feb 2017, Portland, Oregon, United States. hal-01588482 HAL Id: hal-01588482 https://hal.inria.fr/hal-01588482 Submitted on 16 Sep 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Parallelism and conflicting changes in Git version control systems Hoai Le Nguyen Claudia-Lavinia Ignat Universite´ de Lorraine, F-54000 Inria, F-54600 Inria, F-54600 Universite´ de Lorraine, F-54000 CNRS, F-54500 CNRS, F-54500 [email protected] [email protected] ABSTRACT facilitates detection of a certain problem or recovering of Version control systems such as Git support parallel collabo- deleted contributions. rative work and became very widespread in the open-source As a version control system supports parallel work on the community. While Git oers some very interesting features, same documents, it has to oer support for synchronization resolving collaborative conicts that arise during synchro- of parallel changes on those documents. Version control nization of parallel changes is a time-consuming task. In this systems can be classied into centralized version control paper we present an analysis of concurrency and conicts in systems (CVCSs) and decentralized version control systems ocial Git repository of four projects: Rails, IkiWiki, Samba (DVCSs). and Linux Kernel. We also analyze how oen users decide to CVCSs such as CVS[1] and Subversion[5] rely on a client- rollback to previous document version when the integration server architecture. e server keeps a complete history of process results in conict. Finally, we discuss the mechanism versions while clients keep only a local copy of the shared adopted by Git to consider changes made on two continuous documents. Users can modify in parallel their local copy lines as conicting. of the shared documents and synchronize with the central server in order to publish their contributions and make them CCS CONCEPTS visible to the other collaborators. is work mode is called •Human-centered computing →Empirical studies in copy-modify-merge paradigm. collaborative and social computing; •So ware and its DVCSs such as Git[8], Mercurial[14] and Darcs[6] became engineering →So ware conguration management and popular around 2005. ese systems rely on a peer-to-peer version control systems; architecture where each client keeps the history of versions plus a local copy of the shared documents. Users can work in KEYWORDS isolation on their local repositories. ey can also synchro- version control systems, parallel work, conicts nize their local repository with the ones of other collabora- tors. Actually, in practice, the core-development-team will 1 INTRODUCTION organize at least one repository as the primary repository where the latest approved changes can be found. Contribu- Computer supported collaboration is an increasingly com- tors can clone from this ocial repository. However, only mon occurrence that becomes mandatory in academia and the core-development-team has the write-access to commit industry where the members of a team are distributed across directly. Other contributors need to use the pull-based devel- organizations and work at dierent moments of time. Ver- opment model in which a contributor creates a pull-request sion control systems are popular tools that support parallel for his changes. A core-team’s member then inspects the work over shared projects. ese systems keep a history of changes and decides to pull and merge contributor’s changes versions of the documents published by users. ese docu- to the main repository or not. And in some cases, contribu- ment versions have associated meta-data such as publication tors are requested to update or add more changes before their date, author and short description of the modications which pull-request is accepted. Nowadays, the pull-based model is Permission to make digital or hard copies of part or all of this work for naturally supported by web-based hosting services such as personal or classroom use is granted without fee provided that copies are GitHub [9] and Bitbucket [3]. not made or distributed for prot or commercial advantage and that copies Even if CVCSs are largely used in academia and indus- bear this notice and the full citation on the rst page. Copyrights for third- try they feature some limitations due to their centralized party components of this work must be honored. For all other uses, contact the owner/author(s). architecture: they oer a limited scalability and limited fault IWCES’17, Portland, Oregon, USA tolerance, administration costs are not shared and data cen- © 2017 Copyright held by the owner/author(s). tralization in the hands of a single vendor is an inherent IWCES’17, February 2017, Portland, Oregon, USA Hoai Le Nguyen and Claudia-Lavinia Ignat threat to privacy. On the other side, DVCSs overcome these with the same name. e study found out that only about limitations of centralized architectures. eir peer-to-peer 0.0035% of all updates made to non-directory les resulted architecture allows them supporting a large number of users in conicts and among them less than one third could not and tolerating faults. Each user keeps a copy of the history be resolved automatically. Authors mentioned that conicts and decides with whom to share it without storing it on a cen- that cannot be resolved automatically are any update/update tral server. ese features made DVCSs widely used in the concurrent changes on source code or text les as they have domain of open soware development where projects have arbitrary semantics and therefore require user intervention. oen a large number of contributors. Parallelism of activities Note that in contrast to the denition of conicts used in is very important for supporting collaboration, but merging [16], in the terminology of version control systems two up- dierent parallel contributions might require a signicant dates done on the same le (source code or textual) lead period of time which can alter the productivity gain. to non-automatically resolved conicts only if the updates Our objective is to study the parallelism of activities in refer to the same line in the le. All update/remove conicts a DVCS such as Git[8]. In Git, users can synchronize their required human intervention and about 0.018% of all naming changes with other users working in parallel with them. In conicts led to name conicts which have to be resolved by this process, a merge is performed between local changes humans. and remote changes. If concurrent changes refer to the same In [15], the authors presented a study about parallel changes le, we say that these changes are conicting. Conicts on in the context of a large soware development organiza- a le can be automatically resolved or they need user inter- tion and project. e study analyses the complete change vention for their resolution. We call the former category and quality history of a subsystem of the Lucent Technolo- automatically resolved conicts and the laer category un- gies’ 5ESS over a period of 12 years. Each set of change resolved conicts. If conicting changes refer to to dierent requests representing all or part of a solution to a problem non-adjacent lines of the le, the conict is automatically re- was recorded by the system. When a change from this set solved by the system. If, on the contrary, conicting changes was made on a le, the system kept track of the lines added, refer to the same or adjacent lines of the le, the conict edited or deleted. is set of changes composes a delta. It cannot be automatically resolved and the user has to man- was found that 12.5% of all deltas were made to the same le ually resolve it. Unresolved conicts also occur if a le is by dierent developers within a day. 3% of all these deltas renamed and modied/deleted concurrently, if it is modied made within a day by dierent developers physically over- and deleted concurrently, if it is renamed concurrently by lap. However, interference of these deltas is analyzed over a two users, if a user renames a le with the same name as quite large period of time (1 day) and not all these deltas are another user gives to a concurrently created le and if two performed concurrently. users concurrently add two les with the same name. In [19] authors investigated four large open-source projects Understanding how oen conicts happen and how users (GCC, JBOSS, JEDIT and PYTHON) and found that in CVS resolve conicts can help proposing a merging mechanism the integration rate is very low (between 0.26% and 0.54%), that minimizes conicts that users have to manually resolve. and the conict rate is between 23% and 47%. ey indicated e rest of the paper is organized as follows. In Section that the parallel changes within single les (integration) are 2 we explain briey related works. Section 3 presents our rare and have small impact to the development process. How- measurements of conicts during merge process. Section 4 ever, they frequently aect the same location and can not be discusses about some limitations of analyzing Git reposito- integrated automatically by CVS.