Guide to Selecting the Right SCM for Globally Distributed Software Development

James Creasy Senior Director, Product Management, WANdisco. June 2013.

COMPANION GUIDE Best SCM Tools for Global Software Development

Table of Contents

Introduction...... 1

Audience...... 1

Industry Problem...... 2

Challenges...... 3

Suitability of Existing Solutions...... 6

Summary Chart...... 8

Deep Dive - Perforce...... 9

Deep Dive - GitHub Enterprise...... 10

True Active-Active...... 11

How to Select an SCM system for Modern Global Software Development...... 12 Feature checklist...... 12 Where to learn more about products with these features...... 12

About WANdisco...... 13

WANdisco, Inc. follows a policy of continuous development and reserves the right to alter, without prior notice, the specifications and descriptions outlined in this document. No part of this document shall be deemed to be part of any contract or warranty. WANdisco, Inc. retains the sole proprietary rights to all information contained in this document. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photo copy, recording, or otherwise, without prior written permission of WANdisco, Inc. or its duly appointed authorized representatives. WANdisco and the WANdisco logo are trademarks. All other marks are the property of their respective owners. Apache, , and Apache HBase are trademarks of Apache Software Foundation (ASF)

Page 2 of 2 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

Best SCM Tools for Global Software Development

Introduction Globally distributed development teams have gradually become the standard for a significant number of enterprise software development projects. What started as “outsourcing,” with tasks delegated from a central hub to remote sites, has evolved into a distributed model with teams assembled based on availability and skill set rather than location. Approximately 15 years ago, a typical model might have been a U.S.-based company outsourcing tasks to India, but the current model reflects globally distributed collaboration between highly skilled teams in India, multiple locations across U.S. time zones, teams in Europe and others . Corporate acquisitions have also accelerated this trend. Likewise, IT and administrative teams responsible for creating and supporting the technical infrastructure have evolved. Despite today’s web-like development model, however, existing and legacy tools for SCM have largely failed to keep pace. The new paradigm of global software development demands a new class of SCM tooling. This paper will examine existing approaches and provide solution recommendations that best meet its requirements. We’ll also examine the trends shaping SCM, and plot a safe course for the future of Global SCM. Audience This paper is intended for anyone who uses, designs, maintains, supports, or otherwise interacts with the SCM infrastructure for global software development at their company. This might be an SCM administrator, a Enterprise Architect, a CTO or CIO, or a developer looking for better solutions to recommend to management.

1. Fryer, K. & Gothe, M. Global software development and delivery: Trends and challenges. Retrieved from http://www.ibm. com/developerworks/ rational/library/ edge/08/jan08/ fryer_gothe/index.html

Page 1 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

Industry Problem The requirement of SCM systems to support global software development has largely outstripped the capabilities of available solutions. Much has been written about the issues and methods of addressing communication between multi- time zone separated teams.2 Issues surrounding cultural differences, project planning, work transfer, IP protection and more have also been well explored (if not completely resolved). However, almost every analysis highlights the lack of an SCM tool that can provide the type of global synchronization needed for effective collaboration.3 Since ultimately the product of a software development project is software, it’s clear even within the broader ALM stack that the ultimate vessel of collaboration is the SCM system. This makes the suitability of that tool critical to facilitation of a successful project. What makes a tool suitable for the type of global synchronization needed for worldwide collaboration? Performance Performance continues to be the primary measure of distributed SCM deployments. After all, what performed adequately on a LAN now horribly underperforms on a WAN. What factors influence performance of a SCM system in a WAN environment? More than bandwidth Achieving high performance in a WAN environment means minimizing remote connections over high latency lines. Although bandwidth is a commonly 2. De Souza, C. R. B. discussed metric of performance, latency can often be the more significant Global Software 4 Development: component. Challenges and No technology can ever entirely eliminate cross-continental latency because it Perspectives. Retrieved from http://citeseerx.ist. is ultimately bound by the speed of light, but does this mean unresponsive SCM psu.edu/viewdoc/ systems are the only choice? summary?doi= 10.1.1.21.4102 Replication 3. Ebert, C. & De Neve, We can sidestep the need to connect to a geographically remote and high P. Surviving Global latency central server through replication of data to locally resident nodes. Software Development. Retrieved from http://e- Using replication, developers and other users can interact with the SCM system learningup.org.in/ over a LAN connection. UploadArticlePDFFiles/ Surviving%20 Note: Replication can be called many things; caching, proxies, and Globalb508f193- replicas are all forms of data replication. d3c4-436e-b8bb- f07fdab5da49.pdf Therefore replication is a key requirement for global software development 4. Cheshire, S. (1996) It’s SCM tools. the Latency, Stupid. Retrieved from http:// There is, however, another aspect to WAN environments that often eludes www.stuartcheshire. attention. WAN-based distributed computing is more than a high-latency LAN org/rants/Latency.html

Page 2 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

environment; it’s also subject to frequent stresses not found in its short-wire cousin. Failure tolerant An effective global SCM solution must be highly failure tolerant when deployed in a WAN environment, partly because software code is notoriously intolerant of wrong or missing data, but also because distributed systems differ from traditional single machine systems deployed over a LAN in that they experience much higher failure rates. Note: Further explanation on Peter Deutsch’s now famous “Fallacies of Distributed Computing”5 explores areas in detail where distributed systems can fail. Clearly failure tolerant replication is a key ingredient for supporting effective global software development. Challenges What type of failures is likely to affect WAN replication of SCM systems? What features or architectures allow replication to be highly failure tolerant? Replication failure scenarios Businesses using global SCM solutions with a single point of failure (master- slave replication) typically rely on cold or warm-failover, but such solutions are subject to unavoidable data loss in the time between the master going offline and the failover starting up. In the case of an earthquake or major storm, the original master may be months away from repair, and the new master presents a new single point of failure. If the failover server experiences an outage or malfunction, significant permanent data loss could occur. Disaster Recovery (DR): These dramatic, location-based failures are referred to as “disaster recovery scenarios”. Global SCM systems should have robust disaster recovery capabilities. Failures are not always catastrophic, as with a flaky connection to a remote site, but even this scenario hides potentially painful consequences. That’s because it’s hard to tell the difference between a slow computer and a dead computer. Suppose you are employing the conventional technique of post- commit replication. If the target of the replication is subject to an unreliable connection, it can be difficult to tell if you need to resend the update, or if your submission will be a duplicate. If the update never made it there, but the next update does, now two computers in the network are (likely silently) diverging. Disruptive, likely manual, untangling of divergence-related issues will require 5. Rotem-Gal-Oz, A. time from your administrators and development staff - time better spent on Retrieved from http:// your product. www.rgoarchitects. com/files/fallacies.pdf

Page 3 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

High Availability (HA): The above condition would commonly be referred to as a High Availability scenario, where a system becomes unavailable during the time needed to discover and unwind the errors. Likewise, if network connectivity is lost the system could periodically become unavailable and impact developer productivity and even lengthen product release cycles. Even scheduled maintenance windows are losses of availability. Note: High Availability and Disaster Recovery are often referenced together in an acronym, such as HA/DR or just HADR. Performance impacts: Less dramatic, but likely just as impactful, are the effects of poor performance using remote connections. Where a code-sync might take minutes for a LAN connected developer near the central server, it could take hours over a high latency line to India. Even with a local slave replica, how far behind the master does that slave run? What’s the cost of poor replication in terms of developer productivity even when the system is running ideally? Unfortunately, these types of problems are only the tip of the iceberg. Coordinating replication reliably is challenging. Replication is hard It takes little more than a Google search on keywords like “replication failures” to spring up pages of complex instructions on diagnosing and repairing replication failures. Furthermore, the conditions under which these steps are being taken are not usually the most enjoyable. For example, imagine (or remember) the pressure of reversing data corruption in the last critical days before a product release. In many industries, time-to-market is a key competitive business driver. Replication is hard and failures are disruptive. With this in mind, we’ll take a look at the two main approaches used in replication. Replication paradigms There are two main paradigms for replication: 1. Master-slave 2. Active-active Master-slave replication: Master-slave replication refers to a system in which one computer acts as the recipient of all writes, replicating to the slave servers who may respond directly to read requests (Figure A). This fulfills the requirement for performing local (LAN-based) reads. Products may describe the components of master-slave architecture with words like “proxy”, or “replica”. However, there are multiple issues with the master-slave replication. Figure A. Master-slave replication

Page 4 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

1. Single point of failure 2. WAN writes 3. Data corruption The master is a single point of failure: lose that and all collaborative development is essentially at a standstill (Figure B). All writes have to propagate to the master, resulting in slow WAN speed writes that are bottlenecked to a single computer. Failures can result in hard-to-find code and data corruption that requires substantial manual effort to recover from. Figure B. Single point of failure Active-active replication: True active-active replication carries none of these risk factors and is highly failure tolerant in a WAN environment. In this type of replication, every node is an equal peer and can accept writes as well as support reads, so all interaction with the system is local with LAN responsiveness (Figure C). Active-active replication is also safe and transparent. When a node goes offline, the activity load is then seamlessly transferred to the remaining nodes (Figure D). No manual intervention is needed, and, in most cases, users of the system remain blissfully Figure C. Active-active replication unaware of any outage. This type of system is called “self-healing.”6 When the isolated node comes back online, it silently catches up to the others and continues working. There is no single point of failure, offering HADR with maximum safety of your data. Note: There are lesser forms of replication that misleadingly call themselves “active- active”. One clue that the technology calling itself active-active is in fact an imitation is if there are mentions of reconciling conflicts. True active-active does not require manual conflict resolution – the algorithm prevents Figure D. “Self-Healing” Active-active replication conflicts from occurring. Therefore, true active-active replication is another indicator of a best choice of SCM tool supporting global software development. 6. See http:// www.wandisco. com/products- faq#self_healing

Page 5 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

Suitability of Existing Solutions Now that we have identified true active-active replication as a key component for any SCM tool supporting global software development efforts, we have a clear lens with which to evaluate our options. Unfortunately, all existing SCM systems, except two, fail to offer true active- active replication that effectively supports global software development needs with performance, scalability, high availability and disaster recovery. Note: The selection of a specific SCM tool for a specific deployment lies outside the scope of this paper. This companion guide addresses the suitability of select existing solutions to the particular challenges of global software development. Solutions discussed in this paper all have some measure of industry adoption. Existing SCM systems can be grouped into a few categories based on multi-site and replication capabilities. The categories are outlined below, followed by a chart that summarizes available SCM options. Single machine architectures Until quite recently, SCM systems were built with ”single machine” architecture; that is, they were applications designed to run on a single piece of hardware. They often provided acceptable performance over a fast connection when such a connection was available. When connections between geographically distributed locations were first used, the limitations of single machine architecture quickly became apparent. Connecting to a shared server in a high-latency WAN environment wreaks havoc on the performance of single machine architectures. What might take a few minutes on a LAN connection can take hours on a WAN connection. As explored above, these systems lack the replication needed to support LAN- based access to local data and are not a best choice for supporting global software development projects. Examples: Subversion (without replication), GitHub Enterprise, Atlassian Stash, Perforce (without replication), Team Foundation Server, ClearCase (without replication).

Page 6 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

Single Machine Architecture + Master-slave replication A number of systems originally architected for single machines have added master-slave replication to at least partially address the needs of global software development. As explored above, master-slave style replication carries risks to the correctness of data as well as introducing performance limitations. Master- slave incorporates single points of failure that impede effective high availability and disaster recovery. Examples: Subversion (with SVNSync), Perforce (with Perforce proxies and replica servers), Team Foundation Server (with proxy). Externally Hosted In this model, repositories are self-service hosted with administration via a web interface. Externally hosted repositories typically suffer from lack of multi-site nodes, forcing remote users to connect over high latency connections, as well as a single point of failure at the web interface. There are also concerns about hosting core company IP on a publically accessible (but presumably permissioned) website. Vendors know these objections; each has a corresponding inside-the-firewall version. Examples: GitHub, Atlassian BitBucket. Partitioned Mastership Rather than try to provide simultaneous access, this model locks writes to all branches (partitioning) except for one. Manual coordination is used to change the mastership (the right to commit code) and prevent conflicts. The creation of extra branches is required to support geographical locations rather than SCM requirements. It’s a restrictive, legacy model liked by few. Example: ClearCase MultiSite. Disconnected Often labeled with the term “distributed,” these tools are more accurately described as “disconnected.” While these tools take a step in the right direction by being able to reconcile repositories across a WAN, true distributed computing implies coordination, and current tools in this category require manually coordinated replication (or replication coordinated by ad-hoc scripts). Examples: Git, .

Page 7 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

Summary Chart

Single Machine Architecture •• Subversion Architected for a single machine7, these solutions are a single point of failure GitHub Enterprise and force remote workers to communicate through high latency WAN •• connections. Very long read and write times are the result, killing productivity. •• Atlassian Stash •• Perforce •• Team Foundation Server •• ClearCase Single Machine Architecture + Master-slave replication •• Subversion + SVNSync Master-slave always has a single point of failure. •• Perforce with replication Master-slave has a variety of other failure points. Team Foundation Server •• Writes for all but the master site are remote writes. Writes have to propagate with TFS Proxy over the WAN to the master and then back down to the slave servers before •• AccuRev with AccuReplica8 they are available for local read.

Externally Hosted •• GitHub These are a single point of failure for all users. Git allows local commits and Atlassian BitBucket other operations, so developers are able to continue to work at reduced •• capacity. Many companies do not want company IP, particularly their software code, to be externally hosted.

Partitioned Mastership •• ClearCase MultiSite Expensive. Proven to be difficult in practice. Branch mastership particularly troublesome. Does not account for many failure conditions.9

Disconnected •• Git No active replication between repositories. The so-called “distributed” tools are more accurately referred to as “disconnected” tools because there is no Mercurial •• intelligent coordination between the disconnected nodes. Remote sites faced with long clone/push times over a WAN.

True active-active •• SVN MultiSite Accounts for all failure and contention conditions, supports true simultaneous, globally distributed development with LAN-speed reads and writes for all Git MultiSite •• users.

7. Or single interface backed by multiple load balanced machines 8. AccuReplica: http://www.rfpconnect.com/accurev/product/accureplica 9. Extensive list of ClearCase MultiSite failure conditions: http://www.ibm.com/developerworks/ rational/library/replicate-repositories-clearcase-multisite/index.html

Page 8 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

Deep Dive - Perforce Perforce uses master-slave replication architecture to support global deployments via caching proxies and forwarding replicas. As noted previously, master-slave replication suffers from several significant shortcomings. The master server is a single point of failure, both in terms of hardware failure and connectivity failure. Many of Perforce’s recent so- called new features are actually increased investment in fragile master-slave replication. Caching Proxy The caching proxy stores file contents for local access. Unless the proxy is manually pre-populated, the Perforce caching proxy incurs the cost of transmission at the exact worst time possible: the moment the file is requested. Subsequent access of the file allows local access to the file and bypasses retransmission over the WAN. Manual cleanup is required to avoid running out of disk space. Replica servers A read-only replica passes writes across the WAN to the master that must then redistribute the approved writes to the replicas. A build replica filters replication to build farms supporting Continuous Integration (CI). Chained replication Chained replication increases the number of steps for data to get to the users. Each additional step is a WAN communication subject to connectivity/ transmission failure. Hubs where several replicas share another replica form local single points of failure, which complicates recovery after a failure. Filtered Replication Perforce supports a method of filtering replication; however, a major reason that filtered replication is required is because the Perforce server generates an unusual amount of write traffic, much of which comes from what would normally be pure read operations such as p4 sync. Filtered replication also eliminates the potential for High Availability and Disaster Recovery at the replicated site, because the filtering removes data necessary for the operation of a slave server to act as a master. Conclusion Although at first glance Perforce seems to offer an impressive array of replication features, it is actually increasing the complexity of a fragile architecture built on the unsecure ground of the master-slave paradigm. Perforce replication has significant limitations when attempting to support modern global software development.

Page 9 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

Deep Dive - GitHub Enterprise GitHub Enterprise forms a single point of failure for all users and offers no multi-site or replication capabilities: remote users must connect to the GitHub Enterprise instance over a WAN. The instance itself is stored in a virtual machine (VM) to which there is no root access for the customer. Backups are accomplished through snapshots of the VM and crashes are addressed by restarting the VM image. GitHub Enterprise, while featuring the powerful tool Git, is not proven to scale in enterprise development environments. It provides self- provisioned hosting of Git repositories that includes a suite of lightweight ALM tools; however, enterprise development requirements can easily exceed both Git’s scalability limitations as well as GitHub Enterprise’s limitations. The size of repositories and the number of concurrent developers are common limitations. You might be tempted to spin up another GitHub Enterprise instance to scale out load, but since there are no known tools to reconcile GitHub Enterprise instances, code divergence is inevitable. GitHub Enterprise’s complete lack of multi-site capabilities means that remote users are always at the mercy of low bandwidth WAN connections, connectivity lapses, and central instance outages. Git itself does provide developers some capability to commit and search history during an outage of the central instance, but when the shared repository is not available, collaboration is sharply limited. Conclusion GitHub Enterprise is a single point of failure, saddled with scaling limitations for itself as well as those inherited from Git, and has no multi-site or replication features which makes it an especially poor fit for supporting modern global software development.

Page 10 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

True Active-Active The ideal solution for supporting global software development is an SCM system that employs true active-active replication to deliver scalability and LAN read and write responsiveness for developers while also delivering High Availability, Disaster Recovery and most importantly, maximum assurance of data correctness across the globally distributed software codebase. WANdisco SVN MultiSite and Git MultiSite are the first and only SCM systems to employ true active-active replication for Global SCM capabilities. Open source is already the de facto standard for SCM and enterprise SCM. Subversion is now under active and accelerating development, powered by teams of corporate committers, such as the five committers at WANdisco. Subversion’s vision statement has turned it increasingly to the types of features needed for the largest enterprise deployments. Adding best in class performance, scalability, and enterprise security and governance features closes the gap with commercial systems. Git has enjoyed recent and rapid adoption in the enterprise, and WANdisco Git MultiSite delivers all the same benefits of true active-active replication. Conclusion The complex and failure-prone world of global software development may appear intimidating, but as we have seen, the multitude of error conditions and recovery scenarios boil down to one simple requirement: true active-active replication is the key technology that meets all the requirements, allowing globally distributed software teams to collaborate easily and safely on a common codebase.

10.Excepting only the WANdisco CVS MultiSite product. CVS is currently considered deep legacy and so is not addressed here.

Page 11 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

How to Select an SCM system for Modern Global Software Development Following is a quick list of bullet points with the key information you need to select an SCM system suitable for the substantial challenges of global software development. Feature checklist üü Implements true active-active replication üü Industry standard feature set üü Comprehensive list of integrations to ALM tools üü Proven track record in production environments üü Provides excellent High Availability and Disaster Recovery üü Cost-effective üü Actively developed üü Development roadmap contains enterprise features üü Supports a wide range of developer sophistication üü Powerful feature set üü Auditability and governance Where to learn more about products with these features

Visit the SVN MultiSite and Git MultiSite sections Learn more about WANdisco’s team of of wandisco.com for more information on true Subversion committers and our Director of active-active replication of Subversion and Git Subversion repositories

See Subversion’s accelerating vision for Check out WANdisco’s acclaimed Subversion enterprise-level software development at the and Git enterprise support official Subversion roadmap

Learn how true active-active replication Contact our Sales department at removes the single point of failure for Hadoop [email protected] Big Data deployments

Page 12 of 13 © WANdisco. All rights reserved. Best SCM Tools for Global Software Development

About WANdisco WANdisco (Wide Area Network Distributed Computing) provides enterprise-ready, non-stop software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability. WANdisco’s products are differentiated by the company’s patented, non-stop data replication technology, serving crucial high availability requirements for 100% uptime globally for both Hadoop Big Data and Application Lifecycle Management (ALM) applications. SVN MultiSite WANdisco SVN MultiSite is a replication, mirroring and clustering software that enables enterprise performance, scalability, and backup, as well as 24/7 availability for globally distributed Apache Subversion deployments. SVN MultiSite removes the single point of failure (SPOF) risk, potential bottleneck and WAN latency issues associated with a central Subversion server by turning distributed repositories into mirrors of each other and providing continuous hot-backup and automatic failover across every site in the cluster. SVN MultiSite provides developers collaborating around the globe with LAN-speed performance, eliminates downtime, prevents server overload, and resolves merge conflicts quickly. SVN MultiSite Plus SVN MultiSite Plus builds on SVN MultiSite’s unique ability to eliminate the single point of failure, performance bottleneck and WAN latency of a central Subversion server by incorporating the latest enhancements to WANdisco’s patented replication technology and implementing them at Subversion’s file system layer. As a result, SVN MultiSite Plus eliminates up to 90 percent of the communication overhead between Subversion clients and the server at each location, allows servers and repositories to be added and removed on the fly without downtime, and supports all Subversion protocols. SVN MultiSite Plus delivers major availability, performance and scalability improvements over standard SVN MultiSite, as well as greater flexibility. Git MultiSite Git MultiSite uses WANdisco’s patented replication technology to provide LAN-speed Git access and collaboration to developers everywhere, even across a WAN. It eliminates the single-point-of-failure, scalability, and performance bottlenecks of a central master repository, allowing enterprises to realize the true promise of distributed version control. With Git MultiSite, all fully writeable replicas of the master repository servers are peers, providing global disaster recovery, delivering optimal performance, and eliminating downtime for planned and unplanned outages. World Headquarters 5000 Executive Pkwy WANdisco Training and Support Suite 270 WANdisco offers a range of training and support options, including a wide selection San Ramon, CA 94583 of webinars and training video modules for developers, administrators, and managers Europe looking to boost their skills in Apache Subversion. More than a first line of defense, Electric Works WANdisco’s team consists of experts in open source software providing in-depth Sheffield Digital Campus professional assistance. Support centers around the globe enable WANdisco to Sheffield provide secure, high quality live assistance to customers located anywhere in the world S1 2BJ whenever they need it. Service options include 24/7 online, phone and email support, guaranteed response times, online delivery of fixes and enhancements, System Asia Pacific Health Check, Bug Buddy, 8 Hours of Free Consulting or Training, Indemnification Level 6, Coverage, and more. Free community support is also available. For more information, Oomori StationBox Bldg see: www.wandisco.com/support. 2-1-2, Sanno, Ota-City Tokyo 143-0023 JAPAN

US Toll Free Outside US EU Email 1-877-WANDISC (926-3472) | +1-925-380-1728 | +44 (0)114 3039985 | [email protected]