Gamification for Enforcing Coding Conventions

Christian R. Prause Matthias Jarke DLR Space Administration RWTH Aachen Königswinterer Str. 522-524 Institut i5 Bonn, Germany Aachen, Germany [email protected] [email protected]

ABSTRACT quality characteristic means how cost-effectively developers Software is a knowledge intensive product, which can only can continuously improve and evolve the software. evolve if there is effective and efficient information exchange Maintainability is broken down into the sub-characteristics between developers. Complying to coding conventions im- analyzability, changeability, stability, testability and main- proves information exchange by improving the readability of tainability compliance. They describe how easy places to be . However, without some form of enforcement, changed and causes of faults can be located, how well fu- compliance to coding conventions is limited. We look at ture changes are supported and can be realized, how much the problem of information exchange in code and propose the software avoids unexpected effects from changes, how gamification as a way to motivate developers to invest in well the software supports validation efforts, and how com- compliance. Our concept consists of a technical prototype pliant the interior of the software is to maintainability stan- and its integration into a Scrum environment. By means of dards and conventions. Maintainability is like an internal two experiments with agile software teams and subsequent version of the usability quality characteristic, i.e., how easy surveys, we show that gamification can effectively improve developers can understand whether source code is suitable adherence to coding conventions. for certain purposes and how to use it for particular tasks and in specific conditions, how well inner workings can be learned, how much developers stay in control of the changes Categories and Subject Descriptors they make, how attractive the inner workings of the soft- D.2.3 []: Coding Tools and Tech- ware appear to the developers, and how compliant they are niques; D.2.9 [Software Engineering]: Management to standards, conventions, style guides or regulations. When we speak of understandability, learning, attractiveness, and compliance in this paper, we mean this internal view and not General Terms the view from the end-user’s external usability perspective. Human Factors, Documentation, Management, Measurement Maintainability is important for the evolvability of soft- ware. We argue that it is related to knowledge, and informa- Keywords tion exchange in the team. As a hidden quality characteris- tic, however, it is easily overlooked, and requires efforts and code style, gamification, , experiment discipline that are not granted. In particular, human behav- ior is controlled by trading off the valences1 of alternative 1. INTRODUCTION actions all the time. If the valence of complying to conven- The standard ISO-9126 establishes a basic model of soft- tions is not high enough, developers will easily be drawn ware quality based on six quality characteristics (e.g., func- to other activities. The dilemma of information exchange tionality, maintainability) each subdivided into several sub- commences. The more freedom developers have, the more characteristics (e.g., analysability, changeability). These char- important such valence aspects are. Internal un-quality, also acteristics can be roughly divided into characteristics of ex- known as technical debt, increases (see Section 2). ternal and internal quality. The famous iceberg metaphor We propose gamification as a solution. Gamification refers illustrates this relationship: external quality is the iceberg’s to the use of design elements characteristic for games in surfaced part that can be easily seen by users of the software. non-game contexts. It has gained considerable attention in Internal quality is the part dangerously hidden below the domains like productivity, finance, health, education, news, surface. Among the submerged parts, the maintainability and media. In this sense, gamification does not mean “play” but means to optimize one’s behavior with respect to the rules in the sense of mechanism design [18, 55]. In a real- world context (programming) with real-world effects (read- ability of code and rewards), gamification is what connects the real-world on both sides using mechanisms known from games. For this approach to work, two different integra- PREPRINT. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive version was published in the following publication: 1In psychology, the term valence means the intrinsic attrac- ESEC/FSE’15, August 30 – September 4, 2015, Bergamo, Italy tiveness or aversiveness (i.e., positive or negative valence, ACM. 978-1-4503-3675-8/15/08 respectively) of an event, object, or situation [22]. http://dx.doi.org/10.1145/2786805.2786806 Self-archived at: https://www.drprause.de/publications/

649 tions must be achieved: Firstly, a technical solution must into a danger to the project [13]. Some therefore even call be integrated into the development environment so that the not putting efforts into such information exchange unprofes- necessary development data can be obtained and prepared sional “anti-team behaviour” [33]. If information exchange is for gamification. For every developer, a reputation score is so important, why are “notorious”[44] for their calculated that averages the compliance to coding conven- dislike of writing down knowledge in a readable form [26]? tions of the files the developer contributed to. Secondly, the The literature on programming lists countless different gamification must be integrated socially to affect the de- motives: time and schedule pressure; missing responsibil- velopers’ valence judgments so that desirable behaviors will ity, education and professionalism; lack of incentives; care- emerge more often. We rely on the typical game design el- lessness, incompetence, “sapient” developers; human nature; ements of personal scores, transparent rules, leader board, additional (cognitive) effort; or operand conditioning, per- and real-world individual and group incentives that can be ceived boredom and pathological gambling. Writing down earned as team (see Section 3). information in a readable way costs a developer his precious In order to evaluate our game design, we set up two paral- time and (cognitive) efforts but has little immediate value for lel experiments with teams following himself. The potential values pay off much later and mainly the agile Scrum methodology. We distributed postgradu- benefit others [11, 15, 23, 30, 32, 33, 36, 41, 43, 48, 49, 56]. ate developers with similar skills onto two teams of green- We note that information exchange through code is an in- field projects. Both teams would run through an interven- stance of the more general dilemma of information exchange tion and a control phase resulting in experimental between- (cf. [16]): A developer must constantly chose between im- group and within-group designs. Surveys using question- plementing new functionality, and spending efforts on in- naires were scheduled with each team at the end of their formation exchange. Writing helpful source code comments intervention period to capture qualitative and quantitative and thinking about the effect for readability of every single information about the gamification. Furthermore, partici- keystroke does not work on autopilot [48]. Developers sub- pation in customer meetings and feedback from the project consciously weigh up motives like time pressure, cognitive leaders enrich our perception of what happened during the effort, and joy of creating new functionality against other experiments (see Section 4). motives like the desire to conform to management expecta- Evaluation of the experiments shows a clear effect of the tions, pro-social information exchange and the satisfaction interventions in both teams. Both teams invested the efforts of achieving quality. According to psychological theory, fac- necessary to achieve the designated convention compliance. tors weighted by personal expectations and beliefs, sum up The game elements had the desired effects, and feedback on the pro and con sides of the decision. The developer will from the developers suggests that the investments in con- decide for the alternative with the greatest expected valence. vention compliance were worth the efforts (see Section 5; Developers will disregard information exchange, when its threats to validity are discussed in Section 6). expected valence is lower than that of other activities. A Coding conventions are widely recognized as a means to social dilemma then commences, if it is still valuable to the improve the internal quality of software. Much work has team as a whole. And indeed, the value for the sending been put into conventions, commercial tools relating conven- developer is lower than for his audience: A developer who tion compliance and technical debt, and the enforcement of wrote the code himself, will have a deep understanding of conventions with techniques like reviews or strict tool-based its functions, inner workings, side effects, and constraints. enforcement. We compare our work to such approaches. He will not need to read that knowledge from the code; at The advantages of our approach are relatively comparably least not in the near future while his memory has not faded low efforts in personnel, and higher flexibility on the devel- and the code remains unchanged (cf. [38]). However, other opers’ side. Furthermore, other research successfully applied developers miss that knowledge. For example, they reinvent gamification to shape developer behavior (see Section 7). the wheel over and over again [28], wander like explorers of Our contributions lead us to conclude that gamification a new world looking for ways to solve problems [35], and works to improve coding convention compliance and is worth cannot use their advanced skills [51]. They have to learn further research (see Section 8). the code first to successfully implement changes. Readability is a judgment of how easy source code is to 2. CODING CONVENTIONS understand, i.e., how easy the knowledge encoded in it can be retrieved by a human [14]. Coding conventions support Software is a highly knowledge intensive product. Its information exchange by improving understandability and source code results from a non-algorithmic, experimental ac- readability [50]. They facilitate communication and col- tivity of discovery and invention. Developers identify issues, laboration between developers by maintaining uniformity experiment with possible solutions, and come to decisions [31]. Nordberg compares conventions to homeowners’ associ- in a process involving justifications, alternatives and trade- ations that support quality by reducing individuality. They offs [9]. Its source code is “the only precise description of prevent code from getting unreadable due to mixtures of the behaviour of the system” [24], and it must be both, exe- unconventional styles [36] and therefore help to reduce what cutable by machines and understandable to humans. A de- Graham [23] calls the common-room effect, which makes veloper must first understand the code before being able to code written by different developers feel bleak and aban- successfully implement changes. Code therefore plays a key doned, and accumulate cruft. role in the information exchange between developers and for The benefit of coding conventions often is not so much preserving knowledge. If the necessary knowledge is not pre- that each rule is well-grounded. In fact, many conventions served well and easy to retrieve for a human, i.e. readable, implement common wisdom, memes of a project’s coding software becomes costly to maintain and evolve [14]. When culture, and subjective opinions about aesthetics [36]. Take internal quality decreases, efforts in re-work increase (called Linus Torvalds’ famous comment as an example: “First off, “un-quality” by Philip B. Crosby), and technical debt grows

650 I’d suggest printing out a copy of the GNU coding standards, density of reported violations, i.e., the number of violations and NOT read it. Burn them, it’s a great symbolic gesture.” vc in the top three severity levels divided by the number of From an objective point of view, many rules might not stand non-empty lines csloc : scientific scrutiny. Their benefit is that they reduce style de-  10 × v  viations, which are nothing but distracting noise when read- q(f) = max 10 − , −10 ing code [50]. Despite a human language text remaining csloc understandable in principle, who would seriously doubt the benefits of orthographic and typographic conventions like 3.1.2 Assessing Responsibility spelling, hyphenation, capitalization, word breaks, empha- The developer who added or last modified a line of code sis, punctuation, line length or spacing? Conventions are owns it and is responsible for it. Using an algorithm simi- low-hanging fruits that are an important and pragmatic in- lar to Subversion’s blame command, responsibility is mined gredient to making software better, which is supported by from the code’s revision repository. The idea is that any new broad interest in the topic (see Section 7). We are therefore version of a file is created by manipulating the contents of its convinced that they are relevant for internal quality. parent. Reconstructing these manipulations means solving the string editing problem. It results in an edit script with 3. GAMIFYING COMPLIANCE the shortest series of changes (insertion, deletions and sub- stitutions) that transform the parent into the new version. Gamification is related to mechanism design. It uses game Our algorithm differs in three aspects from Subversion: design elements for a not-playful purpose in a non-game con- First, it is indifferent to white space changes. Second, clone text. It does not aim to create a full-fledged game for enjoy- detection finds if a code block is moved unchanged within or ment. Instead, it employs design elements that are known between files to retain original authorship. Third, the parent to work from games to set up structured rules for a compet- of a new version is any file or version that is most similar to itive strife toward goals for its players [18]. Embedded in a it, which is not necessarily the direct previous version of the real-world setting, the additional rules increase the valence same file. The rationale behind all modifications is to retain of putting effort in desirable behaviors. As a result, devel- responsibility of contributors even though others may make opers will more often choose to comply to coding rules. Our minor changes to the same code [40]. gamification of coding conventions consists of two parts: The results of this process are responsibility ratios 0 ≤ 1. a technical component (the CollabReview prototype) r(d, f) ≤ 1, where r(d, f) is the responsibility of developer d for file f. r(d, f) is proportional to the number of characters that integrates into the software development process P to mine the necessary data, (Section 3.1) and in lines owned by d in f. For any given file r(d, f) = 1, regardless of its size. 2. game elements that have valence for developers and motivate them to invest in complying to coding con- 3.1.3 Computing Reputation Scores ventions (Section 3.2). On average, a developer should have a reputation score that approximates her average compliance of contributions. 3.1 Technical Component: Obtaining Data For reasons of fairness, a file to which she has contributed At the heart of the technical component is a reputation more (or which is large) should have more influence on the system. It gathers reputation statements by observing users’ score than a file for which she has fewer responsibility (or actions and collecting feedback. The statements are then which is smaller). Reputation k(d) is defined as the average integrated into a model of the developers’ reputation, ex- of the compliance of code q(f) over all files F in the project, pressed as personal scores [21]. weighted by size w(f) and responsibility of the respective The personal reputation score reflects the average com- developer r(d, f): pliance of a developer’s code with the coding conventions, ranging from +10 (fully compliant) to -10 (extremely de- X viant). For determining it, the reputation system must q(f)r(d, f)w(f) f∈F k(d) = X 1. determine compliance of each source file (Section 3.1.1), r(d, f)w(f) 2. assess who is responsible for each file (Section 3.1.2), f∈F The source code of the CollabReview platform is available 3. integrate both into the reputation score (Section 3.1.3). as open source (see [3]). Note, that contribution quantity is not represented in the reputation score. The rationale is that we do not want to 3.2 Social Component: Valence and Influence measure performance but promote compliance. Reputation scores are an essential component to the gami- fication. Yet their mere existence does not affect developers 3.1.1 Determining Compliance because it has no valence for them. Research shows that Checkstyle is an open source static analysis tool for Java. users will value reputation quite exactly as much as it pro- It reports different rule violations like formatting, Javadoc- vides valence [29]. The elements of gamification are then related, magic numbers, and many more using four severity what presents valence to users and lets them pursue higher levels: ignore, info, warning and error. It is highly config- reputation scores. Common examples of such elements are urable and can be made to support almost any coding stan- leader boards, levels or badges [18]. The choice of gamifi- dard. The Sun Code Conventions and Google Java Style cation elements must be calibrated to the specifics of the are supported natively [1]. We automatically compute the environment (cf. [42]). Farmer and Glass give a guide to de- compliance judgment −10 ≤ q(f) ≤ 10 for each file f as the signing gamification elements using reputation systems [21].

651 When we created the design, we had knowledge from pre- all progress and to create transparency within the team. It vious experiments and studies. We had thoroughly stud- shows developers their own and the others’ scores to sup- ied theoretical work, conducted expert workshops, and in- port self-coordination of group work, who needs support, terviewed software engineering experts to identify potential and what has to be done to achieve the rewards. problems. Additionally, we had tested the technical aspects of reputation computations to gain confidence in their cor- 3.2.4 Individual and Group Incentives rectness. In general, it is important to have a good under- The main driving force of the gamification is a mixed standing of the users and their organizational environment. individual and group incentive that can be earned either by an individual or by the whole team. Group incentives 3.2.1 Personal Scores have the advantage over individual incentives that they have Personal scores are the technical basis of most gamifica- stronger effect because members attempt to manage one an- tion designs. Choosing an environment where score can be other through peer pressure and other social sanctions. At obtained is therefore essential. For example, responsibility the same time, this can lead to friction in the team. Adding can be difficult to determine in pair programming situations individual incentives reduces this risk [25]. Furthermore, when two developers work together but only one commits group rewards allow developers to help each other. to the repository. The group incentive was announced to be given to the Presenting individual scores let team members develop a whole team if each team member achieved at least a score of feeling of self-efficacy when they see how their actions af- k(d) ≥ 9 (which is no more than one violation per 10 lines fect their score. It helps them to learn and optimize their of code) by the end date. While the condition took into behavior and understand the game. We therefore disclosed account the score of each developer individually, developers the personal scores to the developers. could help one another by fixing violations in someone else’s files. Still, personal scores allowed the teams to manage 3.2.2 Transparent Rules conformance. The target threshold of 9 out of 10 possible In our experiment, developers would perform distributed reputation points was chosen not to be too easy but also not development, mostly working off-site. Such setting hinder extremely difficult to achieve. informal interaction, information exchange, and the estab- For the case the group incentive was completely out of lishing of resilient social structures. Consequently, instruc- reach due to too many deniers, an individual incentive was tions regarding the game — like briefings, goals and repu- also announced. This individual incentive would only be tation reports — would have to be complete, precise and given to developers, if the team failed to achieve the team self-explanatory. A gamification can easily fail due to a lack goal. The purpose of the individual incentive was as a fall- of understanding of how to “play”. back and to reduce friction in the team if several developers Furthermore, communication of reputation scores is a crit- would not consider the prize worthwhile achieving. In that ical ingredient. Scores must be communicated in a way that case, the individual prize was announced to developers with developers take note of them, easily understand them and the highest reputation scores at the end of the experiment. can relate them to their actions. An important element On the one hand, at least developers interested in the prize hence are intermediate reputation reports sent to the devel- would be able to achieve it. On the other hand, a team of opers via email. Each report explains how scores are com- deniers could not decide as a whole to not go for the prize puted. It contains the relevant computation basis of compli- because the highest scoring developers would still get it. ance of each file q(f) (including the reported violations) and An effective reward must be valuable and seem attainable. the two main contributors with the highest responsibility In an earlier experiment, we offered a prize money of several r(d, f) for the file. dozen Euros in a student lab for ranking first. As only one While the reports were generated automatically, we sent developer could earn the prize, the expectancy-value was them manually as emails to the team’s focal point, who was perceived as too low, and consequently the reward turned then responsible for distributing it further in the team. This out not to be attractive. We therefore offered the prize to manual process increases the visibility and perceived impor- the second ranked as well. Instead of money, we offered a tance of the reports. For the same reason, we kept the total bonus of one third of a grade on the final project grade. No number of reports sent below ten. reward was offered for the third rank because three willing developers were about one third of the members of a team. 3.2.3 Leader Board We considered three developers as a critical mass that would be strong enough to push the whole team towards the group In a survey we conducted earlier, the typical elements used goal situation instead. in gamifications (levels, badges, leader boards) were deemed to be ineffective and have only limited valence. A problem with leader boards, in particular, is that they emphasize the 4. SETUP OF THE EXPERIMENT top ranks, while having few motivational influence on the We aimed our experiment at the gold standard of experi- vast majority of the remaining ranks in the middle. An ex- mentation: Out of two groups, one group receives the inter- ception are leader boards with negative reputation that were vention while the other one does not. Ideally, any observed strongly unpopular. For example, technical debt is valuable difference can then be attributed to the intervention. The as means of conveying overall information about internal two questions that drive our experiment are: quality. However, it is not suitable as reputation, i.e., ev- Principal Question: Does gamification of coding conven- erybody would have his own share of technical debt. tion enforcement lead to improved compliance? The compliance reports still contained a table of reputa- The results of a rigorous experiment alone should provide tion scores. However, it was not intended to function as a sufficient evidence for the above question. However, sur- leader board. Its purpose was to inform the team of its over- veying the test subjects after the experiment can increase

652 confidence in the results by revealing processes and motives neering processes, and practice with collaboration tools like that were in effect. Therefore, it is valuable to investigate Subversion. The collected self-assessments were combined how the gamification intervention itself is perceived by the into a single overall average skill score. (The combined per- subjects. For example, if the gamification is present in the developer score has been included for reference in the fi- subjects’ minds during the experiment and is considered im- nal results Table 2. Three developers joined Team A last- portant, then it is quite probable that it also influences their minute, so no self-assessment data is available.) In the end, behavior. Qualitative investigations on these motives can average skill scores between teams differed by less than five provide hints for improvement and further explanations. percent. On these scales, the developers considered them- Secondary question: Does gamification cause undesirable selves as rather experienced in programming in general and social effects? with Java, and were knowledgeable of the software develop- Another important aspect is that gamification should not ment process. They had some experience with collaboration have negative side effects on the team itself. For example, tools but few with Scrum. group incentives can cause friction [25], or the gamification could over-emphasize convention compliance. For investi- 4.3 Experiment Time Schedule gating the second question, we analyze survey responses and The time for the end of the intervention for Team A and have course instructors report suspicious incidents. start for Team B was fixed to be at the half of the project duration, i.e., after about one month, i.e., 22 work days 4.1 Frame Conditions excluding weekends and public holidays. Development in As part of the teaching activities of a local post-graduate both teams started after about two weeks because earlier teaching institution, two independent software teams exe- software development life cycles phases (e.g., requirements cuted separate projects following the agile Scrum method- elicitation) had to deliver early results first. While the in- ology. The teams consisted of rather experienced develop- tervention had been announced to Team A on the first day, ers, however, included also ones with only some experience the first compliance report was sent to Team A only after (see Section 4.2). To avoid potential problems with legacy two weeks when development was really ongoing (Day 10). code, both projects were green-field projects. It should be The intervention ended for Team A, and started for Team noted that the purpose of the projects was not the exper- B on Day 23. iment but that they were supposed to deliver actual func- 4.4 Intervention and Briefing tionality/software. The gamification experiment was only a piggyback. Therefore both projects had different functional Each intervention started with an instructional email to topics, while still being setup very similar from a develop- the respective team’s Scrum Master, who was responsible mental point of view. for circulating it within his team. The instructional emails The duration of the projects was more than two months, described the gamification and contained the first compli- including 47 work days (excluding weekends and several pub- ance report. It displayed the developers’ current reputation lic holidays). Each team was responsible for organizing it- score, and details on responsibilities and detected violations self, e.g., resource allocation or scheduling. However, one as hints for improving reputation. Scrum sprint retrospective and a review meeting with the The teams were also informed that any intermediate re- instructors acting as customers had to take place at the end ports sent to them were not important for the reward. Only of each week. The course instructors and the experimenter the final score at the end of the respective intervention pe- were different persons. The experimenter only appeared as riod was relevant. That meant that both teams should have a guest at the respective intervention start and ending meet- enough time to carefully approach a better score. Team B ings. received the explicit advice to carefully improve style and The size and organization of the projects allowed for an ex- fix problems over time, and not to heedlessly fix them im- periment with both within-group and between-group designs. mediately, or hope to fix them in the last few days. Team A starts its project while receiving the CollabReview The teams also received the following additional hints: intervention. During this time, the performance of Team A • If developers liked, then they could use the static anal- is compared to the performance of Team B (between-group). ysis tool themselves to check their code before check-in. After about half of the project duration, the intervention ends for Team A and begins for Team B. At the end, the • Most IDEs like Eclipse are capable of automatically performance of Team B is compared to its prior performance fixing style issues, which should easily let a developer without the intervention (within-group). The performance get rid of several violations without manual effort. of both teams is monitored all the time. • While every developer had his own score, they could help each other by fixing someone else’s code. They 4.2 Distribution onto Teams could win individually or as a team. For a successful between-group experiment, both groups need to be as similar as possible. We manually distributed 4.4.1 Surveys after the Intervention the developers onto the two projects in advance. At the end of each team’s intervention, a questionnaire In the end, the two teams consisted of nine and eight was given to the developers to gain a better understanding developers, respectively. Additionally, developers were dis- of how CollabReview was perceived. We wanted to learn tributed in such a way that different skill levels were evenly for future designs. Each question contained a 5-point Likert distributed. We determined skill levels by means of a short scale (“How much...”) and a free text item (“Why was it...”) questionnaire that developers filled out. The questionnaire for capturing feedback both quantitatively and qualitatively. asked them to assess their Java coding skills, other cod- It asked how present scores were in the developer’s con- ing skills, experience in Scrum, knowledge of software engi- sciousness, how important the gamification was for them,

653 how fair it was, how well the developer did understand its ity, observations were restricted to the weekly Scrum review computation, how much the developer did like that reputa- meetings. However, a few anecdotes emerged: tion was computed, how well it was fitting into the devel- Developer A3 misread the briefing of the intervention, opment process, how acceptable the gamification was, how when reputation scores were first revealed to his team. He much influence it had on the developer’s way of program- reported spending the whole night trying to fix problems but ming, how much additional effort it caused for the developer, finally gave up as he saw no chance of reaching the target how much the code readability improved due to the interven- score. In the Scrum review, he was frustrated and expressed tion, if it would have made a difference for the developer to feeling offended because he had worked so much and now only see a team’s average score instead of individual scores, feared he would not get the prize. But his team could ap- and if the developer had privacy concerns. (The identifiers pease him as he misread the briefing and still had enough in italics re-appear in the next section.) time left to fix his reputation score. During the next weeks, however, A3 did not improve his 5. RESULTS AND EVALUATION score very much. Instead, his team helped him fix problems with his code only the day before the deadline. In the Scrum This section presents collected reputation data, reflects retrospective at the end of the intervention he argued that observations made during the experiment, reproduces re- the conventions were driving him mad as they were taking sults from the final questionnaire, and finally interprets and him too much time. He continued that it was only for his summarizes the results. team members who saved him. Interestingly, he followed 5.1 Collected Reputation Scores up on his original statement several weeks later when the intervention of Team A had long ended. He said that he Figure 1 shows how reputation developed over the course had actually learned to appreciate the better readability re- of the experiment. The continuous lines are the average sulting from convention compliance and, to his dismay, had reputation scores of Team A in red and Team B in blue. The noticed its decay after the intervention had ended. “average” reputation is weighted by each developer’s level of B7 also stated that he had someone else improve the code contribution to the project. It has the same value as if all for him. Indeed, it seemed to be a strategy in both teams to code had been written by a single developer. In addition, have certain members specialize in fixing compliance when the figure depicts all developers’ individual reputation scores the deadline approached. CollabReview leaves this flexibil- with dotted lines in the team’s color. ity to developers while at the same time encouraging them to Four events are explicitly marked with thick gray lines. efficiently and autonomously negotiate improvement work. These are the start and end dates of the intervention of The members of Team A increased their scores overhastily Team A and Team B, respectively. Team A received their right before their intervention ended. This could be the rea- first compliance report (and the experimental instructions) son why they went for a perfect 10 instead of the necessary at the tenth day when their code base first contained more 9: They had no time left and had to get it right with one than 200 lines of code. shot; so they did not risk a single violation. A problem with The average reputation of Team A does not change very this approach was that they broke their software right be- much during the first two weeks of their intervention. In- fore presentation at the Scrum review. Some functionality stead, an average reputation of about 5 seems to be achieved got lost in the huddles. naturally without much attention. The reputation of Team A then suddenly goes up to the maximum possible score (10) in the last few days before End A. From then on, after the 5.3 Questionnaire Feedback intervention, average reputation slowly begins to decrease. Figure 2 summarizes opinions about the intervention. As As opposed to this, Team B starts out with an average the items additional effort and concern capture negative reputation of 5. During the next weeks, reputation slightly opinions, the responses shown in the table were inverted for lowers until reaching a “natural” 3 at the day of Start B. For easier visual comparison. Row averages (in parenthesis) are two and a half weeks, Team B steadily works its way towards presented as an indication of “overall feeling” that should be the team reward threshold of 9. They then use the remainder treated with care. On average, Team B (3.9) perceived the of their intervention period (End B) to further improve their intervention better than Team A (3.2). scores a bit while never reaching the maximum of 10. Present: The valence of the team prize was a major reason Both teams reached the average reputation needed to be why the gamification had a high presence for developers. awarded the team reward at the end of their intervention But developers also said that they cared for the readability period. While Team A seems to have been more deadline of their source code and found the intervention helpful, and oriented, Team B slowly worked toward the goal and then that they wanted to have a high score. Some developers maintained a safe margin over the threshold. of Team A admitted to have started caring more for the Translated into violation density, the reputation scores compliance of their code when approaching the deadline. mean: At the start of their respective interventions, Team The idea of paying some attention all the time was more A had ≈ 0.65 violations per line of code, while Team B had ≈ widespread in Team B. 0.75. Until the end of their interventions, Team A improved Fair: The gamification was perceived as quite fair. The compliance to zero violations per line of code, while Team biggest concern was about the computation of scores. De- B improved compliance to almost zero violations. velopers with many contributions could (in absolute num- bers) catch more violations, and thus have a lower score. 5.2 Observations Vice versa, fewer contributions meant less potential conven- This subsection reflects on a few observations that we tion violations. These developers considered neglecting of made during the experiment. As teams were distributed quantity demotivating. For some developers, a concern was and conduction of the work was in their own responsibil- that Checkstyle did not support aspects like identifier names

654 Reputation 10 9 8 7 6 5 4 3 2 1 Day 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 Start A End A Start B End B

Figure 1: Average team scores (Team A red, Team B blue) and scores of individual team members (corre- sponding colors, dotted) while being too picky about aspects like trailing spaces. In- readability were both in the focus now, and it served the deed, this is a valid remark regarding Checkstyle; yet not so educational purposes as it supported learning. much of the enforcement. Other developers, however, found Acceptable: Except for its potential of causing competi- that Checkstyle was making fair judgments, and that scores tion within the team and penalizing of functionality-focused were helpful and gave them the feeling to have achieved behavior, the gamification was received well. Developers something. pointed out its positive motivating effects, and accepted it Important: The gamification was important to the ma- due to its fairness, quality of measurement, its “telling the jority of developers. Reasons why they found it important truth” and the possibility to win as a team. were the prizes, wanting to finish on a top rank in the leader Influence: Developers who felt they were writing clean board, the desire write clean code, and as a way to reflect code right from the start felt they were influenced very little. on one’s own work. Those who found it less important, said Those who felt influenced named reasons like the desire to that they preferred to focus on functionality. write clean code, wanting to win the prize or found what Understand: The majority of developers felt that they they learned about style helpful. Some developers changed understood how reputation was computed. A bit of experi- their behavior to take care of style right from the start and menting and studying the reports was still necessary to un- to develop in a more organized way. B8 admitted to have derstand the score and how to influence it. This aspect was actively influenced others so that everybody would have a particularly relevant for Team A because they started to high score. While Team A had to dedicate the last days take care of the scores rather lately. before the deadline to fixing styles, Team B adapted their Like: In Team A, the gamification caused some distress way of working early and did not feel influenced. for functionality-oriented developers due to their needing of Additional effort: Developers who integrated automated help from others. Developers in Team B generally liked the checks into their normal way of working reported little ad- intervention from an educational point of view and because ditional effort. They used IDE plug-ins to do most of the they were kept alert about readability. They had no feeling work and accepted that they had to dedicate a little extra of a contest but felt they were collaborating nicely. They time. Developers who fixed problems later reported higher particularly enjoyed that all team members were encouraged efforts. They spent whole days fixing their own and others’ to comply to conventions. code. The finickiness of Checkstyle made them review their Fitting: In general, CollabReview was perceived as fit- code several times before committing it to the repository. ting quite well into the development environment. How- Using IDE functions to format code and Checkstyle them- ever, developers noticed that the high pressure coming from selves would have reduced this effort. Fixing compliance the Scrum environment was further increased by one more violations in code that they did not understand anymore aspect that they had to take care of. A recurring topic was a major problem for them. was that the compliance aspect was perceived as distracting Readability improved: A controversy if readability had im- (cf. [48]). In particular, developers wanting to focus more on proved revolved around Checkstyle. Some developers ac- functionality felt they were forced to invest in compliance as cused it of not addressing issues like identifier naming but well. One developer criticized that in group work individual checking seemingly unimportant things like formatting. Yet, scores would not fit well. However, proponents held up that others found that quality improved indeed. Convention com- clear code benefited the team as it was easier to read for all pliance made the code easier to read, while the required others, and that reputation scores were just the right moti- Javadoc comments added to understandability. vation. Scores allowed them to see who cared about compli- Difference: Broad consensus was that individual scores ance, and to organize themselves to react accordingly. The made a difference compared to one overall score. Reputation scores fitted well into the project because functionality and was first perceived as interfering with collaboration. But

655 o aebe hr,i niiulsoe ol o aebeen have not would scores would that individual that if felt them there, They on been have influence do. not active to an what had mentioned gamification well A the quite Team understood also they and highly that B and Team present Moreover, as important. perceived was intervention the teams, compliance. improved compliance convention therefore to We coding leads A). of Team than gamification scores B; that Team better conclude than B Team scores B, better End the A at Team receiving start. A, of End their teams (at end to for intervention compared the increases as at compliance teams Furthermore, problems both convention zero for employed intervention (almost) The the in- found the end. at teams, tool its lower checking both at were for than de- scores that start both reputation tervention seen average on be and effect can individual intended It shows the clearly teams. scores had velopment intervention reputation the the of that history The pliance. questions. research two Answers Questions Research 5.4 orga- to them understand better helped to team. it them their helped that it outside said because even themselves others nize and to had hide they published that to stated not nothing explicitly were developers Instead, scores team. as the long “scores the as that sue to said him. hard in A2 spirit” work and competitive to unexplainable score, them some reputation inspired aroused good from It developers a identify a learn. get to as could and it help they use needed whom to who how see learned to have tool then to reported developers .. osGmfiaino oigCneto En- Convention Coding of Gamification Does 5.4.1 h uvy diinlyspotti nig nboth In finding: this support additionally surveys The com- convention measured on directly depends Reputation the to respect with results the interprets subsection This Concerns: iue2 ietsae pnosaotteitreto,pu efasse kl seScin4.2) Section (see skill self-assessed plus intervention, the about opinions Likert-scaled 2: Figure Average B Average A Average B8 B7 B6 B5 B4 B3 B2 B1 A9 A8 A7 A6 A5 A4 A3 A2 A1 ocmn edt mrvdCompliance? Improved to Lead forcement ulsigsoe a o ena rvc is- privacy a as seen not was scores Publishing 3.5 3.5 3.6 4.8 4.4 3.4 1.7 3.5 3.4 3.8 2.9 3.4 3.9 2.6 3.9 4.9 2.6 — — — Skill 3.88 4.13 3.67 ++ ++ ++ ++ + + + + + + + 0 0 0 0 0 0 present 3.56 4.14 3.11 ++ ++ ++ ++ ++ - - - - + + 0 0 0 0 0 0 - fair 3.94 3.88 4.00 ++ ++ ++ ++ + + + + + + + + + + 0 - - important 3.59 4.13 3.11 ++ ++ - - - - + + + + + + + + + + + 0 - under- standable 3.29 4.00 2.67 ++ ++ - - - - + + + + + 0 0 0 0 0 0 0 0 likable 3.53 4.13 3.00 ++ ++ ++ ++ - - + + + + + + 656 0 0 0 - - - fitting anyhv entecs,i oehn a happened. bad something if case, the cer- de- would been no which have that feelings, tainly adverse the indicate overall averages strong regarding row had concerns The veloper goals. privacy scores. the of report reach publishing not to the other did to team each Developers Quite like helped goals. effects developers project primary contrary, adverse out extreme crowding or developer of pre- friction thereby No existence behavior, his the on not. about influence cluding were strong concerned too others slightly a all reported were while fair- developers scores, perceived two individual contributors), and Only mass effects motivating of ness. by penalizing outweighed and were (poten- and competition acceptability acceptable for with raised (totally) tial issues was only The gamification welcomed. the teams both un- violations. to their fixed due to those unpredictable indebted compe- when feel bit some colleagues developers a made caused and was it problems, problems team, derstanding that Potential recognized the were inside he code). teams tition one when their both (yet mind to by it brought his reported disliked it changed strongly Only benefits later A3 the A3, and gamification. A1 them, the while of a dislike it, previously liked to and of A8 tendency presentation, breaking the overall product developer night, slight before the one of functionality for anecdote sleep working the no A: got Team who from mainly come and surveys from data indirect available. only is that observations is substanti- to reason difficult The more ate. is effects social undesirable cause code their of quality intervention. the the that due additional found improved them developers caused The it that effort. noticed They computed. not 3.88 4.50 3.33 .. osGmfiainCueUdsrbeSocial Undesirable Cause Gamification Does 5.4.2 ++ ++ ++ ++ ++ ++ + + + + + 0 0 0 0 - - nTa ,teoealrcpinwsvr oiie For positive. very was reception overall the B, Team In effects undesirable caused gamification our that Indication does gamification whether question research second The acceptable 3.24 3.50 3.00 ++ + + + + + + + + 0 0 ------Effects? influence 2.57 2.71 2.44 - - - - + + + + 0 0 0

------additional effort (inv.) 3.65 4.25 3.11 ++ ++ ++ ++ readability - - + + + + + + + 0 0 0 - - improve- ment 4.13 4.14 4.11 ++ ++ ++ ++ ++ ++ + + + + + + + 0 0 - difference 3.82 3.63 4.00 ++ ++ ++ ++ ++ ++ ++

+ + concern 0 0 0 0 0 0 - - (inverted) +(4.8) ++ (3.6) + (3.9) + (3.5) + (3.6) + (4.1) + (4.0) + (4.2) + (3.9) + (4.0) + (3.5) + (3.2) 0 (3.2) 0 (3.3) 0 (3.0) 0 (3.2) 0 (3.3) 0 (3.0) 0 (2.8) 0 (2.9) 0 overall average feeling The results for Team A are not as good. This may have concerns about the finickiness of Checkstyle. Still they found several reasons. The project managers (who had seen no rep- that it led to more readable code. utation scores) gauged that, in general, Team A was weaker and less organized than Team B. Developers of Team A were 5.6 Summary mostly functionality-oriented. Moreover, they used a de- The CollabReview concept was successfully validated in velopment framework that recommended slightly different two projects, where it fitted well into Scrum development. coding styles, which led to controversial opinions about the The primary goal of increasing coding convention compli- usefulness of Checkstyle’s conventions. By addressing their ance was clearly reached. It was seen as a good support for scores only lately, the hastiness caused additional troubles achieving better readability. Developers reported that they (broken features). it made consider both functionality and readability when In summary, the gamification was perceived as rather fair, programming. The interventions were perceived well by the likable, fitting into the situation and acceptable, and caus- developers and we found only few indication of adverse ef- ing only few concerns. Serious friction did not occur in the fects. As a result of the experiment, a reasonable readability experiment; the broken features issue was probably due to improvement was reached cost-effectively. a general organizational weakness in Team A. Additionally, as a lesson from this issue, Team B was more carefully in- troduced to the intervention, emphasizing that enough time 6. THREATS TO VALIDITY was left. We therefore negate the question and conclude that Regarding internal validity, the evaluation was designed gamification does not cause undesirable social effects. towards the gold standard of experimentation. Results ob- tained with and without intervention were clearly distin- 5.5 Lessons Learned guishable. A between-groups and a within-group experi- mental design were realized. The developers had similar Compared to Team B, Team A reported costs we deem un- educational backgrounds were assigned randomly to the two necessarily high, and from which we derive our first lessons: groups in such a way that a similar average skill-level was One reason might be organizational weaknesses also observed present in both groups. Of course, both groups did not by instructors that might have become intensified. Use cau- consist of perfectly equal individuals. The work context of tion when introducing gamification! Secondly, coding rules both teams was comparable because the projects were sim- were not calibrated to the third-party libraries used. The ilar with respect to process model (Scrum), tool support, enforced conventions should be adapted to the project sit- (Java), expected quality level (proto- uation! Thirdly, Team A tried to achieve perfect 10 scores type), domain (web application), same project managers and to not risk missing the goal (which was intentionally not set the like. The size of the groups was quite small in statistical to perfect scores). However, according to the eighty-twenty terms but a typical size for agile development. The results rule, the last few percent are the most costly ones to achieve. from the conducted surveys are congruent with statistical The goal was therefore intentionally set below the perfect findings so that they support confidence. Developer opin- score (to 9); but this must also be recognized and realized ions are evaluated partly based on averages of Likert data. by the team! Fourthly, higher effort resulted from not using There is an ongoing controversy about Likert data and para- automated tools for fixing problems like formatting. Hence, metric statistics. However, Norman concludes that statistics available tools should employed by the teams where possi- are sufficiently robust, save to use with Likert data, and that ble! Fifthly, Team A addressed violations not promptly but this assumption is backed by empirical literature [37]. rather late, which is (i) usually more costly in itself, and (ii) Regarding external validity, the environment of the exper- leads to being able to benefit from readability improvements iment was designed to reflect a typical development environ- only lately. Weeks after answering our survey, A3 said he ac- ment. But recruiting industrial development teams for ex- tually learned the value of cleaner code only after seeing its perimental research is almost impossible [54]. The results of decay. So, gamification should reward and promote immedi- our research might therefore not simply extend onto highly ate adaptation of behaviors! Sixthly, for Team B, a positive professionalized industrial projects. Replication is necessary. difference between additional effort and benefits from read- However, besides such environments, there is a lot more soft- ability improvements stands out. Feedback indicates that ware development going on in different domains that is often they had a relaxed and planned approach to dealing with overlooked. For example, in teaching, start-up projects, am- reported violations. They fixed issues during programming, ateur open source, or scientific programming (cf. [34]). Such saw it as a normal task they had to dedicate time to, and environments are much more similar to the environment in used tools to automatically fix most problems. Gamification which our experiment took place. works best in a relaxed, planned but continuous way! Some lasting learning effects may have occurred. The de- cay in reputation after the intervention for Team A supports 7. RELATED WORK that it was the intervention, which was responsible for the Coding conventions are often relied on as a first measure increase. But reputation does not fall abruptly after the in- to improve internal quality (cf. [36]). Therefore, it is no tervention although much more code was produced. It only surprise that they are a dime a dozen (cf. [31]). Famous ex- decreases by about 15%. This observation is congruent with amples are the GNU coding conventions or MISRA-C, which the developers’ statements that they had learned things like forbids constructs that tend to be unsafe. This is not to say immediately writing comments, or keeping an eye on style. that internal quality is all about coding conventions. For The compliance report listed for each file the main con- example, see the discussion of technical debt and its many tributor, the file’s compliance rating and detected violations. derivatives like design, testing or documentation debt [13]. This information helped developers to sufficiently well un- Regarding source code, researchers have looked into aspects derstand how scores were computed. The developers uttered like the problem of complexity (cf. [53]) and complementary

657 readability [14]. Nevertheless, an important ingredient to ing code noise. However, this information exchange aspect readability is coding conventions (see Section 2). is often neglected in practice. Investing efforts in making As of today, it is not possible to fully automatically as- code more easily understandable has a high cost for the in- sess the readability or quality of code. However, style vi- dividual but benefits are spread over the whole team. This olations and smells are significant hindrances for readabil- imbalance is a major reason for the appearance of the infor- ity [50]. Consequently, readability can be improved through mation exchange dilemma. We coding conventions that judged through style violations and code smells like white space, line length, comments, magic • relate problems of unclean code to the information ex- numbers and others. Buse and Weimer [14] showed that change dilemma, text-level analysis, on average, is better at judging read- • propose gamification as a way to motivate developers ability than any human assessor. Lee et al. [31] analyzed to comply to coding conventions, several open source projects with respect to Checkstyle and the above readability metric. They empirically substanti- • design a solution for implementing such gamification, ated claims that consistent style and readability are corre- lated; i.e. that less readable source files have a higher rule • explain how to integrate it into the social environment violation density. Smit et al. [47] assessed the importance of a development process, and of 71 rules from code conventions for maintainability, and • conducted two experiments with agile teams where found that they can be discriminated into more and less se- gamification improved convention compliance, without vere rules. The metric we use for compliance is that of Smit major detrimental social effects. et al. but without considering severity levels. Due to this important role of coding conventions, professional tools like Gamification was accepted as it gave developers a tool for SonarQube [4] or SQuORE [5] rely on conventions checking self-management. Developers noted additional effort and tools (e.g., Checkstyle) to assess internal quality. They even potential hassles, which could, however, be controlled by go as far as quantifying technical debt in dollars based on better (self-)organization, support of respective tools, prompt this. However, Boogerd and Moonen find that some coding fixing of violations, and not taking it too serious. Checkstyle rules can also have detrimental effects to reliability [12]. was perceived as valuable although it has some shortcom- Various kinds of review processes, more or less formal ings: being picky but dumb (e.g., not checking the sensible- ones, and distributed and asynchronous ones, using auto- ness of identifier naming) and requiring adaptation to the matic scheduling, exist to enforce coding standards and guide- peculiarities of the project. Overall, developers welcomed lines (cf. [10]) Our approach to enforcing conventions dif- gamification for improving readability, for supporting self- fers from others in that it does not aim to enforce total organization, and for causing a lower additional cost than compliance to conventions in each and every situation but the value of its resulting benefits. still pushes developers towards compliance. For instance, Our implementation may be better suitable for environ- Vashishtha and Gupta [52] and Merson [35] use Subversion ments where rigid quality controls are not in place and where hooks to prevent check-ins of code that is not compliant to functionality deadlines are not too tight, e.g. academic teach- conventions. The latter reports that introduction of new ing, amateur open source or scientific programming. At rules is problematic as it may stall the entire project until least, we are convinced that our study is representative for the new rule has been fully realized in the code. Pl¨osch et al. such environments. In the future, we want to replicate our combine static analyzers with downstream human reviews to experiments in other environments to make a transition to- improve code quality [39]. As opposed to this, Murphy-Hill wards industrial settings, and run trials with more long-term et al. [20] propose soft advice based on a non-distracting projects. Our results will hopefully serve us to convince fur- ambient/background display to deal with the fallibility of ther (industrial) development teams to try this approach. automatic readability judgments. Gamification may grant more flexibility for managing qual- Online communities like Coderwall or TopCoder [2,6] have ity. It allows developers to fix each others’ code, while effec- long applied gamification to developers. In communities of tively motivating them to invest in compliance themselves. web developers, increasing rewards leads to increased con- It might replace rigid enforcement mechanisms for coding tributions of the desired kinds [19]. Anonymous bidding for conventions in some cases. By motivating developers to in- implementations can even establish a market equilibrium for vest in compliance, they have the flexibility to adapt their quality improvements [8]. Singer and colleagues apply gami- compliance efforts to other project pressures. This might be fication in development teams to promote good development more efficient because if a fixed compliance level is set at practices like frequent commits [45,46]. Dencheva et al. used a too low value, then it will not promote sufficient invest- gamification to improve contribution to a knowledge man- ments in compliance; if it is set too high, it will unneces- agement platform, taking quantity into account as well [17]. sarily burden developers. Compliance efforts should be fair Contributor analysis as input to reputation systems is, for and the general conditions should be set in such way that instance, researched in the domain of wikis [7, 27]. the required efforts are borne by all and are not only at the expense of responsible team members. 8. CONCLUSION Source code is a medium of communication and the only 9. ACKNOWLEDGMENTS precise documentation of implementation details in software We like to thank the experiment participants for their co- development. It has therefore a central role for the preser- operation, Andreas Zimmermann and Alexander Schneider vation of valuable knowledge. Following coding conventions for kindly hosting the experiments, Markus Eisenhauer for can make code more easily understandable for other develop- his advice with the experimental design, and Katrin Wolter ers by constraining expressive freedom and avoiding distract- and many anonymous reviewers for their helpful comments.

658 10. REFERENCES Computer Age: Essays on the Art of Programming. [1] Checkstyle. http://checkstyle.sourceforge.net/. O’Reilly & Associates, Inc., 2004. [2] Coderwall. http://coderwall.com/. [24] M. Harman. Why source code analysis and [3] Collabreview. manipulation will always be important. In SCAM. http://sourceforge.net/projects/collabreview/. IEEE Computer Society, 2010. [4] Sonarqube. http://www.sonarqube.org/. [25] R. L. Heneman and C. V. Hippel. Balancing group [5] Squore. http://www.squoring.com/. and individual rewards: Rewarding individual [6] Topcoder. http://www.topcoder.com/. contributions to the team. Compensation & Benefits [7] B. T. Adler, K. Chatterjee, L. de Alfaro, M. Faella, Review, 27(63):63–68, 1995. I. Pye, and V. Raman. Assigning trust to wikipedia [26] J. D. Herbsleb and D. Moitra. Global software content. In Intl. Symp. on Wikis, 2008. development. IEEE Software, 18(2):16–20, 2001. [8] D. F. Bacon, Y. Chen, D. Parkes, and M. Rao. A [27] B. Hoisl. Motivate online community contributions market-based approach to software evolution. In using social rewarding techniques — a focus on wiki OOPSLA companion. ACM, 2009. systems. Master’s thesis, Vienna University, 2007. [9] S. C. Bailin. Software development as knowledge [28] W. S. Humphrey. Managing Technical People. creation. IntJ. of Applied Software Technology, Addison-Wesley, 1997. 3(1):75–89, 1997. [29] A. Jøsang, R. Ismail, and C. Boyd. A survey of trust [10] M. Bernhart, S. Reiterer, K. Matt, A. Mauczka, and and reputation systems for online service provision. T. Grechenig. A task-based code review process and Decision Support Systems, 43(2):618–644, 2007. tool to comply with the do-278/ed-109 standard for [30] P. Kruchten, R. L. Nord, and I. Ozkaya. Technical air traffic managment software development: An debt: From metaphor to theory and practice. IEEE industrial case study. In IntSymp. on High-Assurance Software, 29(6):18–21, 2012. Systems Engineering (HASE), pages 182–187, 2011. [31] T. Lee, J. B. Lee, and H. P. In. A study of different [11] B. Boehm. Get ready for agile methods, with care. coding styles affecting code readability. IJSEIA, IEEE Computer, 35:64–69, 2002. 7(5):413–422, 2013. [12] C. Boogerd and L. Moonen. Assessing the value of [32] D. Leonard-Barton. Implementing structured software coding standards: An empirical study. In ICSM, pages methodologies: A case of innovation in process 277–286. IEEE, 2008. technology. Interfaces, 17(3):6–17, May/June 1987. [13] N. Brown, Y. Cai, Y. Guo, R. Kazman, M. Kim, [33] A. Lynex and P. Layzell. Organisational considerations P. Kruchten, E. Lim, A. MacCormack, R. Nord, for software reuse. Annals of Software Engineering, I. Ozkaya, R. Sangwan, C. Seaman, K. Sullivan, and 5:105–124, 1998. 10.1023/A:1018928608749. N. Zazworka. Managing technical debt in [34] Z. Merali. Computational science: Error, why software-reliant systems. In Future of Software scientific programming does not compute. Nature, Engineering Research. ACM, 2010. 467(7317):775–777, 2010. [14] R. P. L. Buse and W. R. Weimer. Learning a metric [35] P. Merson. Ultimate architecture enforcement: for code readability. IEEE Transactions on Software Custom checks enforced at code-commit time. In Engineering, 36(4):546–558, July 2010. Conference on Systems, programming, & applications: [15] C. Connell. Creating poetry in software code. Boston software for humanity, pages 153–160. ACM, 2013. Globe, 1999. [36] M. E. Nordberg, III. Managing code ownership. [16] U. Cress, J. Kimmerle, and F. W. Hesse. Information Software, IEEE, 20(2):26–33, March/April 2003. exchange with shared databases as a social dilemma - [37] G. Norman. Likert scales, levels of measurement and the effect of metaknowledge, bonus systems and costs. the ”laws” of statistics. Adv in Health Sci Educ, Comm. Research, 33(5):370–390, 2006. 15:625–632, 2010. [17] S. Dencheva, C. R. Prause, and W. Prinz. Dynamic [38] C. Parnin. A cognitive neuroscience perspective on self-moderation in a corporate wiki to improve memory for programming tasks. In PPIG, PPIG, 2010. participation and contribution quality. In ECSCW. [39] R. Pl¨osch, H. Gruber, C. K¨orner, and M. Saft. A Springer, 2011. method for continuous code quality management using [18] S. Deterding, D. Dixon, R. Khaled, and L. Nacke. static analysis. QUATIC, pages 370–375. IEEE From game design elements to gamefulness: Defining Computer Society, 2010. “gamification”. In Intl. Academic MindTrek Conf.: [40] C. R. Prause. Maintaining fine-grained code metadata Envisioning Future Media Environments. ACM, 2011. regardless of moving, copying and merging. In SCAM. [19] D. DiPalantino and M. Vojnovic. Crowdsourcing and IEEE CS, 2009. th all-pay auctions. In 10 Conference on Electronic [41] C. R. Prause and Z. Durdik. Architectural design and Commerce. ACM, 2009. documentation: Waste in agile development? In [20] T. B. Emerson Murphy-Hill and A. P. Black. ICSSP. IEEE, 2012. Interactive ambient visualizations for soft advice. [42] C. R. Prause, J. Nonnen, and M. Vinkovits. A field Information Visualization, 12(2):107–132. experiment on gamification of code quality in agile [21] F. R. Farmer and B. Glass. Building Web Reputation development. In PPIG, 2012. Systems. O’Reilly, March 2010. [43] P. Seibel. Coders at Work: Reflections on the Craft of [22] N. H. Frija. The Emotions (Studies in Emotion and Programming. Apress, 2009. Social Interaction). Cambridge University Press, 1987. [44] B. Selic. Agile documentation, anyone? IEEE [23] P. Graham. Hackers & Painters: Big Ideas from the

659 Software, 26(6):11–12, Nov/Dec 2009. March 2011. [45] L. Singer and K. Schneider. Influencing the adoption [51] E. Tryggeseth. Report from an experiment: Impact of of software engineering methods using social software. documentation on maintenance. Empirical Software In ICSE. IEEE, 2012. Engineering, 2:201–207, 1997. [46] L. Singer and K. Schneider. It was a bit of a race: [52] S. Vashishtha and A. Gupta. Automated code reviews Gamification of version control. In Intl. W. on Games with checkstyle, part 2, 11 2008. and Software Engineering, 2012. [53] E. J. Weyuker. Evaluating software complexity [47] M. Smit, B. Gergel, H. J. Hoover, and E. Stroulia. measures. IEEE Trans. Software Eng., Code convention adherence in evolving software. In 14(9):1357–1365, 1988. Intl. Conf. on , ICSM, pages [54] E. J. Weyuker. Empirical software engineering 504–507. IEEE, 2011. research - the good, the bad, the ugly. In Symp. on [48] D. Spinellis. Reading, writing, and code. ACM Queue, Emp. Softw, Eng. & Measrmnt. IEEE, 2011. October 2003. [55] J. R. Whitson. Gaming the quantified self. [49] D. Spinellis. Code documentation. IEEE Software, Surveillance & Society, 1:163–176, 2011. 27:18–19, 2010. [56] S. Wray. How pair programming really works. IEEE [50] D. Spinellis. elyts edoc. IEEE Software, 28:104–103, Software, 27:50–55, 2009.

660