Thinking Aloud About Confusing Code a Qualitative Investigation of Program Comprehension and Atoms of Confusion
Total Page:16
File Type:pdf, Size:1020Kb
Thinking Aloud about Confusing Code A Qualitative Investigation of Program Comprehension and Atoms of Confusion Dan Gopstein Anne-Laure Fayard New York University New York University New York, USA New York, USA Sven Apel Justin Cappos Saarland University, Saarland Informatics Campus New York University Germany New York, USA ABSTRACT 1 INTRODUCTION Atoms of confusion are small patterns of code that have been em- Previous work on atoms of confusion [11] introduced a methodol- pirically validated to be difficult to hand-evaluate by programmers. ogy for discovering, measuring, and validating programmer misun- Previous research focused on defining and quantifying this phe- derstanding in a precise way. An atom of confusion is the smallest nomenon, but not on explaining or critiquing it. In this work, we snippet of code that will often confuse a programmer as to what address core omissions to the body of work on atoms of confusion, the code’s output is. Previous work measured correctness rates of focusing on the ‘how’ and ‘why’ of programmer misunderstanding. programmers hand-evaluating confusing snippets and compared We performed a think-aloud study in which we observed pro- the rates to those for functionally equivalent code hypothesized to grammers, both professionals and students, as they hand-evaluated be less confusing. Between the minimality of the code snippet and confusing code. We performed a qualitative analysis of the data and its comparison against a control, the research on atoms of confusion found several surprising results, which explain previous results, was designed to be both precise and accurate. Gopstein et al. [11] outline avenues of further research, and suggest improvements of applied this protocol in an experiment with 73 participants and the research methodology. analyzed the results with modern statistical techniques. A notable observation is that correct hand-evaluations do not im- The study performed by Gopstein et al. was significant in that it ply understanding, and incorrect evaluations not misunderstanding. was empirical, objective, and quantitative. Code was found to be We believe this and other observations may be used to improve fu- confusing or readily understandable based on experimentation, not ture studies and models of program comprehension. We argue that theory; the observations were based on performance, not opinion, thinking of confusion as an atomic construct may pose challenges to and the extent of confusion was able to be precisely quantified. Thus, formulating new candidates for atoms of confusion. Ultimately, we the experiment was designed to maximize internal validity [20]. By question whether hand-evaluation correctness is, itself, a sufficient using minimal code snippets, Gopstein et al. could be sure that they instrument to study program comprehension. were only measuring precise code constructs. By using functionally equivalent code samples as controls, they were able to demonstrate CCS CONCEPTS a direct relationship between the code and programmer confusion. • Software and its engineering → Software usability. Despite Gopstein et al.’s precision and accuracy in design, it can only tell us the outcome of programmers’ performance, but not KEYWORDS how or why they behaved that way. How can we know that the causes of confusion are those put forth by the researchers? How can Program Understanding; Think-Aloud Study; Atoms of Confusion we know that misunderstandings amongst multiple programmers ACM Reference Format: are homogeneous. How can we even know that hand-evaluation Dan Gopstein, Anne-Laure Fayard, Sven Apel, Justin Cappos. 2020. Think- captures all types of misunderstanding? ing Aloud about Confusing Code: A Qualitative Investigation of Program In short, Gopstein et al.’s strong focus on internal validity and Comprehension and Atoms of Confusion. In Proceedings of the 28th ACM objectivist rigor does not tell the whole story. We study the same Joint European Software Engineering Conference and Symposium on the Foun- fundamental code snippets and hand-evaluation protocols as Gop- dations of Software Engineering (ESEC/FSE ’20), November 8ś13, 2020, Virtual Event, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/ stein et al., but augment the setting by having programmers think- 3368089.3409714 aloud as they participate, followed by a semi-structured interview and discussion. This unique perspective on an existing method- Permission to make digital or hard copies of all or part of this work for personal or ological framework allows us to understand and scrutinize existing classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation work. Our experience with conducting a qualitative study after a on the first page. Copyrights for components of this work owned by others than ACM quantitative experiment leads us to believe the original experiment must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, could likely have been improved if a lightweight qualitative study to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. had been performed as a pilot during the design of the original ESEC/FSE ’20, November 8ś13, 2020, Virtual Event, USA quantitative experiment. © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7043-1/20/11...$15.00 https://doi.org/10.1145/3368089.3409714 ESEC/FSE ’20, November 8ś13, 2020, Virtual Event, USA Dan Gopstein, Anne-Laure Fayard, Sven Apel, Justin Cappos Our study offers insights into previous results as well assev- to determine which code patterns were truly more confusing than eral surprising observations that contradict previous assumptions, their counterpart. Of the 19 proposed atoms, 15 met the statistical including: significance required to be considered a confirmed atom. • The origins of incorrect beliefs about semantics differ across Following the original studies, the notion of atoms of confusion programmers. has been shown to be common in practice and correlated with • Errors evaluating atom-containing code are often caused by negative code quality indicators such as bug density and security other, unrelated aspects of the code snippet. vulnerabilities [12]. The concept has also been investigated with • Correct evaluation of a snippet does not mean a programmer the open-source community through opinion surveys and pull- understood its semantics. requests [17]. This line of investigation confirms that atoms of • Our study reveals new types of potential atoms. confusion are indeed confusing and prevalent across several di- mensions. However, there has yet to be an investigation into the In Section 4, we outline descriptions of how and why program- mechanism with which that misunderstanding occurs. Our research mers made mistakes or avoided doing so in surprising ways. This sets out to explain the phenomena observed in previous studies. provides insight into how to more accurately interpret the results In an effort to expand the concept of atoms of confusion beyond of Gopstein et al. as well as other hand-evaluation program compre- just the C language, Castor adapted it to the Swift Programming lan- hension experiments. In Section 5, we turn an eye to future research guage [4]. Castor used new methods of finding confusing patterns, and propose potential improvements or new research questions. such as measuring the infrequency of occurrence in large code A complete replication package for this study is provided at bases and expert opinion. We propose that observing programmers https://atomsofconfusion.com/2020-think-aloud. The goal of the in a think-aloud study is an acutely effective means of identifying replication package is to facilitate the understanding of our methods specific sources of misunderstandings. and observations as well as to encouraging others to perform similar studies of their own. While we provide the outcomes of our analysis, Qualitative Research in Software Engineering. With goals of under- we recommend that anyone using this package stays open to new standing how confusion arises in programmers and improving themes that might emerge. The package contains: methodologies for future research, we chose a qualitative method • Preparatory material, including all code snippets used and to explain previous results and explore potential new research de- the scripts enlisted to assign them to subjects. signs. Despite the positives of Grounded Theory, we decided it was • Interview instructions, including a pre-flight checklist, meta- not a good fit for our study, as we were already familiar withpre- protocol and universal answer key used to increase the re- existing literature in the field, and the semi-structured nature ofour producibility of the semi-structured interviews. inquiry was slightly too rigid to fully benefit from Grounded The- • Raw Data, including anonymized transcripts and scans of ory. Still, we took many lessons both from primary sources of the the subjects’ written notes from each interview. technique [6, 9], as well as descriptions designed specifically for the • Analysis, including the labels assigned to the transcripts software engineering field [22]. We used techniques recommended during open coding, and the codebook used in that process. from these texts, such as continuous data analysis, (open) coding, and memoing. Perhaps the best high-level description