<<

Masaryk University Faculty of Informatics

Usability Analysis of TLS API Documentation

Master’s Thesis

Bc. Matěj Grabovský

Brno, Spring 2020 Masaryk University Faculty of Informatics

Usability Analysis of TLS API Documentation

Master’s Thesis

Bc. Matěj Grabovský

Brno, Spring 2020 This is where a copy of the official signed thesis assignment and acopyofthe Statement of an Author is located in the printed version of the document. Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Bc. Matěj Grabovský

Advisor: RNDr. Martin Ukrop

i Acknowledgements

I thank my dearest Kateřina for her fair gaiety, affectionate assiduities, firm support and constant sense of belonging. This work would not be presented here as is without her presence and input. I thank my advisor Martin for his boundless patience, invaluable advice, wise guidance and support in all phases of our work. My thanks also goes to Lydia for her priceless help and advice on writing and data analysis. Though many of my friends and family remain unnamed here, mygrat- itude goes to them undiluted for they supported me selflessly throughout the time and provided a much needed relief and distraction innumerable times.

ii Abstract

This thesis deals with the usability of selected TLS libraries, their documen- tation, code samples and related resources. We describe the design and re- sults of our initial exploratory study and a follow-up study, both performed with students of the Faculty of Informatics. The results show that the selected libraries’ documentation suffers from many usability problems. Although these issues are not specific to TLS libraries, their consequences might be more serious. Quantitative rating of the severity of selected problems suggests that the libraries are comparable. Further research is necessary in order to determine what kind of exam- ples are most useful to developers and, especially, how to compel developers to heed the research community’s recommendations.

Keywords

developer experience, documentation, human factors, usability, usable se- curity, user study, TLS

iii Contents

1 Introduction 1

2 Exploratory Study 4 2.1 Design and Methods 4 2.2 Data Analysis 5 2.3 Setting and Participants 6 2.4 Results 6

3 Methodology 10 3.1 Research Questions 10 3.2 Study Design 10 3.3 Task Design 11 3.4 Setting 15 3.5 Participants 15 3.6 Data Collection, Processing and Analysis 17 3.7 Limitations 19

4 Results 22 4.1 Identified Issues 22 4.2 Severity of Obstacles 33 4.3 Other Sources of Information 35 4.4 Supplementary Analyses 40

5 Discussion 42 5.1 Obstacles 42 5.2 Other Sources 44

6 Related Work 45

7 Conclusion 47 7.1 Future Work 47

Bibliography 49

iv A Pre-Task Questionnaire 54

B Post-Task Questionnaire 57

C Assignment Microsite Snapshot 60

v 1 Introduction

TLS and Usability. Security (TLS) is a family of security protocols for secure, encrypted and authenticated communication in com- puter networks. It is most widely deployed in the HTTPS (HTTP over TLS) protocol, which serves hunders of millions of websites over the world with the numbers growing steadily. The Let’s Encrypt (2020) initiative claims to have provided TLS certifi- cates to 225 million websites. According to (2020), more than 90% of all pages loaded in the Google Chrome browser are served over HTTPS. NetMarketShare (2020) estimates 87% of all web traffic is encrypted. The importance of TLS for the modern Internet is undisputable. How- ever, several studies have shown that the protocol implementations are misused and misunderstood by developers, risking their users’ data and privacy (Egele et al. 2013; Georgiev et al. 2012; Krüger et al. 2018). Moreover, it has been shown that security and cryptographic inter- faces in general are hard to use and too easy to misuse (Acar, Backes, Fahl, Garfinkel, et al. 2017; Iacono and Gorski 2017; Nadi et al. 2016). Even pro- fessionals developers tend to struggle with writing code that is both secure and functionally correct (Acar, Stransky, et al. 2017; Georgiev et al. 2012; Krombholz et al. 2017).

Exploratory Study. To this end, it is clear that the usability of security interfaces, their documentation and other resources is essential for their pratical security. In order to generate new reserarch questions at the in- terscetion of TLS and developer experience (DX) research, we performed a small qualitative exploratory study with in 2018. Nine IT students partic- ipated in the study, tasked to create a small application with three estab- lished -language TLS libraries. The results revealed and confirmed a few usability factors considered by developers when working with the . They have also hinted at the indispensability of the availability of quality documentation and code sam- ples, in line with previous research (Acar, Backes, Fahl, Garfinkel, et al. 2017; Nadi et al. 2016).

Follow-up Study. We therefore decided to narrow down on this partic- ular aspect of library usability in a larger, more focused follow-up study.

1 1. Introduction

We designed a mixed-methods user study, again with IT students at our university as participants. We aimed to identify common issues in docu- mentation, quantify the severity of some issues, and observe patterns in use of external sources of information. This time, we selected two popular libraries for two programming lan- guages each, diversifying the field of possibilities. We also opted for a more realistic task of modifying an existing application rather than creating a new one. We then analysed the results of our follow-up study, identify the causes and implications of various issues, and situate our findings within the ex- isting body of research.

Thesis Outline. The thesis is laid out as follows. At the beginning, Chap- ter 2 lays out the design, execution and results of our initial exploratory study on the usability of TLS programming interfaces. In Chapter 3, we set the research questions for our follow-up study and detail the methods we chose to answer them with. Chapter 4 then lists the results from our analyses of the qualitative and quantitative data we collected. In Chapter 5, we situate the results within a wider context, interpret them and speculate on possible implica- tions. Chapter 6 surveys the the related research grounds and where our work fits in. And finally, Chapter 7 sums up our findings, their possible limitations and potential directions for further research.

Acknowledgement of Contributions

It bears mentioning that the work presented here is in part the result of a collaboration with my advisor, RNDr. Martin Ukrop, and a consultant, Dr.-Ing. Lydia Kraus. They were both advising me and providing feedback in all stages ofthe research process – designing the studies, analysing and understanding data, and presenting results. It would be too difficult to list their contributions precisely, so I present here merely a short summary of the key moments. Martin Ukrop helped design both studies and the tasks, administered the assignments in both experiments, anonymised the collected data and preprocessed them before the analysis phase. He also served as the second coder in coding qualitative data.

2 1. Introduction

Lydia Kraus assisted with the analysis of qualitative data, creating the codebook and calculating interrater agreement in the exploratory study. She also advised me in various stages of analysing the data in the follow- up study. Moreover, we wrote the work-in-progress short paper presenting our exploratory study collaboratively. Unless otherwise noted, all the remaining work presented on the fol- lowing pages is my own with all due responsibility.

3 2 Exploratory Study

While the usability of lower-level cryptographic libraries has been under active scrutiny by the usable security and DX research community (Acar, Backes, Fahl, Garfinkel, et al. 2017; Arzt et al. 2015; Egele et al. 2013; Nadi et al. 2016), the issues specific to TLS-related APIs and the usability ofcon- crete TLS libraries have not yet been thoroughly investigated to our knowl- edge. Therefore, following the agenda set forth by Acar, Fahl and Mazurek (2016), we designed and performed a small exploratory user study with the aim to to uncover potential TLS-specific usability issues. We defined our scope to TLS libraries, their APIs, their documentation and other related resources.

2.1 Design and Methods

Since we were aiming to generate new research questions and hypothe- ses in this area, we opted for a purely qualitative study with open-ended data. Qualitative research methods had been used successfully in several studies before (Iacono and Gorski 2017; Krombholz et al. 2017; Naiakshina, Danilova, Tiefenau, Herzog, et al. 2017; Naiakshina, Danilova, Tiefenau and Smith 2018), which reinforced our choice of methodology. Due to the exploratory nature of our endeavour, we established no re- search questions a priori. Instead, we designed a programming task to be completed by the participants in three different libraries with a subsequent written report focusing on elements of the usability of theAPI. We chose three C libraries with TLS support for our experiment: Open- SSL, GnuTLS and TLS (formerly PolarSSL). At the time of writing, OpenSSL is arguably the most popular cryptographic library (per Internet- wide scans; Nemec et al. 2017). GnuTLS aims to be a full-featured GPL- compatible alternative, although with a substantially different API. The API of mbed TLS is also distinct and we expected interesting data could be generated from comparisons with the two mainstream libraries. The programming task consisted of implementing a simple HTTPS client, starting from a basic code skeleton with library initialisation and clean up code using each of the libraries.

4 2. Exploratory Study

We specified that the client connect to www.example.com, enforcing TLS version at least 1.2, verify its certificate and gracefully close the con- nection. We explicitly required that the certificate check include at least a check for expiration, hostname match, chain validity and revocation sta- tus (either via CRL or OCSP). For testing the implementations, we recommended the BadSSL web- site1, which serves subdomains presenting crafted server certificates with various flaws (such as an expired certificate or a revoked one) and server settings (e.g. different ciphersuites or TLS versions). We then asked the participants for an open-ended written reflection at least one page long with 5–10 specific points comparing the APIs. We also provided a set of twelve questions sampled from a previously published usability questionnaire (Wijayarathna, Arachchilage and Slay 2017) for in- spiration. These were questions such as “Do you find the API abstraction level appropriate to the task?”, “How easy is it to stop in the middle of work and check the progress?” or “When reading code that uses the API, is it easy to tell what each section of code does?”

2.2 Data Analysis

We first tested each submitted application for task compliance, i.e. whether it connects to www.example.com, whether it enforces for TLS 1.2 for the connection and whether all the required checks are performed. We tested the latter by modifying the source code and pointing the URL to there- spective subdomains of badssl.com. Some of the submissions either failed to compile, connect to www.example.com or to the homepage of badssl.com – these cases were then not further tested. The submitted reports were then subject to thematic analysis using an inductive approach. We split the reports into roughly 240 chunks, or fragments, approximately corresponding to sentences or meaningful, infor- mational statements. Two researchers processed all the data using initial coding, which revealed some codes correlated with the inspirative ques- tions. We then discussed the created codes, applied focused coding to the data to identify repeating themes and compiled a codebook from the most prominent ones. A third, independent coder was invited to apply the cod- ing scheme to the same data. We then measured interrater agreement (Co-

1. ://badssl.com

5 2. Exploratory Study

hen’s κ = .62, p < .001) which turned out to be substantial (Landis and Koch 1977).

2.3 Setting and Participants

The study took place in autumn 2018 at the Faculty of Informatics, Masaryk University. The task was assigned as a homework in the course PA193 Se- cure coding principles and practices. The course focuses on software security from the programmer’s point of view, discusses topics such as code review, static analysis, formal verification etc. It is compulsory for graduate IT se- curity students, and assumes basic familiarity with applied and programming in C or C++. The experiment took place in a week dedi- cated to usable security and usability of security APIs, with an introductory lecture and a practical, hands-on seminar. The students were informed that their responses would be anonymised and used in our research on usable security. Out of twelve students enrolled in the course at the time, nine completed the assignment. Participation in the study was opt-in by default, and although students were allowed to opt out, none chose to.

2.4 Results

Although with only nine participants the sample was small and the con- clusions of the study were not clear-cut, we identified several patterns in the reports. Most participants chose to express their opinion unbound by the inspirational questions we provided, and although they often answered them one way or another, none answered (implicitly or explicitly) more than five of the twelve items. In the following paragraphs, we outline the themes and mentioned us- ability factors we identified in the reports, and accompany them with au- thentic quotes for illustration, where suitable. In short, we noted diverse preferences for the tested libraries and distinguished five usability factors mentioned by the participants: quality and structure of documentation and code samples, the API abstraction level, consistency of return values and error handling, naming of entities, and efficiency.

6 2. Exploratory Study

Diverse Library Preferences. Among our participants, no single library was generally preferred over another. The reasons expressed for one pref- erence or the other were disparate, sometimes even contradicting one an- other. This shows diverse, yet not clearly delineated preferences among developers.

Documentation and Code Samples. Perhaps the most prominent theme relates to documentation and examples. The opinions were diverse. With GnuTLS, one was “generally satisfied” (P9), while another found it “worst” (P6). Regarding mbed TLS, one participant “liked it the most” (P7), yet an- other thought it was “not rich in details” (P9). And similarly for OpenSSL. Likewise, sometimes the libraries had “somewhat helpful examples” (P3, GnuTLS), “nice example programs” (P2, GnuTLS) or “nice, easy to under- stand examples” (P6, mbed TLS); other times, the samples had “some issues” (P1, OpenSSL) or were simply “bad” (P8, mbed TLS) or “poor” (P6, GnuTLS).

Abstraction Level. Opinions on the API abstraction level diverged sig- nificantly, as well; some described the libraries as “well encapsulated, as expected” (P5, GnuTLS) or as having “best abstraction and usage” (P6, Open- SSL). Others pointed out flaws: “[GnuTLS] seems to be more lower level than the previous ones” (P7), “a lot of things have to be programmed manually” (P3, OpenSSL). One participant was not satisfied with any of the libraries: “they mix low-level affairs (such as TCP sockets) with high-level interfaces […], force the developer to write critical code by hand.” (P3, all libraries) Six participants specifically mention GnuTLS’ choice to offload socket handling to the developer: “when you want to connect to the specific port, you need to use your own socket […] which is extremely annoying” (P1, GnuTLS), “I basically got stuck while trying to connect via the socket.” (P7, GnuTLS)

Return Values and Error Handling. Return values were mostly perceived well in GnuTLS and mbed TLS, as being consistent and “uniform”. On the other hand, OpenSSL was perceived as problematic: “it does not match the typical /C convention for returned success/error code.” (P4) Error han- dling was perceived both positively: “Thanks to the error handling, I could stop […] and quickly verify that there’s no problem so far” (P1, OpenSSL), and negatively: “issue regarding not showing enough details about reasons for failing the verification was not resolved.” (P2, GnuTLS)

7 2. Exploratory Study

Entity Naming. The names of functions, arguments and constants were perceived almost exclusively positively across the libraries. The partici- pants found the intuitive and understandable: “[OpenSSL has] in the most of cases well named structures/types” (P8), “The functions are quite straight- forward, you can deduce what they are supposed to do from their names” (P1, mbed TLS), “All the GnuTLS APIs used in this assignment have understand- able names” (P5). Only one participant pointed out a “highly non intuitive” (P1) constant name in mbed TLS.

Efficiency. Although we did not employ any quantitative measure of code size or program complexity, one of the inspirational questions asked for subjective assessment of the required amount of code. Most comments gen- erally judged it as appropriate. One participant mentioned the amount of necessary boilerplate: “We need to initialize six or so structures […] just to establish a basic TLS tunnel” (P3, mbed TLS). One participant found the amount of GnuTLS code for certificate verification “too large” (P5) and an- other noted the same for OpenSSL, which was “larger than the other APIs. […] mostly due to a lot of settings for certificate verification, which isnot done implicitly. [With GnuTLS] the expiration, host name and CRL should be checked by default.” (P1) We also checked how successful the participants were in completing the assigned task. Although some security checks seemed to have been rel- atively easy to perform or implement, e.g. check for certificate expiration or hostname check, none of the participants succeeded in correctly check- ing for certificate revocation. This appears to relate to the fact that none of the libraries have easy-to-use, clearly described facilities for handling certificate revocation. This reveals a serious flaw in the TLS ecosystem.

P

The study design and preliminary results were presented as “Usability In- sights from Establishing TLS Connections” in the work-in-progress at the SantaCrypt 2019 workshop (Mikulášská kryptobesídka) on 5 December 2019. A replication of this study was performed in autumn 2019 in the same setting with 26 new participants. At the time of writing, preliminary results seem to confirm all of the previously identified themes. A more thorough

8 2. Exploratory Study analysis of the new data may suggest further directions for research. We plan to analyze the data thoroughly and publish a paper summarizing all three studies in the near future.

9 3 Methodology

Guided by the conclusions of the exploratory study, we set out to design a follow-up study with more focus on a particular part of the usability of TLS libraries, yet still providing enough room for the participant’s individ- ual expression and for the possibility of discovering novel interactions. In this chapter, we describe the design of our follow-up study and explain our choices at various steps of the process.

3.1 Research Questions

For the follow-up study, we decided to focus on the status quo in the us- ability of popular TLS libraries’ documentation, how developers interact with it and what other resources they use when programming with TLS interfaces. We also decided to consider the slightly more realistic scenario of modifying an already existing implementation of an application using TLS rather than writing such an application, albeit simple, from scratch. Pursuing the outlined direction, we designed a three-part user (devel- oper) study motivated by four research questions:

RQ 1. What issues related to library documentation do developers encounter when working with existing TLS code?

RQ 2. Which of the ten previously described obstacles in API documenta- tion (Uddin and Robillard 2015) do they encounter in TLS libraries? How severe do they perceive them to be? Why?

RQ 3. Is the severity of obstacles perceived differently between libraries?

RQ 4. What online sources of information other than the official documen- tation do developers use when working with existing TLS code? What circumstances lead them to use these other sources?

3.2 Study Design

In an attempt to answer these questions in a satisfactory way, we optedfor a mixed methods user study. Being interested in perceived flaws in current, popular APIs, we selected two languages, C and Python, both being popular

10 3. Methodology

among developers (Stack Overflow 2019). For each of the languages, we selected two relatively popular TLS-capable libraries – GnuTLS1 (used in the exploratory study as well) and libtls (a more usable, developer-resistant take at a TLS library; part of LibreSSL2) for C, and the Python standard library ssl module3 (henceforth abbreviated as “pyssl”) and pyOpenSSL4. It is worth noting that while both pyssl and pyOpenSSL rely on Open- SSL as a backend, pyOpenSSL is a “thin wrapper”, i.e. the API is mostly identical to OpenSSL’s with slight changes to accommodate programming language differences. On the other hand, pyssl, being part of the standard library, aims to provide a more “Pythonic” interface to TLS. We chose not to include OpenSSL in our study as the students were more likely to have had prior experience with the API. The library is used in assignments in other courses and students in our exploratory indicated previous experience in other contexts as well, for instance in their daily jobs or in a hobby project. We tested our application source codes with the following library ver- sions (mostly the latest available) and recommended that the students use them for their solutions as well: GnuTLS 3.6.10, LibreSSL 2.9.2, Python 3.8 and pyOpenSSL 19.0.0. At the beginning, the participants could choose either language, but would have to complete the tasks using both of the corresponding libraries. Since all of the libraries have distinct APIs and documentation formats, this allows for the participants to experience and compare different program- ming approaches, as well as for us to gather a wider range of input with regards to possible obstacles occurring “in the wild”.

3.3 Task Design

The experimental task itself consisted of three parts: a pre-task question- naire, a programming task and a post-task questionnaire. The instructions were published, along with the required source code, in the form of a single- page microsite on a subdomain of the Faculty of Informatics, with no access restrictions. (See Appendix C for a snapshot of the website.)

1. https://gnutls.org/ 2. https://www.libressl.org/ 3. https://docs.python.org/3.8/library/ssl.html 4. https://www.pyopenssl.org/en/stable/

11 3. Methodology

The instructions for the first part were published first. The pre-task questionnaire was open for answers for 9 days, after which it was closed (submitted answers could not be changed and new ones could not besub- mitted). Instructions for the remaining two parts were then published and solutions were accepted for 12 days. The following subsections describe the three parts in more detail.

3.3.1 Pre-Task Questionnaire In the first part, the students were allowed to choose one of the twopro- gramming languages (C or Python) they would use for the rest of the home- work. Two variants (one for each programming language) of the same ques- tionnaire (with the corresponding libraries switched in) were presented in the university’s information system. The students would choose which pro- gramming language they prefer by submitting either of these two variants. Since it was not technically possible to restrict access to the other vari- ant of the questionnaire if the student had already completed one, it was possible for the students to submit solutions to the first part for both pro- gramming languages. Indeed, in our case two students did so, although they then completed the remaining parts of the assignment using only one language. The pre-task questionnaire was designed with two goals in mind: a) to collect information on the participant’s background and previous experi- ence, and b) to compel the participant to browse and use the library’s doc- umentation. In the questionnaire, the students were first asked about their relevant prior experience: how long they had been coding in general; their self- assessed skill in the chosen programming language (on a 6-point verbal rating scale: “None”, “Poor”, “Fair”, “Good”, “Very good” and “Excellent”); their self-assessed knowledge of TLS (on the same 6-point scale); and if they had worked programmatically with TLS or either of the assigned li- braries before (and in what context). They were then asked several “priming” questions to entice themto find, browse and use the official documentation of both libraries tofind information before starting with the programming task. We opted for this mechanism since we were interested in their opinions on the documenta- tion and we wanted to make sure that they were exposed to it early on, before they decided to look for a different source of information.

12 3. Methodology

These “priming” questions/tasks included: writing down the URL ofthe official documentation; writing down the function, argument, attribute or other element of the API they would use to a) explicitly perform the TLS handshake, b) , where the task was to find functions force a specific TLS version for the connection, or c) set a trusted . We also asked them if they saw any possible issues with the documentation if they were to use it in their development. This set of questions was repeated for both libraries corresponding to the chosen language. Due to technical limitations of the questionnaires, the order of the libraries as displayed could not have been randomised, thus the order was GnuTLS → libtls, and pyssl → pyOpenSSL, respectively, for all participants. The full questionnaire can be seen in Appendix A.

3.3.2 Programming Task

After the first part’s deadline, the instructions for the second and thirdpart was published with 12 days the solution. In the second part, the students were given the source code of two ap- plications implementing a simple HTTPS client, each using either of the corresponding libraries. For C, these applications consisted of 147 (Gnu- TLS) and 111 (libtls) lines of code (excluding comments and blank lines); for Python, they were 47 (pyssl) and 68 (pyOpenSSL) lines long. The task was to implement two small modifications to these programs: a) to set a specific self-signed certificate as the only trusted root authority, and b) to enforce that the connection only succeed for TLS version 1.2 or greater. We found these two modifications reasonably simple so as not to dis- courage the participants from dropping out, but also complex enough so that the typical programmer has to look for library-specific information (be it in the documentation or elsewhere). At the same time, they appear as real-world small tasks that one might be assigned in a daily job or might encounter in a hobby project. (TLS 1.0 and 1.1 have been deprecated since March 2020.) Notice that both of the modifications relate back to the “priming” ques- tions posed in the pre-task questionnaire, where the goal was to find func- tions that set the trusted root and force a specific TLS version.

13 3. Methodology

For testing their implementations, we recommended that the students use our dedicated microsite (for testing the root authority setting) and BadSSL5, also used in the exploratory study, see Section 2.1. In contrast to our exploratory study, where the programming task was to write an HTTPS client basically from scratch, in the follow-up study, we opted for the modification of an existing implementation. This scenario seems more realistic as many programming jobs consists in large part of maintaining and improving legacy software. (Grams 2019)

3.3.3 Post-Task Questionnaire After completing the programming task for each library, the participants were asked to reflect on the task and to try to identify any issues they encountered when working with the libraries’ documentation. For each library, they were asked to rate how difficult they found the task on a 7-point Likert scale ranging from “Very easy” to “Very difficult”. This scale (also known as Single Easy Question or SEQ) a post-task difficulty scale which is both simple and reliable (Sauro and Dumas 2009; Tedesco and Tullis 2006)); The participants were then asked to asses how serious (if at all)they found ten selected issues in relation to the library’s documentation on a 5- point rating scale (“No opinion”, “Not a problem”, “Moderate”, “Severe” and “Blocker”). These ten obstacles were taken from Uddin and Robillard (2015) and included: ambiguity, bloat, excess structural information, fragmenta- tion, incompleteness, inconsistency, incorrectness obsoleteness, tangled in- formation, and unexplained examples,. Moreover, for each obstacle they rated as moderately or more serious, they were to describe why and where specifically they found the documen- tation problematic in that regard. The participants were also asked to list other sources of information or URLs they used to complete the task and briefly describe why it was useful for them. Finally, for the purposes of the course, they were asked to reflect on the whole assignment, to list the realisations or new insights they gained, and to optionally comment on the assignment in general. This kind of feedback is valuable to the course instructors.

5. https://badssl.com

14 3. Methodology

The questionnaire in full can be seen in Appendix B.

3.4 Setting

The experiment was performed in November 2019 at the Faculty ofIn- formatics, Masaryk University. The task was assigned as the fifth assign- ment in the course PV079 Applied Cryptography, which is an intermedi- ate graduate-level security-oriented course with lectures and tutorials. It is aimed at IT students interested in , applications of cryp- tography and security protocols. The course assumes basic familiarity with the principles of cryptography. The students were informed that their submitted solutions wouldbe used for purposes of an API usability study and were informed of the pos- sibility to opt out of the study without any repercussions (opt-in by default). (See Appendix C.)

3.5 Participants

All in all, 50 students submitted their solution to at least one part ofthe assignment. 44 students handed in solutions to all three parts of the assign- ment. We limit any further processing and analysis to these 44 participants. Six of them used C for the last two parts, 38 chose Python. Two handed in their solutions to both variants of the pre-task questionnaire, though they then completed the rest in only one language. The sample is quite diverse when it comes to previous experience and programming skill. The reported number of years coding (regardless of pro- gramming language, including any education) ranged from one to fifteen, with median and mode equal to five. 17 participants reported coding no more than four years, 23 participants reported between five and eight years and only three participants claimed 9 years or more. One participant did not answer this question. Most participants (37, or 80% of all answers) estimate their skill with the language of choice as fair or good. At the same time, 68% (30 participants) describe their knowledge of TLS as none or poor and 63% (28) had never worked with it programmatically before. See Table 3.1 for more details. We also obtained information on enrollment in two related courses from the university information system. The two courses are PV181 Lab-

15 3. Methodology

C Python Total Q: How would you None describe your current Poor 4 4 programming skill in Fair 1 18 19 C/Python? Good 4 14 18 Very good 1 3 4 Excellent 1 1 Q: How would you None 8 8 describe your current Poor 2 20 22 knowledge of TLS? Fair 2 7 9 Good 1 2 3 Very good 1 1 2 Excellent Q: How many times Never 1 27 28 have you worked with Once 2 4 6 TLS programmatically? A few times 3 6 9 Many times 1 1

Table 3.1: Summary of participants by their self-reported programming skill and TLS experience. (Note that two participants completed the first part in both variants, hence the numbers in the first block add up to two more than expected. Empty cells represent counts of zero.)

oratory of security and applied cryptography, a hands-on seminar aimed at applying the students’ knowledge of cryptographic principles, algorithms and protocols, and PA193 Secure coding principles and practices, an advanced graduate-level course with lectures and tutorials focused on a broader range of topics in software security, clean code, program verification etc. Both of these courses touch upon cryptographic libraries and TLS, and have featured an assignment in OpenSSL, LibreSSL or GnuTLS in recent years. Moreover, the exploratory study described in Chapter 2 was per- formed in autumn 2018 and replicated in autumn 2019 in PA193. Of our participants, two thirds (29) had never attended PA193 or PV181, seven were attending both in parallel with PV079. See Table 3.2 for more numbers. Of the 39 participants in the (pre-task) Python group, 37 had never used either of the libraries. Those who had used it at least once mention ajoband school-related work as the relevant contexts. Of the seven participants in

16 3. Methodology

PA193 Secure coding principles Never In parallel Before PV181 Lab. sec. Never 29 1 & app. crypto In parallel 3 7 1 Before 3

Table 3.2: Summary of participants by enrollment in related courses.

GnuTLS libtls pyssl pyOpenSSL Q: How many times Never 2 6 37 38 have you used the Once 2 1 1 library before? A few times 3 1 1 Many times

Table 3.3: Summary of participants’ previous experience with the chosen li- braries. (Note that two participants completed the first part in both variants, hence the numbers for each language add up to one more than expected.)

the C group, 6 had never used libtls, but only two had no prior experience with GnuTLS. All of those who had any prior experience mention PA193 and/or PV181 as the context of use. See Table 3.3 for detailed numbers.

3.6 Data Collection, Processing and Analysis

After the deadline of the assignment, the solutions were first corrected and scored for the purposes of the course. Subsequently, all three parts relevant to the study (pre- and post-task questionnaires and the modified source code) were pseudonymised and collected into a unified form for the pur- poses of the study. As mentioned above, of the 50 total participants, we discarded six who did not complete all three parts of the task from any further analysis. In the pseudonymisation process, each participant was assigned a unique identi- fier; participants who completed the latter parts in C were assigned the numbers 101–106, while the Python group received the series 201–238. The following subsections describe in detail the steps we took inpro- cessing and analysing the collected data, both open-ended and quantitative.

17 3. Methodology

We did not test or analyse the submitted modified applications for correct- ness as it was not the focus of our study. For preprocessing and keeping track of the collected data, we used GNU PSPP (GNU Project 2020) and jamovi (The jamovi project 2020). For data wrangling, exploratory data analysis and statistical modelling and testing, we used (R Core Team 2020) and several packages from the Comprehen- sive R Archive Network (CRAN).

3.6.1 Qualitative Analysis

Of the qualitative, open-ended data we collected, we did not thoroughly analyse all items. We checked if the participants correctly answered the priming questions in the pre-task questionnaire, briefly checked repeat- ing obstacles identified after using the documentation and surveyed other sources of information used in completing the task. The heart of our analysis consisted of the thematic analysis of therea- soning behind obstacle severity ratings (questions 3 and 7 in the post-task questionnaire). We loosely followed the procedure of Braun and Clarke (2006), using a codebook instead of a thematic map. With one researcher leading the discovery and codebook preparation, we iterated three times before settling on a final version. At the beginning, one researcher familiarised himself with the sub- mitted responses, reading through them several times, marking reoccur- ring themes, and looking for similarities or differences between responses. A first iteration of the codebook was created using initial coding (also known as open coding; Saldaña 2013), establishing several data-based codes and a subsequent consolidation of similar or overlapping codes. Responses from the six C-group participants were then split into 105 chunks. (These were adjusted before every iteration as needed.) These chunks were then each assigned up to two codes from the codebook by two inde- pendent raters. After the first round of coding, we calculated a measureof interrater agreement (Cohen’s weighted κ; Cohen 1968) κw = .675 (95%CI .596–.754) which is deemed good according to Regier et al. (2013). After a discussion between the two raters, a second codebook was cre- ated. The rest of the responses were split as well and about 50% ofthose were coded by one researcher. After reflecting on the second round of cod- ing, a third and final iteration of the codebook was created.

18 3. Methodology

Using the final codebook, one researcher assigned codes to all (583) frag- ments, while a second one assigned codes only to 202. We then computed

agreement from the intersection, κw = .727 (95%CI .676–.779), which also falls in the good range. It is worth noting that the improvement in the calculated agreement is practically tiny (95%CI −.042–.146) and not statistically significantZ ( = 1.081, p = .28). In the final codebook, our focus was on gathering issues the partici- pants encountered in the documentations in line with our research ques- tions. Hence we decided not to assign codes to hypothetical statements and statements relating to the task itself or other aspects not of interest

3.6.2 Quantitative Analysis To test whether the obstacles in one library were rated as more severe than in the other we built and estimated several multilevel (mixed-effects) ordi- nal (ordered probit) models using the ordinal R package (Christensen 2019). For this modelling we pairwise deleted “no opinion” ratings, as they have no informational value for this part. The ordered probit model assumes a normally distributed latent vari- able “behind” the ordinal ratings and estimates thresholds along which the continuous variable splits into the rating categories. The multilevel struc- ture allows to account for variation in ratings between participants as well as between obstacles. It is also effective in presence of missing data. We tried fitting models with individual-level predictors such as years spent coding, TLS experience and TLS skill (with no interactions), but these were not significant improvements (judged by likelihood-ratio test). Inthe end, we chose the most parsimonious model with the lowest value of the Akaike information criterion (AIC; Akaike 1974). The model includes a sin- gle binary fixed effect for the library and two crossed random effects – individual-level and obstacle-level random intercepts.

3.7 Limitations

We are aware of several limitations in the design of our study one should keep in mind when interpreting the results and considering their applica- bility.

19 3. Methodology

One large drawback lies in our sample bias. Although previous studies have shown computer science students may serve as acceptable proxies to professionals (Krombholz et al. 2017; Naiakshina, Danilova, Gerlitz, et al. 2019; Yakdan et al. 2016), TLS is a fairly narrow and specific region of the programming landscape. The difference between an IT student and apro- fessional developer in this case might be substantial. Moreover, our participants were specifically students of IT security or were at least interested in the area. This might have further skewed the results. For instance, the APIs tested in our study might be used by non- security-oriented developers as well and their experiences, needs and ex- pectations might be still different. A related issue is that of framing. Since our task was administered as an assignment to university students, the execution might have been affected by various factors irrelevant to real-world software development (grad- ing, seminars, questionnaires, etc.). And vice versa, significant real-world problems may remain hidden to our eyes. However, previous research sug- gests framing might not be a significant factor in some cases (Naiakshina, Danilova, Gerlitz, et al. 2019; Naiakshina, Danilova, Tiefenau, Herzog, et al. 2017). Another issue is the order of tasks and the related effect of learning. Due to the technical limitations of our tools, we fixed the order of items in the questionnaires, which might have slightly affected some quantitative measures. Similarly, although we suggested no particular order for the pro- gramming task, most participants presumably followed the pattern of the questionnaires. Indeed, some participants reported learning a particular concept or ap- proach in one library and then applying it with a small modification in the other, particularly in the Python group. Furthermore, our programming task concerned a small, albeit (in our opinion) realistic, scenario and two relatively minor modifications of an ex- isting application. Many facets of the TLS libraries, such as APIs for work- ing with X.509 certificates, remain unexplored. On a related note, the TLS interfaces of the tested libaries can be re- garded as relatively low-level and it is somewhat unlikely that a typical de- veloper would use them frequently. Many applications nowadays rely on HTTPS and REST or SOAP architectures rather than communication over pure TLS. Therefore, a more “popular” approach would focus on the usabil-

20 3. Methodology ity of higher-level APIs, such as cURL for C, Python’s urllib, http.client or Requests, or Java’s javax.net.ssl. Finally, while C and Python are among the most popular programming languages according to the Stack Overflow Developer Survey (Stack Over- flow 2019), we have neglected a large portion of developers, libraries and potential usability problems in our design. It is inevitable that the TLS li- braries of C++, Java or JavaScript will have their own idiosyncrasies which are worth exploring as well.

21 4 Results

In this chapter, we report on the results obtained from the quantitative and qualitative analysis of the collected data. The first, exclusively qualitative section summarises several themes we identified in participants responses in relation to encountered obstacles. The second, mainly quantitative section attempts to quantify howse- vere the participants found a set of particular obstacles in the libraries. The third, somewhat mixed section is dedicated to external resources the par- ticipants used in completing the tasks outside the main documentation. The final section then touches upon several minor analyses orsum- maries of the collected data we performed.

4.1 Identified Issues

Similarly to our exploratory study (see Chapter 2), we observed many dis- tinct, sometimes contradicting preferences for multiple elements of usabil- ity in the qualitative data from the follow-up study. Although we identified no grand, overarching theme or story present in a majority of responses, several consistent themes could be cosntructed. The following subsections describe eight classes of issues our partic- ipants experienced when using the libraries’ documentation: missing ex- amples, unhelpful examples, missing information, too much information, obsoleteness, difficult wayfinding, lack of transparency, and confusion & uncertainty.

4.1.1 Missing Examples Many participants struggled with the lack of relevant and clearly marked examples of usage of the API. This problem was observed in all of the tested libaries: “there was no valid sample examples provided in the documenta- tion” (P106, libtls); “I find the library incomplete and lacking exam- ples” (P105, GnuTLS); “there were almost no examples at all” (P207, pyssl); “there were no examples at all!” (P217, pyOpenSSL) Clearly, developers expect examples to be present in the documentation and recognise their importance and helpfulness in the process of learning and using the library: “there are no examples and I think this would have

22 4. Results

helped me a lot” (P238, pyOpenSSL); “since I’ve no experience at all I want to see example of everything I plan to use.” (P232, pyssl)

[T]hanks to the short “snippets” at the beginning of the page I was able to quickly resolute the mess in my head when differentiating the “socket” and the “context”. (P218, pyssl)

Examples are so important for developers that their lack in the docu- mentation often forces them to search elsewhere:

[T]here were NO code examples for usage of any of the functions I needed, I needed to search elsewhere. (P104, libtls)

I tired to use some other websites […] to see more examples on different cases. (P225, pyssl)

Though sometimes in vain: “when searching through internet, the only available examples were composed of only a basic ‘if good return ok if not good return no-ok’.” (P236, pyOpenSSL) It is worth noting that one of the ten obstacles from Uddin and Robillard (2015) we used in our post-task questionnaire was “unexplained examples”. Several participants expressed despair at the word unexplained where lack of would be more fitting:

I wrote “Not a problem” because there is not any answer that correspond to the fact that THERE IS NOT EXAMPLE IN THE DOCUMENTATION. So, in fact, it is a huge problem […] (P231, pyOpenSSL)

Regarding unexplained examples and incosistency: I have chosen this because it is probably closest to my actual issue: missing examples. (P201, pyssl)

In whole documentation there isn’t any example and I think that miss- ing example is worse than an unexplained example. (P226, pyOpenSSL)

4.1.2 Unhelpful Examples Even if examples were present, participants did not find them helpful or relevant to their goals: “[T]here are barely any examples, certainly not any useful for me.” (P232, pyssl)

23 4. Results

Some criticised their limited scope, they found them too simplistic to be usable for the task: “the examples are defined at one place, but they de- scribe very limited implementation” (P105, GnuTLS); “very simple examples, explaining too little on how to work with ” (P213, pyssl); “A more con- crete example […] in a defined use case can really add some consistency tothe documentation.” (P236, pyssl) Others called for examples of specific functions or concepts: “A more detailled example of how to use this callback function would be really appreci- ated” (P236, pyOpenSSL); “I could not find an example using potentially more versions of the protocol” (P206, pyOpenSSL); “I was lacking more informa- tion and examples to method Context.set_options()” (P220, pyOpenSSL); “[the examples] should have demonstrated more possible TLS versions sup- ported by the lib.” (P105, GnuTLS) A participant suggested providing more examples showing one piece of functionality at a time:

It can be improved by giving examples with every function that has been written instead of giving a single example for multiple functions together that is difficult to understand. So examples should be further subdivided for ease of understanding. (P105, GnuTLS)

One participant would welcome instructions on when to use the sam- ples: “it was not explained when I should actually use them.” (P238, pyssl) Another was looking to compare the possible ways to use the API:

[It would] help me to have two examples of codes with the same func- tionality with and without the usage of the SSLContext. (P238, pyssl)

Sometimes the problem is not the lack or low quality but rather their unexpected or obscured placement. One participant found examples of pri- ority strings in GnuTLS beneficial for understanding, but was perplexed by their unexpected location: “The useful examples of actual priority strings are located at the very end. After I read them, I could understand the ver- bose definition […], but why on earth are those examples at theend?” (P101, GnuTLS) Other participants noted the importance of proximity of samples to the functions they demonstrate:

Don’t know why the examples cannot be right under every function’s documentation. Or at least at the end of every “chapter” (examples for

24 4. Results

working with sockets at the end of the part about Sockets…). (P232, pyOpenSSL)

The examples were (mostly) at the beginning and at the end of thedoc- umentation. This was problematic for me as a beginner (P238, pyssl)

[M]ore examples closer to function params, constants would be helpful. (P228, pyssl)

4.1.3 Missing Information Another frequent complaint related to the complete absence of certain kinds of information in the documentation. Often, these were missing descriptions of arguments, “Many functions are not describing some parameters at all” (P207, pyssl); its type, “often types of parameters and/or return values were missing” (P215, pyOpenSSL), “There are missing types of arguments […] I’m not sure if it’s expecting address as a string, tuple or AF_INET” (P226, pyOpenSSL); or the values it takes – in one instance, the value was “in examples, but not explained anywhere in the HTML single-page documentation.” (P101, GnuTLS) The functions themselves were also often explained poorly or notat all: “there wasn’t enough explanation of functions” (P211, pyOpenSSL); “very vague description of methods for me” (P217, pyOpenSSL); “I just wanted to know what exactly [an element] does. In the documentation, there is no de- tailed information about that” (P219, pyOpenSSL) One participant described this as their biggest issue: “[T]he functions themselves are left unexplained along with their arguments (that wasTHE problem for me).” (P101, GnuTLS) Other times, the explanation of some key concepts or design decisions was missing: “I still do not understand why are there two class with the same goal” (P231, pyssl); “In the task, I had a problem understanding the concept of the context” (P238, pyssl); “For example, in the case of flags, I was not sure how to use them” (P228, pyOpenSSL). One such concept was the self-signed certificate: “The main problem for me was to find information about self-signed certifications” (P229, py- OpenSSL); [A] big problem in this documentation: self-signed certificate. There a chapter about it but it is never explained how to accept self-signed cer- tificate when you provide a root CA to trust. (P236, pyssl)

25 4. Results

Consequently, participants had to resort to external sources to gain knowledge, even if only to nudge in a certain direction: “[A Q&A answer] helped me to found cert_regs param. After that, documentation was suffi- cient.” (P207, pyssl) More on this later in Section 4.3.

4.1.4 Too Much Information In contrast to the previous, participants were in some cases overwhelmed by the amount of information the documentation presented to them. This often led to confusion, indecision and frustration: “I was little confused to use which function out of the plethora of functions given in the documen- tation” (P160, GnuTLS); “There is so much text about unimportant parts;” (P101, GnuTLS) “There is a ridiculous amount of text descriptions.” (P208, pyssl) A very specific issues mentioned by at least two participants wasthe large number of functions listed on one page in libtls: “[T]he URL […] i get after searching ”tls_config_set_ca_file” has many unneccasry packages displayed” (P105, libtls); “tls_config_set_ca() was little bit tangled as it had many related methods using file, path, etc etc.” (P106, libtls) One participant pointed out including historic information made “the documentation unnecessary bloated and confusing.” (P201, pyssl) Others com- plained about unnecessarily detailed descriptions: “there were long texts for each parameter description, some could be one-lined” (P237, pyOpenSSL); “sometimes the descriptions were too much extensive […] I would suggest shorten the descriptions.” (P224, pyssl) While one participant prefers a lot of information as a beginner, “for a beginner like me, more the information, more it helps in improving my knowledge” (P103), another is of the opposite stance “It is extensive and unclear for a new user.” (P106, GnuTLS)

4.1.5 Obsoleteness While deprecation warnings and information about obsolete functions are useful when working with legacy code, how they are presented plays an im- portant role, too. Sometimes this kind of information is interspersed among function descriptions, other times it is collected in a separate changelog. Several participants discussed the role of historic information in the docu- mentation and its effect on readability.

26 4. Results

Some participants observe the inclusion of historic information clutters the documentation, leads to confusion and makes navigation more difficult:

[T]he “deprecated” info (and version specifications - e.g. this is since version XY, this is not…) were quite annoying when scrolling and try- ing to extract the correct information. It caused me big trouble when navigating. (P218, pyssl)

Though opinions are also mixed here – some like to have the informa- tion close by: “it’s great that they have a change log in the documentation” (P226, pyssl); “Fortunately, the version where each new method has been in- troduced is specified, and deprecated method are also indicated” (P231, pyssl); some would prefer a separate page: “for me it would be better if the docu- mentation would not show deprecated methods and just move them to other page or so” (P222, pyssl); “I just do not think it is a good idea to include depre- cated functionality in the currently viewed version. It is not a problem to find the documentation of a chosen version.” (P201, pyssl) One participant pointed out the apparent uselesness of a version switch in pyssl documentation: “As for the version specifications in the text -Why it is then the version selector at the top of the page?” (P218, pyssl) One participant compares the documentation of pyOpenSSL with pyssl: “it was better than the previous documentation, mostly because of the missing ‘new from version, deprecated…’ statements - even if they were there, they were not directly the part of the text itself” (P218, pyOpenSSL) A different participant finds the form of changelog in pyOpenSSL problematic asitis not immediately clear if a function is deprecated or not:

[T]he documentation main page does not say it the function is depre- cated or not. We can think that all the functions are up to date, but during the reply of this question, I found that if go to “Backward com- patibility” […] I find that function are written deprecated here […]This function is deprecated but in the main page, it is not written. […] In this example, we can see it is problematic because we need to read all the changelog to sure the function is not deprecated. (P231, pyOpenSSL)

One participant pointed out an inconsistency in Python’s deprecation warnings: “Multiple places where ‘New in python 3.6’ and ‘Deprecated in python 3.6’ are shown at the same time” (P205, pyssl). Other participants show a different inconsistency – supposedly depre- cated function are recommended for use in other parts of documentation:

27 4. Results

“E.g. ssl.OP_NO_SSLv2 is supposed to be deprecated but another part of docu- mentation - SSLContext.options is recommending its usage” (P207, pyssl); “I’ve run into various constants / methods that are deprecated, or some is dep- recated to use with other method that is not deprecated and so on.” (P222, pyssl) What is worse, even if deprecation warnings are present, they do not suggest which element to use instead:

[M]any parts refer to obsolete/deprecated API elements, sometimes with- out referring to the “current” way, ie. the documentation for ssl.OP_ NO_TLSv1_1 says it’s deprecated but does not disclose the proper way. (P215, pyssl)

[M]any times I found something that could be used for solution, but it was deprecated in the newest version, this was the biggest problem of the library. (P224, pyssl)

One participant describes a chain of deprecated constants whose de- scription points from one to the next:

When I looked for some information about how to set a specific TLS ver- sion, I first looked at the function SSLContext(protocol=PROTOCOL_ TLS). This sections told me that I should use oneof PROTOCOL_* con- stants. So I looked at them and found out that all of them are already deprecated and I have to use “protocol PROTOCOL_TLS with flags like OP_NO_SSLv3 instead.”So I looked at the OP_NO_* flags and again found that all of them are deprecated.

Only in the section ssl.OP_NO_TLSv1, there is information that I should use “the new SSLContext.minimum_version and SSLContext. maximum_version instead” – I wonder why this is not written bellow all the possible flags, or even better right in the documentation ofthe SSLContext() where there is information about setting a specific ver- sion. (P219, pyssl)

Obsolete functions are even used in (the few) code samples:

[The examples] seem to be either old (for example the part regarding TLS version is deprecated - however I am using this in my solution as well, the new functions are not in the example at all) (P201, pysssl)

28 4. Results

4.1.6 Difficult Wayfinding It is paramount for documentation that it provide all the relevant, up-to date, frequently needed information, and to provide it in a clear and or- ganised fashion. For newcomers to the library, it is also important they are able to determine their current position within the documentation. Unfortunately, both of these principles were often broken in the tested libaries. Some participants express their difficulties with navigation in general: “Navigating it is sort of difficult” (P101, GnuTLS); “lots of information and very bad navigation in page” (P237, pyOpenSSL); “Very difficult to navigate the official documentation.” (P223, pyOpenSSL) Others specifically mention fragmentation and scattering of informa- tion as a root cause of the difficulties: “Most of the information is spread in more than one page” (P106, GnuTLS); “The information […] seems for me to be fragmented […] which made it confusing and time consuming” (P220, pyssl); “a long page not split into smaller logical parts was not very accessi- ble” (P215, pyssl); “most of the time I couldn’t see which method is for what objcet.” (P237, pyOpenSSL) A few participants noted the lack of structure, categorisation or hierar- chy: “I miss a fine structure that would make it easier” (P250, pysssl); “The manual is not well categorized” (P105, libtls); “[A]s there was no proper struc- ture and flow […] it was very difficult for me to understand the methods” (P160, libtls) One participant suggested a layout for the documentation that would be helpful to them: “The order i would recommend is to have the syntax, de- scription of all arguments followed by a small example to implement that function having any prerequisite or post condition.” (P105, GnuTLS) One participant speculated the difficulties might have stemmed from inexperience and compared the situation in both Python libraries: The documentation is not very easy to navigate through for me, because I never worked with TLS before so I was not sure what exactly I am looking for, but pyOpenSSL documentation seemed more user-friendly to me. (P235, pyssl, pyOpenSSL) Another participant agreed with the comparison: “I found it also a bit disarranged, but I found it easier to work with this documentation than with the documentation of the standard Python TLS library.” (P221) A different participant felt the same unwelcoming vibe as a newcomer:

29 4. Results

The arrangement of information […] is good for proper documentation for developer with experience working with this library, but it is not arranged in an easily understood way for first time user. (P234, pyssl)

The search functionality in libtls was mentioned asbeing “not inter- active and user friendly.” (P105) Two participants struggled with its strin- gency: “One has to know about exact name of package to search about it” (P105); “one needs to formulate the search the really carefully to find the spe- cific thing.” (P105) One participant preferred to use a search engine to find information in GnuTLS documentation: “Most of the time, I can find the needed information inside the documentation way faster by googling.” (P101) One participant struggled with navigation in pyssl “especially as the table of contents in the sidebar was not ‘sticky’ and you needed to scroll all the way up to use it.” (P215, pyssl)

4.1.7 Lack of Transparency Transparency of the API, its concepts and decisions behind its design is im- portant in order for users of the library to build an effective mental image of the interface. This in turn enables faster learning and more efficient usage of the library. The tested libraries have some deficiencies in this regard. Lack of precise, relevant information tends to muddle understanding and leads to an opaque interface rather than a transparent one: “I could not figure out what is going to happen if I use a particular option” (P210, pyOpenSSL); “By documentation I’m not sure if it returns server’s or my cer- tificate. In PEM or DER format?” (P226, pyOpenSSL) The lack of transparency, clear explanations of key concepts and rele- vant examples leads the developers to experimenting with the library in order to understand it, or learning by trying.

[M]aybe I would feel more confident when trying some of those things on some example snippet/fiddle… (P218, pyssl, pyOpenSSL)

I found [the appropriate constant] by pure luck using VS Code auto- completion tool (by trying “PEM”) and scrolling through options. (P101, GnuTLS)

I ended up sloving it with “trial and error” method. (P220, pyOpenSSL)

30 4. Results

One developer was particularly interested in certain conditions not mentioned in the documentation and was forced to experiment to find the answers to their questions:

Even when I read the provided link to SSLContext […] There wasn’t enough information. E.g. what if I specify protocol version and client does not adapt to servers choice? Will there be an exception or some- thing else? I needed to discover this myself by trying.

[…] I discovered how to use ssl._version but still wasn’t sure about the result. Then I tried it with badssl and gained confidence that itprob- ably works. (P207, pyssl) While this approach may be beneficial in general, in the absence of clear security guidance and lacking domain knowledge, it may lead to sub- optimal results. One participant compared pyssl and pyOpenSSL in how they support this experimentation: After some time i was able to browse the documentation [of pyssl]fast and trying out things. With the PyOpenSSL library i had different feel- ing, the more i looked into the documentation and when i was trying some things the more i was nervous from it. The errors i was getting were also not very helpful. (P222)

4.1.8 Confusion and Uncertainty Confusion and uncertainty are not among the emotions one might want to inspire in working programmers, least so if they are working on security- critical code. Unfortunately, using the tested libraries seems to lead to such reactions. For instance, many participants expressed confusion over the documen- tation, some of its parts or its functions: “little bit confusing at first (a lotof data in one place)” (P104, GnuTLS); “really confusing, for search one needs to know the exact name of the function or part of it.” (P104, libtls) Explanations of various concepts were among frequent causes of con- fusion: “Priority strings explanation was little confusing” (P106, GnuTLS); “The OpenSSL.SSL.SSLv23_METHOD is quite confusing” (P231, pyOpenSSL). Confusion and misunderstanding or poor understanding of the library and its concepts often leads to situations where the developers don’t know which function or element of the API they should use to achieve their goals:

31 4. Results

[F]or some functions […] with a similar name and goal, it could be dif- ficult to know which one is the most adequate for the current situation. (P204, pyOpenSSL)

Two participants felt overwhelmed by the sheer number of options with no guidance which one to use: “I was little confused to use which func- tion out of the plethora of functions given in the documentation” (P160, Gnu- TLS); “As i said …too many modules…” (P105, GnuTLS) Others were struggling to understand when and how to correctly use elements they were already familiar with: “So I don’t know in which case using set_options(OP_NO_*) is required.” (P204, pyOpenSSL)

For example, in the case of flags, I was not sure how to use them. […]I was not sure if I have to use context.set_options() for every flag or can OR them (P228, pyOpenSSL)

A participant expressed uncertainty how to use the API even after read- ing its descriptions carefully: “there was a description of callback but after I read it I still wasn’t sure how to construct the callback.” (P207, pyOpenSSL) Uncertainty about correctness of their code and inability to check or troubleshoot easily was another paint point for some. So was lack of feed- back and transparency: “Then I was unsure whether I have to use c_rehash. (It worked for me without rehashing but maybe it should have not worked.)” (P238, pyOpenSSL)

[I’m] not sure if it was my mistake. I tried using set priority function to set the priority string to include TLS version 1.2 and 1.3. But it was not working for TLS v.1.3. I used format […] as given in the documentation. (P106, GnuTLS)

This made me crazy, once I tried this with crocs CA required, Iwentto badssl, there has been an error, that there is no crocs CA. After change of minimum_version to TLSv1_2, it stopped working. I don’t know why. (P214, pyssl)

Some opted to cope with the lack of clear, relevant information by ask- ing the question they wanted to have answered: “I’m not sure if it returns server’s or my certificate. In PEM or DER format?” (P226, pyOpenSSL); “E.g. what if I specify protocol version and client does not adapt to servers choice? Will there be an exception or something else?” (P207, pyssl)

32 4. Results

GnuTLS libtls pyOpenSSL pyssl

Ambiguity

Bloat Unexpl. examples Excess str. info. Fragmentation

Incompleteness Inconsistency Incorrectness

Obsoleteness Tangled info. 0 5 0 5 0 10 20 30 0 10 20 30 Severity No opinion Not a problem Moderate Severe Blocker Figure 4.1: Reported severity ratings of the ten obstacles across libraries.

The structure of the documentation, placement of its parts or fragmen- tation was another frequent culprit: “The information […] seems for me tobe fragmented across more topics, which made it confusing and time consuming for me” (P220, pyssl); “Having information related to role as a client and role as a server on the same page is confusing.” (P208, pyssl)

4.2 Severity of Obstacles

After completing the programming task, we asked the participants torate how severe they found each of ten selected obstacles (see Section 3.3.3 for details). The gross results are visualised in Fig. 4.1. 70% of all answers on severity were “not a problem”; 7% amounted to “no opinion”. None of the obstacles were rated “blocker” for the C libraries. The obstacles with most “no opinion” answers included Excess structural information, Unexplained examples and Incorrectness. Only one participant mentioned using an IDE, hence Excess structural information (“description [contained information] which could be easily obtained through modern IDEs”) could have been irrelevant to most participants. Unexplained exam- ples might have been irrelevant as well, due to the lack of any examples

33 4. Results

C Python Fixed effect libtls/pyssl (standard error) −1.02 (0.267) 0.23 (0.110) (p = .0001)(p = .037) Random effects (intercept) Between-participant std. dev. 0.756 0.471 Between-obstacle std. dev. 0.266 0.430 Thresholds Not a problem −0.19 (0.364) 1.04 (0.18) Moderate 0.76 (0.373) 1.90 (0.192) Severe — 2.75 (0.228) Blocker Model fit AIC 189.26 917.65 log likelihood −89.63 −452.83

Table 4.1: Summary of the estimated parameters of the selected ordinal model for obstacle severity rating.

at all (see the previous section). Incorrectness may be difficult to judge for developers unknowledgeable in the domain.

Comparison Between Libraries. To test whether the participants rated the severity of obstacles in one library generally higher than in the other, we estimated a mixed-effects ordinal (ordered probit) model. Among sev- eral candidate models, we selected one with a single fixed effect for the li- brary and two crossed random effects – individual-level and obstacle-level random intercepts. The parameters estimated for each language group sep- arately can be seen in Table 4.1. The parameter of interest is estimated in the row titled “libtls/pyssl”, which shows that libtls’ and pyOpenSSL’s obstacles were, in general, rated as significantly less severe than GnuTLS’s and pyssl’s, respectively. In other words, participants were more likely to rate the same obstacles as more severe in GnuTLS than in libtls. It is worth noting that while the library fixed effect is statistically signif- icant (tested using Wald and likelihood-ratio tests) in both language mod- els, the standard errors of the estimates are considerably large. For instance, the 95% Wald confidence interval for the libtls coefficient is approximately

34 4. Results

[−1.54, −0.5], which is perhaps too wide to draw any definite conclusions from a probit model. It is likely that the relatively poor fit of the model stems from the small sample size, especially in the C group. (This is further supported by the large estimated between-participant intercept variance.) A different esti- mation method as well as a fully Bayesian approach might also improve the resulting estimates.

Comparison Between Obstacles. To test whether certain obstacles were generally rated as more severe than others across all libraries, we estimated a different multilevel ordinal model with a single fixed effect for theobsta- cles and two crossed random effects – individual-level and library-level random intercepts. After adjusting for multiple comparisons (using the Holm–Bonferroni method), the following obstacles were rated as significantly more severe than the reference level (Excess structural information): Ambiguity, Unex- plained examples, Fragmentation and Incompleteness.

4.3 Other Sources of Information

In the post-task questionnaire, we asked the participants to list the URLs of resources other than the official library documentation that they used while completing the task, and to briefly explain why they found them useful. We analysed these answers by categorising these URLs, counting their occurrences and briefly surveying the participants’ reasoning.

Categorisation. We gathered resources of a similar type (e.g. Q&A, tu- torial, etc.) into ten groups and divided these groups among three broad categories: Vendor – resources provided or maintained by the library au- thors; Community – resources created or published by the community; and General – other unrelated websites and resources, e.g. a search engine. A summary of the categorisation and the corresponding counts can be found in Table 4.2.

Observed Patterns. It is clear from the summary that community-pro- vided resources were the most popular among our participants. More than a half (24 or 55%) referenced at least one URL in this category, versus 6 and

35 4. Results Pt. is the sum of mentions of URLs in the group. Ment. being written for. than on the official domain. A comment on such a report. where than in the main documentation. 1+9 8 Using a search engine to look for information. ∗ 2 1+23 3 0+3 A code sample provided 3 by the library A bug developers report else- or issue reported in the library’s bug tracker. 12 0+2 0+2 1 1 A recording or transcript of a presentation or talk. 8 3+9 9 A tutorial or walkthrough created by the community. 5 1+5 6 6 1+15 12 1 1114 0+19 5+13 15 A 12 question posted on a Q&A A site, code such sample as created Stack by Overflow. the community. 40 15+41 24 Src. Ment. Pt. Description ; double-counting if a source is mentioned by the same participant for both libraries.) C+Python mentions is the number of distinct sources (URLs) within the group. Src. Bug report Talk Foreign documentation 2 0+2 2 Documentation of a library other than the one the code is Official example Search engine Wikipedia Q&A Community example Community tutorial Unofficial documentation 7 7+0 5 Library documentation or manual pages elsewhere hosted Table 4.2: Categorisation of information sources used by participants while completing the task. Category/Group Vendor Community General No specific search queries mentioned. Columns: counts how many participants∗ mentioned a URL in the group at least once. (In the format

36 4. Results

12 for the other two categories. 40 distinct URLs were listed in this category, in contrast with 11 in total for the remaining two, with 56 total mentions, more than twice as many as the other two categories. It is worth noting that no participants in the C group mentioned using a Q&A site, such as Stack Overflow. The opposite imbalance was present in using documentation hosted unofficially, which was mentioned by five C participants and none in the Python group. This groups of resources con- sists almost exclusively of HTML versions of the libraries’ manual pages.1 Although this may be partly caused by the small sample size (6 C vs. 38 Python participants), it might be an interesting pattern to analyse further. It may be explained by the fact that the Python libraries only host their documentation centrally, while manual pages (which also often contain the documentation of C APIs) have many web mirrors. Therefore, developers using searching for information on C libraries are more likely to end up on a hosted manual page. It may also mean the selected libraries are not so popular as to have many Stack Overflow questions. Eight participants explicitly mentioned the use of a search engine (al- though some mention Google specifically, “googling” may refer to the generic act of searching), although they list no specific search terms. Two partici- pants mentioned using Stack Overflow, but provide no specifics either. One interesting type of resource is the bug report. Three participants listed three different pyOpenSSL bug reports and all of them argue they helped them with understanding the API; e.g. “This issue explained to me how to use the set_options function.” (P236) This might be a for the developers to incorporate helpful comments from bug reports into their documentation to improve its usability. Another potentially interesting kind of resource in this domain is a video or recording of a talk. One participant listed two videos that “gave me a bet- ter understanding of TLS overall and […] code examples using the standard library.” (P211, pyssl)

Participant Reasoning. While knowing what websites developers use when programming with TLS is essential in understanding the libraries’ usability, we also need to ask why they choose these resources. We should

1. These were on the domains https://man7.org/linux/man-pages/, https://linux. die.net/man/ and https://sortix.org/man/.

37 4. Results

also like to know whether they prefer them over the official documentation, and why if so. As would be expected after some preceding results (Sections 4.1.1 and 4.1.2), participants were often looking for examples of use of the API they were working with: “I needed […] an example of a TLS client with these API tu understand how to use each part of the API.” (P231, pyssl) The results were frequently satisfactory: “It helped me to see a bigger chunk of code which uses the features from the library” (P238, pyssl); “This source was very useful for me […] and after seeing it, I knew immediately what I had to do.” (P212, pyssl) Many participants link external information with improved understand- ing of the API: “I used the code as the reference to confirm my understandings […] This link helped me in gaining my knowledge” (P103, GnuTLS); “exam- ple of the usage - good for understanding what is needed” (P104, GnuTLS); “very helpful to understand in which part what one should do” (P104, libtls). Two participants succeeded in resolving an issue with the API thanks to a Q&A site: “Thanks to this I have resolved the problem” (P205, pyOpenSSL); “I found a solution to why I was getting ECONNRESET error.” (P203, pyOpenSSL) Other two participants found an element of the API on a Q&A site and proceeded to complete the task with just the documentation: “This helped me to found cert_regs param. After that, documentation was suf- ficient” (P207, pyssl); “God bless stackoverflow - It was just what I needed […] it pointed me to the ‘options’ and I have found it in the documentation afterwards and implemented it.” (P218, pyssl) Sometimes, an unofficial guide was much more useful and usable than an official one, as one participant compares: “This page is clear, easy to nav- igate, solved similar problems that I needed - just easy to understand in my opinion and it worked like a charm.” (P218, pyssl) A different participant said this about the Fedora Defensive Coding Guide:

I used this website for reference as it was clearly giving the TLS client structure in a single page. All the necessary modules were explained in one go. That made it simple and clear. It followed the correct flow also. (P106, GnuTLS)

Other times, external resources were not easy to find. One participant was describing their struggles with web search as their learning style es- sentially relied on examples:

38 4. Results

[Usually] I tried to find an implementation example from stackoverflow and to get the idea of how to actually use each function […] My learning method is by example but it’s hard to find or maybe I did not use the right keyword. (P210)

Another participant described how hard it was to find pyssl-related in- formation:

I did try to do a google search for “python ssl module verify tls self signed certificate” which I assumed would return heap of answers. Itdid not. The name of the module “ssl” is so short and used in other context that relevant stack overflow / blog article wasn’t for my search inthe first 3 pages of the search result. At that point I gave up and returned to the official documentation. (P208, pyssl)

On the other hand, several participants were completely satisfied with the official documentation and felt no need to search elsewhere: “No other sources were needed, so I consider the documentation to be good” (P101, Gnu- TLS, libtls); “The documentation was sufficient” (P224, pyssl, pyOpenSSL); “I did not have to use any other resources, I found the documentation alone sufficient.” (P202, pyssl, pyOpenSSL) In summary, it is clear that the reasons for using and searching for external resources were disparate but more or less all of them seem link back to the issues identified previously; namely, missing or unhelpful code samples, missing information in descriptions, lacking explanations of key concepts and difficult navigation.

Quality of Resources. We did not analyse these resources in detail, e.g. by assessing their veracity, correctness or security. However, a few brief observations are in order nonetheless. The Fedora Defensive Coding Guide, mentioned by four participants, was last updated in 2014 (six years ago). Using it therefore increases the risk of using a deprecated or even insecure API element or the wrong settings. One participant mentioned using a code sample from CycloneSSL, a li- brary completely unrelated to any of the tested ones. It is not clear how it was useful to them. Two participants linked to an example from an earlier version of py- OpenSSL, which is not longer included in its source, but can be found in a GitHub fork of the project.

39 4. Results

13 GnuTLS 2 3 libtls 4 8 18 4 2 2 pyOpenSSL 95 1010 2 pyssl

2030 0510 5

Perceived difficulty 1 7

Figure 4.2: Likert plot of the perceived difficulty of the task by library.

4.4 Supplementary Analyses

Besides our primary work on identifying and quantifying obstacles, we also analysed some of the remaining data we collected during the experiment. The following subsections briefly summarise the results.

4.4.1 Pre-Task Priming Questions For the sake of completeness, we briefly checked the answers to the pre- task priming questions, where our goal was to compel the participants to use the library documentation before embarking on the programming task. We found most answers were correct in the sense of functionality, though not perfect. Some were referring to outdated information or obsolete func- tions, for instance. Unusually many answered an incorrect function to set the trusted root certificate. We think this might be caused by the wording of the question which allows for multiple plausible interpretations.

4.4.2 Post-Task Difficulty When asked how difficult they found the task in each library, the partici- pants rated the experience mostly positively. 68% of the ratings on a 7-point scale were between one and three with four being the midpoint. 11% of the ratings were above the midpoint and only one participant assessed the task as 7 (very difficult), in pyssl. Fig. 4.2 summarises the reported ratings graph- ically. The difference in perceived difficulty is not statistically significant for either language (Monte Carlo Kruskal–Wallis test, χ2 = 1.52, p = .3 for C, χ2 = 0.159, p = .7 in the case of Python).

40 4. Results

4.4.3 Post-Task Preferences At the very end of the task, as a (poor) proxy for general satisfation with the libraries, we asked which of the two libraries the participants would prefer to use for their project. Neither of the libraries was generally preferred over the other in either language. In the C group, three participants would choose libtls, two GnuTLS and one expressed no preference. In Python, 19 would pick pyssl, 14 would go for pyOpenSSL, five had no preference. The slight difference in the Python group is not statistically significant (one-sided exact binomial test, 95%CI 0.42–1.0, p = .24.) This finding further supports the conclusions of our exploratory study, where no prevalent preference was found among participants, either. In- stead, the participants showed widely varied preferences and disparate rea- sons for their decisions.

41 5 Discussion

In this chapter, we summarise our findings, attempt to use them to answer our research questions, as well as compare them with some of the literature. For ease of reading, we repeat the research questions we had established:

RQ 1. What issues related to library documentation do developers encounter when working with existing TLS code?

RQ 2. Which of the ten previously described obstacles in API documenta- tion (Uddin and Robillard 2015) do they encounter in TLS libraries? How severe do they perceive them to be? Why?

RQ 3. Is the severity of obstacles perceived differently between libraries?

RQ 4. What online sources of information other than the official documen- tation do developers use when working with existing TLS code? What circumstances lead them to use these other sources?

5.1 Obstacles

RQ 1. The data to support an answer to RQ 1 were presented in Sec- tion 4.1. In short, we identified eight classes of documentation issues. It is worth noting that none of the problems are specific to TLS libraries, but their impacts be more serious in the area of security compared with other libraries. A big problem was the absence of code samples. Even if they were present, our participants found them irrelevant or unhelpful, too basic to be applicable in their use case. Developers also clearly expect examples to be present in the documentation; if of appropriate quality, they increase pro- ductivity and correctness of the resulting code (Acar, Backes, Fahl, Garfinkel, et al. 2017; Sohan et al. 2017). This observation confirms various prior stud- ies (Nadi et al. 2016; Robillard 2009; Robillard and Deline 2011). If examples are not present, developers are forced to search for them elsewhere on the Internet (see also Section 4.3). However, library develop- ers have little control over external content and this might lead to incorrect or insecure results (Acar, Backes, Fahl, Kim, et al. 2016; Fischer et al. 2017; Ukrop and Matyas 2018).

42 5. Discussion

Participants also cited missing or incomplete descriptions of functions, arguments, constants and other element of the API as a frequent obstacle. This often leads to confusion, hinders learning and encourages lookingfor information in external sources, again, with the same potential issues. Lack of official information also obscures the API and its design choice. As previous research suggests (Green and Smith 2016; Robillard and Deline 2011), making the API penetrable and easy to learn is essential for usability. This is also embodied in Nielsen’s heuristic of visibility of system status (Nielsen and Molich 1990). The opposite problem of too much information, or bloat, was also men- tioned several times, with similar consequences for beginners. Our data also suggest that the visual and logical structure (or hierarchy) of the docu- mentation is an important factor, as well as navigation elements and search capabilities. This is in line with Meng, Steinhardt and Schubert (2019) and Robillard (2009). Obsolete information and the presentation of deprecated elements was a contentious topic. While developers expect deprecated functions to be present in documentation when working with legacy code, they might not agree how such information should be presented. Some of our participants preferred a separate changelog with deprecation warnings, others liked the interspersing of historic information in the main document. The latter is also recommended by Gorski and Iacono (2016). Finally, a few participants reported experimenting with the API in or- der to learn or understand it. This stems from the previously mentioned missing descriptions but also from the absence from a testing mode, a fea- ture called for in prior research (Gorski and Iacono 2016; Green and Smith 2016).

RQ 2. In order to answer RQ 2, we performed a statistical test on the collected quantitative data. Of the ten obstacles presented in Uddin and Robillard (2015), four were generally rated as more severe among partici- pants across all libraries: ambiguity, unexplained examples, fragmentation and incompleteness. These four obstacles were also among the top five obstacles by developer- reported severity in the original paper. (Incorrectness being the remaining one.) Note that “unexplained examples” were used by many participants as

43 5. Discussion

a substitute to the category of “missing examples”, which does not appear in the original article.

RQ 3. To answer RQ 3, we fitted an ordinal model, which confirmed asta- tistically significant difference between severity ratings – the obstacles of GnuTLS and pyssl were generally rated as more severe than of their coun- terparts. However, the model fit is far from perfect. The effect size forthe Python group is very small and the uncertainty in the C group is fairly large, presumably due to the small sample (six participants). With these limitations in mind, we therefore discourage drawing any definite conclusions from this analysis. Pretty much in line with our exploratory study, we found that neither of the libraries was preferred over the other. This reflects the diverse pref- erences and expectations of developers.

5.2 Other Sources

RQ 4. In answering the last research question, we analysed lists of web- sites reported to have been used by the participants. An overwhelming number of the URLs (used by 55% of all participants) were resources cre- ated or maintained by the programming community, such as Q&A sites, tutorials and examples. It bears noting that none of the C participants ref- erenced any Q&A site. We also found two surprising sources of information: bug reports and videos. These findings might serve library developers to inform theirde- cision how to better structure documentation and where to invest their efforts.

44 6 Related Work

Our work lies at the intersection of multiple areas of active research: devel- oper experience, usability of cryptographic APIs and TLS specifically, and documentation usability. Fahl, Harbach, Muders, et al. (2012) and Fahl, Harbach, , et al. (2013) investigate misuse of TLS in Android and iOS mobile applications. They find that 8–10% of them are vulnerable to man-in-the-middle attacks. Sub- sequent investigation reveals the flaws are mostly due to poor usability of the programming interfaces. Following-up, Georgiev et al. (2012) suggest a novel approach of declaratively configuring rather than imperatively pro- gramming the TLS connection, with positive results. Acar, Backes, Fahl, Garfinkel, et al. (2017) study the usability of crypto- graphic libraries in Python. They conclude that although simpler APIs are beneficial for security, reducing cognitive load, the availability of quality documentation, code samples and wide functionality are still more impor- tant. However, their study only focused on fairly low-level operations, not on cryptographic protocols, such as TLS. On a similar note, Sohan et al. (2017) verify with professional develop- ers that usage examples for REST APIs are essential for many, and that their absence in the documentation lower productivity and satisfaction signifi- cantly. They also present four recommendations for API developers. While the finding regarding examples are relevant, the recommendations are spe- cific to REST APIs and do not bear much weight outside of thedomain. Robillard (2009) and Robillard and Deline (2011) studied issues that de- velopers encountered when learning new APIs, with a focus on documenta- tion. Among others, they confirm the usual obstacles of missing examples, missing information or problems with navigation. Furthermore, they argue for considering factors such as “documentation of intent” and “penetrabil- ity” when designing API documentation. Gorski and Iacono (2016) and Green and Smith (2016) compiled lists of usability heuristics and recommendations for cryptographic and security libraries, collected from literature. These lists reinforce the ideas of user- or developer-centered design, and call for ease of use, ease of learning, testa- bility and predictability, in order to meet developers’ expectations. Uddin and Robillard (2015) identify ten common problems in API docu- mentation perceived by professional developers and quantify their severity

45 6. Related Work and prevalence in a bigger survey. Although these obstacles relate to API documentation in general, we reused the list in the quantitative section of our post-task questionnaire as well. More recently, Aghajani et al. (2019) developed a comprehensive taxon- omy of documentation issues by analyzing a large sample of content from Q&A sites, mailing lists and discussions. They also suggest several concrete steps for library developers to improve their documentation as well as ar- eas where future research should focus. Meng, Steinhardt and Schubert (2019) observed developers’ use of doc- umentation while working with an API. Among other things, they note that consistent navigation, search functionality and clean code samples are beneficial to the experience and productivity.

46 7 Conclusion

In the previous chapters, we explored the design and results of our small qualitative exploratory study and of a larger, more focused mixed-methods follow-up study. In the latter, we narrowed down on the usability issues in the documentation of TLS libraries as perceived by developers, or IT students as a proxy in our case. We found that the lack of helpful, relevant examples, clear explanations and attractive, easy-to-navigate documentation interfaces is still prevalent in today’s libraries’ documentation even after several years of usability studies bringing these issues into light. It appears that none of the identified issues are endemic to TLS-related interfaces. TLS libraries seem to be plagued by the same ills as other crypto- graphic APIs and software libraries in general. However, these issues might have a more severe impact in the context of a security library than in other cases. Although we would like our findings to serve as an immediate callto action to TLS library maintainers or a guide in the design of new libraries and their documentation, it seems unlikely given the state of previous re- search and the extent of its applications.

7.1 Future Work

Bearing in mind the limitations of our work (as laid out in Section 3.7), we should like to perform a similar or an extended study with professional software developers, as well, in order to identify what kind of differences there might be in relation to security interfaces. An experimental study oriented on a larger programming task, a wider range of functionality or a different kind of interface (such as cURL, urllib) or protocol (OAuth, X.509) may be in order, as well. Such a study may reveal similar usability problems in other widely deployed technologies. Going deeper in our line of research, one might want to identify what constitutes, for instance, a quality code sample, what makes an API descrip- tion easy to understand, or what navigation elements make a library docu- mentation easy to use. These are clearly important questions that have not yet been satisfactorily answered in the field of security to our knowledge.

47 7. Conclusion

On the other hand, observing that the recommendations of a long se- ries of research papers are not being implemented even several years after publication, we might need to focus our attention to the maintainers and developers of these libraries. Our task should be to identify their obstacles in improving the documentation and creating a welcoming atmosphere for beginners as well as other developers.

48 Bibliography

Acar, Yasemin, Michael Backes, Sascha Fahl, Simson Garfinkel, et al. (2017). “Comparing the Usability of Cryptographic APIs”. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 154–171. doi: 10.1109/sp.2017.52. Acar, Yasemin, Michael Backes, Sascha Fahl, Doowon Kim, et al. (2016). “You Get Where You’re Looking for: The Impact of Information Sources on Code Security”. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 289–305. doi: 10.1109/sp.2016.25. Acar, Yasemin, Sascha Fahl and Michelle L Mazurek (2016). “You are not your developer, either: A research agenda for usable security and privacy research beyond end users”. In: 2016 IEEE Cybersecurity Development (SecDev). IEEE, pp. 3–8. Acar, Yasemin, Christian Stransky, et al. (2017). “Security Developer Studies with GitHub Users: Exploring a Convenience Sample”. In: Thirteenth Symposium on Usable Privacy and Security (SOUPS 2017). Santa Clara, CA: USENIX Association, pp. 81–95. Aghajani, Emad et al. (2019). “Software Documentation Issues Unveiled”. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, pp. 1199–1210. Akaike, Hirotogu (1974). “A new look at the statistical model identification”. In: vol. 19. 6, pp. 716–723. doi: 10.1109/TAC.1974.1100705. Arzt, Steven et al. (2015). “Towards Secure Integration of Cryptographic Software”. In: 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!) Onward! 2015. Pittsburgh, PA, USA: ACM, pp. 1–13. doi: 10.1145/2814228.2814229. Braun, Virginia and Victoria Clarke (2006). “Using thematic analysis in psychology”. In: Qualitative Research in Psychology 3.2, pp. 77–101. doi: 10.1191/1478088706qp063oa. Christensen, R. H. B. (2019). ordinal—Regression Models for Ordinal Data.R package version 2019.12-10. https://CRAN.R-project.org/package=ordinal. Cohen, Jacob (1968). “Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit”. In: Psychological Bulletin 70.4, p. 213. Egele, Manuel et al. (2013). “An Empirical Study of Cryptographic Misuse in Android Applications”. In: Proceedings of the 2013 ACM SIGSAC Conference on

49 BIBLIOGRAPHY

Computer & Communications Security. CCS ’13. Berlin, Germany: ACM, pp. 73–84. doi: 10.1145/2508859.2516693. Fahl, Sascha, Marian Harbach, Thomas Muders, et al. (2012). “Why Eve and Mallory love Android: An analysis of Android SSL (in)security”. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security. ACM, pp. 50–61. doi: 10.1145/2382196.2382204. Fahl, Sascha, Marian Harbach, Henning Perl, et al. (2013). “Rethinking SSL Development in an Appified World”. In: Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security. CCS ’13. Berlin, Germany: ACM, pp. 49–60. doi: 10.1145/2508859.2516655. Fischer, Felix et al. (2017). “Stack Overflow Considered Harmful? The Impact of Copy&Paste on Android Application Security”. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE, pp. 121–136. doi: 10.1109/sp.2017.31. Georgiev, Martin et al. (2012). “The Most Dangerous Code in the World: Validating SSL Certificates in Non-Browser Software”. In: Proceedings of the 2012 ACM conference on Computer and Communications Security. ACM, pp. 38–49. doi: 10.1145/2382196.2382204. GNU Project (2020). GNU PSPP. Version 1.2.0. Boston, MA, USA. Google (2020). Google Transparency Report: HTTPS Encryption on the Web. url: https://transparencyreport.google.com/https (visited on July 19, 2020). Gorski, Peter Leo and Luigi Lo Iacono (2016). “Towards the Usability Evaluation of Security APIs”. In: Proceedings of the Tenth International Symposium on Human Aspects of Information Security and Assurance, pp. 252–265. Grams, Chris (Mar. 14, 2019). Developers spend 30% of their time on code maintenance: our latest survey results, part 3. url: https://blog.tidelift.com/developers-spend-30-of-their-time-on- code-maintenance-our-latest-survey-results-part-3 (visited on July 10, 2020). Green, Matthew and Matthew Smith (Sept. 2016). “Developers are Notthe Enemy!: The Need for Usable Security APIs”. In: IEEE Security & Privacy 14, pp. 40–46. doi: 10.1109/msp.2016.111. Iacono, Luigi Lo and Peter Leo Gorski (2017). “I Do and I Understand. Not Yet True for Security APIs. So Sad”. In: Proceedings of the 2nd European Workshop on Usable Security. EuroUSEC ’17. Paris, France: Internet Security. doi: 10.14722/eurousec.2017.23015.

50 BIBLIOGRAPHY

Krombholz, Katharina et al. (2017). “”I Have No Idea What I’m Doing” – On the Usability of Deploying HTTPS”. In: 26th USENIX Security Symposium (USENIX Security 17), pp. 1339–1356. Krüger, Stefan et al. (2018). “CrySL: An Extensible Approach to Validating the Correct Usage of Cryptographic APIs”. In: 32nd European Conference on Object-Oriented Programming (ECOOP 2018). Ed. by Todd Millstein. Vol. 109. Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany, 10:1–10:27. isbn: 978-3-95977-079-8. doi: 10.4230/LIPIcs.ECOOP.2018.10. Landis, J. Richard and Gary G. Koch (1977). “The measurement of observer agreement for categorical data”. In: Biometrics 33.1, pp. 159–174. doi: 10.2307/2529310. Let’s Encrypt (2020). Let’s Encrypt – Free SSL/TLS Certificates. url: https://letsencrypt.org/ (visited on July 19, 2020). Meng, Michael, Stephanie Steinhardt and Andreas Schubert (2019). “How Developers Use API Documentation: An Observation Study”. In: Communication Design Quarterly Review 7.2, pp. 40–49. Nadi, Sarah et al. (2016). ““Jumping Through Hoops”: Why Do Java Developers Struggle with Cryptography APIs?” In: Proceedings of the 38th International Conference on Software Engineering. ICSE ’16. Austin, Texas: ACM, pp. 935–946. doi: 10.1145/2884781.2884790. Naiakshina, Alena, Anastasia Danilova, Eva Gerlitz, et al. (2019). ““If You Want, I Can Store the Encrypted Password”: A Password-Storage Field Study with Freelance Developers”. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. CHI ’19. Glasgow, Scotland Uk: Association for Computing Machinery, pp. 1–12. isbn: 9781450359702. doi: 10.1145/3290605.3300370. Naiakshina, Alena, Anastasia Danilova, Christian Tiefenau, Marco Herzog, et al. (2017). “Why Do Developers Get Password Storage Wrong?: A Qualitative Usability Study”. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. CCS ’17. Dallas, Texas, USA: ACM, pp. 311–328. isbn: 978-1-4503-4946-8. doi: 10.1145/3133956.3134082. Naiakshina, Alena, Anastasia Danilova, Christian Tiefenau and Matthew Smith (Aug. 2018). “Deception Task Design in Developer Password Studies: Exploring a Student Sample”. In: Fourteenth Symposium on Usable Privacy and Security (SOUPS 2018). Baltimore, MD: USENIX Association, pp. 297–313. isbn: 978-1-939133-10-6.

51 BIBLIOGRAPHY

Nemec, Matus et al. (2017). “Measuring Popularity of Cryptographic Libraries in Internet-Wide Scans”. In: Proceedings of the 33rd Annual Computer Security Applications Conference (ACSAC 2017). ACM. doi: 10.1145/3134600.3134612. NetMarketShare (2020). Market share for mobile, browsers, operating systems and search engines. url: https://netmarketshare.com/ (visited on July 19, 2020). Nielsen, Jakob and Rolf Molich (1990). “Heuristic Evaluation of User Interfaces”. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI ’90. Seattle, Washington, USA: ACM, pp. 249–256. doi: 10.1145/97243.97281.

The jamovi project (2020). jamovi. Version 1.2. url: https://www.jamovi.org/. R Core Team (2020). R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing. Vienna, Austria. url: https://www.R-project.org/. Regier, Darrel A. et al. (2013). “DSM-5 Field Trials in the United States and Canada, Part II: Test-Retest Reliability of Selected Categorical Diagnoses”. In: American Journal of Psychiatry 170.1, pp. 59–70. doi: 10.1176/appi.ajp.2012.12070999. Robillard, Martin P. (Nov. 2009). “What Makes APIs Hard to Learn? Answers from Developers”. In: IEEE Software 26.6, pp. 27–34. doi: 10.1109/MS.2009.193. Robillard, Martin P. and Robert Deline (Dec. 2011). “A Field Study of API Learning Obstacles”. In: Empirical Software Engineering 16.6, pp. 703–732. issn: 1382-3256. doi: 10.1007/s10664-010-9150-8. Saldaña, Johnny (2013). The Coding Manual for Qualitative Researchers. 2nd. Sage Publications. isbn: 978-1-44624-736-5. Sauro, Jeff and Joseph S Dumas (2009). “Comparison of three one-question, post-task usability questionnaires”. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1599–1608. doi: 10.1145/1518701.1518946. Sohan, SM et al. (2017). “A study of the effectiveness of usage examples in REST API documentation”. In: 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, pp. 53–61. Stack Overflow (2019). Stack Overflow Developer Survey .2020 url: https://insights.stackoverflow.com/survey/2020 (visited on July 7, 2020).

52 BIBLIOGRAPHY

Tedesco, Donna and Tom Tullis (2006). “A comparison of methods for eliciting post-task subjective ratings in usability testing”. In: Usability Professionals Association (UPA) 2006, pp. 1–9. Uddin, Gias and Martin P. Robillard (2015). “How API Documentation Fails”. In: IEEE Software 32.4, pp. 68–75. doi: 10.1109/MS.2014.80. Ukrop, Martin and Vashek Matyas (2018). “Why Johnny the Developer Can’t Work with Public Key Certificates: An Experimental Study of OpenSSL Usability”. In: Topics in Cryptology – CT-RSA 2018: The Cryptographers’ Track at the RSA Conference 2018. Springer International Publishing, pp. 45–64. doi: 10.1007/978-3-319-76953-0_3. Wijayarathna, Chamila, Nalin A. G. Arachchilage and Jill Slay (2017). “A Generic Cognitive Dimensions Questionnaire to Evaluate the Usability of Security APIs”. In: Human Aspects of Information Security, Privacy and Trust. Ed. by Theo Tryfonas. Springer International Publishing, pp. 160–173. Yakdan, Khaled et al. (2016). “Helping johnny to analyze malware: A usability-optimized decompiler and malware analysis user study”. In: 2016 IEEE Symposium on Security and Privacy (SP). IEEE, pp. 158–177.

53 A Pre-Task Questionnaire

This is the first part of the fifth assignment in PV079. The fullassignment overview can be found at […].

Questions 1–5: General previous experience

This section poses several question on your previous experience.

1. Including any education, how many years have you been coding (re- gardless of the language)? [Input for number of years] 2. How would you describe your current programming skill with C/Python?1 [Dropdown with the options “None”, “Poor”, “Fair”, “Good”, “Very good”, “Excellent”] 3. How would you describe your current knowledge of TLS? (What TLS is, how it’s used, how is the protocol structured, etc.) [Dropdown with the options “None”, “Poor”, “Fair”, “Good”, “Very good”, “Excellent”] 4. How many times have you worked with TLS programmatically (in code)? [Dropdown with the options “Never”, “Once”, “A few times”, “Many times”] 5. Please briefly describe the context in which you worked with TLS programmatically. • What have you done? • When have you done it? (recently, some years ago, …) • Why did you do it? (part of your job, course assignment, cu- riosity, …) If you have never worked with TLS programmatically, leave the an- swer empty. [Text input]

1. Only the selected programming language was displayed.

54 A. Pre-Task Questionnaire Questions 6–12: Library 1

All the questions in this section will refer to Library 1. [Note: Library 1 was GnuTLS for C and pyssl for Python. Library 2 was libtls and pyOpenSSL, respectively.] Note: You CAN (and probably will have to) use the Internet and browse the documentation while answering these questions. 6. How many times have you used Library 1 before? [Dropdown with the options “Never”, “Once”, “A few times”, “Many times”] 7. Please briefly describe the context in which you used Library 1. • What have you done? • When have you done it? (recently, some years ago, …) • Why did you do it? (part of your job, course assignment, cu- riosity, …) If you have never used it, leave the answer empty. [Text input] 8. What is the URL of the official documentation of Library 1? [Text input] 9. Browse/search the documentation. What function/argument/attribute would you use to explicitly perform the TLS handshake? Please write the name of a specific function/functions (and arguments or attributes, if relevant). [Text input] 10. Browse/search the documentation. What function/argument/attribute would you use to force a specific TLS version in the connected ses- sion? Please write the name of a specific function/functions (and ar- guments or attributes, if relevant). [Text input] 11. Browse/search the documentation. What function/argument/attribute would you use to set the set of trusted root certificates? Please write the name of a specific function/functions (and arguments or attributes, if relevant). [Text input] 12. You have now interacted with the documentation of Library 1 a bit. Imagine you are using this documentation to help you with your development. Do you see any possible obstacles you might face while

55 A. Pre-Task Questionnaire

using it? [Text input]

Questions 13–19: Library 2

Same as the previous section but with Library 2 (libtls or pyOpenSSL) in place of Library 1.

56 B Post-Task Questionnaire

This is the third part of the fifth assignment in PV079. The full assignment overview can be found at […]. Answer these questions AFTER you have completed the programming task for the particular library. Longer free-text answers are expected.

Questions 1–4: Library 1

All the questions in this section will refer to Library 1. [Note: Library 1 was GnuTLS for C and pyssl for Python. Library 2 was libtls and pyOpenSSL, respectively.]

1. Overall, how difficult was the task in Library 1? [Dropdown with the options “1 (very easy)” up to “7 (very difficult)”] 2. For the official documentation of Library 1, consider the following potential issues. For each, decide if this was a problem and if do, se- vere you found it. Incompleteness (The description of an API element or topic wasn’t where you expected it to be.) [Dropdown with the options “Not a problem”, “Moderate (kind of irritating)”, “Severe (I wasted a lot of time on this but figured it out)”, “Blocker (I could not get past it)”, “No opinion”] The same pattern repeated for the remaining nine obstacles: • Ambiguity (The description of an API element was mostly com- plete but unclear.) • Unexplained examples (A code example was insufficiently ex- plained.) • Obsoleteness (The documentation on a topic referred to apre- vious version of the API.) • Inconsistency (The documentation of elements meant tobe combined didn’t agree.) • Incorrectness (Some information was incorrect.) • Bloat (The description of an API element or topic was verbose or excessively extensive.) • Fragmentation (The information related to an element or topic was fragmented or scattered over too many pages or sections.)

57 B. Post-Task Questionnaire

• Excess structural information (The description of an element contained redundant information about the element’s syntax or structure, which could be easily obtained through modern IDEs.) • Tangled information (The description of an API element or topic was tangled with information you didn’t need.) 3. For each item in question above where you selected something else than “Not a problem” or “No opinion”, please briefly describe why and where specifically you find it problematic. [Text input] 4. What other sources of information did you use to solve the task? Write a list of URLs. For each URL, summarize in a single sentence why was it useful for you. [Text input]

Questions 5–8: Library 2

Same as the previous section but with Library 2 (libtls or pyOpenSSL) in place of Library 1.

Questions 9–12: Comparison and further comments

This is a concluding section.

9. You have now worked with both Library 1 and Library 2. If you could choose, which library would you use in your project? [Dropdown with the options “I prefer Library 1”, “I prefer Library 2”, “I have no preference”] 10. Why do you have such (non-)preference? Please describe your rea- soning. [Text input] 11. Now reflect on the whole assignment. What have you gained/learned/re- alized? List at least three specific things. Don’t constrain yourself to hard-skills: It may be emotions towards certain library APIs, hat- ed/loved documentation features, new theoretical knowledge on TLS, usability insights on different APIs, … [Text input]

58 B. Post-Task Questionnaire

12. If you have any further comments regarding this assignment, feel free to write them here. [Text input]

59 C Assignment Microsite Snapshot

This assignment is focused on programming with TLS-related APIs and using the accompanying documentation. It consists of three parts:

1. the initial set of questions answered via the Information system, 2. the programming task in C or Python using the selected APIs, and 3. the structured reflection of the programming task (again guided by the questions in the Information system).

Several comments on the whole assignment follow:

• For successful completion of the assignment, submission of all three parts is necessary. That is, if you don’t submit the first part, youcan- not get any points from the second and third parts. • The assignment can be completed either in C or in Python (youhave to choose at the beginning and then stick to the choice). That is, you EITHER do C version of all the assignment parts OR the Python ver- sion of all three parts. • The first part has to be completed in the first week, thelattertwo parts in the second week (deadline details below).

C.1 The initial questionnaire

• Deadline: Tuesday 19. 11. 23:59 • Estimated time: 1–2 hours • Awarded score: 0–2 points • Submission: IS questionnaire Assignment 5, part 1 C version or Python version

In this questionnaire, you will first answer a set of questions regarding your previous experience with TLS libraries (if any) and your programming skill in general. Then, you will be asked to use the documentation ofthe selected two libraries (in C or Python, according to your choice). Note that you don’t have to finish all the questions in one session – the questionnaire can be edited multiple times until the deadline.

60 C. Assignment Microsite Snapshot C.2 The programming task

• Task published: Wednesday 20. 11. around 10:00 • Deadline: Sunday 1. 12. 23:59 • Estimated time: 4–6 hours • Awarded score: 0–4 points • Task files: C codes, Python codes • Submission: IS homework vault Assignment 5, part 2

In the second part of the assignment, you are given the source code of a simple TLS client implemented in two different libraries in the language of your choice (C or Python, selected in the first part). For C, these are GnuTLS and libtls (part of LibreSSL), and for Python these are pyOpenSSL and the ssl module from the standard library. Your task is to understand the given source code and implement two small modifications in both programs:

• Adjust the client so that only the CRoCS is trusted as a root authority. The CRoCS CA certificate is included in theZIP file with client source codes. • Adjust the client to refuse connections with TLS version lower than 1.2.

Submit a single ZIP file hw05.zip with the same structure and filesas the task ZIP you downloaded (that is, the only changes in files will be your code adjustments). Some recommendations and suggestions follow.

• For testing the local authority, you can use the website […], the cer- tificate of which was issued by the CRoCS CA. • For testing the supported TLS version, you can use badssl.com. • While working on this part, you may use whatever resources you want (the library’s documentation and code samples, search online or use any other resources). However, if you use any other sources beside the official documentation, please note down the URLs (they are needed in the third part of the assignment, see below). • The reflection in the third part of the assignment has separate parts concerning each of the library. It may be a good idea to fill in the corresponding part of the reflection right after you finished thepro- gramming task with the first library (and then fill in the rest after you finished the other part of the programming task).

61 C. Assignment Microsite Snapshot C.3 The reflection

• Task published: Wednesday 20. 11. around 10:00 • Deadline: Sunday 1. 12. 23:59 • Estimated time: 1–2 hours • Awarded score: 0–4 points • Submission: IS questionnaire Assignment 5, part 3 C version or Python version

The third part asks you to reflect on the use of the two libraries and their documentation and describe them in multiple aspects. In the questionnaire, there is a request to list URLs of all sources you used (apart from the official documentation). Keep this in mind while doing the programming task. Note that you don’t have to finish all the questions in one session – the questionnaire can be edited multiple times until the deadline. Please do take the time to answers these questions thoughtfully and diligently. The points will be awarded manually according to the complete- ness of your answers.

Research use

After the assignment is marked, the data will be anonymized and usedfor statistical and qualitative analysis in research of API usability at the Cen- tre for Research on Cryptography and Security, Masaryk University. Your identity will be present in the processed data. Direct quotes from your an- swers may be used to a limited extent in the summarizing research publi- cations (you’ll be referred to as “one of the participants”).

62