On the Usability and Security of Password-Based User Authentication

Dissertation

zur Erlangung des Grades eines Doktor-Ingenieurs der Fakultät für Elektrotechnik und Informationstechnik an der Ruhr-Universität Bochum

vorgelegt von

Maximilian Golla geboren in Schweinfurt

Bochum, 27. März 2019 Tag der mündlichen Prüfung: 29. Mai 2019

Gutachter: Prof. Dr.-Ing. Markus Dürmuth, Ruhr-Universität Bochum

Zweitgutachter: Prof. Dr. rer. nat. Sascha Fahl, Leibniz Universität Hannover Abstract

Passwords’ security and usability problems have been studied for decades. Still, passwords remain to be the primary authenticator in computer systems. With the increasing number of services that require authentication, users, administrators, and system developers face new challenges like the threats caused by weak passwords or pass- word reuse. To better protect their users, services deployed solutions to reinforce password-based authentication, mostly by considering au- thentication factors other than passwords. At the same time, legacy problems such as system designers’ and users’ incorrect mental models of attacker capabilities hinder the adoption of healthy password prac- tices. This thesis studies four key aspects of password-based user au- thentication: password recovery, password strength meters, password- reuse notifications, and cracking-resistant password managers. First, we explore password recovery mechanisms. When users are forced to comply with complicated password composition require- ments or expiration policies, they cannot be blamed for forgetting their passwords. However, currently deployed knowledge-based re- covery mechanisms are heavily biased by users’ selection and thus insecure. We propose a selection bias-free fallback authentication sys- tem that both relieves the user from memorizing a secret and performs well over longer periods of time. Second, we analyze password strength meters. During account reg- istration, users can benefit from additional guidance and feedback provided by password strength meters. However, in a large-scale survey, we found many meters that are inaccurate but are used on popular websites or password managers. To support developers and system designers, we provide metrics, guidance, and tools to improve their meters. Third, we look into communicating the threats caused by pass- word reuse. Currently, password reuse is one of the most pressing security issues in password-based authentication. Proactive checks by service providers against their user database for matches with leaked ii credentials is one technique services deploy to limit the success rate of password-reuse attacks. Communicating this security issue is a challenging task, as it involves a complexity that is difficult for users to understand, but that requires immediate action to prevent harm. We show that users’ mental models regarding the imminent threat of password reuse are incomplete and oftentimes wrong. We then pro- vide guidelines for system designers and developers to improve their password-reuse notifications. Finally, we examine the security of cracking-resistant password man- agers. Password managers can help to deal with the increasing number of passwords and accounts. By synchronizing the password protected vault file with cloud services, attackers have a higher chance to obtain and thus successfully decrypt the vault. Cracking-resistant managers help to mitigate this problem by trying to cover whether a guessed master password is correct or not. We show that the current proposal is vulnerable to distribution-based attacks which can distinguish real from decoy vaults and propose a more secure construction. Kurzfassung

Die Benutzbarkeits- und Sicherheitsprobleme von Passwörtern wer- den seit Jahrzehnten untersucht. Dennoch sind Passwörter weiterhin das primäre Authentifizierungsmerkmal in Computersystemen. Mit der zunehmenden Anzahl von Diensten, für die eine Authentifizierung erforderlich ist, stehen Nutzer, Administratoren und Systementwick- ler vor neuen Herausforderungen, wie zum Beispiel der Bedrohung durch die Nutzung schwacher Passwörter oder die Wiederverwendung von Passwörtern. Um Nutzer besser zu schützen und passwortbasierte Authentifizierung zu stärken, haben Dienste neuartige Lösungen, die mehr als ein Authentifizierungsmerkmal berücksichtigen, entwickelt. Gleichzeitig verzögern Altlasten, wie etwa falsche mentale Modelle der Nutzer und Systementwickler über die Fähigkeiten von Angreifern, die Verbreitung sicherer Passwortpraktiken. In dieser Dissertation werden vier Schlüsselaspekte der passwortbasierten Benutzerauthentifizierung behandelt: Passwortwiederherstellung, Passwortstärkemeter, Benach- richtigungen zur Wiederverwendung von Passwörtern und crackingre- sistente Passwortmanager. Zuerst betrachten wir das Themengebiet der Passwortwiederher- stellung. Wenn Nutzer dazu gezwungen sind, komplizierte Richtlinien zum Aufbau oder Wechsel von Passwörtern zu erfüllen, darf man sie nicht dafür verantwortlich machen, dass sie ihre Passwörter verges- sen. Derzeit bereitgestellte wissensbasierte Wiederherstellungsverfah- ren sind stark von den Vorlieben und der Auswahl der Nutzer be- einflusst und daher angreifbar. Wir schlagen deshalb ein Verfahren vor, das frei von Beeinflussungen durch dessen Nutzer ist, und zudem Nutzer davon entlastet, sich ein Geheimnis zu merken und eine gute Leistung über längere Zeiträume aufweist. Als Nächstes befassen wir uns mit Passwortstärkemetern. Bei der Kontoerstellung können Nutzer von zusätzlichen Hilfestellungen und Rückmeldung durch sogenannte Passwortstärkemeter profitieren. In einer groß angelegten Untersuchung fanden wir jedoch viele Stärke- meter, die ungenau arbeiten, obwohl sie auf beliebten Webseiten oder iv in Passwortmanagern zum Einsatz kommen. Um Entwickler und Sy- stemdesigner zu unterstützen, bieten wir deshalb Metriken, Hilfestel- lungen und Werkzeuge zur Verbesserung der Stärkemeter an. Anschließend untersuchen wir die Wiederverwendung von Passwör- tern. Momentan ist die Wiederverwendung von Passwörtern eines der dringlichsten Sicherheitsprobleme bei der passwortbasierten Authen- tifizierung. Prophylaktische Überprüfungen durch die Dienstbetreiber anhand ihrer Benutzerdatenbank auf Übereinstimmungen mit gestoh- lenen und im Internet veröffentlichten Anmeldeinformationen sind ei- ne Technik, mit der die Erfolgsquote von Passwortwiederverwendungs- angriffen begrenzt wird. Die anschließende Kommunikation des Sicher- heitsproblems mit dem Nutzer ist eine herausfordernde Aufgabe, da der gesamte Sachverhalt eine Komplexität darstellt, die für Nutzer nur schwer verständlich ist, jedoch sofortige Maßnahmen erfordern, um drohenden Schaden abzuwenden. Wir zeigen, dass die mentalen Modelle der Nutzer in Bezug auf die Bedrohung durch die Wieder- verwendung von Passwörtern unvollständig und oftmals falsch sind. Im Anschluss entwickeln wir Hilfestellungen für Systementwickler zur Verbesserung ihrer Kommunikation mit dem Nutzer. Zum Abschluss analysieren wir die Sicherheit von Passwortmana- gern. Passwortmanager können dabei helfen, mit der steigenden An- zahl an Passwörtern und Konten umzugehen. Durch die Synchronisie- rung der passwortgeschützten Tresordatei mit Clouddiensten haben Angreifer jedoch eine höhere Chance den Tresor zu stehlen und er- folgreich zu entschlüsseln. Crackingresistente Passwortmanager helfen dieses Problem zu mindern, indem sie verschleiern, ob ein Ratever- such gegen das Masterpasswort richtig ist oder nicht. Wir zeigen, dass der aktuelle Vorschlag zur Konstruktion solcher Manager anfällig für verteilungsbasierte Angriffe ist, und schlagen eine sicherere Konstruk- tion vor. Acknowledgements

First, I would like to express my deepest gratitude to my advisor Prof. Dr. Markus Dürmuth, for his expertise, thoughts, patience, and guidance. I am extremely thankful for receiving the opportunity to join the Mobile Security group at the Ruhr University Bochum in September 2014. Moreover, I want to thank Prof. Dr. Blase Ur for hosting my visit to the SUPERgroup at the University of Chicago in the summer of 2017. Through this mentorship, I grew as a researcher and built a global network of collaborators. Likewise, I thank Prof. Dr. Thorsten Holz and Prof. Dr. Sascha Fahl for sharing their thoughts in uncertain times and pushing me to finish this dissertation. I was fortunate to have wonderful colleagues that supported me ev- ery day, including Katharina, Kai, David, Theodor, Florian, Philipp, Jan, Nicolai, Merlin, Martin, Christine, Teemu, Dennis, Moritz, An- dre, Nadine, Johannes, Thomas, Felix, and all the others. The support of these people was absolutely indispensable. Thanks for sharing all the great moments that really made this an unforgettable experience. I am also grateful for having the opportunity to work with awesome collaborators that all became good friends over the past few years, in- cluding Miranda, Adam, Per, Elissa, Claude, Fatma, and Elizabeth. Moreover, I want to thank Nils, Konstantinos, and Philip, especially for their expertise, technical support, and discussions. Most of all, thanks to my loving and supportive family. I would like to thank my parents Barbara and Jürgen, for always being there and for supporting every step I have taken. I would also like to thank my dear sister Stefanie for always being my creative and emotional guide. The entire journey would not have been possible without my grandparents Ottilia and Hans-Georg. I would not be where I am without their endless support throughout the years. They have always been there for me, and I am eternally grateful for that. Finally, I wish to thank Lea for her continuous support, understand- ing, patience, and love; I owe you everything.

Contents

1 Introduction 1 1.1 Motivation ...... 2 1.2 Challenges and Contributions ...... 4 1.3 List of Publications ...... 6 1.4 Overview and Structure ...... 9

2 Preliminaries 11 2.1 Introduction to User Authentication ...... 12 2.2 Password Recovery ...... 15 2.3 Password Strength ...... 18 2.4 Password Reuse ...... 23 2.5 Password Managers ...... 25

3 Password Recovery 29 3.1 Introduction ...... 30 3.2 Background ...... 32 3.3 The MooneyAuth Scheme ...... 37 3.4 Experiment 1: Pre-Study ...... 44 3.5 Experiment 2: Long-term Behavior Study ...... 53 3.6 Experiment 3: MooneyAuth Study ...... 56 3.7 Security Analysis and Discussion ...... 61 3.8 Conclusion ...... 64

4 Password Strength 67 4.1 Introduction ...... 68 4.2 Password Strength Meters ...... 69 4.3 Evaluated Password Datasets ...... 73 4.4 Similarity Measures ...... 77 4.5 Evaluation ...... 89 4.6 Results ...... 95 4.7 Conclusion ...... 106 viii Contents

5 Password Reuse 109 5.1 Introduction ...... 110 5.2 Study 1 ...... 113 5.3 Study 1 Results ...... 119 5.4 Password-Reuse Notification Goals ...... 126 5.5 Study 2 ...... 127 5.6 Study 2 Results ...... 133 5.7 Limitations ...... 145 5.8 Discussion ...... 146

6 Password Management 153 6.1 Introduction ...... 154 6.2 Cracking-Resistant Vaults ...... 157 6.3 Static and Adaptive NLEs ...... 163 6.4 The (In-)Security of Static NLEs ...... 164 6.5 Cracking NoCrack ...... 174 6.6 Adaptive NLEs Based on Markov Models ...... 179 6.7 Conclusion ...... 187

7 Summary and Future Work 189 7.1 Summary and Key Results ...... 190 7.2 Directions for Future Work ...... 192 7.3 Concluding Remarks ...... 195

List of Figures 197

List of Tables 199

A Password Recovery 201 A.1 Survey Data: Pre-Study and Long-term Study . . . . 202 A.2 Survey Data: MooneyAuth Study ...... 203

B Password Strength 205 B.1 Comparison: Academia, PW Managers, and OSs . . . 206 B.2 Comparison: Websites and Previous Work ...... 208 Contents ix

C Password Reuse 211 C.1 Study 1 Survey Instrument ...... 212 C.2 Study 2 Survey Instrument ...... 214 C.3 Study 1 Notifications ...... 219 C.4 Study 2 Notifications ...... 225

Bibliography 227

Only amateurs attack machines; professionals target people.

— Bruce Schneier 1 Introduction

Contents 1.1 Motivation ...... 2 1.2 Challenges and Contributions ...... 4 1.2.1 Password Recovery ...... 4 1.2.2 Password Strength ...... 4 1.2.3 Password Reuse ...... 5 1.2.4 Password Management ...... 5 1.3 List of Publications ...... 6 1.4 Overview and Structure ...... 9 2 Chapter 1 Introduction

1.1. Motivation

Passwords—many times declared dead—continue to dominate when it comes to user authentication on the Web. Internet services have realized that passwords will not be replaced in the near future [108]. Thus, they came up with solutions to reinforce password-based au- thentication, mostly by considering additional factors other than pass- words [27]. The list of protection measures that services deployed in the past focused on guessability and include composition poli- cies [143], expiration policies [253], password strength meters [231], blacklist checks [103], hashing [18], and rate-limiting [76]. However, threats that do not depend on the strength of passwords, such as phishing [165] and password-stealing malware [224], are completely unaffected by those protection mechanisms. Today, with the increas- ing number of services that require authentication, users, administra- tors, and system developers face new challenges. Over the past ten years, billions of credentials have been leaked due to vulnerable web applications [53] and unprotected backups [17]. As a result of insufficient key stretching [96] and advancements in GPU-based hashing [180, 216], attackers were able to crack the ma- jority of the hashed passwords. Because users tend to reuse passwords across services, attackers can match email addresses with leaked cre- dentials to exploit password reuse in so-called credential stuffing at- tacks [51,121]. To improve the account security of their users, services started to deploy protection mechanisms that reinforce password- based authentication like two-factor authentication (2FA), risk-based authentication, and proactive password-reuse checks [39, 224]. Two-factor authentication [205] is known from the enterprise con- text in the form of hardware tokens. Nowadays, it is implemented via one-time passwords sent by SMS or generated in an app on the user’s smartphone. Risk-based authentication [63, 177] is used to protect accounts if an unrecognized device or an unusual sign-in location is detected. In such cases, the website will ask for additional verifica- tion and notify the user via email. Services implement password-reuse 1.1. Motivation 3 checks [154,224] to counter the imminent threat of password-reuse at- tacks. Here, services run large lists of leaked email and password pairs against their user database to identify vulnerable accounts. With those new security mechanisms, novel usability and privacy issues emerge. A number of studies reported on the usability of two- factor authentication [50, 145, 191] and found issues with the mental and physical workload and the number of authentication steps in- volved. 2FA is also featured in recent potential privacy violations, as Facebook has been accused of abusing phone numbers registered with the service for other purposes than 2FA [111]. Others have studied the usability issues that arise when services try to share security- and privacy-related information with their users. This work includes stud- ies on the effectiveness of password reset emails [119], post-breach risk perceptions [255], and password-reuse notifications [92]. Password ad- vice is overwhelmingly complex [194] and system designers and users often have incorrect mental models [48]. Thus, it is not surprising that the adoption rates of healthy password practices are low [126, 189]. Users struggle with services asking for advanced composition re- quirements and expiration policies [125,143]. Meanwhile, an updated publication [98] replaced the ad hoc character-class-driven strength heuristic by NIST [38] that formed the basis for those policies. Frag- ments of the old recommendation are still in use in other, not-yet- updated standard bodies, and might have caused the prevalent mis- conceptions about password guessability and strength among users and system designers [160]. This thesis designs and implements solutions to help users and sys- tem developers to deal with the new threats and legacy problems de- scribed. Specifically, we address the current insecure and burdening knowledge-based fallback authentication solutions, inaccurate pass- word meters and models of password strength, communicating the results of proactive password-reuse checks and how to protect oneself against attacks, as well as, the ongoing effort to improve the adoption rates of password managers and their security. 4 Chapter 1 Introduction

1.2. Challenges and Contributions

In this thesis, we make several contributions to password-based user authentication. Our work aims to support end users with their daily usability and security challenges. We also propose means for devel- opers and system administrators to enable them to more informed decisions that ultimately affect users, too. While there are several ar- eas in the context of password-based authentication that are in need of improvement, this thesis focuses on the following four topics.

1.2.1. Password Recovery

The memorability of passwords is one of the biggest challenge users face in authentication. Thus, trading security for memorability is a common coping strategy of users [217]. If users cannot remem- ber their passwords, password recovery, helps them to regain access to their accounts. The well-known insecurity of personal knowledge questions [25] makes password recovery difficult to achieve without the use of out-of-band communication. However, user authentication requires alternatives that do not rely on email or SMS. In Chapter 3, we describe a new authentication system, called MooneyAuth that utilizes implicit learning, thus relieves the user from the burden of actively memorizing a secret. Due to its long-lasting priming effect, it is particularly suited for password recovery. Moreover, its security does not suffer from a selection bias that is inherent and strongly pronounced in the currently deployed personal knowledge questions.

1.2.2. Password Strength

When it comes to password creation, user-choice is heavily biased, rendering them insecure. Users can benefit from support systems like password strength meters that estimate the guessability of passwords and can provide additional feedback and guidance [229]. Unfortu- nately, strength meters have a bad reputation [220], mostly because they are often based on ad hoc, inaccurate strength estimation al- 1.2. Challenges and Contributions 5 gorithms. In Chapter 4, we investigate the accuracy of password strength meters. To measure the accuracy of a strength meter one compares the meter output to an ideal reference. After identifying a suitable similarity metric for such a comparison, we conduct a large- scale evaluation and measure the accuracy of a multitude of password strength meters. We provide guidance, recommendations, and tools to help developers and administrators in deploying more accurate meters and ultimately help users in creating better passwords.

1.2.3. Password Reuse

It is unreasonable to expect users to maintain dozens of distinct and secure passwords. A widespread coping strategy of users is to reuse passwords across sites [178]. Due to a large number of leaked creden- tials, service providers are facing an increasing number of malicious password-reuse attacks that cannot be prevented using traditional rate-limiting mechanisms. Services run large lists of leaked credentials against their user database to identify password reuse [39,224]. Com- municating the results of such reuse checks is challenging for service providers. In Chapter 5, we explore the design space around password- reuse notifications. We analyze notifications sent by real companies and ask users for the root cause of receiving such notifications, what actions they might take in response, and explore their mental mod- els regarding the imminent threat of password reuse. Based on these findings, we establish best practices that system designers should con- sider to maximize the effectiveness of their notifications. Finally, we discuss other measures than notifications that should be considered for holistically addressing password reuse.

1.2.4. Password Management

Password managers can help users to manage their increasing num- ber of passwords and accounts. To increase the convenience for cross- device usage, managers synchronize the protected vault file with cloud services. If attackers can access the service, they can obtain the vault 6 Chapter 1 Introduction and successfully decrypt it when it is protected by an insufficient mas- ter password. Cracking-resistant managers help to mitigate this prob- lem by trying to cover whether a guessed master password is correct or not. In Chapter 6, we analyze the security of a cracking-resistant password vault construction. NoCrack [46] is such a cracking-resistant construction, which is based on a Natural Language Encoder (NLE) that can generate plausible looking decoy vaults on the fly. We show that one could distinguish real from decoy vaults with high accuracy, based on the difference in the distribution of the passwords stored in the real and the decoy vaults. In the following, we evaluate additional signals such as passwords containing the username, password reuse, and composition policies that should be considered in the construc- tion of cracking-resistant password vaults. As a possible solution, we propose adaptive NLEs, as an alternative construction that is more secure.

1.3. List of Publications

Below is a list of peer-reviewed publications, which build the primary chapters of this thesis. The research described in those publications was carried out in collaboration with students, colleagues, and other members of the respective research projects and includes work that was conducted during a research visit at the University of Chicago.

• Chapter 3: Password Recovery C. Castelluccia, M. Dürmuth, M. Golla, and F. Deniz, “Towards Implicit Visual Memory-Based Authentication,” in Symposium on Network and Distributed System Security (NDSS ’17). San Diego, California, USA: ISOC, Feb. 2017.

• Chapter 4: Password Strength M. Golla and M. Dürmuth, “On the Accuracy of Password Strength Meters,” in ACM Conference on Computer and Communications Security (CCS ’18). Toronto, Ontario, Canada: ACM, Oct. 2018, pp. 1567–1582. 1.3. List of Publications 7

• Chapter 5: Password Reuse M. Golla, M. Wei, J. Hainline, L. Filipe, M. Dürmuth, E. Red- miles, and B. Ur, “What was that site doing with my Facebook password? Designing Password-Reuse Notifications,” in ACM Con- ference on Computer and Communications Security (CCS ’18). Toronto, Ontario, Canada: ACM, Oct. 2018, pp. 1549–1566.

• Chapter 6: Password Management M. Golla, B. Beuscher, and M. Dürmuth, “On the Security of Cracking-Resistant Password Vaults,” in ACM Conference on Com- puter and Communications Security (CCS ’16). Vienna, Austria: ACM, Oct. 2016, pp. 1230–1241.

Moreover, parts from this list of peer-reviewed publications have been used to build the foundation of this thesis. The list is in chrono- logical order.

• M. Golla and M. Dürmuth, “Analyzing 4 Million Real-World Per- sonal Knowledge Questions,” in International Conference on Pass- words (PASSWORDS ’15). Cambridge, United Kingdom: Springer, Dec. 2015, pp. 39–44.

• M. Golla, D. Detering, and M. Dürmuth, “EmojiAuth: Quantify- ing the Security of Emoji-based Authentication,” in Workshop on Usable Security (USEC ’17). San Diego, CA, USA: ISOC, Feb. 2017.

• M. Golla, D. V. Bailey, and M. Dürmuth, “I want my money back! Limiting Online Password-Guessing Financially,” in Who Are You?! Adventures in Authentication Workshop (WAY ’17). Santa Clara, CA, USA: USENIX, Jul. 2017.

• M. Wei, M. Golla, and B. Ur, “The Password Doesn’t Fall Far: How Service Influences Password Choice,” in Who Are You?! Ad- ventures in Authentication Workshop (WAY ’18). Baltimore, MD, USA: USENIX, Aug. 2018. 8 Chapter 1 Introduction

• M. Golla, B. Hahn, K. Meyer zu Selhausen, H. Hosseini, and M. Dürmuth, “Bars, Badges, and High Scores: On the Impact of Password Strength Visualizations,” in Who Are You?! Adventures in Authentication Workshop (WAY ’18). Baltimore, MD, USA: USENIX, Aug. 2018.

• M. Golla, T. Schnitzler, and M. Dürmuth, “Will any password do?: Exploring Rate-Limiting on the Web,” in Who Are You?! Adventures in Authentication Workshop (WAY ’18). Baltimore, MD, USA: USENIX, Aug. 2018.

• W. He, M. Golla, R. Padhi, J. Ofek, M. Dürmuth, E. Fernan- des, and B. Ur, “Rethinking Access Control and Authentication for the Home Internet of Things,” in USENIX Security Symposium (SSYM ’18). Baltimore, MD, USA: USENIX, Aug. 2018, pp. 255–272.

• P. Markert, M. Golla, E. Stobert, and M. Dürmuth, “A Compar- ative Long-Term Study of Fallback Authentication,” in Workshop on Usable Security and Privacy (USEC ’19). San Diego, CA, USA: ISOC, Feb. 2019.

• M. Golla, J. Rimkus, A. J. Aviv, and M. Dürmuth, “On the In- Accuracy and Influence of Android Pattern Strength Meters,” in Workshop on Usable Security and Privacy (USEC ’19). San Diego, CA, USA: ISOC, Feb. 2019.

• E. Liu, A. Nakanishi, M. Golla, D. Cash, and B. Ur, “Reasoning Analytically About Password-Cracking Software,” in IEEE Sympo- sium on Security and Privacy (SP ’19). San Francisco, CA, USA: IEEE, May 2019, pp. 380–397. 1.4. Overview and Structure 9

1.4. Overview and Structure

The remainder of this thesis is structured as follows:

• In Chapter 2, we provide an overview of prior work and present the necessary background by detailing on topics such as studying pass- words, password choice, fallback authentication, password strength metrics, and guessing algorithms. We elaborate on the password reuse issue and security warning design, as well as, the usability and security aspects of password managers.

• In Chapter 3, we introduce MooneyAuth, an authentication scheme that utilizes implicit memory. It is a selection bias-free scheme that stands out by offering a good long-term performance and promises a reduction of cognitive effort on the users’ side.

• In Chapter 4, we investigate how accurate password strength me- ters are by proposing a set of requirements that strength meters should fulfill and testing various similarity metrics that can be used to measure the accuracy of meters. In a large-scale survey, we compare a wide range of meter designs and provide guidance for developers to improve their meters.

• In Chapter 5, we explore password-reuse notifications. Such noti- fications are sent by services to inform users about their accounts’ vulnerability to password-reuse attacks. The notifications are chal- lenging to design, as they try to explain a complex security issue and ultimately try to convince the user to change their behavior to prevent further harm.

• In Chapter 6, we describe a distribution-based attack against crack- ing-resistant password vaults. Furthermore, we show how to build an adaptive Natural Language Encoder based on Markov models that prevents this kind of attack.

• In Chapter 7, we summarize the thesis by listing our key findings and providing directions for future work.

Give a man an 0-day and he’ll have access for a day, teach a man to phish and he’ll have access for life.

— @thegrugq 2 Preliminaries

Contents 2.1 Introduction to User Authentication ...... 12 2.1.1 Studying Passwords ...... 13 2.1.2 Password Choice ...... 14 2.2 Password Recovery ...... 15 2.2.1 Out-of-Band Communication . . . . . 16 2.2.2 Personal Knowledge Questions . . . . 16 2.2.3 Social Authentication ...... 17 2.3 Password Strength ...... 18 2.3.1 Entropy-Based Strength Metrics . . . 18 2.3.2 Guessing Algorithms ...... 19 2.3.3 Password Strength Meters ...... 21 2.3.4 Creating Stronger Passwords . . . . . 22 2.4 Password Reuse ...... 23 2.4.1 Protection Mechanisms ...... 24 2.4.2 Security Warnings and Notifications . 24 2.5 Password Managers ...... 25 2.5.1 Usability Assessments and Adaptation 25 2.5.2 Security Analyses and Issues ...... 27 12 Chapter 2 Preliminaries

In the following, we review some material related to password-based user authentication. In particular, after a short introduction to user authentication, we focus on aspects of studying passwords and related user behavior (Sec. 2.1). Next, we have a look at password recovery and the security and usability issues of the currently deployed sys- tems (Sec. 2.2). Afterward, we provide the required background on password strength estimation, password guessing, as well as, strength meters, user perception of strength, and password advice (Sec. 2.3). Subsequently, we explain the threats caused by password reuse, intro- duce protection mechanisms, and give a brief overview of the topic of security warnings and notifications (Sec. 2.4). Finally, we review the related work on password managers (Sec. 2.5).

2.1. Introduction to User Authentication

Existing user authentication schemes are commonly based on some- thing you know, such as passwords, something you have, such as hard- ware tokens, or something you are, such as biometrics. Many authen- tication schemes suffer from the competing requirements of security and usability [26], which are hard to fulfill simultaneously. Despite substantial effort to improve the state-of-the-art, currently deployed schemes are far from optimal [1]. Password-based authentication is widely used on the Web, as services require an authentication system that is easy to understand for laypersons, does not rely on additional hardware, and is supported by a broad ecosystem. However, users seem to disfavor password-based authentication [104,109]; thus alter- natives [156,221] are becoming necessary. Yet, password-replacements cannot always offer the same benefits [26] and face adoption barri- ers [71]. Hence, password-based authentication will stay for the fore- seeable future [26, 108]. 2.1. Introduction to User Authentication 13

2.1.1. Studying Passwords

Researching the usability and security of text passwords is a challeng- ing task. Obtaining a reliable ground truth in password research is difficult, and studies using real-world data are rare [24, 25]. In prac- tice, proposals are evaluated using password leaks [88], which can raise ethical questions. The majority of studies that used leaked pass- words focused on the English language sphere. Unfortunately, other languages are not intensively studied [149]. Most findings in the area of usable security are based on lab or online studies [159] that require convincing role-played scenarios. While user studies can reach hundreds or thousands of users, most of them rely on self-reporting, which raises questions about their ecological valid- ity [243]. Moreover, passwords can be studied using interviews [217], which are suitable for detailed questions, but forces users to recall what they “normally” do. As authentication is a secondary task [200], this technique can involve a self-reporting bias that is reinforced by asking users to recall past events. In diary studies [107,125], users are asked to protocol instances of their habits, instead of recalling them later, but can only be conducted with a limited number of users. More evolved is the use of a monitoring system installed on the users’ com- puters, trying to measure “real-world” behavior, like Carnegie Mellon University’s Security Behavior Observatory (SBO) [178]. Fahl et al. [72] have analyzed the ecological validity of password user studies. In their study, they role-played the enrollment into a univer- sity identity management system and compared the observed data with the passwords that students used in the real identity system. At the same time, they tested if conducting the study in a lab environ- ment or online leads to different results, and whether mentioning that the study is about passwords has any effect on the outcome. They found that about 25 % of their participants used a real password and argue that insights created in user studies can be useful data points. However, the authors also emphasize that studies need to be carefully designed, especially those interested in the memorability of passwords. 14 Chapter 2 Preliminaries

Findings based on self-reported, instead of measured data, are an- other source for errors that must be considered [193]. Other biases that are well-known for impacting user studies include social desir- ability, learning and fatigue, as well as participation and recruiting biases [148, 188]. Redmiles et al. [190] have studied the impact of conducting security and privacy surveys using Amazon Mechanical Turk (MTurk). The authors tested how good MTurk results generalize to a broader pop- ulation, by comparing them to a census-representative web-panel and a probabilistic telephone sample. They reported that the compara- bly low-cost MTurk results are a good proxy for the other samples. Unfortunately, they also found significant differences for older and less educated users that are often underrepresented in studies, but in particular need of usable security and privacy solutions.

2.1.2. Password Choice

In the following, we explain how users usually interact with passwords. Jakobsson and Dhiman [128] found that users produce passwords us- ing only a small set of rules and components such as dictionary words, replacement strategies, and misspellings. Studying these passwords by their semantic theme reveals that they often relate to pets’ names, people’s names, or dates [236]. Common are also concepts relating to love, profanity, animals, food, and money [237]. Other frequent terms include sports teams, geographic locations, or song lyrics [147]. Keyboard walks or patterns are also common passwords [31]. Fur- thermore, Wei et al. found that passwords often include the name or semantic theme of the service for which they are created [244]. Wash et al. [243] studied six weeks of password use with the help of a custom data collection tool installed on the participants’ computers. They combined their measurements with a survey asking about beliefs and behaviors. Their design allowed them to compare self-reported intentions with actual behavior. Unfortunately, they found a low cor- relation between intended password use and actual behavior. Besides 2.2. Password Recovery 15 the reuse of passwords, the authors confirmed a finding from previous works [72,75,82,217], as there seems to be a hard limit of around 5−6 different passwords that users actively remember and use. Pearman et al. [178] studied password habits of 154 users over six months using a behavior observatory. They measured an average password strength of approximately 1012 guesses (based on estimates by a neural network [162,229]). Regarding reuse, they found that the average participant reused 79 % of their passwords and that the av- erage password is at least related to 3.66 other passwords. Similar to others, they found that password reuse was rarely limited to a single account category. Moreover, they found that strength is a statisti- cally significant factor in predicting reuse, as they observed that the stronger a password is, the less likely password reuse occurs. Hanamsagar et al. [106] studied the reasoning behind bad password habits. Similar to Wash et al. [243], they found that users’ intents often mismatches practice. Reasons for this include misconceptions about attacks regarding password strength and reuse. Convenience is another reason, as users prefer to trade security for memorability. Most importantly, while users intent not to reuse passwords across account categories (banking vs. social media), they do in practice.

2.2. Password Recovery

The memorability of passwords is one of the biggest challenges users face in authentication. Thus, trading security for memorability is a common coping strategy of users [217]. If users cannot remember their passwords, fallback authentication, sometimes called password recovery or reset, helps them to regain access to their accounts. The usability requirements for fallback authentication are different from primary authentication. Depending on the deployed system, the long- term memorability is more critical, due to the missing rehearsal of the secret. An analysis of the deployed fallback authentication system at Google by Bonneau et al. [25] found that the system is most often used within time frames of 6 to 18 months. Moreover, the authentication 16 Chapter 2 Preliminaries time and workload are less critical than in primary authentication sys- tems, as they are intended not to be used daily. Finally, the deployed rate-limiting can be stricter. There are many prominent examples of targeted attacks that exploited fallback authentication [113, 252].

2.2.1. Out-of-Band Communication

The most frequently used systems are based on password reset via out- of-band communication. In this case, a registered email address or mo- bile phone number [80] receives a new temporary password or a time- limited password reset code or link. However, receiving such password reset messages can be risky, if not correctly implemented [29], and can be error-prone if the contact details on record are outdated. Further- more, not all users like the idea of providing their cellphone number or email address due to privacy concerns [111]. Moreover, receiving an SMS or an email is not always possible, i. e., if the receiving smart- phone is out of reach or ran out of battery.

2.2.2. Personal Knowledge Questions

The second most used fallback authentication system on the Web are personal knowledge questions (PKQs), sometimes called, cognitive passwords or security questions. The earliest study of PKQs was conducted in 1990 by Zviran and Haga [256], which awarded cognitive passwords good usability and security properties. However, by today’s standards, the security of PKQs is low, as several studies have shown. Griffith et al. [100] showed that the Mother’s Maiden Name ques- tion, which is frequently used as a PKQ, can often be derived from public databases, rendering them insecure. Rosenblum [198] has shown that private information about persons can often be inferred from so- cial networking sites. This information can also be used to narrow down potential answers for the security questions. The secrecy of PKQ answers, in the age of Facebook, was studied by Rabkin [186], whereas Bonneau et al. studied the entropy of names [28]. Schechter 2.2. Password Recovery 17 et al. [202] demonstrated that for a number of such security questions the answers could often be guessed easily. A more general discussion on designing security questions including usability, privacy, and security is given by Just [135]. An alternative and a potentially better domain of security questions, namely ques- tions about personal preference similar to those used on online dating sites, where studied by Jakobson et al. [129]. They have found that preference-based questions are more secure than most other commonly used questions. Bonneau et al. [25] have evaluated real-world PKQs at Google. They found that some answers are quite predictable, in part because some users do not answer truthfully, which lowers the overall security. Golla et al. [88] analyzed the security of real-world PKQs leaked after the hack of an online dating website. Micallef and Arachchilage [164] studied how gamification can be used to improve the usability of system-generated answers to Avatar-based security questions.

2.2.3. Social Authentication

Fallback authentication using information about the social graph of a user, so-called social authentication, has been explored by Brainard et al. [32] and Schechter et al. [203]. In 2011, Facebook deployed such a social fallback scheme, which utilizes designated trustees called Trusted Friends [69]. In this scheme, the user is asked to provide three codes that are sent to a list of friends. Due to a design flaw [130] which exploited the fact that the list of friends was not predefined but selected after the access was lost, Facebook changed details of the scheme, which is nowadays called Trusted Contacts [70]. Today, users can set up three to five friends that receive a recovery code via email in the case the user has forgotten the password. By collecting three or more codes, one can reset the password. However, due to the protocol design, recovery times can rise from hours to days, which is a potential drawback of this approach. 18 Chapter 2 Preliminaries

2.3. Password Strength

Next, we provide and detailed overview of password strength estima- tion and guessing. Trying to estimate the strength of a password as a measure to defend against guessing attacks has a long history. In 1979, Morris and Thompson [169] did password checking by attempt- ing to crack hashed passwords. The ones successfully cracked were marked as weak, and the users were notified. In the following, one started to check the strength of a password before it is accepted by a system via pro-active checkers, using certain rule sets that try to exclude weak passwords [19, 142, 214].

2.3.1. Entropy-Based Strength Metrics

In the past, mathematical metrics for guessing resistance based on entropy estimations have been used to calculate password strength. The advantage of an entropy estimation approach is that it always models a best-case attacker, and does not introduce bias from a spe- cific password cracking software or configuration. The SP 800-63-2 Electronic Authentication Guideline [38] by the NIST includes ad hoc heuristics to estimate the entropy of a single password based on pass- word length, compliance to a composition policy, and a common dic- tionary check. The most recent version, the SP 800-63B [98], no longer recommends this basic estimation heuristic, due to its inaccu- racy [246]. Joseph Bonneau [24] discussed several well-known entropy- related metrics to estimate the strength of an entire password distri- bution. He revised why Shannon entropy is an inappropriate measure for guessing difficulty. Furthermore, he described an accurate metric, called α-guesswork, which is related to a metric by Pliam [182], that also incorporates the fact that an attacker may not want to guess the passwords of all accounts but stop earlier after some successful guesses. 2.3. Password Strength 19

2.3.2. Guessing Algorithms

Guessing passwords is in many ways related to password strength. The optimal strategy to guess passwords is in decreasing order of likelihood, i. e., most frequent passwords first. There are different proposals to enumerate passwords with decreasing likelihood, in other words, with increasing strength. In practice, the most relevant way to guess passwords is GPU-based password cracking which uses large dictionaries and ad hoc mangling rules to generate password candidates in a transformation-based at- tack. To measure the strength of a password, the number of guessing attempts made by real-world password cracking tools such as Hashcat and John the Ripper [233] can be used. This method provides very accurate reflections of password strength, but can be sensitive to the choice of guessing tools and their configuration, and is usually time and resource intensive. Recently, Liu et al. [151] described how to reason efficiently about such transformation-based password cracking approaches without needing to enumerate guesses. A different approach to password guessing and strength estima- tion relies on probabilistic password models that are trained from some corpus of passwords that ideally represents the target password distribution. The literature around building such relatively accurate password models includes several proposals:

1. Markov Models: In 2005, Narayanan and Shmatikov [170] pro- posed their use to overcome some problems of dictionary-based attacks. The idea behind Markov models is based on the ob- servation that subsequent tokens, such as letters in a text, are rarely independently chosen and can often be accurately mod- eled based on a short history of tokens. In 2012, Dürmuth et al. [43,66] improved the approach by generating password candi- dates according to their occurrence probabilities, i. e., by guess- ing the most likely passwords first. In 2014, Ma et al. [158] discussed other sources for improvements such as smoothing, backoff models, and issues related to data sparsity. 20 Chapter 2 Preliminaries

2. Probabilistic Context-Free Grammars: In 2009, Weir et al. [247] suggested a method to exploit the structural patterns from a password leak by associating a probability for the members of the distribution defined by an underlying context-free gram- mar (CFG). In 2014, Veras et al. [236] extended the approach by building a semantically meaningful PCFG-based password guesser. In 2016, Wang et al. [239] proposed a fuzzy PCFG that is built by learning how mangling rules are used to modify a base dictionary to match a training distribution of stronger passwords.

3. Neural Networks: In 2016, Melicher et al. [162, 229] proposed using long short-term memory (LSTM) recurrent neural net- works (RNNs). Similar to Markov models, their approach cal- culates the probability of a subsequent character in a password based on the previous characters using a multi-layer neural net- work.

In 2010, Dell’Amico et al. [60] conducted an empirical study on the effectiveness of different password guessing approaches. Schechter et al. [204] classified passwords as weak by counting the number of times a specific password is present in the password database. Kelley et al. [140] proposed a guess-number calculator to determine if and when a given password-guessing algorithm, would guess a specific password. Another study targeting probabilistic password modeling approaches was done by Ma et al. [158]. Ur et al. [233] did a large scale comparison and found running a single guessing algorithm, often yields a very poor estimate of password strength. Dell’Amico and Filippone [59] proposed the use of Monte Carlo methods to estimate the number of guesses required to find passwords that are far beyond practical enumeration. The security threat of targeted online guessing attacks was analyzed by Wang et al. [240]. 2.3. Password Strength 21

2.3.3. Password Strength Meters

It is challenging for users to estimate the strength of a password. Ur et al. [230] investigated the relationship between users’ perceptions of password strength and actual strength. They found severe misconcep- tions for passwords that include digits like astley123, keyboard walks like 1qaz2wsx3edc, and common phrases like iloveyou. A password strength meter (PSM), also called strength meter or password meter can help users choosing more secure passwords, by displaying a rep- resentation of the estimation of the strength. It either tries to nudge or force a user to pick a password that provides a reasonable level of security by means of guessing resistance. Widely adopted and intensively studied are password strength me- ters that display the estimated password strength as a bar, often hor- izontally oriented, typically with colors changing from red over yellow to green. Beyond these simple, informative meters, motivators such as fear appeals [234] and peer-pressure [68] have been studied. Sometimes meters also include a textual representation of the estimated guess- ing resistance, e. g., [Weak, Good, Strong]. Recent academic propos- als [229], tell users what is wrong with their password and how to improve it, by offering additional specific guidance and feedback. In the past, ad hoc solutions like counts of lower- and uppercase characters, digits, and symbols (LUDS) have been used as a strength metric. Such meters are known to not accurately capture strength [43, 57,89,246]. However, they are still in use, and the main reason for the sometimes reported poor quality of strength meters on the Web [220]. Many of the password strength meters in the current literature are based on the aforementioned password guessing approaches. Propos- als include neural networks by Melicher et al. [162] and Ur et al. [229], PCFGs by Houshmand and Aggarwal [115] and Wang et al. [239], and Markov models by Castelluccia et al. [43]. Furthermore, there is a me- ter that uses a set of advanced heuristics by Wheeler [248], the official NIST entropy estimation [38], and others [68, 102, 231]. 22 Chapter 2 Preliminaries

Ur et al. [231] found that strength meters, depending on the visual feedback, led users to create longer passwords or caused them to place less importance on satisfying the meter. Egelman et al. [68] studied the impact of password strength meters on the password selection process and found that meters result in stronger passwords when users are forced to change existing passwords on “important” accounts. De Carné de Carnavalet and Mannan [57] analyzed deployed strength meters in 2014. They found evidence that the commonly used meters are highly inconsistent and fail to provide coherent feedback.

2.3.4. Creating Stronger Passwords

Besides judging and visualizing the strength of a password, meters can also assist users in creating stronger passwords, by providing ad- ditional help. Shay et al. [208] explored how feedback and guidance could improve password creation under specific composition policies. They found that giving real-time feedback prevents users from mak- ing errors while creating strong passwords. Their three-step guided password-creation processes made password-creation easier but re- sulted in weak passwords. Ur et al. [232] conducted a qualitative interview study and asked participants about their general password creation strategies and inspirations. Based on their observations they suggested reworking abstract advice like “include digits” to be more specific like “consider inserting digits into the middle.” Later, Ur et al. [229] implemented this suggestion by adding a de- tailed data-driven feedback functionality using 21 heuristics that try to teach a user how to improve a specific candidate password. Similar to zxcvbn [248] their heuristics consider common words and phrases, character substitutions, typical locations of uppercase letters and dig- its, and keyboard patterns. They found the feedback functionality produced more secure, but not less memorable passwords, than a tra- ditional bar-based meter without the detailed feedback functionality. Common security advice given to users for protecting their online accounts is to use strong passwords, not to reuse passwords for more 2.4. Password Reuse 23 than one account, not to write them down, and ideally to change them frequently [194]. These recommendations require substantial cogni- tive effort and are almost impossible to follow for end-users [160], as shown by the escalating password reuse problem [178]. This ob- servation has started a process to adopt recommendations for more realistic advice [171]. For example, to group accounts and reuse pass- words within the same category [77], or no longer recommending, to frequently change passwords [172].

2.4. Password Reuse

As described before, it is unreasonable to expect users to maintain dozens of distinct and secure passwords. Although online account providers employ new methods beyond password-based authentication to improve security, such as 2FA [50, 205] and risk-based authentica- tion [63, 79, 165], solutions such as password managers face adoption barriers [71]. Accounts, therefore, remain vulnerable to a number of password-related attacks [233]. Various studies over the years have found that users reuse a majority of their passwords across sites [54] as users have dozens of accounts, but only a few passwords that they cycle through [72,75,178,217,243]. Users reuse passwords to minimize the burden of memorization [77, 82], and they do so especially often for accounts they consider to be of low value [82, 217]. Even if users do not reuse passwords verbatim, they often modify existing pass- words when creating new ones [143, 178, 209]. For a couple of years, password breaches are frequent, with billions of credentials already re- ported stolen [121]. Leveraging stolen credentials enables attackers to perform online guessing with some success. Password reuse amplifies the severity of all password attacks [11,54,105,127,152,183,241]. Once login credentials are compromised, all accounts with those same cre- dentials become vulnerable to credential stuffing attacks [51,121,176]. Thomas et al. accumulated over 1.79 billion non-unique usernames and passwords from credential leaks, finding that 7–25 % of those cre- dentials would enable attackers to log into a compromised account 24 Chapter 2 Preliminaries holder’s Google account [224]. Credential stuffing that automates logging into as many sites as possible with the stolen login creden- tials generates more than 90 % of login traffic on many of the world’s largest websites and mobile apps [207]. Once accounts have been com- promised, attackers may use them to send spam, obtain financial data, or distribute malware [173, 223].

2.4.1. Protection Mechanisms

Common protection mechanisms against online password guessing, such as rate-limiting and throttling often do not apply to password- reuse attacks. An overview of rate-limiting techniques (CAPTCHAs, blocking, and account locking) is given by Golla et al. [86, 90] and Florêncio et al. [76] and can have severe usability issues [36]. Some protection against password-reuse attacks is offered by risk-based au- thentication [79, 165], which includes behavior [165] and browser fin- gerprinting [6]. Related to password reuse are new attempts of proac- tively blacklisting millions of leaked passwords [120, 185] and a per- sonalized strength meter that considers a user’s history of leaked cre- dentials [176]. Some users adopt password managers [157] that can generate random passwords with no reuse, or activate two-factor au- thentication [50] which can help to enhance account security.

2.4.2. Security Warnings and Notifications

For service providers, it is a challenging task to communicate account security issues to the user. A large body of prior work has researched security warnings and notifications. Some examples include encour- aging users to adopt 2FA [191] and detecting phishing [67, 210]. In their study of 25 million Google Chrome and users, Akhawe et al. found that user experience has a significant impact on behavior and that users often do look at warnings [5], contrary to other find- ings that users are susceptible to habituation and often ignore Web warnings [33, 34]. Jenkins et al. evaluated the efficacy of just-in-time fear appeals in warnings at preventing users from reusing passwords, 2.5. Password Managers 25

finding that such appeals resulted in a significant decrease in pass- word reuse [131]. Zou et al. studied reactions to notifications of the Equifax data breach [255]. Huh et al. studied the notification LinkedIn sent users after their password database was breached, finding that less than half of participants changed their LinkedIn password upon receiving this notification [119].

2.5. Password Managers

While using a password manager might not be a solution for everyone, they can help users with their weak and reused passwords. Password managers are intended to generate and store a user’s passwords, such that the user is no longer required to remember them. Password man- agers support users in generating and retrieving strong passwords by storing them in an encrypted vault file and suggesting unique and se- cure passwords during account creation. The vault file is encrypted using a key derived from a single, so-called, master password. Pass- word managers are shipped with or integrated into the browser using extensions. Besides the basic operations such as storing and retriev- ing passwords, managers do more to help users with their account security. For example, they often integrate password generators, pass- word strength meters, and blacklist checks. Moreover, they sometimes implement a password auditing functionality that checks for compro- mised passwords, weak passwords, or password reuse. For convenience reasons, managers often offer synchronization of the vault across de- vices using cloud services.

2.5.1. Usability Assessments and Adaptation

The low adoption of password managers is intensively studied. A pa- per by Chicason et al. [48] was one of the first works that reported on users and their inaccurate or incomplete mental models that hinder the adoption of password managers. Karole et al. [138] compared on- line, phone, and USB key-based password managers and found users 26 Chapter 2 Preliminaries that were not comfortable with trusting their passwords with an on- line manager. Mark Ciampa [49] compared a bookmarklet that gen- erates site-specific passwords with a cloud-based password manager. While users liked the manager more, the author also noted that users continue to believe that creating and memorizing strong passwords is their responsibility and should not be trusted with a password man- ager. Stobert and Biddle [217] conducted interviews to learn more about users’ coping strategies regarding managing a large number of accounts and passwords. Some of their participants expressed dis- trust in password managing software. Moreover, the authors reported that for those who used a manager integrated into their browser, it remained unclear, if anyone made use of the password creation mech- anisms that is vital to prevent reuse and generate strong passwords.

Florêncio et al. [77] recommended to group accounts for reuse. Their main motivation behind this considers fixed user effort budgets and the attempt to minimize the total expected loss due to different classes of attacks with regard to common coping strategies. They found strategies that rule out weak passwords or reuse to be suboptimal and suggested to reuse passwords within smaller groups of, e. g., high- , medium-, low-value accounts. Unfortunately, a couple of works have demonstrated that users reuse passwords across account categories, and mix high- and low-value accounts [106, 178].

Stobert and Biddle [218] interviewed 15 expert users regarding their password managing strategies. While experts still relied on reusing passwords, they only do so for low-value accounts. The majority, 12 out of 15 experts, reported to use a password manager and generate random passwords for their high-value accounts. In comparison to av- erage users, experts use dedicated password managers more frequently, but also rely on their browsers to store their passwords. Alkaldi and Renaud [7] investigated why users adopt or reject password managers. They found a lack of awareness, as users did not know about the ex- istence of password managers. Moreover, they reported that people believe their current password practices are secure and do not need 2.5. Password Managers 27 to change. Finally, as a number of respondents mentioned trust is- sues, the authors highlighted the importance of explaining to users how password managers work and protect accounts. Fagan et al. [71] studied non-users and users of password managers to explore reasons for the low adoption rates of password manag- ing software. Similar to previous work, non-users expressed security concerns, mostly due to a lack of understanding of the technology. Interestingly, the authors also found a higher prevalence of suspicion among non-users when login into a website. They recommended to better explain the purpose of a password manager, and the technolo- gies behind it. Lyastani et al. [157] evaluated the impact of managers on password strength and reuse. They highlighted the importance of the password creation strategy. The usage of a password manager alone does not guarantee strong and different passwords for every web- site. Instead, they found that it depends on how users interact with the software. They suggested to better support users throughout the entire process starting with the password creation, over the storage, to the password entry, such that the old passwords of the users get replaced by new, strong, random strings.

2.5.2. Security Analyses and Issues

Besides the usability and adoption aspects, the security problems of password managers have been analyzed. Gasti and Rasmussen [81] an- alyzed the formal security of encrypted vault storage formats. They provided two realistic security models and evaluated a number of pop- ular password managers. Unfortunately, they found most vault stor- age formats to be insecure. Zhao et al. [254] compared two cloud-based password managers and found vulnerabilities regarding locally saved master passwords and unprotected vault information. Li et al. [150] examined the security of web-based password man- agers. They found flaws in the function for generating one-time pass- words, the use of bookmarklets, and a password sharing feature. Sil- ver et al. [93, 211] described an attack that abuses the autofill func- 28 Chapter 2 Preliminaries tionality. They described multiple ways to execute their so-called Sweep Attack, by injecting JavaScript on the fly into the victim’s browser. Stock et al. [219] analyzed the potential threat by cross- site scripting (XSS) attacks. The lack of support for password managers in mobile operating systems such as Android or iOS has led to a series of security issues. Fahl et al. [73] studied the security of password managers on mobile Android devices. They found security problems, due to the lack of support and integration into Android that allows attackers to steal account credentials once they are copied to the clipboard. Besides, they found issues related to storage encryption, key derivation, and TLS certificate pinning. Casati and Visconti [41] showed the security impact on password managers if used on rooted Android devices. After Fahl et al.’s investigation [73] and Google’s effort [136] to improve the situation, Aonzo et al. [9] found more security vulnera- bilities caused by the improper support for password managers. They found that password managers on Android used to rely on attacker- controllable package names to identify apps. In their analysis, they showed how Android apps are vulnerable to “hidden field” attacks and how password managers can be tricked to suggest and autofill creden- tials into malicious apps that disguise as legitimate. Further, they exploited a technology called, Instant Apps, that allows an adversary to gain full UI control and phish for credentials that are stored in the manager software. Huber et al. [117] have reported other vulner- abilities of password managers running on Android. Tavis Ormandy found various security vulnerabilities in popular password managers like 1Password, Keeper, and LastPass, which were related to the inter- process communication and browser extensions [95, 110]. Do not allow me to forget you.

— Gabriel García Márquez 3 Password Recovery

Contents 3.1 Introduction ...... 30 3.1.1 Contributions ...... 31 3.1.2 Outline ...... 32 3.2 Background ...... 32 3.3 The MooneyAuth Scheme ...... 37 3.3.1 Description ...... 39 3.3.2 Adversary Model ...... 41 3.3.3 Static Scoring ...... 41 3.3.4 Dynamic Scoring ...... 42 3.4 Experiment 1: Pre-Study ...... 44 3.4.1 Experimental Setup ...... 45 3.4.2 Matching Labels ...... 48 3.4.3 User Participation ...... 49 3.4.4 Results ...... 50 3.5 Experiment 2: Long-term Behavior Study . . . 53 3.5.1 Experimental Setup ...... 53 3.5.2 User Participation ...... 54 3.5.3 Results ...... 54 3.6 Experiment 3: MooneyAuth Study ...... 56 3.6.1 Experimental Setup ...... 56 3.6.2 User Participation ...... 57 3.6.3 Results ...... 57 3.7 Security Analysis and Discussion ...... 61 3.8 Conclusion ...... 64 30 Chapter 3 Password Recovery

3.1. Introduction

In the following, we describe a new type of knowledge-based authen- tication scheme called MooneyAuth. It eases the high cognitive load of explicit passwords and thus has the potential to improve the us- ability and security of knowledge-based authentication. In particular, we study how implicit memory can be used to design a secure and usable authentication scheme. Current knowledge-based authentica- tion schemes are based on explicit memory, where users are asked to create a random combination of characters as their authentication se- cret and to explicitly provide this secret at the time of authentication. Such secrets are usually very difficult to memorize as one has to work consciously to remember it. In contrast, with an implicit memory-based scheme, users first learn an association between a task and its solution. This learned associ- ation is then used as the authentication secret. Because recalling a situation that is stored in the implicit memory is remembered with less effort [199,201], almost unconsciously, such an authentication scheme relieves users of the high cognitive burden of remembering an ex- plicit password. In this work, we built a novel, operational authentication scheme utilizing implicit memory based on Mooney images [168]. A Mooney image is a degraded two-tone image of a single object. This object is usually hard to recognize at first sight and becomes easier to recognize when the original image was presented to the user. Our scheme is composed of two phases: (i) In the enrollment phase, the user learns the association between a set of Mooney images, their original versions, and labels describing the content of the image. This process is also called “priming.” (ii) During the authentication phase, a larger set of Mooney images, including the primed Mooney images from the enrollment phase, are displayed to the user. The user is then asked to provide a label for the hidden object in each Mooney image. Using our dynamic scoring algorithm, the system computes an authentication score and provides or denies access accordingly. Due to 3.1. Introduction 31 its relatively slow enrollment and authentication, but high long-term memorability performance, MooneyAuth seems particularly suited for fallback authentication. We conducted three experiments to identify practical parameters, to measure long-term effects, and to determine the performance of the scheme. We conducted Experiment 1 over the course of 25 days with 360 participants of which 230 finished both phases. The results of this experiment were used for parameter selection. To identify long-term priming effects of Mooney images we re-invited the participants after 264 days in Experiment 2. To validate the overall performance of the scheme, we performed Experiment 3 with 70 new participants over the course of 21 days.

3.1.1. Contributions

Our contributions include: 1. We present a novel authentication scheme, based on implicit visual memory, that outperforms existing ones in terms of false acceptance and false rejection rates, as well as the time required for authentication. 2. To decide whether a user successfully passes the authentication phase, the inputs have to be evaluated, i. e., “scored.” We pro- pose an alternative scoring technique, dynamic scoring, which is inspired by the notion of self-information, also known as sur- prisal. We show that our scoring technique substantially outper- forms the static scoring proposed in previous work by Denning et al. [62]. 3. We demonstrate the practicability of our scheme by implement- ing it and conduction three experiments. The results show that MooneyAuth substantially outperforms current implicit memory- based authentication schemes [62]. 4. We are the first to study long-term priming effects of Mooney images over a period as long as 8.5 months. The results reveal a substantial long-term priming effect for Mooney images, which 32 Chapter 3 Password Recovery

implies that MooneyAuth is suited for fallback authentication with long intervals between enrollment and authentication.

The contributions of this work resulted from a collaboration with Claude Castelluccia, Markus Dürmuth, and Fatma Deniz and the sup- port of Inria Grenoble and the University of California, Berkeley.

3.1.2. Outline

This chapter is structured as follows: Section 3.2 introduces the con- cept of implicit memory and Mooney images. Our scheme is described in Section 3.3. Then, we present details on the three experiments we performed. First, the pre-study for estimating the required parame- ters in Section 3.4, a long-term study proving that the priming effect of Mooney images last over time in Section 3.5, and the main study demonstrating the general performance of the scheme in Section 3.6. We discuss security properties in Section 3.7 and conclude with some final remarks in Section 3.8.

3.2. Background

Next, we provide an overview of the related work in the area of implicit memory-based authentication and introduce Mooney images.

Associative and Repetitive Memory-Based Authentication

Work by Bonneau and Schechter [30] demonstrated that users are capable of remembering cryptographically-strong secrets via spaced repetition. In their experiment, they enabled users to learn a limited number of strong authentication secrets by displaying an additional code that was required to login. This code did not change and was only shown after an annoying delay which was increased at every login attempt. The users were motivated to accelerate the login procedure and not wait for the code to display, by entering the code, which they 3.2. Background 33 subliminally learned by heart, due to its continuous repetition. After some of such fast and successful logins, the code was extended. A similar user study, realized by Blocki et al. [20], improved the repetition idea. Based on so-called Person-Action-Object (PAO) sto- ries they were able to combine associative and repetitive memory to improve the concept. They asked their participants to invent a story based on a shown photo, a user-chosen famous person, and a randomly selected action-object pair that served as authentication secret. In contrast to [30], the users were able to see the complete secret at once and were told that they are required to learn the secret. Finally, the users were able to remember their secrets for longer times with fewer rehearsals due to the PAO story mnemonic.

Implicit Memory-Based Authentication

Applying the knowledge about how humans store and recall informa- tion was first applied to user authentication by Weinshall and Kirk- patrick [245] in 2004. However, the proposed scheme made use of the explicit characterization of images that were stored in human memory, not implicit memory, and the performance of the proposed scheme is unsuitable for deployment. In 2011, Denning et al. [62] proposed a scheme that utilizes implicit memory. They presented an authentication scheme based on implic- itly learning associations between degraded drawings of familiar ob- jects (e .g., animals, vehicles, or tools) and their complete drawings. Each degraded drawing was created by using fragmented lines instead of continuous lines. In their paper, the authors presented a prelimi- nary authentication scheme and performed a user study. Their results show that many of these drawings show a (small) priming effect, but this effect is too small for using it in an authentication scheme for all but two images they tested. As agreed by the authors, the viability of such a system concept is dependent upon being able to system- atically identify or create images with a sufficiently strong priming effect. Our idea builds on this work to propose a complete and effi- 34 Chapter 3 Password Recovery cient system. We show that Mooney images provide a strong priming effect necessary to implement such a practical scheme, and we build a real prototype. In 2012, Bojinov et al. [22] proposed a scheme that resists coercion attacks where the user is forcibly asked by an attacker to reveal the key. The proposed scheme is based on a Serial Interception Sequence Learning (SISL) task that was integrated into the Guitar Hero video game. While the secret can be used for authentication, the participant cannot be forced to reveal it since the user has no conscious knowledge of it. The authors perform a number of user studies using Amazon’s Mechanical Turk to validate their scheme. Although the proposed idea is very interesting, performance results show that their scheme is not practical and cannot be used for real-world applications: the registration phase takes more than 45 minutes for a single password and put a lot of cognitive burden on users. In 2018, Joudaki et al. [132] showed how to improve the usability of system-assigned passphrases using implicit learning techniques. Their system utilizes contextual cueing in the form of a 2-dimensional spa- tial arrangement of distractors, and semantic priming by displaying semantically-related prime words. In a user study, the authors were able to improve recall rates and login times for 4-word system-assigned passphrases.

Explicit vs. Implicit Memory

Explicit memory is a type of memory that is based on intentional recollection of information with the purpose to consciously recall this information at a later time. We use this type of memory also referred to as declarative memory, constantly in our daily life [84]. For example when we remember the time of our flight the next day, recall our address, or a chain of strings that forms our password. In contrast, implicit memory relies on the unintentional recollection of information. In this case, we are not aware of the specific informa- tion we stored in our memory, but we can easily recall the information. 3.2. Background 35

This type of memory, also referred to as nondeclarative memory, can usually be observed in habitual behavior, such as riding a bicycle or playing an instrument [84]. The cognitive and neural mechanisms of explicit and implicit memory are not entirely understood [83]. Some studies suggest a distinct mechanism for explicit and implicit mem- ory [166, 199], whereas others suggest a joint mechanism [16, 227]. One way to trigger implicit memory is an effect called priming [44, 163]. Priming occurs when the previous exposure (conscious or un- conscious) to a stimulus affects the performance of a subsequent task. For example, when a series of images with specific objects (primes) are presented to the participants, their recognition performance (e. g., time and correctness) of a similar object in another or the same image that is presented later improves. Throughout this work, we use such priming effects that are based on repetition and association. In a first enrollment phase, we present participants an association between a thresholded Mooney image and the original image with a label. In a second authentication phase, we repeat the previously primed Mooney image (among other non-primed Mooney images) and measure the recognition performance of the repeated image. In some cases, prim- ing has been shown to have long-lasting effects [44].

Mooney Images

A Mooney image is a thresholded, two-tone image showing a single object. This object is hard to recognize at first sight with recognition times in a seconds to minutes range [123]. In some cases, the recog- nition is abrupt and gives rise to a feeling of having solved a difficult problem (also known as the aha-feeling or Eureka-effect) [141]. An example of a Mooney image is presented in Figure 3.1.1

1To understand the effect of Mooney images, we suggest the reader to spend some time trying to identify the object in Figure 3.1, and then look at Fig- ure 3.7 on page 65. 36 Chapter 3 Password Recovery

This abrupt recognition can happen intrinsically [123], after the contour of the object is marked [222], or after presenting the subject with the original image [64, 116, 155].

Figure 3.1.: Example of a Mooney image.

Once a subject has seen the original grayscale image from which the Mooney image is generated, recognition is much accelerated. The value of using Mooney images for authentication is that they are very likely to trigger brain processes that are involved in implicit mem- ory [12]. Implicit memory, as stated above, does not require direct conscious involvement but happens with less effort in comparison to explicit memory. Triggering the implicit memory for authentication is therefore desirable as it reduces the cognitive load for users. Prim- ing is one way to trigger implicit memory, and Mooney images are an excellent example that can be used to prime participants to specific concepts. 3.3. The MooneyAuth Scheme 37

3.3. The MooneyAuth Scheme

In the next section, we describe the basic construction of our authen- tication scheme. We first describe how Mooney images are generated, and then present the two phases, enrollment, and authentication, of our protocol.

Mooney Image Generation

In this work, we use an extended set of two-tone Mooney images that contain not only faces as used originally [168], but also objects (e. g., animals, fruits, or tools) of different types [64, 123]. We selected our Mooney images from an automatically generated two-tone, Mooney image database [123]. This database is based on a large number of images collected from the Web and was created following these steps:

1. Concrete nouns were selected from a linguistic database [251] (based on the directness of reference to sense experience, and capacity to arouse nonverbal images, cf. [123]). These words were used as search terms to automatically download images from an online image database.

2. The images were converted to grayscale and were smoothed us- ing a 2D smooth operation with a Gaussian kernel (σ = 2 pixels and full width at half maximum (FWHM) = 5 pixels).

3. The images were resized to have a size of 350 × 350 pixels (sub- sampled with an appropriate scale factor). These parameters were selected to create Mooney images that are hard to recog- nize by a user at first sight [124]. The smoothing operation is in particular important for the results as the thresholding al- gorithm applied in the next stage operates better on smoothed images than on not smoothed ones.

4. The smoothed and resized images were thresholded using a his- togram based thresholding algorithm (Otsu’s method [174]) to 38 Chapter 3 Password Recovery

generate the Mooney images. This thresholding method as- sumes that each image has two classes of pixel properties: A foreground and a background. For each possible threshold, the algorithm iteratively computes the separability of the two classes and converges when the maximum separability is reached.

Once the images are automatically downloaded and thresholded, a manual clean up session by human subjects needs to be done. This manual cleaning session is necessary because some images that are automatically downloaded from the Web may not include the object that corresponds to the search word (e. g., cat), hence, need to be removed from the image set [61]. Subsequently, a selection of suitable Mooney images took place. While the original Mooney image database contained 330 images [123], for our experiments we considered images with a mean recognition rate of 5 seconds and longer resulting in 250 images. We further reduced this set to 120 images to obtain enough samples per image for an estimated 100 participants. A suitable Mooney image for the purpose of this application is an image that is difficult to recognize without a previous explicit presentation of the original image. At the same time, if the user has seen the original image, then the user should be able to correctly identify and label the hidden object. This procedure makes use of implicit memory as the users first learn the association between the original image and the corresponding Mooney image without an explicit effort. As in the example of riding a bike, users usually do not remember the details of the original image but can name the hidden object in the Mooney image when they have previously seen the original image. For some images, the object shown in the Mooney image can be recognizable by a non-primed user as well, but only after a relatively long time, whereas primed users will recognize it almost instantly. Therefore, within this work, we will treat images with a recognition time beyond a set threshold as “likely not primed.” 3.3. The MooneyAuth Scheme 39

3.3.1. Description

We use a set of images I and their corresponding Mooney images.

Enrollment (Priming) Phase

• When a new user is enrolled, the server first assigns two disjoint

subsets IP ,IN ⊂ I with |IP | = |IN | = k to the user. IP reflects

the primed images, IN the non-primed images.

• The subset IP is then used to prime the user. During this ses- sion, first a Mooney image, then the original image and a la- bel that describes the object in the image is presented to the user. This procedure creates an association between the origi- nal image, the correct label of the image and the corresponding Mooney image.

Authentication (Recall) Phase

• At the beginning of the authentication phase, the two subsets

IP ,IN for this user are retrieved from the database. The primed

and non-primed Mooney images (IP ∪ IN ) are then presented to the user in pseudo-randomized order. For each Mooney image presentation, the user is requested to type in the label of the object that the image contains, or skip the image if the user is not able to recognize an object.

• Two metrics are then computed for each image:

(i) The correctness of the label is computed by comparing the typed label to a list of previously defined labels. This is achieved by a distance metric that measures how similar the label provided by the user matched the defined labels.

(ii) The recognition time, i. e., the time between displaying the image and the first keystroke. If the recognition time is 40 Chapter 3 Password Recovery

longer than 20 seconds, we treat the image as if the label were incorrect, i. e., “likely non-primed.” (We chose 20 sec- onds as a threshold as we expect the recognition for primed images to occur almost instantaneous, but then allow the user to hesitate a couple of seconds before starting to type the label. From our experience, recognition without being primed takes closer towards a minute to happen.)

• Authentication is based on the hypothesis that the user labels the primed images more often (and faster) correctly than those Mooney images that the user was not primed on. Sometimes primed images will be labeled incorrectly and vice versa. To tolerate some of these errors, we compute a score from the cor- rect and incorrect labels and accept a user if the reached score is above a specific threshold. There are several possibilities to per- form this scoring. After the necessary terminology is introduced in the next section, we will discuss two scoring methods.

Terminology

This section introduces some of the notations that are used throughout this work. For one specific image with index i (which is displayed to the user) there are four possible events that we need to consider: the image was/was not primed for the user (i. e., it is in IP or in IN ), and the user provides a correct or an incorrect label for the image. We denote the probability that a (randomly chosen) user correctly labels a primed image with pi, and the probability that a user correctly labels a non-primed image with ni. We expect pi to be larger than ni, and we denote the difference with di := pi − ni. A positive di indicates that priming is working for this image. For reasonably well-working priming, images should have di > 0.5. (Those are called “ideal” in [62], which is slightly misleading as “ideal” in a strict sense is di = 1.) In

Section 3.4 we will see that 1/3 of the total images have di > 0.5, i. e., we can identify a good amount of images that work well for our authentication scheme. 3.3. The MooneyAuth Scheme 41

3.3.2. Adversary Model

We consider a strong adversary that has detailed information about the image database I, but has no information about the subsets IP and IN .

1. We assume the adversary knows the correct labels for all images in I. This is a strong assumption, as a substantial fraction of images is hard to label for humans if not primed. The rationale is that a motivated attacker may spend substantial effort to la- bel the images, automated image search facilities might reveal the source image, or algorithmic classifiers may be able to label images. (We are unaware of any algorithm that can identify objects in Mooney images, but we cannot guarantee that such algorithm does not exist; thus we assume the attacker (artifi- cially) knows all labels.)

2. We assume the adversary knows the probabilities ni and pi. While knowing the exact values requires substantial work by the attacker (basically replicating our study), getting approxi- mations is relatively easy, and one should not rely on an assumed bound on their correctness.

3. The adversary is free to answer the questions at any time, i. e., the answer times can be freely manipulated. (Even though the adversary cannot gain any advantage from this with the current prototype, this may be relevant for alternative implementations that more carefully take the answer time into account.)

Consequently, the security of the scheme solely relies on the par- tition of the shown images into the primed and non-primed images, i. e., the sets IP and IN .

3.3.3. Static Scoring

One straightforward scoring strategy, used by Denning et al. [62], is what we call static scoring. We briefly describe static scoring here so 42 Chapter 3 Password Recovery later we can compare our new scoring strategy, dynamic scoring, with it. There are four basic events that can occur for a single image: • A primed image (with index i) is – labeled correctly:

Occurs with probability pi, assigned score sp,c. – labeled incorrectly:

Occurs with probability 1 − pi, assigned score sp,f . • A non-primed image (with index i) is – labeled correctly:

Occurs with probability ni, assigned score sn,c. – labeled incorrectly:

Occurs with probability 1 − ni, assigned score sn,f . Now static scoring assigns the value 1 to the two “good” events, i. e., sp,c = 1, sn,f = 1 and 0 to the two “bad” events sp,f = 0, sn,c = 0. In other words, this scoring strategy counts the “good” events that happened.

3.3.4. Dynamic Scoring

Static scoring does not differentiate between different probability val- ues, thus loses information. For that reason, we propose an alternative method, dynamic scoring, which takes inspiration from the notion of self-information or surprisal, a well-known concept in information the- ory [206]. Self-information denotes the information content associated with a single event, as opposed to entropy which is a property of an entire distribution.

∗ ∗ The self-information I(E ) of an event E with probability pi is defined as

∗ I(E ) = − log(pi), (3.1)

where we use logarithms to base e throughout this work. For dynamic 3.3. The MooneyAuth Scheme 43 scoring, score each event with its surprisal, i. e.,

sp,c = ln(pi), sp,f = ln(1 − pi),

sn,c = ln(ni), sn,f = ln(1 − ni).

Note that we invert the sign of I(E∗) so that a higher score refers to a better match, i. e., “less surprisal.” Thus, the scores are negative. For an intuition on why dynamic scoring improves on static scoring consider the event E∗ that the user wrongly labels a primed image.

Let us assume a fixed “priming effect,” i. e., the difference di = pi − ni = 0.5 is constant. We first consider an image where pi = 0.5

(and thus ni = 0), i. e., the primed image is labeled correctly and incorrectly with the same probability. Then the event E∗ does carry little information, as it is a plausible outcome for a legitimate (primed) user. Second, we consider the case where pi = 1 (and thus ni = 0.5), then every primed image will be labeled correctly by the legitimate (primed) user. Thus, if event E∗ happens, we can be certain that it is not the legitimate user participating in the protocol. Static scoring gives the same score 0 in both cases, while dynamic scoring gives a score of −∞ and thus indicates that this event can only be caused by an impostor.

Legitimate (Primed) User Score For the legitimate user, the expected value of the score Si for a single image with index i equals 1 E(S ) = · (p · ln(p ) + (1 − p ) · ln(1 − p )) i 2 i i i i 1 (3.2) + · (n · ln(n ) + (1 − n ) · ln(1 − n )) , 2 i i i i which equals the average of the Shannon entropies of Bernoulli-distributed random variables B1,pi and B1,ni with mean pi and ni, respectively, 1 E(Si) = (H(B1,p ) + H(B1,n )) , (3.3) 2 i i where H(X) denotes the Shannon entropy, which is the expected value of the self-information H(X) = E[I(X)]. 44 Chapter 3 Password Recovery

Adversary Score The adversary does not know whether the image was primed or not (this is what the security of the scheme rests on). Recall that we assume the adversary knows the labels and knows the probabilities pi and ni. We assume that the same number of images is primed and non-primed so that a single random image is primed 1 with probability 2 . An adversary can decide to give the correct label or the wrong label, based on the known probabilities. If the adversary gives the correct label, the score (for that single image) will be

(sp,c + sn,c)/2 = (ln(pi) + ln(ni))/2, and if the adversary gives the incorrect label, the score will be

(sp,f + sn,f )/2 = (ln(1 − pi) + ln(1 − ni))/2.

So an adversary can calculate both values and pick the one that has a higher expected score.

3.4. Experiment 1: Pre-Study

A pre-study was performed to identify the critical parameters of the

MooneyAuth scheme. The critical parameters are: (1) pi, the proba- bility that a primed image is correctly labeled and (2) ni, the proba- bility that a non-primed image is labeled correctly. From these values, we can then derive (3) the size k of the sets IP and IN . These param- eters were then used in the following experiments. While there is no ethics committee covering this type of studies at the organizations involved in this research, there are strict laws and privacy regulations in place that must be obeyed. The experiments comply with these strict regulations. The data we collected about a participant cannot be linked back to a respondent, as the data is in quite broad categories only. We did not collect any personal identifiers (IP address, device identifier, name, or similar), and did not use third- party components that may still log such data. Before any data was recorded, the respondents were informed about the purpose of the 3.4. Experiment 1: Pre-Study 45 experiment and how the contributed data will be managed, and that they can leave the experiment at any time.

3.4.1. Experimental Setup

We used a total of 120 images. For each participant, we used 10 primed images |IP | = 10 and 20 non-primed images |IN | = 20, i. e., an asymmetric distribution of primed and non-primed images, ran- domly selected from the 120 images. Choosing |IN | to be larger than

|IP | helps to speed up the enrollment process. We developed a web application to conduct the experiment and measured the parameters pi, ni.

Enrollment (Priming) Phase For the enrollment phase, a random sub- set of |IP | = 10 Mooney images was selected for each participant. Priming consisted of four steps:

(i) Introduction: The experiment started with a brief introduction and explanation of how the experiment will proceed. We pro- vided participants with the necessary written explanation on the web page that this study was about an alternative web- based authentication scheme. Participants were informed about the two experimental phases (enrollment and authentication). They were further informed to be contacted via email, after the enrollment phase, to take part in the authentication phase. Be- sides, we provided a link with further information about Mooney images and implicit memory for the interested participant.

(ii) Priming 1: For each image from the subset IP , we first presented the Mooney image for 3.5 seconds, then the original gray-scale image for another 3.5 seconds, then again the Mooney image. To make the shifting between the images more comprehensible, we gradually transitioned between the images, i. e., fading out the first image while fading in the second image. A label (a sin- gle English word) that described the hidden object in the image 46 Chapter 3 Password Recovery

was displayed during the original gray-scale image presentation. We consider this approach a reasonable tradeoff between giv- ing enough time to prime the image and spending time on the enrollment process.

(iii) Survey: After the first priming phase, the participants were asked to fill out a short questionnaire with basic questions such as age, field of work, gender, and opinion about the usability of current web authentication systems. This survey was intended to provide the participants with a short break before the second priming phase. In addition, we used the data collected from this survey for a statistical assessment of the participants.

(iv) Priming 2: In the second priming phase, we repeated the first priming phase for the same 10 images in a new pseudo-randomized order. Overall, users saw each Mooney image and its corre- sponding gray-scale image twice.

Authentication (Recall) Phase Participants were invited via email to take part in the authentication phase. Each participant was pro- vided with an individual link. In order to test how long the effects of priming and authentication performance lasted, we performed the authentication phase in two separate groups at two different points in time (approximately two weeks apart). The authentication phase was composed of two main steps:

(i) Introduction: Before the authentication started the task was de- scribed. Each participant was asked to view the Mooney image, and to label the hidden object in the image as fast as possible. Participants were specifically asked to label each image using a single English word. Importantly, participants were asked to la- bel the images regardless of what they have seen in the priming phase. If the participants could not identify the hidden object (possibly because this image was not used in the priming phase), 3.4. Experiment 1: Pre-Study 47

they were asked to press the “I don’t know” button. These in- structions were provided in a written form on the web page.

(ii) Authentication: For each returning participant, we selected a

subset IN ⊂ I \ IP of size |IN | = 20. All Mooney images from

the entire set IP ∪ IN were presented to the participants for labeling in random order. The interface used for this labeling task can be seen in Figure 3.2.

Figure 3.2.: Screenshot of the user interface during authentication.

Please note that the website as used in Experiment 1 had a bug, which led to a layout change caused by an information banner fad- ing out during the labeling process in the authentication phase. This could have led participants to click the “I don’t know” button acci- dentally instead of selecting the text entry field. It seems very un- 48 Chapter 3 Password Recovery likely that this bug has affected the results: we have not received any feedback from the participants mentioning this issue, the fading out related miss-click could have only occurred in specific instances with slow Internet connections, and when we filtered the participants that may have been affected based on the text input time the overall results even slightly improved. Furthermore, we fixed this potential issue for Experiments 2 and 3, and these report very similar results. This confirms that the bug had minimal or no influence on the results.

Implementation

To perform the experiments, we implemented a web application based on the Model, View, Controller (MVC) design pattern. The front- end (View) is based on the Bootstrap framework to accelerate de- velopment, the back-end (Model and Controller) is written in PHP, and data is stored in a MySQL database. To compute an edit dis- tance during the authentication phase, we used a C implementation of the Damerau-Levenshtein algorithm which was included as an ex- ternal PHP module. Data was transmitted using transport layer se- curity (TLS) to protect the privacy of the participants. To be compliant with the federal data protection act and privacy laws, users were informed about what data was collected and had to consent to the processing and storing of the data. Collected data was stored in an encrypted form. We used the free web analytics software Piwik on the web server to derive statistics about the web application’s usage. Every user was able to opt-out and the usage of the Do Not Track (DNT) HTTP header was honored.

3.4.2. Matching Labels

For each image, we created a small set of correct labels (typically two to five labels). All labels were converted to lowercase before compar- ison. We computed the Damerau-Levenshtein distance (string edit distance considering insertion, deletion, substitution, and transposi- tion of adjacent letters) between the provided label and all given la- 3.4. Experiment 1: Pre-Study 49 bels for that image. If one label had a distance less or equal to 1, we marked it to be correct. This ensures that a variety of typical deviations is accepted, such as simple spelling errors, plural endings, British/American spelling differences, and such. Although the use of an open text field to provide answers has draw- backs considering entry time and error rate (especially on mobile de- vices), we decided not to use alternative methods such as selecting the correct answer from multiple choice. Previous work has shown that using multiple choice answers leads to higher recognition rates for non-primed Mooney images [155]. First, the number of choices gives us a lower bound for the ni and second providing a choice of labels already exhibits priming effects.

3.4.3. User Participation

Participants were recruited via several email distribution lists and social media. To motivate participants, we raffled gift cards to those who finished both phases. For this experiment, 360 people started the enrollment phase. We sent out 323 invite emails for the authentication phase because 37 par- ticipants had not finished their enrollment (6 stopped at the intro- duction tutorial, 16 during the first priming, 6 during the survey, 9 during the second priming). From those re-invited to the authenti- cation phase 230 finished, 6 started but have not finished, and 87 never tried to start the phase. A high dropout rate between enroll- ment and authentication was expected, as we have not verified email addresses during the enrollment of Experiment 1 nor have we filtered obviously fake email addresses. Furthermore, misclassification of our invite email as spam or junk email might have occurred, as well, which would explain the high number of users that not even tried to start the authentication phase. We collected, with users’ consent, basic statistics such as country of origin and timing from the server logs, as well as the results from a sur- vey; a summary of the statistics can be found in Appendix A.1. Please 50 Chapter 3 Password Recovery note, the reported numbers of the questionnaire in the appendix differ from the actual number of participants, as providing answers was not mandatory. About four out of five participants were between 20 and 30 years old, but all age groups were represented, and about four out of five were male. Most participants were from France and Germany due to the mailing lists we used, but people from over 30 countries participated. The majority of them liked to use MooneyAuth. As a result of the sampling process, the participants in this and the following experiments are skewed towards young and male par- ticipants working in the sciences. Previous work found no evidence indicating differences in recognition rates of Mooney images for gender or occupation of the primed participants [123, 124, 141, 155]. Unfor- tunately, there is no data available on priming effects in different age groups.

3.4.4. Results

We now present the results of our first experiment that helped us to estimate and test parameters (e. g., labeling).

Table 3.1.: Statistics on duration and average event probability. Duration (in days) Results Median Mean SD ∅ pi ∅ ni ∅ di Experiment 1 20 18.0 8.8 0.648 0.219 0.429 Experiment 2 – Batch 1 9 8.7 2.2 0.726 0.226 0.500 – Batch 2 25 25.1 4.2 0.586 0.215 0.371 – Batch 3 264 264.3 3.8 0.499 0.252 0.247 Experiment 3 21 19.9 4.7 0.642 0.203 0.439

Estimating pi, ni, and di

The main result of Experiment 1 is the estimation of the parameters pi and ni for the tested images. We find that the average difference 3.4. Experiment 1: Pre-Study 51

d over the individual di = pi − ni, which is a good indicator for the overall performance, is 0.43. This is a fundamental improvement over the previous work [62], which achieved an average difference of d = 0.07. A more detailed view is given in the plot in Figure 3.3, which shows these parameters for each individual image. Each data point indicates one image, with the positions on the x-axis (y-axis) representing the empirical values for pi (ni). The plot shows our main result for the full dataset. To improve comparability with previous findings, we printed our results as an overlay on top of the plot from previous work [62]. One might reason that the compared time frames are not the same (20 days and 28 days). However, we show in Section 3.5 that our Mooney image priming effect declines only moderately over time allowing one to consider this a fair comparison. The (diagonal) lines are intended to help in the comparison of the results in the layered graph. The small solid line in the top of the graph (pi = 0.07 + ni) indicates the average value of the difference d of the previous work [62], it corresponds to the bold solid line in the bottom of the graph. This line represents the average for our system

(pi = 0.43 + ni), while the third solid line (pi = 0.5 + ni) indicates the line with di = 0.5.

Response Time

A summary is given in Table 3.2. The average time to label an image is around 10 seconds with a high standard deviation. (Maximum timing can be more than 10 minutes). Median values are more robust to outliers. They are closely grouped together (7.25 − 7.89 seconds). The only exception can be seen in the correctly labeled primed images. These images were substantially faster (a median of 6.30 seconds).

Strict vs. Relaxed Labeling

The way we use for testing the labels for correctness may obviously affect the measured values (and thus the performance of the scheme). 52 Chapter 3 Password Recovery

Figure 3.3.: Priming effect comparison: pi versus ni plot for our scheme after 20 days (points, blue) and previous work [62] after 28 days (stars, black).

Table 3.2.: Statistics on the timing (in seconds) for the image labeling. Event Median Mean SD Primed/correct 6.30 8.62 8.76 Primed/false 7.25 10.24 12.54 Non-primed/correct 7.89 11.17 24.83 Non-primed/false 7.30 9.55 15.38

To evaluate if the strict labeling, as described in Section 3.4.2, gives reasonable results, or whether more sophisticated measures (e. g., a lexical database that includes synsets to find related words) needed to be taken, we additionally assessed the quality of the comparison by hand. We tested all labels that were classified as “wrong” in the automatic test. In this manual “clean up session,” we added some labels to the set of accepted labels that were synonymous to existing 3.5. Experiment 2: Long-term Behavior Study 53 labels, which we missed in the original creation of the labels (e. g., we added “carafe” for an image showing a “pitcher”), we added some generalized terms (e. g., “animal” instead of “tiger”), and very similar species that were easy to confuse in the images (e. g., “bee” and “ant”). We grouped those labels as “similar”, and everything else as “wrong” as before. Contrary to our expectation, relaxed labeling slightly worsens the performance. While for strict labeling we have d = 0.43, for the relaxed labeling we have d = 0.42, a small but noticeable difference. This might be explained by the fact that some “similar” cases, in particular generalizations, are so general that they can be guessed (e. g., 77 of the 120 images were showing animals). Consequently, in all the following studies we used strict labeling, which in addition is computable without human intervention.

3.5. Experiment 2: Long-term Behavior Study

It is well-known that, in principle, priming can last over very long times [44]. However, this is not known for priming on Mooney images. In a second experiment, we measured the long-term effects of the priming.

3.5.1. Experimental Setup

This experiment is an extension of the first experiment. It extends Experiment 1 in two aspects: first, we divided the data gathered in Experiment 1 into two batches by the time between enrollment and authentication; second, we re-invited the participants of Experiment 1 after approximately 8.5 months again and measured the pi, ni decline over time. Therefore, we can compare three different batches (9, 25, and 264 days), details are listed in Table 3.1. 54 Chapter 3 Password Recovery

3.5.2. User Participation

People from the first batch were invited to authentication approx- imately 10 days after the first invitation to the enrollment, people from the second batch after about three and a half weeks. For each participant, we measured the time between priming and authentica- tion. For the first batch, this difference has a median of 9 days, for the second batch, it has a median of 25 days. For the third batch, the median is 264 days. Further details on the participants are given in Appendix A.1.

3.5.3. Results

Detailed information is provided in Figure 3.4 and Table 3.1. We see a moderate decline of the priming effect over the first couple of weeks: the average value of the di is 0.500 for the first batch and 0.371 for the second batch, both for strict labeling. However, over longer times, the decline becomes much less pronounced; in fact, 264 days after the initial priming we still measure an average di of 0.247.

1 After 9 days Previous average p = n + 0.07

0.8

0.6

Our average p = n + 0.5

0.4 (correct label for non-primed images) i n 0.2

0 0 0.2 0.4 0.6 0.8 1

pi (correct label for primed images) (a) Priming effect decline for the first batch after 9 days. 3.5. Experiment 2: Long-term Behavior Study 55

1 After 25 days Previous average p = n + 0.07

0.8

Our average p = n + 0.371 0.6

Ideal p = n + 0.5

0.4 (correct label for non-primed images) i n 0.2

0 0 0.2 0.4 0.6 0.8 1

pi (correct label for primed images) (b) Priming effect decline for the second batch after 25 days.

1 After 264 days Previous average p = n + 0.07

0.8 Our average p = n + 0.247

0.6

Ideal p = n + 0.5

0.4 (correct label for non-primed images) i n 0.2

0 0 0.2 0.4 0.6 0.8 1

pi (correct label for primed images) (c) Priming effect decline for the third batch after 264 days.

Figure 3.4.: Priming effect decline over time: pi versus ni plot for the first batch after 9 days, the second batch after 25 days, and the third batch after 264 days. 56 Chapter 3 Password Recovery

This is shown in more detail in Figure 3.4, which shows scatter plots for pi and ni, separated for the first batch (9 days), second batch (25 days), and third batch (264 days). We can additionally see that even in the third batch, there is a substantial number of images with a di greater than 0.5 (dashed line on the lower right).

Additionally, Table 3.1 shows the average pi and ni for each batch.

As expected, the values for ni do not vary over time (as no priming took place), but the values for pi do change.

3.6. Experiment 3: MooneyAuth Study

Based on the findings of our first study, we conducted a third study with the estimated parameters pi and ni. This experiment is designed as a realistic test of the overall performance of the authentication scheme.

3.6.1. Experimental Setup

The experimental setup was very similar to the setup for our first experiment as described in Section 3.4.1. The main difference is the reduced set of images. We used a subset of 20 images of the original image database, the same subset for all users, and computed a random partition of this reduced database for each participant. We selected those images with the best performance in the first experiment, i. e., those images with the highest values di = pi − ni. The selected images had values di between 0.79 and 0.57, on average 0.643. For each user, we used 10 primed and 10 non-primed images, i. e., |IP | = |IN | = 10. There were no changes to the enrollment phase. The authentication phase worked as before, but as we learned in Experiment 1 that the strict labeling outperforms the relaxed labeling, we only used the strict labeling. The goal of this experiment was to evaluate the suitability of the authentication method, including potential cross-contamination of the memory when several images with good priming effects were 3.6. Experiment 3: MooneyAuth Study 57 learned by a single user (an effect we could not study in the first experiment). Also, the measured and presented statistics are tailored toward this goal.

3.6.2. User Participation

Participants were recruited via email distribution lists. We took sev- eral measures to avoid that participants of the first or second experi- ment also participated in the third: we used (mostly) disjoint mailing lists, asked users in the questionnaire if they participated before, fil- tered all emails used for the login that have participated in the first, and placed a cookie that allowed us to detect multiple participations. Participating in both studies has to be prevented as the images in the third experiment are a subset from the first experiment, so being primed on some images in the first experiment can disturb the results of the third experiment. However, the effect of duplicate participants is small, as the overlap of primed images and the 20 images in the third experiment is less than two on average. Again, we raffled gift cards to those who finished both phases. About half of the 70 participants in this experiment were between 20 and 30 years, but all age groups were represented. About 3 out of 4 were male. Most participants were from France and Germany, because of the mailing lists we used. The results of the questionnaire are shown in Appendix A.2.

3.6.3. Results

The main result of this experiment is a precise estimation of the per- formance of the proposed authentication scheme. In addition, we compare the static and the dynamic scoring strategy. 58 Chapter 3 Password Recovery

0.5 Measured user 0.45 Simulated user Optimal attacker 0.4 0.35 0.3 0.25 0.2 Probability 0.15 0.1 0.05 0 5 10 15 20 25 30 Score (a) Static Scoring 0.5 Measured user 0.45 Simulated user Optimal attacker 0.4 0.35 0.3 0.25 0.2 Probability 0.15 0.1 0.05 0 -25 -20 -15 -10 -5 0 Score (b) Dynamic Scoring

Figure 3.5.: Distribution of measured (bar, dark blue) and estimated (solid line, light blue) scores for static scoring and dy- namic scoring. 3.6. Experiment 3: MooneyAuth Study 59

Table 3.3.: Performance of the scheme for |IP | = |IN | = 10. Target Score Resulting FRR FAR Thres. Sim. Meas. Static 0.1 % 17 48.9 % 76 % scoring 0.5 % 16 27.6 % 67 % 1.0 % 15 13.0 % 56 % Dynamic 0.1 % -16 0.30 % 2.86 % scoring 0.5 % -16 0.30 % 2.86 % 1.0 % -16 0.30 % 2.86 %

Performance

The complete graphs illustrating the distribution of scores are shown in Figure 3.5, both for dynamic scoring, and static scoring. The x-axes give the scores assigned to a run (rounded to integers if necessary), and the y-axes the relative frequency. The dark blue bars give the measured distribution determined in the third experiment, while the light blue solid line gives the estimated distribution of score values for a legitimate user (see Section 3.6.3, using the estimated parameters pi and ni from above). The red dashed line gives the distribution of an impostor using the optimal strategy as described in Section 3.3.4. We measure the performance of the scheme in terms of the false acceptance and false reject rates. The false acceptance rate (FAR) is an indicator for the security of the protocol; it gives the likelihood that an impostor is (falsely) classified as a legitimate user, i. e., “ac- cepted.” For fallback authentication schemes (which can apply strict rate-limiting and other techniques to limit the capabilities of an impos- tor) FARs in the range of 0.01 and 0.001 can be considered acceptable (Denning et al. [62] considered a FAR of 0.005). For a given FAR, we can determine the threshold that meets this FAR, which provides us with the false reject rate (FRR), i. e., the probability that a legitimate user is denied access to the system. Denning et al. [62] considered an FRR of 0.025 to be acceptable. 60 Chapter 3 Password Recovery

Table 3.4.: The overall timing (in seconds) for Experiment 3. Mean Median Max Min SD Var Enroll - Tutorial 24 23 58 13 8 58 Enroll - Priming 1 113 107 170 99 15 221 Enroll - Survey 51 47 143 23 20 390 Enroll - Priming 2 105 101 186 94 14 196 Enroll - Total 294 (5 minutes) Auth - Tutorial 28 24 97 3 16 262 Auth - Labeling 177 163 472 70 75 5597 Auth - Total 206 (3.5 minutes)

Figure 3.5 and Table 3.3 depict the basic performance of the pro- posed scheme. We can see that for the dynamic scoring, the scheme achieves simulated FRRs of 0.3 % for FARs between 1 % and 0.1 %, and measured FRRs of 2.86 %. (While it may be surprising that the measured FRR are higher than the simulated FRRs, please note that only two participants achieved a dynamic score of −16, which are solely responsible for the relatively high FRR.) Still, an FRR of 2.86 % is pretty much within the bounds of previous work. Some statistics about the duration of the experiments and the prop- erties of the used Mooney images are summarized in Table 3.1. Some statistics about the duration of each phase is given in Table 3.4. For example, it shows that the enrollment phase took 5.0 min on average (including tutorial and questionnaire), and the authentication phase 3.5 min.

The Simulation

Besides the measured data from the user experiment, we use simu- lated numbers to provide additional insights. These simulations are based on the estimated parameters pi, ni determined in the first ex- periment, where we selected the 20 best images and used those pi, ni. 3.7. Security Analysis and Discussion 61

We simulated 100 000 authentication attempts as follows:

• Choose random subsets IP and IN from the available images.

• Simulate a user (primed on IP ) logging in, based on the collected

probabilities pi, ni, and compute the score.

• Simulate an optimal adversary (as defined above), and compute the score.

An interesting observation is that the simulation based on the prob- ability values from the previous experiment is relatively accurate. We can see that the shape of the simulated distribution (light blue solid line) closely resembles the shape of the measured distribution (dark blue bars). The only substantial difference is that the distribution is shifted towards lower values, i. e., the mean changes from −8.45 to −9.6 (for the dynamic scoring), and from 16.5 to 14.4 (for the static scoring). In other words, the performance we measured is slightly worse than predicted by the simulation, which can have several plau- sible reasons: (i) The time difference between enrollment and au- thentication for Experiment 1 (when estimating the di) was slightly shorter than for Experiment 3 (mean duration of approx. 18 days vs. 20 days). (ii) Being primed on several images with good priming prop- erties in parallel may cross-contaminate the participant’s memory and thus worsen the overall recall. However, from this experiment, we see that even if this effect plays a role, its influence is relatively small. We can also see that the dynamic scoring substantially outperforms the static scoring. Table 3.3 lists, for several target FARs, the result- ing FRRs, both for dynamic and static scoring. We see that for all listed FARs, the resulting FRRs are substantially better with dynamic scoring, both for the measured data and for the simulated values.

3.7. Security Analysis and Discussion

In the proposed authentication scheme, the priming effect of Mooney images is used to help users memorizing their authentication secret, 62 Chapter 3 Password Recovery using implicit instead of explicit memory. However, it is important to note that, similarly to graphical authentication schemes based on explicit memory, the security of this scheme relies on the subset IP only, and does not depend on the properties of Mooney images. Our security model considers a powerful attacker who (artificially) knows the solution (label) for every image and still fails to authenticate. (There is an indirect dependency, however, as a weak priming effect will typically be compensated by a lower threshold to control the false acceptance rate and thus make attacks easier.) Once a catalog of images with good priming properties is used (e. g., di > 0.5, see Figure 3.6), the scheme is resilient to rate-limited guessing attacks. Note that all users of the authenticating service share the same set I of such images (IP ,IN ⊂ I). Thus, selecting the images is a one-time task.

Figure 3.6.: Example images with longtime (264 days) low (top, red) and high (bottom, green) priming effects. 3.7. Security Analysis and Discussion 63

The secret used for authentication is the set of primed images IP , which is a subset of all images presented to the user in the authen- tication phase IP ∪ IN . Effectively, IP is a randomly chosen subset, so there is no bias of user choice involved (in contrast to passwords and many other schemes), which facilitates the security analysis. The authentication score computed by our scheme is not only based on the primed images the user can identify, but also on the non-primed images that an adversary is not able to determine. As a consequence, an adversary that can decode Mooney images without going through the priming phase has no advantage for breaking the security of the proposed scheme, if it is unknown on which images the victim was primed on. Also, a user connecting to the server under a false user- name and obtaining the presented images does not affect the security. To avoid intersection attacks, it is mandatory that the same set of non-primed images IN is presented at each login attempt.

Just like most other schemes, our scheme is susceptible to phish- ing attacks: An attacker can query the authentication server for the images, present them to the legitimate user, record the timings, and replay those to the server. All standard measures to prevent phishing attacks apply here as well. Furthermore, an active phishing attack is required, i. e., the attacker needs to query the server to get the cor- rect set of images, which may be detected on the server’s side. While passwords can be stored (relatively) secure on the login server using it- erated password hashes and random salts to prevent guessing attacks, this is not feasible for a large range of fallback authentication meth- ods, e. g., as for knowledge questions also approximate answers should be counted, which typically requires storing the solution in plaintext. Similar, there is no (obvious) way to store the secret information (i. e., the indices describing the set IP for our scheme). Guessing attacks against our scheme can be avoided just as for other fallback authen- tication schemes. For example, by putting substantial limits on the guessing rate (e. g., one attempt per day), and a lock-out period, i. e., if account recovery is initiated, the original owner is notified, e. g., 64 Chapter 3 Password Recovery via the stored mail, and has 24 hours to abort the recovery if it was started by somebody else. All of these measures are implemented for other schemes as well. We did not study interference properties, i. e., how well a user can remember a secret if the user is using the system on several servers in parallel (with different sets of primed images IP ). However, most smaller websites use fallback authentication by email or use a single-sign-on solution. Thus a more involved fallback au- thentication scheme like ours will mostly be of interest to large email providers or social networking sites, and thus a user will only use very few parallel instances. Our experiments are conducted on a limited set of 20 Mooney im- ages out of which 10 were primed. This might open the question of whether retrieval of primed Mooney images can get harder when more images are primed. An experiment by Ludmer et al. [155] has used Mooney images to explore memory retrieval in the human brain by priming users with 30 randomly selected images. They have shown that if the solution to a primed Mooney image is retained one week after priming, it is essentially retained to the same degree also three weeks afterward. This suggests that even when a larger sample size of Mooney images is used retrieval of primed Mooney images are likely to retain after longer periods of time.

3.8. Conclusion

Authentication schemes based on implicit memory relieve the user of the burden of actively remembering a secret (such as a compli- cated password). This work presents a new implicit memory-based authentication scheme that significantly improves previous work by using a more efficient imprinting mechanism, namely Mooney images, and optimizing the scoring mechanism. We implemented a compre- hensive prototype and analyzed the performance and security of our proposal in a series of experiments. Results are promising and show that our scheme is particularly suited for applications where timing is not overly critical, such as fallback authentication. 3.8. Conclusion 65

Figure 3.7.: Mooney image and the corresponding original image.

Why judge when it’s only a matter of perception.

— Haresh Sippy 4 Password Strength

Contents 4.1 Introduction ...... 68 4.2 Password Strength Meters ...... 69 4.2.1 Approximating Strength ...... 70 4.2.2 Measuring Accuracy ...... 71 4.3 Evaluated Password Datasets ...... 73 4.3.1 Influencing Factors ...... 73 4.3.2 Datasets ...... 75 4.3.3 Reference ...... 76 4.4 Similarity Measures ...... 77 4.4.1 Test Cases ...... 77 4.4.2 Testing Different Metrics ...... 79 4.4.3 Reference Validation ...... 86 4.4.4 Recommendation ...... 87 4.4.5 Sampling ...... 87 4.5 Evaluation ...... 89 4.5.1 Use Cases ...... 90 4.5.2 Selected Meters ...... 91 4.5.3 Querying Meters ...... 94 4.6 Results ...... 95 4.6.1 Overall Performance ...... 96 4.6.2 Effect of Quantization ...... 99 4.6.3 Performance Over Time ...... 102 4.6.4 Recent Proposals and Future Directions 102 4.6.5 Limitations ...... 106 4.7 Conclusion ...... 106 68 Chapter 4 Password Strength

4.1. Introduction

Password strength meters (PSMs) are designed to help with one of the central problems of passwords, namely weak user-chosen passwords. From leaked password lists we learn that up to 20 % of passwords are covered by a list of only 5 000 common passwords [235]. A pass- word strength meter displays an estimation of the strength of a pass- word when chosen by the user, and either helps or forces the user to pick passwords that are strong enough to provide an acceptable level of security. The accuracy with which a PSM measures the actual strength of passwords is crucial; as people are known to be influenced by PSMs [231], or even forced to comply, an inaccurate PSM can do more harm than good. If weak passwords are rated strong users might end up choosing this password, actually harming security; similarly, if strong passwords are rated weak the meter drives away people from those strong passwords. Traditionally, ad hoc approaches such as counts of lower- and up- percase characters, digits, and symbols (LUDS) have been used to measure the strength of passwords. Despite being well-known that these do not accurately capture password strength [246], they are still used in practice. Nowadays, more sound constructions for PSMs based on precise models capturing user choice have been proposed, e. g., based on Markov models [43], based on probabilistic context- free grammars [115,239], neural networks [162,229], and others [248]. Surprisingly, very little work has been performed on a fair compar- ison of these different proposals, and it remains unclear which pass- word meter is best suited for the task of estimating password strength. Even worse, we lack consensus on how to determine the accuracy of strength meters, with different techniques used ranging from Spear- man and Kendall correlation to ad hoc measures. In the following, we propose a sound methodology for measuring the accuracy of PSMs, based on a clear set of requirements and careful selection of a metric, and we will use this metric to compare a variety of different meters. 4.2. Password Strength Meters 69

Contributions

In more detail, our contributions are:

1. We discuss properties an accurate strength meter needs to fulfill, and create a number of test cases from these requirements.

2. We report tests of 19 candidate measures (and have tested sev- eral more) from a wide range of types and select good metrics for the accuracy of strength meters.

3. We address the challenge of estimating the accuracy from limited datasets and show that meters can be reasonably approximated with a small number of random samples.

4. We provide an extensive overview of the current state of the art of strength meters.

5. We use the derived measures and provide a comparison of a broad selection of 45 password meters in 81 variations, rang- ing from academic proposals over meters deployed in password managers and operating systems to meters in practical use on websites.

More important than the results of this work, are the methods we developed. They provide means to select suitable similarity metrics that match the requirements for specific use cases. We hope that it will foster the future development of strength meters and will simplify the selection process for service operators.

The contributions of this work resulted from a collaboration with Markus Dürmuth.

4.2. Password Strength Meters

Next, we discuss password strength meters and how to measure their accuracy. 70 Chapter 4 Password Strength

4.2.1. Approximating Strength

“Weak” passwords such as passw0rd or abc123 are not insecure per se (e. g., based on some “magical” property). They are insecure as they are chosen commonly by humans, and thus an adversary trying to guess passwords will guess those common passwords early in an attack. (Similar observations have been made by Wang et al. [239].) An ideal strength meter, thus, assigns each password its likelihood, e. g., approximated by the relative frequency from a large enough pass- word corpus. However, this straightforward idea is hard to use in practice: The relative frequencies can in principle be accurately ap- proximated for “relatively likely” passwords (cf. [24]), e. g., those that are particularly relevant for online guessing attacks. Estimating fre- quencies for less likely passwords, relevant for offline guessing attacks, is next to impossible due to the amount of data required. There- fore, practical strength meters should aim at approximating the true strength using compactly representable functions. The traditional LUDS meter allows for a very compact representation (of a few bytes), at the cost of limited accuracy [246], while other approaches based on Markov models [43] or PCFGs [239, 247] have been demonstrated to be more accurate, at the expense of increased storage size. For the remainder of this work we assume a PSM is a mechanism f that takes as input a password, i. e., a string of characters Σn over an alphabet Σ, and outputs a score or strength value: f :Σ∗ → R. We assume the score being a real-valued number. Some me- ters aim at providing an estimate for the probability of a password (e. g., [43, 115, 239]), i. e., values are in the interval [0, 1]; Others aim at estimating the guess number (e. g., [162,229,248]), i. e., are integer- valued; Most meters deployed at websites output a textual descrip- tion of the password strength, e. g., [Too short, Weak, Fair, Good, Strong] for Google’s PSM, in this case we convert these textual de- scriptions to natural numbers between 1 and the number of classes. PSMs can be either informative when they are used merely to in- form the user about the strength of the password (nudging the user 4.2. Password Strength Meters 71 towards more secure choices), or enforcing when passwords that are considered weak are not accepted by the system. Most deployed sys- tems we analyzed, fall actually in the middle, enforcing a certain min- imal strength, and informing (and nudging) the user towards more secure passwords beyond those minimal requirements.

4.2.2. Measuring Accuracy

Accuracy is one of the central factors of PSMs, and several PSMs have been proposed over the past few years. However, little work was done towards a fair comparison of different meters, and even on the question what constitutes a fair comparison, there is no agreement. The preferred method to measure the accuracy of a strength meter is by comparing it to an ideal reference, measuring the similarity be- tween the reference and the meter output. This idea is based on the intuition that weak passwords are those that are common and have been used before [43, 239, 248]. However, the techniques for compar- ing reference and tested meters in previous work were ad hoc and ranged from measures counting overestimation errors to rank corre- lation metrics. In the following, we will systematically study which measures are most suited for performing this comparison. Specifically, we will show that previously used similarity measures have significant shortcomings limiting their validity and usefulness. Before discussing specific similarity measures, it is instructive to consider properties that these measures should fulfill. To this goal, we specify which differences the meter and the reference should yield high and low similarity. There is no absolute truth in which require- ments are desirable or not, and for specific applications, there may be additional requirements that are desired. We provide a list of require- ments based on extensive experience with passwords and PSMs, and believe it captures requirements suitable for common online use. By explicitly stating the desired requirements the selection process becomes much more transparent, and we will see that most previ- ously used similarity measures fail even to fulfill some fundamental 72 Chapter 4 Password Strength requirements, highlighting the importance of a systematic treatment. (Specific test cases derived from these abstract requirements are pro- vided in the following section.)

1. Tolerance to Monotonic Transformations: The output score given by strength meters is often not directly comparable. Their score can be based on the number of guessing attempts, different forms of entropy, on arbitrary scales like [Weak, Fair, Strong] vs. [Terrible, Weak, Good, Excellent, Fantastic], and other home-brewed measures of strength. Assuming that the underlying sorting of passwords is identical, these differences can be modeled as monotone functions. A good similarity mea- sure should tolerate such monotone transformation and assign high similarity to such transformed strength estimations.

2. Tolerance to Quantization: A particular case of monotone transformations is quantization, e. g., strength meters that di- vide the reported values into a small number of bins, often three to five. A good similarity measure should tolerate such quan- tization. Note, with a very low number of bins, e. g., 2 bins [reject, accept], the comparison becomes less meaningful, and scores will typically be low, even for otherwise reasonable mea- sures. In the case of an enforcing PSM, the strength policy be- comes particularly interesting. In this case, all passwords that are not accepted effectively end up in the lowest bin (commonly called “Too short,” “Too easily guessed,” or “Too weak”). The stricter the policy is set, the larger this lowest bin gets, reducing the overall precision. A good similarity measure should tolerate moderately large reject-bins.

3. Tolerance to Noise: Small deviations in the strength estima- tions are frequent, based on slight differences in the used models, the training data, or other factors. A good measure should tol- erate such minor deviations. 4.3. Evaluated Password Datasets 73

4. Sensitivity to Large Errors: While small differences do not have a significant effect on the usefulness of a strength meter, large deviations, in particular, overestimates, can harm. A good measure needs to be sensitive to large variations in strength even for a small set of passwords.

5. Approximation Precision: A similarity score is easier to compute and thus more useful if it does not need full knowledge of the meter. Specifically, strength meters deployed on websites put limits on the number of samples one can handle, either by the slowness of the process or more specific restrictions, like the number of allowed queries. Thus, a good measure should be easy to approximate from a limited number of samples.

4.3. Evaluated Password Datasets

Next, we discuss factors that influence password choice, introduce the datasets that we will use to evaluate a broad selection of PSMs, and describe our reference password distribution which we will use for comparing different accuracy metrics.

4.3.1. Influencing Factors

When evaluating strength meters one must consider that password strength is contextual and influenced by many factors [11, 75, 178].

1. Password leaks originating from a single web service follow a distribution partially specific for this site. For example, the password “linkedin” appears in the LinkedIn leak with a prob- ability of 0.12 %, but does not appear in the RockYou leak. In contrast, the password “rockyou” appears with a probability of 0.06 % in the RockYou leak, but only appears with a probabil- ity of 0.000028 % in the LinkedIn leak. Often passwords from a service reflect the category of the service and include the name or semantic theme of the service [244]. 74 Chapter 4 Password Strength

2. Website administrators often enforce password composition poli- cies [143] (e. g., requiring the password to contain a digit or to be of a certain length) that force users into choosing different passwords which are compliant with the respective policy.

3. Florêncio et al. showed that not using any weak passwords or not considering to reuse some passwords becomes impossible with a growing number of accounts. If no password manager is used, account grouping and reusing passwords becomes the only viable solution [77]. Given a fixed time-effort budget [14] it is sub-optimal to spend the same amount of effort for all accounts. Florêncio et al. [76] proposed to classify accounts into categories from “don’t-care” to “ultra-sensitive” accounts based on, e. g., the consequences of account compromise.

4. A password strength meter might be tuned and more intensively tested with a specific password leak. Specifically, academic me- ter proposals, which are based on probabilistic password models, require a lot of real-world password data. Some strength meters even include small blacklists of very common passwords.

While it is difficult to avoid all factors, we try to minimize their influence by testing three very different datasets in our experiments that differ by service, policy, and leak date. We selected the datasets to allow easy verification and generate reproducible results based on publicly available data. Our findings are limited to predominantly English speaking users and their password preference. To reason about the strength of a password distribution considering a best-case attacker, we provide the

Min-entropy H∞ as lower bound and partial guessing entropy (α- guesswork) Gα for alpha 0.25 as described by Bonneau [24]. 4.3. Evaluated Password Datasets 75

4.3.2. Datasets

An overview of the used datasets that are described in the following is given in Table 4.1:

Table 4.1.: Evaluated Datasets 2 Name Year Service Policy H˜ ∞ G˜0.25 RockYou 2009 Social Games 5+ 6.81 15.89 LinkedIn 2012 Social Network 6+ 7.27 19.08 000Webhost 2016 Web Hosting 6+ [a-Z][0-9] 9.26 20.69

• RockYou: This is a well-established leak used extensively in previous work. 32 million plaintext passwords leaked from the RockYou web service in December 2009, via an SQL injection attack, which means that no bias was introduced. We include RockYou in our evaluation because of its popularity in the com- munity. However, its passwords should be considered relatively ˜ weak (G0.25 = 16 bits).

• LinkedIn: The social networking website LinkedIn was hacked in June 2012. The full leak became public in late 2016. The leak contains a SQL database dump that includes approx. 163 mil- lion unsalted SHA-1 hashes. In the following, we use a 98.68 % recovered plaintext version resulting in approx. 161 million plaintext passwords. We expect the bias introduced by ignoring 1.32 % of (presumably strong) passwords to be low, as we are mostly interested in passwords whose probability can reasonably be approximated by their count. We include LinkedIn in our evaluation because we consider those passwords to be a reason- ˜ able candidate for medium-strong passwords (G0.25 = 19 bits).

2We list the active policy at the time when the data breach happened. 76 Chapter 4 Password Strength

• 000Webhost: Leaked from a free web space provider for PHP and MySQL applications. The data breach became public in Oc- tober 2015. The leak contains 15 million plaintext passwords. Based on the official statement, a hacker breached the server, by exploiting a bug in an outdated PHP version, which again means that no bias was introduced. We include 000Webhost in our evaluation because of its enforcement of a lowercase and digits password composition policy, which results in a differ- ent password distribution containing relatively strong passwords ˜ (G0.25 = 21 bits).

To avoid processing errors in later steps (querying online meters), we cleaned the leaks, by removing all passwords that were longer than 256 characters or non-ASCII. This cleaning step removed 0.06 %, 0.09 %, and 0.19 % of the passwords from RockYou, LinkedIn, and 000Webhost.

4.3.3. Reference

To reason about various candidate metrics that might be suitable to measure the accuracy of a strength meter, we created a fourth dataset. The dataset only contains the frequent passwords of the LinkedIn leak. We have chosen LinkedIn because it was the largest available leak at our disposal. As has been shown by Bonneau [24] and Wang et al. [239], approximating strength for unlikely passwords is error- prone. To avoid such approximation errors, we limited the LinkedIn file only to include ASCII passwords that occur 10 or more times (count ≥ 10), which resulted in the reference password file containing approx. 1 million unique passwords. We use the dataset as a) ideal reference and as b) strength meter output. For this, we divided the set into two disjoint sets REF-A and REF-B of about equal size by random sampling. In the following experiments, REF-A will be used as the reference, whereas REF-B will be used as a basis for the test cases, thus, simulates the meter output. The experiments as described in Section 4.4.2 operate on the 4.4. Similarity Measures 77 count values; if the password abc123 occurs 36, 482 times in LinkedIn, then REF-A and REF-B include a count value of ∼18, 240 for this password. In Section 4.4.3 we report on the reliability of this reference by performing additional tests that include uncommon passwords, as well as, the other leaks (RockYou and 000Webhost).

4.4. Similarity Measures

Next, we describe the process of selecting a suitable similarity metric.

4.4.1. Test Cases

Monotonic Transformations

We prepared several cases to test a measures’ tolerance to monotonic transformations:

• DOUBLE: For this test case we double the count values in REF-B. This represents the case that two strength meters use a different scale (e. g., one sets the cutoff for the strong class at a different threshold than the other). This would naturally occur when two strength meters use the expected time to crack a password (such as zxcvbn [248]) but assume different speeds of the cracking hardware.

• HALF: For this test case we half the values in REF-B before applying the measure to calculate the similarity with the ideal strength meter.

• LOG: For this test case we take the logarithm to base 2 of the count values in REF-B. This occurs naturally when one strength meter reports strength in “expected number of guesses,” and one in “bits of entropy.”

• SQR/SQRT: Further, we added test cases by applying the square operation and the square root to REF-B, respectively. 78 Chapter 4 Password Strength

Quantization

A substantial fraction of online meters uses binned output. Thus, such test cases are highly relevant in practice.

• Q4-equi/Q10-equi: For this test case we use quantization into four/ten bins, about the same number of passwords per bin (counting with multiplicities).

• Q4-alt/Q10-alt: Similar to the test case above, we use quanti- zation into four/ten bins, but in this case, splitting into bins of equal size based on unique passwords (without counting multi- plicities).

Disturbances

We have a number of test cases testing the tolerance and sensitivity to disturbances in the data.

• RAND: We use random values drawn from a uniform distribu- tion between 1 and the maximum count value. This test case can be seen as a calibration of low similarities as any matching only happens by chance.

• ADD-RAND: We add small random disturbances to REF-B drawn according to a uniform distribution between 1 and the respective count of a password.

• INV-WEAK-5: We modify the weakest 5 % of passwords (with multiplicities), by setting their usually very large count to 0 (i. e., we invert their scoring to very strong).

• INV-STRONG-5: We modify the strongest 5 % of passwords (with multiplicities), by setting their usually very small count to the maximum count value (i. e., we invert their scoring to very weak). 4.4. Similarity Measures 79

4.4.2. Testing Different Metrics

Next, we describe possible similarity measures and evaluate them. The results are shown in Table 4.2, Table 4.3, and Table 4.4, we will discuss these results in-depth in the remainder of this section.

Table 4.2.: (Weighted) Correlation Metrics. Test Cases Sim. Pear. Spear. Kend. wPear. wSpear. REF-B H 1.00 0.73 0.56 1.00 0.99 DOUBLE H 1.00 0.73 0.56 1.00 0.99 HALF H 1.00 0.73 0.56 1.00 0.99 LOG H 0.13 0.73 0.56 0.49 0.99 SQR H 0.93 0.73 0.56 0.99 0.99 Monotonic SQRT H 0.45 0.73 0.56 0.96 0.99 Q4-alt H 0.05 0.89 0.78 0.08 0.73 Q10-alt H 0.06 0.91 0.80 0.09 0.86 Q4-equi H 0.12 0.72 0.61 0.21 0.97 Quant. Q10-equi H 0.11 0.90 0.80 0.25 0.99 RAND L 0.00 0.00 0.00 0.12 0.04 ADD-RAND H 0.99 0.70 0.54 1.00 0.99 INV-WEAK-5 M 0.25 0.73 0.56 -0.04 0.70 Disturb. INV-STRO-5 M -0.02 -0.13 0.01 0.50 0.72 Sim.: Expected similarity with REF-A: L=Low, M=Medium, H=High similarity.

Correlation Metrics

A straightforward way to measure similarity, which has been used in most prior work, is the correlation between the reference and the observed values.

Pearson Correlation Coefficient: It is defined as the covariance divided by both standard deviations. Pearson correlation has several problems as a similarity measure for PSMs: First, it is sensitive to monotonic transformations (e. g., correlation of REF-A and LOG is 0.13), which is undesirable. Even worse, it is highly sensitive to quan- tization (which we typically encounter for most web-based meters), the correlation between REF-A and quantized versions Q4-equi/Q10- equi/Q4-alt/Q10-alt is close to zero (between approximately 0.1 and 0.05). Another issue is that it does not capture well the case INV- STRONG-5, where 5 % of strong passwords are given a weak score 80 Chapter 4 Password Strength

(arguably not a big problem at all), yet the similarity drops to around zero (−0.02). Two properties of Pearson correlation underlay this un- desirable behavior. First, it is a parametric measure and computed from the given values (instead of, e. g., ranks such as Spearman cor- relation), which makes it sensitive to non-linear transformations of the data. Second, it gives each data point equal weight, even though the weak passwords have a much higher count (by definition), thus Pearson correlation weights deviations for strong passwords stronger, relatively speaking.

Spearman Rank Correlation Coefficient: It is defined as Pear- son correlation over the ranks of the data. Thus it is based on ranks of the (sorted) data only. Spearman correlation has been used by previ- ous work on password strength [43, 239]. Spearman is robust against monotonic transformations and quite tolerant to quantization, which is an improvement over Pearson. Still, it gives too much weight to strong passwords, similarly to Pearson correlation. One additional problem is visible for Spearman: the correlation between REF-A and REF-B should be (close to) 1, as we expect perfect correlation, how- ever, Table 4.2 shows a correlation of 0.73. The underlying reason is again the missing weights, which leads to the situation that the strong passwords dominate the similarity score (around 50 % of pass- words have a count of less than 20 in REF-A), and the (small) errors from sampling on those strong passwords pull the score from 1 (what would be expected) to around 0.7.

Kendall Rank Correlation Coefficient: Kendall’s tau coefficient is quite similar to Spearman correlation, but conceptually simpler (it only takes into account if ranks are wrong and the direction, but not how big the difference is). Previous work, in fact, showed very similar results for Spearman and Kendall [239]. However, it has the disadvantage that naïve implementations (as in standard R) are com- putational expensive for larger samples requiring O(N 2) operations. 4.4. Similarity Measures 81

While Kendall is expected to be robust to monotonic transformations, a problem similar to Spearman reduces correlation to 0.56. Further- more, adding randomness (ADD-RAND) introduces enough variation to reduce the similarity to 0.54, and the impact of quantization is stronger than for Spearman.

Weighted Correlation Metrics

One common problem with the above correlation measures is that they treat frequent and infrequent passwords as equally weighted data points, i. e., an error in a single infrequent password is rated equally as an error in a frequent password which influences much more accounts. Weighted correlation measures give specific weights to the data points, which we take to be the frequency in the reference dataset. (To the best of our knowledge, neither weighted Pearson nor weighted Spear- man correlation has been used to compare PSMs before.)

Weighted Pearson Correlation: Is defined as (normal) Pearson correlation but weighting each data point with a weight vector, where we use the reference (REF-A) as the weight vector. This similarity measure exhibits similar problems as unweighted Pearson correlation for monotonic transformations, as expected, even though the effect is less pronounced, and remains highly sensitive to quantized data.

Weighted Spearman Correlation: Is defined as weighted Pearson correlation on the ranks. It is the most promising similarity measure considered so far. As ordinary (unweighted) Spearman correlation, it handles monotonic transformations well. It gives a correlation close to 1 to the test case REF-B (and the other monotone transformations including most quantized test cases, which was problematic before), as now the weights prevent the over-representation of strong passwords. It also handles the INV-WEAK-5 and INV-STRONG-5 cases well, where it assigns roughly the same correlation to both cases of around 0.7, a moderate but noticeable lower value than 1. 82 Chapter 4 Password Strength

Table 4.3.: (Weighted) Mean Error Metrics. Test Cases MAE MSE rMAE rMSE wrMAE wrMSE REF-B 4 54 0.16 0.05 2.77 13 DOUBLE 28 300402 0.16 0.05 2.77 13 HALF 14 75703 0.16 0.05 2.77 13 LOG 24 301526 0.16 0.05 2.77 13 SQR 3.E+05 7.E+16 0.16 0.05 2.77 13 SQRT 23 300014 0.16 0.05 2.77 13 Q4-alt 25 301711 0.10 0.02 5.41 18803 Q10-alt 22 301465 0.09 0.01 2.79 3006 Q4-equi 26 301766 0.16 0.04 3.76 29 Q10-equi 25 301607 0.09 0.02 1.83 6 RAND 3.E+05 9.E+10 0.33 0.17 22.78 1.E+05 ADD-RAND 15 29799 0.17 0.05 3.09 16 INV-WEAK-5 6 283087 0.16 0.05 5.52 1.E+06 INV-STRO-5 1.E+05 7.E+10 0.37 0.19 14.93 28742

Mean Error Metrics

Another set of similarity measures is mean square error (MSE) and related concepts. We tested variations, inspired by the above results and techniques used in previous work.

Mean Absolute Error (MAE): Is defined as the average absolute error, with equal weight for each data point. A similar measure was used in a publication by Wheeler [248], where a logarithmic error was used. Our test cases reveal the following problems: It is highly sensitive to monotonic transformations, even linear ones (and previous work [248] needed to adapt the scales of the meters to get a reasonable comparison). Large deviations in the rating for single passwords only have moderate impact on the similarity (due to taking absolute errors only). Its sensitivity to deviations in frequent passwords is low (the error for INV-WEAK-5 is 6, only marginally larger than the error due to random sampling (REF-B with an error of 4).

Mean Squared Error (MSE): Is defined as the average over the squared error, giving more weight to large deviations. Properties of MSE are very similar to that of MAE. 4.4. Similarity Measures 83

Ranked Mean Absolute/Squared Error (rMAE/rMSE): Here we first rank the data (assigning ties the average rank), and compute the MAE or MSE of the ranks. As expected, the resulting measures are resistant to monotonic transformations. However, as they are non-weighted, they fail to capture errors for few frequent passwords (INV-WEAK-5). This means, in the bad performing PSM test case (INV-WEAK-5) both, rMAE and rMSE fail to show any difference to the reference making them unsuitable.

Weighted Mean Error Metrics

All error measures discussed in the previous subsection are unweighted, and thus fail to capture errors in few frequent passwords. In this sub- section, we consider weighted variants.

Weighted a. Ranked Mean Abs./Sq. Err. (wrMAE/wrMSE): When we use both ranked and weighted data points, the resulting similarity measure becomes more discriminative, e. g., it allows to dis- tinguish the INV-WEAK-5 and DOUBLE test cases. Both measures work very well on our test-cases (remember that lower values mean more similarity) and seem to be a reasonable choice.

Table 4.4.: (Weighted) One-Sided/Pairwise Error Metrics. Test Cases wrLAE wrLSE PE PE-5 PU wPE wPE-5 wPU REF-B 1.39 7 1.00 0.68 0.95 0.96 0.24 0.99 DOUBLE 1.39 7 1.00 0.68 0.95 0.96 0.24 0.99 HALF 1.39 7 1.00 0.68 0.95 0.96 0.24 0.99 LOG 1.39 7 1.00 0.68 0.95 0.96 0.24 0.99 SQR 1.39 7 1.00 0.68 0.95 0.96 0.24 0.99 SQRT 1.39 7 1.00 0.68 0.95 0.96 0.24 0.99 Q4-alt 4.34 18799 1.00 0.69 0.75 1.00 0.81 0.75 Q10-alt 1.86 3003 1.00 0.54 0.90 1.00 0.20 0.90 Q4-equi 2.58 25 1.00 0.78 0.36 1.00 0.36 0.75 Q10-equi 0.96 3 1.00 0.57 0.71 1.00 0.19 0.90 RAND 20.37 1.E+05 1.00 0.90 1.00 1.00 0.92 1.00 ADD-RAND 1.62 9 1.00 0.70 0.97 0.97 0.26 0.99 INV-WEAK-5 4.13 1.E+06 1.00 0.68 0.95 1.00 0.29 0.99 INV-STRO-5 12.89 28727 1.00 0.96 0.92 1.00 0.99 0.98 84 Chapter 4 Password Strength

One-Sided Error Metrics

As described before, password strength approximations can be under- or overestimates. Previous work [162, 248] observed that a meter un- derestimating the security of strong passwords (e. g., INV-STRO-5) is less problematic than overestimating the strength of weak passwords (e. g., INV-WEAK-5). The former results in a user simply selecting another (presumably secure) password, whereas in the latter case the user believes having selected a secure password, where in reality it is weak.

Weighted and Ranked Mean Abs./Squared One-Sided Lower Error (wrLAE/wrLSE): One can define versions for MAE/MSE that only take one-sided errors into account. If this measure operates on count values, this approach favors meters that generally underesti- mate security: a meter that rates all passwords insecure (i. e., a high count value) will get a high rating. This can be prevented by operat- ing on ranked data. On the tested datasets and tested cases the re- sulting measures, wrLAE/wrLSE perform similarly to their two-sided versions wrMAE/wrMSE. A likely explanation is that wrLAE/wrLSE operate on ranked data. Therefore, overestimating the strength of one password generally leads to underestimating the strength of another password. (Weights and squaring differences (wrLSE) mean that the results still can differ, however, these effects seem to even out on the dataset that we considered.) For applications that call for one-sided metrics, one should consider non-ranked similarity metrics at the cost of losing the ability to tolerate monotonic transformations.

Pairwise Error Metrics

In preliminary tests, we observed that several similarity measures give a low similarity score to quantized data. This behavior is undesirable, as heavy quantization loses information about the distribution. We tried to address this problem by designing a similarity score that is based on two individual metrics: an error metric which describes how 4.4. Similarity Measures 85 many passwords are not in the “correct” order, and a utility metric which describes if the meter provides “useful” and discriminative out- put. To illustrate this problem, consider a strength meter with binary output, where only a few very strong passwords are “accepted,” and the other passwords are “rejected.” This meter would have a low error rating, as it mostly preserves the order, but a low utility rating, as most passwords are in the same bin. This mechanism is based on the rank. We evaluated several variants of this basic idea.

Pairwise Error/Utility Rate (PE/PU): These consider the rel- ative ranking of all pairs of passwords. PE considers the fraction of pairs where the meter and the reference disagree (where a tie in one of the two is not counted as a disagreement), whereas the PU considers the fraction of pairs where the meter sees a tie. (A meter outputting the same strength for all passwords, i. e., uses a single bin, has a PE of 0, but also a PU of 0.)

Pairwise Error Rate More Than 5 % (PE-5): As small devia- tions are typically considered a non-problem, for this variant we tol- erate any deviation that is less than 5 % (in terms of rank) and do not count them towards the error.

Weighted Pairwise Error Metrics

We have argued before that unweighted measures that do not take the specific probabilities of passwords into account systematically bias the results.

Weighted Pairwise Error/Utility Rate (wPE/wPU) and Weighted Pairwise Error Rate More Than 5 % (wPE-5): We use weighted versions of the three measures introduced before, where we weight each pair with the product of the probabilities of the two passwords. 86 Chapter 4 Password Strength

Implementation All measures are implemented using R v3.4.4 (March 2018). For Pearson and Spearman, we use standard R. For weighted Pearson and Spearman we use the wCorr package.3 For calculating the Kendall correlation, we use a O(n log n) optimized version from pcaPP.4

4.4.3. Reference Validation

To confirm our findings and test the reliability of our reference, which is based on the common LinkedIn passwords, we repeated our anal- ysis using RockYou and 000Webhost. The leaks are different in size; thus the resulting number of passwords tested were different. While the reference had approx. 1 million passwords that occurred 10 or more times, RockYou only includes 250 000 and 000Webhost 62 000 unique passwords. Across different leaks we observed only minor dif- ferences. The tendencies for correlation, mean error, one-sided error, and pairwise error metrics, remain the same independent of the tested password leak. For example, for the three leaks the wSpear. metric results vary only around ±0.04 across all test cases. Furthermore, we repeated our tests with a LinkedIn set that in- cluded uncommon passwords (count ≥ 2). Including uncommon pass- words is expected to be more error prone [24,239]. While the common variant included approx. 1 million passwords, the uncommon version consisted of 31 million unique passwords. Our results show that the tendencies remain the same. For example, for the uncommon variant the wSpear. metric results vary around ±0.07 across all test cases. To summarize, in those additional tests we found only minor dif- ferences in the behavior of the similarity measures across password leaks. Moreover, including uncommon passwords had a bigger albeit overall negligible impact on the results.

3Package: wCorr (Weighted Correlations), Version 1.9.1, May 2017, https://cran.r-project.org/package=wCorr, as of March 27, 2019. 4Package: pcaPP (Robust PCA by PP), Version 1.9-73, January 2018, https://cran.r-project.org/package=pcaPP, as of March 27, 2019. 4.4. Similarity Measures 87

4.4.4. Recommendation

We reported results for 19 candidates: we considered five correlation- based similarity measures, six variants that are mean absolute/square error-based, as well as, eight one-sided/pairwise error metrics and evaluated them on a number of test cases. Those tests included com- monly observed cases like logarithmic transformation and quantiza- tion, but also meters that incorrectly judge strength simulated via dis- turbances. We have seen that measures that are not weighted largely fail to capture essential aspects of the (highly skewed) distributions of passwords. Consequently, sensible measures should be weighted. Fur- thermore, we observed that measures based not on rank (but rather on values) are generally too sensitive to monotonic transformations and quantization to be useful. In our evaluation the metrics wSpear., wrMAE, wrMSE, wrLAE, wrLSE, and wPE-5/wPU are weighted and ranked metrics that per- formed well and seem suitable as comparison metric. For the re- mainder of this work we have selected weighted Spearman correla- tion. It is not perfect, especially on quantized output, and it does not differentiate between under- and overestimating strength, but per- formed well on most test cases. Furthermore, it is a standard metric and easy to interpret, relatively good to approximate from sampled data (cf. Section 4.4.5), and implementations are easily available. Also, (unweighted) Spearman correlation has been used before to eval- uate strength meters [43, 239].

4.4.5. Sampling

Collecting data from online sources is often cumbersome (e. g., pre- vious work [57] that evaluated data from (online) password strength meter went through great effort to collect large amounts of data). Therefore, we want to determine confidence intervals for our mea- sures to select the amount of data we need to collect. Determining accurate bounds is non-trivial, and to the best of our knowledge, no bounds are known that are applicable to our problem. 88 Chapter 4 Password Strength

We determine empirical confidence intervals for the weighted Spear- man measure (as it was the most promising one in the previous sec- tion) by repeated sub-sampling from the reference REF-A and the test cases. We sample subsets of varying sizes, ranging from 100 to 10 000, and computing similarity on those subsets, using the full data avail- able to determine the strength score of the reference. We repeat this process 10 000 times and determine the interval that contains 95 % of all similarity values (with 2.5 % larger and 2.5 % smaller). We report both the width of the interval and the actual interval. We perform this process for two different datasets, namely for Q4-equi and for LOG. While this does not give a formal guarantee that the actual value can be found in this interval, it gives us reasonable confidence and determines rough boundaries. Note that this process only takes into account (random) errors caused by sampling; it does not take into account any systematic errors that may be introduced, e. g., by the smaller sample size. The summary of results is shown in Table 4.5.

Table 4.5.: (Empirical) confidence intervals for REF-A vs. Q4- equi/LOG for different sample sizes and the weighted Spearman similarity measure (using 10 000 iterations and a 5 % confidence level). Reported is the width of the con- fidence interval, as well as, the boundaries (in brackets). 100 Samples 1000 Samples 10 000 Samples Q4-equi 0.146 [0.852–0.998] 0.044 [0.928–0.972] 0.027 [0.940–0.966] LOG 0.081 [0.919–1.000] 0.024 [0.974–0.998] 0.011 [0.985–0.997]

An example of a histogram of the correlation values is given in Figure 4.1, which was determined for weighted Spearman correlation with 10 000 samples and 10 000 iterations. The histogram has a me- dian of 0.990, min./max. of 0.980/1.000, and 2.5 %/97.5 %-percentiles of 0.985/0.997, resulting in a width of the 95 % confidence interval of 0.011. We see that, as expected, the width of the confidence inter- val decreases with increasing sample size. For weighted Spearman, we find widths of 0.027 and 0.011, respectively, and we will assume differences greater than 0.05 to be significant. 4.5. Evaluation 89

1000

800

600

400 Frequency

200

0 0.985 0.990 0.995 Weighted Spearman Correlation

Figure 4.1.: Histogram example for the monotonic transforma- tion error LOG using weighted Spearman correlation, 10,000 samples, and 10,000 iterations.

4.5. Evaluation

Guided by practical requirements, we distinguish between two differ- ent application scenarios. 1. PSMs deployed to protect online accounts, i. e., the most preva- lent online logins for social networks, email providers, etc. For on- line accounts, the operator can and should implement measures to limit the effectiveness of online guessing attacks such as rate-limiting. Typically one considers between 100 and 1000 allowed guesses within 30 days [26, 90, 98, 239, 240]. 2. Strength meters deployed to protect local authentication, such as hard disk encryption. In this scenario the number of guesses the adversary can test is only limited by the computational power of the adversary; In real-world attacks, the number of guesses per day on a single GPU is in the order of 109 to 1012 guesses [97]; some even consider up to 1014 guesses to be reasonable [78]. 90 Chapter 4 Password Strength

4.5.1. Use Cases

Online Guessing: For online account PSMs, techniques such as rate- limiting can reduce the risk of online guessing attacks. Based on previous work [239,240], which describes 1000 guesses as a reasonable limit an attacker can perform in an online guessing attack, and based on the assumption that the attacker is acting rational and guesses the most likely passwords first, we deduce that the most interesting set of passwords relevant for this kind of attack is the most likely 10 000 passwords. If each user omits passwords from the “easier half” of this set, then overall security will greatly be improved. Sampling strength meter scores for these 10 000 passwords of all three datasets (Rock- You, LinkedIn, 000Webhost) would put an unnecessary burden on the server infrastructure, and might even trigger server-side alerts. Given the results on sampling accuracy in Section 4.4.5 we try to avoid such implications by restricting ourselves to 1000 samples from these online services: i. e., out of the 10 000 most common RockYou passwords, we uniformly random sample 1000 passwords. We repeat this process for LinkedIn and 000Webhost, respectively, to obtain three different on- line guessing datasets. For the most likely passwords, we have very accurate frequency estimates. So for those common passwords, we can use the sample frequency as a ground truth for their strength in an online attack.

Offline Guessing: For offline guessing attacks, there is no limit on the number of attempts an attacker can perform, depending on the computing capabilities and the password hashing function deployed for storing the password. Consequently, the sample frequency in the datasets does not provide useful information about the strength in an offline attack, as the number of guesses (by far) exceeds the size of the dataset. Instead, we use the performance of common guessing tools as the reference. More specifically, we use the results of the Pass- word Guessability Service (PGS) [233], which allows researchers to send lists of passwords (in plaintext), and the service evaluates when 4.5. Evaluation 91 these passwords will be guessed by common configurations of widely used password guessing tools. Work by Ur et al. [233] found that the attribute min_auto is a good measure of resistance to guessing attacks, even with humans password experts involved. We use this recommended configuration without a password composition policy (1class1) as the ground truth for the offline guessing evaluation. For each of the three datasets (RockYou, LinkedIn, 000Webhost) we sam- pled 10 000 passwords to obtain the three different offline guessing datasets.

4.5.2. Selected Meters

Academic Proposals We considered different meter proposals from the literature. The list is in chronological order. • Heuristic/NIST : In 2004 the NIST published SP 800-63 Ver. 1.0 [38], which includes heuristics based on length and compliance to a composition policy on entropy estimation. The heuristic also considers a bonus if the password passes a common dictionary check. The latest version, SP 800-63B [98] from June 2017, no longer includes the ad hoc heuristic.

• Markov Model/OMEN : In 2012 Castelluccia et al. [43] proposed to train n-gram Markov models on the passwords of a service to pro- vided accurate strength estimations. The estimation is thus based on the probabilities of the n-grams a password is composed of. The me- ter provides adaptive estimations based on a target distribution but is limited to server-side implementations.

• Heuristic/Comp8 : In 2012 Ur et al. [231] investigated how PSMs can be used to nudge users towards stronger passwords. They out- lined a scoring algorithm derived from a composition policy called “Comprehensive 8.” Due to the lack of better alternatives, this scor- ing function was used to estimate the strength of a password. While this LUDS-based approach should no longer be used, we include it for completeness. 92 Chapter 4 Password Strength

• Heuristics/zxcvbn: In 2012 Daniel Wheeler described a PSM in a Dropbox Inc. blog post. It is based on advanced heuristics that extend the LUDS approach by including dictionaries, considering leetspeak, keyboard walks, and more. Due to its easy to integrate design, it is deployed on many websites. The meter was backed up by scientific analysis [248] in 2016.

• PCFG/fuzzyPSM : In 2012 Houshmand and Aggarwal [115] pro- posed a system to analyze the strength of a password. For this, they used a PCFG-based approach. In 2016 Wang et al. [239] extended the concept by proposing a fuzzy PCFG to model password strength, based on which mangling rules are required to modify a base dictio- nary to match a training distribution of stronger passwords.

• Heuristic/Eleven: In 2013 Egelman et al. [68] studied password selection in the presence of PSMs. For their meter, they decided to use a similar metric for strength as NIST. Similar to LUDS approaches the meter considers character set sizes and length.

• RNN/ DD-PSM : In 2016 Melicher et al. [162] proposed to use a recurrent neural network for probabilistic password modeling. For our analysis, we use the guess number estimations provided by the RNN. The authors also describe a method that allows a client-side implementation using a special encoding and a Bloom filter. In 2017, Ur et al. [229] extended the concept by adding data-driven feedback using 21 heuristics that try to explain how to improve the password choice. We use Ur’s website5 for additional measurements.

• Heuristic/LPSE: In 2018 Guo et al. [102] proposed a lightweight client-side meter. It is based on cosine-length and password-edit dis- tance similarity. It transforms a password into a LUDS vector and compares it to a standardized strong-password vector using the afore- mentioned similarity measures.

5Data-Driven PSM: https://cups.cs.cmu.edu/meter/, as of March 27, 2019. 4.5. Evaluation 93

Password Managers We also tested meters that protect high value encrypted password vaults. If no further protection mechanism is deployed [87], especially the security of cloud-based password man- agers depend on the use of a high entropy secret. Thus, service providers need to give accurate strength estimates. Furthermore, vaults that offer the ability to store user-chosen credentials might show strength estimates for their stored secrets, too. For our work, we analyzed the PSMs of 11 popular password managers, including 1Password [3], Bitwarden [257], Dashlane [55], Enpass [213], KeeP- ass [196], Keeper [139], LastPass [153], and more.

Popular Websites We queried password strength meters from popular web services within the top 100 ranking published by Alexa Internet. Our samples include large sites like Apple, Baidu, Dropbox, Facebook, Google, Microsoft, reddit, Twitter, Sina Weibo, Yandex, and more. For better comparability, we tried to include sites that were queried in previous work by de Carné de Carnavalet and Mannan [57]. The authors published their findings on all the meter codes via a “Password Multi-Checker Tool” on a self-hosted website [58], allowing one to compare their results with the currently implemented versions.

Operating Systems We analyzed password strength meters from stan- dard operating systems. Microsoft’s Windows and Apple’s macOS do not provide a strength estimation during account creation. How- ever, Apple includes a Password Assistant with a strength estimation functionally that is used by the Keychain Access application while generating passwords, e. g., for file encryption. Canonical’s Ubuntu distribution shows a PSM during the account creation and hard disk encryption setup. It is part of the graphical live CD installer and based on ’s Seamonkey PSM function. 94 Chapter 4 Password Strength

4.5.3. Querying Meters

To query the website PSMs, we used similar techniques as previous work [57]. For JavaScript and server-side implementations, we used the Selenium framework [118] to automate a headless Google Chrome browser. As all JavaScript (that involves no server communication) is evaluated on the client only, one can obtain large quantities of strength estimates in a short period. The used browser automation approach allows to execute JavaScript. Thus, intermediate strength estimation results that are not displayed in the UI of the password strength meter, but do exist in the Document Object Model (DOM) of the browser were obtained as well. For the academic proposals, a more evolved approach was required, as some meters require a training or no implementation was available. For the training, we sampled 10 million passwords from each sani- tized dataset (excluding the respective offline and online passwords). Please note, not all meters make full use of all available training data, e. g., fuzzyPSM, OMEN, and the RNN-based approach have specific requirements. The Markov model approach used by OMEN does not allow training passwords shorter than the n-gram size. Similarly, fuzzyPSM does not require a training corpus larger than 1 million passwords. For the RNN-based approach we were forced to limit the training set to passwords no longer than 30 characters. For Comp8, LPSE, and fuzzyPSM we contacted the authors that kindly shared their source code or evaluated their implementation and shared the results with us. For the RNN PSM, we used an implemen- tation by Melicher et al. [161]. We tested multiple variants: i) Based on the guess numbers of a pre-trained (generic) password distribution. ii) Based on a client-side JavaScript implementation (using a different password composition policy) [228]. iii) Based on the guess numbers of a self-trained RNN using a matching distribution (targeted), following the recommend construction guidelines. iv) Based on a self-trained RNN using a matching distribution (targeted) including a Bloom filter made of the top 2 million training set passwords (following the recom- 4.6. Results 95 mendations in the original paper [162]). For the Markov approach, we modified a version of the Ordered Markov ENumerator (OMEN) [8] by Dürmuth et al. [66] (a password guesser implementing the approach of Castelluccia et al. [43]) and used the aforementioned training set to obtain strength estimates. As this approach uses quantized (level- based) strength estimates only, we also implemented an approach described by Golla et al. [87] that outputs probabilities instead of quantized scores and increases precision by training one model per password length. For zxcvbn, we used the official JavaScript imple- mentation [65]. For NIST we used a JavaScript implementation of the meter [56]. We built a dictionary consisting of the top 100 000 pass- words from Mark Burnett’s 10 million password list [37], which has been used as blacklist by previous work [103]. For most of the password managers (1Password, Bitwarden, Keeper, etc.), we were able to query their respective web interface version using Selenium. For RoboForm and True Key we automated the respective Chrome extensions using Selenium. For KeePass, we used the KP- Script plugin on Windows [197]. For Dashlane we used the Appium framework [133] and its Windows Driver to automate the Windows desktop. While analyzing Enpass’s PSM we found the use of the of- ficial zxcvbn implementation [212] (including the same dictionaries), thus we didn’t query Enpass. Instead, we report the zxcvbn results. For the operating systems, we queried Ubuntu’s PSM using the original Python script from the Ubiquity source code [40]. For macOS we used PyObjC [175], a Python Objective-C bridge to query Apple’s Security Foundation framework.

4.6. Results

Next, we present and discuss the results of the evaluation of the dif- ferent meters, both for the online and offline use case. The academic PSM results are summarized in the Table 4.6. We present the results for the online use case on the top, and for the offline use case on the bottom. We report results for both use cases for all strength meters, 96 Chapter 4 Password Strength even though some are designed for one specific use case only (e. g., password meters deployed on websites are intended for the online use case). We computed the weighted Spearman correlation as the best similarity score selected in Section 4.4. The full table for all 81 password strength meter variations can be found in the Appendix B. The primary results are separated into four categories (academic proposals, password managers, operating systems, and websites). A fifth category is included for comparison and is based on the “Password Multi-Checker Tool” [58] by previous work [57]. A version of our results that allows an easier comparison, provides bar charts of the quantizations, and more can be found on- line [91]. Please note, not all tested mechanisms like ID: 5A/B NIST or ID: 33 Have I Been Pwned? are intended to be used as a strength me- ter. Thus, the reported results for those estimators cannot be directly compared with others as their parameters can likely be augmented to perform better.

4.6.1. Overall Performance

The three best-performing academic meters are ID: 6 fuzzyPSM (0.899− 1.000), ID: 7C RNN Target (0.860−0.965), and ID: 4C Markov (Multi) (0.721−0.998) for both online and offline use cases. A number of other PSM variants perform well, including ID: 8A zxcvbn (Guess Number) (0.554 − 0.999). For password managers we found ID: 13A KeePass and ID: 14B Keeper to be accurately ranking meters (0.284 − 0.884). Further- more, we found the zxcvbn-based ID: 17B RoboForm to be precise (0.528 − 0.962). Across the binning PSMs we found some of the zx- cvbn (Score)-based meters, e. g., ID: 17A RoboForm (Q4), ID: 17C RoboForm Business (Q6), ID: 18 True Key (Q5), and ID: 12 Enpass (Q5) to be accurately ranking (0.341 − 0.827). The PSM in ID: 10A Bitwarden shows significant problems. Similar, but less pronounced are the inaccuracies of ID: 16B LogMeOnce and ID: 19A Zoho Vault. All three are LUDS-based meters. 4.6. Results 97

Table 4.6.: We computed the weighted Spearman correlation as the best similarity score (cf. Section 4.4). We highlighted if and on how many bins a meter quantized its output and list whether a meter runs on client- or server-side. Online Attacker ID Meter T. Q. RY LI 0W 1A Comprehensive8 [231] C - -0.652 -0.589 0.251 1B Comprehensive8 [231] C Q5 -0.331 -0.084 0.409 2 Eleven [68] C - 0.670 0.912 0.492 3 LPSE [102] C Q3 0.584 0.669 0.508 4A Markov (OMEN) [43] S - 0.721 0.697 0.410 4B Markov (Single) [87] S - 0.718 0.998 0.817 4C Markov (Multi) [87] S - 0.721 0.998 0.902 5A NIST [38] C - 0.670 0.912 0.492 5B NIST (w. Dict.) [38] C - 0.669 0.910 0.472 6 PCFG (fuzzyPSM) [239] S - 1.000 0.994 0.963 7A RNN Generic [162] C - 0.632 0.542 0.427 7B RNN Generic (Web) [229] C - 0.473 0.649 0.421 7C RNN Target [162] C - 0.951 0.913 0.965 7D RNN Target (w. Blo.) [162] C - 0.951 0.913 0.965 8A zxcvbn (Guess No.) [248] C - 0.989 0.990 0.554 8B zxcvbn (Score) [248] C Q5 0.341 0.490 0.359

Offline Attacker ID Meter T. Q. RY LI 0W 1A Comprehensive8 [231] C - -0.476 -0.616 0.441 1B Comprehensive8 [231] C Q5 -0.128 -0.123 0.421 2 Eleven [68] C - 0.755 0.951 0.733 3 LPSE [102] C Q3 0.544 0.718 0.693 4A Markov (OMEN) [43] S - 0.701 0.669 0.660 4B Markov (Single) [87] S - 0.828 0.991 0.872 4C Markov (Multi) [87] S - 0.997 0.995 0.777 5A NIST [38] C - 0.755 0.951 0.733 5B NIST (w. Dict.) [38] C - 0.756 0.953 0.816 6 PCFG (fuzzyPSM) [239] S - 0.998 0.999 0.899 7A RNN Generic [162] C - 0.535 0.520 0.800 7B RNN Generic (Web) [229] C - 0.449 0.688 0.777 7C RNN Target [162] C - 0.896 0.860 0.885 7D RNN Target (w. Blo.) [162] C - 0.896 0.860 0.882 8A zxcvbn (Guess No.) [248] C - 0.989 0.999 0.868 8B zxcvbn (Score) [248] C Q5 0.373 0.567 0.817 Type (T): C=Client; S=Server Quantization (Q) : Q3–Q5=Number of bins, e. g., Q3=[Weak, Good, Strong] Dataset: RY=RockYou; LI=LinkedIn; 0W=000WebHost 98 Chapter 4 Password Strength

When it comes to PSMs used by current operating systems, we found very negative results. While macOS does not prominently display their PSM, Ubuntu uses the PSM during account creation and hard disk encryption. We found both meters to perform poorly. An analysis of Ubuntu’s PSM source code revealed a LUDS meter. Their approach counts the number of uppercase, digit, and symbol characters and multiplies them with some magic constants. The estimation function is a re-implementation of Mozilla’s Seamonkey meter which dates back to 2006. First bug reports about the poor quality and inconsistency in the assessment date back to 2012 [15].

While the weighted Spearman correlation decreases due to the ef- fects of quantization (cf. Section 4.6.2), we observed a relatively good accuracy for some of the website PSMs, too. For example the non- binning ID: 35B Microsoft (v3), ID: 25A Best Buy, ID: 28A Drupal, and ID: 41A Twitter PSMs (0.424 − 0.951). Across the binning PSMs we found some of the zxcvbn (Score)-based meters, e. g., ID: 36 reddit (Q5) and ID: 40A Twitch (Q5) (0.197 − 0.817) and some non-zxcvbn- based PSMs like ID: 41B Twitter (Q5), ID: 32A Google (Q5), and ID: 35A Microsoft (v3) (Q4) to be accurately ranking (0.487 − 0.763). Based on our measurements ID: 33 Have I Been Pwned? performs excellently. This is likely owed to the fact that all tested datasets are part of the Pwned Passwords list [122]. Thus, different results are expected if non-breached passwords are evaluated. Surprising are the results of ID: 27A Dropbox (0.056 − 0.611), the developers of the zxcvbn meter. On their website, they rely on a Q4 score-based imple- mentation (they discard the first bin, i. e., all passwords with a guess number below 103). Based on our ID: 8B zxcvbn (Score) findings, we expected better results. Note that the results of ID: 27B Dropbox using an older implementation by previous work [57] results in similar low performance.

To summarize, the academic contribution to strength estimation is outperforming many other meters and has brought up several con- cepts that are improving the estimations. Some other factors may 4.6. Results 99 contribute to those meters performing well: Specifically for the aca- demic proposals we often have continuous scores, and the meters are trained on the distribution. The PSMs in password managers perform reasonably well, with a few exceptions (ID: 10A Bitwarden, ID: 16B LogMeOnce, and ID: 19A Zoho Vault). Similar to previous work, we measured a good accuracy for ID: 13A KeePass. The high accuracy of ID: 17B RoboForm and others are explained by the use of zxcvbn. PSMs in operating systems are not popular, even though, when present, they are used for security-sensitive operations. Current im- plementations are LUDS-based meters that lack any helpful guidance and should be replaced. PSMs which are used on websites are doing reasonably well, with correlations up to 0.951. Most of our evaluated website PSMs are client-side JavaScript meters, also popular are hybrids. Server-side implementations were rare in our evaluation set. We speculate there are several reasons that websites are not using better meters: lack of awareness, lack of guidance on the quality of meters, and the usually larger size of academic meters that need to store the model parame- ters.

4.6.2. Effect of Quantization

Almost all PSMs on websites provide a binned output, as users rely on tangible feedback like Weak or Strong instead of a more ab- stract probability or guess number. Binning will reduce the accuracy of a meter. However, weighted Spearman is relatively robust against this effect. In the following, we qualitatively analyze the estimates of binning meters to provide a more intuitive way to compare weighted Spearman correlation with the over- and underestimates of the meters. For this, we use the PGS [233] min_auto guess number, and a loga- rithmic binning similar to zxcvbn (Score). The results are visualized in Figure 4.2. 100 Chapter 4 Password Strength

I II III IV V ≥ 1e10 3003 19 2 0 0 ≥ 1e8 1660 0 0 0 0 ≥ 1e6 2141 0 0 0 0 ≥ 1e3 2650 0 0 0 0

MinGuess Number ≥ 1e0 525 0 0 0 0 Very Weak Weak Moderate Strong Very Strong (a) ID: 13B (KeePass), wSpear: 0.002

I II III IV V ≥ 1e10 0 156 984 827 1057 ≥ 1e8 2 434 930 262 32 ≥ 1e6 7 1328 665 125 16 ≥ 1e3 278 2093 220 56 3

MinGuess Number ≥ 1e0 482 43 0 0 0 0 1 2 3 4 (b) ID: 8B (zxcvbn, Score), wSpear: 0.567

I II III IV V ≥ 1e10 0 486 988 1361 189 ≥ 1e8 0 549 782 282 47 ≥ 1e6 0 1095 863 176 7 ≥ 1e3 49 2021 542 31 7

MinGuess Number ≥ 1e0 329 184 12 0 0 Obvious Weak Good Strong Very Strong (c) ID: 41B (Twitter), wSpear: 0.665

Figure 4.2.: Quantization effect of different PSMs: Visualization of the number of passwords per bin. An accurate and cor- rectly binning PSM produces a diagonal green line. A meter that only assigns the weak/strong bin is visualized via a vertical bar on the left (purple)/right (red). If the bin thresholds are incorrectly chosen (cf. Figure 4.2a), the quantization degrades the precision, the relative ranking is lost, and the correlation degrades. 4.6. Results 101

The ≥ 103 | V Bin

This bin includes passwords that are weak (103 ≤ guess number < 106) but misjudged to be strong. An analysis of ID: 41B Twitter’s bin re- vealed weakness in detecting keyboard walks and leet transformations. The password !QAZxsw2 (a keyboard walk on US keyboards) as well as P@$$w0rd, jessica#1, and password@123 were incorrectly ranked. ID: 8B zxcvbn (Score)’s bin includes misdosamores and mardelplata (film and city names), as well as oportunidades (Spanish term).

The ≥ 1010 | II Bin

This bin includes passwords that are strong but misjudged by the me- ter to be somewhat weak. ID: 41B Twitter’s bin revealed problems with digit-only passwords like 9371161366 that are usually cracked us- ing a Mask attack. Furthermore, it includes all lowercase phrases like itsababydog that attackers crack by running a Combinator attack us- ing two dictionaries. ID: 8B zxcvbn (Score)’s bin includes zxcvbvcxz (a variation of the meter’s name giving keyboard walk) usually cracked via a dictionary or Mask attack. Additionally, we found phrases like atlantasports, which is likely cracked with a Combinator attack. To further study the binning effect for real-world data, we look at those meters that provide both, quantized and non-quantized feed- back. Interesting is the case of ID: 13B KeePass (Q5). While the strength estimates are relatively precise ID: 13A KeePass (0.393 − 0.884), the binning had severe consequences. The PSM enforces very strong passwords (i. e., “Very weak” for score < 64 bit) resulting in the majority of passwords falling in the weakest bin. (KeePass does not display this textual feedback in their password manager software). In comparison, we see that for ID: 9A 1Password (0.276 − 0.807) the binning of the strength estimates by ID: 9B 1Password (Q5) (0.276 − 0.813) had close to no effect on the accuracy. There are cases where binning improves the score, e. g., the unbinned version ID: 10A Bitwarden (−0.635 − 0.676) performs substantially worse than the binned version ID: 10B Bitwarden (Q3) (0.258−0.725). The reason is 102 Chapter 4 Password Strength that binning can eliminate some types of errors of a meter, depending on the precise binning boundaries.

4.6.3. Performance Over Time

It is interesting and illustrative to compare our results with those from previous work [57], which were collected in 2013. By analyzing the matching set of websites, one can observe positive and negative developments in the past 5 years. First of all, ID: 26A vs. 26B (China Railway), as well as, ID: 31B vs. 31C (Fedex) did not change. Sec- ond, some meters most notably ID: 23A vs. 23C (Apple), as well as, ID: 27A vs. 27B (Dropbox) and ID: 28B vs. 28C (Drupal) show a degraded accuracy. When analyzing the Apple PSM, we found server- side blacklisting of all RockYou passwords in combination with a basic LUDS approach that checks for symbols and length. Finally, we can report a positive development for ID: 32A vs. 32B (Google), ID: 42B vs. 42C (Yandex), and ID: 41B vs. 41C (Twitter). Note that Twitter changed its quantization over time; thus results are not directly com- parable. If we consider all websites in our set, one observes slightly better performing PSMs than in 2013, but no significant change. One can observe that the majority of websites did change their PSM over time. (During our crawler development and data collection we ob- served how reddit changed its meter from a simple LUDS approach to zxcvbn.)

4.6.4. Recent Proposals and Future Directions

Recent academic proposals are among the best-scoring approaches. Specifically, instances of Markov models, PCFGs, RNNs, as well as zxcvbn, score excellently. Figure 4.3 shows the distribution of the strength estimations for the LinkedIn offline dataset. On the x-axis, one can see increasing password count numbers. This way, strong (less common) passwords are on the left; weak (more common) pass- words are on the right. 4.6. Results 103

While none of the approaches outperforms the others in terms of accuracy, other factors become more critical. For example to decrease the dependency of accuracy on the password distribution. We chose the different password evaluation sets to show the varying performance based on the (trained) distribution. A representative example for this is ID: 7B RNN Generic (Web) (0.421 − 0.777) vs. ID: 7C RNN Target (0.860 − 0.965). It shows the generic meter performance (using a different composition policy) in comparison to a targeted (distribution matching) trained variant. Also relevant are the storage requirements of the meters: While the inaccurate and straightforward LUDS meters can fit into a couple of bytes, the n-gram databases of meters based on Markov models can occupy hundreds of megabytes to gigabytes of disk space. The current database underlying the Have I Been Pwned? meter requires 30 GB (or enough trust to send partial password hashes to a third- party service). Optimized variants fit into 860 MB of memory using a Bloom filter [225]. zxcvbn has a size of around 800 KB. Decreasing the size of meters while maintaining high accuracy seems a worthwhile research goal. The RNN approach by Melicher et al. [162] is a first good example that reasonable accurate meters can fit into a couple of megabytes [162]. Furthermore, we need to better understand and mitigate the nega- tive effects of quantization. While the non-binned academic proposals perform well, we need to find a way to transfer their accuracy to quan- tized output that is required for users. In the set of evaluated password strength meters we found score-based (i. e., > 42), percentage-based (i. e., > 75 %), and logarithmic (i. e., ≥ 106) binning approaches, equal sized and unequal sized bins, magic constants, and rule-based binning. Finally, what we have learned from the success of zxcvbn (which is implemented at several sites according to our findings) is that pro- viding implementations in multiple programming languages that are readily deployable helps adoption. 104 Chapter 4 Password Strength

1

10 -10

10 -20

10 -30 Markov (Probability) 10 0 10 1 10 2 10 3 10 4 10 5 Password Count (a) Markov

1

10 -10

10 -20

10 -30 PCFG (Probability)

10 0 10 1 10 2 10 3 10 4 10 5 Password Count (b) PCFG 4.6. Results 105

10 20

10 10 RNN (Guesses)

10 0 10 0 10 1 10 2 10 3 10 4 10 5 Password Count (c) RNN

10 20

10 10 zxcvbn (Guesses)

10 0 10 1 10 2 10 3 10 4 10 5 Password Count (d) zxcvbn

Figure 4.3.: Distributions of strength estimations: Increasing pass- word counts on the x-axis. Strong (less common) pass- words are on the left, Weak (more common) passwords are on the right. Estimated strength values (measured as probability/guess number) on the y-axis. 106 Chapter 4 Password Strength

4.6.5. Limitations

Albeit we carefully selected the datasets for our evaluation, we only simulated real-world password choice using breached passwords. As mentioned in Section 4.3 password distributions are influenced by many factors, thus the three evaluated sets only reflect a small set of passwords. In particular, they reflect mostly an English speaking community. The impact of composition policies was not studied, pri- marily, because constraining the passwords in each leak to those that satisfy a given policy does not reflect real user behavior [140]. While weighted Spearman correlation was the best in our set of tested metrics for quantized strength estimations, it still is not per- fectly accurate. Thus, results for quantized meters need to be inter- preted carefully. In general, the lower the number of bins, the less precise the results, as explained in Section 4.4. Furthermore, differ- ent application contexts may place different demands on meters, thus may require different choices of similarity metrics. Finally, measuring the accuracy alone is not enough to asses the overall performance of a meter. Usability and deployability aspects are vital for a complete assessment but are not presented in our analysis.

4.7. Conclusion

In this work, we considered the accuracy of password strength meters. We have demonstrated that the currently used measures to determine the accuracy of strength meters (such as Pearson and Kendall correla- tion) are not precise. We compared different similarity measures and argued that weighted Spearman correlation is best suited to precisely and robustly estimate the accuracy of strength meters. We applied this measure to 45 different strength meters and deter- mined their accuracy for an online and offline use case. We found that the academic PSM proposals based on Markov models, PCFGs, and RNNs perform best. We also found several websites and password managers to have quite accurate strength meters. However, the pass- 4.7. Conclusion 107 word strength meters used in practice are less accurate than academic proposals, and we see no significant improvement of meter accuracy when comparing with meters from 5 years ago. High accuracy is one important aspect that impacts the security of a password strength meter. Also vital are usability and deployability aspects, those are independent of the presented work. We hope our work aids further improvements of PSMs and provides helpful guid- ance and a metric for the selection of accurate PSMs and thus helps to improve the situation of password-based user authentication.

But....urm.... what was another site doing with my Facebook password in the first place?

— Anonymous 5 Password Reuse

Contents 5.1 Introduction ...... 110 5.2 Study 1 ...... 113 5.2.1 Recruitment and Survey Structure . . 113 5.2.2 Conditions ...... 114 5.2.3 Analysis Methods and Metrics . . . . . 117 5.3 Study 1 Results ...... 119 5.3.1 Respondents ...... 119 5.3.2 Notification Response ...... 119 5.3.3 Understanding of the Notification . . . 123 5.3.4 Intended Response to Notification . . . 123 5.3.5 Reactions to Structure and Delivery . 125 5.4 Password-Reuse Notification Goals ...... 126 5.5 Study 2 ...... 127 5.5.1 Study 2 Conditions ...... 128 5.5.2 Study 2 Structure and Recruitment . . 131 5.5.3 Analysis Method and Metrics . . . . . 132 5.6 Study 2 Results ...... 133 5.6.1 Respondents ...... 133 5.6.2 Perceived Causes of the Scenario . . . 134 5.6.3 Creating New Passwords ...... 136 5.6.4 Taking Other Security-Related Actions 140 5.6.5 Perceptions of the Notification . . . . . 144 5.7 Limitations ...... 145 5.8 Discussion ...... 146 5.8.1 Best Practices ...... 146 5.8.2 Addressing Persistent Misunderstandings147 110 Chapter 5 Password Reuse

5.1. Introduction

People reuse passwords [54,178,243]. An average user may have hun- dreds of different online accounts [75, 178, 243], and passwords are unlikely to be completely replaced anytime soon [26]. As password managers [71,150] and single sign-on systems [13,221] have low adop- tion, password reuse is a common coping strategy. Password reuse has major ramifications for the security of online ac- counts. A breach of one account provider’s password database puts at risk accounts on other services where login credentials are the same as, or even just similar to [240], the breached accounts. Attackers that target large leaks of passwords stored using computationally ex- pensive hash functions (e. g., bcrypt) exploit this password reuse in offline guessing [94]. Attackers try to match identifiers like usernames and email addresses to previously cracked credentials. They then transform the already known passwords to increase their likelihood of correctly guessing passwords [51, 105, 238]. Password breaches are common. The website haveibeenpwned.com counts billions of compromised credentials due to data breaches, in- cluding from high-profile services like Adobe, Dropbox, LinkedIn, MySpace, Sony, Tumblr, and Yahoo! [120, 179]. Thomas et al. esti- mated that 7–25 % of passwords traded on black-market forums match high-value targets like Google accounts [224]. Account providers send a variety of notifications about situations potentially caused by password reuse. We refer to all such notifications as password-reuse notifications, regardless of whether password reuse is explicitly mentioned. To protect their users, some providers proac- tively monitor black-market sources for passwords stolen from other sites, searching for matches in their own password database [154]. Once aware of such situations, these providers send notifications to affected users, encouraging them to change their password. Password- reuse notifications also include notifications about suspicious login at- tempts, which may have been triggered by a password-reuse attack, or notifications requiring a password reset after a data breach. In a 5.1. Introduction 111 recent example, Twitter asked users to change not only their Twitter passwords, but also passwords on services where they had reused their Twitter password [4]. Surprisingly little is known about how users interpret or respond to password-reuse notifications, and how the design of such notifications impacts users’ understanding and risk perception. Current password- reuse notifications vary widely, and despite the frequency with which such notifications are sent, no best practices have been outlined. This paucity of knowledge contrasts with the large and rich literature inves- tigating the design of warnings and notifications about other security- critical tasks, including detecting phishing [67, 210], TLS-protected browsing [5, 74], malware [33, 35], and 2FA [191]. Many studies have aimed to help users make better passwords [68, 162, 229] or measured the prevalence of password reuse [54, 127, 183, 224]. This work is the first to explore how to inform users about situations caused by pass- word reuse and help them recover from the resultant consequences. Password-reuse notifications face the herculean task of helping users understand and respond to a convoluted situation. Users have posted on Twitter about their confusion about receiving such notifications. For example, one tweet about a notification asked: “What was another site doing with my Facebook password in the first place?” This may be because understanding the risks of password reuse requires knowledge of how attackers leverage breaches to compromise accounts on other services. Password-reuse notifications must address this underlying complexity to convince users to replace reused passwords across all sites with a new, unique password for each account. We explain the complexity of these issues from the perspective of a fictitious company, AcmeCo, which we adopt for the remainder of this work. We conducted two complementary user studies about password- reuse notifications. First, we sought to understand how users un- derstand and perceive existing notifications. We collected 24 noti- fications sent by real companies in situations that may have been caused by password reuse. We chose six notifications whose charac- 112 Chapter 5 Password Reuse teristics were representative of the full 24. In Study 1, we conducted a scenario-based online survey in which 180 Mechanical Turk workers saw one of these six notifications (Sec. 5.2). We asked respondents why they might have received such a notification, the feelings the notification elicits, and what actions they might take in response. Re- spondents reported they would be alarmed and confused, and that they would intend to take action in response to receiving these no- tifications (Sec. 5.3). Some notifications were more effective than others at encouraging a response. Ultimately, though, participants’ responses misattributed the potential root cause of receiving these (real, previously deployed) notifications. Only 20.6 % mentioned the breach of another company’s password database as a potential cause, and only 18.8 % mentioned password reuse as a factor.

Based on respondents’ perceptions and responses (Sec. 5.4), we identified five design goals for password-reuse notifications that in- tegrated characteristics of notifications that were effective in Study 1 and improved upon characteristics that were less effective. We then conducted a follow-up study to analyze a model notification we be- lieved achieved all five design goals (Sec. 5.5). This notification ex- plicitly describes password reuse and the breach of another provider as the cause of the notification. Additionally, it forces a password reset, encourages other beneficial security actions, and is delivered through multiple mediums. Study 2 was again a scenario-based survey in which 588 Mechanical Turk workers saw one of 15 variants of this model noti- fication. While Study 2 respondents perceived our model notification as official and urgent, they nonetheless misattributed the root cause of the notification (Sec. 5.6). Many respondents did not perceive password reuse as a potential cause of the situation. Additionally, although nearly all respondents stated intentions to change one or more passwords, most reported plans to create these “new” passwords by reusing other passwords of theirs, leaving them vulnerable to sim- ilar attacks in the future. From our collected results, we establish best practices for maximizing the effectiveness of password-reuse no- 5.2. Study 1 113 tifications. However, because password-reuse notifications may not be sufficient on their own, we conclude with a discussion of additional steps for holistically addressing password reuse (Sec. 5.8).

The contributions of this work resulted from a collaboration with Mi- randa Wei, Juliette Hainline, Lydia Filipe, Markus Dürmuth, Elissa Redmiles, and Blase Ur and the support of the University of Chicago and the University of Maryland.

5.2. Study 1

Study 1 explored current password-reuse notifications. It investigated user perceptions of, and reactions to, such notifications. Both Study 1 and Study 2 were approved by the Social and Behavioral Sciences Institutional Review Board at the University of Chicago. As both Study 1 and Study 2 rely on respondents’ self-reports of their feelings and actions they would intend to take, percentages reported below should not be taken as ground truth. Rather, we use our survey find- ings to inform the design of improved password-reuse notifications. While observing notification response in the wild may produce more accurate absolute reports of behavioral response, such observational studies fail to allow us to understand why people may react in cer- tain ways and improve those reactions. Thus, similar to prior work on SSL warnings [74], we use Study 1 to identify potential areas of improvement for current password-reuse notifications, developing a model notice that we evaluate in Study 2.

5.2.1. Recruitment and Survey Structure

We recruited participants on Amazon’s Mechanical Turk, requiring that workers be 18 years or older, live in the US, and have a 95 %+ approval rate. We advertised our study as a survey about “online ac- count notifications.” To avoid recruitment biases, we did not mention security or privacy. Study 1 was a scenario-based survey expected 114 Chapter 5 Password Reuse to take 15 minutes. Respondents were compensated $2.50. Respon- dents were first introduced to the survey scenario: “In the following, you will be asked to imagine that your name is Jo Doe. You have an online account with a major company called AcmeCo and can ac- cess your account through both a website and a mobile application. Imagine that this account is important to you and that it is like other accounts you may have, such as for email, banking, or social media.” Then, respondents were presented with one of six password-reuse no- tifications (cf. Section 5.2.2). Three sets of questions followed. The first set measured respon- dents’ overall understanding of the notification by asking what may have caused it to be sent through two open-ended questions: “In your own words, please describe what this notification is telling you” and “In your own words, please describe all factors that may have caused you to receive this notification.” The second set asked respondents to list three feelings they might have and three actions they might take upon receiving the notification, and why. The third set pre- sented seven statements, in randomized order, about perceptions of the effectiveness of the notification’s explanation of the situation, its delivery method, and its apparent legitimacy. Respondents gave a Likert-scale response and free-text justification for each. Finally, re- spondents reported the following demographic information: gender, age, highest degree attained, and technical expertise. Appendix C.1 contains the full text of the survey.

5.2.2. Conditions

In Study 1 we evaluated six real notifications used by online account providers. To collect such notifications, four members of the research team searched for notifications sent by major online account providers after known data breaches that had been posted online or on social me- dia. We deemed a notification in scope if the potential risk may have originated from password reuse. We verified all notifications as legit- imate (not phishing) by cross-referencing Twitter accounts, company 5.2. Study 1 115 security blogs, and news articles. We collected 24 real notifications about password reuse. To select a set of representative notifications, we used affinity diagramming [112] to categorize and group similar notifications. Three members of the research team created separate affinity diagrams for major types of variations. We uncovered stark differences in the degree to which a cause was explained, what actions were required or suggested, and how the notification was delivered.

Table 5.1.: Prominent characteristics of the six Study 1 notifications. The name of the condition identifies the provider that cur- rently uses that text. Netflix LinkedIn Instagram Google email Facebook Google red bar Explicitly Mentioned Password reuse X X Outside breach X X Outside security incident X X Suspicious activity X X X Review activity X X Forced password reset X X X Recommended password reset X Delivery Method Browser X X Email X X X Mobile X

From the 24 notifications, we selected six that captured the range of variation within and across these three dimensions. Table 5.1 summa- rizes the notifications, which we refer to with the name of the provider who originally sent that notification. To avoid priming respondents with biases they might have about the companies that originally sent these notifications, as well as to minimize potential confounds from 116 Chapter 5 Password Reuse

Figure 5.1.: A notification we tested, rebranded from LinkedIn. 5.2. Study 1 117 the visual layout of the notification, we visually rebranded all notifi- cations to be from a hypothetical online account provider “AcmeCo.” Figure 5.1 depicts the rebranded LinkedIn notification. The five other notifications are in Appendix C.3. Prior to launching the study, we conducted cognitive interviews to refine the survey wording iteratively and verify the intelligibility of questions. A limitation of survey studies is that responses can suffer from self-report and social desirability biases that may affect accu- racy. Respondents’ reported reactions may differ from their reactions had they received the notification in real life. In line with survey best practices, we worked to minimize relevant biases through the afore- mentioned pre-testing and by using softening language to minimize social-desirability bias [146]. Despite potential biases, related work has shown that while survey responses to security messages may be biased, they correlate strongly with real-world reactions [193]. Our results should thus be interpreted as trends of user behavior rather than precise frequency estimates.

5.2.3. Analysis Methods and Metrics

We collected both quantitative and qualitative data. Our quantita- tive analysis centered on the seven statements to which participants responded on scales (four on Likert scales and three on other scales), which we treated as ordinal. To evaluate whether responses differed significantly across notifi- cations while controlling for the effects of demographic factors, we built ordinal logistic regression models. In each model, the depen- dent variable was the set of Likert-scale responses to a given state- ment. We used the following independent variables: the notification the respondent saw; the respondent’s age; the respondent’s gender; the respondent’s level of education; and the respondent’s technical background. All independent variables were treated as categorical; we selected the most prevalent categorical value as the baseline. We chose the LinkedIn notification as the baseline category for the no- 118 Chapter 5 Password Reuse tification term as it was most representative (as determined through affinity diagramming) of the 24 messages we originally collected. In particular, we built parsimonious regression models using step- wise backward-elimination, minimizing AIC. All of these parsimonious final models contained the notification term. To determine whether this notification term was significant, we compared these final mod- els to their analogous null models (removing the notification term) to calculate an omnibus p-value, which we report as the regression p-value. Furthermore, we report significant individual factors in the regression by providing that factor’s log-adjusted regression coefficient (e. g., odds ratio, denoted OR) and p-value. If this omnibus test was significant, we performed pairwise comparisons between notifications using the Mann-Whitney U test, for which we report the test statistic (U) and the p-value. We set α = .05 for all tests and used the Holm method to correct for multiple testing within a family of tests. Finally, we analyzed responses to open-answer survey questions via qualitative coding. A member of the research team read the responses and performed a thematic analysis, iteratively updating the codebook as necessary. The researcher then used axial coding for consolidation and clarification, resulting in 11 themes for the causes of receiving the presented notification. To focus on recurring themes, we report codes that occurred for at least 10 % of responses. We also performed a thematic analysis of respondents’ free-text explanations in the third set of questions to more fully understand why respondents answered the questions the way they did. This process was largely the same as the one for the first section of questions, resulting in four or more codes for each question. In addition, respondents provided in free text three feelings and three intended actions in response to the notification. We cleaned responses to condense tenses differences and misspellings. As the sur- vey asked for these in any order, responses were not ranked during analysis. We used the NRC Word-Emotion Association Lexicon [167] to group feelings as positive, neutral, or negative. 5.3. Study 1 Results 119

5.3. Study 1 Results

In Study 1, we found that the current password-reuse notifications we tested elicit worry and fear. While the notifications do motivate some respondents to report intending to change their passwords, re- spondents do not report intending to change their passwords in suffi- ciently security-enhancing ways. For example, many respondents re- port planning to make small adjustments to existing passwords, which will likely leave them susceptible to password-reuse attacks. This lack of sufficient action may be attributed in part to notification confusion. A majority of respondents report not understanding the notification, and their mental model may, therefore, be insufficient to elicit an appropriate response.

5.3.1. Respondents

180 people responded to our survey. Their ages ranged from 18 to 74 years, though most respondents were between 25 and 34 years old. 44.4 % of our respondents were female, and a majority (62.8 %) of respondents had a two-year or higher degree. 70.6 % of respondents reported no experience (education or job) in a technical field.

5.3.2. Notification Response

Figure 5.2 highlights respondents’ reactions to the notifications.

Notifications elicited negative responses. Of the 540 feelings reported by respondents, worried, afraid, and anxious were the main responses to receiving a password-reuse notification. Figure 5.3 displays feel- ings reported by ten or more respondents. Fortunately, some positive feelings, such as safe or relieved, were also common. As the notifica- tions are communicating potential risks to accounts, it makes sense that an overall negative sentiment dominated. However, a password- reuse notification should induce more positive responses, as they are ultimately helping their users. 120 Chapter 5 Password Reuse

Method Consequence Real Resolve Netflix Netflix Netflix Netflix LinkedIn LinkedIn LinkedIn LinkedIn Instagram Instagram Instagram Instagram Google red bar Google red bar Google red bar Google red bar Google email Google email Google email Google email Facebook Facebook Facebook Facebook

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Percentage Percentage Percentage Percentage Strongly disagree Strongly agree

Neither

Action Concern Before Netflix Netflix Netflix LinkedIn LinkedIn LinkedIn Instagram Instagram Instagram Google red bar Google red bar Google red bar Google email Google email Google email Facebook Facebook Facebook

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Percentage Percentage Percentage Very high Not at all Extremely Not at all priority a priority concerned concerned Many times Never

Medium priority Somewhat concerned A few times

Figure 5.2.: Respondents reported their agreement of whether the no- tification was sent via the appropriate method, could be ignored without consequence, would be sent by real com- panies, and explained how to resolve the situation. Re- spondents also reported their priority of taking action in response to the notification. Finally, respondents re- ported the level of concern they would expect to have upon receiving the notification, and whether they had re- ceived such notifications before. 5.3. Study 1 Results 121

negative neutral positive

worried afraid anxious annoyance concerned nervous confusion angry surprised

Feelings safe suspicious irritating frustrated upset relieved curiosity 0% 5% 10% Proportion of Responses

Figure 5.3.: Sentiment analysis of respondents’ reported feelings upon receiving a notification (using NRC EmoLex [167]).

Notifications were concerning. Across notifications, most respondents (66.7 %) reported that they would feel extremely or moderately con- cerned upon receiving the notification. R56 explained, “The poten- tial for losing an account and sensitive information is something to be concerned about. Anyone who wouldn’t feel concerned is either ignorant or lying.” Only 3.3 % reported no concern. Respondents’ reported concern differed significantly across notifications (regression p = 0.003). Respondents found the Facebook (OR = 3.3, p = 0.011) and the Google email notifications (OR = 4.1, p = 0.003) more concerning than the LinkedIn notification, the control in our regres- sion. Respondents also reported a greater concern about receiving the Facebook notification (U = 674.5, p = 0.019) and the Google email notification (U = 730.5, p = 0.011) than the Instagram no- tification. 89.7 % reported the Facebook notification as concerning, 83.9 % reported the Google email notification as concerning, 54.9 % reported the LinkedIn notification as concerning, and 53.1 % reported the Instagram notification as concerning. 122 Chapter 5 Password Reuse

Ignoring the notifications would have consequences. Most respon- dents disagreed or strongly disagreed that ignoring the notification they received would not have consequences (77.1 %). Responses dif- fered significantly across notifications (regression p = .045) Respon- dents noted potential consequences that included harm to their ac- count “because hackers could steal my info” (R150), as well as “being locked out of my accounts” (R84). However, a sizable minority was unsure (16.7 %). These “unsure” respondents wanted to get more in- formation from AcmeCo, which shows the importance of clear com- munication of the situation at hand. Finally, a few respondents were dismissive of any consequences: “Acme has so many accounts that the chances that my account is hacked are pretty slim” (R81).

Facebook and Google email notifications are a high-priority. Across notifications, responses about the priority of taking action differed significantly (regression p = .012). Compared to the LinkedIn noti- fication, a significantly larger fraction of respondents reported that taking action in response to the Facebook (OR = 4.3, p = 0.003) and Google email (OR = 3.0, p = 0.022) notifications would be a high pri- ority. Significantly more respondents reported the same for the Face- book notification relative to the Instagram notification (U = 633.0, p = 0.044). 100 % of respondents who received the Facebook noti- fication, 93.5 % of those who received the Google email notification, 80.6 % of those who received the LinkedIn notification, and 71.0 % of those who received the Instagram notification reported that taking action in response to their respective notifications would be a high priority. We hypothesize respondents perceived the Facebook and Google email notifications to a be higher priority because the Face- book notification prohibited users from logging in, and Google’s email included a prominent red color.

Nearly all respondents indicated that taking action in response to the notification was a priority. Across all notifications, 95.6 % of respon- dents indicated that taking action in response to receiving the noti- 5.3. Study 1 Results 123

fication would be either a very high, high, or a medium priority. In their free-response justifications, 76.6 % of respondents explained that they wanted to protect their personal information or prevent unau- thorized account access. 29.4 % of responses specified that the high priority was due to a lack of time: “The quicker I act, the safer my account will be” (R54).

5.3.3. Understanding of the Notification

Few respondents recognized the notification’s real cause. We asked re- spondents to describe all factors that may have caused them to receive that notification. Most respondents believed that the notification was sent because of circumstances beyond their control. R171 was typical in failing to account for password-reuse attacks as a cause, stating, “The chances of someone guessing that I use the same password are still incredibly low. Still, I would be worried that the password might be too common.” 60 % of respondents attributed the notification to someone hacking their account or unsuccessfully attempting to log in. While this makes sense, as some notifications convey that someone may have tried to log in to their account, this is not the full truth: the login may have been attempted as part of a password-reuse at- tack. Further, 21.1% of respondents believed that it may have been sent in error, as a false alarm due to the real user of the account using a new device, signing in from a new location, or entering the incorrect password too many times. A minority mentioned the potential real cause of the notification: either a data breach (20.6 %) or password reuse by the account holder (18.8 %).

5.3.4. Intended Response to Notification

Most respondents do not intend to change their password. While most respondents agreed that taking action was a priority, they disagreed on what to do and volunteered a wide variety of examples. Respon- dents wrote that they would take actions such as changing their pass- word (29.3 % of respondents), investigating the situation (18.6 %), and 124 Chapter 5 Password Reuse logging into their account (15.4 %) in response to receiving the notifi- cation. While the self-reported intention to change their password was the most common, it is nevertheless extremely low as an absolute per- centage. This is a cause for concern, as securing an account through password changes should be a priority for all users in situations of password reuse.

Overall, respondents found the notifications informative. A majority of respondents (62.8 %) either agreed or strongly agreed that the noti- fication they received explained how to resolve the situation by giving specific, clear instructions (58.3 %). However, 26.7 % believed it did not do so, and 10.0 % of respondents indicated that resolving the situa- tion would require more background information. As R137 explained, “I need more information as to what happened before I just blindly change my password.”

Prominent explanations perceived as most informative. We observed significant differences across notifications in respondents’ perceptions of whether the notification explained how to resolve the situation (re- gression p < 0.001). The agreement that the notification explained the situation differed starkly across notifications: LinkedIn (80.6 % of respondents agreed or strongly agreed), Facebook (75.9 %), Net- flix (75.0 %), Instagram (68.7 %), Google email (54.9 %), and Google red bar (21.6 %). Agreement was higher for the LinkedIn notification than for the Google email (OR = 0.2, p < 0.001) and the Google red bar (OR = .03, p < 0.001) notifications. Compared to the Google red bar notification, agreement was also significantly higher for the Facebook (U = 646.5, p < 0.001), Google email (U = 642, p = 0.011), Instagram (U = 177.5, p < 0.001), and Netflix (U = 131.0, p < 0.001) notifications. The low reported percentages for the Google email and Google red bar notifications make sense because both notifications had a link that had to be clicked for more information and explana- tion. The other notifications had more detail and instructions in the notification itself. 5.3. Study 1 Results 125

5.3.5. Reactions to Structure and Delivery

Most respondents agreed that the notification they received used the appropriate method of contacting them (65.0 %), primarily because it was easy, convenient, or fast (58.3 % of respondents). However, some respondents would have preferred a more immediate method (17.8 %) or multiple methods (11.6 %). Agreement about the method’s appropriateness differed across notifications (regression p < .001).

Email perceived as the most legitimate delivery method. The Google email, LinkedIn, and Netflix notifications, all sent by email, were re- ported to be delivered with the most appropriate method and to seem the most legitimate. This is perhaps due to some respondents’ justifi- cation that email is official (10 %), and that they may have seen sim- ilar email notifications in the past. Respondents were more likely to report that the LinkedIn notification was appropriate than the Face- book (OR = 0.2, p < 0.001), Google red bar (OR = 0.1, p < 0.001), and Instagram (OR = 0.3, p < 0.010) notifications. For the LinkedIn, Instagram, Facebook, and Google red bar notifications, respectively, 98.6 %, 62.5 %, 51.7 %, and 48.2 % of respondents reported agreement that the notification they received was delivered with the appropriate method. Fewer respondents found the Facebook notification appropri- ate than the Google email notification (U = 265, p = 0.049). Respon- dents’ expectations regarding real companies sending the notification also differed across conditions (regression p = .012). While 96.7 % of respondents who saw the LinkedIn notification reported expecting real companies would send it, only 67.7 % reported the same for the Instagram notification (OR = 0.2, p = 0.003).

Notifications were relevant to real situations. Most respondents agreed (86.7 %) that they would expect real companies to send notifications like these when necessary. Respondents reported receiving notifica- tions similar to this one in the past: 52.1 % of respondents indicated receiving a similar notification a few times, and 9.5 % many times. 126 Chapter 5 Password Reuse

Those who had received similar notifications explained that they were from sign-ins on other devices (20.0 %) or financial services (13.3 %).

5.4. Password-Reuse Notification Goals

Password-reuse notifications take on a challenging task as the situa- tion at hand is the cumulative result of multiple parties’ actions. Fur- ther, the level of risk to convey and the appropriate actions to suggest or require are not always clear. While research has investigated best practices for other types of security notifications (cf. Section 2.4.2), we sought to create a framework for evaluating password-reuse no- tifications. Drawing on the Study 1 results, we identified five goals that effective notifications should achieve sufficiently: timeliness, le- gitimacy, action, background, and trust. We used these goals as a framework to evaluate notifications in Study 2. 1. Notifications should reach their intended audience in a timely manner. A notification about a compromised password is only useful if the user sees the notification to create a new one. 2. Notifications should be perceived as legitimate. Some respon- dents in Study 1 were hesitant to trust our notifications, believ- ing that they might be phishing. The presence of hyperlinks was cited as an indicator of phishing, and a few respondents were skeptical of any email that required password changes. 3. A password-reuse notification should lead to actions that im- prove the security of the directly affected account. Ideally, this would include taking productive actions for other accounts that may be at risk (i. e., those where similar passwords were used), as well as advising against unproductive or unrelated actions. 4. The background information provided by a notification should be easily understood. In Study 1, 12.8 % of respondents were confused by how one service “got” their passwords for another service, which could potentially lead to confusion. Not all users will understand the mechanisms behind password databases or 5.5. Study 2 127

cryptographic hashes, but the root cause of the notification (password reuse) must be clearly conveyed. 5. Notifications should improve trust between providers and users. Account providers send notifications to increase the security of users’ accounts with that provider, as well as potentially with other providers, too. Therefore, notifications should aim to en- gender users’ trust.

5.5. Study 2

In Study 1, we found that the content of a password-reuse notification impacted respondents’ understanding of the situation at hand, as well as whether they would intend to take action in response. In Study 2, we sought to better isolate the factors of effective notifications by ex- ploring the impact of making small changes to the content or delivery of these notifications. Our design of Study 2 focuses on key results from Study 1, along with the goals outlined in Section 5.4. We had six core research questions for Study 2. First, we consider the delivery medium. The timeliness of a noti- fication is largely determined by how it is sent to the recipient. Mobile push notifications interrupt the current workflow, whereas emails or in-app notifications require users to actively check those sources. The delivery medium of the notification may also change respondents’ per- ception of the legitimacy of the notification.

• RQ 1A: How does the delivery medium of a password-reuse noti- fication affect its perceived effectiveness?

• RQ 1B: If you, an online account provider, are breached, how important is the delivery medium in which you send your notifica- tion?

Next, we consider mentions of suspicious account activity and the nature of the data breach. These details address the goal of pro- viding adequate background for users to understand the situation. 128 Chapter 5 Password Reuse

• RQ 2: How does explicitly identifying the root causes of the inci- dent influence the notification’s effectiveness?

• RQ 3: How does mentioning suspicious account activity influence the notification’s effectiveness?

Depending on the importance of the account and the incident, notifi- cations should force a password reset.

• RQ 4: If a password change is only recommended, instead of re- quired, will users report that they would change their passwords?

Finally, we consider various security suggestions beyond pass- word changes. We hypothesize that these suggestions could improve the user’s trust in the account provider by appearing to demonstrate proactive approaches to security.

• RQ 5A: Is it important to explicitly recommend password changes on other sites in a notification?

• RQ 5B: Is it important to explicitly recommend pro-security ac- tions (e. g., 2FA, adopting a password manager) in a notification?

• RQ 5C: If your service is breached, is it important to explicitly recommend password changes on other sites and pro-security be- haviors beyond changing your password?

• RQ 6: Will users report taking pro-security actions if they are not explicitly mentioned in a password-reuse notification?

5.5.1. Study 2 Conditions

We began developing our Study 2 conditions by creating a model noti- fication (shown in Figure 5.4) that synthesized the individual aspects of notifications that were most successful in Study 1, filling in gaps relative to our aforementioned design goals. To disambiguate the impact of each aspect of the model notifica- tion’s content and delivery, we created 14 additional variants of the 5.5. Study 2 129

Figure 5.4.: Our model notification with the parts that varied high- lighted in color and further specified in Table 5.2. 130 Chapter 5 Password Reuse model, each of which differed in a targeted way. These variants, as detailed in Table 5.2, reflect changes in the delivery method, descrip- tion of the security incident, mention of account activity, suggested remediation, reference to other accounts, and additional pro-security actions mentioned. Each respondent was randomly assigned to see either the model notification or one of these fourteen variants. When presenting our results, we refer to these variants using multi-part names based on the nomenclature defined in Table 5.2. Special attention was given to increase the likelihood that our respondents perceived the notifi- cation as legitimate, rather than as phishing. Appendix C.4 contains additional images of the variants.

Table 5.2.: Notification dimensions varied in Study 2. Delivery Medium model Delivered by email inApp Mobile in-app mobile Mobile push notification and in-app Incident model This incident was likely a data breach of a service unrelated to AcmeCo, but because many people reuse similar passwords on multiple sites, your AcmeCo login information may have been affected. usBreach This incident was likely a data breach of one of our services. vagueCause — Account Activity model While we have not detected any suspicious activity on your AcmeCo account, . . . as a precaution. suspicious Because we have detected suspicious activity on your AcmeCo account, . . . omitActivity — Remediation model . . . you must create a new password . . . recommend . . . we recommend that you create a new password. Other Accounts model Change all similar passwords on other accounts. noOthers — Extra Actions model To further improve your online security, we recommend: • Enabling AcmeCo’s Two-Factor Authentication. • Using a password manager. noExtras — 5.5. Study 2 131

5.5.2. Study 2 Structure and Recruitment

We recruited respondents on Amazon’s Mechanical Turk, again ad- vertising a survey about online account notifications with no mention of security or privacy. Requirements for participation were the same as for Study 1, and participation in both Study 1 and Study 2 was prevented. The survey was again scenario-based, but this survey was structured into five sections and added additional questions to ex- plore topics raised during the analysis of Study 1. Each respondent was compensated with $2.50 for completing the 15-minute survey. Re- spondents were introduced to the survey scenario with the same text as Study 1 (cf. Section 5.2.1) and were then presented with their assigned notification. The first section of survey questions measured respondents’ over- all reported conceptions of the notification with questions similar to Study 1, but with a key modification: respondents were given eleven closed-answer choices of the causes of receiving the notification. We chose to give closed-answer choices to measure explicitly whether or not they expected some factor might have caused the situation, rather than relying just on what they thought to write. We based these choices on the responses to Study 1’s open-ended version. The second section asked whether respondents would intend to change their passwords for AcmeCo, as well as for other accounts. This section also contained follow-up questions about why they would or would not intend to change their passwords, as well as how they would create and memorize such passwords. The third section asked them to report their security perceptions of, and likelihood to take, ten actions beyond changing their password. These actions were again closed-answer and were selected based on free-text responses from Study 1. The fourth section asked respondents about their perceptions of the notification with questions based on the corresponding section from Study 1 but modified to align with Study 2’s research questions. 132 Chapter 5 Password Reuse

The final section solicited the following demographic information: gender, age, highest educational degree attained, and technical exper- tise. We also asked respondents to report any previous experiences being notified about data breaches and history of having others gain unauthorized access to their online accounts. Appendix C.2 contains the survey text. As in Study 1, responses are reported behavioral intentions, rather than actual behavior. We again mitigated biases with softening language and pre-testing.

5.5.3. Analysis Method and Metrics

We again use regression models in our analysis. We had both bi- nary (whether respondents selected each of the eleven potential causes, whether respondents reported intending to change any passwords or take the ten additional actions) and ordinal (responses on scales re- garding perceptions of the notification, as well as a Likert-scale agree- ment with the security benefit of the ten actions) dependent variables. For binary dependent variables, we built logistic regression models. For ordinal dependent variables, we built ordinal logistic regression models. The independent variables were the notification, all covari- ates used in Study 1 (the respondent’s age range, gender, education level, and technical background), whether the respondent had ever been notified that their information was exposed in a data breach and whether the respondent had experienced unauthorized access to an online account. These final two variables are proxies for prior experi- ence with breaches [187, 192]. All independent variables were treated as categorical. As in Study 1, we built parsimonious models through backward elimination. The full regression tables are again contained in our companion technical report. To determine whether the omnibus noti- fication term was significant, we compared these final models to their analogous null models (removing the notification term) to calculate an omnibus p-value, which we report as the regression p-value. If the notification term was removed in backward elimination, we treated 5.6. Study 2 Results 133 the notification as non-significant. For significant individual factors, we report the odds ratio and p-value. When the omnibus notification term was significant, we made 18 comparisons between pairs of notifications to investigate our six re- search questions directly. For ordinal data, we used Mann-Whitney U tests (reporting U and the p-value). For categorical data more naturally expressed as a contingency table (e. g., whether and how re- spondents intended to change their password), we performed χ2 tests if all cell counts were greater than five, and Fisher’s Exact Test (de- noted FET ) if they were not. We again set α = .05 and used Holm correction within each family of tests. Finally, in a process analogous to that for Study 1, we qualitatively coded free-response data.

5.6. Study 2 Results

Across all variants of the model notification, respondents reported anticipating serious consequences to ignoring the notification and re- ported believing that changing their password would benefit their ac- count security. While a majority of respondents indicated that they would intend to change their passwords, their intended password cre- ation strategies would continue to expose them to password-reuse at- tacks. Unfortunately, many respondents did not perceive password reuse to be the root cause of the situation. We found that adding extra security suggestions increases perceived risks, which may help the notification convey the seriousness of the situation and the need to take action. Omitting information about account activity or be- ing vague about the origin of the security incident, however, warps perceptions of the situation.

5.6.1. Respondents

There were 588 respondents in Study 2. Most respondents were be- tween the ages of 25 and 34 (44.6 %), although 11.2 % were younger and 44.1 % were older. 48.4 % of the respondents identified as female. 134 Chapter 5 Password Reuse

Over half of our respondents had a two- or four-year degree, and 10.8 % held higher degrees. A quarter of respondents reported expe- rience (education or job) in technical fields. 53.2 % of respondents in Study 2 indicated that they had been affected by a prior data breach. Most respondents were notified via email (55.9 %), although receiv- ing physical mail (17.3 %) and reading the news or browsing social media (18.2 %) were other common notification methods. The most common data breach mentioned by respondents was Equifax (12.1 % of respondents). Less than one-third of respondents reported unau- thorized access to an account. Of the 188 respondents that reported someone had gained unauthorized access to one of their online ac- counts, 23 personally knew the attacker, whereas 155 did not.

5.6.2. Perceived Causes of the Scenario

Respondents did not perceive reuse to be a cause of the situation. From among eleven potential causes of receiving the notification, we asked respondents to choose all they felt applied. Unfortunately, across all notifications, a minority of respondents chose “you reused the same or similar passwords for multiple online accounts” as a po- tential cause even though many variants of the notification mentioned password reuse. For example, the model notification (control condi- tion) explained that their “AcmeCo account login and password may have been compromised” due to a data breach of a service unrelated to AcmeCo “because many people reuse similar passwords on multiple sites.” Nonetheless, only 44.7 % of respondents who saw the model no- tification chose password reuse as a cause of receiving the notification. The rate of selecting password reuse as a cause varied by condition (regression p < .001). Among all variants, model-{suspicious} (named using the keywords in Table 5.2) was most effective at conveying that password reuse was a potential cause. This variant augmented the model notification by noting suspicious activity had been detected on the account. Nonetheless, only 57.9 % of respondents chose password reuse as a possible cause, which did not differ significantly from the 5.6. Study 2 Results 135 control. Unsurprisingly, four variants that mentioned that AcmeCo itself suffered a breach had significantly lower rates of choosing pass- word reuse as a cause:

• model-{usBreach}-{mobile} (2.4 %, OR = 0.03, p < 0.001); • model-{usBreach}-{inApp} (2.4 %, OR = 0.03, p = 0.001); • model-{usBreach}-{noOthers} (10.0 %, OR = 0.13, p = 0.001); • model-{usBreach}-{noOthers}-{noExtras} (2.6 %, OR = 0.03, p = 0.001).

For model-{vagueCause}, 10.3 % of respondents chose reuse as a cause, which was also significantly lower than the control (OR = 0.16, p = 0.003). This is notable because that notification mentions a vaguely-worded “potential security incident” that may have led to a credential compromise, typical of many widely deployed notifications even when password reuse is the culprit. Respondents also rarely chose “you have a weak password for your AcmeCo account” as a potential cause. Across conditions, only 15.0 % of respondents selected this option. This did vary significantly by condition (regression p = .011), though we only observed a significant difference for conditions investigating the impact of mentioning sus- picious activity (RQ3). While 38.9 % of respondents indicated a weak password as a potential cause for model-{suspicious}, which men- tioned suspicious activity, only 4.9 % did so for model-{omitActivity}, which did not (FET, p = .009). In contrast, across all notifications, respondents commonly chose that “AcmeCo was hacked” (49.0 % of respondents) or that a company unrelated to AcmeCo was hacked (41.7 %). Note, however, that con- ditions varied in whether they reported that AcmeCo or some other company was breached, so these frequencies and the significant differ- ences between conditions (both regressions p < .001) are unsurprising. More surprisingly, across conditions respondents selected three addi- tional potential causes at higher rates than password reuse: “Some- one hacked your AcmeCo account” (32.5 % of respondents); “AcmeCo conducts regular security checks and this is just a standard security 136 Chapter 5 Password Reuse notification” (28.2 %); “Someone is trying to gain unauthorized access to your account by sending this email” (27.4 %). These did not vary significantly across conditions.

5.6.3. Creating New Passwords

Respondents rated whether fifteen potential actions would improve their account security. Six of these actions related to password changes. In addition, respondents selected whether or not “if [they] received this notification about an online account [they] had with a real company” they would change their password for that company. For brevity, we refer to this below as changing their AcmeCo password. We also asked them to report their likelihood to take five actions related to chang- ing passwords, but for services other than the one that sent them the notification.

Most respondents perceived unique passwords as good for security. Overall, 86.0 % of respondents agreed that changing their AcmeCo password to “a completely new password unrelated to the old one” would improve their account security. Most cited a “better safe than sorry” rationale for changing their password. For example, R55 wrote, “It would bring me peace of mind to know I had done what I could to protect myself and my account.” Yet, 34.6 % also answered that changing their AcmeCo password to “a modification. . . of the old one” would improve their account security, while 26.0 % answered similarly about changing their AcmeCo password “to a password I use for an- other online account.” To prevent password-reuse attacks, users should have a unique pass- word for each account, and 84.1 % of respondents agreed that doing so would improve their account security. However, a concerningly large fraction of respondents — 50.2 % — agreed that changing “all of my similar passwords on other online accounts to one new password” would improve their security. Unfortunately, doing so makes them susceptible to future password-reuse attacks. We did not observe sig- 5.6. Study 2 Results 137

Account Intention to Change Passwords

AcmeCo 0 10 20 30 40 50 60 70 80 90 100 Change Keep Same

Other Providers 0 10 20 30 40 50 60 70 80 90 100 All Same Similar Important None

Figure 5.5.: Respondents’ intentions for creating new passwords for their account on AcmeCo (who sent the notification) and on other providers. Respondents could change all passwords, only passwords that were the same or similar, only passwords for important accounts, or none at all. nificant differences in responses across conditions for any of these six actions related to password changes.

If they received our notification in real life, respondents would change their password, but ineffectively. The vast majority of respondents — 90.3 % — reported they would change their passwords if, in real life, they received the notification they saw (Figure 5.5). However, among these respondents, only 1.4 % of them said they would change their password to something completely unrelated. Additionally, only 9.7 % of them said they would use a password manager or their browser to generate the password. The majority of respondents’ new passwords would continue to ex- pose their accounts to the same risks (Figure 5.6). Most respondents — 59.0 % — reported intending to create their new AcmeCo password by changing a few characters in the old password, while 11.4 % re- ported intending to simply reuse another password they already used elsewhere. In reality, these strategies would not truly resolve their problems and would continue to facilitate password-reuse attacks. At- tackers have adapted to users’ tendency to modify passwords in small ways (e. g., common character substitutions, insertions, and capital- 138 Chapter 5 Password Reuse

Account Password Changing Strategy AcmeCo 0 10 20 30 40 50 60 70 80 90 100

Other Providers 0 10 20 30 40 50 60 70 80 90 100 PW Completely Modify Reuse Other Manager New

Figure 5.6.: For the respondents who intended to change their pass- word, this figure shows their stated strategies for doing so for their account on AcmeCo and on other providers. They could generate a new password with a password manager or browser, make a completely new password, modify the old password, reuse a password they already use, or apply some other strategy. izations) and apply such common transformations in password-reuse attacks [54,238]. Furthermore, self-reported intentions typically over- report actual behavior [226], suggesting that these results may already be overly optimistic. Respondents’ stated likelihood to “leave [their AcmeCo] password as-is” varied by condition (regression p = .012). Respondents who saw model-{noOthers}-{noExtras} were more likely to say they would keep their current password than those who saw model (OR = 2.4, p = .042) or model-{noOthers} (W = 533, p = .040). Respondents were also more likely to state the same if they had not previously received a data-breach notification (OR = 1.4, p = .034) or if they had a background in technology (OR = 1.5, p = .038). We hypothesize this last result may stem from overconfidence. Some perceptions of security also varied across demographic factors. Female respondents were more likely to rate having unique passwords for all accounts as secure (OR = 1.5, p = .012) and less likely to rate keeping their current password as secure (OR = 0.6, p = .008). Respondents who had not previously received a data-breach notifica- tion were more likely to rate modifying their old password as secure 5.6. Study 2 Results 139

(OR = 1.4, p = .017) and less likely to rate changing it to something unrelated as secure (OR = 0.6, p = .002). Surprisingly, respondents with a background in technology were also less likely to rate the latter as secure (OR = 0.6, p = .007).

Those who avoid password changes may do so due to suspicion of no- tifications or invincibility beliefs. Of the 52 respondents (9 % of the total) who said they would not change their password, 25 reported that it was because they would need to verify that the notification was legitimate, rather than a phishing attack. R534 elaborated that they would “wait and go to AcmeCo’s website and see what was go- ing on first.” Eight others said they would not change their password because of memorability concerns. Seventeen respondents expressed different beliefs of invincibility: eleven said they use unique passwords on every account and thus would not worry about one password being compromised, while six believed their passwords were strong enough to eliminate the risk of compromise. As R2 wrote, “It is a very good password and I doubt someone would waste the time trying to crack it.” While non-experts have difficulties judging password strength [232], Pearman et al. observed in an in-situ study of 154 participants an average password strength that could resist up to 1012 guesses [178]. At the same time, real-world offline guessing attacks are on the order of 109 to 1012 guesses per day on a single GPU even against memory- hard hash functions like scrypt [52, 97]. Others consider 1014 guesses realistic in offline attacks [78]. While rate-limiting and risk-based authentication slow online guessing [90], password reuse remains a threat [152, 224].

Most respondents believed changing passwords for other accounts with similar passwords would improve security, yet they did not intend to do so. Even though, as previously mentioned, 84.1 % of respondents agreed that using unique passwords for each of their accounts would 140 Chapter 5 Password Reuse improve security, 35.2 % of respondents reported that they did not intend to change passwords for any accounts other than AcmeCo. As Figure 5.5 shows, an additional 15.6 % only intended to do so for accounts where they used exactly the same password, while 14.1 % only intended to change passwords for important accounts. The largest portion of respondents would not change their passwords on other accounts because they did not perceive connections between the account addressed by the notification and any other account. R69 explained, “Unless I heard from a company that was hacked, I’m not concerned.” Furthermore, respondents believed that if the account providers were unrelated, then the risks to account security also must be unrelated. A few respondents speculated that the threats were unrelated be- cause “a potential hacker likely doesn’t know my additional accounts exist” (R23). Unfortunately, because reuse of both usernames and passwords across services is common [54, 238], attackers know to try the same or similar credentials across unrelated services. In contrast, only 15.3 % of respondents reported intending to change their passwords on all other accounts, while 19.7% reported intending to change all passwords that were similar to the one that was com- promised. Unfortunately, even for respondents who said they would intend to change other passwords, their intended password-creation strategies would leave many at risk. The majority of respondents again intended to either modify (46.5 %) or directly reuse (9.6 %) pass- words they already used elsewhere, as shown in Figure 5.6. On a more positive note, 38.8 % of respondents reported intend- ing to use a password manager or browser to generate these other passwords, which balances usability and security for changing many passwords at once.

5.6.4. Taking Other Security-Related Actions

For nine additional actions unrelated to password changes, respon- dents again rated their expectation of how these actions impact se- 5.6. Study 2 Results 141 curity, as well as their likelihood to take these actions upon receiving the notification. Notifications should encourage actions that are both productive and relevant for addressing password reuse. To account for these nuances, we included four actions that can potentially address password reuse, as well as five that are only tangentially related to the situation, as shown in Figure 5.7.

Notifications encourage 2FA adoption, yet are less effective at encourag- ing the use of password managers. The notifications had a divergent impact on two of the actions most relevant to mitigating threats from password reuse: enabling 2FA and using a password manager. While 83.3 % of respondents agreed that enabling 2FA would improve their security and 64.0 % rated it likely that they would do so, only 44.3 % agreed that using a password manager would improve their security, and only 37.3 % rated it likely they would adopt one after receiving the notification. In contrast, 78.8 % of respondents agreed changing their password more frequently would improve security, and 51.9 % rated it likely they would do so. Furthermore, 80.9 % of respondents agreed that reviewing the recent activity on their account would improve security, and 89.5 % rated it likely they would do so.

Notification variants did not impact the likelihood of taking these ac- tions. Which notification respondents saw did not significantly im- pact their stated likelihood of taking any of these nine actions. How- ever, some demographic factors did. Respondents with a background in technology expressed a higher likelihood of using a password man- ager (OR = 1.7, p = .002), using an identity theft protection service (OR = 1.6, p = .008), and changing the way they lock their phone (OR = 1.4, p = .033) upon receiving the notification. Finally, female respondents expressed being more likely to review the activity on their account (OR = 1.5, p = .022). 142 Chapter 5 Password Reuse

Taking Other Actions Action improves security? Related to current situation Enable 2FA 0 10 20 30 40 50 60 70 80 90 100 Use a password manager 0 10 20 30 40 50 60 70 80 90 100 Update security questions 0 10 20 30 40 50 60 70 80 90 100 Review recent activity 0 10 20 30 40 50 60 70 80 90 100 Tangentially related Update software more frequently 0 10 20 30 40 50 60 70 80 90 100 Lock phone 0 10 20 30 40 50 60 70 80 90 100 Lock computer 0 10 20 30 40 50 60 70 80 90 100 Change password more frequently 0 10 20 30 40 50 60 70 80 90 100 Use identity protection service 0 10 20 30 40 50 60 70 80 90 100

Strongly Agree Disagree Strongly agree disagree

Taking Other Actions Intention to take action? Related to current situation Enable 2FA 0 10 20 30 40 50 60 70 80 90 100 Use a password manager 0 10 20 30 40 50 60 70 80 90 100 Update security questions 0 10 20 30 40 50 60 70 80 90 100 Review recent activity 0 10 20 30 40 50 60 70 80 90 100 Tangentially related Update software more frequently 0 10 20 30 40 50 60 70 80 90 100 Lock phone 0 10 20 30 40 50 60 70 80 90 100 Lock computer 0 10 20 30 40 50 60 70 80 90 100 Change password more frequently 0 10 20 30 40 50 60 70 80 90 100 Use identity protection service 0 10 20 30 40 50 60 70 80 90 100

Very Likely Unlikely Very likely unlikely

Figure 5.7.: Respondents’ perceptions of whether actions would in- crease security and their stated intention of taking those actions upon receiving the notification. We group actions by whether they relate to password reuse. 5.6. Study 2 Results 143

Different notification variants minimally impacted security perceptions. The agreement that updating their account’s security questions would improve security varied across the different notifications (regression p = .012), though we did not observe the notification to signifi- cantly impact perceptions of any of the other eight actions. Compared to respondents who saw model, those who saw model-{vagueCause} (OR = 2.7, p = .017) or model-{suspicious} (OR = 3.5, p = .004) were more likely to agree that updating their security questions would improve security.

We observed the same effect for three notifications that mentioned that AcmeCo itself had been breached: model-{usBreach}-{mobile} (OR = 3.6, p = .002), model-{usBreach}-{inApp} (OR = 2.4, p = .035), and model-{usBreach}-{noOthers} (OR = 2.6, p = .026). Fe- male respondents were more likely to agree that it would improve security (OR = 1.4, p = .047), while those who had never received a data-breach notification were less likely to do so (OR = 0.6, p = .009).

Demographic factors were correlated with variations in respondents’ perceptions of how these actions impacted security. Female respon- dents were more likely to agree that using an identity theft protection service (OR = 1.6, p = .003), changing their password more fre- quently (OR = 1.7, p = .001), and changing how they lock their computer (OR = 1.5, p = .011) would improve security. Respondents with a background in technology (OR = 0.5, p < .001) and those who had never received a data-breach notification (OR = 0.7, p = .015) were also less likely to agree with this statement. Respondents with a background in technology were also less likely to agree that changing their password in the future improves security (OR = 0.7, p = .023), while those who had never received a data-breach notification were less likely to agree that updating software improves security (OR = 0.7, p = .043). 144 Chapter 5 Password Reuse

5.6.5. Perceptions of the Notification

Most respondents would act in response within 24 hours. We found that most respondents would anticipate seeing and acting on the noti- fication within a short period of time, despite our notifications varying in delivery method. 87.4 % reported that they would see the notifica- tion within 24 hours and 84.5 % would intend to take action within 24 hours; responses did not vary significantly across notifications. Re- spondents strongly preferred that account providers contact them via email (90.0 %), although SMS (43.9 %), mobile app (32.5 %), and mo- bile push notification (29.1 %) were also favorable options. Interest- ingly, this stated preference for email notifications conflicts with some respondents’ hesitation to take action because of phishing concerns (Section 5.6.3).

Respondents’ trust was lower when AcmeCo suffered a breach. The level of reported trust varied significantly across notification con- ditions (regression p = .004). Compared to model, the reported trust of the provider was, perhaps unsurprisingly, lower for model- -{usBreach}-{inApp}, which stated that AcmeCo itself was breached (OR = 0.3, p = .003). In their free-response justification, 13.8 % re- spondents overall reported decreased trust because they believed it to be AcmeCo’s responsibility to prevent such breaches. On the con- trary, 8.3 % of respondents’ trust did not change, as “any company is bound to have security breaches” (R196). However, across condi- tions, 45.2 % of respondents increased trust in AcmeCo as a result of the notification, and 35.8 % reported no change. This was because the notification conveyed a prioritization of their safety (29.3 %) or proactive and transparent policies (13.3 %). An additional 15.1 % of respondents believed that such a notification was simply expected of a company.

Experience with technology and data breaches impacted perceptions. In our models, we also compared the responses of respondents who 5.7. Limitations 145 had prior experience with data breaches to those who had no such experiences. Respondents who had never been notified about being in a breach reported that receiving the notification would lead to greater trust in AcmeCo compared to those who had previously received a data-breach notification (OR = 1.7 p = .002). Respondents who had never received such a notification were also more likely to agree that they would not know why they received such a notification (OR = 1.6, p = .002), more likely to perceive the notification as official (OR = 1.6, p = .009), and less likely to expect companies to send notifications like the one they saw (OR = 0.7, p = .043). This may be because prior experience gives respondents some expectations of provider behavior. Respondents who reported a background in technology were more likely to agree that they would not know why they received such a notification (OR = 1.6, p = .008) and more likely to agree that ignoring the notification would have no consequences (OR = 1.6, p = .006). They were also less likely to agree that they expected companies to send such notifications (OR = 0.6, p = .010), less likely to agree that they would believe such a notification was official (OR = 0.5, p < .001), and less likely to prioritize taking action in response (OR = 0.7, p = .031). They were also less likely to agree that the notification explained how to resolve the situation (OR = 0.6, p = .007) and less likely to report that they would feel grateful about receiving the notification (OR = 0.6, p < .001).

5.7. Limitations

Like many survey studies, our results suffer from self-report biases. Respondents may have answered questions according to social desir- ability: selecting the answer they believe they should select, rather than their true answer [144]. To mitigate this bias, we did not ex- plain that this was a study about security, and we included softening language in sensitive questions to remind respondents that people may have many different responses. That said, stated intentions are typically an upper bound on actual behavior [226]. As many respon- 146 Chapter 5 Password Reuse dents’ intended actions would still leave them vulnerable to further attacks, reality may be even worse. This would be consistent with other researchers’ finding that LinkedIn’s actual breach notification was ineffective at prompting password resets [119]. Finally, we report on a convenience sample of MTurk workers receiv- ing our hypothetical notifications. Such a design is inherently limited in its ecological validity. However, given that such notifications have rarely been studied, testing notifications for the first time in the field and potentially causing respondents to think they had been breached would create too high of a potential risk to human subjects. As in prior work on other types of notification messages [74], we chose to conduct a formative, controlled study to inform future research on password-reuse notifications.

5.8. Discussion

We performed the first systematic study of how users understand and intend to respond to security notifications about situations related to password reuse. Through two complementary user studies, we iden- tified best practices for the design of password-reuse notifications. Further, we identified where notifications are destined to fall short in helping users fully remediate password reuse issues. Our formative study lays the groundwork for future field studies. We recommend future work that does not rely on self-reporting, instead testing the best practices we developed for password-reuse notifications in more ecologically valid situations.

5.8.1. Best Practices

Our Study 1 results led us to identify five key design goals for password- reuse notifications (Section 5.4). Some goals (e. g., timeliness) are obvious from general guidelines about warning design, but the im- portance of providing an adequate background, as well as the subtle considerations around engendering trust, are more specific to the do- 5.8. Discussion 147 main of password reuse. Our model notification in Study 2 performed the best according to these goals, suggesting these best practices for designing password-reuse notifications: • The notification should be very explicit about the root causes of the situation, i. e., password reuse and a data breach. • Providers should force a password reset on their service. • The notification should strongly encourage changing similar pass- words on other accounts and thoroughly explain why doing so staves off attacks. • The notification should explicitly encourage enabling 2FA and using password managers. • Notifications should be sent via both email and more immediate channels (e. g., a blocking notification upon login). Therefore, we propose the wording of Figure 5.8 as a model noti- fication for situations related to password reuse. Unfortunately, few real-world notifications currently follow these best practices. Table 5.3 compares the 24 real-world notifications we collected in Study 1 to the best practices we identified. None of these notifications met all of the best practices we established. In short, there is much room for im- provement in widely deployed notifications.

5.8.2. Addressing Persistent Misunderstandings

While the real notifications we tested in Study 1 were successful in arousing concern, many respondents were unaware of the correct ac- tions to take in response. The model notification we synthesized for Study 2, and its variants were successful in spurring the vast majority of respondents to report that they would change their password on the site that sent them the notification. This apparent success was tempered, however, by respondents reporting that their new password would often be a minor variation on their previous passwords, or even simply a password reused verbatim from another account. Further- more, many respondents reported that they would leave their pass- words unchanged on providers other than the one that sent them the 148 Chapter 5 Password Reuse

Figure 5.8.: The notification we found to be the most effective, relative to our notification goals, for respondents in Study 2. 5.8. Discussion 149

Table 5.3.: How 24 real-world password-reuse notifications compare to our Study 2 model notification’s best practices. Notifica- tions are identified by their sender (and other details if we collected multiple from the same provider).

Notification Delivered via email Mentions password reuse Forces password change Suggests changing similar passwords Suggests 2FA and password manager Adobe X X Amazon X X X X Carbonite X X X X Digital Ocean X X X X Edmodo X X Evernote X X X Facebook (Accessed) X X Facebook (Confirm Identity) Facebook (Logged In) X Freelancer X X X Google (2-Step) X Google (Someone Has . . . ) X Google (Suspicious) Houzz X X X Instagram X LinkedIn X X Microsoft X Netflix X X X X Pinterest (Read-Only) X Pinterest (Suspicious) X X Sony X SoundCloud X X Spirit X X Spotify X X X X 150 Chapter 5 Password Reuse notification. Collectively, these decisions would leave users vulnerable to future attacks leveraging password reuse. The model notification had mixed success at encouraging respon- dents to take two other actions that could potentially mitigate pass- word reuse. Users adopting 2FA erects another barrier for attackers in exploiting reused credentials, and nearly two-thirds of respondents re- ported being likely to do so after receiving the notifications we tested. Users adopting a password manager and using it to generate unique, strong passwords for each site is among the small number of usable solutions to combat password reuse, yet under 40 % of respondents re- ported being likely to do so. These intentions did not vary significantly whether or not the notification explicitly encouraged respondents to take these actions. Future work could investigate whether describing the exact situation the given user is in even more explicitly, as well as why these particular actions are crucial in mitigation, might be more successful. Our work thus underscores that it is unreasonable to expect users to maintain dozens of distinct and secure passwords simply by telling them to do so. Although notifications are a critical source of infor- mation to incite positive change in users’ online security behaviors, they are only a band-aid on a gaping wound. In addition to improv- ing notifications, we recommend devising ecosystem-level strategies to combat password reuse. Individual account providers cannot pre- vent password reuse across services without direct cooperation with others [241]. As our respondents already expressed much confusion about how providers “had this information in the first place, who they got it from and how they got it” (R159), other actors may be better positioned to make a difference. Password managers and web browsers have a unique viewpoint on the full spectrum of a user’s passwords that individual providers do not. Specifically, they have the opportunity to identify and prevent password reuse when users create, change, or import passwords. Un- fortunately, current implementations of many password managers and 5.8. Discussion 151 browsers permit users to reuse passwords across accounts, often not even warning those users about why this is problematic. This behavior could be out of fear that users would not use those password man- agers or browsers if they felt burdened by onerous actions. Future work should thus investigate how password managers and browsers can be more explicit in preventing password reuse while maintaining a positive user experience. The current state of password reuse results from many actors’ decisions. Remediation will require the contribu- tions of many more.

A small leak will sink a great ship.

— Benjamin Franklin 6 Password Management

Contents 6.1 Introduction ...... 154 6.1.1 Contributions ...... 156 6.1.2 Outline ...... 157 6.2 Cracking-Resistant Vaults ...... 157 6.2.1 NoCrack and Natural Language Encoder 159 6.2.2 Attacker Model ...... 162 6.3 Static and Adaptive NLEs ...... 163 6.3.1 Static Distribution NLEs ...... 163 6.3.2 Adaptive Distribution NLEs ...... 164 6.4 The (In-)Security of Static NLEs ...... 164 6.4.1 Distinguishing Real from Decoy Vaults 165 6.4.2 Datasets ...... 167 6.4.3 Experiments for Entire Vaults . . . . . 167 6.4.4 Experiments for Single Passwords . . . 170 6.5 Cracking NoCrack ...... 174 6.5.1 Best: Combining the Factors . . . . . 177 6.5.2 Further Remarks ...... 178 6.6 Adaptive NLEs Based on Markov Models . . . 179 6.6.1 Static NLEs Based on Markov Models 179 6.6.2 Baseline Performance ...... 182 6.6.3 Adaptive Construction ...... 184 6.6.4 Performance of the Adaptive NLE . . 185 6.6.5 Security of the Adaptive NLE . . . . . 185 6.6.6 Limitations of the Adaptive NLE . . . 186 6.7 Conclusion ...... 187 154 Chapter 6 Password Management

6.1. Introduction

To relieve the user from the burden of memorizing a large number of passwords or not to reuse them across sites, many security ex- perts recommend the use of password vaults (also called password managers) [184]. Password vaults store passwords and usually also domains and usernames, in an encrypted container, where the en- cryption key is derived from a master password using a key derivation function (KDF) [137]. Vaults can store both user-chosen passwords, which can be chosen stronger as the user does not have to memo- rize them, and randomly chosen cryptographically strong passwords that are often generated by the password manager. To facilitate mi- grating the stored passwords to a new or secondary device, and to backup data to prevent loss, many vaults offer the possibility to store the encrypted vault with an online service. Password vaults stored online are a promising target for attackers. This is not a theoret- ical problem: LastPass was a target of a suspected breach in 2011 and 2015 [99]. Other work, analyzing the security of online password managers [150, 211], has shown a number of weaknesses, including vulnerabilities that allow exfiltration of stored vaults. An attacker can try to recover the missing master password, once the encrypted vault has been stolen [26, 81, 138]. In an offline guess- ing attack, the attacker can try candidate (master) passwords, trial- decrypt the vault, and then verify if the candidate was correct by observing the result of the decryption. Often decryption with an in- correct master password will yield an error or a malformed header, allowing easy identification of wrong candidates. This kind of attack is “offline,” as no interaction with an online service is required, and the correctness of a guess can be verified locally. The number of pass- word guesses an attacker can try is almost unbounded, only limited by the computational resources at disposal. No data is publicly avail- able describing how users choose their master passwords. Utilizing the available information for normal account passwords [23, 66] and given the recent advancements in GPU and FPGA design [250] we 6.1. Introduction 155 postulate that also user-chosen master passwords can be guessed in limited time [85, 215]. For current vaults [2, 195], the number of guesses per day on a single GPU is in the order of 109 to 1012 guesses. The GPU-based password cracking software Hashcat [216] supports a variety of popu- lar vault storage formats and their respective master password hashes including 1Password [3], KeePass [196], and LastPass [153]. Thus, un- throttled guessing attacks would constitute a major threat. In prin- ciple, this problem can be solved by avoiding giving feedback about the successful decryption. However, not only explicit feedback in the form of error messages needs to be avoided, but also implicit feed- back, i. e., in the form of “implausible” passwords in the vault. One solution was devised in the Kamouflage scheme [21], which constructs so-called decoy vaults that are generated during a setup phase, which are encrypted under similarly structured decoy master passwords, and return predefined plausible password vaults. An attacker will get mul- tiple vaults in an offline guessing attack and needs to decide, e. g., via online verification, which vault is the correct one. A newer proposal, NoCrack [46], improved on this approach by not only offering a pre- defined (constant) number of decoy vaults but by generating new and plausible decoy vaults on the fly for each (wrong) master password candidate. Additionally, they discovered a flaw in the generation of Kamouflage’s decoy master passwords that led to an improved attack against the scheme. Generating reasonable looking decoy vaults is not a trivial task. NoCrack uses techniques from Honey Encryption (HE) and Natu- ral Language Encoders (NLEs) based on Probabilistic Context-Free Grammars (PCFGs) to generate plausible decoy vaults. PCFGs have been shown to quite accurately model password distributions in the past [247]. In a preliminary evaluation, the authors showed that basic machine learning attacks are not able to distinguish real from decoy vaults. 156 Chapter 6 Password Management

6.1.1. Contributions

The features used in the security analysis of the NoCrack vaults were quite simplistic: repeat counts, edit distances, and n-gram frequency statistics were used as input to the machine learning step. We show that techniques exist that can distinguish real from decoy vaults with high accuracy. Our technique is based on the distribution of the passwords in the vaults, which can be easily measured by an attacker simply by trial- decrypting a vault with a number of wrong master passwords. By determining the similarity between distributions (we use Kullback– Leibler divergence as a measure of similarity) one can see that the distribution of passwords in the decoy vaults, generated by NoCrack, is substantially different from the distribution of other password lists. This enables us to rank the correct vault significantly higher (up to a median rank of 1.97 % for real-world vaults and 0.1 % for artificial vaults composed from real-world password lists). We show that this is not a problem unique to NoCrack, but also caused by differences in various password lists. Based on the observation that this problem persists for many different ways to choose decoy vaults, we propose the notion of adaptive NLEs, where the generated distribution of decoy vaults depends on the actual values stored in the vault. Finally, we evaluate additional signals that enable one to even better distinguish real from decoy vaults. We show that additional informa- tion, such as the correlation of usernames and passwords, password reuse, or password composition policies, should be considered by the NLE. In the case of NoCrack, this results in a mean rank of 2.4 % and a 1st Quartile of 0.56 %, an improvement by a factor of 40 and 170, respectively. To summarize, our contributions include:

1. We show that techniques exist that can distinguish real from decoy vaults with high accuracy for NoCrack. 6.2. Cracking-Resistant Vaults 157

2. We propose the notion of adaptive NLEs, where the generated distribution of decoy vaults depends on the actual values stored in the vault.

3. We evaluate signals that enable one to even better distinguish real from decoy vaults via additional information such as user- names and password policies.

The contributions of this work resulted from a collaboration with Benedict Beuscher and Markus Dürmuth.

6.1.2. Outline

In Section 6.2 we will review some material about cracking-resistant password vaults and introduce the attacker model that we consider. In Section 6.3 we will define the concept of adaptive NLEs, as opposed to static NLEs. In Section 6.4 we will show that for static NLEs and specifically for NoCrack, KL divergence is able to distinguish between real and decoy vaults with high accuracy. In Section 6.5 we will describe some more factors that can be used to distinguish real from decoy vaults. In Section 6.6 we will define an adaptive NLE based on Markov models and demonstrate its much stronger resistance against attacks compared to static NLEs.

6.2. Cracking-Resistant Vaults

In the following, we introduce the required notions and review some material about cracking-resistant password vaults and describe the attacker model that we consider. Recall that an offline guessing attack describes an attack where one can verify guesses without contacting an online service. Thus, the number of tests that can be performed is only limited by the available computational resources. In contrast, an online service can implement rate-limiting and risk-based authentication to limit the exposure in online attack scenarios. 158 Chapter 6 Password Management

Kamouflage

Bojinov et al. [21] have proposed a solution for this problem, by de- signing a password vault that resists offline guessing attacks. Their Kamouflage system uses a “hide-in-an-explicit-list” approach. For this, they generate a large number (they suggest 107 for medium secu- rity) of decoy vaults that are stored besides the real vault. Even by correctly guessing a master password, the attacker no longer knows whether decrypting the vault with the master password leads to the real or one of the decoy vaults present in the file. To generate plausi- ble looking decoys, they were required to consider that multiple pass- words are generated for the same user. Therefore, they implemented a solution that generates decoys by assigning probabilities to password templates that are derived by a process similar to the concept of using a PCFG. They tokenize every given domain password and reuse those tokens across the passwords in the vault, where a token represents a word or number of a certain length. Subsequently, they validate the tokens using a dictionary and flag unknown tokens for manual user re- view. Based on those derived tokens, they generate plausible looking decoys via a dictionary. A potential drawback of this approach is the storage overhead, re- quired to save a large number of decoy vaults. Chatterjee et al. [46] broke the scheme by abusing the revealed structure of the master password, once any (real or decoy) vault master password has been guessed. By identifying the tokens (e. g., “Kamouflage16” → L10D2) of the found password, one is able to narrow down the search space significantly and speed up the search for the remaining master pass- words. This flaw even degrades the resistance against offline guessing attacks to a lower level than traditionally encrypted vaults that do not use decoys at all.

Honey Encryption

Honey Encryption (HE), introduced by Jules and Ristenpart [134], produces a ciphertext that when decrypted with an incorrect key, 6.2. Cracking-Resistant Vaults 159

Honey Encryption (HE) Password-based Encryption (PBE)

Master Password Salt mpw salt

Key Derivation Function Encode(pwd) -> S SHA-256

Domain Password Natural Language Bit String Encryption / Decryption pwd Encoder S AES-CTR

pwd <— Decode( S) Ciphertext c

Figure 6.1.: Design of NoCrack (simplified). This schematic omits details, e. g., domain hashing or separated handling of human-chosen and computer-generated passwords. The Natural Language Encoder (NLE) decodes a bit string to a password and encodes vice versa. yields plausible-looking decoy plaintexts, called honey messages. It addresses the problem of encrypting plaintext using low-entropy keys, such as keys derived from passwords, by applying a specialized encod- ing mechanism first and encrypting the result afterward. The key challenge of building such a system is the construction of a randomized message encoding scheme, called distribution transform- ing encoder (DTE). Creating such an encoder is relatively easy for some classes of data, like a random string, but very challenging for natural language text like real-world passwords. The authors showed the usefulness of Honey Encryption for password- based encryption schemes by building a system for encrypting RSA secret keys and credit card numbers. Utilizing an HE-scheme for encrypting a password vault provides, in contrast to Kamouflage’s so- lution, security beyond the traditional offline work bound. In other words, it is ensured that an attack is never less expensive than attack- ing a traditional vault.

6.2.1. NoCrack and Natural Language Encoder

Chatterjee et al. [46] used the idea of Honey Encryption to provide more secure password vaults, extending the concept previously ap- 160 Chapter 6 Password Management plied in Kamouflage. An overview of the simplified architecture of NoCrack is given in Figure 6.1. Basically, the idea is to output “plau- sible looking” password vaults for each master password used for trial decrypting a vault (whereas Kamouflage only outputs decoy vaults for a small number of wrong master passwords). Here, the challenge is to generate plausible-looking vaults on the fly, which is achieved by building a new kind of DTE for encoding and decoding of natu- ral language named Natural Language Encoder (NLE). For this, they built a DTE that takes natural language, i. e., a password pwd, as input and encodes it to a random bit string S~. Reverse, decoding any random bit string S~ outputs a plausible plaintext, i. e., a password pwd. A number of promising candidates for NLEs are available (and often used for password guessing [42, 66, 247] or password strength meters [43] already): Probabilistic Context-Free Grammars (PCFG), n-gram Markov models, password samplers, and more.

NoCrack’s Approach: A PCFG Model-based NLE

NoCrack implements two different DTEs called UNIF (uniform) for randomly chosen passwords and SG (sub-grammar) for human-chosen passwords. While the former is straightforward to build, the con- struction of the latter is an unsolved problem and a challenging task. The authors of NoCrack sketched two approaches for this, an n-gram model and a PCFG-based solution and described the latter in detail. PCFGs can be used to model a distribution of strings in a language, i. e., passwords, by associating a probability for the members of the distribution defined by an underlying context-free grammar (CFG). In language theory, a CFG is a formal grammar that is defined by a lexicographically ordered set of production rules. A rule is specified as a pair (A, α), where A is a non-terminal and α is a terminal or non-terminal symbol. One can build a PCFG by the assignment of probabilities to such CFG rules. Due to the need to be able to encode every occurring password, there is a special catch-all rule with a very low probability. 6.2. Cracking-Resistant Vaults 161

For a given PCFG, a password may be specified by a sequence of probabilities. Based on the first probability in the sequence, a rule from the set is selected for the start symbol S producing the children of S in a parse tree. Via recursion, the complete parse tree is built. They showed that a rule set probability can be represented as an integer. By this, given a PCFG, one can represent a password by a non-unique vector of integers. In the encoding step of the NoCrack NLE, the vector of probabilities that define a parse tree is uniformly selected from all such producing the given password. During decoding the given vector of probabilities, thus the parse tree is used to rebuild the encoded password. The authors proposed an SG (sub-grammar) NLE that is used to build a vault from the described single password NLE (MPW). Ap- plying a single password NLE multiple times produces a vault of unre- lated passwords that is rather insecure. Instead, the SG approach tries to build related passwords, to simulate normal user behavior more ac- curately. For this, all passwords stored in the real vault are parsed using the trained PCFG model. After this, a new sub-grammar PCFG that consists of the cumulative set of rules used during the parsing is built. Finally, the rule probabilities are copied from the original PCFG and renormalized over the sub-grammar PCFG. Please note that some special rules, like the aforementioned catch-all rule is al- ways part of the used grammar. To be able to decode the SG encoded passwords, the SG is encoded as well and always the first part of the NLE’s output. For full details, we refer the interested reader to the original paper [46].

The NoCrack Implementation

The authors of NoCrack implemented a full version [45] of a Honey Encryption-based password vault. The implementation must be con- sidered a prototype which does not always correctly decrypt a vault and occasionally crashes. However, we did not focus on the manager 162 Chapter 6 Password Management software itself, like the implemented client–server model and encryp- tion, but rather on the concept of the actual PCFG model-based NLE. NoCrack, as implemented currently, does not store usernames, so technically we have to assume that usernames are known. Usernames are typically not considered secret, so we expect them to be easily retrievable, with the email address, or it may be even publicly known.

6.2.2. Attacker Model

While cracking-resistant vaults, of course, are subject to all the normal attacks that can be launched against vaults, here we concentrate on so-called online-verification attacks, where an attacker performs online queries to verify which vault (resulting from an offline guessing attack) is the correct one.

1. Trial-Decryption: The attacker trial-decrypts the given encoded vault with N master password candidates according to a pass- word distribution that is believed to be suitable for the task.

This yields a number of candidate vaults cv1, . . . , cvN .

2. Ranking of Vault Candidates: The attacker ranks the vault can- didates such that the “more likely correct” vaults are (hopefully) ranked near the top. The original paper uses machine learning to rank the more likely candidate vaults to the top.

3. Online Verification: Finally, the attacker uses online guessing to verify the correctness of the vaults, starting with the highly ranked vaults. The number of online verification attempts the attacker is to perform depends on the countermeasures imple- mented by the service and can vary wildly. Note that even a single service with very weak defenses has a great impact on the security of the complete vault.

The security of a vault scheme against this type of attack is best measured by the distribution of ranks of the real vault. To this end, it is not necessary to create millions of decoy passwords; as those are 6.3. Static and Adaptive NLEs 163 chosen uniformly by the NLE, it is good enough to observe the ranking of the real vault in a much smaller set of decoy vaults. The average rank of the real vault among the decoy vaults is a good measure for the average defense against online-verification attacks.

6.3. Static and Adaptive NLEs

A central aspect of an NLE used in a password vault is the distribution of its generated decoy vaults. It needs to generate decoy vaults that cannot be easily distinguished from the real vault. Technically, for a traditional vault software, this distribution exists as well, with two vaults, the “error vault” ⊥ and the correct vault, having a non-zero probability. For Kamouflage this distribution has a limited number of vaults with non-zero probability. To decrypt the vault, the NLE construction, as used in NoCrack, gets a bit string as input. This bit string is generated by applying a KDF to the used master pass- word. Thus, the input distribution for the NLE is (close to) a uniform distribution, and (practically) independent of the distribution of the guessed master passwords. We distinguish two variants of NLEs:

(i) Static NLEs: an NLE where the generated distribution of decoy vaults is independent of the actual values stored in the vault.

(ii) Adaptive NLEs: an NLE where the generated distribution of decoy vaults depends on the actual values stored in the vault.

6.3.1. Static Distribution NLEs

NoCrack and all NLEs that follow the schematics in Figure 6.1 are necessarily static, as no information about the stored vault is avail- able to the NLE. Thus, the distribution of decoy vaults is necessarily independent of the passwords stored in the vault. Static NLEs seem to be a logical and conservative choice for pass- word vaults. An attacker can easily approximate the generated distri- bution and if the distribution transports information about the pass- 164 Chapter 6 Password Management words stored in the vault, an attacker might be able to extract infor- mation about the vault from this easily accessible distribution. However, as we will show in the following section, static NLEs have one major drawback and are susceptible to online-verification attacks. In brief, the problem is that the distribution of decoy vaults needs to be fixed at one point before the actual passwords are stored. But password distributions differ substantially from one service to an- other, where reasons include a different password policy (which may vary over time), a different user-base, different (perceived) security requirements, and much more. A previously fixed distribution will not be able to handle these vast differences that even change over time. Furthermore, storing a strong password in such a vault will make breaking the vault easier; a counter-intuitive behavior that we should avoid at all cost.

6.3.2. Adaptive Distribution NLEs

A potential solution to the problem described in Section 6.3.1 is of- fered by adaptive NLEs, where the chosen distribution of decoy vaults depends on the passwords stored in the vault. This makes it unnec- essary to “predict” the changes in password distributions over time at creation time of the software, as the distribution can adapt to rele- vant changes from the stored passwords. This raises another security concern: When the distribution depends on the stored passwords, will knowledge of the distribution help in recovering the stored passwords? In Section 6.6 we will show an adaptive NLE based on Markov mod- els, which resists online-verification attacks much better than a static NLE, and additionally, the adaptive property does not undermine the security of the scheme.

6.4. The (In-)Security of Static NLEs

Our first main contribution shows that static NLEs and especially NoCrack suffer from a severe weakness that substantially limits their 6.4. The (In-)Security of Static NLEs 165 security. In brief, we show how an attacker can efficiently distinguish the distribution of passwords generated by NoCrack from real vaults, and we argue that this is at least partially a fundamental limitation of static NLEs.

6.4.1. Distinguishing Real from Decoy Vaults

In the attack scenario described in Section 6.2.2 and introduced by Chatterjee et al., the adversary has created a list of candidate vaults by decrypting the encrypted vault with a number of candidate master passwords. We can assume that the correct master password is in the list of candidate passwords and thus the correct vault is among the list of candidate vaults. Then an attacker wants to rank the available candidate vaults so that the average position of the true vault in the list of candidate vaults is near the top of the list. A “perfect” NLE would lead to an average rank of 50 %, as there was no way to distinguish real from decoy vaults and thus the ranking cannot do better than guessing. Chatterjee et al. tested attacks based on generic ML algorithms. We devise an alternative attack targeting the similarity of the observed distributions, based on the Kullback–Leibler divergence.

Kullback–Leibler Divergence

The Kullback–Leibler divergence (KL divergence) is a measure of the difference between two probability distributions. Given two probabil- ity distributions P and Q, the KL divergence is defined as

X P [z] DKL(P k Q) = P [z] · log , (6.1) Q[z] z∈supp(P ) provided that supp(P ) ⊂ supp(Q), and ∞ otherwise. We use loga- rithms to base 2 throughout this work. The measure is not symmetric in P and Q. 166 Chapter 6 Password Management

Setup

The setup follows the attack model described in Section 6.2.2. We highlight the deviations from there in the following.

1. Determining Distribution Pdecoy of Decoy Vaults: Static NLEs have the (defining) property that the distribution of decoy vaults is constant. This distribution can be obtained in two ways: Either, it can be approximated by repeatedly sampling pass- words from the distribution by evaluating the KDF and trial- decrypting the vault (similar, but less computationally expen- sive, one can choose and decode a bit string from uniform ran- dom). We use this method in the current section, and we de- termine the influence of the sample size on the accuracy of the attack in Section 6.4.3. Or, for some NLEs, it is possible to use a theoretical argument to derive a mathematical description for the distribution via the source code. We use this method in Section 6.6.1 for Markov model-based NLEs.

2. Trial-Decryption: The attacker trial-decrypts the given encoded vault with master password candidates, yielding candidate vaults

cv1, . . . , cvN .

3. Ranking of Vault Candidates: For the ranking in this experiment we use the similarity of distributions of passwords measured by the KL divergence. We compute the similarity scores

ˆ ˆ si := DKL(Pcvi k Pdecoy ) for i = 1,...,N (6.2)

ˆ for each candidate vault cvi. Here, Pcvi is the distribution de- ˆ rived from the vault cvi based on relative frequencies and Pdecoy is the distribution derived from the empirical measurements in the first step, again based on relative frequencies. We rank the

candidate vaults based on the score si, where higher si means a larger distance from the decoy distribution and thus likely the real vault. 6.4. The (In-)Security of Static NLEs 167

4. Online Verification: Finally, the attacker uses online guessing to verify the correctness of the vaults, starting with the higher ranked vaults.

6.4.2. Datasets

For completeness, we give a brief description of the datasets used in the following evaluations.

• The PBVault leak is likely from before June 2011 and has a sub- stantial overlap with credentials that were obtained via keystroke logging and HTML form injection by the ZEUS Trojan. Chat- terjee et al. used this list to evaluate their NLE approach and made the file available along with the NoCrack source code. The file contains username and password pairs. To the best of our knowledge, PBVault is the only publicly available list of pass- word vaults. Detailed statistics on the file, called Pastebin, can be found in the NoCrack paper [46]. • The RockYou list contains 32 million plaintext passwords. It was leaked in December 2009 through an SQL injection attack. • The Gmail list contains over 5 million Google Mail credentials and was posted in September 2014 in a Russian Bitcoin forum. • The Yahoo leak is from July 2012 and was obtained by attacking the Yahoo! Voices publishing service; it contains around 450 thousand passwords. • The MySpace list contains around 55,000 passwords that were posted in October 2006. The passwords in the list were obtained via a phishing attack.

6.4.3. Experiments for Entire Vaults

First, we present results using the KL divergence on full vaults. For comparison, we use the same set of password vaults used in the eval- uation of NoCrack. 168 Chapter 6 Password Management

Setup

This experiment follows the description in Section 6.4.1. ˆ 1. The decoy distribution Pdecoy is approximated by the relative frequencies using 30 million samples of entire vaults from the NoCrack distribution, obtained by repeatedly decrypting a vault using a wrong master password and querying it for the passwords for 50 well-known login domains.

2. As data for the ranking we use 1,000 vaults, where one “real vault” is chosen from PBVault and 999 “decoy vaults” are chosen from a list of decoy vaults, obtained by repeatedly decrypting a NoCrack vault with the wrong master password and querying it for the passwords for a specific number of login domains. Decoy vaults are chosen to have the same size as the real vault. Note that this list is disjoint from the list used for approximating ˆ Pdecoy.

3. The ranking is performed using KL divergence ˆ ˆ DKL(Pcvi k Pdecoy ).

4. This experiment is repeated 100 times for each vault in PBVault, choosing fresh decoy vaults in each iteration.

Results

The results of this experiment are summarized in Table 6.1 and Ta- ble 6.2. They show that the real vault cvreal is ranked on average among the 6.20 % of the most likely vaults, thus reducing the amount of online guessing by approximately a factor of 16. This is significantly better than the best-reported attack by Chatterjee et al. using ma- chine learning [46], which reported an average rank for the combined feature vector of 39.70 %. The situation is, however, even worse when we take a closer look at how the ranks of the real vaults are distributed. In fact, this distribution is skewed to the right, with a median of only 1.97 % and a 1st Quartile of 1.02 %. This means that half the vaults 6.4. The (In-)Security of Static NLEs 169 are among the top-ranked 1.97 % of the vaults, reducing the amount of online guessing by a factor of 50, and 25 % of the vaults are ranked among the top 1.02 %, reducing the amount of online guessing by a factor of 98. Note that we provide an attack against NoCrack tak- ing into account further information in Section 6.5.1, which brings down the average rank to 2.43 % and the 1st Quartile to 0.56 %. For comparability, we also report the results separated by vault sizes, see Table 6.1. The results vary little with the vault sizes, mainly the mean gets smaller for larger vault sizes, most likely as there is more data available for the comparison.

Table 6.1.: Rank results based on a KL divergence attack of entire vaults, where smaller numbers mean a more efficient at- tack. Decoy vaults are chosen from the NoCrack distribu- tion. Real vaults are chosen from the PBVault distribu- tion. We list results for varying classes of vault sizes.

KL Div.: NoCrack vs. PBVault Sample Size: 30 × 106, By Vault Size

Mean Q0.25 Median Vault Size 2-3 9.56 % 0.86 % 2.08 % Vault Size 4-8 5.97 % 0.97 % 1.86 % Vault Size 9-50 3.14 % 1.12 % 1.69 % All (2-50) 6.20 % 1.02 % 1.97 %

Influence of the Size of the Training Set

In order to determine the influence of the number of samples for ap- proximating the decoy distribution, we ran the experiment with vary- ing numbers of samples. The results are summarized in Table 6.2. As expected, more samples provide better results, while the improve- ments are getting less pronounced beyond 10 million samples. Note that this depends on the NLE used and we have found different be- havior for the NLE based on Markov models, which we introduce in Section 6.6. Furthermore, while a KDF, which is computationally ex- 170 Chapter 6 Password Management pensive to evaluate, will slow down the trial decryptions, it cannot effectively slow down sampling of the decoy distribution. Instead, the NLE can be queried directly by providing a random bit string S as input for the NLE (cf. Figure 6.1).

Table 6.2.: Rank results based on a KL divergence attack of entire vaults, where smaller numbers mean a more efficient at- tack. Decoy vaults are chosen from the NoCrack distribu- tion. Real vaults are chosen from the PBVault distribu- tion. We list results for varying numbers of samples for the reference distribution.

KL Div.: NoCrack vs. PBVault Vault Size: 2-50, By Sample Size

Mean Q0.25 Median 100,000 12.83 % 3.99 % 7.99 % 300,000 10.39 % 3.18 % 6.54 % 1,000,000 8.48 % 2.27 % 4.46 % 3,000,000 7.39 % 1.75 % 3.36 % 10,000,000 6.59 % 1.32 % 2.63 % 30,000,000 6.20 % 1.02 % 1.97 %

6.4.4. Experiments for Single Passwords

The previous experiments have demonstrated that there is a signifi- cant difference in the distribution of passwords sampled from NoCrack and from the PBVault leak and that the KL divergence is a suitable tool to distinguish the two. We strive to better understand how simi- lar password lists are, and how well the KL divergence can perform to accurately distinguish them. All available password lists we are aware of, with the exception of the PBVault, contain single passwords only. Therefore, we sample one single random password from a number of NoCrack (SG) vaults to obtain independent password samples. This approach produces artificial vaults, containing not related passwords, like NoCrack (MPW), that can be meaningfully compared with other artificial vaults built from single-passwords-only password lists. 6.4. The (In-)Security of Static NLEs 171

Setup

In this experiment, we perform a pairwise comparison between two password lists. We denote one list as “real password list” R, the other as “decoy password list” D.

1. The decoy list D is split into two (disjoint) parts Dv and Dref ,

where Dv is used for picking decoy vaults, and Dref is used as an approximation for the reference distribution. As some lists (e. g., MySpace) are quite small, we repeatedly split D for each new vault. 2. We pick one “real” vault of size 10 from the set R, where we select the passwords independently of each other. We pick 999 “decoy”

vaults of size 10 from Dv, again with independent passwords.

3. When computing the KL divergence, we use the set Dref to

obtain the reference distribution (which is disjoint from Dv).

Results

The results are summarized in Table 6.3. First, we see that NoCrack performs worse on these artificial vaults, i. e., KL divergence is sub- stantially better at distinguishing the distribution generated by NoC- rack from these artificial vaults. As we used independent passwords for NoCrack as well, the differences that allow us to distinguish are caused by the distribution of passwords. The most likely reason for this behavior is that 10 independently chosen passwords carry more information than 10 passwords with a high reuse, as it is observed in the PBVault leak. Second, as one can see in Figure 6.2, the distributions of RockYou, Yahoo, and Gmail are rather hard to distinguish. Apparently, their distributions are relatively similar. Third, MySpace is quite different from RockYou, Yahoo, and Gmail, and is relatively easy to distinguish from them. Lastly, as expected, comparing one distribution against the same distribution has an average rank of 50 %. Note that these re- sults are not symmetric, i. e., it makes a difference which distributions the decoys are chosen from. 172 Chapter 6 Password Management

Table 6.3.: Rank results based on KL divergence of artificial (indepen- dently selected) vaults of size 10. Note, for easier compar- ison we also report numbers for the static Markov NLE, which is introduced in Section 6.6.

All vs. All: Average Rank Leak D RockYou Gmail

Leak R Mean Q0.25 Median Mean Q0.25 Median RockYou 50.54 % 25.03 % 50.85 % 50.70 % 25.53 % 50.35 % Gmail 28.04 % 7.01 % 20.12 % 50.16 % 25.53 % 50.15 % Yahoo 27.54 % 7.01 % 20.62 % 40.05 % 15.92 % 36.04 % MySpace 27.46 % 6.91 % 19.82 % 28.33 % 8.21 % 21.02 % Leak D Yahoo MySpace

Leak R Mean Q0.25 Median Mean Q0.25 Median RockYou 49.20 % 23.02 % 49.85 % 10.53 % 1.90 % 4.00 % Gmail 40.61 % 16.42 % 36.94 % 8.32 % 1.60 % 3.20 % Yahoo 50.38 % 25.40 % 50.80 % 7.45 % 1.50 % 3.10 % MySpace 29.07 % 9.01 % 22.62 % 50.98 % 26.80 % 51.65 % Leak D NoCrack (MPW) Static Markov (MPW)

Leak R Mean Q0.25 Median Mean Q0.25 Median RockYou 0.18 % 0.10 % 0.10 % 44.49 % 15.82 % 41.44 % Gmail 0.12 % 0.10 % 0.10 % 8.28 % 0.10 % 0.10 % Yahoo 0.11 % 0.10 % 0.10 % 5.32 % 0.10 % 0.10 % MySpace 0.11 % 0.10 % 0.10 % 11.93 % 0.10 % 0.60 % 6.4. The (In-)Security of Static NLEs 173

0.05 0.05 RockYou-RockYou RockYou-Gmail

0.04 0.04

0.03 0.03

0.02 0.02 Probability Probability

0.01 0.01

0.00 0.00 0 50 100 0 50 100 Rank Rank 0.05 0.05 RockYou-Yahoo! RockYou-MySpace

0.04 0.04

0.03 0.03

0.02 0.02 Probability Probability

0.01 0.01

0.00 0.00 0 50 100 0 50 100 Rank Rank 0.05 0.05 RockYou-NoCrack RockYou-Markov

0.04 0.04

0.03 0.03

0.02 0.02 Probability Probability

0.01 0.01

0.00 0.00 0 50 100 0 50 100 Rank Rank

Figure 6.2.: Result distribution of the KL divergence experiment for single (unrelated) passwords. The real vault is sampled from RockYou, the decoy vaults are sampled from distribution approximations of real-world passwords and artificial ones, i. e., NoCrack (MPW) and static Markov (MPW). 174 Chapter 6 Password Management

6.5. Cracking NoCrack

We have seen a first criterion (the KL divergence) for distinguishing between vaults drawn according to the decoy distribution. Next, we will consider several more criteria that are based on structural differ- ences of the vault.

Correlation and Dependence

As already mentioned, but not examined by Bojinov et al. [21] and Chatterjee et al. [46], additional data that influences the human pass- word choice [236] might be helpful in determining the real vault. In fact, research has shown that background information about the user helps to guess passwords [42, 181] and to answer personal knowledge questions [186,202]. We evaluated this assumption by using the avail- able usernames or email addresses from PBVault to measure the ef- fect on the ranking success. We considered an overlap between the username and password as an indicator for the vault being real, but preliminary experiments have shown that this is a weak indicator only. Therefore, we give this factor a small weight compared to the KL di- vergence, thus it basically resolves ties between two vaults with the same KL divergence. We converted both, the username and the pass- word to lowercase and reverted some leetspeak transformations. If the username was a substring of the password, we gave a score of 2; if the edit distance was below a threshold, we gave a score of 1. Results from this experiment are summarized in Table 6.4. They show that the median rank for the real vault in NoCrack is 2.10 %, thus worsens the KL divergence attack result (median of 1.97 %). Note, for easier comparison we also report numbers for the static Markov NLE, which is introduced in Section 6.6. For Markov, we see a decrease in the median ranking result to 7.22 %, compared to the KL divergence attack with a median of 14.24 %. 6.5. Cracking NoCrack 175

Table 6.4.: Rank results based on four different attacks against en- tire vaults, where smaller numbers mean a more efficient attack. Note, for easier comparison we also report num- bers for the static Markov NLE, which is introduced in Section 6.6. Decoy vaults are chosen from the NoCrack or Markov distribution, real vaults are chosen from the PB- Vault distribution. We list the results for varying classes of vault sizes.

Correlation Attack PBVault NoCrack: 30 × 106 Static Markov

Vault Size Mean Q0.25 Median Mean Q0.25 Median 2-3 9.56 % 0.88 % 2.10 % 30.46 % 0.39 % 14.66 % 4-8 6.20 % 1.00 % 2.75 % 25.65 % 0.33 % 8.69 % 9-50 3.41 % 0.53 % 1.88 % 19.31 % 0.12 % 2.10 % All 6.36 % 0.92 % 2.10 % 25.08 % 0.18 % 7.22 %

Reuse Attack PBVault NoCrack: 30 × 106 Static Markov

Vault Size Mean Q0.25 Median Mean Q0.25 Median 2-3 9.59 % 0.95 % 2.11 % 31.48 % 0.49 % 16.82 % 4-8 5.97 % 0.96 % 1.94 % 26.86 % 0.17 % 9.72 % 9-50 3.14 % 1.12 % 1.74 % 24.83 % 1.18 % 12.69 % All 6.21 % 0.99 % 1.99 % 27.76 % 0.39 % 14.28 %

Policy Attack PBVault NoCrack: 30 × 106 Static Markov

Vault Size Mean Q0.25 Median Mean Q0.25 Median 2-3 3.43 % 0.70 % 1.42 % 20.03 % 0.48 % 15.29 % 4-8 2.44 % 0.72 % 1.34 % 17.47 % 0.18 % 8.96 % 9-50 1.74 % 0.85 % 1.26 % 17.07 % 1.20 % 11.24 % All 2.54 % 0.80 % 1.37 % 18.31 % 0.38 % 12.82 %

Best Attack PBVault NoCrack: 30 × 106 Static Markov

Vault Size Mean Q0.25 Median Mean Q0.25 Median 2-3 3.36 % 0.67 % 1.44 % 19.11 % 0.10 % 13.64 % 4-8 2.43 % 0.73 % 1.34 % 16.51 % 0.11 % 7.88 % 9-50 1.53 % 0.10 % 1.01 % 12.69 % 0.15 % 1.53 % All 2.43 % 0.56 % 1.31 % 16.15 % 0.17 % 6.54 % 176 Chapter 6 Password Management

Password Reuse

It is well known that users tend to reuse passwords across services. Re- ported numbers differ and range from around 10 % to around 60 % [11, 54,77]. Hence, we expect to find this amount of reuse in the vaults as well. NoCrack simulates password reuse by decoding sub-grammars (see Section 6.2.1), and for our Markov model-based NLE we imple- mented a similar solution. If implemented incorrectly, i. e., if the NLE outputs vaults with an unrealistic password reuse, this can be used as an indicator for real vaults as well. Similar to the correlation feature of Section 6.5, preliminary tests have shown that reuse is a relatively weak indicator, thus again, we use it with a small weight only, mostly breaking ties in the KL divergence. For each vault, we calculate the reuse rate, i. e., given two randomly chosen passwords from the vault, what is the probability that these two are equal. In addition, we calculated a reuse rate for “similar” passwords, where similarity is measured by the Levenshtein edit dis- tance for thresholds ranging from 1 to 5. This measure has been used before [11] in the context of reuse. Finally, we use a weighted average of these six reuse rates as the final indicator. Results from these experiments are summarized in Table 6.4. We see that the results do not vary greatly. The median rank for the real vault with NoCrack is 1.99 %, thus does not improve the KL diver- gence attack result (median of 1.97 %). For Markov, we see the same, with a median ranking result of 14.28 % compared to the KL diver- gence attack with a median of 14.24 %. These results show that both, the NoCrack NLE as well as the Markov NLE, accurately simulate the available data in PBVault. However, there is no other data avail- able to cross-check these results, and we expect this attack to perform better on fresh data, which may have a different behavior in terms of password reuse. 6.5. Cracking NoCrack 177

Password Policies

Many websites enforce password-composition policies on the pass- words of its users. These rules differ from site to site and may change over time. Thus, it is difficult for an NLE to create passwords for a specific site that adheres to the imposed rules, without “overdoing” it and choosing unrealistically strong passwords. Bojinov et al. [21] sur- veyed policies for the Kamouflage system and found that the majority of large sites apply a minimum length criterion, e. g., at least 8 chars that should be considered by an NLE. We found that Chatterjee et al. [46] reported policy compliance for their UNIF NLE, which builds computer-generated passwords, but not for the main NLE, the vault- generating SG. In the following experiments, we assume that the user has at least one account stored in the vault that requires a minimum password length of 8. If decrypting the vault yields a shorter password for this specific account, we discard this vault as being non-compliant. Some results from this experiment are summarized in Table 6.4. They show that the median rank for the real vault in NoCrack is 1.37 %, thus improving the KL divergence attack result (median of 1.97 %). For Markov, we see the same, with a median ranking re- sult of 12.82 % compared to the KL divergence attack with a median of 14.24 %. In principle, it is possible to prevent attacks based on the violation of password policies. One would need to keep track of password policies for the sites of interest and modify the encoder to only generate compliant passwords. However, this task is complicated by the fact that policies change over time and are not available in a machine-readable format, yet [114].

6.5.1. Best: Combining the Factors

Finally, we combine the features Policy, Correlation, and KL diver- gence to an overall classifier. The results of this experiment are sum- marized in Table 6.4. They show that the median rank for the real vault in NoCrack is 1.31 %, thus improve the KL divergence attack re- sult (median of 1.97 %). For Markov, we see the same, with a median 178 Chapter 6 Password Management

Figure 6.3.: The median rank of all PBVault vault sizes (2-50). ranking result of 6.54 % compared to the KL divergence attack with a median of 14.24 %. In Figure 6.3 we depict the summarized attack results against the NoCrack NLE for different training sizes, showing analog to Section 6.4.3 how an increased training set improves the ranking across all classifiers.

6.5.2. Further Remarks

Besides the already discussed structural differences of passwords, there are some more criteria that might be considered. First of all, knowing a leaked password from a website might be a great way to distinguish real from decoy vaults. Furthermore, as already mentioned by Chat- terjee et al. [46] if a vault is stolen twice, the security falls back to that of a conventional PBE. The lack of real-world sample data does not allow experiments on the correlation of master passwords and 6.6. Adaptive NLEs Based on Markov Models 179 corresponding domain passwords. Furthermore, the security pledge of NoCrack is somehow counterintuitive, as using a more unique (se- cure) self-chosen domain password facilitates distinguishing. Finally, if a website reports that an entered password was correct in the past like Facebook does, it might be a further help for an attacker to find the real vault, even though the user changed some domain passwords after the vault has been stolen.

6.6. Adaptive NLEs Based on Markov Models

Next, we describe a (static) NLE based on Markov models which has better properties than the PCFG-based NLE from NoCrack. Then we show how to turn this NLE into an adaptive NLE, and how this can improve the resistance against ranking attacks.

6.6.1. Static NLEs Based on Markov Models

Markov models are tools to model stochastic processes, widely used for natural language processing like automatic speech recognition. Nowadays, they have established themselves as an important tool for password guessing [66, 158, 170, 215] as well as measuring password strength [43]. In fact, an NLE based on Markov models was briefly tested by the authors of NoCrack [46] but dismissed in favor of a PCFG-based NLE.

Markov Models

In an n-gram Markov model one models the probability of the next token in a string based on a prefix of length n − 1. Hence, for a given sequence of tokens c1, . . . , cm, an n-gram Markov model estimates its probability as 180 Chapter 6 Password Management

P (c1, . . . , cm) (6.3) m Y = P (c1, . . . , cn−1) · P (ci|ci−n+1, . . . , ci−1). i=n

We use 4-grams, which offer a good trade-off between memory con- sumption, speed, and accuracy [66, 158], and we use all 95 printable

ASCII characters. The required initial probabilities (IP) P (c1, . . . , c3) and transition probabilities (TP) P (c4|c1, . . . , c3) are estimated as the relative frequencies from training data, where we use the RockYou dataset. We apply simple additive smoothing to account for unseen n-grams. We train individual Markov models for each length in the range of 4 to 12 characters.

Encoding of a Password pwd

The encoding is a (probabilistic) mapping from the set of passwords to bitstrings. To compute this encoding we fix an ordering of n-grams (e. g., the alphabetic ordering). For each transition probability, i. e., for each prefix of length 3, the fixed order gives us a partition of the interval [0, 1) into segments whose lengths just correspond to the tran- sition probabilities. For a given transition in pwd, we determine the corresponding segment [a, b) (where b − a = P (x4|x1, . . . , x3)). From this segment, we sample a uniformly chosen value s. This process is re- peated for all transitions in the string pwd, and likewise for the initial probability and for the length of the password. Finally, this process ~ yields a vector S = (s1, . . . , slen(pwd)−1) of len(pwd) − 1 values. The vector S~ can be encoded into a binary string using techniques similar to previous work [46].

Decoding of a Vector S~

The (deterministic) decoding of a vector S~ is straightforward. The

first value s1 determines a length l for the password, after deciding 6.6. Adaptive NLEs Based on Markov Models 181 in which segment it falls. In addition, this tells us which Markov model to use. The value s2 determines the first 3-gram of pwd, and the subsequent values s3, . . . , sl−1 determine the remaining transition n-grams.

Handling Vaults

To simulate password reuse in a way similar to NoCrack’s SG model, we generate vaults with related passwords. We determine the desired level of password reuse from the vaults in the PBVault set. We mea- sured both exact reuse as well as reuse of similar passwords with a small Levenshtein distance (cf. [11]). The measured reuse rates are (48.52, 9.81, 4.17, 2.74, 2.08, 2.72) for Levenshtein distances of 0 to 5, respectively. Constructing vaults to match a given vector of reuse is not trivial, as there is a high level of interaction between simi- lar passwords. We construct vaults by selecting a “base password” and using that for a fraction of M0 of passwords in the vault (exact reuse). Furthermore, we add a fraction of M1,...,M5 of passwords with a Levenshtein distance of 1,..., 5 to the base password, respec- tively. The remainder of the passwords is filled up with unrelated passwords. All these passwords relate to each other, so the actual fraction of passwords with an edit distance of 1 will in general higher ~ than M0. We empirically determined values M = (Mi) such that the reuse rates match the empirical results given above. We used values M~ = (0.66, 0.06, 0.02, 0.01, 0.015), adding Gaussian noise with σ2 of (0.06, 0.034, 0.008, 0.004, 0.012), respectively. The related passwords are determined by modifying the last tran- sition probability, which models most of the modified reuse found in practice [54]. More sophisticated approaches can be tested for real-world implementations. For example, by considering more than the last n-gram position and more precisely simulating user behav- ior [232, 236]. 182 Chapter 6 Password Management

6.6.2. Baseline Performance

First, we determine how well this static NLE performs. Therefore, we first rerun the experiments based on the KL divergence.

Kullback–Leibler Divergence Attack

We first describe the results for entire vaults. The setup is similar to the setup described in Section 6.4.3, i. e., we choose real vaults from the PBVault list, and decoy vaults according to the distribution generated by the Markov model. For determining the reference dis- tribution, we slightly deviate from the previous approach of sampling the distribution empirically. There are two reasons for this. First, for Markov models, it is easy to extract an explicit description of the probability distribution from the code, namely by copying the IP and TP tables. This information is more accurate than an approxima- tion based on sampling, and thus preferable. Second, it turned out that the probability distribution generated by Markov models is much more “spread out” than the distribution generated by NoCrack, which is more concentrated on fewer values. (For illustration we sampled 1.5 M passwords for both distributions. We obtained 250 k unique passwords for NoCrack and 1.25 M unique passwords for Markov.) The results are shown in Table 6.5. Comparable results for NoC- rack can be found in Table 6.1. We see that the Markov NLE is substantially more robust against this attack, with an average rank of 27.8 % (NoCrack: 6.2 %) and median 14.2 % (NoCrack: 2.0 %). In- terestingly, for the weak vaults, there is no big difference: Q0.25 is 0.4 and 1.0, respectively. The results for artificial vaults (independently chosen, size 10) can be found in Table 6.3, on page 172. Here we see that Markov performs similar to NoCrack, with a median of 0.1 %, and only slightly better for the mean. The only exception is in the comparison with RockYou, for which it performs relatively close to random. For the other lists, the median is 0.1 % each, equal to NoC- rack, only the median being slightly better with values between 5.3 % and 11.9 %. 6.6. Adaptive NLEs Based on Markov Models 183

Table 6.5.: Rank results based on a KL divergence attack of entire vaults, where smaller numbers mean a more efficient at- tack. Decoy vaults are chosen from the static Markov or adaptive Markov distribution; real vaults are chosen from the PBVault distribution. For better comparability to pre- vious work [46] we list results for varying classes of vault sizes.

KL Divergence Attack PBVault Static Markov Adaptive Markov

Vault Size Mean Q0.25 Median Mean Q0.25 Median 2-3 31.50 % 0.45 % 16.83 % 42.21 % 12.66 % 33.01 % 4-8 26.88 % 0.17 % 9.71 % 39.55 % 11.36 % 32.32 % 9-50 24.81 % 1.21 % 12.49 % 38.63 % 4.46 % 36.83 % All 27.77 % 0.39 % 14.24 % 40.12 % 9.12 % 35.14 %

Machine Learning Attacks

We also re-created the original attack based on machine learning, in order to check how well the Markov NLE fares against it. The NoC- rack paper gives limited details only. In the full version [46] of the paper, they report their best performing ML engine was a Support Vector Machine (SVM) with a radial basis function kernel. They constructed four SVM-based classifiers, one for each of the following feature vectors: Repeat count (including numbers for the uniqueness of passwords, leetspeak transformations, capitalization, and tokens within a vault); Edit distance (including numbers for password pairs of different edit distances); n-gram structure (comprises percentage frequencies for the most popular n-grams, to characterize token reuse); Combined (combination of the three features above). We re-created the classifier with the information available and some deliberate tun- ing of the parameters, obtaining a classifier that shows similar per- formance to the original classifier. The results both for vaults and artificial vaults (called MPW in NoCrack’s terminology) are shown in Table 6.6. To facilitate the comparison between the original SVM and our re-implementation, we list the results for NoCrack by Chatterjee 184 Chapter 6 Password Management et al. [46] as well. We see that both NoCrack and Markov perform very similar against this classifier and that the values are very similar to the reported ones for vaults and similar for MPW. However, this is mostly a sign for the ineptness of the used features, as we have seen that KL divergence is substantially better in ranking than this classifier.

Table 6.6.: Ranking results for the re-created ML classifier, for NoC- rack and static Markov, both with MPW and SG. To facilitate the comparison between the original SVM and our re-implementation, we list the results for NoCrack by Chatterjee et al. [46] as well.

Kind ML Single (MPW) ML Vaults (SG) Feature [46] NoCr. S. Mark. [46] NoCr. S. Mark. Repeat ct. 0.60 % 4.71 % 3.43 % 37.40 % 38.45 % 45.30 % Edit dist. 1.20 % 1.71 % 1.28 % 41.40 % 35.52 % 35.10 % n-gram 0.60 % 2.83 % 2.13 % 38.50 % 38.90 % 32.72 % Combined 1.00 % 0.54 % 0.40 % 39.70 % 37.80 % 40.91 %

6.6.3. Adaptive Construction

The basic idea of the adaptive construction is to modify the n-gram model such that it does not assign very low probabilities to passwords that actually appear in the vault. Otherwise, the appearance of a very improbable password in a candidate vault would be a strong signal for the real vault. Therefore, we modify the transition probabilities of an n-gram model as follows: (i) For each password occurring in the vault, we choose one n-gram from that password at random and increase the probability by multiplying it with 5. (ii) For all remaining n-grams, we increase probabilities by a factor of 5 with a probability of 20 %. (iii) Finally, we re-normalize all probabilities. Here, the constants 5 and 20 % are determined empirically to work well and provide reason- able security guarantees (see below). In the next subsection, we estab- lish that the resulting (adaptive) NLE prevents online-verification at- 6.6. Adaptive NLEs Based on Markov Models 185 tacks much better than previously seen (static) NLEs, and we discuss the security implications of the adaptive property in the subsequent section.

6.6.4. Performance of the Adaptive NLE

To evaluate the performance of the adaptive NLE, we rerun the same experiment as before using KL divergence and the PBVault dataset. The results are summarized in Table 6.5. They show that the real vault cvreal is ranked on average among the 40.12 % of the most likely vaults, thus increasing the amount of online guessing substantially. Note specifically that the 1st Quartile dropped from 0.39 % for the static Markov NLE to 9.12 % for the adaptive Markov NLE. We tested several other boosting constants (2, 4, 5, 6, 8, 10) which resulted in the following mean values (33.71 %, 39.38 %, 40.12 %, 40.36 %, 41.56 %, 43.4 %). We considered 5 to be suitable as beyond the improvements are small.

6.6.5. Security of the Adaptive NLE

We have to assume that an attacker is able to determine which n- grams have been boosted in the process, either because the attacker knows which corpus the original n-gram model has been trained on, or because the attacker is able to notice deviations from a “normal” distribution. In this case, it might be possible to infer information about the passwords stored in the vault. (In fact, if we only boosted n-grams for passwords in the vault and omitted the random boosting of other n-gram probabilities, then this would give rise to an easy and very efficient attack.) Next, we will show that the information that an attacker can infer is very limited. Let B be the set of those n-grams that have been boosted (and this set is known to the attacker). Consider a password pwd, which might or might not be in the vault, and let N be the set of n-grams that pwd contains. Depending on if it is in the vault or not, the size of the intersection N ∩ B will change (it will be larger on average if 186 Chapter 6 Password Management pwd is in the vault, as in this case N ∩ B is guaranteed to be greater than 1). We consider the influence that learning the value i := N ∩ B has on the probability of seeing a password pwd0. The ratio between

P (pwd = pwd0) and P (pwd = pwd0|i = i0) can be estimated as follows, using Bayes’ rule, writing f(k; n, p) for the probability mass function of the Binomial distribution, writing len(pwd) for the length of a password and p = 0.2:

P (pwd = pwd |i = i ) P (i = i |pwd = pwd ) 0 0 = 0 0 P (pwd = pwd0) P (i = i0) (6.4) f(i − 1; len(pwd) − 3, p) 1 = 0 < = 1.25 f(i0; len(pwd) − 2, p) 1 − p

In other words, even an adversary knowing the exact set of boosted n-grams will increase the estimate of the probability of the correct password by a factor of 1.25, which has a very limited effect on the guessing behavior for an online guessing attack. The exact influence on password guessing depends on the precise distribution of pass- words, or more specifically on the attacker’s belief about the password distribution, and is thus hard to precisely quantify.

6.6.6. Limitations of the Adaptive NLE

Finally, we discuss some limitations of adaptive NLEs and the Markov- based adaptive NLE in particular. Adaptive NLEs demonstrate an interesting direction to overcome fundamental limitations of static NLEs, as we have demonstrated in Section 6.4. However, more work is required to better understand the mechanisms for providing adaptive NLEs and to quantify their security guarantees. Our method for implementing adaptive NLEs based on n-gram models shows a first step towards realizing adaptive NLEs. The tech- nique is straightforward, but better methods may exist. Note that we are unaware of an easy and promising way to base adaptive NLEs on PCFGs. The parameters that we used were determined empirically and seem to work well, but a more systematic treatment may reveal 6.7. Conclusion 187 parameters with better overall performance. Altogether, we consider adaptive NLEs as still being work-in-progress. So far we have considered vaults that do not change over time. If a new password is added to a vault, one possible way is to re-encode the entire vault as described in Section 6.6.3. Then the construction is vulnerable to the same intersection attack as NoCrack: Given the same vault before and after adding a new password, the correct mas- ter password will decrypt both vaults to the same set of passwords, whereas a wrong master password will decrypt to different decoys with high probability. It is unclear how this problem can be avoided, both for static and adaptive vaults.

6.7. Conclusion

There are various attacks against cracking-resistant vaults, of which distribution-based attacks are only one possible class. We showed that the proposed NoCrack NLE, which is based on a PCFG model, is too simple. We highlighted that the inability to distinguish, of the applied SVM-based machine learning engine, does not serve as a lower-bound security guarantee. Rather, we provided a distribution- based attack, utilizing KL divergence, which can distinguish real from decoy vaults. Additionally, we described further issues that need to be considered for the construction of a well-performing NLE. Next, we demonstrated that our proposed n-gram model outperforms the PCFG-based solution. Moreover, we introduced the notion of adaptive NLEs, where the generated distribution of decoy vaults depends on the actual passwords stored in the vault. This makes it unnecessary to “predict” the changes in password distributions over time, an inherent flaw of static NLEs. Unfortunately, the lack of real-world statistics and sample data on vaults does not exactly ease the situation in vault security research.

Everything is going to be fine in the end. If it’s not fine it’s not the end.

— Oscar Wilde 7 Summary and Future Work

Contents 7.1 Summary and Key Results ...... 190 7.2 Directions for Future Work ...... 192 7.3 Concluding Remarks ...... 195 190 Chapter 7 Summary and Future Work

7.1. Summary and Key Results

In this thesis, we analyzed security and usability problems and pro- posed solutions for four different aspects of user authentication. In Chapter 3, we described a new authentication system, called MooneyAuth that utilizes implicit learning for authentication. In com- parison to previous work, our proposal shows a high performance, even over longer time spans of up to 8.5 months, which makes it particu- larly suitable for fallback authentication. While the scheme still has usability issues such as the long authentication time owed to the label- ing task, and potential interference issues, it represents a step toward utilizing implicit memory for user authentication and inspired others. In Chapter 4, we investigated the accuracy of password strength meters. In particular, we identified a list of requirements that simi- larity measures must fulfill. By applying synthetic disturbances to a password distribution, we tested a list of 19 metrics and recommended the use of a weighted rank-based similarity measure to counter mono- tonic transformations and quantization. We also showed that a lim- ited number of approximately 1000 to 10 000 samples are enough to estimate the accuracy of a meter. In a large-scale evaluation, we com- pared strength meters from websites, password managers, operating systems, as well as, measurements from previous work. We found that academic proposals offer high accuracy, and confirmed previous work that reported on the low accuracy of many website meters. Our analy- sis, provided guidance, and developed tools can help system designers and meter developers to improve the current situation. In Chapter 5, we explored password-reuse notifications. We con- ducted two complementary user studies about password-reuse noti- fications. In the first study, we chose six notifications sent by real companies and surveyed participants about what actions they might take in response. We found that responses misattributed the poten- tial root cause of receiving such notifications. Based on respondents’ perceptions, we developed five design goals for password-reuse notifi- cations and conducted a follow-up study to analyze 15 variants of a 7.1. Summary and Key Results 191 model password-reuse notification. While respondents perceived these notifications as official and urgent, 55 % of the respondents nonetheless misattributed the root cause of receiving these notifications. Although nearly 90 % of the respondents stated intentions to change their pass- words, about 60 % reported plans to create these “new” passwords by reusing other passwords of theirs, leaving them vulnerable to sim- ilar attacks in the future. Based on these findings, we established five best practices for maximizing the effectiveness of the notification. However, we also reasoned about why we believe notifications on their own may not be sufficient and discussed other measures for holistically addressing password reuse.

In Chapter 6, we analyzed the security of a cracking-resistant pass- word vault. To be exact, we analyzed NoCrack, which is an instance of a cracking-resistant password vault that uses a Natural Language Encoder (NLE) to generate plausible looking decoy vaults on the fly for each (wrong) master password candidate. We showed that one could distinguish real from decoy vaults with high accuracy, based on the difference in the distribution of the passwords. We used the Kullback–Leibler divergence as a measure of similarity to rank the correct vault significantly higher up to a median rank of 1.97 % for real-world vaults. In the following, we evaluated additional signals such as passwords containing the username, password reuse, and com- position policies that enabled us to even better distinguish real from decoy vaults. As a possible solution, we proposed the notion of adap- tive NLEs, where the generated distribution of decoy vaults depends on the actual values stored in the vault.

In summary, we consider the contributions made in those four areas to be helpful in building secure and usable user authentication sys- tems. By applying different methodologies such as conducting user studies and running large-scale measurements, we contribute towards solving issues in different states of research. From basic research (im- plicit memory-based authentication and cracking-resistant vaults) to applied research (password strength meters and password-reuse no- 192 Chapter 7 Summary and Future Work tifications), our work contributes to holistically addressing issues by considering problems of users and system developers.

7.2. Directions for Future Work

During our research, we created a number of different research arti- facts, including websites, prototypes, code, and datasets. To incen- tivize repetition, reproducible results, and further improvements, we shared the majority of those artifacts. While working on the four top- ics presented in this thesis, we identified new research directions that will be interesting to explore in the future.

Reducing the Authentication Time in MooneyAuth A usability issue of the MooneyAuth scheme is the required time to label the images. As noted before, timing and workload are not as crucial for fallback authentication as they are for primary authentication. Still, they are a factor when it comes to real-world deployments. In the MooneyAuth study, where users were asked to label up to 20 images, we observed that the average authentication took 3.5 minutes. Moreover, typing on a keyboard is error-prone, which we addressed, and is also considered tedious by users. In future work, one could explore alternative user interfaces to avoid the labeling task altogether. For this, one might test alternatives that rely on the recognition of images given various cues. Moreover, one could also investigate a variant where users label the images by selecting from a list of predefined labels.

Investigate the Influence of Accuracy in a Strength Meter While we argue that accuracy is one of the driving factors for the performance of password strength meters, so far its influence on password strength is not well understood. In theory, inaccurate meters can be harm- ful as they are misguiding the user, but the question of whether it is the accuracy or the mere presence of a meter that improves pass- word strength is unexplored. Observations by Ur et al. [230] and Aviv et al. [10] about users’ perception of strength raise questions about 7.2. Directions for Future Work 193 the influence of accuracy on the password selection. Moreover, they motivate to explore building a strength meter that follows perceived instead of actual strength that can help to actively reduce weak au- thentication choices, while not wasting users’ cognitive effort in the “don’t care” region of password security [78].

Improving Browser-Based Password Managers We noted that pass- word reuse is an ecosystems-level problem that requires holistic solu- tions. A key component is the password managers that are integrated into the web browsers, as they can observe the full spectrum of a user’s passwords. As noted by Lyastani et al. [157] it is essential that ex- isting passwords get replaced by new, strong, random strings. Recent changes to Apple’s Safari and Google’s Chrome browser [242] help users to choose secure passwords for their new accounts. However, for existing accounts, there are no mechanisms to replace vulnerable pass- words. Moreover, current interfaces do not support users enough in preventing password reuse. Future work should thus investigate how password managers and browsers can be more explicit in preventing password reuse while maintaining a positive user experience.

Usability Aspects of a Cracking-Resistant Password Vault Security- wise, cracking-resistant vaults are better than traditional vaults. Even in the worst case, where a flaw exists that allows an attacker to distin- guish between real and decoy vaults, a cracking-resistant vault would still provide the same level of security as a traditional vault. In this case, a strong master password and a slow key derivation function are most often the only limiting factors preventing a successful attack. However, even if one assumes a secure enough cracking-resistant vault can be built, one must consider its usability implications. If an at- tacker cannot tell the difference between the real and the decoy vault, how can the legitimate user? Worse, if a new credential is acciden- tally saved to a decoy vault, the user will lose access to the account. As cracking-resistant vaults are still in the early stages of their de- velopment, more work on their usability aspects is required. At the 194 Chapter 7 Summary and Future Work same time, building a cracking-resistant vault that only stores indis- tinguishable random strings could be the first step toward a real-world deployment.

Outlook Besides those concrete recommendations, we would also like to outline other aspects of user authentication we imagine will be interesting future directions. Over the past few years, we observed the deployment of risk-based authentication (RBA) mechanisms by some larger services that try to detect suspicious account activity [63, 177]. As RBA is based on soft factors, such as browser fingerprinting and user behavior, ser- vices do not share many details about the inner working of those systems [39, 79]. Studying these continuously adapting and learn- ing systems is a challenging task and potentially error-prone, e. g., it might require labeled authentication attempts, a multitude of “used” real-world accounts, or actual behavior of users interacting with the service [249]. We suggest exploring RBA systems to learn more about their security benefits and issues, such as their authentication perfor- mance, false positive rates, and related usability problems. Another question is how services can help each other to counter online guessing attacks via collaboration. We imagine that password- reuse attacks, in particular, are easier to prevent if services share some knowledge about ongoing attacks. Related to this question is a first proposal by Wang and Reiter [241] that suggests a private set- membership test protocol that allows checking whether a user has a similar password at another service. Given a global view on the current attack landscape, we imagine it will be possible to limit the success-rate of ongoing guessing attacks. Finally, future research should consider dealing with legacy issues caused by security practices and advice that can, at best, be rated outdated, to be one of the main challenges. In the long term, re- searchers should investigate users’ mental models and find ways to address their misconceptions. In particular, misconceptions about password strength and attacker capabilities should be investigated. 7.3. Concluding Remarks 195

7.3. Concluding Remarks

Considering the number of breached services and leaked credentials, one might think the sooner passwords are replaced, the better. Until then, we observed strategies adopted by system designers in an effort to secure this decades-old ecosystem. With new proposals like W3C’s Web Authentication [156], vendors and services have started to ex- plore a password-less future for the Web that relies on biometrics [47] and hardware tokens such as security keys and smart devices [101]. It continues to be our responsibility as researchers to improve existing and develop new, usable, deployable, and secure user authentication systems for the years to come.

List of Figures

3.1 Example Mooney image ...... 36 3.2 User interface during authentication ...... 47 3.3 Priming effect comparison ...... 52 3.4 Priming effect decline over time ...... 55 3.5 Distribution of static/dynamic scores ...... 58 3.6 Priming effect example images ...... 62 3.7 Example Mooney image and the corresponding original 65

4.1 Histogram of an monotonic transformation error . . . 89 4.2 Quantization effect of different PSMs ...... 100 4.3 Distributions of strength estimations ...... 105

5.1 Rebranded example notification ...... 116 5.2 Respondents priority of taking actions ...... 120 5.3 Sentiment analysis of respondents’ reported feelings . 121 5.4 Our model password-reuse notification ...... 129 5.5 Respondents’ intentions for creating new passwords . . 137 5.6 Respondents’ password changing strategies ...... 138 5.7 Respondents’ perceptions regard. actions’ effectiveness 142 5.8 The most effective password-reuse notification . . . . . 148

6.1 Design of NoCrack (simplified) ...... 159 6.2 KL divergence distributions for single passwords . . . 173 6.3 Attack results summary (by vault size) ...... 178

C.1 Study 1 Facebook notification ...... 219 C.2 Study 1 Google email notification ...... 220 C.3 Study 1 Google red bar notification ...... 221 C.4 Study 1 Instagram notification ...... 222 C.5 Study 1 LinkedIn notification ...... 223 C.6 Study 1 Netflix notification ...... 224 C.7 Study 2 model-{mobile} notification ...... 225 C.8 Study 2 model-{inApp} notification ...... 226

List of Tables

3.1 Statistics on duration and average event probability . 50 3.2 Statistics on the timing for the labeling task ...... 52 3.3 Performance of the scheme ...... 59 3.4 Statistics on the overall timing ...... 60

4.1 Overview of evaluated datasets ...... 75 4.2 (Weighted) Correlation Metrics...... 79 4.3 (Weighted) Mean Error Metrics...... 82 4.4 (Weighted) One-Sided/Pairwise Error Metrics. . . . . 83 4.5 Confidence intervals for different sample sizes . . . . . 88 4.6 Performance of academic strength meter proposals . . 97

5.1 Characteristics of the six tested notifications ...... 115 5.2 The varied notification dimensions ...... 130 5.3 Comparison of real-world password-reuse notifications 149

6.1 Rank results for entire vaults (by vault sizes) . . . . . 169 6.2 Rank results for entire vaults (by sample sizes) . . . . 170 6.3 Rank results for artificial vaults ...... 172 6.4 Rank results for entire vaults (using additional criteria) 175 6.5 Rank results for entire vaults (static vs. adaptive NLE) 183 6.6 Ranking results (ML classifier) ...... 184

B.1 Online use case: Academia, PW Managers, OSs . . . . 206 B.2 Offline use case: Academia, PW Managers, OSs . . . . 207 B.3 Online use case: Websites and Previous Work . . . . . 208 B.4 Offline use case: Websites and Previous Work . . . . . 209

A Password Recovery

Appendices for Chapter 3, based on the publication:

C. Castelluccia, M. Dürmuth, M. Golla, and F. Deniz, “Towards Im- plicit Visual Memory-Based Authentication,” in Symposium on Net- work and Distributed System Security (NDSS ’17). San Diego, Cali- fornia, USA: ISOC, Feb. 2017.

Includes:

• Demographics and Questionnaire: – Pre-Study and Long-term Study – MooneyAuth Study 202 Appendix A Password Recovery

A.1. Survey Data: Pre-Study and Long-term Study

(9 days) (25 days) (264 days) 1st batch 2nd batch 3rd batch No. % No. % No. % Age 97 100.0 % 129 100.0 % 124 100.0 % 20-30 61 62.9 % 66 51.2 % 69 55.6 % 31-40 27 27.8 % 40 31.0 % 38 30.6 % 41-49 6 6.2 % 14 10.9 % 12 9.7 % 50+ 3 3.1 % 9 7.0 % 5 4.0 % Gender 97 100.0 % 129 100.0 % 124 100.0 % male 81 83.5 % 101 78.3 % 97 78.2 % female 16 16.5 % 27 20.9 % 27 21.8 % other -- 1 0.8 % -- Country 97 100.0 % 129 100.0 % 124 100.0 % France 40 41.2 % 55 42.6 % 54 43.5 % Germany 41 42.3 % 44 34.1 % 45 36.3 % other 16 16.5 % 30 23.3 % 25 20.2 % Native English speaker 97 100.0 % 129 100.0 % 124 100.0 % Yes 6 6.2 % 4 3.1 % 7 5.6 % No 91 93.8 % 125 96.9 % 117 94.4 % Profession 97 100.0 % 129 100.0 % 124 100.0 % Administration 4 4.1 % 3 2.3 % 4 3.2 % Arts ------Engineering 38 39.2 % 55 42.6 % 52 41.9 % Humanities -- 2 1.6 % -- Life science 1 1.0 % 1 0.8 % 1 0.8 % Science 53 54.6 % 66 51.2 % 65 52.4 % other 1 1.0 % 2 1.6 % 2 1.6 % Heard of Mooney images 97 100.0 % 129 100.0 % 124 100.0 % Worked with before ------Heard of before 10 10.3 % 11 8.5 % 8 6.5 % none 87 89.7 % 118 91.5 % 116 93.5 % Passwords are easy to remember 97 100.0 % 129 100.0 % 124 100.0 % Strongly agree 4 4.1 % 3 2.3 % 4 3.2 % Agree 30 30.9 % 32 24.8 % 36 29.0 % Neither agree nor disagree 33 34.0 % 46 35.7 % 38 30.6 % Disagree 27 27.8 % 39 30.2 % 39 31.5 % Strongly disagree 3 3.1 % 9 7.0 % 7 5.6 % Passwords are secure 97 100.0 % 129 100.0 % 124 100.0 % Strongly agree 2 2.1 % 2 1.6 % 2 1.6 % Agree 29 29.9 % 35 27.1 % 37 29.8 % Neither agree nor disagree 36 37.1 % 38 29.5 % 37 29.8 % Disagree 25 25.8 % 42 32.6 % 38 30.6 % Strongly disagree 5 5.2 % 12 9.3 % 10 8.1 % M. img. are interesting to work with 96 100.0 % 129 100.0 % 124 100.0 % Strongly agree 8 8.3 % 5 3.9 % 9 7.3 % Agree 46 47.9 % 70 54.3 % 57 46.0 % Neither agree nor disagree 36 37.5 % 39 30.2 % 48 38.7 % Disagree 6 6.3 % 13 10.1 % 9 7.3 % Strongly disagree -- 2 1.6 % 1 0.8 % Using Mooney images is funny 97 100.0 % 129 100.0 % 124 100.0 % Strongly agree 6 6.2 % 7 5.4 % 7 5.6 % Agree 37 38.1 % 52 40.3 % 42 33.9 % Neither agree nor disagree 40 41.2 % 56 43.4 % 59 47.6 % Disagree 13 13.4 % 10 7.8 % 14 11.3 % Strongly disagree 1 1.0 % 4 3.1 % 2 1.6 % A.2. Survey Data: MooneyAuth Study 203

A.2. Survey Data: MooneyAuth Study

(21 days) No. Percent Age 70 100.0 % 20-29 39 55.7 % 30-39 22 31.4 % 40-49 6 8.6 % 50-59 2 2.9 % 60+ 1 1.4 % Gender 70 100.0 % male 54 77.1 % female 15 21.4 % other 1 1.4 % Nationality 70 100.0 % France 29 41.4 % Germany 12 17.1 % USA 9 12.9 % other 20 28.6 % Country you completing this in 70 100.0 % France 37 52.9 % USA 16 22.9 % Germany 13 18.6 % other 4 5.7 % Native English speaker 70 100.0 % Yes 10 14.3 % No 60 85.7 % Profession 70 100.0 % Arts 2 2.9 % Business 2 2.9 % Engineering 24 34.3 % Humanities 1 1.4 % Life science 6 8.6 % Science 35 50.0 % other -- Heard of Mooney images 70 100.0 % Worked with before 3 4.3 % Heard of before 18 25.7 % none 49 70.0 % Mooney images are easy to remember 70 100.0 % Strongly agree 1 1.4 % Agree 25 35.7 % Neither agree nor disagree 35 50.0 % Disagree 9 12.9 % Strongly disagree --

B Password Strength

Appendices for Chapter 4, based on the publication:

M. Golla and M. Dürmuth, “On the Accuracy of Password Strength Meters,” in ACM Conference on Computer and Communications Se- curity (CCS ’18). Toronto, Ontario, Canada: ACM, Oct. 2018, pp. 1567–1582.

Includes:

• Accuracy Comparison: – Academia, Password Managers, and Operating Systems – Websites and Previous Work 206 Appendix B Password Strength

B.1. Comparison: Academia, PW Managers, and OSs

Table B.1.: PSM accuracy (wSpear.) online use case. Online Attacker ID Meter T. Q. Visu. RY LI WH Academic Proposals 1A Comprehensive8 [231] C - - -0.652 -0.589 0.251 1B Comprehensive8 [231] C Q5 Text -0.331 -0.084 0.409 2 Eleven [68] C - - 0.670 0.912 0.492 3 LPSE [102] C Q3 - 0.584 0.669 0.508 4A Markov (OMEN) [43] S - - 0.721 0.697 0.410 4B Markov (Single) [87] S - - 0.718 0.998 0.817 4C Markov (Multi) [87] S - - 0.721 0.998 0.902 5A NIST [38] C - - 0.670 0.912 0.492 5B NIST (w. Dict.) [38] C - - 0.669 0.910 0.472 6 PCFG (fuzzyPSM) [239] S - - 1.000 0.994 0.963 7A RNN Generic [162] C - - 0.632 0.542 0.427 7B RNN Generic (Web) [229] C - - 0.473 0.649 0.421 7C RNN Target [162] C - - 0.951 0.913 0.965 7D RNN Target (w. Blo.) [162] C - - 0.951 0.913 0.965 8A zxcvbn (Guess No.) [248] C - - 0.989 0.990 0.554 8B zxcvbn (Score) [248] C Q5 - 0.341 0.490 0.359 Password Managers 9A 1Password (Web) C - Bar 0.276 0.433 0.441 9B 1Password (Web) C Q5 Text(Int.) 0.276 0.433 0.407 10A Bitwarden (Web) C - Bar -0.635 -0.490 0.418 10B Bitwarden (Web) C Q3 Text(Int.) 0.258 0.372 0.494 11 Dashlane 5.5 (Windows) C - Text 0.686 0.785 0.241 12 Enpass 5.6.8 (Windows)6 C Q5 Bar a. Text[Z] 0.341 0.490 0.359 13A KeePass 2.38 (Windows) C - Bar 0.856 0.785 0.393 13B KeePass 2.38 (Windows) C Q5 Text 0.000 0.000 0.045 14A Keeper (Web) C Q5 Bar 0.200 0.258 0.400 14B Keeper (Web) C - Score(Int.) 0.805 0.719 0.284 15 LastPass (Web) C Q5 Bar[Z] 0.197 0.428 0.266 16A LogMeOnce (Web) C - Bar 0.425 0.559 0.245 16B LogMeOnce (Web) C Q5 Text 0.053 0.138 0.315 17A RoboForm 8.4.8.8 (Chro.) C Q4 Text[Z] 0.740 0.773 0.477 17B RoboForm 8.4.8.8 (Chro.) C - Score(Int.)[Z] 0.685 0.932 0.528 17C RoboForm Business (Web) C Q6 Text[Z] 0.523 0.693 0.402 18 True Key 2.8.5711 (Chro.) C Q5 Text[Z] 0.341 0.490 0.359 19A Zoho Vault (Web) C Q3 Bar a. Text 0.088 0.134 0.120 19B Zoho Vault (Web) C - Score(Int.) 0.464 0.502 0.509 Operating Systems 20A macOS High Sierra 10.13.4 C - Bar -0.667 -0.513 0.450 20B macOS High Sierra 10.13.4 C Q4 Text 0.072 0.204 0.469 20C macOS High Sierra 10.13.4 C - Bar(Hover) -0.667 -0.513 0.449 21A Ubuntu 18.04 (Ubiquity) C Q5 Text -0.849 -0.808 -0.141 21B Ubuntu 18.04 (Ubiquity) C - Score(Int.) -0.818 -0.817 -0.189

6Not crawled, analysis revealed the use of plain zxcvbn (Score) MID 8B. B.1. Comparison: Academia, PW Managers, and OSs 207

Table B.2.: PSM accuracy (wSpear.) offline use case. Offline Attacker ID Meter T. Q. Visu. RY LI WH Academic Proposals 1A Comprehensive8 [231] C - - -0.476 -0.616 0.441 1B Comprehensive8 [231] C Q5 Text -0.128 -0.123 0.421 2 Eleven [68] C - - 0.755 0.951 0.733 3 LPSE [102] C Q3 - 0.544 0.718 0.693 4A Markov (OMEN) [43] S - - 0.701 0.669 0.660 4B Markov (Single) [87] S - - 0.828 0.991 0.872 4C Markov (Multi) [87] S - - 0.997 0.995 0.777 5A NIST [38] C - - 0.755 0.951 0.733 5B NIST (w. Dict.) [38] C - - 0.756 0.953 0.816 6 PCFG (fuzzyPSM) [239] S - - 0.998 0.999 0.899 7A RNN Generic [162] C - - 0.535 0.520 0.800 7B RNN Generic (Web) [229] C - - 0.449 0.688 0.777 7C RNN Target [162] C - - 0.896 0.860 0.885 7D RNN Target (w. Blo.) [162] C - - 0.896 0.860 0.882 8A zxcvbn (Guess No.) [248] C - - 0.989 0.999 0.868 8B zxcvbn (Score) [248] C Q5 - 0.373 0.567 0.817 Password Managers 9A 1Password (Web) C - Bar 0.401 0.621 0.807 9B 1Password (Web) C Q5 Text(Int.) 0.401 0.621 0.813 10A Bitwarden (Web) C - Bar -0.457 -0.540 0.676 10B Bitwarden (Web) C Q3 Text(Int.) 0.333 0.340 0.725 11 Dashlane 5.5 (Windows) C - Text 0.698 0.820 0.410 12 Enpass 5.6.8 (Windows)6 C Q5 Bar a. Text[Z] 0.373 0.567 0.817 13A KeePass 2.38 (Windows) C - Bar 0.884 0.870 0.744 13B KeePass 2.38 (Windows) C Q5 Text 0.003 0.002 0.321 14A Keeper (Web) C Q5 Bar 0.223 0.238 0.589 14B Keeper (Web) C - Score(Int.) 0.869 0.824 0.476 15 LastPass (Web) C Q5 Bar[Z] 0.232 0.510 0.717 16A LogMeOnce (Web) C - Bar 0.410 0.602 0.503 16B LogMeOnce (Web) C Q5 Text 0.070 0.130 0.541 17A RoboForm 8.4.8.8 (Chro.) C Q4 Text[Z] 0.711 0.827 0.759 17B RoboForm 8.4.8.8 (Chro.) C - Score(Int.)[Z] 0.781 0.962 0.725 17C RoboForm Business (Web) C Q6 Text[Z] 0.553 0.727 0.738 18 True Key 2.8.5711 (Chro.) C Q5 Text[Z] 0.373 0.567 0.817 19A Zoho Vault (Web) C Q3 Bar a. Text 0.104 0.107 0.506 19B Zoho Vault (Web) C - Score(Int.) 0.450 0.468 0.727 Operating Systems 20A macOS High Sierra 10.13.4 C - Bar -0.488 -0.569 0.726 20B macOS High Sierra 10.13.4 C Q4 Text 0.094 0.171 0.728 20C macOS High Sierra 10.13.4 C - Bar(Hover) -0.488 -0.569 0.727 21A Ubuntu 18.04 (Ubiquity) C Q5 Text -0.792 -0.851 0.132 21B Ubuntu 18.04 (Ubiquity) C - Score(Int.) -0.779 -0.855 0.002 Type (T): C=Client; S=Server; H=Hybrid Quantization (Q): Q3–Q6=Number of bins, e. g., Q3=[Weak, Good, Strong] Visualization: Bar=Bar meter; Text=Textual; (Int.)=Internal value; [Z]=zxcvbn Dataset: RY=RockYou; LI=LinkedIn; 0W=000WebHost 208 Appendix B Password Strength

B.2. Comparison: Websites and Previous Work

Table B.3.: PSM accuracy (wSpear.) online use case. Online Attacker ID Meter T. Q. Visu. RY LI WH Websites 22 Airbnb C Q3 Text 0.054 0.113 0.331 23A Apple H Q4 Bar 0.000 0.020 0.102 23B Apple H Q3 Text 0.000 0.020 0.102 24 Baidu S Q3 Text 0.829 0.828 0.154 25A Best Buy C - Bar 0.676 0.912 0.424 25B Best Buy C Q3 Text 0.074 0.102 0.331 26A China Railway (12306.cn) C Q3 Bar 0.161 0.226 0.346 27A Dropbox C Q4 Bar[Z] 0.056 0.094 0.087 28A Drupal 8.5.3 C - Bar 0.677 0.788 0.490 28B Drupal 8.5.3 C Q4 Text 0.022 0.019 0.157 29 eBay (PW Change) H Q4 Bar 0.031 0.120 -0.373 30 Facebook (PW Change) C Q4 Text -0.066 0.372 0.498 31A FedEx C Q3 Bar 0.000 0.090 0.147 31B FedEx C Q5 Score(Int.) 0.000 0.090 0.147 32A Google H Q5 Bar 0.522 0.692 0.586 33 Have I Been Pwned? S - Text 0.992 0.996 0.739 34 The Home Depot C Q3 Bar a. Text 0.362 0.548 0.475 35A Microsoft (v3)7 C Q4 Bar 0.521 0.694 0.487 35B Microsoft (v3)7 C - Score(Int.) 0.670 0.912 0.491 36 reddit C Q5 Bar[Z] 0.341 0.490 0.359 37 Sony C Q3 Bar 0.115 0.185 0.188 38 Sina Weibo C Q4 Text 0.427 0.779 0.544 39 Tumblr S Q6 Bar 0.499 0.576 0.165 40A Twitch C Q5 Bar[Z] 0.197 0.428 0.266 40B Twitch C Q3 Text[Z] 0.197 0.427 0.300 41A Twitter H - Bar 0.643 0.637 0.509 41B Twitter H Q5 Score(Int.) 0.554 0.629 0.585 42A Yandex H - Bar -0.392 -0.082 0.494 42B Yandex H Q4 Text 0.370 0.724 0.502 de Carné de Carnavalet and Mannan (2014) [57] 23C Apple H Q4 - 0.521 0.694 0.462 26B China Railway (12306.cn) C Q3 - 0.160 0.226 0.346 27B Dropbox C Q5 - [Z] 0.085 0.104 0.121 28C Drupal C Q4 - -0.187 0.256 0.148 31C Fedex C Q5 - 0.000 0.090 0.147 32B Google S Q5 - 0.521 0.694 0.507 35C Microsoft (v3) C Q4 - 0.521 0.694 0.487 43 PayPal H Q4 - 0.521 0.694 0.444 44 QQ C Q4 - 0.844 0.874 0.492 41C Twitter C Q6 - 0.223 0.638 0.514 45 Yahoo! C Q4 - -0.187 0.256 0.142 42C Yandex S Q4 - 0.150 0.581 0.398

7MID 35A/35B have been deprecated by Microsoft in 2016. B.2. Comparison: Websites and Previous Work 209

Table B.4.: PSM accuracy (wSpear.) offline use case. Offline Attacker ID Meter T. Q. Visu. RY LI WH Websites 22 Airbnb C Q3 Text 0.063 0.141 0.605 23A Apple H Q4 Bar 0.000 0.046 0.345 23B Apple H Q3 Text 0.000 0.046 0.345 24 Baidu S Q3 Text 0.825 0.875 0.350 25A Best Buy C - Bar 0.765 0.949 0.645 25B Best Buy C Q3 Text 0.077 0.095 0.511 26A China Railway (12306.cn) C Q3 Bar 0.166 0.197 0.571 27A Dropbox C Q4 Bar[Z] 0.076 0.108 0.611 28A Drupal 8.5.3 C - Bar 0.688 0.822 0.732 28B Drupal 8.5.3 C Q4 Text 0.022 0.039 0.356 29 eBay (PW Change) H Q4 Bar 0.157 0.081 -0.146 30 Facebook (PW Change) C Q4 Text 0.118 0.339 0.725 31A FedEx C Q3 Bar 0.007 0.071 0.345 31B FedEx C Q5 Score(Int.) 0.007 0.071 0.345 32A Google H Q5 Bar 0.551 0.729 0.763 33 Have I Been Pwned? S - Text 0.991 0.997 0.939 34 The Home Depot C Q3 Bar a. Text 0.420 0.604 0.731 35A Microsoft (v3)7 C Q4 Bar 0.551 0.726 0.724 35B Microsoft (v3)7 C - Score(Int.) 0.755 0.951 0.734 36 reddit C Q5 Bar[Z] 0.373 0.567 0.817 37 Sony C Q3 Bar 0.128 0.169 0.511 38 Sina Weibo C Q4 Text 0.500 0.803 0.502 39 Tumblr S Q6 Bar 0.550 0.514 0.394 40A Twitch C Q5 Bar[Z] 0.232 0.510 0.717 40B Twitch C Q3 Text[Z] 0.232 0.510 0.712 41A Twitter H - Bar 0.581 0.674 0.769 41B Twitter H Q5 Score(Int.) 0.526 0.665 0.681 42A Yandex H - Bar -0.224 -0.117 0.733 42B Yandex H Q4 Text 0.475 0.775 0.799 de Carné de Carnavalet and Mannan (2014) [57] 23C Apple H Q4 - 0.551 0.726 0.707 26B China Railway (12306.cn) C Q3 - 0.165 0.197 0.571 27B Dropbox C Q5 - [Z] 0.121 0.131 0.654 28C Drupal C Q4 - -0.095 0.233 0.350 31C Fedex C Q5 - 0.007 0.071 0.345 32B Google S Q5 - 0.551 0.726 0.717 35C Microsoft (v3) C Q4 - 0.551 0.726 0.724 43 PayPal H Q4 - 0.552 0.727 0.736 44 QQ C Q4 - 0.867 0.918 0.721 41C Twitter C Q6 - 0.271 0.674 0.658 45 Yahoo! C Q4 - -0.095 0.233 0.346 42C Yandex S Q4 - 0.217 0.624 0.709 Type (T): C=Client; S=Server; H=Hybrid Quantization (Q): Q3–Q6=Number of bins, e. g., Q3=[Weak, Good, Strong] Visualization: Bar=Bar meter; Text=Textual; (Int.)=Internal value; [Z]=zxcvbn Dataset: RY=RockYou; LI=LinkedIn; 0W=000WebHost

C Password Reuse

Appendices for Chapter 5, based on the publication:

M. Golla, M. Wei, J. Hainline, L. Filipe, M. Dürmuth, E. Redmiles, and B. Ur, “What was that site doing with my Facebook password? De- signing Password-Reuse Notifications,” in ACM Conference on Com- puter and Communications Security (CCS ’18). Toronto, Ontario, Canada: ACM, Oct. 2018, pp. 1549–1566.

Includes:

• Survey Instrument: Study 1 and Study 2

• Notification Screenshots: Study 1 and Study 2 212 Appendix C Password Reuse

C.1. Study 1 Survey Instrument

Introduction In the following survey, you will be asked to imagine that your name is Jo Doe. You have an online account with a major company called AcmeCo and can access your ac- count through both a website and a mobile application. Imagine that this account is important to you, and that it is like other accounts you may have, such as for email, banking, or social media. This survey should take approximately 15 minutes to com- plete.

Because the notifications were delivered through different channels, we inserted wording appropriate to the notification. For example, for LinkedIn, we used the following: Prompt: Imagine that you receive, through email from AcmeCo, VerbPrompt: receiving this notification through email from AcmeCo NounPrompt: this notification through email from AcmeCo PastTensePrompt: received this notification through email from AcmeCo

Prompt the following notification:

(A screenshot of a password reuse notification.)

In your own words, please describe what this notification is telling you.

In your own words, please describe all of the factors that may have caused you to receive this notification.

Please list three feelings you might have after receiving this notification. F1: , F2: , F3:

Please list three actions you might take after receiving this notification. A1: , A2: , A3:

The first feeling you listed was (display F1). Please explain why you might feel this way.

The second feeling you listed was (display F2). Please explain why you might feel this way.

The third feeling you listed was (display F3). Please explain why you might feel this way.

The first action you listed was (display A1). Please explain why you might take this action.

The second action you listed was (display A2). Please explain why you might take this action.

The third action you listed was (display A3). Please explain why you might take this action. C.1. Study 1 Survey Instrument 213

I feel that NounPrompt explained to me how to resolve the situation. Strongly agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know Why?

Notifications can be received in many different ways, such as through email, on a web- page, or in a mobile app. Please select the answer choice that most closely matches how you feel about the following statement:

I feel that NounPrompt uses the appropriate method of contacting me. Strongly agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know Why?

I feel that ignoring NounPrompt would not have any consequences. Strongly agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know Why?

For me, taking action in response to VerbPrompt would be a Very high priority High priority Medium priority Low priority Not a priority Don’t know Why?

I would feel about VerbPrompt. Extremely concerned Moderately concerned Somewhat concerned Slightly concerned Not at all concerned Don’t know Why?

I would expect real companies to send notifications like this one when necessary. Strongly agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know Why?

I have received notifications similar to this one in the past. Never A few times Many times Don’t know

Briefly describe the notifications, if any, that you have received. Please include the context in which you received the notifications and who sent them. 214 Appendix C Password Reuse

C.2. Study 2 Survey Instrument

Introduction In the following survey, you will be asked to imagine that your name is Jo Doe. You have an online account with a major company called AcmeCo and can access your account through both a website and a mobile application. Imagine that this account is important to you, and that it is like other accounts you may have, such as for email, banking, or social media. This survey should take approximately 15 minutes to complete.

(Show notification and explain delivery method.)

Initial Questions In your own words, please describe what this notification is telling you.

What may have caused you to receive this notification? Please check all that apply.  Someone hacked your AcmeCo account. AcmeCo noticed suspicious activity, such as logins from an unexpected location, a new device being used, or multiple unsuccessful logins.  Your AcmeCo account has not been hacked. Instead, you simply logged in from a new location or device, or accidentally entered the wrong password too many times.  AcmeCo was hacked.  A company unrelated to AcmeCo was hacked.  You reused the same or similar passwords for multiple online accounts.  Someone is trying to gain unauthorized access to your account by sending this email.  AcmeCo conducts regular security checks and this is just a standard security notifi- cation.  You have a weak password for your AcmeCo account.  AcmeCo sent this by mistake.  You went to a malicious website or downloaded malicious software.  AcmeCo requires you to regularly change your password (e. g., every 90 days).  Don’t know

Password Change Actions If you received this notification about an online account you had with a real company, which of the following best describes what you would do about passwords for that ac- count? I would keep my password the same. I would change my password. Don’t know Why?

(If “I would change my password” is selected) What would you use for your new password on that account? Something related to the old password, but a few characters different. Something completely unrelated to the old password. A password that I already use for other accounts. A password generated by a password manager or browser. Other C.2. Study 2 Survey Instrument 215

(If “I would change my password” is selected) How would you try to remember your new password for that account? Select all that apply.  Write it down (e. g., in a diary, on a sticky note).  Use a password manager.  Just try to remember it.  Save it on my computer (e. g., in a document).  Save it on my phone (e. g., in a note).  Other

If you received this notification about an online account you had with a real company, which of the following best describes what you would do about passwords on other ac- counts? Please select all that apply.  I would change all of my passwords I have on other accounts.  I would change my passwords only for other accounts where I use the same password.  I would change my passwords only for other accounts where I use similar passwords.  I would change my passwords only for really important accounts (e. g., bank account).  I would keep my passwords the same.  Don’t know.

Why?

(If any of the first four from above were selected) What would you use for your new password(s) on those other accounts? Something related to the old password, but a few characters different. Something completely unrelated to the old password. A password that I already use for other accounts. A password generated by a password manager or browser. Other

(If any of the first four from above were selected) How would you try to remember your new password(s) for those other accounts? Select all that apply.  Write it down on paper (e. g., in a diary, on a sticky note).  Use a password manager.  Just try to remember it.  Save it on my computer (e. g., in a document).  Save it on my phone (e. g., in a note).  Other

People have different reactions and responses to notifications about their online ac- counts. If you received this notification about an online account you had with a real company, how likely would you be to take the following actions? (Answer for each) Very Unlikely Unlikely Neither likely nor unlikely Likely Very Likely Don’t Know • Enable Two-Factor Authentication. • Use a password manager. • Update my security questions. • Review my recent account activity. • Leave my password as-is. • Commit to change my password more frequently in the future. • Sign up for an account with a company offering identity theft protection. 216 Appendix C Password Reuse

• Update the software my devices more regularly. • Add a/Change my current password to lock my computer. • Add a/Change my current password, PIN, pattern, fingerprint, etc. to lock my phone.

There are many different actions that people could take in response to notifications about their online accounts. Please select the answer choice that most closely matches how you feel about the following statements: If I received this notification about an online account I had with a real company, it would improve my account security if I . . . (Answer for each) Strongly agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t Know

• . . . enabled Two-Factor Authentication. • . . . used a password manager. • . . . changed my password for this account to a new password that is a modifica- tion (changing a few characters) of the old one. • . . . changed my password for this account to a completely new password unre- lated to the old one. • . . . changed my password for this account to a password I use for another online account. • . . . used unique passwords for each of my online accounts. • . . . changed all of my similar passwords on other online accounts to one new password. • . . . updated my security questions. • . . . reviewed my recent activity. • . . . left my password as-is. • . . . committed to change my password more frequently in the future. • . . . signed up for an account with a company offering identity theft protection. • . . . updated the software on my devices more regularly. • . . . added a/changed my current password, PIN, pattern, fingerprint, etc. to lock my phone. • . . . added a/changed my current password to lock my computer.

Notifications can be received in many different ways, such as through email, on a web- page, or in a mobile app. Please select the answer choice that most closely matches how you feel about the following statement: I feel that this notification uses the appropriate method of contacting me. Strongly Agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know

If you were to receive a similar notification about an online account you had with a real company, how would you want to be contacted? Please select all that apply.  Email  Pop-up notification on mobile, such as if you received an SMS  Text message  Website on desktop or mobile browser  In the mobile app  Phone call  Physical mail  Other C.2. Study 2 Survey Instrument 217

Given that I received NounPrompt, I would probably see this notification: Within 3 hours Within 24 hours Within 3 days Within a week After a week Never Don’t know

After receiving this notification, I would probably take action: Within 3 hours Within 24 hours Within 3 days Within a week After a week Never Don’t know

I would expect real companies to send notifications like this one when necessary. Strongly Agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know

If I received this notification about an online account I had with a real company, I would believe that this was an official notification sent by that company. Strongly Agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know

I feel that ignoring this notification would not have any consequences. Strongly Agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know

For me, taking action in response to VerbPrompt would be a: Very high priority High priority Medium priority Low priority Not a priority Don’t know

If I received this notification about an online account I had with a real company, I would feel grateful. Strongly Agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know

This notification adequately explains what is going on with my online account. Strongly Agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know

If I received this notification about an online account I had with a real company, I wouldn’t know why I received this notification. Strongly Agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know

I feel that NounPrompt explained to me how to resolve the situation. Strongly Agree Agree Neither agree nor disagree Disagree Strongly disagree Don’t know

People may have many different responses to receiving notifications about their online accounts. Please select the answer choice that most closely matches how you feel about the following statement: After receiving this notification, my trust in AcmeCo would: Significantly increase Increase Neither increase nor decrease Decrease Significantly decrease Don’t know Why? 218 Appendix C Password Reuse

To your knowledge, has anyone ever gained unauthorized access to one of your online accounts? Yes No Don’t know

(If yes selected) Who do you think accessed your online account? Please select all that apply.  Someone you know personally  Someone you don’t know personally  Don’t know

(If yes selected) Please describe what happened.

Do any of your accounts require you to change your password regularly (e. g., every 90 days)? Yes No Don’t know

(If yes selected) Please describe how you were informed of this regular password change policy.

Have you ever been notified that your information was exposed in a data breach? Yes No Don’t know

(If yes selected) Please describe how you found out and what happened.

With what gender do you identify? Female Male Non-binary Other Prefer not to say

What is your age? 18-24 25-34 35-44 45-54 55-64 65-74 75 or older Prefer not to say

What is the highest degree or level of school you have completed? Some high school High school Some college Trade, technical, or vocational training Associate’s Degree Bachelor’s Degree Master’s Degree Professional degree Doctorate Prefer not to say

Which of the following best describes your educational background or job field? I have an education in, or work in, the field of computer science, computer engineer- ing or IT. I do not have an education in, nor do I work in, the field of computer science, com- puter engineering or IT. Prefer not to say

(Optional) Do you have any final thoughts or questions about today’s study? C.3. Study 1 Notifications 219

C.3. Study 1 Notifications

The six notifications used for Study 1 are shown below. Notifications are identified by their original sender, although all notifications were rebranded as AcmeCo for the purposes of the survey.

Figure C.1.: Study 1 Facebook notification. 220 Appendix C Password Reuse

Figure C.2.: Study 1 Google email notification. C.3. Study 1 Notifications 221

Figure C.3.: Study 1 Google red bar notification. 222 Appendix C Password Reuse

Figure C.4.: Study 1 Instagram notification. C.3. Study 1 Notifications 223

Figure C.5.: Study 1 LinkedIn notification. 224 Appendix C Password Reuse

Figure C.6.: Study 1 Netflix notification. C.4. Study 2 Notifications 225

C.4. Study 2 Notifications

Two of the three variants of the model notification’s delivery medium — mobile and inApp — used for Study 2 are shown below. The third variant — email — was provided in the body of Chapter 5. The text variations of the notifications are shown in Chapter 5 in Figure 5.4 at page 129 with the corresponding changes in Table 5.2 on page 130.

Figure C.7.: Study 2 model-{mobile} notification. 226 Appendix C Password Reuse

Figure C.8.: Study 2 model-{inApp} notification. Bibliography

[1] A. Adams and M. A. Sasse, “Users Are Not the Enemy,” Communi- cations of the ACM, vol. 42, no. 12, pp. 40–46, Dec. 1999. [2] AgileBits, Inc., “1Password Support: Technical Document – OPVault Format,” Dec. 2012, https://support.1password.com/opvault-design, as of March 27, 2019. [3] AgileBits, Inc., “1Password (Web) – Password Manager,” May 2018, https://1password.com, as of March 27, 2019. [4] P. Agrawal, “Twitter – Keeping Your Account Secure,” May 2018, https://blog.twitter.com/official/en_us/topics/company/2018/ keeping-your-account-secure.html, as of March 27, 2019. [5] D. Akhawe and A. P. Felt, “Alice in Warningland: A Large-Scale Field Study of Browser Security Warning Effectiveness,” in USENIX Security Symposium (SSYM ’13). Washington, District of Columbia, USA: USENIX, Aug. 2013, pp. 257–272. [6] F. Alaca and P. C. Van Oorschot, “Device Fingerprinting for Augment- ing Web Authentication: Classification and Analysis of Methods,” in Annual Conference on Computer Security Applications (ACSAC ’16). Los Angeles, California, USA: ACM, Dec. 2016, pp. 289–301. [7] N. Alkaldi and K. Renaud, “Why Do People Adopt, or Reject, Smart- phone Password Managers?” in European Workshop on Usable Secu- rity (EuroUSEC ’16). Darmstadt, Germany: ISOC, Jul. 2016. [8] F. Angelstorf and F. Juckel, “OMEN v0.3.0 – C Implementation of a Markov Model-based Password Guesser,” Mar. 2017, https://github. com/RUB-SysSec/OMEN, as of March 27, 2019. [9] S. Aonzo, A. Merlo, G. Tavella, and Y. Fratantonio, “Phishing At- tacks on Modern Android,” in ACM Conference on Computer and Communications Security (CCS ’18). Toronto, Ontario, Canada: ACM, Oct. 2018, pp. 1788–1801. [10] A. J. Aviv and D. Fichter, “Understanding Visual Perceptions of Us- ability and Security of Android’s Graphical Password Pattern,” in Annual Computer Security Applications Conference (ACSAC ’14). New Orleans, Louisiana, USA: ACM, Dec. 2014, pp. 286–295. [11] D. V. Bailey, M. Dürmuth, and C. Paar, “Statistics on Password Re- use and Adaptive Strength for Financial Accounts,” in Security and Cryptography for Networks (SCN ’14). Amalfi, Italy: Springer, Sep. 2014, pp. 218–235. 228 Bibliography

[12] M. D. Barense, J. K. W. Ngo, L. H. T. Hung, and M. A. Peterson, “Interactions of Memory and Perception in Amnesia: The Figure- Ground Perspective,” Cerebral Cortex, vol. 22, no. 11, pp. 2680–2691, Nov. 2012. [13] L. Bauer, C. Bravo-Lillo, E. Fragkaki, and W. Melicher, “A Compari- son of Users’ Perceptions of and Willingness to Use Google, Facebook, and Google+ Single-sign-on Functionality,” in Workshop on Digital Identity Management (DIM ’13). Berlin, Germany: ACM, Nov. 2013, pp. 25–36. [14] A. Beautement, M. A. Sasse, and M. Wonham, “The Compliance Bud- get: Managing Security Behaviour in Organisations,” in New Security Paradigms Workshop (NSPW ’08). Lake Tahoe, California, USA: ACM, Sep. 2008, pp. 47–58. [15] S. Benvenuti, “Ubiquity – Ubuntu Should Encourage Stronger Passwords,” Sep. 2012, https://bugs.launchpad.net/ubuntu/+source/ ubiquity/+bug/1044868, as of March 27, 2019. [16] C. J. Berry, D. R. Shanks, M. Speekenbrink, and R. N. A. Henson, “Models of Recognition, Repetition Priming, and Fluency: Exploring a New Framework,” Psychological Review, vol. 119, no. 1, pp. 40–79, Jan. 2012. [17] J. Biggs, “Spammers Expose Over a Billion Email Addresses Af- ter Failed Backup,” Mar. 2017, https://techcrunch.com/2017/03/06/ spammers-expose-billions-of-emails-after-failed-backup/, as of March 27, 2019. [18] A. Biryukov, D. Dinu, and D. Khovratovich, “Argon2: The Memory- Hard Function for Password Hashing and Other Applications,” Jul. 2015, https://github.com/P-H-C/phc-winner-argon2, as of March 27, 2019. [19] M. Bishop and D. V. Klein, “Improving System Security via Proactive Password Checking,” Computers & Security, vol. 14, no. 3, pp. 233– 249, 1995. [20] J. Blocki, S. Komanduri, L. F. Cranor, and A. Datta, “Spaced Repeti- tion and Mnemonics Enable Recall of Multiple Strong Passwords,” in Symposium on Network and Distributed System Security (NDSS ’15). San Diego, California, USA: ISOC, Feb. 2015. [21] H. Bojinov, E. Bursztein, X. Boyen, and D. Boneh, “Kamouflage: Loss-Resistant Password Management,” in European Symposium on Research in Computer Security (ESORICS ’10). Athens, Greece: Springer, Sep. 2010, pp. 286–302. Bibliography 229

[22] H. Bojinov, D. Sanchez, P. Reber, D. Boneh, and P. Lincoln, “Neu- roscience Meets Cryptography: Designing Crypto Primitives Secure Against Rubber Hose Attacks,” in USENIX Security Symposium (SSYM ’12). Bellevue, Washington, USA: USENIX, Aug. 2012, pp. 129–141. [23] J. Bonneau, “Guessing Human-Chosen Secrets,” Ph.D. dissertation, University of Cambridge, 2012. [24] J. Bonneau, “The Science of Guessing: Analyzing an Anonymized Corpus of 70 Million Passwords,” in IEEE Symposium on Security and Privacy (SP ’12). San Jose, California, USA: IEEE, May 2012, pp. 538–552. [25] J. Bonneau, E. Bursztein, I. Caron, R. Jackson, and M. Williamson, “Secrets, Lies, and Account Recovery: Lessons from the Use of Per- sonal Knowledge Questions at Google,” in The World Wide Web Con- ference (WWW ’15). Florence, Italy: ACM, May 2015, pp. 141–150. [26] J. Bonneau, C. Herley, P. C. Van Oorschot, and F. Stajano, “The Quest to Replace Passwords: A Framework for Comparative Evalua- tion of Web Authentication Schemes,” in IEEE Symposium on Secu- rity and Privacy (SP ’12). San Jose, California, USA: IEEE, May 2012, pp. 553–567. [27] J. Bonneau, C. Herley, P. C. Van Oorschot, and F. Stajano, “Pass- words and the Evolution of Imperfect Authentication,” Communica- tions of the ACM, vol. 58, no. 7, pp. 78–87, Jun. 2015. [28] J. Bonneau, M. Just, and G. Matthews, “What’s in a Name? Eval- uating Statistical Attacks on Personal Knowledge Questions,” in Fi- nancial Cryptography and Data Security (FC ’10). Tenerife, Canary Islands, Spain: Springer, Jan. 2010, pp. 98–113. [29] J. Bonneau and S. Preibusch, “The Password Thicket: Technical and Market Failures in Human Authentication on the Web,” in Workshop on the Economics of Information Security (WEIS ’10). Cambridge, Massachusetts, USA: ACM, Jun. 2010. [30] J. Bonneau and S. Schechter, “Towards Reliable Storage of 56- bit Secrets in Human Memory,” in USENIX Security Symposium (SSYM ’14). San Diego, California, USA: USENIX, Aug. 2014, pp. 607–623. [31] J. Bonneau and E. Shutova, “Linguistic Properties of Multi-word Passphrases,” in Workshop on Usable Security (USEC ’12). Kral- endijk, Bonaire: Springer, Mar. 2012, pp. 1–12. 230 Bibliography

[32] J. Brainard, A. Juels, R. L. Rivest, M. Szydlo, and M. Yung, “Fourth- Factor Authentication: Somebody You Know,” in ACM Conference on Computer and Communications Security (CCS ’06). Alexandria, Virginia, USA: ACM, Oct. 2006, pp. 168–178. [33] C. Bravo-Lillo, L. F. Cranor, J. Downs, and S. Komanduri, “Bridging the Gap in Computer Security Warnings: A Mental Model Approach,” IEEE Security & Privacy, vol. 9, no. 2, pp. 18–26, Mar. 2011. [34] C. Bravo-Lillo, L. F. Cranor, S. Komanduri, S. Schechter, and M. Sleeper, “Harder to Ignore? Revisiting Pop-Up Fatigue and Ap- proaches to Prevent It,” in Symposium on Usable Privacy and Security (SOUPS ’14). Menlo Park, California, USA: USENIX, Jul. 2014, pp. 105–111. [35] C. Bravo-Lillo, S. Komanduri, L. F. Cranor, R. W. Reeder, M. Sleeper, J. Downs, and S. Schechter, “Your Attention Please: Designing Security-Decision UIs to Make Genuine Risks Harder to Ignore,” in Symposium on Usable Privacy and Security (SOUPS ’13). Newcas- tle, United Kingdom: ACM, Jul. 2013, pp. 6:1–6:12. [36] S. Brostoff and M. A. Sasse, ““Ten Strikes and You’re Out”: Increas- ing the Number of Login Attempts can Improve Password Usability,” in Workshop on Human-Computer Interaction and Security Systems (HCISEC ’03). Fort Lauderdale, Florida, USA: ACM, Apr. 2003. [37] M. Burnett, “Today I Am Releasing Ten Million Passwords,” Feb. 2015, https://xato.net/today-i-am-releasing-ten-million-passwords- b6278bbe7495, as of March 27, 2019. [38] W. E. Burr, D. F. Dodson, and W. T. Polk, “Electronic Authentication Guideline: NIST Special Publication 800-63-2,” Aug. 2013. [39] P. Canahuati, “Facebook: Keeping Passwords Secure,” Mar. 2019, https://newsroom.fb.com/news/2019/03/keeping-passwords-secure/, as of March 27, 2019. [40] J. Carranza and Contributors, “Ubiquity – Ubuntu Live CD In- staller,” May 2018, https://launchpad.net/ubuntu/+source/ubiquity, as of March 27, 2019. [41] L. Casati and A. Visconti, “Exploiting a Bad User Practice to Retrieve Data Leakage on Android Password Managers,” in Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS ’17). Torino, Italy: Springer, Jul. 2017, pp. 952–958. [42] C. Castelluccia, A. Chaabane, M. Dürmuth, and D. Perito, “When Privacy Meets Security: Leveraging Personal Information for Pass- word Cracking,” CoRR, vol. abs/1304.6584, pp. 1–16, Apr. 2013. Bibliography 231

[43] C. Castelluccia, M. Dürmuth, and D. Perito, “Adaptive Password- Strength Meters from Markov Models,” in Symposium on Network and Distributed System Security (NDSS ’12). San Diego, California, USA: ISOC, Feb. 2012. [44] C. B. Cave, “Very Long-Lasting Priming in Picture Naming,” Psycho- logical Science, vol. 8, no. 4, pp. 322–325, Jul. 1997. [45] R. Chatterjee, “NoCrack Password Vault,” Sep. 2015, https://github. com/rchatterjee/nocrack, as of March 27, 2019. [46] R. Chatterjee, J. Bonneau, A. Juels, and T. Ristenpart, “Cracking- Resistant Password Vaults using Natural Language Encoders,” in IEEE Symposium on Security and Privacy (SP ’15). San Jose, Cal- ifornia, USA: IEEE, May 2015, pp. 481–498. [47] I. Cherapau, I. Muslukhov, N. Asanka, and K. Beznosov, “On the Impact of Touch ID on iPhone Passcodes,” in Symposium on Usable Privacy and Security (SOUPS ’15). Ottawa, Canada: USENIX, Jul. 2015, pp. 257–276. [48] S. Chiasson, P. C. Van Oorschot, and R. Biddle, “A Usability Study and Critique of Two Password Managers,” in USENIX Security Symposium (SSYM ’06). Vancouver, British Columbia, Canada: USENIX, Jul. 2006, pp. 1–16. [49] M. Ciampa, “A Comparison of User Preferences for Browser Password Managers,” Journal of Applied Security Research, vol. 8, no. 4, pp. 455–466, Sep. 2013. [50] J. Colnago, S. Devlin, M. Oates, C. Swoopes, L. Bauer, L. F. Cranor, and N. Christin, ““It’s Not Actually That Horrible”: Exploring Adop- tion of Two-Factor Authentication at a University,” in ACM Confer- ence on Human Factors in Computing Systems (CHI ’18). Montreal, Quebec, Canada: ACM, Apr. 2018, pp. 456:1–456:11. [51] S. Croley (“Chick3nman”), “Abusing Password Reuse at Scale: Bcrypt and Beyond,” Aug. 2018, https://www.youtube.com/watch?v=5su3_ Py8iMQ, as of March 27, 2019. [52] S. Croley (“Chick3nman”), “NVIDIA Titan RTX Hashcat Benchmarks,” Mar. 2019, https://gist.github.com/Chick3nman/ 5d261c5798cf4f3867fe7035ef6dd49f, as of March 27, 2019. [53] N. Cubrilovic, “RockYou Hack: From Bad To Worse,” Dec. 2009, https://techcrunch.com/2009/12/14/rockyou-hack-security- myspace-facebook-passwords/, as of March 27, 2019. 232 Bibliography

[54] A. Das, J. Bonneau, M. Caesar, N. Borisov, and X. Wang, “The Tan- gled Web of Password Reuse,” in Symposium on Network and Dis- tributed System Security (NDSS ’14). San Diego, California, USA: ISOC, Feb. 2014. [55] Dashlane, Inc., “Dashlane (Windows) – Password Manager,” May 2018, https://www.dashlane.com, as of March 27, 2019. [56] “dcopi”, “NIST – Password Strength Meter Example,” Jan. 2013, https://github.com/dcopi/PWStrength, as of March 27, 2019. [57] X. de Carné de Carnavalet and M. Mannan, “From Very Weak to Very Strong: Analyzing Password-Strength Meters,” in Symposium on Network and Distributed System Security (NDSS ’14). San Diego, California, USA: ISOC, Feb. 2014. [58] X. de Carné de Carnavalet and M. Mannan, “Password Multi- Checker Tool,” Feb. 2014, https://madiba.encs.concordia.ca/ software/passwordchecker/, as of March 27, 2019. [59] M. Dell’Amico and M. Filippone, “Monte Carlo Strength Evaluation: Fast and Reliable Password Checking,” in ACM Conference on Com- puter and Communications Security (CCS ’15). Denver, Colorado, USA: ACM, Oct. 2015, pp. 158–169. [60] M. Dell’Amico, P. Michiardi, and Y. Roudier, “Password Strength: An Empirical Analysis,” in Conference on Information Communications (INFOCOM ’10). San Diego, California, USA: IEEE, Mar. 2010, pp. 983–991. [61] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima- geNet: A Large-Scale Hierarchical Image Database,” in IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR ’09). Miami, Florida, USA: IEEE, Jun. 2009, pp. 248–255. [62] T. Denning, K. Bowers, M. van Dijk, and A. Juels, “Exploring Implicit Memory for Painless Password Recovery,” in ACM Conference on Human Factors in Computing Systems (CHI ’11). Vancouver, British Columbia, Canada: ACM, May 2011, pp. 2615–2618. [63] P. Diwanji, “Google: Detecting Suspicious Account Activity,” Mar. 2010, https://security.googleblog.com/2010/03/detecting-suspicious- account-activity.html, as of March 27, 2019. [64] R. J. Dolan, G. R. Fink, E. T. Rolls, M. Booth, A. J. Holmes, R. S. Frackowiak, and K. J. Friston, “How the Brain Learns to See Objects and Faces in an Impoverished Context,” Nature, vol. 389, no. 6651, pp. 596–599, Oct. 1997. Bibliography 233

[65] Dropbox, Inc. and Contributors, “zxcvbn v4.4.2 – JavaScript Imple- mentation of the zxcvbn Strength Meter,” Feb. 2017, https://github. com/dropbox/zxcvbn, as of March 27, 2019. [66] M. Dürmuth, F. Angelstorf, C. Castelluccia, D. Perito, and A. Chaa- bane, “OMEN: Faster Password Guessing Using an Ordered Markov Enumerator,” in International Symposium on Engineering Secure Software and Systems (ESSoS ’15). Milan, Italy: Springer, Mar. 2015, pp. 119–132. [67] S. Egelman, L. F. Cranor, and J. Hong, “You’ve Been Warned: An Empirical Study of the Effectiveness of Web Browser Phishing Warn- ings,” in ACM Conference on Human Factors in Computing Systems (CHI ’08). Florence, Italy: ACM, Apr. 2008, pp. 1065–1074. [68] S. Egelman, A. Sotirakopoulos, I. Muslukhov, K. Beznosov, and C. Herley, “Does My Password Go Up to Eleven?: The Impact of Password Meters on Password Selection,” in ACM Conference on Hu- man Factors in Computing Systems (CHI ’13). Paris, France: ACM, Apr. 2013, pp. 2379–2388. [69] Facebook Security, “Facebook: Introducing Trusted Friends,” Oct. 2011, https://www.facebook.com/notes/facebook-security/national- cybersecurity-awareness-month-updates/10150335022240766/, as of March 27, 2019. [70] Facebook Security, “Facebook: Introducing Trusted Con- tacts,” May 2013, https://www.facebook.com/notes/facebook- security/introducing-trusted-contacts/10151362774980766/, as of March 27, 2019. [71] M. Fagan, Y. Albayram, M. M. H. Khan, and R. Buck, “An Investiga- tion Into Users’ Considerations Towards Using Password Managers,” Human-Centric Computing and Information Sciences, vol. 7, no. 1, Mar. 2017. [72] S. Fahl, M. Harbach, Y. Acar, and M. Smith, “On the Ecological Validity of a Password Study,” in Symposium on Usable Privacy and Security (SOUPS ’13). Newcastle, United Kingdom: ACM, Jul. 2013, pp. 13:1–13:13. [73] S. Fahl, M. Harbach, M. Oltrogge, T. Muders, and M. Smith, “Hey, You, Get Off of My Clipboard: On How Usability Trumps Security in Android Password Managers,” in Financial Cryptography and Data Security (FC ’13). Okinawa, Japan: Springer, Apr. 2013, pp. 144– 161. 234 Bibliography

[74] A. P. Felt, A. Ainslie, R. W. Reeder, S. Consolvo, S. Thyagaraja, A. Bettes, H. Harris, and J. Grimes, “Improving SSL warnings: Com- prehension and Adherence,” in ACM Conference on Human Factors in Computing Systems (CHI ’15). Seoul, Republic of Korea: ACM, Apr. 2015, pp. 2893–2902.

[75] D. Florêncio and C. Herley, “A Large-scale Study of Web Password Habits,” in The World Wide Web Conference (WWW ’07). Banff, Alberta, Canada: ACM, May 2007, pp. 657–666.

[76] D. Florêncio, C. Herley, and P. C. Van Oorschot, “An Administrator’s Guide to Internet Password Research,” in Large Installation System Administration Conference (LISA ’14). Seattle, Washington, USA: USENIX, Nov. 2014, pp. 44–61.

[77] D. Florêncio, C. Herley, and P. C. Van Oorschot, “Password Portfolios and the Finite-Effort User: Sustainably Managing Large Numbers of Accounts,” in USENIX Security Symposium (SSYM ’14). San Diego, California, USA: USENIX, Aug. 2014, pp. 575–590.

[78] D. Florêncio, C. Herley, and P. C. Van Oorschot, “Pushing on String: The “Don’t Care” Region of Password Strength,” Communications of the ACM, vol. 59, no. 11, pp. 66–74, Oct. 2016.

[79] D. M. Freeman, S. Jain, M. Dürmuth, B. Biggio, and G. Giacinto, “Who Are You? A Statistical Approach to Measuring User Authen- ticity,” in Symposium on Network and Distributed System Security (NDSS ’16). San Diego, California, USA: ISOC, Feb. 2016.

[80] S. L. Garfinkel, “Email-Based Identification and Authentication: An Alternative to PKI?” IEEE Security & Privacy, vol. 1, no. 6, pp. 20–26, Nov. 2003.

[81] P. Gasti and K. B. Rasmussen, “On the Security of Password Manager Database Formats,” in European Symposium on Research in Com- puter Security (ESORICS ’12). Pisa, Italy: Springer, Sep. 2012, pp. 770–787.

[82] S. Gaw and E. W. Felten, “Password Management Strategies for Online Accounts,” in Symposium on Usable Privacy and Security (SOUPS ’06). Pittsburgh, Pennsylvania, USA: ACM, Jul. 2006, pp. 44–55.

[83] M. S. Gazzaniga, R. B. Ivry, and G. R. Mangun, Cognitive Neuro- science: The Biology of the Mind, 4th ed. New York, New York, USA: W. W. Norton & Company, Inc., 2013. Bibliography 235

[84] M. A. Gluck, E. Mercado, and C. E. Myers, Learning and Memory: From Brain to Behavior, 3rd ed. New York, New York, USA: Worth Publishers, 2016.

[85] J. Goldberg, “On Hashcat and Strong Master Passwords as Your Best Protection,” Apr. 2013, https://blog.agilebits.com/2013/04/16/ 1password-hashcat-strong-master-passwords/, as of March 27, 2019.

[86] M. Golla, D. V. Bailey, and M. Dürmuth, ““I want my money back!” Limiting Online Password-Guessing Financially,” in Who Are You?! Adventures in Authentication Workshop (WAY ’17). Santa Clara, California, USA: USENIX, Jul. 2017.

[87] M. Golla, B. Beuscher, and M. Dürmuth, “On the Security of Cracking-Resistant Password Vaults,” in ACM Conference on Com- puter and Communications Security (CCS ’16). Vienna, Austria: ACM, Oct. 2016, pp. 1230–1241.

[88] M. Golla and M. Dürmuth, “Analyzing 4 Million Real-World Per- sonal Knowledge Questions (Short Paper),” in International Confer- ence on Passwords (PASSWORDS ’15). Cambridge, United King- dom: Springer, Dec. 2015, pp. 39–44.

[89] M. Golla and M. Dürmuth, “On the Accuracy of Password Strength Meters,” in ACM Conference on Computer and Communications Se- curity (CCS ’18). Toronto, Ontario, Canada: ACM, Oct. 2018, pp. 1567–1582.

[90] M. Golla, T. Schnitzler, and M. Dürmuth, ““Will Any Password Do?” Exploring Rate-Limiting on the Web,” in Who Are You?! Adventures in Authentication Workshop (WAY ’18). Baltimore, Maryland, USA: USENIX, Aug. 2018.

[91] M. Golla, I. Sertkaya, and M. Dürmuth, “Password Strength Meter Comparison Website,” May 2018, https://password-meter- comparison.org, as of March 27, 2019.

[92] M. Golla, M. Wei, J. Hainline, L. Filipe, M. Dürmuth, E. Redmiles, and B. Ur, ““What was that site doing with my Facebook pass- word?” Designing Password-Reuse Notifications,” in ACM Conference on Computer and Communications Security (CCS ’18). Toronto, Ontario, Canada: ACM, Oct. 2018, pp. 1549–1566.

[93] R. Gonzalez, E. Y. Chen, and C. Jackson, “Automated Password Extraction Attack on Modern Password Managers,” CoRR, vol. abs/1309.1416, pp. 1–7, Sep. 2013. 236 Bibliography

[94] D. Goodin, “Why Passwords Have Never Been Weaker–and Crack- ers Have Never Been Stronger,” Aug. 2012, https://arstechnica. com/information-technology/2012/08/passwords-under-assault/, as of March 27, 2019. [95] D. Goodin, “For 8 Days Windows Offered a Preloaded Pass- word Manager With a Plugin Vulnerability,” Dec. 2017, https://arstechnica.com/information-technology/2017/12/for-8- days-windows-offered-a-preloaded-password-manager-with-a-plugin- vulnerability/, as of March 27, 2019. [96] J. M. Gosney (“epixoip”), “How LinkedIn’s Password Sloppiness Hurts Us All,” Jun. 2016, https://arstechnica.com/information-technology/ 2016/06/how-linkedins-password-sloppiness-hurts-us-all/, as of March 27, 2019. [97] J. M. Gosney (“epixoip”), “Nvidia GTX 1080 Ti Hashcat Benchmarks,” Apr. 2017, https://gist.github.com/epixoip/ ace60d09981be09544fdd35005051505, as of March 27, 2019. [98] P. A. Grassi, J. L. Fenton, and W. E. Burr, “Digital Identity Guide- lines – Authentication and Lifecycle Management: NIST Special Pub- lication 800-63B,” Jun. 2017. [99] A. Greenberg, “Password Manager LastPass Got Breached Hard,” Jun. 2015, https://www.wired.com/2015/06/hack-brief-password- manager-lastpass-got-breached-hard/, as of March 27, 2019. [100] V. Griffith and M. Jakobsson, “Messin’ with Texas: Deriving Mother’s Maiden Names Using Public Records,” in Applied Cryptography and Network Security (ACNS ’05). New York, New York, USA: Springer, Jun. 2005, pp. 91–103. [101] E. Grosse and M. Upadhyay, “Authentication at Scale,” IEEE Security & Privacy, vol. 11, no. 1, pp. 15–22, Jan. 2013. [102] Y. Guo and Z. Zhang, “LPSE: Lightweight Password-Strength Es- timation for Password Meters,” Computers & Security, vol. 73, pp. 507–518, Mar. 2018. [103] H. Habib, J. Colnago, W. Melicher, B. Ur, S. M. Segreti, L. Bauer, N. Christin, and L. F. Cranor, “Password Creation in the Presence of Blacklists,” in Workshop on Usable Security (USEC ’17). San Diego, California, USA: ISOC, Feb. 2017. [104] R. Hackett, “Yahoo: Sayonara, Passwords,” Oct. 2015, http://fortune. com/2015/10/16/yahoo-password-security/, as of March 27, 2019. Bibliography 237

[105] W. Han, Z. Li, M. Ni, G. Gu, and W. Xu, “Shadow Attacks Based on Password Reuses: A Quantitative Empirical Analysis,” IEEE Trans- actions on Dependable and Secure Computing, vol. 15, no. 2, pp. 309– 320, Apr. 2018.

[106] A. Hanamsagar, S. S. Woo, C. Kanich, and J. Mirkovic, “Leveraging Semantic Transformation to Investigate Password Habits and Their Causes,” in ACM Conference on Human Factors in Computing Sys- tems (CHI ’18). Montreal, Quebec, Canada: ACM, Apr. 2018, pp. 570:1–570:12.

[107] E. Hayashi and J. Hong, “A Diary Study of Password Usage in Daily Life,” in ACM Conference on Human Factors in Computing Systems (CHI ’11). Vancouver, British Columbia, Canada: ACM, May 2011, pp. 2627–2630.

[108] C. Herley and P. C. Van Oorschot, “A Research Agenda Acknowledg- ing the Persistence of Passwords,” IEEE Security & Privacy, vol. 10, no. 1, pp. 28–36, Jan. 2012.

[109] A. Hern, “Google Aims to Kill Passwords,” May 2016, https://www.theguardian.com/technology/2016/may/24/google- passwords-android, as of March 27, 2019.

[110] A. Hern, “LastPass Warns Users to Exercise Caution While It Fixes ’Major’ Vulnerability,” Mar. 2017, https://www.theguardian.com/ technology/2017/mar/30/lastpass-warns-users-to-exercise-caution- while-it-fixes-major-vulnerability, as of March 27, 2019.

[111] A. Hern, “Facebook Faces Backlash Over Users’ Safety Phone Num- bers,” Mar. 2019, https://www.theguardian.com/technology/2019/ mar/04/facebook-faces-backlash-over-users-safety-phone-numbers, as of March 27, 2019.

[112] K. Holtzblatt and H. Beyer, Contextual Design – Design for Life, 2nd ed. San Francisco, California, USA: Elsevier, 2016.

[113] M. Honan, “How Apple and Amazon Security Flaws Led to My Epic Hacking,” Aug. 2012, http://www.wired.com/2012/08/apple- amazon-mat-honan-hacking/, as of March 27, 2019.

[114] M. Horsch, M. Schlipf, J. Braun, and J. Buchmann, “Password Re- quirements Markup Language,” in Australasian Conference on Infor- mation Security and Privacy (ACISP ’16). Melbourne, Victoria, Australia: Springer, Jul. 2016, pp. 426–439. 238 Bibliography

[115] S. Houshmand and S. Aggarwal, “Building Better Passwords Using Probabilistic Techniques,” in Annual Computer Security Applications Conference (ACSAC ’12). Orlando, Florida, USA: ACM, Dec. 2012, pp. 109–118.

[116] P.-J. Hsieh, E. Vul, and N. Kanwisher, “Recognition Alters the Spatial Pattern of fMRI Activation in Early Retinotopic Cortex,” Journal of Neurophysiology, vol. 103, no. 3, pp. 1501–1507, Jan. 2010.

[117] S. Huber, S. Rasthofer, and S. Arzt, “Extracting All Your Se- crets: Vulnerabilities in Android Password Managers,” Apr. 2017, https://conference.hitb.org/hitbsecconf2017ams/sessions/extracting- all-your-secrets-vulnerabilities-in-android-password-managers/, as of March 27, 2019.

[118] J. Huggins and S. Contributors, “Selenium – Web Browser Automa- tion,” May 2017, http://www.seleniumhq.org, as of March 27, 2019.

[119] J. H. Huh, H. Kim, S. S. Rayala, R. B. Bobba, and K. Beznosov, “I’m Too Busy to Reset My LinkedIn Password: On the Effectiveness of Password Reset Emails,” in ACM Conference on Human Factors in Computing Systems (CHI ’17). Denver, Colorado, USA: ACM, May 2017, pp. 387–391.

[120] T. Hunt, “Have I Been Pwned? – Check If Your Email Has Been Compromised in a Data Breach,” Dec. 2013, https://haveibeenpwned. com, as of March 27, 2019.

[121] T. Hunt, “Password Reuse, Credential Stuffing and An- other Billion Records in Have I Been Pwned?” May 2017, https://www.troyhunt.com/password-reuse-credential-stuffing-and- another-1-billion-records-in-have-i-been-pwned/, as of March 27, 2019.

[122] T. Hunt, “I’ve Just Launched “Pwned Passwords” V2 With Half a Billion Passwords for Download,” Feb. 2018, https://www.troyhunt. com/ive-just-launched-pwned-passwords-version-2/, as of March 27, 2019.

[123] F. Imamoglu, T. Kahnt, C. Koch, and J.-D. Haynes, “Changes in Functional Connectivity Support Conscious Object Recognition,” NeuroImage, vol. 63, no. 4, pp. 1909–1917, Dec. 2012.

[124] F. Imamoglu, C. Koch, and J.-D. Haynes, “MoonBase: Generating a Database of Two-Tone Mooney Images,” Journal of Vision, vol. 13, no. 9, pp. 50–50, Jul. 2013. Bibliography 239

[125] P. G. Inglesant and M. A. Sasse, “The True Cost of Unusable Password Policies: Password Use in the Wild,” in ACM Conference on Human Factors in Computing Systems (CHI ’10). Atlanta, Georgia, USA: ACM, Apr. 2010, pp. 383–392. [126] I. Ion, R. W. Reeder, and S. Consolvo, ““...No one Can Hack My Mind”: Comparing Expert and Non-Expert Security Practices,” in Symposium on Usable Privacy and Security (SOUPS ’15). Ottawa, Ontario, Canada: USENIX, Jul. 2015, pp. 327–346. [127] D. Jaeger, C. Pelchen, H. Graupner, F. Cheng, and C. Meinel, “Anal- ysis of Publicly Leaked Credentials and the Long Story of Pass- word (Re-)use,” in International Conference on Passwords (PASS- WORDS ’16). Bochum, Germany: Springer, Dec. 2016. [128] M. Jakobsson and M. Dhiman, “The Benefits of Understanding Pass- words,” in Workshop on Hot Topics in Security (HotSec ’12). Belle- vue, Washington, USA: USENIX, Aug. 2012. [129] M. Jakobsson, E. Stolterman, S. Wetzel, and L. Yang, “Love and Au- thentication,” in ACM Conference on Human Factors in Computing Systems (CHI ’08). Florence, Italy: ACM, Apr. 2008, pp. 197–200. [130] A. Javed, D. Bletgen, F. Kohlar, M. Dürmuth, and J. Schwenk, “Se- cure Fallback Authentication and the Trusted Friend Attack,” in Inter- national Distributed Computing Systems Workshops (ICDCSW ’14). Madrid, Spain: IEEE, Jun. 2014, pp. 22–28. [131] A. Jenkins, M. Anandarajan, and R. D’Ovidio, “’All that Glitters is not Gold’: The Role of Impression Management in Data Breach Notification,” Western Journal of Communication, vol. 78, no. 3, pp. 337–357, Jan. 2014. [132] Z. Joudaki, J. Thorpe, and M. V. Martin, “Reinforcing System- Assigned Passphrases Through Implicit Learning,” in ACM Con- ference on Computer and Communications Security (CCS ’18). Toronto, Ontario, Canada: ACM, Oct. 2018, pp. 1533–1548. [133] JSFoundation, Inc., “Appium – Automation Made Awesome,” May 2018, http://appium.io, as of March 27, 2019. [134] A. Juels and T. Ristenpart, “Honey Encryption: Security Beyond the Brute-Force Bound,” in Advances in Cryptology – EUROCRYPT 2014 (EUROCRYPT ’14). Copenhagen, Denmark: Springer, May 2014, pp. 293–310. [135] M. Just, “Designing and Evaluating Challenge-Question Systems,” IEEE Security & Privacy, vol. 2, no. 5, pp. 32–39, Oct. 2004. 240 Bibliography

[136] W. Kalicinski, “Google: Getting Your Android App Ready for Aut- ofill,” Nov. 2017, https://android-developers.googleblog.com/2017/ 11/getting-your-android-app-ready-for.html, as of March 27, 2019.

[137] B. S. Kaliski, “PKCS #5: Password-Based Cryptography Specification Version 2.0,” Internet Requests for Comments, RFC Editor, RFC 2898, Sep. 2000. [Online]. Available: https://tools.ietf.org/html/rfc2898

[138] A. Karole, N. Saxena, and N. Christin, “A Comparative Usability Evaluation of Traditional Password Managers,” in International Con- ference on Information Security and Cryptology (ICISC ’10). Seoul, Korea: Springer, Dec. 2010, pp. 233–251.

[139] Keeper Security, Inc., “Keeper (Web) – Password Manager,” May 2018, https://keepersecurity.com, as of March 27, 2019.

[140] P. Kelley, S. Kom, M. L. Mazurek, R. Shay, T. Vidas, L. Bauer, N. Christin, L. F. Cranor, and J. López, “Guess Again (and Again and Again): Measuring Password Strength by Simulating Password- Cracking Algorithms,” in IEEE Symposium on Security and Privacy (SP ’12). San Jose, California, USA: IEEE, May 2012, pp. 523–537.

[141] J. M. Kizilirmak, J. Galvao Gomes da Silva, F. Imamoglu, and A. Richardson-Klavehn, “Generation and the Subjective Feeling of “Aha!” Are Independently Related to Learning From Insight,” Psy- chological Research, vol. 80, no. 6, pp. 1059–1074, Aug. 2016.

[142] D. V. Klein, ““Foiling the Cracker”: A Survey of, and Improvements to, Password Security,” in USENIX Security Workshop (SSYM ’90). Portland, Oregon, USA: USENIX, Aug. 1990, pp. 5–14.

[143] S. Komanduri, R. Shay, P. G. Kelley, M. L. Mazurek, L. Bauer, N. Christin, L. F. Cranor, and S. Egelman, “Of Passwords and Peo- ple: Measuring the Effect of Password-Composition Policies,” in ACM Conference on Human Factors in Computing Systems (CHI ’11). Vancouver, British Columbia, Canada: ACM, May 2011, pp. 2595– 2604.

[144] F. Kreuter, S. Presser, and R. Tourangeau, “Social Desirability Bias in CATI, IVR, and Web Surveys: The Effects of Mode and Question Sensitivity,” Public Opinion Quarterly, vol. 72, no. 5, pp. 847–865, Dec. 2008. Bibliography 241

[145] K. Krol, E. Philippou, E. De Cristofaro, and M. A. Sasse, ““They brought in the horrible key ring thing!” Analysing the Usability of Two-Factor Authentication in UK Online Banking,” in Symposium on Network and Distributed System Security (NDSS ’15). San Diego, California, USA: ISOC, Feb. 2015.

[146] J. A. Krosnick, “Survey Research,” Annual Review of Psychology, vol. 50, no. 1, pp. 537–567, Feb. 1999.

[147] C. Kuo, S. Romanosky, and L. F. Cranor, “Human Selection of Mnemonic Phrase-based Passwords,” in Symposium on Usable Pri- vacy and Security (SOUPS ’06). Pittsburgh, Pennsylvania, USA: ACM, Jul. 2006, pp. 67–78.

[148] J. Lazar, J. H. Feng, and H. Hochheiser, Research Methods in Human- Computer Interaction, 2nd ed. San Francisco, California, USA: Mor- gan Kaufmann, 2017.

[149] Z. Li, W. Han, and W. Xu, “A Large-Scale Empirical Analysis of Chi- nese Web Passwords,” in USENIX Security Symposium (SSYM ’14). San Diego, California, USA: USENIX, Aug. 2014, pp. 559–574.

[150] Z. Li, W. He, D. Akhawe, and D. Song, “The Emperor’s New Password Manager: Security Analysis of Web-based Password Managers,” in USENIX Security Symposium (SSYM ’14). San Diego, California, USA: USENIX, Aug. 2014, pp. 465–479.

[151] E. Liu, A. Nakanishi, M. Golla, D. Cash, and B. Ur, “Reasoning An- alytically About Password-Cracking Software,” in IEEE Symposium on Security and Privacy (SP ’19). San Francisco, California, USA: IEEE, May 2019, pp. 380–397.

[152] D. Logan, “British Airways Among Latest Breaches,” Network Secu- rity, vol. 2015, no. 4, pp. 2–20, Apr. 2015.

[153] LogMeIn, Inc., “LastPass (Web) – Password Manager,” May 2018, https://www.lastpass.com, as of March 27, 2019.

[154] C. Long, “Facebook – Keeping Passwords Secure,” Oct. 2014, https://www.facebook.com/notes/protect-the-graph/keeping- passwords-secure/1519937431579736/, as of March 27, 2019.

[155] R. Ludmer, Y. Dudai, and N. Rubin, “Uncovering Camouflage: Amyg- dala Activation Predicts Long-Term Memory of Induced Perceptual Insight,” Neuron, vol. 69, no. 5, pp. 1002–1014, Mar. 2011. 242 Bibliography

[156] E. Lundberg, J. Jones, A. Kumar, D. Balfanz, A. Czeskis, A. H. Liao, M. B. Jones, J. Hodges, and R. Lindemann, “Web Authentication: An API for Accessing Public Key Credentials – Level 1,” Mar. 2019, https: //www.w3.org/TR/2019/REC-webauthn-1-20190304/, as of March 27, 2019. [157] S. G. Lyastani, M. Schilling, S. Fahl, M. Backes, and S. Bugiel, ““Bet- ter managed than memorized?” Studying the Impact of Managers on Password Strength and Reuse,” in USENIX Security Symposium (SSYM ’18). Baltimore, Maryland, USA: USENIX, Aug. 2018, pp. 203–220. [158] J. Ma, W. Yang, M. Luo, and N. Li, “A Study of Probabilistic Pass- word Models,” in IEEE Symposium on Security and Privacy (SP ’14). San Jose, CA, USA: IEEE, May 2014, pp. 689–704. [159] M. L. Mazurek, S. Komanduri, T. Vidas, L. Bauer, N. Christin, L. F. Cranor, P. G. Kelley, R. Shay, and B. Ur, “Measuring Password Guess- ability for an Entire University,” in Conference on Computer and Communications Security (CCS ’13). Berlin, Germany: ACM, Nov. 2013, pp. 173–186. [160] R. McMillan, “The Man Who Wrote Those Password Rules Has a New Tip: N3v$r M1nd!” Aug. 2017, https://www.wsj.com/articles/the- man-who-wrote-those-password-rules-has-a-new-tip-n3v-r-m1-d- 1502124118, as of March 27, 2019. [161] W. Melicher, “Source Code – Cracking Passwords with Neural Networks,” May 2017, https://github.com/cupslab/neural_network_ cracking, as of March 27, 2019. [162] W. Melicher, B. Ur, S. M. Segreti, S. Komanduri, L. Bauer, N. Christin, and L. F. Cranor, “Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks,” in USENIX Security Symposium (SSYM ’16). Austin, Texas, USA: USENIX, Aug. 2016, pp. 175–191. [163] D. E. Meyer and R. W. Schvaneveldt, “Facilitation in Recognizing Pairs of Words: Evidence of a Dependence Between Retrieval Oper- ations,” Journal of Experimental Psychology, vol. 90, no. 2, pp. 227– 234, Oct. 1971. [164] N. Micallef and N. A. G. Arachchilage, “A Gamified Approach to Improve Users’ Memorability of Fall-back,” in Who Are You?! Ad- ventures in Authentication Workshop (WAY ’17). Santa Clara, Cal- ifornia, USA: USENIX, Jul. 2017. Bibliography 243

[165] G. Milka, “Anatomy of Account Takeover,” in USENIX Enigma Con- ference (Enigma ’18). Santa Clara, California, USA: USENIX, Jan. 2018. [166] D. B. Mitchell, A. S. Brown, and D. R. Murphy, “Dissociations Be- tween Procedural and Episodic Memory: Effects of Time and Aging,” Psychology and Aging, vol. 5, no. 2, pp. 264–276, Jun. 1990. [167] S. M. Mohammad and P. D. Turney, “Crowdsourcing a Word-Emotion Association Lexicon,” Computational Intelligence, vol. 29, no. 3, pp. 436–465, Sep. 2012. [168] C. M. Mooney, “Age in the Development of Closure Ability in Chil- dren,” Canadian Journal of Psychology, vol. 11, no. 4, pp. 219–226, Dec. 1957. [169] R. Morris and K. Thompson, “Password Security: A Case History,” Communications of the ACM, vol. 22, no. 11, pp. 594–597, Nov. 1979. [170] A. Narayanan and V. Shmatikov, “Fast Dictionary Attacks on Pass- words Using Time-Space Tradeoff,” in ACM Conference on Computer and Communications Security (CCS ’05). Alexandria, Virginia, USA: ACM, Oct. 2005, pp. 364–372. [171] National Cyber Security Centre, “Password Guidance: Simplify- ing Your Approach,” Jan. 2016, https://www.ncsc.gov.uk/guidance/ password-guidance-simplifying-your-approach, as of March 27, 2019. [172] National Cyber Security Centre, “The Problems with Forcing Regu- lar Password Expiry,” Dec. 2016, https://www.ncsc.gov.uk/articles/ problems-forcing-regular-password-expiry, as of March 27, 2019. [173] J. Onaolapo, E. Mariconti, and G. Stringhini, “What Happens After You Are Pwnd: Understanding the Use of Leaked Webmail Creden- tials in the Wild,” in Internet Measurement Conference (IMC ’16). Santa Monica, California, USA: ACM, Nov. 2016, pp. 65–79. [174] N. Otsu, “A Threshold Selection Method from Gray-Level His- tograms,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66, Jan. 1979. [175] R. Oussoren, “PyObjC – The Python Objective-C Bridge,” May 2018, https://pythonhosted.org/pyobjc/, as of March 27, 2019. [176] B. Pal, T. Daniel, R. Chatterjee, and T. Ristenpart, “Beyond Creden- tial Stuffing: Password Similarity Models using Neural Networks,” in IEEE Symposium on Security and Privacy (SP ’19). San Francisco, California, USA: IEEE, May 2019, pp. 866–883. 244 Bibliography

[177] C. Palow, “After Watching This Talk, You’ll Never Look at Pass- words the Same Again,” Nov. 2013, https://vimeo.com/80460475, as of March 27, 2019.

[178] S. Pearman, J. Thomas, P. E. Naeini, H. Habib, L. Bauer, N. Christin, L. F. Cranor, S. Egelman, and A. Forget, “Let’s Go in for a Closer Look: Observing Passwords in Their Natural Habitat,” in ACM Con- ference on Computer and Communications Security (CCS ’17). Dal- las, Texas, USA: ACM, Oct. 2017, pp. 295–310.

[179] N. Perlroth, “All 3 Billion Yahoo Accounts Were Affected by 2013 At- tack,” Oct. 2017, https://www.nytimes.com/2017/10/03/technology/ yahoo-hack-3-billion-users.html, as of March 27, 2019.

[180] A. Peslyak (“Solar Designer”) and Community, “John the Ripper,” Jul. 1996, http://www.openwall.com/john/, as of March 27, 2019.

[181] A. Peslyak (“Solar Designer”) and Community, “John the Ripper’s Cracking Modes: The “Single Crack” Mode,” May 2013, http://www. openwall.com/john/doc/MODES.shtml, as of March 27, 2019.

[182] J. O. Pliam, “On the Incomparability of Entropy and Marginal Guess- work in Brute-Force Attacks,” in International Conference in Cryp- tology in India (INDOCRYPT ’00). Calcutta, India: Springer, Dec. 2000, pp. 67–79.

[183] P. Poornachandran, M. Nithun, S. Pal, A. Ashok, and A. Ajayan, “Password Reuse Behavior: How Massive Online Data Breaches Im- pacts Personal Data in Web,” in Innovations in Computer Science and Engineering (ICICSE ’15). Hyderabad, India: Springer, Aug. 2015, pp. 199–210.

[184] S. Profis, “The Guide to Password Security,” Jan. 2016, http://www.cnet.com/how-to/the-guide-to-password-security- and-why-you-should-care/, as of March 27, 2019.

[185] J. Pullman, K. Thomas, and E. Bursztein, “Password Checkup: Pro- tect Your Accounts From Data Breaches With Password Checkup,” Feb. 2019, https://security.googleblog.com/2019/02/protect-your- accounts-from-data.html, as of March 27, 2019.

[186] A. Rabkin, “Personal Knowledge Questions for Fallback Authentica- tion: Security Questions in the Era of Facebook,” in Symposium on Usable Privacy and Security (SOUPS ’08). Pittsburgh, Pennsylva- nia, USA: ACM, Jul. 2008, pp. 13–23. Bibliography 245

[187] E. Rader, R. Wash, and B. Brooks, “Stories As Informal Lessons About Security,” in Symposium on Usable Privacy and Security (SOUPS ’12). Washington, District of Columbia, USA: ACM, Jul. 2012, pp. 6:1–6:17. [188] E. M. Redmiles, Y. Acar, S. Fahl, and M. L. Mazurek, “A Summary of Survey Methodology Best Practices for Security and Privacy Re- searchers,” UM Computer Science Department, Technical Report CS- TR-5055, May 2017. [189] E. M. Redmiles, S. Kross, and M. L. Mazurek, “How I Learned to Be Secure: A Census-Representative Survey of Security Advice Sources and Behavior,” in ACM Conference on Computer and Communica- tions Security (CCS ’16). Vienna, Austria: ACM, Oct. 2016, pp. 666–677. [190] E. M. Redmiles, S. Kross, and M. L. Mazurek, “How Well Do My Results Generalize? Comparing Security and Privacy Survey Results from MTurk, Web, and Telephone Samples,” in IEEE Symposium on Security and Privacy (SP ’19). San Francisco, California, USA: IEEE, May 2019, pp. 227–244. [191] E. M. Redmiles, E. Liu, and M. L. Mazurek, “You Want Me To Do What? A Design Study of Two-Factor Authentication Messages,” in Who Are You?! Adventures in Authentication Workshop (WAY ’17). Santa Clara, California, USA: USENIX, Jul. 2017. [192] E. M. Redmiles, A. R. Malone, and M. L. Mazurek, “I Think They’re Trying to Tell Me Something: Advice Sources and Selection for Digital Security,” in IEEE Symposium on Security and Privacy (SP ’16). San Jose, California, USA: IEEE, May 2016, pp. 272–288. [193] E. M. Redmiles, Z. Zhu, S. Kross, D. Kuchhal, T. Dumitras, and M. L. Mazurek, “Asking for a Friend: Evaluating Response Biases in Security User Studies,” in ACM Conference on Computer and Com- munications Security (CCS ’18). Toronto, Ontario, Canada: ACM, Oct. 2018, pp. 1238–1255. [194] R. W. Reeder, I. Ion, and S. Consolvo, “152 Simple Steps to Stay Safe Online: Security Advice for Non-Tech-Savvy Users,” IEEE Security & Privacy, vol. 15, no. 5, pp. 55–64, Oct. 2017. [195] D. Reichl, “KeePass Help Center: Protection against Dictionary At- tacks,” Jun. 2016, http://keepass.info/help/base/security.html, as of March 27, 2019. 246 Bibliography

[196] D. Reichl, “KeePass (Windows) – Password Manager,” May 2018, http://keepass.info/help/kb/pw_quality_est.html, as of March 27, 2019. [197] D. Reichl, “KPScript (Windows) – Scripting KeePass,” May 2018, http://keepass.info/help/v2_dev/scr_index.html, as of March 27, 2019. [198] D. Rosenblum, “What Anyone Can Know: The Privacy Risks of Social Networking Sites,” IEEE Security & Privacy, vol. 5, no. 3, pp. 40–49, Jun. 2007. [199] M. D. Rugg, R. E. Mark, P. Walla, A. M. Schloerscheidt, C. S. Birch, and K. Allan, “Dissociation of the Neural Correlates of Implicit and Explicit Memory,” Nature, vol. 392, no. 6676, pp. 595–598, Apr. 1998. [200] M. A. Sasse, M. Steves, K. Krol, and D. Chisnell, “The Great Au- thentication Fatigue – And How to Overcome It,” in International Conference on Cross-Cultural Design (CCD ’14). Heraklion, Crete, Greece: Springer, Jun. 2014, pp. 228–239. [201] D. L. Schacter and R. D. Badgaiyan, “Neuroimaging of Priming: New Perspectives on Implicit and Explicit Memory,” Current Directions in Psychological Science, vol. 10, no. 1, pp. 1–4, Feb. 2001. [202] S. Schechter, A. J. B. Brush, and S. Egelman, “It’s No Secret. Measur- ing the Security and Reliability of Authentication via “Secret” Ques- tions,” in IEEE Symposium on Security and Privacy (SP ’09). Oak- land, California, USA: IEEE, May 2009, pp. 375–390. [203] S. Schechter, S. Egelman, and R. W. Reeder, “It’s Not What You Know, But Who You Know: A Social Approach to Last-Resort Au- thentication,” in ACM Conference on Human Factors in Computing Systems (CHI ’09). Boston, Massachusetts, USA: ACM, Apr. 2009, pp. 1983–1992. [204] S. Schechter, C. Herley, and M. Mitzenmacher, “Popularity Is Every- thing: A New Approach to Protecting Passwords from Statistical- Guessing Attacks,” in Workshop on Hot Topics in Security (Hot- Sec ’10). Washington, District of Columbia, USA: USENIX, Aug. 2010. [205] N. Shah, “Google: Advanced Sign-In Security for Your Google Account,” Dec. 2011, https://googleblog.blogspot.com/2011/02/ advanced-sign-in-security-for-your.html, as of March 27, 2019. [206] C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, Jul. 1948. Bibliography 247

[207] I. Shape Security, “The 2017 Credential Spill Report,” Jan. 2017, http: //info.shapesecurity.com/2017-Credential-Spill-Report-w.html, as of March 27, 2019.

[208] R. Shay, L. Bauer, N. Christin, L. F. Cranor, A. Forget, S. Komanduri, M. L. Mazurek, W. Melicher, S. M. Segreti, and B. Ur, “A Spoonful of Sugar?: The Impact of Guidance and Feedback on Password-Creation Behavior,” in ACM Conference on Human Factors in Computing Sys- tems (CHI ’15). Seoul, Republic of Korea: ACM, Apr. 2015, pp. 2903–2912.

[209] R. Shay, S. Komanduri, P. G. Kelley, P. G. Leon, M. L. Mazurek, L. Bauer, N. Christin, and L. F. Cranor, “Encountering Stronger Pass- word Requirements: User Attitudes and Behaviors,” in Symposium on Usable Privacy and Security (SOUPS ’10). Redmond, Washington, USA: ACM, Jul. 2010, pp. 2:1–2:20.

[210] S. Sheng, B. Magnien, P. Kumaraguru, A. Acquisti, L. F. Cranor, J. Hong, and E. Nunge, “Anti-Phishing Phil: The Design and Evalu- ation of a Game That Teaches People Not to Fall for Phish,” in Sym- posium on Usable Privacy and Security (SOUPS ’07). Pittsburgh, Pennsylvania, USA: ACM, Jul. 2007, pp. 88–99.

[211] D. Silver, S. Jana, D. Boneh, E. Chen, and C. Jackson, “Password Managers: Attacks and Defenses,” in USENIX Security Symposium (SSYM ’14). San Diego, California, USA: USENIX, Aug. 2014, pp. 449–464.

[212] Sinew Software Systems, Pvt. Ltd., “Enpass Release Notes – Use of the zxcvbn Strength Meter,” Dec. 2016, https://www.enpass.io/release- notes/windowspc/, as of March 27, 2019.

[213] Sinew Software Systems, Pvt. Ltd., “Enpass (Windows) – Password Manager,” May 2018, https://www.enpass.io, as of March 27, 2019.

[214] E. H. Spafford, “Observing Reusable Password Choices,” in USENIX Security Symposium (SSYM ’92). Berkeley, California, USA: USENIX, Sep. 1992, pp. 299–312.

[215] J. Steube (“atom”), “Introducing the PRINCE Attack-Mode,” in Inter- national Conference on Passwords (PASSWORDS ’14). Trondheim, Norway: Springer, Dec. 2014, pp. 1–42.

[216] J. Steube (“atom”) and Community, “Hashcat,” Jun. 2016, https:// hashcat.net/hashcat/, as of March 27, 2019. 248 Bibliography

[217] E. Stobert and R. Biddle, “The Password Life Cycle: User Behaviour in Managing Passwords,” in Symposium on Usable Privacy and Se- curity (SOUPS ’14). Menlo Park, California, USA: USENIX, Jul. 2014, pp. 243–255.

[218] E. Stobert and R. Biddle, “Expert Password Management,” in Inter- national Conference on Passwords (PASSWORDS ’15). Cambridge, United Kingdom: Springer, Dec. 2015, pp. 3–20.

[219] B. Stock and M. Johns, “Protecting Users Against XSS-based Pass- word Manager Abuse,” in ACM Symposium on Information, Com- puter and Communications Security (ASIA CCS ’14). Kyoto, Japan: ACM, Jun. 2014, pp. 183–194.

[220] M. Stockley, “Why You Can’t Trust Password Strength Meters,” Mar. 2015, https://nakedsecurity.sophos.com/2015/03/02/why-you- cant-trust-password-strength-meters/, as of March 27, 2019.

[221] S.-T. Sun, E. Pospisil, I. Muslukhov, N. Dindar, K. Hawkey, and K. Beznosov, “What Makes Users Refuse Web Single Sign-on?: An Empirical Investigation of OpenID,” in Symposium on Usable Pri- vacy and Security (SOUPS ’11). Pittsburgh, Pennsylvania, USA: ACM, Jul. 2011, pp. 4:1–4:20.

[222] C. Tallon-Baudry, O. Bertrand, C. Delpuech, and J. Permier, “Oscil- latory Gamma-Band (30-70 Hz) Activity Induced by a Visual Search Task in Humans,” The Journal of Neuroscience, vol. 17, no. 2, pp. 722–734, Jan. 1997.

[223] K. Thomas, F. Li, C. Grier, and V. Paxson, “Consequences of Con- nectivity: Characterizing Account Hijacking on Twitter,” in ACM Conference on Computer and Communications Security (CCS ’14). Scottsdale, Arizona, USA: ACM, Nov. 2014, pp. 489–500.

[224] K. Thomas, F. Li, A. Zand, J. Barrett, J. Ranieri, L. Invernizzi, Y. Markov, O. Comanescu, V. Eranti, A. Moscicki, D. Margolis, V. Paxson, and E. Bursztein, “Data Breaches, Phishing, or Malware? Understanding the Risks of Stolen Credentials,” in ACM Conference on Computer and Communications Security (CCS ’17). Dallas, Texas, USA: ACM, Oct. 2017, pp. 1421–1434.

[225] R. Tilley, “Blooming Password,” Jun. 2018, https://www. bloomingpassword.fun, as of March 27, 2019.

[226] R. Tourangeau and T. Yan, “Sensitive Questions in Surveys,” Psycho- logical Bulletin, vol. 133, no. 5, pp. 859–883, Sep. 2007. Bibliography 249

[227] N. B. Turk-Browne, D.-J. Yi, and M. M. Chun, “Linking Implicit and Explicit Memory: Common Encoding Factors and Shared Represen- tations,” Neuron, vol. 49, no. 6, pp. 917–927, Mar. 2006. [228] B. Ur, “Source Code – Data-Driven Password Meter,” Aug. 2017, https://github.com/cupslab/password_meter, as of March 27, 2019. [229] B. Ur, F. Alfieri, M. Aung, L. Bauer, N. Christin, J. Colnago, L. F. Cranor, H. Dixon, P. E. Naeini, H. Habib, N. Johnson, and W. Melicher, “Design and Evaluation of a Data-Driven Password Me- ter,” in ACM Conference on Human Factors in Computing Systems (CHI ’17). Denver, Colorado, USA: ACM, May 2017, pp. 3775–3786. [230] B. Ur, J. Bees, S. M. Segreti, L. Bauer, N. Christin, and L. F. Cranor, “Do Users’ Perceptions of Password Security Match Reality?” in ACM Conference on Human Factors in Computing Systems (CHI ’16). Santa Clara, California, USA: ACM, May 2016, pp. 3748–3760. [231] B. Ur, P. G. Kelley, S. Komanduri, J. Lee, M. Maass, M. L. Mazurek, T. Passaro, R. Shay, T. Vidas, L. Bauer, N. Christin, and L. F. Cranor, “How Does Your Password Measure Up? The Effect of Strength Meters on Password Creation,” in USENIX Security Sym- posium (SSYM ’12). Bellevue, Washington, USA: USENIX, Aug. 2012, pp. 65–80. [232] B. Ur, F. Noma, J. Bees, S. M. Segreti, R. Shay, L. Bauer, N. Christin, and L. F. Cranor, ““I Added ‘!’ at the End to Make It Secure”: Observ- ing Password Creation in the Lab,” in Symposium on Usable Privacy and Security (SOUPS ’15). Ottawa, Ontario, Canada: USENIX, Jul. 2015, pp. 123–140. [233] B. Ur, S. M. Segreti, L. Bauer, N. Christin, L. F. Cranor, S. Ko- manduri, D. Kurilova, M. L. Mazurek, W. Melicher, and R. Shay, “Measuring Real-World Accuracies and Biases in Modeling Password Guessability,” in USENIX Security Symposium (SSYM ’15). Wash- ington, District of Columbia, USA: USENIX, Aug. 2015, pp. 463–481. [234] A. Vance, D. Eargle, K. Ouimet, and D. Straub, “Enhancing Pass- word Security through Interactive Fear Appeals: A Web-based Field Experiment,” in Hawaii International Conference on System Sciences (HICSS ’13). Wailea, Maui, Hawaii, USA: IEEE, Jan. 2013, pp. 2988–2997. [235] A. Vance, “If Your Password Is 123456, Just Make It HackMe,” Jan. 2010, http://www.nytimes.com/2010/01/21/technology/21password. html, as of March 27, 2019. 250 Bibliography

[236] R. Veras, C. Collins, and J. Thorpe, “On the Semantic Patterns of Passwords and their Security Impact,” in Symposium on Network and Distributed System Security (NDSS ’14). San Diego, California, USA: ISOC, Feb. 2014.

[237] R. Veras, J. Thorpe, and C. Collins, “Visualizing Semantics in Pass- words: The Role of Dates,” in Symposium on Visualization for Cyber Security (VizSec ’12). Seattle, Washington, USA: ACM, Oct. 2012, pp. 88–95.

[238] C. Wang, S. T. Jan, H. Hu, D. Bossart, and G. Wang, “The Next Domino to Fall: Empirical Analysis of User Passwords across Online Services,” in ACM Conference on Data and Application Security and Privacy (CODASPY ’18). Tempe, Arizona, USA: ACM, Mar. 2018, pp. 196–203.

[239] D. Wang, D. He, H. Cheng, and P. Wang, “fuzzyPSM: A New Password Strength Meter Using Fuzzy Probabilistic Context-Free Grammars,” in Conference on Dependable Systems and Networks (DSN ’16). Toulouse, France: IEEE, Jun. 2016, pp. 595–606.

[240] D. Wang, Z. Zhang, P. Wang, J. Yan, and X. Huang, “Targeted Online Password Guessing: An Underestimated Threat,” in ACM Conference on Computer and Communications Security (CCS ’16). Vienna, Austria: ACM, Oct. 2016, pp. 1242–1254.

[241] K. C. Wang and M. K. Reiter, “How to End Password Reuse on the Web,” in Symposium on Network and Distributed System Security (NDSS ’19). San Diego, California, USA: ISOC, Feb. 2019.

[242] T. Warren, “Google’s new Chrome Launches With an Updated Pass- word Manager,” Sep. 2018, https://www.theverge.com/2018/9/4/ 17814516/google-chrome-new-design-features, as of March 27, 2019.

[243] R. Wash, E. Radar, R. Berman, and Z. Wellmer, “Understanding Password Choices: How Frequently Entered Passwords are Re-used Across Websites,” in Symposium on Usable Privacy and Security (SOUPS ’16). Denver, Colorado, USA: USENIX, Jul. 2016, pp. 175–188.

[244] M. Wei, M. Golla, and B. Ur, “The Password Doesn’t Fall Far: How Service Influences Password Choice,” in Who Are You?! Adventures in Authentication Workshop (WAY ’18). Baltimore, Maryland, USA: USENIX, Aug. 2018. Bibliography 251

[245] D. Weinshall and S. Kirkpatrick, “Passwords You’ll Never Forget, But Can’t Recall,” in ACM Conference on Human Factors in Computing Systems (CHI ’04). Vienna, Austria: ACM, Apr. 2004, pp. 1399– 1402. [246] M. Weir, S. Aggarwal, M. Collins, and H. Stern, “Testing Metrics for Password Creation Policies by Attacking Large Sets of Revealed Passwords,” in ACM Conference on Computer and Communications Security (CCS ’10). Chicago, Illinois, USA: ACM, Oct. 2010, pp. 162–175. [247] M. Weir, S. Aggarwal, B. d. Medeiros, and B. Glodek, “Password Cracking Using Probabilistic Context-Free Grammars,” in IEEE Sym- posium on Security and Privacy (SP ’09). Oakland, California, USA: IEEE, May 2009, pp. 391–405. [248] D. L. Wheeler, “zxcvbn: Low-Budget Password Strength Estimation,” in USENIX Security Symposium (SSYM ’16). Austin, Texas, USA: USENIX, Aug. 2016, pp. 157–173. [249] S. Wiefling, L. L. Iacono, and M. Dürmuth, “Is This Really You? An Empirical Study on Risk-Based Authentication Applied in the Wild,” in International Conference on ICT Systems Security and Privacy Protection (IFIP SEC ’19). Lisbon, Portugal: IFIP, Jun. 2019, pp. 134–148. [250] F. Wiemer and R. Zimmermann, “High-Speed Implementation of bcrypt Password Search Using Special-Purpose Hardware,” in Inter- national Conference on ReConFigurable Computing and FPGAs (Re- ConFig ’14). Cancun, Mexico: IEEE, Dec. 2014, pp. 1–6. [251] M. Wilson, “MRC Psycholinguistic Database: Machine-Usable Dic- tionary, Version 2.00,” Behavior Research Methods, Instruments, & Computers, vol. 20, no. 1, pp. 6–10, Jan. 1988. [252] K. Zetter, “Palin E-Mail Hacker Says It Was Easy,” Sep. 2008, https: //www.wired.com/2008/09/palin-e-mail-ha/, as of March 27, 2019. [253] Y. Zhang, F. Monrose, and M. K. Reiter, “The Security of Mod- ern Password Expiration: An Algorithmic Framework and Empirical Analysis,” in ACM Conference on Computer and Communications Security (CCS ’10). Chicago, Illinois, USA: ACM, Oct. 2010, pp. 176–186. [254] R. Zhao, C. Yue, and K. Sun, “A Security Analysis of Two Commercial Browser and Cloud Based Password Managers,” in IEEE International Conference on Social Computing (SocialCom ’13). Alexandria, Vir- ginia, USA: IEEE, Sep. 2013, pp. 448–453. 252 Bibliography

[255] Y. Zou, A. H. Mhaidli, A. McCall, and F. Schaub, ““I’ve Got Nothing to Lose”: Consumers’ Risk Perceptions and Protective Actions after the Equifax Data Breach,” in Symposium on Usable Privacy and Se- curity (SOUPS ’18). Baltimore, Maryland, USA: USENIX, Aug. 2018, pp. 197–216. [256] M. Zviran and W. J. Haga, “User Authentication by Cognitive Pass- words: An Empirical Assessment,” in Jerusalem Conference on Infor- mation Technology (JCIT ’90). Jerusalem, Israel: IEEE, Oct. 1990, pp. 137–144. [257] 8bit Solutions, LLC, “bitwarden (Web) – Password Manager,” May 2018, https://bitwarden.com, as of March 27, 2019.