<<

1 DeepMnemonic: Mnemonic Genera- tion via Deep Attentive Encoder-Decoder Model

Yao Cheng, Chang Xu, Zhen Hai*, and Yingjiu Li

Abstract—Strong are fundamental to the security of password-based user systems. In recent years, much effort has been made to evaluate the password strength or to generate strong passwords. Unfortunately, the usability or memorability of the strong passwords has been largely neglected. In this paper, we aim to bridge the gap between strong password generation and the usability of strong passwords. We propose to automatically generate textual password mnemonics, i.e., natural language sentences, which are intended to help users better memorize passwords. We introduce DeepMnemonic, a deep attentive encoder-decoder framework which takes a password as input and then automatically generates a mnemonic sentence for the password. We conduct extensive experiments to evaluate DeepMnemonic on the real-world data sets. The experimental results demonstrate that DeepMnemonic outperforms a well-known baseline for generating semantically meaningful mnemonic sentences. Moreover, the user study further validates that the generated mnemonic sentences by DeepMnemonic are useful in helping users memorize strong passwords.

Index Terms—Password mnemonic generation, deep neural network, attention mechanism, neural machine translation. !

1 INTRODUCTION

OWADAYS, user authentication is the to ensuring made to address the memorability of strong passwords. N the security of user accounts for most online services, Strong passwords are usually difficult for users to memorize such as , e-commerce, and online banking. because their entropy values are beyond many users’ mem- Although various authentication schemes have emerged orability [13] [14] [15]. The memorability of strong password in recent years, e.g., pattern-based or biometric-based au- has become one of the biggest hindrances to the wide thentication, the password-based authentication remains a adoption of strong passwords in real-world applications. prevailing choice in most real-world applications, whose One possible approach to addressing this problem is security relies on the difficulty in cracking passwords. to generate passwords that are not only secure but also Choosing strong passwords becomes extremely important easy to remember. Different strategies have been applied to and necessary. generating word-based memorable passwords, such as pro- In reality, service providers often present password nounceable passwords [16] [17], meaningful passwords [18], policies to aid users in creating strong passwords. Such and passwords concatenating random words [19]. Unfor- policies may require that a password be longer than a pre- tunately, some of such strategies were demonstrated to be defined minimum length, or contain multiple types of characters vulnerable to certain attacks [18] [20], and moreover, there (letters, numbers, and special characters). Such policies are is no theoretical guarantee that the generated passwords are expected to guide the generation of passwords that are indeed strong. In addition, various user-defined rules were resistant to password attacks [1], but users tend to choose adopted for the generation of expression-based memorable passwords that are easy to memorize in practice [2]. As passwords [21] [20] [22]. One limitation of this approach a result, password policies may not be as effective as is that the strength of the generated passwords largely arXiv:2006.13462v1 [cs.CR] 24 Jun 2020 expected [3]. depends on the uncertainty of the phrases chosen by users. Much effort has been made to evaluate the strength of This approach also tends to suffer from the low quality of passwords [4] [5] [6] or to generate strong passwords [7] the generation rules [22]. [8] [9]. Various methods, on the other hand, have been Instead of generating strong and memorable passwords, presented to crack passwords, assess whether passwords recent effort has been made to assist users in remembering are sufficiently strong or not, or reduce password guess- passwords using external tools, such as helper cards [23], ability [10] [11] [12]. However, no rigorous effort has been hint images [24] [25] and engaging games [26]. In particular, Jeyaraman and Topkara [27] proposed a heuristic method that relies on textual hint headlines to deal with the • Yao Cheng is with Huawei International, Singapore. E-mail: [email protected] memorability issue of strong passwords. Given a password, • Chang Xu is with Data61, CSIRO, Australia. E-mail: they proposed to search and find an existing natural language [email protected] headline that suggests the password from a given corpus • *Corresponding author. Zhen Hai is with Institute for Infocomm Research, (i.e., Reuter Corpus Volume 1). If the search is unsuccessful, A*STAR, Singapore. E-mail: [email protected] • Yingjiu Li is with Department of and Information Science, they would then use an external semantic lexicon named University of Oregon, Eugene, Oregon, United States. E-mail: WordNet [28] to replace certain words in an existing headline [email protected] with their synonyms. In this way, they could find a headline Manuscript submitted on June 3, 2019; accepted on April 2, 2020. or a variant headline and use it as a hint for memorizing a 2

given password. This approach is subject to the following Our experimental results show that DeepMnemonic limitations: (i) It can only handle the passwords composed achieves 99.09% MP (Mnemonic Proportion) and of alphabetic characters. Manual intervention is needed to 16.47 BLEU-4 (BiLingual Evaluation Understudy), tackle digits or special characters, which are non-trivial outperforming an n-gram language model baseline for the composition of strong passwords in reality. (ii) (83.62% MP and 5.09 BLEU-4). Meaningful hint headlines are often missing by simply • We conduct a user study to qualitatively evaluate the searching the given corpus while creating variant headlines helpfulness of DeepMnemonic. The results demon- at the word level by using an external lexicon may result in strate that DeepMnemonic helps users with 54.47% syntactically incorrect or semantically inconsistent headline decrease in time spent on remembering passwords sentences. (iii) It only works for short passwords that consist and 57.14% decrease in recall error measured by of 6 or 7 characters due to the limited lengths of the the edit distance between each pair of the given headlines in the corpus. password and its recalled version. In this work, given strong passwords, we propose to The rest of this paper is organized as follows. Section 2 automatically generate mnemonics, i.e., natural language describes the password mnemonic generation problem sentences, to bridge the gap between security and mem- and its application scenario. Section 3 introduces orability of the passwords. Specifically, we introduce a DeepMnemonic, a deep attentive encoder-decoder new mnemonic generation system named DeepMnemonic, system that can generate mnemonic sentences for the which is capable of generating a semantically meaningful given textual passwords. In Section 4 and Section 5, we mnemonic sentence for each given textual password. The evaluate DeepMnemonic quantitatively and qualitatively in core of DeepMnemonic is an attentive encoder-decoder experiment and user study, respectively. Section 6 discusses neural network. It learns to transform any given user several usability issues of DeepMnemonic. Section 7 password into a mnemonic sentence by first encoding summarizes the related work, and Section 8 concludes this the password character sequence and then decoding the paper. encoded information into the mnemonic sentence via a specially designed attention strategy. Both encoder and decoder are implemented with recurrent neural networks 2 PROBLEM STATEMENT AND APPLICATION SCE- that can capture the contextual dependency among pair- NARIO wise input passwords and mnemonic sentences. The key insight of DeepMnemonic is inspired by a cognitive psy- Given a strong textual password that often consists of chology theory [29], which states that the human ability to random characters, the aim of this work is to assist common memorize and recall certain information is positively influ- users in remembering the password properly. According enced by associating additional semantic content with the to a famous cognitive psychology theory [29], in order to information [27]. The natural language mnemonic sentences memorize a piece of information, it is helpful to associate generated by DeepMnemonic can provide such semantically some external tools or contextual contents with the informa- meaningful content for users to remember and recall the tion. More specifically, the cognitive psychology theory [29] given strong passwords. explains that the information to be remembered can be a list Different from the existing textual password hint sys- of items, for example, a list of characters in a password. For tems [27], DeepMnemonic enjoys several unique properties: the semantic content, it is not limited to a vivid image rep- (i) Feature-free: DeepMnemonic is an end-to-end solution, resenting, symbolizing, or suggesting each item of semantic which aims to directly map a given textual password to its information. It indicates that, in order to better memorize corresponding mnemonic sentence, and no manual feature a piece of information, users can construct such semantic engineering is needed in the learning process. (ii) Long- content and associate it with the information. Inspired password-friendly: The attention-based recurrent neural by the theory, we propose DeepMnemonic, an intelligent network component in DeepMnemonic is capable of identi- system that helps users construct the semantic content by fying salient information in long passwords for generating suggesting a meaningful mnemonic sentence corresponding semantically meaningful sentences. (iii) Learning-adaptive: to each character in the password. For example, given a The learning of DeepMnemonic is fitted to different types text password Tcahm,“Wac?” to be remembered, a semantic of password-sentence training pairs so as to process diver- content can be The child asked her mother,“ Who are children?”, sified passwords and meet various mnemonic generation where the first letter of each word in the mnemonic sentence requirements. is suggesting and corresponding to each character in the In summary, we have made the following main contri- password. butions in this work: Technically, the problem of password mnemonic gener- ation is closely related to the machine translation problem • We introduce DeepMnemonic, a mnemonic sentence in natural language processing. Both tasks aim to address generation system, to help users remember a sequence-to-sequence learning problem. Specifically, in strong passwords. DeepMnemonic utilizes an machine translation, the goal is to generate a translated encoder-decoder neural network model to generate sentence in a target language given a sentence in a source meaningful mnemonic sentences for given strong language, whereas in our case, we focus on generating passwords. a mnemonic sentence for a given password (which is a • We quantitatively evaluate the capability of sequence of characters). We propose to exploit a neural DeepMnemonic in mnemonic sentence generation. sequence-to-sequence language model [30] to translate a 3

1. Password creation 2. Password creation 3. Password generation request request request Strong Password Generator 6. Password and mnemonic generation 5. 4. Mnemonic Mnemonic

7. Password and mnemonic via secure channel request

DeepMnemonic

Fig. 1: An applicable scenario of DeepMnemonic in the password-based authentication system.

given password (a character sequence) into a meaningful passwords generated by the password generator, and hence mnemonic sentence (a word sequence). does not compromise the security of the generated pass- Formally, let D be a set of N pairs of password and words, under the assumption that users keep both pass- mnemonic sentence (Xn,Yn) (n ∈ {1, ··· ,N}), where Xn words and associated mnemonic sentences secure. The denotes a password of T characters, Xn = x1, x2, ··· , xT , passwords’ resilience to existing password attacks, such where the character xt (t ∈ {1, ··· ,T }) can be alphabetic, as brute force attacks, dictionary attacks, and guessing punctuation, digital, and special characters, while Yn refers attacks, relies on strong password generation which has to a mnemonic sentence of T dictionary terms, Yn = been well studied independently of our research. DeepM- y1, y2, ··· , yT , where the term yt (t ∈ {1, ··· ,T }) can be nemonic does not generate strong passwords but focuses words, punctuation marks, numbers, and special tokens. In on the usability of strong passwords. It can generate a the following, we use words or tokens to refer to dictionary corresponding mnemonic sentence for the password so as terms. For each pair of password and mnemonic sentence, a to overcome the hindrance to the practical use of strong one-to-one mapping relationship exists between the char- password generation techniques. acters of the password and the words of the mnemonic sentence, where each character of the password may appear at certain position of the corresponding mnemonic word. 3 METHODOLOGY For example, given a text password O,y,slt., a mnemonic 3.1 Overview sentence can be Oh, yes, something like that., where each DeepMnemonic applies a deep attentive encoder-decoder alphabetic character of the password corresponds to the first learning model [30] to learn a sequence-to-sequence map- letter of respective words in the mnemonic sentence. ping from an input password (i.e., a character sequence) We formulate the mnemonic sentence generation as to an output mnemonic sentence (i.e., a word sequence). a natural language generation problem. Given a (strong) The encoder of DeepMnemonic captures the underlying password X , the problem is to automatically generate n contextual meaning of a password from its sequence of a semantically coherent and meaningful mnemonic sen- characters, while the decoder generates a corresponding tence Y of the same length as X . Our solution, named n n mnemonic sentence based on the encoded content of pass- DeepMnemonic, employs a neural attentive encoder-decoder word. Typically, the lengths of the given passwords are vari- language model [30] to generate textual mnemonic sen- able, and strong passwords may consist of long sequences tences for passwords. DeepMnemonic builds on a sequence- of random characters. DeepMnemonic may lose focus in to-sequence learning framework, which encodes an input its process, if it treats all characters of an input password password (i.e., a sequence of characters) and automatically equally. To tackle this issue, DeepMnemonic exploits an generates a mnemonic sentence (i.e., a sequence of words) attention mechanism [30] to dynamically determine which based on the encoded contextual information. parts of an input password are more relevant to generating Figure 1 shows an applicable scenario of DeepMnemonic a semantically meaningful mnemonic sentence. in the conventional password-based authentication system. Figure 2 shows the encoder-decoder learning framework When a user requires to create a password for registration of DeepMnemonic. The encoder first takes as input a textual (step 1 and step 2), the authentication service requests the password X of length T , password generator to generate a strong password (step n

3). Typically, the generated password may not be easy Xn = (x1, x2, ··· , xt, ··· , xT ), for the user to remember due to the randomness of the VC password [13]. To improve the memorability of the strong where xt ∈ R , t ∈ {1, ··· ,T } denotes a character, and password, the password generator requests DeepMnemonic VC is the size of the input character vocabulary. It derives to produce a mnemonic sentence corresponding to the a hidden context-aware summary or “meaning” of the generated password (step 4). Once the password generator sequence of characters from input password via a bidirec- receives the response from DeepMnemonic (step 5), it re- tional recurrent neural network (BiRNN) [31]. Based on the turns the generated password as well as the corresponding encoded hidden “meaning” of the password, the decoder mnemonic sentence to the service (step 6). Finally, the of DeepMnemonic utilizes an attentive mechanism [30] to service distributes the password and mnemonic information generate the corresponding mnemonic sentence Yn word by to the user via a secure channel (step 7). word, Notice that DeepMnemonic does not modify the strong Yn = (y1, y2, ··· , yt, ··· , yT ), 4

VW where yt ∈ R denotes a word, and VW is the size of the Following the aforementioned steps, the backward hid- ~ output word vocabulary. den state ht can be computed reversely for any given time step t. Note that the character embedding matrix E(C) is y1 y2 y3 yt’-1 yt’ shared for both forward and backward processes of BiRNN...... Next, we concatenate both forward and backward hidden ~ ~ states ht and ht for each time step t, and derive a set of (h , ··· , h ) ...... overall hidden states 1 T , where ... " ~ # s0 s1 s2 s3 st’-1 st’ ht ht = ~ . (4) ht ... c3 ... ct’ ... αt’1 αt’2 αt’3 αt’t αt’T 3.3 Decoder

Based on the encoded hidden states (h1, ··· , hT ) of the ...... input password, the decoder of DeepMnemonic employs h1 h2 h3 ht hT ...... an attention mechanism to evaluate the importance of individual hidden states. Then, it derives an overall context-

x1 x2 x3 xt xT aware representation of the hidden states in order to generate each target word in the corresponding mnemonic Fig. 2: The encoder-decoder learning framework of sentence. DeepMnemonic. In particular, for each encoded hidden state ht at time step t, the decoder first computes an attentive weight αt0,t 0 with regard to the contextual state st0−1 at time step t , 3.2 Encoder exp(et0,t) α 0 = , Given an input password Xn, we employ a BiRNN to derive t ,t PT (5) exp(e 0 ) the hidden semantic representation as the meaning of the τ=1 t ,τ password. BiRNN consists of both forward and backward where processes, where the forward process reads the password (a)T (a) (a) e 0 = v tanh(W s 0 + U h ) character sequence in the original order, while the backward t ,t t −1 t process reads the sequence in the reverse order. BiRNN is an alignment model that evaluates the relevance between can thus capture contextual patterns in both directions (left the input hidden state ht and the previous output hidden and right) for each password character by summarizing the (a) n0 (a) n0×n (a) state st0−1. The v ∈ R , W ∈ R , and U ∈ information not only from the preceding characters but also n0×2n 0 R are trainable parameters, and n refers to the from the following ones in the sequence. dimensionality of the hidden state vector of the attention Formally, by using the gated recurrent units (GRU) [32], layer. which is a popular choice for modern RNNs, the forward Then, the attentive context vector ct is computed via a process of BiRNN computes the hidden state ht at a given weighted sum, time step t as follows. First, the forward reset gate ~rt is T X computed as, ct0 = αt0,tht. (6) (r) (C) (r)~ t=1 ~rt = σ(W~ E xt + U~ ht−1), (1) We can then compute the hidden state st0 of the decoder (C) m×V where σ is the sigmoid function, E ∈ R C is the at time t0 given the hidden states of encoder via a decoder character embedding matrix for the password characters, GRU as below, (r) n×m (r) n×n W~ ∈ R and U~ ∈ R are trainable parameter matrices, and m and n refer to the dimensionalities of st0 = (1 − zt0 ) ◦ st0−1 + zt0 ◦ s˜t0 , (7) the character embedding and the hidden state vector, where respectively. (W ) Similarly, the forward update gate ~zt is computed as, s˜t0 = tanh(WE yt0−1 + U[rt0 ◦ st0−1] + Cct0 ), (z) (C) (z)~ (z) (W ) (z) (z) ~zt = σ(W~ E xt + U~ ht−1), (2) zt0 = σ(W E yt0−1 + U st0−1 + C ct0 ), ~ (z) n×m ~ (z) n×n (r) (W ) (r) (r) where W ∈ R and U ∈ R are trainable rt0 = σ(W E yt0−1 + U st0−1 + C ct0 ). (8) parameter matrices. In the above formulas, E(W ) ∈ m×VW is the word Then, the forward hidden state vector ~h is computed as, R t embedding matrix for the output mnemonic sentence; (z) (r) n×m (z) (r) n×n ~ ~ ~˜ W, W ,W ∈ R , U, U ,U ∈ R and ht = (1 − ~zt) ◦ ht−1 + ~zt ◦ ht, (3) (z) (r) n×2n C,C ,C ∈ R are all trainable parameters. m and where n are the dimensionalities of the word embedding and ~˜ (C) ~ ht = tanh(WE~ xt + U~ [~rt ◦ ht−1]), the hidden state vector, respectively. Note that the initial s n×m n×n hidden state of decoder 0 is computed as, both W~ ∈ R and U~ ∈ R are trainable parameters, (s) ~ and ◦ refers to an element-wise multiplication. s0 = tanh(W h1), (9) 5

(s) n×n where W ∈ R is the parameter matrix. Among the ground truth dataset, we randomly selected Next, given the decoder state st0 , the attentive context 450,000 pairs as training data, and used the rest 50,000 for vector ct0 , and the previous generated word yt0−1, we testing. Among the training data, 20% of the pairs were follow [30] and define the probability for generating the randomly held out as validation data for model selection. target word yt0 as follows, After preprocessing, we obtained a vocabulary of 109,584 words (tokens) for the mnemonic sentence generation task. T (o) p(yt0 |st0 , ct0 , yt0−1) ∝ exp(yt0−1W lt0 ), (10) where 4.2 Experimental Setting ˜ ˜ T lt0 = [max{lt0,2k−1, lt0,2k}]k=1,··· ,K , In DeepMnemonic, the hidden layer sizes of both encoder (o) V ×K and decoder were set to 256, which were optimally selected W ∈ W ˜l 0 and R . The hidden state t is computed as using the validation set. The dropout strategy has been pre- follows, viously shown effective to prevent neural network models (o) (o) (W ) (o) from overfitting [33] and the dropout rate was set to 0.2 in ˜l 0 = U s 0 + V E y 0 + C c 0 , (11) t t t −1 t our experiments. (o) 2K×n (o) 2K×m (o) 2K×2n where U ∈ R , V ∈ R , and C ∈ R Note that no extra information, e.g., pre-defined gen- are trainable parameters. eration rules, is required beyond training data. Once the training process is done, the DeepMnemonic can be used to automatically generate a mnemonic sentence for any given 4 EXPERIMENTS password. We conducted quantitative experiments on a large-scale publicly available dataset to evaluate DeepMnemonic for 4.2.1 Beam Search mnemonic sentence generation. We also carried out a case Simply generating the best word that achieves the highest study to understand the influence of password lengths and predictive probability at each time step may not always digital/special characters on the utility of DeepMnemonic. result in an overall semantically meaningful sentence in In addition, we provided a visualized understanding of the practice. Therefore, a left-to-right beam search strategy [34] attention mechanism in DeepMnemonic. was employed to find the most likely mnemonic sen- tence [35] [30] for each given password. 4.1 Dataset In particular, the beam search based decoder stores a The “Webis-Simple-Sentences-17” is a publicly available predetermined number b (beam width) of partial sentences, data source for analyzing expression-based passwords, and the where each partial sentence is a prefix of a candidate sentences in the dataset are similar to the human-chosen mnemonic. At each time step, each partial sentence in the mnemonics in terms of syllable distribution [22]. After beam grows with a possible mnemonic word from the filtering out those sentences that contain non-English words, decoder vocabulary. Clearly, this process would greatly we derived a password from each remaining sentence increase the number of candidate sentences. To overcome by concatenating the first letter of each word and the this issue, only b most likely candidates are maintained rest special characters (including punctuation marks and in terms of their predicted probabilities. Once the end of numerical digits) in the original order as shown in the sentence symbol is appended to a candidate mnemonic, sentence. For example, a password O,y,slt. can be derived it is included in the set of full mnemonic sentences. In from the following sentence, Oh, yes, something like that. In general, the wider the beam width b is, the more candidate this way, we collected totally 500,000 pairs of passwords mnemonics the decoder searches for, and thus the better and sentences, forming a ground-truth dataset for building results could be achieved. DeepMnemonic. It is worth noting that, in creating the ground truth data, it is also possible to derive a password 4.2.2 Baseline by concatenating the last (or a middle) letter of each word The n-gram language model is widely known as one of the and the rest special characters in each sentence. dominant methods for probabilistic language modeling [36]. In preprocessing the ground-truth data, all words of each As a non-parametric learning method, it primarily utilizes sentence were converted to lowercase, which was helpful to the preceding sequence of n − 1 words to estimate a reduce the size of vocabulary for the language generation. conditional probability for the prediction of the current In particular, by following the given passwords, we can word. Various values of n were tested for n-gram, and the easily restore the corresponding words of the generated bigram language model (n = 2) achieved decent generation mnemonic sentences to uppercase wherever applicable. In performance and was chosen as the baseline to benchmark addition, extra symbols, such as and , were the proposed DeepMnemonic. inserted at the start and the end of each sentence to indicate The bigram language model depends on the preceding its boundary. A special unknown symbol, i.e., , was word for estimating the conditional probability, and then included in the vocabulary, and would be used for the generates the current word that has the highest probability. generation of sentences when no appropriate words can Therefore, the bigram model is unable to automatically be predicted. In the pairwise ground-truth dataset, the figure out the relationship between a given password and minimum and maximum lengths of passwords are 8 and its mnemonic sentence, i.e., the correspondence between the 16, respectively, which are compatible with the password alphabetic characters of the password and the first letters length requirements in many authentication systems. of mnemonic words in the sentence. To properly apply the 6

bigram language model to password mnemonic generation, 4.3 Experimental Results we manually adopted a pre-defined generation rule. Specif- This section reports the results of mnemonic sentence ically, to generate a mnemonic word, it is required that the generation via DeepMnemonic and the baseline model in word not only achieve the highest conditional probability terms of MP and BLEU. DeepMnemonic ran on Nvidia Tesla (given its preceding word), but its first letter also must P100 GPU with 16GB memory. Its training process with be identical to the corresponding character in the given 450,000 training data pairs took around 15 hours, and the password. inference on the entire test data set with 50,000 passwords 4.2.3 Evaluation Metrics took about 5 minutes. Two metrics were used to evaluate the quality of generated 4.3.1 MP mnemonic sentences via DeepMnemonic and the baseline Table 1 shows the MP results of the DeepMnemonic and method. One metric is BLEU (BiLingual Evaluation Un- bigram language model (the higher, the better). As we can derstudy), which is one of the most popular metrics for see, DeepMnemonic achieves much better results compared evaluating the quality of machine translation task in natural to Bigram over different beam widths. language processing [37]. Following [30], we used BLEU to evaluate the quality of the generated mnemonic sentences TABLE 1: The MP results of DeepMnemonic and bigram with respect to ground-truth sentences in the test set, where language model (Bigram) given beam width (b) of 1 and 5. the quality refers to the correspondence between each pair of the generated sentence and the ground-truth sentence. In DeepMnemonic Bigram other words, the closer the generated mnemonic sentence b = 1 99.09% 83.62% is to the ground-truth sentence, the more meaningful and b = 5 99.15% 98.44% consistent it is. BLEU-n is defined as follows (the higher, the better). Given beam width b = 1, DeepMnemonic attains an N 0 1 X MP value of 99.09%, while Bigram only achieves 83.62% BLEU-n = B · exp( logp ), N 0 n MP. When the beam width increases to 5, the Bigram MP n=1 increases up to 98.44%, but is still lower than DeepM- where N 0 is the maximum length of n-grams and N 0 = 4 nemonic. Surprisingly, increasing the beam width does not was used in the experiments. B refers to brevity penalty, lead to significant gain to DeepMnemonic. This suggests and is computed as that DeepMnemonic can achieve a high MP value given a  1 if c > r  small beam width b = 1, and it is not sensitive to the choice B = , e1−r/c if c ≤ r of width of the beam search. where c is the size of the generated mnemonic set, and r 4.3.2 BLEU is the size of the ground truth sentence set. The modified This section reports the BLEU-n scores for the gener- n-gram precision pn is computed as ated sentences with regard to ground-truth mnemonic P P C∈H ∈C Countclip(n-gram) sentences [37] [35]. Table 2 lists the BLEU-n results with n-gram 0 pn = P P 0 , the order n ∈ {1, 2, 3, 4} (N = 4) for DeepMnemonic 0 0 0 Count(n-gram ) C ∈H n-gram ∈C and Bigram given different beam width b ∈ {1, 5}. Over- where H is the set of generated mnemonic sentences, all, DeepMnemonic outperforms Bigram significantly and 0 0 Count(n-gram ) is the number of n-gram s in a generated consistently in all cases. mnemonic, and Countclip(n-gram) is the clipped number of the n-gram of a generated mnemonic with regard to the TABLE 2: The BLEU scores of DeepMnemonic and bigram corresponding ground-truth mnemonic sentence. language model (Bigram) given beam width (b) 1 and 5. The other metric, Mnemonic Proportion (MP), is specif- ically defined for testing the matching proportion between DeepMnemonic Bigram b = 1 b = 5 b = 1 b = 5 pair-wise passwords and mnemonic sentences. Specifically, given each pair of password Xn and generated mnemonic BLEU-1 45.17 45.29 28.82 37.22 sentence Yn, MP calculates the proportion of the cases BLEU-2 30.26 30.38 14.72 22.07 where each word yt of mnemonic Yn does start with the BLEU-3 22.33 22.44 8.49 14.29 corresponding character xt in password Xn, as shown BLEU-4 16.47 16.57 5.09 9.48 below:

PN MP If beam width is fixed at b = 1 or b = 5 , the BLEU-n MP = n=1 n , N score for each model decreases as the order n grows from 1 to 4. This is expected, as increasing the order n results in where N is the size of the test set. Then MPn is defined as, a stricter evaluation of BLEU for the generated sentences, |{yt| yt ∈ Yn and F irstLetter(yt) = xt}| MPn = which is consistent with the findings in [37]. In other words, |Xn| not only are the words between each pair of the generated where |Xn| is the length of password Xn, and and the ground-truth sentences required to be identical, but F irstLetter(·) is a function that returns the first letter the order of the words in the n-gram also needs to be exactly of a word. the same. 7

DeepMnemonic achieves the best BLEU-1 of 45.29 com- Figure 3 plots the impact of password lengths on MP pared to the baseline Bigram given the beam width b = 5. values at b = 5. We can observe that for both Bigram Roughly, this means that about 45 out of 100 generated and DeepMnemonic, MP values decrease as the passwords mnemonic words (unigrams) are identically matched to become longer. The longer the passwords are, the more diffi- the ground truth. In addition, the results again show that cult the task of generating semantically sensible sentences is. DeepMnemonic is robust to the width of the beam search. In It is clear that the MP values of DeepMnemonic are always contrast, the BLEU-n values (from 1 to 4) of Bigram improve better than those of Bigram at different password lengths, significantly by increasing the beam width from b = 1 to suggesting that DeepMnemonic generates better mnemonic b = 5. But the best BLEU-1 (37.22) of Bigram at b = 5 is still sentences from the given passwords. For example, the lower than DeepMnemonic at b = 1. mnemonic sentence generated by DeepMnemonic in Case In addition, it is worth noting that BLEU-n is designed 9 in Table 3 is more sensible and memorable than that by to measure the correspondence between each pair of the Bigram, although both sentences match all characters of the MP CHANGES WITH PASSWORD generated mnemonic sentence and ground-truth sentence. given password. LENGTH However, it is possible that a generated mnemonic sentence is helpful to assist users in memorizing a given password, DeepMnmonic MP Bigram MP but it may have a low BLEU-n score if the generated 100.0 99.5 sentence does not match to its ground truth very well. To 99.0 mitigate this issue, Section 5 further conducts a user study 98.5 about the helpfulness of DeepMnemonic, where participants 98.0 are invited to memorize and recall the assigned passwords 97.5 97.0 using mnemonic sentences generated by DeepMnemonic. 96.5 8 9 10 11 12 13 14 15 16

4.4 Case Study Fig. 3: Impact of the password length on MP values (%) given b = 5. In this section, we conduct a case study to provide a complementary understanding of the generated mnemonic sentences. The case study reveals the influence of pass- 4.4.2 Digital and Special Characters word lengths and digital/special characters on mnemonic In many password generation systems, digital and special generation. Moreover, we visualize the effectiveness of characters play an important role in creating strong pass- the attention mechanism in DeepMnemonic to explain its words. DeepMnemonic is able to handle such characters semantic meaningfulness. when generating mnemonic sentences for strong passwords. Given a list of 10 randomly selected passwords of dif- As shown in Case 3 of Table 3, DeepMnemonic generates ferent lengths, Table 3 shows the corresponding mnemonic a digital sequence “1937” from number “1” in the given sentences generated via DeepMnemonic and Bigram meth- password, and then it concatenates previously generated ods. Note that the row “Original” shows the ground-truth words to form a meaningful expression, “in february 1937”, sentences from our ground-truth dataset. for the given password subsequence “iF1”. In contrast, Bigram generates the words “it for 1” for the same sub- 4.4.1 Password Length sequence, which is less readable and not much helpful It can be observed that, for short passwords, all the sen- for remembering the password. Overall the mnemonic tences generated by either DeepMnemonic or Bigram match sentence generated by Bigram is not as meaningful as the all the password characters (i.e., 100% MP). However, as sentence by DeepMnemonic. In Case 6, DeepMnemonic the password length grows, the unknown token “” generates a sensible mnemonic phrase “75 causalities in begins to appear more frequently in the mnemonic sen- 24 hours” from a password segment “7ci2h”, and also tences generated by Bigram compared to DeepMnemonic. produces an overall semantically meaningful mnemonic The main reason is that, when generating a sentence, it is sentence. Although Bigram generates a meaningful phrase sometimes difficult for the Bigram model to find a word that “70 countries in 2006 have” from the same password not only identically matches the given password character segment, unfortunately, it ends up with a grammatically but also has the conditional probability greater than zero incorrect and semantically inconsistent sentence. given the preceding word. For example, Bigram begins In general, it is challenging to handle special characters to generate the unknown token “” from the third of passwords such as “/” and “*” during generating position onwards in Case 7. Given the first generated word mnemonic sentences. For example, in Case 2, given a “but”, the model identifies the next word “z2”, which has slash character “/” between “a” and “e” of the password, the highest probability and also starts with the character DeepMnemonic generates a sensible phrase “ author / “z” of the given password. However, when continuing to editor”, where “/” normally represents “or”, while the generate the next word based on “z2”, Bigram fails to find Bigram model outputs a strange phrase “a / etc” for the any words that start with the password character “e” and same password segment. In Case 4 of Table 3, following also have the positive conditional probability. As a result, the password segment “*g*”, DeepMnemonic generates a Bigram generates an unknown token instead. In contrast, reasonable phrase “*good*”. Though it is not identical DeepMnemonic does not have such issue, and for each to the original one (“*giddy*”), semantically, this phrase given password, it can complete the automatic generation can be used to highlight the meaning of “good” in the of an entire meaningful sentence. generated sentence. However, Bigram fails to generate either 8 TABLE 3: Randomly selected passwords and the corresponding mnemonic sentences generated by DeepMnemonic and Bigram (Original means that the mnemonic sentences come from the ground-truth dataset).

No. Length Items Values Password Y m s , b t o . Original you’ll miss some , but that’s okay . Case 1 Short DeepMnemonic you may say , but that’s ok . Bigram you might say , but the other . Password S I b a a / e ? Original should i buy a acoustic / electric ? Case 2 Short DeepMnemonic should i be an author / editor ? Bigram so i bought at a / etc ? Password T f i d i F 1 . Original this finding is diagrammed in figure 1 . Case 3 Short DeepMnemonic the festival is due in february 1937 . Bigram the first i do it for 1 . Password B I j c d t * g * t . Original but i just can’t do the * giddy * thing . Case 4 Medium DeepMnemonic but i just can’t do the * good * thing . Bigram but i just can’t do they * go Password T c a h m , “ W a c ? ” Original the child asks his mother , “ what are circles ? ” Case 5 Medium DeepMnemonic the child asked her mother , “ who are children ? ” Bigram they can also have more , “ we are called ? ” Password D t m t b h s 7 c i 2 h ? Original does this mean the book has sold 7 copies in 24 hours ? Case 6 Medium DeepMnemonic does this mean that because he sees 75 casualties in 24 hours ? Bigram during the more than before he said 70 countries in 2006 have ? Password B Z e p a l i h h t c h . Original but zeus ever pursued and longed in his heart to catch her . Case 7 Medium DeepMnemonic but zelotes ever pushed a line in his head to crucify him . Bigram but z2 Password T , I w l t r t l o K J . Original today , i would like to recognize the life of ken jablonski . Case 8 Medium DeepMnemonic today , i would like to read the letter of karl james . Bigram this , it was like to read the lord of knowledge Password D t s w i N - D , l a t a p . Original during the six weekends in new - delhi , lunch and tea are provided . Case 9 Long DeepMnemonic during the second week in november - december , lakes awoke to a pond . Bigram during the same way is no - day , look at the air pollution . Password A : Y , b t e n t b a i a l . Original ac : yes , but the egg needs to be altered in a lab. Case 10 Long DeepMnemonic alan : yeah , but that’s exactly not the best album in a league . Bigram a : yes , but the early next to be an important as long . a short sensible phrase or an entire semantically meaningful i would like to recognize the life of ken jablonski.”, our sentence. generated mnemonic sentence seems easier for users to fol- One more interesting example is the colon symbol “:” in low and recall the corresponding password. To be specific, Case 10, which is typically used to indicate the start of an DeepMnemonic decodes the given password segment “KJ” utterance. Surprisingly, DeepMnemonic generates a name as a name, “karl james”, and surprisingly, it turns out to be a “alan” (Alan) for the character “A” in front of the colon new person name that does not even appear in the training “:”, and then generates an utterance for the subsequence of data. This demonstrates the ability of DeepMnemonic to characters following the colon symbol. Bigram fails again to capture the semantic context of input passwords as well handle this special character. as the relationship between each pair of password and mnemonic sentence. This also shows that the attention 4.4.3 Attentive Generation of Mnemonic Sentences mechanism in DeepMnemonic captures not only the full view of an input password context but also its salient parts, Generally, a mnemonic sentence, which not only literally both of which are useful for mnemonic word inference at covers the characters of the given password, but is also se- each generation step. mantically meaningful, is more useful for users to memorize every password character properly. Figure 4 visualizes the attention weights in the alignment As shown in Case 8 of Table 3, DeepMnemonic generates between each pair of input password (y-axis) and generated a mnemonic sentence, “today, i would like to read the letter mnemonic sentence (x-axis) for Case 5 and Case 6. Each of karl james.” from password “T,IwltrtloKJ.” Although this pixel denotes the attention weight αt0,t of the t-th password is not identical to the ground-truth sentence, i.e., “today, character with regard to t0-th target mnemonic word in 9

Figure 5 illustrates the user study procedure. In Mem- orization Phase I without the aid of mnemonic sentences, each participant was asked to memorize an assigned pass- does this mean that because he sees 75 casualties in 24 hours ? the child asked her mother , “ what are children ? ” T D word p1, where the time the participant used for memoriza- t c tion is t1. Then, after a period of time, in Recall Phase I, a m h t each participant was asked to recall and write down the b m assigned password p1, and the recalled password is r1. , h s “ In Memorization Phase II with the aid of the mnemonic 7 W c sentences generated by DeepMnemonic, each participant a i was asked to memorize a different assigned password p2, c 2 ? h where the time the participant used for memorization is ” ? t2. Later in the Recall Phase II, the recalled version of p is termed as r . Finally, the online questionnaire was (a) Case 5. (b) Case 6. 2 2 designed to evaluate the participants’ experiences about Fig. 4: The heat map of the attention weights αt0,t in the whole user study process. Following the forgetting grayscale (0: black, 1: white) curve theory [38], we set the time gap between password memorization and password recall to 48 hours1.

grayscale (0: black, 1: white). The brighter the pixel is, the 5.2 User Study Results more important the password character is to the generation of the corresponding mnemonic word. From the heat map, Figure 6 shows the average results over all participants we can observe that DeepMnemonic concentrates on the within each group (i.e., with the same password length) important characters of an input password when it decodes with and without the aid of the generated mnemonic individual target mnemonic words in the generation phase. sentences by DeepMnemonic. In addition, the attention layer also takes into account Figure 6a shows the time costs in memorizing the t t the contextual neighboring characters of each password passwords without ( 1) and with ( 2) the aid of mnemonic t for mnemonic sentence generation. Thanks to the atten- sentences. Note that 2 includes the time for memorizing the tion mechanism, DeepMnemonic automatically learns the associated mnemonic sentence in addition to the password. t t alignment and discovers which characters in the input It is observed that 2 is lower than 1, especially for password are more important for generating a semantically memorizing longer passwords. We conducted a statistical meaningful mnemonic sentence. t-test which can work well with our small set of samples that approximate the normal distribution [39] [40]. The difference between the two sets of time costs is statis- 5 USER STUDY tically significant at a significance level α = 0.01. This This section conducts a user study to analyze the usability indicates that, by using the mnemonic sentences generated of DeepMnemonic, and validates that the mnemonic sen- by DeepMnemonic, the participants are able to memorize tences generated by DeepMnemonic are helpful for users to the passwords more quickly than without the aid of any memorize passwords. mnemonics. For group 2, who memorized passwords of length 16, the observed difference is more significant, i.e., 110 seconds versus 42.6 seconds with pvalue = 0.007. Instruction Memorization Phase II Clearly, DeepMnemonic shows its capability of assisting in memorizingInstruction for passwords the user study more effectively, and is especially Memorization Phase I Recall Phase II helpful for memorizing relatively long passwords. FigurePassword 6bmemorization evaluates phase I the password recall in terms of edit distance between each pair of recalled password r Password recall phase I Recall Phase I Online Questionnaire and assigned password p (the smaller, the better). Overall the averagePassword memorization edit distance phase II with the aid of DeepMnemonic Fig. 5: The user study procedure. is smaller than that without using DeepMnemonic. The differencesPassword become recall phase even II larger when the password lengths become longer. For group 2 where the password length is 16, Online questionnaire phase 5.1 User Study Setting the average edit distance with the aid of DeepMnemonic is significantly smaller than that without the aid. Our statisti- In this user study, we recruited 24 participants from univer- cal test shows that the difference is significant given the level sities and institutes, and randomly divided them into three α = 0.01. However, for shorter passwords in group 0 where groups, labeled as group 0, group 1, and group 2. The lengths password length is 8, the average edit distances with and of passwords assigned to the three groups were 8, 12, and without using DeepMnemonic are almost the same (0.75). 16, respectively. This user study includes three main tasks, A possible explanation is that, when a given password is i.e., password memorization, password recall, and online short, the efforts used to remember the generated mnemonic questionnaire. The password memorization and password sentence and the password itself are comparable. In such recall tasks were conducted face-to-face in our lab, where we gave task instructions and measured the participants’ 1. The detailed user study design as well as the ethical consideration performance using a timer. can be found at https://goo.gl/KtwuGB t1 t2 d1 d2 65 43 0 1 194 16 4 3 104 20 1 0 Group 0 Group 1 Group 2 65 42 3 1 without with without with without with 201 58 8 1 time 10.875 9.75 55.375 27.875 110 42.625 49 42 4 0 edit distance 0.75 0.75 0.875 0.5 3.625 1 95 23 6 0 107 97 3 2 110 42.625 3.625 1 Without DeepMnemonicWith DeepMnemonic (t₁) (t₂) Group 2 8 charecters 10.875 9.75 t-Test: Two-Sample Assuming Unequal Variancest-Test: Two-Sample Assuming Unequalt-Test: VariancesTwo-Sample Assuming Unequal12 charecters Variances55.375 27.875 16 charecters 110 42.625 Variable 1 Variable 2 Variable 1 Variable 2 Mean 110 42.625 Mean 3.625 1 Variance 3339.714 685.6964 Variance 6.553571 1.142857Without DeepMnemonicWith DeepMnemonic (between p (between₁ and r₁) p₂ and r₂) Observations 8 8 Observations 8 8 8 charecters 0.75 0.75 Hypothesized Mean0 Difference Hypothesized Mean0 Difference 12 charecters 0.875 0.5 df 10 df 9 16 charecters 3.625 1 t Stat 3.003576 t Stat 2.676268 P(T<=t) one-tail0.006631 P(T<=t) one-tail0.01268 t Critical one-tail1.812461 t Critical one-tail1.833113 P(T<=t) two-tail0.013262 P(T<=t) two-tail0.025361 t Critical two-tail2.228139 t Critical two-tail2.262157 10

120 4 80% 100 3 80 60% Without DeepMnemonic 8 charecters 60 2 12 charecters 40% 16 charecters 40 Edit Distance Edit 1 20%

Complete Recall Without DeepMnemonic 20 8 charecters Memorization Time (s) Time Memorization 12 charecters 0 0 0% 16 charecters 8 charecters 12 charecters 16 charecters 8 charecters 12 charecters 16 charecters 8 charecters 12 charecters 16 charecters Strongly disagree Disagree Neutral Agree Strongly agree You are perplexed by remembering passwords. 1 Without DeepMnemonic2 (t₁) 9 9 Without DeepMnemonic1 22(between p₁ and r₁) Without DeepMnemonic The passwords assigned to you during the user study are strong passwords. 0 With DeepMnemonic2 (t₂) 0 17 With DeepMnemonic3 (between22 p₂ and r₂) With DeepMnemonic The passwords assigned to you during the user study are easy to remember. 2 (a) Memorization9 time cost3 7 (b) Edit1 distance22 (c) Complete recall The gist of the mnemonic sentence generated by DeepMnemonic is easy to remember exactly. 0 1 7 13 1 22 The mnemonic sentence generated by DeepMnemonic are easy to remember exactly. 0 2 6 13 1 22 The mnemonics generated by DeepMnemonic is grammatically correct. Fig. 6: The0 comparison3 results of5 different10 user groups4 with22 and without using DeepMnemonic. We compare the The mnemonics generated by DeepMnemonic is semantically meaningful. memorization0 time cost2 (a), edit distance6 (b),12 and complete2 recall22 ratio (c) of the three groups without and with the DeepMnemonic is generally helpful in remembering the passwords in general. aid of DeepMnemonic.0 0 For each comparison,6 14 the results for2 different22 password lengths are shown separately.

Strongly disagree Disagree Neutral Agree Strongly agree You are perplexed by remembering passwords. case, using0 mnemonic2 sentences may9 incur10 extra burdens3 remember”24 and “wording is easy to remember”, respec- The passwords assigned to you during the user study are strong passwords. 0 1 0 18 5 24 The passwords assigned to you during the user study are easy to remember. for memorizing1 short passwords.9 3 8 3 tively.24 With regard to grammatical correctness, about 8.3% The gist of the mnemonic sentence generated by DeepMnemonic is easy to remember exactly. This can0 also be discovered0 from7 the complete14 recall ratio3 of24 participants noticed a few grammar errors existed in the The mnemonic sentence generated by DeepMnemonic are easy to remember exactly. in Figure 6c0 (the higher,2 the better).6 It shows14 that users can2 generated24 sentences. For example, in Case 6 of Table 3, the The mnemonics generated by DeepMnemonic is grammatically correct. 0 2 5 11 6 24 The mnemonics generated by DeepMnemonic is semantically meaningful. recall 8-character0 passwords1 better6 without13 the burden4 of simple24 present tense in the sentence may not be rigorously DeepMnemonic is generally helpful in remembering the passwords in general. memorizing0 extra mnemonic0 sentences.5 However,15 the help-4 reasonable.24 The majority of the participants (70.9%) agreed fulness of mnemonic sentences becomes obvious for users to that the mnemonic sentences generated by DeepMnemonic memorizeStrongly disagree 12-characterDisagree andNeutral 16-characterAgree passwords.Strongly Overall agree are meaningful. Overall, 79.2% of participants recognized You are perplexed by remembering passwords. the mnemonic0 sentences2 generated9 by DeepMnemonic10 are3 that24 DeepMnemonic is generally helpful for remembering The passwords assigned to you during the user study are strong passwords. 0 1 0 18 5 24 The passwords assigned to you during the user study are easy to remember. effective in1 improving9 the performance3 of password8 recall,3 the24 given passwords. The gist of the mnemonic sentence generated by DeepMnemonic is easy to remember exactly. especially for0 longer/stronger0 passwords.7 14 3 24 The mnemonic sentence generated by DeepMnemonic are easy to remember exactly. In addition,0 we analyzed2 the results6 of online14 question-2 24 The mnemonics generated by DeepMnemonic is grammatically correct. 0 2 5 11 6 624 DISCUSSION The mnemonics generated by DeepMnemonic is semantically meaningful. naire to evaluate0 the user1 experience.6 Figure13 7 shows that4 24 DeepMnemonic is generally helpful in remembering the passwords in general. only 8.3% of0 the participants0 indicated5 that15 they are not4 In24 this section, we discuss several issues related to the perplexed by memorization of passwords, which clearly usability of DeepMnemonic. suggestsStrongly disagree that rememberingDisagree Neutral passwordsAgree is indeedStrongly a difficult agree You are perplexed by remembering passwords. task.0.0% Almost 96%8.3% of participants37.5% agreed41.7% that the passwords12.5% 6.1 Learning from Multiple Strong Password Genera- The passwords assigned to you during the user study are strong passwords. 0.0% 4.2% 0.0% 75.0% 20.8% The passwords assigned to you during the user study are easy to remember. assigned4.2% to them37.5% are secure12.5% and strong.33.3% About 46%12.5% of them tion Rules The gist of the mnemonic sentence generated by DeepMnemonic is easy to remember exactly. considered0.0% their0.0% assigned passwords29.2% easy58.3% to remember.12.5% We In general, the successful application of DeepMnemonic The mnemonic sentence generated by DeepMnemonic are easy to remember exactly. note0.0% that the majority8.3% of them25.0% come from58.3% group 08.3%, where The mnemonics generated by DeepMnemonic is grammatically correct. to automatic mnemonic generation largely depends on the their0.0% assigned passwords8.3% are20.8% not very45.8% long, and25.0% are thus The mnemonics generated by DeepMnemonic is semantically meaningful. 0.0% 4.2% 25.0% 54.2% 16.7% large-scale training data used for building the sequence-to- DeepMnemonic is generally helpful in remembering the passwords in general. not very0.0% difficult0.0% to remember.20.8% 62.5% 16.7% sequence learning model. Recall that in our experiments, we generated pairwise password and mnemonic sentence Strongly disagree Disagree Neutral Agree Strongly agree training data from Webis-Simple-Sentences-17. For each You are perplexed by remembering passwords. 0.0% 8.3% 37.5% 41.7% 12.5% 100.0%sentence in the dataset, we followed one strong password The passwords assigned to you during the user study are strong passwords. 0.0%generation rule proposed in [22], and thus concatenated 0.0% 4.2% 0.0% 75.0% 20.8% 100.0%all the first letter of each word and special characters The passwords assigned to you during the user study are easy to remember. 0.0% 4.2% 37.5% 12.5% 33.3% 12.5% 100.0%in the sentence so as to create a password. It is worth The gist of the mnemonic sentence generated by DeepMnemonic is easy to remember noting that DeepMnemonic is not limited to this single exactly. 0.0%strong password generation rule adopted for constructing 0.0% 0.0% 29.2% 58.3% 12.5% 100.0% The wording of the mnemonic sentence generated by DeepMnemonic is easy to the pairwise training data. It is possible that DeepMnemonic remember exactly. 0.0%can also be trained on other datasets constructed using 0.0% 8.3% 25.0% 58.3% 8.3% 100.0%different password generation rules, such as concatenating The mnemonics generated by DeepMnemonic is grammatically correct. 0.0% 0.0% 8.3% 20.8% 45.8% 25.0% 100.0%the last letter of each word instead of the first letters in The mnemonics generated by DeepMnemonic is semantically meaningful. 0.0%deriving passwords. Thanks to the use of the sequence-to- 0.0% 4.2% 25.0% 54.2% 16.7% 100.0%sequence learning model, DeepMnemonic is able to learn DeepMnemonic is generally helpful in remembering the passwords in general. 0.0% 0.0% 0.0% 20.8% 62.5% 16.7% 100.0%language generation patterns automatically from the train- ing data, and map passwords to mnemonic sentences in Fig. 7: Online questionnaire analysis results. testing. Therefore, it is possible to train multiple encoder- decoder language generation models in DeepMnemonic A large portion of the participants agreed that both gist using different password generation rules. As a result, and exact words of each generated mnemonic sentence are users have flexibility to choose from multiple generated helpful and easy to remember. Almost 71% and 67% of mnemonic sentences for a given password, which may participants agreed on the two points, i.e., “gist is easy to further improve the usability of DeepMnemonic. 11

6.2 Baseline Selection study lies in the age distribution of the participants. The The n-gram language model used in our comparison eval- ages of the participants range from 24 to 40, which are not uation is one of the most well-known language generation representative of younger or older population. Indeed, this models. We have tested various values of n (n ∈ [1, 4]). Since is also a challenge in many other user studies [41]. Our user the bigram language model achieved the best performance study suggests that DeepMnemonic is generally helpful for among different n, it was selected as the baseline. young and middle-aged people. It remains a future work to We discovered that when using a unigram language evaluate the DeepMnemonic against other age groups. model to map from a password sequence to a mnemonic sequence, the model cannot leverage on the context, i.e., 6.5 Lowercase and Uppercase Password Characters preceding words. As a result, the words that appear more DeepMnemonic does not handle lowercase and uppercase frequently in training data are considered more important, characters when generating mnemonic sentences, and thus and are then assigned with a higher probability for predic- cannot help users distinguish between them. We have tion. For example, given a password character “t”, it is very tried to build a variant DeepMnemonic model, which likely that the unigram language model would generate the can generate sentences that differentiate lowercase and word “the”, regardless of the preceding contextual words. uppercase characters. However, its performance (e.g., BLEU In addition, trigram (n = 3) and 4-gram based language and MP) is not as good as that of current DeepMnemonic models built on the pairwise training data often suffer from model. We plan to deal with this lowercase and uppercase sparse co-occurrence information, and output the unknown character issue in the future work. In addition, users may token . It is more likely that these models, compared have multiple passwords across various platforms. It would to the bigram models, generate empty list of candidate be interesting to extend DeepMnemonic to help users words from two or more contextual preceding words and memorize their passwords across different platforms. a target password character. According to our experimental n results, when is larger than 2, the language models have 6.6 Generating Mnemonics in Different Languages more difficulty in generating complete sentences. Currently, the proposed DeepMnemonic has been shown to be effective in generating mnemonics for passwords in 6.3 Evaluation Metric English. Similar to the sequence-to-sequence machine transla- BLEU is one of the most popular automated metrics for tion task in natural language processing [30], the underlying evaluating the quality of machine translation [37] in natural encoder-decoder model of DeepMnemonic can encode any language processing. The quality is often considered to be input password into a semantic vectorial representation and the matching degree between each pair of generated sen- decode the representation into a semantically meaningful tence and ground-truth sentence. The closer the generated mnemonic sentence in the target language. When suitable sentence is to the ground-truth, the better the language training datasets in other languages (e.g., Chinese) are generation is. Our quantitative experimental results show available, DeepMnemonic can be easily adapted to the that DeepMnemonic significantly outperforms the well- mnemonic sentence generation in those language. established bigram language model in terms of BLEU. We note that BLEU is a good metric to evaluate the 7 RELATED WORK quality of machine translation, where each generated or 7.1 Strong Password Evaluation and Generation translated sentence should be close to the reference transla- tion (ground truth) as much as possible. The state-of-the-art A lot of efforts have been made in the past to assess the BLEU score is 34.8 for machine translation using deep neural strength of passwords or to generate strong passwords. networks [35]. However, BLEU may not be a fair metric for The strength of a password can be typically evaluated by evaluating the password mnemonic generation, which aims two common metrics, namely, entropy and guessability. to assist users in password memorization. It is true that The entropy measures how unpredictable a password is by the generated mnemonic sentences sometimes do not match considering the length of password and the distribution of perfectly to the reference sentences (ground truth), which characters in the password. One limitation of the entropy- may lead to poor BLEU scores. However, it is possible that based measurement [42] is that it only supplies users with such mnemonic sentences are still helpful for memorizing rough approximations of password strength [3] [43] [44]. In passwords, as long as they are grammatically correct and contrast, guessability-based measurement, which is defined semantically meaningful by themselves, as shown in Table 3. as the number of guesses required to break a password, has The metric MP proposed in Section 4.3.1 is thus designed become increasingly popular. To evaluate the guessability to compliment BLEU in terms of mnemonic generation of a password, one key step is to identify an algorithm evaluation. for , such as the Probabilistic Context- Free Grammar model (PCFG) [10] [11] [6] and the Markov n-grams model [12] [4]. These cracking algorithms exploit 6.4 User Study Scale and Population the password distributions that are derived from various In our user study, we recruited 24 participants to evaluate password datasets disclosed in previous security incidents. the usability of DeepMnemonic. Despite the small scale, the It is revealed that different password datasets collected user study shows that DeepMnemonic is helpful in assisting from different user groups demonstrate different distribu- users in better memorizing the passwords, especially when tions [45] [46]. Then, the guessability of the password can passwords are relatively long. A limitation of this user be measured using the cracking algorithm. 12

Previous studies have evaluated the password strength for memorizing the generated passwords. Due to users’ by measuring the popularity of passwords. Schechter et behavioral tendency toward picking up easy-to-remember al. [47] evaluated user-chosen passwords by identifying reference sentences [7], the generated memorable passwords undesirably popular passwords. The passwords within using such human-selected language expressions may be certain popularity threshold are considered secure under easy-to-guess by attackers. probabilistic attacks. It is argued that the existing password Researchers have studied various tips for creating creation policies can be replaced with popularity limits mnemonic passwords, including sentence substitution so as to strengthen both security and usability for user (SenSub), key-board change (KbCg), using a formula authentication systems. (UsForm), and special character insertion (SpIns) [49]. Strong password generation aims to strengthen pass- Alphapwd [50] is a memorable password generation words via changing or adding characters to the passwords. strategy based on password shapes. In order to create Generation of persuasive text passwords is one of such memorable passwords, Alphapwd requires user to approaches, which inserts one to four characters at random remember a shorter sequence of letters that are shown positions in a given password [7] [8]. Recently, Houshmand in larger size on top of a normal keyboard; a user may and Aggarwal [9] presented an analyze-modify method to derive his/her strong password easily from the keyboard generate strong passwords. They first evaluated password following the strokes of the shorter sequence of large- strength using PCFG. For identified weak passwords, they size letters. The strengths of these generated passwords modified them based on a set of editing rules. One limitation are evaluated using probabilistic attack algorithms such of strong password generation is that it does not take into as the PCFG algorithm and MarKov model trained on consideration the usability or memorability of passwords. previous disclosed password datasets. Ghazvininejad and As a consequence, the generated passwords may not be Knight [51] proposed to generate passwords in the form user-friendly in practice. of English sequences, called , whose lengths range from 31.2 to 87.7 characters on average. However, experiments revealed that it is difficult for common users 7.2 Memorable Password Generation to reproduce the exact wording of , even if A variety of strategies have been employed to create they manage to remember the gist of these sentences [52]. easy-to-remember passwords by service providers. One Being complementary to these memorable password strategy is to generate pronounceable passwords [16] [18]. generation methods, DeepMnemonic is designed to For example, pronounceable password “kilakefe52” is com- generate mnemonic sentences for any given passwords prised of a random sequence of vowel-consonant pairs [16]. instead of generating new passwords. Another strategy is to create memorable passwords that comprise multiple components with certain meanings [17]. For example, password “#FreDDi17%” can be divided into 7.3 Password Mnemonics three meaningful parts, i.e., “#” (identity), “FreDDi” (name), To improve the memorability of given strong passwords, and “17%” (percentage). The third strategy aims to generate various types of hints or mnemonic tools have been used. passwords by simply concatenating a list of random words Atallah et al. [53] is the first to propose the concept of via hyphen character [19]. The benefit of this strategy is using funny jingles for memorizing a randomly generated twofold: (i) It is usually difficult to guess unrelated words password. One challenge of this proposal is the generation of the generated passwords; and (ii) Longer passwords of related funny jingles. Several other efforts were made often lead to higher entropy and security [48]. Although to assist users in remembering passwords via external the generated passwords via the aforementioned strategies tools, such as helper card [23], hint image [24] [25], and are memorable, their strengths have not been systematically engaging game [26]. Jeyaraman and Topkara [27] proposed evaluated yet, and some of them are vulnerable to password to match given passwords to textual headlines or their attacks [20]. variant versions selected from a given corpus to assist users One typical approach to generate memorable yet strong in remembering passwords. One drawback of this approach passwords is to rely on meaningful language expressions, is that the variant headlines generated may be syntactically making use of specific parts of given reference sentences incorrect or semantically inconsistent. Due to the limited or phrases according to certain generation rules [21] [20] length of the headline text, it can only generate hints [22]. For example, password “Ilwm.” can be generated by for short passwords and largely fails for long and strong joining the first character of each word and the punctuation passwords. The memorability of the passwords generated in the following sentence: “I love watching movies.” The by this approach has not been evaluated. expression-based passwords generated by this approach are often supposed to be stronger than those selected intuitively 7.4 Natural Language Generation by users [13]; given reference sentences or phrases, the usability cost for memorizing the generated passwords is In natural language processing, statistical language models almost comparable to memorizing those intuitive ones. are used to generate meaningful sentences by computing Yang et al. [20] demonstrated that the security level of joint probabilities of sequences of words from a dictionary an expression-based password was largely affected by its of a given language. One of the dominant methods for generation rules, which was also validated by Kiesel et probabilistic language modeling is the n-gram language al. [22]. However, it is not clear how various generation model [36], which is a non-parametric learning algorithm. It rules at different security levels affect the usability costs relies on the preceding sequence of n − 1 words to estimate 13 the conditional probability for predicting (generating) the [6] D. Wang, D. He, H. Cheng, and P. Wang, “fuzzypsm: A new current n-th word. password strength meter using fuzzy probabilistic context-free grammars,” in 46th Annual IEEE/IFIP International Conference on Recently, neural network based language models have Dependable Systems and Networks (DSN), 2016. IEEE, 2016, pp. become increasingly popular in natural language genera- 595–606. tion tasks. Bengio et al. [54] proposed a generic neural [7] A. Forget, S. Chiasson, P. C. Van Oorschot, and R. Biddle, probabilistic language model, which can simultaneously “Improving text passwords through persuasion,” in Proceedings of the 4th symposium on Usable privacy and security, 2008, pp. 1–12. learn the distributed representation of each word and joint [8] ——, “Persuasion for stronger passwords: Motivation and probability function of sequences. Sutskever et al. [35] pilot study,” in International Conference on Persuasive Technology. proposed a sequence-to-sequence neural language model to Springer, 2008, pp. 140–150. address the machine translation problem. One key benefit [9] S. Houshmand and S. Aggarwal, “Building better passwords using probabilistic techniques,” in Proceedings of the 28th Annual of their model is that it can automatically generate a Computer Security Applications Conference. ACM, 2012, pp. 109– translated sentence in the target language, given a sentence 118. in the source language. In order to improve machine [10] M. Weir, S. Aggarwal, B. De Medeiros, and B. Glodek, “Password cracking using probabilistic context-free grammars,” in Security translation, Bahdanau et al. [30] introduced an attentive and Privacy, 2009 30th IEEE Symposium on. IEEE, 2009, pp. 391– alignment strategy to enhance the sequence-to-sequence 405. neural language model, which learned to dynamically pay [11] R. Veras, C. Collins, and J. Thorpe, “On semantic patterns of more attention to salient parts of the input sentences when passwords and their security impact.” in NDSS, 2014. [12] C. Castelluccia, M. Durmuth,¨ and D. Perito, “Adaptive password- generating the translated sentences. strength meters from markov models.” in NDSS, 2012. In this paper, we exploit natural language translation [13] J. Yan, A. Blackwell, R. Anderson, and A. Grant, “Password techniques to generate human-readable and semantically memorability and security: Empirical results,” IEEE Security & meaningful mnemonic sentences from any given passwords privacy, vol. 2, no. 5, pp. 25–31, 2004. [14] J. R. Anderson and C. J. Lebiere, The atomic components of thought. so as to help users memorize strong passwords. Psychology Press, 2014. [15] G. A. Miller, “The magical number seven, plus or minus two: Some limits on our capacity for processing information.” Psychological 8 CONCLUSION review, vol. 63, no. 2, p. 81, 1956. In this work, we have proposed DeepMnemonic, a deep [16] WarpConduit Computing, “Password generator,” https://www. warpconduit.net/password-generator/, 2017. neural network based approach to automatic generation [17] R. Accettura, “Safepassword,” https://www.safepasswd.com/, of mnemonic sentences for any given textual passwords. 2017. DeepMnemonic builds upon an attentive encoder-decoder [18] Bad Neighborhood, “Mnemonic strong password generator,” translating http://www.bad-neighborhood.com/password-generator.htm, language generation framework, and works by 2017. an input sequence of password characters to a natural [19] edoceo, “Mnemonic password generator,” http://edoceo.com/u language sentence of mnemonic words. DeepMnemonic is tilitas/mnemonic-password-generator, 2017. designed to bridge the gap between the strong password [20] W. Yang, N. Li, O. Chowdhury, A. Xiong, and R. W. Proctor, “An empirical study of mnemonic sentence-based password generation and the usability of strong passwords. Exper- generation strategies,” in Proceedings of the 2016 ACM SIGSAC imental results show that DeepMnemonic is capable of Conference on Computer and Communications Security. ACM, 2016, generating semantically meaningful mnemonic sentences. pp. 1216–1229. A user study is conducted to evaluate the usability of [21] C. Kuo, S. Romanosky, and L. F. Cranor, “Human selection of mnemonic phrase-based passwords,” in Proceedings of the Second DeepMnemonic, which shows that the generated mnemonic Symposium on Usable Privacy and Security, ser. SOUPS ’06. New sentences are helpful in memorizing strong passwords. York, NY, USA: ACM, 2006, pp. 67–78. [Online]. Available: Specifically, with the aid of DeepMnemonic, the time used http://doi.acm.org/10.1145/1143120.1143129 for remembering a password is largely reduced, and the [22] J. Kiesel, B. Stein, and S. Lucks, “A large-scale analysis of the mnemonic password advice,” in Proceeding of NDSS, 2017. password recall quality is also significantly improved. In [23] U. Topkara, M. J. Atallah, and M. Topkara, “Passwords the future, we plan to train DeepMnemonic using more decay, words endure: Secure and re-usable multiple password comprehensive and diverse training data. mnemonics,” in Proceedings of the 2007 ACM symposium on Applied computing. ACM, 2007, pp. 292–299. [24] M. Fukumitsu, T. Katoh, B. B. Bista, and T. Takata, “A proposal REFERENCES of an associating image-based password creating method and a development of a password creating support system,” in Advanced [1] R. W. Proctor, M.-C. Lien, K.-P. L. Vu, E. E. Schultz, and Information Networking and Applications (AINA), 2010 24th IEEE G. Salvendy, “Improving computer security for authentication International Conference on. IEEE, 2010, pp. 438–445. of users: Influence of proactive password restrictions,” Behavior [25] K. A. Juang, S. Ranganayakulu, and J. S. Greenstein, “Using Research Methods, vol. 34, no. 2, pp. 163–169, 2002. system-generated mnemonics to improve the usability and [2] B. Ur, F. Noma, J. Bees, S. M. Segreti, R. Shay, L. Bauer, N. Christin, security of password authentication,” in Proceedings of the Human and L. F. Cranor, “I added !at the end to make it secure: Observing Factors and Ergonomics Society Annual Meeting, vol. 56, no. 1. SAGE password creation in the lab,” in Proc. SOUPS, 2015. Publications Sage CA: Los Angeles, CA, 2012, pp. 506–510. [3] M. Weir, S. Aggarwal, M. Collins, and H. Stern, “Testing metrics [26] J. Doolani et al., “Improving memorization and long term recall of for password creation policies by attacking large sets of revealed system assigned passwords,” Ph.D. dissertation, 2016. passwords,” in Proceedings of the 17th ACM conference on Computer [27] S. Jeyaraman and U. Topkara, “Have the cake and eat it and communications security. ACM, 2010, pp. 162–175. too-infusing usability into text-password based authentication [4] J. Ma, W. Yang, M. Luo, and N. Li, “A study of probabilistic systems,” in 21st Annual Computer Security Applications Conference. password models,” in Security and Privacy (SP), 2014 IEEE IEEE, 2005, pp. 10–pp. Symposium on. IEEE, 2014, pp. 689–704. [28] P. University, “Wordnet,” https://wordnet.princeton.edu/. [5] W. Melicher, B. Ur, S. M. Segreti, S. Komanduri, L. Bauer, [29] G. H. Bower, “Analysis of a mnemonic device: Modern psychology N. Christin, and L. F. Cranor, “Fast, lean, and accurate: Modeling uncovers the powerful components of an ancient system for password guessability using neural networks.” in USENIX improving memory,” American Scientist, vol. 58, no. 5, pp. 496–510, Security Symposium, 2016, pp. 175–191. 1970. 14

[30] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation generation strategy based on mnemonic shape,” IEEE Access, by jointly learning to align and translate,” in Proceedings of the 2015 vol. 7, pp. 119 052–119 059, 2019. International Conference on Learning Representations (ICLR), 2015. [51] M. Ghazvininejad and K. Knight, “How to memorize a random [31] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural 60- string,” Parking, vol. 11, no. 70.8, pp. 58–83, 2015. networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, [52] H. A. Yajam, Y. K. Ahmadabadi, and M. Akhaee, “Papiapass: pp. 2673–2681, 1997. Sentence-based passwords using dependency trees,” in 13th [32] K. Cho, B. Van Merrienboer,¨ C. Gulcehre, D. Bahdanau, International Iranian Society of Cryptology Conference on Information F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase Security and Cryptology (ISCISC). IEEE, 2016, pp. 91–96. representations using rnn encoder-decoder for statistical machine [53] M. J. Atallah, C. J. McDonough, V. Raskin, and S. Nirenburg, translation,” in Proceedings of the 2014 Conference on Empirical “Natural language processing for information assurance and Methods in Natural Language Processing (EMNLP), 2014, pp. 1724– security: an overview and implementations,” in Proceedings of the 1734. 2000 workshop on New security paradigms. ACM, 2001, pp. 51–65. [33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and [54] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural R. Salakhutdinov, “Dropout: a simple way to prevent neural probabilistic language model,” Journal of machine learning research, networks from overfitting,” The Journal of Machine Learning vol. 3, no. Feb, pp. 1137–1155, 2003. Research, vol. 15, no. 1, pp. 1929–1958, 2014. [34] P. Koehn, “Pharaoh: a beam search decoder for phrase-based statistical machine translation models,” in Conference of the Association for Machine Translation in the Americas. Springer, 2004, pp. 115–124. Yao Cheng is currently a senior researcher [35] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence at Huawei International in Singapore. She learning with neural networks,” in Advances in neural information received her Ph.D. degree in Computer Science processing systems, 2014, pp. 3104–3112. and Technology from University of Chinese [36] C. D. Manning, C. D. Manning, and H. Schutze,¨ Foundations of Academy of Sciences. Her research interests statistical natural language processing. MIT press, 1999. include security and privacy in deep learning [37] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method systems, blockchain technology applications, for automatic evaluation of machine translation,” in Proceedings of Android framework vulnerability analysis, the 40th annual meeting on association for computational linguistics. mobile application security analysis, and mobile Association for Computational Linguistics, 2002, pp. 311–318. malware detection. [38] H. Ebbinghaus, “Memory: A contribution to experimental psychology,” Annals of neurosciences, vol. 20, no. 4, p. 155, 2013. [39] T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural computation, vol. 10, no. 7, pp. 1895–1923, 1998. [40] N. K. Bakirov and G. J. Szekely, “Students t-test for gaussian scale Chang Xu is currently a Postdoctoral Fellow mixtures,” Journal of Mathematical Sciences, vol. 139, no. 3, pp. 6497– at Data61, CSIRO, Australia. He received his 6505, 2006. PhD degree in Computer Science from Nanyang [41] Y. Li, Y. Cheng, Y. Li, and R. H. Deng, “What you see is not Technological University in March 2017. His cur- what you get: Leakage-resilient password entry schemes for smart rent interests include robust and explainable glasses,” in Proceedings of the 2017 ACM on Asia Conference on deep neural models for natural language pro- Computer and Communications Security. ACM, 2017, pp. 327–333. cessing. [42] NIST, “Digital identity guidelines,” https://pages.nist.gov /800-63-3/, 2017. [43] P. G. Kelley, S. Komanduri, M. L. Mazurek, R. Shay, T. Vidas, L. Bauer, N. Christin, L. F. Cranor, and J. Lopez, “Guess again (and again and again): Measuring password strength by simulating password-cracking algorithms,” in Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 2012, pp. 523–537. [44] B. Ur, P. G. Kelley, S. Komanduri, J. Lee, M. Maass, M. L. Mazurek, Zhen Hai received the PhD degree in Computer T. Passaro, R. Shay, T. Vidas, L. Bauer et al., “How does your Science and Engineering from Nanyang Tech- password measure up? the effect of strength meters on password nological University in 2014. He has been with creation.” in USENIX Security Symposium, 2012, pp. 65–80. the Institute for Infocomm Research, A*STAR, [45] D. Wang, P. Wang, D. He, and Y. Tian, “Birthday, name and Singapore since 2015. His research interests bifacial-security: understanding passwords of chinese web users,” include natural language processing, text min- in 28th USENIX Security Symposium (USENIX Security 19), 2019, ing, sentiment analysis, information security, and pp. 1537–1555. machine learning. He has been invited to serve [46] D. Wang, Q. Gu, X. Huang, and P. Wang, “Understanding on the program committees of leading confer- human-chosen pins: characteristics, distribution and security,” in ences including SIGIR, ACL, EMNLP, AACL, Proceedings of the 2017 ACM on Asia Conference on Computer and AAAI, IJCAI, CIKM, etc. Communications Security. ACM, 2017, pp. 372–385. [47] S. Schechter, C. Herley, and M. Mitzenmacher, “Popularity is everything: A new approach to protecting passwords from statistical-guessing attacks,” in Proceedings of the 5th USENIX conference on Hot topics in security. USENIX Association, 2010, pp. 1–8. Yingjiu Li is currently a Ripple Professor in [48] R. Shay, P. G. Kelley, S. Komanduri, M. L. Mazurek, B. Ur, the Department of Computer and Information T. Vidas, L. Bauer, N. Christin, and L. F. Cranor, “Correct Science at the University of Oregon. His re- horse battery staple: Exploring the usability of system- search interests include IoT Security and Pri- assigned passphrases,” in Proceedings of the Eighth Symposium vacy, Mobile and System Security, Applied Cryp- on Usable Privacy and Security, ser. SOUPS ’12. New York, tography and Cloud Security, and Data Appli- NY, USA: ACM, 2012, pp. 7:1–7:20. [Online]. Available: cation Security and Privacy. He has published http://doi.acm.org/10.1145/2335356.2335366 over 140 technical papers in international confer- ences and journals, and served in the program [49] B. Ye, Y. Guo, L. Zhang, and X. Guo, “An empirical study of committees for over 80 international conferences mnemonic password creation tips,” & Security, vol. 85, and workshops, including top-tier cybersecurity pp. 41–50, 2019. conferences. [50] J. Song, D. Wang, Z. Yun, and X. Han, “Alphapwd: A password