Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning

Jieshan Chen Chunyang Chen∗ Zhenchang Xing† [email protected] [email protected] [email protected] Australian National University Monash University Australian National University Australia Australia Australia Xiwei Xu Liming Zhu†‡ Guoqiang Li∗ [email protected] [email protected] [email protected] Data61, CSIRO Australian National University Shanghai Jiao Tong University Australia Australia China Jinshui Wang∗ ymkscom@.com Fujian University of Technology China ABSTRACT our model can make accurate predictions and the generated labels According to the World Health Organization(WHO), it is estimated are of higher quality than that from real Android developers. that approximately 1.3 billion people live with some forms of vision impairment globally, of whom 36 million are blind. Due to their CCS CONCEPTS disability, engaging these minority into the society is a challenging • Human-centered computing → Accessibility systems and problem. The recent rise of smart mobile phones provides a new tools; Empirical studies in accessibility; • Software and its engi- solution by enabling blind users’ convenient access to the infor- neering → Software usability. mation and service for understanding the world. Users with vision KEYWORDS impairment can adopt the screen reader embedded in the mobile Accessibility, neural networks, user interface, image-based buttons, operating systems to read the content of each screen within the content description app, and use gestures to interact with the phone. However, the prerequisite of using screen readers is that developers have to add ACM Reference Format: natural-language labels to the image-based components when they Jieshan Chen, Chunyang Chen, Zhenchang Xing, Xiwei Xu, Liming Zhu, are developing the app. Unfortunately, more than 77% apps have Guoqiang Li, and Jinshui Wang. 2020. Unblind Your Apps: Predicting Natural- issues of missing labels, according to our analysis of 10,408 An- Language Labels for Mobile GUI Components by Deep Learning. In 42nd droid apps. Most of these issues are caused by developers’ lack of International Conference on Software Engineering (ICSE ’20), May 23–29, 2020, Seoul, Republic of Korea. ACM, New York, NY, USA, 13 pages. https: awareness and knowledge in considering the minority. And even if //doi.org/10.1145/3377811.3380327 developers want to add the labels to UI components, they may not come up with concise and clear description as most of them are of no visual issues. To overcome these challenges, we develop a deep- 1 INTRODUCTION learning based model, called LabelDroid, to automatically predict Given millions of mobile apps in Play [11] and App store [8], the labels of image-based buttons by learning from large-scale com- the smart phones are playing increasingly important roles in daily mercial apps in . The experimental results show that life. They are conveniently used to access a wide variety of services such as reading, shopping, chatting, etc. Unfortunately, many apps ∗Corresponding author. remain difficult or impossible to access for people with disabili- †Also with Data61, CSIRO. ties. For example, a well-designed user interface (UI) in Figure 1 ‡ Also with University of New South Wales. often has elements that don’t require an explicit label to indicate their purpose to the user. A checkbox next to an item in a task Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed list application has a fairly obvious purpose for normal users, as for profit or commercial advantage and that copies bear this notice and the full citation does a trash can in a file manager application. However, to users on the first page. Copyrights for components of this work owned by others than ACM with vision impairment, especially for the blind, other UI cues are must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a needed. According to the World Health Organization(WHO) [4], it fee. Request permissions from [email protected]. is estimated that approximately 1.3 billion people live with some ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea form of vision impairment globally, of whom 36 million are blind. © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7121-6/20/05...$15.00 Compared with the normal users, they may be more eager to use https://doi.org/10.1145/3377811.3380327 the mobile apps to enrich their lives, as they need those apps to ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea Jieshan Chen, Chunyang Chen, Zhenchang Xing, et al.

Figure 2: Source code for setting up labels for “add playlist” Figure 1: Example of UI components and labels. button (which is indeed a clickable ImageView). represent their eyes. Ensuring full access to the wealth of infor- developers may add label “add” to the button in Figure 2 rather mation and services provided by mobile apps is a matter of social than “add playlist” which is more informative about the action. justice [54]. To overcome those challenges, we develop a deep learning based Fortunately, the mobile platforms have begun to support app model to automatically predict the content description. Note that accessibility by screen readers (e.g., TalkBack in Android [12] and we only target at the image-based buttons in this work as these VoiceOver in IOS [20]) for users with vision impairment to interact buttons are important proxies for users to interact with apps, and with apps. Once developers add labels to UI elements in their apps, cannot be read directly by the screen reader without labels. Given the UI can be read out loud to the user by those screen readers. For the UI image-based components, our model can understand its example, when browsing the screen in Figure 1 by screen reader, semantics based on the collected big data, and return the possible users will hear the clickable options such as “navigate up”, “play”, label to components missing content descriptions. We believe that “add to queue”, etc. for interaction. The screen readers also allow it can not only assist developers in efficiently filling in the content users to explore the view using gestures, while also audibly de- description of UI components when developing the app, but also scribing what’s on the screen. This is useful for people with vision enable users with vision impairment to access to mobile apps. impairments who cannot see the screen well enough to understand Inspired by image captioning, we adopt the CNN and transformer what is there, or select what they need to. encoder decoder for predicting the labels based on the large-scale Despite the usefulness of screen readers for accessibility, there dataset. The experiments show that our LabelDroid can achieve is a prerequisite for them functioning well, i.e., the existence of 65% exact match and 0.697 ROUGE-L score which outperforms both labels for the UI components within the apps. In detail, the Android state-of-the-art baselines. We also demonstrate that the predictions Accessibility Developer Checklist guides developers to “provide con- from our model is of higher quality than that from junior Android tent descriptions for UI components that do not have visible text"[6]. developers. The experimental results and feedbacks from these Without such content descriptions1, the Android TalkBack screen developers confirm the effectiveness of our LabelDroid. reader cannot provide meaningful feedback to a user to interact Our contributions can be summarized as follow: with the app. • To our best knowledge, this is the first work to automati- Although individual apps can improve their accessibility in many cally predict the label of UI components for supporting app ways, the most fundamental principle is adhering to platform acces- accessibility. We hope this work can invoke the community sibility guidelines [6]. However, according to our empirical study in attention in maintaining the accessibility of mobile apps. Section 3, more than 77% apps out of 10,408 apps miss such labels • We carry out a motivational empirical study for investigating for the image-based buttons, resulting in the blind’s inaccessibility how well the current apps support the accessibility for users to the apps. Considering that most app designers and developers with vision impairment. are of no vision issues, they may lack awareness or knowledge • We construct a large-scale dataset of high-quality content of those guidelines targeting for blind users. To assist developers descriptions for existing UI components. We release the with spotting those accessibility issues, many practical tools are dataset2 for enabling other researchers’ future research. developed such as Android Lint [1], Accessibility Scanner [5], and other accessibility testing frameworks [10, 18]. However, none of these tools can help fix the label-missing issues. Even if developers 2 ANDROID ACCESSIBILITY BACKGROUND or designers can locate these issues, they may still not be aware 2.1 Content Description of UI component how to add concise, easy-to-understand descriptions to the GUI Android UI is composed of many different components to achieve components for users with vision impairment. For example, many the interaction between the app and the users. For each component of Android UI, there is an attribute called android:contentDescription 1“labels” and “content description” refer to the same meaning and we use them inter- changeably in this paper 2https://github.com/chenjshnn/LabelDroid Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea

Table 1: Statistics of label missing situation

Element #Miss/#Apps #Miss/#Screens #Miss/#Elements ImageButton 4,843/7,814 (61.98%) 98,427/219,302(44.88%) 241,236/423,172(57.01%) Clickable Image 5,497/7,421 (74.07%) 92,491/139,831(66.14%) 305,012/397,790(76.68%) Total 8,054/10,408 (77.38%) 169,149/278,234(60.79%) 546,248/820,962(66.54%)

Figure 3: Examples of image-based buttons. ○1 clickable ImageView; ○2 ○3 ImageButton. properly label an image-based button such that it interacts properly with screen readers, alternative text descriptions must be added in the button’s content description field. Figure 3 shows two kinds of in which the natural-language string can be added by the develop- image-based buttons. ers for illustrating the functionality of this component. This need is parallel to the need for alt-text for images on the web. To add 2.3.1 Clickable Images. Images can be rendered in an app using the label “add playlist” to the button in Figure 2, developers usually the Android API class android.widget.ImageView[14]. If the click- add a reference to the android:contentDescription in the layout xml able property is set to true, the image functions as a button (○1 file, and that reference is referred to the resource file string.xml in Figure 3). We call such components Clickable Images. Different which saves all text used in the application. Note that the content from normal images of which the clickable property is set to false, description will not be displayed in the screen, but can only be read all Clickable Images are non-decorative, and a null string label can by the screen reader. be regarded as a missing label accessibility barrier.

2.2 Screen Reader 2.3.2 Image Button. Image Buttons are implemented by the An- According to the screen reader user survey by WebAIM in 2017 [2], droid API class android.widget.ImageButton [13]. This is a sub-class 90.9% of respondents who were blind or visually impaired used of the Clickable Image’s ImageView class. As its name suggests, Im- screen readers on a smart phone. As the two largest organisations age Buttons are buttons that visually present an image rather than that facilitate mobile technology and the app market, Google and text (○2 ○3 in Figure 3). Apple provide the screen reader (TalkBack in Android [12] and VoiceOver in IOS [20]) for users with vision impairment to access to 3 MOTIVATIONAL MINING STUDY the mobile apps. As VoiceOver is similar to TalkBack and this work While the main focus and contribution of this work is developing a studies Android apps, we only introduce the TalkBack. TalkBack is model for predicting content description of image-based buttons, an accessibility service for Android phones and tablets which helps we still carry out an empirical study to understand whether the blind and vision-impaired users interact with their devices more development team adds labels to the image-based buttons during easily. It is a system application and comes pre-installed on most their app development. The current status of image-based button Android devices. By using TalkBack, a user can use gestures, such labeling is also the motivation of this study. But note that this mo- as swiping left to right, on the screen to traverse the components tivational study just aims to provide an initial analysis towards shown on the screen. When one component is in focus, there is an developers supporting users with vision impairment, and a more audible description given by reading text or the content description. comprehensive empirical study would be needed to deeply under- When you move your finger around the screen, TalkBack reacts, stand it. reading out blocks of text or telling you when you’ve found a button. Apart from reading the screen, TalkBack also allows you to interact with the mobile apps with different gestures. For example, users 3.1 Data Collection can quickly return to the home page by drawing lines from bottom To investigate how well the apps support the users with vision to top and then from right to left using fingers without the need impairment, we randomly crawl 19,127 apps from Google Play [11], to locate the home button. TalkBack also provides local and global belonging to 25 categories with the installation number ranging context menus, which respectively enable users to define the type from 1K to 100M. of next focused items (e.g., characters, headings or links) and to We adopt the app explorer [32] to automatically explore differ- change global setting. ent screens in the app by various actions (e.g., click, edit, scroll). During the exploration, we take the screenshot of the app GUI, and 2.3 Android Classes of Image-Based Buttons also dump the run-time front-end code which identifies each ele- There are many different types of UI components [3] when develop- ment’s type (e.g., TextView, Button), coordinates in the screenshots, ers are developing the UI such as TextView, ImageView, EditText, content description, and other metadata. Note that our explorer Button, etc. According to our observation, the image-based nature can only successfully collect GUIs in 15,087 apps. After remov- of the classes of buttons make them necessary to be added with ing all duplicates by checking the screenshots, we finally collect natural-language annotations for two reasons. First, these compo- 394,489 GUI screenshots from 15,087 apps. Within the collected nents can be clicked i.e., as important proxies for interaction than data, 278,234(70.53%) screenshots from 10,408 apps contain image- static components like TextView with users. Second, the screen read- based buttons including clickable images and image buttons, which ers cannot directly read the image to natural language. In order to forms the dataset we analyse in this study. ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea Jieshan Chen, Chunyang Chen, Zhenchang Xing, et al.

Figure 5: Box-plot for missing rate distribution of all apps Figure 4: The distribution of the category of applications with different installation numbers. with different rate of image-based buttons missing content description label-missing rate. The correlation coefficient is 0.046 showing a very weak relationship between these two factors. This result 3.2 Current Status of Image-Based Button further proves that the accessibility issue is a common problem Labeling in Android Apps regardless of the popularity of applications. Therefore, it is worth developing a new method to solve this problem. Table 1 shows that 4,843 out of 7,814 apps (61.98%) have image but- : By analyzing 10,408 existing apps crawled from tons without labels, and 5,497 out of 7,421 apps (74.07%) have click- Summary Google Play, we find that more than 77% of them have atleast able images without labels. Among all 278,234 screens, 169,149(60.79%) one image-based button missing labels. Such phenomenon is of them including 57.01% image buttons and 76.68% clickable im- further exacerbated for apps categories highly related to pic- ages within these apps have at least one element without explicit tures such as personalization, game, photography . However, labels. It means that more than half of image-based buttons have no out of our expectation, the popular apps do not behave better labels. These statistics confirm the severity of the button labeling in accessibility than that of unpopular ones. These findings con- issues which may significantly hinder the app accessibility to users firm the severity of label missing issues, and motivate our model with vision impairment. development for automatic predicting the labels for image-based We then further analyze the button labeling issues for different buttons. categories of mobile apps. As seen in Figure 4, the button labeling issues exist widely across different app categories, but some cate- 4 APPROACH gories have higher percentage of label missing buttons. For example, 72% apps in Personalization, 71.6% apps in Game, and 71.8% apps Rendering a UI widget into its corresponding content description in Photography have more than 80% image-based button without is the typical task of image captioning. The general process of our labels. Personalization and Photography apps are mostly about up- approach is to firstly extract image features using55 CNN[ ], and dating the screen background, the alignment of app icons, viewing encode this extracted informative features into a tensor using an the pictures which are mainly for visual comfort. The Game apps encoder module. Based on the encoded information, the decoder always need both the screen viewing and the instantaneous interac- module generates outputs (which is a sequence of words) condi- tions which are rather challenging for visual-impaired users. That tioned on this tensor and previous outputs. Different from the tra- is why developers rarely consider to add labels to image-based but- ditional image captioning methods based on CNN and RNN model tons within these apps, although these minority users deserve the or neural translation based on RNN encoder-decoder [33, 34, 44], right for the entertainment. In contrast, about 40% apps in Finance, we adopt the Transformer model [68] in this work. The overview Business, Transportation, Productivity category have relatively more of our approach can be seen in Figure 6. complete labels for image-based buttons with only less than 20% label missing. The reason accounting for that phenomenon may be 4.1 Visual Feature Extraction that the extensive usage of these apps within blind users invokes To extract the visual features from the input button image, we developers’ attention, though further improvement is needed. adopt the convolutional neural network [46] which is widely used To explore if the popular apps have better performance in adding in software engineering domain [30, 35, 79]. CNN-based model can labels to image-based buttons, we draw a box plot of label missing automatically learn latent features from a large image database, rate for apps with different installation numbers in Figure 5. There which has outperformed hand-crafted features in many computer are 11 different ranges of installation number according tothe vision tasks [52, 65]. CNN mainly contains two kinds of layers, i.e., statistics in Google Play. However, out of our expectation, many convolutional layers and pooling layers. buttons in popular apps still lack the labels. Such issues for popular Convolutional Layer. A Convolution layer performs a linear apps may post more negatively influence as those apps have a transformation of the input vector to extract and compress the larger group of audience. We also conduct a Spearman rank-order salient information of input. Normally, an image is comprised of a correlation test [67] between the app installation number and the 퐻 ×푊 ×퐶 matrix where 퐻,푊 ,퐶 represent height, width, channel of Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea

Figure 6: Overview of our approach this image respectively. And each value in this matrix is in the range shorter training time than RNN model. Within the encoder, there of 0 to 255. A convolution operation uses several small trainable are two kinds of sublayers: the multi-head self-attention layer and matrices (called kernels) to slide over the image along the width and the position-wise feed forward layer. height in a specified stride. Each movement will compute a value Multi-head Self-attention Layer Given a sequence of input 푇 푛×푑 of output at the corresponding position by calculating the sum of vectors 푋 = [푥1, 푥2, ..., 푥푛] (푋 ∈ 푅 푒푚푏푒푑 ), a self-attention layer element-wise product of current kernel and current sub-matrix of first computes a set of query푄 ( ), key (퐾), value (푉 ) vectors (푄/퐾 ∈ the image. One kernel performs one kind of feature extractor, and 푅푑푘 ×푛,푉 ∈ 푅푑푣 ×푛) and then calculates the scores for each posi- computes one layer of the final feature map. For example, for an tion vector with vectors at all positions. Note that image-based image 퐼 ∈ 푅퐻푊 퐶 , we feed it in a convolutional layer with stride one buttons are artificial images rendered by the compiler with the and 푘 kernels, the output will have the dimension of 퐻 ′ × 푊 ′ × 푘, specified order i.e., from left to right, from top to bottom. There- where 퐻 ′ and 푊 ′ are the times of movement along height & width. fore, we also consider the sequential spatial information to capture Pooling Layer. A pooling layer is used to down-sample the cur- the dependency between the top-left and bottom-right features ex- rent input size to mitigate the computational complexity required tracted by the CNN model. For position 푖, the scores are computed to process the input by extracting dominant features, which are by taking the dot product of query vector 푞푖 with all key vectors invariant to position and rotation, of input. It uses a fixed length 푘푗 (푗 ∈ 1, 2, ...푛). of window to slide the image with a fixed stride and summarises In order to get a more stable gradient, the scores are then divided p current scanned sub-region to one value. Normally, the stride is by 푑푘 . After that, we apply a softmax operation to normalize all same as the window’s size so that the model could filter meaning- scores so that they are added up to 1. The final output of the self- less features while maintaining salient ones. There are many kinds attention layer is to multiply each value vector to its corresponding of strategy to summarise sub-region. For example, for max-pooling, softmax score and then sums it up. The matrix formula of this it takes the maximum value in the sub-region to summarise current procedure is: patch. For average pooling, it takes the average mean of all values in current patch as the output value. 푄퐾푇 ( ) ( ) 푆푒푙 푓 _퐴푡푡푒푛푡푖표푛 푄, 퐾,푉 = 푠표푓 푡푚푎푥 p 푉 (1) 푑푘 4.2 Visual Semantic Encoding To further encode the visual features extracted from CNN, we first = 푒 푇 = 푒 푇 = 푒 푇 푒 ∈ 푑푘 ×푑푒푚푏푒푑 where 푄 푊푞 푋 , 퐾 푊푘 푋 , 푉 푊푣 푋 and 푊푞 푅 , embed them using a fully-connected layer and then adopt the en- 푊 푒 ∈ 푅푑푘 ×푑푒푚푏푒푑 , 푊 푒 ∈ 푅푑푣 ×푑푒푚푏푒푑 are the trainable weight met- coder based on the Transformer model which was first introduced 푘 푣 푄퐾푇 for the machine translation task. We select the Transformer model rics. The 푠표푓 푡푚푎푥 ( √ ) can be regarded as how each feature (푄) 푑푘 as our encoder and decoder due to two reasons. First, it overcomes of the feature sequence from CNN model is influenced by all other the challenge of long-term dependencies [48], since it concentrates features in the sequence (퐾). And the result is the weight to all on all relationships between any two input vectors. Second, it sup- features 푉 . ports parallel learning because it is not a sequential learning and all The multi-head self-attention layer uses multiple sets of query, latent vectors could be computed at the same time, resulting in the key, value vectors and computes multiple outputs. It allows the ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea Jieshan Chen, Chunyang Chen, Zhenchang Xing, et al. model to learn different representation sub-spaces. We then concate- 푒 푑푚표푑푒푙 ×ℎ푑푣 nate all the outputs and multiply it with matrix 푊표 ∈ 푅 (where h is the number of heads) to summarise the information of multi-head attention. Feed Forward Layer Then a position-wise feed forward layer is applied to each position. The feed-forward layers have the same structure of two fully connected layers, but each layer is trained separately. The feed forward layer is represented as: 푒 푒 푒 푒 퐹푒푒푑_푓 표푟푤푎푟푑(푍) = 푊2 × (푊1 × 푍 + 푏1) + 푏2 (2)

푒 푑푓 푓 ×푑푚표푑푒푙 푒 푑푚표푑푒푙 ×푑푓 푓 푒 푑푓 푓 푒 where 푊1 ∈ 푅 , 푊2 ∈ 푅 , 푏1 ∈ 푅 , 푏2 ∈ Figure 7: Example of our dataset 푅푑푚표푑푒푙 are the trainable parameters of two fully connected layers. Besides, for each sub-layer, Transformer model applies resid- KL divergence since 퐷 (푝|푞) = 퐻 (푝, 푞) − 퐻 (푝), where H(p) is ual connection [46] and layer normalization [23]. The equation 푘푙 constant. is 퐿푎푦푒푟푁표푟푚(푥 + 푆푢푏푙푎푦푒푟 (푥)), where x and Sublayer(x) are the input and output of current sub-layer. For input embedding, Trans- 5 IMPLEMENTATION former model also applies position encoding to encode the relative position of the sequence. To implement our model, we further extract the data analysed in Sec- tion 3 by filtering out the noisy data for constructing a large-scale 4.3 Content Description Generation pairs of image-based buttons and corresponding content descrip- tions for training and testing the model. Then, we introduce the As mentioned in Section 4.2, the encoder module is comprised of detailed parameters and training process of our model. N stacks of layers and each layer consists of a multi-head self- attention sub-layer and a feed-forward layer. Similar to the encoder, 5.1 Data Preprocessing the decoder module is also of M-stack layers but with two main differences. First, an additional cross-attention layer is inserted into For each screenshot collected by our tool, we also dump the runtime the two sub-layers. It takes the output of top-layer of the encoder XML which includes the position and type of all GUI components module to compute query and key vectors so that it can help capture within the screenshot. To obtain all image-based buttons, we crop the relationship between inputs and targets. Second, it is masked the GUI screenshot by parsing the coordinates in the XML file. to the right in order to prevent attending future positions. However, the same GUI may be visited many times, and different Cross-attention Layer. Given current time 푡, max length of GUIs within one app may share the same components. For example, content description 퐿, previous outputs 푆푑 (∈ 푅푑푚표푑푒푙 ×퐿) from self- a menu icon may appear in the top of all GUI screenshots within the app. To remove duplicates, we first remove all repeated GUI attention sub-layer and output 푍푒 (∈ 푅푑푚표푑푒푙 ×푛) from encoder mod- screenshots by checking if their corresponding XML are the ule, the cross-attention sub-layer can be formulated as: same. After that, we further remove duplicate image-based buttons 푄푒 (퐾푒 )푇 ( 푒 푒 푑 ) ( ) 푑 if they are exactly same by the value. But duplicate buttons 퐶푟표푠푠_퐴푡푡푒푛푡푖표푛 푄 , 퐾 ,푉 = 푠표푓 푡푚푎푥 p 푉 (3) 푑푘 may not be 100% same in pixels, so we further remove duplicate buttons if their coordinate bounds and labels are the same because where 푄푒 = 푊 푑푍, 퐾푒 = 푊 푑푍, 푉 푑 = 푊 푑푆푑 , 푊 푑 ∈ 푅푑푘 ×푑푚표푑푒푙 , 푞 푘 푣 푞 some buttons would appear in a fixed position but with a different 푑 ∈ 푑푘 ×푑푚표푑푒푙 푑 ∈ 푑푣 ×푑푚표푑푒푙 푑 = 푊푘 푅 and 푊푣 푅 . Note that we mask 푆푘 0 background within the same app. For example, for the “back” button (for 푘 ≥ 푡) since we currently do not know future values. at the top-left position in Figure 1, once user navigates to another Final Projection. After that, we apply a linear softmax opera- news page, the background of this button will change while the tion to the output of the top-layer of the decoder module to predict functionality remains. × next word. Given output 퐷 ∈ 푅푑푚표푑푒푙 퐿 from decoder module, we Apart from the duplicate image-based buttons, we also remove ′ 푑 푑 have 푌 = 푆표푓 푡푚푎푥 (퐷∗푊표 +푏표 ) and take the 푡푡ℎ output of Y’ as the low-quality labels to ensure the quality of our training model. We next predicted word. During training, all words can be computed at manually observe 500 randomly selected image-based buttons from 푑 = ≥ the same time by masking 푆푘 0 (for 푘 푡) with different 푡. Note Section 3, and summarise three types of meaningless labels. First, that while training, we compute the prediction based on the ground the labels of some image-based buttons contain the class of ele- ′ truth labels, i.e, for time t, the predicted word 푦푡 is based on the ments such as “image button”, “button with image”, etc. Second, ground truth sub-sequence [푦0,푦1, ...,푦푡−1]. In comparison, in the the labels contain the app’s name. For example, the label of all period of inference (validation/test), we compute words one by one, buttons in the app Ringtone Maker is “ringtone maker”. Third, based on previous predicted words, i.e., for time t, the predicted some labels may be some unfinished placeholders such as “test”, ′ ′ ′ ′ word 푦푡 is based on the ground truth sub-sequence [푦0,푦1, ...,푦푡−1]. “content description”, “untitled”, “none”. We write the rules to filter Loss Function. We adopt Kullback-Leibler (KL) divergence loss [53] out all of them, and the full list of meaningless labels can be found (also called as relative entropy loss) to train our model. It is a natural in our website. method to measure the difference between the generated proba- After removing the non-informative labels, we translate all non- bility distribution q and the reference probability distribution 푝. English labels of image-based buttons to English by adopting the Note that there is no difference between cross entropy loss and API. For each label, we add < 푠푡푎푟푡 >, < 푒푛푑 > Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea

Table 2: Details of our accessibility dataset. 6 EVALUATION We evaluate our LabelDroid in three aspects, i.e., accuracy with #App #Screenshot #Element automated testing, generality and usefulness with user study. Train 6,175 10,566 15,595 Validation 714 1,204 1,759 Test 705 1,375 1,879 6.1 Evaluation Metric Total 7,594 13,145 19,233 To evaluate the performance of our model, we adopt five widely- used evaluation metrics including exact match, BLEU [62], ME- TEOR [26], ROUGE [56], CIDEr [69] inspired by related works about image captioning. The first metric we use is exact match tokens to the start and the end of the sequence. We also replace the rate, i.e., the percentage of testing pairs whose predicted content low-frequency words (less than five) with an < 푢푛푘 > token. To description exactly matches the ground truth. Exact match is a enable the mini-batch training, we need to add a < 푝푎푑 > token to binary metric, i.e., 0 if any difference, otherwise 1. It cannot tell the pad the word sequence of labels into a fixed length. Note that the extent to which a generated content description differs from the maximum number of words for one label is 15 in this work. ground-truth. For example, the ground truth content description After the data cleaning, we finally collect totally 19,233 pairs of may contain 4 words, but no matter one or 4 differences between image-based buttons and content descriptions from 7,594 apps. Note the prediction and ground truth, exact match will regard them as that the app number is smaller than that in Section 3 as the apps with 0. Therefore, we also adopt other metrics. BLEU is an automatic no or uninformative labels are removed. We split cleaned dataset evaluation metric widely used in machine translation studies. It into training, validation3 and testing set. For each app category, calculates the similarity of machine-generated translations and we randomly select 80% apps for training, 10% for validation and human-created reference translations (i.e., ground truth). BLEU the rest 10% for testing. Table 2 shows that, there are 15,595 image- is defined as the product of 푛-gram precision and brevity penalty. based buttons from 6,175 apps as the training set, 1,759 buttons As most content descriptions for image-based buttons are short, from 714 apps as validation set and 1,879 buttons from 705 apps as we measure BLEU value by setting 푛 as 1, 2, 3, 4, represented as testing set. The dataset can also be downloaded from our site. BLEU@1, BLEU@2, BLEU@3, BLEU@4. METEOR [26] (Metric for Evaluation of Translation with Explicit 5.2 Model Implementation ORdering) is another metric used for machine translation evalua- We use ResNet-101 architecture [46] pretrained on MS COCO tion. It is proposed to fix some disadvantages of BLEU which ignores dataset [57] as our CNN module. As you can see in the leftmost the existence of synonyms and recall ratio. ROUGE (Recall-Oriented of Figure 6, it consists of a convolution layer, a max pooling layer, Understudy for Gisting Evaluation) [56] is a set of metric based on four types of blocks with different numbers of block (denoted in recall rate, and we use ROUGE-L, which calculates the similarity different colors). Each type of block is comprised of three convo- between predicted sentence and reference based on the longest lutional layers with different settings and implements an identity common subsequence (short for LCS). CIDEr (Consensus-Based shortcut connection which is the core idea of ResNet. Instead of Image Description Evaluation) [69] uses term frequency inverse approximating the target output of current block, it approximates document frequency (tf-idf) [64] to calculate the weights in refer- the residual between current input and target output, and then the ence sentence 푠푖 푗 for different n-gram 푤푘 because it is intuitive to target output can be computed by adding the predicted residual believe that a rare n-grams would contain more information than a and the original input vector. This technique not only simplifies the common one. We use CIDEr-D, which additionally implements a training task, but also reduces the number of filters. In our model, length-based gaussian penalty and is more robust to gaming. We we remove the last global average pooling layer of ResNet-101 to then divide CIDEr-D by 10 to normalize the score into the range compute a sequence of input for the consequent encoder-decoder between 0 and 1. We still refer CIDEr-D/10 to CIDEr for brevity. model. All of these metrics give a real value with range [0,1] and are For transformer encoder-decoder, we take 푁 = 3, 푑푒푚푏푒푑 = 512, usually expressed as a percentage. The higher the metric score, the 푑푘 = 푑푣 = 푑푚표푑푒푙 = 64, 푑푓 푓 = 2048, ℎ = 8 and the vocabulary size more similar the machine-generated content description is to the as 633. We train the CNN and the encoder-decoder model in an ground truth. If the predicted results exactly match the ground end-to-end manner using KL divergence loss [53]. We use Adam truth, the score of these metrics is 1 (100%). We compute these −9 optimizer [51] with 훽1 = 0.9, 훽2 = 0.98 and 휖 = 10 and change metrics using coco-caption code [39]. = −0.5 × the learning rate according to the formula 푙푒푎푟푛푖푛푔_푟푎푡푒 푑푚표푑푒푙 푚푖푛(푠푡푒푝_푛푢푚−0.5, 푠푡푒푝_푛푢푚×푤푎푟푚푢푝_푠푡푒푝푠−1.5) to train the model, where 푠푡푒푝_푛푢푚 is the current iteration number of training batch 6.2 Baselines and the first 푤푎푟푚_푢푝 training step is used to accelerate training We set up two state-of-the-art methods which are widely used for process by increasing the learning rate at the early stage of training. image captioning as the baselines to compare with our content Our implementation uses PyTorch [17] on a machine with Intel description generation method. The first baseline is to adopt the i7-7800X CPU, 64G RAM and NVIDIA GeForce GTX 1080 Ti GPU. CNN model to encode the visual features as the encoder and adopt a LSTM (long-short term memory unit) as the decoder for generating the content description [48, 70]. The second baseline also adopt the 3for tuning the hyperparameters and preventing the overfitting encoder-decoder framework. Although it adopts the CNN model as ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea Jieshan Chen, Chunyang Chen, Zhenchang Xing, et al.

Table 3: Results of accuracy evaluation

Method Exact match BLEU@1 BLEU@2 BLEU@3 BLEU@4 METEOR ROUGE-L CIDEr CNN+LSTM 62.5% 0.669 0.648 0.545 0.461 0.420 0.667 0.317 CNN+CNN 62.5% 0.667 0.645 0.555 0.503 0.421 0.669 0.318 LabelDroid 65.0% 0.689 0.663 0.566 0.513 0.444 0.697 0.333 the encoder, but uses another CNN model for generating the out- put [22] as the decoder. The output projection layer in the last CNN decoder performs a linear transformation and softmax, mapping the output vector to the dimension of vocabulary size and getting word probabilities. Both methods take the same CNN encoder as ours, and also the same datasets for training, validation and testing. We denote two baselines as CNN+LSTM, CNN+CNN for brevity.

6.3 Accuracy Evaluation We use randomly selected 10% apps including 1,879 image-based buttons as the test data for accuracy evaluation. None of the test data appears in the model training.

6.3.1 Overall Performance. Table 3 shows the overall performance Figure 8: Performance distribution of different app category of all methods. The performance of two baselines are very simi- lar and they all achieve 62.5% exactly match rate. But CNN+CNN model is slightly higher in other metrics. In contrast with baselines, “back” and “next” buttons is the arrow direction. It also predicts the generated labels from our LabelDroid for 65% image-based “next” for the “cycle repeat model” button (E5), as there is one right- buttons exactly match the ground truth. And the average BLEU@1, direction arrow. Similar reasons also lead to mistakes in E2. BLEU@2, BLEU@3, BLEU@4, METEOR, ROUGE-L and CIDEr of Our model is good at generating long-sequence labels due to the our method are 0.689, 0.663, 0.566, 0.513, 0.444, 0.697, 0.333. Com- self-attention and cross-attention mechanisms. Such mechanism pared with the two state-of-the-art baselines, our LabelDroid can find the relationship between the patch of input image and outperforms in all metrics and gains about 2% to 11.3% increase. We the token of output label. For example, our model can predict the conduct the Mann–Whitney U test [59] between these three models correct labels such as “exchange origin and destination points” (E3), among all testing metrics. Since we have three inferential statistical “open in ” (E6). Although CNN+LSTM can generate tests, we apply the Benjamini & Hochberg (BH) method [27] to correct labels for E4, E5, it does not work well for E3 and E6. correct p-values. Results show the improvement of our model is In addition, our model is more robust to the noise than the significant in all comparisons (p-value<0.01)4. baselines. For example, although there is “noisy” background in E7, To show the generalization of our model, we also calculate the our model can still successfully recognize the “watch” label for it. performance of our model in different app categories as seen in In contrast, the CNN+LSTM and CNN+CNN are distracted by the Figure 8. We find that our model is not sensitive to the app category colorful background information with “” as the output. i.e., steady performance across different app categories. In addition, 6.3.3 Common Causes for Generation Errors. We randomly sample Figure 8 also demonstrates the generalization of our model. Even if 5% of the wrongly generated labels for the image-based buttons. there are very few apps in some categories (e.g., medical, person- We manually study the differences between these generated labels alization, libraries and demo) for training, the performance of our and their ground truth. Our qualitative analysis identifies three model is not significantly degraded in these categories. common causes of the generation errors. 6.3.2 Qualitative Performance with Baselines. To further explore (1) Our model makes mistakes due to the characteristics of the why our model behaves better than other baselines, we analyze input. Some image-based buttons are very special and it is totally the image-buttons which are correctly predicted by our model, but different from the training data. For example, the E1 in Table5is wrongly predicted by baselines. Some representative examples can rather different from the normal social-media sharing button. It be seen in Table 4 as the qualitative observation of all methods’ aggregates the icons of whatsapp, twitter, facebook and message performance. In general, our method shows the capacity to generate with some rotation and overlap. Some buttons are visually similar different lengths of labels while CNN+LSTM prefers medium length to others but with totally different labels. The “route” button (E2) labels and CNN+CNN tends to generate short labels. in Table 5 includes a right arrow which also frequently appears Our model captures the fine-grained information inside the given in “next” button. (2) Our prediction is an alternative to the ground image-based button. For example, the CNN+CNN model wrongly truth for certain image-based buttons. For example, the label of E3 predict “next” for the “back” button (E1), as the difference between is the “menu”, but our model predicts “open navigation drawer”. Although the prediction is totally different from the ground truth 4The detailed p-values are listed in https://github.com/chenjshnn/LabelDroid in term of the words, they convey the same meaning which can Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea

Table 4: Examples of wrong predictions in baselines

ID E1 E2 E3 E4 E5 E6 E7

Button

CNN+LSTM start play < 푢푛푘 > cycle shuffle mode cycle repeat mode color swatch

CNN+CNN next previous track call cycle repeat mode next open drawer trip check

exchange origin and back previous track cycle shuffle mode cycle repeat mode open in google maps watch LabelDroid destination points

Table 5: Common causes for generation failure.

ID E1 E2 E3 E4

Cause Special case Model error Alternative Wrong ground truth

Button

LabelDroid < 푢푛푘 > next open navigation drawer download

Ground truth share note route menu story content image Figure 9: Distribution of app acceptability scores by human annotators (A1, A2, A3) and the model (M). be understood by the blind users. (3) A small amount of ground truth is not the right ground truth. For example, some developers annotate the E4 as “story content image” although it is a “download” them the example labels for some randomly selected image-based button. Although our prediction is different from the ground truth, buttons (not in our dataset). During the experiment, they are shown we believe that our generated label is more suitable for it. This the target image-based buttons highlighted in the whole UI (similar observation also indicates the potential of our model in identifying to Figure 1), and also the meta data about the app including the app wrong/uninformative content description for image-based buttons. name, app category, etc. All participants carried out experiments We manually allocate the 5% (98) failure cases of our model into independently without any discussions with each other. these three reasons. 54 cases account for model errors especially As there is no ground truth for these buttons, we recruit one for special cases, 41 cases are alternatives labels to ground truth, professional developer (evaluator) with prior experience in acces- and the last three cases are right but with a wrong groundtruth. It sibility service during app development to manually check how shows that the accuracy of our model is highly underestimated. good are the annotators’ comments. Instead of telling if the re- sult is right or not, we specify a new evaluation metric for human 6.4 Generalization and Usefulness Evaluation evaluators called acceptability score according to the acceptability criterion [45]. Given one predicted content description for the but- To further confirm the generalization and usefulness of our model, ton, the human evaluator will assign 5-point Likert scale [28, 61]) we randomly select 12 apps in which there are missing labels of with 1 being least satisfied and 5 being most satisfied. Each result image-based buttons. Therefore, all data of these apps do not appear from LabelDroid and human annotators will be evaluated by the in our training/testing data. We would like to see the labeling quality evaluator, and the final acceptability score for each app is the aver- from both our LabelDroid and human developers. age score of all its image-based buttons. Note that we do not tell 6.4.1 Procedures. To ensure the representativeness of test data, the human evaluator which label is from developers or our model 12 apps that we select have at least 1M installations (popular apps to avoid potential bias. To guarantee if the human evaluators are often influence more users), with at least 15 screenshots. These capable and careful during the evaluation, we manually insert 4 12 apps belong to 10 categories. We then crop elements from UI cases which contain 2 intentional wrong labels and 2 suitable con- screenshots and filter out duplicates by comparing the raw pixels tent description (not in 156 testing set) which are carefully created with all previous cropped elements. Finally, we collect 156 missing- by all authors together. After the experiment, we ask them to give label image-based buttons, i.e., 13 buttons in average for each app. some informal comments about the experiment, and we also briefly All buttons are feed to our model for predicting their labels (de- introduce LabelDroid to developers and the evaluator and get noted as M). To compare the quality of labels from our LabelDroid some feedback from them. and human annotators, we recruit three PhD students and research 6.4.2 Results. Table 65 summarizes the information of the selected staffs (denoted as A1, A2, A3) from our school to create the content 12 apps and the acceptability scores of the generated labels. The av- descriptions for all image-based buttons. All of them have at least erage acceptability scores for three developers vary much from 3.06 one-year experience in Android app development, so they can be (A1) to 3.62 (A3). But our model achieves 3.97 acceptability score regarded as junior app developers. Before the experiment, they are required to read the accessibility guidelines [6, 19] and we demo 5Detailed results are at https://github.com/chenjshnn/LabelDroid ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea Jieshan Chen, Chunyang Chen, Zhenchang Xing, et al.

Table 6: The acceptability score (AS) and the standard deviation for 12 completely unseen apps. * denotes 푝 < 0.05.

ID Package name Category #Installation #Image-based button AS-M AS-A1 AS-A2 AS-A3 1 com.handmark.sportcaster sports 5M - 10M 8 4.63(0.48) 3.13(0.78) 3.75(1.20) 4.38(0.99) 2 com.hola.launcher personalization 100M - 500M 10 4.40(0.92) 3.20(1.08) 3.50(1.75) 3.40(1.56) 3 com.realbyteapps.moneymanagerfree finance 1M - 5M 24 4.29(1.10) 3.42(1.29) 3.75(1.45) 3.83(1.55) 4 com.jiubang.browser communication 5M - 10M 11 4.18(1.34) 3.27(1.21) 3.73(1.54) 3.91(1.38) 5 audio.mp3.music.player media_and_video 5M - 10M 26 4.08(1.24) 2.85(1.06) 2.81(1.62) 3.50(1.62) 6 audio.mp3.mp3player music_and_audio 1M - 5M 16 4.00(1.27) 2.75(1.15) 3.31(1.53) 3.25(1.39) 7 com.locon.housing lifestyle 1M - 5M 10 4.00(0.77) 3.50(1.12) 3.60(1.28) 4.40(0.80) 8 com.gau.go.launcherex.gowidget.weatherwidget weather 50M - 100M 12 3.42(1.66) 2.92(1.38) 3.00(1.78) 3.42(1.80) 9 com.appxy.tinyscanner business 1M - 5M 13 3.85(1.23) 3.31(1.20) 3.08(1.59) 3.38(1.44) 10 com.jobkorea.app business 1M - 5M 15 3.60(1.67) 3.27(1.57) 3.13(1.67) 3.60(1.54) 11 browser4g.fast.internetwebexplorer communication 1M - 5M 4 3.25(1.79) 2.00(0.71) 2.50(1.12) 2.50(1.66) 12 com.rcplus social 1M - 5M 7 3.14(1.55) 2.00(1.20) 2.71(1.58) 3.57(1.29) AVERAGE 13 3.97*(1.33) 3.06(1.26) 3.27(1.60) 3.62(1.52)

Table 7: Examples of generalization.

ID E1 E2 E3 E4 E5

Button

M next song add to favorites open ad previous song clear query A1 change to the next song in playlist add the mp3 as favorite show more details about SVIP paly the former one clean content A2 play the next song like check play the last song close A3 next like enter last close which significantly outperforms three developers by 30.0%, 21.6%, applies to A2/A3’s labels for E2 and such short labels do not contain 9.7%. The evaluator rates 51.3% of labels generated from Label- enough information. (3) Some manual labels may be ambiguous Droid as highly acceptable (5 point), as opposed to 18.59%, 33.33%, which may confuse users. For example, A2 and A3 annotate “play 44.23% from three developers. Figure 9 shows that our model be- the last song” or “last” to “previous song” button (E4) which may haves betters in most apps compared with three human annotators. mislead users that clicking this button will come to the final song These results show that the quality of content description from our in the playlist. (4) Developers may make mistakes especially when model is higher than that from junior Android app developers. Note they are adding content descriptions to many buttons. For example, that the evaluator is reliable as both 2 intentional inserted wrong A2/A3 use “close” to label a “clear query” buttons (E5). We further labels and 2 good labels get 1 and 5 acceptability score as expected. manually check 135 low-quality (acceptability score = 1) labels from To understand the significance of the differences between four annotators into these four categories. 18 cases are verbose labels, 21 kinds of content description, we carry out the Wilcoxon signed-rank of them are uninformative, six cases are ambiguous which would test [74] between the scores of our model and each annotator and confuse users, and the majority, 90 cases are wrong. between the scores of any two annotators. It is the non-parametric We also receive some informal feedback from the developers version of the paired T-test and widely used to evaluate the differ- and the evaluator. Some developers mention that one image-based ence between two related paired sample from the same probability button may have different labels in different context, but they are distribution. The test results suggest that the generated labels from not very sure if the created labels from them are suitable or not. our model are significantly better than that of developers푝 ( -value Most of them never consider adding the labels to UI components < 0.01 for A1, A2, and < 0.05 for A3)6. during their app development and curious how the screen reader For some buttons, the evaluator gives very low acceptability works for the app. All of them are interested in our LabelDroid score to the labels from developers. According to our observation, and tell that the automatic generation of content descriptions for we summarise four reasons accounting for those bad cases and icons will definitely improve the user experience in using the screen give some examples in Table 7. (1) Some developers are prone to reader. All of these feedbacks indicate their unawareness of app write long labels for image-based buttons like the developer A1. accessibility and also confirm the value of our tool. Although the long label can fully describe the button (E1, E2), it is too verbose for blind users especially when there are many image- 7 RELATED WORK based buttons within one page. (2) Some developers give too short Mobile devices are ubiquitous, and mobile apps are widely used labels which may not be informative enough for users. For example, for different tasks in people’s daily life. Consequently, there are A2 and A3 annotate the “add to favorite” button as “like” (E2). many research works for ensuring the quality of mobile apps [29, Since this button will trigger an additional action (add this song to 43, 47]. Most of them are investigating the apps’ functional and favorite list), “like” could not express this meaning. The same reason non-functional properties like compatibility [73], performance [58, 6The p-values are adjusted by Benjamin & Hochberg method [27]. All detailed p-values 81], energy-efficiency [24, 25], GUI design [31, 36], GUI animation are listed in https://github.com/chenjshnn/LabelDroid linting [80], localization [72] and privacy and security [37, 38, 40, Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea

42, 76]. However, few of them are studying the accessibility issues, users with vision impairment) such as manual testing, and auto- especially for users with vision impairment which is focused in our mated testing with analysis tools. First, for manual testing, the work. developers can use the built-in screen readers (e.g., TalkBack [12] for Android, VoiceOver [20] for IOS) to interact with their Android 7.1 App Accessibility Guideline device without seeing the screen. During that process, developers Google and Apple are the primary organizations that facilitate can find out if the spoken feedback for each element conveys its mobile technology and the app marketplace by Android and IOS purpose. Similarly, the Accessibility Scanner app [5] scans the spec- platforms. With the awareness of the need to create more accessible ified screen and provides suggestions to improve the accessibility apps, both of them have released developer and designer guidelines of your app including content labels, clickable items, color contrast, for accessibility [6, 15] which include not only the accessibility etc. The shortcoming of this tool is that the developers must run it design principles, but also the documents for using assistive tech- in each screen of the app to get the results. Such manual exploration nologies embedding in the operating system [15], and testing tools of the application might not scale for larger apps or frequent testing, or suits for ensuring the app accessibility. The World Wide Web and developers may miss some functionalities or elements during Consortium (W3C) has released their web accessibility guideline the manual testing. long time ago [21] And now they are working towards adapting Second, developers can also automate accessibility tasks by re- their web accessibility guideline [21] by adding mobile character- sorting testing frameworks like Android Lint, Espresso and Robolec- istics into mobile platforms. Although it is highly encouraged to tric, etc. The Android Lint [1] is a static tool for checking all files of follow these guidelines, they are often ignored by developers. Differ- an Android project, showing lint warnings for various accessibility ent from these guidelines, our work is specific to users with vision issues including missing content descriptions and providing links impairment and predicts the label during the developing process to the places in the source code containing these issues. Apart from without requiring developers to fully understand long guidelines. the static-analysis tools, there are also testing frameworks such as Espresso [10] and Robolectric [18] which can also check accessi- 7.2 App Accessibility Studies for Blind Users bility issues dynamically during the testing execution. And there are counterparts for IOS apps like Earl-Grey [9] and KIF [16]. Note Many works in Human-Computer Interaction area have explored that all of these tools are based on official testing framework. For the accessibility issues of small-scale mobile apps [63, 75] in differ- example, Espresso, Robolectric and Accessibility Scanner are based ent categories such as in health [71], smart cities [60] and govern- on Android’s Accessibility Testing Framework [7]. ment engagement [50]. Although they explore different accessibility Although all of these tools are beneficial for the accessibility issues, the lack of descriptions for image-based components has testing, there are still three problems with them. First, it requires been commonly explicitly noted as a significant problem in these developers’ well awareness or knowledge of those tools, and un- works. Park et al [63] rated the severity of errors as well as fre- derstanding the necessity of accessibility testing. Second, all these quency, and missing labels is rated as the highest severity of ten testing are reactive to existing accessibility issues which may have kinds of accessibility issues. Kane et al [49] carry out a study of already harmed the users of the app before issues fixed. In addition mobile device adoption and accessibility for people with visual and to these reactive testing, we also need a more proactive mechanism motor disabilities. Ross et al [66] examine the image-based button of accessibility assurance which could automatically predicts the labeling in a relative large-scale android apps, and they specify content labeling and reminds the developers to fill them into the some common labeling issues within the app. Different from their app. The goal of our work is to develop a proactive content labeling works, our study includes not only the largest-scale analysis of model which can complement the reactive testing mechanism. image-based button labeling issues, but also a solution for solving those issues by a model to predict the label of the image. There are also some works targeting at locating and solving the 8 CONCLUSION AND FUTURE WORK accessibility issues, especially for users with vision impairment. More than 77% apps have at least one image-based button without Eler et al [41] develop an automated test generation model to dy- natural-language label which can be read for users with vision im- namically test the mobile apps. Zhang et al [78] leverage the crowd pairment. Considering that most app designers and developers are source method to annotate the GUI element without the original of no vision issues, they may not understand how to write suitable content description. For other accessibility issues, they further de- labels. To overcome this problem, we propose a deep learning model velop an approach to deploy the interaction proxies for runtime based on CNN and transformer encoder-decoder for learning to repair and enhancement of mobile application accessibility [77] predict the label of given image-based buttons. without referring to the source code. Although these works can also We hope that this work can invoking the community attention help ensure the quality of mobile accessibility, they still need much in app accessibility. In the future, we will first improve our model effort from developers. Instead, the model proposed in our work can for achieving better quality by taking the app metadata into the automatically recommend the label for image-based components consideration. Second, we will also try to test the quality of existing and developers can directly use it or modify it for their own apps. labels by checking if the description is concise and informative.

7.3 App Accessibility Testing Tools REFERENCES [1] 2011. Android Lint - Project Site. http://tools.android.com/tips/ It is also worth mentioning some related non-academic projects. lint. There are mainly two strategies for testing app accessibility (for [2] 2017. Screen Reader Survey. https://webaim.org/projects/screenreadersurvey7/. ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea Jieshan Chen, Chunyang Chen, Zhenchang Xing, et al.

[3] 2018. android.widget | Android Developers. https://developer.android.com/ IEEE, 596–607. reference/android/widget/package-summary. [37] Sen Chen, Lingling Fan, Chunyang Chen, Minhui Xue, Yang Liu, and Lihua Xu. [4] 2018. Blindness and vision impairment. https://www.who.int/en/news-room/fact- 2019. GUI-Squatting Attack: Automated Generation of Android Phishing Apps. sheets/detail/blindness-and-visual-impairment. IEEE Transactions on Dependable and Secure Computing (2019). [5] 2019. Accessibility Scanner. https://play.google.com/store/apps/details?id=com. [38] Sen Chen, Minhui Xue, Lingling Fan, Shuang Hao, Lihua Xu, Haojin Zhu, and google.android.apps.accessibility.auditor. Bo Li. 2018. Automated poisoning attacks and defenses in malware detection [6] 2019. Android Accessibility Guideline. https://developer.android.com/guide/ systems: An adversarial machine learning approach. computers & security 73 topics/ui/accessibility/apps. (2018), 326–344. [7] 2019. Android’s Accessibility Testing Framework. https://github.com/google/ [39] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Pi- Accessibility-Test-Framework-for-Android. otr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection [8] 2019. Apple App Store. https://www.apple.com/au/ios/app-store/. and evaluation server. arXiv preprint arXiv:1504.00325 (2015). [9] 2019. Earl-Grey. https://github.com/google/EarlGrey. [40] Tobias Dehling, Fangjian Gao, Stephan Schneider, and Ali Sunyaev. 2015. Ex- [10] 2019. Espresso | Android Developers. https://developer.android.com/training/ ploring the far side of mobile health: information security and privacy of mobile testing/espresso. health apps on iOS and Android. JMIR mHealth and uHealth 3, 1 (2015), e8. [11] 2019. Google Play Store. https://play.google.com. [41] Marcelo Medeiros Eler, José Miguel Rojas, Yan Ge, and Gordon Fraser. 2018. [12] 2019. Google TalkBack source code. https://github.com/google/talkback. Automated accessibility testing of mobile apps. In 2018 IEEE 11th International [13] 2019. Image Button. https://developer.android.com/reference/android/widget/ Conference on Software Testing, Verification and Validation (ICST). IEEE, 116–126. ImageButton. [42] Ruitao Feng, Sen Chen, Xiaofei Xie, Lei Ma, Guozhu Meng, Yang Liu, and Shang- [14] 2019. ImageView. https://developer.android.com/reference/android/widget/ Wei Lin. 2019. MobiDroid: A Performance-Sensitive Malware Detection System on ImageView. Mobile Platform. In 2019 24th International Conference on Engineering of Complex [15] 2019. iOS Accessibiliyu Guideline. https://developer.apple.com/accessibility/ios/. Computer Systems (ICECCS). IEEE, 61–70. [16] 2019. KIF. https://github.com/kif-framework/KIF. [43] Bin Fu, Jialiu Lin, Lei Li, Christos Faloutsos, Jason Hong, and Norman Sadeh. [17] 2019. PyTorch. https://pytorch.org/s. 2013. Why people hate your app: Making sense of user feedback in a mobile [18] 2019. Robolectric. http://robolectric.org/. app store. In Proceedings of the 19th ACM SIGKDD international conference on [19] 2019. Talkback Guideline. https://support.google.com/accessibility/android/ Knowledge discovery and data mining. ACM, 1276–1284. answer/6283677?hl=en. [44] Sa Gao, Chunyang Chen, Zhenchang Xing, Yukun Ma, Wen Song, and Shang- [20] 2019. VoiceOver. https://cloud.google.com/translate/docs/. Wei Lin. 2019. A neural model for method name generation from functional [21] 2019. World Wide Web Consortium Accessibility. https://www.w3.org/standards/ description. In 2019 IEEE 26th International Conference on Software Analysis, webdesign/accessibility. Evolution and Reengineering (SANER). IEEE, 414–421. [22] Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. 2018. Convolutional [45] Isao Goto, Ka-Po Chow, Bin Lu, Eiichiro Sumita, and Benjamin K Tsou. 2013. image captioning. In Proceedings of the IEEE Conference on Computer Vision and Overview of the Patent Machine Translation Task at the NTCIR-10 Workshop.. Pattern Recognition. 5561–5570. In NTCIR. [23] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza- [46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual tion. arXiv preprint arXiv:1607.06450 (2016). learning for image recognition. In Proceedings of the IEEE conference on computer [24] Abhijeet Banerjee, Hai-Feng Guo, and Abhik Roychoudhury. 2016. Debugging vision and pattern recognition. 770–778. energy-efficiency related field failures in mobile apps.In Proceedings of the Inter- [47] Geoffrey Hecht, Omar Benomar, Romain Rouvoy, Naouel Moha, and Laurence national Conference on Mobile Software Engineering and Systems. ACM, 127–138. Duchien. 2015. Tracking the software quality of android applications along [25] Abhijeet Banerjee and Abhik Roychoudhury. 2016. Automated re-factoring their evolution (t). In 2015 30th IEEE/ACM International Conference on Automated of android apps to enhance energy-efficiency. In 2016 IEEE/ACM International Software Engineering (ASE). IEEE, 236–247. Conference on Mobile Software Engineering and Systems (MOBILESoft). IEEE, 139– [48] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural 150. computation 9, 8 (1997), 1735–1780. [26] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for [49] Shaun K Kane, Chandrika Jayant, Jacob O Wobbrock, and Richard E Ladner. 2009. MT evaluation with improved correlation with human judgments. In Proceedings Freedom to roam: a study of mobile device adoption and accessibility for people of the acl workshop on intrinsic and extrinsic evaluation measures for machine with visual and motor disabilities. In Proceedings of the 11th international ACM translation and/or summarization. 65–72. SIGACCESS conference on Computers and accessibility. ACM, 115–122. [27] Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a [50] Bridgett A King and Norman E Youngblood. 2016. E-government in Alabama: practical and powerful approach to multiple testing. Journal of the Royal statistical An analysis of county voting and election website content, usability, accessibility, society: series B (Methodological) 57, 1 (1995), 289–300. and mobile readiness. Government Information Quarterly 33, 4 (2016), 715–726. [28] John Brooke et al. 1996. SUS-A quick and dirty usability scale. Usability evaluation [51] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- in industry 189, 194 (1996), 4–7. mization. arXiv preprint arXiv:1412.6980 (2014). [29] Margaret Butler. 2010. Android: Changing the mobile landscape. IEEE Pervasive [52] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica- Computing 10, 1 (2010), 4–7. tion with deep convolutional neural networks. In Advances in neural information [30] Chunyang Chen, Xi Chen, Jiamou Sun, Zhenchang Xing, and Guoqiang Li. 2018. processing systems. 1097–1105. Data-driven proactive policy assurance of post quality in community q&a sites. [53] Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. Proceedings of the ACM on human-computer interaction 2, CSCW (2018), 1–22. The annals of mathematical statistics 22, 1 (1951), 79–86. [31] Chunyang Chen, Sidong Feng, Zhenchang Xing, Linda Liu, Shengdong Zhao, [54] Richard E Ladner. 2015. Design for user empowerment. interactions 22, 2 (2015), and Jinshui Wang. 2019. Gallery DC: Design Search and Knowledge Discovery 24–29. through Auto-created GUI Component Gallery. Proceedings of the ACM on [55] Yann LeCun, Yoshua Bengio, et al. 1995. Convolutional networks for images, Human-Computer Interaction 3, CSCW (2019), 1–22. speech, and time series. The handbook of brain theory and neural networks 3361, [32] Chunyang Chen, Ting Su, Guozhu Meng, Zhenchang Xing, and Yang Liu. 2018. 10 (1995), 1995. From ui design image to gui skeleton: a neural machine translator to bootstrap [56] Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries mobile gui implementation. In Proceedings of the 40th International Conference on using n-gram co-occurrence statistics. In Proceedings of the 2003 Human Lan- Software Engineering. ACM, 665–676. guage Technology Conference of the North American Chapter of the Association for [33] Chunyang Chen, Zhenchang Xing, and Yang Liu. 2017. By the community & for Computational Linguistics. 150–157. the community: a deep learning approach to assist collaborative editing in q&a [57] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva sites. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common 1–21. objects in context. In European conference on computer vision. Springer, 740–755. [34] Chunyang Chen, Zhenchang Xing, Yang Liu, and Kent Long Xiong Ong. 2019. [58] Mario Linares-Vasquez, Christopher Vendome, Qi Luo, and Denys Poshyvanyk. Mining likely analogical apis across third-party libraries via large-scale unsu- 2015. How developers detect and fix performance bottlenecks in Android apps. pervised api semantics embedding. IEEE Transactions on Software Engineering In 2015 IEEE International Conference on Software Maintenance and Evolution (2019). (ICSME). IEEE, 352–361. [35] Guibin Chen, Chunyang Chen, Zhenchang Xing, and Bowen Xu. 2016. Learning a [59] Henry B Mann and Donald R Whitney. 1947. On a test of whether one of dual-language vector space for domain-specific cross-lingual question retrieval. In two random variables is stochastically larger than the other. The annals of 2016 31st IEEE/ACM International Conference on Automated Software Engineering mathematical statistics (1947), 50–60. (ASE). IEEE, 744–755. [60] Higinio Mora, Virgilio Gilart-Iglesias, Raquel Pérez-del Hoyo, and María Andújar- [36] Sen Chen, Lingling Fan, Chunyang Chen, Ting Su, Wenhe Li, Yang Liu, and Lihua Montoya. 2017. A comprehensive system for monitoring urban accessibility in Xu. 2019. Storydroid: Automated generation of storyboard for Android apps. smart cities. Sensors 17, 8 (2017), 1834. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning ICSE ’20, May 23–29, 2020, Seoul, Republic of Korea

[61] Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, 102, 5 (2012), 1104–1112. Tomoki Toda, and Satoshi Nakamura. 2015. Learning to Generate Pseudo-Code [72] Xu Wang, Chunyang Chen, and Zhenchang Xing. 2019. Domain-specific machine from Source Code Using Statistical Machine Translation (T). In Automated Soft- translation with recurrent neural network for software localization. Empirical ware Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, Software Engineering 24, 6 (2019), 3514–3545. 574–584. [73] Lili Wei, Yepang Liu, and Shing-Chi Cheung. 2016. Taming Android fragmen- [62] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a tation: Characterizing and detecting compatibility issues for Android apps. In method for automatic evaluation of machine translation. In Proceedings of the 2016 31st IEEE/ACM International Conference on Automated Software Engineering 40th annual meeting on association for computational linguistics. Association for (ASE). IEEE, 226–237. Computational Linguistics, 311–318. [74] Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Break- [63] Kyudong Park, Taedong Goh, and Hyo-Jeong So. 2014. Toward accessible mobile throughs in statistics. Springer, 196–202. application design: developing mobile application accessibility guidelines for [75] Shunguo Yan and PG Ramachandran. 2019. The current status of accessibility in people with visual impairment. In Proceedings of HCI Korea. Hanbit Media, Inc., mobile apps. ACM Transactions on Accessible Computing (TACCESS) 12, 1 (2019), 31–38. 3. [64] Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document [76] Zhenlong Yuan, Yongqiang Lu, and Yibo Xue. 2016. Droiddetector: android queries. In Proceedings of the first instructional conference on machine learning, malware characterization and detection using deep learning. Tsinghua Science Vol. 242. Piscataway, NJ, 133–142. and Technology 21, 1 (2016), 114–123. [65] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: [77] Xiaoyi Zhang, Anne Spencer Ross, Anat Caspi, James Fogarty, and Jacob O Towards real-time object detection with region proposal networks. In Advances Wobbrock. 2017. Interaction proxies for runtime repair and enhancement of in neural information processing systems. 91–99. mobile application accessibility. In Proceedings of the 2017 CHI Conference on [66] Anne Spencer Ross, Xiaoyi Zhang, James Fogarty, and Jacob O Wobbrock. 2018. Human Factors in Computing Systems. ACM, 6024–6037. Examining image-based button labeling for accessibility in Android apps through [78] Xiaoyi Zhang, Anne Spencer Ross, and James Fogarty. 2018. Robust Annota- large-scale analysis. In Proceedings of the 20th International ACM SIGACCESS tion of Mobile Application Interfaces in Methods for Accessibility Repair and Conference on Computers and Accessibility. ACM, 119–130. Enhancement. In The 31st Annual ACM Symposium on User Interface Software [67] Ch Spearman. 2010. The proof and measurement of association between two and Technology. ACM, 609–621. things. International journal of epidemiology 39, 5 (2010), 1137–1150. [79] Dehai Zhao, Zhenchang Xing, Chunyang Chen, Xin Xia, and Guoqiang Li. 2019. [68] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, ActionNet: vision-based workflow action recognition from programming screen- Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all casts. In 2019 IEEE/ACM 41st International Conference on Software Engineering you need. In Advances in neural information processing systems. 5998–6008. (ICSE). IEEE, 350–361. [69] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: [80] Dehai Zhao, Zhenchang Xing, Chunyang Chen, Xiwei Xu, Liming Zhu, Guoqiang Consensus-based image description evaluation. In Proceedings of the IEEE confer- Li, and Jinshui Wang. 2020. Seenomaly: Vision-Based Linting of GUI Animation ence on computer vision and pattern recognition. 4566–4575. Effects Against Design-Don’t Guidelines. In 42nd International Conference on [70] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show Software Engineering (ICSE ’20). ACM, New York, NY, 12 pages. https://doi.org/ and tell: A neural image caption generator. In Proceedings of the IEEE conference 10.1145/3377811.3380411 on computer vision and pattern recognition. 3156–3164. [81] Hui Zhao, Min Chen, Meikang Qiu, Keke Gai, and Meiqin Liu. 2016. A novel [71] Fahui Wang. 2012. Measurement, optimization, and impact of health care accessi- pre-cache schema for high performance Android system. Future Generation bility: a methodological review. Annals of the Association of American Geographers Computer Systems 56 (2016), 766–772.