Text2App: A Framework for Creating Android Apps from Text Descriptions Masum Hasan1*, Kazi Sajeed Mehrab1*, Wasi Uddin Ahmad2, and Rifat Shahriyar1
1Bangladesh University of Engineering and Technology (BUET) 2University of California, Los Angeles (UCLA) [email protected], [email protected], [email protected] [email protected]
Abstract Natural language description: Create an app with a textbox, a button named “Speak”, and a We present Text2App – a framework that al- text2speech. When the button is lows users to create functional Android appli- clicked, speak the text in the text box.
cations from natural language specifications. Simplified App Representation: The conventional method of source code gen- On click plication with a substantially smaller number
lows seq2seq networks to learn complex ap- plication structures with less overhead. In or- Literal Dictionary: der to train sequence models, we introduce a { “STRING0”: “Speak” data synthesis method grounded in a human } survey. We demonstrate that Text2App gener- alizes well to unseen combination of app com- Figure 1: An example app created by our system that ponents and it is capable of handling noisy nat- speaks the textbox text on button press. The natural lan- ural language instructions. We explore the pos- guage is machine translated to a simpler intermediate sibility of creating applications from highly ab- formal language which is compiled into an app source stract instructions by coupling our system with code. Literals are separated before machine translation. GPT-3 – a large pretrained language model. We perform an extensive human evaluation and identify the capabilities and limitations 2016), with the aspiration to automatically gen- of our system. The source code, a ready-to- run demo notebook, and a demo video are erate full-fledged software systems further down publicly available at https://github.com/ the road. To date, however, the task of source code text2app/Text2App. generation has turned out to be highly difficult – the arXiv:2104.08301v2 [cs.CL] 7 Jul 2021 best deep neural networks, consisting of hundreds 1 Introduction of millions of parameters and trained with hundreds of gigabytes of data, fails to achieve an accuracy Mobile application developers often have to build higher than 20% (Ahmad et al., 2021; Lu et al., applications from natural language requirements 2021). Till date, the ambition to produce software provided by their clients or managers. An auto- automatically from natural language descriptions mated tool to build functional applications from has remained a distant reality. such natural language descriptions will signifi- In this work, we present Text2App, a novel cantly value this application development process. pipeline to generate Android Mobile Applications For many years, researchers have been trying to (app) from natural language (NL) descriptions. In- generate source code from natural language de- stead of an end-to-end learning-based model, we scriptions (Yin and Neubig, 2017; Ling et al., break down the challenging task of app develop- * Equal contribution ment into modular components and apply learning- based methods only where necessary. We create a application with a substantially smaller number of formal language named Simplified App Represen- tokens, allowing seq2seq models to generate intri- tation (SAR), to represent an app with a minimal cate apps in a few decoding steps, which otherwise number of tokens and train a sequence-to-sequence would be unsolvable by current sequential models. neural network to generate this formal representa- We design a formal language named Simplified tion from an NL. Fig.1 shows an example of a for- App Representation (SAR) that captures the app mal representation created from a given NL. Using design, components, and functionalities in a small a custom-made compiler, we convert the simpli- number of tokens (Section 3.1). We further develop fied formal representation to the application source a SAR compiler that converts a SAR to an appli- code from which a functional app can be built. We cation source code (Section 3.2). Using MIT App create a data synthesis method and a BERT-based Inventor1 – a popular, accessible application devel- NL augmentation method to synthesize realistic opment tool – the source code can be compiled to NL-SAR parallel corpus. functional app in a matter of minutes. Training a We demonstrate that the compact app represen- sequence-to-sequence neural network for translat- tation allows seq2seq models to generate app from ing a natural language to SAR requires a parallel significantly noisy input, even being able to pre- NL-SAR corpus. However, human annotation of dict combinations it has not seen during training. such a corpus is difficult, and it limits our capa- Moreover, we open source our implementation to bility to add new components and functionalities. the community, and lay down the groundwork to Instead, we conduct a human survey to understand extend the features and functionalities of Text2App user perception of text-based app development and beyond what we demonstrated in this paper. app description pattern (Section 3.3), and based on this survey, we create a data synthesis method to 2 Related Works automatically generate fluent natural language de- Historically, deep learning based program genera- scriptions of apps along with corresponding SARs tion tended to focus on generating unit functions or (Section 3.4). To make sure our synthetic dataset methods from natural language instructions using is not monotonous, we introduce a BERT based sequence-to-sequence or sequence-to-tree architec- data augmentation method (Section 3.5). Using the tures (Ling et al., 2016; Yin and Neubig, 2017; synthesized and augmented parallel NL-SAR data, Brockschmidt et al., 2019; Parisotto et al., 2017; we train multiple sequence-to-sequence neural net- Rabinovich et al., 2017; Ahmad et al., 2021; Lu works to predict SAR from a given text description et al., 2021). The other type of works in program of an app, which is then compiled into functional generation that sparked researcher’s interest is gen- apps (Section 3.6). Fig.2 describes each step in erating GUI source code from a screenshot, hand- our natural language to app generation process. We drawn image, or text description of the GUI (Bel- also discuss how a pretrained language model, such tramelli, 2018; Jain et al., 2019; Robinson, 2019; as GPT-3, can be used with our system as an exter- Zhu et al., 2019; Moran et al., 2020; Kolthoff, nal knowledge-base for simplifying abstract human 2019). These works are limited to generating GUI instructions (Section 3.7). Our system is built to be design only, and does not naturally extend to func- modular, where each module is self-contained: in- tionality based programming. To the best of our dependent and with a single, well-defined purpose. knowledge, ours is the first work on developing This allows us to modify one part of the system working software with interdependent functional without affecting the others and debug the system components from natural language description. to pinpoint any error. Literals like strings, numbers are separated dur- 3 Text2App ing the preprocessing and are re-introduced during compilation. Contrary to conventional program- Text2App is a framework that aims to build oper- ming languages, unless a user specifies a detail of ational mobile applications from natural language an app component, a suitable default is assumed. (NL) specifications. We make this possible by trans- This allows the user to describe an app more natu- lating a specification to an intermediate, compact, rally and also reduces unnecessary overhead from formal representation which is compiled into the the sequential model. application source code in a later step. This inter- mediate language helps our system represent an 1https://appinventor.mit.edu/ Simplified App { "STRING0": "Speak" } Representation (SAR) Create an app with a create an app with a ... Natural Language Tokenized and Description of App Pre-processed Seq2seq Neural Network
{ "STRING0": "Speak" }
.apk MIT App Inventor MIT App Inventor SAR Compiler Ready to Use! backend Compiler Source (.scm, .bky)
Figure 2: Text2App Prediction Pipeline. A given text is formatted and passed to a seq2seq network to be translated into SAR. Using a SAR Compiler, it is converted to App Inventor project, which can be built into an application.
3.1 Simplified App Representation (SAR) an animated ball has properties ‘color’, ‘speed’, SAR is an abstract, intermediate, formal language ‘radius’, etc. Such a property is called an argu- that represents a mobile application in our system. ment (’ heventsi ‘
’ project using a custom written compiler. The heventsi →h eventi heventsi | heventi project is then compiled into functioning app (.apk) heventi → kens and fetches their corresponding templates. By tokens. One functionality in our system is a tu- fetching and modifying the predefined templates ple containing an event, an action, and a value. with the user specified arguments, we generate the
Sample
Components SAR Event Action Value
NL
Figure 3: Automatic synthesis of NL and SAR parallel corpus. Bold-italic indicates text is selected stochastically.
3.3 Survey on Natural Language based App Original: Create an app that has an audio player with Development source string0, a switch. If the switch is flipped, play player. In order to understand how a user would perceive Augmentation: Create an app that has an external a system like Text2App, early in our study, we player with source string0, a switch. If the switch gets performed a semi-structured human survey among flipped, play player. participants with some programming experience. We asked them to describe several mobile apps Table 1: BERT mask filling based data augmentation from a given set of app components. We received method. Mutated words are highlighted green. a total of 57 responses2 from 30 participants. 36 out of the 57 responses contained enough details for app creation. 19 responses were not detailed, from a predefined list. When the components, ar- and would require more details and knowledge to guments, and their functionalities are selected, we be converted to app (e.g. “make a photo editing stochastically create a natural language description app” – requires the system to know what a photo by sampling natural language snippets from pre- editor is). The observations from this survey helps defined lists. Furthermore, we deterministically us to create a data synthesis method, and it works create a SAR representation of an app and the func- as a general guideline for our study design. tionalities. Figure3 demonstrates creation of a 3.4 Synthesising Natural Language and SAR simple app with three components. The random Parallel Data selection process and repetition of components al- lows our synthesis method to create wide variety Training a seq2seq model to generate SAR from of apps. natural language would require an NL-SAR par- allel corpus. Based on the findings of our survey described in Section 3.3, we develop a data syn- thesis method for generating natural language app 3.5 BERT Based NL Augmentation description and SAR data parallelly. First, from a list of allowed components, we randomly select a Data augmentation is common practice in com- certain number of components. Most components puter vision, where an image is rotated, cropped, (i.e. button, textbox) are allowed to be re- scaled, or corrupted, in order to increase the data peated a certain number of times, but some compo- size and introducing variation to the dataset. To nents (e.g. text2speech, accelerometer) add diversity to our synthetic dataset, we propose can only appear once. The selected components a data augmentation method where we mask a cer- are sorted into three groups, event component, ac- tain percentage of words in our dataset using the tion component, and value component (detailed in Masked Language Modeling (MLM) property of Section 3.1). For each event, action, and value com- pretrained BERT (Devlin et al., 2019) and sample ponents, their functionalities are selected randomly contextually correct alternate words. Table1 shows 2http://bit.ly/AppDescriptions an example of the augmentation technique. 3.6 NL to SAR Translation using Seq2Seq cluded data. These establishes the models’ ability Networks to generalize beyond the patterns it was trained We generate 50,000 unique NL and SAR parallel on. From Table2 we can see that augmenting the data using our data synthesis method, and mutate training dataset notably improves all models (up to 1% of the natural language tokens. We split this 22.04%). We also see that the RoBERTa initialized dataset into train-validation-test sets in 8:1:1 ratio model performs best in all evaluation categories. and train three different models. Note that, all predictions reported in Table2 are Pointer Network: We train a Pointer Network valid SAR format. (See et al., 2017) consisting of a randomly ini- Human Evaluation. To evaluate our system tialized bidirectional LSTM encoder with hidden with real world natural language, we conduct a layers of size 500 and 250. survey with 13 Computer Science undergraduate Transformer with pretrained encoders: We volunteers. We provided the participants short create two sequence-to-sequence Transformer videos of 10 mobile applications, and asked them 3 (Vaswani et al., 2017) networks each having 12 to describe the application in their own language . encoder layers and 6 decoder layers. Every layer We provided them with the component names and has 12 self-attention heads of size 64. The hidden one example NL. We collected total 112 responses dimension is 768. The encoder of one of the models and generated the SAR from these responses us- is initialized with RoBERTa base (Liu et al., 2020) ing Text2App system. The evaluation contained pretrained weights, and the other one with Code- application SARs with different lengths (min 10, BERT base (Feng et al., 2020) pretrained weights. max 42) Upon comparing the generated SAR with the ground truth SAR data we found a BLEU-1 3.7 Simplifying Abstract Natural Language score of 54.99, and exact match 4/111. The labeled Instructions using GPT-3 data and predictions are made publicly available4. Upon manual inspection, we found that 25 out of In our survey sessions (Section 3.3) we found that the 112 predictions are either correct or function- some abstract instructions require external knowl- ally correct predictions of the user provided NL. edge to be converted to applications (e.g. “Create This represents the difficulty of our task and the a photo editor app” – expects knowledge how a complexity of natural language. Observing the suc- photo editor looks and works). Large pretrained au- cessful and unsuccessful predictions, we find the toregressive language models (LMs), have shown patterns emerge as shown in Table4. to understand abstract natural language concepts and even explain them in simple terms (Mayn, 5 Scope, Limitation, and Future Work 2020). Using the few shot prediction capability of GPT-3 (Brown et al., 2020), we experiment with Text2App is the first attempt of an NL based app simplifying abstract app concepts. We find that development tool, and our core contribution of our although this method is promising, the LM fails to project lies in the development endeavour of build- limit the prediction to our current capability (Table ing the SAR, the SAR compiler, and the SAR- 3). The GPT-3 prompts are provided in Appendix NL parallel data synthesizer. Currently it sup- D. ports app creation from 12 components (i.e. ‘cam- era’, ‘textbox’, ‘button’, ‘text2speech’, ‘ball’, ‘ac- 4 Evaluation celerometer’, ‘video player’, ‘switch’, ‘player’, Automatic Evaluation. We evaluate the three ‘label’, ‘timepicker’, ‘passwordtextbox’), 4 events seq2seq networks mentioned in Section 3.6 – Point- (i.e. button click, switch flip, accelerometer shaken, erNetwork, seq2seq Transformer initialized with ball flung), and more than 10 actions (e.g. Take RoBERTa, and seq2seq Transformer initialized a picture, Start/stop video/audio, Speak a given with CodeBERT. We evaluate the models in 3 differ- text or a textbox, Bounce ball, set speed/color of ent settings – firstly, in a held out test set, secondly, the ball, Set label, etc.). We have covered the first with increasing amount of mutation in the test set 3 out of 4 fundamentals of Android applications (2%, 5% 10%) (Section 3.5). We also trained (i.e. Activity, Services, Content Provider, Broad- separate models using data excluding specific cast Receiver). Most components supported by in combinations of components (
Table 2: Comparison between Pointer Network and seq2seq Transformer with encoder initialized with RoBERTa and CodeBERT pretrained weights. BLEU indicates BLEU-1 and Exact Match (EM) is shown in percent.
1. number adding app - make an app with a textbox, a Observation Example textbox, and a button named ”+”. Successful predictions NL: Create an app that has a SAR:
User expects the auto- NL: A typical registration matic system to have form with necessary text fields 3. browser app - create an app with a textbox, a button inherent understanding and a submit button. named “go”, and a button named “back”. When the button of the real world. “go” is pressed, go to the url in the textbox. When the button “back” is pressed, go back to the previous page. Model is biased to- NL: Create an app which wards synthesizer key- has a moving ball. (‘mov- 4. Google front page - make an app with a textbox, a but- words. ing’ frequently appears with ton named “google”, and a button named “search”. When accelerometer, model predicts the button “google” is pressed, search google. When the accelerometer component.) button “search” is pressed, search the web. System misses compo- NL: Create an app that has a nents not specified. textbox labelled ”Insert Text”, Table 3: Abstract instructions to simpler app descrip- when the device is shaken, tion using GPT-3. Example 1, 2, was constrained speak the text in the textbox within our allowed set of functionalities. 3, 4, intro- duced concepts that are not yet supported in Text2App Table 4: Observation from human labeled data predic- (marked in red and italic). tions
MIT App Inventor follows a similar tree-like pat- tern, and thus can be represented as SAR. We are inviting open source community to contribute to and help grow this project. In future, we would like to experiment with generative model based mediate representation of the application. The in- data synthesis, more reliable language model based termediate formal representation allows to describe NL simplification techniques, and potentially app an app with significantly smaller number of tokens development in native languages. than native app development languages. We also 6 Conclusion design a data synthesis method guided by a human survey, that automatically generates fluent natural In this paper, we explore creating functional mo- language app descriptions and their formal repre- bile applications from natural language text de- sentations. Our AI aware design approach for a scriptions using seq2seq networks. We propose formal language can guide future programming lan- Text2App, a novel framework for natural language guage and frameworks development, where further to app translation with the help of a simpler inter- source code generation works can benefit from. Acknowledgement 1536–1547, Online. Association for Computational Linguistics. We thank Prof. Zhijia Zhao from UCR for propos- ing the problem that inspired this project idea. Vanita Jain, Piyush Agrawal, Subham Banga, Rishabh We also thank OpenAI, Google Colaboratory, Kapoor, and Shashwat Gulyani. 2019. Sketch2code: Transformation of sketches to ui in real-time using Hugging Face, MIT App Inventor community, the deep neural network. survey participants, and Prof. Anindya Iqbal for feedback regarding modularity. This project was K. Kolthoff. 2019. Automatic generation of g raphical user interface prototypes from unrestricted natural funded under the ‘Innovation Fund’ by the ICT language requirements. In 2019 34th IEEE/ACM In- Division, Government of the People’s Republic of ternational Conference on Automated Software En- Bangladesh. gineering (ASE), pages 1234–1237. Wang Ling, Phil Blunsom, Edward Grefenstette, References Karl Moritz Hermann, Toma´sˇ Kociskˇ y,´ Fumin Wang, and Andrew Senior. 2016. Latent predictor Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi networks for code generation. In Proceedings of the Ray, and Kai-Wei Chang. 2021. Unified pre-training 54th Annual Meeting of the Association for Compu- for program understanding and generation. In Pro- tational Linguistics (Volume 1: Long Papers), pages ceedings of the 2021 Conference of the North Amer- 599–609, Berlin, Germany. Association for Compu- ican Chapter of the Association for Computational tational Linguistics. Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Tony Beltramelli. 2018. pix2code: Generating code dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, from a graphical user interface screenshot. In Pro- Luke Zettlemoyer, and Veselin Stoyanov. 2020. ceedings of the ACM SIGCHI Symposium on Engi- Ro{bert}a: A robustly optimized {bert} pretraining neering Interactive Computing Systems, pages 1–6. approach.
Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Gaunt, and Oleksandr Polozov. 2019. Generative Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, code modeling with graphs. In International Con- Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li- ference on Learning Representations. dong Zhou, Linjun Shou, Long Zhou, Michele Tu- fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun- Tom Brown, Benjamin Mann, Nick Ryder, Melanie daresan, Shao Kun Deng, Shengyu Fu, and Shujie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Liu. 2021. Codexglue: A machine learning bench- Arvind Neelakantan, Pranav Shyam, Girish Sastry, mark dataset for code understanding and generation. Amanda Askell, Sandhini Agarwal, Ariel Herbert- CoRR, abs/2102.04664. Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Andrew Mayn. 2020. Openai api alchemy: Sum- Clemens Winter, Chris Hesse, Mark Chen, Eric marization – @andrewmayne. https:// Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, andrewmayneblog.wordpress.com/2020/06/ Jack Clark, Christopher Berner, Sam McCandlish, 13/openai-api-alchemy-summarization/. Alec Radford, Ilya Sutskever, and Dario Amodei. (Accessed on 03/22/2021). 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, K. Moran, C. Bernal-Cardenas,´ M. Curcio, R. Bonett, volume 33, pages 1877–1901. Curran Associates, and D. Poshyvanyk. 2020. Machine learning-based Inc. prototyping of graphical user interfaces for mobile apps. IEEE Transactions on Software Engineering, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 46(2):196–221. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Emilio Parisotto, Abdel rahman Mohamed, Rishabh standing. In Proceedings of the 2019 Conference Singh, Lihong Li, Dengyong Zhou, and Pushmeet of the North American Chapter of the Association Kohli. 2017. Neuro-symbolic program synthesis. for Computational Linguistics: Human Language In Proceedings of the 5th International Conference Technologies, Volume 1 (Long and Short Papers), on Learning Representations (ICLR 2017), Toulon, pages 4171–4186, Minneapolis, Minnesota. Associ- France. ation for Computational Linguistics. Maxim Rabinovich, Mitchell Stern, and Dan Klein. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- 2017. Abstract syntax networks for code generation aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, and semantic parsing. In Proceedings of the 55th An- Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code- nual Meeting of the Association for Computational BERT: A pre-trained model for programming and Linguistics (Volume 1: Long Papers), pages 1139– natural languages. In Findings of the Association 1149, Vancouver, Canada. Association for Computa- for Computational Linguistics: EMNLP 2020, pages tional Linguistics. Alex Robinson. 2019. Sketch2code: Generating a web- site from a paper mockup. Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083, Vancouver, Canada. Association for Computa- tional Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, volume 30. Curran Associates, Inc. Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 440–450, Vancouver, Canada. Association for Computational Linguistics. Zhihao Zhu, Zhan Xue, and Zejian Yuan. 2019. Auto- matic graphics program generation using attention- based hierarchical decoder. In Computer Vision – ACCV 2018, pages 181–196, Cham. Springer Inter- national Publishing. Supplementary Material: Appendices
A SAR Compilation Algorithm "$Type":"TextBox","$Version":"6", "Hint":"Hint for TextBox1","Uuid":"913409813"},{"$Name": "Button1", "$Type":"Button", Algorithm 1: Compiling SAR to .scm, .bky "$Version":"6","Text":"Speak","Uuid": "955068562"}, 1 Input SAR tokens, LiteralDict {"$Name":"TextToSpeech1", 2 Output .scm, .bky "$Type":"TextToSpeech","$Version":"5", "Uuid":"1305598760"}]}} 3 scm = initializeSCM() bky = |# initializeBKY() Listing 1: A sample scm file representing the visual 4 token in complist for do components of the app in Fig. 1. The blue portion lists 5 if isComponentStart(token) then the components of the app. 6 uuid = generateNumUUID() 7 n = getCompNum(token) C Blockly (.bky) Logical Components 8 if token.hasArgument() then 9 args = fetchArgs(token,
#| Listing 2: A sample bky file representing the logical $JSON components of the app in Fig. 1. The colored lines {"authURL":["ai2.appinventor.mit.edu"], represent different blocks. "YaVersion":"208", "Source":"Form", "Properties":{"$Name":"Screen1","$Type": "Form","$Version":"27", "AppName":"speak_it","Title":"Screen1", "Uuid":"0", "$Components":[{"$Name":"TextBox1", D GPT-3 Prompt
How to make an app with these components : button, switch, textbox, accelerometer, audio player, video player, text2speech random video player app. – make an app with a video player with a random video, a button named “play” and a button named “pause”. When the first button is pressed, start the video. When the second button is pressed pause the video. A time speaking app – an app with a button, a clock and a text2speech. When the button is clicked, speak the time. Display time app – create an app with a button, a timepicker, and a label. When the button is pressed, set the label to the time. A messeging app – create an app with a with a textbox, and a button named ”send”, and a label. When the button is pressed, set label to textbox text. Login form – create an app with a textbox, a passwordbox, and a button named ”login”. Search interface – make an application with a textbox, and a button named “search”. siren app – create an app with a music player with source “siren sound.mp3”, and a button. When the button is pressed, play the audio. An arithmatic addition app gui – make an app with a textbox, a textbox, and a button named “+”. vibration alert app – create an app with an accelerometer, and a text2speech. When the accelerometer is shaken, speak “vibration detected”. {A new prompt} –
Table 5: Prompt used to generate app description with GPT-3. A new unseen prompt is added at the end and the model is tasked to continue generating text in the same pattern. This method is known as Few Shot text generation (Brown et al., 2020).