Text2App: A Framework for Creating Android Apps from Text Descriptions Masum Hasan1*, Kazi Sajeed Mehrab1*, Wasi Uddin Ahmad2, and Rifat Shahriyar1

1Bangladesh University of Engineering and Technology (BUET) 2University of California, Los Angeles (UCLA) [email protected], [email protected], [email protected] [email protected]

Abstract Natural language description: Create an app with a textbox, a button named “Speak”, and a We present Text2App – a framework that al- text2speech. When the button is lows users to create functional Android appli- clicked, speak the text in the text box.

cations from natural language specifications. Simplified App Representation: The conventional method of source code gen- On click eration tries to generate source code directly, Happy text to which is impractical for creating complex soft- forming natural language into an abstract in- termediate formal language representing an ap- plication with a substantially smaller number of tokens. The intermediate formal representa- tion is then compiled into target source codes. This abstraction of programming details al- lows seq2seq networks to learn complex ap- plication structures with less overhead. In or- Literal Dictionary: der to train sequence models, we introduce a { “STRING0”: “Speak” data synthesis method grounded in a human } survey. We demonstrate that Text2App gener- alizes well to unseen combination of app com- Figure 1: An example app created by our system that ponents and it is capable of handling noisy nat- speaks the textbox text on button press. The natural lan- ural language instructions. We explore the pos- guage is machine translated to a simpler intermediate sibility of creating applications from highly ab- formal language which is compiled into an app source stract instructions by coupling our system with code. Literals are separated before machine translation. GPT-3 – a large pretrained language model. We perform an extensive human evaluation and identify the capabilities and limitations 2016), with the aspiration to automatically gen- of our system. The source code, a ready-to- run demo notebook, and a demo video are erate full-fledged software systems further down publicly available at https://github.com/ the road. To date, however, the task of source code text2app/Text2App. generation has turned out to be highly difficult – the arXiv:2104.08301v2 [cs.CL] 7 Jul 2021 best deep neural networks, consisting of hundreds 1 Introduction of millions of parameters and trained with hundreds of gigabytes of data, fails to achieve an accuracy Mobile application developers often have to build higher than 20% (Ahmad et al., 2021; Lu et al., applications from natural language requirements 2021). Till date, the ambition to produce software provided by their clients or managers. An auto- automatically from natural language descriptions mated tool to build functional applications from has remained a distant reality. such natural language descriptions will signifi- In this work, we present Text2App, a novel cantly value this application development process. pipeline to generate Android Mobile Applications For many years, researchers have been trying to (app) from natural language (NL) descriptions. In- generate source code from natural language de- stead of an end-to-end learning-based model, we scriptions (Yin and Neubig, 2017; Ling et al., break down the challenging task of app develop- * Equal contribution ment into modular components and apply learning- based methods only where necessary. We create a application with a substantially smaller number of formal language named Simplified App Represen- tokens, allowing seq2seq models to generate intri- tation (SAR), to represent an app with a minimal cate apps in a few decoding steps, which otherwise number of tokens and train a sequence-to-sequence would be unsolvable by current sequential models. neural network to generate this formal representa- We design a formal language named Simplified tion from an NL. Fig.1 shows an example of a for- App Representation (SAR) that captures the app mal representation created from a given NL. Using design, components, and functionalities in a small a custom-made compiler, we convert the simpli- number of tokens (Section 3.1). We further develop fied formal representation to the application source a SAR compiler that converts a SAR to an appli- code from which a functional app can be built. We cation source code (Section 3.2). Using MIT App create a data synthesis method and a BERT-based Inventor1 – a popular, accessible application devel- NL augmentation method to synthesize realistic opment tool – the source code can be compiled to NL-SAR parallel corpus. functional app in a matter of minutes. Training a We demonstrate that the compact app represen- sequence-to-sequence neural network for translat- tation allows seq2seq models to generate app from ing a natural language to SAR requires a parallel significantly noisy input, even being able to pre- NL-SAR corpus. However, human annotation of dict combinations it has not seen during training. such a corpus is difficult, and it limits our capa- Moreover, we open source our implementation to bility to add new components and functionalities. the community, and lay down the groundwork to Instead, we conduct a human survey to understand extend the features and functionalities of Text2App user perception of text-based app development and beyond what we demonstrated in this paper. app description pattern (Section 3.3), and based on this survey, we create a data synthesis method to 2 Related Works automatically generate fluent natural language de- Historically, deep learning based program genera- scriptions of apps along with corresponding SARs tion tended to focus on generating unit functions or (Section 3.4). To make sure our synthetic dataset methods from natural language instructions using is not monotonous, we introduce a BERT based sequence-to-sequence or sequence-to-tree architec- data augmentation method (Section 3.5). Using the tures (Ling et al., 2016; Yin and Neubig, 2017; synthesized and augmented parallel NL-SAR data, Brockschmidt et al., 2019; Parisotto et al., 2017; we train multiple sequence-to-sequence neural net- Rabinovich et al., 2017; Ahmad et al., 2021; Lu works to predict SAR from a given text description et al., 2021). The other type of works in program of an app, which is then compiled into functional generation that sparked researcher’s interest is gen- apps (Section 3.6). Fig.2 describes each step in erating GUI source code from a screenshot, hand- our natural language to app generation process. We drawn image, or text description of the GUI (Bel- also discuss how a pretrained language model, such tramelli, 2018; Jain et al., 2019; Robinson, 2019; as GPT-3, can be used with our system as an exter- Zhu et al., 2019; Moran et al., 2020; Kolthoff, nal knowledge-base for simplifying abstract human 2019). These works are limited to generating GUI instructions (Section 3.7). Our system is built to be design only, and does not naturally extend to func- modular, where each module is self-contained: in- tionality based programming. To the best of our dependent and with a single, well-defined purpose. knowledge, ours is the first work on developing This allows us to modify one part of the system working software with interdependent functional without affecting the others and debug the system components from natural language description. to pinpoint any error. Literals like strings, numbers are separated dur- 3 Text2App ing the preprocessing and are re-introduced during compilation. Contrary to conventional program- Text2App is a framework that aims to build oper- ming languages, unless a user specifies a detail of ational mobile applications from natural language an app component, a suitable default is assumed. (NL) specifications. We make this possible by trans- This allows the user to describe an app more natu- lating a specification to an intermediate, compact, rally and also reduces unnecessary overhead from formal representation which is compiled into the the sequential model. application source code in a later step. This inter- mediate language helps our system represent an 1https://appinventor.mit.edu/ Simplified App { "STRING0": "Speak" } Representation (SAR) Create an app with a create an app with a textbox and a button text box and a button Encoder Decoder ... when ... ... Natural Language Tokenized and Description of App Pre-processed Seq2seq Neural Network

{ "STRING0": "Speak" }

.apk MIT App Inventor MIT App Inventor SAR Compiler Ready to Use! backend Compiler Source (.scm, .bky)

Figure 2: Text2App Prediction Pipeline. A given text is formatted and passed to a seq2seq network to be translated into SAR. Using a SAR Compiler, it is converted to App Inventor project, which can be built into an application.

3.1 Simplified App Representation (SAR) an animated ball has properties ‘color’, ‘speed’, SAR is an abstract, intermediate, formal language ‘radius’, etc. Such a property is called an argu- that represents a mobile application in our system. ment (). The values of such arguments are We design SAR to be minimal and compact, at the indicated by . As an example, Figure1 same time, to completely describe an application. shows the SAR of an app containing button and We formally define the Context Free Grammar of text2speech. event triggers SAR using the following production rules: the action from the text2speech compo- nent, which uses - the text in hSARi →h screensi textbox1 as a value. hscreensi →h screeni ‘’ hscreensi | hscreeni hscreeni →h complisti hcodei hcomplisti → ‘ ’ hcompsi ‘ ’ 3.2 Converting SAR to Mobile Apps hcompsi →h compsi hcompi | hcompi hcompi → ‘’ hargsi ‘’ | ‘’ We convert SAR to MIT App Inventor (MIT AI) hcodei → ‘’ heventsi ‘’ project using a custom written compiler. The heventsi →h eventi heventsi | heventi project is then compiled into functioning app (.apk) heventi → hactionsi ‘ ’ ‘ ’ using MIT AI server. MIT AI is a popular tool for hactionsi →h actionsi hactioni | hactioni app development large community of active devel- hactioni → ‘’ hargsi ‘’ opers, rich and growing functionalities. MIT AI hargsi →h argi hargsi | hargi file structure mainly consists of a Scheme (.scm) hargi → ‘’‘’‘’ | ‘’ file consisting of components and their properties and a Blockly (.bky) file consisting code functional- Here, is the starting symbol and the ities. Appendix A shows an algorithm for our SAR tokens inside quotes are terminals. A mobile ap- to source conversion process. Appendix B and C plication in our system firstly consists of screens. respectively shows the .scm and .bky files for the Each screen contains an ordered list of visible (e.g. example shown in Fig.1. video player, textbox) or invisible (e.g. accelerom- eter, text2speech) components, which are identi- The SAR tokens have corresponding predefined fied with the tokens. Next, the template source codes. The compiler parses the to- application logic is defined within the kens and fetches their corresponding templates. By tokens. One functionality in our system is a tu- fetching and modifying the predefined templates ple containing an event, an action, and a value. with the user specified arguments, we generate the is an external or internal process that .scm and .bky files from SAR. The files are then triggers an action. is a process that compressed into an MIT AI project file (.aia). This performs a certain operation. Both and has to be uploaded to the publicly available MIT components often have properties that AI server, after which the user can debug the app, determine their identity or behavior. For example, or download it as an executable (.apk) file. Allowed components

Sample

format. (i.e. clear com- when clicked open the rear 2. twitter app - make an app with a textbox, a button ponent list followed by camera and capture an image. named “tweet”, and a label. When the button is pressed, functionality.) set the label to textbox text. Many predictions are NL: Create an app that can SAR: label is wrong. (Missing mention of a button.) User expects the auto- NL: A typical registration matic system to have form with necessary text fields 3. browser app - create an app with a textbox, a button inherent understanding and a submit button. named “go”, and a button named “back”. When the button of the real world. “go” is pressed, go to the url in the textbox. When the button “back” is pressed, go back to the previous page. Model is biased to- NL: Create an app which wards synthesizer key- has a moving ball. (‘mov- 4. Google front page - make an app with a textbox, a but- words. ing’ frequently appears with ton named “google”, and a button named “search”. When accelerometer, model predicts the button “google” is pressed, search google. When the accelerometer component.) button “search” is pressed, search the web. System misses compo- NL: Create an app that has a nents not specified. textbox labelled ”Insert Text”, Table 3: Abstract instructions to simpler app descrip- when the device is shaken, tion using GPT-3. Example 1, 2, was constrained speak the text in the textbox within our allowed set of functionalities. 3, 4, intro- duced concepts that are not yet supported in Text2App Table 4: Observation from human labeled data predic- (marked in red and italic). tions

MIT App Inventor follows a similar tree-like pat- tern, and thus can be represented as SAR. We are inviting open source community to contribute to and help grow this project. In future, we would like to experiment with generative model based mediate representation of the application. The in- data synthesis, more reliable language model based termediate formal representation allows to describe NL simplification techniques, and potentially app an app with significantly smaller number of tokens development in native languages. than native app development languages. We also 6 Conclusion design a data synthesis method guided by a human survey, that automatically generates fluent natural In this paper, we explore creating functional mo- language app descriptions and their formal repre- bile applications from natural language text de- sentations. Our AI aware design approach for a scriptions using seq2seq networks. We propose formal language can guide future programming lan- Text2App, a novel framework for natural language guage and frameworks development, where further to app translation with the help of a simpler inter- source code generation works can benefit from. Acknowledgement 1536–1547, Online. Association for Computational Linguistics. We thank Prof. Zhijia Zhao from UCR for propos- ing the problem that inspired this project idea. Vanita Jain, Piyush Agrawal, Subham Banga, Rishabh We also thank OpenAI, Google Colaboratory, Kapoor, and Shashwat Gulyani. 2019. Sketch2code: Transformation of sketches to ui in real-time using Hugging Face, MIT App Inventor community, the deep neural network. survey participants, and Prof. Anindya Iqbal for feedback regarding modularity. This project was K. Kolthoff. 2019. Automatic generation of g raphical user interface prototypes from unrestricted natural funded under the ‘Innovation Fund’ by the ICT language requirements. In 2019 34th IEEE/ACM In- Division, Government of the People’s Republic of ternational Conference on Automated Software En- Bangladesh. gineering (ASE), pages 1234–1237. Wang Ling, Phil Blunsom, Edward Grefenstette, References Karl Moritz Hermann, Toma´sˇ Kociskˇ y,´ Fumin Wang, and Andrew Senior. 2016. Latent predictor Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi networks for code generation. In Proceedings of the Ray, and Kai-Wei Chang. 2021. Unified pre-training 54th Annual Meeting of the Association for Compu- for program understanding and generation. In Pro- tational Linguistics (Volume 1: Long Papers), pages ceedings of the 2021 Conference of the North Amer- 599–609, Berlin, Germany. Association for Compu- ican Chapter of the Association for Computational tational Linguistics. Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Tony Beltramelli. 2018. pix2code: Generating code dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, from a graphical user interface screenshot. In Pro- Luke Zettlemoyer, and Veselin Stoyanov. 2020. ceedings of the ACM SIGCHI Symposium on Engi- Ro{bert}a: A robustly optimized {bert} pretraining neering Interactive Computing Systems, pages 1–6. approach.

Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Gaunt, and Oleksandr Polozov. 2019. Generative Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, code modeling with graphs. In International Con- Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li- ference on Learning Representations. dong Zhou, Linjun Shou, Long Zhou, Michele Tu- fano, Ming Gong, Ming Zhou, Nan Duan, Neel Sun- Tom Brown, Benjamin Mann, Nick Ryder, Melanie daresan, Shao Kun Deng, Shengyu Fu, and Shujie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Liu. 2021. Codexglue: A bench- Arvind Neelakantan, Pranav Shyam, Girish Sastry, mark dataset for code understanding and generation. Amanda Askell, Sandhini Agarwal, Ariel Herbert- CoRR, abs/2102.04664. Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Andrew Mayn. 2020. Openai api alchemy: Sum- Clemens Winter, Chris Hesse, Mark Chen, Eric marization – @andrewmayne. https:// Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, andrewmayneblog.wordpress.com/2020/06/ Jack Clark, Christopher Berner, Sam McCandlish, 13/-api-alchemy-summarization/. Alec Radford, Ilya Sutskever, and Dario Amodei. (Accessed on 03/22/2021). 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, K. Moran, C. Bernal-Cardenas,´ M. Curcio, R. Bonett, volume 33, pages 1877–1901. Curran Associates, and D. Poshyvanyk. 2020. Machine learning-based Inc. prototyping of graphical user interfaces for mobile apps. IEEE Transactions on Software Engineering, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 46(2):196–221. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Emilio Parisotto, Abdel rahman Mohamed, Rishabh standing. In Proceedings of the 2019 Conference Singh, Lihong Li, Dengyong Zhou, and Pushmeet of the North American Chapter of the Association Kohli. 2017. Neuro-symbolic program synthesis. for Computational Linguistics: Human Language In Proceedings of the 5th International Conference Technologies, Volume 1 (Long and Short Papers), on Learning Representations (ICLR 2017), Toulon, pages 4171–4186, Minneapolis, Minnesota. Associ- France. ation for Computational Linguistics. Maxim Rabinovich, Mitchell Stern, and Dan Klein. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- 2017. Abstract syntax networks for code generation aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, and semantic parsing. In Proceedings of the 55th An- Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code- nual Meeting of the Association for Computational BERT: A pre-trained model for programming and Linguistics (Volume 1: Long Papers), pages 1139– natural languages. In Findings of the Association 1149, Vancouver, Canada. Association for Computa- for Computational Linguistics: EMNLP 2020, pages tional Linguistics. Alex Robinson. 2019. Sketch2code: Generating a web- site from a paper mockup. Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083, Vancouver, Canada. Association for Computa- tional Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, volume 30. Curran Associates, Inc. Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 440–450, Vancouver, Canada. Association for Computational Linguistics. Zhihao Zhu, Zhan Xue, and Zejian Yuan. 2019. Auto- matic graphics program generation using attention- based hierarchical decoder. In Computer Vision – ACCV 2018, pages 181–196, Cham. Springer Inter- national Publishing. Supplementary Material: Appendices

A SAR Compilation Algorithm "$Type":"TextBox","$Version":"6", "Hint":"Hint for TextBox1","Uuid":"913409813"},{"$Name": "Button1", "$Type":"Button", Algorithm 1: Compiling SAR to .scm, .bky "$Version":"6","Text":"Speak","Uuid": "955068562"}, 1 Input SAR tokens, LiteralDict {"$Name":"TextToSpeech1", 2 Output .scm, .bky "$Type":"TextToSpeech","$Version":"5", "Uuid":"1305598760"}]}} 3 scm = initializeSCM() bky = |# initializeBKY() Listing 1: A sample scm file representing the visual 4 token in complist for do components of the app in Fig. 1. The blue portion lists 5 if isComponentStart(token) then the components of the app. 6 uuid = generateNumUUID() 7 n = getCompNum(token) C Blockly (.bky) Logical Components 8 if token.hasArgument() then 9 args = fetchArgs(token, 10 t = getTemplate(token) 14 for token in code do Button1 15 t = getTemplate(token) then 20 val = closestMatchingFile() TextToSpeech1 21 t.set(val) 22 else 25 t.set(number) 26 t.set(uuid) TextBox1 27 bky.add(t) Text 28 write(bky) B Sceme (.scm) File for Visual Components

#| Listing 2: A sample bky file representing the logical $JSON components of the app in Fig. 1. The colored lines {"authURL":["ai2.appinventor.mit.edu"], represent different blocks. "YaVersion":"208", "Source":"Form", "Properties":{"$Name":"Screen1","$Type": "Form","$Version":"27", "AppName":"speak_it","Title":"Screen1", "Uuid":"0", "$Components":[{"$Name":"TextBox1", D GPT-3 Prompt

How to make an app with these components : button, switch, textbox, accelerometer, audio player, video player, text2speech random video player app. – make an app with a video player with a random video, a button named “play” and a button named “pause”. When the first button is pressed, start the video. When the second button is pressed pause the video. A time speaking app – an app with a button, a clock and a text2speech. When the button is clicked, speak the time. Display time app – create an app with a button, a timepicker, and a label. When the button is pressed, set the label to the time. A messeging app – create an app with a with a textbox, and a button named ”send”, and a label. When the button is pressed, set label to textbox text. Login form – create an app with a textbox, a passwordbox, and a button named ”login”. Search interface – make an application with a textbox, and a button named “search”. siren app – create an app with a music player with source “siren sound.mp3”, and a button. When the button is pressed, play the audio. An arithmatic addition app gui – make an app with a textbox, a textbox, and a button named “+”. vibration alert app – create an app with an accelerometer, and a text2speech. When the accelerometer is shaken, speak “vibration detected”. {A new prompt} –

Table 5: Prompt used to generate app description with GPT-3. A new unseen prompt is added at the end and the model is tasked to continue generating text in the same pattern. This method is known as Few Shot text generation (Brown et al., 2020).