Learning Programmatic Idioms for Scalable Semantic Parsing

Learning Programmatic Idioms for Scalable Semantic Parsing Srinivasan Iyery, Alvin Cheungx and Luke Zettlemoyeryz yPaul G. Allen School of Computer Science and Engineering, Univ. of Washington, Seattle, WA fsviyer, [email protected] xDepartment of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA [email protected] z Facebook AI Research, Seattle [email protected] Abstract Statement Programmers typically organize executable source code using high-level coding patterns if ParExpr Statement IfOrElse or idiomatic structures such as nested loops, exception handlers and recursive blocks, rather Block else Statement than as individual code tokens. In contrast, ( Expr ) state of the art (SOTA) semantic parsers still Block map natural language instructions to source { BlockStatement } code by building the code syntax tree one node at a time. In this paper, we introduce an it- Statement { BlockStatement } erative method to extract code idioms from large source code corpora by repeatedly col- Expr ; lapsing most-frequent depth-2 subtrees of their Statement syntax trees, and train semantic parsers to apply these idioms during decoding. Apply- (a) Expr ; ing idiom-based decoding on a recent context- dependent semantic parsing task improves the Statement SOTA by 2.2% BLEU score while reducing training time by more than 50%. This im- if ( Expr ) { Expr ; } else { Expr ; } proved speed enables us to scale up the model by training on an extended training set that is 5× larger, to further move up the SOTA (b) by an additional 2.3% BLEU and 0.9% exact match. Finally, idioms also significantly im- Figure 1: (a) Syntax tree based decoding for seman- prove accuracy of semantic parsing to SQL on tic parsing uses as many as 11 rules (steps) to pro- the ATIS-SQL dataset, when training data is duce the outer structure of a very frequently used limited. if-then-else code snippet. (b) Direct application of an if-then-else idiom during decoding leads to im- 1 Introduction proved accuracy and training time. When programmers translate Natural Language (NL) specifications into executable source code, State-of-the-art semantic parsers are neural they typically start with a high-level plan of the encoder-decoder models, where decoding is major structures required, such as nested loops, guided by the target programming language gram- conditionals, etc. and then proceed to fill in spe- mar (Yin and Neubig, 2017; Rabinovich et al., cific details into these components. We refer to 2017; Iyer et al., 2018) to ensure syntactically these high-level structures (Figure1 (b)) as code valid programs. For general purpose program- idioms (Allamanis and Sutton, 2014). In this pa- ming languages with large formal grammars, this per, we demonstrate how learning to use code id- can easily lead to long decoding paths even for ioms leads to an improvement in model accuracy short snippets of code. For example, Figure1 and training time for the task of semantic parsing, shows an intermediate parse tree for a generic i.e., mapping intents in NL into general purpose if-then-else code snippet, for which the de- source code (Iyer et al., 2017; Ling et al., 2016). coder requires as many as eleven decoding steps 5426 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5426–5435, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics before ultimately filling in the slots for the if con- marginally outperforms the SOTA when the full dition, the then expression and the else expres- training set is used (more details in Section7). sion. However, the if-then-else block can be seen as a higher level structure such as shown in 2 Related Work Figure1 (b) that can be applied in one decoding step and reused in many different programs. In Neural encoder-decoder models have proved ef- this paper, we refer to frequently recurring sub- fective in mapping NL to logical forms (Dong trees of programmatic parse trees as code idioms, and Lapata, 2016) and also for directly produc- and we equip semantic parsers with the ability to ing general purpose programs (Iyer et al., 2017, learn and directly generate idiomatic structures as 2018). Ling et al.(2016) use a sequence-to- in Figure1 (b). sequence model with attention and a copy mech- anism to generate source code. Instead of di- We introduce a simple iterative method to ex- rectly generating a sequence of code tokens, recent tract idioms from a dataset of programs by re- methods focus on constrained decoding mecha- peatedly collapsing the most frequent depth-2 sub- nisms to generate syntactically correct output us- trees of syntax parse trees. Analogous to the byte ing a decoder that is either grammar-aware or has pair encoding (BPE) method (Gage, 1994; Sen- a dynamically determined modular structure par- nrich et al., 2016) that creates new subtokens of alleling the structure of the abstract syntax tree words by repeatedly combining frequently occur- (AST) of the code (Rabinovich et al., 2017; Kr- ring adjacent pairs of subtokens, our method takes ishnamurthy et al., 2017; Yin and Neubig, 2017). a depth-2 syntax subtree and replaces it with a Iyer et al.(2018) use a similar decoding approach tree of depth-1 by removing all the internal nodes. but use a specialized context encoder for the task This method is in contrast with the approach using of context-dependent code generation. We aug- probabilistic tree substitution grammars (pTSG) ment these neural encoder-decoder models with taken by Allamanis and Sutton(2014), who use the ability to decode in terms of frequently oc- the explanation quality of an idiom to prioritize id- curring higher level idiomatic structures to achieve ioms that are more interesting, with an end goal to gains in accuracy and training time. suggest useful idioms to programmers using IDEs. Another different but related method to pro- Once idioms are extracted, we greedily apply them duce source code is using sketches, which are code to semantic parsing training sets to provide super- snippets containing slots in the place of low-level vision for learning to apply idioms. information such as variable names, method argu- We evaluate our approach on two semantic pars- ments, and literals. Dong and Lapata(2018) gen- ing tasks that map NL into 1) general-purpose erate such sketches using programming language- source code, and 2) executable SQL queries, re- specific sketch creation rules and use them as spectively. On the first task, i.e., context depen- intermediate representations to train token-based dent semantic parsing (Iyer et al., 2018) using the seq2seq models that convert NL to logical forms. CONCODE dataset, we improve the state of the Hayati et al.(2018) retrieve sketches from a large art (SOTA) by 2.2% of BLEU score. Furthermore, training corpus and modify them for the current generating source code using idioms results in a input; Murali et al.(2018) use a combination more than 50% reduction in the number of decod- of neural learning and type-guided combinatorial ing steps, which cuts down training time to less search to convert existing sketches into executable than half, from 27 to 13 hours. Taking advan- programs, whereas Nye et al.(2019) additionally tage of this reduced training time, we further push also generate the sketches before synthesising pro- the SOTA on CONCODE to an EM of 13.4 and grams. Our idiom-based decoder learns to produce a BLEU score of 28.9 by training on an extended commonly used subtrees of programming syntax- version of the training set (with 5× the number trees in one decoding step, where the non-terminal of training examples). On the second task, i.e., leaves function as slots that can be subsequently mapping NL utterances into SQL queries for a expanded in a grammer-aware fashion. Code id- flight information database (ATIS-SQL; Iyer et al. ioms can be roughly viewed as a tree-structured (2017)), using idioms significantly improves de- generalization of sketches, that can be automat- notational accuracy over SOTA models, when a ically extracted from large code corpora for any limited amount of training data is used, and also programming language, and unlike sketches, can 5427 also be nested with other idioms or grammar rules. coder is trained to learn to apply idioms. More closely related to the idioms that we use for decoding is Allamanis and Sutton(2014), 1 Procedure Extract-Idioms(D, G , n) who develop a system (HAGGIS) to automatically Input: D ! Training Programs mine idioms from large code bases. They focus Input: G ! Parsing Grammar on finding interesting and explainable idioms, e.g., Input: n ! Number of idioms those that can be included as preset code tem- 2 T fg . Stores all parse trees plates in programming IDEs. Instead, we learn 3 for d 2 D do frequently used idioms that can be easily associ- 4 T T [ Parse-Tree(d; G ) ated with NL phrases in our dataset. The produc- 5 end tion of large subtrees in a single step directly trans- 6 I fg lates to a large speedup in training and inference. 7 for i 1 to K do Concurrent with our research, Shin et al.(2019) 8 s Most-frequent also develop a system (PATOIS) for idiom-based Depth-2-Subtree(T ) semantic parsing and demonstrate its benefits on 9 for t 2 T do the Hearthstone (Ling et al., 2016) and Spider (Yu 10 t Collapse-Subtree(t, s) et al., 2018) datasets. While we extract idioms by 11 end collapsing frequently occurring depth-2 AST sub- 12 I I [ fsg trees and apply them greedily during training, they 13 end use non-parametric Bayesian inference for idiom 14 end extraction and train neural models to either apply 15 Procedure Collapse-Subtree(t; s) entire idioms or generate its full body.

Load more