Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes

Chris Quirk Raymond Mooney∗ Michel Galley Research UT Austin Microsoft Research Redmond, WA, USA Austin TX, USA Redmond, WA, USA [email protected] [email protected] [email protected]

Abstract these recipes represents a step toward complex nat- ural language programming, moving beyond single Using natural language to write programs commands toward compositional statements with is a touchstone problem for computational control flow. linguistics. We present an approach that learns to map natural-language descrip- Several services, such as and IFTTT, al- tions of simple “if-then” rules to executable low users to create simple programs with “triggers” code. By training and testing on a large cor- and “actions.” For example, one can program their pus of naturally-occurring programs (called Phillips Hue light bulbs to flash red and blue when “recipes”) and their natural language de- the Cubs hit a home run. A somewhat complicated scriptions, we demonstrate the ability to GUI allows users to construct these recipes based effectively map language to code. We on a set of information “channels.” These chan- compare a number of semantic parsing ap- nels represent many types of information. Weather, proaches on the highly noisy training data news, and financial services have provided constant collected from ordinary users, and find that updates through web services. Home automation loosely synchronous systems perform best. sensors and controllers such as motion detectors, thermostats, location sensors, garage door openers, etc. are also available. Users can then describe the recipes they have constructed in natural language 1 Introduction and publish them. The ability to program computers using natural lan- Our goal is to build semantic parsers that al- guage would clearly allow novice users to more low users to describe recipes in natural language effectively utilize modern information technology. and have them automatically mapped to exe- Work in semantic parsing has explored mapping cutable code. We have collected 114,408 recipe- natural language to some formal domain-specific description pairs from the http://ifttt.com website. programming languages such as database queries Because users often provided short or incomplete (Woods, 1977; Zelle and Mooney, 1996; Berant et English descriptions, the resulting data is extremely al., 2013), commands to robots (Kate et al., 2005), noisy for the task of training a semantic parser. operating systems (Branavan et al., 2009), smart- Therefore, we have constructed semantic-parser phones (Le et al., 2013), and spreadsheets (Gulwani learners that utilize and adapt ideas from several and Marron, 2014). Developing such language- previous approaches (Kate and Mooney, 2006; to-code translators has generally required specific Wong and Mooney, 2006) to learn an effective in- dedicated efforts to manually construct parsers or terpreter from such noisy training data. We present large corpora of suitable training examples. results on our collected IFTTT corpus demonstrat- An interesting subset of the possible program ing that our best approach produces more accurate space is if-then “recipes,” simple rules that allow programs than several competing baselines. By users to control many aspects of their digital life exploiting such “found data” on the web, seman- including smart devices. Automatically parsing tic parsers for natural-language programming can ∗Work performed while visiting Microsoft Research. potentially be developed with minimal effort.

878 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 878–888, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics

2 Background Since our “found data” for IFTTT is extremely noisy, we have taken an approach similar to KRISP; We take an approach to semantic parsing that however, we use a probabilistic log-linear text clas- directly exploits the formal grammar of the tar- sifier rather than an SVM to recognize productions. get meaning representation language, in our case This method of assembling well-formed pro- IFTTT recipes. Given supervised training data in grams guided by a natural language query bears the form of natural-language sentences each paired some resemblance to Keyword Programming (Lit- with their corresponding IFTTT recipe, we learn tle and Miller, 2007). In that approach, users en- to introduce productions from the formal-language ter natural language queries in the middle of an grammar into the derivation of the target program existing program; this query drives a search for based on expressions in the natural-language input. programs that are relevant to the query and fit This approach originated with the SILT system within the surrounding program. However, the (Kate et al., 2005) and was further developed in function used to score derivations is a simple match- the WASP (Wong and Mooney, 2006; Wong and ing heuristic relying on the overlap between query Mooney, 2007b) and KRISP (Kate and Mooney, terms and program identifiers. Our approach uses 2006) systems. machine learning to build a correspondence be- WASP casts semantic parsing as a syntax-based tween queries and recipes based on parallel data. statistical machine translation (SMT) task, where There is also a large body of work applying Com- a synchronous context-free grammar (SCFG) (Wu, binatory Categorical Grammars to semantic pars- 1997; Chiang, 2005; Galley et al., 2006) is used ing, starting with Zettlemoyer and Collins (2005). to model the translation of natural language into a Depending on the set of combinators used, this ap- formal meaning representation. It uses statistical proach can capture more expressive languages than models developed for syntax-based SMT for lexical synchronous context-free MT. In practice, however, learning and parse disambiguation. Productions in synchronous MT systems have competitive accu- the formal-language grammar are used to construct racy scores (Andreas et al., 2013). Therefore, we synchronous rules that simultaneously model the have not yet evaluated CCG on this task. generation of the natural language. WASP was sub- sequently “inverted” to use the same synchronous 3 If-this-then-that recipes grammar to generate natural language from the for- The recipes considered in this paper are diverse and mal language (Wong and Mooney, 2007a). powerful despite being simple in structure. Each KRISP uses classifiers trained using a Support- recipe always contains exactly one trigger and one Vector Machine (SVM) to introduce productions action. Whenever the conditions of the trigger are in the derivation of the formal translation. The satisfied, the action is performed. The resulting productions of the formal-language grammar are recipes can perform tasks such as home automation treated like semantic concepts to be recognized (“turn on my lights when I arrive home”), home from natural-language expressions. For each pro- security (“text me if the door opens”), organization duction, an SVM classifier is trained using a string (“add receipt emails to a spreadsheet”), and much subsequence kernel (Lodhi et al., 2002). Each clas- more (“remind me to drink water if I’ve been at sifier can then estimate the probability that a given a bar for more than two hours”). Triggers and natural-language substring introduces a production actions are drawn from a wide range of channels into the derivation of the target representation. Dur- that must be activated by each user. These channels ing semantic parsing, these classifiers are employed can represent many entities and services, including to estimate probabilities on different substrings devices (such as Android devices or WeMo light of the sentence to compositionally build the most switches) and knowledge sources (such as ESPN probable meaning representation for the sentence. or Gmail). Each channel exposes a set of functions Unlike WASP whose synchronous grammar needs for both trigger and action. to be able to directly parse the input, KRISP’s ap- Several services such as IFTTT, Tasker, and proach to “soft matching” productions allows it Llama allow users to author if-this-then-that to produce a parse for any input sentence. Conse- recipes. IFTTT is unique in that it hosts a large quently, KRISP was shown to be much more robust set of recipes along with descriptions and other to noisy training data than previous approaches to metadata. Users of this site construct recipes using semantic parsing (Kate and Mooney, 2006). a GUI interface to select the trigger, action, and the

879 parameters for both trigger and action. After the 4 Program synthesis methods recipe is authored, the user must provide a descrip- We consider a number of methods to map the natu- tion and optional set of notes for this recipe and ral language description of a problem into its for- publish the recipe. Other users can browse and use mal program representation. these published recipes; if a user particularly likes a recipe, they can mark it as a favorite. 4.1 Program retrieval As of January 2015, we found 114,408 recipes One natural baseline is retrieval. Multiple users on http://ifttt.com. Among the available recipes we could potentially have similar needs and therefore encountered a total of 160 channels. In total, we author similar or even identical programs. Given found 552 trigger functions from 128 of those chan- a novel description, we can search for the closest nels, and 229 action functions from 99 channels, description in a table of program-description pairs, for a total of 781 functions. Each recipe includes and return the associated program. We explored a number of pieces of information: description1, several text-similarity metrics, and found that string note, author, number of uses, etc. 99.98% of the edit distance over the unmodified character se- entries have a description, and 35% contain a note. quence achieved best performance on the devel- Based on availability, we focused primarily on the opment set. As the corpus of program-description description, though there are cases where the note pairs becomes larger, this baseline should increase is a more explicit representation of program intent. in quality and coverage.

The recipes at http://ifttt.com are represented as 4.2 Machine Translation HTML forms, with combo boxes, inline maps, and other HTML UI components allowing end users The downside to retrieval is that it cannot general- to select functions and their parameters. This is ize. Phrase-based SMT systems(Och et al., 1999; convenient for end users, but difficult for automated Koehn et al., 2003) can be seen as an incremental approaches. We constructed a formal grammar of step beyond retrieval: they segment the training possible program structures, and from each HTML data and attempt to match and assemble those seg- form we extracted an abstract syntax tree (AST) ments at runtime. If the phrase length is unbounded, conforming to this grammar. We model this as a retrieval is almost a special case: it could return context-free grammar, though this assumption is whole programs from the training data when the violated in some cases. Consider the program in description matches exactly. In addition, they can Figure 1, where some of the parameters used the find subprograms that are relevant to portions of the action are provided by the trigger. input, and assemble those subprograms into whole programs. This data could be used in a variety of ways. As a baseline, we adopt a recent approach (An- Recipes could be suggested to users based on their dreas et al., 2013) that casts semantic parsing as activities or interests, for instance, or one could phrasal translation. First, the ASTs are converted train a natural language generation system to give into flat sequences of code tokens using a pre-order a readable description of code. left-to-right traversal. The tokens are annotated In this paper, the paired natural language descrip- with their arity, which is sufficient to reconstruct tions and abstract syntax trees serve as training data the tree given a well formed sequence of tokens for semantic parsing. Given a description, a system using a simple stack algorithm. Given this paral- must produce the AST for an IFTTT recipe. We lel corpus of language and code tokens, we train note in passing that the data was constructed in a conventional statistical machine translation sys- the opposite direction: users first implemented the tem that is similar in structure and performance to recipe and then provided a description afterwards. Moses (Koehn et al., 2007). We gather the k-best Ideal data for our application would instead start translations, retaining the first such output that can with the description and construct the recipe based be successfully converted into a well-formed pro- on this description. Yet the data is unusually large gram according to the formal grammar. Integration and diverse, making it interesting training data for of the well-formedness constraint into decoding mapping natural language to code. would likely produce better translations, but would require more modifications to the MT system. Approaches to semantic parsing inspired by ma- 1The IFTTT site refers to this as “title”. chine translation have proven effective when the

880 IF

TRIGGER ACTION

Android Phone Call Google Drive (A)CHANNELS

Any phone call missed Add row to spreadsheet (B)FUNCTIONS

Spreadsheet name Formatted row Drivefolder path (C)PARAMETERS missed OccurredAt FromNumber ContactName IFTTT Android {{ }} {{ }} {{ }} Archive your missed calls from Android to Google Drive

Figure 1: Example recipe with description, with nodes corresponding to (a) Channels, (b) Functions, and (c) Parameters indicated with specific boxes. Note how some of the fields in braces, such as OccurredAt, depend on the trigger. data is very parallel. In the IFTTT dataset, however, with non-terminals V , terminal vocabulary Σ, pro- the available pairs are not particularly clean. Word ductions R and start symbol S. E represents a alignment quality suffers, and production extrac- source sentence, and D, a formal derivation tree tion suffers in turn. Descriptions in this corpus are for that sentence. R(D) is the set of productions often quite telegraphic (e.g., “Instagram to Face- in that derivation. The score of a derivation is the book”) or express unnecessary pieces of informa- following product: tion, or are downright unintelligible (“ 2Mrl14”). Approaches that rely heavily on lexicalized infor- P (D E) = P (r E) P ( r E) | | ¬ | mation and assume a one-to-one correspondence r R(D) r R R(D) between source and target (at the phrase, if not the ∈Y ∈ Y\ word level) struggle in this setting. The binary classifiers are log-linear models over features, F , of the input string: P (r E) | ∝ 4.3 Generation without alignment exp θr>F (E) . An alternate approach is to treat the source lan-  guage as context and a general direction, rather than 4.3.1 Training a hard constraint. The target derivation can be pro- For each production, we train a binary classifier duced primarily according to the formal grammar predicting its presence or absence. Given a train- while guided by features from the source language. ing set of parallel descriptions and programs, we For each production in the formal grammar, we create R binary classifier training sets, one for | | can train a binary classifier intended to predict each classifier. We currently use a small set of whether that production should be present in the simple features: word unigrams and bigrams, and derivation. This classifier uses general features of character trigrams. the source sentence. Note how this allows produc- tions to be inferred based on context: although a 4.3.2 Inference description might never explicitly say that a pro- When presented with a novel utterance, E, our sys- duction is necessary, the surrounding context might tem must find the best code corresponding to that strongly imply it. utterance. We use a top-down, left-to-right gener- We assign probabilities to derivations by looking ation strategy, where each search node contains a at each production independently. A derivation ei- stack of symbols yet to be expanded and a log prob- ther uses or does not use each production. For each ability. The initial node is [S] , 0 ; and a node is h i production used in the derivation, we multiply by complete when its stack of non-terminals is empty. the probability of its inclusion. Likewise for each Given a search node with a non-terminal as its production not used in the derivation, we multiply first symbol on the stack, we expand with any pro- by one minus the probability of its inclusion. duction for that symbol, putting its yield onto the Let G = (V, Σ,R,S) be the formal grammar stack and updating the node cost to include its

881 derivation score: IF[1-6] [X, α] , p (X β) R h i → ∈ TRIGGER[3-6] ACTION[1-2] [β, α] , p + log P (X β E) h → | ii ESPN[3-6] Phone call[1-2] If the first stack item is a terminal, it is scanned: [a, α] , p a Σ New in-game update[3-6] Call my phone[1-2] h i ∈ [α] , p h i Chicago Cubs[5-5] Using these inference rules, we utilize a simple greedy approach that only accounts for the produc- 1 2 3 4 5 6 tions included in the derivation. To account for the Call me if the Cubs score negative P ( r E) factors, we use a r R R(D) ¬ | Figure 2: An example training pair with its semantic deriva- beam search,∈ and\ rerank the n-best final outcomes Q tion. Note the correspondence between formal language and from this search based on the probability of all pro- natural language denoted with indices and spans. ductions that are not included. Partial derivations are grouped into beams according to the number of The core components of KRISP are string-kernel productions in that derivation. classifiers P (r, i..j E) denoting the probability | 4.4 Loosely synchronous generation that a production r in the AST covers the span of words i..j in the sentence E. Here, i < j are The above method learns distributions over pro- positions in the sentence indicating the span of ductions given the input, but treats the sentence as tokens most relevant to this production. In other an undifferentiated bag of linguistic features. The words, the substring E[i..j] denotes the production syntax of the source sentence is not leveraged at all, r with probability P (r, i..j E). The probability of nor is any correspondence between the language | a semantic derivation D is defined as follows: syntax and the program structure used. Often the pairs are not in sufficient correspondence to sug- P (D E) = P (r, i..j E) | | gest synchronous approaches, but some loose corre- (r,i..j) D spondence to maintain at least a notion of coverage Y∈ could be helpful. That is, we assume that each production is indepen- We pursue an approach similar to KRISP (Kate dent of all others, and is conditioned only on the and Mooney, 2006), with several differences. First, string to which it is aligned. This can be seen as a rather than a string kernel SVM, we use a log-linear refinement of the above production classification model with character and word n-gram features. 2 approach using a notion of correspondence. Second, we allow the model to consider both span- Rather than using string kernels, we use logis- internal features and contextual features. tic regression classifiers with word unigram, word This approach explicitly models the correspon- bigram, and character trigram features. Unlike dence between nodes in the code side and tokens in KRISP, we include features from both inside and the language. Unlike standard MT systems, word outside the substring. Consider the production alignment is not used as a hard constraint. Instead, “Phone call Call my phone” with span 1-2 from → this phrasal correspondence is induced as part of Figure 2. Word unigram features indicate that “call” model training. and “me” are inside the span; the remaining words We define a semantic derivation D of a natu- are outside the span. Word bigram features indicate ral language sentence E as a program AST where that “call me” is inside the span, “me if” is on the each production in the AST is augmented with a boundary of the span, and all remaining bigrams span. The substrings covered by the children of are outside the span. a production must not overlap, and the substring 4.4.1 Training covered by the parent must be the concatenation These classifiers are trained in an iterative EM- of the substrings covered by the children. Figure 2 like manner (Kate and Mooney, 2006). Starting shows a sample semantic derivation. with some initial classifiers and a training set of 2 We have a preference for log-linear models given their NL and AST pairs, we search for the most likely robustness to hyperparameter settings, ease of optimization, and flexible incorporation of features. An SVM trained with derivation. If the AST underlying this derivation similar features should have similar performance, though. matches the gold AST, then this derivation is added

882 to the set of positive instances. Otherwise, it is Language Code added to the set of negative instances, and the best Recipes 77,495 77,495 derivation constrained to match the gold standard Train Tokens 527,368 1,776,010 AST is found and added to the positive instances. Vocabulary 58,102 140,871 Given this revised training data, the classifiers are retrained. After each pass through the training data, Recipes 5,171 5,171 we evaluate the current model on the development Dev Tokens 37,541 110,074 set. This procedure is repeated until development- Vocabulary 7,741 14,804 set performance begins to fall. Recipes 4,294 4,294 4.4.2 Inference Test Tokens 28,214 94,367 Vocabulary 6,782 13,969 To find the most probable derivation according to the grammar, KRISP uses a variation on Earley parsing. This is similar to the inference method Table 1: Statistics of the data after cleaning and separating into training, development, and test sets. In each case, the from Section 4.3.2, but each item now additionally number of recipes, tokens (including punctuation, etc.) and maintains a position and a span. Inference proceeds vocabulary size are included. left-to-right through the source string. The natural language may present information in a different overfitting to the linguistic style of a particular au- order than the formal language, so all permutations thor. Table 1 presents summary statistics for the of rules are considered during inference. resulting data. We found this inference procedure to be quite Although certain trigger-action pairs occur much slow for larger data sets, especially because wide more often than others, the recipes in this data beams were needed to prevent search failure. To are quite diverse. The top 10 trigger-action pairs speed up inference, we used scores from the account for 14% of the recipes; the top 100 account position-independent classifiers as completion-cost for 37%; the top 1000 account for 72%. estimates. The completion-cost estimate for a given sym- 5.1 Metrics bol is defined recursively. Terminals have a cost of To evaluate system performance, several different zero. Productions have a completion cost of the log measures are employed. Ideally a system would probability of the production given the sentence, output exactly the correct abstract syntax tree. One plus the completion cost of all non-terminal sym- measure is to count the number of exact matches, bols. The completion cost for a non-terminal is though almost all methods receive a score of 0.4 the max cost of any production rooted in that non- Alternatively, we can look at the AST as a set of terminal. Computing this cost requires traversing productions, computing balanced F-measure. This all productions in the grammar for each sentence. is a much more forgiving measure, giving partial Given a partial hypothesis, we use exact scores credit for partially correct results, though it has the for the left-corner subtree that has been fully con- caveat that all errors are counted equally. structed, and completion estimates for all the sym- Correctly assigning the trigger and action is the bols and productions whose left and right spans are most important, especially because some of the pa- not yet fully instantiated. rameter values are tailored for particular users. For example, “turn off my lights when I leave home” 5 Experimental Evaluation requires a “home” location, which varies for each Next we evaluate the accuracy of these approaches. user. Therefore, we also measure accuracy at iden- The 114,408 recipes described in Section 3 were tifying the correct trigger and action, both at the first cleaned and tokenized. We kept only one channel and function level. recipe per unique description, after mapping to low- 5.2 Human comparison ercase and normalizing punctuation.3 Finally the recipes were split by author, randomly assigning One remaining difficulty is that multiple programs each to training, development, or test, to prevent may be equally correct. Some descriptions are very difficult to interpret, even for humans. Second, 3We found many recipes with the same description, likely copies of some initial recipe made by different users. We 4Retrieval gets an exact match 3.7% of the time, likely due selected one representative using a deterministic heuristic. to near-duplicates from copied recipes.

883 multiple channels may provide similar functional- Trigger Action ity: both Phillips Hue and WeMo channels provide C C+F C C+F the ability to turn on lights. Even a well-authored description may not clarify which channel should # of categories 128 552 99 229 be used. Finally, many descriptions are underspec- all .592 .492 .596 .532 ified. For instance, the description “notify me if Intelligible English .687 .528 .731 .627 it rains” does not specify whether the user should receive an Android notification, an iOS notification, Table 2: Annotator agreement as measured by Krippendorff’s an email, or an SMS. This is difficult to capture α coefficient (Krippendorff, 1980). Agreement is measured with an automatic metric. on either channel (C) or channel and function (C+F), and on either the full test set (4294 recipes) or its English and To address the prevalence and impact of under- intelligible subset (2262 recipes). specification and ambiguity in descriptions, we asked humans to perform a very similar task. Human annotators on Amazon Mechanical Turk “tentative conclusion” of agreement, and turkers are (“turkers”) were presented with recipe descriptions relatively close to that level for both trigger and and asked to identify the correct channel and func- action channels. However, it is important to note tion (but not parameters). Turkers received careful that the coding scheme used by turkers is not mutu- instructions and several sample description-recipe ally exclusive, as several triggers and actions (e.g., pairs, then were asked to specify the best recipe for “SMS” vs. “Android SMS” actions) accomplish each input. We requested they try their best to find similar effects. Thus, our levels of agreement are an action and a trigger even when presented with likely to be greater than suggested by measures in vague or ambiguous descriptions, but they could the table. Finally, we also measured agreement on tag inputs as ‘unintelligible’ if they were unable to the English and intelligible subset of the data, as we make an educated guess. Turkers created recipes found that confusion between the two labels “non- only for English descriptions, applying the label English” and “unintelligible” was relatively high. ’non-English’ otherwise. Five recipes were gath- As shown in the table, this substantially increased ered for each description. The resulting recipes are levels of agreement, up to the point where α for not exactly gold, as they have limited training at the both trigger and action channels are above the 0.67 task. However, we imposed stringent qualification cutoff drawing tentative conclusion of agreement. requirements to control the annotation quality.5 Our workers were in fair agreement with one an- 5.3 Systems and baselines other and the gold standard, producing high quality The retrieval method searches for the closest de- annotation at wages calibrated to local minimum scription in the training data based on character wage. We measure turker agreement with Krippen- string-edit-distance and returns the recipe for that dorff’s α (Krippendorff, 1980), which is a statis- training program. The phrasal method uses phrase- tical measure of agreement between any number based machine translation to generate candidate of coders. Unlike Cohen’s κ (Cohen, 1960), the α outputs, searching the resulting n-best candidates statistic does not require that coders be the same for for the first well-formed recipe. After exploring each unit of analysis. This property is particularly multiple word alignment approaches, we found desirable in our case, since turkers generally differ that an unsupervised feature-rich method (Berg- across HITs. A value of α = 1 indicates perfect Kirkpatrick et al., 2010) worked best, leverag- agreement, while α 0 suggests the absence of ≤ ing features of string similarity between the de- agreement or systematic disagreement. Agreement scription and the code. We ran MERT on the de- measures on the Mechanical Turk data are shown velopment data to tune parameters. We used a in Table 2. This shows encouraging levels of agree- phrasal decoder with performance similar to Moses. ment for both the trigger and the action, especially The synchronous grammar method, a recreation of considering the large number of categories. Krip- WASP, uses the same word alignment as above, pendorff (1980) advocates a 0.67 cutoff to allow but extracts a synchronous grammar rules from 5Turkers must have 95% HIT approval rating and be native the parallel data (Wong and Mooney, 2006). The speakers of English (As an approximation of the latter, we classifier approach described in Section 4.3 is in- required Turkers be from the U.S.). Manual inspection of an- notation on a control set drawn from the training data ensured dependent of word alignment. Finally, the posclass there was no apparent spam. approach from Section 4.4 derives its own deriva-

884 tion structure from the data. Channel +Func Prod F1 The human annotations are used to establish the (a) All: 4,294 recipes mturk human-performance baseline by taking the retrieval 28.2 19.3 40.8 majority selection of the trigger and action over phrasal 17.3 10.0 34.8 5 HITs for each description and comparing the sync 16.2 9.5 34.9 result to the gold standard. The oracleturk human- classifier 46.3 33.0 47.3 performance baseline shows how often at least one posclass 47.4 34.5 48.0 of the turkers agreed with the gold standard. mturk 33.4 22.6 –n/a– In addition, we evaluated all systems on a sub- oracleturk 48.8 37.8 –n/a– set of the test data where at least three human- generated recipes agreed with the gold standard. (b) Omit non-English: 3,741 recipes This subset represents those programs that are retrieval 28.9 20.2 41.7 easily reproducible by human workers. A good phrasal 19.3 11.3 35.3 method should strive to achieve 100% accuracy sync 18.1 10.6 35.1 on this set, and we should perhaps not be overly classifier 48.8 35.2 48.4 concerned about the remaining examples where posclass 50.0 36.9 49.3 humans disagree about the correct interpretation. mturk 38.4 26.0 –n/a– oracleturk 56.0 43.5 –n/a– 5.4 Results and discussion (c) Omit non-English & unintelligible: 2,262 recipes Table 3 summarizes the main evaluation results. retrieval 36.8 25.4 49.0 Most of the measures are in concordance. phrasal 27.8 16.4 39.9 Interestingly, retrieval outperforms the phrasal sync 26.7 15.5 37.6 MT baseline. With a sufficiently long phrase limit, classifier 64.8 47.2 56.5 phrasal MT approaches retrieval, but with a few posclass 67.2 50.4 57.7 crucial differences. First, phrasal requires an exact mturk 59.0 41.5 –n/a– match of some substring of the input to some sub- oracleturk 86.2 59.4 –n/a– string of the training data, where retrieval can skip (d) 3 turkers agree with gold: 758 recipes over words. Second, the phrases are heavily depen- ≥ retrieval 43.3 32.3 56.2 dent on word alignment; we find the word align- phrasal 37.2 23.5 45.5 ment techniques struggle with the noisy IFTTT sync 36.5 24.1 42.8 descriptions. Sync performs similarly to phrasal. classifier 79.3 66.2 65.0 The underspecified descriptions challenge assump- posclass 81.4 71.0 66.5 tions in synchronous grammars: much of the target mturk 100.0 100.0 –n/a– structure is implied rather than stated. oracleturk 100.0 100.0 –n/a– In contrast, the classification method performs quite well. Some productions may be very likely Table 3: Evaluation results. The first column measures how given a prior alone, or may be inferred given other often the channels are selected correctly for both trigger and productions and the need for a well-formed deriva- action (e.g. Android Phone Call and Google Drive in Fig- tion. Augmenting this information with positional ure 1). The next column measures how often both the channel and function are correctly selected for both trigger and ac- information as in posclass can help with the attri- tion (e.g. Android Phone Call::Any phone call missed and bution problem. Consider the input “Download Google Drive::Add row to spreadsheet). The last column Facebook Photos you’re tagged in to ”: shows balanced F-measure against the gold tree over all pro- ductions in the proposed derivation, from the root production we would like the token “Facebook” to invoke only down to the lowest parameter. We show results on (a) the the trigger, not the action. We believe further gains full test data; (b) omitting descriptions marked as non-English could come from better modeling of the correspon- by a majority of the crowdsourced workers; (c) omitting de- scriptions marked as either non-English or unintelligible by dence between derivation and natural language. the crowd; and (d) only recipes where at least three of five We find that semantic parsing systems have ac- workers agreed with the gold standard. curacy nearly as high or even higher than turkers in certain conditions. There are several reasons for this. First, many of the channels overlap in func- lated (Post a tweet vs. Post a tweet with an image). tionality (Gmail vs. email, or Android SMS vs. All the systems with access to thousands of train- SMS); likewise functions may be very closely re- ing pairs are at a strong advantage; they can, for

885 INPUT Park in garage when snow tomorrow (a) IFTTT Weather : Tomorrow’s forecast calls for = SMS : Send me an SMS ⇒ OUTPUT Weather : Tomorrow’s forecast calls for = SMS : Send me an SMS ⇒ INPUT Suas fotos do instagr.am salvas no dropbox (b) IFTTT Instagram : Any new photo by you = Dropbox : Add file from URL ⇒ OUTPUT Instagram : Any new photo by you = Dropbox : Add file from URL ⇒ INPUT Foursquare check-in archive (c) IFTTT Foursquare : Any new check-in = : Create a note ⇒ OUTPUT Foursquare : Any new check-in = Google Drive : Add row to spreadsheet ⇒ INPUT if i post something on blogger it will post it to wordpress (d) IFTTT Blogger : Any new post = WordPress : Create a post ⇒ OUTPUT Feed : New feed item = Blogger : Create a post ⇒ INPUT Endless loop! (e) IFTTT Gmail : New email in inbox from = Gmail : Send an email ⇒ OUTPUT SMS : Send IFTTT any SMS = Philips hue : Turn on color loop ⇒

Table 4: Example output from the posclass system. For each input instance, we show the original query, the recipe originally authored through IFTTT, and our system output. Instance (a) demonstrates a case where the correct program is produced even though the input is rather tricky. Even the Portuguese query of (b) is correctly predicted, though keywords help here. In instance (c), the query is underspecified, and the system predicts that archiving should be done in Google Drive rather than evernote. Instance (d) shows how we sometimes confuse the trigger and action. Certain queries, such as (e), would require very deep inference: the IFTTT recipe sets up an endless email loop, where our system assembles a strange interpretation based on keyword match. instance, more effectively break such ties by learn- URLs of recipes along with turker annotations at ing a prior over which channels are more likely. http://research.microsoft.com/lang2code/. Turkers, on the other hand, have neither specific The best performing results came from a loosely training at this job nor a background corpus and synchronous approach. We believe this is a very more frequently disagree with the gold standard. promising direction: most work inspired by pars- Second, there are a number of non-English and ing or machine translation has assumed a strong unintelligible descriptions. Although the turkers connection between the description and the opera- were asked to skip these sentences, the machine- ble semantic representation. In practical situations, learning systems may still correctly predict the however, many elements of the semantic representa- channel and action, since the training set also con- tion may only be implied by the description, rather tains non-English and cryptic descriptions. For than explicitly stated. As we tackle domains with the cases where humans agree with each other and greater complexity, identifying implied but neces- with the gold standard, the best automated system sary information will be even more important. (posclass) does fairly well, getting 81% channel Underspecified descriptions open up new inter- and 71% function accuracy. face possibilities as well. This paper considered Table 4 has some sample outputs from the only single-turn interactions, where the user de- posclass system, showing both examples where scribes a request and the system responds with an the system is effective and where it struggles to interpretation. An important next step would be find the intended interpretation. to engage the user in an interactive dialogue to 6 Conclusions confirm and refine the user’s intent and develop a fully-functional correct program. The primary goal of this paper is to highlight a new application and dataset for semantic pars- ing. Although if-this-then-that recipes have a lim- Acknowledgments ited structure, many potential recipes are possible. This is a small step toward broad program synthe- The authors would like to thank William Dolan and sis from natural language, but is driven by real the anonymous reviewers for their helpful advice user data for modern hi-tech applications. To en- and suggestions. courage further exploration, we are releasing the

886 References R. J. Kate, Y. W. Wong, and R. J. Mooney. 2005. Learning to transform natural to formal languages. Jacob Andreas, Andreas Vlachos, and Stephen Clark. In Proceedings of the Twentieth National Confer- 2013. Semantic parsing as machine translation. In ence on Artificial Intelligence (AAAI-05), pages Proceedings of the 51st Annual Meeting of the As- 1062–1068, Pittsburgh, PA, July. sociation for Computational Linguistics (Volume 2: Short Papers), pages 47–52, Sofia, Bulgaria, August. Philipp Koehn, Franz Josef Och, and Daniel Marcu. Association for Computational Linguistics. 2003. Statistical phrase-based translation. In Pro- ceedings of the 2003 Conference of the North Amer- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy ican Chapter of the Association for Computational Liang. 2013. Semantic parsing on Freebase from Linguistics on Human Language Technology - Vol- question-answer pairs. In Proceedings of Confer- ume 1, NAACL ’03, pages 48–54, Stroudsburg, PA, ence on Empirical Methods in Natural Language USA. Association for Computational Linguistics. Processing (EMNLP-13). Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Taylor Berg-Kirkpatrick, Alexandre Bouchard-Cotˆ e,´ Callison-Burch, Marcello Federico, Nicola Bertoldi, John DeNero, and Dan Klein. 2010. Painless un- Brooke Cowan, Wade Shen, Christine Moran, supervised learning with features. In Human Lan- Richard Zens, Chris Dyer, Ondrej Bojar, Alexan- guage Technologies: The 2010 Annual Conference dra Constantin, and Evan Herbst. 2007. Moses: of the North American Chapter of the Association Open source toolkit for statistical machine transla- for Computational Linguistics, pages 582–590, Los tion. In Annual Meeting of the Association for Com- Angeles, California, June. Association for Computa- putational Linguistics (ACL), demonstration session, tional Linguistics. Prague, Czech Republic, June.

S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, and Klaus Krippendorff. 1980. Content Analysis: an In- Regina Barzilay. 2009. Reinforcement learning for troduction to its Methodology. Sage Publications, mapping instructions to actions. In Joint Conference Beverly Hills, CA. of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Vu Le, Sumit Gulwani, and Zhendong Su. 2013. Joint Conference on Natural Language Processing Smartsynth: Synthesizing smartphone automation of the Asian Federation of Natural Language Pro- scripts from natural language. In MobiSys. cessing (ACL-IJCNLP), Singapore. Greg Little and Robert C. Miller. 2007. Keyword pro- David Chiang. 2005. A hierarchical phrase-based gramming in java. In Proceedings of the Twenty- model for statistical machine translation. In Pro- second IEEE/ACM International Conference on Au- ceedings of the 43nd Annual Meeting of the Asso- tomated Software Engineering, ASE ’07, pages 84– ciation for Computational Linguistics (ACL), pages 93, New York, NY, USA. ACM. 263–270, Ann Arbor, MI. Huma Lodhi, Craig Saunders, John Shawe-Taylor, J. Cohen. 1960. A coefficient of agreement for nomi- Nello Cristianini, and Chris Watkins. 2002. Text nal scales. Educational and Psychological Measure- classification using string kernels. Journal of Ma- ment, 20(1):37 – 46. chine Learning Research, 2:419–444.

Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Franz Josef Och, Christoph Tillmann, and Hermann Marcu, Steve DeNeefe, Wei Wang, and Ignacio Ney. 1999. Improved alignment models for statisti- Thayer. 2006. Scalable inference and training of cal machine translation. In Proc. of the Conference context-rich syntactic translation models. In Pro- on Empirical Methods in Natural Language Process- ceedings of the 21st International Conference on ing and Very Large Corpora, pages 20–28, Univer- Computational Linguistics and 44th Annual Meet- sity of Maryland, College Park, MD, June. ing of the Association for Computational Linguistics (COLING/ACL), pages 961–968, Sydney, Australia, Yuk Wah Wong and Raymond Mooney. 2006. Learn- July. ing for semantic parsing with statistical machine translation. In Proceedings of the Human Language Sumit Gulwani and Mark Marron. 2014. Nlyze: Inter- Technology Conference of the NAACL, Main Confer- active programming by natural language for spread- ence, pages 439–446, New York City, USA, June. sheet data analysis and manipulation. In SIGMOD. Association for Computational Linguistics.

Rohit J. Kate and Raymond J. Mooney. 2006. Us- Yuk Wah Wong and Raymond J. Mooney. 2007a. Gen- ing string-kernels for learning semantic parsers. In eration by inverting a semantic parser that uses sta- Proceedings of the 21st International Conference on tistical machine translation. In Proceedings of Hu- Computational Linguistics and 44th Annual Meet- man Language Technologies: The Conference of the ing of the Association for Computational Linguistics, North American Chapter of the Association for Com- pages 913–920, Sydney, Australia, July. Association putational Linguistics (NAACL-HLT), pages 172– for Computational Linguistics. 179, Rochester, NY.

887 Yuk Wah Wong and Raymond J. Mooney. 2007b. Learning synchronous grammars for semantic pars- ing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association for Compu- tational Linguistics (ACL), pages 960–967, Prague, Czech Republic, June. William A. Woods. 1977. Lunar rocks in natural English: Explorations in natural language question answering. In Antonio Zampoli, editor, Linguistic Structures Processing. Elsevier North-Holland, New York. Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377–403. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive logic programming. In Proceedings of the Thirteenth Na- tional Conference on Artificial Intelligence (AAAI- 96), pages 1050–1055, Portland, OR, August. Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Struc- tured classification with probabilistic categorial grammars. In In Proceedings of the 21st Conference on Uncertainty in AI, pages 658–666.

888