Automatic Synthesis of Parallel Unix Commands and Pipelines with Kumquat

Total Page:16

File Type:pdf, Size:1020Kb

Automatic Synthesis of Parallel Unix Commands and Pipelines with Kumquat Automatic Synthesis of Parallel Unix Commands and Pipelines with KumQuat Jiasi Shen Martin Rinard Nikos Vasilakis MIT EECS & CSAIL MIT EECS & CSAIL MIT EECS & CSAIL USA USA USA [email protected] [email protected] [email protected] Abstract including pipelines that contain new commands or command options for which combiners were previously unavailable. We present KumQuat, a system for automatically generating KumQuat targets commands that can be expressed as data data parallel implementations of Unix shell commands and parallel divide-and-conquer computations with two phases:1 pipelines. The generated parallel versions split input streams, the first phase executes the original, unmodified command execute multiple instantiations of the original pipeline com- in parallel on disjoint parts of the input; the second phase mands to process the splits in parallel, then combine the combines the partial results from the first phase to obtain resulting parallel outputs to produce the final output stream. the final output. KumQuat generates candidate combiners, KumQuat automatically synthesizes the combine operators, then repeatedly feeds selected inputs to parallelized versions with a domain-specific combiner language acting as a strong of the command that use the candidate combiners. A com- regularizer that promotes efficient inference of correct com- parison of the resulting outputs with corresponding outputs biners. from the original serial version of the command enables We evaluate KumQuat on 70 benchmark scripts that to- KumQuat to identify a correct combiner for the command. gether have a total of 427 stages. KumQuat synthesizes a A domain-specific combiner language acts as a strong regu- correct combiner for 113 of the 121 unique commands that larizer that promotes efficient learning of correct combiners. appear in these benchmark scripts. The synthesis times vary The resulting (automatically generated) parallel computa- between 39 seconds and 331 seconds with a median of 60 tion executes directly in the same environment and with the seconds. We present experimental results that show that same program and data locations as the original sequential these combiners enable the effective parallelization of our command. benchmark scripts. This paper makes the following contributions: • Algorithm: We present a new algorithm that automati- 1 Introduction cally synthesizes combiners for parallel and distributed versions of Unix commands. The resulting synthesized The Unix shell, working in tandem with the wide range combiners enable the automatic generation of parallel and of commands it supports, provides a convenient program- distributed versions of Unix commands and pipelines. ming environment for many stream processing computa- • tions. Shell commands—which can be written in multiple Domain-Specific Language: We present a domain-spe- languages—typically execute sequentially on a single pro- cific language for combiner operators. This language sup- cessor. This sequential execution often leaves available data ports both the class of combiner operators relevant to parallelism, in which a command operates on different parts this domain and an efficient algorithm for automatically of an input stream in parallel, unexploited. This observation synthesizing these combiner operators. • We present theorems (Theorem 2 arXiv:2012.15443v2 [cs.PL] 22 Aug 2021 has motivated the development of systems that exploit data Correct Combiners: parallelism in shell pipelines [18, 26]. A key prerequisite is and Theorem 4) that characterize when the combiner syn- obtaining the combiners required to merge the resulting mul- thesis algorithm will identify a correct combiner for a tiple parallel output streams correctly into a single output given set of parallel output streams. stream. Previous systems rely on developers to manually We also present an analysis of the interaction between our implement such combiners and associate them with their input generation algorithm, our benchmark commands, corresponding shell commands [18, 26]. and our combiner synthesis algorithm. This analysis iden- We present a new system, KumQuat, for automatically tifies why KumQuat generates correct combiners for the exploiting data parallelism available in Unix pipelines. Work- ing with the commands in the pipeline as black boxes, KumQuat automatically generates inputs that explore the behavior of 1There is no requirement that the actual internal implementation must the command to infer and automatically generate a combiner be structured as a divide-and-conquer computation — because KumQuat for the command. This capability enables KumQuat to auto- interacts with the command as a black box, the requirement is instead only matically generate data parallel versions of Unix pipelines, that the computation that it implements can be expressed in this way. ,, Jiasi Shen, Martin Rinard, and Nikos Vasilakis 113 of 121 benchmark commands for which correct com- cat $IN| tr-csA-Za-z '\n ' | trA-Za-z| biners exist, identifies command patterns that ensure cor- sort| uniq-c| sort-rn rect combiner synthesis and correct data parallel execu- tion, and provides insight into the rationale behind the Figure 1. Example pipeline that computes word frequen- KumQuat design and the reasons why the KumQuat de- cies [2, 26]. sign can effectively exploit data parallelism available in its target class of commands. function, which may depend on the sort flag that specifies • Experimental Results: We present experimental results the comparison function. The “uniq -c” command produces that characterize the effectiveness of KumQuat on a set of a stream of (count, word) pairs. Given two streams y1 and y2, 70 benchmark scripts that together have a total of 477 com- the combine operator compares the word in the last line of mands, among which 121 are unique and process an input y1 with the word in the first line of y2. If they are the same, it stream. The results show that KumQuat can effectively concatenates y1 and y2 but combines the last and first lines to synthesize combiners for the majority of our benchmark include the sum of the two word counts. Otherwise it simply commands and that these synthesized combiners enable concatenates y1 and y2. As these examples highlight, select- effective parallelizations of our benchmark Unix pipelines. ing an appropriate combine operator for each command is a critical step for obtaining a correct parallel execution. 2 Example The default KumQuat parallel computation applies the Figure 1 presents an example pipeline that we use to illus- combine operator after the parallel execution of each com- trate KumQuat. The pipeline implements a computation that mand to obtain a single output stream for that command. counts the frequency of words in an input document. The In many cases, however, it is possible to enhance parallel six commands in the pipeline (1) read the input document, performance by eliminating intermediate combine operators (2) break the document into lines of words, (3) translate the so that the output substreams for one command feed directly words into lower case, (4) sort the words, (5) remove dupli- into the input substreams for the direct parallel execution cates and prepend each unique word with a count, and (6) of the next command. KumQuat therefore applies an opti- sort the words on their counts in reverse order. mization that automatically eliminates intermediate combine The pipeline conforms to a standard Unix model that struc- operators when possible (Section 3.5). tures computations as pipelines of building-block commands Model of Computation: KumQuat targets commands 5 that process character streams. The example commands pro- that have a combine operator 6 that satisfies cess the streams as lines of words, with the lines and words 5 ¹x ¸¸ x º = 6¹5 ¹x º, 5 ¹x ºº separated by delimiters, in the example newline and space. 1 2 1 2 As the commands process lines, words, or characters, they for all input streams x1, x2, where the streams are (poten- apply a function to each unit and either output the result tially recursively) structured as units separated by delimiters. of the function (commands “tr -cs A-Za-z ’\n’” and KumQuat currently targets character streams structured “tr A-Z a-z”), sort the units according to a certain order as lines with the newline delimiter, so that x1 and x2 ter- (commands “sort” and “sort -rn”), or accumulate a result minate with newlines. ¸¸ denotes string concatenation. A that is output when the command finishes reading the input key step in the parallelization of 5 is the synthesis of a cor- stream and terminates (command “uniq -c”). rect combiner 6 for 5 . To focus the synthesis algorithm on a To exploit the data parallelism available in this computa- productive space of candidate combiners, KumQuat works tion, KumQuat splits the input data stream into substreams, with combiners expressible in a domain-specific combiner then instantiates the commands to process the input sub- language (Figure 3). streams in parallel. The result is a set of parallel output Combiner Synthesis: To infer a combiner 6 for a com- substreams that must then be combined to obtain the single mand 5 , KumQuat works with a set of candidate combiner final output stream of the command. functions. In the current KumQuat implementation, this set Different commands often require different combine op- consists of all combiner functions with seven or fewer nodes erators. The combine operator for command “tr A-Z a-z” in the DSL abstract syntax tree. KumQuat
Recommended publications
  • Software Developement Tools Introduction to Software Building
    Software Developement Tools Introduction to software building SED & friends Outline 1. an example 2. what is software building? 3. tools 4. about CMake SED & friends – Introduction to software building 2 1 an example SED & friends – Introduction to software building 3 - non-portability: works only on a Unix systems, with mpicc shortcut and MPI libraries and headers installed in standard directories - every build, we compile all files - 0 level: hit the following line: mpicc -Wall -o heat par heat par.c heat.c mat utils.c -lm - 0.1 level: write a script (bash, csh, Zsh, ...) • drawbacks: Example • we want to build the parallel program solving heat equation: SED & friends – Introduction to software building 4 - non-portability: works only on a Unix systems, with mpicc shortcut and MPI libraries and headers installed in standard directories - every build, we compile all files - 0.1 level: write a script (bash, csh, Zsh, ...) • drawbacks: Example • we want to build the parallel program solving heat equation: - 0 level: hit the following line: mpicc -Wall -o heat par heat par.c heat.c mat utils.c -lm SED & friends – Introduction to software building 4 - non-portability: works only on a Unix systems, with mpicc shortcut and MPI libraries and headers installed in standard directories - every build, we compile all files • drawbacks: Example • we want to build the parallel program solving heat equation: - 0 level: hit the following line: mpicc -Wall -o heat par heat par.c heat.c mat utils.c -lm - 0.1 level: write a script (bash, csh, Zsh, ...) SED
    [Show full text]
  • Linux Commands Cheat Sheet
    LINUX COMMANDS CHEAT SHEET System File Permission uname => Displays Linux system information chmod octal filename => Change file permissions of the file to octal uname -r => Displays kernel release information Example uptime => Displays how long the system has been running including chmod 777 /data/test.c => Set rwx permissions to owner, group and everyone (every- load average one else who has access to the server) hostname => Shows the system hostname chmod 755 /data/test.c => Set rwx to the owner and r_x to group and everyone hostname -i => Displays the IP address of the system chmod 766 /data/test.c => Sets rwx for owner, rw for group and everyone last reboot => Shows system reboot history chown owner user-file => Change ownership of the file date => Displays current system date and time chown owner-user: owner-group => Change owner and group owner of the file timedatectl => Query and change the System clock file_name chown owner-user:owner-group- => Change owner and group owner of the directory cal => Displays the current calendar month and day directory w => Displays currently logged in users in the system whoami => Displays who you are logged in as Network finger username => Displays information about the user ip addr show => Displays IP addresses and all the network interfaces Hardware ip address add => Assigns IP address 192.168.0.1 to interface eth0 192.168.0.1/24 dev eth0 dmesg => Displays bootup messages ifconfig => Displays IP addresses of all network interfaces cat /proc/cpuinfo => Displays more information about CPU e.g model, model name, cores, vendor id ping host => ping command sends an ICMP echo request to establish a connection to server / PC cat /proc/meminfo => Displays more information about hardware memory e.g.
    [Show full text]
  • Chapter 19 RECOVERING DIGITAL EVIDENCE from LINUX SYSTEMS
    Chapter 19 RECOVERING DIGITAL EVIDENCE FROM LINUX SYSTEMS Philip Craiger Abstract As Linux-kernel-based operating systems proliferate there will be an in­ evitable increase in Linux systems that law enforcement agents must process in criminal investigations. The skills and expertise required to recover evidence from Microsoft-Windows-based systems do not neces­ sarily translate to Linux systems. This paper discusses digital forensic procedures for recovering evidence from Linux systems. In particular, it presents methods for identifying and recovering deleted files from disk and volatile memory, identifying notable and Trojan files, finding hidden files, and finding files with renamed extensions. All the procedures are accomplished using Linux command line utilities and require no special or commercial tools. Keywords: Digital evidence, Linux system forensics !• Introduction Linux systems will be increasingly encountered at crime scenes as Linux increases in popularity, particularly as the OS of choice for servers. The skills and expertise required to recover evidence from a Microsoft- Windows-based system, however, do not necessarily translate to the same tasks on a Linux system. For instance, the Microsoft NTFS, FAT, and Linux EXT2/3 file systems work differently enough that under­ standing one tells httle about how the other functions. In this paper we demonstrate digital forensics procedures for Linux systems using Linux command line utilities. The ability to gather evidence from a running system is particularly important as evidence in RAM may be lost if a forensics first responder does not prioritize the collection of live evidence. The forensic procedures discussed include methods for identifying and recovering deleted files from RAM and magnetic media, identifying no- 234 ADVANCES IN DIGITAL FORENSICS tables files and Trojans, and finding hidden files and renamed files (files with renamed extensions.
    [Show full text]
  • “Linux at the Command Line” Don Johnson of BU IS&T  We’Ll Start with a Sign in Sheet
    “Linux at the Command Line” Don Johnson of BU IS&T We’ll start with a sign in sheet. We’ll end with a class evaluation. We’ll cover as much as we can in the time allowed; if we don’t cover everything, you’ll pick it up as you continue working with Linux. This is a hands-on, lab class; ask questions at any time. Commands for you to type are in BOLD The Most Common O/S Used By BU Researchers When Working on a Server or Computer Cluster Linux is a Unix clone begun in 1991 and written from scratch by Linus Torvalds with assistance from a loosely-knit team of hackers across the Net. 64% of the world’s servers run some variant of Unix or Linux. The Android phone and the Kindle run Linux. a set of small Linux is an O/S core programs written by written by Linus Richard Stallman and Torvalds and others others. They are the AND GNU utilities. http://www.gnu.org/ Network: ssh, scp Shells: BASH, TCSH, clear, history, chsh, echo, set, setenv, xargs System Information: w, whoami, man, info, which, free, echo, date, cal, df, free Command Information: man, info Symbols: |, >, >>, <, ;, ~, ., .. Filters: grep, egrep, more, less, head, tail Hotkeys: <ctrl><c>, <ctrl><d> File System: ls, mkdir, cd, pwd, mv, touch, file, find, diff, cmp, du, chmod, find File Editors: gedit, nedit You need a “xterm” emulation – software that emulates an “X” terminal and that connects using the “SSH” Secure Shell protocol. ◦ Windows Use StarNet “X-Win32:” http://www.bu.edu/tech/support/desktop/ distribution/xwindows/xwin32/ ◦ Mac OS X “Terminal” is already installed Why? Darwin, the system on which Apple's Mac OS X is built, is a derivative of 4.4BSD-Lite2 and FreeBSD.
    [Show full text]
  • Unix Tools 2
    Unix Tools 2 Jeff Freymueller Outline § Variables as a collec9on of words § Making basic output and input § A sampler of unix tools: remote login, text processing, slicing and dicing § ssh (secure shell – use a remote computer!) § grep (“get regular expression”) § awk (a text processing language) § sed (“stream editor”) § tr (translator) § A smorgasbord of examples Variables as a collec9on of words § The shell treats variables as a collec9on of words § set files = ( file1 file2 file3 ) § This sets the variable files to have 3 “words” § If you want the whole variable, access it with $files § If you want just the second word, use $files[2] § The shell doesn’t count characters, only words Basic output: echo and cat § echo string § Writes a line to standard output containing the text string. This can be one or more words, and can include references to variables. § echo “Opening files” § echo “working on week $week” § echo –n “no carriage return at the end of this” § cat file § Sends the contents of a file to standard output Input, Output, Pipes § Output to file, vs. append to file. § > filename creates or overwrites the file filename § >> filename appends output to file filename § Take input from file, or from “inline input” § < filename take all input from file filename § <<STRING take input from the current file, using all lines un9l you get to the label STRING (see next slide for example) § Use a pipe § date | awk '{print $1}' Example of “inline input” gpsdisp << END • Many programs, especially Okmok2002-2010.disp Okmok2002-2010.vec older ones, interac9vely y prompt you to enter input Okmok2002-2010.gmtvec • You can automate (or self- y Okmok2002-2010.newdisp document) this by using << Okmok2002_mod.stacov • Standard input is set to the 5 contents of this file 5 Okmok2010.stacov between << END and END 5 • You can use any label, not 5 just “END”.
    [Show full text]
  • The Linux Command Line
    The Linux Command Line Fifth Internet Edition William Shotts A LinuxCommand.org Book Copyright ©2008-2019, William E. Shotts, Jr. This work is licensed under the Creative Commons Attribution-Noncommercial-No De- rivative Works 3.0 United States License. To view a copy of this license, visit the link above or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042. A version of this book is also available in printed form, published by No Starch Press. Copies may be purchased wherever fine books are sold. No Starch Press also offers elec- tronic formats for popular e-readers. They can be reached at: https://www.nostarch.com. Linux® is the registered trademark of Linus Torvalds. All other trademarks belong to their respective owners. This book is part of the LinuxCommand.org project, a site for Linux education and advo- cacy devoted to helping users of legacy operating systems migrate into the future. You may contact the LinuxCommand.org project at http://linuxcommand.org. Release History Version Date Description 19.01A January 28, 2019 Fifth Internet Edition (Corrected TOC) 19.01 January 17, 2019 Fifth Internet Edition. 17.10 October 19, 2017 Fourth Internet Edition. 16.07 July 28, 2016 Third Internet Edition. 13.07 July 6, 2013 Second Internet Edition. 09.12 December 14, 2009 First Internet Edition. Table of Contents Introduction....................................................................................................xvi Why Use the Command Line?......................................................................................xvi
    [Show full text]
  • Useful Commands in Linux and Other Tools for Quality Control
    Useful commands in Linux and other tools for quality control Ignacio Aguilar INIA Uruguay 05-2018 Unix Basic Commands pwd show working directory ls list files in working directory ll as before but with more information mkdir d make a directory d cd d change to directory d Copy and moving commands To copy file cp /home/user/is . To copy file directory cp –r /home/folder . to move file aa into bb in folder test mv aa ./test/bb To delete rm yy delete the file yy rm –r xx delete the folder xx Redirections & pipe Redirection useful to read/write from file !! aa < bb program aa reads from file bb blupf90 < in aa > bb program aa write in file bb blupf90 < in > log Redirections & pipe “|” similar to redirection but instead to write to a file, passes content as input to other command tee copy standard input to standard output and save in a file echo copy stream to standard output Example: program blupf90 reads name of parameter file and writes output in terminal and in file log echo par.b90 | blupf90 | tee blup.log Other popular commands head file print first 10 lines list file page-by-page tail file print last 10 lines less file list file line-by-line or page-by-page wc –l file count lines grep text file find lines that contains text cat file1 fiel2 concatenate files sort sort file cut cuts specific columns join join lines of two files on specific columns paste paste lines of two file expand replace TAB with spaces uniq retain unique lines on a sorted file head / tail $ head pedigree.txt 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 8 0 0 9 0 0 10
    [Show full text]
  • Basic Plotting Codes in R
    The Basics of R for Windows We will use the data set timetrial.repeated.dat to learn some basic code in R for Windows. Commands will be shown in a different font, e.g., read.table, after the command line prompt, shown here as >. After you open R, type your commands after the prompt in what is called the Console window. Do not type the prompt, just the command. R stores the variables you create in a working directory. From the File menu, you need to first change the working directory to the directory where your data are stored. 1. Reading in data files Download the data file from www.biostat.umn.edu/~lynn/correlated.html to your local disk directory, go to File -> Change Directory to select that directory, and then read the data in: > contact = read.table(file="timetrial.repeated.dat", header=T) Note: if the first row of the data file had data instead of variable names, you would need to use header=F in the read.table command. Now you have named the data set contact, which you can then see in R just by typing its name: > contact or you can look at, for example, just the first 10 rows of the data set: > contact[1:10,] time sex age shape trial subj 1 2.68 M 31 Box 1 1 2 4.14 M 31 Box 2 1 3 7.22 M 31 Box 3 1 4 8.00 M 31 Box 4 1 5 7.09 M 30 Box 1 2 6 8.55 M 30 Box 2 2 7 8.79 M 30 Box 3 2 8 9.68 M 30 Box 4 2 9 6.05 M 30 Box 1 3 10 6.25 M 30 Box 2 3 This data set has 6 variables, one each in a column, where sex and shape are character valued.
    [Show full text]
  • PS TEXT EDIT Reference Manual Is Designed to Give You a Complete Is About Overview of TEDIT
    Information Management Technology Library PS TEXT EDIT™ Reference Manual Abstract This manual describes PS TEXT EDIT, a multi-screen block mode text editor. It provides a complete overview of the product and instructions for using each command. Part Number 058059 Tandem Computers Incorporated Document History Edition Part Number Product Version OS Version Date First Edition 82550 A00 TEDIT B20 GUARDIAN 90 B20 October 1985 (Preliminary) Second Edition 82550 B00 TEDIT B30 GUARDIAN 90 B30 April 1986 Update 1 82242 TEDIT C00 GUARDIAN 90 C00 November 1987 Third Edition 058059 TEDIT C00 GUARDIAN 90 C00 July 1991 Note The second edition of this manual was reformatted in July 1991; no changes were made to the manual’s content at that time. New editions incorporate any updates issued since the previous edition. Copyright All rights reserved. No part of this document may be reproduced in any form, including photocopying or translation to another language, without the prior written consent of Tandem Computers Incorporated. Copyright 1991 Tandem Computers Incorporated. Contents What This Book Is About xvii Who Should Use This Book xvii How to Use This Book xvii Where to Go for More Information xix What’s New in This Update xx Section 1 Introduction to TEDIT What Is PS TEXT EDIT? 1-1 TEDIT Features 1-1 TEDIT Commands 1-2 Using TEDIT Commands 1-3 Terminals and TEDIT 1-3 Starting TEDIT 1-4 Section 2 TEDIT Topics Overview 2-1 Understanding Syntax 2-2 Note About the Examples in This Book 2-3 BALANCED-EXPRESSION 2-5 CHARACTER 2-9 058059 Tandem Computers
    [Show full text]
  • Attitudes Toward Disability
    School of Psychological Science Attitudes Toward Disability: The Relationship Between Attitudes Toward Disability and Frequency of Interaction with People with Disabilities (PWD)​ # Jennifer Kim, Taylor Locke, Emily Reed, Jackie Yates, Nicholas Zike, Mariah Es?ll, BridgeBe Graham, Erika Frandrup, Kathleen Bogart, PhD, Sam Logan, PhD, Chris?na Hospodar Introduc)on Methods Con)nued Conclusion The social and medical models of disability are sets of underlying assump?ons explaining Parcipants Connued Suppor?ng our hypothesis, the medical and social models par?ally mediated the people's beliefs about the causes and implicaons of disability. • 7.9% iden?fied as Hispanic/Lano & 91.5% did not iden?fy as Hispanic/Lano relaonship between contact and atudes towards PWD. • 2.6% iden?fied as Black/African American, 1.7% iden?fied as American Indian or Alaska • Compared to the social model, the medical model was more strongly associated • The medical model is the predominant model in the United States that is associated Nave, 23.3% iden?fied as Asian, 2.2% iden?fied as Nave Hawaiian or Pacific Islander, with contact and atudes. with the belief that disability is an undesirable status that needs to be cured (Darling & 76.7% iden?fied as White, & 2.1% iden?fied as Other • This suggests that more contact with PWD can help decrease medical model beliefs Heckert, 2010). This model focuses on the diagnosis, treatment and curave efforts • Average frequency of contact was about once a month in general, but may be slightly less effec?ve at promo?ng the social model beliefs. related to disability.
    [Show full text]
  • Configuration and Administration of Networking for Fedora 23
    Fedora 23 Networking Guide Configuration and Administration of Networking for Fedora 23 Stephen Wadeley Networking Guide Draft Fedora 23 Networking Guide Configuration and Administration of Networking for Fedora 23 Edition 1.0 Author Stephen Wadeley [email protected] Copyright © 2015 Red Hat, Inc. and others. The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/. The original authors of this document, and Red Hat, designate the Fedora Project as the "Attribution Party" for purposes of CC-BY-SA. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. For guidelines on the permitted uses of the Fedora trademarks, refer to https://fedoraproject.org/wiki/ Legal:Trademark_guidelines. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.
    [Show full text]
  • Arxiv:1802.08979V2 [Cs.CL] 2 Mar 2018 Parsing Benchmarks (Dahl Et Al., 1994; Popescu Et Al., 2003)
    NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System Xi Victoria Lin*, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst Salesforce Research, University of Washington, University of Washington, University of Washington [email protected], fclwang,lsz,[email protected] Abstract We present new data and semantic parsing methods for the problem of mapping English sentences to Bash commands (NL2Bash). Our long-term goal is to enable any user to perform operations such as file manipulation, search, and application-specific scripting by simply stating their goals in English. We take a first step in this domain, by providing a new dataset of challenging but commonly used Bash commands and expert-written English descriptions, along with baseline methods to establish performance levels on this task. Keywords: Natural Language Programming, Natural Language Interface, Semantic Parsing 1. Introduction We constructed the NL2Bash corpus with frequently used The dream of using English or any other natural language Bash commands scraped from websites such as question- to program computers has existed for almost as long as answering forums, tutorials, tech blogs, and course materi- the task of programming itself (Sammet, 1966). Although als. We gathered a set of high-quality descriptions of the significantly less precise than a formal language (Dijkstra, commands from Bash programmers. Table 1 shows several 1978), natural language as a programming medium would examples. After careful quality control, we were able to be universally accessible and would support the automation gather over 9,000 English-command pairs, covering over of highly repetitive tasks such as file manipulation, search, 100 unique Bash utilities.
    [Show full text]