Xutools: Unix Commands for Processing Next-Generation Structured Text
Total Page:16
File Type:pdf, Size:1020Kb
Submitted for publication. Author Copy - do not redistribute. XUTools: Unix Commands for Processing Next-Generation Structured Text Gabriel A. Weaver Sean W. Smith Dartmouth College Dartmouth College Keywords: Text processing, Configuration Management, de meaningful structures via nested blocks. Natural- Change Management, Automation, Tools, Unix language documents may be divided up into chapters, paragraphs, and sentences. Sentences themselves may be Abstract further parsed. Configuration files and programming lan- guages, similarly, may be divided up into blocks of code. Traditional Unix tools operate on sequences of chara- For example, Cisco IOS interface blocks or C function cters, bytes, fields, lines, and files. However, modern blocks are information units that network administrators practitioners often want to manipulate files in terms of and developers reference daily. a variety of language-specific constructs—C functions, The prevalence of multi-line, nested-block-structured Cisco IOS interface blocks, and XML elements, to na- formats has left a capability gap for traditional tools. To- me a few. These language-specific structures quite often day, if practitioners want to extract interfaces from a Ci- lie beyond the regular languages upon which Unix text- sco IOS router configuration file, they must craft an invo- processing tools can practically compute. In this paper, cation for sed(1) we propose eXtended Unix text-processing tools (xuto- sed -n ’/ˆinterface ols) and present implementations that enable practitio- ATM0/,/ˆ!/\{/ˆ\!d;p;\}’ ners to extract (xugrep), count (xuwc), and compare (xudiff) texts in terms of language-specific structures. Using our xutools, practitioners need only type We motivate, design, and evaluate our tools around real- world use cases from network and system administrators, xugrep ’//ios:interface’ security consultants, and software engineers from a va- We designed and built eXtended Unix text-processing riety of domains including the power grid, healthcare, tools (xutools) so that practitioners could process files in and education. terms of the language constructs appropriate to the pro- blem at hand—even if these languages lie beyond regular 1 Introduction expressions. Besides “eXtending Unix,” we also chose the xu prefix During our fieldwork, we observed the need to generalize from the Greek word xulon, denoting “tree” or “staff.” Unix text processing tools so that practitioners can pro- We find the first sense especially appropriate given that cess the right type of string for the job at hand. We thus xutools operate on parse trees and process texts (traditio- need to extend traditional Unix tools because many mo- nally printed on trees). The second sense is appropriate dern, structured-text formats break assumptions of tradi- because xutools are designed to support IT staff in their tional Unix tools. real-world needs. Finally, xulon comes from the word Traditional Unix tools operate on sequences of cha- xuw , meaning to scrape, and our xutools, particularly racters, bytes, fields, lines, and files. When lines and xugrep, are well-suited to scraping content from docu- files correspond to language-specific constructs, traditio- ment trees. nal Unix tools work well. For example, lines of Apache In order to support a variety of languages, our xutools log files correspond to HTTP requests. rely upon a modular grammar library, as well as xupath, a However, there are many other language-specific con- general purpose querying language for structured text, w- structs besides the line. Many file formats found in mar- hich we based upon XPath. Although xutools may even- kup, configuration, and programming languages enco- tually encompass a broad range of Unix tools, this paper 1 presents xutools that extract (xugrep), count (xuwc), and compare (xudiff) structured text formats. xugrep xuwc xudiff Our xutools generalize the class of languages that w- XML e can practically process on the command-line. Recall (e.g., [33]) that a language is a set of strings and a re- Router 2.1,3.1 2.2,3.2 2.3,3.3 gular language is a set of strings that can be recognized Configs 4.1,5.1 4.2,5.2 4.3,5.3 by some finite automaton or equivalently, by a regular C expression. In other words, we can write a regular ex- pression to recognize whether a string is in a given lan- guage if and only if that language is regular. In contrast, a context-free language is a set of strings that can be Table 1: For each section of our paper, we will discuss recognized by a finite automaton with a stack. In practi- each of our tools relative to one or more real-world use ce, this means that we can write a context-free grammar cases that use the following structured-file formats. to recognize a string in a context-free language. All re- gular languages are context-free (they just don’t use the stack)—but not vice-versa. For example, the language of 2 Motivating use cases all strings with properly-nested parentheses is context- free but not regular. One cannot solve the parenthesis- During our fieldwork we observed the need to generalize matching problem with regular expressions. Unix text-processing tools. Specifically, we worked with In traditional Unix text-processing tools, most of the various network and system administrators and auditors types of strings (bytes, fields, characters, files, lines) a- of major electrical power utilities in the United States re either directly matched or indirectly split via a regular and discussed their challenges. We have also met with expression. In contrast, many of the languages in which network administrators at Dartmouth College. In addi- practitioners are interested (XML elements, JSON sub- tion, we received much feedback from a poster on our trees, Cisco IOS interfaces, C function blocks), are not tools [42] and recorded and analyzed these interactions at regular—but are context-free. This is no coincidence si- http://www.xutools.com/. As a result, we have nce the syntax of many programming languages was tra- discussed our work with practitioners from the RedHat ditionally specified via context-free grammars. Context- Security Response team. Finally, we have met with the free languages allow practitioners to recognize strings CEO of a network security company based in Germany that possess a recursive or hierarchical structure. Further- regarding uses for our tools. more, as languages evolve and acquire new constructs, a The intent of xutools is to help practitioners process grammatical description of a language is easily extended the right type of string for the job at hand. Users (such (even for regular languages) [1]. as network and system administrators, security consul- Furthermore, Unix text-processing tools tend to ope- tants, and software engineers) from a variety of domains rate on all the lines in a given input stream. For exam- (including the power grid, healthcare, and education) ha- ple grep(1) reports lines that contain a substring that ve emphasized the need for tools that let them operate on matches a regular expression, diff(1) reports all dif- strings in languages that are relevant to them. ferences between two files, and wc(1) counts all of the bytes, characters, words or lines in a file [32, 38, 39]. O- Many interesting strings are not in regular langua- ften, however, practitioners are interested in reporting a ges. Current Unix text-processing tools do not allow matching line in terms of the block in which that match practitioners to operate on blocks of text nested arbitra- occurs; they are interested in reporting line-level differe- rily deep or to report results with respect to that nesting. nces between two router configurations in terms of inter- However, many practitioners must process these blocks face blocks; or they are interested in counting the lines in order to solve problems they encounter. within each function block in C source. We extend Unix Figures1 and2 illustrate the mismatch between strings tools to report results in terms of contexts other than the of interest in modern file formats and the strings upon file. which current Unix text-processing tools operate. This Paper: We motivate (Section2), design and im- 2.1 Use cases for rethinking grep(1) plement (Section3), evaluate (Section4), and show the novelty of (Section5) our tools relative to real-world e- NVD-XML: Very recently, practitioners at the Re- xamples. Table1 illustrates our path through each of the- dHat Security Response Team wanted a way to grep se sections. Finally, in Section6 we discuss future work the XML feed for the National Vulnerability Database and in Section7 we conclude. (NVD).1 Specifically, they wanted to know how many 2 Regular Context-Free traditional unix tools xutools cut cat head, xugrep csplit tail uniq xudiff xuwc sgrep diff grep wc XyDiff xmllint Coccinelle CIMDiff Figure 2: Our xutools improve on prior work by systematically extending Unix tools to operate on the broader class of languages that are encountered in modern file formats. NVD entries contained the string cpe:/a:redhat, the vulnerability score of these entries, and how many XML elements in the feed contain cpe:/a:redhat. Traditional grep cannot handle this use case because it requires us to solve the parenthesis-matching problem. This limitation motivates the capability to be able to re- port matches with respect to a given context. Perl In contrast, we need a grep(1) that can handle Java JSON strings in context-free languages because when we ex- Context-Free YAML tract XML elements, we want to ensure that every ope- and beyond C ning tag is matched by a closing tag. Multiple XML elements may share the same closing tag, and XML e- paragraphs XML lements may be nested arbitrarily deep. Therefore, we need parenthesis matching to recognize XML elements.