Meteor/Sopremo: an Extensible Query Language and Operator Model

Meteor/Sopremo: An Extensible Query Language and Operator Model Arvid Heise*1 Astrid Rheinländery2 Marcus Leichz3 Ulf Lesery4 Felix Naumann*5 *Hasso Plattner Institute Potsdam, Germany yHumboldt-Universität zu Berlin, Germany z Technische Universität Berlin, Germany 1,5{arvid.heise, felix.naumann}@hpi.uni-potsdam.de 2,4{rheinlae,leser}@informatik.hu-berlin.de [email protected] ABSTRACT and removing duplicates are fundamental steps in various Recently, quite a few query and scripting languages for Map- data analysis pipelines [1]. Reduce-based systems have been developed to ease formu- Data flow systems [3, 4, 19] as a generalization of the lating complex data analysis tasks. However, existing tools MapReduce programming model [11] have gained much at- mainly provide basic operators for rather simple analyses, tention in this context, as they promise to ease writing scal- such as aggregating or filtering. Analytic functionality for able code for analytical tasks on huge amounts of data. How- advanced applications, such as data cleansing or information ever, developing data flow programs for non-relational work- extraction can only be embedded in user-defined functions loads like information extraction (IE), data cleansing (DC), where the semantics is hidden from the query compiler and or data mining (DM) can become quite complex. For ex- optimizer. In this paper, we present a language that treats ample, IE often involves complex preprocessing steps such application-specific functions as first-class operators, so that as text segmentation, linguistic annotation, or stop word re- operator semantics can be evaluated and exploited for opti- moval before the actual entities or relationships can be de- mization at compile time. termined. IE algorithms themselves are often complex, and We present Sopremo, a semantically rich operator model, a re-use of existing algorithms and tools is often necessary and Meteor, an extensible query language that is grounded for ad-hoc data and text analysis. in Sopremo. Sopremo also provides a programming frame- Quite a few languages for expressing complex analysis work that allows users to easily develop and integrate ex- data flows in form of queries or scripts [6, 9, 14, 23, 27] have tensions with their respective operators and instantiations. been developed recently. These tools often provide only ba- Meteor's syntax is operator-oriented and uses a Json-like sic operators for simple, SQL-like analyses, such as aggregat- data model to support applications that analyze semi- and ing, joining, or filtering data. Analytic functionality, which unstructured data. Meteor queries are translated into data is required for more complex tasks, must be embedded in flow programs of operator instantiations, i.e., concrete im- user-defined functions (UDF) where the UDF's semantics is plementations of the involved Sopremo operators. Using a hidden from the query compiler and optimizer. Moreover, real-world example, we show how operators from different a combination of operators that were initially developed for applications can be combined for writing complex analytical different use cases and application domains is often difficult, queries. since operators may differ in terms of expected input, pro- duced output, or background information that is needed by an operator to work properly. 1. INTRODUCTION In this paper, we propose to treat use case specific func- Today, we are flooded with data that is generated in var- tions as first-class operators of a data flow scripting lan- ious scientific areas, on the web, or in business applications. guage instead of writing UDFs. A main advantage of this For example, Twitter produces 340M messages per day as approach is that the operator's semantics can be accessed of March 20121, which can be analyzed in order to gain in- at compile time and potentially be used for data flow opti- sights on emerging trends [21]. Often, the available data sets mization, or for detecting syntactically correct, but semanti- contain (near-exact) duplicates, unstructured text, or both cally erroneous queries. We present Meteor and Sopremo, an in combination. Therefore, extracting relevant information extensible query language and a semantically rich operator model for the Stratosphere data flow system. Both tools are 1http://blog.twitter.com/2012/03/twitter-turns-six.html integrated in the system's software stack and can be tested by downloading Stratosphere2. Sopremo provides a programming framework that allows Permission to make digital or hard copies of all or part of this work for users to define custom packages, the respective operators personal or classroom use is granted without fee provided that copies are and their instantiations. Sopremo already contains a set of not made or distributed for profit or commercial advantage and that copies base operators from the relational algebra, plus two sets of bear this notice and the full citation on the first page. To copy otherwise, to additional operators for IE and DC. Further operators, e.g., republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. for performing statistical tests and for boilerplate detection Int. Workshop on End-to-end Management of Big Data 2012 Istanbul, on web data, are in development. Turkey 2 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. http://www.stratosphere.eu Query Meteor Script Meteor Parser Language Extensible Sopremo Compiler Operator Model Sopremo Plan Programming Model Pact Program Pact Optimizer Execution Execution Graph Nephele Scheduler Engine Figure 2: Architecture of the Stratosphere system. Figure 1: Abstract workflow for retrieving mentions of teen-aged students in news articles. 2. THE STRATOSPHERE SYSTEM We implemented Meteor and Sopremo within Stratosphere, Data analysis programs for Sopremo are specified in Me- a system for parallel data analysis. Stratosphere consists teor, a declarative language inspired by Jaql [5]. Meteor of four major components; next to Meteor and Sopremo, builds upon the Json data model, which allows the system it comprises the Pact programming model [2, 3], and the to process a variety of different types of data. Meteor con- Nephele execution engine [25]. tains several simplifications of Jaql that are necessary to Figure 2 displays the complete stack of the Stratosphere seamlessly add new operators to the language and to sup- system. End-users specify data analysis tasks by writing port n-ary input and output operators. Meteor queries are Meteor queries. Such a query is parsed and translated into parsed and translated into data flow programs of Sopremo a Sopremo plan, a directed acyclic graph (DAG) of inter- operator instantiations, i.e., concrete implementations of the connected high level data processing operators. Each oper- involved operators. ator consumes and processes one or multiple input data sets We use the following scenario as our running example in and produces one or multiple output data sets. Operators this paper. We give only an abstract description of the task are semantically rich, i.e., the concrete instantiation (data here, details on how to express this use-case with Meteor processing algorithm) of an operator defines in which way and Sopremo are explained in Sections 4 and 5. data sets are partitioned and processed in parallel. Often, there are different instantiations available for each operator, Example: Suppose a county school-board wants to find which are expressed with Pacts. These instantiations have out which of its teen-aged students carry out voluntary work different properties with respect to runtime, memory con- and thus have appeared in recent news articles. Our input sumption, quality, etc. Choosing the 'right' algorithm will is a set of spreadsheets containing information on all stu- be part of a cost-based optimizer (see Sect. 6). dents from all schools in the county and a corpus of news Pact is a generalization of the MapReduce programming articles. The task is to return all news articles that men- paradigm [11]. A Pact operator consists of a second-order tion at least one student being under 20. As displayed in function (InputContract), a first-order function (the UDF), Fig. 1, we have to perform multiple operations to accom- and a so-called OutputContract. Next to Map and Reduce, plish this task. First, we have to eliminate duplicates across Pact defines three additional second-order functions that the spreadsheets, since students might have changed school. each process two input streams, namely Match, CoGroup, Afterwards, we need to filter the spreadsheets and keep only and Cross. Each Pact is responsible for partitioning the in- records of teenagers. Concerning the news data, we need to put data and calling a user-defined first-order function. Pact analyze the textual contents, extract the names of people uses schema-free tuples as data model, i.e., the schema of mentioned in the articles, and group the news articles by any tuple is up to the interpretation of the user code. Pact person names. Finally, we need to join both data sets on programs are compiled into data flow graphs, which are then the persons' full names and return the result. deployed and executed by Nephele. During translation, the Pact compiler performs physical optimization and reorder- The remainder of this paper is structured as follows. In ings [18] to improve the parallel execution. Sect. 2, we present background information on the Strato- Finally, Nephele [25] interprets data flow graphs and dis- sphere system and illustrate how Sopremo and Meteor fit tributes tasks to the computation nodes. Nephele is designed into the system's architecture. We give insights into techni- to run in heterogeneous environments, and is also capable of cal details and base operators of Sopremo as well as language exploiting the elasticity of clouds by booking and releasing features of Meteor in Sect. 3. In Sect. 4, we show how both machine instances at runtime.

Load more