An Event-Driven ETL System for the On-Line Processing of Heterogeneous Sensor Data

Demonstration StreamLoader: An Event-Driven ETL System for the On-line Processing of Heterogeneous Sensor Data M. Mesiti, L. Ferrari, S. Valtolina, M. S. Dao, K. Zettsu G. Licari, G. Galliani Universal Communication Research Institute Dip. di Informatica, Università di Milano, Italy NICT, Kyoto, Japan {mesiti,lferrari,valtolina}@di.unimi.it {dao.minhson,zettsu}@nict.go.jp ABSTRACT a geographical area), in thematics (data about traffic jams ETL (Extraction-Transform-Load) tools, traditionally de- vs data about pollutions). Therefore, there is the need of veloped to operate offline on historical data for feeding Data- ETL (Extract-Transform-Load) operations that can be ap- warehouses, need to be enhanced to deal with big and fresh plied on data streams for their reconciliation. These opera- data and be executed at network level during data streams tions should be applied during data acquisition and bound acquisition. In this paper, we present StreamLoader, a Web with reactive capabilities in order to properly identify the application for the specification of conceptual ETL dataflows relevant streams when abnormal events occur and under- on heterogeneous sensor data that leverages the peculiari- take the proper actions. Finally, the specification and actu- ties of network configuration, data stream management, and ation of the ETL operations should be efficiently performed specification and deployment of ETL operations in a pro- on-line and on fresh and timely data in order to properly grammable network. It can be used for feeding traditional/ handling big real-time data streams. All these technical re- real-time data-warehouses or visual analytic tools. quirements should be addressed in graphical, user-friendly environments supporting the user in the design and execution of the operations. 1. MOTIVATION Many systems have been proposed for configuring pro- Nowadays we are witnesses of the proliferations of differ- grammable networks ([4, 2, 9]), for data stream management ent sensor devises able to produce heterogeneous types of (e.g. Niagara [14], TelegraphCQ [5], Borealis [1]), for the data that can be profitable used for detecting, handling and specification and actuation of ETL operations (e.g. [16, 10, advising people of the verification of events such as natu- 13]) and dataflow (e.g. Talend www.talend.com, Pentaho- ral disasters (like flooding, storming, extreme temperatures kettle www.pentaho.com, CoverETL www.cloveretl.com), etc.), traffic congestions (due to accidents, strikes, football and for complex event processing (e.g. Apache S4 [15], matches), and social web interactions. Beside the physi- Storm [12], StreamBase www.tibco.com). However, all of cal sensors, able to detect data about physical phenomena them are quite complex to use, seldom provide web GUIs for (like, temperature, humidity, wind, rain, pressure, level of designing and monitoring dataflows and are not integrated see water), there is a proliferation of social sensors able to in a single tool. This limits their use in the management of collect data from people (like, twitter data, traffic informa- emergency situations. tion, train or flight schedule). These data are characterized In this paper we propose StreamLoader, a Web applica- both from the temporal, spatial and thematic dimensions tion for the specification of conceptual ETL dataflows on that can be exploited from for the identification of meaning- heterogeneous sensors to be applied during data stream ac- ful events in a given context. quisition on a programmable network. It exploits different Several challenges should be faced for handling sensors technologies (AngularJS, Cytoscape, SparkJava) for prov- and their data especially in emergency situations. First, ing the graphical environment. StreamLoader is equipped sensors (both physical and social) are located in different with an interactive environment that supports the user in networks and made available by different institutes, agen- charge of handling events to discover the sensors useful in cies and NPOs. In this context, network configuration, sen- a given situation, specify the adequate dataflow for extract- sor detection and discovery are difficult issues to be solved. ing, filtering, integrating, (eventually) storing, and analyze Moreover, data produced by sensors are heterogeneous in the data coming from the identified sensors, optimize the structures (different types), in spatial and/or temporal gran- schedule for the execution of the dataflow and visualize the ularities (e.g. temperature in a room versus temperatures in results. By exploiting samples produced by the involved sensors, the user can easily debug the developed dataflow. Once the dataflow is consistent (i.e. it can be soundly ac- tivated at network level), the translation is automatically invoked. Then, the underlying network is configured and the processes associated with the ETL operators executed. At this point, logs of the execution of the dataflow compo- c 2016, Copyright is with the authors. Published in Proc. 19th Inter- nents are directly shown in the interactive environment to national Conference on Extending Database Technology (EDBT), March provide statistics on the dataflow execution. 15-18, 2016 - Bordeaux, France: ISBN 978-3-89318-070-7, on OpenPro- ceedings.org. Distribution of this paper is permitted under the terms of the In the remainder, Section 2 presents the requirements we Creative Commons license CC-by-nc-nd 4.0 Series ISSN: 2367-2005 628 10.5441/002/edbt.2016.65 started from. The main characteristics of the system and of the interactive environment are discussed in Section 3. Section 4 presents the features of the system we wish to demonstrate and discusses the system significance. 2. StreamLoader REQUIREMENTS Starting from the idea to develop a tool that is easy to use for people without a computer science background, the following requirements have been posed at the basis of the development of our tool. Dynamic (and automatic) configuration of ETL dataflow on events. Starting from several data streams, the user in- terface should offer the possibility to identify relevant sources of information depending on the verification of events. For example, suppose we have several sensors for detecting the humidity and temperature of a given area; apparent tem- Figure 1: StreamLoader Architecture perature represents the temperature that is perceived by humans and depends on both temperature and humidity. language and processing techniques to the domain of net- The computation and acquisition of the apparent tempera- working. In [8] an extension of the declarative network- ture in a given area can be triggered when the temperature ◦ ing [4] approach has been proposed that consists of two is greater than 24 C. Events can be used both for trigger- components: declarative service networking (DSN) and net- ing or stopping the acquisition and elaboration of streams. work control protocol stacks (SCN). DSN provides a method Moreover, ETL dataflows should be generated on the fly to model and describe a high-level network of information depending on the needs, immediately actuated and its ex- services for an application, which includes service discov- ecution monitored in the same tool. The dataflow should ery, service monitoring, execution control, and service mes- become \live" and give execution feedbacks to the user. sage exchanges. SCN aims at capturing application require- ETL operations for integrating heterogeneous streams. ments and requesting appropriate configuration to the net- The heterogeneity of the data flows requires the application work platform more directly and effectively. The network of common operations developed in the context of data inte- control protocol stack interprets the DSN description and gration and data fusion in order to identify different repre- dynamically coordinates the network configurations, such sentations of the same real world entities. For this purpose, as data flows, segmentations, and QoS parameters. The the system should be equipped with transformation opera- dataflow graphically described by the user should be then tions: (1) for changing the unit of measure (e.g. from yards translated in DSN/SCN language to be actuated in the net- to meters) or geographical coordinates (from one standard to work and monitored. another one); (2) for introducing virtual properties relying on the values assumed by other attributes (e.g. the appar- 3. StreamLoader OVERVIEW ent temperature discussed above); (3) for checking that data Architecture. Figure 1 reports the architecture of the Stream- conform to given validation rules (e.g. dates conforming to Loader system. At the bottom there is a network. Each given patterns). Moreover, operation for filtering data rely- node of the network is in charge of managing a bunch of ing on different conditions should be included and for culling sensors and can execute the proposed ETL stream process- data belonging to a temporal interval or a geographical area ing operations. Sensors are handled through a distributed according to a reducing factor. Finally, operations for aggre- publish-subscribe system [3]. Each time a sensor is pub- gating and joining streams should be included for combining lished, its type, schema, and frequency of data generation information coming from different streams. are made available to subscribers. Discovery of sensor data sources. The large amount of Our Web environment is made available for the design and sensors with different levels of availability that can be mon- monitoring of dataflows. When a conceptual dataflow

An Event-Driven ETL System for the On-Line Processing of Heterogeneous Sensor Data

The Application of Data Fusion Method in the Analysis of Ocean and Meteorology Observation Data

Causal Inference and Data Fusion in Econometrics∗

Data Fusion – Resolving Data Conflicts for Integration

Data Mining and Fusion Techniques for Wsns As a Source of the Big Data Mohamed Mostafa Fouada,B,E,F, Nour E

Data Fusion and Modeling for Construction Management Knowledge Discovery

Data Fusion by Matrix Factorization

An Overview of Iot Sensor Data Processing, Fusion, and Analysis Techniques

Information Acquisition in Data Fusion Systems

CS 520 Data Integration, Warehousing, and Provenance

Cheminformatics and the Semantic Web: Adding Value with Linked Data and Enhanced Provenance

Information Quality Assessment for Data Fusion Systems

The Use of Data Fusion on Multiple Substructural Analysis Based GA Runs