Data Streams Permeate Almost Every Aspect of Computing and Data Communication

Data streams permeate almost every aspect of computing and data communication. From medical imaging to YouTube videos, from stock market data to sports results or from seismic sensor data to baby monitors, data streams play an increasing role in almost every aspect of our lives. They can be found in every application area imaginable, including medicine, finance, education, engineering, physical sciences, media, entertainment, and the arts. Although increasing processing capacity has made data stream processing more accessible, in- creases in storage capacities and network bandwidths are making it possible to store and access ever increasing volumes of data. This in turn leads to an increase in demand for processing the available data streams, e.g. in the case of multimedia archives there is an acute need to index, search and summarise the data if it is to remain useful. Furthermore, new applications with increasing resource requirements will continue to emerge and many of these applications will require processing resources beyond the capabilities of much of the existing infrastructure of today. Large organisations, such as media corporations and telecommunications providers, are building significant standing infrastructures for specialized data stream processing, but these are not suited to meeting the diverse and transient requirements of a broad range of users, from home musicians to scientific collaborators. Requirements of economy, flexibility, scalability, mobility and location mit- igate against this approach. These users need techniques that allow them to dynamically build and control transient infrastructures for processing their data streams on demand. Questions arising from our previous work are: • Where can the necessary compute power be found to meet these needs? • Can cycle-harvesting (opportunistic) services handle these streams? • Can the use of on-demand virtual machines assist? • What are the real issues in frequency/volume? • What kind of stream datatypes yield most benefit? • What kind of opportunistic resources are best? • What categories of stream execution are best handled this way? • Can idle cores and GPUs be effectively exploited? We will explore architectures that would allow elements from existing infrastructures to be accessed and composed on demand, enabling new possibilities for communities for whom it is infeasible, un- economical or unnecessary to build a dedicated, standing infrastructure. For example, a school might use on-demand infrastructure for indexing, summarizing and searching an archive of audio and video streams from practical science experiments. Or scientists might construct an on-demand infrastructure to process the output data stream from a radio telescope or particle accelerator, with the possibility of performing further stream processing to visualise the output. Alternatively, an on-demand infrastructure might be created to transform an archived video stream for playback on a low-power mobile device. 1Image of An Bradan´ Feasa courtesy of Ois´ın Mac Suibhne and Alanna Avant, http://www.suibhne.com/suibhne.html 1 Figure 1: On-demand stream processing stack We propose a comprehensive programme of research encompassing the end-to-end challenges pre- sented by stream processing using infrastructure on demand. Figure 1 illustrates the relationship between three work packages: WP1 – Applications We will examine the requirements of a diverse range of applications involving the acquisition, synthesis and processing of data streams. A number of specific applications will be investigated as a means of identifying challenges and evaluating solutions. WP2 – Infrastructure on demand Supporting the requirements of a diverse range of applications is a challenging problem, especially as they may be executed on computing resources distributed both geographically and across administrative boundaries. We propose to address this challenge by investigating techniques for dynamically building infrastructure on demand. We will incorporate the use of existing infrastructure resources but will also investigate the addition of hooks to existing infrastructures that enable the dynamic creation of virtual machines (VMs). Such VMs could be configured to meet the requirements of specific stream processing applications. Although such facilities would be useful across a range of processing models, special consideration will be given to the needs of data stream processing applications. WP3 – Stream processing extensions An infrastructure on demand facility, by itself, will not provide for the execution of complex stream processing applications. We will develop stream processing extensions that will provide (i) a framework within which applications can describe in- frastructural, quality-of-service, fault-tolerance and other requirements; (ii) the ability to execute complex stream processing workflows using infrastructure on demand; and (iii) interfaces that will give both existing and new applications access to the stream processing facility. 1 WP1: Applications This work package will investigate stream processing applications of various types, and will be led by Dr. Jonathan Dukes. Stream processing applications consist of stream acquisition, stream synthesis and stream processing. Figure 2 illustrates these stream flows. Stream Acquisition Streams of data come into existence either by the acquisition of external information or by synthesis. External information comes either from interfaces to the natural world or from other computing systems. Such information may already be in the form of a time series of discrete values, in which case it can be directly used as a data stream, or it may be a continuous variable that needs to be sampled to create a data stream suitable for further processing. Many scientific grand challenges produce ex- tremely large data streams, of the order of petabytes per year. The prime global example is the CERN 2 Figure 2: Stream flows LHC [1]. Examples of interest to Irish scientists (from the recently submitted HEA PRTLI4 e-PSI proposal) are the LHCb experiment [2], HESS [3], CTA [4] and the Solar Dynamics Observatory (SDO) [5]. While these are on a grand scale, smaller-scale examples such as streaming data from weather stations or from seismic sensors are also of interest. There are already several information and monitoring systems that can be used to acquire data streams, including the very widely deployed gridICE [6] and R-GMA [7, 8, 9] (the latter is specifically targetted at streams). For the purposes of this research we propose to acquire H.323[10] and AccessGrid[11] audio and video streams. For non audio/video data streams we intend to acquire streams using R-GMA, and particularly to avail of the HESS, CTA and SDO sources if possible. Stream Synthesis A great majority of the streams that are synthesised are audio or video streams. Some common applications are audio synthesis (voice and music), avatar video synthesis, navigable world synthesis, and scientific visualisation. Audio is generally synthesised directly and fed to the sound subsystems of desktop PCs. Video stream synthesis is more demanding, and raises interesting questions. Quite complex video streams may be synthesised directly, e.g. using DirectX [12] or OpenGL [13] and dis- played using PCs’ graphics accelerators. Rendering engines can be used for more complex streams: for example OpenGL can be rendered on clusters running Chromium [14]. For immersive visual environments a “cave” can be used. The various scales on which rendering can be performed are illustrated in Figure 3. “Caves” and desktop PCs only serve one user at a time per system whereas rendering engines can be constructed to serve multiple users while scaling performance per user. Stream processing Data streams, whether acquired or synthesised, may be processed to produce one or more new or transformed streams. Processing may be performed on general processing architectures or using specialized stream processing hardware (e.g. GPUs or CPUs with specialized stream processing capabilities). Processed streams may be processed further, rendered (e.g. in the case of video streams) or stored for future use. 1.1 Motivating Application Areas A number of specific application areas will provide focal points for our investigation of on-demand infrastructure for stream processing. In each application area, we will identify specific requirements and challenges and develop proof-of-concept applications. Work in each of the identified application areas will build on past and ongoing research and expertise in the Department of Computer Science at Trinity College Dublin. 3 Figure 3: Rendering engine performance scale spectrum 1.1.1 Scientific Visualisation Early work by the proposers has concentrated on dedicated multiuser rendering engines, e.g. recently a multi-user multi-modal multi-scale grid-enabled visualisation engine [15] has been constructed, including a 9-node 2-d SCI torus running Chromium for rendering [16]. We will explore the possibility of integrating on-demand resources into rendering engines, which raises interesting questions: • What are the limitations in creating on-demand engines to user-specified scales? • Is it feasible to reliably run a complex engine such as Chromium on transient resources? • Does this model permit the steering of visualisations? • How do such engines scale in performance relative to the number of nodes? • What are the key contributions to time-variance in setup, execution and teardown? 1.1.2 Multimedia Use of multimedia

Load more