Efficient Stream Data Management

Total Page:16

File Type:pdf, Size:1020Kb

Efficient Stream Data Management From Big Data to Fast Data: Efficient Stream Data Management Alexandru Costan To cite this version: Alexandru Costan. From Big Data to Fast Data: Efficient Stream Data Management. Distributed, Parallel, and Cluster Computing [cs.DC]. ENS Rennes, 2019. tel-02059437v2 HAL Id: tel-02059437 https://hal.archives-ouvertes.fr/tel-02059437v2 Submitted on 14 Mar 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. École doctorale MathSTIC HABILITATION À DIRIGER DES RECHERCHES Discipline: INFORMATIQUE présentée devant l’École Normale Supérieure de Rennes sous le sceau de l’Université Bretagne Loire par Alexandru Costan préparée à IRISA Institut de Recherche en Informatique et Systèmes Aléatoires Soutenue à Bruz, le 14 mars 2019, devant le jury composé de: Rosa Badia / rapporteuse Directrice de recherche, Barcelona Supercomputing Center, Espagne From Big Data Luc Bougé / examinateur Professeur des universités, ENS Rennes, France to Fast Data: Valentin Cristea / examinateur Professeur des universités, Université Politehnica de Efficient Stream Bucarest, Roumanie Christian Pérez / rapporteur Data Management Directeur de recherche, Inria, France Michael Schöttner / rapporteur Professeur des universités, Université de Düsseldorf, Allemagne Patrick Valduriez / examinateur Directeur de recherche, Inria, France 3 Abstract This manuscript provides a synthetic overview of my research journey since my PhD de- fense. The document does not claim to present my work in its entirety, but focuses on the contributions to data management in support of stream processing. These results address all stages of the stream processing pipeline: data collection and in-transit processing at the edge, transfer towards the cloud processing sites, ingestion and persistent storage. I start by presenting the general context of stream data management in light of the recent transition from Big to Fast Data. After highlighting the challenges at the data level associ- ated with batch and real-time analytics, I introduce a subjective overview of my proposals to address them. They bring solutions to the problems of in-transit stream storage and processing, fast data transfers, distributed metadata management, dynamic ingestion and transactional storage. The integration of these solutions into functional prototypes and the results of the large-scale experimental evaluations on clusters, clouds and supercomputers demonstrate their effectiveness for several real-life applications ranging from neuro-science to LHC nuclear physics. Finally, these contributions are put into the perspective of the High Performance Computing - Big Data convergence. Keywords: Big Data; stream processing; storage; data management; data analytics; transactions; data transfers; metadata management; in-transit processing; workflow management; HPC. i Contents Foreword 4 1 Introduction 5 1.1 The need for real-time processing . .6 1.1.1 Motivating use-case: autonomous cars . .6 1.1.2 Solution: stream computing in real-time . .7 1.2 The challenge of data management for streams . .8 1.3 Mission statement . .9 1.4 Objectives . .9 Part I — Context: Stream Processing in the Clouds 13 2 Big Data Processing: Batch-based Analytics of Historical Data 15 2.1 Batch processing with MapReduce: the execution model for Big Data . 16 2.1.1 MapReduce extensions . 17 2.2 Big Data processing frameworks . 18 2.2.1 From Hadoop to Yarn . 19 2.2.2 Workflow management systems . 20 2.3 Big Data management . 21 2.3.1 Data storage . 21 2.3.2 Data transfer . 26 2.4 Discussion: challenges . 27 3 The World Beyond Batch: Streaming Real-Time Fast Data 29 3.1 Stream computing . 30 3.1.1 Unbounded streaming vs. bounded batch . 30 3.1.2 Windowing . 30 3.1.3 State management . 31 3.1.4 Correctness . 32 3.2 Fast Data processing frameworks . 33 3.2.1 Micro-batching with Apache Spark . 33 3.2.2 True streaming with Apache Flink . 34 3.2.3 Performance comparison of Spark and Flink . 34 3.2.4 Other frameworks . 36 ii Contents 3.3 Fast Data management . 37 3.3.1 Data ingestion . 38 3.3.2 Data storage . 39 3.4 Discussion: challenges . 40 4 The Lambda Architecture: Unified Stream and Batch Processing 41 4.1 Unified processing model . 42 4.1.1 The case for batch-processing . 43 4.2 Limitations of the Lambda architecture . 43 4.2.1 High complexity of two separate computing paths . 43 4.2.2 Lack of support for global transactions . 44 4.3 Research agenda . 44 Part II — From Sensors to the Cloud: Stream Data Collection and Pre- processing 47 5 DataSteward: Using Dedicated Nodes for In-Transit Storage and Processing 49 5.1 A storage service on dedicated compute nodes . 50 5.1.1 Design principles . 51 5.1.2 Architectural overview . 51 5.1.3 Zoom on the dedicated nodes selection in the cloud . 52 5.2 In-transit data processing . 55 5.2.1 Data services for scientific applications . 55 5.3 Evaluation and perspectives . 56 5.3.1 Data storage evaluation . 56 5.3.2 Gains of in-transit processing for scientific applications . 57 5.3.3 Going further . 58 6 JetStream: Fast Stream Transfer 61 6.1 Modelling the stream transfer in the context of clouds . 62 6.1.1 Zoom on the event delivery latency . 63 6.1.2 Multi-route streaming . 64 6.2 The JetStream transfer middleware . 66 6.2.1 Adaptive batching for stream transfers . 66 6.2.2 Architecture overview . 67 6.3 Experimental evaluation . 69 6.3.1 Individual vs. batch-based event transfers . 69 6.3.2 Adapting to context changes . 70 6.3.3 Benefits of multi-route streaming . 70 6.3.4 JetStream in support of a real-life LHC application . 71 6.3.5 Towards stream transfer "as a Service" . 73 7 Small Files Metadata Support for Geo-Distributed Clouds 75 7.1 Strategies for multi-site metadata management . 77 7.1.1 Centralized Metadata (Baseline) . 78 7.1.2 Replicated Metadata (on Each Site) . 79 Contents iii 7.1.3 Decentralized, Non-Replicated Metadata . 80 7.1.4 Decentralized Metadata with Local Replication . 80 7.1.5 Matching strategies to processing patterns . 81 7.2 One step further: managing workflow hot metadata . 82 7.2.1 Architecture . 84 7.2.2 Protocols for hot metadata . 85 7.2.3 Towards dynamic hot metadata . 86 7.3 Implementation and results . 87 7.3.1 Benefits of decentralized metadata management . 88 7.3.2 Separate handling of hot and cold metadata . 90 Part III — Scalable Stream Ingestion and Storage 93 8 KerA: Scalable Data Ingestion for Stream Processing 95 8.1 Impact of ingestion on stream processing . 96 8.1.1 Time domains . 96 8.1.2 Static vs. dynamic partitioning . 98 8.1.3 Record access . 99 8.2 KerA overview and architecture . 100 8.2.1 Models . 100 8.2.2 Favoring parallelism: consumer and producer protocols . 103 8.2.3 Architecture and implementation . 103 8.2.4 Fast crash recovery for low-latency continuous processing . 105 8.3 Experimental evaluation . 105 8.3.1 Setup and methodology . 105 8.3.2 Results . 106 8.3.3 Discussion . 108 9 Týr: Transactional, Scalable Storage for Streams 109 9.1 Blobs for stream storage . 110 9.2 Design principles and architecture . 111 9.2.1 Predictable data distribution . 111 9.2.2 Transparent multi-version concurrency control . 112 9.2.3 ACID transactional semantics . 114 9.2.4 Atomic transform operations . 115 9.3 Protocols and implementation . 116 9.3.1 Lightweight transaction protocol . 116 9.3.2 Handling reads: direct, multi-chunk and transactional protocols . 118 9.3.3 Handling writes: transactional protocol, atomic transforms . 120 9.3.4 Implementation details . 120 9.4 Real-time, transactional data aggregation in support of system monitoring . 121 9.4.1 Transactional read/write performance . 123 9.4.2 Horizontal scalability . 126 iv Contents Part IV — Perspectives 127 10 Stream Storage for HPC and Big Data Convergence 129 10.1 HPC and BDA: divergent stacks, convergent storage needs . 131 10.1.1 Comparative overview of the HPC and BDA stacks . 131 10.1.2 HPC and BDA storage . 132 10.1.3 Challenges of storage convergence between HPC and BDA . 133 10.2 Blobs as a storage model for convergence . 134 10.2.1 General overview, intuition and methodology . ..
Recommended publications
  • Parallel Patterns for Adaptive Data Stream Processing
    Università degli Studi di Pisa Dipartimento di Informatica Dottorato di Ricerca in Informatica Ph.D. Thesis Parallel Patterns for Adaptive Data Stream Processing Tiziano De Matteis Supervisor Supervisor Marco Danelutto Marco Vanneschi Abstract In recent years our ability to produce information has been growing steadily, driven by an ever increasing computing power, communication rates, hardware and software sensors diffusion. is data is often available in the form of continuous streams and the ability to gather and analyze it to extract insights and detect patterns is a valu- able opportunity for many businesses and scientific applications. e topic of Data Stream Processing (DaSP) is a recent and highly active research area dealing with the processing of this streaming data. e development of DaSP applications poses several challenges, from efficient algorithms for the computation to programming and runtime systems to support their execution. In this thesis two main problems will be tackled: • need for high performance: high throughput and low latency are critical re- quirements for DaSP problems. Applications necessitate taking advantage of parallel hardware and distributed systems, such as multi/manycores or cluster of multicores, in an effective way; • dynamicity: due to their long running nature (24hr/7d), DaSP applications are affected by highly variable arrival rates and changes in their workload charac- teristics. Adaptivity is a fundamental feature in this context: applications must be able to autonomously scale the used resources to accommodate dynamic requirements and workload while maintaining the desired Quality of Service (QoS) in a cost-effective manner. In the current approaches to the development of DaSP applications are still miss- ing efficient exploitation of intra-operator parallelism as well as adaptations strategies with well known properties of stability, QoS assurance and cost awareness.
    [Show full text]
  • A Middleware for Efficient Stream Processing in CUDA
    Noname manuscript No. (will be inserted by the editor) A Middleware for E±cient Stream Processing in CUDA Shinta Nakagawa ¢ Fumihiko Ino ¢ Kenichi Hagihara Received: date / Accepted: date Abstract This paper presents a middleware capable of which are expressed as kernels. On the other hand, in- out-of-order execution of kernels and data transfers for put/output data is organized as streams, namely se- e±cient stream processing in the compute uni¯ed de- quences of similar data records. Input streams are then vice architecture (CUDA). Our middleware runs on the passed through the chain of kernels in a pipelined fash- CUDA-compatible graphics processing unit (GPU). Us- ion, producing output streams. One advantage of stream ing the middleware, application developers are allowed processing is that it can exploit the parallelism inherent to easily overlap kernel computation with data trans- in the pipeline. For example, the execution of the stages fer between the main memory and the video memory. can be overlapped with each other to exploit task par- To maximize the e±ciency of this overlap, our middle- allelism. Furthermore, di®erent stream elements can be ware performs out-of-order execution of commands such simultaneously processed to exploit data parallelism. as kernel invocations and data transfers. This run-time One of the stream architecture that bene¯t from capability can be used by just replacing the original the advantages mentioned above is the graphics pro- CUDA API calls with our API calls. We have applied cessing unit (GPU), originally designed for acceleration the middleware to a practical application to understand of graphics applications.
    [Show full text]
  • The UK E-Science Core Programme and the Grid Tony Hey∗, Anne E
    Future Generation Computer Systems 18 (2002) 1017–1031 The UK e-Science Core Programme and the Grid Tony Hey∗, Anne E. Trefethen UK e-Science Core Programme EPSRC, Polaris House, North Star Avenue, Swindon SN2 1ET, UK Abstract This paper describes the £120M UK ‘e-Science’ (http://www.research-councils.ac.uk/and http://www.escience-grid.org.uk) initiative and begins by defining what is meant by the term e-Science. The majority of the £120M, some £75M, is funding large-scale e-Science pilot projects in many areas of science and engineering. The infrastructure needed to support such projects must permit routine sharing of distributed and heterogeneous computational and data resources as well as supporting effective collaboration between groups of scientists. Such an infrastructure is commonly referred to as the Grid. Apart from £10M towards a Teraflop computer, the remaining funds, some £35M, constitute the e-Science ‘Core Programme’. The goal of this Core Programme is to advance the development of robust and generic Grid middleware in collaboration with industry. The key elements of the Core Programme will be outlined including details of a UK e-Science Grid testbed. The pilot e-Science projects that have so far been announced are then briefly described. These projects span a range of disciplines from particle physics and astronomy to engineering and healthcare, and illustrate the breadth of the UK e-Science Programme. In addition to these major e-Science projects, the Core Programme is funding a series of short-term e-Science demonstrators across a number of disciplines as well as projects in network traffic engineering and some international collaborative activities.
    [Show full text]
  • AMD Accelerated Parallel Processing Opencl Programming Guide
    AMD Accelerated Parallel Processing OpenCL Programming Guide November 2013 rev2.7 © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade- marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic- tions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or other- wise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or envi- ronmental damage may occur.
    [Show full text]
  • The Fourth Paradigm
    ABOUT THE FOURTH PARADIGM This book presents the first broad look at the rapidly emerging field of data- THE FOUR intensive science, with the goal of influencing the worldwide scientific and com- puting research communities and inspiring the next generation of scientists. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud- computing technologies. This collection of essays expands on the vision of pio- T neering computer scientist Jim Gray for a new, fourth paradigm of discovery based H PARADIGM on data-intensive science and offers insights into how it can be fully realized. “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.” —Bill GaTES “I often tell people working in eScience that they aren’t in this field because they are visionaries or super-intelligent—it’s because they care about science The and they are alive now. It is about technology changing the world, and science taking advantage of it, to do more and do better.” —RhyS FRANCIS, AUSTRALIAN eRESEARCH INFRASTRUCTURE COUNCIL F OURTH “One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive
    [Show full text]
  • Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated
    FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures Feng Zhang and Lin Yang, Renmin University of China; Shuhao Zhang, Technische Universität Berlin and National University of Singapore; Bingsheng He, National University of Singapore; Wei Lu and Xiaoyong Du, Renmin University of China https://www.usenix.org/conference/atc20/presentation/zhang-feng This paper is included in the Proceedings of the 2020 USENIX Annual Technical Conference. July 15–17, 2020 978-1-939133-14-4 Open access to the Proceedings of the 2020 USENIX Annual Technical Conference is sponsored by USENIX. FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures Feng Zhang1, Lin Yang1, Shuhao Zhang2;3, Bingsheng He3, Wei Lu1, Xiaoyong Du1 1Key Laboratory of Data Engineering and Knowledge Engineering (MOE), and School of Information, Renmin University of China 2DIMA, Technische Universität Berlin 3School of Computing, National University of Singapore [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract GPU memory via PCI-e before GPU processing, but the low Accelerating SQL queries on stream processing by utilizing bandwidth of PCI-e limits the performance of stream process- heterogeneous coprocessors, such as GPUs, has shown to be ing on GPUs. Hence, stream processing on GPUs needs to be an effective approach. Most works show that heterogeneous carefully designed to hide the PCI-e overhead. For example, coprocessors bring significant performance improvement be- prior works have explored pipelining the computation and cause of their high parallelism and computation capacity.
    [Show full text]
  • Lightsaber: Efficient Window Aggregation on Multi-Core Processors
    Research 28: Stream Processing SIGMOD ’20, June 14–19, 2020, Portland, OR, USA LightSaber: Efficient Window Aggregation on Multi-core Processors Georgios Theodorakis Alexandros Koliousis∗ Peter Pietzuch Imperial College London Graphcore Research Holger Pirk [email protected] [email protected] [email protected] [email protected] Imperial College London Abstract Invertible Aggregation (Sum) Non-Invertible Aggregation (Min) 200 Window aggregation queries are a core part of streaming ap- 140 150 Multiple TwoStacks 120 tuples/s) Multiple TwoStacks plications. To support window aggregation efficiently, stream 6 Multiple SoE 100 SlickDeque SlickDeque 80 100 SlideSide SlideSide processing engines face a trade-off between exploiting par- 60 allelism (at the instruction/multi-core levels) and incremen- 50 40 20 tal computation (across overlapping windows and queries). 0 0 Throughput (10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Existing engines implement ad-hoc aggregation and par- # Queries # Queries allelization strategies. As a result, they only achieve high Figure 1: Evaluating window aggregation queries performance for specific queries depending on the window definition and the type of aggregation function. an order of magnitude higher throughput compared to ex- We describe a general model for the design space of win- isting systems—on a 16-core server, it processes 470 million dow aggregation strategies. Based on this, we introduce records/s with 132 µs average latency. LightSaber, a new stream processing engine that balances parallelism and incremental processing when executing win- CCS Concepts dow aggregation queries on multi-core CPUs.
    [Show full text]
  • Big Data Stream Analysis: a Systematic Literature Review
    Kolajo et al. J Big Data (2019) 6:47 https://doi.org/10.1186/s40537-019-0210-7 SURVEY PAPER Open Access Big data stream analysis: a systematic literature review Taiwo Kolajo1,2* , Olawande Daramola3 and Ayodele Adebiyi1,4 *Correspondence: [email protected]; Abstract [email protected] Recently, big data streams have become ubiquitous due to the fact that a number of 1 Department of Computer and Information Sciences, applications generate a huge amount of data at a great velocity. This made it difcult Covenant University, Ota, for existing data mining tools, technologies, methods, and techniques to be applied Nigeria directly on big data streams due to the inherent dynamic characteristics of big data. In Full list of author information is available at the end of the this paper, a systematic review of big data streams analysis which employed a rigorous article and methodical approach to look at the trends of big data stream tools and technolo- gies as well as methods and techniques employed in analysing big data streams. It provides a global view of big data stream tools and technologies and its comparisons. Three major databases, Scopus, ScienceDirect and EBSCO, which indexes journals and conferences that are promoted by entities such as IEEE, ACM, SpringerLink, and Elsevier were explored as data sources. Out of the initial 2295 papers that resulted from the frst search string, 47 papers were found to be relevant to our research questions after implementing the inclusion and exclusion criteria. The study found that scalability, privacy and load balancing issues as well as empirical analysis of big data streams and technologies are still open for further research eforts.
    [Show full text]
  • Aligning Machine Learning for the Lambda Architecture
    Aalto University School of Science Degree Programme in Computer Science and Engineering Visakh Nair Aligning Machine Learning for the Lambda Architecture Master’s Thesis Espoo, September 24, 2015 Supervisor: Assoc. Prof. Keijo Heljanko, Aalto University Advisor: Olli Luukkonen, D.Sc. (Tech.), Tieto Finland Oy Aalto University School of Science ABSTRACT OF Degree Programme in Computer Science and Engineering MASTER’S THESIS Author: Visakh Nair Title: Aligning Machine Learning for the Lambda Architecture Date: September 24, 2015 Pages: 61 Major: Machine Learning and Data Mining Code: T-110 Supervisor: Assoc. Prof. Keijo Heljanko Advisor: Olli Luukkonen, D.Sc. (Tech.), Tieto Finland Oy We live in the era of Big Data. Web logs, internet media, social networks and sensor devices are generating petabytes of data every day. Traditional data stor- age and analysis methodologies have become insufficient to handle the rapidly increasing amount of data. The development of complex machine learning tech- niques has led to the proliferation of advanced analytics solutions. This has led to a paradigm shift in the way we store, process and analyze data. The avalanche of data has led to the development of numerous platforms and solutions satisfying various business analytics needs. It becomes imperative for the business practitioners and consultants to choose the right solution which can provide the best performance and maximize the utilization of the data available. In this thesis, we develop and implement a Big Data architectural framework called the Lambda Architecture. It consists of three major components, namely batch data processing, realtime data processing and a reporting layer. We develop and implement analytics use cases using machine learning techniques for each of these layers.
    [Show full text]
  • Real-Time Analytics for Fast Evolving Social Graphs
    Real-time Analytics for Fast Evolving Social Graphs Charith Wickramaarachchi1, Alok Kumbhare1, Marc Frincu2, Charalampos Chelmis2, and Viktor K. Prasanna2 1Dept. of Computer Science, University of Southern California, Los Angeles, California 90089 2Dept. of Electrical Engineering, University of Southern California, Los Angeles, California 90089 Email: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract—Existing Big Data streams coming from social second have been observed. In Facebook’s case we notice and other connected sensor networks exhibit intrinsic inter- an average of 41,000 posts per second or about 2.4Mb dependency enabling unique challenges to scalable graph ana- of data each second. Processing this huge amount of fast lytics. Data from these graphs is usually collected in different geographically located data servers making it suitable for streaming data to extract useful knowledge in real-time is distributed processing on clouds. While numerous solutions for challenging and requires besides efficient graph updates, large scale static graph analysis have been proposed, addressing scalable methods for performing incremental analytics in in real-time the dynamics of social interactions requires novel order to reduce the complexity of the data-driven algorithms. approaches that leverage incremental stream processing and Existing graph processing methods have focused either graph analytics on elastic clouds. We propose a scalable solution based on our stream pro- on large shared memory approaches where the graph is cessing engine, Floe, on top of which we perform real-time streamed and processed in-memory [6], or on batch pro- data processing and graph updates to enable low latency graph cessing techniques for distributed computing where periodic analytics on large evolving social networks.
    [Show full text]
  • Dzone-Guide-To-Big-Data.Pdf
    THE 2018 DZONE GUIDE TO Big Data STREAM PROCESSING, STATISTICS, & SCALABILITY VOLUME V BROUGHT TO YOU IN PARTNERSHIP WITH THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY Dear Reader, Table of Contents I first heard the term “Big Data” almost a decade ago. At that time, it Executive Summary looked like it was nothing new, and our databases would just be up- BY MATT WERNER_______________________________ 3 graded to handle some more data. No big deal. But soon, it became Key Research Findings clear that traditional databases were not designed to handle Big Data. BY G. RYAN SPAIN _______________________________ 4 The term “Big Data” has more dimensions than just “some more data.” It encompasses both structured and unstructured data, fast moving Take Big Data to the Next Level with Blockchain Networks BY ARJUNA CHALA ______________________________ 6 and historical data. Now, with these elements added to the data, some of the other problems such as data contextualization, data validity, Solving Data Integration at Stitch Fix noise, and abnormality in the data became more prominent. Since BY LIZ BENNETT _______________________________ 10 then, Big Data technologies has gone through several phases of devel- Checklist: Ten Tips for Ensuring Your Next Data Analytics opment and transformation, and they are gradually maturing. A term Project is a Success BY WOLF RUZICKA, ______________________________ that was considered as a fad and a technology ecosystem that was 13 considered a luxury are slowly establishing themselves as necessary Infographic: Big Data Realization with Sanitation ______ needs for today’s business activities. Big Data is the new competitive 14 advantage and it matters for our businesses.
    [Show full text]
  • Mr Chancellor
    ANTHONY JOHN GRENVILLE HEY DCL Mr Chancellor, “All the world's a stage … and one man in his time plays many parts”. Of no-one is this truer than of Professor Tony Hey. He does indeed stride the world stage, and has successively been at the forefront of particle physics, computing science, research programme direction, the popularization of ‘hard science’, and the stimulation of innovation in the world’s foremost software company. Notwithstanding his thirty-one years of service to the University of Southampton in a succession of ever more senior academic posts, Tony’s most recent appointment (as Corporate Vice President of Technical Computing at Microsoft) actually marks the sixth time he has set up camp in the United States: one gets the impression that only the QE2 might exceed his record for round-trips between Southampton and the USA! The roots of his long-term fascination with America were laid immediately after he completed his PhD at Oxford, when he took up a Harkness Fellowship at Caltech to work at the forefront of particle physics with two Nobel Prize Winners: Richard Feynman and Murray Gell-Mann. The spirit of openness, drive and adventure with which his American colleagues were imbued made a lasting impression on Tony. 1 After leaving the USA, Tony spent two years in Geneva as a Research Fellow at CERN, the European Organization for Nuclear Research. Later to become the birthplace of the worldwide web, in the early 1970s CERN was rapidly expanding its unique facilities for probing subatomic particles, and Tony worked in a team which developed the theoretical under-pinning for much of modern quantum physics.
    [Show full text]