Dmla: a Dynamic Model-Based Lambda Architecture for Learning And

Total Page:16

File Type:pdf, Size:1020Kb

Dmla: a Dynamic Model-Based Lambda Architecture for Learning And DMLA: A DYNAMIC MODEL-BASED LAMBDA ARCHITECTURE FOR LEARNING AND RECOGNITION OF FEATURES IN BIG DATA A THESIS IN Computer Science Presented to the Faculty of the University Of Missouri-Kansas City in partial fulfillment Of the requirements for the degree MASTER OF SCIENCE By RAVI KIRAN YADAVALLI B.Tech, Jawaharlal Nehru Technological University – Hyderabad, India, 2013 Kansas City, Missouri 2016 ©2016 RAVI KIRAN YADAVALLI ALL RIGHTS RESERVED DMLA: A DYNAMIC MODEL-BASED LAMBDA ARCHITECTURE FOR LEARNING AND RECOGNITION OF FEATURES IN BIG DATA Ravi Kiran Yadavalli, Candidate for the Master of Science Degree University of Missouri-Kansas City, 2016 ABSTRACT Real-time event modeling and recognition is one of the major research areas that is yet to reach its fullest potential. In the exploration of a system to fit in the tremendous challenges posed by data growth, several big data ecosystems have evolved. Big Data Ecosystems are currently dealing with various architectural models, each one aimed to solve a real-time problem with ease. There is an increasing demand for building a dynamic architecture using the powers of real-time and computational intelligence under a single workflow to effectively handle fast-changing business environments. To the best of our knowledge, there is no attempt at supporting a distributed machine-learning paradigm by separating learning and recognition tasks using Big Data Ecosystems. The focus of our study is to design a distributed machine learning model by evaluating the various machine-learning algorithms for event detection learning and predictive analysis with different features in audio domains. We propose an integrated architectural model, called DMLA, to handle real-time problems that can enhance the richness in the information level and at the same time reduce the overhead of dealing with diverse architectural constraints. The DMLA architecture is the variant of a Lambda Architecture that combines the power of Apache Spark, Apache Storm (Heron), and Apache Kafka to handle massive amounts of data using both streaming and batch processing techniques. The primary dimension of this study is to iii demonstrate how DMLA recognizes real-time, real-world events (e.g., fire alarm alerts, babies needing immediate attention, etc.) that would require a quick response by the users. Detection of contextual information and utilizing the appropriate model dynamically has been distributed among the components of the DMLA architecture. In the DMLA framework, a dynamic predictive model, learned from the training data in Spark, is loaded from the context information into a Storm topology to recognize/predict the possible events. The event-based context aware solution was designed for real-time, real-world events. The Spark based learning had the highest accuracy of over 80% among several machine-learning models and the Storm topology model achieved a recognition rate of 75% in the best performance. We verify the effectiveness of the proposed architecture is effective in real-time event-based recognition in audio domains. iv APPROVAL PAGE The faculty listed below, appointed by the Dean of the School of Computing and Engineering, have examined a thesis titled “DMLA: A Dynamic Model-based Lambda Architecture for Learning and Recognition of Features in Big Data” presented by Ravi Kiran Yadavalli, candidate for the Master of Science degree, and certify that in their opinion, it is worthy of acceptance. Supervisory Committee Yugyung Lee, Ph.D., Committee Chair School of Computing and Engineering Yongjie Zheng, Ph.D. School of Computing and Engineering Sejun Song, Ph.D. School of Computing and Engineering v TABLE OF CONTENTS ABSTRACT .............................................................................................................................................. iii ILLUSTRATIONS………………………………….................................................................................................vii TABLES .................................................................................................................................................... x 1. INTRODUCTION .................................................................................................................................. 1 1.1 Motivation .................................................................................................................................... 1 1.2 Problem Statement ...................................................................................................................... 2 1.3 Proposed Solution ........................................................................................................................ 2 2. BACKGROUND AND RELATED WORK.................................................................................................. 4 2.1 Terminology .................................................................................................................................. 4 2.2 Related Work ................................................................................................................................ 6 2.2.1 Big Data Streaming Tools and Frameworks .......................................................................... 6 2.2.2 Evaluation on Current Stream Processing Frameworks ...................................................... 11 3. PROPOSED FRAMEWORK ................................................................................................................. 18 3.1 Overview ..................................................................................................................................... 18 3.2 Dynamic Recognition .................................................................................................................. 20 3.3 Feature Extraction Flow.............................................................................................................. 21 3.4 Apache Spark Workflow ............................................................................................................. 22 3.5 Apache Storm Workflow ............................................................................................................ 24 3.6 Apache Kafka and REST API ........................................................................................................ 28 3.7 Features on JAudio ..................................................................................................................... 29 3.8 Context Aware Model................................................................................................................. 32 3.8.1 Home Context ..................................................................................................................... 33 3.8.2 Classroom Context .............................................................................................................. 34 3.8.3 Outdoor Context ................................................................................................................. 36 3.8.4 Office Context...................................................................................................................... 37 vi 3.8.5 Contextual features ............................................................................................................. 38 4. RESULTS AND EVALUATION .......................................................................................................... 43 4.1 Apache Spark .............................................................................................................................. 43 4.1.1 Machine Learning Algorithms ............................................................................................. 43 4.2 Evaluation .................................................................................................................................. 53 4.2.1 Feature Based Analysis ........................................................................................................ 53 4.2.2 Audio File VS Feature Data .................................................................................................. 54 5. CONCLUSION AND FUTURE WORK ................................................................................................... 56 5.1 Conclusion .................................................................................................................................. 56 5.2 Limitations .................................................................................................................................. 56 5.3 Future Scope ............................................................................................................................... 56 REFERENCES ......................................................................................................................................... 57 VITA ..................................................................................................................................................... 59 vii ILLUSTRATIONS Figure Page Figure 1: Hadoop vs Spark Runtime Performance .................................................................................... 9 Figure 2: Storm Topology Architecture ................................................................................................... 10 Figure 3: Streaming Applications Workflow ............................................................................................ 12 Figure 4: Lambda Architecture ...............................................................................................................
Recommended publications
  • Aligning Machine Learning for the Lambda Architecture
    Aalto University School of Science Degree Programme in Computer Science and Engineering Visakh Nair Aligning Machine Learning for the Lambda Architecture Master’s Thesis Espoo, September 24, 2015 Supervisor: Assoc. Prof. Keijo Heljanko, Aalto University Advisor: Olli Luukkonen, D.Sc. (Tech.), Tieto Finland Oy Aalto University School of Science ABSTRACT OF Degree Programme in Computer Science and Engineering MASTER’S THESIS Author: Visakh Nair Title: Aligning Machine Learning for the Lambda Architecture Date: September 24, 2015 Pages: 61 Major: Machine Learning and Data Mining Code: T-110 Supervisor: Assoc. Prof. Keijo Heljanko Advisor: Olli Luukkonen, D.Sc. (Tech.), Tieto Finland Oy We live in the era of Big Data. Web logs, internet media, social networks and sensor devices are generating petabytes of data every day. Traditional data stor- age and analysis methodologies have become insufficient to handle the rapidly increasing amount of data. The development of complex machine learning tech- niques has led to the proliferation of advanced analytics solutions. This has led to a paradigm shift in the way we store, process and analyze data. The avalanche of data has led to the development of numerous platforms and solutions satisfying various business analytics needs. It becomes imperative for the business practitioners and consultants to choose the right solution which can provide the best performance and maximize the utilization of the data available. In this thesis, we develop and implement a Big Data architectural framework called the Lambda Architecture. It consists of three major components, namely batch data processing, realtime data processing and a reporting layer. We develop and implement analytics use cases using machine learning techniques for each of these layers.
    [Show full text]
  • Building a Scalable Distributed Data Platform Using Lambda Architecture
    Building a scalable distributed data platform using lambda architecture by DHANANJAY MEHTA B.Tech., Graphic Era University, India, 2012 A REPORT submitted in partial fulfillment of the requirements for the degree MASTER OF SCIENCE Department Of Computer Science College Of Engineering KANSAS STATE UNIVERSITY Manhattan, Kansas 2017 Approved by: Major Professor Dr. William H. Hsu Copyright Dhananjay Mehta 2017 Abstract Data is generated all the time over Internet, systems, sensors and mobile devices around us this data is often referred to as 'big data'. Tapping this data is a challenge to organiza- tions because of the nature of data i.e. velocity, volume and variety. What make handling this data a challenge? This is because traditional data platforms have been built around relational database management systems coupled with enterprise data warehouses. Legacy infrastructure is either technically incapable to scale to big data or financially infeasible. Now the question arises, how to build a system to handle the challenges of big data and cater needs of an organization? The answer is Lambda Architecture. Lambda Architecture (LA) is a generic term that is used for a scalable and fault-tolerant data processing architecture that ensure real-time processing with low latency. LA provides a general strategy to knit together all necessary tools for building a data pipeline for real- time processing of big data. LA builds a big data platform as a series of layers that combine batch and real time processing. LA comprise of three layers - Batch Layer, responsible for bulk data processing; Speed Layer, responsible for real-time processing of data streams and Serving Layer, responsible for serving queries from end users.
    [Show full text]
  • Introduction to Big Data & Architectures
    Introduction to Big Data & Architectures This project has received funding from the European Union's Horizon 2020 Research and Innovation programme under grant agreement No 809965. About us 2 Smart Data Analytics (SDA) ❖ Prof. Dr. Jens Lehmann ■ Institute for Computer Science , University of Bonn ■ Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS) ■ Institute for Applied Computer Science, Leipzig. ❖ Machine learning techniques ("analytics") for Structured knowledge ("smart data") Covering the full spectrum of research including theoretical foundations, algorithms, prototypes and industrial applications! 3 SDA Group Overview • Founded in 2016 • 55 Members: – 1 Professor – 13 PostDocs – 31 PhD Students – 11 master students • Core topics: – Semantic Web – AI / ML • 10+ awards acquired • 3000+ citations / year • Collaboration with Fraunhofer IAIS 4 SDA Group Overview ❖ Distributed Semantic Analytics ➢ Aims to develop scalable analytics algorithms based on Apache Spark and Apache Flink for analysing large scale RDF datasets ❖ Semantic Question Answering ➢ Make use of Semantic Web technologies and AI for better and advanced question answering & dialogue systems ❖ Structured Machine Learning ➢ Combines Semantic Web and supervised ML technologies in order to improve both quality and quantity of available knowledge ❖ Smart Services ➢ Semantic services and their composition, applications in IoT ❖ Software Engineering for Data Science ➢ Researches on how data and software engineering methods can be aligned with Data Science
    [Show full text]
  • Open Source Lambda Architecture for Interactive Analytics
    Proceedings of the 50th Hawaii International Conference on System Sciences | 2017 The RADStack: Open Source Lambda Architecture for Interactive Analytics Fangjin Yang, Gian Merlino, Nelson Ray, Xavier Léauté, Himanshu Gupta, Eric Tschetter {fangjinyang, gianmerlino, ncray86, xavier.leaute, g.himanshu, echeddar}@gmail.com ABSTRACT An ideal data serving layer alone is often not sufficient The Real-time Analytics Data Stack, colloquially referred to as a complete analytics solution. In most real-world use as the RADStack, is an open-source data analytics stack de- cases, raw data cannot be directly stored in the serving layer. signed to provide fast, flexible queries over up-to-the-second Raw data suffers from many imperfections and must first data. It is designed to overcome the limitations of either be processed (transformed, or cleaned) before it is usable a purely batch processing system (it takes too long to sur- [17]. The drawback of this requirement is that loading and face new events) or a purely real-time system (it’s difficult processing batch data is slow, and insights on events cannot to ensure that no data is left behind and there is often no be obtained until hours after the events have occurred. way to correct data after initial processing). It will seam- To address the delays in data freshness caused by batch lessly return best-effort results on very recent data combined processing frameworks, numerous open-source stream pro- with guaranteed-correct results on older data. In this paper, cessing frameworks such as Apache Storm[12], Apache Spark we introduce the architecture of the RADStack and discuss Streaming[25], and Apache Samza[1] have gained popular- our methods of providing interactive analytics and a flexible ity for offering a low-latency model to ingest and process data processing environment to handle a variety of real-world event streams at near real-time speeds.
    [Show full text]
  • A Scalable Data Store and Analytic Platform for Real-Time Monitoring of Data-Intensive Scientific Infrastructure
    Doctoral Thesis A Scalable Data Store and Analytic Platform for Real-Time Monitoring of Data-Intensive Scientific Infrastructure A thesis submitted to Brunel University London in accordance with the requirements for award of the degree of Doctor of Philosophy By Uthayanath Suthakar in the Department of Electronic and Computer Engineering College of Engineering, Design and Physical Sciences November 23, 2017 Declaration of Authorship I, Uthayanath Suthakar, declare that the work in this dissertation was carried out in ac- cordance with the requirements of the University's Regulations and Code of Practice for Research Degree Programmes and that it has not been submitted for any other academic award. Except where indicated by specific reference in the text, the work is the candidate's own work. Work done in collaboration with, or with the assistance of, others, is indicated as such. Any views expressed in the dissertation are those of the author. SIGNED: ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: DATE: ::::::::::::::::::::::::::::::::::: (Signature of student) \Progress is made by trial and failure; the failures are generally a hundred times more numerous than the successes; yet they are usually left unchronicled." Sir William Ramsay, Chemist Abstract Monitoring data-intensive scientific infrastructures in real-time such as jobs, data trans- fers, and hardware failures is vital for efficient operation. Due to the high volume and velocity of events that are produced, traditional methods are no longer optimal. Several techniques, as well as enabling architectures, are available to support the Big Data issue. In this respect, this thesis complements existing survey work by contributing an extensive literature review of both traditional and emerging Big Data architecture.
    [Show full text]
  • Real-Time Data Processing with Lambda Architecture
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-20-2019 Real-Time Data Processing With Lambda Architecture Omkar Ashok Malusare San Jose State University Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Systems Architecture Commons Recommended Citation Malusare, Omkar Ashok, "Real-Time Data Processing With Lambda Architecture" (2019). Master's Projects. 681. DOI: https://doi.org/10.31979/etd.2s5u-hgps https://scholarworks.sjsu.edu/etd_projects/681 This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact [email protected]. REAL-TIME DATA PROCESSING WITH LAMBDA ARCHITECTURE 1 Real-Time Data Processing With Lambda Architecture A Thesis Presented to The Faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements for the Degree Master of Science By Omkar Ashok Malusare May 2019 REAL-TIME DATA PROCESSING WITH LAMBDA ARCHITECTURE 2 The Designated Project Committee Approves the Project Titled Real-Time Data Processing With Lambda Architecture By Omkar Ashok Malusare APPROVED FOR THE DEPARTMENTS OF COMPUTER SCIENCE SAN JOSE STATE UNIVERSITY May 2019 Dr. Robert Chun Department of Computer Science Dr. Jon Pearce Department of Computer Science Mr. Manoj Thakur Staff Software Engineer at LinkedIn REAL-TIME DATA PROCESSING WITH LAMBDA ARCHITECTURE 3 Abstract Data has evolved immensely in recent years, in type, volume and velocity.
    [Show full text]
  • The Radstack: Open Source Lambda Architecture for Interactive Analytics
    The RADStack: Open Source Lambda Architecture for Interactive Analytics Fangjin Yang Gian Merlino Xavier Léauté Imply Data, Inc. Imply Data, Inc. Metamarkets Group, Inc. [email protected] [email protected] xavier@metamar- kets.com ABSTRACT revolve around data exploration and computation, organi- The Real-time Analytics Data Stack, colloquially referred to zations quickly realized that in order to support low latency as the RADStack, is an open-source data analytics stack de- queries, dedicated serving layers were necessary. Today, signed to provide fast, flexible queries over up-to-the-second most of these serving layers are Relational Database Man- data. It is designed to overcome the limitations of either agement Systems (RDBMS) or NoSQL key/value stores. a purely batch processing system (it takes too long to sur- Neither RDBMS nor NoSQL key/value stores are partic- face new events) or a purely real-time system (it’s difficult ularly designed for analytics [27], but these technologies are to ensure that no data is left behind and there is often no still frequently selected as serving layers. Solutions that in- way to correct data after initial processing). It will seam- volve these broad-focus technologies can be inflexible once lessly return best-effort results on very recent data com- tailored to the analytics use case, or suffer from architecture bined with guaranteed-correct results on older data. In this drawbacks that prevent them from returning queries fast paper, we introduce the architecture of the RADStack and enough to power interactive, user-facing applications [28]. discuss our methods of providing interactive analytics and a An ideal data serving layer alone is often not sufficient flexible data processing environment to handle a variety of as a complete analytics solution.
    [Show full text]
  • Lambda Architecture for Distributed Stream Processing in the Fog
    Lambda Architecture for Distributed Stream Processing in the Fog DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmendes Studiums Software Engineering &Internet Computing eingereicht von Matthias Schrabauer,BSc Matrikelnummer 01326214 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Associate Prof.Dr.-Ing. Stefan Schulte Wien, 2. Februar 2021 Matthias Schrabauer Stefan Schulte Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Lambda Architecture for Distributed Stream Processing in the Fog DIPLOMA THESIS submitted in partial fulfillment of the requirements forthe degree of Diplom-Ingenieur in Software Engineering &Internet Computing by Matthias Schrabauer,BSc Registration Number 01326214 to the Faculty of Informatics at the TU Wien Advisor: Associate Prof.Dr.-Ing. Stefan Schulte Vienna, 2nd February, 2021 Matthias Schrabauer Stefan Schulte Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassungder Arbeit Matthias Schrabauer,BSc Hiermit erkläre ich, dass ichdieseArbeit selbständig verfasst habe,dass ichdie verwen- detenQuellenund Hilfsmittel vollständig angegeben habeund dass ichdie Stellen der Arbeit–einschließlichTabellen,Karten undAbbildungen –, dieanderen Werken oder dem Internet im Wortlaut oder demSinn nach entnommensind, aufjeden Fall unter Angabeder Quelleals Entlehnung kenntlich gemacht habe. Wien, 2. Februar 2021 Matthias Schrabauer v Danksagung An dieser Stelle möchte ichmichbei allen Personen
    [Show full text]
  • Fundamentals of Real-Time Data Processing Architectures Lambda and Kappa
    Martin Feick, Niko Kleer, and Marek Kohn (Hrsg.): SKILL 2018, Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2018 1 Fundamentals of Real-Time Data Processing Architectures Lambda and Kappa Martin Feick, Niko Kleer, Marek Kohn1 Abstract: The amount of data and the importance of simple, scalable and fault tolerant architectures for processing the data keeps increasing. Big Data being a highly influential topic in numerous businesses has evolved a comprehensive interest in this data. The Lambda as well as the Kappa Architecture represent state-of-the-art real-time data processing architectures for coping with massive data streams. This paper investigates and compares both architectures with respect to their capabilities and implementation. Moreover, a case study is conducted in order to gain more detailed insights concerning their strengths and weaknesses. Keywords: Software architecture, Big Data, real-time data processing, Lambda and Kappa architecture 1 Introduction The internet is a global network that is becoming accessible to an increasing number of people. Therefore, the amount of data available via the internet has been growing significantly. Using social networks for building communities, distributing information or posting images represent common activities in many people’s daily life. Moreover, all kinds of businesses use technologies for collecting data about their companies. This allows them to gain more detailed insights regarding their finances, employees or even competitiveness. As a result, the interest in this data has been growing as well. The term Big Data is used for referring to this data and its dimensions. As Big Data has progressively been gaining importance, the need for technologies that are capable of handling massive amounts of data has emerged.
    [Show full text]
  • Efficient Stream Data Management
    From Big Data to Fast Data: Efficient Stream Data Management Alexandru Costan To cite this version: Alexandru Costan. From Big Data to Fast Data: Efficient Stream Data Management. Distributed, Parallel, and Cluster Computing [cs.DC]. ENS Rennes, 2019. tel-02059437v2 HAL Id: tel-02059437 https://hal.archives-ouvertes.fr/tel-02059437v2 Submitted on 14 Mar 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. École doctorale MathSTIC HABILITATION À DIRIGER DES RECHERCHES Discipline: INFORMATIQUE présentée devant l’École Normale Supérieure de Rennes sous le sceau de l’Université Bretagne Loire par Alexandru Costan préparée à IRISA Institut de Recherche en Informatique et Systèmes Aléatoires Soutenue à Bruz, le 14 mars 2019, devant le jury composé de: Rosa Badia / rapporteuse Directrice de recherche, Barcelona Supercomputing Center, Espagne From Big Data Luc Bougé / examinateur Professeur des universités, ENS Rennes, France to Fast Data: Valentin Cristea / examinateur Professeur des universités, Université Politehnica de Efficient Stream Bucarest, Roumanie Christian Pérez / rapporteur Data Management Directeur de recherche, Inria, France Michael Schöttner / rapporteur Professeur des universités, Université de Düsseldorf, Allemagne Patrick Valduriez / examinateur Directeur de recherche, Inria, France 3 Abstract This manuscript provides a synthetic overview of my research journey since my PhD de- fense.
    [Show full text]
  • How to Build and Run a Big Data Platform in the 21St Century
    How to build and run a big data platform in the 21st century Ali Dasdan - Atlassian - [email protected] Dhruba Borthakur - Rockset - [email protected] This tutorial was presented at the IEEE BigData Conference in Los Angeles, CA, USA on Dec 9th, 2019. 1 Disclaimers ● This tutorial presents the opinions of the authors. It does not necessarily reflect the views of our employers. ● This tutorial presents content that is already available in the public domain. It does not contain any confidential information. 2 Speaker: Ali Dasdan Ali Dasdan is the head of engineering for Confluence Cloud at Atlassian. Prior to Atlassian, he worked as the head of engineering and CTO in three startups (Turn in real-time online advertising, Vida Health in chronic disease management, Poynt in payments platform) and as an engineering leader in four leading tech companies (Synopsys for electronic design automation; Yahoo! for web search and big data; Ebay for e-commerce search, discovery, and recommendation; and Tesco for big data for retail). He has been working in big data and large distributed systems since 2006 (releasing the first production application on Hadoop in 2006). He led the building of three big data platforms on-prem from scratch (100PB+ capacity at Turn, multi-PB capacity for a leading telco, and 30PB+ capacity at Tesco). He is active in multiple areas of research related to big data, namely, big data platforms, distributed systems, machine learning. He received his PhD in Computer Science from University of Illinois at Urbana-Champaign in 1999. During part of his PhD, he worked as a visiting scholar at the University of California, Irvine on embedded real-time systems.
    [Show full text]