Multimedia Big Data Processing Using Hpcc Systems

MULTIMEDIA BIG DATA PROCESSING USING HPCC SYSTEMS by Vishnu Chinta A Thesis Submitted to the Faculty of The College of Engineering & Computer Science In Partial Fulfillment of the Requirements for the Degree of Master of Science Florida Atlantic University Boca Raton, FL May 2017 Copyright by Vishnu Chinta 2017 ii ACKNOWLEDGEMENTS I would like to express gratitude to my advisor, Dr. Hari Kalva, for his constant guidance and support in helping me complete this thesis successfully. His enthusiasm, inspiration and his efforts to explain all aspects clearly and simply helped me throughout. My sincere thanks also go to my committee members Dr. Borko Furht and Dr. Xingquan Zhu for their valuable comments, suggestions and inputs to the thesis. I would also like to thank the LexisNexis for making this study possible by providing its funding, tools and all the technical support I’ve received. I would also like to thank all my professors during my time at FAU for all the invaluable lessons I’ve learnt in their classes. This work has been funded by the NSF Award No. 1464537, Industry/University Cooperative Research Center, Phase II under NSF 13-542. I would like to thank the NSF for this. Many thanks to my friends at FAU and my fellow lab mates in the Multimedia lab for the enthusiastic support and interesting times we have had during the past two years. Last but not the least I would like to thank my family for being a constant source of support and encouragement during my Masters. iv ABSTRACT Author: Vishnu Chinta Title: Multimedia Big Data Processing Using Hpcc Systems Institution: Florida Atlantic University Thesis Advisor: Dr. Hari Kalva Degree: Master of Science Year: 2017 There is now more data being created than ever before and this data can be any form of data, textual, multimedia, spatial etc. To process this data, several big data processing platforms have been developed including Hadoop, based on the MapReduce model and LexisNexis’ HPCC systems. In this thesis we evaluate the HPCC Systems framework with a special interest in multimedia data analysis and propose a framework for multimedia data processing. It is important to note that multimedia data encompasses a wide variety of data including but not limited to image data, video data, audio data and even textual data. While developing a unified framework for such wide variety of data, we have to consider computational complexity in dealing with the data. Preliminary results show that HPCC can potentially reduce the computational complexity significantly. v MULTIMEDIA BIG DATA PROCESSING USING HPCC SYSTEMS LIST OF FIGURES ......................................................................................................... viii INTRODUCTION .............................................................................................................. 1 1.1 Motivation ................................................................................................................. 1 1.2 Main Contributions ................................................................................................... 2 1.3 Overview of Thesis ................................................................................................... 3 1.4 Glossary of Terms ..................................................................................................... 3 CHAPTER 2 ....................................................................................................................... 4 BACKGROUND AND RELATED WORK ...................................................................... 4 2.1 Hadoop Ecosystem .................................................................................................... 4 2.2 HPCC Systems .......................................................................................................... 5 2.3 Similar Studies .......................................................................................................... 8 2.3.1 HIPI - Hadoop Image Processing Interface ....................................................... 8 2.3.2 CURT MapReduce: Caching and Utilizing Results of Tasks for MapReduce on Cloud Computing .......................................................................................... 9 2.3.3 Optimized Multiple Platforms for Big Data Analysis ..................................... 10 2.3.4 MapReduce Vocabulary Tree: An Approach for Large Scale Image Indexing and Search in the Cloud .................................................................................... 11 2.3.4 Extensible Video Processing Framework in Apache Hadoop ......................... 12 CHAPTER 3 ..................................................................................................................... 13 PROBLEM DESCRIPTION ............................................................................................. 13 vi 3.1 Multimedia Data Processing ................................................................................... 13 3.2 Multimedia Big Data in HPCC ............................................................................... 13 3.3 Proposed System Architecture ................................................................................ 14 CHAPTER 4 ..................................................................................................................... 16 PERFORMANCE EVALUATION .................................................................................. 16 4.1 Experimental Setup ................................................................................................. 16 4.2 Implementation ....................................................................................................... 16 4.3 Dataset ..................................................................................................................... 17 4.4 Dataset Processing .................................................................................................. 18 4.5 Decoding Image Data ............................................................................................. 21 4.6 Query Processing .................................................................................................... 21 4.6.1 Metadata Processing ........................................................................................ 22 4.6.2 Feature Extraction ............................................................................................ 28 4.7 Data Visualization ................................................................................................... 28 CHAPTER 5 ..................................................................................................................... 30 CONCLUSION AND FUTURE WORK ......................................................................... 30 5.1 Conclusion .............................................................................................................. 30 5.2 Other Applications .................................................................................................. 30 5.3 Future Work ............................................................................................................ 31 BIBLIOGRAPHY ............................................................................................................. 32 LIST OF FIGURES Figure 2.1 Hadoop Distributed File System Architecture .................................................. 5 Figure 2.2 HPCC THOR Cluster ........................................................................................ 8 Figure 2.3 How HIPI parallelizes image processing .......................................................... 9 Figure 3.14Proposed System Architecture ....................................................................... 14 Figure 4.15Sample images from dataset ........................................................................... 18 Figure 4.26Spray Timings ................................................................................................ 19 Figure 4.37Execution graph generated in ECL Watch ..................................................... 24 Figure 4.48Impact of dataset size on query processing .................................................... 26 Figure 4.59Execution times for finding images with occurrences of 'Cat' in the United States ......................................................................................................................... 27 Figure 4.6 Sample Images returned from query ………………………………………...28 Figure 4.7 Sample Visualization ………………………………………………………...29 LIST OF TABLES Table 4.1 Spray Timings ................................................................................................... 19 Table 4.2 Sample of Query Results .................................................................................. 23 Table 4.3 Breakdown of query execution times for 100m records ................................... 24 Table 4.4 Execution times for different input record sizes ............................................... 25 Table 4.5 Execution times for query finding images with occurrences of 'Cat' in United States ................................................................................................................ 26 ix CHAPTER 1 INTRODUCTION Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/unstructured and streaming/batch, and different sizes from terabytes to petabytes. Big data is a term applied to datasets whose

Multimedia Big Data Processing Using Hpcc Systems

Deliver Performance and Scalability with Hitachi Vantara's Pentaho

The HPCC Cluster Computing Paradigm and an Efficient Data-Centric Programming Language Are Key Factors in Our Company's Success

HPCC Systems Open Source, Big Data Processing and Analytics See

Big Data Analytics Tool Hive

HPCC Benchmarking

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

HPCC System and Its Future Aspects in Maintaining Big Data

Cloud Based Big Data Infrastructure: Architectural Components and Automated Provisioning

A Guide in the Big Data Jungle Thesis, Bachelor of Science

The ECL Programming Paradigm

HPCC Systems: Introduction to HPCC (High-Performance Computing Cluster)

The Pigmix Benchmark on Pig, Mapreduce, and HPCC Systems