MULTIMEDIA PROCESSING USING HPCC SYSTEMS

by

Vishnu Chinta

A Thesis Submitted to the Faculty of

The College of Engineering & Computer Science

In Partial Fulfillment of the Requirements for the Degree of

Master of Science

Florida Atlantic University

Boca Raton, FL

May 2017

Copyright by Vishnu Chinta 2017

ii

ACKNOWLEDGEMENTS

I would like to express gratitude to my advisor, Dr. Hari Kalva, for his constant guidance and support in helping me complete this thesis successfully. His enthusiasm, inspiration and his efforts to explain all aspects clearly and simply helped me throughout.

My sincere thanks also go to my committee members Dr. Borko Furht and Dr.

Xingquan Zhu for their valuable comments, suggestions and inputs to the thesis.

I would also like to thank the LexisNexis for making this study possible by providing its funding, tools and all the technical support I’ve received. I would also like to thank all my professors during my time at FAU for all the invaluable lessons I’ve learnt in their classes.

This work has been funded by the NSF Award No. 1464537, Industry/University

Cooperative Research Center, Phase II under NSF 13-542. I would like to thank the NSF for this.

Many thanks to my friends at FAU and my fellow lab mates in the Multimedia lab for the enthusiastic support and interesting times we have had during the past two years.

Last but not the least I would like to thank my family for being a constant source of support and encouragement during my Masters.

iv ABSTRACT

Author: Vishnu Chinta

Title: Multimedia Big Data Processing Using Hpcc Systems

Institution: Florida Atlantic University

Thesis Advisor: Dr. Hari Kalva

Degree: Master of Science

Year: 2017

There is now more data being created than ever before and this data can be any form of data, textual, multimedia, spatial etc. To process this data, several big data processing platforms have been developed including Hadoop, based on the MapReduce model and LexisNexis’ HPCC systems.

In this thesis we evaluate the HPCC Systems framework with a special interest in multimedia data analysis and propose a framework for multimedia data processing.

It is important to note that multimedia data encompasses a wide variety of data including but not limited to image data, video data, audio data and even textual data. While developing a unified framework for such wide variety of data, we have to consider computational complexity in dealing with the data. Preliminary results show that HPCC can potentially reduce the computational complexity significantly.

v MULTIMEDIA BIG DATA PROCESSING USING HPCC SYSTEMS

LIST OF FIGURES ...... viii

INTRODUCTION ...... 1

1.1 Motivation ...... 1

1.2 Main Contributions ...... 2

1.3 Overview of Thesis ...... 3

1.4 Glossary of Terms ...... 3

CHAPTER 2 ...... 4

BACKGROUND AND RELATED WORK ...... 4

2.1 Hadoop Ecosystem ...... 4

2.2 HPCC Systems ...... 5

2.3 Similar Studies ...... 8

2.3.1 HIPI - Hadoop Image Processing Interface ...... 8

2.3.2 CURT MapReduce: Caching and Utilizing Results of Tasks for MapReduce

on Cloud Computing ...... 9

2.3.3 Optimized Multiple Platforms for Big Data Analysis ...... 10

2.3.4 MapReduce Vocabulary Tree: An Approach for Large Scale Image Indexing

and Search in the Cloud ...... 11

2.3.4 Extensible Video Processing Framework in ...... 12

CHAPTER 3 ...... 13

PROBLEM DESCRIPTION ...... 13

vi 3.1 Multimedia Data Processing ...... 13

3.2 Multimedia Big Data in HPCC ...... 13

3.3 Proposed System Architecture ...... 14

CHAPTER 4 ...... 16

PERFORMANCE EVALUATION ...... 16

4.1 Experimental Setup ...... 16

4.2 Implementation ...... 16

4.3 Dataset ...... 17

4.4 Dataset Processing ...... 18

4.5 Decoding Image Data ...... 21

4.6 Query Processing ...... 21

4.6.1 Metadata Processing ...... 22

4.6.2 Feature Extraction ...... 28

4.7 Data Visualization ...... 28

CHAPTER 5 ...... 30

CONCLUSION AND FUTURE WORK ...... 30

5.1 Conclusion ...... 30

5.2 Other Applications ...... 30

5.3 Future Work ...... 31

BIBLIOGRAPHY ...... 32

LIST OF FIGURES

Figure 2.1 Hadoop Distributed File System Architecture ...... 5

Figure 2.2 HPCC THOR Cluster ...... 8

Figure 2.3 How HIPI parallelizes image processing ...... 9

Figure 3.14Proposed System Architecture ...... 14

Figure 4.15Sample images from dataset ...... 18

Figure 4.26Spray Timings ...... 19

Figure 4.37Execution graph generated in ECL Watch ...... 24

Figure 4.48Impact of dataset size on query processing ...... 26

Figure 4.59Execution times for finding images with occurrences of 'Cat' in the United

States ...... 27

Figure 4.6 Sample Images returned from query ………………………………………...28

Figure 4.7 Sample Visualization ………………………………………………………...29

LIST OF TABLES

Table 4.1 Spray Timings ...... 19

Table 4.2 Sample of Query Results ...... 23

Table 4.3 Breakdown of query execution times for 100m records ...... 24

Table 4.4 Execution times for different input record sizes ...... 25

Table 4.5 Execution times for query finding images with occurrences of 'Cat' in United

States ...... 26

ix CHAPTER 1

INTRODUCTION

Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/unstructured and streaming/batch, and different sizes from terabytes to petabytes. Big data is a term applied to datasets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. Generally Big data is defined using the 3V (volume, variety and velocity). Volume refers to the amount of data, variety refers to the number of types of data and velocity refers to the speed of data processing. To process these this big data, several big data processing platforms have been developed including Hadoop, based on the MapReduce model and LexisNexis’

HPCC systems.

1.1 Motivation

There is now more data being created than ever before and this data can be any form of data, textual, multimedia, spatial etc. In the emerging era of technology filled with multimedia data, which makes up about two thirds of internet traffic, it is important to have systems capable of handling this data. Most of the research in the Big Data field has been concentrated to text-based analytics. Multimedia Big Data Computing is the new topic that focus on all aspects of distributed computing systems that enable massive scale image and video analytics.

1 The purpose of this study is the optimized integration of image processing algorithms into High Performance Computing Cluster (HPCC) Systems and to design a data pipeline capable of efficient pre-processing and processing of multimedia data.

HPCC is an open sourced massively parallel processing computing platform developed by Lexis Nexus used for solving Big Data problems

This thesis aims to compare performance and architectures of HPCC with Hadoop platform while running image processing tasks. The Apache Hadoop platform is widely used for Big-Data analytics and is also being used for multimedia applications. Selecting the right platform for the problem at hand is critical to achieve all possible performance efficiencies. The objective is compare HPCC and Hadoop by considering the system architecture, programming model, and benchmarking in evaluating the two platforms for processing Multimedia Big Data. Feature extraction and multimedia content retrieval were used for benchmarking the platforms.

1.2 Main Contributions

Multimedia big data processing differs from other kinds of big data tasks like text or log file processing as multimedia content is very diverse and resources used for task completion vary for each individual image or video. With this in mind, it becomes important to have efficient usage of resources with load balancing techniques.

The key contributions to this thesis include:

1. Identify challenges in optimizing multimedia big data frameworks

2. Performance evaluation of framework with respect to multimedia data

3. Complexity reduction and optimization of queries.

2 1.3 Overview of Thesis

This thesis is structured as follows: Chapter 2 presents a background and related work in multimedia data and big data processing. Chapter 3 describes the problem pertaining to multimedia big data processing. Chapter 4 shows results of experiments carried out to address the thesis's goal, and their interpretation. Finally, Chapter 5 contains conclusions and suggestions for future work

1.4 Glossary of Terms

Multimedia Data: Data that consist of various media types like text, audio, video, and animation.

Big Data: Datasets that are so large or complex that traditional data processing application software is inadequate to deal with them.

Feature: Any “interesting” part of an image like edges, corners etc.

Feature Detection: It deals with finding interesting points (i.e. features) in an image

Feature Extraction: It is the process of identifying the relevant and representative features (feature vectors) from an image to reduce the amount of data to process.

3 CHAPTER 2

BACKGROUND AND RELATED WORK

Multimedia data has been growing at an amazing speed and accounts for more than

70% of unstructured. It can be considered as “big data” not just on the basis of its huge volume, but also because it can be analyzed to gain insight and information in a wide range of applications. This data can be extracted to gain useful knowledge and understand the semantics by analysis

2.1 Hadoop Ecosystem

The Hadoop ecosystem has evolved from the Apache Hadoop implementation of the MapReduce paradigm. A MapReduce job is a unit of work that consists of the input data, the associated Map and Reduce programs, and user-specified configuration information.

MapReduce is a programming model (White,2012) for processing and generating large data sets. The map function processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. Hadoop provides the Hadoop Distributed File

System (HDFS) for storing data. HDFS is a filesystem designed for storing large files with provisions for streaming data, access patterns and run on commodity hardware.

HDFS has a

4 master/slave architecture with one master namenode which controls data i/o and manages the filespace and multiple slave datanodes which manage storage on the nodes they are attached to. HDFS employs a block format based storage system, with typical block sizes of 64 or 128 MB which minimize the cost of seeks. When a file on HDFS is to be read, the client makes a function call to the namenode which returns address of the datanode where the initial blocks of the file are stored and data is streamed from the datanode to the blocks. Optimizations like sorting datanodes in order of proximity to the client to minimize latency and data i/o increase efficiency.

Figure 2.1 Hadoop Distributed File System Architecture 2.2 HPCC Systems

HPCC (Middleton, 2016) is the solution to data intensive cloud computing developed by LexisNexis. To meet all requirements that such a platform would have required the design and implementation off two distinct cluster processing environments, each of which could be optimized independently for its parallel data processing purpose.

5 The first of these platforms is called a Data Refinery or THOR. The second of the parallel data processing platforms designed and implemented by LexisNexis is called the Rapid

Data Delivery Engine or ROXIE.

The Thor system cluster (Middleton, 2011) is implemented using a master/slave approach with a single master process and multiple slave processes which provide a parallel job execution environment. Unlike the block based storage in Hadoop, Thor employs a record-oriented based storage system. These records can be of fixed or variable length, and support a variety of formats like XML, CSV etc. Records can also contain nested child datasets. Record I/O is buffered in large blocks to reduce latency and improve data i/o.

Files to be loaded to a Thor cluster are typically first transferred to a landing zone from some external location, then a process called spraying is used to split the file and load it to the nodes of a Thor cluster. The initial spraying process divides the file on user-specified record boundaries and distributes the data as evenly as possible with records in sequential order across the available processes in the cluster. These characteristics make Thor ideal for processing huge volumes of data for tasks like ETL processing, record linking and entity resolution, large-scale ad-hoc complex analytics, and creation of keyed data and indexes to support high-performance structured queries and data warehouse applications.

Figure 1.2 shows an HPCC System multi-cluster setup. The figure shows a THOR processing cluster which is similar to Google and Hadoop MapReduce platforms with respect to its function, filesystem, execution, and capabilities but offers higher performance.

The Roxie system is designed as an online high-performance structured query and analysis platform or data warehouse delivering the parallel data access processing

6 requirements of online applications through Web services interfaces supporting thousands of simultaneous queries and users with sub-second response times.

Enterprise Control Language(ECL) is the language used in HPCC Systems. It has been designed to be Data-flow oriented, declarative, non-procedural, parallel language for data-intensive computing. The compiler generates C++ code that is highly optimized. A key feature of ECL and HPCC is control over distribution of data across the computing cluster which helps avoid data skew and allows for execution of both local operations that are performed on data local to node and global operations performed across nodes. ECL is compiled into optimized C++ code for execution on the HPCC Systems platform, and can be used for complex data processing and analysis jobs on a Thor cluster or for comprehensive query and report processing on a Roxie cluster. ECL allows inline C++ functions to be incorporated into ECL programs, and external programs in other languages can be incorporated and parallelized through a PIPE facility

The use of HPCC Systems is proposed because it enables users to leverage a multi- cluster environment to speed up the running time of any computationally intensive algorithm. It is open source and easy to setup. HPCC systems is also preferred because of its scalability in respect to code reuse irrespective of the size of dataset and number of clusters. It provides programming abstraction and parallel runtime to hide complexities of fault tolerance and data parallelism.

The scope of this study is to show that the HPCC system is the ideal system to run resource intensive image processing tasks on massive scale.

7

Figure 2.2 HPCC THOR Cluster 2.3 Similar Studies

2.3.1 HIPI - Hadoop Image Processing Interface

HIPI (Sweeney,2011) is an image processing library designed to be used with the

Apache Hadoop MapReduce parallel programming framework. HIPI facilitates efficient and high-throughput image processing with MapReduce style parallel programs typically executed on a cluster. It provides a solution for how to store a large collection of images on the Hadoop Distributed File System (HDFS) and make them available for efficient distributed processing. HIPI also provides integration with OpenCV. The system manages storage using HIPI image bundles which stores many images in one large file. A

HIPI image bundle consists of two files: a data file containing concatenated images and

8 an index file which has information about images and their offsets. While executing jobs, these images in a bundle are distributed across map nodes.

Figure 2.3 How HIPI parallelizes image processing 2.3.2 CURT MapReduce: Caching and Utilizing Results of Tasks for MapReduce on

Cloud Computing

In iterative applications, the output from reduce tasks is fed back into the mapper for the next iteration. Since duplicate datasets may appear in different data splits by the run-time system, resources are wasted to process the duplicated dataset. To avoid this,

Huang et al. (Huang, 2016) propose a method for caching and using these cached results to avoid overheads in executing tasks.

At runtime, the system partitions inputs of Mapper and Reducer tasks according to the Intelligent Square Divisor algorithm while also caching the intermediate results. By mapping the input-to-output tuple, the system uses cached output wherever necessary.

9 Before assigning data to an map task, the CURT system checks if the same data is present on disk. If so, the data is skipped and cached results are reused.

Since the overhead in caching results may reduce performance if the cached results aren’t used, the system suspends creating caches on demand according to result of the Intelligent Square Divisor algorithm. MD5 hashing is used to create the input-to- output tuple and prevent data collisions in cache.

2.3.3 Optimized Multiple Platforms for Big Data Analysis

The objective of the study by Chang et al. (Chang, 2016) was to realize a multiple big data processing platform with high performance, high availability and high scalability that will be compatible with any existing business intelligence and analysis tool.

The system monitors state of the cluster and dynamically adjusts the VMs resource allocation. A caching mechanism is more necessary in some environments which have many duplicate SQL commands. Every search query will require time and resources. These can be reduced while the cache hits. Therefore, this study designed a high-speed In-memory cache and a large-capacity In-disk cache.

The study found that even the homogeneous analysis platform would still exhibit a significant performance difference in various test conditions. Automatically detecting the cluster state through the program and choosing the best platform for data retrieval can greatly save time. In addition, the design of cache can reduce retrieval time to less than 5 seconds.

10

2.3.4 MapReduce Vocabulary Tree: An Approach for Large Scale Image Indexing and Search in the Cloud

Gomes et al. (Gomes, 2016) evaluate MapReduce solutions for parallelizing key points detection and descriptor extraction from images in large collections, and vocabulary tree construction based on those descriptors. The indexing phase is done in an offline batch process on the apache hadoop framework in which vectors are extracted from images and a Vocabulary tree is created in the following steps.

-Database Preprocessing

A hadoop mapper program downloads images and extracts their features in parallel and outputs tuples of features of images in JSON format

-Clustering and tree building

K means clustering is used to create the vocabulary tree with a slight modification of using Hamming distance instead of euclidian distance. This is done by implementing the algorithm as a Mapper program. Each mapper calculates nearest centroid for each descriptor. The reducer aggregates all centroids values with the number of descriptors in it.

-Post Processing

Once the tree is built, weights of leaf nodes are calculated and histograms of the training images are computed. Then, the histogram of a training image is obtained by counting the number of occurrences of its keypoint descriptors at each leaf node while traversing the tree and concatenating the counts onto a vector.

11 In the querying phase, an index is generated for a query image and used to search against the stored database. In this phase, the centroids information is used to convert the previously built vocabulary tree into a Binary Search Tree. For each key-point descriptor of the query image, when the nearest leaf node to it is found, the counter of the node is incremented. After all descriptors are propagated, the counters are concatenated into a vector to form the histogram of visual words of the query image. Once the histogram of visual words of the query image has been computed, it is matched with the ones of indexed images, in order to find the K most similar images in the database.

2.3.4 Extensible Video Processing Framework in Apache Hadoop

The system reported by Ryu et al. (Ryu, 2013) demonstrates the usage of Hadoop environment to parallelize video encoding using FFMPEG and image processing using

OpenCV by implementing a face tracking system on top of the framework.

To read the video files, HDFS FFMPEG connectors are used. Map tasks process the videos across multiple nodes by decomposing the video processing jobs into (Group

Of Pictures) GOP elements and each GOP element is processed by the FFMPEG library and reduce tasks are used refine the processed data returned from the map phase.

12 CHAPTER 3

PROBLEM DESCRIPTION

3.1 Multimedia Data Processing

Multimedia data processing poses challenges different than those faced in processing structured, textual data. Since multimedia data is highly heterogeneous, there is a semantic gap between the data points that can be extracted for analysis and this poses a great challenge when confronted with multimedia data analysis tasks. The other major challenge when dealing with multimedia big data is intent expression. In multimedia data processing, the query intent generally can be represented only by text, however, the text can only express very limited query intent.

3.2 Multimedia Big Data in HPCC

The main problem this thesis aims to address is to leverage HPCC systems to process large scale image databases. Distributed processing of large scale image datasets is still a relatively new concept and to the best of our knowledge there is no single system capable of optimized, scalable processing of multimedia data in a distributed computing cluster. We believe HPCC Systems can be leveraged to create such a system.

13 3.3 Proposed System Architecture

Figure 3.1 Proposed System Architecture As seen from figure 3.1, the first component of the framework is the data processing component where the input datasets are cleaned, preprocessed and processed so that data points can be extracted and analyzed to execute queries. Once the data is processed, the extracted data can be broadly categorized into metadata and actual image data. The metadata consists information about geolocation, exif data and user tags among other points of data about the images in the dataset. This metadata can be used to optimize queries and improve performance of framework.

To run queries on image data, images have to be decoded and features of importance must be extracted. This is done in the feature extraction phase where multiple nodes of the THOR cluster parallel perform this tasks in a highly efficient and optimized manner. The feature extraction phase can be executed multiple times based on input queries and the features required. The unique two parallel clusters of HPCC Systems allow for parallel execution of the feature extraction and metadata processing tasks.

14 The application layer is where the user interacts with the system and can provide parameters for queries and obtain and visualize results and obtain other meaningful information.

15 CHAPTER 4

PERFORMANCE EVALUATION

4.1 Experimental Setup

The framework consists of processing the metadata and reading images in the HPCC platform, decoding the JPEG image data and using the decoded JPEG data and metadata to execute queries that retrieve the images. For applicable queries, visualizations can be generated to make the huge amounts of data easier to understand. The implementation is carried out based on the following stages:

• Reading the metadata files into the HPCC systems cluster.

• Reading the image dataset into HPCC systems cluster.

• Decoding image data to compute RGB values which can be used for extracting

further image features.

• Retrieve a set of images based on metadata query parameters and extracted

image features.

• Visualize query results in a web application.

4.2 Implementation

The first step of the implementation was to set up HPCC systems computational clusters. A cluster with 11 nodes was setup comprising of one ROXIE and ten THOR nodes. Each node has 20 Intel(R) Xeon(R) CPU’s and 128 GBs of memory.

16 4.3 Dataset

We have used the YFCC100M dataset (Thomee, 2015) which has a total of 100 million media objects, of which approximately 99.2 million are images and 0.8 million are videos, all uploaded to Flickr between 2004 and 2014 and published under a CC commercial or noncommercial license. The dataset contains a tab separated list of image and video records. Each record of the list has fields describing various attributes like

Image/video identifier, Image/video hash, Date taken, Date uploaded, Capture device,

Title, Description, User tags, Machine tags amongst others. Three expansion packs have recently been added to the dataset, in the Exif metadata pack, each line refers to an image/video and its set of Exif metadata item. Similarly, the Auto tag metadata pack describes presence of a variety of concepts, such as people, animals, objects, food, events, architecture, scenery etc. which have been put together using deep learning principles for each image/video and the location metadata pack which describes the location where the image/video was taken in a human readable format instead of geographic coordinates.

Due to resource constrains, the YFCC100m was not used for image processing tasks. Instead the Yahoo! Flickr Creative Common Images tagged with ten concepts which consists of 200,000 images. Images in the dataset are categorized into 10 groups.

Figure 4.1 shows 10 sample images from the dataset in the categories from left to right:

(a)People (b)Wedding (c)Beach (d)Sky (e)Nature (f)Travel (g)Food (h)London (i)Music

(j)2012 .

17

Figure 4.14Sample images from dataset 4.4 Dataset Processing

The tab separated metadata files are sprayed onto the HPCC Systems cluster.

There are different options when spraying files (Boca Raton Documentation Team, 2015) onto the HPCC systems cluster. The delimited option is the perfect fit for the metadata files. After the metadata files are sprayed onto the cluster as logical files, the next step is to read them using ECL code. To read the logical files, a record structure and a

DATASET are defined.

18 Table 4.1 Spray Timings

Records(millions) Time(seconds)

1 15

10 176

20 291

40 597

60 897

80 1632

100 2039

Spray Times

2500 2039 2000 1632

1500

897 1000 597 Time(in seconds) Time(in 500 291 176 15 0 1 10 20 40 60 80 100

Figure 4.2 Spray Timings For the main YFCC100m dataset, the record structure looks like this:

EXPORT imagedata := RECORD

STRING field1;

STRING field2;

..

STRING field25; 19 END;

And the DATASET is defined and the contents of the logical file are declared to it in this manner: yfcc100mdata := DATASET(‘~100mdata::yfcc100m_dataset,imagedata, CSV);

To execute queries on this data, the data must be bound to tables in the following manner:

ImagesTable := Table(yfcc100mdata);

The next step is to spray the images onto the cluster for processing. In HPCC

Systems, images are represented as BLOBs(reference) (Binary Large OBjects). HPCC client tools(reference) provide a command line interface which is useful here. A spray action command in the following manner is executed to to spray the files onto the cluster:

dfuplus action=spray srcip=10.149.0.39 srcfile=/var/lib/HPCCSystems/mydropzone/*.jpeg server=hpcc.fau.edu username=****** password=***** dstname=100mdata::imagesdata dstcluster=mythor

PREFIX=filename,filesize nosplit=1 where the nosplit=1 option ensures that contents of each JPEG is kept together and not split across the THOR nodes which improves performance and

PREFIX=filename,filesize are used to define the record structure of the image dataset in the following manner: imageRecord := RECORD

string filename;

UNSIGNED8 RecPos{virtual(fileposition)};

20 DATA images;

END;

The file contents of the JPEG image are stored in the field images which is of DATA format, which is a packed hexadecimal data block of n bytes. Similar to the metadata files, a DATASET is defined to access results of the sprayed data: imageData := DATASET('~100mdata::imagesdata’,imageRecord,FLAT);

4.5 Decoding Image Data

As there is no previous work on HPCC Systems related to image processing, there are no existing ECL methods to decode and process the image data. After evaluating several methods like integrating HPCC with R (Shetye, 2013) and using the existing image processing packages in R and running shell scripts in a parallelized manner using

ECL commands in the HPCC framework, implementing a JPEG decoder in C++ to extract useful data from the images was chosen as the most efficient method. A simple open source implementation of the JPEG decoder (Fiedler) was chosen and modified to fulfil our requirements. The BeginC++ Structure allows execution of inline C++ code within ECL. The records of imageData DATASET generated in section 4.4 are passed as a parameter to the JPEG decoder. The JPEG decoder decodes the image data and returns average R, G&B values for each image which are added as columns to the dataset. These additional data points can be used for feature extraction tasks in later stages of the system.

4.6 Query Processing

As discussed in Chapter 2, multimedia content retrieval requires processing of semantic as well as textual information. In this framework, the metadata processing

21 component of system executes ROXIE queries to process the the textual information.

Semantic data processing is handled by the Data Analysis component where THOR queries executed to extract image features.

4.6.1 Metadata Processing

As described in section 3.2, the YFCC100M dataset includes extensive metadata.

Queries can be executed on this metadata as a first step to retrieving multimedia content.

The ROXIE queries are executed on the tables generated in section 3.3 to return records based on query parameters.

The following query is used to return a subset of images shot in the United States

USImages:= ImagesPlaces(STD.Str.Find(field2,'United+States:Country', 1) != 0);

And as a next step, a table is generated in which each record contains the ImageID and the zip code where the image was taken.

ZipImages := PROJECT(USImages,TRANSFORM(ResultTable,

ZipPos := Std.Str.FIND(LEFT.field2,'Zip');

SELF.field1 := LEFT.field1;

SELF.field2 := LEFT.field2[ZipPos-6..ZipPos-2]));

In a similar manner, a table can be generated with other query parameters and

JOIN operation between the resultant tables generates records that meet all query parameter conditions. The following queries generates a table SortedResults which sorts images based on Camera used and Zip code where the image was taken.

ResultImages := JOIN(s1,ZipImages,LEFT.field2 = RIGHT.field1,

22 TRANSFORM(MyOutRec,

SELF.ID := LEFT.field2;

SELF.Camera := LEFT.field8;

SELF.Zip := RIGHT.field2));

SortedResults := SORT(Table(ResultImages,{ ID, Camera, Zip }), Camera);

Table 4.1 Sample of Query Results

id camera zip

3581619 99950

1279993733 NIKON+CORPORATION+NIKON+D80 99950

1280004445 NIKON+CORPORATION+NIKON+D80 99950

1280017713 NIKON+CORPORATION+NIKON+D80 99950

1280027063 NIKON+CORPORATION+NIKON+D80 99950

1280030081 NIKON+CORPORATION+NIKON+D80 99950

4773621262 NIKON+D80 99950

7680788898 Apple+iPhone+4S 99950

7681111346 99950

7689205058 PENTAX+K20D 99950

7689212758 PENTAX+K20D 99950

7689217870 PENTAX+K20D 99950

7689233396 PENTAX+K20D 99950

23 When a query is executed in ECL, an execution graph is generated a graph displaying a visual representation of the dataflows for the query which shows you statistics for the job.

The graph for the query discussed above looks like this:

Figure 4.3 Execution graph generated in ECL Watch ECL watch (Boca Raton Documentation Team) also outputs job execution times.

The execution times for the query discussed above to execute on 100 million records are:

Table 4.2 Breakdown of query execution times for 100m records

Name Time(seconds)

Compile Time 0.834

Mythor 213.784

Total cluster time 213.784

24 Table 4.3 shows execution times and number of results returned for the query which finds images taken with Canon cameras in the United States sorted by the zip code where the image was taken with varying subsets of input records.

Table 4.3 Execution times for different input record sizes

Number of Records(in millions) Total Cluster Time(in Number of Records

seconds) Returned

1 1.81 19590

10 6.571 195161

20 67.26 389673

40 127.914 779775

60 111.6 1170118

80 159.583 3637625

100 213.784 6133023

25 Query Times vs Number of Records

250 213.784

200 159.583

150 127.914 111.6

100 67.26 Time(in seconds) 50 1.81 6.571 0 1 10 20 40 60 80 100 Number of Records(in millions)

Figure 4.4 Impact of dataset size on query processing

The following table and graph show execution times for query finding images with occurrences of ‘Cat’ in United States.

Table 4.4 Execution times for query finding images with occurrences of 'Cat' in United States

Number of Records(in millions) Total Cluster Time(in Number of Records

seconds) Returned

1 0.63 63

10 4.272 807

20 7.921 1631

40 15.642 3527

60 23.325 4856

26 80 44.29 15002

100 60.04 25431

Total Cluster Time

60 51.77 50 44.29

40

30 23.325

20 15.642 Time(in seconds) Time(in 7.921 10 4.272 0.63 0 0 20 40 60 80 100 120 Number of Records(in millions)

Figure 4.5 Execution times for finding images with occurrences of 'Cat' in the United States

27

Figure 4.6 Sample Images returned from query 4.6.2 Feature Extraction

In the data analysis component of the system, image features are extracted based on the RGB values calculated for each image in section 4.5. Queries similar to the ones discussed in section 4.6.2 are executed in the THOR cluster to extract the image features.

For example, the RGB color space is converted to HSV and the Hue component is used to compute the dominant color of each image.

4.7 Data Visualization

In the application layer component of the system, the results of the queries are visualized in an easy to understand manner. D3js visualizations are used to plot images in a simple web application. Figure 4.7 shows an example visualization that plots six sample images based on geographic location.

28

Figure 4.7 Sample Visualization

29 CHAPTER 5

CONCLUSION AND FUTURE WORK

5.1 Conclusion

In this thesis we have provided a unified framework for efficient processing of multimedia big data. By leveraging the power of LexisNexis’ HPCC Systems, we were able to configure a multi-node cluster capable of processing a dataset of hundred million images. Some of the challenges we faced were lack of any image processing libraries in

HPCC Systems or ECL. The enormous cost in obtaining a dataset so large meant we had to work with a fairly small subset. The metadata files were analyzed and simple, efficient image processing algorithms were used to extract features. Because of HPCC Systems’ unique multi-cluster design, these jobs could be executed in parallel. Data points extracted from both were used to execute sample queries to retrieve images. HPCC systems guarantees optimized and efficient execution of these queries.

5.2 Other Applications

The core contributions of this thesis can be extended or modified to suit a variety of applications other than just multimedia content retrieval. By analyzing spatiotemporal points of data and running a few complex image processing algorithms we can improve accuracy of many prediction engines that previously did not use image data. For example, systems that predict weather patterns can analyze images from past and use spatio- temporal data to improve accuracy.

30 5.3 Future Work

Due to resource and time constraints, we were not able to develop the system to its full capabilities. We would like to extend the framework to support the processing of videos as well. Given the time, it would be useful to compare the system to a similar implementation in a different distributed computing framework and computing clusters with more GPU power.

31 BIBLIOGRAPHY

A.M. Middleton, D.A. Bayliss, G. Halliday, A. Challa, and B. Furht, “The HPCC/ECL Platform for Big Data,” Chapter 6 in the book “Big Data Technologies and Applications,” Springer Publishers, 2016

Chang, B. R., Tsai, H. F., & Wang, Y. A. (2016, April). Optimized Multiple Platforms for Big Data Analysis. In Multimedia Big Data (BigMM), 2016 IEEE Second International Conference on (pp. 155-158). IEEE.

Fiedler, Martin. “Nano JPEG Decoder”.

Gomes, Herman M., et al. "MapReduce Vocabulary Tree: An Approach for Large Scale b Image Indexing and Search in the Cloud." Multimedia Big Data (BigMM), 2016 IEEE Second International Conference on. IEEE, 2016.

“HPCC Systems: ECL Programmers Guide. Boca Raton Documentation Team,” 2015.

“HPCC Systems: HPCC Client Tools. Boca Raton Documentation Team,” 2014.

“HPCC System: Using ECL Watch. Boca Raton Documentation Team,”.

Huang, T. C., Chu, K. C., Zeng, X. Y., Chen, J. R., & Shieh, C. K. (2016, April). CURT MapReduce: Caching and Utilizing Results of Tasks for MapReduce on Cloud Computing. In Multimedia Big Data (BigMM), 2016 IEEE Second International Conference on (pp. 149-154). IEEE.

Middleton, “HPCC Systems: Introduction to HPCC (high performance computing cluster),” 2011.

Ryu, Chungmo, et al. "Extensible video processing framework in apache hadoop." Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference on. Vol. 2. IEEE, 2013.

Shetye, Dinesh. “Package ‘rHpcc’”, 2013.

Sweeney, Chris, et al. "HIPI: a Hadoop image processing interface for image-based tasks." Chris. University of Virginia (2011).

32 Thomee, Bart, et al. "Yfcc100m: The new data in multimedia research." Communications of the ACM 59.2 (2016): 64-73.

Middleton, “HPCC Systems: Introduction to HPCC (high performance computing cluster),” 2011.

White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.

White, Brandyn, et al. "Web-scale computer vision using mapreduce for multimedia data mining." Proceedings of the Tenth International Workshop on Multimedia Data Mining. ACM, 2010.

33