A Survey of Big Data Machine Learning Applications Optimization in Cloud Data Centers and Networks

A Survey of Big Data Machine Learning Applications Optimization in Cloud Data Centers and Networks Sanaa Hamid Mohamed, Student Member, IEEE, Taisir E.H. El-Gorashi, Member, IEEE, and Jaafar M.H. Elmirghani, Senior Member, IEEE Abstract— This survey article reviews the challenges associated with deploying and optimizing big data applications and machine learning algorithms in cloud data centers and networks. The MapReduce programming model and its widely-used open-source platform; Hadoop, are enabling the development of a large number of cloud-based services and big data applications. MapReduce and Hadoop thus introduce innovative, efficient, and accelerated intensive computations and analytics. These services usually utilize commodity clusters within geographically-distributed data centers and provide cost-effective and elastic solutions. However, the increasing traffic between and within the data centers that migrate, store, and process big data, is becoming a bottleneck that calls for enhanced infrastructures capable of reducing the congestion and power consumption. Moreover, enterprises with multiple tenants requesting various big data services are challenged by the need to optimize leasing their resources at reduced running costs and power consumption while avoiding under or over utilization. In this survey, we present a summary of the characteristics of various big data programming models and applications and provide a review of cloud computing infrastructures, and related technologies such as virtualization, and software-defined networking that increasingly support big data systems. Moreover, we provide a brief review of data centers topologies, routing protocols, and traffic characteristics, and emphasize the implications of big data on such cloud data centers and their supporting networks. Wide ranging efforts were devoted to optimize systems that handle big data in terms of various applications performance metrics and/or infrastructure energy efficiency. This survey aims to summarize some of these studies which are classified according to their focus into applications-level, networking-level, or data centers-level optimizations. Finally, some insights and future research directions are provided. Index Terms— Big Data, MapReduce, Machine Learning, Data Streaming, Cloud Computing, Cloud Networking, Software-Defined Networking (SDN), Virtual Machines (VM), Network Function Virtualization (NFV), Containers, Data Centers Networking (DCN), Energy Efficiency, Completion Time, Scheduling, Routing. I INTRODUCTION T HE evolving paradigm of big data is essential for critical advancements in data processing models and the underlying acquisition, transmission, and storage infrastructures [1]. Big data differs from traditional data in being potentially unstructured, rapidly generated, continuously changing, and massively produced by a large number of distributed users or devices. Typically, big data workloads are transferred into powerful data centers containing sufficient storage and processing units for real-time or batch computations and analysis. A widely used characterization for big data is the "5V" notion which describes big data through its unique attributes of Volume, Velocity, Variety, Veracity, and Value [2]. In this notation, the volume refers to the vast amount of data produced which is usually measured in Exabytes (i.e. 260 or 1018 bytes) or Zettabytes (i.e. 270 or 1021 bytes), while the velocity reflects the high speed or rate of data generation and hence potentially the short lived useful lifetime of data. Variety indicates that big data can be composed of different types of data which can be categorized into structured and unstructured. An example of structured data is bank transactions which can fit into relational database systems, and an example of the unstructured data is social media content that could be a mix of text, photos, animated Graphics Interchange Format (GIF), audio files, and videos contained in the same element (e.g. a tweet, or a post). The veracity measures the trustworthiness of the data as some generated portions could be erroneous or inaccurate, while the value measures the ability of the user or owner of the data to extract useful information from the data. In 2020, the global data volume is predicted to be around 40,000 Exabytes which represents a 300 times growth factor compared to the global data volume in 2005 [3]. An estimate of the global data volume in 2010 is about 640 Exabytes [4], and in 2015 is about 2,700 Exabytes [5]. This huge growth in data volumes is the result of continuous developments in various applications that generate massive and rich content related to a wide range of human activities. For example, online business transactions are expected to have a rate of 450 Billion transactions per day by 2020 [4]. Social media such as Facebook, LinkedIn, and Twitter, which have between 300 Million and 2 Billion subscribers who access these social media platforms through web browsers in personal computers (PCs), or through applications installed in tablets and smart phones are enriching the content of the Internet with content in the range of several Terabytes (240 bytes) per day [5]. Analyzing the thematic connections between the subscribers, for example by grouping people with similar interests, is opening remarkable opportunities for targeted marketing and e-commerce. Moreover, the subscriber's behaviours and preferences tracked by their activities, clickstreams, requests, and collected web log files can be analyzed with big data mining tools for profound psychological, economical, business-oriented, and product improvement studies [6], [7]. To accelerate the delay-sensitive operations of web searching and indexing , distributed programming models for big data such as MapReduce were developed [8]. MapReduce is a powerful, reliable, and cost-effective programming model that performs parallel processing for large distributed datasets. These features have enabled the development of different distributed programming big data solutions and cloud computing applications. Fig. 1. Big data communication, networking, and processing infrastructure, and examples of big data applications. A wide range of applications are considered big data applications from data-intensive scientific applications that require extensive computations to massive datasets that require manipulation such as in earth sciences, astronomy, nanotechnology, genomics, and bioinformatics [9]. Typically, the computations, simulations, and modelling in such applications are carried out in High Performance Computing (HPC) clusters with the aid of distributed and grid computing. However, the datasets growth beyond these systems capacities in addition to the desire to share datasets for scientific research collaborations in some disciplines are encouraging the utilization of big data applications in cloud computing infrastructures with commodity devices for scientific computations despite the resultant performance and cost tradeoffs [10]. With the prevalence of mobile applications and services that have extensive computational and storage demands exceeding the capabilities of the current smart phones, emerging technologies such as Mobile Cloud Computing (MCC) were developed [11]. In MCC, the computational and storage demands of applications are outsourced to remote (or close as in mobile edge computing (MEC)) powerful servers over the Internet. As a result, on-demand rich services such as video streaming, interactive video, and online gaming can be effectively delivered to the capacity and battery limited devices. Video content accounted for 51% of the total mobile data traffic in 2012 [11], and is predicted to account for 78% of an expected total volume of 49 Exabytes by 2021 [12]. Due to these huge demands, in addition to the large sizes of video files, big video data platforms are fronting several challenges related to video streaming, storage, and replication management, while needing to meet strict quality-of-experience (QoE) requirements [13]. In addition to mobile devices, the wide range of everyday physical objects that are increasingly interconnected for automated operations has formed what is known as the Internet-of-Things (IoT). In IoT systems, the underlying communication and networking infrastructure are typically integrated with big data computing systems for data collection, analysis, and decision-making. Several technologies such as RFID, low power communication technologies, Machine-to-Machine (M2M) communications, and wireless sensor networking (WSN) have been suggested for improved IoT communications and networking infrastructure [14]. To process the big data generated by IoT devices, different solutions such as cloud and fog computing were proposed [15]-[31]. Existing cloud computing infrastructures could be utilized by aggregating and processing big data in powerful central data centers. Alternatively, data could be processed at the edge where fog computing units, typically with limited processing capacities compared to cloud, are utilized [32]. Edge computing reduces both; the traffic in core networks, and the latency by being closer to end devices. The connected devices could be sensors gathering different real-time measurements, or actuators performing automated control operations in industrial, agricultural, or smart building applications. IoT can support vehicle communication to realize smart transportation systems. IoT can also support medical applications such as wearables and telecare

A Survey of Big Data Machine Learning Applications Optimization in Cloud Data Centers and Networks

Efficient Implementation of Data Objects in the OSD+-Based Fusion

Globalfs: a Strongly Consistent Multi-Site File System

Storage Systems and Input/Output 2018 Pre-Workshop Document

HFAA: a Generic Socket API for Hadoop File Systems

Maximizing the Performance of Scientific Data Transfer By

Locofs: a Loosely-Coupled Metadata Service for Distributed File Systems

A Persistent Storage Model for Extreme Computing Shuangyang Yang Louisiana State University and Agricultural and Mechanical College

The Ceph Distributed Storage System

Introduction to Distributed File Systems. Orangefs Experience 0.05 [Width=0.4]Lvee-Logo-Winter

Týrfs: Increasing Small Files Access Performance with Dynamic Metadata Replication Pierre Matri, María Pérez, Alexandru Costan, Gabriel Antoniu

Analytical Study on Performance, Challenges and Future Considerations of Google File System

QUICK START GUIDE System Requirements