<<

The University of Manchester Research

MapReduce: MR Model Abstraction for Future Security Study

Link to publication record in Manchester Research Explorer

Citation for published version (APA): Zhang, N., & Lahmer, I. (2014). MapReduce: MR Model Abstraction for Future Security Study. In the 7th International Conference on Security of Information and Networks (SIN’14) Association for Computing Machinery.

Published in: the 7th International Conference on Security of Information and Networks (SIN’14)

Citing this paper Please note that where the full-text provided on Manchester Research Explorer is the Author Accepted Manuscript or Proof version this may differ from the final Published version. If citing, it is advised that you check and use the publisher's definitive version.

General rights Copyright and moral rights for the publications made accessible in the Research Explorer are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

Takedown policy If you believe that this document breaches copyright please refer to the University of Manchester’s Takedown Procedures [http://man.ac.uk/04Y6Bo] or contact [email protected] providing relevant details, so we can investigate your claim.

Download date:09. Oct. 2021 MapReduce: MR Model Abstraction for Future Security Study

Ibrahim Lahmer, Ning Zhang School of Computer Science The University of Manchester Manchester, M13 9PL, UK [email protected] [email protected]

ABSTRACT functions played by a single master node in Hadoop is split MapReduce is a new parallel programming paradigm pro- into two sets, respectively, played by two master nodes, a Re- posed to process large amount of data in a distributed set- source Manager and a JobTracker. In other words, Corona ting. Since its introduction, there have been efforts to im- and YARN use two masters to perform the functions which prove the architecture of this model making it more efficient, are performed by the single master in Hadoop. While, these secure and scalable. In parallel with these developments, architectural changes in the new MapReduce implementa- there are also efforts to implement and deploy MapReduce, tions bring a number of benefits such as improving perfor- and one of its most popular open source implementation mances and reducing the chance of creating bottlenecks in is Hadoop. Some more recent practical MapReduce imple- the system [5][11]. They also have implications on other mentations have made architectural changes to the original aspects of the system such as security provisioning. For MapReduce model, e.g., those by Facebook and IBM. These example, these changes will alter the way different system architectural changes may have implications to the design components interact, and how and from which entities vari- of solutions to secure MapReduce. To inform these changes ous requests and responses are generated. These will lead to and to serve any future design of such solutions, this paper changes in the risk landscape in the system, and therefore attempts to build a generic MapReduce computation model how security should be provided. However, existing security capturing the main features and properties of the most re- solution designs for MapReduce have not taken into account cent MapReduce implementations. these architectural changes in the new MapReduce imple- mentations. For example, the encryption service has been designed by Jason and Subatra (2013). They examined their Keywords proposed security service for the old MapReduce framework MR: MapReduce, DFS: Distributed File System, YARN: Yet version [1]. The scheme encrypts data which is uploaded by Another Resource Negotiator Master node, Resource Man- the client and stored in DFS. Furthermore, Somu, N. and et ager, JobTracker, TaskTrackers, ,NameNode, DataNodes. al. (2014) proposed another security service [12]. They have designed an authentication service by which the client can be 1. INTRODUCTION authenticated to the MapReeduce framework. Their design MapReduce, as a parallel and distributed programming schemes show that they perform well in term of security and framework, is becoming more and more widely used [17]. performance in the old MapReduce version. Their designs Hadoop, an implemetation of the MapReduce framework, assume that, there is only one master node (JobTracker) in has been adopted by many companies, including the major which the client authenticate to and submit his job. This IT players in the world, such as Facebook, eBay, IBM and master node also retrieves the input data splits which are up- Yahoo [2] [16]. Recently, Facebook engineers team [5] have loaded to the DFS, and allocates the TaskTrackers (Worker claimed that they developed a new MapReduce topology nodes). However, in the current MapReduce framework im- called Corona. Also, Yahoo developers [10] and IBM devel- plementation, it is no longer the case, there are two master opers [11] have also announced that they have developed and nodes, one master node, Resource manager, is responsible used a new MapReduce implementation, YARN. These new for job submission, providing job IDs and the paths of job implementations of the MapReduce use architectures that files to a clients, then the client uploads his input data files are different from the one used by Hadoop, the previous into the DFS. The other master node, JobTracker, retrieves implementation. For example, in Corona and YARN, the input splits which are uploaded by the client to the DFS, and allocate the TaskTrackers which are assigned by the Re- source Manager. This raise concerns in term of security as the way of MR components interact and accessing the data stored in the DFS. An example of these concerns, in the new MR framework implementation the client authenticates to the Resource manager rather than JobTracker and his input splits are retrieved by the other master node. This means that the JobTracker also needs proper secret keys to have access to the DFS and retrieves the input splits when the clients job submission is completed. Also, in the new MR implementation, the worker daemons have to authenticate to the Resource Manger to obtain the admission to the MR framework rather than authenticate to the JobTracker. The authors’ main research task is to investigate and design Figure 1: A typical MapReduce task sequence [3] security solutions for MapReduce applications, as designs of some security services for MapReduce should be architec- tural dependent, and as the existing MapReduce security A MapReduce system consists of a number of components solutions are largely homogeneous and only secure entrance which are summarized in Figure 2. As shown in the figure, of a job execution [13] [9]. For this, the solution should be de- these components can largely be classified into two main signed based on a generic MapReduce framework. However, groups (called clusters). One is the Distributed File System first, there is not a generic MapReduce model in literature, (DFS or Distributed Storage) that is used to store date files which captures the recent changes in the MapReduce archi- during job executions. The DFS consists of one name node tecture. Second, through literature research in MapReduce and multiple data nodes. The other is the Processing Frame- architectures [2][4] , and exiting MapReduce applications, work (PF) cluster that is used to carry out job executions such as used by [5][10] [11], we have identified that there (or distributed computations). PF consists of two master is a mismatch between the original MapReduce architecture nodes and a number of worker nodes. In addition to these and what has been deployed in recent real-life applications. physical components, there are also logical components (i.e. So this paper as our first step work, tries to bridge this gap software components) and data components. In the follow- by presenting an abstract MapReduce model capturing the ing, we discusses the functionalities of these components [4] features and properties of the most recent deployments of [5] [15] [16] . MapReduce. We investigate the existing MapReduce archi- tecture model and synthesising from these real-life deploy- 2.2.1 Physical Components ments and in future steps, build and present a generic secu- Physical components are actual physical nodes used to host rity solution requirements and design for such model. software components, clients and servers. Depending on the In detail, the remaining part of this paper is structured as roles played by the servers, they may be master workers or follows. Section 2 introduces MapReduce, its architecture slave workers. The corresponding nodes hosting these com- and architectural components and explains, at a generic ponents are therefore referred to as user machines, master level, how it is typically deployed in a Cloud environment. worker nodes (or master nodes, in short) and slave worker Section 3 analyses, in detail, a simple example of job execu- nodes (or slave nodes, in short). tion process using MapReduce, and based on this analysis, the section presents an abstract model of MapReduce execu- tion using MapReduce. It starts with an example job execu- 1. User Machines:A user machine runs a client appli- tion process using MapReduce, and based on this example, it cation, which largely performs two major tasks: i) constructs an abstract model for MapReduce highlights the submit a job to the MapReduce master node, and ii) procedures and interactions among its architectural com- specify the configuration, trigger and monitor the Job ponents during a job execution process. Finally, Section 4 execution [8]. concludes the paper and outlines out future work. 2. Masters nodes: in Figure 2, master nodes are of the following types. 2. MAPREDUCE ARCHITECTURE This section introduces the MapReduce framework, high- (a) Central Cluster Manager (also called Resource lighting its physical and logical components and explains Manager):This node is responsible for managing how the framework may be deployed in practice. the job schedules and the cluster resources by tracking the nodes in the cluster and the amount 2.1 An Overview of free resources on these nodes. There is one such MapReduce is a new programing model that supports par- node per cluster. allel processing and distributed resource sharing. It uses a (b) JobTracker(also called Master Node or MRAp- set of distributed nodes (called servers) that work collabo- plication Master): It runs a program, called mas- ratively to process a large amount of data or to execute a ter daemon that is responsible for managing (mon- specified job. A job execution in MapReduce is carried out itoring and tracking the progress of) each indi- in two distinctive phases: Map and Reduce. In the Map vidual job executions. It also negotiates with the phase, a set of nodes, called Mappers, are used to process Resource Manger regarding the use of resources and convert one set of data (input) into another set of data (containers i.e. Slaves). Usually, there is one or (intermediate results). This set of input data is broken into more JobTrackers per cluster. In older MapRe- key-value pairs called tuples. In the Reduce phase, number duce framework version, there is only one Master of nodes, called Reducers combines and process the inter- Node that plays the roles of both Resource Man- mediate data result to produce smaller set of tuples called ager and Job Trackers. job result. As indicated in the framework name, in MapRe- (c) NameNode(also called Catalogue Server or Meta- duce, Map tasks are always executed before the Reduce tasks data Node): This node manages and maintains a [17]. Figure 1 shows the generic sequence of the process in set of DataNodes, i.e. a file system and metadata MapReduce applications. for all the directories and files stored in DataN- odes. It is usually part of DFS cluster and there 2.2 MapReduce Components is one per cluster. 3. Slave nodes: they are of two types, TaskTrackers and 3. Job history daemon is also executed on the Job- DataNodes. Tracker node. Its function is to store and archive the job history for later interrogation by the user if desired. (a) TaskTracker: A TaskTracker node may host a single Mapper, a single Reducer or both. Usu- 4. Input data broadly refers to various input data and/or ally, there is more than one TaskTracker node per files used in a MapReduce computation. It includes a cluster. As shown in Figure 2, each Task- Tracker job configuration file, Input data files and computed runs the worker daemon to execute multiple tasks. input splits information. The data in an input data (b) DataNode: This node is also part of the DFS file are parsed into key-value pairs by Mappers. The and, sometimes, named as a Storage Server. It is Mappers then produce intermediate buffered data. the node where the actual data files are stored in 5. Intermediate Buffered Data is the output of the units of data blocks (each being 64MB or 128MB), Map phase, Assuming that the number of reducers and shared. Each data block has its own block ID, used in a job execution is R, then the intermediate and, for fault-tolerance and performance consid- data is a set of R regions called Partitions (i.e. for erations, it is replicated in a number of DataN- each reducer assigned to a job execution, there is a odes. A DataNode periodically sends a report data partition output from the Map phase to process). to the NameNode listing the blocks it stores, as shown in Figure 2. There are typically multiple 6. Shuffle Handler is a service designed to sort and data nodes in a cluster. In some cases the DataN- group the intermediate key-values pairs according to ode may be the same node as the TaskTracker the key (an example in the next section explains what node for performance consideration [15] [17]. this key is and how the key-value is used to present the data). It is run on the TaskTracker Node.

7. Output Data is the final result produced by the Re- ducers.

Central Cluster Sub. JobTracker Manager (Master Node) Job Client 8. Job Configuration is a set of variables and parame- Assign Machine Resource Client ters specific to a certain Job and specified bythe User. Tasks Master Daemon Manager Job Program

TaskTracker m (Worker Node) TaskTracker 2 Data Processing Figure 2 depicts both physical and logical components of (Worker Node) TaskTracker 1 Framework Cluster Worker (Worker Node) the MapReduce Framework as organized in two Cluster cat- Daemon Worker egories. Daemon Map Task Worker

Map Task Daemon Reduce Task Ask for Read Map Task Reduce Task and Write Data 2.3 Implementing and Hosing MapReduce Com- Reduce Task Read/Write ponents in Cloud data Periodically Report MapReduce components are typically hosted in a virtual en- DFS DataNode n DFS NameNode (Storage Server ) DFS DataNode 2 (Catalogue Server) vironment by a MapReduce provider. Virtualization allows (Storage ServerDFS) DataNode 1 multiple independent Operating System components be run Block3 (Storage Server) Data Location Block1 DN 1,2 DFS (Distributed Block5 Block2 DN2,n on a single physical node. It allows the sharing of physical re- Block2 Block2 Block1 Block3 DN1,n File System) Cluster Block1 …… …… sources by multiple users [8]. The MapReduce framework is Block5 Block3 hosted and implemented in such environment due to the ad- vantages provided by such technology. For instance, Server Consolidation, increase up time (e.g. fault tolerance, live

Slave Work-Pattern Nodes Master Work-Pattern Nodes migration, high availability, storage migration) and isolate applications tasks within one physical server [7]. The MapReduce application components run on the cloud system are hosted on the top of Virtual Machines (VMs). Figure 2: MapReduce framework components Figure 3 shows how different components of a MapReduce application such as master node (Master Daemon), worker 2.2.2 Logical Components and Data nodes (Worker Daemons), DataNodes and NameNode are Logical components, or programs, are executed on the above hosted. They can be run on the top of the same physi- mentioned physical components to accomplish a computa- cal server. They are isolated using virtualization host ma- tional task or to process some data. These logical compo- chines and virtual switches. These different components nents and the types of data involved are explained below. share physical resources (e.g. CPUs and Rams). At the network level, they are isolated by using virtual switches 1. Master daemon is executed in the JobTracker node, and VLANs mechanisms. Communications among differ- and performs the tasks assigned to JobTracker. ent set of MapReduce components hosted in different phys- ical servers are carried out through external network fabric. 2. Worker daemon is executed in a TaskTracker node. The Hypervisor is the core of the virtualization technique. Its function is to launch and manage the mapper and As a middleware, the Hypervisor is responsible for creating, reducer child daemons. (i.e. the mission of a worker managing and executing VM operating system instances and daemon is to execute the Map and Reduce tasks). sharing the physical resources. There are number of Cloud service delivery models [6], but the following two are used to 3.1 MapReduce Job Execution: an Example host a MapReduce application. These two are different in To describe the data execution work-flow in brief, a practical term of services provided by the cloud providers as following: simplified example is illustrated. Suppose that there is a big number of records. These records contains temperatures of UK cities which are daily bases registered for one year. 1. Infrastructure-as-a-Service (IaaS): in this deliv- They are stored in number of database files as shown in ery model the customer allocates the virtual resources the following tables, these files are stored in number of data as needed. The IaaS provider delivers the required blocks in the DFS. In this example, the Job is submitted by storage and networking devices, in form of wrapped the client to the Master. The duty of the Job is to find out service. They also provides a basic security needs, in- the maximum temperature of each city (each city has more cluding physical security and system security such as than one record per file). firewall, Intrusion Prevention System (IPS) and Intru- sion Detection System (IDS). A MapReduce applica- File1 File 2 File n tion can be implemented using this service delivery Record City Temp. Other Record City Temp. Other Record City Temp. Other No. Name Cent. Data No. Name Cent. Data No. Name Cent. Data model in which the application can be managed di- 1 York 0 1 Manch. 15 1 London 15 2 Manch. 19 2 Manch. 10 2 Manch. 10 rectly by the MR Client through the master nodes. 3 Leeds 9 3 York 9 3 York 9 4 Leeds 20 4 London 20 4 London 20 2. Platform-as-a-Service (PaaS): This service deliv- 5 London 13 5 Durham 20 5 Leeds 13 6 Leeds 18 6 London 18 6 London 18 ery model offers the facilities such as Software Develop- 7 London 6 7 Durham 6 7 Durham 6 ment Kit (SDK) (e.g. Python and Java) which can be 8 York 13 8 York 13 8 York 13 ……… ……. …… ……… ……. …… ……… ……. …… used by customers to write their own programs to de- ……

velop their own applications. With this service deliver model, customers using these facilities do not have to manage the required hardware or software components such as VMs, virtual switches. CPUs and RAMs. A Figure 4: A Number of files store temperatures of UK cities MapReduce application can also be implemented using this Cloud service delivery model. To simplify this example for explanation purpose, let con- sider number of files n = 10, each file stored in a one date block, and each input split size is the same as the data block size. Figure 5 shows the MapReduce components which are involved in processing the input data for this task, as a lay- out. First, the Client (User) determines the Job configu- ration parameters, input data format and the location of VM3 VM2 VM1 VM3 VM2 VM1 Worker Data Worker Data Name Master the input data. In our example, the input files are divided Node Node Node Node Node Node (TT) (DN) (TT) (DN) (NN) (JT) into input splits (10 input splits), and there is an input split

per mapper to process. They are assigned to the Mappers Virtual Switches and VLANs Virtual Switches and VLANs

Cloud by the Master based on job configuration files and parame- Hypervisor (VMware, Hyper-V …..) Hypervisor (VMware, Hyper-V …..) Internet ters involved [14]. Each Mapper read and parsing its input Shared CPUs, RAMs and Other Physical Shared CPUs, RAMs and Other Physical split into key-value pairs (in our example, the key is the city Resources Resources name and the value is its temperature). The Mapper starts the Map execution Task (in this example, the map phase is to find out the highest temperature for each city in each Network Fabric (Physical Switches, Routers, Firewalls, External Storage, and Power Units) input block i.e. file). The output result of the Mappers is the intermediate buffered data which is written locally in a

Client number of data partitions. They are equal to the number

Machine of Reducers (in our example two data partitions and two Reducers). After the Mappers completed their task, each partition is assigned to a Reducer. The Reducers start their execution task which is finding the maximum temperature per city but before that the Reducers themselves have to Figure 3: Hosting MapReduce components in the Cloud perform the shuffle process. In the shuffle process the in- termediate data (within the same partition) is sorted and 3. A GENERIC MAPREDUCE COMPUTA- grouped based on the key. (For example, York city (i.e. TION MODEL key) has a couple of temperatures values 18, 11, 23, 9 as an output of different Mappers, as listed in the intermediate re- The components of the MapReduce model execute and pro- sult shuffle table in figure 5). The final output result of each cess the input data based on a specific execution data work- reduce task, which is a list of cities and the maximum tem- flow. This work-flow involves a number of interactions be- perature associated with each city are written into separate tween model components. This section, starting with a prac- files on the DFS system as shown in Figure 5. tical usecase scenario, attempts to build an abstract model of MapReduce computation capturing the data flows as well as the interactions among MapReduce components during a 3.2 Interactions of MapReduce Components job execution. This abstract model can serve as a basis for The following points describe a typical execution flow of job our future design of a security solution for MapReduce. being executed by the MapReduce framework from the mo- DFS System Contains DFS System Contains Intermediate Results Number of Input Files Number of Output Files

P1 Mappers Shuffle the N City Temp. Other Key value Reducers 1 York 0 York 18 Map Task 2 Manch. 19 Manch. 19 Map exe. Process Output City Temp 3 Leeds 9 Leeds 20 Reduce exe. Process Key value York 18 4 Leeds 20 York 18, 11, 23, Manch. 24 5 London 13 Key value P2 9… Leeds 22

is York 0 6 Leeds 18 Map London 27 Manch. 19, 23, 24, Input Manch. 19 Reduce Durham 15 7 London 6 Durham 19 13…. Leeds 9 Task …. ……… ……. …… Spilt Bradford 20 Leeds 20, 13, Task Leeds 20 File1 (Block1) 22… 1 London 13 Leeds 18 P1 No. City Temp. Other chthe Data Key value ……. London 6 Output File1 1 Manch. 15 Manch 23 Luton 11, 17,..… KeyYork value13 2 Manch. 10 York. 11 Manch. 15 3 York 9 valuepairs London 23 Input - Manch. 10 4 London 20 Map York 9 5 Durham 20 Spilt P2 London 20 Task 6 London 18 Cardiff 27 2 Durham 20 Shuffle the Map 7 Durham 6 Wales 19 London 18 Task Output ……… ……. …… Leeds 20 Durham 6 Key value City Temp London 27 File2 (Block2) parsedkey to York 13 London 27, 21, P1 23, 19… Durham. 21 Key value Durham. 19, 20, Bradford 22 Key value London 20 14, 21…. Brighton 30 London 15 Reduce

eadingData Input Phase whi in Manch. 24 No. City Temp. OtherInput Bradford 20, 13, …. R Manch. 10 Leeds 20 1 London 15 Map 22… Task York 9 2 Manch. 10 Spilt ……… London 20 Task 3 York 9 P2 3 Leeds 13 4 London 20 Newscast. 26 Glasgow 11, 17... London 18 Output File2 5 Leeds 13 Brighton 30 Durham 6 6 York 13 Glasgow 16 York 13 ……… ……. …… Reading and Parsing Process Writing Process File10 (Block10)

Assign Partitions Assign Input Splits Masters Task

P: partition, exe: Execution Submit the Job Client

Figure 5: Simplified practical Example of Data Execution Flow in MapReduce Model. ment when the job is submitted by the client to the moment DFS (Figure 6 (5)), the client authenticate to the Na- when it is successfully completed. This data execution work- meNode using the service ticket (contains the address flow is further illustrated in Figure 6. of the remote service/resource as well as the secret) is- sued by TGT server from the first interaction between the client and AS. 1. First, the user (on the Client node) runs the MapRe- duce program (Job Client) after it has been authen- 4. The client is redirected to the DataNode and start ticated to the master node using his credentials. The copying the Job resources (Configuration file, input used credentials are usually a Username and Password splits which are computed by the Job Client) to the (Figure 6 (1)(2)). The master node might be either the specified directories (Figure 6 (6)). Resource Manager in the new version of MapReduce framework or it is the JobTracker itself in previous 5. Once the copying is completed the Client informs the version of the framework. This initial authentication Master Node that the Job is ready to be launched, this is done by using Kerberos. The Kerberos protocol in- is called client job submission (Figure 6 (7)). volves KDC Key Distributer Centre server in which the client and the server are registered. The KDC 6. The Resource Manager allocates the JobTracker, and contains AS Authentication Service and TGS Ticket the master daemon (Application Master) is launched Granting Service. The client uses his credentials to in this JobTracker (Figure 6 (8)(9)). This daemon cre- make authentication request to the AS in order to get ates an object and instant to present the Job and keeps the TGT Ticket Granting Ticket. The Authentication track to the progress of the Jobs Tasks. Server then issues a TGT to the client. The Client 7. In order for the JobTracker to create a list of tasks to then send it to the TGS requesting the service ticket. be run, it retrieves the input splits (computed by the Once the client has the service ticket, he uses it to client) from the DFS (Figure 6 (10)), whereas one map access the designated resource (i.e. service). task object assigned for each input split. JobTracker 2. The Client requests the job ID (i.e. new application also determines the reduce tasks objects. ID) and the Path to write the Job definition files (Fig- ure 6 (3)). 8. The JobTracker requests TaskTrackers from the Re- source Manager for all Map and Reduce tasks (Figure 3. Once the Client acquires the Job ID (Figure 6 (4)), the 6 (11)). In this request, as a performance prospec- Client contacts the Namenode to start writing into the tive (i.e. consideration), JobTracker sends the data location of each Map task, trying to obtain a Mapper 12. Finally the Reducer (Work daemon) informs the Mas- node close to its split input. ter Node (JobTracker) that the task has been com- pleted successfully or not (Figure 6 (17)). Once the 9. Then once the tasks (either Map or Reduce Tasks) entire reducers involved in the submitted job have com- have been assigned to TaskTrackers, the mapper and pleted the tasks assigned to them, they notify the Mas- reducer child daemons are started launching by the ter Node (JobTracker). Then the JobTracker updates worker daemon (Figure 6 (13) (14)). However, before the client with the Job status (Figure 6 (18). the map tasks started, the worker daemon retrieves the Job resources from the DFS (DataNodes) and copied them locally (Figure 6 (12)). The map task also, parses 4. CONCLUSION the input data into key-value pairs as explained in the In this paper, we have examined a real-life application of previous example. the MapReduce Model. Based on this examination, we have built an abstract MapReduce usecase model that captures the main characteristics of an MapReduce execution at a generic level, the functionality of each of its components

(3) Request New Job ID and and the interactions among the components in executing a (2) Run Job the path of Job files Central Cluster computational job. This abstract model which addresses the Manager MapReduce (4) Obtain Job ID and Job previous MapReduce changes will be used in our future work Program files Path Resource on designing a cost effective security solution for MapRe- Client Machine (7) Job submission Manager duce, because to study any security services for such model,

Username [ad ] it is necessary to understand its anatomy and including a Password [………] (8) Allocate JobTracker recent MapReduce model. (1) User Auth. (18) Update (11) Request the Client TaskTrackers JobTracker (Master Node) 5. REFERENCES

(5) Ask to (9) Run Master [1] Jason Cohen and Subatra Acharya. Towards a trusted write into the Master Daemon DFS system Daemon (Application hadoop storage platform: Design considerations of an Master) (13) Allocate the aes based encryption scheme with tpm rooted key

TaskTrackers (10) Retrieve protections. In Ubiquitous Intelligence and Computing,

Input Splits (17) Inform 2013 IEEE 10th International Conference on and 10th (6) Redirected the Master International Conference on Autonomic and Trusted and start writing DFS NameNode TaskTracker 1 the designated (Catalogue Server) (Worker Node) Computing (UIC/ATC), pages 444–451. IEEE, 2013. Files. DFS Data Node n TaskTracker 1 Data Location (Storage Server) DFS Data Node 1 Worker (Worker Node) [2] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Block1 (StorageDN 1,2 Server) (12) Retrieve Daemon Block2 DN2,n Block3 DN1,n Job Resources Map Task simplified data processing on large clusters. Block3 Worker …… …… Child daemon Block1 Daemon Communications of the ACM, 51(1):107–113, 2008.

Map Task Block2 Block5 [3] Developers. Mapreduce for app engine. Google Block3 Local DataChild daemon

(16) Write (14) Run Inc, 2014. Child Reduce Task Reduce TaskLocal Data [4] James Dyer and Ning Zhang. Security issues relating Output to DFS Child daemon Daemons to inadequate authentication in Reduce Task (15) Read Map Child daemon applications. In High Performance Computing and Task Output Simulation (HPCS), 2013 International Conference

on, pages 281–288. IEEE, 2013. [5] Facebook Engineers. Under the hadoop scheduling mapreduce jobs more efficiently with corona. FaceBook Figure 6: Data execution work-flow in MapReduce Model Inc, 2012. [6] Diogo AB Fernandes, Liliana FB Soares, Jo˜ao V 10. After Map tasks have been completed by the Mappers, Gomes, M´ario M Freire, and Pedro RM In´acio. the Reducers start to read the output data of the map Security issues in cloud environments: a survey. task (Figure 6 (15)). Each reducer is assigned to one International Journal of Information Security, data partition as explained in the previous example. 13(2):113–170, 2014. The reducer shuffles the map phase output data (i.e. [7] David Marshall. Top 10 benefits of server intermediate data) in each data partition. This par- virtualization. InfoWorld IT, 2011. tition was specified before by the Master using user [8] George Ou. Introduction to server virtualization. configuration file. These intermediate data at the spec- TechRepublic, 2006. ified partition, might be located in different nodes as [9] GS Sadasivam, KA Kumari, and S Rubika. A novel they are an output of different mappers child daemons. authentication service for hadoop in cloud So the reducer child daemon connects to designated environment. In Cloud Computing in Emerging Worker Nodes (daemons), where the partial data of Markets (CCEM), 2012 IEEE International the indexed partition stored locally, to gain access to Conference on, pages 1–6. IEEE, 2012. the desired data (Figure 6 (15)). [10] Sumeet Singh. Hadoop at yahoo!: More than ever ˘ ´ 11. Once the reduce task is executed and completed, the before. Yahoo! developerˆaAZs network, 2013. worker child daemon writes its output result into a file [11] Sumeet Singh. Introduction to yarn. Hadoop developer at the DFS system (Figure 6 (16)). and administrator IBM developerˆaA˘Zs´ team, 2013. [12] Nivethitha Somu, A Gangaa, and VS Shankar Sriram. Authentication service in hadoop using one time pad. Indian Journal of Science and Technology, 7(4):56–62, 2014. [13] Nivethitha Somu, A Gangaa, and VS Shankar Sriram. Authentication service in hadoop using one time pad. Indian Journal of Science and Technology, 7(4):56–62, 2014. [14] Hadoop Tutorial Team. Hadoop architecture. Tangient LLC, 2013. [15] Tom White. Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012. [16] Jing Xiao and Zhiwei Xiao. High-integrity mapreduce computation in cloud with speculative execution. In Theoretical and Mathematical Foundations of Computer Science, pages 397–404. Springer, 2011. [17] Paul Zikopoulos, Chris Eaton, et al. Understanding : Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, 2012.