An Example of Nfs, Ceph, Hadoop

Total Page:16

File Type:pdf, Size:1020Kb

An Example of Nfs, Ceph, Hadoop International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in A STUDY ON DISTRIBUTED FILE SYSTEMS: AN EXAMPLE OF NFS, CEPH, HADOOP 1MAHMUT UNVER, 2ATILLA ERGUZEN 1,2Computer Engineering, Kırıkkale University, Turkey E-mail: [email protected], [email protected] Abstract- Distributed File Systems (DFS), are systems used in both local network and wide area networks by using discs, storage areas and sources together. Nowadays, DFS makes possible operating big data, large-scale computings and transactions. The classifications are performed based on their working logics and architectures. These classifications were performed based on fault tolerance, replications, naming, synchronization and purpose of design. In this study, firstly the examinations on general design of DFSs were performed. Later, the results were discussed based on the advantages and disadvantages of Ceph and Hadoop and Network File Systems(NFS), commonly used in these days. Keywords- Distributed file system, Network file system (NFS), Hadoop, Ceph, fault tolerance, synchronization, replication, naming, operating system. I. INTRODUCTION which can support up to 5000 clients[5].Network File System (NFS) uses RPC Remote Procedure Call Computer systems had large evolutions until now. (RPC) communication model. RPC creates The first one is development of strong intermediate layer between server and client. The microprocessors on 1980s from 8 bit to 64 bit client performs operations without knowing the processing.The strengths of these computers were as server's file systems.This method allows clients and mainframe computers and command processing costs servers with different file systems to run smoothly were low at the same time. The second evolution is [6]. The purpose of Google File System (GFS) is to commonly using local networks with high speed and work with big data.This is achieved by using a lot of large scale nodes, This helped transferring 1 gigabit low cost equipment.Another DFS that has a very data in a second. At the end of these developments, different structure is XFS.It keeps very large files distributed systems using multiple computers with stable.Also, XFS does not have a generic server. The high speed networks appeared rather than a strong entire file system is distributed over the clients.In computer having one processor [1]. Ceph DFS, it decomposes the metadata holding the The first DFSs were developed on 1970s. These were data and data information.It replicates and increases storage system connected with FTP-like structure and the system's fault tolerance. they were not commonly used due to their limited storage spaces. L. Svoboda reported the first study on In this study, DFSs were compared using specific DFSs [2]and Svoboda developed various DFS in this classifications.Introduction of this work gives general year such as LOCUS, ACORN, SWALLOW, and information about DFS. In the second part, general XDFS. The studies continued on DFSs until now. architectural structures of DFS are mentioned. The Today’s DFSs are generally designed analogously to basic concepts are explained in this section. In the classical time sharing systems. These generally take third chapter, the classification criteria to be base the UNIX file systems. The purpose of this compared are determined and explained.In the fourth system is combination of different computer files and chapter, currently active DFSs are described storage sytems [3]. according to the criteria specified in the third chapter. DFSs process differently generated data on numeric In the last part, results and comparisons were data platforms. It also performs this safely, efficiently performed. and rapidly. The need for rapid growth of data and rapid access to them has caused the growth of data II. GENERAL STRUCTURE OFDISTRIBUTED storage resources.The big increase on data created a FILE SYSTEMS new concept, BigData. At the same time, distributed file systems are used to process big data and to The overall design goal of DFSs is to use less local perform operations quickly. Distributed file systems hardware resources by sharing hardware have emerged and are now being used effectively by resources.Besides the hardware advantages, it also cloud systems. A DFS file is stored on one or more has advantages in managing the files.This is also computers, each of which is a server, and computers, important in general design.For example, attention called clients, access those files as if they were a has been paid to the level of transparency of the DSF single file [4]. in order to overcome access problems caused by the DFSs were designed for different goals.For example, network [7]. While DFS is designed, they are the purpose of Andrew File System (AFS) is DS designed to provide file services to file system A Study on Distributed File Systems: An Example of NFS, CEPH, Hadoop 36 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in clients.In this structure, clients use the interfaces to processed. An example of this architecture is the create files, delete files, read files, write files, perform Google File System (GFS). directory operations.The operating system used to perform these operations may be a distributed operating system or an intermediate layer between the operating system and the distributed file system[8]. Fig.3. Clustered-based architecture. The most important difference between DFS servers with a symmetric architecture is whether they create a file system on a distributed storage layer, or that all Fig.1. The Remote access model. files are stored in the nodes that are created.Thisarchitecture consists of three separate The architecture of DFS is generally based on 3 layers. The first layer is basic decentralizedlookup structures. These; facilities.The middle layer is a fully distributed block- -Server-Client based structures oriented storage layer. Top layer is a layer -Cluster based structures implementing afile system[1]. -Symmetricstructures III. CLASSIFICATION CRITERIA The Server-Client based architecture has been used extensively in DFS architecture. There are two DFSs have several classifications that affect server models in this architecture. qualities. The most important of these classifications are as follows: A.Fault Tolerance: When any part of the distributed site becomes corrupted, it is tolerated without being felt in the client [1]. B. Transparency: The distributed system looks like a single server by the client. It is the most important criterion affecting system design. C. Replication:More than one copy of the files used in the system is created and stored in the distributed system. Reliability is improved on this. If a copy is not accessible, the system continues to work using the Fig.2. Upload/download access model. other copy. D. Synchronization: There are copies of the file on The first is the remote access model. In this model, different servers. The change of client in one copy is the client provides an interface with various file also made in the other copies. operations.File operations are performed through this E. Naming: Names are all sources in the distributed interface. The server has to respond to this request. system. These are computers, services, users and The second model is the upload / download remote objects. Distributed system is to make a model.Unlike the client / server model, this model consistent naming of objects. If it does not provide, it downloads the file that the client will process, and will not access the objects. accesses the file locally.Server / Client model is used in NFS DFS. Nowadays, NFS is becoming the most IV. DISTRIBUTED FILE SYSTEMS used DFS [1]. 1.1. Network File System (NFS) Clustered based architecture also does not have a NFS was started to be developed in 1984. The project single server. There are multiple servers in the was developed by Sun Microsystems.It is the most system.One of the servers is the master server. The used and implemented DFS on UNIX systems. It uses master server keeps the metadata of the data. Other Remote Procedure Call (RPC) model for servers are chunk servers.With more than one server, communication [9]. Chuk can handle multiple clients at the same time. With this architecture, very large data can be A Study on Distributed File Systems: An Example of NFS, CEPH, Hadoop 37 International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-2, Feb.-2017 http://iraj.in at the same time. Those are Object-based storing, block-based storing, file system. The most important features of Ceph are reliability and scalability. Metadata is the data holding the information of the data. In general distributed systems, the metadata that holds the data and data information is located on separate servers. Data cannot be accessed when metadata is not available. Ceph does not need a metadata server.Instead of a metadata server, it uses an algorithm that determines the location of the print job. This algorithm is called CRUSH. Clients use this algorithm to determine and Fig.4. NFS architecture. read the position of the dataset. With this algorithm, there is no problem of not reaching the metadata. The latest version is NFS version 4. The basic design structure is the distributed execution of the classic In Ceph DFS, more than one copy of the data is kept Unix file system.Virtual file system is used. The as distributed on the serve. It performs replication virtual file system works like an intermediate layer. with this way. This allows clients to easily work with different file systems. The operating system is an interfaced call According to the workload measurements, Ceph has placed between calls and file system calls. More than very good Input / Output performance. Ithas one command can be sent from an RPC in the last scalability metadata management that allows up to version.
Recommended publications
  • Red Hat Data Analytics Infrastructure Solution
    TECHNOLOGY DETAIL RED HAT DATA ANALYTICS INFRASTRUCTURE SOLUTION TABLE OF CONTENTS 1 INTRODUCTION ................................................................................................................ 2 2 RED HAT DATA ANALYTICS INFRASTRUCTURE SOLUTION ..................................... 2 2.1 The evolution of analytics infrastructure ....................................................................................... 3 Give data scientists and data 2.2 Benefits of a shared data repository on Red Hat Ceph Storage .............................................. 3 analytics teams access to their own clusters without the unnec- 2.3 Solution components ...........................................................................................................................4 essary cost and complexity of 3 TESTING ENVIRONMENT OVERVIEW ............................................................................ 4 duplicating Hadoop Distributed File System (HDFS) datasets. 4 RELATIVE COST AND PERFORMANCE COMPARISON ................................................ 6 4.1 Findings summary ................................................................................................................................. 6 Rapidly deploy and decom- 4.2 Workload details .................................................................................................................................... 7 mission analytics clusters on 4.3 24-hour ingest ........................................................................................................................................8
    [Show full text]
  • Globalfs: a Strongly Consistent Multi-Site File System
    GlobalFS: A Strongly Consistent Multi-Site File System Leandro Pacheco Raluca Halalai Valerio Schiavoni University of Lugano University of Neuchatelˆ University of Neuchatelˆ Fernando Pedone Etienne Riviere` Pascal Felber University of Lugano University of Neuchatelˆ University of Neuchatelˆ Abstract consistency, availability, and tolerance to partitions. Our goal is to ensure strongly consistent file system operations This paper introduces GlobalFS, a POSIX-compliant despite node failures, at the price of possibly reduced geographically distributed file system. GlobalFS builds availability in the event of a network partition. Weak on two fundamental building blocks, an atomic multicast consistency is suitable for domain-specific applications group communication abstraction and multiple instances of where programmers can anticipate and provide resolution a single-site data store. We define four execution modes and methods for conflicts, or work with last-writer-wins show how all file system operations can be implemented resolution methods. Our rationale is that for general-purpose with these modes while ensuring strong consistency and services such as a file system, strong consistency is more tolerating failures. We describe the GlobalFS prototype in appropriate as it is both more intuitive for the users and detail and report on an extensive performance assessment. does not require human intervention in case of conflicts. We have deployed GlobalFS across all EC2 regions and Strong consistency requires ordering commands across show that the system scales geographically, providing replicas, which needs coordination among nodes at performance comparable to other state-of-the-art distributed geographically distributed sites (i.e., regions). Designing file systems for local commands and allowing for strongly strongly consistent distributed systems that provide good consistent operations over the whole system.
    [Show full text]
  • Unlock Bigdata Analytic Efficiency with Ceph Data Lake
    Unlock Bigdata Analytic Efficiency With Ceph Data Lake Jian Zhang, Yong Fu, March, 2018 Agenda . Background & Motivations . The Workloads, Reference Architecture Evolution and Performance Optimization . Performance Comparison with Remote HDFS . Summary & Next Step 3 Challenges of scaling Hadoop* Storage BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges Data Capacity Silos Costs Performance & efficiency Typical Challenges Data/Capacity Multiple Storage Silos Space, Spent, Power, Utilization Upgrade Cost Inadequate Performance Provisioning And Configuration Source: 451 Research, Voice of the Enterprise: Storage Q4 2015 *Other names and brands may be claimed as the property of others. 4 Options To Address The Challenges Compute and Large Cluster More Clusters Storage Disaggregation • Lacks isolation - • Cost of • Isolation of high- noisy neighbors duplicating priority workloads hinder SLAs datasets across • Shared big • Lacks elasticity - clusters datasets rigid cluster size • Lacks on-demand • On-demand • Can’t scale provisioning provisioning compute/storage • Can’t scale • compute/storage costs separately compute/storage costs scale costs separately separately Compute and Storage disaggregation provides Simplicity, Elasticity, Isolation 5 Unified Hadoop* File System and API for cloud storage Hadoop Compatible File System abstraction layer: Unified storage API interface Hadoop fs –ls s3a://job/ adl:// oss:// s3n:// gs:// s3:// s3a:// wasb:// 2006 2008 2014 2015 2016 6 Proposal: Apache Hadoop* with disagreed Object Storage SQL …… Hadoop Services • Virtual Machine • Container • Bare Metal HCFS Compute 1 Compute 2 Compute 3 … Compute N Object Storage Services Object Object Object Object • Co-located with gateway Storage 1 Storage 2 Storage 3 … Storage N • Dynamic DNS or load balancer • Data protection via storage replication or erasure code Disaggregated Object Storage Cluster • Storage tiering *Other names and brands may be claimed as the property of others.
    [Show full text]
  • Andrew File System (AFS) Google File System February 5, 2004
    Advanced Topics in Computer Systems, CS262B Prof Eric A. Brewer Andrew File System (AFS) Google File System February 5, 2004 I. AFS Goal: large-scale campus wide file system (5000 nodes) o must be scalable, limit work of core servers o good performance o meet FS consistency requirements (?) o managable system admin (despite scale) 400 users in the “prototype” -- a great reality check (makes the conclusions meaningful) o most applications work w/o relinking or recompiling Clients: o user-level process, Venus, that handles local caching, + FS interposition to catch all requests o interaction with servers only on file open/close (implies whole-file caching) o always check cache copy on open() (in prototype) Vice (servers): o Server core is trusted; called “Vice” o servers have one process per active client o shared data among processes only via file system (!) o lock process serializes and manages all lock/unlock requests o read-only replication of namespace (centralized updates with slow propagation) o prototype supported about 20 active clients per server, goal was >50 Revised client cache: o keep data cache on disk, metadata cache in memory o still whole file caching, changes written back only on close o directory updates are write through, but cached locally for reads o instead of check on open(), assume valid unless you get an invalidation callback (server must invalidate all copies before committing an update) o allows name translation to be local (since you can now avoid round-trip for each step of the path) Revised servers: 1 o move
    [Show full text]
  • A Survey of Distributed File Systems
    A Survey of Distributed File Systems M. Satyanarayanan Department of Computer Science Carnegie Mellon University February 1989 Abstract Abstract This paper is a survey of the current state of the art in the design and implementation of distributed file systems. It consists of four major parts: an overview of background material, case studies of a number of contemporary file systems, identification of key design techniques, and an examination of current research issues. The systems surveyed are Sun NFS, Apollo Domain, Andrew, IBM AIX DS, AT&T RFS, and Sprite. The coverage of background material includes a taxonomy of file system issues, a brief history of distributed file systems, and a summary of empirical research on file properties. A comprehensive bibliography forms an important of the paper. Copyright (C) 1988,1989 M. Satyanarayanan The author was supported in the writing of this paper by the National Science Foundation (Contract No. CCR-8657907), Defense Advanced Research Projects Agency (Order No. 4976, Contract F33615-84-K-1520) and the IBM Corporation (Faculty Development Award). The views and conclusions in this document are those of the author and do not represent the official policies of the funding agencies or Carnegie Mellon University. 1 1. Introduction The sharing of data in distributed systems is already common and will become pervasive as these systems grow in scale and importance. Each user in a distributed system is potentially a creator as well as a consumer of data. A user may wish to make his actions contingent upon information from a remote site, or may wish to update remote information.
    [Show full text]
  • Key Exchange Authentication Protocol for Nfs Enabled Hdfs Client
    Journal of Theoretical and Applied Information Technology 15 th April 2017. Vol.95. No 7 © 2005 – ongoing JATIT & LLS ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 KEY EXCHANGE AUTHENTICATION PROTOCOL FOR NFS ENABLED HDFS CLIENT 1NAWAB MUHAMMAD FASEEH QURESHI, 2*DONG RYEOL SHIN, 3ISMA FARAH SIDDIQUI 1,2 Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, South Korea 3 Department of Software Engineering, Mehran UET, Pakistan *Corresponding Author E-mail: [email protected], 2*[email protected], [email protected] ABSTRACT By virtue of its built-in processing capabilities for large datasets, Hadoop ecosystem has been utilized to solve many critical problems. The ecosystem consists of three components; client, Namenode and Datanode, where client is a user component that requests cluster operations through Namenode and processes data blocks at Datanode enabled with Hadoop Distributed File System (HDFS). Recently, HDFS has launched an add-on to connect a client through Network File System (NFS) that can upload and download set of data blocks over Hadoop cluster. In this way, a client is not considered as part of the HDFS and could perform read and write operations through a contrast file system. This HDFS NFS gateway has raised many security concerns, most particularly; no reliable authentication support of upload and download of data blocks, no local and remote client efficient connectivity, and HDFS mounting durability issues through untrusted connectivity. To stabilize the NFS gateway strategy, we present in this paper a Key Exchange Authentication Protocol (KEAP) between NFS enabled client and HDFS NFS gateway. The proposed approach provides cryptographic assurance of authentication between clients and gateway.
    [Show full text]
  • Using the Andrew File System on *BSD [email protected], Bsdcan, 2006 Why Another Network Filesystem
    Using the Andrew File System on *BSD [email protected], BSDCan, 2006 why another network filesystem 1-slide history of Andrew File System user view admin view OpenAFS Arla AFS on OpenBSD, FreeBSD and NetBSD Filesharing on the Internet use FTP or link to HTTP file interface through WebDAV use insecure protocol over vpn History of AFS 1984: developed at Carnegie Mellon 1989: TransArc Corperation 1994: over to IBM 1997: Arla, aimed at Linux and BSD 2000: IBM releases source 2000: foundation of OpenAFS User view <1> global filesystem rooted at /afs /afs/cern.ch/... /afs/cmu.edu/... /afs/gorlaeus.net/users/h/hugo/... User view <2> authentication through Kerberos #>kinit <username> obtain krbtgt/<realm>@<realm> #>afslog obtain afs@<realm> #>cd /afs/<cell>/users/<username> User view <3> ACL (dir based) & Quota usage runs on Windows, OS X, Linux, Solaris ... and *BSD Admin view <1> <cell> <partition> <server> <volume> <volume> <server> <partition> Admin view <2> /afs/gorlaeus.net/users/h/hugo/presos/afs_slides.graffle gorlaeus.net /vicepa fwncafs1 users hugo h bram <server> /vicepb Admin view <2a> /afs/gorlaeus.net/users/h/hugo/presos/afs_slides.graffle gorlaeus.net /vicepa fwncafs1 users hugo /vicepa fwncafs2 h bram Admin view <3> servers require KeyFile ~= keytab procedure differs for Heimdal: ktutil copy MIT: asetkey add Admin view <4> entry in CellServDB >gorlaeus.net #my cell name 10.0.0.1 <dbserver host name> required on servers required on clients without DynRoot Admin view <5> File locking no databases on AFS (requires byte range locking)
    [Show full text]
  • HFAA: a Generic Socket API for Hadoop File Systems
    HFAA: A Generic Socket API for Hadoop File Systems Adam Yee Jeffrey Shafer University of the Pacific University of the Pacific Stockton, CA Stockton, CA [email protected] jshafer@pacific.edu ABSTRACT vices: one central NameNode and many DataNodes. The Hadoop is an open-source implementation of the MapReduce NameNode is responsible for maintaining the HDFS direc- programming model for distributed computing. Hadoop na- tory tree. Clients contact the NameNode in order to perform tively integrates with the Hadoop Distributed File System common file system operations, such as open, close, rename, (HDFS), a user-level file system. In this paper, we intro- and delete. The NameNode does not store HDFS data itself, duce the Hadoop Filesystem Agnostic API (HFAA) to allow but rather maintains a mapping between HDFS file name, Hadoop to integrate with any distributed file system over a list of blocks in the file, and the DataNode(s) on which TCP sockets. With this API, HDFS can be replaced by dis- those blocks are stored. tributed file systems such as PVFS, Ceph, Lustre, or others, thereby allowing direct comparisons in terms of performance Although HDFS stores file data in a distributed fashion, and scalability. Unlike previous attempts at augmenting file metadata is stored in the centralized NameNode service. Hadoop with new file systems, the socket API presented here While sufficient for small-scale clusters, this design prevents eliminates the need to customize Hadoop’s Java implementa- Hadoop from scaling beyond the resources of a single Name- tion, and instead moves the implementation responsibilities Node. Prior analysis of CPU and memory requirements for to the file system itself.
    [Show full text]
  • Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO
    Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO..........................................................................................................................................1 Martin Hinner < [email protected]>, http://martin.hinner.info............................................................1 1. Introduction..........................................................................................................................................1 2. Volumes...............................................................................................................................................1 3. DOS FAT 12/16/32, VFAT.................................................................................................................2 4. High Performance FileSystem (HPFS)................................................................................................2 5. New Technology FileSystem (NTFS).................................................................................................2 6. Extended filesystems (Ext, Ext2, Ext3)...............................................................................................2 7. Macintosh Hierarchical Filesystem − HFS..........................................................................................3 8. ISO 9660 − CD−ROM filesystem.......................................................................................................3 9. Other filesystems.................................................................................................................................3
    [Show full text]
  • Decentralising Big Data Processing Scott Ross Brisbane
    School of Computer Science and Engineering Faculty of Engineering The University of New South Wales Decentralising Big Data Processing by Scott Ross Brisbane Thesis submitted as a requirement for the degree of Bachelor of Engineering (Software) Submitted: October 2016 Student ID: z3459393 Supervisor: Dr. Xiwei Xu Topic ID: 3692 Decentralising Big Data Processing Scott Ross Brisbane Abstract Big data processing and analysis is becoming an increasingly important part of modern society as corporations and government organisations seek to draw insight from the vast amount of data they are storing. The traditional approach to such data processing is to use the popular Hadoop framework which uses HDFS (Hadoop Distributed File System) to store and stream data to analytics applications written in the MapReduce model. As organisations seek to share data and results with third parties, HDFS remains inadequate for such tasks in many ways. This work looks at replacing HDFS with a decentralised data store that is better suited to sharing data between organisations. The best fit for such a replacement is chosen to be the decentralised hypermedia distribution protocol IPFS (Interplanetary File System), that is built around the aim of connecting all peers in it's network with the same set of content addressed files. ii Scott Ross Brisbane Decentralising Big Data Processing Abbreviations API Application Programming Interface AWS Amazon Web Services CLI Command Line Interface DHT Distributed Hash Table DNS Domain Name System EC2 Elastic Compute Cloud FTP File Transfer Protocol HDFS Hadoop Distributed File System HPC High-Performance Computing IPFS InterPlanetary File System IPNS InterPlanetary Naming System SFTP Secure File Transfer Protocol UI User Interface iii Decentralising Big Data Processing Scott Ross Brisbane Contents 1 Introduction 1 2 Background 3 2.1 The Hadoop Distributed File System .
    [Show full text]
  • Collective Communication on Hadoop
    Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu Computer Science Department Indiana University Bloomington, IN, USA Abstract—Big data tools have evolved rapidly in recent years. MapReduce is very successful but not optimized for many important analytics; especially those involving iteration. In this regard, Iterative MapReduce frameworks improve performance of MapReduce job chains through caching. Further Pregel, Giraph and GraphLab abstract data as a graph and process it in iterations. However, all these tools are designed with fixed data abstraction and have limitations in communication support. In this paper, we introduce a collective communication layer which provides optimized communication operations on several important data abstractions such as arrays, key-values and graphs, and define a Map-Collective model which serves the diverse communication demands in different parallel applications. In addition, we design our enhancements as plug-ins to Hadoop so they can be used with the rich Apache Big Data Stack. Then for example, Hadoop can do in-memory communication between Map tasks without writing intermediate data to HDFS. With improved expressiveness and excellent performance on collective communication, we can simultaneously support various applications from HPC to Cloud systems together with a high performance Apache Big Data Stack. Fig. 1. Big Data Analysis Tools Keywords—Collective Communication; Big Data Processing; Hadoop Spark [5] also uses caching to accelerate iterative algorithms without restricting computation to a chain of MapReduce jobs. I. INTRODUCTION To process graph data, Google announced Pregel [6] and soon It is estimated that organizations with high-end computing open source versions Giraph [7] and Hama [8] emerged.
    [Show full text]
  • Maximizing Hadoop Performance and Storage Capacity with Altrahdtm
    Maximizing Hadoop Performance and Storage Capacity with AltraHDTM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created not only a large volume of data, but a variety of data types, with new data being generated at an increasingly rapid rate. Data characterized by the “three Vs” – Volume, Variety, and Velocity – is commonly referred to as big data, and has put an enormous strain on organizations to store and analyze their data. Organizations are increasingly turning to Apache Hadoop to tackle this challenge. Hadoop is a set of open source applications and utilities that can be used to reliably store and process big data. What makes Hadoop so attractive? Hadoop runs on commodity off-the-shelf (COTS) hardware, making it relatively inexpensive to construct a large cluster. Hadoop supports unstructured data, which includes a wide range of data types and can perform analytics on the unstructured data without requiring a specific schema to describe the data. Hadoop is highly scalable, allowing companies to easily expand their existing cluster by adding more nodes without requiring extensive software modifications. Apache Hadoop is an open source platform. Hadoop runs on a wide range of Linux distributions. The Hadoop cluster is composed of a set of nodes or servers that consist of the following key components. Hadoop Distributed File System (HDFS) HDFS is a Java based file system which layers over the existing file systems in the cluster. The implementation allows the file system to span over all of the distributed data nodes in the Hadoop cluster to provide a scalable and reliable storage system.
    [Show full text]