A STUDY ON DISTRIBUTED FILE SYSTEMS: AN EXAMPLE OF NFS, , HADOOP

1MAHMUT UNVER, 2ATILLA ERGUZEN

1,2Computer Engineering, Kırıkkale University, Turkey E-mail: [email protected], [email protected]

Abstract- Distributed File Systems (DFS), are systems used in both local network and wide area networks by using discs, storage areas and sources together. Nowadays, DFS makes possible operating big data, large-scale computings and transactions. The classifications are performed based on their working logics and architectures. These classifications were performed based on fault tolerance, replications, naming, synchronization and purpose of design. In this study, firstly the examinations on general design of DFSs were performed. Later, the results were discussed based on the advantages and disadvantages of Ceph and Hadoop and Network File Systems(NFS), commonly used in these days.

Keywords- Distributed , (NFS), Hadoop, Ceph, fault tolerance, synchronization, replication, naming, operating system.

I. INTRODUCTION DFSs were designed for different goals.For example, the purpose of Andrew File System (AFS) is DS Computer systems had large evolutions until now. which can support up to 5000 clients[5].Network File The first one is development of strong System (NFS) uses RPC Remote Procedure Call microprocessors on 1980s from 8 bit to 64 bit (RPC) communication model. RPC creates processing.The strengths of these computers were as intermediate layer between server and client. The mainframe computers and command processing costs client performs operations without knowing the were low at the same time. The second evolution is server's file systems.This method allows clients and commonly using local networks with high speed and servers with different file systems to run smoothly large scale nodes, This helped transferring 1 gigabit [6]. The purpose of (GFS) is to data in a second. At the end of these developments, work with big data.This is achieved by using a lot of distributed systems using multiple computers with low cost equipment.Another DFS that has a very high speed networks appeared rather than a strong different structure is XFS.It keeps very large files computer having one processor [1]. stable.Also, XFS does not have a generic server. The The first DFSs were developed on 1970s. These were entire file system is distributed over the clients.In storage system connected with FTP-like structure and Ceph DFS, it decomposes the metadata holding the they were not commonly used due to their limited data and data information.It replicates and increases storage spaces. L. Svoboda reported the first study on the system's fault tolerance. DFSs [2]and Svoboda developed various DFS in this year such as LOCUS, ACORN, SWALLOW, and In this study, DFSs were compared using specific XDFS. The studies continued on DFSs until now. classifications.Introduction of this work gives general Today’s DFSs are generally designed analogously to information about DFS. In the second part, general classical time sharing systems. These generally take architectural structures of DFS are mentioned. The base the UNIX file systems. The purpose of this basic concepts are explained in this section. In the system is combination of different computer files and third chapter, the classification criteria to be storage sytems [3]. compared are determined and explained.In the fourth DFSs process differently generated data on numeric chapter, currently active DFSs are described data platforms. It also performs this safely, efficiently according to the criteria specified in the third chapter. and rapidly. The need for rapid growth of data and In the last part, results and comparisons were rapid access to them has caused the growth of data performed. storage resources.The big increase on data created a new concept, BigData. At the same time, distributed II. GENERAL STRUCTURE OFDISTRIBUTED file systems are used to process big data and to FILE SYSTEMS perform operations quickly. Distributed file systems have emerged and are now being used effectively by The overall design goal of DFSs is to use less local cloud systems. A DFS file is stored on one or more hardware resources by sharing hardware computers, each of which is a server, and computers, resources.Besides the hardware advantages, it also called clients, access those files as if they were a has advantages in managing the files.This is also single file [4]. important in general design.For example, attention

Proceedings of 54th The IRES International Conference, Florence, Italy, 28th-29th December 2016, ISBN: 978-93-86291-70-7 17 A Study on Distributed File Systems: An Example of NFS, CEPH, Hadoop has been paid to the level of transparency of the DSF master server keeps the metadata of the data. Other in order to overcome access problems caused by the servers are chunk servers.With more than one server, network [7]. While DFS is designed, they are Chuk can handle multiple clients at the same time. designed to provide file services to file system With this architecture, very large data can be clients.In this structure, clients use the interfaces to processed. An example of this architecture is the create files, delete files, read files, write files, perform Google File System (GFS). directory operations.The operating system used to perform these operations may be a distributed operating system or an intermediate layer between the operating system and the distributed file system[8].

Fig.3. Clustered-based architecture.

The most important difference between DFS servers with a symmetric architecture is whether they create a file system on a distributed storage layer, or that all Fig.1. The Remote access model. files are stored in the nodes that are created.Thisarchitecture consists of three separate The architecture of DFS is generally based on 3 layers. The first layer is basic decentralizedlookup structures. These; facilities.The middle layer is a fully distributed block- -Server-Client based structures oriented storage layer. Top layer is a layer -Cluster based structures implementing afile system[1]. -Symmetricstructures III. CLASSIFICATION CRITERIA The Server-Client based architecture has been used extensively in DFS architecture. There are two DFSs have several classifications that affect server models in this architecture. qualities. The most important of these classifications are as follows:

A.Fault Tolerance: When any part of the distributed site becomes corrupted, it is tolerated without being felt in the client [1]. B. Transparency: The distributed system looks like a single server by the client. It is the most important criterion affecting system design. C. Replication:More than one copy of the files used in the system is created and stored in the distributed system. Reliability is improved on this. If a copy is not accessible, the system continues to work using the Fig.2. Upload/download access model. other copy. D. Synchronization: There are copies of the file on The first is the remote access model. In this model, different servers. The change of client in one copy is the client provides an interface with various file also made in the other copies. operations.File operations are performed through this E. Naming: Names are all sources in the distributed interface. The server has to respond to this request. system. These are computers, services, users and The second model is the upload / download remote objects. Distributed system is to make a model.Unlike the client / server model, this model consistent naming of objects. If it does not provide, it downloads the file that the client will process, and will not access the objects. accesses the file locally.Server / Client model is used in NFS DFS. Nowadays, NFS is becoming the most IV. DISTRIBUTED FILE SYSTEMS used DFS [1]. 1.1. Network File System (NFS) Clustered based architecture also does not have a NFS was started to be developed in 1984. The project single server. There are multiple servers in the was developed by .It is the most system.One of the servers is the master server. The used and implemented DFS on UNIX systems. It uses

Proceedings of 54th The IRES International Conference, Florence, Italy, 28th-29th December 2016, ISBN: 978-93-86291-70-7 18 A Study on Distributed File Systems: An Example of NFS, CEPH, Hadoop Remote Procedure Call (RPC) model for "Giant". It offers three different storage architectures communication [9]. at the same time. Those are Object-based storing, block-based storing, file system.

The most important features of Ceph are reliability and scalability. Metadata is the data holding the information of the data. In general distributed systems, the metadata that holds the data and data information is located on separate servers. Data cannot be accessed when metadata is not available. Ceph does not need a metadata server.Instead of a metadata server, it uses an algorithm that determines the location of the print job. This algorithm is called CRUSH. Clients use this algorithm to determine and read the position of the dataset. With this algorithm,

Fig.4. NFS architecture. there is no problem of not reaching the metadata.

The latest version is NFS version 4. The basic design In Ceph DFS, more than one copy of the data is kept structure is the distributed execution of the classic as distributed on the serve. It performs replication . is used. The with this way. virtual file system works like an intermediate layer. This allows clients to easily work with different file According to the workload measurements, Ceph has systems. The operating system is an interfaced call very good Input / Output performance. Ithas placed between calls and file system calls. More than scalability metadata management that allows up to one command can be sent from an RPC in the last 250,000 metadata transactions per second. It can be version. integrated into known cloud and virtualization systems (such as OpenStack, CloudStack, VMware). Fault tolerance is high on NFS. Information about the status of files is kept. In case of an error originating from the client, the server is notified. No file replication is done on NFS. The entire system is replicated. Files are cached.The copy on the cache is compared to the copy on the server. If the times are different, the file is changed and the cache is discarded [10].

NFS does not use a synchronization method because the files are not replicated. Single server processes are performed. This is the download / upload model described earlier. It sends the latest data to the server that made the last modification on the file. When a Fig.5. Ceph system architecture. client wants to open a file that is cached memorized 1.3. HADOOP and locked, it is updated by revalidating it from the Hadoop provides a distributed file system known as server. Consistency is ensured in this way Hadoop Distributed File System (HDFS).Hadoop [11][12].With NFS, only the file system can be uses the high-level Java programming language shared, the printer and the modem cannot be shared. [14]and the MapReduce architecture.It is a The objects to be shared may be part of a directory or framework that allows analysis and conversion of a file. very big data clusters. In NFS, the installation of local diskette for each application is not required. Applications can be shared via Server. The same machine can be both server and client. As a result, NFS reduces the cost of data storage.

1.2. CEPH Ceph is open source code. Today, the object-based distributed file system is increasing in popularity. First report is done by Weil et al.in 2006 [13].It was purchased by Redhat in 2014 and seven major versions were developed. The last stable version is Fig.5. Hadoop architecture.

Proceedings of 54th The IRES International Conference, Florence, Italy, 28th-29th December 2016, ISBN: 978-93-86291-70-7 19 A Study on Distributed File Systems: An Example of NFS, CEPH, Hadoop The Hadoop Distributed File System (HDFS) can REFERENCES reliably store very large data sets. HDFS is also designed to provide streaming of these data clusters [1] A.S.Tanenbaum and M.V.Steen, "Distributed Systems to the client application with high bandwidth [15]. Principles and Paradigms", USA: Pearson Prentice Hall, 2.Edition, 2006. [2] L. Svoboda, "File Server For Network-based Distributed Hadoop works with a backup copy of the data, Systems," ACM Computing Surveys, vol. 16, no. 4, pp. 353- assuming that computed elements and storage may 398, 1984. fail [16]. If a fault occurs, the copy that resides on [3] U.Ergun, S.Eken and A.Sayar, "Guncel Dagitik Dosya Sistemlerinin Karsilastirilmali Analizi," in 6. Muhendislik ve another node is copied again.This keeps the data Teknoloji Sempozyumu, Ankara,TURKEY, 2013. secure. Hadoop is scalable and it works on petabyte [4] P.J.Braam, "The Distributed File System," Linux dimensions[17]. Journal, vol. 50, no. 6, pp. 10-20, 1998. [5] M. Santaranayanan, "Scalable, Secure And Highly Available Distributed File Access," Computer, vol. 23, no. 5, pp. 9-18, Hadoop is used by many large companies today. It is 1990. generallypreferred in industrial and academic areas. [6] A. Siegel, K. Birman and K. Marzullo, "Deceit: a flexible Companies like Twitter, LinkedIn, Ebay, AOL, distributed file system," in Management of Replicated Data, Alibaba, Yahoo, Facebook, Adobe, IBM are using Houston, TXT, USA, 1990. [7] "CODA web sitesi," [Online]. Available: Hadoop[18] http://www.coda.cs.cmu.edu/ljpaper/lj.html. [Accessed 16 11 2016]. V. COMPARISON AND CONCLUSIONS [8] E. Levy, "Distributed File Systems:Concepts and Examples," ACM Computing Surveys, vol. 22, no. 4, pp. 321-374, 1990. [9] R. Sandberg, D. Goldberg, S. Kleimen, D. Walsh and B. Today, distributed file systems are used in large-scale Lyon, "Design and Implemantation of The Sun Network File companies as well as small-scale companies and System," in Proceedings of the Usenix Conference, Portland, projects. With the growth of the data produced in 1985. recent years, it is preferred to store and manage [10] G. Coulouris, J. Dollimore and T. Kindberg, Distributed Systems: Concepts and Design, USA: Addison-Wesley especially big data. Cloud data storage is a subservice Publishing Company, 2011. of distributed systems. Distributed systems are used [11] C. Juszczak, "Improving the Performance and Correctness of in cloud systems within the Internet. an NFS Server," in Proc. Summer,USENIX, USA, 1990. [12] B. Karasulu and S.Korukoğlu, "Modern Dağıtık Dosya Sistemlerinin Yapısal Karşılaştırılması," in X. Akademik NFS, one of the DFS, mentioned in this work seems Bilisim, Çanakkale,Turkey, 2008. to be based on distributed systems and it is distributed [13] S. Weil, S. Brandt, E. Miller, D. Long and C.Maltzahn, version of the UNIX system. It is used to store Big "Ceph: A scalable, high-performance distributed file system data and to share hardware and software resources. (pp. 307-320). USENIX A," in 7th Symposium on Operating Systems Design and Implementation, USENIX, USA, 2006. The fact that Ceph does not need the metadata server, [14] M. Grossman, M. Breternitz and V. Sarkar, "Hadoopcl2: which is the biggest feature, ensures that the system Motivating the design of a distributed, heterogeneous can work reliably. This feature makes it more programming system with machine-learning applications," prominent.Due to the scalability feature of Ceph, IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 3, pp. 762-775, 2016. bigdatacan be processed. There are over 200 [15] N. Shankaran and R. Sharma, "Cloud Storage Systems-A developerand more than 50 supporter companies. Survey," Indiana University, USA, 2011. This is ideal for low-budget institutions. [16] K. Shvachko, H. Kuang, S. Radia and R. Chansler, "The hadoop distributed file system (pp. 1-10)," in IEEE 26th symposium on mass storage systems and technologies Hadoop can handle very large data that many (MSST), NV, USA, 2010. companies nowadays prefer. It contains a lot of stack [17] G. Yavuz, S. Aytekin and M. Akçay, " ve servers. Like Ceph, it keeps data and metadata on Dagıtık Sistemler Üzerindeki Rolü," Journal of the Institute separate servers.Fault tolerance makes it safe and of Science & Technology of Dumlupinar University, no. 27, pp. 43-54, 2012. scalable.It also establishes a base for cloud [18] [Online]. Available: technology heavily used at the present days. It will be https://wiki.apache.org/hadoop/PoweredBy. [Accessed 15 11 the most widely used DFS in the future. 2016].



Proceedings of 54th The IRES International Conference, Florence, Italy, 28th-29th December 2016, ISBN: 978-93-86291-70-7 20