IMPROVING THE COMMUNICATION PERFORMANCE OF DISTRIBUTED ANIMATION RENDERING USING BITTORRENT

ASST. PROF. NAMFON ASSAWAMEKIN, PH.D. EKASIT KIJSIPONGSE, PH.D.

THE RESEARCH WAS FINANCIALLY SUPPORTED BY THE UNIVERSITY OF THE THAI CHAMBER OF COMMERCE 2013 i

Title : Improving the Communication Performance of Distributed Animation Rendering Using BitTorrent File System

Main researcher : Asst. Prof. Namfon Assawamekin, Ph.D. School of Science and Technology, University of the Thai Chamber of Commerce Co-researcher : Ekasit Kijsipongse, Ph.D. National Electronics and Computer Technology Center

Year of accomplishment : 2013 No. of pages : 46

Key words : animation rendering, BitTorrent, distributed file system, peer-to-peer

ABSTRACT*

Rendering is a crucial process in the production of computer generated animation movies. It executes a computer program to transform 3D models into series of still images, which will eventually be sequenced into a movie. Due to the size and complexity of 3D models, rendering process becomes a tedious, time-consuming and unproductive task on a single machine. As a result, animation rendering is commonly carried out in a distributed computing environment where a number of computers execute in parallel to speedup the rendering process. In accordance with the distribution of computing, the data dissemination to all computers also needs certain mechanisms which allow large 3D models to be efficiently moved to those distributed computers to ensure the reduction of time and cost in animation production. In this report, we present and evaluate the BitTorrent File System (BTFS) for improving the communication performance of distributed animation rendering. The BTFS provides an efficient, secure and transparent distributed file system which decouples the applications from the complicated communication mechanism. By having the data disseminated in a peer-to-peer manner and using local , the rendering time can be reduced. Its performance comparison with a production-grade 3D animation favorably shows that BTFS outperforms the traditional distributed file systems up to the factor of 4 in our test configuration.

*The research was financially supported by the University of the Thai Chamber of Commerce. ii

CONTENTS

Page ABSTRACT i LIST OF TABLES v LIST OF FIGURES vi

CHAPTER 1 INTRODUCTION 1 1.1 Introduction …………………………………………………………… 1 1.2 Research Objectives …………………………………………………… 2 1.3 Research Scope ………………………………………………………… 2 1.4 Research Methodology ………………………………………………… 3 1.5 Research Contributions ………………………………………………… 3

CHAPTER 2 RESEARCH BACKGROUND AND RELATED WORK 4 2.1 Distributed File System (DFS) ………….……………………………… 4 2.1.1 (NFS) ………….………………………… 4 2.1.2 Message Block (SMB) / Common File System (CIFS) …………………………………………………………… 5 2.1.3 (AFS) / ……………………………… 6 2.1.4 MogileFS ………………………………………………………… 6 2.1.5 Hadoop Distributed File System (HDFS) ……………………… 7 2.2 Peer-to-Peer File Sharing ………….…………………………………… 7 2.2.1 Napster ………….……………………………………………… 7 2.2.2 Gnutella ………….……………………………………………… 8 2.2.3 Freenet ………….………………………………………………… 8 2.2.4 OceanStore ………….…………………………………………… 8 2.2.5 KaZaA / FastTrack ………….…………………………………… 9 2.2.6 BitTorrent ………….…………………………………………… 9

iii

CONTENTS (cont.)

Page 2.3 BitTorrent Protocol ………….………………………………………… 9 2.3.1 .torrent ………….………………………………………………… 10 2.3.2 Tracker ………….……………………………………………… 10 2.3.3 Peer ………….…………………………………………………… 11 2.4 Related Work ………….……………………………………………… 11

CHAPTER 3 DESIGN AND IMPLEMENTATION 15 3.1 Metadata Server ………...……………………………………………… 15 3.2 Seeder ………………………...………………………………………… 17 3.3 Tracker ……………………...…………………………………………… 18 3.4 BTFS …………………………………………………………… 18 3.5 Mapping to POSIX Semantics ………………………………………… 19 3.6 Consistency Model ……………………………………………………… 20 3.6.1 Attribute and File Cache Management ………….……………… 21 3.6.2 Write Back Policy ………….…………………………………… 21 3.7 Security ………………………………………………………………… 22 3.7.1 , Authorization and Access Control ……………… 22 3.7.2 Data Integrity ………….………………………………………… 23 3.7.3 Confidentiality ………….……………………………………… 23 3.8 Load Balancing and Fault Tolerance …………………………………… 24 3.9 Global Configuration Management …………………………………… 25 3.10 Garbage Collection …………………………………………………… 26 3.11 Operation ……………………………………………………………… 26

iv

CONTENTS (cont.)

Page CHAPTER 4 EVALUATION AND EXPERIMENTS 27 4.1 Testbed System Configuration ………………………………………… 27 4.2 Render Data and Software ……………………………………………… 28 4.3 Performance Comparison of BTFS and SMB File System …………… 31 4.3.1 Small Job ………………………………………………………… 31 4.3.2 Medium Job ……………………………………………………… 32 4.3.3 Large Job ………………………………………………………… 33 4.4 Peer Contribution under BTFS File System …………………………… 34 4.5 Load Balance of BTFS with Multiple Seeders ………………………… 36 4.6 BTFS Replication Performance ………………………………………… 37 4.7 Operation Breakdown ………………………………………………… 39

CHAPTER 5 CONCLUSIONS 42

REFERENCES 43 BIOGRAPHY 46 v

LIST OF TABLES

Table Page 2.1 Comparison of BitTorrent-based data dissemination in distributed computing environments ……………………………………………………………………… 13 3.1 Supported file system operations ………………………………………………… 19 3.2 Mapping from BTFS to POSIX semantics ……………………………………… 20 4.1 Characteristics of the testing data ………………………………………………… 29 4.2 Number of files and disk usages on each seeder ………………………………… 37

vi

LIST OF FIGURES

Figure Page 3.1 BitTorrent file system architecture ……………………………………………… 15 3.2 Mapping from ZooKeeper to BTFS file system ………………………………… 17 3.3 Global configuration file ………………………………………………………… 25 4.1 Testbed system …………………………………………………………………… 28 4.2 Job status ………………………………………………………………………… 30 4.3 status ……………………………………………………………………… 30 4.4 Outbound network traffic from (seeder) for small job ………………… 32 4.5 Outbound network traffic from file server (seeder) for medium job …………… 33 4.6 Outbound network traffic from file server (seeder) for large job ………………… 34 4.7 Breakdown of inbound traffic under BTFS for small job ………………………… 35 4.8 Breakdown of inbound traffic under BTFS for medium job …………………… 35 4.9 Breakdown of inbound traffic under BTFS for large job ………………………… 36 4.10 Load of each seeder ……………………………………………………………… 37 4.11 Data transfer time of multiple replicas …………………………………………… 38 4.12 Speedup of multiple replicas ……………………………………………………… 38 4.13 Read operation breakdown ……………………………………………………… 40 4.14 Write operation breakdown ……………………………………………………… 41

CHAPTER 1 INTRODUCTION

This chapter aims to provide the introduction of distributed animation rendering and the problems of the research (Section 1.1). The research objectives are defined in Section 1.2, followed by the scope of research in Section 1.3. Section 1.4 describes the research methodology. Section 1.5 summarizes the key contributions of this research.

1.1 Introduction

Animation rendering is a process that transforms 3D models into hundred thousands of image frames to be composed into a movie. Rendering process is very computing intensive and time consuming. A single frame of an industrial-level animation can even take several hours in rendering on a single machine. Animation rendering is then typically carried out on a set of high performance computers where distributed rendering is taken place in which frames are distributed and rendered across many machines in a network to reduce the overall rendering time. Distributed rendering comes in many flavors such as render farm and volunteer-based rendering. For render farm, machines are dedicated for rendering task and all machines are tightly-coupled in having high bandwidth. In volunteer-based rendering [1], machines are loosely connected by the public Internet and the machine owners provides the idle time of their computing resources for rendering. For example, Renderfarm.fi [2] is a large-scale, volunteer, loosely- coupled rendering service that distributes rendering process over the Internet.

Originally, volunteer computing uses central servers for distributing data and computing to clients [3]. There is no notion of data exchange between clients. However, in animation rendering, the 3D models and related library files are large. These input files have to be transferred to the clients before the rendering can begin on the clients. The transfer time is significant due to the latency of the public Internet. Additionally, centralized servers can become overloaded when there are too many clients requesting for 2 the data which will slow down the rendering process. Since the same files may be used by several clients almost the same time, it is a great opportunity to coordinate the file transfer among clients in the peer-to-peer (P2P) manner to reduce the data transfer time. With the P2P file sharing model, a client who has already downloaded a whole or some parts of the file from the central servers can share the file to other clients so that others can directly download the file (or parts of it) from the former client instead of from the central servers. As there are more copies of the files on different clients, there are fewer requirements to download data from the central servers.

To allow rendering applications to transparently access the shared files over the P2P model without having the applications modified, it is necessary to implement the P2P file sharing service as the file system layer in the operating system. Thus, transferring the files from several peers across the network is invisible to the applications such that the shared files can be treated the same as files on local disks. Security is another important issue when the render data has to be shared among peers. Peers without the correct permission must not be able to access the data. In the typical peer-to-peer file sharing, this security issue is not the main concern as opposed to animation rendering where unintended data disclosure has to be minimal.

1.2 Research Objectives

1. To design and develop the BitTorrent file system (BTFS) for efficient data sharing among distributed compute nodes with transparent and secure access to the applications. 2. To use the BitTorrent file system to improve the communication performance of distributed animation rendering.

1.3 Research Scope

1. The BitTorrent file system is developed for Linux-based systems. 2. Testing is carried out by using the open source rendering software. 3

3. The BitTorrent protocol is only used for reading input data for rendering from BTFS. The output image frames written back to BTFS are transferred to the central file server by other protocols.

1.4 Research Methodology

The distributed rendering environment that we focus consists of a group of central servers and a large number of distributed render clients which may be located in local or wide area networks. Users submit rendering jobs to the servers and there exists a job scheduler which will dispatch the jobs to render on the clients. Each job specifies the rendering program with all necessary arguments including the names of input and output files as well as the number of frames to render. All input files which are necessary for a client to render the jobs are initially stored on the servers. These input files must eventually be transferred to the client on which the job is executed. They can be transferred from either the servers or peers depending on whichever is best. The output files are later transferred back from the clients to the servers directly.

In animation rendering, a job that is dispatched to run on different clients requires the same input data. So, when a client needs the input data for the job, the data may already have been existed on other clients in whole or in part. The client can coordinate with other clients to download the input data from them in the peer-to-peer (P2P) manner instead of from the central file servers. We apply BitTorrent protocol as the means to disseminate the data among clients.

1.5 Research Contributions

1. This research designs and develops the BitTorrent file system for providing a secure, efficient and transparent data dissemination system based on BitTorrent protocol. 2. This research applies the BitTorrent file system in improving the communication performance of the volunteer-based distributed rendering.

CHAPTER 2 RESEARCH BACKGROUND AND RELATED WORK

This chapter provides necessary background information and gives the overview of relevant research work.

2.1 Distributed File System (DFS)

Distributed File System (DFS) is a technique of storing and accessing files from remote storages. Typically, DFS uses one or more central servers to store many files that can be accessed with suitable authorization rights by any number of remote clients in the network. Much like how an operating system organizes various files into a hierarchical structure (directories), DFS usually follows the similar organization including the naming convention when referring to remote files. When a client retrieves a file from the server, the file appears as a normal file on the client machine. Additionally, the users are able to work with the file in the same ways as if it is stored locally on the client machine. When the user finishes working with the file, the current version of the file can be returned over the network to the server storing for retrieval at a later time.

The DFSs are beneficial because they make it easier to access files from multiple clients and can provide a large and reliable storage system such that client machines do not need to use their local resources to store files. The DFS may also provide location transparency and redundancy to improve data availability in the case of failure or heavy load by allowing files to be replicated to multiple different locations. We give a brief explanation of some well-known DFSs as follows.

2.1.1 Network File System (NFS)

Network File System (NFS) [4] is a client/server application designed by Sun Microsystems in 1984. It allows user on a client computer to transparently access files over 5 a network in a manner similar to how local storage is accessed. The NFS provides access to shared files through an interface called the (VFS) running on top of the TCP/IP protocol. With NFS, computers connected to a network operate as clients while accessing remote files and as servers while providing remote users access to locally shared files. The NFS standards are publicly available and widely used. Like many other protocols, the NFS is implemented on top of a Remote Procedure Call (RPC) package to help simplify protocol definition, implementation and maintenance. It uses an External Data Representation (XDR) specification to describe protocols in a machine and system independent way. Despite the fact that the NFS is designed to be easily portable to other operating systems and machine architectures, it mostly used in Unix environment. Another drawback of NFS is that it is not easy to scale.

2.1.2 (SMB) / Common Internet File System (CIFS)

Server Message Block (SMB) [5] is another protocol that provides a client computer to read and write files from/to a server computer in a network. The SMB protocol can be used over the Internet on top of its TCP/IP protocol or other network protocols (e.g., Internetwork Packet Exchange and NetBIOS / NetBEUI). Using the SMB protocol, a client application can read, create and update files on a remote server as well as other resources (i.e., using printers, slots and named pipes). The SMB protocol originated at and has gone through a number of developments. A given client and server may implement different sets of protocol variations which they negotiate before starting a session.

Microsoft has submitted a specification of SMB to the Internet Engineering Task Force (IETF), called the Common Internet File System (CIFS). The CIFS is a new protocol that provides more features and is backward compatibility with SMB. operating systems since Windows 95 include client and server SMB protocol support. For Unix systems, [6], an open source software, is available. The SMB is often used to share files across Unix and Windows platforms over intranet and internet.

6

Since the SMB file system is mature, standardized and easy to setup, it becomes one of the distributed file systems the most frequently used in distributed rendering. Several users can collaborate their animation project; while the client machines in render farms can access the project’s data through the SMB file system.

2.1.3 Andrew File System (AFS) / Coda

Andrew File System (AFS) [7] is the first distributed networked file system using a set of servers to present a homogeneous, location-transparent file name space to all the client workstations. It is developed by Carnegie Mellon University as part of the Andrew project. Its primary use is in distributed computing to exploit a persistent cache on clients which caches both file and directory data.

Coda [8] is an advanced networked file system developed at CMU since 1987. Coda descends from AFS version 2. Coda adds several significant features to AFS offerings: disconnected operation with reintegration, server replication with resolution of diverging replicas and bandwidth adaptation. A multi-RPC protocol that allows multiple requests to be dispatched and responses to be collated without serializing remote procedure calls supports Coda replication.

2.1.4 MogileFS

MogileFS [9] is an open source distributed file system which emphasizes on data archiving and deployment on commodity hardware. MogileFS is fault-tolerant (no single points of failure) by spreading data and metadata over different server nodes, where the automatic file replication level depends on the type of the file. MogileFS defines three different kinds of nodes (trackers, database and storage) of which multiple instances may exist in a given configuration. A tracker is responsible for handling client sessions and requests. A database node stores file system metadata and storage nodes store the actual data. MogileFS is designed to be an application-level utility in which it leaves the responsibility of file management ultimately up to the application. Although MogileFS can be accessed through a variety of and libraries, it requires application modification as it does not provide a POSIX or block device interface to clients. 7

2.1.5 Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS) [10], unlike other DFSs, is designed to be highly fault-tolerant and can be deployed on low-cost commodity hardware. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. Since it is very likely that one of these machines can fail, the detection of faults and quick automatic recovery from them is a core architectural goal of HDFS. HDFS keeps multiple replicas of file blocks in many machines. The number of replicas is defined by replication factor which is configurable per file. HDFS provides high throughput access to application data and is suitable for applications that have very large files. HDFS files are immutable, meaning that once created they cannot be changed later.

2.2 Peer-to-Peer File Sharing

Peer-to-peer (P2P) file sharing can be described as the distribution and sharing of digital documents and computer files using P2P technology. Users access the media files (e.g., books, music, movies and games) or exchange data with others over the Internet using a specialized P2P software program. This program searches and transfers the desired content from other computers on a P2P network. The architecture of P2P networks is classified into structured or unstructured architecture, both of which can be further subdivided into centralized (client-server) or decentralized ones. The hybrid architecture merges the advantages offered by different architecture types to boost the overall system performance and reliability.

2.2.1 Napster

Napster [11] is the first commercial P2P file sharing system. It makes use of a simple client-server or centralized architecture, where the server indexes the files its clients have. Napster is used to share the music and audio files in terms of mp3 format. The main communication is that a peer sends a search request to the central server, waits for a reply from it and downloads a searched file from the node to which the server points. However, 8

Napster suffers from the limitations for all centralized systems since the central server is shut down due to copyright infringement and so its era is over.

2.2.2 Gnutella

Gnutella [12] appears after Napster and follows the concept of decentralized unstructured P2P systems. Gnutella has no centralized servers to index the files. The search for files is realized via the basic message flooding. Nevertheless, this method of searching suffers from numerous drawbacks. For instance, the network is never stable because frequent peers disconnect. The network stops functioning satisfactorily as the bandwidth cost of the search increases exponentially depending on the number of searched users. When the network grows large enough, it gets saturation and often causes enormous delays. As the result, queries are often dropped and searches produce unsatisfactory results as only minor portion of peers are searched.

2.2.3 Freenet

Freenet [13] employs a decentralized unstructured P2P architecture for a distributed information storage and retrieval system. It has no central servers and is designed to address many concerns on privacy and availability. Freenet operates as a location- independent distributed file system across many individual computers that allow files to be inserted, stored and requested anonymously. Nonetheless, Freenet has the problem of redundant data replication and it limits the end uses from controlling of the data allocated in the disc space of Freenet. Additionally, the use of this structure leads to slow and poor performance of the system.

2.2.4 OceanStore

OceanStore [14] is a decentralized P2P file sharing system designed to use a cooperative utility model in which consumers pay the service providers certain fees to ensure access to persistent storage. The service providers in turn use utility model to form agreement and resource sharing. The fundamental unit in OceanStore is the persistent object. Each object is named by a globally unique identifier (GUID). Objects are replicated 9 and stored on multiple servers. This replication provides availability in the presence of network partitions and durability against failure and attack. However, it also leads to the overhead for redundant data replication.

2.2.5 KaZaA / FastTrack

KaZaA [15] was one of the biggest P2P file sharing systems that is built on a hybrid architecture between Napster and Gnutella. KaZaA has no central servers; but it promotes some peers to be more distinct than others. These peers are called supernodes in contrast to ordinary nodes. Each supernode is a powerful peer that collects a list of shared files from a number of ordinary nodes that connect to it. An ordinary node sends a request for files to its supernode, which in turn communicates with other supernodes to find the files. Then, the file is transferred directly between two ordinary nodes. KaZaA has been shutdown due to several infringement lawsuits.

2.2.6 BitTorrent

BitTorrent [16] is one of the most popular P2P file sharing protocol to distribute large amount of data over the Internet. BitTorrent is rather different from all of the P2P file sharing systems considered above. First, its focus is fast and fair download of files, for example, peers that actively upload to other peers will probably download faster. However, the anonymity of users is sacrificed to a certain for this goal. Second, BitTorrent does not employ its own search algorithm, but utilizes external search facilities to find the downloadable files. The BitTorrent protocol will be further explained in the next section.

2.3 BitTorrent Protocol

BitTorrent is primarily designed to reduce the download time for large and popular files and lessen the load on the file servers. When a user begins downloading a file, the BitTorrent system will locate multiple computers with the same file and begin downloading the file from several computers in parallel. Since most Internet Service Providers (ISPs) offer much faster download speeds than upload speeds, downloading from 10 multiple computers can significantly increase the file transfer rate. BitTorrent operation involves 3 main components including .torrent, tracker and peer, each of which is described below.

2.3.1 .torrent

A user who wants to distribute a file creates a small file called .torrent that acts as the key to initiate the sharing of the file. The .torrent does not contain the content of the file; but it rather contains information about the file, its length, the hashing information for verifying integrity and the URL of the tracker(s). The user may also serve the file through a BitTorrent client, named as a seeder. The .torrent file must be distributed to other users by conventional means (web site, email, catalog service, etc.). When another user wants to download the file, he/she must obtain the corresponding .torrent file and then opens that .torrent file in a BitTorrent client to start exchanging file with peers.

2.3.2 Tracker

Trackers in the BitTorrent system play a central role in peer communication. It helps peers locate one another in the system by keeping track of the seeders and downloaders of a particular file. Peers wishing to participate in the file sharing must first obtain the .torrent file to find out which tracker holds tracking information about the file and other important information about the file. Then, peers need to communicate with the tracker to get a list of peers currently connected to it, who are participating in that file exchange. This list may not contain all the possible peers that are interested in the file, but only some randomly chosen peers. This is done in order to even the load on the network. Once the list of peers is obtained, peer exchange (download / upload) process begins. The tracker does not directly involve in any data transfer and does not have a copy of the file(s) for the torrents it tracks. Peers must report statistics information to the tracker periodically and in exchange receive updated information about new peers to which they can connect or about peers which have left. A BitTorrent tracker is commonly implemented as a HTTP / HTTPS server.

11

2.3.3 Peer

Peer is an instance of a BitTorrent client running on a computer on the Internet which exchanges file to and from other clients. Each file is exchanged in pieces. A peer is not necessary to have the all pieces of the file; only some of it. To exchange a file, peers consult the .torrent to find a tracker which is used to report information about what pieces it has, its IP address and port the BitTorrent client application is running on. Each peer polls the tracker for information about other peers to exchange pieces. Each piece can be downloaded concurrently from different peers. A peer who has completely downloaded the file becomes a seeder.

Peers continuously exchange pieces with other peers in the network until they have obtained the complete set of file pieces needed to reassemble the original file. The order in which pieces are request from other peers in BitTorrent is optimized to improve their download rates. For initial downloaders (peers with no pieces) it is important that they obtain a complete piece as quickly as possible, so they use the random piece selection algorithm for selecting a piece to download. Once a peer has at least one complete piece, it changes its strategy of piece selection by choosing to download rare pieces in the network to increase the number of replicas of those pieces.

2.4 Related Work

BitTorrent has been utilized in many applications to improve the performance of data transfer. For example, Kaplan et al. [17] proposed GridTorrent that makes use of BitTorrent to efficiently distribute data for scientific applications in Grid computing environment. Their implementation supports access control on the shared data similar to the Unix file permissions. During the same time, Zissimos et al. [18] independently created GridTorrent that integrates BitTorrent with Globus Grid middleware components. They replaced the “.torrent” metainfo file with the Replica Location Service (RLS) in Globus to simplify the bootstrapping step in the native BitTorrent protocol.

12

For volunteer-based computing, Costa et al. [19] applied BitTorrent to optimize data distribution on BOINC [20] middleware which is the open infrastructure for volunteering computing. They showed that BitTorrent can reduce the network load at servers significantly while having minimal impact to the computing time at clients. Wei et al. [21, 22] implemented BitTorrent in the computational desktop grid platforms including XtremWeb [23] to collaborate data distributing among users in solving scientific problems. They showed that even if BitTorrent protocol has more overhead than typical file transfer protocols, it can outperform when distributing large files to a high number of nodes.

BURP [24] is a voluntarily distributed rendering systems based on the barebone BOINC middleware. The executable and input files are downloaded from the centralized servers. Renderfarm.fi [2] is a volunteer-based distributed rendering that uses BURP as the core component. According to the current information, Renderfarm.fi uses mirror servers to allow faster download to clients. The BitTorrent protocol has not been implemented to speed up the data distribution yet. vSwarm [25] is a community render farm based on the concept of volunteer computing. A unique feature of vSwarm is that the vSwarm client is a virtual machine that runs the rendering process so that all distributed rendering jobs are executed under the same environment. However, to the best of our knowledge, it uses centralized servers to distribute data to clients.

Although, the aforementioned works are similar to ours; several points remain distinct. Firstly, other implementations use different file namespaces within their applications. Secondly, there was no cache management at the client side for data that are repeatedly used. Thirdly, the access control to shared data is not implemented. Next, most of the implementations are not transparent to the applications running at the upper layer. Lastly, the evaluations were done with scientific or even synthetic applications. The comparison is summarized in Table 2.1. 13

Table 2.1 Comparison of BitTorrent-based data dissemination in distributed computing environments

Access Application Application and Work Bootstrapping and Tracking Namespace Client Cache Control Transparency Testing Platform Kaplan et al. Use to get .torrent Every shared Unix file Not specified Not transparent A synthetic scientific [17] from the catalogue service. file has and permission: application on 3 The tracker is a variant of filename. public, group nodes BitTorrent tracker that includes and user level additional features like access access control. Zissimos et al. Use Replica Location Service Every shared Grid security Not specified Not transparent Unspecified [18] (RLS) as the catalogue and file is referred infrastructure applications on 18 tracker services by a based on PlanetLab nodes GridTorrent digital URL certificates Costa et al. The .torrent file is generated Every shared Not specified Use a simple timer Not transparent. A synthetic [19] from a shared file and stored at file has path and to delete files The .torrent is used application on 312 the central servers along with filename. as the input file in nodes from the file. job description. Grid’5000 testbed It uses native BitTorrent Tracker. 14

Access Application Application and Work Bootstrapping and Tracking Namespace Client Cache Control Transparency Testing Platform Wei et al. [21, The catalogue service provides Flat file name Not specified Not specified Not transparent A synthetic 22] the necessary information for identified by application on 64- file transfer. UUID node heterogeneous The native BitTorrent tracker cluster is used. Our Approach Use file name to obtain .torrent Every shared Access Local cache at Transparent since it Animation Rendering from the catalogue service. file has a unique Control List clients with the is implemented in on 5 distributed sites The native BitTorrent tracker name in the for Least Recently the file system layer is used. Unix file create/read/ Used (LRU) cache namespace. write/delete replacement policy access

CHAPTER 3 DESIGN AND IMPLEMENTATION

We design BitTorrent File System (BTFS) to function at file system layer in the Linux operating system to provide any applications the scaleable, fault-tolerant and distributed file system with transparent P2P data dissemination and persistent local cache to improve the performance of data transfer in distributed rendering. The implementation of BTFS is based on File System in Userspace (FUSE) [26] as shown in Figure 3.1. The system consists of 4 main components: metadata server, seeder, tracker and BTFS client, which are described below.

Exchange files from other BitTorrent clients

Tracker Update peers information BitTorrent Render Application Download files Client 1 2 Seeder

Metadata Fuse File System Get .torrents if permission Storage Server is granted 1 = Request new files 2 = Files are already in harddisk Central File Server Local Cache

Render Clients

Figure 3.1 BitTorrent file system architecture

3.1 Metadata Server

The metadata server provides the information about the file and directory structure of the BTFS file namespace such as file size or last modified time. All files have to be registered with the metadata server and unregistered when they are no longer needed. The register is possible only if the path is not already occupied by another file in the same 16 namespace. The metadata server is also responsible to enforce the file permission and serves clients the torrent information of the requested files if the permission is granted. Each file in BTFS is associated with an individual torrent. Regarding to BitTorrent protocol [16], the contents of the torrent information include filename, file size, hash information, number of pieces, seeder and tracker URLs like other .torrent which will be used in the BitTorrent P2P file sharing. In the current work, there is a single metadata server for each namespace.

We use Apache ZooKeeper 3.4.3 [27] to implement the metadata server. ZooKeeper is a distributed coordination service for distributed systems. It organizes data into a hierarchy of nodes similar to files and directories. The top level directories in ZooKeeper consist of /btfs and /config nodes. The /btfs node is used to hold the root directory of a BTFS namespace. It is where the user files and directories are placed. The BTFS users will only see the files and directories under the /btfs node as the normal POSIX files on their computers. The /config node is used to store the globally system configuration for BTFS clients. We manage the client configuration through ZooKeeper so that the configuration updates can be distributed to all clients easily.

Each file and directory in the BTFS namespace is associated with a descendant node under the /btfs node. We store the BTFS file (or directory) attributes in a ZooKeeper node’s data. The essential BTFS attributes include the Universally Unique IDentifier (UUID), last modification time, file size and encryption key. These attributes will be mapped into the appropriated fields in the POSIX stat structure as described later. The UUID is used internally by all BTFS components as a unique ID to refer to the file (similar to what number is). Figure 3.2 shows the ZooKeeper hierarchical structure and the corresponding BTFS file system. All files have different UUIDs. The torrent node in the ZooKeeper tree is created as a child of each user file. It is used to store the torrent information of a file to be shared in BTFS.

17

ZooKeeper Node Directory File /

btfs config /

Mary Mary John John

tmp model tmp model

a.blend a.blend

a.blend.torrent (a) ZooKeeper tree structure (b) BTFS file system

Figure 3.2 Mapping from ZooKeeper to BTFS file system

3.2 Seeder

Every file in BTFS must be uploaded into the central file servers called seeders which are responsible to permanently serve files for BTFS clients. Initially, files are available only at seeders. The first client accessing the files always connects to the seeders to retrieve the files. Other clients may or may not need to get the files from the seeders. Seeders should be located in the public network so that any clients can always reach. In a more complex deployment, there can be multiple (and possibly remotely distributed) seeders to concurrently serve the files or a single file can be replicated on multiple seeders for parallel download if necessary.

We use WEBDAV servers to implement the BTFS seeders. The WEBDAV protocol is required for the BTFS clients to manipulate files in the seeders including uploading and downloading files. When the files are stored in the seeders, they are referenced by their UUIDs. Since all files are uniquely identified by its UUID which is unlikely to duplicate, there is no need to maintain directory structure in the seeders. Then, all files are stored in the same directory on the seeders for simplicity. In the current implementation, we use Lighttpd 1.4.31 [28]. 18

3.3 Tracker

According to the BitTorrent protocol, trackers are required to coordinate all BitTorrent peers to share files. This is also the case with BTFS clients which run the protocol. We currently deploy opentracker [29]. In fact, any BitTorrent trackers including public trackers can also be used.

3.4 BTFS Client

BTFS client is a core component that glues other components together. It runs on client machines where rendering software access the shared files. BTFS clients intercept all file system calls such as open() and read() from any applications which request access to files in the BTFS file system. For reading a file, BTFS clients contact the metadata server for attributes and the torrent information of the file, and create a BitTorrent client thread to download the file from seeders and peers. The downloaded file will be temporarily cached into the local storage of the client and then passed to the requesting application. The file is kept as long as the temporary space is available; otherwise, the cache replacement algorithm is invoked to clear unused files. We use the Least Recently Used (LRU) in the cache replacement. Future requests of the file will be redirected to the cached file if available to reduce the network traffic. This cached file can be exchanged to other BTFS clients like in P2P file sharing as well.

When writing a new file, BTFS clients perform the writing operation locally. After the written file has been completely closed, it generates a UUID for the file. Then, the file is uploaded into the seeder and the BTFS clients must register the file into the metadata server to complete the operation. The POSIX attributes of the file will be stored in the associated ZooKeeper node. Other basic operations such as deleting a file, listing files, creating and removing a directory are summarized in Table 3.1. It should be noted that the current implementation has not supported rewriting operation yet.

19

The BTFS client is implemented as a user-based file system. It is built on top of many libraries such as FUSE 2.8.6. [26], neon 0.29.6 [30], ZooKeeper C library 3.4.3 [31], SQLite 3.7.7 [32], openSSL 1.0.0 [33] and Rasterbar’s libtorrent 0.16.0 [34].

Table 3.1 Supported file system operations System Call Description init Read configuration files. Initialize cache database. Establish ZooKeeper connection. getattr Retrieve BTFS attribute data from the associated ZooKeeper node. mknod Create an associated ZooKeeper node. open Get torrent data from ZooKeeper. Download (by BitTorrent protocol) the file into local storage and open it. create Create an associated ZooKeeper node. Create a local file for write. read Read blocks from local file. write Write blocks to local file. release Close the local file. If the file is locally updated, upload it to seeders via WEBDAV protocol and update the BTFS attribute in ZooKeeper node. mkdir Create an associated ZooKeeper node. opendir Check if the associated ZooKeeper node existed. readdir Get the children list and their BTFS attributes from ZooKeeper nodes. unlink Remove the associated ZooKeeper node and torrent child node. rmdir Remove the associated ZooKeeper node. destroy Close cache database. Close ZooKeeper connection.

3.5 Mapping to POSIX Semantics

The POSIX attributes of files and directories in the BTFS file system are partially obtained from BTFS attributes which are stored in the associated ZooKeeper node’s data. However, some POSIX attributes are mismatched or not efficient to store in the BTFS attributes. So, they are assigned with the default values. The list of BTFS attributes, the mapping from BTFS attributes to POSIX stat structure and the default values are shown in Table 3.2. Note that the permission mode is set to the default value 777; but it does not 20 mean to give access to everyone. The exact permission is controlled by the ZooKeeper’s security model that cannot be aligned with the POSIX permission mode and it will be explained later.

Table 3.2 Mapping from BTFS to POSIX semantics BTFS POSIX stat Default Description Attributes Structure Values size st_size File size in bytes mtime st_atime, Only the last modification time is st_mtime, maintained and copied to the last access st_ctime and last status change uuid Not applicable Universally unique identifier of each file iv Not applicable Initialization vector key Not applicable Encryption/decryption key st_dev 0 ID of device st_ino 0 Inode number st_mode 777 Permission mode st_nlink 1 Number of hard links st_uid, st_gid 0,0 Owner and groud ID st_rdev 0 Device ID for special file st_blksize 0 Preferred block size for efficient I/O st_blocks 0 Number of 512-byte blocks

3.6 Consistency Model

To improve the performance, BTFS employs the weak consistency model which is simple but works efficiently in distributed rendering. The attributes of the files obtained from the metadata server are maintained in memory cache until time-to-expire. So, reading a file is not guaranteed to get the latest version of the file due to the stale attributes in cache. If the files are updated, change will be returned to the metadata and seeders until it has been closed. Multiple writers do not corrupt the integrity of the file; but last writer wins. The followings describe the implementation of our consistency model in details. 21

3.6.1 Attribute and File Cache Management

The result of each getattr() system call is cached in the memory (memory cache) for a specific time (60 seconds in the current implementation). This can significantly reduce the time and network traffic to access the metadata server. However, this is a trade-off to the consistency if the file system is under heavy updates.

All files that have been downloaded are cached in the local storage (disk cache). When the users try to read the file, the open() system call will check if the file exists in the cache. If this is the case, the cached file is opened. When the cached file is accessed, we update data in the cache database, which includes the last access time and the frequency of access. There is a cache manager thread which we implement the Least Recently Used (LRU) cache policy to manage the cache space. When the cache is full (the total size of files in the cache exceeds the cache size), the oldest files will be removed from the cache first. Note that the cache size is a soft-limit as the cache manager will monitor and remove files from the cache on a periodic basis. The cache location is under ~/.btfs directory of each user on a client machine. It is possible to mount multiple BTFS file systems by the same user, given that the ~./btfs directory complies with POSIX (e.g. not on NFS).

3.6.2 Write Back Policy

We deploy write back policy for files that are locally updated. The files are sent back to metadata server and seeders after they have been closed, thus, resulting in the weak consistency. It is possible for other BTFS clients to read the old version of the file if they access the file during the write back period, but the file will never be corrupted since the updated file will be assigned a new UUID. Both old file and new file will coexist in the seeders until the next garbage collection which removes the old file from the seeders.

22

3.7 Security

Since BTFS is intended to be used in a wide area network, it needs to guard against many cyber attacks such as unauthorized access and data corruption. We have designed the security mechanism of BTFS as follows.

3.7.1 Authentication, Authorization and Access Control

BTFS needs all users to login before accessing the file system metadata on the ZooKeeper. Each user sends the username and password (could be different than username and password to login to his/her machine) for authentication with the ZooKeeper. If the username and password are valid, the user is authenticated. Currently, we develop a file- based authentication as a plug-in for ZooKeeper. The password file is stored in the same machine with ZooKeeper and it is simply a list of valid username and password of all users. In BTFS, a user can be a member of any group. We also maintain the group memberships in the group file stored in the ZooKeeper machine. In the future work, we plan to use an LDAP server to store password and group information.

Once the user has already been authenticated, he/she must have authorization to access files or directories in the BTFS file system. We implement BTFS authorization using the ZooKeeper’s Access Control Lists (ACLs). An ACL is associated with a file or a directory in BTFS. It consists of a list of user or group permissions. The authenticated user will be inspected against the ACL to check if the access is granted. At least one user or group permission is required to gain access. Since the ZooKeeper’s ACLs are much finer- grained than POSIX permission modes, we just map any ACL to 777 POSIX permission mode with owner and group set to root (uid and gid = 0) for simplicity. However, this doesn’t override the access control enforced the BTFS file system. When a new file or directory is created, its ACL inherits the ACL from parent directory by default.

Another essential security setting is due to the seeders which are opened for public access, it is necessary to protect the seeders from as well. In particular, the normal users should be allowed for only GET and PUT commands. Other commands like DELETE or 23

PROPFIND (for directory listing) must strictly be prohibited; otherwise, malicious users may create denial of service attacks by removing seed files.

The messages between BTFS clients and the metadata server must not be sent in plaintext since critical information such as username and password can easily be exposed to eavesdroppers. Unfortunately, the ZooKeeper’s C client APIs supporting the encrypted connections is not available in the current version (SSL encryption is only supported in JAVA APIs since 3.4.0). Although, we believe that the support will definitely be available in the future releases of ZooKeeper, a workaround has to be employed for the moment. So, we wrap the connection between ZooKeeper and BTFS Clients with SSL encryption by using stunnel [35] which can protect secret information and resist replay attacks.

3.7.2 Data Integrity

Every file in the BTFS file system is supported by .torrent which contains the 20- byte SHA1 message digest (hash) of the file’s content. When a BTFS client wants to read the file, it must first obtain the valid .torrent of the file from the metadata server. Thus, the BTFS client can ensure the data integrity by validating the downloaded file with the message digest.

3.7.3 Confidentiality

Since the BitTorrent protocol doesn’t enforce the confidentiality of the data, we protect the file by using the encryption. The files are always stored as a cipher text in the seeders and the local cache on each BTFS client. When a BTFS client creates a file, it will also randomly generate the Initialization Vector (IV) and the key for encrypting/decrypting the file and stores them in the BTFS attributes at metadata server (The IV and key could be stored in the comment of .torrent node to reduce the overhead of attribute queries). Data written into the file are encrypted on-the-fly with the 128-bit AES CTR mode. This encrypted file will be uploaded to the seeders and shared with other BTFS clients when closed by the write-back policy. For another BTFS client to read the file, it must know the IV and the key which can only be obtained from the metadata server if it is authorized. The 24

BTFS client decrypts the file while reading. The encryption/decryption is done in the FUSE file system layer and thus transparent to the applications.

3.8 Load Balancing and Fault Tolerance

The BTFS file system has been designed to tolerate from failure. There is no single point of failure in our implementation. The metadata server implemented by ZooKeeper can be replicated to many nodes in master/slave. Data stored in multiple ZooKeeper servers are kept consistent by the quorum-based protocol. At the present of some node failures, the operation of the metadata servers continues as long as the majority of servers agree on the data. BTFS clients can arbitrarily connect to one of any metadata servers. So the load is distributed to multiple metadata servers. In case of write, all the operations will be forwarded to the master server. Since in the animation rendering, data are more read than written, the update operation will not cause too much load on the master server.

When a BTFS client creates a file, it will randomly choose one of the seeders for uploading the file. So it is likely to balance the load of multiple seeders. BTFS seeders can also fail. Yet, they are implemented by using WEBDAV servers which work independent to each others. We can setup multiple seeders and partition the BTFS file system into these seeders. If some seeders fail, it only affects to partial file system. The remaining parts are intact. It should be emphasized that files on the failed seeders will be completely lost only if there is no BTFS client having been cached the files left. We can increase the level of fault-tolerance by allowing a particular file replicated on multiple seeders. Each BTFS client randomly selects the seeders to hold the replicas. The number of replicas is defined by the replication parameter in the configuration file which will be described later.

To prevent a single tracker failure, multiple trackers have already been proposed by the BitTorrent protocol specification extension [36] to alleviate the problem. With this extension, if some trackers fail, the clients will try to connect to the tracker that is still functioning randomly. However, it is possible that the clients sharing the same file may connect to a different tracker which causes each client seeing only the partial peer list. So, some of them will not cooperate the file transfer as expected. Regarding to this problem, 25 libtorrent library provides a variant extension that sends the announce message to all trackers at the expense of additional network traffic. Our BTFS clients are configurable to support either native extension or the libtorrent variant.

3.9 Global Configuration Management

The global configuration of the BTFS file system is stored in the /config/.btfsrc.xml node in the metadata server. Each client will read the content of this node during the initialization phrase. The configuration is written in an XML format which is exemplified in Figure 3.3. The configuration is self-described, for instance, the sections define the information of seeders such as IP and port. The section defines the replication parameter. Setting the replication parameter to more than one allows a number of replicas to store in different seeders. The BTFS clients will upload a file to multiple seeders according to this parameter.

1 203.185.96.47 8080 /repository/default admin admin

203.185.96.48 8080 /repository/default admin admin

http://203.185.96.47:6969/announce

http://203.185.96.48:6969/announce

Figure 3.3 Global configuration file

26

3.10 Garbage Collection

When a file is updated or removed from the BTFS file system, the associated ZooKeeper node is also updated or removed respectively. However, the files in the WEBDAV seeders remain intact. To reclaim the space on the seeders, another process running on WEBDAV seeders is required for doing garbage collection. The process runs periodically to collect all current UUIDs from ZooKeeper and compares the UUIDs with all files in the seeders. For a file without the matching UUID, if the last access time is older than a predefined threshold, the process removes the file.

3.11 Operation

The BTFS is implemented as a file system in the . Each user can mount BTFS file system to a mount point in the local directory tree that he/she has a permission. The command to mount BTFS file system is as follows.

$btfsmount 192.168.1.1 /home/user1/btfs

The first argument is the IP address of the metadata Server. The second argument is the local mount point. Once the BTFS file system is mounted, the user can access the files and directories as if they were files or directories in the local storage.

CHAPTER 4 EVALUATION AND EXPERIMENTS

This chapter describes the experimental setup and the preliminary results when using BTFS in distributed rendering.

4.1 Testbed System Configuration

We have carried out the experiments on a testbed system working as a distributed render farm. The testbed consists of a set of servers located at a central site and multiple clients from different remote sites to simulate the volunteer-based distributed rendering which users donate their desktop or notebook computers for rendering animation of a specific project.

We setup the testbed on 5 remote sites, i.e., NECTEC, INET, CAT, UTCC and CSLOXINFO, as illustrated in Figure 4.1 in which NECTEC is chosen as the central site to place all the servers. We allocate 7 clients from the remaining sites. For the hardware specification, all client machines have 4 CPU cores and 4 GB RAM with free harddisk space at least 50 GB. Their CPU speed slightly varies between 2.2 - 2.6 GHz. They connect to the Internet with public IP addresses bypassing any enterprise firewalls. The network bandwidth between sites is slightly varied over time around 100 Mb/s bi- directional. However, the egress bandwidth from the central to remote site is throttled to 10 Mb/s for the experiments. All machines are installed with Linux CentOS 6.2.

28

NECTEC

INET1

INET2 UTCC INET UTCC Internet

CAT CSLOXINFO

CAT2 CAT1 CSLOXINFO1 CSLOXINFO2

Figure 4.1 Testbed system

4.2 Render Data and Software

The data used in the experiments are from the Big Buck Bunny project [37] which is the open animation movie initiated by the Blender software development team. The entire animation has 10 minutes running time which is generated from more than 400 files in 1.2 GB data. The project has publicly released all 3D model, image and texture files that have been used during their production. The animation is composed of 13 scenes separated in the top level folders under the project directory. Each scene may further be broken into sub-scenes each of which is stored as a .blend file.

When rendering a .blend file, other files which are referenced from the current file have to be existed. Different scenes have distinct computational requirements. Some scenes can finish rendering in a few minutes and use only small memory; while others may take an hour with much large memory. Thus, we select some scenes that represent different computational requirements as small, medium and large job as shown in Table 4.1. The input size is the total size of the .blend and all referenced files required for rendering. Note that, the rendering time is measured when files are in local disk. In fact, several scenes in 29 the Big Bug Bunny require larger memory than what we have (4 GB) in the testbed, in which case they cause the rendering to fail. These too large scenes, such as 01_intro/01.blend, are initially precluded from our consideration.

Table 4.1 Characteristics of the testing data Avg. Time No. of Mem Input Size Job Size Scene (min. per Frames (MB) (MB) frame) 12_peach/03.blend

Small 28 650 1:30 90

01_intro/02.blend

Medium 93 2,500 4:50 40

02_rabbit/02.blend

Large 91 3,500 9:06 290

In distributed rendering, different frames in a job are assigned to render on different client machines concurrently to reduce the overall rendering time. So there must exist a job scheduler to manage which frame is assigned to which machine. We deploy DrQueue 0.63.4 [38], the job scheduler for distributed render farm, on the testbed. The DrQueue master process is installed on the server at NECTEC site. The DrQueue slave process and Blender 2.49 [39], the open source rendering software, are installed on all clients.

Basically, to submit a job, users must configure the path of the .blend file, the start and the end frame for each job as well as other render parameters. Figure 4.2 shows the GUI of DrQueue for monitoring the status of jobs. It shows that job 0-2 have been finished, 30 job 3 was cancelled by the user after 7 frames have been rendered and job 4 is currently running. The status of render clients is illustrated in Figure 4.3 which computer 0-6 are rendering a frame from a job.

Figure 4.2 Job status

Figure 4.3 Node status

For a client to execute a job, it is required that the input files (i.e., the .blend and all referenced files) in the job are always accessible from the client. Typically, an instance of DFS is used to allow the clients to access all the required files. 31

4.3 Performance Comparison of BTFS and SMB File System

Since SMB file system is a widely used DFS in distributed rendering, we compare the performance of BTFS to SMB file system. We setup a Samba server version 3.5.10 [6] at the NECTEC site to hold the entire project data. For a client to access files from Samba, there are two possibilities: FuseSMB [40] and Linux CIFS [41]. Both are SMB/CIFS clients; but the former is a user-space SMB client to mount the SMB file system based on Fuse. The latter is the kernel-based SMB client that requires root permission to mount.

We submit jobs to the testbed with selected scenes as mentioned in Section 4.2. However, the very large job size is not included since there is not enough computing resource to render the job. All jobs are submitted with 25% of full HD resolution (480x270 pixels). For all cases, a single file server (or seeder in case of BTFS file system) is used. Then, we measure the amount of data transfer from the file server over the time the job is running on the testbed.

4.3.1 Small Job

For small job, the total rendering time varies for each file system used. The job spends approximately 840, 1,260 and 3,660 seconds to finish under BTFS, Linux CIFS and Fuse-SMB, respectively. Figure 4.4 also illustrates the amount of data transfer in KB/s from the file server. It clearly shows that the Fuse-SMB performs the worst since data are transferred from the file server all the time. We observe that the Linux CIFS is much better than Fuse-SMB. The data transfer of Linux CIFS happens only at the early period of rendering time (0-800) and then reduces sharply until job finishes. This is due to their internal differences though both implement the same SMB file system. Most of the improvement of the Linux CIFS comes from pagecache which automatically caches files in free memory. The BTFS has the least data transfer from the file server since BTFS clients can share data among others in the P2P manner.

32

1200

1000

800

BTFS 600 Fuse-SMB CIFS

Data Transfer (KB/s) 400

200

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Time (s)

Figure 4.4 Outbound network traffic from file server (seeder) for small job

4.3.2 Medium Job

For medium job, it spends approximately 4,320, 4,500 and 6,000 seconds to finish under BTFS, Linux CIFS and Fuse-SMB, respectively. Similar to the small job, the Fuse- SMB performs the worst in terms of the rendering time and the amount of network traffic from the file server as shown in Figure 4.5 which the traffic stays high all the time. While both BTFS and Linux CIFS transfer data from the file server only at the beginning of time, the BTFS, however, has half amount of traffic and has slightly less rendering time than Linux CIFS.

33

1200

1000

800

BTFS 600 Fuse-SMB CIFS

Data Transfer (KB/s) Transfer Data 400

200

0 0 1000 2000 3000 4000 5000 6000 7000 Time (s)

Figure 4.5 Outbound network traffic from file server (seeder) for medium job

4.3.3 Large Job

In this case, the large job spends approximately 7,680, 18,960 and 28,500 seconds to finish under BTFS, Linux CIFS and Fuse-SMB, respectively. Figure 4.6 shows the traffic load of the file server under different file system. The performance of Fuse-SMB is the worst with the same reason as mentioned in the small and medium jobs. Interestingly, the Linux CIFS has twice rendering time larger than the BTFS as well as it has continuous network activity throughout the rendering time. Since the large job requires entire memory of the machine in rendering, there is no free memory left for doing pagecache. As a result, Linux CIFS needs to reload files from the file server every time. On the contrary, BTFS stores files in persistent storage (disk). So it is not affected by losing pagecache. The traffic load of the file server is greatly reduced by the BTFS.

34

1200

1000

800

BTFS 600 Fuse-SMB CIFS

Data Transfer (KB/s) 400

200

0 0 5000 10000 15000 20000 25000 30000 35000 Time (s)

Figure 4.6 Outbound network traffic from file server (seeder) for large job

4.4 Peer Contribution under BTFS File System

Figure 4.7, 4.8 and 4.9 show the breakdown of inbound traffic at each render node for the small, medium and large jobs, respectively. Obviously, each render node downloads data from the seeder as well as exchange data among other peers. Although, different fractions of data are downloaded from each peer, the accumulative contribution of all peers is the major source of inbound traffic for all jobs. For instance, approximately 80% of UTCC data are downloaded from peers. We also observe that if nodes are located in the same site, they are likely to share more data between them. In Figure 4.7, INET1 is the most contributor sending data to INET2 which is, in turn, the second contributor of INET1. As another example, a lot of data are exchanged between CAT1 and CAT2 in Figure 4.8 and 4.9.

35

100%

90%

80%

70% CAT2 CAT1 60% CSLOXINFO2

50% CSLOXINFO1 INET2 40% INET1 30% UTCC Fraction of Data Transfer Data of Fraction BTFS Seeder 20%

10%

0% UTCC INET1 INET2 CSLOXINFO1 CSLOXINFO2 CAT1 CAT2

Render Node

Figure 4.7 Breakdown of inbound traffic under BTFS for small job

100%

90%

80%

70% CAT2 CAT1 60% CSLOXINFO2

50% CSLOXINFO1 INET2 40% INET1

UTCC

Fraction of Data Transfer Data of Fraction 30% BTFS Seeder 20%

10%

0% UTCC INET1 INET2 CSLOXINFO1 CSLOXINFO2 CAT1 CAT2

Render Node

Figure 4.8 Breakdown of inbound traffic under BTFS for medium job

36

100%

90%

80%

70% CAT2 CAT1 60% CSLOXINFO2 50% CSLOXINFO1 INET2 40% INET1

Fraction of Data Transfer Data of Fraction 30% UTCC

BTFS Seeder 20%

10%

0% UTCC INET1 INET2 CSLOXINFO1 CSLOXINFO2 CAT1 CAT2

Render Node

Figure 4.9 Breakdown of inbound traffic under BTFS for large job

4.5 Load Balance of BTFS with Multiple Seeders

In this experiment, we setup a cluster of 4 seeders with the same configuration to share load and the aggregate performance of all seeders is 4 KB/s. We assign the number of replicas to 1 (via the section in the global configuration file) which causes a single copy of a file to be uploaded into a random seeder. Then, the entire Big Bug Bunny project is put into the BTFS file system. We have all 7 clients running a program that continuously read a random distinct file from the project. The peer-to-peer data exchange is turned off so that data are solely downloaded from the seeders. Figure 4.10 depicts the load of each seeder over time. The load varies and is spread across all seeders, although it is not perfectly balanced which makes the aggregate bandwidth below 4 KB/s. Ideally, the load can be equal if the distribution of files is based on the size rather than the randomness. In fact, each seeder holds roughly 100 files but has quite different disk usage as shown in Table 4.2. The load on seeder 1 and 3 is higher than others due to most of the project data being placed on them. 37

3000

2500

2000 Seeder 4 Seeder 3 1500 Seeder 2 Seeder 1 1000 Data Transfer (KB/s)

500

0 0 240 480 720 960 1200 1440 1680 1920 2160 2400 2640 2880 3120 3360 3600 3840 4080 4320 Time (s)

Figure 4.10 Load of each seeder

Table 4.2 Number of files and disk usages on each seeder Seeder 1 Seeder 2 Seeder 3 Seeder 4 Total No. of Files 119 103 112 101 435 Disk Usage (MB) 422 259 310 229 1,220

4.6 BTFS Replication Performance

In this experiment, we replicate files into multiple seeders and measure the time for a single BTFS client to retrieve the files from different number of seeders. Note that, we can control the number of replicas a BTFS client creates for a file by setting the

parameter in the global configuration file. We vary the number of seeders from 1 to 4 and allow replicating files to all available seeders. The BTFS client can download several parts of the file in parallel from these seeders. Figure 4.11 shows the download time of the required data for rendering the large job, i.e. scene 02_rabbit/02.blend (not including rendering time). There are 58 referenced files and the total data size is 290 MB for this job. The transfer time using only 1 seeder is slightly larger than 300 s. If there are more seeders (and thus more replicas), the time reduces accordingly. The best performance is obtained when the number of seeders is 4. Figure 38

4.12 demonstrates the speedup of the transfer time with the increasing number of seeders. The speedup is calculated as the ratio of the transfer time using multiple seeders to the transfer time using single seeder. It is observed that the reduction in the transfer time and its speedup is not optimal comparing to the ideal case because there is some overhead in the BTFS. As the number of seeders keeps increasing, the overhead tends to grow as well. However, as long as the overhead is not too high, using replication could improve the performance of BTFS.

350

300

250

200 Ideal BTFS

Time (s) 150

100

50

0 1234 Number of Seeders

Figure 4.11 Data transfer time of multiple replicas

4.5

4

3.5

3

2.5 Ideal

2 BTFS Speedup 1.5

1

0.5

0 1234 Number of Seeders

Figure 4.12 Speedup of multiple replicas 39

4.7 Operation Breakdown

To understand the overhead of the BTFS, we measure the time spent in each step of the read and write operation of the BTFS file system. In this experiment, we setup only one seeder and one BTFS client. We vary the file size to read and write from 0.5 MB to 20 MB to see how the overhead grows. For read operation, each BTFS client executes the following steps respectively:

1) Get Metadata Get file attributes from the metadata server. 2) Get Torrent Get torrent information of the file from the metadata server. 3) Download File Download the file from the seeder and store it in local disk. 4) Update Cache DB Update cache information in local database. 5) Read Local Read the file locally. 6) Decrypt Decrypt the file using the key obtain from the .

Figure 4.13 shows the time of all steps in milliseconds for reading files of different sizes except the download time which is presented in seconds by the secondary Y-axis. As the file is larger, the download time increases proportionally. The accumulative time of all other steps is considered as the overhead and it grows in accordance with file size. Most of the overhead time comes from the decryption step. However, the overhead time is less than a second and grows much slower than the download time. So, for the file larger than 1M, this overhead is very trivial. Although, this operation breakdown differs by many factors such as bandwidth, the number of peers and CPU speed of the system, it still gives some useful internal information.

40

1000 30

800 24

600 18

400 12 File Download Time (s) BTFS Operation Time (ms) Time Operation BTFS 200 6

0 0 0.5M 1M 2M 5M 10M 20M File Size

Get Metadata Get To rren t Update Cache DB Read Local Decrypt Download File

Figure 4.13 Read operation breakdown

For write operation, each BTFS client carries out the following steps respectively: 1) Get Metadata Get the parent’s ACL from the metadata server. 2) Write Local Write the file in local disk. 3) Encrypt Encrypt the file using a generated key. 4) Update Cache DB Update cache information in local database. 5) Create Torrent Calculate file hash and create torrent information. 6) Upload File Upload file to a seeder. 7) Update Metadata Update file attribute and put torrent information into the metadata server.

Similarly, Figure 4.14 shows the overhead and the upload time of a BTFS client when writing a file of different sizes. The overhead is measured in milliseconds and includes all the steps except the upload time which is presented in seconds by the secondary Y-axis. In overall, the overhead in writing a file is larger than that in reading a file. The total overhead is less than a second in all cases. The time used in encryption is the largest part of the overhead. The upload, encryption, writing locally and creating torrent steps are clearly grows as the file becomes larger. However, the growth rate of the upload time is higher than that of the overhead which is also trivial when writing a file larger than 1M. 41

1000 30

800 24

600 18

400 12 File UploadTime (s)

BTFS Operation Time (ms) 200 6

0 0 0.5M 1M 2M 5M 10M 20M File Size

Get Metadata Write Local Encrypt Update Cache DB Create Torrent Update Metadata Upload File

Figure 4.14 Write operation breakdown

CHAPTER 5 CONCLUSIONS

We present the design and implementation of BitTorrent file system, or BTFS, which aims to reduce the data transfer time and improve the performance of distributed rendering. The BTFS allows the rendering software as well as other applications to transparently share and exchange data in a peer-to-peer manner. Many components in BTFS are built around well-developed open source software and standard protocols for the BTFS to gain their ability, maturity and stability. We have carried out the experiments on a testbed using a production-grade 3D animation. The results show that the performance of distributed rendering using BTFS is better than the traditional network file systems as verified by the less rendering time and lower load on the server.

BTFS can also be used in other distributed applications which require disseminating a large amount of data to many nodes in a short time. However, the “last writer wins” consistency model used in BTFS might not be suitable for all applications. Release consistency model which requires and unlock operations when accessing files should further be implemented in BTFS as it is common in many applications. Besides, even though the metadata server is replicated, it could become a performance barrier if there are a large number of metadata updates to the master server. Sharding metadata to multiple metadata servers can ensure the scalability of BTFS in a stronger sense. The partial request files and the data reduplication in BTFS are also worth investigation for future work.

REFERENCES

[1] “Distributed Rendering,” Available at http://www.isgtw.org/visualization/distributed-rendering/, September 7, 2011. [2] “Free Rendering by the for the People,” Available at http://www.renderfarm.fi/, 2012. [3] “Volunteer Computing,” Available at http://boinc.berkeley.edu/trac/wiki/VolunteerComputing/, 2012. [4] R. Sandberg et al., Design and Implementation of the Sun Network Filesystem. In USENIX 1985 Summer Conference Proceedings, 1985, pp. 119-130. [5] “Common Internet File System (CIFS),” Available at http://www.cifs.com/, 2013. [6] “Samba – Opening Windows to a Wider World,” Available at http://www.samba.org/, July 26, 2011. [7] J.H. Howard et al., “Scale and Performance in a Distributed File System,” ACM Transactions on Computer Systems, Vol.6, No.1, February 1988, pp. 51-81. [8] “Coda File System,” Available at http://www.coda.cs.cmu.edu/, 2013. [9] “MogileFS,” Available at ://github.com/mogilefs/, 2013. [10] K. Shvachko et al., “The Hadoop Distributed File System,” In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Lake Tahoe, Nevada, USA., May 6-7, 2010, pp. 1-10. [11] “Napster,” Available at http://www.napster.com/, 2013. [12] T. Klingberg and R. Manfredi, “Gnutella Protocol Development,” Available at http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html, June 2002. [13] I. Clarke et al., “Protecting Free Expression Online with Freenet,” IEEE Internet Computing, Vol.6, No.1, January-February 2002, pp. 40-49. [14] J. Kubiatowicz et al., “OceanStore: An Architecture for Global-Scale Persistent Storage,” In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2000), Cambridge, Massachusetts, United States, November 2000, pp. 190-201. [15] “Kazaa Lite,” Available at http://kazaa-lite.en.softonic.com/, 2013. [16] B. Cohen, “Incentives Build Robustness in BitTorrent,” Workshop on Economics of Peer-to-Peer Systems, Berkeley, CA, USA., May 22, 2003. 44

[17] A. Kaplan, G.C. Fox, and G.v. Laszewski, “GridTorrent Framework: A High- Performance Data Transfer and Data Sharing Framework for Scientific Computing,” Grid Computing Environments Workshop, Reno, Nevada, USA., November 11-12, 2007. [18] A. Zissimos et al., “GridTorrent: Optimizing Data Transfers in the Grid with Collaborative Sharing,” Proceedings of the 11th Panhellenic Conference on Informatics (PCI 2007), Patras, Greece, May 18-20, 2007. [19] F. Costa et al., “Optimizing the Data Distribution Layer of BOINC with BitTorrent,” Proceedings of the 2008 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, Florida, USA., April 14-18, 2008, pp. 1-8. [20] D.P. Anderson, “BOINC: A System for Public-Resource Computing and Storage,” Proceedings of the fifth IEEE/ACM International Workshop on Grid Computing (GRID 2004), Pittsburgh, Pennsylvania, USA., November 8, 2004, pp. 4-10. [21] B. Wei, G. Fedak, and F. Cappello, “Collaborative Data Distribution with BitTorrent for Computational Desktop Grids,” Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC 2005), France, July 4- 6, 2005. [22] B. Wei, G. Fedak, and F. Cappello, “Towards Efficient Data Distribution on Computational Desktop Grids with BitTorrent,” Future Generation Computer Systems, Vol.23, No.8, November, 2007, pp. 983-989. [23] F. Cappello et al., “Computing on Large-Scale Distributed Systems: XtremWeb Architecture, Programming Models, Security, Tests and Convergence with Grid,” Future Generation Computer Systems, Vol.21, No.3, March 1, 2005, pp. 417-437. [24] “BURP: the Big and Ugly Rendering Project,” Available at http://burp.renderfarming.net/, 2012. [25] “vSwarm: Free Render Farm,” Available at http://www.vswarm.com/, 2012. [26] “FUSE: ,” Available at http://fuse.sourceforge.net/, 2012. [27] “Apache ZooKeeper,” Available at http://zookeeper.apache.org/, 2012. [28] “Lighttpd,” Available at http://www.lighttpd.net/download/, 2012. [29] “opentracker – An Open and Free BitTorrent Tracker,” Available at http://erdgeist.org/arts/software/opentracker/, 2012. 45

[30] “neon HTTP and WebDAV Client Library,” Available at http://www.webdav.org/neon/, 2013. [31] “Apache ZooKeeper,” Available at http://zookeeper.apache.org/, 2012. [32] “SQLite,” Available at http://www.sqlite.org/, 2013. [33] “OpenSSL: The Open Source Toolkit for SSL/TLS,” Available at http://www.openssl.org/, 2009. [34] “libtorrent,” Available at http://www.rasterbar.com/products/libtorrent/, 2005. [35] “stunnel,” Available at https://www.stunnel.org/index.html/, 2013. [36] “BitTorrent Multitracker Metadata Extension,” Available at http://www.bittorrent.org/beps/bep_0012.html, Feb 14, 2008. [37] “Big Buck Bunny,” Available at http://www.bigbuckbunny.org/, June 2010. [38] “DrQueue, the Open Source Distributed Render Queue,” Available at http://www.drqueue.org/, 2013. [39] “Blender,” Available at http://www.blender.org/, 2009. [40] “SMB for Fuse,” Available at http://www.ricardis.tudelft.nl/~vincent/fusesmb/, 2007. [41] “The Linux Kernel Archives,” Available at https://www.kernel.org/, 2012.

BIOGRAPHY

MAIN RESEARCHER NAME Asst. Prof. Namfon Assawamekin, Ph.D. INSTITUTIONS ATTENDED University of the Thai Chamber of Commerce, 1995 Bachelor of Science (Computer Science) Chulalongkorn University, 1999 Master of Science (Computer Science) Mahidol University, 2009 Doctor of Philosophy (Computer Science) EMPLOYMENT ADDRESS University of the Thai Chamber of Commerce 126/1 Vibhavadee-Rangsit Rd., Dindaeng, Bangkok 10400, THAILAND Tel. +66(0) 2697-6506 E-mail: [email protected], [email protected]

CO-RESEARCHER NAME Ekasit Kijsipongse, Ph.D. INSTITUTIONS ATTENDED Chulalongkorn University, 1991 Bachelor of Engineering (Industrial Engineering) Asian Institute of Technology, 1994 Master of Engineering (Computer Science) Mahidol University, 2009 Doctor of Philosophy (Computer Science) EMPLOYMENT ADDRESS National Electronics and Computer Technology Center 112 Thailand Science Park, Phahonyothin Road, Khlong Nueng, Khlong Luang, Pathumthani 12120, THAILAND Tel. +66(0) 2564-6900 E-mail: [email protected]