2. Related Work
Total Page:16
File Type:pdf, Size:1020Kb
2. Related work File prefetching is an effective technique for improving file access performance. File prefetching brings two major advantages. First, applications execute smoothly because they hit more in the file cache. Second, there is less “burst” load placed on the network because prefetching is done only when there is network bandwidth available rather than on demand. On the other hand, there are two main costs of prefetching. One cost is the CPU cycles expended by the client in determining when and what to prefetch. Cycles are spent both on gathering the information which is necessary to make prefetching decisions, and on actually executing the prefetching. The other cost is the network bandwidth wasted when prefetch decisions inevitably prove less than perfect [5]. There are kinds of distributed file systems using file prefetching to improve performance. File prefetching has also been used in web prefetching for tolerating web access latency. The following briefly describes works involving file prefetching on distributed file system, Web prefetching, and automatic prefetching. 2.1 Distributed file system Now a day, network file system [11] is more and more mature. Network file system is usually named as distributed file system, because shared files and directories can be supported by many unlike computers on network. Distributed file system can let users from different computing systems access their own files in the same directory. The primary concept of distributed file system is client/server system. One computer system provides other computers for accessing services. The supply device is server, and the request one is client. For all users who use or manage computers, distributed file system provides 12 many very important advantages. A user can access files in the same form from one or different computers. For system management, they can simplify the manage works like backup or recovery and can centralize the standard system manage work like create/delete account or monitor the use of system and so on. Because the high-speed network, powerful local servers, and distributed file system which have been tested by time work well in the “open source code” environment, we can use distributed file system more easily than before. The following is a brief introduction of some kinds of distributed file system. z AFS (Andrew File System) & OpenAFS AFS [12] [13] is a distributed file system product, pioneered at Carnegie Mellon University and supported and developed as a product by Transarc Corporation (now IBM Pittsburgh Labs). It offers client/server architecture for file sharing, providing location independence, scalability and transparent migration capabilities for data. IBM branched the source of the AFS product, and made a copy of the source available for community development and maintenance. They called the release OpenAFS. The original release is distributed here as OpenAFS 1.0. OpenAFS is based on client/server architecture model and is aimed at WAN application. To reduce the delay of network transmission, OpenAFS uses cache to overcome this problem. While a user is opening a file to edit, the file server sends the front blocks to the client device. S/he may very likely want to edit the rest part of this file. About this reason, OpenAFS file server will fetch the whole file and send it to the cache of the client device. When this user continues to edit this file, the rest of this file has already been the local cache. Before writing back this file, there is no more network transmission. z NFS (Network File System) The Network File System (NFS) [14] is the most popular distributed file system. 13 Almost all UNIX and UNIX-like system can use NFS. It was developed to allow machines to mount a disk partition on a remote machine as if it were on a local hard drive. This allows for fast, seamless sharing of files across a network. It also gives the potential for unwanted people to access your hard drive over the network (and thereby possibly read your email and delete all your files as well as break into your system) if you set it up incorrectly. So please read the Security section of this document carefully if you intend to implement an NFS setup. The latest one, NFS (Network File System) version 4 [15] is a distributed file system protocol which owes heritage to NFS protocol versions 2 [16] and 3 [17]. Unlike earlier versions, the NFS version 4 protocol supports traditional file access while integrating support for file locking and the mount protocol. In addition, support for strong security (and its negotiation), compound operations, client caching, and internationalization have been added. Of course, attention has been applied to making NFS version 4 operate well in an Internet environment. z Coda Coda [18] [19] is an advanced networked file system. It has been developed at CMU since 1987 by the systems group of M. Satyanarayanan. in the SCS department. Actually, Coda is a branch of AFS and is produced based on the source code of AFS version 2. So, Coda can share many functions with AFS. Its main goal is to solve problems with worse or even inexistence network connection. Coda is the first distributed file system supporting disconnected operation for mobile computing. This means that a user can continue to operate files in cache even if the network is disconnected. While reconnecting, the client device will synchronize files in the local cache and those in the file server automatically. z GFS (Global File System) Sistina GFS [20] is the industry’s most advanced and mature scalable file system. 14 Recognized as the de facto cluster file system on Linux, Sistina GFS is a highly stable solution for enterprise and technical computing applications requiring reliable access to data. Sistina GFS allows multiple servers on a Storage Area Network (SAN) to have read and write access to a single file system on shared SAN devices, delivering the strength, safety and simplicity demanded by enterprise and technical computing environments. z InterMezzo InterMezzo [21] is a new distributed file system with a focus on high availability. InterMezzo will be suitable for replication of servers, mobile computing, managing system software on large clusters, and for maintenance of high availability clusters. For example, InterMezzo offers disconnected operation and automatic recovery from network outages. InterMezzo is an Open Source (GPL) project. InterMezzo entered the Linux kernel at version 2.4.15. z Sprite Sprite [22] [23] is a research operating system developed at the University of California, Berkeley, by John Ousterhout's research group. Sprite is a distributed operating system that provides a single system image to a cluster of workstations. It provides very high file system performance through client and server caching. It has process migration to take advantage of idle machines. It was used as a testbed for research in log-structured file systems, striped file systems, crash recovery, and RAID file systems, among other things. z xFS There is a serverless file system called xFS [24] which will attempt to provide low latency, high bandwidth access to file system data by distributing the functionality of the server among the clients. The typical duties of a server include 15 maintaining cache coherence, locating data, and servicing disk requests. The developers of xFS are currently developing cache coherence protocols which use the collective memory of the clients as a system-wide cache. By reducing the amount of redundant caching among clients and allowing the memory of idle machines to be utilized, cooperative caching can lower the latency of reads by reducing number of requests which must go to disk. The function of locating data in xFS is distributed by having each client responsible for servicing requests on a subset of the files. File data is striped across multiple clients to provide high bandwidth. The striped data includes parity information which can be used to reconstruct a missing stripe segment due to, for example, a machine being down. In this way, no node is a single point of failure. 2.2 Web prefetching People use the World Wide Web (WWW) because it gives quick and easy access to a tremendous variety of information in remote locations. Users do not like to wait for their results; they tend to avoid or complain about Web pages that take a long time to retrieve. That is, users care about Web latency. In distributed information systems like the World Wide Web, prefetching techniques attempt to predict the future requests of users based on past history, as observed at the client, server, or proxy. Prefetching for the Web is an active area of study that has considerable practical. The objective of prefetching is the reduction of the user perceived latency. Web servers are in better position in making predictions about future references, since they log a significant part of requests by all Internet clients for the resources they own. The prediction engine can be implemented by exchange of messages between the server and clients, having the server piggybacking information about the predicted resources onto regular response messages, avoiding establishment of any new TCP connections. 16 WMo is a new algorithm whose characteristics include three factors: a) the order of dependencies between page accesses, b) the noise present in user (i.e., access that are not part of a pattern), c) the ordering of accesses within access sequences, that characterize the performance of predictive web prefetching algorithms [25] [26]. However, there is no way for the approach to pre-retrieve documents that are newly created or never visited before. For example, all anchored URLs of a page are fresh when a client gets into a new web site and none of them will be pre-fetched by the approaches.