TierStore: A Distributed Filesystem for Challenged Networks in Developing Regions

Michael Demmer, Bowei Du, and Eric Brewer University of California, Berkeley {demmer,bowei,brewer}@cs.berkeley.edu

Abstract To address these challenges, various groups have used novel approaches for connectivity in real-world applica- TierStore is a distributed filesystem that simplifies the de- tions. The Wizzy Digital system [36] distributes velopment and deployment of applications in challenged educational content among schools in South Africa by network environments, such as those in developing re- delaying dialup access until night time, when rates are gions. For effective support of bandwidth-constrained cheaper. DakNet [22] provides e- and web connec- and intermittent connectivity, it uses the Delay Toler- tivity by copying data to a USB drive or hard disk and ant Networking store-and-forward network overlay and then physically carrying the drive, sometimes via mo- a publish/subscribe-based multicast replication protocol. torcycles. Finally, Ca:sh [1] uses PDAs to gather rural TierStore provides a standard filesystem interface and a health care data, also relying on physical device trans- single-object coherence approach to conflict resolution port to overcome the lack of connectivity. These projects which, when augmented with application-specific han- demonstrate the value of information distribution appli- dlers, is both sufficient for many useful applications and cations in developing regions, yet they all essentially simple to reason about for programmers. In this paper, started from scratch and thus use ad-hoc solutions with we show how these properties enable easy adaptation little leverage from previous work. and robust deployment of applications even in highly in- This combination of demand and obstacles reveals termittent networks and demonstrate the flexibility and the need for a flexible application framework for “chal- bandwidth savings of our prototype with initial evalua- lenged” networks. Broadly speaking, challenged net- tion results. works lack the ability to support reliable, low-latency, end-to-end communication sessions that typify both the 1 Introduction phone network and the Internet. Yet many important applications can still work well despite low data rates The limited infrastructure in developing regions both and frequent or lengthy disconnections; examples in- hinders the deployment of information technology and clude e-mail, voicemail, data collection, news distribu- magnifies the need for it. In spite of the challenges, a tion, e-government, and correspondence education. The variety of simple information systems have shown real challenge lies in implementing systems and protocols to impact on health care, education, commerce and produc- adapt applications to the demands of the environment. tivity [19, 34]. For example, in Tanzania, data collection Thus our central goal is to provide a general purpose related to causes of child deaths led to a reallocation of framework to support applications in challenged net- resources and a 40% reduction in child mortality (from works, with the following key properties: First, to adapt 16% to 9%) [4, 7]. existing applications and develop new ones with mini- Yet in many places, the options for network connec- mal effort, the system should offer a familiar and easy- tivity are quite limited. Although cellular networks are to-use filesystem interface. To deal with intermittent net- growing rapidly, they remain a largely urban and costly works, applications must operate unimpeded while dis- phenomenon, and although satellite networks have cov- connected, and easily resolve update conflicts that may erage in most rural areas, they too are extremely expen- occur as a result. Finally, to address the networking chal- sive [30]. For these and other networking technologies, lenges, replication protocols need to be able to leverage a power problems and coverage gaps cause connectivity to range of network transports, as appropriate for particular vary over time and location. environments, and efficiently distribute application data.

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 35 As we describe in the remainder of this paper, Tier- Based in part on these observations, TierStore imple- Store is a distributed filesystem that offers these prop- ments a single-object coherence policy for conflict man- erties. Section 2 describes the high-level design of the agement, meaning that only concurrent updates to the system, followed by a discussion of related work in Sec- same file are flagged as conflicts. We have found that this tion 3. Section 4 describes the details of how the system simple model, coupled with application-specific conflict operates. Section 5 discusses some applications we have resolution handlers, is both sufficient for many useful ap- developed to demonstrate flexibility. Section 6 presents plications and easy to reason about for programmers. It an initial evaluation, and we conclude in Section 7. is also a natural consequence from offering a filesystem interface, as UNIX filesystems do not naturally expose a 2 TierStore Design mechanism for multiple-file atomic updates. When conflicts do occur, TierStore exposes all infor- The goal of TierStore is to provide a distributed filesys- mation about the conflicting update through the filesys- tem service for applications in bandwidth-constrained tem interface, allowing either automatic resolution by and/or intermittent network environments. To achieve application-specific scripts or manual intervention by a these aims, we claim no fundamentally new mechanisms, user. For more complex applications for which single- however we argue that TierStore is a novel synthesis of file coherence is insufficient, the base system is exten- well-known techniques and most importantly is an effec- sible to allow the addition of application-specific meta- tive platform for application deployment. objects (discussed in Section 4.12). These objects can be TierStore uses the Delay Tolerant Networking (DTN) used to group a set of user-visible files that need to be bundle protocol [11, 28] for all inter-node messag- updated atomically into a single TierStore object. ing. DTN defines an overlay network architecture for To distribute data efficiently over low-bandwidth net- challenged environments that forwards messages among work links, TierStore allows the shared data to be par- nodes using a variety of transport technologies, includ- titioned into fine-grained publications, currently defined ing traditional approaches and long-latency “sneakernet” as disjoint subtrees of the filesystem namespace. Nodes links. Messages may also be buffered in persistent stor- can then subscribe to receive updates to only their pub- age during connection outages and/or retransmitted due lications of interest, rather than requiring all shared state to a message loss. Using DTN allows TierStore to adapt to be replicated. This model maps quite naturally to naturally to a range of network conditions and to use so- the needs of real applications (e.g. users’ mailboxes lution(s) most appropriate for a particular environment. and folders, portions of web sites, or regional data col- To simplify application development, TierStore im- lection). Finally, TierStore nodes are organized into a plements a standard filesystem interface that can be ac- multicast-like distribution tree to limit redundant update cessed and updated at multiple nodes in the network. transmissions over low-bandwidth links. Any modifications to the shared filesystem state are both applied locally and encoded as update messages that are 3 Related Work lazily distributed to other nodes in the network. Because nodes may be disconnected for long periods of time, the Several existing systems offer distributed storage ser- design favors availability at the potential expense of con- vices with varying network assumptions; here we briefly sistency [12]. This decision is critical to allow applica- discuss why none fully satisfies our design goals. tions to function unimpeded in many environments. One general approach has been to adapt traditional The filesystem layer implements traditional NFS-like network file systems such as NFS and AFS for use in semantics, including close-to-open consistency, hard and constrained network environments. For example, the soft links, and standard UNIX group, owner, and per- Low-Bandwidth File System (LBFS) [18] implements mission semantics. As such, many interesting and useful a modified NFS protocol that significantly reduces the applications can be deployed on a TierStore system with- bandwidth consumption requirements. However, LBFS out (much) modification, as they often already use the maintains NFS’s focus on consistency rather than avail- filesystem for communication of shared state between ability in the presence of partitions [12], thus even application instances. For example, several implemen- though it addresses the bandwidth problems, it is unsuit- tations of e-mail, log collection, and wiki packages are able for intermittent connectivity. already written to use the filesystem for shared state and Coda [16] extends AFS to support disconnected oper- have simple data distribution patterns, and are therefore ation. In Coda, clients register for a subset of files to be straightforward to deploy using TierStore. Also, these “hoarded”, i.e. to be available when offline, and modi- applications are either already conflict-free in the ways fications made while disconnected are merged with the that they interact with shared storage or can be easily server state when the client reconnects. Due to its AFS made conflict-free with simple extensions. heritage, Coda has a client-server model that imposes re-

36 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association strictions on the network topology, so it is not amenable ments, whereas our design is focused on a more widely to cases in which there may not be a clear client-server distributed topology. relationship and where intermittency might occur at mul- Haggle [29] is a clean-slate design for networking and tiple points in the network. This limits the deployability data distribution targeted for mobile devices. It shares of Coda in many real-world environments that we target. many design characteristics with DTN, including a flexi- Protocols such as rsync [33], Unison [24] and Of- ble naming framework, multiple network transports, and flineIMAP [20] can efficiently replicate file or applica- late binding of message destinations. The Haggle system tion state for availability while disconnected. These sys- model incorporates shared storage between applications tems provide pairwise synchronization of data between and the network, but is oriented around publishing and nodes, so they require external ad-hoc mechanisms for querying for messages, not providing a replicated stor- multiple-node replication. More fundamentally, in a age service. Thus applications must be rewritten to use shared data store that is being updated by multiple par- the Haggle APIs or adapted using network proxies. ties, no single node has the correct state that should be Finally, the systems that are closest to TierStore in de- replicated to all others. Instead, it is the collection of sign are optimistically concurrent peer-to-peer file sys- each node’s updates (additions, modifications, and dele- tems such as Ficus [21] and Rumor [15]. Like TierStore, tions) that needs to be replicated throughout the network Ficus implements a shared file system with single file to bring everyone up to date. Capturing these update consistency semantics and automatic resolution hooks semantics through pair-wise synchronization of system for update conflicts. However the Ficus log-exchange state is challenging and in some cases impossible. protocols are not well suited for long latency (i.e. sneak- Bayou [23, 32] uses an epidemic propagation proto- ernet) links, since they require multiple round trips for col among mobile nodes and has a strong consistency synchronization. Also, update conflicts must be resolved model. When conflicts occur, it will roll back updates before the file becomes available, which can degrade and then roll forward to reapply them and resolve con- availability in cases where an immediate resolution to the flicts as needed. However, this flexibility and expressive- conflict is not possible. In contrast, TierStore allows con- ness comes at a cost: applications need to be rewritten to flicting partitions to continue to make progress. use the Bayou shared database, and the system assumes Rumor is an external user-level synchronization sys- that data is fully replicated at every node. It also assumes tem that builds upon the Ficus work. It uses Ficus’ tech- that rollback is always possible, but in a system with hu- niques for conflict resolution and update propagation, man users, the rollback might require undoing the actions thus making it unsuitable in our target environment. of the users as well. TierStore sacrifices the expressive- ness of Bayou’s semantic level updates in favor of the 4 TierStore in Detail simplicity of a state-based system. PRACTI [2] is a replicated storage system that uses This section describes the implementation of TierStore. a Bayou-like replication protocol, enhanced with sum- First we give a brief overview of the various components maries of aggregated metadata to enable multi-object of TierStore, shown in Figure 1, then we delve into more consistency without full database content replication. detail as the section progresses. However, the invalidation-based protocol of PRACTI im- plies that for strong consistency semantics, it must re- 4.1 System Components trieve invalidated objects on demand. Since these re- quests may block during network outages, PRACTI ei- As discussed above, TierStore implements a standard ther performs poorly in these cases or must fall back filesystem abstraction, i.e., a persistent repository for file to simpler consistency models, thus no longer provid- objects and a hierarchical namespace to organize those ing arbitrary consistency. Also, as in the case of Bayou, files. Applications interface with TierStore using one of PRACTI requires a new programming environment with two filesystem interfaces, either FUSE [13] (Filesystem special semantics for reading and writing objects, in- in Userspace) or NFS [27]. Typically we use NFS over creasing the burden on the application programmer. a loopback mount, though a single TierStore node could Dynamo [8] implements a key/value data store with export a shared filesystem to a number of users in a well- a goal of maximum availability during network parti- connected LAN environment over NFS. tions. It supports reduced consistency and uses many File and system data are stored in persistent storage techniques similar to those used in TierStore, such as repositories that lie at the core of the system. Read ac- version vectors for conflict detection and application- cess to data passes through the view resolver that han- specific resolution. However, Dynamo does not offer a dles conflicts and presents a self-consistent filesystem to full hierarchical namespace, which is needed for some applications. Modifications to the filesystem are encap- applications, and it is targeted for data center environ- sulated as updates and forwarded to the update manager

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 37 4.3 VersionsVersions Applications Each nodeincrements a localupdate counter after ev-ev- eryne neww object creation or modification to the filesystem FUSE / NFS Filesystem Interface namespace (i.ei.e.. rename or delete). This counter is used to uniquelyidentify the particularupdate in the historyhistory of View Resolver ConflictConflict Management modifications madeat thelocal node,and ispersistently serialized to disk to survi surviveve reboots. A collectioncollection ofupdate counters from multiple nodes Object / Metadata / Persistent Storage defines a verversionsion vector and tracks the logical ordering Version Repositories of upd updatesates for a file or mapping. As mentioned abo above,ve, each mapping contains a v versionersion v vector.ector. Although each Subscription Update vversionersion v vectorector conceptually has a column for all nodes Manager Manager Replication in the system, in practice, we only include columns for nodes that ha haveve modified a particular mapping or the cor cor-- TierStoreTierStore Daemon responding object, which is all that is required for the single-object coherence mod model.el. DTN Network Thus a ne newlywly created mappin mappingg has only a single entry in its v versionersion v vector,ector, in the column of t thehe creating node. If a second node were to subsequently update the same mapping, sa sayy by renaming the file, then th thee ne neww map- Figure1: BlockBlock diag diagramram showingshowing themajor components ping’ping’ss v versionersion v vectorector w wouldould include the old v versionersion in of the TierStore system. ArrowsArrows indicate theflo floww ofinf infor-or- the creating node’ node’ss column, plus the ne newlywly incremented mation betweenbetween components.components. update counter from the second node. Thus the ne neww v vec-ec- tor w wouldould subsume the old one in the v versionersion sequence. where theythey are appliedto the persistentrepositories and WWee e expectxpect T TierStoreierStore deplo deploymentsyments to be relati relativelyvely forwardedforwarded to the subscription mana managerger. small-scale (at most hundreds of nodes in a single sys- tem), wh whichich k keepseeps the maxim maximumum le lengthngth of the v vectorsectors to The subscription manageruses the DTN netw networkork to a r reasonableeasonable bound. Furthermore, most of the time, files distribdistributeute updates to and fromother nodes. Updates that are updated at an e evenven smaller number of sites, so the arriarriveve fromthe networknetwork are forwardedforwarded tothe updateman- size of the v versionersion v vectorsectors should not be a performance ager where the theyy are processedand applied to thethe persis- problem. WWe e could, ho however,wever, adopt techniques similar to tent repository in the same w wayay as localmodifications. those used in Dynamo [8] to truncate old entries from the vvectorector if this were to become a performance limitation. 4.2 Objects, Mappings, and Guids WWee also use v versionersion v vectorsectors to detect missing updates. The subscription mana managerger records a log of the v versionsersions TierStoreTierStore obj objectsects deri deriveve from twotwo basic types: data ob- for all updates that ha haveve been recei receivedved from the net- jects are re regulargular files that contain arbitrary user data, e ex-x- wwork.ork. Since each modification causes e exactlyxactly one update cept for symbolic links that ha haveve a well-specified for for-- counter to be incremented, the subscription manager de- mat. ContainerContainerss implement directories by storing a set tects missing updates by looking for holes in the v versionersion of mappings: tuples of (guid, name,name, version,version, vie view)ww)). sequence. Although the DTN netw networkork protocols retrans- A guid uniquely identifies an object, independentindependent from mit lost messages to ensure reliable deli delivery,very, a ffallback allback its location in the filesystem,filesystem, akin to an inode number in repair protocol detects missing updates and can request the UNIX filesystem, though with global scope. Each them from a peer peer.. node in a TierStoreTierStore deploymentdeployment is configured with a unique identity by an administrator,administrator, and guidsguids are defined 4.4 PersistentPersistent Repositories as a tuple (node,(node, time) oft thehe node identity where an object wwasas created and a strictly in increasingcreasing local time counter counter.. Thecore of thethe system has a setof persistent repositories The name is the user-specifieduser-specified filename in the con- forsystem state. The object repositoryrepository is implemented tainer.tainer. The versionversion defines the logical time when the using regularregular UNIX files namednamed with the object guid. ForFor mapping waswas created in the history of system updates, data objects, each entry simplysimply stores the contents of the and the viewview identifies thethe node the created the mapping givengiven file. ForFor container objects, each file stores a log (not necessarilynecessarily the node that originally created the ob- of updates to the name/guid/viewname/guid/view tuple set, periodically ject). VersionsVersions and viewsviews are discussed further below.below. compressed to truncate redundantredundant entries. WeWe use a log

38 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association CREATE is a tuple (objectguid, object type,type, version,vvererersion,sion, pub publica-lica- guid guid TierStoreTierStore tion id id,, filesystem metadata, object data) . These updatesha haveve type TierStoreTierStore source Metadata version Metadata no depe dependencies,ndencies, so the theyy are immediately applied to the name publication id container guid persistent da databasetabase upon reception, and the theyy are idem- Mapping 1 mode view potent since the binding of a guid to object data ne neverver uid version File changes (see the ne nextxt subsect subsection).ion). gid Metadata name mtime MAP updates bind objects into the filesystem names- container guid ctime Mapping 2 pace. Each MAP update contains the guid of an object view version and a v vectorector of (name,(name, container guid,vie view,ww,, version)version) tuples object data File . that specify the location(s) where the object shouldshould be Contents . mapped into the namespace. Although in most cases a CREATE . MAP file is mapped into only a single location, multiple map- pings maymay be needed to properly handle hard links and Figure 2:Contents of the core TierStoreTierStore update mes- some conflicts (described below).below). sagessages.. CREACREATE TE updates add objects tothe system; MAP Because TierStoreTierStore implements a single-object coher-coher- updates bind objects to location(s) in thenamespace namespace.. ence model, MAP updates can be applied asas long as a node has previouslypreviously receivedreceived CREATE updates for the object and the container(s) where the object is to be mapped. insteadof a vectorvector ofmappings for better performance This dependencydependency is easily checkedchecked by looking up the rel- on modificationsto largelarge dire directories.ctories. evantevant guidsguids in the metadata repositoryrepository and does not de- Each objectobject (data and container) has acorresponding pend on other MAP messages havinghaving been received.received. If the entry in thethe metadata repositoryrepository, also implemented using necessary CREATE updates havehave not yet arrived,arrived, the MAP files named with the object guid. These entries contain update is put into a deferred update queue for later pro- the system metadata, e.g.e.g. user/group/mode/permissions,user/group/mode/permissions, cessing when the other updatesupdates are received.received. that are typically stored in an inode. TheyThey also contain a An important design decision related to MAP messages vectorvector of all the mappingsmappings where the objectobject is located in is that theythey contain no indication of anyany obsoleteobsolete map- the filesystem hierarchy.hierarchy. ping(s) to removerremoemove from the namespace. That is because WithWith this design, mappingmapping state is duplicatedduplicated in the en- each MAP message implicitly removesremoves all older mappingsmappings tries of the metadata table,table, and in the individualindividual container for the givengiven object and for the givengiven location(s) in the data files. This is a deliberate design decision: knowingknowing namespace, computed based on thethe logical versionversion vec-vec- the vectorvector of objectsobjects in a container is needed for efficientefficient tors. As described above,above, the currentcurrent location(s) of an directory listing and path traversal,traversal, while storing the set object can be easily lookedlooked up in the metadata repository of mappings for an objectobject is needed to update the object using the object guid. mappings without knowingknowing its currentcurrent location(s) in the Thus, asas shownshown in Figure 3, to process a MAP message, namespace, simplifying the replication protocols. TierStoreTierStore first looks up the object and container(s)container(s) using ToTo deal with the factfact that the twotwo repositories mightmight be their respectiverespective guids in the metadata repository.repository. If theythey out of sync afterafter a system crash, we use a write ahead both exist,exist, then it compares the versionsversions of the mappings log for all updates.updates. Because the updates are idempo- in the message with those stored in the repository.repository. If the tent (as discussed below),below), we simplysimply replay uncommit- newnew message contains moremore recent mappings, TierStoreTierStore ted updatesupdates after a system crash to ensure that the system applies the newnew set of relevantrelevant mappings to the repos- state is consistent.consistent. WeWe also implementimplement a simple write- itory.itory. If the message contains old mappings,mappings, it is dis- through cache for both persistent repositories to improveimprove carded. InIn case the versionsversions are incomparable (i.e.i.e. up- read performance on frequentlyfrequently accessed files. dates occurred simultaneously), then therethere is a conflict and both conflicting mappings are applied to the repos- 4.5 Updates itory to be resolvedresolved laterlater (see below).below). Therefore,Therefore, MAP messages are also idempotent, since anyany obsoleteobsolete map- Thefilesystem layer translates application operations pings containedcontained within them are ignoredignored in favorfavor of the (e.g.e.g. write, rename, creat, un unlinklink) into twotwo basic more recent ones that are alreadyalready in the repository.repository. update operations: CREATE and MAP, the format of which is shownshown in Figure 2.These updates are thenapplied lo- 4.6 ImmutableObjects and Deletion cally to the persistent repositoryand di distributedstributed overover the netwnetworkork to other nodes. Thesetw twoo messagetypes aresufsufficient ficient because TierStoreTierStore CREATE updates add newnew objects to the system butbut do objectsare immutable. A fi filele modification is imple- not mak makee them visible inthe filesystem namespace. Each mentedby copyingcopying an object, applyingapplying thechange, and

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 39 sufficientsufficient in practice,practice, though a more sophisticatedsophisticated dis- Update Arrives tributedtributed garbagegarbage collection such as that used in Ficus [21] wouldwould be more robust.robust. Object Add Update to and Container(s) No Deferred Queue Exist?

YesYes 4.7 Publications andSubscriptions

Get Current Mappings One of the keykey design goals for T TierStoreierStore isto en- Discard ablefine-grained sharing of application state. TTo o that Update end, TierStoreTierStore applications di dividevide theo overallverall filesystem Update No Version More namespaceinto disjointco coveringvering subsets called publica- Recent? YesYes Discard Old tions. Our current implementation defines a publication Mapping(s) as a tuple (container,(container, depth) thatincludes anyany mappingsmappings Neither and objects in the subtree that is rooted at the gi givenven con- Install New Mapping tainertainer,, up to the gi givenven depth. An Anyy containers that are created at the lea leavesves of this subtree are thems themselveselves the root of ne neww publications. By defdefau default,ault, ne neww publications Figure3: F Flowchartlowchart of the decision processw whenhen apply- hahaveve infinite depth; custom-depth publications are cre- ing MAP updates updates.. ated through a special admin administrativeistrative interf interface.ace. TTierStoreierStore nodes then ha haveve subscriptions to an arbitrary set o off publications; once a node is subscribed to a publi- installing the modifiedcop copyy inplace o off the old one (with cation, it recei receivesves and transmits updates for the objects anewa new CREATE and MAP). Thus thebinding of a guidto in that publication among all other subscribed nodes. particular file content ispersistent for the life ofthe sys- The subscription mana managerger component handles re register-gister- tem. Thi Thiss model has been used by othersystems such as ing and responding to subscription interest and informing Oceanstore [26],for the adv advantageantage that write-write con- the DTN layer to set up forw forwardingarding state accordingly accordingly.. It flicts are handled as name conflicts (two(two ob objectsjects being interacts with the update managermanager to be notified of lo- put in thesame namespace location), so we canuse a cal updates for distr distributionibution and to apply updates recei receivedved single mechanism to handle both typesof conflicts. from the netw networkork to the data s store.tore. An ob obviousvious disadv disadvantageantage is the need todistrib distributeute Because nodes c canan subscribe to an arbitrary set of pub- whole objects, e evenven for smallsmall changes. ToTo address this lications and thus recei receiveve a subset of updates to the whole issue, the filesystem layer only “freezes”an object( i.ei.e.. namespace, each publication defines a se separateparate v versionersion issues a CREATE and MAP update) after the applicationapplication vvectorector space. In other w words,ords, the combination of (node,(node, closes the file, not after each call to write. In addition, pubpublication,lication, update counter) is uniqueacross the system. we plan to inte integrategrate other well-kno well-knownwn technique techniques,s, such This means that a node knowsknows when it has receivedreceived all as sending deltas of pre previousvious v versionsersions or encoding the updates for a publication when the versionversion vectorvector space objects as a v vectorector of se segmentsgments and only sending modi- is fully packedpacked and has no holes.holes. fied se segmentsgments (as in LBFS [18]). Ho However,wever, when using these techniques, care w wouldould ha haveve to be tak takenen to a avoidvoid ToTo bootstrap the system,system, all nodes havehave a defaultdefault sub- round trips in long-latenc long-latencyy enenv environments.vironments. scription to the special root container “/” with a depth of When an object is no lon longerger needed, either because 1. Thus wheneverwhenever anyany node creates an object (or a con- it w wasas e explicitlyxplicitly rem removedoved with unlink or because a ne neww tainer) inin the root directory,directory, the object is distributeddistributed to object w wasas mapped into the same location through an edit all other nodes in the system. However,However, because the root or rename, we d doo not immediately delete it, b butut instead subscription is at depth 1, all containers within the root we map it into a special trash container container.. This step is nec- directory are themselvesthemselves the root for newnew publications, so essary be becausecause some othe otherr node may ha haveve concurrently application state can be partitioned.partitioned. mapped the object into a difdifferent ferent location in the names- ToTo subscribe to other publications, users create a sym- pace, and we need to hold onto the object to potentially bolic link in a special /.subscriptions/ directory to resolvresolvee the conflict. point to the root containercontainer of a publication. This opera- In our curren currentt prototype, objects are e eventuallyventually re- tion is detected by the Subscription Mana Managerger, which then momovedved from the trash container after a long interv intervalal ( ee.g..g. sets up the appropriate subscription state. ThisThis design al- multiple days), after which we assume no more updates lowslows applications to manage their interest sets without will arri arriveve to the obj object.ect. This simple method has been the need for a custom programmingprogramming interface.interface.

40 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association 4.8 Update Distribution 4.9 Views and Conflicts

To deal with intermittent or long-delay links, the Tier- Each mapping contains a view that identifies the Tier- Store update protocol is biased heavily towards avoiding Store node that created the mapping. During normal op- round trips. Thus unlike systems based on log exchange eration, the notion of views is hidden from the user, how- (e.g. Bayou, Ficus, or PRACTI), TierStore nodes proac- ever views are important when dealing with conflicts. A tively generate updates and send them to other nodes conflict occurs when operations are concurrently made at when local filesystem operations occur. different nodes, resulting in incomparable logical version vectors. In TierStore’s single-object coherence model, TierStore integrates with the DTN reference imple- there are only two types of conflicts: a name conflict oc- mentation [9] and uses the bundle protocol [28] for all curs when two different objects are mapped to the same inter-node messaging. The system is designed with min- location by different nodes, while a location conflict oc- imal demands on the networking stack: simply that all curs when the same object is mapped to different loca- updates for a publication eventually propagate to the sub- tions by different nodes. scribed nodes. In particular, TierStore can handle dupli- Recall that all mappings are tagged with their respec- cate or out-of-order message arrivals using the version- tive view identifiers, so a container may contain multiple ing mechanisms described above. mappings for the same name, but in different views. The This design allows TierStore to take advantage of the job of the View Resolver (see Figure 1) is to present a co- intermittency tolerance and multiple transport layer fea- herent filesystem to the user, in which two files can not tures of DTN. In contrast with systems based on log- appear in the same location, and a single file can not ap- exchange, TierStore does not assume there is ever a low- pear in multiple locations. Hard links are an obvious ex- latency bidirectional connection between nodes, so it can ception to this latter case, in which the user deliberately be deployed on a wide range of network technologies in- maps a file in multiple locations, so the view resolver is cluding sneakernet or broadcast links. Using DTN also careful to distinguish hard links from location conflicts. naturally enables optimizations such as routing smaller The default policy to manage conflicts in TierStore ap- MAP updates over low-latency, but possibly expensive pends each conflicting mapping name with .#X, where X links, while sending large CREATE updates over less ex- is the identity of the node that generated the conflicting pensive but long-latency links, or configuring different mapping. This approach retains both versions of the con- publications to use different DTN priorities. flicted file for the user to access, similar to how CVS han- However, for low-bandwidth environments, it is also dles an update conflict. However, locally generated map- important that updates be efficiently distributed through- pings retain their original name after view resolution and out the network to avoid overwhelming low-capacity are not modified with the .#X suffix. This means that the links. Despite some research efforts on the topic of mul- filesystem structure may differ at different points in the ticast in DTNs [38], there currently exists no implemen- network, yet also that nodes always “see” mappings that tation of a robust multicast routing protocol for DTNs. they have generated locally, regardless of any conflicting Thus in our current implementation, TierStore nodes updates that may have occurred at other locations. in a given deployment are configured by hand in a static Although it is perhaps non-intuitive, we believe this multicast distribution tree, whereby each node (except to be an important decision that aids the portability of the root) has a link to its parent node and to zero or more unmodified applications, since their local file modifica- child nodes. Nodes are added or removed by editing con- tions do not “disappear” if another node makes a con- figuration files and restarting the affected nodes. Given flicting update to the file or location. This also means the small scale and simple topologies of our current de- that application state remains self-consistent even in the ployments, this manual configuration has been sufficient face of conflicts and most importantly, is sufficient to thus far. However we plan to investigate the topic of a handle conflicts for many applications. Still, conflict- general publish/subscribe network protocol suitable for ing mappings would persist in the system unless resolved DTNs in future work. by some user action. Resolution can be manual or auto- matic; we describe both in the following sections. In this simple scheme, when an update is generated, TierStore forwards it to DTN stack for transmission to the parent and to each child in the distribution tree. DTN 4.10 Manual Conflict Resolution queues the update in persistent storage, and ensures re- liable delivery through the use of custody transfer and For unstructured data with indeterminate semantics (such retransmissions. Arriving messages are re-forwarded to as the case of general file sharing), conflicts can be man- the other peers (not back to the sending node) so updates ually resolved by users at any point in the network by eventually reach all nodes in the system. using the standard filesystem interface to either remove

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 41 C1: guid1, "A" C2: guid2, "B" Node A Node B M1: guid1, "/foo", M2: guid2, "/foo" view A, ver (A,1) view B, ver (B,1) Step Action FS View Action FS View 1 write(/foo, “A”) /foo ! “A” write(/foo, “B”) /foo ! “B”

1 M1 C2 M2 1 C 1 /foo ! “A” /foo ! “B” 2 receive C2, M2 receive C1, M1 /foo.#B ! “B” /foo.#A ! “A” A 2 2 B rename(/foo.#B, /foo ! “A” /foo ! “B” 4 3 /bar) /bar ! “B” /foo.#A ! “A” 3 M3 /foo ! “A” /foo ! “A” 4 receive M3 M3: guid2, "/bar" /bar ! “B” /bar ! “B” view A, ver (A,2)(B,1)

Figure 4: Update sequence demonstrating a name conflict and a user’s resolution. Each row in the table at right shows the actions that occur at each node and the nodes’ respective views of the filesystem. In step 1, nodes A and B make concurrent writes to the same file /foo. generating separate create and mapping updates (C1,M1 C2, and M2) and applying them locally. In step 2, the updates are exchanged, causing both nodes to display conflicting versions of the file (though in different ways). In step 3, node A resolves the conflict by renaming /foo.#B to /bar, which generates a new mapping (M3). Finally, in step 4, M3 is received at B and the conflict is resolved.

or rename the conflicting mappings. Figure 4 shows an The function returns resolved, which is the list of non- example of how a name conflict is caused, what each conflicting mappings that should be visible to the user. filesystem presents to the user at each step, and how the The only requirements on the implementation of the re- conflict is eventually resolved. solve function are that it is deterministic based on its When using the filesystem interface, applications do operands and that its output mappings have no conflicts. not necessarily include all the context necessary to in- In fact, the default view resolver implementation de- fer user intent. Therefore an important policy decision is scribed above is implemented as a resolve function that whether operations should implicitly resolve conflicts or appends the disambiguating suffix for visible filenames. let them linger in the system by default. As in the ex- In addition, the resolver described in Section ample shown in Figure 4, once the name conflict occurs 5.1 is another example of a custom view resolver that in step 2, if the user were to write some new contents to safely merges mail file status information encoded in the /foo, should the new file contents replace both conflict- maildir filename. Finally, a built- in view resolver de- ing mappings or just one of them? tects identical object contents with conflicting versions The current policy in TierStore is to leave the con- and automatically resolves them, rather than presenting flicting mappings in the system until they are explicitly them to the user as vacuous conflicts. resolved by the user (e.g. by removing the conflicted An important feature of the resolve function is that it name), as shown in the example. Although this policy creates no new updates, rather it takes the updates that means that conflicting mappings may persist indefinitely exist and presents a self- consistent file system to the user. if not resolved, it is the most conservative policy and we This avoids problems in which multiple nodes indepen- believe the most intuitive as well, though it may not be dently resolve a conflict, yet the resolution updates them- appropriate for all environments or applications. selves conflict [14]. Although a side effect of this design is that conflicts may persist in the system indefinitely, 4.11 Automatic Conflict Resolution they are often eventually cleaned up since modifications to merged files will obsolete the conflicting updates. Application writers can also configure a custom per- container view resolution routine that is triggered when the system detects a conflict in that container. The inter- 4.12 Object Extensions face is a single function with the following signature: Another way to extend TierStore with application- resolve(local view, locations, names) → resolved specific support is the ability to register custom types for data objects and containers. The current implementation The operands are as follows: local view is the local supports C++ object subclassing of the base object and node identity, locations is a list of the mappings that are container classes, whereby the default implementations in conflict with respect to location and names is a list of file and directory access functions can be overridden of mappings that are in conflict with respect to names. to provide alternative semantics.

42 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association For example, this extension could be used to imple- CREATE message for the given object. However if a meta- ment a conflict-free, append-only “log object”. In this data update occurs long after an object was created, then case, the log object would in fact be a container, though it the effects of the operation are not known throughout the would present itself to the user as if it were a normal file. network until another change is made to the file contents. If a user appends a chunk of data to the log (i.e. opens Because the applications we have used so far do not the file, seeks to the end, writes the data, and closes the depend on propagation of metadata, this shortcoming has file), the custom type handlers would create a new object not been an issue in practice. However, we plan to add a for the appended data chunk and add it to the log object new META update message to contain the modified meta- container with a unique name. Reading from the log ob- data as well as a new metadata version vector in each ob- ject would simply concatenate all chunks in the container ject. A separate version vector space is preferable to al- using the partial order of the contained objects’ version low metadata operations to proceed in parallel with map- vectors, along with some deterministic tiebreaker. In this ping operations and to not trigger false conflicts. Con- way multiple locations may concurrently append data to flicting metadata updates would be resolved by a deter- a file without worrying about conflicts, and the system ministic policy (e.g. take the intersection of permission would transparently merge updates into a coherent file. bits, later modification time, etc).

4.13 Security 5 TierStore Applications

Although we have not focused on security features In this section we describe the initial set of applications within TierStore itself, security guarantees can be effec- we have adapted to use TierStore, showing how the sim- tively implemented at complementary layers. ple filesystem interface and conflict model allows us to Though TierStore nodes are distributed, the system is leverage existing implementations extensively. designed to operate within a single administrative scope, similar to how one would deploy an NFS or CIFS share. In particular, the system is not designed for untrusted, 5.1 E-mail Access federated sharing in a peer-to-peer manner, but rather One of the original applications that motivated the devel- to be provisioned in a cooperative network of storage opment of TierStore was e-mail, as it is the most popular replicas for a particular application or set of applications. and fastest-growing application in developing regions. In Therefore, we assume that configuration of network con- prior work, we found that commonly used web-mail in- nections, definition of policies for access control, and terfaces are inefficient for congested and intermittent net- provisioning of storage resources are handled via exter- works [10]. These results, plus the desire to extend the nal mechanisms that are most appropriate for a given de- reach of e-mail applications to places without a direct ployment. In our experience, most organizations that are connection to the Internet, motivate the development of candidates to use TierStore already follow this model for an improved mechanism for e-mail access. their system deployments. It is important to distinguish between e-mail delivery For data security and privacy, TierStore supports the and e-mail access. In the case of e-mail delivery, one standard UNIX file access-control mechanisms for users simply has to route messages to the appropriate (single) and groups. For stronger authenticity or confidentiality destination endpoint, perhaps using storage within the guarantees, the system can of course store and replicate network to handle temporary transmission failures. Ex- encrypted files as file contents are not interpreted, except isting protocols such as SMTP or a similar DTN-based by an application-specific automatic conflict resolver that variant are adequate for this task. depends on the file contents. For e-mail access, users need to receive and send mes- At the network level, TierStore leverages the recent sages, modify message state, organize mail into folders, work in the DTN community on security protocols [31] and delete messages, all while potentially disconnected, to protect the routing infrastructure and to provide mes- and perhaps at different locations, and existing access sage security and confidentiality. protocols like IMAP or POP require clients to make a TCP connection to a central mail server. Although this 4.14 Metadata model works well for good-quality networks, in chal- lenged environments users may not be able to get or send Currently, our TierStore prototype handles metadata up- new mail if the network happens to be unavailable or is dates such as chown, chmod,orutimes by applying them too expensive at the time when they access their data. only to the local repository. In most cases, the opera- In the TierStore model, all e-mail state is stored in tions occur before updates are generated for an object, so the filesystem and replicated to any nodes in the sys- the intended modifications are properly conveyed in the tem where a user is likely to access their mail. An off-

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 43 the-shelf IMAP server (e.g. courier [6]) runs at each of tion of transport technologies such as satellite broad- these endpoints and uses the shared TierStore filesystem cast [17] or sneakernet and opens up potential optimiza- to store users’ mailboxes and folders. Each user’s mail tions such as sending some content with a higher priority. data is grouped into a separate publication, and via an administrative interface, users can instruct the TierStore 5.3 Offline Web Access daemon to subscribe to their . We use the maildir [3] format for mailboxes, which Although systems for offline web browsing have existed was designed to provide safe mailbox access without for some time, most operate under the assumption that needing file locks, even over NFS. In maildir, each mes- the client node will have periodic direct Internet access, sage is a uniquely named independent file, so when a i.e. will be “online”, to download content that can later mailbox is replicated using TierStore, most operations be served when “offline”. However, for poorly connected are trivially conflict free. For example, a disconnected sites or those with no direct connection at all, TierStore user may modify existing message state or move mes- can support a more efficient model, where selected web sages to other mailboxes while new messages are simul- sites are crawled periodically at a well-connected loca- taneously arriving without conflict. tion, and the cached content is then replicated. However, it is possible for conflicts to occur in the case Implementing this model in TierStore turned out to be of user mobility. For example, if a user accesses mail at quite simple. We configured the wwwoffle proxy [37] to one location and then moves to another location before use TierStore as its filesystem for its cache directories. all updates have fully propagated, then the message state By running web crawls at a well-connected site through flags (i.e. passed, replied, seen, draft, etc) may be out of the proxy, all downloaded objects are put in the ww- sync on the two systems. In maildir, these flags are en- woffle data store, and TierStore replicates them to other coded as characters appended to the message filename. nodes. Because wwwoffle uses files for internal state, if Thus if one update sets a certain state, while another con- a remote user requests a URL that is not in cache, ww- currently sets a different state, the TierStore system will woffle records the request in a file within TierStore. This detect a location conflict on the message object. request is eventually replicated to a well-connected node To best handle this case, we wrote a simple conflict that will crawl the requested URL, again storing the re- resolver that computes the union of all the state flags sults in the replicated data store. for a message, and presents the unified name through the We ran an early deployment of TierStore and wwwof- filesystem interface. In this way, the fact that there was fle to accelerate web access in the Community Informa- an underlying conflict in the TierStore object hierarchy tion Center kiosks in rural Cambodia [5]. For this de- is never exposed to the application, and the state is safely ployment, the goal was to enable accelerated web access resolved. Any subsequent state modifications would then to selected web sites, but still allow direct access to the subsume both conflicting mappings and clean up the un- rest of the Internet. Therefore, we configured the ww- derlying (yet invisible) conflict. woffle servers at remote nodes to always use the cached copy of the selected sites, but to never cache data for 5.2 Content Distribution other sites, and at a well-connected node, we periodi- cally crawled the selected sites. Since the sites changed TierStore is a natural platform to support content distri- much less frequently than they were viewed, the use of bution. At the publisher node, an administrator can ar- TierStore, even on a continuously connected (but slow) bitrarily manipulate files in a shared repository, divided network link, was able to accelerate the access. into publications by content type. Replicas would be configured with read-only access to the publication to en- 5.4 Data Collection sure that the application is trivially conflict-free (since all modifications happen at one location). The distributed Data collection represents a general class of applica- content can then be served by a standard web server or tions that TierStore can support well. The basic data simply accessed directly through the filesystem. flow model for these applications involves generating As we discuss further in Section 6.2, using TierStore log records or collecting survey samples at poorly con- for content distribution is more efficient and easier to ad- nected edge nodes and replicating these samples to a minister than traditional approaches such as rsync [33]. well-connected site. In particular, TierStore’s support for multicast distribu- Although at a fundamental level, it may be sufficient to tion provides an efficient delivery mechanism for many use a messaging interface such as e-mail, SMS, or DTN networks that would require ad-hoc scripting to achieve bundling for this application, the TierStore design offers with point-to-point synchronization solutions. Also, the a number of key advantages. In many cases, the local use of the DTN overlay network enables easier integra- node wants or needs to have access to the data after it

44 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association CREATE READ WRITE GETDIR STAT RENAME Local 1.72 (0.04) 16.75 (0.08) 1.61 (0.01) 7.39 (0.01) 3.00 (0.01) 27.00 (0.2) FUSE 3.88 (0.1) 20.31 (0.08) 1.90 (0.8) 8.46 (0.01) 3.18 (0.005) 30.04 (0.07) NFS 11.69 (0.09) 19.75 (0.06) 42.56 (0.6) 8.17 (0.01) 3.76 (0.01) 36.03 (0.03) TierStore 7.13 (0.06) 21.54 (0.2) 2.75 (0.3) 15.38 (0.01) 3.19 (0.01) 38.39 (0.05)

Table 1: Microbenchmarks for various file system operations for local Ext3, loopback-mounted NFS, passthrough FUSE layer and TierStore. Runtime is in seconds averaged over five runs, with the standard error in parenthesis.

has been collected, thus some form of local storage is Section 4.9), limiting the utility of the application. necessary anyway. Also, there may be multiple desti- Resolving these types of conflicts is also straightfor- nations for the data; many situations exist in which field ward. PmWiki (like many wiki packages) contains built workers operate from a rural office that is then connected in support for managing simultaneous edits to the same to a larger urban headquarters, and the pub/sub system of page by presenting a user with diff output and asking for replication allows nodes at all these locations to register confirmation before committing the changes. Thus the data interest in any number of sample sets. conflict resolver simply renames the conflicting files in Furthermore, certain data collection applications can such a way that the web scripts prompt the user to man- benefit greatly from fine-grained control over the units of ually resolve the conflict at a later time. data replication. For example, consider a census or med- ical survey being conducted on portable devices such as 6 Evaluation PDAs or cell phones by a number of field workers. Al- though replicating all collected data samples to every de- In this section we present some initial evaluation results vice will likely overwhelm the limited storage resources to demonstrate the viability of TierStore as a platform. on the devices, it would be easy to set up publications First we run some microbenchmarks to demonstrate that such that the list of which samples had been collected the TierStore filesystem interface has competitive perfor- would be replicated to each device to avoid duplicates. mance to traditional filesystems. Then we describe ex- Finally, this application is trivially conflict free. Each periments where we show the efficacy of TierStore for device or user can be given a distinct directory for sam- content distribution on a simulation of a challenged net- ples, and/or the files used for the samples themselves can work. Finally we discuss ongoing deployments of Tier- be named uniquely in common directories. Store in real-world scenarios.

5.5 Wiki Collaboration 6.1 Microbenchmarks Group collaboration applications such as online Wiki This set of experiments compares TierStore’s filesystem sites or portals generally involve a set of web scripts that interface with three other systems: Local is the Linux manipulate page revisions and inter-page references in Ext3 file system; NFS is a loopback mount of an NFS a back-end infrastructure. The subset of common wiki server running in user mode; FUSE is a fusexmp instance software that uses simple files (instead of SQL databases) that simply passes file system operations through the user is generally quite easy to adapt to TierStore. space daemon to the local file system. All of the bench- For example, PmWiki [25] stores each Wiki page as marks were run on a 1.8 GHz Pentium 4 with 1 GB of an individual file in the configured wiki.d directory. memory and a 40GB 7200 RPM EIDE disk, running De- The files each contain a custom revision format that bian 4.0 and the 2.6.18 Linux kernel. records the history of updates to each file. By configuring For each filesystem, we ran several benchmark tests: the wiki.d directory to be inside of TierStore, multiple CREATE creates 10,000 sequentially named empty files. nodes can update the same shared site when potentially READ performs 10,000,000 16 kilobyte read() calls at disconnected. random offsets of a one megabyte file. WRITE performs Of course, simultaneous edits to the same wiki page at 10,000,000 16k write() calls to append to a file; the file different locations can easily result in conflicts. In this was truncated to 0 bytes after every 1,000 writes. GET- case, it is actually safe to do nothing at all to resolve DIR issues 1,000 getdir() requests on a directory con- the conflicts, since at any location, the wiki would still taining 800 files. STAT issues 1,000,000 stat calls to a be in a self-consistent state. However, users would no single file. Finally, RENAME performs 10,000 rename() longer easily see each other’s updates (since one of the operations to change a single file back and forth between conflicting versions would be renamed as described in two filenames. Table 1 summarizes the results of our ex-

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 45 Villages 250 Rsync e2e Rsync hop TierStore Towns Cities 200

Root 150 satellite modem 100 (in MB) Traffic 128kb/s, 300ms 56kb/s, 10ms 50 fiber 0 100Mb/s, 0ms 0% down 10% down 25% down 0% down 10% down 25% down

Single Subscription Multiple Subscriptions

Figure 5: Network model for the emulab experiments. Figure 6: Total network traffic consumed when synchro- nizing educational content on an Emulab simulation of a challenged network in developing regions. As the net- periments. Run times are measured in seconds, averaged work outage increases, the performance of TierStore rel- over five runs, with the standard error in parentheses. ative to both end to end and hop by hop rsync improves. The goal of these experiments is to show that exist- ing applications, written with standard filesystem per- formance in mind, can be deployed on TierStore with- We ran two sets of experiments, one in which all data out worrying about performance barriers. These results is replicated to all nodes (single subscription), and an- support this goal, as in many cases the TierStore sys- other in which portions of the data are distributed to dif- tem performance is as good as traditional systems. the ferent subsets of the edge nodes (multiple subscriptions). cases where the TierStore performance is worse are due The results from our experiments are shown in Figure 6. to some inefficiencies in how we interact with FUSE and We compared TierStore to rsync in two configurations. the lack of optimizations on the backend database. The end-to-end model (rsync e2e) is the typical use case for rsync, in which separate rsync processes are run from 6.2 Multi-node Distribution the root node to each of the edge nodes until all the data is transferred. As can be seen from the graphs, however, In another set of experiments, we used the Emulab [35] this model has quite poor performance, as a large amount environment to evaluate the TierStore replication proto- of duplicate data must be transferred over the constrained col on a challenged network similar to those found in links, resulting in more total traffic and a corresponding developing regions. increase in the amount of time to transfer (not shown). To simulate this target environment, we set up a net- As a result, TierStore uses less than half of the bandwidth work topology consisting of a single root node, with a of rsync in all cases. This result, although unsurprising, well-connected “fiber” link (100 Mbps, 0 ms delay) to demonstrates the value of the multicast-like distribution two nodes in other “cities”. We then connect each of model of TierStore to avoid sending unnecessary traffic these city nodes over a “satellite” link (128 kbps, 300 over a constrained network link. ms delay) to an additional node in a “village”. In turn, To offer a fairer comparison, we also ran rsync in a each village connects to five local computers over “di- hop-by-hop mode, in which each node distributed con- alup” links (56 kbps, 10 ms delay). Figure 5 shows the tent to its downstream neighbor. In this case, rsync per- network model for this experiment. forms much better, as there is less redundant transfer of To model the fact that real-world network links are data over the constrained link. Still, TierStore can adapt both bandwidth-constrained and intermittent, we ran a better to intermittent network conditions as the outage periodic process to randomly add and remove firewall percentage increases. This is primarily because rsync rules that block transfer traffic on the simulated dialup has no easy way to detect when the distribution is com- links. Specifically, the process ran through each link plete, so it must repeatedly exchange state even if there once per second, comparing a random variable to a is no new data to transmit. This distinction demonstrates threshold parameter chosen to achieve the desired down- the benefits of the push-based distribution model of Tier- time percentage, and turning on the firewall (blocking Store as compared to state exchange when running over the link) if the threshold was met. It then re-opened a bandwidth-constrained or intermittent networks. blocked link after waiting 20 seconds to ensure that all Finally, although this latter mode of rsync essentially transport connections closed. duplicates the multicast-like distribution model of Tier- We ran experiments to evaluate TierStore’s perfor- Store, rsync is significantly more complicated to admin- mance for electronic distribution of educational content, ister. In TierStore, edge nodes simply register their inter- comparing TierStore to rsync [33]. We then measured est for portions of the content, and the multicast repli- the time and bandwidth required to transfer 7MB of mul- cation occurs transparently, with the DTN stack tak- timedia data from the root node to the ten edge nodes. ing care of re-starting transport connections when they

46 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association break. In contrast, multicast distribution with rsync re- Acknowledgements quired end-to-end application-specific synchronization Thanks to anonymous reviewers and to our shepherd, processes, configured with aggressive retry loops at each Margo Seltzer, for providing insightful feedback on ear- hop in the network, making sure to avoid re-distributing lier versions of this paper. partially transferred files multiple times, which was both Thanks also to Pauline Tweedie, the Asia Foundation, tedious and error prone. Samnang Yuth Vireak, Bunhoen Tan, and the other oper- ators and staff of the Cambodia CIC project for providing us with access to their networks and help with our proto- 6.3 Ongoing Deployments type deployment of TierStore. This material is based upon work supported by the Na- We are currently working on several TierStore deploy- tional Science Foundation under Grant Number 0326582 ments in developing countries. One such project is and by the Defense Advanced Research Projects Agency supporting community radio stations in Guinea Bissau, under Grant Number 1275918. a small West African country characterized by a large number of islands and poor infrastructure. For many of the islands’ residents, the main source of informa- Availability tion comes from the small radio stations that produce and TierStore is freely available open-source software. broadcast local content. Please contact the authors to obtain a copy. TierStore is being used to distribute recordings from these stations throughout the country to help bridge the References communication barriers among islands. Because of the poor infrastructure, connecting these stations is challeng- [1] Vishwanath Anantraman, Tarjei Mikkelsen, Reshma ing, requiring solutions like intermittent long-distance Khilnani, Vikram S Kumar, Rao Machiraju, Alex Pent- WiFi links or sneakernet approaches like carrying USB land, and Lucila Ohno-Machado. Handheld computers drives on small boats, both of which can be used trans- for rural healthcare, experiences in a large scale imple- parently by the DTN transport layer. mentation. In Proc. of the 2nd Development by Design Workshop (DYD02), 2002. The project is using an existing content management system to manage the radio programs over a web inter- [2] Nalini Belaramani, Mike Dahlin, Lei Gao, Amol Nay- face. This system proved to be straightforward to inte- ate, Arun Venkataramani, Praveen Yalagandula, and Jian- dan Zheng. PRACTI replication. In Proc. of the 3rd grate with TierStore, again because it was already de- ACM/Usenix Symposium on Networked Systems Design signed to use the filesystem to store application state, and Implementation (NSDI), San Jose, CA, May 2006. and replicating this state was an easy way to distribute the data. We are encouraged by early successes with the [3] D.J. Bernstein. Using maildir format. http://cr.yp.to/proto/maildir.html. integration and are currently in the process of preparing a deployment for some time in the next several months. [4] Eric Brewer, Michael Demmer, Bowei Du, Melissa Ho, Matthew Kam, Sergiu Nedevschi, Joyojeet Pal, Rabin Pa- tra, Sonesh Surana, and Kevin Fall. The case for technol- ogy in developing regions. IEEE Computer, 38(6):25–38, 7 Conclusions June 2005. [5] Cambodia Community Information Centers. In this paper we described TierStore, a distributed http://www.cambodiacic.info. filesystem for challenged networks in developing re- [6] Courier Mail Server. http://www.courier-mta.org. gions. Our approach stems from three core beliefs: the [7] Don de Savigny, Harun Kasale, Conrad Mbuya, and Gra- first is that dealing with intermittent connectivity is a nec- ham Reid. In Focus: Fixing Health Systems. International essary part of deploying robust applications in develop- Research Development Centre, 2004. ing regions, thus network solutions like DTN are critical. [8] Giuseppe DeCandia, Deniz Hastorun, Madan Jam- Second, a replicated filesystem is a natural interface for pani, Gunavardhan Kakulapati, Avinash Lakshman, Alex applications and can greatly reduce the burden of adapt- Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, ing applications to the intermittent environment. Finally, and Werner Vogels. Dynamo: Amazon’s Highly Avail- a focus on conflict avoidance and a single-object coher- able Key-value Store. In Proc. of the 21st ACM Sympo- ence model is both sufficient for many useful applica- sium on Operating Systems Principles (SOSP), Steven- tions and also eases the challenge of programming. Our son, WA, 2007. initial results are encouraging, and we hope to gain addi- [9] Delay Tolerant Networking Reference Implementation. tional insights through deployment experiences. http://www.dtnrg.org/wiki/Code.

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 47 [10] Bowei Du, Michael Demmer, and Eric Brewer. Analy- [24] Benjamin C. Pierce and Jerome Vouillon. What’s in Uni- sis of WWW Traffic in Cambodia and Ghana. In Proc. son? A Formal Specification and Reference Implementa- of the 15th international conference on World Wide Web tion of a File Synchronizer. Technical Report MS-CIS- (WWW), 2006. 03-36, Univ. of Pennsylvania, 2004. [11] Kevin Fall. A Delay-Tolerant Network Architecture [25] PmWiki. http://www.pmwiki.org/. for Challenged Internets. In Proc. of the ACM Sym- [26] Sean Rhea, Patrick Eaton, Dennis Geels, Hakim Weath- posium on Communications Architectures & Protocols erspoon, Ben Zhao, and John Kubiatowicz. Pond: the (SIGCOMM), 2003. OceanStore Prototype. In Proc. of the 2nd USENIX Con- [12] Armando Fox and Eric Brewer. Harvest, yield and scal- ference on File and Storage Technologies (FAST), March able tolerant systems. In Proc. of the 7th Workshop on 2003. Hot Topics in Operating Systems (HotOS), 1999. [27] Russel Sandberg, David Goldberg, Steve Kleiman, Dan [13] Fuse: Filesystem in Userspace. http://fuse.sf.net. Walsh, and Bob Lyon. Design and Implementation of the Sun Network Filesystem. In Proc. of the USENIX Sum- [14] Michael B. Greenwald, Sanjeev Khanna, Keshav Kunal, mer Technical Conference, Portland, OR, 1985. Benjamin C. Pierce, and Alan Schmitt. Agreeing to Agree: Conflict Resolution for Optimistically Replicated [28] Keith Scott and Scott Burleigh. RFC 5050: Bundle Pro- Data. In Proc. of the International Symposium on Dis- tocol Specification, 2007. tributed Computing (DISC), 2006. [29] Jing Su, James Scott, Pan Hui, Eben Upton, Meng How [15] Richard G. Guy, Peter L. Reiher, David Ratner, Michial Lim, Christophe Diot, Jon Crowcroft, Ashvin Goel, and Gunter, Wilkie Ma, and Gerald J. Popek. Rumor: Mobile Eyal de Lara. Haggle: Clean-slate Networking for Mobile Data Access Through Optimistic Peer-to-Peer Replica- Devices. Technical Report UCAM-CL-TR-680, Univer- tion. In Proc. of ACM International Conference on Con- sity of Cambridge, Computer Laboratory, January 2007. ceptual Modeling (ER) Workshop on Mobile Data Access, [30] Lakshminarayanan Subramanian, Sonesh Surana, Rabin pages 254–265, 1998. Patra, Sergiu Nedevschi, Melissa Ho, Eric Brewer, and Anmol Sheth. Rethinking Wireless in the Developing [16] James J. Kistler and M. Satyanarayanan. Disconnected World. In Proc. of the 5th Workshop on Hot Topics in Operation in the Coda File System. In Proc. of the Networks (HotNets), November 2006. 13th ACM Symposium on Operating Systems Principles (SOSP), 1991. [31] Susan Symington, Stephen Farrell, and Howard Weiss. Bundle Security Protocol Specification. Internet [17] Dirk Kutscher, Janico Greifenberg, and Kevin Loos. Scal- Draft draft-irtf-dtnrg-bundle-security-04.txt, able DTN Distribution over Uni-Directional Links. In September 2007. Work in Progress. Proc. of the SIGCOMM Workshop on Networked Systems in Developing Regions Workshop (NSDR), August 2007. [32] Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer, and Carl H. Hauser. [18] Athicha Muthitacharoen, Benjie Chen, and David Managing Update Conflicts in Bayou, a Weakly Con- Mazieres. A Low-Bandwidth Network File System. In nected Replicated Storage System. In Proc. of the Proc. of the 18th ACM Symposium on Operating Systems 15th ACM Symposium on Operating Systems Principles Principles (SOSP), 2001. (SOSP), 1995. [19] Sergiu Nedevschi, Joyojeet Pal, Rabin Patra, and Eric [33] A. Tridgell and P. MacKerras. The rsync algorithm. Tech- Brewer. A Multi-disciplinary Approach to Studying Vil- nical Report TR-CS-96-05, Australian National Univ., lage Internet Kiosk Initiatives: The case of Akshaya. In June 1996. Proc. of Policy Options and Models for Bridging Digital [34] Voxiva. http://www.voxiva.com/. Divides, March 2005. [35] Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, [20] OfflineIMAP. Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad http://software.complete.org/offlineimap. Barb, and Abhijeet Joglekar. An Integrated Experimental [21] T. W. Page, R. G. Guy, J. S. Heidemann, D. Ratner, Environment for Distributed Systems and Networks. In P. Reiher, A. Goel, G. H. Kuenning, and G. J. Popek. Proc. of the 5th USENIX Symposium on Operating Sys- Perspectives on optimistically replicated peer-to-peer fil- tems Design and Implementation, December 2002. ing. Software—Practice and Experience, 28(2):155–180, [36] Wizzy Digital Courier. http://www.wizzy.org.za/. February 1998. [37] WWWOFFLE: World Wide Web Offline Explorer. [22] Alex (Sandy) Pentland, Richard Fletcher, and Amir Has- http://www.gedanken.demon.co.uk/wwwoffle/. son. DakNet: Rethinking Connectivity in Developing Na- [38] Wenrui Zhao, Mostafa Ammar, and Ellen Zegura. Mul- tions. IEEE Computer, 37(1):78–83, January 2004. ticasting in Delay Tolerant Networks: Semantic Models [23] Karin Petersen, Mike J. Spreitzer, Douglas B. Terry, Mar- and Routing Algorithms. In Proc. of the ACM SIGCOMM vin M. Theimer, and Alan J. Demers. Flexible Update Workshop on Delay-Tolerant Networking (WDTN), 2005. Propagation for Weakly Consistent Replication. In Proc. of the 16th ACM Symposium on Operating Systems Prin- ciples (SOSP), 1997. All Internet URLs in citations are valid as of January 2008.

48 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association