USENIX Association

Proceedings of the FREENIX Track: 2003 USENIX Annual Technical Conference

San Antonio, Texas, USA June 9-14, 2003

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

© 2003 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. U-P2P: A Peer-to-Peer Framework for Universal Resource Sharing and Discovery Neal Arthorne, Babak Esfandiari, Aloke Mukherjee Department of Systems and Computer Engineering Carleton University Ottawa, Ontario, Canada [email protected] [email protected] [email protected]

Abstract of files. Another important problem is the current frac- tured state of peer-to-peer communities. Again, the We present U-P2P, an open source framework for de- lack of metadata about communities makes it difficult veloping, deploying and discovering file-sharing com- to know what there is to look for in the first place. munities. We address the problem of search in peer-to- We propose a peer-to-peer framework called Univer- peer file sharing by allowing the end user to add meta- sal Peer-to-Peer (U-P2P) that simplifies the sharing of data to shared documents. Each file-sharing community custom metadata formats as well as the easy creation, allows the sharing of a particular structured document configuration and discovery of peer-to-peer communi- type. Communities are themselves modeled as struc- ties. In U-P2P, each community is described in part tured documents, thus enabling their sharing and dis- by the metadata of the files it exchanges. The format covery just like any other document. The creator of of a community’s metadata is specified using the XML a particular community specifies, among other proper- Schema language, allowing new communities centered ties, the document type that it shares and the deployment on that file type to be created in any text editor. U-P2P model. U-P2P’s extensible architecture allows develop- allows these custom resources to be created and shared ers to create new properties or extend existing ones. For as in traditional file-sharing services. example, developers can provide new deployment mod- By logical extension, the description of the commu- els or custom privacy and authentication features. U-P2P nity itself (the metadata format of its files, the proto- makes use of other open source projects such as Jakarta col used for search, etc) is encapsulated in an XML file. Tomcat and eXist, an XML database system. In U-P2P, creating, sharing or discovering a community follows the same principles as creating, sharing or dis- 1Introduction covering a file within that community. As a result, file- The current success of peer-to-peer (P2P) file sharing ap- swapping communities such as Napster or Kazaa can be plications has highlighted the benefits of resource dis- seen as instances of U-P2P devoted to sharing a few spe- tribution and redundancy. However, to truly exploit cific file types and utilizing a given P2P protocol. such advantages, a few roadblocks remain to be cleared. In particular, search and discovery of resources is still 2 Related Work quite difficult. Most known approaches rely on sim- Our focus is on the discovery problem in peer-to-peer ple schemes such as a search for the resource name or systems. In Napster [1], the only files that could be type. File sharing communities are most efficient when shared on the network were MP3 audio files. Search the name of the file carries most if not all of the needed was based on filenames and relied on users encoding information. As a result, most file swapping commu- the artist and title of each song in the MP3’s title. Al- nities are restricted to swapping music and video files. though metadata such as encoding rate could be used Even when the exchange of non-music files is possible, to sort the results, there was no way to search on these the difficulty of finding such files has been a sufficient metadata or define other parameters. On a [2] deterrent. Lack of metadata is the main problem. network, any type of file can be shared: however there is Another roadblock to more general applications of no explicit metadata handling. Search strings are passed P2P has been the difficulty of creating communities for around without processing between peers and their in- specific purposes. This arises partially from the diffi- terpretation is left to the peer. Each peer must imple- culty of defining custom metadata about different types ment its own search algorithm using the search string

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 29 as input. Most Gnutella implementations, like Napster, 3 Background: XML Schema simply return filenames that contain the search string. U-P2P relies heavily on XML Schema and the following Gnutella does permit the design of overlay protocols to sections include some examples of XML Schema and a encode and decode metadata from search strings. Some brief description of the structure of XML Schema docu- Gnutella developers have proposed using this approach ments. for richer metadata searches. [3, 4]. Schemas are defined XML Schema is a Recommendation [8] by the World for common file types: for example an audio file might Wide Consortium for a schema language for XML. It be defined to have properties such as artist, title, bit rate, constitutes a set of markup tags themselves written in album, etc. The schema defines a structured format for XML, that are used to describe both the structure of an searching MP3 metadata that is sent as a search string to XML document and the datatypes of individual elements other Gnutella nodes. Responding clients use the query and attributes. XML Schema uses the XML Namespace to search local files annotated using the schema and re- Recommendation that allows authors to prefix their own turns the results using the same structured format. In tags with a specific namespace. It also allows importing practice though, all members of a given community must types from other schemas both in the current namespace be able to speak a common language in order to commu- and from other namespaces. nicate. XML Schema can be used to validate XML docu- Other P2P systems such as FastTrack [5], Opencola ments using a schema processor. XML documents them- [6] or Bitzi [7] propose variations on that idea, but they selves can link to their schema using a schemaLocation are still limited to a number of predefined schemas. The tag. However this is only a suggested method for speci- latter two have the capability to extract metadata infor- fying a schema for the document and they may use other mation from a file given the file format, but this is only types of schemas such as DTDs or not require validation possible for certain formats. at all. In XML Schema, there are four basic units that can be Clearly, there is a need for shared ontologies if we used to describe parts of an XML document: elements, want to allow search for any type of file. In the In- attributes, simpleTypes and complexTypes. ternet world, the XML Schema format[8] is the cur- rent choice to represent ontologies, replacing Document Elements - Elements are the tags used in XML to Type Definition (DTD) [9], the schema language de- label a piece of data. They have a datatype associated fined in the original XML specification. XML Schema with them that is either defined at the same time as the supports the creation of custom and complex data types element (anonymously) or is a complexType or simple- for XML tags, which is essential to describe aggregated Type. Elements can contain subelements, attributes and or composite resources. For richer semantic descrip- can also be part of an XML Namespace. The structure tions, for example to allow software agents to perform of an element can be controlled in XML Schema using search instead of humans, there can be a need to de- sequences, choices and the same restrictions used to de- scribe relationships between resources. RDF [10] and rive simple and complex types. For example an author more generally the Semantic Web [11] effort address can specify that an element called ’name’ should consist that need. The Edutella project [12] is a technology for of an exact sequence of the two elements ’firstname’ and distributed learning that uses a P2P network for sharing ’lastname’. RDF-formatted metadata. However in Edutella metadata Attributes - Attributes are properties that belong to a is the resource, not the means to describe one. parent element. They can only use simpleTypes and they cannot contain any elements or attributes. Attributes can A P2P system using a semantic layering approach is be optional or mandatory within an element. not without its drawbacks. First and foremost is get- Simple Types - These types are either built-in data ting users of the system to supply metadata for their re- types specified in the XML Schema Recommendation or sources. The addition of metadata requires user-friendly Simple Types derived from the built-in or existing types tools for authoring RDF and schemas and a simplified (XML Schema defines some derived types such as date approach for users who are not familiar with XML lan- and time). The Recommendation includes many useful guages. data types such as string and decimal, that can easily be What we propose in U-P2P starts, as in Edutella, with manipulated into more specialized Simple Types. the sharing and discovery of metadata described this us- Complex Types - These types are a sequence or ing XML Schema. Once metadata is discovered it is choice between a set of elements and/or attributes. They used to instantiate a particular resource, which is in turn are reusable (when defined on a global level) and can be shared. The next section gives a high-level description used in any part of the current schema. Once a Complex of the principles behind U-P2P as well as its design. Type is defined an element can use it by specifying the

30 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association type attribute equal to the name of the Complex Type. An example of a Complex Type follows: of a stamp object and their data types. It is pos- sible to generate a form from this specification us- ing XML Stylesheet Language Transformations (XSLT) An example conforming to this schema: [13]. Here is an excerpt from a stylesheet that generates a search form from the above XML schema: 1125 Colonel By Drive Ottawa name="SchemaTemplate" K1S 5B6 match="*[local-name()=’schema’]"> Ontario

Search for a Resource

Canada

Enter keywords in any of the fields below to perform a search.

Having defined the above example, the type can then be used in an element anywhere in the schema or even XML Schema is powerful enough to describe any complex XML structures and the support for data types
PropertyValue
means the contents of shared documents are validated

using the lexical representation of the data. When shar- ing large volumes of documents, it is useful to have full validation at the entry of the system as this will prevent

the spread of unstructured data that can’t be properly
searched and indexed. Similarly, stylesheets can be produced that render 4 U-P2P Concepts Create forms or display a stamp object. With noth- U-P2P provides four fundamental services: search, cre- ing more than a few XML documents (a schema and ate, browse local and view. Each of these services are stylesheets for creating, searching and displaying), it is provided in the context of a community. We can imagine possible to define a whole new P2P file-sharing applica- the existence of a ”stamps” community that trades pic- tion! In fact, U-P2P provides default stylesheets for han- tures and descriptions of stamps from around the world. dling display and forms for resources made up of com- On entering the stamp community the U-P2P Search mon types, making the transformation processes trans- function offers fields to search for stamps of a given parent to the novice user. Figure 1 shows the relation- year, and/or from a given country. Similarly, the Create ships between the U-P2P functions, and how they are function prompts you to upload a picture of the stamp, accessed through XSL transformations. Browse Local shows the stamp objects that you have al- So how can a stamp community be found or shared in ready downloaded and View displays a picture as well the first place? The idea in U-P2P is to see a file-sharing as the attributes for one of your downloaded stamps. community as just another type of resource. This is anal- In traditional P2P applications this functionality ogous to the idea of a class in object-oriented program- would require downloading a that knows about ming which specifies the structure of objects. In pure the format of a stamp object and contains customized object-oriented languages such as Smalltalk, a class is search, create and view screens for such an object. In U- merely another type of object whose structure is speci- P2P, we use the power of metadata to simplify this task. fied by a metaclass. Traversing the analogy in the oppo- Consider the following XML schema describing a stamp site direction, a specific U-P2P community can be seen object: as a class instantiated by a more general metaclass: a

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 31 Resource Create Resource form Search form XSL XSL

RESOURCE As can be seen, a community can have many XML SCHEMA attributes: keywords, deployment protocol, secu- instantiates rity... This means that a community can be created by choosing specific values for such at- Resource tributes (e.g. keywords: stamps, Canadian; XSL deployment: Napster-style). The search for a community is made in similar fashion, by filling out a Resource similar form. As of now however, we only provide one View type of deployment protocol, ”Napster-style”, in which the peer that creates the community also acts as a broker. Figure 1:Generation of resource-specific displays Other deployment models, such as ”Gnutella-style” (no broker required) and ”centralized repository” (a client- Community-sharing community (in short: a commu- server option with no local copies of files) are currently nity community) shares Community objects: being developed. This means that searching for a doc- • an object is an instance of its class, which is an ument could either be completely decentralized, or that instance of its metaclass on the other extreme full persistence of files could be as- • mp3 belongs to mp3 community, which belongs to sured by a central storage of files. Deployment decisions community community are thus left entirely up to the creator of the given com- munity. Also, we have not yet explored various secu- In U-P2P the problem of discovering the existence of rity or privacy schemes. Such possibilities are discussed a community is thus reduced to the problem of finding later in this paper, in the section on design. The modu- an object. This provides a standard way to discover the lar aspect of our design should hopefully allow the open existence of resource-sharing communities. source development community to provide support for To facilitate this, U-P2P comes packaged with one many more of these attributes. ”bootstrap” schema that can be used to search for and The schema combined with default stylesheets as de- more importantly create communities: scribed above allow U-P2P to become an engine for creating and searching for communities which trade all sorts of different types of files. A stamp collector us- ing U-P2P for the first time will go to the community search form and type ”stamp” into the keyword field. Upon finding a community of collectors, he or she might of entering the community. Once in the community the collector can perform all the actions one would expect of a file-sharing application devoted specifically to sharing search form and finally the root community view, the highest level of abstraction in U-P2P. Like other file-sharing services, U-P2P consists of a client and a file server running on a user’s computer. The client part of U-P2P is implemented using Java Server Pages (JSPs)[15] running on a local web server. The pro- totype uses Jakarta Tomcat [16], but any server capable of serving JSPs may be used. The user connects to the U-P2P network by pointing their browser at the address of the local web server, typically localhost:8080.

32 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association Figure 2:The ”stamps” community

WebAdapter and the PeerNetworkAdapter. This allows for distributed P2P topologies: each node must be able to Jakarta Tomcat Repository XMLdb execute searches against its own set of shared resources. Java PeerNetworkAdapter - Provides an interface to the Servlets Web & underlying Peer-to-Peer network and is responsible for WebAdapter Browser JSPs servicing search requests, publishing resources and downloading resources from the network. PeerNetwork P2P Adapter Network The above components are modeled as Java in- terfaces, with their implementations as DefaultWe- Figure 3:The U-P2P Architecture bAdapter, DefaultRepository and GenericPeerAdapter respectively. Additional classes include:

The web server acts as GUI, and dispatches search and FileMapper - Maps resource IDs to real files on the create requests to the other peers. In the current pro- local file system. When a file is ’uploaded’ it is as- totype, U-P2P follows the Napster model. This means signed an ID and mapped without modifying the file. that there is also a central server that acts as a database The FileMapper is persistent in case of shutdown of the for information about all shared objects in the system. U-P2P client and file mappings are restored on startup. As with Napster, the information about the location of FileMapEntry - Holds a reference to a resource file files is stored centrally but file transfers are conducted and all its attachments, pulled from the resource in the between peers. We are considering other possible peer upload process. Attachment names must be unique configurations, such as the Gnutella distributed model within a resource. Resource IDs are generated from the and a hybrid one like FastTrack. content of the XML file using an MD5 hashing function. 5.1Architectural Model When hashing, a special ResourceProcessor is used that omits any attachment links within the XML. This allows U-P2P consists of three major components shown in 3: the hash to stay consistent when the links are changed the WebAdapter, PeerNetworkAdapter, and the Repos- by another peer upon download of the resource. itory. These components form the core of U-P2P and BasePeerNetworkAdapter - A skeleton class that provide all the services needed to share and discover re- holds a reference to a DefaultRepository and implements sources on a Peer-to-Peer network. the accessor for the repository. The GenericPeerAdapter WebAdapter - Glues the components together and is the generic P2P implementation included with U-P2P, provides a single point of access for the user inter- which follows a Napster-type model. Any developer face. This component currently supports a browser- wishing to provide an alternative peer-to-peer deploy- based GUI, but could be replaced for alternate GUIs. ment, such as a fully distributed one, or one that would Repository - Stores all shared XML resources in a plug into an existing network, would have to provide a persistent XML database (using the XML:DB Database different adapter. API [17]) and provides local search capabilities to the DatabaseAdapter - Performs the dirty details needed

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 33 used in all network communication.

create.jsp CreateServlet It is reasonable to assume that a set of standard Create adapters could be made available alongside a secured UploadServlet version of each adapter. The currently available central- ized peer-to-peer adapter could for example, communi- View view.jsp cate through Secure Sockets Layer SSL [19] channels

and wrap all XML resources with an XML Signature search.jsp SearchServlet

WebAdapter [20]. This would ensure that the content of the resources Search displayResults.jsp have not been tampered with while in transit or when shared by another user. The current Tomcat 4 platform download.jsp used by U-P2P has full provisions for SSL communi- cation, but the current release of U-P2P is not secured. External Download DownloadServlet U-P2P also uses the Java Servlet standard that provides Requests role-based security suitable for deployment and integra- Diagnostic DatabaseViewer tion with existing infrastructure. Tool It should be noted that the current implementation of U-P2P uses MD5 sums to generate a unique ID when a Figure 5:U-P2P Servlets resource is first uploaded to the network. This ID is not intended for security purposes, but as a simple means to to get an XML database up and running and to config- check if multiple users are sharing the same resource. In ure the port that it runs on. The current implementation a future release of U-P2P, the core of U-P2P could make uses eXist 0.8, an open source XML database [18]. If use of the MD5 sum and XML Signatures for integrity a switch were made to a different XMLdb implementa- and authentication of shared resources, with the security tion, this class would be sub-classed or replaced. of communications remaining in the network adapter. The class diagram in figure 4 illustrates the relation- 6 Case Studies ships between these classes. Two separate case studies have been developed to study The web-based user interface requires that dynamic the usefulness and the performance aspects of U-P2P. pages be served up to the user for such activities as The first case study attempts to solve the known prob- Search, Create and View. The general flow of events lem of searching for design patterns. However, as the that occur in one of these activities involves the user ac- number of design patterns that are properly documented cessing a JSP, the JSP submitting a form to a Servlet and and formatted is not sufficient to observe the behavior of the Servlet talking to the WebAdapter and then returning U-P2P under high load, another series of tests were run through a JSP. using the Genome database. Figure 5 shows the pages and Servlets involved in each activity as well as the two extra Servlets needed for 6.1Pattern Repository diagnostic reasons and for servicing download requests. The Carleton Pattern Repository [21] was started in 1999. It served as a repository for software design pat- 5.2 Security in U-P2P terns and provided extensive search capabilities over a In peer-to-peer systems, data integrity, authentication small list of patterns. The patterns were represented in and authorization are concerns that are shared with other XML using a DTD designed especially for the reposi- network communication systems. In U-P2P we are con- tory project [22]. The original pattern repository project cerned with a layering of meta-data that is used on top included plans to expand the repository to a distributed of an existing network that may or may not be secure. model. For this reason, the bulk of security measures are left The distributed model proposed was for each author up to the network adapter used to communicate with the group to have a repository server with a fixed list of the underlying network. Each peer network has its own re- other servers in the network. The servers would form quirements for security. Thus it would not be suitable a highly distributed mesh and send out their searches to for U-P2P to impose a minimum level of security for all all other servers. This model was not implemented and networks using the U-P2P layering: this would restrict evidently, no one else has pursued the idea. the ability to join public, non-secure networks such as Using the DTD as a basis, we have developed an XML Gnutella or Freenet. Instead, it is up to the founder of Schema for representing design patterns [23]. This is a community to decide which network adapter to use: used as the basis of a file-sharing community for design the level of security provided by the adapter will then be patterns. In addition to the schema a custom stylesheet is

34 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association WebAdapter PeerNetworkAdapter #peerNetwork BasePeerNetworkAdapter

#repository Repository

DefaultWebAdapter GenericPeerAdapter #repository

#mapper

FileMapper DefaultRepository

1 #dbAdapter 0…*

FileMapEntry DatabaseAdapter

Figure 4:U-P2P core classes

required to render this complex object since the default ing multiple communities, each with a large collection stylesheet is tailored to simpler formats. Another design of objects. The second was automating the process of problem is deciding which parts of the design pattern publishing and searching for objects. should be indexed. For simplicity, our current imple- Creating a community and populating it with files mentation of U-P2P uploads the entire XML pattern file manually was too time-consuming so we sought out to make it fully searchable. archives containing large numbers of XML objects. To our knowledge, prior to our work there has been One example of such a repository can be found at the no way to share design patterns in a peer-to-peer fashion Berkeley Drosophila Genome Project (BDGP) [24] This that incorporates meta-data search. When fully imple- project maintains a database of annotations of different mented this U-P2P based system will expand the benefits parts of the Drosophila (fruitfly) genome. The project of peer-to-peer file-sharing to this area. Such a system makes these annotations available in different forms in- would allow computer scientists and students to publish cluding as a large XML file. This file is split such that a rich collection of patterns into an underlying peer-to- each annotation was stored in an individual file. Another peer network, search them using rich queries and repli- pre-requisite for U-P2P communities is a valid XML cate popular patterns to increase their accessibility. The schema. As with many existing XML repositories, the community-discovery aspect could also be used to ac- BDGP repository is described using the older DTD tech- cess sub-communities devoted to different classes of de- nology. This has been transformed into XML Schema sign patterns or based on different underlying networks. with the help of the dtd2xs [25] package. Although the BDGP collection was not intended for 6.2 Drosophila Genome Project a peer-to-peer environment, it does present one possi- A performance evaluation was made using a client and ble peer-to-peer application. The BDGP is part of Fly- server running on a single computer. The performance Base, a distributed project involving scientists at many characteristics for creating objects and searching were different institutions. Each annotation contains a collec- measured. Running on the same node removed the ef- tion of information that can be enriched as information fect of network latency and emphasizes the delays in- is gathered. With U-P2P each institution (or even each curred by the software and the XML database. There scientist) could maintain their own local database of an- were two major challenges to measuring U-P2P’s per- notations. By searching on a specific gene’s name these formance characteristics. The first challenge was hav- multiple annotations could be discovered, viewed and

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 35 Average search time in Fruitfly Community Time to publish a file 0.7 1000 35000

30000 0.6 100

0.5 25000 10 0.4 20000 File size 1 Time to add 0.3 15000

Time in seconds Time 0.1 0.2 10000 File size in bytes

Search time in seconds

0.1 0.01 5000

0 0.001 0 123456789101112131415 1 112131415161718191 Hundreds of Files in Community File number out of 100

Figure 6:Evolution of search time based on community Figure 7:Time to publish based on community size size

dynamically reconciled, rather than requiring curation of Time to add first 10 files to Fruitfly Community 1000000 10000

each annotation at a central repository. 9000 100000 Manually publishing and searching for large numbers 8000 7000 of files was not practical, so these operations were au- 10000 6000 tomated. An XMLRPC interface was designed for the File sizes 1000 5000 No existing communities 1-3 existing communities

publish function. A command-line XMLRPC client was File size 4000 100 written to take advantage of this capability allowing the to add in ms Time 3000

2000 rapid addition of multiple objects. Search used a differ- 10 ent approach, relying on the URL to encode an XPath 1000 1 0 search expression for a servlet designed to handle XPath 12345678910 File number formatted queries. These two command-line tools can then be used to script more complex experiments. Figure 8:Time to publish when increasing the number The time to publish and search was measured as of communities repository size increases. One hundred files were pub- lished to the Fruitfly Community. Measurements are shown in Figures 6 and 7. 6 shows that publishing time increases in a linear fashion with number of files cre- Adding a file when there are no published objects in- ated. A set of fifty search queries was executed after one curs a performance penalty but these quickly converge to hundred files were added to the community. The same the 1s figure mentioned previously. Adding the first ob- queries were repeated every one hundred files, up to fif- ject to a community in the presence of other communi- teen hundred files. Finally each set of fifty execution ties does not incur this penalty. Publishing times for ob- times for various search queries was averaged and plot- jects in the new community are comparable to the time ted versus the number of files. Figure 7 illustrates that it would take to add more objects to one of the existing the initial publishing of objects is costly: two orders of communities. This indicates that it is the total number magnitude slower than the steady state time to publish. of objects in the system and not the number of objects in An explanation for this is the overhead required to cre- a community which determines how long it will take to ate a new collection, a new object and whatever other publish a new object. internal data structures are used by the XML database. In Figure 9, four searches were performed in the Eventually this time settles down to a steady state. On a Fruitfly community and the time for results to return Pentium 4 system this value was 1s. recorded. The same four searches were performed Another interesting set of experiments was performed within the community in the presence of between one to determine whether publish and search times are af- and three other communities (each with about 100 files). fected by the presence of other communities. The first The first of the searches took about twice as long with- experiment 8 measures the time to add the first 10 files out the presence of other communities than with, but the to a community as the number of existing communities subsequent three searches took about the same time re- is increased from none to three. Each community was gardless of the number of communities. This indicates created using 100 or so objects: the Fruitfly community that search times converge more quickly than publish was then created and the time to add the first 10 files times and that the existence of other communities does recorded. not affect search times.

36 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association Average search time 7 Conclusion and Future Work 1200

1000 U-P2P is a peer-to-peer framework that allows a user to describe, share and discover communities just like any 800 other resource. Once a community is found, its schema No existing communities One existing community 600 Two existing communities and associated stylesheets are downloaded and are used Three existing communities Time in ms Time to perform search and publishing of resources specific 400 to that community. Communities play a generative role 200 similar to metaclasses in object-oriented languages. Pos-

0 sible applications of U-P2P include sharing resources 1234 Search Number such as resumes, knowledge management in a corporate setting, or distributed repositories for design patterns and Figure 9:Search time when increasing the number of software components. communities A major direction for future work is in demonstrat- ing the protocol independence of U-P2P. By develop- 6.3 Further Comments ing PeerNetworkAdapters to interface to existing net- works such as Freenet or Gnutella, U-P2P could be- A permanent U-P2P server has been set up (http: come a meta-data layer that would provide an enhanced //24.102.27.174:3300/up2p) and the Fruitfly community-based search capability. gene annotations have been published there. In addition U-P2P communities may also be extended beyond the three other collections of XML documents have been schema format to include definition of parameters such published: as protocol, security and authentication. These various • DBLP [26] - references to Computer Science re- parameters define a design space for peer-to-peer sys- lated papers, theses and proceedings tems. Instead of envisioning a single system that is all • EDGAR [27] - filings with the US Securities and things to all people, a more practical approach is to con- Exchange Commission (SEC) sider which methods are most appropriate to the appli- • Gene Ontology [28] - database of gene-related con- cation. The system should allow important features such cepts as security or authentication to be decoupled and mixed and matched. These are all pure XML communities: objects do Despite some database performance issues that we be- not contain embedded objects. A few hundred objects lieve can be overcome, as it stands U-P2P can be easily have been published in each community. These can be used for distributed sharing of XML documents, while searched by accessing the client running at the perma- exploiting structured metadata to optimize search. nent location or via a U-P2P peer installed elsewhere. In addition, Chemistry Markup Language (CML) re- searchers also attempted to create a community for shar- Availability ing molecules using this server. In general these preliminary performance results indi- U-P2P is an open source application licensed under cate that base performance is a general problem: pub- the GPL, and makes use of other open source prod- lish times of 1s on a Pentium 4 are too slow. However, ucts such as Jakarta Tomcat [16] and eXist [18]. Com- the quick convergence of publish and search times, their plete source code and documentation as well as guides slow linear growth as the number of files increase and and presentations are accessible at http://u-p2p. practical experience indicate that a U-P2P node is scal- sourceforge.net. able and can accommodate a few thousand files before becoming too slow to be useful. One approach to rem- Acknowledgments edying this problem would be to upgrade to the latest version of XML:DB which promises performance en- This work is supported by the Natural Sciences and En- hancements. Another option would be to use the file gineering Research Council of Canada and the Foundry system for storing documents although this makes rich Program of Carleton University. Valuable feedback on search more difficult. One concrete improvement is the U-P2P has been received from Peter Murray-Rust and ability to designate other nodes to be central servers for his team at the University of Cambridge, UK. The design new communities. This offloads requests as well as stor- pattern case study would not have been possible without age space allowing the network as a whole to serve many the help of Darrell Ferguson and Dwight Deugo of the more objects. Carleton School of Computer Science.

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 37 References [14] Gong Li, JXTA:A Network Programming Envi- [1] Napster, http://www.napster.com Ac- ronment, IEEE Internet Computing Vol 5. No. 3. cessed on 07 April 2003 06:00 UTC. May/June 2001 [15] JavaServer Pages, Sun Microsystems, http:// [2] Gnutella, http://gnutella.wego.com Ac- java.sun.com/products/jsp/ Accessed cessed on 07 April 2003 06:00 UTC. on 07 April 2003 06:00 UTC. [3] Thadani, Sumeet, Meta Information [16] Jakarta Tomcat, Apache Software Foundation, Searches on the Gnutella Network, http: http://jakarta.apache.org/tomcat //www.limewire.com/index.jsp/ Accessed on 07 April 2003 06:00 UTC. metainfo_searches Accessedon07April [17] XML:DB API, Working Draft, 20 September 2003 06:00 UTC. 2001, Kimbro Staken (Editor), http://www. [4] Thadani, Sumeet, Meta Data searches on xmldb.org/xapi/xapi-draft.html Ac- the Gnutella Network (addendum), http: cessed on 07 April 2003 06:00 UTC. //www..com/developer/ [18] eXist 0.8, http://exist.sourceforge. MetaProposal2.htm Accessedon07April net Accessed on 07 April 2003 06:00 UTC. 2003 06:00 UTC. [19] The SSL Protocol Version 3.0, Internet-Draft, [5] FastTrack, now owned by Sharman Networks and 18 November 1996, http://wp.netscape. found in Kazaa Media Desktop http://www. com/eng/ssl3/draft302.txt Accessed on kazaa.com Accessed on 07 April 2003 06:00 07 April 2003 06:00 UTC. UTC. [20] XML-Signature Syntax and Processing, W3C Rec- [6] OpenCola project, http://www.opencola. ommendation, 12 February 2002, http://www. com Accessed on 07 April 2003 06:00 UTC. w3.org/TR/xmldsig-core/ Accessed on 07 April 2003 06:00 UTC. [7] Bitzi, http://www.bitzi.com Accessed on [21] Dwight Deugo, Darrell Ferguson, Carleton Pat- 07 April 2003 06:00 UTC. tern Repository, http://muffin.nexus. [8] XML Schema Part 0: Primer, W3C Recommen- carleton.ca/˜darrell/repo/ Accessed dation, 2 May 2001 http://www.w3.org/ in July 2002. TR/xmlschema-0/ Accessed on 07 April 2003 [22] Dwight Deugo, D. Ferguson, XML Specification 06:00 UTC. for Design Patterns, Proceedings of the Second [9] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, International Conference on Internet Computing Eve Maler, Extensible Markup Language (XML) (IC’2001): 407-412. 1.0 (Second Edition), W3C Recommendation, [23] Neal Arthorne, An XML Schema for De- 6 October 2000, http://www.w3.org/TR/ sign Patterns, http://chat.carleton. REC-xml Accessed on 07 April 2003 06:00 UTC. ca/˜narthorn/project/patterns/ pattern.xsd Accessed on 07 April 2003 06:00 [10] Resource Description Framework, W3C http: UTC. //www.w3.org/RDF/ Accessedon07April 2003 06:00 UTC. [24] G.M. Rubin, Around the Genomes:The Drosophila Genome Project, Genome Research [11] Tim Berners-Lee, James Hendler and Ora Las- (1996) 6:71-79 sila, The Semantic Web, Scientific American, May 2001. [25] DTD2XS, http://puvogel.informatik. med.uni-giessen.de/lumrix/ Accessed [12] Nejdl, Wolfgang et al, EDUTELLA:A P2P on 07 April 2003 06:00 UTC. Networking Infrastructure Based on RDF, [26] DBLP, http://dblp.uni-trier.de/ http://edutella.jxta.org/reports/ xml/ Accessed on 07 April 2003 06:00 UTC. edutella-whitepaper.pdf Accessed on 07 April 2003 06:00 UTC. [27] EDGAR, http://bulk.resource.org/ edgar/xml/ Accessed on 07 April 2003 06:00 [13] Extensible Stylesheet Language Transformations UTC. (XSLT) Version 1.0, W3C Recommendation, 16 [28] The Gene Ontology Consortium, Gene Ontology: November 1999, James Clark (Editor), http:// tool for the unification of biology Nature Genetics www.w3.org/TR/xslt Accessedon07April 25: 25-29. 2003 06:00 UTC.

38 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association