Rapid Application Evolution and Integration Through Document Metamorphosis
Total Page:16
File Type:pdf, Size:1020Kb
Rapid Application Evolution and Integration Through Document Metamorphosis Paul M. Aoki Ian E. Smith James D. Thornton Xerox Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304-1314 USA Abstract (J2EE) provides the JDBC and Enterprise JavaBeans (EJB) APIs for database access and the Java Message The Harland document management system Service (JMS) API for message queuing.1 Applications implements a data model in which document can use reliable storage and queuing as a means of (object) structure can be altered by mixin-style decoupling data producers from data consumers [37]. By multiple inheritance at any time. This kind of adding a layer of indirection, decoupling often contributes structural fluidity has long been supported by to both modularity and scalability. knowledge-base management systems, but its use Frameworks like J2EE, however, provide quite has primarily been in support of reasoning and distinct facilities for the communication of messages and inference. In this paper, we report our the general management of persistent data for an experiences building and supporting several non- application, despite the fact that messages will often trivial applications on top of this data model. concern and revolve around significant persistent data. Based on these experiences, we argue that They do not provide the flexibility to allow persistent structural fluidity is convenient for data-intensive records of application objects themselves to be the applications other than knowledge-base communication vehicles within decoupled applications. management. Specifically, we suggest that this In fact, many frameworks do not make any decoupling of flexible data model is a natural fit for the parts of applications convenient because they do not decoupled programming methodology that arises provide useful structuring facilities for (e.g.) message naturally when using enterprise component headers and metadata. This is true even of tuplespace frameworks. models [5], which take unification of data management and communication as a primary goal. 1 Introduction The Bantam API, developed at Xerox PARC, defines a data model in which the structure of data for persistent Modern application frameworks provide a wide variety of facilities that both simplify deployment and enhance the “objects” can be altered by metamorphosis – mixin-style scalability of data-intensive applications. To exploit these multiple inheritance [33] – at any time. With this flexibility, an object that is part of the persistent state of frameworks and their services, programmers break the application can also be the vehicle for communication applications into modular sub-programs (e.g., “servlets”) between producers and consumers, augmented with and reusable components. These sub-programs generally necessary data in-place as an alternative to message communicate through application program interfaces (APIs) provided by the framework. headers. At the same time, parts of the application can Application frameworks generally include database evolve in a decoupled way by reusing data objects in non- interfering ways. This kind of structural fluidity has long access interfaces and message queue support. For been supported by knowledge-base systems, but its use example, the Java 2 Platform Enterprise Edition [32] Permission to copy without fee all or part of this material is granted 1 provided that the copies are not made or distributed for direct Throughout this paper, we will primarily use Java standards commercial advantage, the VLDB copyright notice and the title of the and APIs as examples when we refer to various types of publication and its date appear, and notice is given that copying is by software infrastructure. This is due to their wide exposure and permission of the Very Large Data Base Endowment. To copy is not intended as a slight to other systems, particularly those otherwise, or to republish, requires a fee and/or special permission from that predate their Java equivalents (e.g., IBM MQSeries or the Endowment Proceedings of the 27th VLDB Conference, Oracle Advanced Queuing Option in the case of message Roma, Italy, 2001 queue systems). 1 has primarily been in support of reasoning and inference Documents are primarily persistent containers for (e.g., the dynamic construction of ontologies [10]). properties. However, the Bantam model also includes With the Bantam API, we have been able to gain collection documents (documents containing other experience with structural fluidity for several non-trivial documents as members) and content documents applications outside the context of traditional knowledge- (documents having file contents separate from their base management. We suggest that this flexible data properties and accessible through a read/write API). A model is a natural fit for the decoupled programming document may have an arbitrary number of properties, methodology that arises naturally when using application each one uniquely identified by a simple text name. frameworks. Bantam promotes a program structure that A document property has a name and one or more “feels” like object persistence, but encourages queue-like values. A value is typically a simple data value, such as decoupling an integer or a string.3 Properties with multiple values The paper is organized as follows. In Section 2, we have bag semantics, i.e., there may be duplicates and no give a brief overview of the data model underlying consistent ordering is guaranteed. Individual property Bantam and the current implementation of that data model values may be added to or removed from documents at in the Harland system. In Section 3, we describe how we any time as long as schema constraints are not violated. documented our experiences – both the systems we A schema is a named group of properties with studied and the means by which we gathered the constraint definitions. A schema defines a regular information. Section 4 provides an overview of a Bantam structure that individual documents may have and application as context for Section 5, which describes the typically corresponds to a “role” that a document object specific lessons we learned. In the final two sections, we may play within an application. The constraints on a place Bantam and Harland in their context in the property specify the data type of its values and restrict the academic literature and then summarize our findings. number of values. For example, the abstraction of a “to- do list item” might be defined by a schema with four 2 Current system properties: Subject, a single text string; Received and Deadline, single date fields; and Categories, The rest of the paper requires some understanding of both an optional text string property that may have many Bantam and Harland. Since neither has appeared in the 2 values. Schemas may overlap (contain the same property literature to date, we provide a brief overview. In this or properties), but definition of inconsistent schemas is section, we first describe the data model supported by the disallowed. Bantam API. We then sketch the current implementation A document with properties satisfying all the of the Bantam API in the Harland document management constraints of a given schema is said to conform to that system. schema. A schema may also be enforced on a conforming 2.1 Data model document, which causes the system to reject any modifications that would change the document to become The current model evolved from a simple property store. non-conforming. A document does not have to have a Property stores are an old idea, dating back at least to single or even primary schema; it may have any number logic databases based on (physical) k-ary relations [20]. of schemas enforced, or none at all. Schemas are Our notion of a property is more complex than that of enforced on individual documents and may be enforced or such systems. unenforced at any time as long as there are no conflicts. The Bantam data model features three primary In addition, documents may have any number of elements: documents, properties and schemas. properties that are not part of any enforced schemas, again Documents roughly correspond to persistent objects and as long as there are no conflicts with the enforced have system-generated identifiers. Properties are named constraints. data values associated with a document. Schemas are Queries may be used to retrieve sets of documents named groups of properties and impose constraints (e.g., based on the values of their properties and/or the schemas type and arity) on their member properties. A Bantam enforced upon them. A variety of common query data space consists of a set of documents and their operators for property values are supported and composite properties, with schemas defining significant structures queries may be built using AND, OR, and NOT but the that documents can assume. model provides no support for joins. There are some additional query operators to address collection membership, content indexing, and single vs. multiple 2 Additional, detailed information about the Bantam API and the values. Harland system may be obtained from the system documentation [34] contained in the public Harland software distribution. As of this writing, this distribution can be obtained from the Xerox PARC Web site, 3 The current implementation, written in Java, does permit http://www.parc.xerox.com/harland/. arbitrary Serializable objects as values. 2 email Bantam-style schemas first appeared in a support library schema to-do schema for Placeless Documents. Thus Harland was grounded From: String from the beginning in software development experience. To: String Subject: String Subject: String 2.2 Implementation Received: Date Received: Date Deadline: Date Categories: {String} Harland is our current implementation of the Bantam API. The entire system is written in Java. The current release (2) (3) contains about 26,000 lines of code (26 KLOC). Figure 2 provides an abstract view of the Harland internal architecture. Harland is organized into five main sections, indicated by the shaded backgrounds in the figure. The first two sections implement a middleware document (1) query processor and a content management interface, Figure 1: Bantam schema usage.