<<

Proc. Int. Conf. on Dublin Core and for e-Communities 2002: 139-146 © Firenze University Press

Using Dublin Core to Build a Common Data Architecture

Sandra Fricker Hostetter Rohm and Haas Company, Knowledge Center [email protected]

Abstract something new. The plant world provides us with a helpful analogy. A hybrid plant is the combination of The corporate world is drowning in disparate data. two separate entities into something completely new Data elements, field names, column names, row and unique, yet shares the attributes of both parent names, labels, metatags, etc. seem to reproduce at plants. This does not happen by accident. Two differ- whim. Librarians have been battling data disparity for ent species of plants will not merge to create a new over a century with tools like controlled vocabularies one without purposeful human intervention, man- and classification schemes. Data Administrators have agement, and care. And therein lie both the problem been waging their own war using data dictionaries and and the opportunity. naming conventions. Both camps have had limited In the past, tabular and non-tabular data have been success. A common data architecture bridges the gap managed and accessed in very different ways. between the worlds of tabular (structured) and non- However, the ever-demanding user population wants tabular (unstructured) data to provide a total solution to see all the available data integrated together and and clear understanding of all data. Using the Dublin presented in a manner individually tailored to their Core Metadata Element Set Version 1.1 and its specific needs. It has become impossible to separate- Information Resource concept as building blocks, the ly manage non-tabular data and tabular data. This Rohm and Haas Company Knowledge Center has cre- demands we address seemingly mutually exclusive ated a common data architecture for use in the imple- issues in a way that satisfies all parties. The creation mentation of an electronic document management sys- of a common data architecture is the most effective tem (EDMS). This platform independent framework, way to bridge the gap between all types of data. when fully implemented, will provide the ability to cre- ate specific subsets of enterprise data on demand, enable interoperability with other internal or external 2. Metadata management in a document systems, and reduce cycle time when migrating to the managed world next generation tool. Keywords: common data architecture, CDA, docu- The importance of controlling the metadata used ment management, platform independent framework, to describe items deposited in a document manage- data resource management, metadata, Dublin Core, ment system is critical to facilitate effective search controlled vocabularies and retrieval activities in partnership with the duel- ing aspects of a full-text environment – instant grati- fication and lack of discrimination. At the Rohm and 1. A new hybrid Haas Company, Dublin Core was a good starting point and became the basis for the document class Organizing information has become a core compe- and document properties structure “dictated” by the tency for corporations. Moving from a paper-based EDMS. From the beginning, our goal was to create a world to an electronic-based one is a difficult and platform independent framework that would meet lengthy transformation. Paper forced us to behave in the following needs: (1) enable the creation of specif- certain ways because of physical limitations associat- ic subsets of enterprise data on demand (2) provide ed with its tangibility. However, paper also had inher- future interoperability with other internal and exter- ent strengths in its universality and this is something nal systems (3) reduce cycle time when migrating we have taken for granted. from “today’s tool,” to the next generation of docu- Blending the features of paper and electronic for- ment management software without excessive re- mats is an enormous challenge. We must create work. 140 DC-2002, October, 13-17 - Florence, Italy

The Dublin Core data elements as implemented in 4.1 Defining the “pivotal” data subject the EDMS at the Rohm and Haas Company function as the common metadata. All document classes have The first step is to identify, formally name, and these properties, though it is not mandatory the define the pivotal data subject. The pivotal data sub- properties be populated. Eventually, three of these ject is the most central business concept. All related Dublin Core based properties (DC.Title, concepts will be organized around this data subject. DC.Date.issued, DC.Publisher) will be required, and The pivotal data subject for the EDMS was the soft- DC.Publisher will have a Rohm and Haas specific ware defined object “Document Class”. We adopted controlled scheme to reflect the company’s business the Dublin Core terminology for “Information unit structure. Resource” and broadened the definition as follows:

Information Resource 3. The common data architecture An Information Resource is a set of data in con- approach text, recorded in any medium of expression (text, audio, video, graphic, digital) that is meaningful, A common data architecture (CDA) “is a formal, relevant, and understandable to one or more peo- comprehensive, data architecture that provides a ple at a point in time or for a period of time. common context within which ALL DATA are under- Traditionally, an Information Resource is recorded stood and integrated”. A CDA has the following basic on some medium, such as a document, a web components – data subjects, data characteristics, and page, a diagram, and so on. In the broad sense, data characteristic variations. A data subject is “a per- however, an Information Resource could be a per- son, place, thing, concept, or event that is of interest son or a team of people. to the organization and about which data are cap- An Information Resource in this data architec- tured and maintained”. A data characteristic is “an ture represents a version of an Information individual characteristic that describes a data sub- Resource when there is more than one version ject”. A data characteristic variation “represents a dif- produced. The Information Resource. System ference in the format, content, or meaning of a specif- Identifier changes for each version. The ic data characteristic” (Brackett, 1994, p. 31, p. 39). Information Resource Document. Number that is At first glance, a standard like the Dublin Core assigned as an Information Property Item through Metadata Element Set Version 1.1 looks like it might Information Resource Property remains the same be a common data architecture. However under clos- across versions and identifies the Information er scrutiny, its deficiencies become more obvious. Resource, and the Information Resource. Version Dublin Core violates a core principle of data manage- Identifier uniquely identifies the version of that ment by mixing different facts within a single field. Information Resource. DC.Creator can represent a person or an organiza- Note that the system identifier as defined in this tion. The ideal data management equation is 1 Fact = data architecture is the system identifier of the 1 Field. In Dublin Core’s well-intended effort to be home system where data about information simple yet fully extensible, it is also very non-specific. resources are stored. Any other foreign identifiers This leads us down the tempting path to the never- from other systems where data about information ending crosswalk. Cross walking happens only at the resources are stored are assigned as an physical level, requires an excessive amount of work, Information Property Item through Information and yields minimal understanding. Instead, if we Resource Property. move beyond the traditional physical level analysis Note that there are non-EDMS versions of an and cross-reference to a common data architecture Information Resource, such as web page versions, created at the logical level, we gain a true common that may not have a date, version identifier, URL context for understanding all data. change, and so on. There is no way to know or dis- tinguish versions of this type.

4. How to build a common data 4.2 Defining the data characteristics architecture The second step is to identify, formally name, and Building a common data architecture involves five define the data characteristics of the pivotal data major steps. It is a reiterative process that may take subject. Examples include: several months to become an accurate reflection of the organizational situation and will require occa- Information Resource. Title sional readjustments over time. Since a common The official title of the Information Resource, data architecture represents is a living breathing such as “The Importance of Adding Property Data organization that grows and changes, it too must be to a Panagon Document.” This is the name by refreshed as needed. which the Information Resource is formally known. Proc. Int. Conf. on Dublin Core and Metadata for e-Communities 2002 141

Information Resource. System Identifier from a set of reference items commonly held by The system assigned identifier in the home sys- an Information Resource. Each Information tem that uniquely identifies an Information Property Item belongs to an Information Property Resource. This is not the same as the system iden- Group. Information Resource Property assigns the tifier that identifies an Information Resource in an Information Property Items to Information EDMS system or any other foreign system docu- Resources. menting Information Resources. The Information Information Property Item Alias Resource, System Identifier changes for each ver- An Information Property Item can have different sion of an Information Resource. The Information names in different systems or standards. There is Resource. Version Identifier identifies the version no uniform name that transcends all systems and of the Information Resource. standards. Information Property Item Alias docu- Information Resource. Version Identifier ments all of the alias names for a foreign The version number of the Information Information Property Items in various systems Resource. The versions are typically, but not nec- and standards, and their originating system or essarily, assigned sequentially from 1. In some for- standard. The preferred name is shown in eign systems or standards, the version identifier Information Property Item. Name. may be appended to the system identifier. In this Information Resource Contributor data architecture, the version identifier is main- An Information Resource can have many differ- tained separate from the system identifier. ent Information Contributors, and an Information Information Resource Subtype. Code Contributor can contribute to many different Information Resource Subtype indicates a more Information Resources. Information Resource detailed classification of documents within Contributor designates a specific Information Information Resource Type. Not every Contributor for a specific Information Resource. Information Resource Type will have Information Information Resource Contributor Role identifies Resource Subtypes. the specific role performed by an Information Information Resource Type. Code Resource Contributor. The code that unqiuely identifies an Information Information Resource Contributor Role Resource Type, such as LNBK for the Information An Information Contributor can perform differ- Resource Type Laboratory Notebook. ent roles with respect to an Information Resource. Information Resource Contributor Role is a refer- 4.3 Defining the qualifying data subjects ence table identifying the roles that an Information Contributor can perform for an The third step is to identify, formally name, and Information Resource. define any qualifying data subjects and their data Information Resource Property characteristics. We used the Dublin Core Metadata An Information Resource can be characterized Element Set Version 1.1 as the basic building blocks. by many different Information Property Items, Examples include: and an Information Property Item can character- ize many different Information Resources. Information Contributor Information Resource Property assigns a specific An Information Contributor is any person or Information Property Item to a specific organization that contributes in any way to an Information Resource. If that Information Information Resource. A person may be an author, Property Item requires additional data, such as a a researcher that provides material, or a reviewer, date or description, those data are provided in the and an organization may be a service or profes- data characteristics described below. sional organization. Information Resource Information Resource Property Validity Contributor connects an Information Contributor An Information Resource Type has a set of to an Information Resource. Information Information Properties Items that are valid and Resource Contributor Role identifies the specific can be assigned to an Information Resource role played by an Information Contributor. belonging to that Information Resource Type. Information Property Group Information Resource Property Validity indicates An Information Property Group is a set of relat- the valid assignments of Information Property ed Information Property Items. The structure of Items. Note that this data subject is set up to show Information Property Groups and Information only the valid assignments of an Information Property Items allows a variety of reference tables Property Item for an Information Resource Type. or enumerated lists to be defined for assignment If an Information Property Item appears, then that to an Information Resource through Information Information Property Item is valid for the Resource Property. Information Property Group Information Resource Type. If an Information represents a controlled set of reference tables Property Item does not appear, then that Information Property Item Information Property Item is not valid for the Information Property Item is one reference item Information Resource Type. 142 DC-2002, October, 13-17 - Florence, Italy

Information Resource Publisher 4.4 Creating a visual representation of the An Information Resource can be published by relationships more than one Publisher, and a Publisher can pub- lish more than one Information Resource. The fourth step is to create a visual representation Information Resource Publisher identifies the of how all the data subjects relate to each other. In publication of an Information Resource by a spe- Figure 1 the relationships are depicted in a manner cific Publisher. based on data modeling techniques outlined below: Information Resource Relationship Arrows moving away from a data subject represent An Information Resource can have a relation- a one-to-many relationship between the data sub- ship with other Information Resources, such as jects. For example, a single Information Resource reference to another Information Resource, mate- may have many Information Resource Contributors. rial included from another Information Resource, An Information Contributor (DC.Creator or and so on. Information Resource Relationship DC.Contributor) is any person or organization that identifies a specific relationship between two contributes in any way to an Information Resource. Information Resources. Information Resource A person may be an author, a researcher that pro- Relationship Type identifies the specific type of vides material, or a reviewer, and an organization relationship between Information Resources. may be a service or a professional organization Information Resource Relationship Type Arrows moving towards a data subject represent a Information Resource Relationship Type is a ref- many-to-one relationship. For example, an erence table that identifies the specific type of Information Contributor may be an Information relationship between two Information Resources Resource Contributor to many different Information identified in Information Resource Relationship. Resources. Information Resource Contributor desig- Information Resource Subtype nates a specific Information Contributor for a specif- Information Resource Subtype indicates a more ic Information Resource. Information Resource detailed classification of documents within Contributor Role identifies the specific role per- Information Resource Type. Not every formed by an Information Resource Contributor. Information Resource Type will have Information Two arrows represent a relationship between two Resource Subtypes. Information Resources. For example, an Information Information Resource Type Resource can have a relationship with other Information Resource Type is a broad grouping Information Resources, such as reference to another of Information Resources that designates the Information Resource, material included from anoth- nature or genre of the content of the Information er Information Resource, etc. Information Resource Resource. It describes general categories, func- Relationship identifies a specific relationship tions, or aggregation levels of the content of between two Information Resources. Information Information Resources. Resource Relationship Type identifies the specific Information Security Group type of relationship between Information Resources. An Information Resource can have different lev- Multiple arrows going in the same direction in els of security classification governing which indi- sequence represent a hierarchy relationship. For viduals or organizations can access that example, Information Resource Subtype indicates a Information Resource. Information Security more detailed classification of documents within Group is a reference table designating the broad Information Resource Type. However, not every levels of security for an Information Resource. Information Resource Type will have Information Information Security Subgroup identifies a more Resource Subtypes. detailed grouping of security. Arrows going towards each other and intersecting Information Security Subgroup at the same data subject represent an assignment rela- Information Security Groups can have a more tionship. For example, Information Resource Con- detailed level of classification. Information tributor connects an Information Contributor to an Security Subgroup provides the detailed levels of Information Resource. Information Resource Con- security classification within Information Security tributor Role identifies the specific role played by an Group. Information Contributor (the various Information Publisher Resource Contributor Roles that an Information Con- A Publisher is any organization, internal or tributor can perform for an Information Resource are external to Rohm and Haas, that formally publish- stored in a reference table. This will be discussed in es an Information Resource. Note that this current more detail under heading 5. Properties Make the definition is limited to Information Resources. As World Go ‘Round). We assign an Information Con- the common data architecture is enhanced, this tributor to an Information Resource and then we give definition may be altered to include the publishers the Information Contributor a specific role. The same of other material not considered an Information kind of assignment relationship exists for Publisher. Resource. An Information Resource could be published by two different publishers. The print copy could be pub- Proc. Int. Conf. on Dublin Core and Metadata for e-Communities 2002 143

Information Information Information Resource Resource Contributor Property Validity Type

Information Information Information Property Information Resource Resource Resource Group Contributor Contributor Role Subtype

Information Information Information Information Information Resource Resource Resource Property Item Resource Property Relationship Relationship Type

Information Information Information Information Property Item Publisher Resource Security Subgroup Security Group Alias Publisher

Figure 1. lished by a different entity than the electronic version Because an Information Resource may have many and the electronic and print versions could be the different Information Resource Property Items, we same content or might be different content. need to resolve the many-to-many relationship and figure out a way to assign them to the specific 4.5 Testing and adjustment Information Resource. We define the Information Resource Property Items first, and then assign them. The final step is to test the resulting common data By structuring things in this manner, and adjust as needed. This can be done Property Groups and Information Property Items by trying it out on another system or conducting within those groups can become ineffective at any business use cases. time without altering the structure of the data resource (Figure 2). Information Resource Property Items for a specific 5. Properties make the data world go Information Resource are kept in reference tables ‘round called Information Property Groups. An Information Resource Property is a qualifying Data Subject which Properties (fields, attributes, characteristics, fea- assigns Information Resource Property Items, via the tures, metatags) help us understand more about the Information Resource Property Groups structure to a content and context of the information resource to specific Information Resource. which they belong. Common properties are univer- Information Property Group is a reference table of sal. Everyone in the organization cares about these reference tables. Information Property Item is a spe- properties. It is important to limit the names and dis- cific value in a reference table. All Information play labels of these common properties so we can Property Items must belong to an Information effectively share them and mean the same thing. Property Group. This portion of the CDA represents Special or custom properties apply only to a small “a of controlled vocabularies”. subset of information resources, but their names and These reference tables are documented as data sub- labels should be limited also. Limiting the values for jects, but their definitions clearly identify them as most properties helps keep the context meaningful reference tables and not true data subjects. and clear. An example of an Information Resource Property

Information Information Information Information Resource Property Resource Property Item Property Group

Figure 2. 144 DC-2002, October, 13-17 - Florence, Italy

Group is Information Resource Description. An Comment: Information Resource Contributor Information Resource can have many associated Role. Name, Formal=’Creator’ descriptions, such as content, spatial, physical for- Subject An exact cross-reference is indeterminate mat, temporal, and rights. Information Resource based on the definition of Subject and the Description identifies each of the descriptions that lack of a specific controlled vocabulary or for- can be assigned to an Information Resource. mal classification scheme. Any implementa- Examples of Information Property Items for the tion could use one or more controlled vocabu- Information Resource Property Group called laries or formal classification schemes. The “Information Resource Description” are Content best cross-reference approach is to identify Description, Spatial Description, Physical Format each specific controlled vocabulary or formal Description, Temporal Description, and Rights classification scheme used under the Dublin Description. Other examples of Information Property Core standard, document it as a reference Groups include: Information Resource Date, table in the common data architecture, and Information Resource Identifier, Information then prepare a cross-reference to that refer- Resource Library, Information Resource Subject, ence table. Language, and Non-Enumerated Feature. Business Unit Classification Scheme. Name, Formal Comment: Information Property Group. 6. Documenting the common data Name, Formal = Information Resource architecture Subject Comment: Information Property Group. We are formally documenting our common data Name, Formal is indeterminate and needs to architecture in the Data Resource Guide. The Data be determined for each data occurrence. Resource Guide is a proprietary Microsoft Access Description Description is defined as a reference table in software application which contains tables on the the common data architecture as Information Common Data Architecture side for data subject, Resource Description. The specific types of data characteristic, data characteristic variation, data descriptions, such as table of contents, code set, data code. On the Data Product side (e.g. abstract, etc. are reference items in that refer- EDMS, Dublin Core Metadata Element Set Version ence table. 1.1, etc.) the has tables for data product Information Resource Property. type, data product, data product group, data product Description, Dublin Core unit, and data product code. It also has tables for Comment: Information Property Group = data product cross-referencing. The inclusion of a Information Resource Description reporting feature enables the data resource adminis- Comment: Information Property Item. Name, trator to see how multiple data products relate to Formal is variable and needs to be deter- each other and what data elements they share. mined for each data occurrence. Publisher Publisher. Name, Variable Comment: The publisher name should be 7. Cross referencing Dublin Core to the used as the cross-reference. CDA Contributor Contributor can be either a person or an organization. The cross-references are identi- When we cross-reference the Dublin Core Meta- fied for each Contributor variant. data Element Set Version 1.1 to the common data Information Contributor. Person Name, architecture it yields the following results. Complete Inverted Comment: Information Resource Contributor Dublin Common Data Architecture Equivalent Role. Name, Formal is variable and needs to Core (Data Subject. Data Characteristic, be determined for each data occurrence. Element Data Characteristic Variation) Information Contributor. Organization Label Name, Variable Comment: Information Resource Contributor Title Information Resource. Title, Variable Role. Name, Formal is variable and needs to Creator Creator can be either a person or an organiza- be determined for each data occurrence. tion. The cross-references are identified for Date Date is defined as a reference table in the each Creator variant. common data architecture as Information Information Contributor. Person Name, Resource Date. The specific types of dates, Complete Inverted such as Available Date, Creation Date, Issued Comment: Information Resource Contributor Date, Modified Date, Valid Date, etc. are refer- Role. Name, Formal = ’Creator’ ence items in that reference table. Information Contributor. Organization Information Resource Property. Date, ISO Name, Variable 8601 Proc. Int. Conf. on Dublin Core and Metadata for e-Communities 2002 145

Comment: Information Property Group. two Information Resources as defined in Name, Formal=Information Resource Date Information Resource Relationship. The iden- Comment: Information Property Item. tifier of the Source in Dublin Core must be Names, Formal is variable and needs to be determined, the system identifier located, and determined for each data occurrence. that system identifier used in Information Type Information Resource Type, Name, Dublin Resource Relationship. System Identifier. The Core specific types of relationships, such as source, Comment: If a controlled vocabulary other etc. are defined in Information Resource than the list of Dublin Core types is used, it Reference Type. needs to be documented as a data product Information Resource Property. Value, unit variant and cross-referenced to an appro- Identifier Variable priate reference table in the common data Comment: The Information Resource architecture. Relationship Type. Name, Formal is indeter- Format Format is a specific type of description which minate and needs to be identified for each is defined as a reference table in the common data occurrence. data architecture as Information Resource Coverage Coverage is a specific type of description Description. The specific types of descrip- which is defined as a reference table in the tions, such as text, audio, etc. are reference common data architecture as Information items in that reference table. Resource Description. The specific types of Information Resource Property. Descrip- coverage, such as spatial, temporal, etc. are tion, Dublin Core reference items in that reference table. Comment: Information Property Group. Information Resource Description. Name is Name, Formal = Information Resource variable and needs to be determined for each Description data occurrence. Comment: Information Resource Description. Information Resource Property. Name, Formal is variable and needs to be Description, Dublin Core determined for each data occurrence. Comment: Information Resource Property. Identifier Identifier is defined as a reference table in the Name, Formal = Description common data architecture as Information Comment: Information Property Item. Name, Resource Identifier. The specific types of iden- Formal = Spatial Description tifiers, such as URI, ISBN, etc., are reference Comment: Information Property Item. Name, items in that reference table. Formal = Temporal Description Information Resource Property. Value, Comment: Information Property Item. Name, Variable Formal = Jurisdiction Description Comment: Information Resource Property = Rights Rights is a specific type of description which Information Resource Identifier is defined as a reference table in the common Comment: Resource Description. Name, data architecture as Information Resource Formal is variable and needs to be deter- Description. The specific types of rights, such mined for each data occurrence as copyright, royalty, and so on, are reference Source Source represents a relationship between two items in that reference table. Information Resources as defined in Information Resource Property. Information Resource Relationship. The iden- Description, Dublin Core tifier of the Source in Dublin Core must be Comment: Information Property Group. determined, the system identifier located, and Name, Formal = Information Resource that system identifier used in Information Description Resource Relationship. System Identifier. The Comment: Information Property Item. Name, specific types of relationships, such as source, Formal is variable and needs to be deter and so on, are defined in Information mined for each data occurrence. Resource Reference Type. Information Resource Property. Value, Identifier Variable 8. Next steps Comment: Information Resource Relationship Type. Name, Formal = Source This common data architecture is currently a Language Language is a multiple-fact data item for the work-in-progress. Full documentation of the com- language and the country associated with the mon data architecture in a Data Resource Guide language. The cross-references are identified must be completed as well as the final cross-referenc- for each language variant. ing of the EDMS metadata and Dublin Core Language. Code, ISO 639 Metadata Element Set 1.1. The creation of a the- Country. Code, ISO 3166 saurus component is essential to making the CDA Relation Relation represents a relationship between content available to the wider community of general 146 DC-2002, October, 13-17 - Florence, Italy

system users and the individuals who develop and References design new database applications. We envision a “Data Element Supermarket” where developers can 1. Brackett, M., 1994. Data Sharing Using a Common shop for the field name desired, find its variations Data Architecture. New York: John Wiley & Sons, Inc. (code, name, acronym), and learn its single source and history of use in other systems. We have created a good foundation, but there is still much work to be done before the true value can be realized.

Acknowledgements

This work would not have been possible without the professional advice and consultation expertise of Michael H. Brackett. The author is grateful for his personal encouragement and support.