Do Metadata Mo dels meet

IQ Requirements



Claudia Rolker Felix Naumann

Humb oldtUniversitat zu Berlin Forschungszentrum Informatik FZI

Unter den Linden HaidundNeuStr

D Berlin D Karlsruhe

Germany Germany

naumanndbisinformatikhub erli nde rolkerfzide

Abstract

Research has recognized the imp ortance of analyzing information quality IQ for

many dierent applications The success of data integration greatly dep ends on the

quality of the individual data In statistical applications p o or data quality often leads

to wrong conclusions High information quality is literally a vital prop erty of hospital

information systems Po or data quality of sto ck price information services can lead to

economically wrong decisions

Several pro jects have analyzed this need for IQ metadata and have prop osed a set

of IQ criteria or attributes which can b e used to prop erly assess information quality

In this pap er we survey and compare these approaches In a second step we take

a lo ok at existing prominent prop osals of metadata mo dels esp ecially those on the

Internet Then we match these mo dels to the requirements of information quality

mo deling Finally we prop ose a quality assurance pro cedure for the assurance of

metadata mo dels

Intro duction

The quality of information is b ecoming increasingly imp ortant not only b ecause of the

rapid growth of the Internet and its implication for the information industry Also the

anarchic nature of the Internet has made industry and researchers aware of this issue As

awareness of quality issues amongst information professionals grow their demands for high

quality information will increase There is a clear need for the industry to resp ond to these

requirements and this also represents a genuine market opp ortunity Inf



This research was supp orted by the German Research So ciety BerlinBrandenburg Graduate Scho ol in

Distributed Information Systems DFG grant no GRK

The autonomy of WWW information sources prevents information seekers from directly

controlling the quality of the information they receive Rather users of such information

sources must resort to analyzing the quality of the information once it is retrieved and use the

analysis for future queries Research has recognized the imp ortance of analyzing information

quality IQ for many dierent applications WS Red As a result several pro jects have

emerged to nd a general measure for information quality While the application domains

dier from structured multidatabases or data warehouse applications to retrieval systems

for unstructured information the approaches to measure IQ are all similar Domain exp erts

dene a set of IQ criteria that are deemed to b e imp ortant to the eld or a general set such

as that of Wang and Strong WS is chosen Next assessment metho ds for each criterion

are develop ed These metho ds include questionnaires for sub jective criteria calibration

metho ds etc Finally some way of summarizing the results is given so one is able to

qualitatively compare whole sources query execution plans or pieces of information All

approaches heavily rely on metadata esp ecially quality metadata IQ criteria are of no use

if no score for them is found A dimension which cannot b e assessed do es not contribute to

a comparison of sources

On the other hand information providers have recognized the need to describ e the pro d

ucts they oer and provide this metadata Obviously this provider metadata will not directly

address IQ No information source will admit their information or data to b e outdated or

inaccurate It rather covers asp ects of authorship title etc Such particulars can only b e

evaluated to indirectly nd IQ ratings The creation date of a do cument reveals its age the

publisher may have a go o d or bad reputation etc

Our goal is to bridge the gap b etween IQ metadata requirements and actual metadata

that is already provided by many sources To this end we rst analyze the most imp ortant

prop osed sets of IQ criteria ie the wish list of information brokers and information

consumers Section The next section will take a lo ok at the most widespread metadata

mo dels that already exist and are used by many providers Section The main contribution

of this pap er is a comparison of the IQ metadata requirements with the metadata mo dels

We show how IQ criteria can b e derived from existing metadata Section The pap er

ends with a prop osal to let metadata registries assure the quality of metadata mo dels in

the future Section and with a further outlo ok onto certication authorities for metadata

instances with resp ect to their quality Section

Information Quality Metadata Requirements

This section will review several pro jects concerned with information quality Some provide

research from a global viewp oint and dene IQ in a very general way Others have con

centrated either on certain quality asp ects or on certain application domains for IQ All

reviewed pro jects have in common that IQ is dened as some set of quality criteria ie

that quality is made up of many facets All pro jects face the problem of assessing values for

the criteria In the scop e of this work we view these criteria as metadata for the data b eing

analyzed Thus a list of criteria can b e viewed as metadata requirements or a wish list

of criteria one would like to evaluate

What follows is a short summarization of the mentioned pro jects Instead of listing each

set in each section we have summarized the IQ criteria of the pro jects in Table The actual

criteria names may slightly dier but have b een adapted appropriately We have classied

the criteria into four sets Contentrelated criteria concern the actual information that is

retrieved Technical criteria measure asp ects that are determined by soft and hardware

Intel lectual criteria are made up of very sub jective criteria like b elievability Instantiation

related criteria concern the presentation of the information

TDQM

Total Data Quality Management is a pro ject at MIT aimed at providing an empiric founda

tion for data quality Wang and Strong have empirically identied fteen IQ criteria regarded

by data consumers as the most imp ortant WS The authors have classied these criteria

into intrinsic quality accessibility contextual quality and representational quality

Their framework has already b een used eectively in industry and government To our b est

knowledge this is the only empirical study in this eld and has thus often b een used as a

research basis for other pro jects see b elow

IQ criteria for molecular biology information systems

Based on the criteria of the TDQM mo del we have adapted the set to suit the integration

of molecular biology information systems MBIS in a mediatorbased architecture NLF

Due to the nature of this architecture and the underlying relational mo del the TDQM criteria

were mo died Two criteria resp onse time and price were added to account for the Internet

setting of the approach some criteria were interpreted in a new manner to account for the

integration asp ect of the approach Criteria such as objectivity or concise representation were

dropp ed since in a relational data mo del a query result is simply a table

For the pro cess of planning queries against such a distributed and heterogeneous system

three classes of criteria were distinguished Sourcesp ecic querysp ecic and attribute

sp ecic criteria

Notions of service quality

Weikum has develop ed a dierent classication of IQcriteria Wei He distinguishes

systemcentric pro cesscentric and informationcentric criteria The set of criteria in Wei

was put together in an informal manner with no claim for completeness However in our

eyes Weikum do es provide several new criteria such as latency which play an increasingly

imp ortant role in new information systems esp ecially in WWW settings Each criterion is

thoroughly discussed again in an informal manner

DWQ

Data Warehouse Quality DWQ is an Esprit funded pro ject to analyze the meaning of

data quality for data warehouses and to pro duce a formal mo del of information quality to

enable design optimization of data warehouses JV Again the approach is based on the

empirical studies of Wang and Strong WS However the fo cus lies on data warehouse

sp ecic asp ects such as the quality of aggregated data The authors develop a mo del for IQ

metadata management in a data warehouse setting

SCOUG

Measurement of the quality of databases was the sub ject of the Southern California Online

User Group SCOUG Annual Retreat in The brainstorming session resulted in a

checklist of criteria which fall into broad categories Bas These criteria are the mostly

referenced ones within the database area Although the fo cus lies on the evaluation of

database p erformance including categories like do cumentation and customer training its

similarity to the ab ove describ ed quality measures is obvious

Chen et al

With a fo cus on World Wide pro cessing Chen et al prop ose a set of quality

criteria from an information server viewp oint CZW In their setting a user can sp ecify

quality requirements along with the query Under heavy workload the WWW server must

then simultaneously pro cess multiple queries and still meet the quality requirements To this

end the authors present a scheduling algorithm that is based on the timerelevant criteria

such as resp onse time or network delay The other IQ criteria are only briey discussed

Metadata Mo dels

Metadata mo dels have b een develop ed for many dierent purp oses One of the rst applica

tions was that of mo deling bibliographic information for libraries Recently the problem of

describing information in general through metadata has received much attention The abun

dance of information that is nowadays accessible through the Internet and WWW makes it

necessary to describ e the provided information in a concise uniform and easily understand

able and interpretable way Without such a description an information seeker will drown

in nonrelevant information and may even not nd the desired information even though it

is available

In the following sections we present several pro jects that attempt to set up a common

metadata mo del for WWW information in do cuments and gain general acceptance in the

Internet community We have tried to cover the most imp ortant pro jects and have summa

Category IQ Criteria TDQM MBIS Weikum DWQ SCOUG Chen

Content Accuracy Yes Yes Yes Yes Yes Yes

related Do cumentation Yes

Criteria Relevancy Yes Yes Yes Yes

ValueAdded Yes Yes

Completeness Yes Yes Yes Yes Yes Yes

Interpretability Yes Yes

Technical Timeliness Yes Yes Yes Yes Yes Yes

Criteria Reliability Yes

Latency Yes Yes

Performability Yes Yes

Resp onse time Yes Yes Yes

Security Yes Yes Yes

Accessibility Yes Yes Yes Yes Yes

Price Yes Yes Yes

Customer Supp ort Yes

Intellectual Believability Yes Yes Yes Yes Yes

Criteria Reputation Yes Yes Yes

Objectivity Yes

Instantiation Veriability Yes

related Amount of data Yes Yes Yes

Criteria Understandability Yes Yes

Concise represent Yes

Consistent represent Yes Yes Yes Yes Yes

Table Metadata Requirements for Information Quality

rized the attributes of these metadata mo dels in Table The attribute names may slightly

dier but have b een adapted appropriately

Dublin Core

The Dublin Core Metadata initiative has develop ed a metadata element set intended to

facilitate the discovery of electronic resources Dub It evolved from a series of workshops

with participants from many dierent application domains The element set is wide spread

across many typ es of information systems from digital libraries to museums and many other

electronic do cument collections Dublin Core is esp ecially widespread in HTMLDo cuments

where the META tag is used META NAMEDCTitle CONTENT MyTitle

STARTS

In the Stanford Prop osal for Internet MetaSearching STARTS pro ject a list of required

metadata elds for do cuments is prop osed GCGM It is based on the use attributes of

ZGILS see Sections and The list was develop ed by researchers and practi

tioners from large Internet companies in a numb er of workshops In the Dublin Core

standard see Section was integrated

STARTS also prop oses a list of metadata elds to describ e the query capabilities of an

information source These elds help solving the problems of source selection and rank

merging the results While this metadata may also b e relevant to assessing IQ in some

situations it is not considered here

Z Attribute Set BIB

Z is an ANSI and ISO standard that describ es the communication b etween a client and

a metadata server mainly with resp ect to searching Originally it was develop ed for the

communication interop erability of libraries

Z is indep endent of any application area A prole sp ecies how to use the various

functions dened by Z in a sp ecic application area A prole also sp ecies which

attribute set to use The Attribute Set BIB Z describ es bibliographic metadata and

comprises attributes BIB allows to describ e bibliographic data by several identication

schemas and keyword lists Each schemakeyword list corresp onds to one BIBattribute eg

there are sub ject attributes each of them referring to a dierent keyword list In Table

these attributes are summarized in content

Z Prole GILS

GILS Elia stands for Global Information Lo cator Service or for Government Information

Lo cator Service Originally the latter one was understo o d under this synonym and was

develop ed from an initiative in the United States The Environment and Natural Resources

Management Pro ject of the G adopted the Government Information Lo cator Service as a

mo del for the Global Information Lo cator Service From the p ersp ective of standards and

technology there is no dierence b etween them

GILS is not only a means to describ e b o oks or datasets but also to provide information

ab out p eople events meetings artifacts ro cks etc The Z Prole Version comprises

attributes Elib The level of these attributes is very detailed and so they are summarized

in content in Table eg GILS attributes corresp ond to the distributor attribute in

Table

DIF

The Directory Interchange Format DIF was originally develop ed to make scientic US

governmental catalogues describing data groups interop erable Glo Ols DIF consists

of data elds of them are mandatory

In a numb er of workshops the DIFstandard was develop ed and based on it the data

catalogue Global Change Master Directory GDMC was created To day the GDMC sta

is the maintenance agency of the DIFstandard

Matching Requirements and Metadata Mo dels

Having intro duced b oth a numb er of desired IQ criteria sets and a numb er of metadata

attribute sets currently in use the question arises where and how well they meet Is it p ossible

to derive values for the IQ criteria from existing metadata The answer unfortunately is no

at least not in a straightforward manner The following section discusses how and how well

metadata attributes help in determining IQ criteria scores We do not examine each criterion

in detail but lo ok into a few exemplary criteria one from each class of Table Similar

arguments hold for the other criteria of the resp ective class

Relevancy Wang and Strong dene relevancy as the extent to which data are applicable

and helpful for the task at hand WS Relevancy is an often used criterion in the eld

of A do cument or piece of information is considered to b e relevant to

the query if the keywords of the query app ear often andor in prominent p ositions in the

do cument Thus the metadata attributes Coverage Title SubjectKeywords and Description

Dublin Core STARTS BIB GILS DIF

Title Yes Yes Yes Yes Yes

Author or Creator Yes Yes Yes Yes Yes

Sub ject and Keywords Yes Yes Yes Yes

Description Yes Yes Yes Yes

PublisherDistributor Yes Yes Yes Yes

Other Contributor Yes

Date Yes Yes Yes Yes Yes

Last Review Date Yes

Future Review Date Yes

Resource Typ e Yes Yes

Format Yes

Storage Medium Yes Yes

Resource Identier Yes Yes Yes Yes Yes

Identier Typ e Yes Yes Yes

Cross References Yes Yes Yes Yes

Source Yes Yes

Language Yes Yes Yes Yes

Relation Yes Yes Yes

Coverage Yes Yes Yes Yes

Rights Management Yes Yes

Do cumenttext Yes Yes

Sensor name Yes

Parameter measured Yes

Quality Assurance Metho d Yes

Table Metadata Attribute Prop osals

are of help in determining Relevancy Esp ecially Title and SubjectKeywords explicitly p oint

out prominent representatives of the information content

Even with the help of these attributes determining the relevancy of information is error

prone For instance a query for the term jaguar at any WWW will retrieve

do cument links b oth for the animal and the automobile If the user had the animal in mind

the links to automobile sites should have b een considered as not relevant

Resp onse Time The resp onse time criterion measures the delay b etween submission of a

query by the user and reception of the complete resp onse from the information system The

score for this criterion dep ends on unknown factors such as network trac server workload

etc These asp ects are hardly predictable Another factor is the typ e and complexity of the

user query Again this cannot not b e predicted however it can b e taken into account once

the query is p osed and a query execution plan is develop ed

A third asp ect plays an imp ortant role the technical equipment of the information server

Metadata on the equipment can b e derived from the Publisher attribute and the Storage

Medium attribute Storage Medium can directly b e translated to some time factor To derive

a factor from the Publisher attribute further investigations on the publishers hardware and

software are necessary for instance by directly contacting the publisherwebsite provider

Concluding existing metadata attributes hardly contribute to the resp onse time criterion

A more realistic approach to determine the scores is to a keep statistics on previous queries

and b employ calibration techniques as prop osed in Spi

Believability When querying autonomous information sources b elievability is an esp ecially

imp ortant criterion Apart from simply providing information a source must convince the

user that this information is accepted or regarded as true real and credible WS

The main source for b elievability is the author or creator of the information Thus the

AuthorCreator and the Contributor attributes are helpful in determining a score However

this cannot b e done automatically First a user dened mapping of authors to b elievability

scores must b e created Obviously this mapping is very sub jective and must b e newly created

for each user

Determining IQ scores for all intellectual criteria is a very dicult task Not only are

these criteria of extremely sub jective nature Also one must assume that information sources

will b e very resourceful trying to nd ways to improve b elievability without improving the

correctness of the information itself A common authority as prop osed in the next section

might help determine and control the scores

Veriability When b elievability is not as high as it could b e the quality of information

can greatly improve if it is veriable through a second source The verication pro cess can

b e supp orted by the attributes Resource Identier Relation and Cross References Relation

and cross references may p oint to another source where the information can b e veried A

global identier will help identication of the ob ject or information in that other source

where it can b e veried Thus the content of the attributes do not directly contribute to

veriability but their existence do es improve information quality

Figure summarizes the discussion ab ove and additionally gives matches for all criteria

not examined Similar considerations have led to the each of the matchings

Content-related and Metadata Attributes Intellectual and technical Criteria Instantiation-related Criteria

Accuracy Date Believability Documentation Coverage Reputation Relevancy Author/Creator Objectivity Value-Added Contributor Completeness Title Interpretability Subject/Keywords Description Verifiability Timeliness Resource Type Amount of Data Reliability Resource Ident. Understandability Latency Language Concise Repr. Performability Relation Consistent Repr. Resp. Time Cross References Security Publisher Accessibility Format Price Storage Medium Customer- Rights Management

Support

Figure Matching required IQ Criteria and existing generalpurp ose Metadata attributes

Quality Assurance by Metadata Registries a Pro

p osal

Metadata registries are set up to avoid multiple development of similar metadata schemata

and to ensure interop erability b etween the metadata schemata at b oth syntactic and se

mantic levels In Gai a metadata registry is dened as a publicly accessible system

that records the semantics structure and interchange formats of any typ e of metadata A

formal authority or agency maintains and manages the development and evolution of a

metadata registry The authority is resp onsible for p olicies p ertaining to registry contents

and op eration

There are some metadata registries already running on the Web for instance Meta

dataNet Dis or ROADS Mic Moreover standardization organizations are currently

developing a framework for metadata registries DL Fra Fra

Each metadata registry exp ects the metadata to b e describ ed in a standardized schema

language like the following

 An imp ortant memb er of these sp ecications is XML XML is the Extensible Markup

Language Worb extensible b ecause it is not a xed format like HTML It is

designed to enable the use of SGML on the SGML is the Standard

Generalized Markup Language ISO the international standard for dening

descriptions of the structure and content of dierent typ es of electronic do cument

Do cuments typ es are sp ecied through Do cument Typ e Denitions DTDs A DTD

is a le or several les to b e used together written in XML which contains a formal

denition of a particular typ e of do cument We prop ose to include attributes for quality

metadata in such a denition The simple structure of a DTD will then allow to easily

evaluate the quality of a source or do cument

 The Platform for Internet Content Selection PICS sp ecies a lab eling infrastructure

to enhance HTML headers Wora While it was originally created to attach ratings

to WWW material that is inappropriate for children the approach has b een adapted

to supp ort various metadata tasks PICS is supp orted by the W Consortium

Again the inclusion of additional attributes for quality metadata can assist in nding

and selecting relevant information or do cuments

 The Resource Description Framework RDF is an infrastructure that enables the

enco ding exchange and reuse of structured metadata and is an application of XML

Worc It additionally provides a means for publishing b oth humanreadable and

machinepro cessable vo cabularies designed to encourage the reuse and extension of

metadata semantics among disparate information communities

RDF imp oses needed structural constraints to provide unambiguous metho ds express

ing semantics

We prop ose that metadata registries should not only register metadata but also should

have an eye on the usefulness of the registered metadata mo dels towards quality reasoning

The aim should b e that all registered metadata mo dels fulll a certain level of quality by

requiring a minimal set of quality criteria Once this measure is implemented informa

tion seekers will greatly prot from the new metadata For instance users will b e able to

cho ose b etween an accurate but somewhat slow information source and one that is fast but

inaccurate to a certain degree Information systems that integrate many sources meta infor

mation systems will also b enet since they could combine sources in a way that pro duces

qualitatively b etter results and not arbitrarily combining sources as it is done to day

Conclusions and Outlo ok

With the help of metadata registries quality assurance of metadata mo dels can b e reached

as during the registration pro cess the metadata develop er could b e forced to show that his

metadata mo del covers the IQ criteria Having quality assured metadata mo dels is one

step but it is also imp ortant that all metadata instances of a registered and quality assured

metadata mo del provide values for these attributes Of course these values must b e correct

and b elievable To this end a certication authority is needed which takes care of the

quality of the pro duced metadata instances The examination of instances with resp ect to

their quality is an unsolved problem and can probably only b e achieved by carrying out

sp ot checks While this pro cedure may seem exp ensive the b enets of accessing and using

certied quality information are obvious

Whereas metadata registries and schema languages for the description of metadata ex

ist the task of quality assurance executed by registries and the need for quality certifying

authorities are still an issue Without such a centralized control WWW information system

designers and users must rely on the somewhat inaccurate and sub jective metho ds describ ed

in Section

Concluding there is a long way to go for metadata mo dels until they meet the require

ments to evaluate information quality On the other hand it is inevitable that quality ana

lyzers must compromise in their need for metadata A middle ground may b e provided by

metadata registries These authorities can combine and match the desires of users or systems

requiring high quality information on the one side and the p ossibilities of the information

providers on the other side

References

Bas Reva Basch Measuring the Quality of the Data Rep ort on the Fourth Annual

SCOUG Retreat Database Searcher Octob er

CZW Ying Chen Qiang Zhu and Nengbin Wang Query pro cessing with quality control

in the World Wide Web World Wide Web

DL DLib Magazine Metadata Registries Workshop April Wash

ington DC Summary httpmirroredukolnacuklis journalsdlib

dlibdlibmayclipshtml May

Dis Distributed Systems Technology Centre MetadataNet Metadata Schema Reg

istry and Metadata To ols Services httpmetadatanet June

Dub Dublin Core Metadata Initiative httppurlorgdcindexhtm

Elia Eliot Christian US Geological Survey GILS FAQ httpwwwgilsnetfaq

html

Elib Eliot Christian US Geological Survey GILS Metadata Elements httpwww

gilsnetelementhtmltable

Fra Frank Olken Workshop Rep ort Joint Workshop on Metadata Registries http

wwwlblgovolkenEPAWorkshopreporthtml Dec

Fra Frank Olken and John McCarthy Metadata Registries Averting a Tower of XML

Bab el httpwwwlblgovolkenmendelwcpapersxtechabstract

html January

Gai Gail Clement and Pete Winn Dublin Core Users Guide Glossary http

webstereffemcomdublinglossaryhtm June

GCGM Luis Gravano ChenChuan K Chang and Hector GarciaMolina STARTS

Stanford prop osal for internet metasearching In Proc of the ACM SIGMOD

Conference

Glo Global Change Master Directory Directory Interchange Format DIF Manual

ftpnssdcagsfcnasagovMDDOCDIFMANUALPS April

Inf Information Market Observatory IMO The Quality of Electronic Information

Pro ducts and Services httpwwwecholuimpactimohtml Septem b er

JV M Jarke and Y Vassiliou Data warehouse quality design A review of the DWQ

pro ject In Proc nd Conference on Information Quality MIT Boston

Mic Michael Day ROADS Metadata Registry httpwwwukolnacukmetadata

roadstemplates Feb

NLF Felix Naumann Ulf Leser and Johann Christoph Freytag Qualitydriven inte

gration of heterogenous information systems In Proc of the Int Conf on Very

Large Dtabases Edinburgh UK

Ols Olsen Global Change Master Directory Directory Interchange Format DIF

Writers Guide Version httpgcmdgsfcnasagovdifguidedifman

html May

Red Thomas C Redman The impact of p o or data quality in the typical enterprise

Communications of the ACM

Spi Myra Spiliop oulou A calibration mechanism identifying the optimization tech

nique of a multidatabase participant In Proc of the Conf on Paral lel and Dis

tributed Computing Systems PDCS Dijon France Sept

Wei Gerhard Weikum Towards guaranteed quality and dep endability of information

systems In Proc of the Conf Datenbanksysteme in Bur o Technik und Wis

senschaft Freiburg Germany

Wora WorldWide Web Consortium Platform for Internet Content Selection http

wwwworgPICS

Worb WorldWide Web Consortium WC Extensible Markup Language XML

httpwwwworgXML June

Worc WorldWide Web Consortium WC RDF httpwwwworgRDF

WS Richard Y Wang and Diane M Strong Beyond accuracy What data quality

means to data consumers Journal on Management of Information Systems

Z Z Implementors Group and Z Maintenance Agency Attribute Set BIB

Z Semantics httplcweblocgovzagencydefnsbib

html Sep