Do Metadata Models Meet IQ Requirements?
Total Page:16
File Type:pdf, Size:1020Kb
Do Metadata Mo dels meet IQ Requirements Claudia Rolker Felix Naumann Humb oldtUniversitat zu Berlin Forschungszentrum Informatik FZI Unter den Linden HaidundNeuStr D Berlin D Karlsruhe Germany Germany naumanndbisinformatikhub erli nde rolkerfzide Abstract Research has recognized the imp ortance of analyzing information quality IQ for many dierent applications The success of data integration greatly dep ends on the quality of the individual data In statistical applications p o or data quality often leads to wrong conclusions High information quality is literally a vital prop erty of hospital information systems Po or data quality of sto ck price information services can lead to economically wrong decisions Several pro jects have analyzed this need for IQ metadata and have prop osed a set of IQ criteria or attributes which can b e used to prop erly assess information quality In this pap er we survey and compare these approaches In a second step we take a lo ok at existing prominent prop osals of metadata mo dels esp ecially those on the Internet Then we match these mo dels to the requirements of information quality mo deling Finally we prop ose a quality assurance pro cedure for the assurance of metadata mo dels Intro duction The quality of information is b ecoming increasingly imp ortant not only b ecause of the rapid growth of the Internet and its implication for the information industry Also the anarchic nature of the Internet has made industry and researchers aware of this issue As awareness of quality issues amongst information professionals grow their demands for high quality information will increase There is a clear need for the industry to resp ond to these requirements and this also represents a genuine market opp ortunity Inf This research was supp orted by the German Research So ciety BerlinBrandenburg Graduate Scho ol in Distributed Information Systems DFG grant no GRK The autonomy of WWW information sources prevents information seekers from directly controlling the quality of the information they receive Rather users of such information sources must resort to analyzing the quality of the information once it is retrieved and use the analysis for future queries Research has recognized the imp ortance of analyzing information quality IQ for many dierent applications WS Red As a result several pro jects have emerged to nd a general measure for information quality While the application domains dier from structured multidatabases or data warehouse applications to retrieval systems for unstructured information the approaches to measure IQ are all similar Domain exp erts dene a set of IQ criteria that are deemed to b e imp ortant to the eld or a general set such as that of Wang and Strong WS is chosen Next assessment metho ds for each criterion are develop ed These metho ds include questionnaires for sub jective criteria calibration metho ds etc Finally some way of summarizing the results is given so one is able to qualitatively compare whole sources query execution plans or pieces of information All approaches heavily rely on metadata esp ecially quality metadata IQ criteria are of no use if no score for them is found A dimension which cannot b e assessed do es not contribute to a comparison of sources On the other hand information providers have recognized the need to describ e the pro d ucts they oer and provide this metadata Obviously this provider metadata will not directly address IQ No information source will admit their information or data to b e outdated or inaccurate It rather covers asp ects of authorship title etc Such particulars can only b e evaluated to indirectly nd IQ ratings The creation date of a do cument reveals its age the publisher may have a go o d or bad reputation etc Our goal is to bridge the gap b etween IQ metadata requirements and actual metadata that is already provided by many sources To this end we rst analyze the most imp ortant prop osed sets of IQ criteria ie the wish list of information brokers and information consumers Section The next section will take a lo ok at the most widespread metadata mo dels that already exist and are used by many providers Section The main contribution of this pap er is a comparison of the IQ metadata requirements with the metadata mo dels We show how IQ criteria can b e derived from existing metadata Section The pap er ends with a prop osal to let metadata registries assure the quality of metadata mo dels in the future Section and with a further outlo ok onto certication authorities for metadata instances with resp ect to their quality Section Information Quality Metadata Requirements This section will review several pro jects concerned with information quality Some provide research from a global viewp oint and dene IQ in a very general way Others have con centrated either on certain quality asp ects or on certain application domains for IQ All reviewed pro jects have in common that IQ is dened as some set of quality criteria ie that quality is made up of many facets All pro jects face the problem of assessing values for the criteria In the scop e of this work we view these criteria as metadata for the data b eing analyzed Thus a list of criteria can b e viewed as metadata requirements or a wish list of criteria one would like to evaluate What follows is a short summarization of the mentioned pro jects Instead of listing each set in each section we have summarized the IQ criteria of the pro jects in Table The actual criteria names may slightly dier but have b een adapted appropriately We have classied the criteria into four sets Contentrelated criteria concern the actual information that is retrieved Technical criteria measure asp ects that are determined by soft and hardware Intel lectual criteria are made up of very sub jective criteria like b elievability Instantiation related criteria concern the presentation of the information TDQM Total Data Quality Management is a pro ject at MIT aimed at providing an empiric founda tion for data quality Wang and Strong have empirically identied fteen IQ criteria regarded by data consumers as the most imp ortant WS The authors have classied these criteria into intrinsic quality accessibility contextual quality and representational quality Their framework has already b een used eectively in industry and government To our b est knowledge this is the only empirical study in this eld and has thus often b een used as a research basis for other pro jects see b elow IQ criteria for molecular biology information systems Based on the criteria of the TDQM mo del we have adapted the set to suit the integration of molecular biology information systems MBIS in a mediatorbased architecture NLF Due to the nature of this architecture and the underlying relational mo del the TDQM criteria were mo died Two criteria resp onse time and price were added to account for the Internet setting of the approach some criteria were interpreted in a new manner to account for the integration asp ect of the approach Criteria such as objectivity or concise representation were dropp ed since in a relational data mo del a query result is simply a table For the pro cess of planning queries against such a distributed and heterogeneous system three classes of criteria were distinguished Sourcesp ecic querysp ecic and attribute sp ecic criteria Notions of service quality Weikum has develop ed a dierent classication of IQcriteria Wei He distinguishes systemcentric pro cesscentric and informationcentric criteria The set of criteria in Wei was put together in an informal manner with no claim for completeness However in our eyes Weikum do es provide several new criteria such as latency which play an increasingly imp ortant role in new information systems esp ecially in WWW settings Each criterion is thoroughly discussed again in an informal manner DWQ Data Warehouse Quality DWQ is an Esprit funded pro ject to analyze the meaning of data quality for data warehouses and to pro duce a formal mo del of information quality to enable design optimization of data warehouses JV Again the approach is based on the empirical studies of Wang and Strong WS However the fo cus lies on data warehouse sp ecic asp ects such as the quality of aggregated data The authors develop a mo del for IQ metadata management in a data warehouse setting SCOUG Measurement of the quality of databases was the sub ject of the Southern California Online User Group SCOUG Annual Retreat in The brainstorming session resulted in a checklist of criteria which fall into broad categories Bas These criteria are the mostly referenced ones within the database area Although the fo cus lies on the evaluation of database p erformance including categories like do cumentation and customer training its similarity to the ab ove describ ed quality measures is obvious Chen et al With a fo cus on World Wide Web query pro cessing Chen et al prop ose a set of quality criteria from an information server viewp oint CZW In their setting a user can sp ecify quality requirements along with the query Under heavy workload the WWW server must then simultaneously pro cess multiple queries and still meet the quality requirements To this end the authors present a scheduling algorithm that is based on the timerelevant criteria such as resp onse time or network delay The other IQ criteria are only briey discussed Metadata Mo dels Metadata mo dels have b een develop ed for many dierent purp oses One of the rst applica tions was that of mo deling bibliographic information for libraries Recently the