ACM JDIQ) October 31, 2006
Total Page:16
File Type:pdf, Size:1020Kb
PROPOSAL FOR A NEW ACM PUBLICATION: ACM Journal of Data and Information Quality (ACM JDIQ) October 31, 2006 Contacts for the proposed ACM Journal of Data and Information Quality (ACM JDIQ) Dr. Yang Lee Phone and email: MIS-EIC (617) 373-5052 Associate Professor (617) 739-9367 (fax) College of Business Administration [email protected] Northeastern University 214 Hayden Hall 360 Huntington Avenue Boston, MA 021115-5000 Dr. Stuart Madnick Phone and email: CS-EIC (617) 253-6671 J N Maguire Professor of Information Technology and (617) 253-3321 (fax) Professor of Engineering Systems [email protected] Massachusetts Institute of Technology Room E53-321 Cambridge, MA 02142 Dr. Elizabeth Pierce Phone and email: Managing Editor (501) 569-3488 Associate Professor (501) 569-7049 (fax) Department of Information Science [email protected] University of Arkansas at Little Rock ETAS Building, Room 258 2801 South University Avenue Little Rock, Arkansas 72204 1. JUSTIFICATION 1.1 Background for this Proposal Today’s organizations are increasingly investing in technology to collect, store, and process vast volumes of data. Even so, they often find themselves stymied in their efforts to translate this data into meaningful insights that can be used to improve business processes and to make better decisions. The reasons for this difficulty can often be traced to issues of quality. These involve both technical issues, such as that the data collected may be inconsistent, inaccurate, incomplete, or out-of-date as well as the fact that many organizations lack a cohesive strategy across the enterprise to ensure that the right stakeholders have the right information in the right format at the right place and time. In recent years, several terms have emerged to refer to these issues, such as Information Quality and Data Quality. We have chosen to name this journal Data and Information Quality to cover the full range of issues and will generally use those terms interchangeably in this proposal. 1 Complicating matters is the fact that today’s organizations need to do more with their data if they are to compete effectively. Data quality as measured by its fitness of use in a particular application is a major consideration when discussing issues such as data privacy and protection, data lineage and provenance, enterprise architecture, data mining, data cleaning, as well as data integration processes such as entity resolution and master data management. Particularly in the area of data integration processes, organizations must grapple with how to deal with incomplete customer data, inaccurate or conflicting data, fuzzy data, and the prospect of trying to develop measures of confidence for the information produced in this environment. Even more daunting is the reality that even if organizations get the creation and management of information right for current stakeholders, there is always the prospect of future stakeholders to consider. How does one ensure that over the long term information will remain accessible, trustworthy, and meaningful in the face of rapidly changing computing and storage technologies? What types of models, methods, and metadata will be needed to represent, preserve, and query data lineage and provenance- possibly for centuries to come? Research on information quality that addresses these issues is not new. Several disciplines such as statistics, library sciences, accounting, computer science, and management information systems have examined these issues before. What is new, however, is a movement towards a unified body of knowledge that addresses information quality in its entirety rather than in a piecemeal fashion. Stuart Madnick and Richard Wang established the roots for this movement when they created the Total Data Quality Management (TDQM) Research Program at MIT in the early 1990's (Madnick & Wang, 1992). One of the early successes of the TDQM Research Program at MIT was the establishment of the annual International Conference on Information Quality (ICIQ). This conference represents the main outlet for the dissemination of cutting edge information quality research. Since 1996, academics and practitioners from around the world have been able to come together on an annual basis to exchange ideas on how to define, measure, analyze, and improve information quality. These collaborations over the past decade have materialized into a growing body of knowledge specifically geared to the study and improvement of information quality. Naumann et al. (2005) observes that over the years the ICIQ conference has spawned SIGMOD workshops on Information Quality in Information Systems and the CAiSE Workshop on Information Quality. Germany has built a large community, the German Society for Information Quality, which organizes regular conferences and workshops. DAMA-I (Data Management Association International), and IAIDQ (International Association for Data and Information Quality) also routinely host industry conferences, seminars, and workshops on information quality topics around the globe. Furthermore, information quality tracks are now appearing at the major information and computing conferences. In May 2006, IRMA had two sessions devoted to information and data quality at its annual conference while in August 2006 AMCIS featured a track devoted to information and data quality issues. 2 In addition to examining information quality from the enterprise perspective, there are also great opportunities for studying data quality from the perspective of how it interacts with other information technology initiatives as well as how to preserve the future quality of data. For example, information quality plays a significant role in Entity Resolution literature. A recent video presentation by Hector Garcia-Molina (Research Channel, 2006) on Generic Entity Resolution considered data quality from the perspective of dealing with source data that is inaccurate, incomplete, conflicting, or fuzzy to the issue of how to quantify the quality (i.e. confidence) associated with the matched and merged information once the entity resolution process is complete. Books on customer data integration and master data management such as the one recently published by Dyche and Levy (2006) devote entire chapters to data quality. Many companies such as IBM, Acxiom, and SAS have either built or acquired data quality tools as part of their data integration suites designed to help their customers in their efforts to store, manage, and mine large volumes of data. Finally, in the area of data privacy and protection, the question of how to ensure that only the “right” people are given the “right” access to the “right” data often involves a discussion of data quality. Within the last few years, concerns over how to preserve the future quality of data, particularly within the digital domain, have emerged as well. One sign of the growing data curation movement is the formation of the UK’s Digital Curation Centre (DCC, 2006). The DCC supports UK institutions in their efforts to store, manage and preserve their scholarly and scientific data to help ensure their enhancement and their continuing long-term use. In addition to their second annual conference to be held November 2006, the DCC plans to publish an international journal to showcase research devoted to data curation. Data curation initiatives are also taking place among many scientific organizations such as the National Cancer Institute, National Virtual Collaboratory for Earthquake Engineering Research, and the Human Genome Project. While it is true that some aspects of data curation deal with topics of the digital medium itself and technology obsolescence are not quality issues in and of themselves, the goal of preserving the “fitness of use” of these materials over the long term is an information quality issue. Thus, research into how best to develop adequate policies and documentation to preserve the future quality of data whether it is scholarly, scientific, economic, or historical in nature is an important extension to the current information quality body of knowledge. There are now projects starting in major computer science departments focusing on data curation and data quality, for example, polygen data model, information product map (IPMAP), and data quality dimensions kind of meta data, and extension of the conceptual entity relationship model (ER) to Quality ER (QER), and quality contexts at M.I.T. and Louisiana State University, data lineage, provenance, and entity resolution at Stanford University, University of Pennsylvania, Edinburgh, and Toronto. Most recently at the VLDB conference (2006, Korea), a first workshop on CleanDB has also been held. In summary, we feel there exists today a large pool of researchers and practitioners capable of producing a sufficient volume of high quality research papers covering all aspects of information quality to support a journal devoted to this subject. 3 1.2 Rationale for new publication The proposed journal would view information quality research from multiple perspectives, such as1: 1. The enterprise view of information quality 2. Database issues such as data lineage and provenance, incomplete data, and data cleaning 3. The relationship between data quality and other computer science/information technology initiatives 4. The challenge of preserving the quality of information for future generations The enterprise view of information quality The field of information quality from the enterprise’s perspective has evolved significantly over the last two decades. Originally, the