The International Archives of the Photogrammetry, Remote Sensing and Spatial Sciences, Vol. 38, Part II

STUDY ON THE QUALITY MANAGEMENT AND THE CONTROL—A CASE STUDY OF THE EARTH SYSTEM SCIENCE DATA SHARING PROJECT

Chongliang Sun*, Juanle Wang

Institute of the Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101 [email protected]

Commission II, WG VI/4

KEY WORDS: Scientific data, Sharing, Data Quality, Management, Control

ABSTRACT:

With the rapid accumulation of scientific data, there has formed mass , which is ascending in every minute. So it becomes a very desiderate problem to be solved that how to manage the mass data effectively and to control the data quality scientifically. This paper analyzes several current problems in the field of scientific data management based on the author’s experience on the data quality management of the Earth System Science Data Sharing Project, such as (1) the data quality problem is still not solved, (2) the absence of data living period design in data quality management, (3) the data quality management strategy always not that clear, etc. And according to the data management regulation home &abroad, the paper talks about an elementary method combining with the data classification and the quality management and some solving measures with a case study on the Earth System Science Data Sharing Project and it’s soil vector data with name of the distribution map of the national soil in a scale of 1:4,000,000(1980s). At last, the paper takes a prospect of the data management & quality control problem.

1. INTRODUCTION The paper analyzes several main problems of the data quality management, based on the long term work on the data With the globalization of information and economy, quality control. There are, (1) the data quality problem is still scientific data has become into the vital strategic resource for need to be solved, (2) the absence of data living period design the national economy and society development. And it plays a in data quality management, (3) the data quality management key role in the national economy, social development, national strategy always not that clear, etc. security, and the public services, etc. Under the condition of network’s rapid development, especially the embedded work on the scientific data sharing 3. ANALYSIS OF DATA QUALITY MANAGEMENT network, the scientific data has been into massive data. And how to manage these data effectively, and control the data Data quality refers to the accordance between the data and quality efficiently to improve the level on the scientific data the impersonality state which described by the data during the production, processing, storage, sharing, and use, has been the process of data production, processing, storage, sharing, and hotspot. The paper analyzes the problem based on the work data use. And to the scientific data, it can be described as the experience of data management, and put up with a resolved matchup degree during the above processes. The scientific data method to provide a way for the improvement of the scientific quality indexes include the data authenticity, completeness, and data sharing efficiency. self-consistency. The data process quality include the transfer quality, storage quality, use quality, etc[2]. For the spatial data, the quality indexes mainly consist of data completeness, data 2. THE CURRENT PROBLEMS logical consistency, data position accuracy, property accuracy, temporal accuracy and some narration of the data.[3] For many years, due to the uncertainty of the scientific Data quality management materializes in the stages of data[1], and the traditional data management system, it result in data production, storage, transfer, sharing, and the use, during the data resources separated management, isolated using. So the which the management problem could be paid more attention in data quality evaluation system has not formed yet. Thereby, it is practice. And the data quality management should meet the difficult to exchange and share data, which limit the data use need in the following aspects, such as the data format, data abroad in the field of inter-discipline, inter-branch, inter- completeness, and data readability, etc. territory, and the inter-industry. Data format: Scientific data has the special storage format, Scientific data quality management refers to much content, such as the vector format, grid format, and property format, etc, large scope, and many domains, which include data quality which can be selected by the data characteristic. itself, as well as the quality problem during the data producing Data completeness: scientific data’s completeness mainly process. refers to whether the data cover the range of concrete entity

* Corresponding author. E-mail: [email protected].

208 The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. 38, Part II

entirely or not[3]. The scientific data stored by a special format should be a complete entity followed by special request. 5. THE IMPLEMENTATION OF THE DATA QUALITY Data veracity: the scientific data, as an integral individual, MANAGEMENT STRATEGY should be read by the given software, and recognize the given parameters exactly, which include the accurate position Given to the above analysis, the physical operation of the information, temporal information, and the theme information, data quality management should include the two parts, there are etc. For example, the soil vector data in geo-science, should be the data quality evaluation method and the evaluation indexes opened by the geo-software, such as the ArcGIS software or the control. SuperMap software, and read the several parameters from the property sheet. 5.1 The evaluation method and flow In general, according to the different data type, we could divide the quality evaluation elements into two kinds, that is the Due to the several types of the scientific data, and every kind Quantitative elements, and Non-quantitative elements. And the has its own characteristic. So they need many evaluation Quantitative elements include, methods. For the vector data, the proper method to data quality Completeness: superabundance, and scarcity. evaluation is the method based on classic mathematics. While Logical consistency: concept, codomain, format, topology. for the property data, the right method may be the one based on Position accuracy: absolute, relative, grid. the statistic mathematics, the chaos maths[4], and rough set Temporal accuracy: temporal survey accuracy, temporal theory[5]. But in general, the evaluation flow would be in the consistency, temporal correctness. same control chart, and the main evaluation content consists of Theme accuracy: classification correctness, non- the followings, which are the data quality evaluation method, quantitative property accuracy, quantitative property accuracy. the evaluation operator, the evaluation date, and the evaluation The non-quantitative elements include the data-use object, result, etc. purpose, data log, etc. The highly efficient management of scientific data depends on the high-quality management method. In this paper, 5.2 Data quality evaluation indexes control to manage the data comprehensively, we introduce the concepts of completeness, superabundance and scarcity, which differ According to the scientific data characteristic, the data from ISO 19113 and ISO 19114, but only used in this paper and type could be divided into the vector data, grid data, property no need to elaborate them. Based on the above cognition, we data, and the during the sharing period, see fig2. So analyze the management request deeply according to the the data quality evaluation indexes should include the index of management demand, and put up with a resolving method for the different data types. There are the vector data evaluation the quality management. That is to analyse the living period of index, grid data evaluation index, property data evaluation, and data management, and implement them into the management the metadata evaluation index, etc. the data quality. Vector/grid data indexes include the following factors,

such as the storage format, project parameter, data coding, data

scale, polygon closing, data document, data name criteria, data 4. THE LIVING PERIOD OF DATA FOR THE DATA version, the last evaluation rank. QUALITY MANAGEMENT Property data evaluation indexes include the following As well as any other process, the scientific data has its factors, such ad the storage format, field inspection, record own living periods, which include the production, storage, use, inspection, data document inspection, data name criteria, data even to the finish. For the different period, there exists different version, the last evaluation rank. data quality discipline, so the data quality control flow would Metadata evaluation indexes include the following factors, be separated into the following parts, there are data production such as the metadata ID, metadata name criteria, the abstract period, data process period, data storage period, data sharing standard, the responsibility information, the key words period, and the data use period. The quality control measures evaluation, and the data services mode(off-line/on-line). should be in all of the above period. The logic frame of the data The data types should be as the fig 2. quality management could be described by the fig1.

……data

Data Data Data Data Data production process storage sharing use

Figure 1 the logical stream of data quality management

209 The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. 38, Part II

R= Valuei*Weighti (1) Vector data where Valuei = evaluation value, ranging from 0-100; Weighti= evaluation weight, ranging from 0-1.

Grid data The value and weight is valued with the criterion of Data quality management guidelines from the institute of Geographic Scientific data sciences and natural resources research, CAS. The result was property data divided into the following four classifications, as in the table 1.

Table 1 the table of evaluated value result and their ranking classification for data evaluation value Metadata 90~100 75~89 60~74 <60 rankin perfect good qualificatio disqualification g n

Other type of data…… 6. CASE STUDY

The scientific data sharing project was put forward by the Fig 2 the classification of the scientific data Ministry of Science and Technology(MOST), China, in 2003 to The relationship among the data evaluation indexes of embark the data sharing work, and this project was transferred different data type can be described by the fig3. into function period in 2006. As the key one in the project, the Earth System Science Data Sharing Platform(ESSDSP) has owned more than 46000 registered users, and more than 24TB scientific data. For so large amount of data, how to manage Evaluation indexes of the them efficiently is a crucial problem in the construction of the vector/grid data platform, especially for the data management and etc. On the above method this paper talk about, the ESSDSP Projection Data Data Polygon data also divides into the following types, such as the vector parameters coding scale close data, grid data, property data, and the metadata, etc. According to the respective evaluation indexes, we select a soil vector data for a case study, and the data name is the distribution map of Data name Data Document the national soil in a scale of 1:4,000,000(1980s). And table 2 criteria version inspection is the soil data evaluation index value and the weight.

Table 2 the table of index value and the weight for vector data evaluation Evaluation indexes of the Projection Data Polygon Index Data scale property data parameter coding close value 100 100 100 98 Storage Fields Record Weight 0.3 0.15 0.1 0.2 format inspection inspection Document Name Data Evaluation Index inspection criteria version ranking(R Document Name Data ) version inspection criteria value 80 70 100 94.6 Weight 0.1 0.1 0.05

With the above formula 1, this soil data evaluation result R is 94.6. So this soil vector data belongs to the perfect Evaluation indexes of classification according to table1, and its quality control is also the metadata perfect. Metadata Metadata document Abstract ID name &name criteria 7. CONCLUSIONS AND PERSPECTIVE From this study, we can see in the current cognition level, Responsibility Keyword Service mode this data evaluation method is properly for the data information accuracy (off/on-line) management, and could deal the data efficiently, which make a favourable base for the scientific data sharing project. The future work would also be focused on the study of Fig 3 control index chart of the data quality evaluation data management methods. And the conjunction of this method and the other method is also important for us.

The evaluation result could be calculated by the following References from Books: formula, Shi Wenzhong, 2005. Principle of Modelling Uncertainties in

210

The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. 38, Part II

Spatial Data and Analysis. The Science Press, Beijing, pp. 3-7. Awrejcewicz J, 1989. Bifurcation and chaos in simple dynamical systems. Singapore: World Scientific. Pawlak Z, 1982. Rough Sets. International Journal of Computer and Information Sciences, 11(5):341-356. References from Other Literature: The scientific data management group of the MOST, 2006, the data quality management discipline of the Scientific data sharing projects. Beijing, China. The institute of Geographic sciences and natural resources research, CAS, 2009, Data quality management guidelines. Beijing, China. References from websites: China KDD, “About the data quality.” http://www.dmresearch.net/research/shujucangku/2008/0805/12 4049_2.html (accessed 6 Nov. 2009) Baidu. “data quality control of the spatial data”. http://baike.baidu.com/view/125958.htm (accessed 28 Sep. 2009) Acknowledgements and Appendix (optional) The paper was funded by earth system science data sharing program(2005DKA32300); the National Natural Science Foundation of China(40801180、40771146); Open Research Funding Program of LGISEM; the National 863 plans projects(2006AA01A120); the forward project for youth researcher of igsnrr: study on the intelligent search for geo- metadata based on geographic ontology.

211