GIS Data Quality Ollivier and Company Page 1

GIS Data Quality GIS Data Quality Ollivier and Company Page 1 BrightStar TRAINING GIS Data Quality GIS Data Quality ..........................................................................................................................................1 Biography.....................................................................................................................................................3 What is Data Quality?...................................................................................................................................4 Definition .................................................................................................................................................4 Accuracy...................................................................................................................................................4 Data Profiling ...........................................................................................................................................5 Data Integrity............................................................................................................................................6 Consistency ..............................................................................................................................................6 Completeness............................................................................................................................................6 Validity.....................................................................................................................................................6 Timeliness ................................................................................................................................................6 Accessibility .............................................................................................................................................6 Security ....................................................................................................................................................6 Data Scrubbing .........................................................................................................................................7 Metadata collected by profiling.................................................................................................................9 Extract Transform Load ETL......................................................................................................................10 Extract ....................................................................................................................................................10 Transform...............................................................................................................................................10 Load .......................................................................................................................................................10 Exercises ....................................................................................................................................................11 Ollivier and Company Page 2 BrightStar TRAINING GIS Data Quality Biography Ollivier and Company are GIS consultants specialising in ESRI products for local and regional government applications. Principal Kim Ollivier has worked as a civil engineer at both technical and management levels within New Zealand local authorities and private industry. Kim started his career in civil engineering. After some years overseas he returned to New Zealand to work in regional government, on buildings, roads, water and sewerage systems. He moved into computing full time as the manager of PrimeShare, an engineering computer service bureau, applying computers to engineering problems, which led to GIS systems. Since 1989 he has installed and supported Geographic Information Systems in over 50 organisations including 20 District and Regional Councils throughout New Zealand. In 1996 he set up his own consultancy based in Auckland with a particular focus on GIS applications and software development. He has specialised in innovative internet mapping tools, cadastral and services mapping, data translation and analysis. Ollivier & Co have formed associations with several other consultants to handle large projects and bring in other skills, notably Explorer Graphics . The company is a business partner with Eagle Technology for ESRI products. The company now has a range of reformatted spatial data designed for GIS users under the Corax brand. He runs training courses on Python, Geoprocessing Tools using ArcGIS and data translation using Safe Software’s FME ETL tools. He is a trustee of the Te Araroa Trust. Ollivier and Company Page 3 BrightStar TRAINING GIS Data Quality What is Data Quality? Definition Data are of high quality: “If they are fit for their intended uses in operations, decision making and planning.” (J.M. Juran) Alternatively, the data are deemed of high quality: “If they correctly represent the real-world construct to which they refer.” These two views can often be in disagreement, even about the same set of data used for the same purpose. The second view is very relevant to spatial data models. Should we use vector or raster? What scale? Continuous or discreet values? Data quality measurements that count the percentage of records correct are not helpful and do not build a business case for spending resources and effort on improving data quality. The measurements should be user orientated to the cost of the errors in actual use. http://en.wikipedia.org/wiki/Data_quality Accuracy Measured values have an accuracy and precision. This is particularly relevant to coordinates Fig 1 Numeric accuracy concepts Accuracy concept in statistical measurements of True/False. That is, the accuracy is the proportion of true results (both true positives and true negatives) in the population. It is a result of the simple aggregate of profiling to identify errors. Ollivier and Company Page 4 BrightStar TRAINING GIS Data Quality Missing Errors Relevant Fig 2 Summary of data accuracy Completeness Score = Relevant Relevant + Missing Accuracy Score = Relevant - Errors Relevant Overall Score = Relevant - Errors Relevant + Missing Data Profiling Data Profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to: 1. Find out whether existing data can easily be used for other purposes 2. Give metrics on data quality including whether the data conforms to company standards 3. Assess the risk involved in integrating data for new applications, including the challenges of joins 4. Assess whether metadata accurately describes the actual values in the source database 5. Understand data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns. 6. Have an enterprise view of all data, for uses such as master data management where key data is needed, or data governance for improving data quality Some companies also look at data profiling as a way to involve business users in what traditionally has been an IT function. Business users can often provide context about the data, giving meaning to columns of data that are poorly defined by metadata and documentation. Typical types of profile metadata sought are: • Domain: whether the data in the column conforms to the discreet values or range of numeric values • Type: Alphabetic or numeric. • Pattern: a regular expression • Frequency counts • Statistics: o minimum value o maximum value o mean value (average) o median value Ollivier and Company Page 5 BrightStar TRAINING GIS Data Quality o modal value o standard deviation o count o frequency • Interdependency: o within a table o between tables Data Integrity Refers to the validity of data. Data integrity can be compromised in a number of ways: • Human errors when data is entered • Errors that occur when data is transmitted from one database to another • Software bugs or viruses • Hardware malfunctions, such as disk crashes • Natural disasters, such as fires and floods • Different encoding schemes, Unicode v ASCII There are many ways to minimize these threats to data integrity. These include: • Backing up data regularly • Controlling access to data via security mechanisms • Designing user interfaces that prevent the input of invalid data • Using error detection and correction software when transmitting data Consistency Measures the discrepancies between different attributes have an interdependent relationship. GIS systems are particularly sensitive to inconsistency because commonly different sources are integrated in one map where the differences are glaringly obvious. Completeness Are all fields populated? Are all features present? How can you tell? Validity Valuation zoning codes are not the classification used by councils, they are for consistent valuation comparisons. Timeliness Accessibility Security Six fundamental, atomic, non-overlapping attributes of information that are protected by information Ollivier and Company Page 6 BrightStar TRAINING GIS Data Quality security measures. Defined by Donn B. Parker, renowned security consultant and writer, they are confidentiality , possession , integrity , authenticity , availability and utility . confidentiality

Load more