GIS Quality

GIS

Ollivier and Company Page 1 BrightStar TRAINING GIS Data Quality

GIS Data Quality ...... 1 Biography...... 3 What is Data Quality?...... 4 Definition ...... 4 Accuracy...... 4 Data Profiling ...... 5 ...... 6 Consistency ...... 6 Completeness...... 6 Validity...... 6 Timeliness ...... 6 Accessibility ...... 6 Security ...... 6 Data Scrubbing ...... 7 Metadata collected by profiling...... 9 Extract Transform Load ETL...... 10 Extract ...... 10 Transform...... 10 Load ...... 10 Exercises ...... 11

Ollivier and Company Page 2 BrightStar TRAINING GIS Data Quality Biography

Ollivier and Company are GIS consultants specialising in ESRI products for local and regional government applications. Principal Kim Ollivier has worked as a civil engineer at both technical and management levels within New Zealand local authorities and private industry.

Kim started his career in civil engineering. After some years overseas he returned to New Zealand to work in regional government, on buildings, roads, water and sewerage systems. He moved into computing full time as the manager of PrimeShare, an engineering computer service bureau, applying computers to engineering problems, which led to GIS systems. Since 1989 he has installed and supported Geographic Information Systems in over 50 organisations including 20 District and Regional Councils throughout New Zealand.

In 1996 he set up his own consultancy based in Auckland with a particular focus on GIS applications and software development. He has specialised in innovative internet mapping tools, cadastral and services mapping, data translation and analysis.

Ollivier & Co have formed associations with several other consultants to handle large projects and bring in other skills, notably Explorer Graphics . The company is a business partner with Eagle Technology for ESRI products.

The company now has a range of reformatted spatial data designed for GIS users under the Corax brand.

He runs training courses on Python, Geoprocessing Tools using ArcGIS and data translation using Safe Software’s FME ETL tools.

He is a trustee of the Te Araroa Trust.

Ollivier and Company Page 3 BrightStar TRAINING GIS Data Quality What is Data Quality?

Definition Data are of high quality:

“If they are fit for their intended uses in operations, decision making and planning.”

(J.M. Juran)

Alternatively, the data are deemed of high quality:

“If they correctly represent the real-world construct to which they refer.”

These two views can often be in disagreement, even about the same set of data used for the same purpose. The second view is very relevant to spatial data models. Should we use vector or raster? What scale? Continuous or discreet values?

Data quality measurements that count the percentage of records correct are not helpful and do not build a business case for spending resources and effort on improving data quality. The measurements should be user orientated to the cost of the errors in actual use. http://en.wikipedia.org/wiki/Data_quality

Accuracy

Measured values have an accuracy and precision. This is particularly relevant to coordinates

Fig 1 Numeric accuracy concepts

Accuracy concept in statistical measurements of True/False.

That is, the accuracy is the proportion of true results (both true positives and true negatives) in the population. It is a result of the simple aggregate of profiling to identify errors.

Ollivier and Company Page 4 BrightStar TRAINING GIS Data Quality

Missing Errors

Relevant

Fig 2 Summary of data accuracy

Completeness Score = Relevant Relevant + Missing

Accuracy Score = Relevant - Errors Relevant

Overall Score = Relevant - Errors Relevant + Missing

Data Profiling

Data Profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:

1. Find out whether existing data can easily be used for other purposes 2. Give metrics on data quality including whether the data conforms to company standards 3. Assess the risk involved in integrating data for new applications, including the challenges of joins 4. Assess whether metadata accurately describes the actual values in the source database 5. Understand data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns. 6. Have an enterprise view of all data, for uses such as master where key data is needed, or data governance for improving data quality

Some companies also look at data profiling as a way to involve business users in what traditionally has been an IT function. Business users can often provide context about the data, giving meaning to columns of data that are poorly defined by metadata and documentation.

Typical types of profile metadata sought are:

• Domain: whether the data in the column conforms to the discreet values or range of numeric values • Type: Alphabetic or numeric. • Pattern: a regular expression • Frequency counts • Statistics: o minimum value o maximum value o mean value (average) o median value Ollivier and Company Page 5 BrightStar TRAINING GIS Data Quality o modal value o standard deviation o count o frequency • Interdependency: o within a table o between tables

Data Integrity

Refers to the validity of data. Data integrity can be compromised in a number of ways: • Human errors when data is entered • Errors that occur when data is transmitted from one database to another • Software bugs or viruses • Hardware malfunctions, such as disk crashes • Natural disasters, such as fires and floods • Different encoding schemes, Unicode v ASCII

There are many ways to minimize these threats to data integrity. These include:

• Backing up data regularly • Controlling access to data via security mechanisms • Designing user interfaces that prevent the input of invalid data • Using error detection and correction software when transmitting data

Consistency Measures the discrepancies between different attributes have an interdependent relationship. GIS systems are particularly sensitive to inconsistency because commonly different sources are integrated in one map where the differences are glaringly obvious.

Completeness Are all fields populated? Are all features present? How can you tell?

Validity Valuation zoning codes are not the classification used by councils, they are for consistent valuation comparisons.

Timeliness

Accessibility

Security

Six fundamental, atomic, non-overlapping attributes of information that are protected by information

Ollivier and Company Page 6 BrightStar TRAINING GIS Data Quality security measures. Defined by Donn B. Parker, renowned security consultant and writer, they are confidentiality , possession , integrity , authenticity , availability and utility . confidentiality

Restrictions on the accessibility and dissemination of information. possession

The ownership or control of information, as distinct from confidentiality. For example, if confidential information such as a user ID-password combination is in a sealed container and the container is stolen, the owner justifiably feels that there has been a breach of security even if the container remains closed (this is a breach of possession or control over the information). data integrity

The quality of correctness, completeness, wholeness, soundness and compliance with the intention of the creators of the data. It is achieved by preventing accidental or deliberate but unauthorized insertion, modification or destruction of data in a database. authenticity

The correct attribution of origin such as the authorship of an e-mail message or the correct description of information such as a data field that is properly named availability

The accessibility of a system resource in a timely manner; for example, the measurement of a system's uptime. Online or offline. Random access or serial. utility

Usefulness; fitness for a particular use. For example, if data are encrypted and the decryption key is unavailable, the security failure is in the lack of utility of the data (they are still confidential, possessed, integral, authentic and available).

Data Scrubbing

Motivation for Scrubbing

Administratively, incorrect or inconsistent data can lead to false conclusions and misdirected investments on both public and private scales. For instance, the government may want to analyze population census figures to decide which regions require further spending and investment on infrastructure and services. In this case, it will be important to have access to reliable data to avoid erroneous fiscal decisions.

In the business world, incorrect data can be costly. Many companies use customer information databases that record data like contact information, addresses, and preferences.

Data Quality

High quality data needs to pass a set of quality criteria. Those include:

Ollivier and Company Page 7 BrightStar TRAINING GIS Data Quality • Accuracy : An aggregated value over the criteria of integrity, consistency and density • Integrity : An aggregated value over the criteria of completeness and validity • Completeness : Achieved by correcting data containing anomalies • Validity : Approximated by the amount of data satisfying integrity constraints • Consistency : Concerns contradictions and syntactical anomalies • Uniformity : Directly related to irregularities • Density : The quantity of missing values in the data and the number of total values ought to be known • Uniqueness : Related to the amount of duplicates in the data

The Process of Scrubbing

• Data Auditing : The data is audited with the use of statistical methods to detect anomalies and contradictions. This eventually gives an indication of the characteristics of the anomalies and their locations.

• Workflow specification : The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of high quality data. In order to achieve a proper workflow, the causes of the anomalies and errors in the data have to be closely considered. If for instance we find that an anomaly is a result of typing errors in data input stages, we could add a better validation tool to the input form or dropdown menu choices.

• Workflow execution : In this stage, the workflow is executed after its specification is complete and its correctness is verified. The implementation of the workflow should be efficient even on large sets of data which inevitably poses a trade-off because the execution of a operation can be computationally expensive.

• Post-Processing and Controlling : After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during execution of the workflow are manually corrected if possible. The result is a new cycle in the data cleansing process where the data is audited again to allow the specification of an additional workflow to further cleanse the data by automatic processing.

Methods Used for Data Cleansing

• Parsing : Parsing in data cleansing is performed for the detection of syntax errors. A parser decides whether a string of data is acceptable within the allowed data specification. This is similar to the way a parser works with grammars and languages such as LEX, YACC or Regular Expressions can be used. • Independent Values : If there are independent values that provide a consistency test then questionable values are able to be identified. For example a point address must fall in a zone. • : Data Transformation allows the mapping of the data from their given format into the format expected by the appropriate application. This includes value conversions or translation functions (eg projection of coordinates.

• Duplicate Elimination : Duplicate detection requires an algorithm for determining whether data contains duplicate representations of the same entity. Usually, data is sorted by a key that would bring duplicate entries closer together for faster identification. Python sets and dictionaries are very useful.

Ollivier and Company Page 8 BrightStar TRAINING GIS Data Quality • Statistical Methods : By analyzing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert (you!) to find values that are unexpected and thus erroneous. Although the correction of such data may be difficult since the true value is not known, it can be resolved by setting the values to an average or other statistical value. Statistical methods can also be used to handle missing values which can be replaced by one or more plausible values that are usually obtained by extensive algorithms. Frequently used for rainfall records.

Challenges and Problems

• Error Correction and loss of information : The most challenging problem within data scrubbing remains the correction of values to remove duplicates and invalid entries. In many cases, the available information on such anomalies is limited and insufficient to determine the necessary transformations or corrections leaving the deletion of such entries as the only plausible solution. The deletion of data though, leads to loss of information which can be particularly costly if there is a large amount of deleted data.

• Maintenance of Cleaned Data : Data scrubbing is an expensive and time consuming process. So after having performed data cleaning and achieving a data collection free of errors, one would want to avoid the re-cleaning of data in its entirety after some values in data collection change. The process should only be repeated on values that have changed which means that a cleaning lineage would need to be kept which would require efficient data collection and management techniques. If possible a feedback loop should be designed to fix the external data source.

• Data Cleaning Framework : In many cases it will not be possible to derive a complete data cleansing graph to guide the process in advance. This makes data cleansing an iterative process involving significant exploration and interaction which may require a framework in the form of a collection of methods for error detection and elimination in addition to data auditing. This can be integrated with other data processing stages like integration and maintenance.

Metadata collected by profiling This is all information about the database, not just the ISO XML file It would ideally contain: • Up to date data model and table schema • Entity Relationship diagram • All codes and their meaning • Business rules implemented in forms • Data Quality Rules • Data Quality scores • Summary report It should all be automated so that successive runs can be compared

Corax example Matching report,spreadsheet

FGDC, NZ Metadata Standard or ISO? http://www.linz.govt.nz/about-linz/news-publications-and-consultations/consultation-projects-and- reviews/nzgms/index.aspx

XML or HTML? Does it really describe the dataset Examples from the Net : GSA, Geography Network Tools to build it: mp

Ollivier and Company Page 9 BrightStar TRAINING GIS Data Quality http://www.fgdc.gov/metadata

Extract Transform Load ETL

Extract

The first part of an ETL process involves extracting the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization and format. Common data source formats are relational databases and flat files. Extraction converts the data into a format for transformation processing.

An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the features may be diverted into a rejects file.

Transform

The transform stage applies to a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformations types to meet the business and technical needs of the end target may be required:

• Selecting only certain columns to load (or selecting null columns not to load) • Translating coded values ( e.g. , if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this calls for automated data cleaning; no manual cleaning occurs during ETL • Encoding free-form values ( e.g. , mapping "Male" to "1" and "Mr" to M) • Deriving a new calculated value ( e.g. , sale_amount = qty * unit_price) • Filtering • Sorting • Joining data from multiple sources ( e.g. , lookup, merge) • Aggregation (for example, rollup - summarizing multiple rows of data - total sales for each store, and for each region, etc.) • Generating surrogate-key values (autonumber or GUID) • Transposing or pivoting (turning multiple columns into multiple rows or vice versa) • Splitting a column into multiple columns ( e.g. , putting a comma-separated list specified as a string in one column as individual values in different columns) • Applying any form of simple or complex . If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data is handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data.

Load

The load phase loads the data into the end target, usually the (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative, updated data every week, while other DW (or even other parts of the same DW) may add new data in a time stamped form, for example, hourly. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW.

Ollivier and Company Page 10 BrightStar TRAINING GIS Data Quality As the load phase interacts with a database, the constraints defined in the database schema — as well as in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.

Exercises 1. Pick your own representative dataset to analyse 2. Data Gazing – browse one of the datasets looking for characteristics 3. Case Study – outline a GIS data quality system 4. Assessment – split into pairs and interview each other about their dataset 5. Examine Metadata – consider its utility 6. Rules Exercise – Examine the DVR datasets and devise some rules 7. Safe Software – ETL in action 8. Maintenance of the Quality – open discussion

Ollivier and Company Page 11 BrightStar TRAINING