Error, Accuracy and Precision
Total Page:16
File Type:pdf, Size:1020Kb
DATA QUALITY Error, Accuracy and Precision Managing Error Data Quality • Until recently, GIS developers and users paid little attention to problems caused by error, inaccuracy and imprecision in spatial datasets • There was awareness that all data suffers from inaccuracy and imprecision, but effects on GIS problems and solutions were not considered • It is now generally recognized that error, inaccuracy and imprecision can “make or break” GIS projects – making the results of a GIS analysis worthless Data Quality • Spatial analyses done manually can easily align map boundaries to overlap and be registered • An automated GIS cannot do this, unless it is programmed to recognize the “undershoots, overshoots, and slivers” to connect lines • The level of the data quality must be made clear for the GIS to operate correctly • Assessing the quality of the data, however, may be costly • Data quality generally refers to the relative accuracy and precision of a particular GIS database – Error encompasses both the imprecision of data and its inaccuracies Data Quality - Assessment • Data quality testing expenses are directly proportional to the degree of accuracy and precision needed • The cost is related not only to the expense of testing the data, but also the production delays resulting from the data testing • The costs of testing must be balanced against the costs of using a lower quality data or accepting a lower data quality standard – Example: Quality testing of twine vs. climbing rope Data Quality - Components • Aronoff divides the characteristics of data quality into 3 categories consisting of 9 components: – Micro Level components • Positional accuracy • Attribute accuracy • Logical consistency • Resolution – Macro Level components • Completeness • Time • Lineage – Usage components • Accessibility of data • Direct and Indirect costs of data Data Quality - Accuracy • Accuracy is the degree to which information in a digital database (or on a map) matches true values – Spatial data is usually a generalization of the real world and it is difficult to identify a true value – instead work with values accepted to be true • Accuracy pertains to the quality of the data and the number of errors in the dataset • Accuracy of the database may have little relationship to the accuracy of products computed from the database – Accuracy of a slope from a DEM is not easily related to the accuracy of the elevations of the DEM itself
© Vicki Drake 1 SMC Fall 2000 Data Quality - Precision • Precision refers to the level of measurement and exactness of description in a GIS database – The level of precision required varies greatly per application – Highly precise data can be difficult and costly to collect – High Precision does not guarantee high accuracy Data Quality – Positional Accuracy • Positional accuracy is defined as the closeness of locational information (usually coordinates) to the true position • Positional accuracy applies to both horizontal and vertical positions • Positional accuracy is a function of the scale at which a map (paper or digital) was created. • Mapping standards used by USGS state: – “requirements for meeting horizontal accuracy as 90% of all measurable points must be within 1/30th of an inch for maps at a scale of 1:20,000 or larger, and 1/50th of an inch for maps at scales smaller than 1:20,000.” Data Quality – Accuracy Standards
• 1:1,200 ± 3.33 feet • 1:2,400 ± 6.67 feet • 1:4,800 ± 13.33 feet • 1:10,000 ± 27.78 feet • 1:12,000 ± 33.33 feet • 1:24,000 ± 40.00 feet • 1:63,360 ± 105.60 feet • 1:100,000 ± 166.67 feet
Data Quality – Accuracy Standards • Accuracy standards imply that a point or lines drawn on a map have a “probable” location within a certain area • False accuracy and false precision can occur by reading locational information from map to levels of accuracy and precision beyond original creation. – Accuracy and precision tied to original map scale and do not change even if user “zooms” in and out in a computer system
© Vicki Drake 2 SMC Fall 2000 Data Quality – Positional Accuracy • Conventionally, map are accurate to roughly one line width or 0.5 mm – Equivalent to 12 m on 1:24,000, or 125 m on 1:250,000 maps – A Typical UTM coordinate pair might be: • Easting 57924.349m Northing 5194732.247m • If the database was digitized from a 1:24,000 sheet, the last four digits for each coordinate could be spurious Positional Accuracy – “Testing” • How to “test” for Positional Accuracy- – Use an independent source of higher accuracy – Find a larger scale map – Use a GPS – Use raw survey data • Use internal evidence – Unclosed polygons, overshoot or undershoot line junctions are indications of inaccuracy – size of gaps may be used as a measure of positional accuracy • Compute accuracy from knowledge of the errors introduced by different sources – 1 mm in source document – 0.5 mm in map registration for digitizing Data Quality – Positional Accuracy • Two other components of positional accuracy • (1) Bias (2) Precision
• Bias – systematic discrepancies between represented and true position. Measured by average positional error of sample points.
© Vicki Drake 3 SMC Fall 2000 • Precision – dispersion of positional errors of data elements. Estimated by calculating standard deviation of test points. A Low SD means error dispersion is lo and errors ten to be relatively small. Data Quality – Attribute Accuracy • The non-spatial data linked to location may also be inaccurate or imprecise • Attribute accuracy is the closeness of attribute values to their true value • Location does not change with time, but attributes often do • Attributes may be discrete or continuous. – Discrete attributes may have a finite number of values – I.e., land use, vegetation type, etc. – Continuous attributes may have an infinite number of values – I.e., elevation, property value, isotherms, isohyets, etc. Data Quality – Attribute Accuracy • Attribute Accuracy must be analyzed in different ways depending on the nature of the data • Continuous attributes (surfaces) such on a DEM or TIN: – Accuracy is expressed as measurement error – • e.g. elevation accurate to 1 m
Data Quality – Attribute Accuracy • Categorical attributes such as classified polygons – Are the categories appropriate, detailed and defined? – Gross errors, such as a polygon classified as “A” when it should be “B” are simple, but unlikely • e.g.Land use is shopping center instead of golf course – More likely – the polygon will be heterogeneous • e.g. vegetation zones where area may be 70% A and 30% B – Worse – A and B may not be well-defined, or the class may not be easily identified as either A or B • E.g. soils classifications are typically fuzzy – At the center of the polygon, there is confidence that the class is A, but more like B at the edges Data Quality – Attribute Accuracy • In testing attribute accuracy: • An error of omission occurs when a point’s class on the ground is incorrectly recorded in the database • An error of commission occurs when the class recorded in the database does not exist on the ground Data Quality – Conceptual Accuracy • GIS depend on the abstraction and classification of real-world phenomena • Users determine amount of information is used, and how it is classified into appropriate categories • Users may use inappropriate categories or misclassify information – e.g., classifying cities by voting behavior is an ineffective way to study fertility patterns – e.g. failing to classify power lines by voltage could limit the effectiveness of a GIS designed to manage an electric utilities infrastructure – e.g. drainage systems in a watershed are studied by classifying tributary rivers and streams by “order” – miscounting can lead to misclassification Date Quality – Logical Consistency • Logical Consistency is how well logical relations among data elements are maintained • It is the internal consistency of the data structure and particularly applies to topological consistency
© Vicki Drake 4 SMC Fall 2000 • Is the database consistent with its definitions? – If there are polygons, do they close? – Is there exactly one label within each polygon? – Are there nodes wherever arcs cross, or do arcs sometimes cross without forming nodes? Data Quality – Logical Consistency • An illogical employment of information in a database would be to map some forest stand boundaries to the center of an adjacent road, and others to the edge of the same road • Building a residential subdivision on a floodplain is dangerous, unless the user also compares the proposed plots with floodplain maps where variations in flood potential have been recorded into the GIS database and are used in the comparison • Mapping a reservoir without taking into consideration the annual (or even daily) fluctuation of the water levels in the reservoir. Setting a standard outline for the reservoir and placing it in each layer provides a baseline for comparisons Data Quality – Logical Consistency • Information stored in a database can be used illogically. • Information stored in a GIS database must be used and compared carefully if it is to yield useful results. • GIS systems are typically unable to warn the user if inappropriate comparisons are being made, or if data are being used incorrectly. • Rules need to be employed to ensure that the characteristics of the real-world phenomena are being modeled correctly in the GIS. – Logical consistency is best addressed before data are entered into a GIS database Data Quality - Resolution • The resolution of a data set is the smallest discernible unit or smallest unit represented – Remote Sensing – Spatial Resolution • Thematic Maps – the minimum mapping unit is the smallest object represented on the map – Factors determining resolution for minimum mapping unit include: end use of map, legibility, drafting expense and known accuracy of source Data Quality - Resolution • Resolution for digital geographic data in a GIS database have no specific scale, and can be displayed at any scale – the minimum mapping unit can be extremely small • The levels of accuracy and precision built into the database during creation will, however, limit the display scale for the data – Using a GIS, a 1:50,000 scale map could be produced using 1:500,000 digitized data, but the accuracy and precision will still be 1:500,000 quality (the original level) Data Quality - Completeness • Macro level components refer to the data set as a whole and are evaluated by judgment, not true testing • Completeness concerns the degree to which the data exhausts the universe of possible items – Are all possible objects included in the database? – Is the database affected by rules of selection, generalization and scale? • Other aspects of completeness include: completeness of coverage; classification, and verification – Completeness of coverage is the proportion of data available for area of interest. • Progressively updated data sets, a “patchwork” of more recent data, may work for current status of a resource • Older, more complete, data sets work best for comparative analysis where consistency is important Data Quality - Completeness • Completeness of Classification assesses how well the chosen classification is able to represent the data
© Vicki Drake 5 SMC Fall 2000 • All data should be encoded at the selected level of detail • Differences may occur owing to the individual or organizations that produced the maps. – Different government agencies mapping adjacent areas may have “boundary- matching” problems, even though the maps are accurate in terms of position and classification Data Quality - Completeness • Completeness of Verification examines the amount and distribution of field measurement or other independent sources of information used to develop the data • In Geology, standard field data techniques use solid lines for mapping visible and verifiable rock types boundaries, while inferred boundaries are indicated with dashed lines (“Air Geology”) – A data quality check for the geologist’s field data Data Quality – Time • A critical factor of any database is Time. • Demographic information is time-sensitive, changing significantly even over the course of a year • Land use information is also time-sensitive as many areas of the world experience rapid urbanization – especially in less-developed countries near large cities • Time of year is another factor in data collection –seasonal changes in crop type – e.g, Spring wheat or Summer wheat, Summer vegetables or Fall vegetables Data Quality - Time • The time aspect of data quality is usually the date of the source material • Topographic maps may include the original source date, as well as any recent dates • Date of acquisition is also a factor for geographic information that changes rapidly over time – Forestry maps are generally updated on a 5-10 year basis – Agricultural maps are updated as rapidly as weekly during growing seasons Data Quality - Lineage • Lineage is a record of the data sources and of the operations which created the database • Lineage is often a useful indicator of accuracy • The lineage of a data set, then, is its history, the source data and processing steps used to produce it. – How was it digitized, from what documents? – When was the data collected? – What agency collected the data? – What steps were used to process the data? Data Quality – Usage and Accessibility • Data quality components of usage are specific to the resources of an organization. • The costs of the data may be too expensive for one, and considered inexpensive to another, based on financial resources, needs and demands of an organization. • Accessibility refers to the ease and obtaining and using data. – Some data may be restricted because it is privately held, deemed a matter of national defense, or the privacy rights of citizens need to be protected Data Quality - Costs • Purchasing a data set from a vendor can be done for a known direct cost • Ascertaining the costs of producing or generating the data “in-house” sometimes requires a little more study. • The true costs of the data include the costs of purchasing equipment, training employees, and other support systems – all of which must be factored into the final cost. • Indirect costs include all the time and materials used to generate the data, and use the data
© Vicki Drake 6 SMC Fall 2000 Data Quality – Sources of Error • Many sources of error can affect the quality of a GIS dataset • Error is introduced at almost every step of database creation • Some errors are quite obvious, but others can be difficult to discern and few will be automatically identified by the GIS itself • It is the user’s responsibility, then, to prevent errors in the GIS and not be “lulled” into a false sense of accuracy and precision unwarranted by the available data
© Vicki Drake 7 SMC Fall 2000