A Workflow for Geocoding South African Addresses
Total Page:16
File Type:pdf, Size:1020Kb
A Workflow for Geocoding South African Addresses Alexandria van Rensburg University of Cape Town Minor Dissertation presented in partial fulfilment of the requirements for the degree of Master of Philosophy in the Department of Computer Science University of Cape Town January 2015 Supervisor: Sonia Berman The copyright of this thesis vests in the author. No quotation from it or information derived from it is to be published without full acknowledgement of the source. The thesis is to be used for private study or non- commercial research purposes only. Published by the University of Cape Town (UCT) in terms of the non-exclusive license granted to UCT by the author. University of Cape Town Plagiarism Declaration I know the meaning of plagiarism and declare that all of the work in this thesis, save for that which is properly acknowledged, is my own. Signature: _________________________ 2 Abstract There are many industries that have long been utilizing Geographical Information Systems (GIS) for spatial analysis. In many parts of the world, it has gained less popularity because of inaccurate geocoding methods and a lack of data standardization. Commercial services can also be expensive and as such, smaller businesses have been reluctant to make a financial commitment to spatial analytics. This thesis discusses the challenges specific to South Africa as well as the challenges inherent in bad address data. The main goal of this research is to highlight the potential error rates of geocoded user-captured address data and to provide a workflow that can be followed to reduce the error rate without intensive manual data cleansing. We developed a six step workflow and software package to prepare address data for spatial analysis and determine the potential error rate. We used three methods of geocoding: a gazetteer postal code file, a free web API and an international commercial product. To protect the privacy of the clients and the businesses, addresses were aggregated with precision to a postcode or suburb centroid. Geocoding results were analysed before and after each step. Two businesses were analysed, a mid-large scale business with a large structured client address database and a small private business with a 20 year old unstructured client address database. The companies are from two completely different industries, the larger being in the financial industry and the smaller company an independent magazine in publishing. Both businesses were subject to address elementising, cleansing and standardization. Their data was mapped and displayed for all three geocoding methods, with all three showing significant error rates. Discrepancies between the three methods were quantified using the Great Circle Distance formula. We found in both instances that an acceptance threshold of 22.2km (0.2o) is recommended to distinguish usable results from those requiring human intervention. When the data was cleansed, and once there was a proven confidence level in the data, it could be mapped in various ways for the businesses to utilize. Demographics and marketing segments could be added to enhance the understanding of the existing client base. 3 Acknowledgements I would like to thank my family and friends for their inspiration and never-ending faith in me. I would also like to express my gratitude to my supervisor, Sonia Berman, for her advice, her resolve, and her patience. My sincere thanks also goes out to my colleagues and incredibly supportive company who have encouraged me every step of the way. Lastly, my amazing friends Sharna and Heidi have been invaluably helpful in this process and are always there with moral support. I am whole heartedly appreciative. 4 Table of Contents Plagiarism Declaration ........................................................................................................................ 2 Abstract ............................................................................................................................................... 3 Acknowledgements ............................................................................................................................ 4 Table of Contents .......................................................................................................................................... 5 List of figures ....................................................................................................................................... 7 List of tables ........................................................................................................................................ 9 1 Introduction ........................................................................................................................................... 10 1.1 Scope ....................................................................................................................................... 11 1.2 Thesis outline .......................................................................................................................... 11 2 Background ............................................................................................................................................ 13 2.1 Data quality ............................................................................................................................. 13 2.2 Improving address data quality ............................................................................................... 13 2.2.1 Categories of address cleansing ................................................................................................ 14 2.3 Geocoding Accuracy ................................................................................................................ 15 2.4 Geocoding Methods Explored ................................................................................................. 17 2.4.1 Verifying geocoded results ......................................................................................................... 22 2.5 GIS and Spatial Analytics ......................................................................................................... 22 2.6 The Importance of Place ......................................................................................................... 23 2.6.1 Definition of place ........................................................................................................................... 24 2.7 The Various Uses for GIS ......................................................................................................... 25 2.7.1 Agriculture ......................................................................................................................................... 25 2.7.2 Banking and finance ...................................................................................................................... 26 2.7.3 CRM systems ..................................................................................................................................... 27 2.7.4 Government ...................................................................................................................................... 27 2.7.5 Healthcare .......................................................................................................................................... 29 5 2.7.6 Insurance ............................................................................................................................................ 30 2.8 Geo Demographics/Tapestry Segments .................................................................................. 31 2.9 South African Postcode Geography ........................................................................................ 32 3 Workflow Design .................................................................................................................................... 34 3.1 Workflow prerequisites .......................................................................................................... 34 3.2 Address Elementising .............................................................................................................. 35 3.3 Address Cleansing ................................................................................................................... 36 3.3.1 Interpreting address cleansing codes .................................................................................... 37 3.3.2 SSRS reports and originating table updates ........................................................................ 40 3.4 Aggregation ............................................................................................................................. 41 3.5 Geocoding ............................................................................................................................... 41 3.5.1 Final geocoded dataset ................................................................................................................. 42 3.6 Comparison ............................................................................................................................. 42 3.7 Result Extraction ..................................................................................................................... 44 3.8 Workflow Overview ................................................................................................................ 44 4 Workflow Application and Results ........................................................................................................